Thursday, 12 October 2023

When privacy rules protect fraudsters

 

 
I was recently contacted with what I thought was a simple request: could I check the Oxford University Gazette to confirm that a person, X, had undergone an oral examination (viva) for a doctorate a few years ago. The request came indirectly from a third party, Y, via a colleague who knew that on the one hand I was interested in scientific fraud, and on the other hand, that I was based at Oxford.

My first thought was that this was a rather cumbersome way of checking someone's credentials. For a start, as Y had discovered, you can consult the on-line University Gazette only if you have an official affiliation with the university. In theory, when someone has a viva, the internal examiner notifies the University Gazette, which announces details in advance so that members of the university can attend if they so wish. In practice, it is vanishingly rare for an audience to turn up, and the formal notification to the Gazette may get overlooked.

But why, I wondered, didn't Y just check the official records of Oxford University listing names and dates of degrees? Well, to my surprise, it turned out that you can't do that. The university website is clear that to verify someone's qualifications you need to meet two conditions. First, the request can only be made by "employers, prospective employers, other educational institutions, funding bodies or recognised voluntary organisations". Second, "the student's permission ... should be acquired prior to making any verification request".

Anyhow, I found evidence online that X had been a graduate student at the university, but when I checked the Gazette I could find no mention of X having had an oral examination. The other source of evidence would be the University Library where there should be a copy of the thesis for all higher degrees. I couldn't find it in the catalogue. I suggested that Y might check further but they were already ahead of me, and had confirmed with the librarian that no thesis had been deposited in that name.

Now, I have no idea whether X is fraudulently claiming to have an Oxford doctorate, but I'm concerned that it is so hard for a private individual to validate someone's credentials. As far as I can tell, the justification comes from data protection regulations, which control what information organisations can hold about individuals. This is not an Oxford-specific interpretation of rules - I checked a few other UK universities, and the same processes apply.

Having said that, Y pointed out to me that there is a precedent for Oxford University to provide information when there is media interest in a high-profile case: in response to a freedom of information request, they confirmed that Ferdinand Marcus Jr did not have the degree he was claiming.

There will always be tension between openness and the individual's right to privacy, but the way the rules are interpreted mean that anyone could claim they had a degree from a UK university and it would be impossible to check this. Is there a solution? I'm no lawyer, but I would have thought it should be trivial to require that on receipt of a degree, the student is asked to give signed permission for their name, degree and date of degree to be recorded on a publicly searchable database. I can't see a downside to this, and going forward it would save a lot of administrative time dealing with verification requests.

Something like this does seem to work outside Europe. I only did a couple of spot checks, but found this for York University (Ontario):

"It is the University's policy to make information about the degrees or credentials conferred by the University and the dates of conferral routinely available. In order to protect our alumni information as much as possible, YU Verify will give users a result only if the search criteria entered matches a unique record. The service will not display a list of names which may match criteria and allow you to select."

And for Macquarie University, Australia, there is exactly the kind of searchable website that I'd assumed Oxford would have.

I'd be interested if anyone can think of unintended bad consequences of this approach. I had a bit of to-and-fro on Twitter about this with someone who argued that it was best to keep as much information as possible out of the public domain. I remain unconvinced: academic qualifications are important for providing someone with credentials as an expert, and if we make it easy for anyone to pretend to have a degree from a prestigious institution, I think the potential for harm is far greater than any harms caused by lack of privacy. Or have I missed something? 

 N.B. Comments on the blog are moderated so may only appear after a delay.


P.S. Some thoughts via Mastodon from Martin Vueilleme on potential drawback of directory: 

Far fetched, but I could see the following reasons:

- You live in an oppressive country that targets academics, intellectuals
- Hiding your university helps prevent stalkers (or other predators) from getting further information on you
- Hiding your university background to fit in a group
- Your thesis is on a sensitive topic or a topic forbidden from being studied where you live
- Hiding your university degree because you were technically not allowed to get it (eg women)

My (DB) response is that I think that in terms of balancing probabilities of risks against the risk of fraudsters benefiting from lack of checking, the case for the open directory is strengthened, as these risks seem very slight for UK universities (at least for now!). And the other cost/benefit analysis is of finances, where an open directory would seem superior; i.e. it costs to maintain the directory, but that has to be done anyhow, Currently there are extra costs for people who are employed to respond to requests for validation.

Monday, 2 October 2023

Spitting out the AI Gobbledegook sandwich: a suggestion for publishers

 


The past couple of years have been momentous for some academic publishers. As documented in a preprint this week, after rapid growth, largely via "special issues" of journals, they have dramatically increased the number of published articles, and at the same time made enormous profits. A recent guest post by Huanzi Zhang, however, showed this has not been without problems. Unscrupulous operators of so-called "papermills" saw an opportunity to boost their own profits by selling authorship slots and then placing fraudulent articles in special issues that were controlled by complicit editors. Gradually, publishers realised they had a problem and started to retract fraudulent articles. To date, Hindawi has retracted over 5000 articles since 2021*.  As described in Huanzi's blogpost, this has made shareholders nervous and dented the profits of parent company Wiley. 

 

There are numerous papermills, and we only know about the less competent ones whose dodgy articles are relatively easy to detect. For a deep dive into papermills in Hindawi journals see this blogpost by the anonymous sleuth Parashorea tomentella.  At least one papermill is the source of a series of articles that follow a template that I have termed the "AI gobbledegook sandwich".  See for instance my comments here on an article that has yet to be retracted. For further examples, search the website PubPeer with the search term "gobbledegook sandwich". 

 

After studying a number of these articles, my impression is that they are created as follows. You start with a genuine article. Most of these look like student projects. The topics are various, but in general they are weak on scientific content. They may be a review of an area, or if data is gathered, it is likely to be some kind of simple survey.  In some cases, reference is made to a public dataset. To create a paper for submission, the following steps are taken:

 

·      The title is changed to include terms that relate to the topic of a special issue, such as "Internet of Things" or "Big data".

·      Phrases are scattered in the Abstract and Introduction mentioning these terms.

·      A technical section is embedded in the middle of the original piece describing the method to be used.  Typically this is full of technical equations. I suspect these are usually correct, in that they use standard formulae from areas such as machine learning, and in some cases can be traced to Wikipedia or another source.  It is not uncommon to see very basic definitions, e.g. formulae for sensitivity and specificity of prediction.

·      A results section is created showing figures that purport to demonstrate how the AI method has been applied to the data. This often reveals that the paper is problematic, as plots are at best unclear and at worst bear no relationship to anything that has gone before.  Labels for figures and axes tend to be vague. A typical claim is that the prediction from the AI model is better than results from other, competing models. It is usually hard to work out what is being predicted from what.

·      The original essay resumes for a Conclusions section, but with a sentence added to say how AI methods have been useful in improving our understanding.

·      An optional additional step is to sprinkle irrelevant citations in the text: we know that papermills collect further income by selling citations, and new papers can act as vehicles for these.


Papermills have got away with this, because the content of these articles is sufficiently technical and complex that the fraud may only be detectable on close reading. Where I am confident there is fraud, I will use the term "Gobbledegook sandwich" in my report on PubPeer, but there are many, many papers where my suspicions are raised but it would take more time than it is worth for me to comb through the article to find compelling evidence.

 

For a papermill, the beauty of the AI gobbledegook sandwich is that you can apply AI methods to almost any topic, and there are so many different algorithms that can be used that there is a potentially infinite number of papers that can be written according to this template.  The ones I have documented include topics ranging from educational methods, hotel management, sports, art, archaeology, Chinese medicine, music, building design, mental health and promotion of Marxist ideology. In none of these papers did the application of AI methods make any sense, and they would not get past a competent editor or reviewers, but once a complicit editor is planted in a journal, they can accept numerous articles. 

 

Recently, Hindawi has ramped up its integrity operations and is employing many more staff to try and shut this particular stable door.  But Hindawi is surely not the only publisher infected by this kind of fraud, and we need a solution that can be used by all journals. My simple suggestion is to focus on prevention rather than cure, by requiring that all articles that report work using AI/ML methods adopt reporting standards that are being developed for machine-learning based science, as described on this website.  This requires computational reproducibility, i.e., data and scripts must be provided so that all results can be reproduced.  This would be a logical impossibility for AI gobbledegook sandwiches.

 

Open science practices were developed with the aim of improving reproducibility and credibility of science, but, as I've argued elsewhere, they could be highly effective in preventing fraud.  Mandating reporting standards could be an important step, which, if accompanied also by open peer review, will make life of the papermillers much harder.



*Source is spreadsheet maintained by the anonymous sleuth Parashorea tomentella

 

N.B. Comments on this blog are moderated, so there may be a delay before they appear.