Saturday, 31 December 2022

New Year's Eve Quiz 2022

Dodgy journals special

With so much happening in the world this year, it’s easy to miss some recent developments in the world of academic publishing.  Test your knowledge here, to see how alert you are to news from the dark underbelly of research communication.

 

1. Which of these is part of a paper mill1?

 


 


2. How many of these tortured phrases2 can you decode?

 

a) In context of chemistry experiment: “watery arrangements”

b) in context of pharmaceuticals: “medication conveyance”

c) in context of statistics: “irregular esteem”

d) in context of medicine: “bosom peril”

e) in context of optical sensors: “wellspring of blunder”

 

 

3. Which journal published a paper beginning with the sentence:

Persistent harassment is a major source of inefficiency and your growth will likely increase over the next several years.

 

and ending with:

The method outlined here can be used to easily illuminate clinical beginnings about confinement in appropriate treatment, sensitivity and the number of treatment sessions, and provides an incentive to investigate the brain regions of two mice and humans

 

a)    a) Proceedings of the National Academy of Science

b)    b) Acta Scientifica

c)    c) Neurosciences and Brain Imaging

d)    d) Serbian Journal of Management

 

 

4. What have these authors got in common?  

 

Georges Chastellain, Jean Bodel, Suzanne Lilar, Henri Michaux, and Pierre Mertens

 

a)    a) They are all eminent French literary figures

b)    b) They all had a cat called Fifi

c)     c) They are authors of papers in the Research Journal of Oncology, vol 6, issue 5

d)    d) They were born in November 

 

5. What kind of statistical test would be appropriate for these data? 

a) t-test

b) no-way analysis of variance

c) subterranean insect optimisation

d) flag to commotion ratio

 

6. Many eminent authors have published in one of these Prime Scholars journals:

i)               Polymer Sciences

ii)              Journal of Autacoids

iii)            Journal of HIV and Retrovirus

iv)            British Journal of Research

 

Can you match the author to the journal?

a)    Jane Austen

b)    Kurt Vonnegut

c)     Walt Whitman

d)    Herman Hesse

e)    Tennessee Williams

f)      Ayn Rand

 

 

7. Some poor authors have their names badly mangled by those who use their name while attempting to avoid plagiarism checks.  Can you reconstruct the correct versions of these two names (and affiliation for author 1)?

 

---------------------------------------------------------------------------------------



 

 

Final thoughts

 

While the absurdity of dodgy journals can make us laugh, there is, of course, a dark side to all of this that cannot be ignored. The huge demand for places to publish has not only led to obviously predatory publishers, who will publish anything for money, but also has infiltrated supposedly reputable publishers.  Papermills are seen as a growing problem, and all kinds of fraud abound, even among some of the upper echelons of academia. As I argued in my last blogpost, it’s far too easy to get away with academic misconduct, and the incentives on researchers to fake data and publications are growing all the time. My New Year’s wish is that funders, academic societies and universities start to grapple with this problem more urgently, so that there won’t be material for such a quiz in 2023.

 

 

References

 

1 COPE & STM. (2022). Paper mills: Research report from COPE & STM. Committee on Publication Ethics and STM. https://doi.org/10.24318/jtbG8IHL

 

2 Cabanac, G., Labbé, C., & Magazinov, A. (2021). Tortured phrases: A dubious writing style emerging in science. Evidence of critical issues affecting established journals (arXiv:2107.06751). arXiv. https://doi.org/10.48550/arXiv.2107.06751

 

 

ANSWERS

 

1. B is a solicitation for an academic paper mill. A is a flour mill and C is a paper mill of the more regular kind. B was discussed here.

 

2. Who knows? Best guesses are:

a) aqueous solutions

b) drug delivery

c) random value

d) breast cancer

e) source of error

 

If you enjoy this sort of word game, you can help by typing "tortured phrases" into PubPeer and checking out the papers that have been detected by the Problematic Paper Screener.  

 

3. c) https://www.primescholars.com/articles/a-short-note-on-mechanism-of-brain-in-animals-and-humans.pdf 

 

4. c) see https://www.primescholars.com/archive/iprjo-volume-6-issue-5-year-2022.html 

If you answered (a) you are misled by the Poirot fallacy – all of them except Bodel are Belgian. 

 

5. Fortunately the paper has been retracted and so no answer is required. For further details see here

c) is a reference to tortured phrase version of “ant colony optimisation” (which is a real thing!) and d) is reference to “signal-to-noise” ratio.

 

 

6.        

Jane Austen  (ii) and (iv)

Kurt Vonnegut (ii) and (iii)

Walt Whitman (i) (ii) and (iv)

Herman Hesse (iv)

Tennessee Williams (ii)

Ayn Rand (iii)

 

See: https://www.primescholars.com/archive/jac-volume-3-issue-2-year-2022.html 

https://www.primescholars.com/archive/ipbjr-volume-9-issue-7-year-2022.html 

https://www.primescholars.com/archive/ipps-volume-7-issue-4-year-2022.html 

https://www.primescholars.com/archive/ipps-volume-7-issue-2-year-2022.html 

https://www.primescholars.com/archive/ipbjr-volume-9-issue-9-year-2022.html 

https://www.primescholars.com/archive/ipbjr-volume-9-issue-7-year-2022.html 

 

7.

This article is available here: https://www.primescholars.com/archive/jac-volume-2-issue-3-year-2020.html.  A genuine email has been added to the paper and is the clue to the person whose identity was used for this paper: Williams, GM with address at New York Medical College, Valhalla campus. Given the mangling of his name, I suspect he is no more aware of his involvement in the paper than Jane Austen or Kurt Vonnegut.

For 2nd e.g. see https://pubpeer.com/publications/B7E65FDF7565448A0507B32123E4D8 




Friday, 16 December 2022

When there are no consequences for misconduct: Parallels between politics and science

 

Gustave Doré: Illustration for Paradise Lost

(Updated 17 Dec 2022) 

As children, we grow up with stories of the battle between good and evil, but good ultimately triumphs. In adulthood, we know things can be more complicated: bad people can get into positions of power and make everyone suffer.  And yet, we tell ourselves, we have a strong legal framework, there are checks and balances, and a political system aspires to be free and fair.

 

During the last decade, I started for the first time to have serious doubts about those assumptions. In both the UK and the US, the same pattern is seen repeatedly: the media report on a scandal involving the government or a public figure, there is a brief period of public outrage, but then things continue as before.

 

In the UK we have become accustomed to politicians lying to Parliament and failing to correct the record, to bullying by senior politicians, and to safety regulations being ignored.  The current scandal is a case of disaster capitalism where government cronies made vast fortunes from the Covid pandemic by gaining contracts for personal protective equipment – which was not only provided at inflated prices, but then could not be used as it was substandard.

 

These are all shocking stories, but even more shocking is the lack of any serious consequences for those who are guilty. In the past, politicians would have resigned for minor peccadilloes, with pressure from the Prime Minister if need be. During Boris Johnson’s premiership, however, the Prime Minister was part of the problem. 

 

During the Trump presidency in the US, Sarah Kendzior wrote about “saviour syndrome”  - the belief people had that someone would come along and put things right. As she noted: “Mr. Trump has openly committed crimes and even confessed to crimes: What is at stake is whether anyone would hold him accountable.” And, sadly, the answer has been no.

 

No consequences for scientific fraud

So what has this got to do with science?  Well, I get the same sinking feeling that there is a major problem, everyone can see there's a problem, but nobody is going to rescue us. Researchers who engage in obvious malpractice repeatedly get away with no consequences.  This has been a recurring theme from those who have exposed academic papermills (Byrne et al., 2021) and/or reported manipulation of figures in journal articles (Bik et al., 2016).  For instance, when Bik was interviewed by Nature, she noted that 60-70% of the 800 papers she had reported to journals had not been dealt with within 5 years. That matches my more limited experience; if one points out academic malpractice to publishers or institutions, there is often no reply. Those who do reply typically say they will investigate, but then you hear no more.

 

At a recent symposium on Research Integrity at Liverpool Medical Institution*, David Sanders (Purdue University) told of repeated experiences of being given the brush-off by journals and institutions when reporting suspect papers. For instance, he reported an article that had simply recycled a table from a previous paper on a different topic. The response was “We will look into it”. “What”, said David incredulously, “is there to look into?”. This is the concern – that there can be blatant evidence of malpractice within a paper, yet the complainant is ignored. In this case, nothing happened. There are honorable exceptions, but it seems shocking that serious and obvious errors in work are not dealt with in a prompt and professional manner.

 

At the same seminar, there was a searing presentation by Peter Wilmshurst, whose experiences of exposing medical fraud by powerful individuals and organisations have led him to be the subject of numerous libel complaints.  Here are a few details of two of the cases he presented:

 

Paolo Macchiarini:  Convicted in 2022 of causing bodily harm with an experimental transplant of a synthetic windpipe that he performed between 2011-2012.  Wilmshurst noted that the descriptions of the experimental surgery in journals were incorrect. For a summary see this BMJ article  A 2008 paper by Macchiarini and colleagues is still published in the Lancet, despite demands for it to be retracted. 

 

Don Poldermans: An eminent cardiologist who conducted a series of studies on perioperative betablockers, leading them to be recommended in guidelines from the European Society of Cardiology,  whose task force he chaired. A meta-analysis challenged that conclusion, showing mortality increased; an investigation found that work by Poldermans had serious integrity problems, and he was fired. Nevertheless, the papers have not been retracted. Wilmshurst estimated that thousands of deaths would have resulted from physicians following the guidelines recommending betablockers.

 

The week before the Liverpool meeting, there was a session on Correcting the Record at AIMOS2022.  The four speakers, John Loadsman (anaesthesiology), Ben Mol (Obstetrics and Gynecology), Lisa Parker (Oncology) and Jana Christopher (image integrity) covered the topic from a range of different angles, but in every single talk, the message came through loud and clear: it’s not enough to flag up cases of fraud – you have to then get someone to act on them, and that is far more difficult than it should be.

 

And then on the same day as the Liverpool meeting, Le Monde ran a piece about a researcher whose body of work contained numerous problems: the same graphs were used across different articles that purported to show different experiments, and other figures had signs of manipulation.  There was an investigation by the institution and by the funder, Centre National de la Recherche Scientifique (CNRS), which concluded that there had been several breaches of scientific integrity. However, it seems that the recommendation was simply that the papers should be “corrected”.

 

Why is scientific fraud not taken seriously?

There are several factors that conspire to get scientific fraud brushed under the carpet.

1.     Accusations of fraud may be unfounded. In science, as in politics, there may be individuals or organisations who target people unfairly – either for personal reasons, or because they don’t like their message. Furthermore, everyone makes mistakes and it would be dangerous to vilify researchers for honest errors. So it is vital to do due diligence and establish the facts. In practice, however, this typically means giving the accused the benefit of the doubt, even when the evidence of misconduct is strong.  While it is not always easy to demonstrate intent, there are many cases, such as those noted above, where a pattern of repeated transgressions is evident in published papers – and yet nothing is done.  

2.     Conflict of interest. Institutions may be reluctant to accept that someone is fraudulent if that person occupies a high-ranking role in the organisation, especially if they bring in grant income. Worries about reputational risk also create conflict of interest. The Printeger project is a set of case studies of individual research misconduct cases, which illustrates just how inconsistently these are handled in different countries, especially with regard to transparency vs confidentiality of process. It concluded “The reflex of research organisations to immediately contain and preferably minimise misconduct
cases is remarkable
”.

3.     Passing the buck. Publishers may be reluctant to retract papers unless there is an institutional finding of misconduct, even if there is clear evidence that the published work is wrong. I discussed this here.  My view is that leaving flawed research in the public record is analogous to a store selling poisoned cookies to customers – you have a responsibility to correct the record as soon as possible when the evidence is clear to avoid harm to consumers. Funders might be expected to also play a role in correcting the record when research they have funded is shown to be flawed. Where public money is concerned, funders surely have a moral responsibility to ensure it is not wasted on fraudulent or sloppy research. Yet in her introduction to the Liverpool seminar, Patricia Murray noted that the new UK Committee on Research Integrity (CORI) does not regard investigation of research misconduct as within its purview.  

4.     Concerns about litigation. Organisations often have concerns that they will be sued if they make investigations of misconduct public, even if they are confident that misconduct occurred. These concerns are justified, as can be seen from the lawsuits that most of the sleuths who spoke at AIMOS and Liverpool have been subjected to.  My impression is that, provided there is clear evidence of misconduct, the fraudsters typically lose libel actions, but I’d be interested in more information on that point.

 

 

Consequences when misconduct goes unpunished

 

The lack of consequences for misconduct has many corrosive impacts on society. 

 

1.     Political and scientific institutions can only operate properly if there is trust. If lack of integrity is seen to be rewarded, this erodes public confidence. 

 

2.     People depend on us getting things right. We are confronting major challenges to health and to our environment. If we can’t trust researchers to be honest, then we all suffer as scientific progress stalls.  Over-hyped findings that make it into the literature can lead subsequent generations of researchers to waste time pursuing false leads.  Ultimately, people are harmed if we don’t fix fraud.

 

3.     Misconduct leads to waste of resources. It is depressing to think of all the research that could have been supported by the funds that have been spent on fraudulent studies.

 

4.     People engage in misconduct because in a competitive system, it brings them personal benefits, in terms of prestige, tenure, power and salary. If the fraudsters are not tackled, they end up in positions of power, where they will perpetuate a corrupt system; it is not in their interests to promote those who might challenge them.

 

5.     The new generation entering the profession will become cynical if they see that one needs to behave corruptly in order to succeed. They are left with the stark choice of joining in the corruption or leaving the field.

 

 

What can be done?

 

There’s no single solution, but I think there are several actions that are needed to help clean up the mess.

 

1.     Appreciate the scale of the problem.

When fraud is talked about in scientific circles, you typically get the response that “fraud is rare” and “science is self-correcting”.  A hole has been blown in the first assumption by the emergence of industrial-scale fraud in the form of academic paper-mills . The large publishers are now worried enough about this to be taking concerted action to detect papermill activity, and some of them have engaged in mass retractions of fraudulent work (see, e.g. the case of IEEE retractions here). Yet, I have documented on PubPeer numerous new papermill articles in Hindawi special issues appearing since September of this year, when the publisher announced it would be engaging in retraction of 500 papers. It’s as if the publisher is trying to clean up with a mop while a fire-hose is spewing out fraudulent content.  This kind of fraud is different from that reported by Wilmshurst, but it illustrates just how slow the business of correcting the scientific record can be – even when the evidence for fraud is unambiguous. 

Publishers trying to mop up papermill outputs
 

Yes, self-correction will ultimately happen in science, when people find they cannot replicate the flawed research on which they try to build. But the time-scale for such self-correction is often far longer than it needs to be.  We have to understand just how much waste of time and money is caused by reliance on a passive, natural evolution of self-correction, rather than a more proactive system to root out fraud.  

 

2.     Full transparency

There’s been a fair bit of debate about open data, and now it is recognised that we also need open code (scripts to generate figures etc.) to properly evaluate results. I would go further, though, and say we also need open peer review. This need not mean that the peer reviewer is identified, but just that their report is available for others to read. I have found open peer reviews very useful in identifying papermill products.

 

3.     Develop shared standards

Organisations such as the Committee on Publication Ethics (COPE) give recommendations for editors about how to respond when an accusation of misconduct occurs.  Although this looks like a start in specifying standards to which reputable journals should adhere, several speakers at the AIMOS meeting suggested that COPE guidelines were not suited for dealing with papermills and could actually delay and obfuscate investigations. Furthermore, COPE has no regulatory power and publishers are under no obligation to follow the guidelines (even if they state they will do so).

 

4.     National bodies for promoting scientific integrity

The Printeger project (cited above) noted that “A typical reaction of a research organisation facing unfamiliar research misconduct without appropriate procedures is to set up ad hoc investigative committees, usually consisting of in-house senior researchers…. Generally, this does not go well.”

In response to some high-profile cases that did not go well, some countries have set up national bodies for promoting scientific integrity. These are growing in number, but those who report cases to them often complain that they are not much help when fraud is discovered – sometimes this is because they lack the funding to defend a legal challenge. But, as with shared standards, this is at least a start, and they may help gather data on the scale and nature of the problem.  

 

5.     Transparent discussion of breaches of research integrity

Perhaps the most effective way of persuading institutions, publishers and funders to act is by publicising when they have failed to respond adequately to complaints.  David Sanders described a case where journals and institutions took no action despite multiple examples of image manipulation and plagiarism from one lab.  He only got a response when the case was featured in the New York Times.

Nevertheless, as the Printeger project noted, relying on the media to highlight fraud is far from ideal – there can a tendency to sensationalise and simplify the story, with potential for disproportionate damage to both accused and whistleblowers. If we had trustworthy and official channels to report suspected research misconduct, then whistleblowers would be less likely to seek publicity through other means.

 

6.     Protect whistleblowers

In her introduction to the Liverpool Research Integrity seminar, Patricia Murray noted the lack of consistency in institutional guidelines on research integrity. In some cases, the approach to whistleblowers seemed hostile, with the guidelines emphasising that they would be guilty of misconduct if they were found to have made frivolous, vexatious and/or malicious allegations. This, of course, is fair enough, but it needs to be countered by recommendations that allow for whistleblowers who are none of these things, who are doing the institution a service by casting light on serious problems. Indeed, Prof Murray noted that in her institution, failure to report an incident that gives reasonable suspicion of research misconduct is itself regarded as misconduct.  At present, whistleblowers are often treated as nuisances or cranks who need to be shut down. As was evident from the cases of both Sanders and Wilmshurst, they are at risk of litigation, and careers may be put in jeopardy if they challenge senior figures.

 

7.     Changing the incentive structure in science

It’s well-appreciated that if you really want to stop a problem, you should understand what causes it and stop it at source. People do fraudulent research because the potential benefits are large and the costs seem negligible.  We can change that balance by, on the one hand having serious and public sanctions for those who commit fraud, and on the other hand, rewarding scientists who emphasise integrity, transparency and accuracy in their work, rather than those that get flashy, eyecatching results.

 

I'm developing my ideas on this topic and I welcome thoughts on these suggestions. Comments are moderated and so do not appear immediately, but I will post any that are on topic and constructive.  



Update 17th December 2022  


Jennifer Byrne suggested one further recommendation, as follows:

To change the incentive structure in scientific publishing. Journals are presently rewarded for publishing, as publishing drives both income (through subscriptions and/or open access charges) and the journal impact factor. In contrast, journals and publishers do not earn income and are not otherwise rewarded for correcting the literature that they publish. This means that the (seemingly rare) journals that work hard to correct, flag and retract erroneous papers are rewarded identically to journals that appear to do very little. Proactive journals appear to represent a minority, but while there are no incentives for journals to take a proactive approach to published errors and misinformation, it should not be surprising that few journals join their efforts. Until publication and correction are recognized as two sides of the same coin, and valued as such, it seems inevitable that we will see a continued drive towards publishing more and correcting very little, or continuing to value publication quantity over quality.

 

Bibliography 

I'll also add here additional resources. I'm certainly not the first to have made the points in this post, and it may be useful to have other articles gathered together in one place.  

 

Besançon, L., Bik, E., Heathers, J., & Meyerowitz-Katz, G. (2022). Correction of scientific literature: Too little, too late! PLOS Biology, 20(3), e3001572. https://doi.org/10.1371/journal.pbio.3001572   

 

Byrne, J. A., Park, Y., Richardson, R. A. K., Pathmendra, P., Sun, M., & Stoeger, T. (2022). Protection of the human gene research literature from contract cheating organizations known as research paper mills. Nucleic Acids Research, gkac1139. https://doi.org/10.1093/nar/gkac1139 

 

Christian, K., Larkins, J., & Doran, M. R. (2022). The Australian academic STEMM workplace post-COVID: a picture of disarray. BioRxiv. https://doi.org/10.1101/2022.12.06.519378 

 

Lévy, R. (2022, December 15). Is it somebody else’s problem to correct the scientific literature? Rapha-z-Lab. https://raphazlab.wordpress.com/2022/12/15/is-it-somebody-elses-problem-to-correct-the-scientific-literature/ 

 

Research misconduct: Theory & Pratico – For Better Science. (n.d.). Retrieved 17 December 2022, from https://forbetterscience.com/2022/08/31/research-misconduct-theory-pratico/   


Star marine ecologist committed misconduct, university says. (n.d.). Retrieved 17 December 2022, from https://www.science.org/content/article/star-marine-ecologist-committed-misconduct-university-says  


Additions on 18th December: Yet more relevant stuff coming to my attention! 

 

Naudet, Florian (2022) Lecture: Busting two zombie trials in a post-COVID world.   

 

Wilmshurst, Peter (2022) Blog: Has COPE membership become a way for unprincipled journals to buy a fake badge of integrity?


 *Addition on 20th December

Liverpool Medical Institution seminar on Research Integrity: The introduction by Patricia Murray, talk by Peter Wilmshurt, and Q&A are now available on Youtube.


And finally.... 

A couple of sobering thoughts:

 

Alexander Trevelyan on Twitter noted  a great quote from the anonymous @mumumouse (author of Research misconduct blogpost above): “To imagine what it’s like to be a whistleblower in the science community, imagine you are trying to report a Ponzi scheme, but instead of receiving help you are told, nonchalantly, to call Bernie Madoff, if you wish." 

 

Peter Wilmshurst started his talk by relaying a conversation with Patricia Murray in the run-up to his talk. He said he planned to talk about the 3 Fs, fabrication, falsification and honesty.

 To which Patricia replied, “There is no F in honesty”. 

(This may take a few moments to appreciate).

 

 

Tuesday, 6 December 2022

Biomarkers to screen for autism (again)


Diagnosis of autism from biomarkers is a holy grail for biomedical researchers. The days when it was thought we would find “the autism gene” are long gone, and it’s clear that both the biology and the psychology of autism is highly complex and heterogeneous. One approach is to search for individual genes where mutations are more likely in those with autism. Another is to address the complexity head-on by looking for combinations of biomarkers that could predict who has autism.  The latter approach is adopted in a paper by Bao et al (2022) who claimed that an ensemble of gene expression measures taken from blood samples could accurately predict which toddlers were autistic (ASD) and which were typically-developing (TD). An anonymous commenter on PubPeer queried whether the method was as robust as the authors claimed, arguing that there was evidence for “overfitting”. I was asked for my thoughts by a journalist, and they were complicated enough to merit a blogpost.  The bottom line is that there are reasons to be cautious about the conclusion of the authors that they have developed “an innovative and accurate ASD gene expression classifier”.

 

Some of the points I raise here applied to a previous biomarker study that I blogged about in 2019. These are general issues about the mismatch between what is done in typical studies in this area and what is needed for a clinically useful screening test.

 

Base rates

Consider first how a screening test might be used. One possibility is that there might be a move towards universal screening, allowing early diagnosis that might help ensure intervention starts young.  But for effective screening in that context, you need extremely high diagnostic accuracy, and accuracy depends on the frequency of autism in the population.  I discussed this back in 2010. The levels of accurate classification reported by Bao et al would be of no use for population screening because there would be an extremely high rate of false positives, given that most children don’t have autism.

 

Diagnostic specificity

But, you may say, we aren’t talking about universal screening.  The test might be particularly useful for those who either (a) already have an older child with autism, or (b) are concerned about their child’s development.  Here the probability of a positive autism diagnosis is higher than in the general population.  However, if that’s what we are interested in, then we need a different comparison group – not typically-developing toddlers, but unaffected siblings of children with autism, and/or children with other neurodevelopmental disorders.   

When I had a look at the code that the authors deposited for data analysis, it implied that they did have data on children with more general developmental delays, and sibs of those with autism, but they are not reported in this paper. 

 

The analyses done by the researchers are extremely complex and time-consuming, and it is understandable that they may prefer to start out with the clearest case of comparing autism with typically-developing children. But the acid test of the suitability of the classifier for clinical use would be a demonstration that it could distinguish children with autism from unaffected siblings, and from nonautistic children with intellectual disability.

 

Reliability of measures

If you run a diagnostic test, an obvious question is whether you’d get the same result on a second test run.  With biological and psychological measures the answer is almost always no, but the key issue for a screener is just how much change there is. Gene expression levels could vary from occasion to occasion depending on time of day or what you’d eaten – I have no idea how important this might be, but it's not possible to evaluate in this paper, where measures come from a single blood sample. My personal view is that the whole field of biomedical research needs to wake up to the importance of reliability of measurement so that researchers don’t waste time exploring the predictive power of measures that may be too unreliable to be useful.  Information about stability of measures over time is a basic requirement for any diagnostic measure.

 

A related issue concerns comparability of procedures for autism and TD groups. Were blood samples collected by the same clinicians over the same period and processed in the same lab for these two groups? Were the blood analyses automated and/or done blind? It’s crucial to be confident that minor differences in clinical or lab procedures do not bias results in this kind of study.

 

Overfitting

Overfitting is really just a polite way of saying that the data may be noise. If you run enough analyses, something is bound to look significant, just by chance.  In the first step of the analysis, the researchers ran 42,840 models on “training” data from 93 autistic and 82 TD children and found 1,822 of them performed better than .80 on a measure that reflects diagnostic accuracy (AUC-ROC – which roughly corresponds to proportion correctly classified: .50 is chance, and 1.00 is perfect classification).  So we can see that just over 4% of the models (1822/42840) performed this well.

 

The researchers were aware of the possibility of overfitting, and they addressed it head-on, saying: “To test this, we permuted the sample labels (i.e., ASD and TD) for all subjects in our Training set and ran the pipeline to test all feature engineering and classification methods. Importantly, we tested all 42,840 candidate models and found the median AUC-ROC score was 0.5101 with the 95th CI (0.42–0.65) on the randomized samples. As expected, only rare chance instances of good 'classification' occurred.”  The distribution of scores is shown in Figure 2b. 

 

 


Figure 2b from Bao et al (2022)

 

They then ran a further analysis on a “test set” of 34 autistic and 31 TD children who had been held out of the original analysis, and found that 742 of the 1822 models performed better than .80 in classification. That’s 40% of the tested models.  Assuming I have understood the methods correctly, that does look meaningful and hard to explain just in terms of statistical noise.  In effect, they have run a replication study and found that a substantial subset of the identified models do continue to separate autism and TD groups when new children are considered. The claim is that there is substantial overlap in the models that fall in the right-hand area under the curve for the red and pink distributions.

 

The PubPeer commenter seems concerned that results look too good to be true. In particular, Figure 2b suggests the models perform a bit better in the test set than in the training set. But the figure shows the distribution of scores for all the models (not just the selected models) and, given the small sample sizes, the differences between distributions does not seem large to me. I was more surprised by the relatively tight distribution of AUC-ROC values obtained in the permutation analysis, as I would have anticipated some models would have given high classification accuracy just by chance in a sample of this size.

The researchers went on to present data for the set of models that achieved .8 classification in both training and test sets. This seemed a reasonable approach to me. The PubPeer commenter is correct in arguing that there will be some bias caused by selecting models this way, and that one would expect  less good performance in a completely new sample, but the 2-stage selection of models would seem to ensure there is not "massive overfitting". I think there would be a problem if only 4% of the 1822 selected models had given accurate classification, but the good rate of agreement between the models selected in the training and test samples, coupled with the lack of good models in the permuted data, suggests there is a genuine effect here. 

 

Conclusion

So, in sum, I think that the results can’t just be attributed to overfitting, but I nevertheless have reservations about whether they would be useful for screening for autism.  And one of the first things I’d check if I were the researchers would be the reliability of the diagnostic classification in repeated blood samples taken on different occasions, as that would need to be high for the test to be of clinical use.

 

Note: I'd welcome comments or corrections on this post. Please note, comments are moderated to avoid spam, and so may not appear immediately. If you post a comment and it has not appeared in 24 hr, please email me and I'll ensure it gets posted. 

 PS. See comment from original PubPeer poster attached. 

Also, 8th Dec 2022, I added a further PubPeer comment asking authors to comment on Figure 2B, which does seem odd. 

https://pubpeer.com/publications/B693366B2B51D143C713359F151F7B#4  


PPS. 10th Dec 2022. 

Author Eric Courchesne has responded to several of the points made in this blogpost on Pubpeer: https://pubpeer.com/publications/B693366B2B51D143C713359F151F7B#5