One of my favourite articles is a piece by Nissen et al (2016) called "Publication bias and the canonization of false facts". In it, the authors model how false information can masquerade as overwhelming evidence, if, over cycles of experimentation, positive results are more likely to be published than null ones. But their article is not just about publication bias: they go on to show how p-hacking magnifies this effect, because it leads to a false positive rate that is much higher than the nominal rate (typically .05).
I was reminded of this when looking at some literature on polyunsaturated fatty acids and children's cognition. This was a topic I'd had a passing interest in years ago when fish oil was being promoted for children with dyslexia and ADHD. I reviewed the literature back in 2008 for a talk at the British Dyslexia Association (slides here). What was striking then was that, whilst there were studies claiming positive effects of dietary supplements, they all obtained different findings. It looked suspicious to me, as if authors would keep looking in their data, and divide it up every way possible, in order to find something positive to report – in other words, p-hacking seemed rife in this field.
My interest in this area was piqued more recently simply because I was looking at articles that had been flagged up because they contained "tortured phrases". These are verbal expressions that seem to have been selected to avoid plagiarism detectors: they are often unintentionally humorous, because attempts to generate synonyms misfire. For instance, in this article by Khalid et al, published in Taylor and Francis' International Journal of Food Properties we are told:
"Parkinson’s infection is a typical neurodegenerative sickness. The mix of hereditary and natural variables might be significant in delivering unusual protein inside explicit neuronal gatherings, prompting cell brokenness and later demise"
And, regarding autism:
"Chemical imbalance range problem is a term used to portray various beginning stage social correspondence issues and tedious sensorimotor practices identified with a solid hereditary part and different reasons."
The paper was interesting, though, for another reason. It contained a table summarising results from ten randomized controlled trials of polyunsaturated fatty acid supplementation in pregnant women and young children. This was not a systematic review, and it was unclear how the studies had been selected. As I documented on PubPeer, there were errors in the descriptions of some of the studies, and the interpretation was superficial. But as I checked over the studies, I was also struck by the fact that all studies concluded with a claim of a positive finding, even when the planned analyses gave null results. But, as with the studies I'd looked at in 2008, no two studies found the same thing. All the indicators were that this field is characterised by a mixture of p-hacking and hype, which creates the impression that the benefits of dietary supplementation are well-established, when a more dispassionate look at the evidence suggests considerable scepticism is warranted.
There were three questionable research practices that were prominent. First, testing a large number of 'primary research outcomes' without any correction for multiple comparisons. Three of the papers cited by Khalid did this, and they are marked in Table 1 below with "hmm" in the main analysis column. Two of them argued against using a method such as Bonferroni correction:
"Owing to the exploratory nature of this study, we did not wish to exclude any important relationships by using stringent correction factors for multiple analyses, and we recognised the potential for a type 1 error." (Dunstan et al, 2008)
"Although multiple comparisons are inevitable in studies of this nature, the statistical corrections that are often employed to address this (e.g. Bonferroni correction) infer that multiple relationships (even if consistent and significant) detract from each other, and deal with this by adjustments that abolish any findings without extremely significant levels (P values). However, it has been validly argued that where there are consistent, repeated, coherent and biologically plausible patterns, the results ‘reinforce’ rather than detract from each other (even if P values are significant but not very large)" (Meldrum et al, 2012)While it is correct that Bonferroni correction is overconservative with correlated outcome measures, there are other methods for protecting the analysis from inflated type I error that should be applied in such cases (Bishop, 2023).
The second practice is conducting subgroup analyses: the initial analysis finds nothing, so a way is found to divide up the sample to find a subgroup that does show the effect. There is a nice paper by Peto that explains the dangers of doing this. The third practice, looking for correlations between variables rather than main effects of intervention: with sufficient variables, it is always possible to find something 'significant' if you don't employ any correction for multiple comparisons. This inflation of false positives by correlational analysis is a well-recognised problem in the field of neuroscience (e.g. Vul et al., 2008).
Given that such practices were normative in my own field of psychology for many years, I suspect that those who adopt them here are unaware of how serious a risk they run of finding spurious positive results. For instance, if you compare two groups on ten unrelated outcome measures, then the probability that something will give you a 'significant' p-value below .05 is not 5% but 40%. (The probability that none of the 10 results is significant is .95^10, which is .6. So the probability that at least one is below .05 is 1-.6 = .4). Dividing a sample into subgroups in the hope of finding something 'significant' is another way to multiply the rate of false positive findings.
In many fields, p-hacking is virtually impossible to detect because authors will selectively report their 'significant' findings, so the true false positive rate can't be estimated. In randomised controlled trials, the situation is a bit better, provided the study has been registered on a trial registry – this is now standard practice, precisely because it's recognised as an important way to avoid, or at least increase detection of, analytic flexibility and outcome switching. Accordingly, I catalogued, for the 10 studies reviewed by Khalid et al, how many found a significant effect of intervention on their planned, primary outcome measure, and how many focused on other results. The results are depressing. Flexible analyses are universal. Some authors emphasised the provisional nature of findings from exploratory analyses, but many did not. And my suspicion is that, even if the authors add a word of caution, those citing the work will ignore it.
Table 1: Reporting outcomes for 10 studies cited by Khalid et al (2022)
|Khalid #||Register||N||Main result*||Subgrp||Correlatn||Abs -ve||Abs +ve|
Key: Main result coded as NS (nonsignificant), yes (significant) or hmm (not significant if Bonferroni corrected); Subgrp and Correlatn coded yes or no depending on whether post hoc subgroup or correlational analyses conducted. Abs -ve coded yes if negative results reported in abstract, no if not, and NA if no negative results obtained. Abs +ve coded yes if positive results mentioned in abstract.
I don't know if the Khalid et al review will have any effect – it is so evidently flawed that I hope it will be retracted. But the problems it reveals are not just a feature of the odd rogue review: there is a systemic problem with this area of science, whereby the desire to find positive results, coupled with questionable research practices and publication bias, have led to the construction of a huge edifice of evidence based on extremely shaky foundations. The resulting waste in researcher time and funding that comes from pursuing phantom findings is a scandal that can only be addressed by researchers prioritising rigour, honesty and scholarship over fast and flashy science.