BishopBlog: publication bias

Showing posts with label publication bias. Show all posts

Sunday, 24 March 2024

Just make it stop! When will we say that further research isn't needed?

I have a lifelong interest in laterality, which is a passion that few people share. Accordingly, I am grateful to René Westerhausen who runs the Oslo Virtual Laterality Colloquium, with monthly presentations on topics as diverse as chiral variation in snails and laterality of gesture production.

On Friday we had a great presentation from Lottie Anstee who told us about her Masters project on handedness and musicality. There have been various studies on this topic over the years, some claiming that left-handers have superior musical skills, but samples have been small and results have been mixed. Lottie described a study with an impressive sample size (nearly 3000 children aged 10-18 years) whose musical abilities were evaluated on a detailed music assessment battery that included self-report and perceptual evaluations. The result was convincingly null, with no handedness effect on musicality.

What happened next was what always happens in my experience when someone reports a null result. The audience made helpful suggestions for reasons why the result had not been positive and suggested modifications of the sampling, measures or analysis that might be worth trying. The measure of handedness was, as Lottie was the first to admit, very simple - perhaps a more nuanced measure would reveal an association? Should the focus be on skilled musicians rather than schoolchildren? Maybe it would be worth looking at nonlinear rather than linear associations? And even though the music assessment was pretty comprehensive, maybe it missed some key factor - amount of music instruction, or experience of specific instruments.

After a bit of to and fro, I asked the question that always bothers me. What evidence would we need to convince us that there is really no association between musicality and handedness? The earliest study that Lottie reviewed was from 1922, so we've had over 100 years to study this topic. Shouldn't there be some kind of stop rule? This led to an interesting discussion about the impossibility of proving a negative and whether we should be using Bayes Factors, and what would be the smallest effect size of interest.

My own view is that further investigation of this association would prove fruitless. In part, this is because I think the old literature (and to some extent the current literature!) on factors associated with handedness is at particular risk of bias, so even the messy results from a meta-analysis are likely to be over-optimistic. More than 30 years ago, I pointed out that laterality research is particularly susceptible to what we now call p-hacking - post hoc selection of cut-offs and criteria for forming subgroups, which dramatically increase the chances of finding something significant. In addition, I noted that measurement of handedness by questionnaire is simple enough to be included in a study as a "bonus factor", just in case something interesting emerges. This increases the likelihood that the literature will be affected by publication bias - the handedness data will be reported if a significant result is obtained, but otherwise can be disregarded at little cost. So I suspect that most of the exciting ideas about associations between handedness and cognitive or personality traits are built on shaky foundations, and would not replicate if tested in well-powered, preregistered studies. But somehow, the idea that there is some kind of association remains alive, even if we have a well-designed study that gives a null result.

Laterality is not the only area where there is no apparent stop rule. I've complained of similar trends in studies of association between genetic variants and psychological traits, for instance, where instead of abandoning an idea after a null study, researchers slightly change the methods and try again. In 2019, Lisa Feldman Barrett wrote amusingly about zombie ideas in psychology, noting that some theories are so attractive that they seem impossible to kill. I hope that as preregistration becomes more normative, we may see more null results getting published, and learn to appreciate their value. But I wonder just what it takes to get people to conclude that a research seam has been mined to the point of exhaustion.

Monday, 4 September 2023

Polyunsaturated fatty acids and children's cognition: p-hacking and the canonisation of false facts

One of my favourite articles is a piece by Nissen et al (2016) called "Publication bias and the canonization of false facts". In it, the authors model how false information can masquerade as overwhelming evidence, if, over cycles of experimentation, positive results are more likely to be published than null ones. But their article is not just about publication bias: they go on to show how p-hacking magnifies this effect, because it leads to a false positive rate that is much higher than the nominal rate (typically .05).

I was reminded of this when looking at some literature on polyunsaturated fatty acids and children's cognition. This was a topic I'd had a passing interest in years ago when fish oil was being promoted for children with dyslexia and ADHD. I reviewed the literature back in 2008 for a talk at the British Dyslexia Association (slides here). What was striking then was that, whilst there were studies claiming positive effects of dietary supplements, they all obtained different findings. It looked suspicious to me, as if authors would keep looking in their data, and divide it up every way possible, in order to find something positive to report – in other words, p-hacking seemed rife in this field.

My interest in this area was piqued more recently simply because I was looking at articles that had been flagged up because they contained "tortured phrases". These are verbal expressions that seem to have been selected to avoid plagiarism detectors: they are often unintentionally humorous, because attempts to generate synonyms misfire. For instance, in this article by Khalid et al, published in Taylor and Francis' International Journal of Food Properties we are told:

"Parkinson’s infection is a typical neurodegenerative sickness. The mix of hereditary and natural variables might be significant in delivering unusual protein inside explicit neuronal gatherings, prompting cell brokenness and later demise"

And, regarding autism:

"Chemical imbalance range problem is a term used to portray various beginning stage social correspondence issues and tedious sensorimotor practices identified with a solid hereditary part and different reasons."

The paper was interesting, though, for another reason. It contained a table summarising results from ten randomized controlled trials of polyunsaturated fatty acid supplementation in pregnant women and young children. This was not a systematic review, and it was unclear how the studies had been selected. As I documented on PubPeer, there were errors in the descriptions of some of the studies, and the interpretation was superficial. But as I checked over the studies, I was also struck by the fact that all studies concluded with a claim of a positive finding, even when the planned analyses gave null results. But, as with the studies I'd looked at in 2008, no two studies found the same thing. All the indicators were that this field is characterised by a mixture of p-hacking and hype, which creates the impression that the benefits of dietary supplementation are well-established, when a more dispassionate look at the evidence suggests considerable scepticism is warranted.

There were three questionable research practices that were prominent. First, testing a large number of 'primary research outcomes' without any correction for multiple comparisons. Three of the papers cited by Khalid did this, and they are marked in Table 1 below with "hmm" in the main analysis column. Two of them argued against using a method such as Bonferroni correction:

"Owing to the exploratory nature of this study, we did not wish to exclude any important relationships by using stringent correction factors for multiple analyses, and we recognised the potential for a type 1 error." (Dunstan et al, 2008)

"Although multiple comparisons are inevitable in studies of this nature, the statistical corrections that are often employed to address this (e.g. Bonferroni correction) infer that multiple relationships (even if consistent and significant) detract from each other, and deal with this by adjustments that abolish any findings without extremely significant levels (P values). However, it has been validly argued that where there are consistent, repeated, coherent and biologically plausible patterns, the results ‘reinforce’ rather than detract from each other (even if P values are significant but not very large)" (Meldrum et al, 2012)

While it is correct that Bonferroni correction is overconservative with correlated outcome measures, there are other methods for protecting the analysis from inflated type I error that should be applied in such cases (Bishop, 2023).

The second practice is conducting subgroup analyses: the initial analysis finds nothing, so a way is found to divide up the sample to find a subgroup that does show the effect. There is a nice paper by Peto that explains the dangers of doing this. The third practice, looking for correlations between variables rather than main effects of intervention: with sufficient variables, it is always possible to find something 'significant' if you don't employ any correction for multiple comparisons. This inflation of false positives by correlational analysis is a well-recognised problem in the field of neuroscience (e.g. Vul et al., 2008).

Given that such practices were normative in my own field of psychology for many years, I suspect that those who adopt them here are unaware of how serious a risk they run of finding spurious positive results. For instance, if you compare two groups on ten unrelated outcome measures, then the probability that something will give you a 'significant' p-value below .05 is not 5% but 40%. (The probability that none of the 10 results is significant is .95^10, which is .6. So the probability that at least one is below .05 is 1-.6 = .4). Dividing a sample into subgroups in the hope of finding something 'significant' is another way to multiply the rate of false positive findings.

In many fields, p-hacking is virtually impossible to detect because authors will selectively report their 'significant' findings, so the true false positive rate can't be estimated. In randomised controlled trials, the situation is a bit better, provided the study has been registered on a trial registry – this is now standard practice, precisely because it's recognised as an important way to avoid, or at least increase detection of, analytic flexibility and outcome switching. Accordingly, I catalogued, for the 10 studies reviewed by Khalid et al, how many found a significant effect of intervention on their planned, primary outcome measure, and how many focused on other results. The results are depressing. Flexible analyses are universal. Some authors emphasised the provisional nature of findings from exploratory analyses, but many did not. And my suspicion is that, even if the authors add a word of caution, those citing the work will ignore it.

Table 1: Reporting outcomes for 10 studies cited by Khalid et al (2022)

Khalid #	Register	N	Main result*	Subgrp	Correlatn	Abs -ve	Abs +ve
41	yes	86	NS	yes	no	no	yes
42	no	72	hmm	no	no	no	yes
43	no	420	hmm	no	no	yes	yes
44	yes	90	NS	no	yes	yes	yes
45	no	90	yes	no	yes	NA	yes
46	yes	150	hmm	no	no	yes	yes
47	yes	175	NS	no	yes	yes	yes
48	no	107	NS	yes	no	yes	yes
49	yes	1094	NS	yes	no	yes	yes
50	no	27	yes	no	no	yes	yes

Key: Main result coded as NS (nonsignificant), yes (significant) or hmm (not significant if Bonferroni corrected); Subgrp and Correlatn coded yes or no depending on whether post hoc subgroup or correlational analyses conducted. Abs -ve coded yes if negative results reported in abstract, no if not, and NA if no negative results obtained. Abs +ve coded yes if positive results mentioned in abstract.

I don't know if the Khalid et al review will have any effect – it is so evidently flawed that I hope it will be retracted. But the problems it reveals are not just a feature of the odd rogue review: there is a systemic problem with this area of science, whereby the desire to find positive results, coupled with questionable research practices and publication bias, have led to the construction of a huge edifice of evidence based on extremely shaky foundations. The resulting waste in researcher time and funding that comes from pursuing phantom findings is a scandal that can only be addressed by researchers prioritising rigour, honesty and scholarship over fast and flashy science.

Wednesday, 26 October 2011

Accentuate the negative

Suppose you run a study to compare two groups of children: say a dyslexic group and a control group. Your favourite theory predicts a difference in auditory perception, but you find no difference between the groups. What to do? You may feel a further study is needed: perhaps there were floor or ceiling effects that masked true differences. Maybe you need more participants to detect a small effect. But what if you can’t find flaws in the study and decide to publish the result? You’re likely to hit problems. Quite simply, null results are much harder to publish than positive findings. In effect, you are telling the world “Here’s an interesting theory that could explain dyslexia, but it’s wrong.” It’s not exactly an inspirational message, unless the theory is so prominent and well-accepted that the null finding is surprising. And if that is the case, then it’s unlikely that your single study is going to be convincing enough to topple the status quo. It has been recognised for years that this “file drawer problem” leads to distortion of the research literature, creating an impression that positive results are far more robust than they really are (Rosenthal, 1979).

The medical profession has become aware of the issue and it’s now becoming common practice for clinical trials to be registered before a study commences, and for journals to undertake to publish the results of methodologically strong studies regardless of outcome. In the past couple of years, two early-intervention studies with null results have been published, on autism (Green et al, 2010) and late talkers (Wake et al, 2011). Neither study creates a feel-good sensation: it’s disappointing that so much effort and good intentions failed to make a difference. But it’s important to know that, to avoid raising false hopes and wasting scarce resources on things that aren’t effective. Yet it’s unlikely that either study would have found space in a high-impact journal in the days before trial registration.

Registration can also exert an important influence in cases where conflict of interest or other factors make researchers reluctant to publish null results. For instance, in 2007, Cylharova et al published a study relating membrane fatty acid levels to dyslexia in adults. This research group has a particular interest in fatty acids and neurodevelopmental disabilities, and the senior author has written a book on this topic. The researchers argued that the balance of omega 3 and omega 6 fatty acids differed between dyslexics and non-dyslexics, and concluded: “To gain a more precise understanding of the effects of omega-3 HUFA treatment, the results of this study need to be confirmed by blood biochemical analysis before and after supplementation”. They further stated that a randomised controlled trial was underway. Yet four years later, no results have been published and requests for information about the findings are met with silence. If the trial had been registered, the authors would have been required to report the results, or explain why they could not do so.

Advance registration of research is not a feasible option for most areas of psychology, so what steps can we take to reduce publication bias? Many years ago a wise journal editor told me that publication decisions should be based on evaluation of just the Introduction and Methods sections of a paper: if an interesting hypothesis had been identified, and the methods were appropriate to test it, then the paper should be published, regardless of the results.

People often respond to this idea saying that it would just mean the literature would be full of boring stuff. But remember, I'm not suggesting that any old rubbish should get published: there has to be a good case for doing the study made in the Introduction, and the Methods have to be strong. Also, some kinds of boring results are important: miminally, publication of a null result may save some hapless graduate student from spending three years trying to demonstrate an effect that’s not there. Estimates of effect sizes in meta-analyses are compromised if only positive findings get reported. More seriously, if we are talking about research with clinical implications, then over-estimation of effects can lead to inappropriate interventions being adopted.

Things are slowly changing and it’s getting easier to publish null results. The advent of electronic journals has made a big difference because there is no longer such pressure on page space. The electronic journal PLOS One adopts a publication policy that is pretty close to that proposed by the wise editor: they state they will publish all papers that are technically sound. So my advice to those of you who have null data from well-designed experiments languishing in that file drawer: get your findings out there in the public domain.

References

Cyhlarova, E., Bell, J., Dick, J., MacKinlay, E., Stein, J., & Richardson, A. (2007). Membrane fatty acids, reading and spelling in dyslexic and non-dyslexic adults European Neuropsychopharmacology, 17 (2), 116-121 DOI: 10.1016/j.euroneuro.2006.07.003

Green, J., Charman, T., McConachie, H., Aldred, C., Slonims, V., Howlin, P., Le Couteur, A., Leadbitter, K., Hudry, K., Byford, S., Barrett, B., Temple, K., Macdonald, W., & Pickles, A. (2010). Parent-mediated communication-focused treatment in children with autism (PACT): a randomised controlled trial The Lancet, 375 (9732), 2152-2160 DOI: 10.1016/S0140-6736(10)60587-9

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86 (3), 638-641 DOI: 10.1037/0033-2909.86.3.638

Wake M, Tobin S, Girolametto L, Ukoumunne OC, Gold L, Levickis P, Sheehan J, Goldfeld S, & Reilly S (2011). Outcomes of population based language promotion for slow to talk toddlers at ages 2 and 3 years: Let's Learn Language cluster randomised controlled trial. BMJ (Clinical research ed.), 343 PMID: 21852344

BishopBlog

Sunday, 24 March 2024

Just make it stop! When will we say that further research isn't needed?

Monday, 4 September 2023

Polyunsaturated fatty acids and children's cognition: p-hacking and the canonisation of false facts

Wednesday, 26 October 2011

Accentuate the negative

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Sunday, 24 March 2024

Just make it stop! When will we say that further research isn't needed?

Monday, 4 September 2023

Polyunsaturated fatty acids and children's cognition: p-hacking and the canonisation of false facts

Wednesday, 26 October 2011

Accentuate the negative

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers