BishopBlog: clinical trials

Friday, 26 July 2013

Why we need pre-registration

There has been a chorus of disapproval this week at the suggestion that researchers should 'pre-register' their studies with journals and spell out in advance the methods and analyses that they plan to do. Those who wish to follow the debate should look at this critique by Sophie Scott, with associated comments, and the responses to it collated here by Pete Etchells. They should also read the explanation of the pre-registration proposals and FAQ by Chris Chambers - something that many participants in the debate appear not to have done.

Quite simply, pre-registration is designed to tackle two problems in scientific publishing:

Bias against publication of null results
A failure to distinguish hypothesis-generating (exploratory) from hypothesis-testing analyses

Either of these alone is bad for science: the combined effect of both of them is catastrophic, and has led to a situation where research is failing to do its job in terms of providing credible answers to scientific questions.

Null results

Let's start with the bias against null results. Much has been written about this, including by me. But the heavy guns in the argument have been wielded by Ben Goldacre, who has pointed out that, in the clinical trials field, if we only see the positive findings, then we get a completely distorted view of what works, and as a result, people may die. In my field of psychology, the stakes are not normally as high, but the fact remains that there can be massive distortion in our perception of evidence.

Pre-registration would fix this by guaranteeing publication of a paper regardless of how the results turn out. In fact, there is another, less bureaucratic, way the null result problem could be fixed, and that would be by having reviewers decide on a paper's publishability solely on the basis of the introduction and methods. But that would not fix the second problem.

Blurring the boundaries between exploratory and hypothesis-testing analyses

A big problem is that nearly all data analysis is presented as if it is hypothesis-testing when in fact much of it is exploratory.

In an exploratory analysis, you take a dataset and look at it flexibly to see what's there. Like many scientists, I love exploratory analyses, because you don't know what you will find, and it can be important and exciting. I suspect it is also something that you get better at as you get more experienced, and more able to see the possibilities in the numbers. But my love of exploratory analyses is coupled with a nervousness. With an exploratory analysis, whatever you find, you can never be sure it wasn't just a chance result. Perhaps I was lucky in having this brought home to me early in my career, when I had an alphabetically ordered list of stroke patients I was planning to study, and I happened to notice that those with names in the first half of the alphabet had left hemisphere lesions and those with names in the second half had right hemisphere lesions. I even did a chi square test and found it was highly significant. Clearly this was nonsense, and just one of those spurious things that can turn up by chance.

These days it is easy to see how often meaningless 'significant' results occur by running analyses on simulated data - see this blogpost for instance. In my view, all statistics classes should include such exercises.

So you've done your exploratory analysis, got an exciting finding, but are nervous as to whether it is real. What do you do? The answer is you need a confirmatory study. In the field of genetics, failure to realise this led to several years of stasis, cogently described by Flint et al (2010). Genetics really highlights the problem, because of the huge numbers of possible analyses that can be conducted. What was quickly learned was that most exciting effects don't replicate. The bar has accordingly been set much higher, and most genetics journals won't consider publishing a genetic association unless replication has been demonstrated (Munafo & Flint, 2011). This is tough, but it has meant that we can now place confidence in genetics results. (It also has had a positive side-effect of encouraging more collaboration between research groups). Unfortunately, those outside the field of genetics are unaware of these developments, and we are seeing increasing numbers of genetic association studies being published in the neuroscience literature, with tiny samples and no replication.

The important point to grasp is that the meaning of a p-value is completely different if it emerges when testing an a priori prediction, compared with when it is found in the course of conducting numerous analyses of a dataset. Here, for instance, are outputs from 15 runs of a 4-way Anova on random data, as described here:

Each row shows p-value for outputs (main effects then interactions) for one run of 4-way Anova on new set of random data. For a slightly more legible version see here

If I approached a dataset specifically testing the hypothesis that there would be an interaction between group and task, then the chance of a p-value of .05 or less would be 1 in 20 (as can be confirmed by repeating the simulation thousands of times - in a small number of runs it's less easy to see). But if I just looked for significant findings, it's not hard to find something on most of these runs. An exploratory analysis is not without value, but its value is in generating hypotheses that can then be tested in an a priori design.

So replication is needed to deal with the uncertainties around exploratory analysis. How does pre-registration fit in the picture? Quite simply, it makes explicit the distinction between hypothesis-generating (exploratory) and hypothesis-testing research, which is currently completely blurred. As in the example above, if you tell me in advance what hypothesis you are testing, then I can place confidence in the uncorrected statistical probabilities associated with the predicted effects. If you haven't predicted anything in advance, then I can't.

This doesn't mean that the results from exploratory analyses are necessarily uninteresting, untrue, or unpublishable, but it does mean we should interpret them as what they are: hypothesis-generating rather than hypothesis-testing.

I'm not surprised at the outcry against pre-registration. This is mega. It would require most of us to change our behaviour radically. It would turn on its head the criteria used to evaluate findings: well-conducted replication studies, currently often unpublishable, would be seen as important, regardless of their results. On the other hand, it would no longer be possible to report exploratory analyses as if they are hypothesis-testing. In my view, unless we do this we will continue to waste time and precious research funding chasing illusory truths.

References

Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University press.

Munafo, M, & Flint, J. (2011). Dissecting the genetic architecture of human personality Trends in Cognitive Sciences, 15 (9), 395-400 DOI: 10.1016/j.tics.2011.07.007

Thursday, 21 March 2013

Blogging as post-publication peer review: reasonable or unfair?

In a previous blogpost, I criticised a recent paper claiming that playing action video games improved reading in dyslexics. In a series of comments below the blogpost, two of the authors, Andrea Facoetti and Simone Gori, have responded to my criticisms. I thank them for taking the trouble to spell out their views and giving readers the opportunity to see another point of view. I am, however, not persuaded by their arguments, which make two main points. First, that their study was not methodologically weak and so Current Biology was right to publish it, and second, that it is unfair, and indeed unethical, to criticise a scientific paper in a blog, rather than through the regular scientific channels.

Regarding the study methodology, as noted above, the principal problem with the study by Franceschini et al was that it was underpowered, with just 10 participants per group. The authors reply with an argument ad populum, i.e. many other studies have used equally small samples. This is undoubtedly true, but it doesn’t make it right. They dismiss the paper I cited by Christley (2010) on the grounds that it was published in a low impact journal. But the serious drawbacks of underpowered studies have been known about for years, and written about in high- as well as low-impact journals (see references below).

The response by Facoetti and Gori illustrates the problem I had highlighted. In effect, they are saying that we should believe their result because it appeared in a high-impact journal, and now that it is published, the onus must be on other people to demonstrate that it is wrong. I can appreciate that it must be deeply irritating for them to have me expressing doubt about the replicability of their result, given that their paper passed peer review in a major journal and the results reach conventional levels of statistical significance. But in the field of clinical trials, the non-replicability of large initial effects from small trials has been demonstrated on numerous occasions, using empirical data - see in particular the work of Ioannidis, referenced below. The reasons for this ‘winner’s curse’ have been much discussed, but its reality is not in doubt. This is why I maintain that the paper would not have been published if it had been reviewed by scientists who had expertise in clinical trials methodology. They would have demanded more evidence than this.

The response by the authors highlights another issue: now that the paper has been published, the expectation is that anyone who has doubts, such as me, should be responsible for checking the veracity of the findings. As we say in Britain, I should put up or shut up. Indeed, I could try to get a research grant to do a further study. However, I would probably not be allowed by my local ethics committee to do one on such a small sample and it might take a year or so to do, and would distract me from my other research. Given that I have reservations about the likelihood of a positive result, this is not an attractive option. My view is that journal editors should have recognised this as a pilot study and asked the authors to do a more extensive replication, rather than dashing into print on the basis of such slender evidence. In publishing this study, Current Biology has created a situation where other scientists must now spend time and resources to establish whether the results hold up.

To establish just how damaging this can be, consider the case of the FastForword intervention, developed on the basis of a small trial initially reported in Science in 1996. After the Science paper, the authors went directly into commercialization of the intervention, and reported only uncontrolled trials. It took until 2010 for there to be enough reasonably-sized independent randomized controlled trials to evaluate the intervention properly in a meta-analysis, at which point it was concluded that it had no beneficial effect. By this time, tens of thousands of children had been through the intervention, and hundreds of thousands of research dollars had been spent on studies evaluating FastForword.

I appreciate that those reporting exciting findings from small trials are motivated by the best of intentions – to tell the world about something that seems to help children. But the reality is that, if the initial trial is not adequately powered, it can be detrimental both to science and to the children it is designed to help, by giving such an imprecise and uncertain estimate of the effectiveness of treatment.

Finally, a comment on whether it is fair to comment on a research article in a blog, rather than going through the usual procedure of submitting an article to a journal and having it peer-reviewed prior to publication. The authors’ reactions to my blogpost are reminiscent of Felicia Wolfe-Simon’s response to blog-based criticisms of a paper she published in Science: "The items you are presenting do not represent the proper way to engage in a scientific discourse”. Unlike Wolfe-Simon, who simply refused to engage with bloggers, Facoetti and Gori show willingness to discuss matters further, and present their side of the story, but they nevertheless it is clear they do not regard a blog as an appropriate place to debate scientific studies.

I could not disagree more. As was readily demonstrated in the Wolfe-Simon case, what has come to be known as ‘post-publication peer review’ via the blogosphere can allow for new research to be rapidly discussed and debated in a way that would be quite impossible via traditional journal publishing. In addition, it brings the debate to the attention of a much wider readership. Facoetti and Gori feel I have picked on them unfairly: in fact, I found out about their paper because I was asked for my opinion by practitioners who worked with dyslexic children. They felt the results from the Current Biology study sounded too good to be true, but they could not access the paper from behind its paywall, and in any case they felt unable to evaluate it properly. I don’t enjoy criticising colleagues, but I feel that it is entirely proper for me to put my opinion out in the public domain, so that this broader readership can hear a different perspective from those put out in the press releases. And the value of blogging is that it does allow for immediate reaction, both positive and negative. I don’t censor comments, provided they are polite and on-topic, so my readers have the opportunity to read the reaction of Facoetti and Gori.

I should emphasise that I do not have any personal axe to grind with the study's authors, who I do not know personally. I’d be happy to revise my opinion if convincing arguments are put forward, but I think it is important that this discussion takes place in the public domain, because the issues it raises go well beyond this specific study.

References

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, advance online publication. doi: 10.1038/nrn3475

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. doi: 10.1371/journal.pmed.0020124

Ioannidis, J. P. (2008). Why most discovered true associations are inflated. Epidemiology 19(5), 640-648.

Ioannidis JP, Pereira TV, & Horwitz RI (2013). Emergence of large treatment effects from small trials--reply. JAMA : the journal of the American Medical Association, 309 (8), 768-9 PMID: 23443435

BishopBlog

Friday, 26 July 2013

Why we need pre-registration

Null results

Blurring the boundaries between exploratory and hypothesis-testing analyses

References

Thursday, 21 March 2013

Blogging as post-publication peer review: reasonable or unfair?

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Friday, 26 July 2013

Why we need pre-registration

Null results

Blurring the boundaries between exploratory and hypothesis-testing analyses

References

Thursday, 21 March 2013

Blogging as post-publication peer review: reasonable or unfair?

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers