Tuesday 22 March 2016

Better control of the publication time-line: A further benefit of Registered Reports

I’ve blogged previously about waste in science. There are numerous studies that are completed but never see the light of day. When I wrote about this previously, I focused on issues such as reluctance of journals to publish null results, and the problem of writing up a study while applying for the next new grant. But here I want to focus on another factor: the protracted and unpredictable process of peer review that can lead to researchers to just give up on a paper.

Sample Gantt chart. Source: http://www.crp.kk.usm.my/pages/jepem.htm
The sample Gantt chart above nicely illustrates a typical scenario.  Let's suppose we have a postdoc with 30 months’ funding. Amazingly, she is not held up by patient recruitment issues, or ethics approvals, and everything goes according to plan, so 24 months in, she writes up the study and submits it to a journal. At the same time, she may be applying for further funding or positions. She may plan to start a family at the end of her fellowship. Depending on her area of study it may take anything from two weeks to six months to hear back from the journal*. The decision is likely to be revise and resubmit. If she’s lucky, she’ll be able to do the revisions and get the paper accepted to coincide with the end of her fellowship.  All too often, though, the reviewers suggest revisions. If she's very unlucky they may demand additional experiments, which she has no funding for.  If they just want changes to the text, that's usually do-able, but often they will suggest further analyses that take time, and she may only get to the point of resubmitting the manuscript when her money runs out. Then the odds are that the paper will go back to the reviewers – or even to new reviewers – who now have further ideas of how the paper can be improved. But now our researcher might have started a new job, have just given birth, or be unemployed and desperately applying for further funds.

The thing about this scenario, which will be all too familiar to seasoned researchers (see a nice example here), is that it is totally unpredictable. Your paper may be accepted quickly, or it may get endlessly delayed. The demands of the reviewers may involve another six month’s work on the paper, at a point when the researcher just doesn’t have the time. I’ve seen dedicated, hardworking, enthusiastic young researchers completely ground down by this situation, faced by the choice of either abandoning a project that has consumed a huge amount of energy and money, or somehow creating time out of thin air. It’s particularly harsh on those who are naturally careful and obsessive, who will be unhappy at the idea of doing a quick and dirty fix to just get the paper out. That paper which started out as their pride and joy, representing their best efforts over a period of years is now reduced to a millstone around the neck.

But there is an alternative. I’ve recently, with a graduate student, Hannah Hobson, put my toe in the waters of Registered Reports, with a paper submitted to Cortex looking at an electrophysiological phenomenon known as mu suppression. The key difference from the normal publication route is that the paper is reviewed before the study is conducted, on the basis of an introduction and protocol detailing the methods and analysis plan. This, of course takes time – reviewing always does. But if and when the paper is approved by reviewers, it is provisionally accepted for publication, provided the researchers do what they said they would.

One advantage of this process is that, after you have provisional acceptance of the submission, the timing is largely under your own control. Before the study is done, the introduction and methods are already written up, and so once the study is done, you just add the results and discussion. You are not prohibited from doing additional analyses that weren’t pre-registered, but they are clearly identified as such. One the study is written up the paper goes back to reviewers. They may make further suggestions for improving the paper, but what they can’t do is to require you to do a whole load of new analyses or experiments. Obviously, if a reviewer spots a fatal error in the paper, that is another matter. But reviewers can’t at this point start dictating that the authors do further analyses or experiments that may be interesting but not essential.

We found that the reviewer comments on our completed study were helpful: they advised on how to present the data and made suggestions about how to frame the discussion. One reviewer suggested additional analyses that would have been nice to include but were not critical; as Hannah was working to tight deadlines for thesis completion and starting a new job, we realised it would not be possible to do these, but because we have deposited the data for this paper (another requirement for a Registered Report), the door is left open for others to do further analysis.

I always liked the idea of Registered Reports, but this experience has made me even more enthusiastic for the approach. I can imagine how different the process would have been had we gone down the conventional publishing route. Hannah would have started her data collection much sooner, as we wouldn’t have had to wait for reviewer comments. So the paper might have been submitted many months earlier. But then we would have started along the long uncertain road to publication. No doubt reviewers would have asked why we didn’t include different control conditions, why we didn’t use current source density analysis, why we weren’t looking at a different frequency band, and whether our exclusionary criteria for participants were adequate. They may have argued that our null results arose because the study was underpowered. (In the pre-registered route, these were all issues that were raised in the reviews of our protocol, so had been incorporated in the study). We would have been at risk of an outright rejection at worst, or requirement for major revisions at best. We could then have spent many months responding to reviewer recommendations and then resubmitting, only to be asked for yet more analyses.  Instead, we had a pretty clear idea of the timeline for publication, and could be confident it would not be enormously protracted.

This is not a rant against peer reviewers. The role of the reviewer is to look at someone else’s work and see how it could be improved. My own papers have been massively helped by reviewer suggestions, and I am on record as defending the peer review system against attacks. It is more a rant against the way in which things are ordered in our current publication system. The uncertainty inherent in the peer review process generates an enormous amount of waste, as publications, and sometimes careers, are abandoned. There is another way, via Registered Reports, and I hope that more journals will start to offer this option.

*Less than two weeks suggests a problem!See here for an example.

Saturday 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it

The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.