BishopBlog: There is a reproducibility crisis in psychology and we need to act on it

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it

The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”

So now the folks in the media are confused and don’t know what to think.

The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.

In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.

The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?

My bet is that things have got worse, and I suspect there are a number of reasons for this:

1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.

2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.

4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.

Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.

13 comments:

Sam Schwarzkopf5 March 2016 at 12:41
Great post, I pretty much agree completely, especially with your point 3, which is what I also discussed on my blog. I don't think there is anything wrong with studying phenomena that are fragile and subject to complex interactions with the environment. If anything, we need to understand those phenomena better. All I've really meant to say is that in order for psychology of complex phenomena (say, like measuring whether someone talking rudely about affirmative action makes people look more at Black people when they think they can hear the comments) should be studied in a way that gets away from the snowflake problem and charts out the boundary conditions. If you publish a single finding under certain conditions, you are implicitly saying that it should generalise. If it then doesn't replicate it is not enough to point to one parameter and say that this is inconclusive. It is fine to declare your intention to use this failure to replicate as motivation for a new study testing this boundary condition. It is not fine to say "This effect replicates just fine, thank you."

Since you mention the Mueller-Lyer illusion, even effects like this are supposed to show cultural dependency. So even for robust phenomena you can demonstrate with a simple example there is plenty of room for moderators. I find it difficult to see how the idea that social psychology findings may be more complex and fragile is controversial to anyone :P
ReplyDelete
Replies
deevybee5 March 2016 at 13:21
Many thanks Sam. I don't know the cross-cultural literature on Mueller-Lyer beyond a dim memory of an account in Eye and Brain, but it's a perfect example of how you can use extension to other populations to understand a phenomenon better (as I recall the argument was that if your environment did not include corners and angles, you didn't get such a strong effect, suggesting it was driven more by experience than innate perceptual bias?). Indeed, I think that for most of my 'solid' effects, there are literatures looking at impact of variations on the method, and generalisation to different populations, so yes, moderators are everywhere and are to be expected.
ReplyDelete
Replies
Ben Siveges5 March 2016 at 13:51
This comment has been removed by the author.
ReplyDelete
Replies
Unknown5 March 2016 at 14:00
Excellent post. Following on from Sam's comment, there might be scope for researchers being more explicit/honest with the boundaries of their studies. Researchers making it clear what the limits are of a study will hopefully limit misrepresentation/misunderstandings by the media (as some research suggests this misrepresentation starts with the researchers themselves: http://www.bmj.com/content/349/bmj.g7015 & http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001308). Though of course this honesty may be difficult in current climate of impressive effects being more likely to get published etc.
ReplyDelete
Replies
deevybee5 March 2016 at 14:59
Thanks both.
Re the 'impressive effects' idea, I wonder whether it would be worth assembling some catalogue of effects in psychology that are generally agreed to be solid.
One reason I wanted to emphasise things like Mueller-Lyer was because some people attack the Reproducibility Project because they are scared that people will conclude all psychology is crap. In fact, we have loads of findings that meet a high standard of reproducibility.
ReplyDelete
Replies
A6 March 2016 at 13:58
Thanks for this post! I was part of the Reproducibility Project, and it was really nice to hear you weighing in on the debate. The idea that software packages might enable fishing for effects (as opposed to pocket calculators) has not crossed my mind before - possibly because I have never used pocket calculators for statistics.
ReplyDelete
Replies
Dean D'Souza7 March 2016 at 20:25
I found this quite interesting:

http://www.slate.com/articles/health_and_science/cover_story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.single.html

I wonder what's next - neonates don't imitate adults?
ReplyDelete
Replies
Dominik Lukeš7 March 2016 at 20:41
I think there's another even more insidious issue which is the focus on effects in the first place as the aim of psychological research. But these are only interesting and relevant if the effects are amazingly strong such as Müller-Lyer or the non-native phonological discrimination.

But in the vast majority of the other phenomena, it is the patterns of variation within populations that psychology should be setting as its aim. I see all this research on effects of interventions which almost always only report on average effectiveness but rarely on the spread of the effects. But in many cases, they have that average effect on literally nobody. So using increasingly more sophisticated statistical methods to detect small effects across populations may in fact be taking us in the wrong direction.

We should be looking at modeling the distribution of individual effects rather than average effects - this seems to apply even to some of the stronger effects like the phonological foundations of reading difficulties where the replications are much more variable than with something like many of the visual phenomena.
ReplyDelete
Replies
jhon16 March 2016 at 08:11
This comment has been removed by the author.
ReplyDelete
Replies

Add comment

New comments are not allowed.

BishopBlog

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it

13 comments:

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it

13 comments:

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers