Saturday, 5 March 2016

There is a reproducibility crisis in psychology and we need to act on it

The Müller-Lyer illusion: a highly reproducible effect. The central lines are the same length but the presence of the fins induces a perception that the left-hand line is longer.

The debate about whether psychological research is reproducible is getting heated. In 2015, Brian Nosek and his colleagues in the Open Science Collaboration showed that they could not replicate effects for over 50 per cent of studies published in top journals. Now we have a paper by Dan Gilbert and colleagues saying that this is misleading because Nosek’s study was flawed, and actually psychology is doing fine. More specifically: “Our analysis completely invalidates the pessimistic conclusions that many have drawn from this landmark study.” This has stimulated a set of rapid responses, mostly in the blogosphere. As Jon Sutton memorably tweeted: “I guess it's possible the paper that says the paper that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know what to think.
The bulk of debate has been focused on what exactly we mean by reproducibility in statistical terms. That makes sense because many of the arguments hinge on statistics, but I think that ignores the more basic issue, which is whether psychology has a problem. My view is that we do have a problem, though psychology is no worse than many other disciplines that use inferential statistics.
In my undergraduate degree I learned about stuff that was on the one hand non-trivial and on the other hand solidly reproducible. Take for instance, various phenomena in short-term memory. Effects like the serial position effect, the phonological confusability effect, the superiority of memory for words over nonwords, are solid and robust. In perception, we have striking visual effects such as the Müller-Lyer illusion, which demonstrate how our eyes can deceive us. In animal learning, the partial reinforcement effect is solid. In psycholinguistics, the difficulty adults have discriminating sound contrasts that are not distinctive in their native language is solid. In neuropsychology, the dichotic right ear advantage for verbal material is solid. In developmental psychology, it has been shown over and over again that poor readers have deficits in phonological awareness. These are just some of the numerous phenomena studied by psychologists that are reproducible in the sense that most people understand it, i.e. if I were to run an undergraduate practical class to demonstrate the effect, I’d be pretty confident that we’d get it. They are also non-trivial, in that a lay person would not just conclude that the result could have been predicted in advance.
The Reproducibility Project showed that many effects described in contemporary literature are not like that. But was it ever thus? I’d love to see the reproducibility project rerun with psychology studies reported in the literature from the 1970s – have we really got worse, or am I aware of the reproducible work just because that stuff has stood the test of time, while other work is forgotten?
My bet is that things have got worse, and I suspect there are a number of reasons for this:
1. Most of the phenomena I describe above were in areas of psychology where it was usual to report a series of experiments that demonstrated the effect and attempted to gain a better understanding of it by exploring the conditions under which it was obtained. Replication was built in to the process. That is not common in many of the areas where reproducibility of effects is contested.
2. It’s possible that all the low-hanging fruit has been plucked, and we are now focused on much smaller effects – i.e., where the signal of the effect is low in relation to background noise. That’s where statistics assumes importance. Something like the phonological confusability effect in short-term memory or a Müller-Lyer illusion is so strong that it can be readily demonstrated in very small samples. Indeed, abnormal patterns of performance on short-term memory tests can be used diagnostically with individual patients. If you have a small effect, you need much bigger samples to be confident that what you are observing is signal rather than noise. Unfortunately, the field has been slow to appreciate the importance of sample size and many studies are just too underpowered to be convincing.

3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines, involves training in statistics. In principle, this is thoroughly good thing, but in practice it can be a disaster if the psychologist is simply fixated on finding p-values less than .05 – and assumes that any effect associated with such a p-value is true. I’ve blogged about this extensively, so won’t repeat myself here, other than to say that statistical training should involve exploring simulated datasets so that the student starts to appreciate the ease with which low p-values can occur by chance when one has a large number of variables and a flexible approach to data analysis. Virtually all psychologists misunderstand p-values associated with interaction terms in analysis of variance – as I myself did until working with simulated datasets. I think in the past this was not such an issue, simply because it was not so easy to conduct statistical analyses on large datasets – one of my early papers describes how to compare regression coefficients using a pocket calculator, which at the time was an advance on other methods available! If you have to put in hours of work calculating statistics by hand, then you think hard about the analysis you need to do. Currently, you can press a few buttons on a menu and generate a vast array of numbers – which can encourage the researcher to just scan the output and highlight those where p falls below the magic threshold of .05. Those who do this are generally unaware of how problematic this is, in terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology is not reproducible in the everyday sense that if I try to repeat your experiment I can be confident of getting the same effect. Implicit in the critique by Gilbert et al is the notion that many studies are focused on effects that are both small and fragile, and so it is to be expected they will be hard to reproduce. They may well be right, but if so, the solution is not to deny we have a problem, but to recognise that under those circumstances there is an urgent need for our field to tackle the methodological issues of inadequate power and p-hacking, so we can distinguish genuine effects from false positives.


  1. Great post, I pretty much agree completely, especially with your point 3, which is what I also discussed on my blog. I don't think there is anything wrong with studying phenomena that are fragile and subject to complex interactions with the environment. If anything, we need to understand those phenomena better. All I've really meant to say is that in order for psychology of complex phenomena (say, like measuring whether someone talking rudely about affirmative action makes people look more at Black people when they think they can hear the comments) should be studied in a way that gets away from the snowflake problem and charts out the boundary conditions. If you publish a single finding under certain conditions, you are implicitly saying that it should generalise. If it then doesn't replicate it is not enough to point to one parameter and say that this is inconclusive. It is fine to declare your intention to use this failure to replicate as motivation for a new study testing this boundary condition. It is not fine to say "This effect replicates just fine, thank you."

    Since you mention the Mueller-Lyer illusion, even effects like this are supposed to show cultural dependency. So even for robust phenomena you can demonstrate with a simple example there is plenty of room for moderators. I find it difficult to see how the idea that social psychology findings may be more complex and fragile is controversial to anyone :P

  2. Many thanks Sam. I don't know the cross-cultural literature on Mueller-Lyer beyond a dim memory of an account in Eye and Brain, but it's a perfect example of how you can use extension to other populations to understand a phenomenon better (as I recall the argument was that if your environment did not include corners and angles, you didn't get such a strong effect, suggesting it was driven more by experience than innate perceptual bias?). Indeed, I think that for most of my 'solid' effects, there are literatures looking at impact of variations on the method, and generalisation to different populations, so yes, moderators are everywhere and are to be expected.

    1. Yes that's exactly right. There are other studies suggesting the same and also the opposite that certain cultures are more prone to illusions like this. The reasons behind that remain still very controversial and I find some accounts more convincing than others. Also, I think it's a perfect case of where decision/cognitive factors may play a strong role and skew our estimate of the participant's subjective experience. This is all the topic of on-going research and as you say it's a perfectly good example of how moderating factors are worth studying.

      That said, it is obviously a great example of solid effects nevertheless. If someone told me that they couldn't replicate the Mueller-Lyer in the Netherlands or at a different time of day or whatever, I would be very sceptical they did it right. When I publish something on the Mueller-Lyer (which I might soon... :) or on the Ebbinghaus/Delboeuf (which I have done and continue to do) then I work on the assumption that it is a reasonably general effect and if people repeatedly fail to replicate the results, I would certainly be inclined to believe that my finding was a fluke (barring any obvious methodological problems at least).

  3. This comment has been removed by the author.

  4. Excellent post. Following on from Sam's comment, there might be scope for researchers being more explicit/honest with the boundaries of their studies. Researchers making it clear what the limits are of a study will hopefully limit misrepresentation/misunderstandings by the media (as some research suggests this misrepresentation starts with the researchers themselves: & Though of course this honesty may be difficult in current climate of impressive effects being more likely to get published etc.

  5. Thanks both.
    Re the 'impressive effects' idea, I wonder whether it would be worth assembling some catalogue of effects in psychology that are generally agreed to be solid.
    One reason I wanted to emphasise things like Mueller-Lyer was because some people attack the Reproducibility Project because they are scared that people will conclude all psychology is crap. In fact, we have loads of findings that meet a high standard of reproducibility.

    1. Some people seem to have an axe to grind and will conclude that psychology is crap no matter what we do. If we ignore the problem they will say that we are being disingenuous. If we acknowledge it, they will consider themselves vindicated. In my view, we should ignore those who won't be satisfied no matter the outcome, and focus instead on improving what we can.

  6. Thanks for this post! I was part of the Reproducibility Project, and it was really nice to hear you weighing in on the debate. The idea that software packages might enable fishing for effects (as opposed to pocket calculators) has not crossed my mind before - possibly because I have never used pocket calculators for statistics.

    1. Try doing a Manova or Factor Analysis by hand. :(

      I believe that there were many statistical analyses available pre-computer days (well pre easy-access to computers) that were impractical before the advent of statistical packages and easy access to computer time.

      One of my professors had Cattel as an adviser and said a lot of the graduate students time was spent on mechanical calculators doing work on factor analysis.He characterized it as "punch the keys,crank the handle; punch the keys, crank the handle., repeat".

      I am envious of deevybee, "She had a hand calculator”. I had a pencil and a pad of paper.

      I may have just had a couple of poor stats instructors in undergrad but immense amounts of time seemed to be spent teaching people how to use calculation formulae as opposed to conceptual instruction on what was actually happening. Now one can download R and do generate a 100 X 100 correlation matrix in moments.

      On a different note, I think a lot of psychology is underpowered due to the difficulty of obtaining subjects particularly human subjects. Some other disciplines seem to be able to do a study with large N's and replicate with no problem. Mice are cheap and a phone call can have them delivered to a lab in no time. Bacteria probably are even cheaper.

      Then try doing a study in school board. It can take 6 months to a year to set up and execute and all one needs in a simple study is to have a flu epidemic wipe out half your subject pool on the planned testing day. Repeat this in another school board if you want a replication. Or try getting two or three hundred young men or women for a sexual dysfunction study.

      Some studies, if done properly, are also incredibly costly by psychology standards. I, once, was costing out a psychomoter skills study done in-house in a large corporation. IIRC, salary costs for subjects was projected at a little over a quarter of million dollars in 1990 $CDN.

      I think psychology has a history of underpowered studies because of the cost and time to get a decent sample size and psychologists have not yet realized the p-value problem when the effect-sizes are likely to be small and the data is noisy.

      On the other hand, maybe should all become Bayesians.

  7. I found this quite interesting:

    I wonder what's next - neonates don't imitate adults?

  8. I think there's another even more insidious issue which is the focus on effects in the first place as the aim of psychological research. But these are only interesting and relevant if the effects are amazingly strong such as Müller-Lyer or the non-native phonological discrimination.

    But in the vast majority of the other phenomena, it is the patterns of variation within populations that psychology should be setting as its aim. I see all this research on effects of interventions which almost always only report on average effectiveness but rarely on the spread of the effects. But in many cases, they have that average effect on literally nobody. So using increasingly more sophisticated statistical methods to detect small effects across populations may in fact be taking us in the wrong direction.

    We should be looking at modeling the distribution of individual effects rather than average effects - this seems to apply even to some of the stronger effects like the phonological foundations of reading difficulties where the replications are much more variable than with something like many of the visual phenomena.

  9. This comment has been removed by the author.