The Müller-Lyer illusion: a highly reproducible effect. The
central lines are the same length but the presence of the fins induces a
perception that the left-hand line is longer.
|
The debate about whether psychological research is
reproducible is getting heated. In 2015, Brian Nosek
and his colleagues in the Open Science Collaboration showed that they could
not replicate effects for over 50 per cent of studies published in top journals.
Now we have a
paper by Dan Gilbert and colleagues saying that this is misleading because
Nosek’s study was flawed, and actually psychology is doing fine. More
specifically: “Our
analysis completely invalidates the pessimistic conclusions that many have drawn
from this landmark study.” This has stimulated a set of rapid responses,
mostly in the blogosphere. As Jon Sutton
memorably tweeted: “I guess it's possible the paper that says the paper
that says psychology is a bit shit is a bit shit is a bit shit.”
So now the folks in the media are confused and don’t know
what to think.
The bulk of debate has been focused on what exactly we mean
by reproducibility in statistical terms. That makes sense because many of the
arguments hinge on statistics, but I think that ignores the more basic issue,
which is whether psychology has a problem. My view is that we do have a
problem, though psychology is no worse than many other disciplines that use
inferential statistics.
In my undergraduate degree I learned about stuff that was on
the one hand non-trivial and on the other hand solidly reproducible. Take for
instance, various phenomena in short-term
memory. Effects like the serial position effect, the phonological
confusability effect, the superiority of memory for words over nonwords, are
solid and robust. In perception, we have striking visual effects such as the Müller-Lyer
illusion, which demonstrate how our eyes can deceive us. In animal learning, the
partial reinforcement effect is solid. In psycholinguistics, the difficulty
adults have discriminating sound contrasts that are not distinctive in their
native language is solid. In neuropsychology, the dichotic
right ear advantage for verbal material is solid. In developmental psychology,
it has been shown over and over again that poor readers have deficits
in phonological awareness. These are just some of the numerous phenomena
studied by psychologists that are reproducible in the sense that most people
understand it, i.e. if I were to run an undergraduate practical class to
demonstrate the effect, I’d be pretty confident that we’d get it. They are also
non-trivial, in that a lay person would not just conclude that the result could
have been predicted in advance.
The Reproducibility Project showed that many effects
described in contemporary literature are not like that. But was it ever thus? I’d
love to see the reproducibility project rerun with psychology studies reported
in the literature from the 1970s – have we really got worse, or am I aware of the
reproducible work just because that stuff has stood the test of time, while
other work is forgotten?
My bet is that things have got worse, and I suspect there
are a number of reasons for this:
1. Most of the phenomena I describe above were in
areas of psychology where it was usual to report a series of experiments that
demonstrated the effect and attempted to gain a better understanding of it by
exploring the conditions under which it was obtained. Replication was built in
to the process. That is not common in many of the areas where reproducibility of
effects is contested.
2. It’s possible that all the low-hanging fruit has
been plucked, and we are now focused on much smaller effects – i.e., where the
signal of the effect is low in relation to background noise. That’s where
statistics assumes importance. Something like the phonological confusability
effect in short-term memory or a Müller-Lyer illusion is so strong that it can
be readily demonstrated in very small samples. Indeed, abnormal patterns of performance
on short-term
memory tests can be used diagnostically with individual patients. If you
have a small effect, you need much bigger samples to be confident that what you
are observing is signal rather than noise. Unfortunately, the field has been
slow to appreciate the importance of sample size and many studies
are just too underpowered to be convincing.
3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
3. Gilbert et al raise the possibility that the effects that are observed are not just small but also more fragile, in that they can be very dependent on contextual factors. Get these wrong, and you lose the effect. Where this occurs, I think we should regard it as an opportunity, rather than a problem, because manipulating experimental conditions to discover how they influence an effect can be the key to understanding it. It can be difficult to distinguish a fragile effect from a false positive, and it is understandable that this can lead to ill-will between original researchers and those who fail to replicate their finding. But the rational response is not to dismiss the failure to replicate, but to first do adequately powered studies to demonstrate the effect and then conduct further studies to understand the boundary conditions for observing the phenomenon. To take one of the examples I used above, the link between phonological awareness and learning to read is particularly striking in English and less so in some other languages. Comparisons between languages thus provide a rich source of information for understanding how children become literate. Another of the effects, the right ear advantage in dichotic listening holds at the population level, but there are individuals for whom it is absent or reversed. Understanding this variability is part of the research process.
4. Psychology, unlike many other biomedical disciplines,
involves training in statistics. In principle, this is thoroughly good thing, but
in practice it can be a disaster if the psychologist is simply fixated on
finding p-values less than .05 – and assumes that any effect associated with
such a p-value is true. I’ve blogged about this extensively, so won’t repeat
myself here, other than to say that statistical
training should involve exploring simulated datasets so that the student
starts to appreciate the
ease with which low p-values can occur by chance when one has a large
number of variables and a flexible approach to data analysis. Virtually all
psychologists misunderstand p-values associated
with interaction terms in analysis of variance – as
I myself did until working with simulated datasets. I think in the past
this was not such an issue, simply because it was not so easy to conduct
statistical analyses on large datasets – one of my early papers
describes how to compare regression coefficients using a pocket calculator,
which at the time was an advance on other methods available! If you have to put
in hours of work calculating statistics by hand, then you think hard about the
analysis you need to do. Currently, you can press a few buttons on a menu and
generate a vast array of numbers – which can encourage the researcher to just
scan the output and highlight those where p falls below the magic threshold of
.05. Those who do this are generally unaware of how problematic this is, in
terms of raising the likelihood of false positive findings.
Nosek et al have demonstrated that much work in psychology
is not reproducible in the everyday sense that if I try to repeat your
experiment I can be confident of getting the same effect. Implicit in the critique
by Gilbert et al is the notion that many studies are focused on effects that
are both small and fragile, and so it is to be expected they will be hard to
reproduce. They may well be right, but if so, the solution is not to deny we
have a problem, but to recognise that under those circumstances there is an
urgent need for our field to tackle the methodological issues of inadequate
power and p-hacking, so we can distinguish genuine effects from false
positives.
Great post, I pretty much agree completely, especially with your point 3, which is what I also discussed on my blog. I don't think there is anything wrong with studying phenomena that are fragile and subject to complex interactions with the environment. If anything, we need to understand those phenomena better. All I've really meant to say is that in order for psychology of complex phenomena (say, like measuring whether someone talking rudely about affirmative action makes people look more at Black people when they think they can hear the comments) should be studied in a way that gets away from the snowflake problem and charts out the boundary conditions. If you publish a single finding under certain conditions, you are implicitly saying that it should generalise. If it then doesn't replicate it is not enough to point to one parameter and say that this is inconclusive. It is fine to declare your intention to use this failure to replicate as motivation for a new study testing this boundary condition. It is not fine to say "This effect replicates just fine, thank you."
ReplyDeleteSince you mention the Mueller-Lyer illusion, even effects like this are supposed to show cultural dependency. So even for robust phenomena you can demonstrate with a simple example there is plenty of room for moderators. I find it difficult to see how the idea that social psychology findings may be more complex and fragile is controversial to anyone :P
Many thanks Sam. I don't know the cross-cultural literature on Mueller-Lyer beyond a dim memory of an account in Eye and Brain, but it's a perfect example of how you can use extension to other populations to understand a phenomenon better (as I recall the argument was that if your environment did not include corners and angles, you didn't get such a strong effect, suggesting it was driven more by experience than innate perceptual bias?). Indeed, I think that for most of my 'solid' effects, there are literatures looking at impact of variations on the method, and generalisation to different populations, so yes, moderators are everywhere and are to be expected.
ReplyDeleteYes that's exactly right. There are other studies suggesting the same and also the opposite that certain cultures are more prone to illusions like this. The reasons behind that remain still very controversial and I find some accounts more convincing than others. Also, I think it's a perfect case of where decision/cognitive factors may play a strong role and skew our estimate of the participant's subjective experience. This is all the topic of on-going research and as you say it's a perfectly good example of how moderating factors are worth studying.
DeleteThat said, it is obviously a great example of solid effects nevertheless. If someone told me that they couldn't replicate the Mueller-Lyer in the Netherlands or at a different time of day or whatever, I would be very sceptical they did it right. When I publish something on the Mueller-Lyer (which I might soon... :) or on the Ebbinghaus/Delboeuf (which I have done and continue to do) then I work on the assumption that it is a reasonably general effect and if people repeatedly fail to replicate the results, I would certainly be inclined to believe that my finding was a fluke (barring any obvious methodological problems at least).
This comment has been removed by the author.
ReplyDeleteExcellent post. Following on from Sam's comment, there might be scope for researchers being more explicit/honest with the boundaries of their studies. Researchers making it clear what the limits are of a study will hopefully limit misrepresentation/misunderstandings by the media (as some research suggests this misrepresentation starts with the researchers themselves: http://www.bmj.com/content/349/bmj.g7015 & http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1001308). Though of course this honesty may be difficult in current climate of impressive effects being more likely to get published etc.
ReplyDeleteThanks both.
ReplyDeleteRe the 'impressive effects' idea, I wonder whether it would be worth assembling some catalogue of effects in psychology that are generally agreed to be solid.
One reason I wanted to emphasise things like Mueller-Lyer was because some people attack the Reproducibility Project because they are scared that people will conclude all psychology is crap. In fact, we have loads of findings that meet a high standard of reproducibility.
Some people seem to have an axe to grind and will conclude that psychology is crap no matter what we do. If we ignore the problem they will say that we are being disingenuous. If we acknowledge it, they will consider themselves vindicated. In my view, we should ignore those who won't be satisfied no matter the outcome, and focus instead on improving what we can.
DeleteThanks for this post! I was part of the Reproducibility Project, and it was really nice to hear you weighing in on the debate. The idea that software packages might enable fishing for effects (as opposed to pocket calculators) has not crossed my mind before - possibly because I have never used pocket calculators for statistics.
ReplyDeleteTry doing a Manova or Factor Analysis by hand. :(
DeleteI believe that there were many statistical analyses available pre-computer days (well pre easy-access to computers) that were impractical before the advent of statistical packages and easy access to computer time.
One of my professors had Cattel as an adviser and said a lot of the graduate students time was spent on mechanical calculators doing work on factor analysis.He characterized it as "punch the keys,crank the handle; punch the keys, crank the handle., repeat".
I am envious of deevybee, "She had a hand calculator”. I had a pencil and a pad of paper.
I may have just had a couple of poor stats instructors in undergrad but immense amounts of time seemed to be spent teaching people how to use calculation formulae as opposed to conceptual instruction on what was actually happening. Now one can download R and do generate a 100 X 100 correlation matrix in moments.
On a different note, I think a lot of psychology is underpowered due to the difficulty of obtaining subjects particularly human subjects. Some other disciplines seem to be able to do a study with large N's and replicate with no problem. Mice are cheap and a phone call can have them delivered to a lab in no time. Bacteria probably are even cheaper.
Then try doing a study in school board. It can take 6 months to a year to set up and execute and all one needs in a simple study is to have a flu epidemic wipe out half your subject pool on the planned testing day. Repeat this in another school board if you want a replication. Or try getting two or three hundred young men or women for a sexual dysfunction study.
Some studies, if done properly, are also incredibly costly by psychology standards. I, once, was costing out a psychomoter skills study done in-house in a large corporation. IIRC, salary costs for subjects was projected at a little over a quarter of million dollars in 1990 $CDN.
I think psychology has a history of underpowered studies because of the cost and time to get a decent sample size and psychologists have not yet realized the p-value problem when the effect-sizes are likely to be small and the data is noisy.
On the other hand, maybe should all become Bayesians.
I found this quite interesting:
ReplyDeletehttp://www.slate.com/articles/health_and_science/cover_story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.single.html
I wonder what's next - neonates don't imitate adults?
what's next - neonates don't imitate adults?
DeleteI'll just leave this here.
I think there's another even more insidious issue which is the focus on effects in the first place as the aim of psychological research. But these are only interesting and relevant if the effects are amazingly strong such as Müller-Lyer or the non-native phonological discrimination.
ReplyDeleteBut in the vast majority of the other phenomena, it is the patterns of variation within populations that psychology should be setting as its aim. I see all this research on effects of interventions which almost always only report on average effectiveness but rarely on the spread of the effects. But in many cases, they have that average effect on literally nobody. So using increasingly more sophisticated statistical methods to detect small effects across populations may in fact be taking us in the wrong direction.
We should be looking at modeling the distribution of individual effects rather than average effects - this seems to apply even to some of the stronger effects like the phonological foundations of reading difficulties where the replications are much more variable than with something like many of the visual phenomena.
This comment has been removed by the author.
ReplyDelete