Sunday, 27 May 2018

Sowing seeds of doubt: how Gilbert et al’s critique of the reproducibility project has played out



In Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising doubt can be used as an effective weapon against inconvenient science. On topics such as the effects of tobacco on health, climate change and causes of acid rain, it has been possible to delay or curb action to tackle problems by simply emphasising the lack of scientific consensus. This is always an option, because science is characterised by uncertainty, and indeed, we move forward by challenging one another’s findings: only a dead science would have no disagreements. But those raising concerns wield a two-edged sword: spurious and discredited criticisms can disrupt scientific progress, especially if the arguments are complex and technical: people will be left with a sense that they cannot trust the findings, even if they don’t fully understand the matters under debate.

The parallels with Merchants of Doubt occurred to me as I re-read the critique by Gilbert et al of the classic paper by the Open Science Collaboration (OSC) on ‘Estimating the reproducibility of psychological science’. I was prompted to do so because we were discussing the OSC paper in a journal club* and inevitably the question arose as to whether we needed to worry about reproducibility, in the light of the remarkable claim by Gilbert et al:  We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is also consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%.’

The Gilbert et al critique has, in turn, been the subject of considerable criticism, as well as a response by a subset of the OSC group. I summarise the main points of contention in Table 1: at times they seem to be making a defeatist argument that we don’t need to worry because replication in psychology is bound to be poor: something I have disputed.

But my main focus in this post is simply to consider the impact of the critique on the reproducibility debate by looking at citations of the original article and the critique. A quick check on Web of Science found 797 citations of the OSC paper, 67 citations of Gilbert et al, and 33 citations of the response by Anderson et al.

The next thing I did, admittedly in a very informal fashion, was to download the details of the articles citing Gilbert et al and code them according to the content of what they said, as either supporting Gilbert et al’s view, rejecting the criticism, or being neutral. I discovered I needed a fourth category for papers where the citation seemed wrong or so vague as to be unclassifiable. I discarded any papers where the relevant information could not be readily accessed – I can access most journals via Oxford University but a few were behind paywalls, others were not in English, or did not appear to cite Gilbert et al. This left 44 citing papers that focused on the commentary on the OSC study. Nine of these were supportive of Gilbert et al, two noted problems with their analysis, but 33 were categorised as ‘neutral’, because the citation read something like this: 

Because of the current replicability crisis in psychological science (e.g., Open Science Collaboration, 2015; but see Gilbert, King, Pettigrew, & Wilson, 2016)….”

The strong impression was that the authors of these papers lacked either the appetite or the ability to engage with the detailed arguments in the critique, but had a sense that there was a debate and felt that they should flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem - people are left thinking that it's all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.

Fortunately, it looks as if Gilbert et al’s critique has been less successful than might have been expected, given the eminence of the authors. This may in part be because the arguments in favour of change are founded not just on demonstrations such as the OSC project, but also on logical analyses of statistical practices and publication biases that have been known about for years (see slides 15-20 of my presentation here). Furthermore, as evidenced in the footnotes to Table 1, social media allows a rapid evaluation of claims and counter-claims that hitherto was not possible when debate was restricted to and controlled by journals. The publication this week of three more big replication studies  just heaps on further empirical evidence that we have a problem that needs addressing. Those who are saying ‘nothing to see here, move along’ cannot retain any credibility.

    Table 1
Criticism
Rejoinder
‘many of OSC’s replication studies drew their samples from different populations than the original studies did’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     Did not show that this feature affected replication rate
‘many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways.’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     OSC showed that this did not affect replication rate
·     Most striking example used by Gilbert et al is given detailed explanation by Nosek (1)  
‘How many of their replication studies should we expect to have failed by chance alone? Making this estimate requires having data from multiple replications of the same original study.’
Used data from pairwise comparisons of studies from the Many Labs project to argue a low rate of agreement is to be expected.
·     Ignores publication bias impact on original studies (2, 3)
·     G et al misinterpret confidence intervals (3, 4)
·     G et al fail to take sample size/power into account, though this is crucial determinant of confidence interval (3, 4)
·      ‘Gilbert et al.’s focus on the CI measure of reproducibility neither addresses nor can account for the facts that the OSC2015 replication effect sizes were about half the size of the original studies on average, and 83% of replications elicited smaller effect sizes than the original studies.’ (2)
Results depended on whether original authors endorsed the protocol for the replication: ‘This strongly suggests that the infidelities did not just introduce random error but instead biased the replication studies toward failure.
·     Use of term ‘the infidelities’ assumes the only reason for lack of endorsement is departure from original protocol. (2)
·     Lack of endorsement included non-response from original authors (3)


References
Anderson, C. J., Bahnik, S., Barnett-Cowan, M., & et al. (2016). Response to Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Nature, 349(6251). doi:10.1126/science.aac4716


*Thanks to the enthusiastic efforts of some of our grad students, and the support of Reproducible Research Oxford, we’ve had a series of Reproducibilitea journal clubs in our department this term.  I can recommend this as a great – and relatively cheap and easy - way of raising awareness of issues around reproducibility in a department: something that is sorely needed if a recent Twitter survey by Dan Lakens is anything to go by.

1 comment:

  1. Your "fourth category for papers where the citation seemed wrong or so vague" doesn't surprise me.

    I commonly look at the references in the, mostly medical, journal articles I read if the author has written something surprising and usually find the reference cannot justify the article text. One typical example is the statement "hand washing saves lives" which referenced the "Marsden Manual of Nursing" - hardly a primary paper. No idea what the peer reviewers were supposed to be doing.

    ReplyDelete