Tuesday, 10 September 2019

Responding to the replication crisis: reflections on Metascience2019

Talk by Yang Yang at Metascience 2019
I'm just back from MetaScience 2019. It was an immense privilege to be invited to speak at such a stimulating and timely meeting, and I would like to thank the Fetzer Franklin Fund, who not only generously funded the meeting, but also ensured the impressively smooth running of a packed schedule. The organisers - Brian Nosek, Jonathan Schooler, Jon Krosnick, Leif Nelson and Jan Walleczek - did a great job in bringing together speakers on a range of topics, and the quality of talks was outstanding. For me, highlights were hearing great presentations from people well outside my usual orbit, such as Melissa Schilling on 'Where do breakthrough ideas come from?', Carl Bergstrom on modelling grant funding systems, and Callin O'Connor on scientific polarisation.

The talks were recorded, but I gather it may be some months before the film is available. Meanwhile, slides of many of the presentations are available here, and there is a copious Twitter stream on the hashtag #metascience2019. Special thanks are due to Joseph Fridman (@joseph_fridman): if you look at his timeline, you can pretty well reconstruct the entire meeting from live tweets. Noah Haber (@NoahHaber) also deserves special mention for extensive commentary, including a post-conference reflection starting here.  It is a sign of a successful meeting, I think, if it gets people, like Noah, raising more general questions about the direction the field is going in, and it is in that spirit I would like to share some of my own thoughts.

In the past 15 years or so, we have made enormous progress in documenting problems with credibility of research findings, not just in psychology, but in many areas of science. Metascience studies have helped us quantify the extent of the problem and begun to shed light on the underlying causes. We now have to confront the question of what we do next. That would seem to be a no-brainer: we need to concentrate on fixing the problem. But there is a real danger of rushing in with well-intentioned solutions that may be ineffective at best or have unintended consequences at worst.

One question is whether we should be continuing with a focus on replication studies. Noah Haber was critical of the number of talks that focused on replication, but I had a rather different take on this: it depends on what the purpose of a replication study is. I think further replication initiatives, in the style of the original Reproducibility Project, can be invaluable in highlighting problems (or not) in a field. Tim Errington's talk about the Cancer Biology Reproducibility Project demonstrated beautifully how a systematic attempt to replicate findings can reveal major problems in a field. Studies in this area are often dependent on specialised procedures and materials, which are either poorly described or unavailable. In such circumstances it becomes impossible for other labs to reproduce the methods, let alone replicate the results. The mindset of many researchers in this area is also unhelpful – the sense is that competition dominates, and open science ideals are not part of the training of scientists. But these are problems that can be fixed.

As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem's studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn't it be better to investigate a hypothesis about something that doesn't contradict the laws of physics? There were two responses to this. Zoltán's view was that the study would tell us more than whether or not precognition exists: it would provide a model of methods that could be extended to other questions. That seems reasonable: some of the innovations, in terms of automated methods and collaborative working could be applied in other contexts to ensure original research was done to the highest standards. Jonathan Schooler, on the other hand, felt it was unscientific of me to prejudge the question, given a large previous literature of positive findings on ESP, including a meta-analysis. Given that I come from a field where there are numerous phenomena that have been debunked after years of apparent positive evidence, I was not swayed by this argument. (See for instance this blogpost on 5-HTTLPR and depression). If the study by Kekecs et al sets such a high standard that the results will be treated as definitive, then I guess it might be worthwhile. But somehow I doubt that a null finding in this study will convince believers to abandon this line of work.

Another major concern I had was the widespread reliance on proxy indicators of research quality. One talk that exemplified this was Yang Yang's presentation on machine intelligence approaches to predicting replicability of studies. He started by noting that non-replicable results get cited just as much as replicable ones: a depressing finding indeed, and one that motivated the study he reported. His talk was clever at many levels. It was ingenious to use the existing results from the Reproducibility Project as a database that could be mined to identify characteristics of results that replicated. I'm not qualified to comment on the machine learning approach, which involved using ngrams extracted from texts to predict a binary category of replicable or not. But implicit in this study was the idea that the results from this exercise could be useful in future in helping us identify, just on the basis of textual analysis, which studies were likely to be replicable.

Now, this seems misguided on several levels. For a start, as we know from the field of medical screening, the usefulness of a screening test depends on the base rate of the condition you are screening for, the extent to which the sample you develop the test on is representative of the population, and the accuracy of prediction. I would be frankly amazed if the results of this exercise yielded a useful screener. But even if they did, then Goodhart's law would kick in: as soon as researchers became aware that there was a formula being used to predict how replicable their research was, they'd write their papers in a way that would maximise their score. One can even imagine whole new companies springing up who would take your low-scoring research paper and, for a price, revise it to get a better score. I somehow don't think this would benefit science. In defence of this approach, it was argued that it would allow us to identify characteristics of replicable work, and encourage people to emulate these. But this seems back-to-front logic. Why try to optimise an indirect, weak proxy for what makes good science (ngram characteristics of the write-up) rather than optimising, erm, good scientific practices. Recommended readings in this area include Philip Stark's short piece on Preproducibility, as well as Florian Markowetz's 'Five selfish reasons to work reproducibly'.

My reservations here are an extension of broader concerns about reliance on text-mining in meta-science (see e.g. https://peerj.com/articles/1715/https://peerj.com/articles/1715/). We have this wonderful ability to pull in mountains of data from online literature to see patterns that might be undetectable otherwise, But ultimately, the information that we extract cannot give more than a superficial sense of the content. It seems sometimes that we're moving to a situation where science will be done by bots, leaving the human brain out of the process altogether. This would, to my mind, be a mistake.

No comments:

Post a Comment