Tuesday, 10 September 2019

Responding to the replication crisis: reflections on Metascience2019

Talk by Yang Yang at Metascience 2019
I'm just back from MetaScience 2019. It was an immense privilege to be invited to speak at such a stimulating and timely meeting, and I would like to thank the Fetzer Franklin Fund, who not only generously funded the meeting, but also ensured the impressively smooth running of a packed schedule. The organisers - Brian Nosek, Jonathan Schooler, Jon Krosnick, Leif Nelson and Jan Walleczek - did a great job in bringing together speakers on a range of topics, and the quality of talks was outstanding. For me, highlights were hearing great presentations from people well outside my usual orbit, such as Melissa Schilling on 'Where do breakthrough ideas come from?', Carl Bergstrom on modelling grant funding systems, and Callin O'Connor on scientific polarisation.

The talks were recorded, but I gather it may be some months before the film is available. Meanwhile, slides of many of the presentations are available here, and there is a copious Twitter stream on the hashtag #metascience2019. Special thanks are due to Joseph Fridman (@joseph_fridman): if you look at his timeline, you can pretty well reconstruct the entire meeting from live tweets. Noah Haber (@NoahHaber) also deserves special mention for extensive commentary, including a post-conference reflection starting here.  It is a sign of a successful meeting, I think, if it gets people, like Noah, raising more general questions about the direction the field is going in, and it is in that spirit I would like to share some of my own thoughts.

In the past 15 years or so, we have made enormous progress in documenting problems with credibility of research findings, not just in psychology, but in many areas of science. Metascience studies have helped us quantify the extent of the problem and begun to shed light on the underlying causes. We now have to confront the question of what we do next. That would seem to be a no-brainer: we need to concentrate on fixing the problem. But there is a real danger of rushing in with well-intentioned solutions that may be ineffective at best or have unintended consequences at worst.

One question is whether we should be continuing with a focus on replication studies. Noah Haber was critical of the number of talks that focused on replication, but I had a rather different take on this: it depends on what the purpose of a replication study is. I think further replication initiatives, in the style of the original Reproducibility Project, can be invaluable in highlighting problems (or not) in a field. Tim Errington's talk about the Cancer Biology Reproducibility Project demonstrated beautifully how a systematic attempt to replicate findings can reveal major problems in a field. Studies in this area are often dependent on specialised procedures and materials, which are either poorly described or unavailable. In such circumstances it becomes impossible for other labs to reproduce the methods, let alone replicate the results. The mindset of many researchers in this area is also unhelpful – the sense is that competition dominates, and open science ideals are not part of the training of scientists. But these are problems that can be fixed.

As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem's studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn't it be better to investigate a hypothesis about something that doesn't contradict the laws of physics? There were two responses to this. Zoltán's view was that the study would tell us more than whether or not precognition exists: it would provide a model of methods that could be extended to other questions. That seems reasonable: some of the innovations, in terms of automated methods and collaborative working could be applied in other contexts to ensure original research was done to the highest standards. Jonathan Schooler, on the other hand, felt it was unscientific of me to prejudge the question, given a large previous literature of positive findings on ESP, including a meta-analysis. Given that I come from a field where there are numerous phenomena that have been debunked after years of apparent positive evidence, I was not swayed by this argument. (See for instance this blogpost on 5-HTTLPR and depression). If the study by Kekecs et al sets such a high standard that the results will be treated as definitive, then I guess it might be worthwhile. But somehow I doubt that a null finding in this study will convince believers to abandon this line of work.

Another major concern I had was the widespread reliance on proxy indicators of research quality. One talk that exemplified this was Yang Yang's presentation on machine intelligence approaches to predicting replicability of studies. He started by noting that non-replicable results get cited just as much as replicable ones: a depressing finding indeed, and one that motivated the study he reported. His talk was clever at many levels. It was ingenious to use the existing results from the Reproducibility Project as a database that could be mined to identify characteristics of results that replicated. I'm not qualified to comment on the machine learning approach, which involved using ngrams extracted from texts to predict a binary category of replicable or not. But implicit in this study was the idea that the results from this exercise could be useful in future in helping us identify, just on the basis of textual analysis, which studies were likely to be replicable.

Now, this seems misguided on several levels. For a start, as we know from the field of medical screening, the usefulness of a screening test depends on the base rate of the condition you are screening for, the extent to which the sample you develop the test on is representative of the population, and the accuracy of prediction. I would be frankly amazed if the results of this exercise yielded a useful screener. But even if they did, then Goodhart's law would kick in: as soon as researchers became aware that there was a formula being used to predict how replicable their research was, they'd write their papers in a way that would maximise their score. One can even imagine whole new companies springing up who would take your low-scoring research paper and, for a price, revise it to get a better score. I somehow don't think this would benefit science. In defence of this approach, it was argued that it would allow us to identify characteristics of replicable work, and encourage people to emulate these. But this seems back-to-front logic. Why try to optimise an indirect, weak proxy for what makes good science (ngram characteristics of the write-up) rather than optimising, erm, good scientific practices. Recommended readings in this area include Philip Stark's short piece on Preproducibility, as well as Florian Markowetz's 'Five selfish reasons to work reproducibly'.

My reservations here are an extension of broader concerns about reliance on text-mining in meta-science (see e.g. https://peerj.com/articles/1715/https://peerj.com/articles/1715/). We have this wonderful ability to pull in mountains of data from online literature to see patterns that might be undetectable otherwise, But ultimately, the information that we extract cannot give more than a superficial sense of the content. It seems sometimes that we're moving to a situation where science will be done by bots, leaving the human brain out of the process altogether. This would, to my mind, be a mistake.

Sunday, 8 September 2019

Voting in the EU Referendum: Ignorance, deceit and folly

As a Remainer, I am baffled as to what Brexiteers want. If you ask them, as I sometimes do on Twitter, they mostly give you slogans such as "Taking Back Control". I'm more interested in specifics, i.e. what things do people think will be better for them if we leave. It is clear that things that matter to me – the economy, the health service, scientific research, my own freedom of movement in Europe - will be damaged by Brexit. I would put up with that if there was some compensating factor that would benefit other people, but I'm not convinced there is. In fact, all the indications are that people who voted to leave will suffer the negatives of Brexit just as much as those who voted to remain.

But are people who want to leave really so illogical? Brexit and its complexities is a long way from my expertise. I've tried to educate myself so that I can understand the debates about different options, but I'm aware that, despite being highly educated, I don't know much about the EU. I recently decided that, as someone who is interested in evidence, I should take a look at some of the surveys on this topic. We all suffer from confirmation bias, the tendency to seek out, process and remember just that information that agrees with our preconceptions, so I wanted to approach this as dispassionately as I could. The UK in a Changing Europe project is a useful starting place. They are funded by the Economic and Social Research Council, and appear to be even-handed. I have barely begun to scratch the surface of the content of their website, but I found their report Brexit and Public Opinion 2019 provides a useful and readable summary of recent academic research.

One paper summarised in the Brexit and Public Opinion 2019 report caught my attention in particular. Carl, Richards and Heath (2019) reported results from a survey of over 3000 people selected to be broadly representative of the British population, who were asked 15 questions about the EU. Overall, there was barely any difference between Leave and Remain voters in the accuracy of answers. The authors noted that results counteracted a common belief, put forward by some prominent commentators – that Leave voters had, on average, a weaker understanding of what they voted for than Remain voters. Interestingly, Carl et al did confirm, as other surveys had done, that those voting Leave were less well-educated than Remain voters, and indeed, in their study, the Leaver voters did less well on a test of probabilistic reasoning. But this was largely unrelated to their responses to the EU survey. The one factor that did differentiate Leave and Remain voters was how they responded to a subset of questions that were deemed 'ideologically convenient' for their position: I have replotted the data below*. As an aside, I 'm not entirely convinced by the categorisation of certain items as ideologically convenient - shown in the figure with £ and € symbols  - but that is a minor point.
Responses to survey items from Carl et al (2019) Table 1.  Items marked £ were regarded as ideologically convenient for Brexit voters; those marked € as convenient for Remain voters
I took a rather different message away from the survey, however. I have to start by saying that I was rather disappointed when I read the survey items, because they didn't focus on implications of EU membership for individuals. I would have liked to see items probing knowledge of how leaving the EU might affect trade, immigration and travel, and relations between England and the rest of the UK.The  survey questions rather tested factual knowledge about the EU, which could be scored using a simple Yes/No response format. It would have been perhaps more relevant, when seeking evidence for validity of the referendum, to assess how accurately people estimated the costs and benefits of EU membership.

With that caveat, the most striking thing to me was how poorly people did on the survey, regardless of whether they voted Leave or Remain. There were 15 two-choice questions. If people were just guessing at random, they would be expected to score on average 7.5, with 95% of people scoring between 4 and 11.  Carl et al plotted the distribution of scores (Figure 2) and noted that the average score was only 8.8, not much higher than what would be expected if people were just guessing. Only 11.2% of Leave voters and 13.1% of Remain voters scored 12 or more. However, the item-level responses indicate that people weren't just guessing, because there were systematic differences from item to item. On some items, people did better than chance. But, as Carl et al noted, there were four items where people performed below chance. Three of these items had been designated as "ideologically convenient" for the Remain position, and one as convenient for the Leave position.

Figure 1 from Carl et al (2019). Distributions of observed scores and scores expected under guessing.

Carl et al cited a book by Jason Brennan, Against Democracy, which argues that "political decisions are presumed to be unjust if they are made incompetently, or in bad faith, or by a generally incompetent decision-making body". I haven't read the book yet, but that seems a reasonable point.

However, having introduced us to Brennan's argument, Carl et al explained: "Although our study did not seek to determine whether voters overall were sufficiently well informed to satisfy Brennan's (2016) ‘competence principle’, it did seek to determine whether there was a significant disparity in knowledge between Leave and Remain voters, something which––if present––could also be considered grounds for questioning the legitimacy of the referendum result."

My view is that, while Carl et al may not have set out to test the competence principle, their study nevertheless provided evidence highly relevant to the principle, evidence that challenges the validity of the referendum. If one accepts the EU questionnaire as an indicator of competence, then both Leave and Remain voters are severely lacking. Not only do they show a woeful ignorance of the EU, they also, in some respects show evidence of systematic misunderstanding. 72% of Leave voters and 50% of Remain voters endorsed the statement that "More than ten per cent of British government spending goes to the EU." (Item M in Figure 1).  According to the Europa.eu website, the correct figure is 0.28%.  So the majority of people think that we send the EU at least 36 times more money than is the case. The lack of overall difference between Leave and Remain voters is of interest, but the levels of ignorance or systematic misunderstanding on key issues is striking in both groups. I don't exclude myself from this generalisation: I scored only 10 out of 15 in the survey, and there were some lucky guesses among my answers.

I have previously made a suggestion that seems in line with Jason Brennan's ideas – that if we were to have another referendum, people should have first to pass a simple quiz to demonstrate that they have a basic understanding of what they are voting for. The results of Carl et al suggest, however, that this would disenfranchise most of the population. Given how ignorant we are about the EU, it does seem remarkable that we are now in a position where we have a deeply polarised population, with people self-identifying as Brexit or Remain voters more strongly than they identify with political parties (Evans & Shaffner, 2019).

*I would like to thank Lindsay Richards for making the raw data available to me, in a very clear and well-documented format. 


Carl, N., Richards, L., & Heath, A. (2019). Leave and Remain voters’ knowledge of the EU after the referendum of 2016. Electoral Studies, 57, 90-98. doi:https://doi.org/10.1016/j.electstud.2018.11.003

Evans, G. & Schaffner, F. (2019). Brexit identity vs party identity. In A. Menon (Ed). Brexit and public opinion 2019.