Wednesday, 29 June 2022

A proposal for data-sharing that discourages p-hacking

Open data is a great way of helping give confidence in the reproducibility of research findings. Although we are still a long way from having adequate implementation of data-sharing in psychology journals (see, for example, this commentary by Kathy Rastle, editor of Journal of Memory and Language), things are moving in the right direction, with an increasing number of journals and funders requiring sharing of data and code. But there is a downside, and I've been thinking about it this week, as we've just published a big paper on language lateralisation, where all the code and data are available on Open Science Framework. 

One problem is p-hacking. If you put a large and complex dataset in the public domain, anyone can download it and then run multiple unconstrained analyses until they find something, which is then retrospectively fitted to a plausible-sounding hypothesis. The potential to generate non-replicable false positives by such a process is extremely high - far higher than many scientists recognise. I illustrated this with a fictitious example here

Another problem is self-imposed publication bias: the researcher runs a whole set of analyses to test promising theories, but forgets about them as soon as they turn up a null result. With both of these processes in operation, data sharing becomes a poisoned chalice: instead of increasing scientific progress by encouraging novel analyses of existing data, it just means more unreliable dross is deposited in the literature. So how can we prevent this? 

In this Commentary paper, I noted several solutions. One is to require anyone accessing the data to submit a protocol which specifies the hypotheses and the analyses that will be used to test them. In effect this amounts to preregistration of secondary data analysis. This is the method used for some big epidemiological and medical databases. But it is cumbersome and also costly - you need the funding to support additional infrastructure for gatekeeping and registration. For many psychology projects, this is not going to be feasible. 

A simpler solution would be to split the data into two halves - those doing secondary data analysis only have access to part A, which allows them to do exploratory analyses, after which they can then see if any findings hold up in part B. Statistical power will be reduced by this approach, but with large datasets it may be high enough to detect effects of interest.  I wonder if it would be relatively easy to incorporate this option into Open Science Framework: i.e. someone who commits a preregistration of a secondary data analysis on the basis of exploratory analysis of half a dataset then receives a code that unlocks the second half of the data (the hold-out sample). A rough outline of how this might work is shown in Figure 1.

Figure 1: A possible flowchart for secondary data analysis on a platform such as OSF

An alternative that has been discussed by MacCoun and Perlmutter is blind analysis - "temporarily and judiciously removing data labels and altering data values to fight bias and error". The idea is that you can explore a dataset and run a planned analysis on it, but it won't be possible for the results to affect your analysis, because the data have been changed, so you won't know what is correct. A variant of this approach would be multiple datasets with shuffled data in all but one of them. The shuffling would be similar to what is done in permutation analysis - so there might be ten versions of the dataset deposited, with only one having the original unshuffled data. Those downloading the data would not know whether or not they had the correct version - only after they had decided on an analysis plan, would they be told which dataset it should be run on. 

I don't know if these methods would work, but I think they have potential for keeping people honest in secondary data analysis, while minimising bureaucracy and cost. On a platform such as Open Science Framework it is already possible to create a time-stamped preregistration of an analysis plan. I assume that within OSF there is already a log that indicates who has downloaded a dataset. So someone who wanted to do things right and just download one dataset (either a random half, or one of a set of shuffled datasets) would just need to have a mechanism that allowed them to gain access to the full, correct data after they had preregistered an analysis, similar to that outlined above.

These methods are not foolproof. Two researchers could collude - or one researcher could adopt multiple personas - so that they get to see the correct data as person X and then start a new process as person B, when they can preregister an analysis where results are already known. But my sense is that there are many honest researchers who would welcome this approach - precisely because it would keep them honest. Many of us enjoy exploring datasets, but it is all too easy to fool yourself into thinking that you've turned up something exciting when it is really just a fluke arising in the course of excessive data-mining. 

Like a lot of my blogposts, this is just a brain dump of an idea that is not fully thought through. I hope by sharing it, I will encourage people to come up with criticisms that I haven't thought of, or alternatives that might work better. Comments on the blog are moderated to prevent spam, but please do not be deterred - I will post any that are on topic. 

P.S. 5th July 2022 

Florian Naudet drew my attention to this v relevant paper: 

Baldwin, J. R., Pingault, J.-B., Schoeler, T., Sallis, H. M., & Munafò, M. R. (2022). Protecting against researcher bias in secondary data analysis: Challenges and potential solutions. European Journal of Epidemiology, 37(1), 1–10. https://doi.org/10.1007/s10654-021-00839-0