Saturday 27 March 2021

Open data: We know what's needed - now let's make it happen



The week after then end of University term and before the Easter break is choc-a-bloc with conferences and meetings. A great thing about them all being virtual is that you can go to far more. My brain is zinging (and my to-do list sadly neglected) after 3 days of the Computational Research Integrity Conference (CRI-Conf) hosted at Syracuse University, and then a 1.5 hr webinar organised by the Center for Biomedical Research Transparency (CBMRT) on Accelerating Science in the Age of COVID-19. Synthesising all the discussion in my head gives one clear message: we need to get a grip on failure of authors to make their data, and analysis scripts, open. 

For a great overview of CRI-Conf, see this blog by one of the presenters, Debora Weber-Wulff . The meeting focused mainly on fraud, and the development of computational methods to detect it, with most emphasis on doctored images. There were presentations by those who detect manipulated data (notably Elisabeth Bik and Jennifer Byrne), by technology experts developing means of automating image analysis, publishers and research integrity officers who attempt to deal with the problem, and by those who have created less conventional means to counteract the tide of dodgy work (Boris Barbour from PubPeer, and Ivan Oransky from Retraction Watch). The shocking thing was the discovery that fabricated data in the literature is not down to a few bad apples: there are "paper mills" that generate papers for sale, which are readily snapped up by those who need them for professional advancement.

CRI-Conf brought together people who viewed the problem from very different perspectives, with a definite tension between those representing "the system" - publishers, institutions and funders - and those on the outside - the data sleuths, PubPeer, Retraction Watch. The latter are impatient at the failure of the system to act promptly to remove fraudulent papers from the literature; the former respond that they are doing a lot already, but the problem is immense and due process must be followed. There was, however, one point of agreement. Life would be easier for all of us if data were routinely published with papers. Research integrity officers in particular noted that a great deal of time in investigations is spent tracking down data.

The CBMRT webinar yesterday was particularly focused on the immense amount of research that has been generated by the COVID-19 pandemic. Ida Sim from Vivli noted that only 70 of 924 authors of COVID-19 clinical trials agreed to share their data within 6 months. John Inglis, co-founder of biorXiv and medrXiv cited Marty Lakary's summary of preprints: "a great disruptor of a clunky, slow system never designed for a pandemic". Deborah Dixon from Oxford University Press noted how open science assumed particular importance in the pandemic: open data not only make it possible to check a study's findings, but also can be used fruitfully as secondary data for new analyses. Finally, a clinician's perspective was provided by Sandra Petty. Her view is that there are too many small underpowered studies: we need large collaborative trials. My favourite quote: "Noise from smaller studies rapidly released to the public domain can create a public health issue in itself".

Everyone agreed that open data could be a game-changer, but clearly it was still the exception rather than the rule. I asked whether it should be made mandatory, not just for journal articles but also for preprints. The replies were not encouraging. Ida Sim, who understands the issues around data-sharing better than most, noted that there were numerous barriers - there may be legal hurdles to overcome, and those doing trials may not have the expertise, let alone the time, to get their data into appropriate format. John Inglis noted it would be difficult for moderators of preprint servers to implement a data-sharing requirement for pre-prints, and that many authors would find it challenging.

I am not, however, convinced. It's clear that there is a tsunami of research on COVID-19, much of it of very poor quality. This is creating problems for journals, reviewers, and for readers, who have to sift through a mountain of papers to try and extract a signal from the noise. Setting the bar for publication - or indeed pre-prints - higher, so that the literature only contains papers that can be (a) checked and (b) used for meta-analysis and secondary studies, would reduce the mountain to a hillock and allow us to focus on the serious stuff. 

The pandemic has indeed accelerated the pace of research, but things done in a hurry are more likely to contain errors, so it is more important than ever to be able to check findings, rather than just trusting authors to get it right. I'm thinking of the recent example where an apparent excess of COVID-19 in toddlers was found to be due to restricting age to a 2-digit number, so someone aged 102 would be listed as a 2-year-old.  We may be aghast, but I feel "there but for the grace of God go I". Unintentional errors are everywhere, and when the stakes are as high as they are now, we need to be able to check and double-check findings studies that are going to translated into clinical practice. That means sharing analysis code as well as data. As Philip Stark memorably said, "Science should be 'show me', not 'trust me'".

All in all, my sense is that we still have a long way to go before we shift the balance in publishing from a focus on the needs of authors (to get papers out rapidly) to an emphasis on users of research, for whom the integrity of the science is paramount. As Besançon et al put it: Open Science saves lives.

1 comment:

  1. Good points, but we first need open media to publish. Like always research output is controlled by big publishers and their businesses. Before we could not openly access the impact factors of journals without WoS subscription. There are unlimited data in published articles, but limited access for big data analysis.