Sunday, 15 November 2015

Who's afraid of Open Data

Cartoon by John R. McKiernan, downloaded from:

I was at a small conference last year, catching up on gossip over drinks, and somehow the topic moved on to journals, and the pros and cons of publishing in different outlets. I was doing my best to advocate for open access, and to challenge the obsession with journal impact factors. I was getting the usual stuff about how early-career scientists couldn't hope to have a career unless they had papers in Nature and Science, but then the conversation took an interesting turn.

"Anyhow," said eminent Professor X. "One of my postdocs had a really bad experience with a PLOS journal."

Everyone was agog. Nothing better at conference drinks than a new twist on the story of evil reviewer 3.  We waited for him to continue. But the problem was not with the reviewers.

"Yup. She published this paper in PLOS Biology, and of course she signed all their forms. She then gave a talk about the study, and there was this man in the audience, someone from a poky little university that nobody had ever heard of, who started challenging her conclusions. She debated with him, but then, when she gets back she has an email from him asking for her data."

We wait with bated breath for the next revelation.

"Well, she refused of course, but then this despicable person wrote to the journal, and they told her that she had to give it to him! It was in the papers she had signed."

Murmurs of sympathy from those gathered round. Except, of course, me. I just waited for the denouement. What had happened next, I asked.

"She had to give him the data. It was really terrible. I mean, she's just a young researcher starting out."

I was still waiting for the denouement. Except that there was no more. That was it! Being made to give your data to someone was a terrible thing. So, being me, I asked, why was that a problem? Several people looked at me as if I was crazy.

"Well, how would you like it if you had spent years of your life gathering data, data which you might want to analyse further, and some person you have never heard comes out of nowhere demanding to have it?"

"Well, they won't stop you analysing it," I said.

"But they may scoop you and find something interesting in it before you have a chance to publish it!"

I was reminded of all of this at a small meeting that we had in Oxford last week, following up on the publication of a report of a symposium I'd chaired on Reproducibility and Reliability of Biomedical Research. Thanks to funding by the St John's College Research Centre, a small group of us were able to get together to consider ways in which we could take forward some of the ideas in the report for enhancing reproducibility. We covered a number of topics, but the one I want to focus on here is data-sharing.

A move toward making data and analyses open is being promoted in a top-down fashion by several journals, and universities and publishers have been developing platforms to make this possible. But many scientists are resisting this process, and putting forward all kinds of argument against it. I think we have to take such concerns seriously: it is all too easy to mandate new actions for scientists to follow that have unintended consequences and just lead to time-wasting, bureaucracy or perverse incentives. But in this case I don't think the objections withstand scrutiny. Here are the main ones we identified at our meeting:

1.  Lack of time to curate data;  Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;

2.  Personal investment - sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;

3. Concerns about being scooped before the analysis is complete;

4.  Fear of errors being found in the data;

5.  Ethical concerns about confidentiality of personal data, especially in the context of clinical research;

6.  Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;

These have partial overlap with points raised by Giorgio Ascoli (2015) when describing NeuroMorpho.Org, an online data-sharing repository for digital reconstructions of neuronal morphology. Despite the great success of the repository, it is still the case that many people fail to respond to requests to share their data, and points 1 and 2 seemed the most common reasons.

As Ascoli noted, however, there are huge benefits to data-sharing, which outweigh the time costs. Shared data can be used for studies that go beyond the scope of the original work, with particular benefits arising when there is pooling of datasets. Some illustrative examples from the field of brain imaging were provided by Thomas Nichols at our meeting (slides here), where a range of initiatives is being developed to facilitate open data. Data-sharing is also beneficial for reproducibility: researchers will check data more carefully when it is to be shared, and even if nobody consults the data, the fact it is available gives confidence in the findings. Shared data can also be invaluable for hands-on training. A nice example comes from Nicole Janz, who teaches a replication workshop in social sciences in Cambridge, where students pick a recently published article in their field and try to obtain the data so they can replicate the analysis and results.

These are mostly benefits to the scientific community, but what about the 'freeloader' argument? Why should others benefit when you have done all the hard work? In fact, when we consider that scientists are usually receiving public money to make scientific discoveries, this line of argument does not appear morally defensible. But in any case, it is not true that the scientists who do the sharing have no benefits. For a start, they will see an increase in citations, as others use their data. And another point, often overlooked, is that uncurated data often become unusable by the original researcher, let alone other scientists, if it is not documented properly and stored on a safe digital site. Like many others, I've had the irritating experience of going back to some old data only to find I can't remember what some of the variable names refer to, or whether I should be focusing on the version called final, finalfinal, or ultimate. I've also had the experience of data being stored on a kind of floppy disk, or coded by a software package that had a brief flowering of life for around 5 years before disappearing completely.

Concerns about being scooped are frequently cited, but are seldom justified. Indeed, if we move to a situation where a dataset is a publication with its own identifier, then the original researcher will get credit every time someone else uses the dataset. And in general, having more than one person doing an analysis is an important safeguard, ensuring that results are truly replicable and not just a consequence of a particular analytic decision (see this article for an illustration of how re-analysis can change conclusions).

The 'fear of errors' argument is, of understandable but not defensible. The way to respond is to say of course there will be errors – there always are. We have to change our culture so that we do not regard it as a source of shame to publish data in which there are errors, but rather as an inevitability that is best dealt with by making the data public so the errors can be tracked down.

Ethical concerns about confidentiality of personal data are a different matter. In some cases, participants in a study have been given explicit reassurances that their data will not be shared: this was standard practice for many years before it was recognised that such blanket restrictions were unhelpful and typically went way beyond what most participants wanted – which was that their identifiable data would not be shared.  With training in sophisticated anonymization procedures, it is usually possible to create a dataset that can be shared safely without any risk to the privacy of personal information; researchers should be anticipating such usage and ensuring that participants are given the option to sign up to it.

Fears about misuse of data can be well-justified when researchers are working on controversial areas where they are subject to concerted attacks by groups with vested interests or ideological objections to their work. There are some instructive examples here and here. Nevertheless, my view is that such threats are best dealt with by making the data totally open. If this is done, any attempt to cherrypick or distort the results will be evident to any reputable scientist who scrutinises the data. This can take time and energy, but ultimately an unscientific attempt to discredit a scientist by alternative analysis will rebound on those who make it.  In that regard, science really is self-correcting. If the data are available, then different analyses may give different results, but a consensus of the competent should emerge in the long run, leaving the valid conclusions stronger than before.

I'd welcome comments from those who have started to use open data and to hear your experiences, good or bad.

P.S. As I was finalising this post, I came across some great tweets from the OpenCon meeting taking place right now in Brussels. Anyone seeking inspiration and guidance for moving to an open science model should follow the #opencon hashtag, which links to materials such as these: Slides from keynote by Erin McKiernan, and resources at