Sunday 26 May 2024

Are commitments to open data policies worth the paper they are written on?


As Betteridge's law of headlines states: "Any headline that ends in a question mark can be answered by the word no."  So you know where I am going with this.  


I'm a longstanding fan of open data - in fact, I first blogged about this back in 2015. So I've been gratified to see the needle shift on this, in the sense that over the past decade, in a rush to present themselves as good guys, various institutions and publishers have published policies supporting open data. The problem is that when you actually ask them to implement those policies, they back down.   


I discussed arguments for and against data-sharing in a Commentary article in 2016. I divided the issues according to whether they focused on the impact of data-sharing on researchers or on research participants. Table 1 from that article, entitled "Conflict between interests of researchers and advancement of science" is reproduced here:




1. Lack of time to curate data.

Unless adequately curated, data will over time become unusable, including by the original researcher.

2. Personal investment—reluctance to give data to freeloaders.

Reuse of data increases its value and the researcher benefits from additional citations. There is also an ethical case for maximizing use of data obtained via public funding.

3. Concerns about being scooped before the analysis is complete.

This is a common concern though there are few attested cases. A time-limited period of privileged use by the study team can be specified to avoid scooping.

4. Fear of errors being found in the data.

Culture change is needed to recognize errors are inevitable in any large dataset and should not be a reason for reputational damage. Data-sharing allows errors to be found and corrected.


I then went on to discuss two other concerns which focused on implications of data-sharing for human participants, viz:

5.  Ethical concerns about confidentiality of personal data, especially in the context of clinical research

6.  Possibility that others with a different agenda may misuse the data, e.g. perform selective analyses that misrepresent the findings.


These last two issues raise complex concerns and there's plenty to discuss on how address them, but I'll put that to one side for now, as the case I want to comment on concerns a simple dataset where there is limited scope for secondary analyses and where no human participants are involved.


My interest was piqued by comments on PubPeer about a paper entitled "Magnetic field screening in hydrogen-rich high-temperature superconductors ".  The thread on PubPeer starts with this extraordinary comment by J. E. Hirsch:


I requested the underlying data for Figs. 3a, 3e, 3b, 3f of this paper on Jan 11, 2023. This is because the published data for Figs. 3a and 3e, as well as for Figs. 3b and 3f, are nominally the same but incompatible with each other, and I would like to understand why that is. I asked the authors to explain, but they did not provide an explanation. Neither did they supply the data. The journal told me that it had received the data from the authors but will not share them with me because they are "confidential". I requested that the journal posts an Editor Note informing readers that data are unavailable to readers. The journal responded that because data were share with editors they "cannot write an editorial note on the published article stating the data is unavailable as this would be factually incorrect".


Pseudonymous commenter Orchestes quercus drew attention to the Data Availability statement in the article: "The data that support the findings of this study are available from the corresponding authors upon reasonable request".


J. E. Hirsch then added a further comment: 


The underlying data are still not available, the editor says the author deems the request "unreasonable" but it cannot divulge the reasoning behind it, nor can the journal publish an editor note that there are restrictions on data availability because the data were provided to the journal.  Springer Nature's Research Integrity Director wrote to me in September 2023 that "we recognize the right of the authors to not share the data with you, in line with the authors’ chosen data availability statement", and that "As Springer Nature considers the correspondence with the authors confidential, we cannot share with you any further details.


Now, I know nothing whatsoever about superconductors or J. E. Hirsch, but I think the editors, publisher and the authors are making themselves look very silly, and indeed suspicious, by refusing to share the data.  They can't plead patient confidentiality or ethical restrictions - it seems they are just refusing to comply because they don't want to.  


To up the ante, Orchestes quercus extracted data from the figures and did further analyses, which confirmed that J. E. Hirsch had a point - the data did not appear to be internally consistent.


Meanwhile, I had joined the PubPeer thread, pointing out


The authors and editor appear to be in breach of the policy of Nature Portfolio journals, stated here:, viz:

An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. A condition of publication in a Nature Portfolio journal is that authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript.

After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal. In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors' funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.

I also noted that two of the authors are based at a Max Planck Institute. The Max Planck Gesellschaft is a signatory to the BerlinDeclaration on Open Access to Knowledge in the Sciences and Humanities.  On the website it states:

the Max Planck Society (MPG) is committed to the goal of providing free and open access to all publications and data from scholarly research (my emphasis).


Well, the redoubtable J. E. Hirsch had already thought of that, and in a subsequent PubPeer comment made public various exchanges he had had with luminaries from the Max Planck Institutes.


All I can say to the Max Planck Gesellschaft is that this is not a good look. Hirsch has noted an inconsistency in the published figures.  This has been confirmed by another reader and needs to be explained. The longer people dig in defensively, attacking the person making the request rather than just showing the raw data, the more it looks as if something fishy is going on here.


Why am I so hung up on data-sharing? The reason is simple. The more I share my own data, or use data shared by others, the more I appreciate the value of doing so. Errors are ubiquitous, even when researchers are careful, but we'll never know about them if data are locked away.


Furthermore, it is a sad reality that fraudulent papers are on the rise, and open data is one way of defending against them. It's not a perfect defence: people can invent raw data as well as summary data, but realistic data are not so easy to fake, and requiring open data would slow down the fraudsters and make them easier to catch.


Having said that, asking for data is not tantamount to accusing researchers of fraud: it should be accepted as normal scientific practice to make data available in order that others can check the reproducibility of findings. If someone treats such a request as an accusation, or deems it "unreasonable", then I'm afraid it just makes me suspicious.  


And if organisations like Springer Nature and Max Planck Gesellschaft won't back up their policies with action, then I think they should delete them from their websites. They are presenting themselves as champions of open, reproducible science, while acting as defenders of non-transparent, secret practices. As we say in the UK, fine words butter no parsnips.   


 P.S. 27th May:  A comprehensive account of the superconductivity affair has just appeared on the website For Better Science.  This suggests things are even worse than I thought.   


In addition, you can see Jorge Hirsch explain his arduous journey in attempting to access the data here.    


NOTE ON COMMENTS: Many thanks to those who have commented. Comments are moderated to prevent spam, so there is a delay before they appear, but I will accept on-topic comments in due course.


  1. I wonder if the editors are exploiting (or confused by) the potentially contradictory parts of the guidelines that contradict the grandiose principle :

    "Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript."

    "The data availability statement must make the conditions of access to the “minimum dataset” that are necessary to interpret, verify and extend the research in the article, transparent to readers."

  2. No doubt that's what they are doing, but why? They can give no good reason for not sharing, so it just looks as if they are worried about something being found out if they do share.

  3. A lot of things become much easier to explain if you assume that in a great many cases, the data either do not exist, or are obviously fabricated. We need to start applying the same levels of skepticism as the criminal justice system, from police officers who do not necessarily immediately accept the first excuse offered, to judges who understand that people frequently lie, obfuscate, and confabulate.

  4. I asked for the data supporting this study last year:

    The paper says the data are available on "reasonable request". No one at the journal or the publisher was willing to do anything to encourage the authors to share the data.

    Upon careful examination, it is very suspicious. Never mind! It was only given a press release by BMJ and published on the BBC:

  5. When reviewing papers that say ‘data available on request’ (not even reasoable request) I’ve taken to pointing out that this is a euphemism for ‘not available’. I then suggest either making the data publicly available or saying ‘data not available’. Surprisingly (to me), authors have always opted for the latter. At least they had to be honest about it I guess.

  6. It is time that journals enforce raw data deposition prior to publication, that is at the peer-review stage. The Nat. Comm. example just shows that a sentence like "data are available upon reasonable request" means that data will be made available only to certain categories of people, and certainly not to the average citizen/taxpayer. Publishers could dedicate a small fraction of their outrageous APC to ensuring long-term preservation of raw data in their own databases, including persistent DOIs in the published pdf. This would be more beneficial than burning energy in their stupid IFs. Btw, some publishers have been doing it for decades, even in the pre-internet era: I recently requested raw data to the IUCr (International Union of Crystallography) for a structure published 40 years ago (structure factors, necessary to check the correctness of a published X-ray structure). Five days later, I received a link to the pdf of a scan including all data deposited in 1984 (hardcopy at that time, 40 pages or so). And yes, the structure was fine :) Ah, but wait... If I am not mistaken, the IUCr is a charity, not a for-profit organization like Springer-Nature.