Sunday, 15 November 2015

Who's afraid of Open Data

Cartoon by John R. McKiernan, downloaded from:

I was at a small conference last year, catching up on gossip over drinks, and somehow the topic moved on to journals, and the pros and cons of publishing in different outlets. I was doing my best to advocate for open access, and to challenge the obsession with journal impact factors. I was getting the usual stuff about how early-career scientists couldn't hope to have a career unless they had papers in Nature and Science, but then the conversation took an interesting turn.

"Anyhow," said eminent Professor X. "One of my postdocs had a really bad experience with a PLOS journal."

Everyone was agog. Nothing better at conference drinks than a new twist on the story of evil reviewer 3.  We waited for him to continue. But the problem was not with the reviewers.

"Yup. She published this paper in PLOS Biology, and of course she signed all their forms. She then gave a talk about the study, and there was this man in the audience, someone from a poky little university that nobody had ever heard of, who started challenging her conclusions. She debated with him, but then, when she gets back she has an email from him asking for her data."

We wait with bated breath for the next revelation.

"Well, she refused of course, but then this despicable person wrote to the journal, and they told her that she had to give it to him! It was in the papers she had signed."

Murmurs of sympathy from those gathered round. Except, of course, me. I just waited for the denouement. What had happened next, I asked.

"She had to give him the data. It was really terrible. I mean, she's just a young researcher starting out."

I was still waiting for the denouement. Except that there was no more. That was it! Being made to give your data to someone was a terrible thing. So, being me, I asked, why was that a problem? Several people looked at me as if I was crazy.

"Well, how would you like it if you had spent years of your life gathering data, data which you might want to analyse further, and some person you have never heard comes out of nowhere demanding to have it?"

"Well, they won't stop you analysing it," I said.

"But they may scoop you and find something interesting in it before you have a chance to publish it!"

I was reminded of all of this at a small meeting that we had in Oxford last week, following up on the publication of a report of a symposium I'd chaired on Reproducibility and Reliability of Biomedical Research. Thanks to funding by the St John's College Research Centre, a small group of us were able to get together to consider ways in which we could take forward some of the ideas in the report for enhancing reproducibility. We covered a number of topics, but the one I want to focus on here is data-sharing.

A move toward making data and analyses open is being promoted in a top-down fashion by several journals, and universities and publishers have been developing platforms to make this possible. But many scientists are resisting this process, and putting forward all kinds of argument against it. I think we have to take such concerns seriously: it is all too easy to mandate new actions for scientists to follow that have unintended consequences and just lead to time-wasting, bureaucracy or perverse incentives. But in this case I don't think the objections withstand scrutiny. Here are the main ones we identified at our meeting:

1.  Lack of time to curate data;  Data are only useful if they are understandable, and documenting a dataset adequately is a non-trivial task;

2.  Personal investment - sense of not wanting to give away data that had taken time and trouble to collect to other researchers who are perceived as freeloaders;

3. Concerns about being scooped before the analysis is complete;

4.  Fear of errors being found in the data;

5.  Ethical concerns about confidentiality of personal data, especially in the context of clinical research;

6.  Possibility that others with a different agenda may misuse the data, e.g. perform selective analysis that misrepresented the findings;

These have partial overlap with points raised by Giorgio Ascoli (2015) when describing NeuroMorpho.Org, an online data-sharing repository for digital reconstructions of neuronal morphology. Despite the great success of the repository, it is still the case that many people fail to respond to requests to share their data, and points 1 and 2 seemed the most common reasons.

As Ascoli noted, however, there are huge benefits to data-sharing, which outweigh the time costs. Shared data can be used for studies that go beyond the scope of the original work, with particular benefits arising when there is pooling of datasets. Some illustrative examples from the field of brain imaging were provided by Thomas Nichols at our meeting (slides here), where a range of initiatives is being developed to facilitate open data. Data-sharing is also beneficial for reproducibility: researchers will check data more carefully when it is to be shared, and even if nobody consults the data, the fact it is available gives confidence in the findings. Shared data can also be invaluable for hands-on training. A nice example comes from Nicole Janz, who teaches a replication workshop in social sciences in Cambridge, where students pick a recently published article in their field and try to obtain the data so they can replicate the analysis and results.

These are mostly benefits to the scientific community, but what about the 'freeloader' argument? Why should others benefit when you have done all the hard work? In fact, when we consider that scientists are usually receiving public money to make scientific discoveries, this line of argument does not appear morally defensible. But in any case, it is not true that the scientists who do the sharing have no benefits. For a start, they will see an increase in citations, as others use their data. And another point, often overlooked, is that uncurated data often become unusable by the original researcher, let alone other scientists, if it is not documented properly and stored on a safe digital site. Like many others, I've had the irritating experience of going back to some old data only to find I can't remember what some of the variable names refer to, or whether I should be focusing on the version called final, finalfinal, or ultimate. I've also had the experience of data being stored on a kind of floppy disk, or coded by a software package that had a brief flowering of life for around 5 years before disappearing completely.

Concerns about being scooped are frequently cited, but are seldom justified. Indeed, if we move to a situation where a dataset is a publication with its own identifier, then the original researcher will get credit every time someone else uses the dataset. And in general, having more than one person doing an analysis is an important safeguard, ensuring that results are truly replicable and not just a consequence of a particular analytic decision (see this article for an illustration of how re-analysis can change conclusions).

The 'fear of errors' argument is, of understandable but not defensible. The way to respond is to say of course there will be errors – there always are. We have to change our culture so that we do not regard it as a source of shame to publish data in which there are errors, but rather as an inevitability that is best dealt with by making the data public so the errors can be tracked down.

Ethical concerns about confidentiality of personal data are a different matter. In some cases, participants in a study have been given explicit reassurances that their data will not be shared: this was standard practice for many years before it was recognised that such blanket restrictions were unhelpful and typically went way beyond what most participants wanted – which was that their identifiable data would not be shared.  With training in sophisticated anonymization procedures, it is usually possible to create a dataset that can be shared safely without any risk to the privacy of personal information; researchers should be anticipating such usage and ensuring that participants are given the option to sign up to it.

Fears about misuse of data can be well-justified when researchers are working on controversial areas where they are subject to concerted attacks by groups with vested interests or ideological objections to their work. There are some instructive examples here and here. Nevertheless, my view is that such threats are best dealt with by making the data totally open. If this is done, any attempt to cherrypick or distort the results will be evident to any reputable scientist who scrutinises the data. This can take time and energy, but ultimately an unscientific attempt to discredit a scientist by alternative analysis will rebound on those who make it.  In that regard, science really is self-correcting. If the data are available, then different analyses may give different results, but a consensus of the competent should emerge in the long run, leaving the valid conclusions stronger than before.

I'd welcome comments from those who have started to use open data and to hear your experiences, good or bad.

P.S. As I was finalising this post, I came across some great tweets from the OpenCon meeting taking place right now in Brussels. Anyone seeking inspiration and guidance for moving to an open science model should follow the #opencon hashtag, which links to materials such as these: Slides from keynote by Erin McKiernan, and resources at


  1. one aspect I found difficult is data curation - RCUK is now requesting data management plans but we are not educated to curate data and tools and standard are scarce - I'm not looking for an excuse, just to try to see why we are still struggling with it and what we can do.

    1. I agree. There is a Digital Data Curation centre and perhaps they could help?

  2. I've made several excuses over recent years not to put my data freely online, but Number 4 is my biggest fear. I have now made a personal commitment to open science (, and part of the commitment is to share all data (which I have started to do here: I must say though that I live in fear someone is going to find some errors and paper retractions will ensue! I am really grateful for someone of your standing to be pushing this important issue. It makes it easier for us early career researchers to follow suit.

    1. Thanks Jim. When I wrote about this issue before, the comments were really helpful in getting things into perspective: especially on how programmers know that there are always bugs, and they share code precisely because you have to do that to find all the bugs.

  3. Actually it seems that highly experienced groups manage to get around the PLoS rules about open data in a variety of ways. They claim that their data are 'open' in the sense that you can contact them and say what you want to work on, and if it's not already one of their planned analyses, then they will collaborate with you and they will run the analyses, but you won't get the actual data just their output. Or other strategies.. So it's not really that "open" but they kind of act like it is. And I just saw a paper in one of the PLoS journals that used data from the MESA study and it actually said right in PLoS that 'all the data were in the paper' which seems totally ridiculous on the face of it. The paper just had the usual type of tables and there is no way the data set is somehow 'in the paper.'

    1. Yes, I agree. It is not really policed very thoroughly. I was quite surprised when I reviewed for PLOS that I, as reviewer, was asked whether the data sharing was adequate. I would have thought that scrutiny step should be carried out in a systematic way by the journal. And in the case I was involved in, the editor accepted a weak excuse from authors for not sharing.

    2. Great blog Yes, I agree. It is not really policed very thoroughly. I was quite surprised when I reviewed for PLOS that I, as reviewer, was asked whether the data sharing was adequate. I would have thought that scrutiny step should be carried out in a systematic way by the journal.Thanks for sharing......

      Hire Clapham Man Van | Clapham Man With Van

  4. I have been putting up my data and code online for years now. I wish everyone would just do the same. We also created the Mind Research Repository with data and code from published papers:

    A lot of people don't share data in my field, even if one asks them. The most common reason is lack of time to assemble it for public release. If everyone used tools like knitr and made the paper the documentation for the analysis, we would not have that problem.

    I notice some hesitancy among non-tenured researchers to release data and code. I can maybe understand that, but tenured professors have no good reason to hold on to data, in my opinion.

    1. Thanks. I am just at the outset of doing these things and I think we do all need a lot more training in data curation and planning for data sharing, as it is much easier if you plan to do it than if you aim to do so post hoc.

  5. Thanks for a very nice summary of the issues of sharing data. I've written up an example of some of the benefits of sharing data:
    I think the data management issue is a common hurdle and it might be partly overcome by people using good practices from the start of the study. So do the hard yards on documentation and version control from start to finish.

  6. Funny, published the very same day is "Managing risks when publishing open data" ... another take on similar issues.

  7. Journals who want to encourage data sharing could always insist on the step(s) being taken before the paper can be published. A list of suitable repositories could be provided by the journal. Nothing like a checklist of author instructions to ensure compliance.

    1. I think some journals already do this. Just from memory PeerJ, J. Neurodevelopmental Disorders.

  8. Has anyone considered any potential inequities of research where those with resources publish and those without don't? Question. Are the researchers at smaller institutions strong advocates for making data freely available? Large, wealthier research institutions have more research tools and funding and manpower to churn out research faster than the smaller institutions. What if the small researchers publish one paper and then the big institutions take their data and churn out many more? And then the research group at the small institution loses their funding and the institutional support and they have to scurry to find another research topic to save their lab. Sadly, my colleagues say, "It's great that your paper is being cited, but what are you working on now? That was yesterday." And just try following up on your research when you're now teaching four classes a semester because hiring is slowed. I would not wish to delay new knowledge being discovered in the world, I just wish my career didn't have to be a casualty in the process.

    1. I just don't really buy this argument - mainly because I have so many people talking of the advantages of data sharing (see Brembs comment below), and I simply haven't come across those form large institutions exploiting those from smaller ones in this way. If anyone knows of such an example, they'd be eligible for this challenge and could at least get a free t-shirt:

    2. Also, it means that you have really really mean colleagues. It might not be a bad idea to get rid of them, if you can afford it.

  9. A fascinating post, and I agree with everything you’ve said. There’s a key perspective missing, though: that of patients who positively want their data to be shared.

    The PACE trial of CBT and graded exercise therapy for chronic fatigue syndrome (CFS) is now rapidly becoming notorious. The PACE investigators abandoned all their main outcome measures and their criteria for “recovery” partway through the trial and replaced them with new ones. The results of the planned analyses were never reported.

    The new threshold for recovering physical function (SF-36) is so low that it’s below the level of trial entry – that is, you could get worse during the trial and be considered “recovered”.

    There are many problems in PACE and patients (including patients who were scientists before they got too sick to work) have been calling for the planned outcomes and/or the raw, anonymised data for years but their FOI requests have been refused.

    Six prominent scientists have now written to The Lancet demanding independent reanalysis:

    and a 10,000-strong petition is calling for the retraction of the misleading claims of improvement and recovery based on the altered analyses. The short, clear background pages will make fascinating reading for any scientist:

    But these efforts shouldn’t be necessary. As an ME patient, I was offered a place on PACE and refused. But if I had taken part I would be horrified at the travesty of science that this trial has become. I wouldn’t have wanted to have risked my health in a clinical trial only for the study authors to publish bizarre and misleading analyses and then hide the data so that others couldn’t challenge them.

    Professor Michael Sharpe, one of the principal investigators on PACE, is at Oxford – I hope he and his colleagues have read your blog.

    1. I'm aware that there's a lot of debate going on around the PACE trial, and, as you will gather, my advice would be to make the data available.
      However, as I note in my blogpost, open data is still a long way from being the norm, and refusal to deposit data seems to be more common than not (see the Ascoli paper I cited). In addition, there was a time when every study that I got ethics approval for had to tell the participants explicitly that their data would *not* be publicly available and would *only* be seen by members of the research team. This was done to protect patients and because of concerns about confidentiality.
      Things are now changing, not least because some patient groups have objected to over-restrictive ethics statements that preclude re-use of data, so studies that are starting out now are more likely to include consent forms and information sheets that make it clear that data-sharing will occur. In such cases, it is of course vital that adequate anonymisation is carried out and that the data does not contain information such as postcodes or dates of birth that could link back to individuals. Even particular symptom patterns could identify someone in some cases. None of these are insuperable problems if data collection and ethics procedures plan for them in advance. But they are issues that affect older studies and can be hard to deal with retrospectively.
      Anyhow, having said all that, I hope it does become possible to fully anonymise and release the PACE data; given the high tension around this study, people are going to think there is some kind of cover-up if the data are withheld.

    2. It is important to differentiate PII (personally identifiable information) from other information that cannot be traced back to individuals. There has been a huge amount of work on privacy within the computer science field and it is important that this is not ignored in the medical field.

      What matters is could someone trace back from individual readings to a patient and given the size of the PACE trial and the number of ME patients in the UK and the nature of the outcome data it would seem like an impossible job to trace back (assuming the normal identifiers such as postcode, age, name etc are removed).

      From an ethical stand point we need to think about the various ethical issues and where the tradeoffs lie. How does a patients rights to have adequate data to make treatment decisions trade off against the privacy rights. Given I believe data can be adequately anonymized I don't really see this as an issue. But again safe sharing of data is something that is being looked at within the computer science field where ideas of data stewardship are explored. Techniques are being explored to allow someone to specify a function to run on data, look at the amount of data being released and the correctness of the function (either by additional embedded computations or through trusted computing). Again the medical establishment should look to build on this work.

      Was it ethical for PACE to make such changes to the protocol (of an unblinded trial) and not give the outcomes associated with the original ones. How does this compare to the privacy risks of releasing outcome data without personal identifiers.

      The last ethical point is how the data is presented. Your blog talks about concerns others may mis-represent data but part of the strengths of open data is that is stops the data owner mis-representing their own data to fit their beliefs (whether consciously or not).

      But maybe more care from publishers and university press offices is also necessary. Take the latest PACE paper by Prof Sharpe from Oxford - that was spun as supporting his favoured treatments. Yet the data (even that published in the paper) didn't support this conclusion. Yet the University of Oxford press office and the Lancet were happy to have press releases and paper abstracts pushing this line. Prof Coyne covers this issue in detail

      Perhaps the controversy over PACE and the way the medical establishment has been quiet or supportive of data suppression is a good guide for the robustness of beliefs in data sharing.

  10. Actually, the opposite of the fears you describe happened to me very recently.
    After we had published a paper on how Drosophila strains that are referred to by the same name in the literature, but came from different laboratories behaved completely different in a particular behavioral experiment,
    Casey Bergman from Manchester contacted me, asking if we shouldn't sequence the genomes of these five fly strains to find out how they differ. So I went and behaviorally tested the strains again, extracted the DNA from the 100 individuals I had just tested and sent the material to him. The data I published immediately on our GitHub project page:
    Casey sequenced the strains and made the sequences available:
    A few weeks later, we were both contacted by Nelson Lau at Brandeis, showing us his bioinformatics analyses of our publicly posted genome data. He asked us to be co-authors as a reward for posting our data and as incentive for others to let go of their fears and also post their data online.
    The work is now in press and both Casey and I are now co-authors even though I initially protested as my contribution was so miniscule. I was finally persuaded to accept the co-author position only for the incentive/policy reason.
    I hope this story will make others share their data and code as well.

    Will write all this up in more detail as the paper becomes available...

  11. Here the link to the promised blog post:

  12. Replies
    1. Our corresponding author has posted another reply to your question:

  13. It would be interesting to compare openness in the medical field with that in other fields. For example, look at the history of open source software and how it was initially thought as a marginal business by the industry but now is a major thing. It brought considerable culture change,

    Or look at vulnerability reporting and debates that have happened over the disclosure of issues with software. A suitable compromise seems to have been achieved. But it is interesting to compare approaches with an industry who pay bounties for those finding issues with their products (hopefully before criminals find them) and where academics can get considerable kudos by breaking a protocol or crypto system. In comparison to the medical industry who seem very sensitive to disclosure and discussing problems.

  14. For example the "European Code of Conduct for Research Integrity" (ESF/ALLEA, 2011, see states: "All primary and secondary data should be stored in secure and accessible form, documented and archived for a substantial period. It should be placed at the disposal of colleagues."
    Such a statement is also part of Principle 3 ("Verifiability") of the VSNU "The Netherlands Code of Conduct for Academic Practice" ( The Dutch Code states: "Raw research data are stored for at least ten years. These data are made available to other academic practitioners upon request, unless legal provisions dictate otherwise."
    All researchers at any of the 14 Dutch research universities must always work fully in line with this VSNU Codes, and already since 1 January 2005. Complaints can be filed when there are indications that a researcher is violating any of the rules of the Code. See and for English versions of the current guidelines at RUG (University of Groningen).
    Frank van Kolfschooten, a science journalist, reports on 1 July 2015 in the Dutch newspaper NRC about a recent case at RUG of researchers who refused to share raw research data of a PhD thesis with colleagues. RUG concluded that these researchers, Dr Anouk van Eerden and Dr Mik van Es, had violated the rules of research integrity, because they refused to share the raw research data with others (Dr Peter-Arno Coppen [Radboud University Nijmegen], Dr Carel Jansen [RUG] and Dr Marc van Oostendorp [Leiden University]. These three researchers had filed a complaint at RUG when both researchers of RUG were unwilling to provide them access to the raw research data.
    At RUG, all PhD candidates are even obliged to promise, in public and during the PhD graduation ceremony, that they will always work according to the VSNU Code of Conduct. This is already the case for around two years
    I am currently confronted with a very persistent refusal of a researcher of Oxford University, Dr Adrian Pont, to give me access to the raw research data of a questionable paper. Details are listed at Dr Pont is the Associate Editor of the journal in question, but is persistently refusing to start a scientific dialogue with me about this case. There have as well been multiple contacts from my side with (officials at) Oxford University. I was turned down, and already a few times. I fail to understand how the current acting of Oxford University and the current behaviour of Dr Pont is in line with for example

  15. Congratulations, Dorothy, on one more succinct framing of an important issue. The comments are very illuminating. We are clearly in the middle of a paradigm shift. In a few years, scientists will look back and wonder what all the fuss was about.

  16. "Indeed, if we move to a situation where a dataset is a publication with its own identifier, then the original researcher will get credit every time someone else uses the dataset. " is a key point. There is no justification for not providing data in order to check the claims made in the paper. However, in many fields, the tedious and unpaid/underpaid labor associated with data collection is motivated by the scientific discoveries, rewards and credit that emerge from its analysis, and until full scientific credit (both material credit and soft credit for the contribution to the process) is explicitly allocated, full unrestricted data-sharing will have undesirable side-effects. A mere citation to the published paper is not adequate reward, and this is why, as in Bjorn Brembs' example above, experimenters are often included as middle co-authors in papers that emerge from analysis of their data. But having a Data-Use field associated with each dataset, and giving credit for providing data that proved to be useful will go a long way towards addressing this issue.

    1. I think that this is a valid concern, possibly addressable with a data time-capsule e.g., data to be released after 2-5 years post-publication.

  17. I am from a different field from most of you - socioeconomic history. 30 years ago, I put together a data set for Philadelphia in 1790 out of 3 different sets of records: the 1790 federal census, a city directory published in 1791, and the 1789 manuscript tax records. I ended up with 10,000 records of 55,000 people, and 20 variables. It took years.

    I was able to publish a first article with maps showing the distribution of various occupations and other social identifiers.

    I also had found - to my surprise - rent and density gradients with R2's comparable to modern cities - except I was using blocks, not miles. That was also in the article.

    I published it in the J of Interdisc. History in the summer of 1993, planning to work more on the data set later.

    Unfortunately - and this is MOST ironic - I never finished anything else using that data set, because I collapsed on October 24, 1994, with a severe case of M.E. Yes. THAT M.E. The one that the PACE trial was about. FWIW, I have been in many studies, and my own case has to do with immune defects and persistent viruses, including HHV-6A and CMV active in my spinal fluid. I was extremely sick, unable to read - unable to brush my own teeth. I can write this because I'm on experimental immune medicine that has brought me back from being mostly bedridden to high-functioning.

    But - back to my data set. I had no research assistants. I had to go to a computer room to input the data (in fact, the first runs were done with punchcards). I did it all myself. 10,000 records with 20 variables. And all I got out of it was one article.

    I got better on immune medicine, but I was never well enough to go back to the data set and finish the study.

    About 15 years later, when I was in a relapse, i was called by a graduate student who seemed surprised to find me still alive. He wanted to know if I still had the data set. I said yes - in fact, if you'd like to use it, that would be great.

    But I wanted a co-authorship.

    And then, I began to worry. It was a long time ago. The tape the information was on was not usable now - though there was a printout of everything that had been on it; wouldn't be difficult for me to key it into a laptop today. And I had done some squirrely things - there were arrows from this to that - I don't know what that was about. I was too sick anyway, so I didn't really pursue it. And he didn't want it.

    Imagine my surprise - imagine my dismay - to find that somebody who had been at the same research institute I was at when I first did that study, who KNEW I had been working on it - had gotten a grant to ... do what i had done! Except he wasn't as thorough. He presented a chapter of his book-to-be at the institute, and I had come to the seminar expecting to find he had used my article and pushed further from it, or even argued against it - but used it. Imagine my chagrin to find he had not cited my article at all. I had really died in 1994, as far as the profession was concerned. I gave him a copy of the article and said I hope you read it!

    What is the moral of this tale? Sometimes if you share data, the data can live on after you've used it. I wish I had. (I really wish I hadn't gotten so sick at the age of 44 - but I could have been hit by a cement truck on I-95. You never know.)

    Since my "research" world is now medicine (I don't DO research, but I'm IN a lot of research ....), I've learned that when a medical project is published, they give a co-authorship to the physician whose patients were in the study. So if i were to hand over my data set, my years of work - I think it's fair to say I ought to get a co-authorship out of it, or at least the offer of a co-authorship. Might make people less reluctant to share.

    So that's my suggestion here - why not give co-authorship status to the person(s) who put together the data set?

    As for the PACE study. Well, we'll see what history has to say about that.

  18. Thanks for all the comments on what it turning into a v interesting debate. Re authorship: what I really dislike is trading authorship for data, i.e. people who say "you can only have my data if I can be an author on your paper". It can be nice if someone who reuses data offers to collaborate, as in the Bjorn Brembs example above, but it should not be a condition for providing data.
    I guess there might be variations depending on the nature and extent of the dataset. I have been involved in consortia of the kind Mary refers to, where a researcher assembles clinical cases and those who provide the cases get authorship ; personally I am uncomfortable with that, especially if there is then a whole series of papers based on the composite dataset, and those who provided the patients do very little beyond the initial contribution of helping form the sample. I think it is reasonable to recognise their role on the initial publication, but not beyond that - unless they are actively engaged. Most formal authorship criteria talk of requiring an 'intellectual contribution', though of course that is hard to define. I think the current conventions can be unfair on the poor postdoc who has done 90% of the analysis and write-up of a paper but who just ends up being one of twenty or more 'authors', some of whom did no more than providing a handful of cases some years previously.
    I would add that people who are authors by virtue of providing data can come unstuck: I would be very cautious about having my name on a paper unless I had enough input to ensure the work was of decent quality.
    It may be that for the future the solution will be to do away with the idea of 'authorship' altogether, and to recognise more explicitly different types of contribution. (I've seen mention of ideas for this on social media but can't track down the source, so if anyone knows it, please provide a link).

    1. An acknowledgement is often recommended by journals if a contributor has not fulfilled the requirements for authorship.

      e.g. "Contributors who meet fewer than all 4 of the above criteria for authorship should not be listed as authors, but they should be acknowledged. Examples of activities that alone (without other contributions) do not qualify a contributor for authorship are acquisition of funding; general supervision of a research group or general administrative support; and writing assistance, technical editing, language editing, and proofreading. Those whose contributions do not justify authorship may be acknowledged individually or together as a group under a single heading (e.g. "Clinical Investigators" or "Participating Investigators"), and their contributions should be specified (e.g., "served as scientific advisors," "critically reviewed the study proposal," "collected data," "provided and cared for study patients", "participated in writing or technical editing of the manuscript")."

    2. If we had a halfway modern infrastructure where we'd all login with our identifier (think, e.g. ORCID) and where we'd all contribute our research objects online, it would be easy to attribute research objects to individuals:

      I described this idea in a more narrative format:

  19. Thanks for some interesting perspectives. And in principle, I'm very much much supportive of moves towards greater transparency in data and the basis for the arguments in a paper.

    Sometimes though, I wonder why there is so much surprise at the resistance to open data? How many times, at least in the UK, have you come across UG/MSc research projects where participants are told in the consent form by the researcher "no-one apart from myself and my supervisor will have access to the data"? I see this all the time and despite protestations over it, I suspect the reality is students absorb this as participants and as early researchers as a philosophy that prioritises security. And we do very little to educate and train students on the advantages or importance of open data and data management, and the consequences (eg awareness of indirect identifiers in "anonymous" data). Then we suddenly expect individuals at phd level and beyond to be part of much more open data, and wonder why it's not obvious

    Surely we should / could do more to nurture good practices as early as possible, and tackle misconceptions and anxieties before they become entrenched?
    John Towse

  20. The Pubmed entry of the 2011 paper in The Lancet about the PACE trial lists 20 authors:
    B J Angus, H L Baber, J Bavinton, M Burgess, T Chalder, L V Clark, D L Cox, J C DeCesare, K A Goldsmith, A L Johnson, P McCrone, G Murphy, M Murphy, H O’Dowd, PACE trial management group, L Potts, M Sharpe, R Walwyn, R Walwyn and P D White (re-arranged in an alfabetic order).
    . (and also ) lists 19 authors.
    Both entries state that all 19 authors are acting 'on behalf of the PACE trial management group†'. and "†Members listed at end of paper".
    The end of the paper (page 835) states: "PACE trial group."
    This term is not identical to "PACE trial management group". In total another 19 names are listed: Hiroko Akagi, Mansel Aylward, Barbara Bowman Jenny Butler, Chris Clark, Janet Darbyshire, Paul Dieppe, Patrick Doherty, Charlotte Feinmann, Deborah Fleetwood, Astrid Fletcher, Stella Law, M Llewelyn, Alastair Miller, Tom Sensky, Peter Spencer, Gavin Spickett, Stephen Stansfeld and Alison Wearden.
    So are all these 19 people also some sort of co-author of this paper?
    There is no overlap with the first 19 people who are listed as author of the paper. So how many people can claim to be an author of this paper.
    Do all 38 people endorse the full contents of the paper?
    The paper has also many affiliations:
    * Queen Mary University of London, UK
    * King’s College London, UK
    * University of Cambridge, UK
    * University of Cumbria, UK
    * University of Oxford, UK
    * University of Edinburgh, UK
    * Medical Research Council Clinical Trials Unit, London, UK
    * South London and Maudsley NHS Foundation Trust, London, UK
    * The John Radcliffe Hospital, Oxford, UK
    * Royal Free Hospital NHS Trust, London, UK
    * Barts and the London NHS Trust, London, UK
    * Frenchay Hospital NHS Trust, Bristol, UK;
    * Western General Hospital, Edinburgh, UK
    Am I right to assume that all 38 people (names see above) and all affilations / institutes plainly refuse to give critics / other scientists / patients / patient groups (etc.) access to the raw research data of this paper and am I am right with my assumption that it is therefore impossible for all other scientists (etc.) to conduct re-calculations, check all statements with the raw data, etc?
    Excuse me very much, but I totally fail to understand how this is possible, (1) in the Dutch context, (2) in the context of the policy of the funders, (3) in the context of the public interest in this topic, (4) in the context of how to teach students how to become a good researcher, etc.

  21. In general, I am in favour of releasing both data and code when a paper is published. One could argue what with some data sets there might be a rationale to embargo the release of the data for 1 o r 2 years to let the original researchers do a bit more data mining but it would require a good reason.

    An interesting case of being able to use the data is evolving on Andrew Gelman's blog ( plus several follow-up posts, about some of the results of a study by Case & Deaton (

    As far as I can see, the blog analyses and discussion do nothing to change the overall conclusions of the original study but concentrate on a subsection of the data and show that one can extract some very interesting findings, (particularly if you are American, White and between 45 and 54. Here, the datasets were public property and could be downloaded from CDC but it does show the usefulness of having the raw data immediately avalable for further analysis.

    On the other hand, it looks to me that not releasing data can have some very bad effects. Here it is not just a researcher or two looking embarassed but national-level, policy decisions were almost certainly based on what appears to be totally erroneous results. The data here were finally released but after what looks like 2 or 3 years and after the study's conclusions were used in testimony before The United States Senate Budget Committee.

    In this last instance there were a number of 'interesting' analysis decisons made and what appears to be a straigth-forward range mistakes in the Excel spreadsheet.

    This last point, the simple Excel errors, interested me, since I have come to the conclusion that a spreadsheet (MS Excel, Apache OpenOffice Calc and so on, are the worst tools one could use for data analysis.

  22. I'm not going to lie. I could use some links for SEO purposes. On the bright side, you'll be saving countless souls from months if not years of torment and poverty. Thanks.

  23. The Open Medicine Foundation is studying patients who are severely disabled by ME/CFS. The head of research, geneticist Ronald W. Davis, said last summer, "My plan is to take a collection of patients and collect more data on them than has ever been collected on a human being before."

    Yesterday blogger Cort Johson reported that Davis is "committed to posting the results from the tests on the internet within 24 hours of getting them."


  24. Perhaps I can express differing view on both access to data and authorship. We sometimes work with big data sets in a health care field. The data is rarely available in usable form. Perhaps 90% of the work is ‘cleaning’ and matching the data, and dealing with missing data. Running the analysis is relatively quick by comparison. Giving an applicant the ‘cleaned data’ means we have done 90% of the work on the data: why would we not be entitled to authorship on a resulting paper? But the situation is much worse than that. In order to fund the work on the uncleaned data, we had to win a competitive grant process. Writing that successful grant application took a great deal of expert work (and should be seen against our general success rate of about 1 in 3 applications – better than average, I think, but still very time consuming). We also had to take the project through several ethics procedures: again very time consuming. Factoring these in means in my ball park calculation, we have done at least 95% of the work in producing the cleaned data.
    It would of course be much easier as a strategy for us to parasitise other research teams who had done this hard work: but this is not a successful long term strategy for good research in the field as a whole.
    There are two further considerations.
    First, the applicant for access to the data may be of low capability. I’ve seen a data access request which demonstrated that the applicants did not understand the absolute basics or background in the field. If they had applied for a grant, they would have been turned down outright on quality grounds, not would a good ethics committee have approved their project. This, to my mind, is more significant than an advance ideological bias regarding the data. I would willingly release data to an expert research group who had expressed advance disagreement with our conclusions. But to release data to people who have revealed in their request that they don’t understand the ABC of the field seems to me to invite only bad and confusing outcomes.
    Second, we do not live in a rational scientific world, but a political one. Even if it were true that science is self-correcting in real life, the consequences of bad data interpretation can have political consequences which are long term, and even prevent further investigation. And the University who pays my wages and determines my promotion prospects regards itself as in competition with other Universities through REF etc., not in collaboration with the scientific enterprise as a whole.
    If another research team wishes to work in this area, I can see two approaches which avoid these difficulties, yet remain ethical in terms of access to original data. They could approach the original data source directly for the raw data, and work with it themselves – perhaps they will handle it in a more informative way than we did. Or they can come and work in our centre with the cleaned data.

    1. Thanks Anon. These are helpful perspectives, especially as I am brewing up to write something more formal on these issues.
      I still don't agree witn you on authorship, but that is really because I don't see it as an unmitigated bonus to be an author on a paper. If I am judging candidates for a job, etc, I would actually rate them less highly if they were associated with low-quality publications. So I think it's worth being quite discriminating about what you put your name to. I wouldn't want authorship unless I'd been involved in evaluating the ideas as well as generating the data.
      I also found it interesting that you used the word 'parasitise', which really casts those who might re-use your data in a very negative light. My view is again different: there are finite resources for doing research and we want to get the most we can out of any expenditure - if that means reusing a dataset, I think that's an efficient use of resources. Indeed, I think some research that is being done will be unnecessary because datasets may well exist already that would answer the question.
      Where I am more in sympathy is when you come to the question of who you share your data with, noting that there may be people who are totally incompetent. In fact, it is worse than that: in some research areas, there are people with vested interests or odd viewpoints who think they already know the answer to a question before they see the data - and are determined to use your data to prove it. And it is typically the case that with enough degrees of freedom and flexibility in approach to data analysis, you can come to different conclusions from the same dataset.
      I'm currently writing something focused on how best to handle such situations: there are precedents for having some restrictions on who can access data and what they do with it, or even having an 'honest broker' as an intermediary to do the analysis if this is a controversial area where researchers suspect the motives of those requesting the data.