Monday 26 May 2014

Data sharing: Exciting but scary


Yesterday I did something I've never done before in many  years of publishing. When I submitted a revised manuscript of a research report to a journal, I also posted the dataset on the web, together with the script I'd used to extract the summary results. It was exciting. It felt as if I was part of a scientific revolution that has been gathering pace over the past two or three years, which culminated in adoption of a data policy by PLOS journals last February. This specified that authors were required to make the data underlying their scientific findings available publicly immediately upon publication of the article. As it happens, my paper is not submitted to PLOS, and so I'm not obliged to do this, but I wanted to, having considered the pros and cons. My decision was also influenced by the Wellcome Trust, who fund my work and encourage data sharing.

The benefits are potentially huge. People usually think about the value to other researchers, who may be able to extract useful information from your data, and there's no doubt this is a factor.  Particularly with large datasets, it's often the case that researchers only use a subset of the data, and so valuable information is squandered and may be lost forever.  More than once I've had someone ask me for an old dataset, only to find it is inaccessible, because it was stored on a floppy disk or an ancient, non-networked computer and so is no longer readable.  Even if you think that you've extracted all you can from a dataset, it may still be worth preserving for potential inclusion in future meta-analyses.

Another value of open data is less often emphasised: when you share data you are forced to ensure it is accurate and properly documented. I enjoy data analysis, but I'm not naturally well-disciplined about keeping everything tidy and well-organised. I've been alarmed on occasion to return to a dataset and find I have no idea what some of the variables are, because I failed to document them properly.  If I know the world at large will see my dataset then I won't want to be embarrassed by it, and so I will take more care to keep it neat and tidy with everything clearly labelled. This can only be good.

But here's the scary thing. data sharing exposes researchers to the risk of being found out to be sloppy or inaccurate. To my horror, shortly before I posted my dataset on the internet yesterday I found I'd made a mistake in the calculation of one of my variables. It was a silly error, caused by basing a computation on the wrong column of data. Fortunately, it did not have a serious effect on my paper, though I did have to go through redoing all the tables and making some changes to the text.  But it seemed like pure chance that I picked up on this error – I could very easily have posted the dataset on the internet with the error still there. And it was an error that would have been detected by anyone eagle-eyed enough to look at the numbers carefully.  Needless to say, I'm nervous that there may well be other errors in there that I did not pick up. But at least it's not as bad as an apocryphal case of a distinguished research group whose dramatic (and published) results arose because someone forgot to designate 9 as a missing value code. When I heard about that I shuddered, as I could see how easily it could happen.

This is why Open Data is both important for science but difficult for scientists. In the past, I've found mistakes in my datasets, but this has been a private experience.  To date, as far as I am aware, no serious errors have got into my published papers – though I did have another close shave last year when I found a wrongly-reported set of means at the proofs stage, and there have been a couple of instances where minor errata have had to be published. But the one thing I've learned as I wiped the egg off my face is that error is inevitable and unavoidable, however careful you try to be. The best way to flush out these errors is to make the data public. This will inevitably lead to some embarrassment when mistakes are found, but at the end of the day, our goal must be to find out what is the case, rather than to save face.

I'm aware that not everyone agrees with me on this. There are concerns that open data sharing could lead to scientists getting scooped, will take up too much time, and could be used to impose ever more draconian regulation on beleaguered scientists: as DrugMonkey memorably put it:  "Data depository obsession gets us a little closer to home because the psychotics are the Open Access Eleventy waccaloons who, presumably, started out as nice, normal, reasonable scientists." But I think this misses the point. Drug Monkey seems to think this is all about imposing regulations to prevent fraud and other dubious practices.  I don't think this is so. The counter-arguments were well articulated in a blogpost by Tal Yarkoni. In brief, it's about moving to a point where it is accepted practice to make data publicly available, to improve scientific transparency, accuracy and collaboration. 


  1. Hi Dorothy,

    Great post! I completely agree that posting data online also improves one's own approach to the analysis. The simple act of preparing the data for someone else to understand is a useful debugging tool, gives you a different perspective and forces you to be a bit more organised/systematic than you might otherwise be. Russ Poldrack made a similar point recently, in a nice post dissecting a coding error he only detected when he shared his analysis scripts: Basically, error is inevitable, especially for bespoke script-based analyses, so we really do need another pair of eyes. I am always struck by how much scrutiny we put to the text of a manuscript, send it around to colleagues and co-authors for endless proof-reads, edits, corrections, but rarely show anyone else the original working out of analyses. This must be the wrong way around. Moreover, I worry that error is not random. We are far more likely to double check anomalies that contradict our hypotheses than nice publishable results (biased debugging: Preparing data (and analysis scripts) for public scrutiny is a great way to improve the reliability of research findings.


    1. Thanks, Mark. My experience exactly.
      Also think this post by Betsy Levy Paluck is worth a read as a riposte to those who say they don't have time to prepare data for sharing. She agrees that it slows you down, but points out that this is no bad thing:

  2. The Reinhart-Rogoff error – or how not to Excel at economics

    Richard Tol
    Errors in estimates of the aggregate economic impacts of climate change

    Financial Times Finds “Many” Errors in Piketty Analysis, Argues They Undermine His Thesis

    1. Speaking of making errors I just wiped out all my comments. If the preceeding seems a bit out of context that's because this was supposed to preceed it:

      Well done Dr. Bishop.

      Yes, it is all too easy to make a mistake. I remember as a graduate students, some (cough) years ago, running a correlation with the result of r = 1. I was quite excited until common sense took hold. One should not correlate line numbers with sequential ID numbers.

      Recently in the economics field there seem to have been some rather dismaying data entry and analysis errors--none of which look deliberate but very distressing particularly the Reinhart-Rogoff paper which has had a very significant influence on government policy in many countries. It was several years later (4-5?) before they released the data to a grad student who proceeded to point out a multitude of errors.

      Having the data released immediately, probably would have allowed some immediate corrections and damage control rather than having it help set monetary policy for a country like the USA.

      I doubt that economists are naturally more prone to these mistake than any other researchers but they have been having a bit of a rough time at the moment (see below). I will point out that Piketty for his book, Capitalism in the 21 Century did publish his data at the same time as the book.

      As a pet peeve of mine, it looks like all three examples used Excel as their main analysis tool. I personally feel that a spreadsheet has no place in serious, or even frivolous, data analysis.

  3. thanks for your comment.Re Excel : I think it has its uses- I will often do a preliminary quick and dirty look at a dataset in Excel, not least because it is so easy to see data and plots alongside one another. I then do the serious analysis in R or SPSS, but having the Excel version provides a good double check, and I have trapped errors when different approaches give discrepant results.
    I am fascinated by the current debate on Piketty - I had only been very vaguely aware of this until someone on Twitter asked if my post was inspired by the Piketty case. I can now see why - very interesting parallels in terms of error detection. I liked this account of the story, which I think is v balanced:

    1. Thanks for directing me to Nate's website. He makes some good points and I have not yet had a change to read Chris Giles' article as my local university library does not have an electronic version of the FT. I am going to have to track it down in hard copy. It's got to be in the university library somewhere--I'm not likely to be able to buy a copy here in a small city in Canada.

      Re Excel. I gave up on it some time ago simply because I don't like the new interface (and I was a beta tester for the Mac version back in the 80's). When I really need a spreadsheet I use Apache Open Office.

      I find it's usually a lot faster and easier to just go directly to R for any analyses. Matter of taste I guess although I really don't trust the Excel stats routines. I know of one instance, some years ago, where someone ran a linear regression and ended up with a negative Rsq.

    2. Re : Piketty

      A rather good (devastating?) response by Thomas Piketty to Chris Giles criticisims in the Financial Times.

      One of the things that is very impressive is that he appears to have made every scrap of data available on line.

  4. An example of data sharing at its best:

  5. Out of curiosity, where did you post the data set and R code? GitHub?

    1. I was experimenting with LabArchives but am waiting until paper is accepted before turning it live. Since then I have been also experimenting with using OpenScienceFramework