Yesterday I did something I've never done before in many years of publishing. When I submitted a revised manuscript of a research report to a journal, I also posted the dataset on the web, together with the script I'd used to extract the summary results. It was exciting. It felt as if I was part of a scientific revolution that has been gathering pace over the past two or three years, which culminated in adoption of a data policy by PLOS journals last February. This specified that authors were required to make the data underlying their scientific findings available publicly immediately upon publication of the article. As it happens, my paper is not submitted to PLOS, and so I'm not obliged to do this, but I wanted to, having considered the pros and cons. My decision was also influenced by the Wellcome Trust, who fund my work and encourage data sharing.
The benefits are potentially huge. People usually think about the value to other researchers, who may be able to extract useful information from your data, and there's no doubt this is a factor. Particularly with large datasets, it's often the case that researchers only use a subset of the data, and so valuable information is squandered and may be lost forever. More than once I've had someone ask me for an old dataset, only to find it is inaccessible, because it was stored on a floppy disk or an ancient, non-networked computer and so is no longer readable. Even if you think that you've extracted all you can from a dataset, it may still be worth preserving for potential inclusion in future meta-analyses.
Another value of open data is less often emphasised: when you share data you are forced to ensure it is accurate and properly documented. I enjoy data analysis, but I'm not naturally well-disciplined about keeping everything tidy and well-organised. I've been alarmed on occasion to return to a dataset and find I have no idea what some of the variables are, because I failed to document them properly. If I know the world at large will see my dataset then I won't want to be embarrassed by it, and so I will take more care to keep it neat and tidy with everything clearly labelled. This can only be good.
But here's the scary thing. data sharing exposes researchers to the risk of being found out to be sloppy or inaccurate. To my horror, shortly before I posted my dataset on the internet yesterday I found I'd made a mistake in the calculation of one of my variables. It was a silly error, caused by basing a computation on the wrong column of data. Fortunately, it did not have a serious effect on my paper, though I did have to go through redoing all the tables and making some changes to the text. But it seemed like pure chance that I picked up on this error – I could very easily have posted the dataset on the internet with the error still there. And it was an error that would have been detected by anyone eagle-eyed enough to look at the numbers carefully. Needless to say, I'm nervous that there may well be other errors in there that I did not pick up. But at least it's not as bad as an apocryphal case of a distinguished research group whose dramatic (and published) results arose because someone forgot to designate 9 as a missing value code. When I heard about that I shuddered, as I could see how easily it could happen.
This is why Open Data is both important for science but difficult for scientists. In the past, I've found mistakes in my datasets, but this has been a private experience. To date, as far as I am aware, no serious errors have got into my published papers – though I did have another close shave last year when I found a wrongly-reported set of means at the proofs stage, and there have been a couple of instances where minor errata have had to be published. But the one thing I've learned as I wiped the egg off my face is that error is inevitable and unavoidable, however careful you try to be. The best way to flush out these errors is to make the data public. This will inevitably lead to some embarrassment when mistakes are found, but at the end of the day, our goal must be to find out what is the case, rather than to save face.
I'm aware that not everyone agrees with me on this. There are concerns that open data sharing could lead to scientists getting scooped, will take up too much time, and could be used to impose ever more draconian regulation on beleaguered scientists: as DrugMonkey memorably put it: "Data depository obsession gets us a little closer to home because the psychotics are the Open Access Eleventy waccaloons who, presumably, started out as nice, normal, reasonable scientists." But I think this misses the point. Drug Monkey seems to think this is all about imposing regulations to prevent fraud and other dubious practices. I don't think this is so. The counter-arguments were well articulated in a blogpost by Tal Yarkoni. In brief, it's about moving to a point where it is accepted practice to make data publicly available, to improve scientific transparency, accuracy and collaboration.