Sunday 6 December 2015
Open code: not just data and publications
I had a nice exchange on Twitter this week.
Nick Fay had found a tweet I had posted over a year ago, asking for advice on an arcane aspect of statistical analysis:
I'd had some replies, but they hadn’t really helped. In the end, I’d worked out that there was an error in the stats bible written by Rand Wilcox, which was leading me astray. Once I’d overcome that, I managed to get the analysis to work.
It was clear Nick was now having the same problem and going round in exactly the same circles I had experienced.
My initial thought was that I could probably dig out the analysis and reconstruct what I’d done, but my heart sank at the prospect. However, then I had a cheerful thought. I had deposited the analysis scripts for my project on the Open Science Framework, here. I checked, and the script was pretty well annotated, and as a bonus you got a script showing you how to make a nice swarm plot.
This experience comes hard on the heels of another interaction, this time around a paper I’m writing with Paul Thompson on p-curve analysis (latest preprint is here). Here there’s no raw data, just simulations, and it’s been refreshing to interact with reviewers who not only look at the code you have deposited, but also make their own code available. There’ve been disagreements with the reviewers about aspects of our paper, and it helped enormously that we could examine one another’s code. The nice thing is that if code is available, you get to really understand what someone has done and also learn a great deal about coding.
These two examples illustrate the importance of making code open. It probably didn’t matter much when everyone was doing very simple and straightforward analyses. A t-test or correlation can easily be re-run from any package given a well-annotated dataset. But the trend in science is for analyses to get more and more complicated. I struggle to understand the methods of many current papers in neuroscience and genetics – fields where replication is sorely needed but impossible to achieve if everyone does things differently and only incompletely described. Even in less data-intensive areas such as psycholinguistics, there has been a culture change away from reliance on ANOVAs to much more fancy multilevel modelling approaches.
My experience leads me to recommend sharing of analysis code as well as data: it will help establish reproducibility of your findings, provide a training tool for others, and ensure your work is in a safe haven if you need to revisit it.
Finally, this is a further endorsement of Twitter as an academic tool. Without Twitter I wouldn't have discovered Open Science Framework, or PeerJ, both of which are great for those who want to embrace open science. And my interchange with Nick was not the end of the story. Others chipped in with helpful comments, as you can see below:
here is another story of victory for Open Data, just yesterday, from the excellent Ed Yong.