Saturday 3 August 2019

Corrigendum: a word you may hope never to encounter


I have this week submitted a 'corrigendum' to a journal for an article published in the American Journal of Medical Genetics B (Bishop et al, 2006). It's just a fancy word for 'correction', and journals use it contrastively with 'erratum'. Basically, if the journal messes up and prints something wrong, it's an erratum. If the author is responsible for the mistake, it's a corrigendum.

 I'm trying to remember how many corrigenda I've written over the 40 odd years I've been publishing: there have been at least three previous cases that I can remember, but there could be more. I think this one was the worst; previous errors have tended to just affect numbers in a minor way. In this case, a whole table of numbers (table II) was thrown out, and although the main findings were upheld, there were some changes in the details.

I discovered the error when someone asked for the data for a meta-analysis. I was initially worried I would not be able to find the files, but fortunately, I had archived the dataset on a server, and eventually tracked it down. But it was not well-documented, and I then had the task of trawling through a number of cryptically-named files to try and work out which one was the basis for the data in the paper. My brain slowly reconstructed what the variable names meant and I got to the point of thinking I'd better check that this was the correct dataset by rerunning the analysis. Alas, although I could recreate most of what was published, I had the chilling realisation that there was a problem with Table II.

Table II was the one place in the analysis where, in trying to avoid one problem with the data (non-independence), I created a whole new problem (wrong numbers). I had data on siblings of children with autism, and in some cases there were two or three siblings in the family. These days I would have considered using a multilevel model to take family structure into account, but in 2005 I didn't know how to do that, and instead I decided to take a mean value for each family. So if there was one child, I used their score, but if there were 2 or 3, then I averaged them. The N was then the number of families, not the number of children.

And here, dear Reader, is where I made a fatal mistake. I thought the simplest way to do this would be by creating a new column in my Excel spreadsheet which had the mean for each family, computing this by manually entering a formula based on the row numbers for the siblings in that family. The number of families was small enough for this to be feasible, and all seemed well. However, I noticed when I opened the file that I had pasted a comment in red on the top row that said 'DO NOT SORT THIS FILE!'. Clearly, I had already run into problems with my method, which would be totally messed up if the rows were reordered. Despite my warning message to myself, somewhere along the line, it seems that a change was made to the numbering, and this meant that a few children had been assigned to the wrong family. And that's why table II had gremlins in it and needed correcting.

I now know that doing computations in Excel is almost always a bad idea, but in those days, I was innocent enough to be impressed with its computational possibilities. Now I use R, and life is transformed. The problem of computing a mean for each family can be scripted pretty easily, and then you have a lasting record of the analysis, which can be reproduced at any time. In my current projects, I aim to store data with a data dictionary and scripts on a repository such as Open Science Framework, with a link in the paper, so anyone can reconstruct the analysis, and I can find it easily if someone asks for the data. I wish I had learned about this years ago, but at least I can now use this approach with any new data – and I also aim to archive some old datasets as well.

For a journal, a corrigendum is a nuisance: they cost time and money in production costs, and are usually pretty hard to link up to the original article, so it may be seen as all a bit pointless. This is especially so given that a corrigendum is only appropriate if the error is not major. If an error would alter the conclusions that you'd draw from the data, then the paper will need to retracted. Nevertheless, it is important for the scientific record to be accurate, and I'm pleased to say that the American Journal of Medical Genetics took this seriously. They responded promptly to my email documenting the problem, suggesting I write a corrigendum, which I have now done.

I thought it worth blogging about this to show how much easier my life would have been if I had been using the practices of data management and analysis that I now am starting to adopt. I also felt it does no harm to write about making mistakes, which is usually a taboo subject. I've argued previously that we should be open about errors, to encourage others to report them, and to demonstrate how everyone makes mistakes, even when trying hard to be accurate (Bishop, 2018). So yes, mistakes happen, but you do learn from them.

References 
Bishop, D. V. M. (2018). Fallibility in science: Responding to errors in the work of oneself and others (Commentary). Advances in Methods and Practices in Psychological Science, 1(3), 432-438 doi:10.1177/2515245918776632. (For free preprint see: https://peerj.com/preprints/3486/)

Bishop, D. V. M., Maybery, M., Wong, D., Maley, A., & Hallmayer, J. (2006). Characteristics of the broader phenotype in autism: a study of siblings using the Children's Communication Checklist - 2. American Journal of Medical Genetics Part B (Neuropsychiatric Genetics), 141B, 117-122.

1 comment:

  1. Thank you for this post, and setting an example of how to respond with transparency and solutions (i.e., make a correction, and then continue strive for best data management practices) rather than trying to hide from a mistake to prevent potential embarrassment. Was a thoughtful post.

    ReplyDelete