BishopBlog: 2016

Thursday, 22 December 2016

Controversial statues: remove or revise?

The Rhodes Must Fall campaign in Oxford ignited an impassioned debate about the presence of monuments to historical figures in our Universities. On the one hand, there are those who find it offensive that a major university should continue to commemorate a person such as Cecil Rhodes, given the historical reappraisal of his role in colonialism and suppression of African people. On the other hand, there are those who worry that removal of the Rhodes statue could be the thin end of a wedge that could lead to demands for Nelson to be removed from Trafalgar Square or Henry VIII from King’s College Cambridge. There are competing petitions online to remove and retain the Rhodes statue: with both having similar numbers of supporters.

The Rhodes Must Fall campaign was back in the spotlight last week, when the Times Higher ran a lengthy article covering a range of controversial statues in Universities across the globe. A day before the article appeared, I had happened upon the Explorer's Monument in Fremantle, Australia. The original monument, dating to 1913, commemorated explorers who had been killed by 'treachorous natives' in 1864. As I read the plaque, I was thinking that this was one-sided, to put it mildly.

Source: https://en.wikipedia.org/wiki/Explorers%27_Monument

But then, reading on, I came to the next plaque, below the first, which was added to give the view of those who were offended by the original statue and plaque.

Source: Source: https://en.wikipedia.org/wiki/Explorers%27_Monument

I like this solution. It does not airbrush controversial figures and events out of history. Rather, it forces one to think about the ways in which a colonial perspective damaged many indigenous people - and perhaps to question other things that are just taken for granted. It also creates a lasting reminder of the issues currently under debate – whereas if a statue is removed, all could be forgotten in a few years’ time.

Obviously, taken to extremes, this approach could get out of control – one can imagine a never-ending sequence of plaques like the comments section on a Guardian article. But used judiciously, this approach seems to me to be a good solution to this debate.

Friday, 16 December 2016

When is a replication not a replication?

-->

Replication studies have been much in the news lately, particularly in the field of psychology, where a great deal of discussion has been stimulated by the Reproducibility Project spearheaded by Brian Nosek.

Replication of a study is an important way to test the the reproducibility and generalisability of the results. It has been a standard requirement for publication in reputable journals in the field of genetics for several years (see Kraft et al, 2009). However, at interdisciplinary boundaries, the need for replication may not be appreciated, especially where researchers from other disciplines include genetic associations in their analyses. I’m interested in documenting how far replications are routinely included in genetics papers that are published in neuroscience journals, and so I attempted to categorise a set of papers on this basis.

I’ve encountered many unanticipated obstacles in the course of this study (unintelligible papers and uncommunicative authors, to name just two I have blogged about), but I had not expected to find it difficult to make this binary categorisation. But it has become clear that there are nuances to the idea of replication. Here are two of those I have encountered:

a) Studies which include a straightforward Discovery and Replication sample, but which fail to reproduce the original result in the Replication sample. The authors then proceed to analyse the data with both samples combined and conclude that the original result is still there, so all is okay. Now, as far as I am concerned, you can’t treat this as a successful replication; the best you can say of it is that it is an extension of the original study to a larger sample size. But if, as is typically the case, the original result was afflicted by the Winner’s Curse, then the combined result will be biased.

b) Studies which use different phenotypes for Discovery and Replication samples. On the one hand, one can argue that such studies are useful for identifying how generalizable the initial result is to changes in measures. It may also be the only practical solution if using pre-existing samples for replication, as one has to use what measures are available. The problem is that there is an asymmetry in terms of how the results are then treated. If the same result is obtained with a new sample using different measures, this can be taken as strong evidence that the genotype is influencing a trait regardless of how it is measured. But when the Replication sample fails to reproduce the original result, one is left with uncertainty as to whether it was type I error, or a finding that is sensitive to how it is measured. I’ve found that people are very reluctant to treat failures to replicate as undermining the original finding in this circumstance.

I’m reminded of arguments in the field of social psychology, where failures to reproduce well-known phenomena are often attributed to minor changes in the procedures or lack of ‘flair’ of experimenters. The problem is that while this interpretation could be valid, there is another, less palatable, interpretation, which is that the original finding was a type I error. This is particularly likely when the original study was underpowered or the phenotype was measured using an unreliable instrument.

There is no simple solution, but as a start, I’d suggest that researchers in this field should, where feasible, use the same phenotype measures in Discovery and Replication samples. Where that is not feasible, the could pre-register their predictions for a Replication Sample prior to looking at the data, taking into account the reliability of the measures of the phenotype and the power of the Replication Sample to detect the original effect, based on the sample size

Tuesday, 13 December 2016

When scientific communication is a one-way street

Together with some colleagues, I am reviewing a set of papers that combine genetic and neuroscience methods. We had noticed wide variation in methodological practices and thought it would be useful to evaluate the state of the field. Our ultimate aim of identifying both problems and instances of best practice, so that we could make some recommendations.

I had anticipated that there would be wide differences between studies in statistical approaches and completeness of reporting, but I had not realised just what a daunting task it would be to review a set of papers. We had initially planned to include 50 papers, but we had to prune it down to 30, on realising just how much time we would need to spend reading and re-reading each article, just to extract some key statistics for a summary.

In part the problem is the complexity that arises when you bring together two or more subject areas, each of which deals with complex, big datasets. I blogged recently about this. Another issue is incomplete reporting. Trying to find out whether the researchers followed a specific procedure can mean wading through pages of manuscript and supplementary material: if you don’t find it, you then worry that you may have missed it, and so you re-read it all again. The search for key details is not so much looking for a needle in a haystack as being presented with a haystack which may or may not have a needle in it.

I realised that it would make sense to contact authors of the papers we were including in the review, so I sent an email, copied to each first and last author, attaching a summary template of the details that had been extracted from their paper, and simply asking them to check if it was an accurate account. I realised everyone is busy and I did not anticipate an immediate response, but I suggested an end of month deadline, which gave people 3-4 weeks to reply. I then sent out a reminder a week before the deadline to those who had not replied, offering more time if needed.

Overall, the outcome was as follows:

15 out of 30 authors responded, either to confirm our template was correct, or to make changes. The tone varied from friendly to suspicious, but all gave useful feedback.
5 authors acknowledged our request and promised to get back but didn’t.
1 author said an error had been found in the data, which did not affect conclusions, and they planned to correct it and send us updated data – but they didn’t.
1 author sent questions about what we were doing, to which I replied, but they did not confirm whether or not our summary of their study was correct.
8 did not reply to either of my emails.

I was rather disappointed that only half the authors ultimately gave us a useable response. Admittedly, the response rate is better than has been reported for people who request data from authors (see, e.g. Wicherts et al, 2011) – but providing data involves much more work than checking a summary. Our summary template was very short (effectively less than 20 details to check), and in only a minority of cases had we asked authors to provide specific information that we could not find in the paper, or confirmation of means/SDs that had been extracted from a digitised figure.

We are continuing to work on our analysis, and will aim to publish it regardless, but I remain curious about the reasons why so many authors were unwilling to do a simple check. It could just be pressure of work: we are all terribly busy and I can appreciate this kind of request might just seem a nuisance. Or are some authors really not interested in what people make of their paper, provided they get it published in a top journal?

Friday, 28 October 2016

The allure of autism for researchers


Data on $K spend on neurodevelopmental disorder research by NIH: from Bishop, D. V. M. (2010). Which neurodevelopmental disorders get researched and why? PLOS One, 5(11), e15112. doi: 10.1371/journal.pone.0015112

Every year I hear from students interested in doing postgraduate study with me at Oxford. Most of them express a strong research interest in autism spectrum disorder (ASD). At one level, this is not surprising: if you want to work on autism and you look at the University website, you will find me as one of the people listed as affiliated with the Oxford Autism Research Centre. But if you look at my publication list, you find that autism research is a rather minor part of what I do: 13% of my papers have autism as a keyword, and only 6% have autism or ASD in the title. And where I have published on autism, it is usually in the context of comparing language in ASD with developmental language disorder (DLD, aka specific language impairment, SLI). And, indeed in the publication referenced in the graph above, I concluded that there was disproportionate amounts of research, and research funding, going to ASD relative to other neurodevelopmental disorders.

Now, I don’t want to knock autism research. ASD is an intriguing condition which can have major effects on the lives of affected individuals and their families. It was great to see the recent publication of a study by Jonathan Green and his colleagues showing that a parent-based treatment with autistic toddlers could produce long-lasting reduction in severity of symptoms. Conducting a rigorous study of this size is hugely difficult to do and only possible with substantial research funding.

But I do wonder why there is such a skew in interest towards autism, when many children have other developmental disorders that have long-term impacts. Where are all the enthusiastic young researchers who want to work on developmental language disorders? Why is it that children with general learning disabilities (intellectual retardation) are so often excluded from research, or relegated to be a control group against which ASD is assessed?

Together with colleagues Becky Clark, Gina Conti-Ramsden, Maggie Snowling, and Courtenay Norbury, I started the RALLI campaign in 2012 to raise awareness of children’s language impairments, mainly focused on a YouTube channel where we post videos providing brief summaries of key information, with links to more detailed evidence. This year we also completed a study that brought together a multidisciplinary, multinational panel of experts with the goal of producing consensus statements on criteria and terminology for children’s language disorders – leading to one published paper and another currently in preprint stage. We hope that increased consistency in how we define and refer to developmental language disorders will lead to improved recognition.

We still have a long way to go in raising awareness. I doubt we will ever achieve a level of interest to parallel that of autism. And I suspect this is because autism fascinates because it does not appear just to involve cognitive deficits, but rather a qualitatively different way of thinking and interacting with the world. But I would urge those considering pursuing research in this field to think more broadly and recognise that there are many fascinating conditions about which we still know very little. Finding ways to understand and eventually ameliorate language problems or learning disabilities could help a huge number of children and we need more of our brightest and best students to recognise this potential.

Saturday, 1 October 2016

On the incomprehensibility of much neurogenetics research

Together with some colleagues, I am carrying out an analysis of methodological issues such as statistical power in papers in top neuroscience journals. Our focus is on papers that compare brain and/or behaviour measures in people who vary on common genetic variants.

I'm learning a lot by being forced to read research outside my area, but I'm struck by how difficult many of these papers are to follow. I'm neither a statistician nor a geneticist, but I have nodding acquaintance with both disciplines, as well as with neuroscience, yet in many cases I find myself struggling to make sense of what researchers did and what they found. Some papers that have taken hours of reading and re-reading to just get at the key information that we are seeking for our analysis, i.e. what was the largest association that was reported.

This is worrying for the field, because the number of people competent to review such papers will be extremely small. Good editors will, of course, try to cover all bases by finding reviewers with complementary skill sets, but this can be hard, and people will be understandably reluctant to review a highly complex paper that contains a lot of material beyond their expertise. I remember a top geneticist on Twitter a while ago lamenting that when reviewing papers they often had to just take the statistics on trust, because they had gone beyond the comprehension of all but a small set of people. The same is true, I suspect, for neuroscience. Put the two disciplines together and you have a big problem.

I'm not sure what the solution is. Making raw data available may help, in that it allows people to check analyses using more familiar methods, but that is very time-consuming and only for the most dedicated reviewer.

Do others agree we have a problem, or is it inevitable that as things get more complex the number of people who can understand scientific papers will contract to a very small set?

Saturday, 3 September 2016

Some thoughts on the Statcheck project

Yesterday, a piece in Retractionwatch covered a new study, in which results of automated statistics checks on 50,000 psychology papers are to be made public on the PubPeer website.

I had advance warning, because a study of mine had been included in what was presumably a dry run, and this led to me receiving an email on 26^th August as follows:

Assuming someone had a critical comment on this paper, I duly clicked on the link, and had a moment of double-take when I read the comment.

Now, this seemed like overkill to me, and I posted a rather grumpy tweet about it. There was a bit of to and fro on Twitter with Chris Hartgerink, one of the researchers on the Statcheck project, and with the folks at Pubpeer, where I explained why I was grumpy and they defended their approach; as far as I was concerned it was not a big deal, and if nobody else found this odd, I was prepared to let it go.

But then a couple of journalists got interested, and I sent them a more detailed thoughts.

I was quoted in the Retraction Watch piece, but I thought it worth reporting my response in full here, because the quotes could be interpreted as indicating I disapprove of the Statcheck project and am defensive about errors in my work. Neither of those is true. I think the project is an interesting piece of work; my concern is solely with the way in which feedback to authors is being implemented. So here is the email I sent to journalists in full:

I am in general a strong supporter of the reproducibility movement and I agree it could be useful to document the extent to which the existing psychology literature contains statistical errors.

However, I think there are 2 problems with how this is being done in the PubPeer study.

1. The tone of the PubPeer comments will, I suspect alienate many people. As I argued on Twitter, I found it irritating to get an email saying a paper of mine had been discussed on PubPeer, only to find that this referred to a comment stating that zero errors had been found in the statistics of that paper.

I don't think we need to be told that - by all means report somewhere a list of the papers that were checked and found to be error-free, but you don't need to personally contact all the authors and clog up PubPeer with comments of this kind.

My main concern was that during an exceptionally busy period, this was just another distraction from other things. Chris Hartgerink replied that I was free to ignore the email, but that would be extremely rash because a comment on PubPeer usually means that someone has a criticism of your paper.

As someone who works on language, I also found the pragmatics of the communication non-optimal. If you write and tell someone that you've found zero errors in their paper, the implication is that this is surprising, because you don't go around stating the obvious*. And indeed, the final part of the comment basically said that your work may well have errors in it and even though they hadn't found them, we couldn't trust it.

Now at the same time as having that reaction, I appreciate this was a computer-generated message, written by non-native English speakers, that I should not take it personally, and no slur on my work was intended. And I would like to know if errors were found in my stats, and it is entirely possible that there are some, since none of us is perfect. So I don't want to over-react, but I think that if I, as someone basically sympathetic to this agenda, was irritated by the style of the communication, then the odds are this will stoke real hostility for those who are already dubious about what has been termed 'bullying' and so on by people interested in reproducibility.

2. I'll be interested to see how this pans out for people where errors are found.

My personal view is that the focus should be on errors that do change the conclusions of the paper.

I think at least a sample of these should be hand-checked so we have some idea of the error rate - I'm not sure if this has been done, but the PubPeer comment certainly gave no indication of that - it just basically said there's probably an error in your stats but we can't guarantee that there is, putting the onus on the author to then check it out.

If it's known that on 99% of occasions the automated check is accurate, then fine. If the accuracy is only 90% I'd be really unhappy about the current process as it would be leading to lots of people putting time into checking their papers on the basis of an insufficiently sensitive diagnostic. It would make the authors of the comments look frankly lazy in stirring up doubts about someone's work and then leaving them to check it out.

In epidemiology the terms sensitivity and specificity are used to refer to the accuracy of a diagnostic test. Minimally if the sensitivity and specificity of the automated stats check is known, then those figures should be provided with the automated message.

The above was written before Dalmeet drew my attention to the second paper, in which errors had been found. Here’s how I responded to that:

I hadn't seen the 2nd paper - presumably because I was not the corresponding author on that one. It's immediately apparent that the problem is that F ratios have been reported with one degree of freedom, when there should be two. In fact, it's not clear how the automated program could assign any p-value in this situation.

I'll communicate with the first author, Thalia Eley, about this, as it does need fixing for the scientific record, but, given the sample size (on which the second, missing, degree of freedom is based), the reported p-values would appear to be accurate.

I have added a comment to this effect on the PubPeer site.

* I was thinking here of Gricean maxims, especially maxim of relation.

Thursday, 1 September 2016

Why I still use Excel

The Microsoft application, Excel, was in the news for all the wrong reasons last week. A paper in Genome Biology documented how numerous scientific papers had errors in their data because they had used default settings in Excel, which had unhelpfully converted gene names to dates or floating point numbers. It was hard to spot as it didn't do it to all gene names, but, for instance, the gene Septin 2, with acronym SEPT2 would be turned into 2006/09/02. This is not new: this paper in 2004 documented the problem, but it seems many people weren't aware of it, and it is now estimated that the literature on genetics is riddled with errors as a consequence.

This isn't the only way Excel can mess up your data. If you want to enter a date, you need to be very careful to ensure you have the correct setting. If you are in the UK and you enter a date like 23/4/16, then it will be correctly entered as 23rd April, regardless of the setting. But if you enter 12/4/16, it will be treated as 4^th December if you are on US settings and as 12^th April if you are on UK settings.

Then there is the dreaded autocomplete function. This can really screw things up by assuming that if you start typing text into a cell, you want it the same as a previous entry in that column that begins with the same sequence of letters. Can be a boon and a time-saver in some circumstances, but a way to introduce major errors in others.

I've also experienced odd bugs in Excel's autofill function, which makes it easy to copy a formula across columns or rows. It's possible for a file to become corrupted so that the cells referenced in the formula are wrong. Such errors are also often introduced by users, but I've experienced corrupted files containing formulae, which is pretty scary.

The response to this by many people is to say serious scientists shouldn't use Excel. It's just too risky having software that can actively introduce errors into your data entry or computations. But people, including me, persist in using it, and we have to consider why.

So what are the advantages of keeping going with Excel?

Well, first, it usually comes ~~for free~~ with Microsoft computers, so it is widely available ~~free of charge~~*. This means most people will have some familiarity with it –though few both to learn how to use it properly.

Second, you can scan a whole dataset easily: it's very direct scrolling through rows or columns. You can use Freeze Panes to keep column and row headers static, and you can hide columns or rows that you don't want getting in the way.

Third, you can format a worksheet to facilitate data entry. A lot of people dismiss colour coding of columns as prettification, but it can help ensure you keep the right data in the right place. Data validation is easily added and can ensure that only valid values are entered.

Fourth, you can add textual comments – either as a row in their own right, or using the Comment function.

Fifth, you can very easily plot data. Better still, you can do so dynamically, as it is easy to create a plot and then change the data range it refers to.

Sixth, you can use lookup functions. In my line of work we need to convert raw scores to standard scores based on normative data. This is typically done using tables of numbers in a manual, which makes it very easy to introduce human error. I have found it is worth investing time to get the large table of numbers entered as a separate worksheet, so we can then automate the lookup functions.

Many of my datasets are slowly generated over a period of years: we gather large amounts of data on individuals, record responses on paper, and then enter the data as it comes in. The people doing the data entry are mostly research assistants who are relatively inexperienced. So having a very transparent method of data entry, which can include clear instructions on the worksheet, and data validation, is important. I'm not sure there are other options of software that would suit my needs.

But I'm concerned about errors and need strategies to avoid them. So here are the working rules I have developed so far.

1. Before you begin, turn off any fancy Excel defaults you don't need. And if entering gene names, ensure they are entered as text.

2. Double data entry is crucial: have the data re-entered from scratch when the whole dataset is in, and cross-check the data files. This costs money but is important for data quality. There are always errors.

3. Once you have the key data entered and checked, export it to a simple, robust format such as tab-separated text. It can then be read and re-used by people working with other packages.

4. The main analysis should be done using software that generates a script that means the whole analysis can be reproduced. Excel is therefore not suitable. I increasingly use R, though SPSS is another option, provided you keep a syntax file.

5. I still like to cross-check analyses using Excel – even if it is just to do a quick plot to ensure that the pattern of results is consistent with an analysis done in R.

Now, I am not an expert data scientist – far from it. I'm just someone who has been analysing data for many years and learned a few things along the way. Like most people, I tend to stick with what I know, as there are costs in mastering new skills, but I will change if I can see benefits. I've become convinced that R is the way to go for data analysis, but I do think Excel still has its uses, as a complement to other methods for storing, checking and analysing data. But, given the recent crisis in genetics, I'd be interested to hear what others think about optimal, affordable approaches to data entry and data analysis – with or without Excel.

*P.S. I have been corrected on Twitter by people who have told me it is NOT free; the price for Microsoft products may be bundled in with the cost of the machine, but someone somewhere is paying for it!

Update: 2nd September 2016
There was a surprising amount of support for this post on Twitter, mixed in with anticipated criticism from those who just told me Excel is rubbish. What's interesting is that very few of the latter group could suggest a useable alternative for data entry (and some had clearly not read my post and thought I was advocating using Excel for data analysis). And yes, I don't regard Access as a usable alternative: been there tried that, and it just induced a lot of swearing.
There was, however, one suggestion that looks very promising and which I will chase up
@stephenelane suggested I look at REDcap.
Website here: https://projectredcap.org/

Meanwhile, here's a very useful link on setting up Excel worksheets to avoid later problems that came in via @tjmahr on Twitter
http://kbroman.org/dataorg/

Update 4th October 2016
Just to say we have trialled REDCap, and I love it. Very friendly interface. Extremely limited for any data manipulation/computation, but that doesn't matter, as you can readily import/export information into other applications for processing. It's free but institution needs to be signed up for it: Oxford is not yet fully functional with it, but we were able to use it via a colleague's server for a pilot.