Wednesday 24 October 2018

Has the Society for Neuroscience lost its way?

The tl;dr version: The Society for Neuroscience (SfN) makes humongous amounts of money from its journal and meetings, but spends very little on helping its members, while treating overseas researchers with indifference bordering on disdain.

I first became concerned about the Society for Neuroscience back in 2010 when I submitted a paper to the Journal of Neuroscience. The instructions to authors explained that there was a submission fee (at the time about $50). Although I 'd never come across such a practice before, I reckoned it was not a large sum, and so went ahead. The Instructions for Authors explained that there was a veto on citation of unpublished work. I wanted to cite a paper of mine that had been ‘accepted in principle’ but needed minor changes, and I explained this in my cover letter. Nevertheless, the paper was desk-rejected because of this violation. A week later, after the other paper was accepted, I updated the manuscript and resubmitted it, but was told that I had to pay another submission fee. I got pretty grumbly at this point, but given that we'd spent a lot of time formatting the paper for J Neuroscience, I continued with the process. We had an awful editor (the Automaton described here), but excellent reviewers, and the paper was ultimately accepted.

But then we were confronted with the publication fee and charges for using colour figures. These were substantial – I can’t remember the details but it was so much that it turned out cheaper for all the authors to join the SfN, which made us eligible for reduced rates on publication fees. So for one year I became a member of the society. 

The journal’s current policy on fees can be found here.  Basically, the submission fee is now $140, but this is waived if first and last authors are SfN members (at cost of $200 per annum for full members, going down to $150 for postdocs and $70 for postgrads). The publication fee is $1,260 (for members) and $1,890 for non-members, with an extra $2,965 if you want the paper to be made open access.

There are some reductions for those working in resource-restricted countries, but the sums involved are still high enough to act as a deterrent. I used Web of Science to look at country of origin for Journal of Neuroscience papers since 2014, and there’s no sign that those from resource-restricted countries are taking advantage of the magnanimous offer to reduce publication fees by up to 50%.
Countries of origin from papers in Journal of Neuroscience (2014-2018)
The justification given for these fees is that ‘The submission fee covers a portion of the costs associated with peer review’, with the implication that the society is subsidising the other portion of the costs. Yet, when we look at their financial statements (download pdf here), they tell a rather different story. As we can see in the table on p 4, in 2017 the expenses associated with scientific publications came to $4.84 million, whereas the income from this source was $7.09 million.

But maybe the society uses journal income to subsidise other activities that benefit its nearly 36,000 members? That’s a common model in societies I’m involved in. But, no, the same financial report shows that the cost of the annual SfN meeting in 2017 was $9.5 million, but the income was $14.8 million. If we add in other sources of income, such as membership dues, we can start to understand how it is that the net assets of the society increased from $46.6 million in 2016 to $58.7 million in 2017.

This year, SfN has had a new challenge, which is that significant numbers of scientists are being denied visas to attend the annual meeting, as described in this piece by the Canadian Association for Neuroscience. This has led to calls for the annual meeting to be held outside the US in future years. The SfN President has denounced the visa restrictions as a thoroughly bad thing. However, it seems that SfN has not been sympathetic to would-be attendees who joined the society in order to attend the meeting, only to find that they would not be able to do so. I was first alerted to this on Twitter by this tweet:

This attracted a fair bit of adverse publicity for SfN, and just over a week later Chris heard back from the Executive Director of SfN who had explained that whereas they could refund registration fees for those who could not attend, they were not willing refund on membership fees. No doubt for an organisation that is sitting on long-term investments of $71.2 million (see table below), the $70 membership fee for a student is chicken feed. But I suspect it doesn’t feel like that to the student, who has probably also incurred costs for submitting an unsuccessful visa application.

Table from p 8 of SfN Annual Financial Report for 2017; The 'alternative investments' are mostly offshore funds in the Cayman Islands and elsewhere
There appears to be a mismatch between the lofty ideals described in SfN's mission statement and their behaviour. They seem to have lost their way: instead of being an organisation that exists to promote neuroscience and help their members, the members are rather regarded as nothing but a source of income, which is then stashed away in investments. It’s interesting to see that under Desired Outcomes, the Financial Reserve Strategy section of the mission statement has: ‘Strive to achieve end of year financial results that generate net revenues between $500,000 and $1 million in annual net operating surplus.’ That is reasonable and prudent for a large organisation with employees and property. But SfN is not achieving that goal: they are making considerably more money than their own mission statement recommends.

That money could be put to good use. In particular, given SfN’s stated claim of wanting to support neuroscience globally, they could offer grants for scientists in resource-poor countries to buy equipment, pay for research assistants or attend meetings. Quite small sums could be transformational in such a context. As far as I can see, SfN currently offers a few awards, but some of these are paid for by external donations, and, in relation to their huge reserves, the sums are paltry. My impression is that other, much smaller, societies do far more with limited funds than SfN does with its bloated income.

Maybe I’m missing something. I’m no longer a member of SfN, so it’s hard to judge. Are there SfN members out there who think the society does a good job for its membership?

Saturday 13 October 2018

Working memories: a brief review of Alan Baddeley's memoir

This post was prompted by Tom Hartley, who asked if I would be willing to feature an interview with Alan Baddeley on my blog.  This was excellent timing, as I'd just received a copy of Working Memories from Alan, and had planned to take it on holiday with me. It proved to be a fascinating read. Tom's interview, which you can find here, gives a taster of the content.

The book was of particular interest to me, as Alan played a big role in my career by appointing me to a post I held at the MRC Applied Psychology Unit (APU) from 1991 to 1998, and so I'm familiar with many of the characters and the ideas that he talks about in the book. His work covered a huge range of topics and collaborations, and the book, written at the age of 84, works both as a history of cognitive psychology and as a scientific autobiography.

Younger readers may be encouraged to hear that Alan's early attempts at a career were not very successful, and his career took off only after a harrowing period as a hospital porter and a schoolteacher, followed by a post at the Burden Neurological Institute, studying the effects of alcohol, where his funds were abruptly cut off because of a dispute between his boss and another senior figure. He was relieved to be offered a place at the MRC Applied Psychology Unit (APU) in Cambridge, eventually doing a doctorate there under the supervision of Conrad (whose life I wrote about here), experimenting on memory skills in sailors and postmen.

I had known that Alan's work covered a wide range of areas, but was still surprised to find just how broad his interests were. In particular, I was aware he had done work on memory in divers, but had thought that was just a minor aspect of his interests. That was quite wrong: this was Alan's main research interest over a period of years, where he did a series of studies to determine how far factors like cold, anxiety and air quality during deep dives affected reasoning and memory: questions of considerable interest to the Royal Navy among others.

After periods working at the Universities of Sussex and Stirling, Alan was appointed in 1974 as Director of the MRC APU, where he had a long and distinguished career until his formal retirement in 1995. Under his direction, the Unit flourished, pursuing a much wider range of research, with strong external links. Alan enjoyed working with others, and had collaborations around the world.  After leaving Cambridge,  he took up a research chair at the University of Bristol, before settling at the University of York, where he is currently based.

I was particularly interested in Alan's thoughts on applied versus theoretical research. The original  APU was a kind of institution that I think no longer exists: the staff were expected to apply their research skills to address questions that outside agencies, especially government, were concerned with. The earliest work was focused on topics of importance during wartime: e.g., how could vigilance be maintained by radar operators, who had the tedious task of monitoring a screen for rare but important events. Subsequently, unit staff were concerned with issues affecting efficiency of government operations during peacetime: how could postcodes be designed to be memorable? Was it safe to use a mobile phone while driving? Did lead affect children's cognitive development?  These days, applied problems are often seen as relatively pedestrian, but it is clear that if you take highly intelligent researchers with good experimental skills and pose them this kind of challenge, the work that ensues will not only answer the question, but may also lead to broader theoretical insights.

Although Alan's research included some work with neurological patients, he would definitely call himself a cognitive psychologist, and not a neuroscientist. He notes that his initial enthusiasm for functional brain imaging died down after finding that effects of interest were seldom clearcut and often failed to replicate. His own experimental approaches to evaluate aspects of memory and cognition seemed to throw more light than neuroimaging on deficits experienced by patients.

The book is strongly recommended for anyone interested in the history of psychology. As with all of Alan's writing, it is immensely readable because of his practice of writing books by dictation as he goes on long country walks: this makes for a direct and engaging style. His reflections on the 'cognitive revolution' and its impact on psychology are highly relevant for today's psychologists. As Alan says in the interview "... It's important to know where our ideas come from. It's all too tempting to think that whatever happened in the last two or three years is the cutting edge and that's all you need to know. In fact, it's probably the crest of a breaking wave and what you need to know is where that wave came from."

Saturday 15 September 2018

An index of neighbourhood advantage from English postcode data

Screenshot from
Densely packed postcodes appear grey: you need to expand the map to see colours
The Ministry of Housing, Communities and Local Government has a website which provides an ‘index of multiple deprivation’ for every postcode in England.  This is a composite index based on typical income, employment, education, health, crime, housing and living environment for each of 32,844 postcodes in 2015. You can also extract indices for the component factors that contribute to the index, which are explained further here. And there is a fascinating interactive website where you can explore the indices on a map of England.

Researchers have used the index of multiple deprivation as an overall measure of environmental factors that might affect child development, but it has one major drawback. The number that the website gives you is a rank from 1 to 32,844. This means it is not normally distributed, and not easy to interpret. You are also given decile bands, but these are just less precise versions of the ranks – and like ranks, have a rectangular, rather than a normal distribution (with each band containing 10% of the postcodes). If you want to read more about why rectangularly distributed data are problematic, see this earlier blogpost.

I wanted to use this index, but felt it would make sense to convert the ranks into z-scores. This is easily done, as z-scores are simply rescaled proportions. Here’s what you do:

Use the website to convert the postcode to an index of deprivation: in fact, it’s easiest to paste in a list of postcodes and you then get a set of indices for each one, which you can download either as .csv or .xlsx file. The index of multiple deprivation is given in the fifth column.

To illustrate, I put in the street address where I grew up, IG38NP, which corresponds to a multiple deprivation index of 12596.

In Excel, you can just divide the multiple deprivation index by 32844, to get a value of .3835, which you can then convert to a z-score using the NORMSINV function. Or, to do this in one step, if you have your index of multiple deprivation in cell A2, you type

This gives a value of -0.296, which is the corresponding z-score. I suggest calling it the ‘neighbourhood advantage score’ – so it’s clear that a high score is good and a low score is bad.

If you are working in R, you can just use the command:
neighbz = qnorm(deprivation_index/depmax)
where neighbz is the neighbourhood advantage score,  depmax has been assigned to 32844 and deprivation_index is the index of multiple deprivation.

Obviously, I’ve presented simplified commands here, but in either Excel or R it is easy to convert a whole set of postcodes in one go.

It is, of course, important to keep in mind that this is a measure of the neighbourhood a person lives in, and not of the characteristics of the individual. Postcode indicators may be misleading in mixed neighbourhoods, e.g. where gentrification has occurred, so rich and poor live side by side. And the different factors contributing to the index may be dissociated. Nevertheless, I think this index can be useful for providing an indication of whether a sample of individuals is representative of the population of England. In psychology studies, volunteers tend to come from more advantaged backgrounds, and this provides one way to quantify this effect.

Sunday 26 August 2018

Should editors edit reviewers?

How Einstein dealt with peer review: from

This all started with a tweet from Jesse Shapiro under the #shareyourrejections hashtag:

JS: Reviewer 2: “The best thing these authors [me and @ejalm] could do to benefit this field of study would be to leave the field and never work on this topic again.” Paraphrasing only slightly.

This was quickly followed by another example;
Bill Hanage: #ShareYourRejections “this paper is not suitable for publication in PNAS, or indeed anywhere.”

Now, both of these are similarly damning, but there is an important difference. The first one criticises the authors, the second one criticises the paper. Several people replied to Jesse’s tweet with sympathy, for instance:

Jenny Rohn: My condolences. But Reviewer 2 is shooting him/herself in the foot - most sensible editors will take a referee's opinion less seriously if it's laced with ad hominem attacks.

I took a different tack, though:
DB: A good editor would not relay that comment to the author, and would write to the reviewer to tell them it is inappropriate. I remember doing that when I was an editor - not often, thankfully. And reviewer apologised.

This started an interesting discussion on Twitter:

Ben Jones: I handled papers where a reviewer was similarly vitriolic and ad hominem. I indicated to the reviewer and authors that I thought it was very inappropriate and unprofessional. I’ve always been very reluctant to censor reviewer comments, but maybe should reconsider that view

DB: You're the editor. I think it's entirely appropriate to protect authors from ad hominem and spiteful attacks. As well as preventing unnecessary pain to authors, it helps avoid damage to the reputation of your journal

Chris Chambers: Editing reviews is dangerous ground imo. In this situation, if the remainder of the review contained useful content, I'd either leave the review intact but inform the authors to disregard the ad hom (& separately I'd tell reviewer it's not on) or dump the whole review.

DB: I would inform reviewer, but I don’t think it is part of editor’s job to relay abuse to people, esp. if they are already dealing with pain of rejection.

CC: IMO this sets a dangerous precedent for editing out content that the editor might dislike. I'd prefer to keep reviews unbiased by editorial input or drop them entirely if they're junk. Also, an offensive remark or tone could in some cases be embedded w/i a valid scientific point.

Kate Jeffery: I agree that editing reviewer comments without permission is dodgy but also agree that inappropriate comments should not be passed back to authors. A simple solution is for editor to revise the offending sentence(s) and ask reviewer to approve change. I doubt many would decline.

A middle road was offered by Lisa deBruine:
LdB: My solution is to contact the reviewer if I think something is wrong with their review (in either factual content or professional tone) and ask them to remove or rephrase it before I send it to the authors. I’ve never had one decline (but it doesn’t happen very often).

I was really surprised by how many people felt strongly that the reviewer’s report was in some sense sacrosanct and could and should not be altered. I’ve pondered this further, but am not swayed by the arguments.

I feel strongly that editors should be able to distinguish personal abuse from robust critical comment, and that, far from being inappropriate, it is their duty to remove the former from reviewer reports. And as for Chris’s comment: ‘an offensive remark or tone could in some cases be embedded w/i a valid scientific point’ – the answer is simple. You rewrite to remove the offensive remark; e.g. ‘The authors’ seem clueless about the appropriate way to run a multilevel model’ could be rewritten to ‘The authors should take advice from a statistician about their multilevel model, which is not properly specified’. And to be absolutely clear, I am not talking about editing out comments that are critical of the science, or which the editor happens to disagree with. If a reviewer got something just plain wrong, I’m okay with giving a clear steer in the editor’s letter, e.g.: ‘Reviewer A suggests you include age as a covariate. I notice you have already done that in the analysis on p x, so please ignore that comment.’ I am specifically addressing comments that are made about the authors rather than the content of what they have written. A good editor should find that an easy distinction to make. From the perspective of an author, being called out for getting something wrong is never comfortable: being told you are a useless person because you got something wrong just adds unnecessary pain.

Why do I care about this? It’s not just because I think we should all be kind to each other (though, in general, I think that’s a good idea). There’s a deeper issue at stake here. As editors, we should work to reinforce the idea that personal disputes should have no place in science. Yes, we are all human beings, and often respond with strong emotions to the work of others. I can get infuriated when I review a paper where the authors appear to have been sloppy or stupid. But we all make mistakes, and are good at deluding ourselves. One of the problems when you start out is that you don’t know what you don’t know: I learned a lot from having my errors pointed out by reviewers, but I was far more likely to learn from this process if the reviewer did not adopt a contemptuous attitude. So, as reviewers, we should calm down and self-edit, and not put ad hominem comments in our reviews. Editors can play a role in training reviewers in this respect.

For those who feel uncomfortable with my approach - i.e. edit the review and tell reviewer why you have done so – I would recommend Lisa de Bruine’s solution of raising the issue with the reviewer and asking them to amend their review. Indeed, in today’s world where everything is handled by automated systems, that may be the only way of ensuring that an insulting review does not go to the author (assuming the automated system lets you do that!).

Finally, as everyone agreed that, this this does not seem to be a common problem, so perhaps not worth devoting much space to, but I'm curious to know how other editors respond to this issue.

Monday 20 August 2018

Matlab vs open source: Costs and benefits to scientists and society

An interesting twitter thread came along yesterday, started by this query from Jan Wessel (@wessel_lab):

Quick thread of (honest) questions for the numerous people on here that subscribe to the position that sharing code in MATLAB ($) is bad open-science practice compared to open source languages (e.g., Python). What should I do as a PI that runs a lab whose entire coding structure is based (publicly shared) MATLAB code? Some say I should learn an open-source language and change my lab’s procedures over to it. But how would that work in practice? 

When I resort to blogging, it’s often because someone has raised a question that has captured my interest because it does not have a simple answer. I have made a Twitter moment to store the rest of Jan’s thread and some of the responses to it, as they raise important points which have broad application.

In part, this is an argument about costs and benefits to the individual scientist and the community. Sometimes these can be aligned, but in this case, they is some conflict, because those who can’t afford Matlab would not be able to run Jan’s code. If he were to move to Python, then anyone would be able to do so.

His argument is that he has invested a lot of time in learning Matlab, has a good understanding of how Matlab code works, and feels competent to advise his trainees in it. Furthermore, he works in the field of EEG, where there are whole packages developed to do the complex analysis involved, and Matlab is the default in this field. So moving to another programming language would not only be a big time sink, but would also make him out of step with the rest of the field.

There was a fair bit of division of opinion in the replies. On the one hand, there were those who thought this was a non-issue. It was far more important to share code than to worry about whether it was written in a proprietary language. And indeed, if you are well-enough supported to be doing EEG research, then it’s likely your lab can afford the licensing costs.

I agree with the first premise: just having the code available can be helpful in understanding how an analysis was done, even if you can’t run it. And certainly, most of those in EEG research are using Matlab. However, I’m also aware that for those in resource-limited countries, EEG is a relatively cheap technology for doing cognitive neuroscience, so I guess there will be those who would be able to get EEG equipment, but for whom the Matlab licensing costs are prohibitive.

But the replies emphasised another point: the landscape is continually changing. People have been encouraging me to learn Python, and I’m resisting only because I’m starting to feel too old to learn yet another programming language. But over the years, I’ve had to learn Basic, Matlab and R, as well as some arcane stuff for generating auditory stimuli whose name I can’t even remember. But I’ve looked at Jan’s photo on the web, and he looks pretty young, so he doesn’t have my excuse. So on that basis, I’d agree with those advising to consider making a switch. Not just to be a good open scientist, but in his own interests, which involves keeping up to date. As some on the thread noted, many undergrads are now getting training in Python or R, and sooner or later open source will become the default.

In the replies there were some helpful suggestions from people who were encouraging Jan to move to open source but in the least painful way possible. And there was reassurance that there are huge savings in learning a new language: it’s really not like going back to square one. That’s my experience: in fact, my knowledge of Basic was surprisingly useful when learning Matlab.

So the bottom line seems to be, don’t beat yourself up about it. Posting Matlab code is far better than not posting any code. But be aware that things are changing, and sooner than later, you’ll need to adapt. The time costs of learning a new language may prove trivial in the long term, against the costs of being out of date. But I can state with total confidence that learning Python will not be the end of it: give it a few years and something else will come along.

When I was first embarking on an academic career, I remember looking at the people who were teaching me, who, at the age of around 40, looked very old indeed. And I thought it must be nice for them, because they have worked hard, learned stuff, and now they know it all and can just do research and teach. When I got to 40, I had the awful realisation that the field was changing so fast, that unless I kept learning new stuff, I would get left behind. And it hasn't stopped over the past 25 years!

Saturday 11 August 2018

More haste less speed in calls for grant proposals

Helpful advice from the World Bank

This blogpost was prompted by a funding call announced this week by the Economic and Social Research Council (ESRC)  , which included the following key dates:
  • Opening date for proposals – 6 August 2018 
  • Closing date for proposals – 18 September 2018 
  • PI response invited – 23 October 2018 
  • PI response due – 29 October 2018 
  • Panel – 3 December 2018 
  • Grants start – 14 February 2019 
As pointed out by Adam Golberg (@cash4questions), Research Development Manager at Nottingham University, on Twitter, this is very short notice to prepare an application for substantial funding:
I make this about 30 working days notice. For a call issued in August. For projects of 36 months, up to £900k - substantial, for social sciences. With only one bid allowed to be led from each institution, so likely requiring an internal sift. 

I thought it worth raising this with ESRC, and they replied promptly, saying:
To access funds for this call we’ve had to adhere to a very tight spending timeframe. We’ve had to balance the call opening time with a robust peer review process and a Feb 2019 project start. We know this is a challenge, but it was a now or never funding opportunity for us.
They suggested I email them for more information, and I’ve done that, so will update this post if I hear more. I’m particularly curious about what is the reason for the tight spending timeframe and the inflexible February 2019 start.

This exchange led to discussion on Twitter which I have gathered together here.

It’s clear that from the responses that this kind of time-frame is not unusual, and I have been sent some other examples. For instance this ESRC Leadership Fellowship (£100,000 for 12 months) had a call for proposals issued on 16th November 2017, with a deadline for submissions of 3 January. When you factor in that most universities shut down from late December until early January, and so this would need to be with administrators before the Christmas break, this gives applicants around 30 days to construct a competitive proposal. But it’s not only ESRC that does this, and I am less interested in pointing the finger at a particular funder – who may well be working under pressures outside their control - than just raising the issue of why this needs a rethink. I see five problems with these short lead times:

1. Poorer quality of proposals 
The most obvious problem is that a hastily written proposal is likely to be weaker than one that is given more detailed consideration. The only good thing you might say about the time pressure is that it is likely to reduce the number of proposals, which reduces the load on the funder’s administration. It’s not clear, however, whether this is an intended consequence.

2. Stress on academic staff 
There is ample evidence that academic staff in the UK have high stress levels, often linked to a sense of increasing demands and high workload. A good academic shows high attention to detail and is at pains to get things right: research is not something that can be done well under tight time pressure. So holding up the offer of a large grant with only a short time period to prepare a proposal is bound to increase stress: do you drop everything else to focus on grant-writing, or pass by the opportunity to enter the competition?

Where the interval between the funding call and the deadline occurs over a holiday period, some might find this beneficial, as other demands such as teaching are lower. But many people plan to take a vacation, and should be able to have a complete escape from work for at least a week or two. Others will have scheduled the time for preparing lectures, doing research, or writing papers. Having to defer those activities in order to meet a tight deadline just induces more sense of overload and guilt at having a growing backlog of work.

3. Equity issues 
These points about vacations are particularly pertinent for those with children at home during the holidays, as pointed out in a series of tweets by Melissa Terras, Professor of Digital Cultural Heritage at Edinburgh University, who said:
I complained once to the AHRC about a call announced in November with a closing date of early January - giving people the chance to work over the Xmas shutdown on it. I wasn't applying to the call myself, but pointed out that it meant people with - say - school age kids - wouldn't have a "clear" Xmas shutdown to work on it, so it was prejudice against that cohort. They listened, apologised, and extended the deadline for a month, which I was thankful for. But we shouldn't have to explain this to them. Have RCUK done their implicit bias training?

4. Stress on administrative staff 
One person who contacted me via email pointed out that many funders, including ESRC, ask institutions to filter out uncompetitive proposals through internal review. That could mean senior research administrators organising exploratory workshops, soliciting input from potential PIs, having people present their ideas, and considering collaborations with other institutions. None of that will be possible in a 30-day time frame. And for the administrators who do the routine work of checking grants for accuracy of funding bids and compliance with university and funder requirements, I suspect it’s not unusual to be dealing with a stressed researcher who expects them to do all of this with rapid turnaround, but where the funding scheme virtually guarantees everything is done in a rush, this just gets worse.

5. Perception of unfairness 
Adding in to this toxic mix, we have the possibility of diminished trust in the funding process. My own interest in this issues stems from a time a few years ago when there was a funding call for a rather specific project in my area. The call came just before Christmas, with a deadline in mid January. I had a postdoc who was interested in applying, but after discussing it, we decided not to put in a bid. Part of the reason was that we had both planned a bit of time off over Christmas, but in addition I was suspicious about the combination of short time-scale and specific topic. This made me wonder whether a decision had already been made about who to award the funds to, and the exercise was just to fulfil requirements and give an illusion of fairness and transparency.

Responses on Twitter again indicate that others have had similar concerns. For instance, Jon May, Professor in Psychology at the University of Plymouth, wrote:
I suspect these short deadline calls follow ‘sandboxes’ where a favoured person has invited their (i.e his) friends to pitch ideas for the call. Favoured person cannot bid but friends can and have written the call.
And an anonymous correspondent on email noted:
I think unfairness (or the perception of unfairness) is really dangerous – a lot of people I talk to either suspect a stitch-up in terms of who gets the money, or an uneven playing field in terms of who knew this was coming.

So what’s the solution? One option would be to insist that, at least for those dispensing public money, there should be a minimum time between a call for proposals and the submission date: about 3 months would seem reasonable to me.

Comments will be open on this post for a limited time (2 months, since we are in holiday season!) so please add your thoughts.

P.S. Just as I was about to upload this blogpost, I was alerted on Twitter to this call from the World Bank, which is a beautiful illustration of point 5 - if you weren't already well aware this was coming, there would be no hope of applying. Apparently, this is not a 'grant' but a 'contract', but the same problems noted above would apply. The website is dated 2nd August, the closing date is 15th August. There is reference to a webinar for applicants dated 9th July, so presumably some information has been previously circulated, but still with a remarkably short time lag, given that there need to be at least two collaborating institutions (including middle- and low-income countries)
, with letters of support from all collaborators and all end users. Oh, and you are advised ‘Please do not wait until the last minute to submit your proposal’.

Update: 17th August 2018
An ESRC spokesperson sent this reply to my query:

Thank you for getting in touch with us with your concerns about the short call opening time for the recently announced Management Practices and Employee Engagement call, and the fact that it has opened in August.

We welcome feedback from our community on the administration of funding programmes, and we will think carefully about how to respond to these concerns as we design and plan future programmes.

To provide some background to this call. It builds on an open-invite scoping workshop we held in February 2018, at which we sought input from the academic, policy and third-sector communities on the shape of a (then) potential research investment on management practices and employee engagement. We subsequently flagged the likelihood of a funding call around the topic area this summer, both at the scoping workshop itself, as well as in our ongoing engagements with the academic community.

We do our best to make sure that calls are open for as long as possible. We have to balance call opening times with a robust and appropriately timetabled peer review process, feasible project start dates, the right safeguards and compliances, and, in certain cases such as this one, a requirement to spend funds within the financial year. 

We take the concerns that you raise in your email and in your blog post of 11 August 2018 extremely seriously. The high standard of the UK's research is a result of the work of our academic community, and we are committed to delivering a system that respects and responds to their needs. As part of this, we are actively looking into ways to build in longer call lead times and/or pre-announcements of funding opportunities for potential future managed calls in this and other areas.

I would also like to stress that applicants can still submit proposals on the topic of management practices and employee engagement through our standard research grant process, which is open all year round. The peer review system and the Grant Assessment Panel does not take into account the fact that a managed call is open on a topic when awarding funding: decisions are taken based on the excellence of the proposal.

Update: 23rd August 2018
A spokesperson for the World Bank has written to note that the grant scheme alluded to in my postscript did in fact have a 2 month period between the call and submission date. I have apologised to them for suggesting it was shorter than this, and also apologise to readers for providing misleading information. The duration still seems short to me for a call of this nature, but my case is clearly not helped by providing wrong information, and I should have taken greater care to check details. Text of the response from the World Bank is below:
We noticed with some concern that in your Aug. 11 blog post, you had singled out a World Bank call for proposals as a “beautiful illustration” of a type of funding call that appears designed to favor an inside candidate. This characterization is entirely inaccurate and appears based on a misperception of the time lag between the announcement of the proposal and the deadline.
Your reference to the 2018 Call for Proposals for Collaborative Data Innovations for Sustainable Development by the World Bank and the Global Partnership for Sustainable Development Data as undermining faith in the funding process seems based on the mistaken assumption that the call was issued on or about August 2. It was not.
The call was announced June 19 on the websites of the World Bank and the GPSDD. This was two months before the closing date, a period we have deemed fair to applicants but also appropriate given our own time constraints. An online seminar was offered to assist prospective applicants, as you note, on July 9.
The seminar drew 127 attendees for whom we provided answers to 147 questions. We are still reviewing submissions for the most recent call for proposals for this project, but our call for the 2017 version elicited 228 proposals, of which 195 met criteria for external review.
As the response to the seminar and the record of submissions indicate, this funding call has been widely seen and provided numerous applicants the opportunity to respond.  To suggest that this has not been an open and fair process does not do it justice.

Here are the links with the announcement dates of June 19th

Friday 20 July 2018

Standing on the shoulders of giants, or slithering around on jellyfish: Why reviews need to be systematic

Yesterday I had the pleasure of hearing George Davey Smith (aka @mendel_random) talk. In the course of a wide-ranging lecture he recounted his experiences with conducting a systematic review. This caught my interest, as I’d recently considered the question of literature reviews when writing about fallibility in science. George’s talk confirmed my concerns that cherry-picking of evidence can be a massive problem for many fields of science.

Together with Mark Petticrew, George had reviewed the evidence on the impact of stress and social hierarchies on coronary artery disease in non-human primates. They found 14 studies on the topic, and revealed a striking mismatch between how the literature was cited and what it actually showed. Studies in this area are of interest to those attempting to explain the well-known socioeconomic gradient in health. It’s hard to unpack this in humans, because there are so many correlated characteristics that could potentially explain the association. The primate work has been cited to support psychosocial accounts of the link; i.e., the idea that socioeconomic influences on health operate primarily through psychological and social mechanisms. Demonstration of such an impact in primates is  particularly convincing, because stress and social status can be experimentally manipulated in a way that is not feasible in humans.

The conclusion from the review was stark: ‘Overall, non-human primate studies present only limited evidence for an association between social status and coronary artery disease. Despite this, there is selective citation of individual non-human primate studies in reviews and commentaries relating to human disease aetiology’(p. e27937).

The relatively bland account in the written paper belies the stress that George and his colleague went through in doing this work. Before I tried doing one myself, I thought that a systematic review was a fairly easy and humdrum exercise. It could be if the literature were not so unruly. In practice, however, you not only have to find and synthesise the relevant evidence, but also to read and re-read papers to work out what exactly was done. Often, it’s not just a case of computing an effect size: finding the numbers that match the reported result can be challenging. One paper in the review that was particularly highly-cited in the epidemiology literature turned out to have data that were problematic: the raw data shown in scattergraphs are hard to reconcile with the adjusted means reported in a summary (see Figure below). Correspondence sent to the author apparently did not achieve a reply, let alone an explanation.

Figure 2 from Shively and Thompson (1994) Arteriosclerosis and Thrombosis Vol 14, No 5. Yellow bar added to show mean plaque areas as reported in Figure 3 (adjusted for preexperimental thigh circumference and TPC-HDL cholesterol ratio)
Even if there were no concerns about the discrepant means, the small sample size and influential outliers in this study should temper any conclusions. But those using this evidence to draw conclusions about human health focused on the ‘five-fold increase’ in coronary disease in dominant animals who became subordinate.

So what impact has the systematic review achieved? Well, the first point to note is that the authors had a great deal of difficulty getting it accepted for publication: it would be sent to reviewers who worked on stress in monkeys, and they would recommend rejection. This went on for some years: the abstract was first published in 2003, but the full paper did not appear until 2012.

The second, disappointing conclusion comes from looking at citations of the original studies reviewed by Petticrew and Davey Smith in the human health literature since their review appeared. The systematic review garnered 4 citations in the period 2013-2015 and just one during 2016-2018. The mean citations for the 14 articles covered in their meta-analysis was 2.36 for 2013-2015, and 3.00 for 2016-2018. The article that was the source of the Figure above had six citations in the human health literature in 2013-2015 and four in 2016-2018. These numbers aren’t sufficient for more than impressionistic interpretation, and I only did a superficial trawl through abstracts of citing papers, so I am not in a position to determine if all of these articles accepted the study authors’ conclusions. However, the pattern of citations fits with past experience in other fields showing that when cherry-picked facts fit a nice story, they will continue to be cited, without regard to subsequent corrections,  criticism or even retraction.

The reason why this worries me is that the stark conclusion would appear to be that we can’t trust citations of the research literature unless they are based on well-conducted systematic reviews. Iain Chalmers has been saying this for years, and in his field of clinical trials these are more common than in other disciplines. But there are still many fields where it is seen as entirely appropriate to write an introduction to a paper that only cites supportive evidence and ignores a swathe of literature that shows null or opposite results. Most postgraduates have an initial thesis chapter that reviews the literature, but it's rare, at least in psychology, to see a systematic review - perhaps because this is so time-consuming and can be soul-destroying. But if we continue to cherry-pick evidence that suits us, then we are not so much standing on the shoulders of giants as slithering around on jellyfish, and science will not progress.

Thursday 12 July 2018

One big study or two small studies? Insights from simulations

At a recent conference, someone posed a question that had been intriguing me for a while: suppose you have limited resources, with the potential to test N participants. Would it be better to do two studies, each with N/2 participants, or one big study with all N?

I've been on the periphery of conversations about this topic, but never really delved into it, so I gave a rather lame answer. I remembered hearing that statisticians would recommend the one big study option, but my intuition was that I'd trust a result that replicated more than one which was a one-off, even if the latter was from a bigger sample. Well, I've done the simulations and it's clear that my intuition is badly flawed.

Here's what I did. I adapted a script that is described in my recent slides that give hands-on instructions for beginners on how to simulate data, The script, Simulation_2_vs_1_study_b.R, which can be found here, generates data for a simple two-group comparison using a t-test. In this version, on each run of the simulation, you get output for one study where all subjects are divided into two groups of size N, and for two smaller studies each with half the number of subjects. I ran it with various settings to vary both the sample size and the effect size (Cohen's d). I included the case where there is no real difference between groups (d = 0), so I could estimate the false positive rate as well as the power to detect a true effect.

I used a one-tailed t-test, as I had pre-specified that group B had the higher mean when d > 0. I used a traditional approach with p-value cutoffs for statistical significance (and yes, I can hear many readers tut-tutting, but this is useful for this demonstration….) to see how often I got a result that met each of three different criteria:
  • a) Single study, p < .05 
  • b) Split sample, p < .05 replicated in both studies 
  • c) Single study, p < .005

Figure 1 summarises the results.
Figure 1

The figure is pretty busy but worth taking a while to unpack. Power is just the proportion of runs of the simulation where the significance criterion was met. It's conventional to adopt a power cutoff of .8 when deciding on how big a sample to use in a study. Sample size is colour coded, and refers to the number of subjects per group for the single study. So for the split replication, each group has half this number of subjects. The continuous line shows the proportion of results where p < .05 for the single study, the dotted line has results from the split replication, and the dashed line has results from the single study with more stringent significance criterion, p < .005 .

It's clear that for all sample sizes and all effect sizes, the one single sample is much better powered than the split replication.

But I then realised what had been bugging me and why my intuition was different. Look at the bottom left of the figure, where the x-axis is zero: the continuous lines (i.e., big sample, p < .05) all cross the y-axis at .05. This is inevitable: by definition, if you set p < .05, there's a one in 20 chance that you'll get a significant result when there's really no group difference in the population, regardless of the sample size. In contrast, the dotted lines cross the y-axis close to zero, reflecting the fact that when the null hypothesis is true, the chance of two samples both giving p < .05 in a replication study is one in 400 (.05^2 = .0025). So I had been thinking more like a Bayesian: given a significant result, how likely was it to have been come from a population with a true effect rather than a null effect? This is a very different thing from what a simple p-value tells you*.

Initially, I thought I was onto something. If we just stick with p < .05, then it could be argued that from a Bayesian perspective, the split replication approach is preferable. Although you are less likely to see a significant effect with this approach, when you do, you can be far more confident it is a real effect. In formal terms, the likelihood ratio for a true vs null hypothesis, given p < .05, will be much higher for the replication.

My joy at having my insight confirmed was, however, short-lived. I realised that this benefit of the replication approach could be exceeded with the single big sample simply by reducing the p-value so that the odds of a false positive are minimal. That's why Figure 1 also shows the scenario for one big sample with p < .005: a threshold that has recently proposed as a general recommendation for claims of new discoveries (Benjamin et al, 2018)**.

None of this will surprise expert statisticians: Figure 1 just reflects basic facts about statistical power that were popularised by Jacob Cohen in 1977. But I'm glad to have my intuitions now more aligned with reality, and I'd encourage others to try simulation as a great way to get more insights into statistical methods.

Here is the conclusions I've drawn from the simulation:
  • First, even when the two groups come from populations with different means, it's unlikely that you'll get a clear result from a single small study unless the effect size is at least moderate; and the odds of finding a replicated significant effect are substantially lower than this.  None of the dotted lines achieves 80% power for a replication if effect size is less than .3 - and many effects in psychology are no bigger than that. 
  • Second, from a statistical perspective, testing an a priori hypothesis in a larger sample with a lower p-value is more efficient than subdividing the sample and replicating the study using a less stringent p-value.
I'm not a stats expert, and I'm aware that there's been considerable debate out there about p-values - especially regarding the recommendations of Benjamin et al (2018). I have previously sat on the fence as I've not felt confident about the pros and cons. But on the basis of this simulation, I'm warming to the idea of p < .005. I'd welcome comments and corrections.

*In his paper The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(171085). doi:10.1098/rsos.171085 David Colquhoun (2017) discusses these issues and notes that we also need to consider the prior likelihood of the null hypothesis being true: something that is unknowable and can only be estimated on the basis of past experience and intuition.
**The proposal for adopting p < .005 as a more stringent statistical threshold for new discoveries can be found here: Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6-10. doi:10.1038/s41562-017-0189-z

Postscript, 15th July 2018

This blogpost has generated a lot of discussion, mostly on Twitter. One point that particularly interested me was a comment that I hadn’t done a fair comparison between the one-study and two-study situation, because the plot showed a one-off two group study with an alpha at .005, versus a replication study (half sample size in each group) with alpha at .05. For a fair comparison, it was argued, I should equate the probabilities between the two situations, i.e. the alpha for the one-off study should be .05 squared = .0025.

So I took a look at the fair comparison: Figure 2 shows the situation when comparing one study with alpha set to .0025 vs a split replication with alpha of .05. The intuition of many people on Twitter was that these should be identical, but they aren’t. Why not? We have the same information in the two samples. (In fact, I modified the script so that this was literally true and the same sample was tested singly and again split into two – previously I’d just resampled to get the smaller samples. This makes no difference – the single sample with more extreme alpha still gives higher power).

Figure 2: Power for one-off study with alpha .0025 (dashed lines) vs. split replication with p < .05
To look at it another way, in one version of the simulation there were 1600 simulated experiments with a true effect (including all the simulated sample sizes and effect sizes). Of these 581 were identified as ‘significant’ both by the one-off study with p < .0025 and they were also replicated in two small studies with p < .05. Only 5 were identified by the split replication alone, but 134 were identified by the one-off study alone.

I think I worked out why this is the case, though I’d appreciate having a proper statistical opinion. It seems to have to do with accuracy of estimating the standard deviation. If you have a split sample and you estimate the mean from each half (A and B), then the average of mean A and mean B will be the same as for the big sample of AB combined. But when it comes to estimating the standard deviation – which is a key statistic when computing group differences – the estimate is more accurate and precise with the large sample. This is because the standard deviation is computed by measuring the difference of each value from its own sample mean. Means for A and B will fluctuate due to sampling error, and this will make the estimated SDs less reliable. You can estimate the pooled standard deviation for two samples by taking the square root of the average of the variances. However, that value is less precise than the SD from the single large sample. I haven’t done a large number of runs, but a quick check suggests that whereas both the one-off study and the split replication give pooled estimates of the SD at around the true value of 1.0, the standard deviation of the standard deviation (we are getting very meta here!) is around .01 for the one-off study but .14 for the split replication. Again, I’m reporting results from across all the simulated trials, including the full range of sample sizes and effect sizes.

Figure 3: Distribution of estimates of pooled SD; The range is narrower for the one-off study (pink) than for the split replication studies (blue). Purple shows area of overlap of distributions

This has been an intriguing puzzle to investigate, but in the original post, I hadn’t really been intending to do this kind of comparison - my interest was rather in making the more elementary point which is that there's a very low probability of achieving a replication when sample size and effect size are both relatively small.

Returning to that issue, another commentator said that they’d have far more confidence in five small studies all showing the same effect than in one giant study. This is exactly the view I would have taken before I looked into this with simulations; but I now realise this idea has a serious flaw, which is that you’re very unlikely to get those five replications, even if you are reasonably well powered, because – the tldr; message implicit in this post – when we’re talking about replications, we have to multiply the probabilities, and they rapidly get very low. So, if you look at the figure, suppose you have a moderate effect size, around .5, then you need a sample of 48 per group to get 80% power. But if you repeat the study five times, then the chance of getting a positive result in all five cases is .8^5, which is .33. So most of the time you’d get a mixture of null and positive results. Even if you doubled the sample size to increase power to around .95, the chance of all five studies coming out positive is still only .95^5 (82%).

Finally, another suggestion from Twitter is that a meta-analysis of several studies should give the same result as a single big sample. I’m afraid I have no expertise in meta-analysis, so I don’t know how well it handles the issue of more variable SD estimates in small samples, but I’d be interested to hear more from any readers who are up to speed with this.

Tuesday 26 June 2018

Preprint publication as karaoke

 Doing research, analysing the results, and writing it up is a prolonged and difficult process. Submitting the paper to a journal is an anxious moment. Of course, you hope the editor and reviewers will love it and thank you for giving them the opportunity to read your compelling research. And of course, that never happens. More often you get comments from reviewers pointing out the various inadequacies of your grasp of the literature, your experimental design and your reasoning, leading to further angst as you consider how to reply. But worse than this is silence. You hear nothing. You enquire. You are told that the journal is still seeking reviewers. If you go through that loop a few times, you start to feel like the Jane Austen heroine who, having dressed up in her finery for the ball, spends the evening being ignored by all the men, while other, superficial and gaudy women are snapped up as dance partners.

There have been some downcast tweets in my timeline about papers getting stuck in this kind of journal limbo. When I suggested that it might help to post papers as preprints, several people asked how this worked, so I thought a short account might be useful.

To continue the analogy, a preprint server offers you a more modern world where you can try karaoke. You don't wait to be asked: you grab the microphone and do your thing. I now routinely post all my papers as preprints before submitting them to a journal. It gets the work out there, so even if journals are unduly slow, it can be read and you can get feedback on it.

So how does it work? Pre-prints are electronic articles that are not peer-reviewed. I hope those who know more about the history will be able to comment on this, as I'm hazy on the details, but the idea started with physicists, to whom the thought of waiting around for an editorial process to complete seemed ridiculous. Physicists have been routinely posting their work on arXiv (pronounced 'archive') for years to ensure rapid evaluation and exchange of ideas. They do still publish in journals, which creates a formal version of record, but the arXiv is what most of them read. The success of arXiv led to the development of BioRxiv, and then more recently PsyArXiv and SocArXiv. Some journals also host preprints - I have had good experiences with PeerJ, where you can deposit an article as a preprint, with the option of then updating it to a full submission to the journal if you wish*.

All of these platforms operate some basic quality control. For instance, the BioRxiv website states: 'all articles undergo a basic screening process for offensive and/or non-scientific content and for material that might pose a health or biosecurity risk and are checked for plagiarism'. However, once they have passed screening, articles are deposited immediately without further review.

Contrary to popular opinion, publishing a preprint does not usually conflict with journal policies. You can check the policy of the journal on the Sherpa/ROMEO database: most allow preprints prior to submission.

Sometimes concerns are expressed that if you post a preprint your work might be stolen by someone who'll then publish a journal article before you. In fact, it's quite the opposite. A preprint has a digital object identifier (DOI) and establishes your precedence, so guards against scooping. If you are in a fast-moving field where an evil reviewer will deliberately hold up your paper so they can get in ahead, pre-printing is the answer.

So when should you submit a preprint? I would normally recommend doing this a week or two before submitting to a journal, to allow for the possibility of incorporating feedback into the submitted manuscript, but, given that you will inevitably be asked for revisions by journal reviewers, if you post a preprint immediately before submission you will still have an opportunity to take on board other comments.

So what are the advantages of posting preprints?

1. The most obvious one is that people can access your work in a timely fashion. Preprints are freely available to all: a particularly welcome feature if you work in an area that has implications for clinical practice or policy, where practitioners may not have access to academic journals.

2. There have been cases where authors of a preprint have been invited to submit the work to a journal by an editor. This has never happened to me, but it's nice to know it's a possibility!

3. You can cite a preprint on a job application: it won't count as much as a peer-reviewed publication, but it does make it clear that the work is completed, and your evaluators can read it. This is preferable to just citing work as 'submitted'. Some funders are now also allowing preprints to be cited.

4. Psychologically, for the author, it can be good to have a sense that the work is 'out there'. You have at least some control over the dissemination of your research, whereas waiting for editors and reviewers is depressing because you just feel powerless.

5. You can draw attention to a preprint on social media and explicitly request feedback. This is particularly helpful if you don't have colleagues to hand who are willing to read your paper. If you put out a request on Twitter, it doesn't mean people will necessarily reply, but you could get useful suggestions for improvement and/or make contact with others interested in your field.

On this final point, it is worth noting that there are several reasons why papers linger in journal limbo: it does not necessarily mean that the journal administration or editor is incompetent (though that can happen!). The best of editors can have a hard job finding reviewers: it's not uncommon to have to invite ten reviewers to find two who agree to review. If your papers is in a niche area then it gets even harder. For these reasons it is crucial to make your title and abstract as clear and interesting as possible: these are the only parts of the paper that potential reviewers will see, and if you are getting a lot of refusals to review, it could be that your abstract is a turn-off. So asking for feedback on a preprint may help you rewrite it in a way that encourages more interest from reviewers.

*Readers: please feel free to add other suggestions while comments are open. (I close comments once the invasion of spammers starts - typically 3-4 weeks after posting).

Saturday 9 June 2018

Developmental language disorder: the need for a clinically relevant definition

There's been debate over the new terminology for Developmental Language Disorder (DLD) at a meeting (SRCLD) in the USA. I've not got any of the nuance here, but I feel I should make a quick comment on one issue I was specifically asked about, viz:

As background: the field of children's language disorders has been a terminological minefield. The term Specific Language Impairment (SLI) began to be used widely in the 1980s as a diagnosis for children who had problems acquiring language for no apparent reason. One criterion for the diagnosis was that the child's language problems should be out of line with other aspects of development, and hence 'specific', and this was interpreted as requiring normal range nonverbal IQ (nviq).

The term SLI was never adopted by the two main diagnostic systems -WHO's International Classification of Diseases (ICD) or the American Psychiatric Association's Diagnostic and Statistical Manual (DSM), but the notion that IQ should play a part in the diagnosis became prevalent.

In 2016-7 I headed up the CATALISE project with the specific goal of achieving some consensus about the diagnostic criteria and terminology for children's language disorders: the published papers about this are openly available for all to read (see below). The consensus of a group of experts from a range of professions and countries was to reject SLI in favour of the term DLD.

Any child who meets criteria for SLI will meet criteria for DLD: the main difference is that the use of an IQ cutoff is no longer part of the definition. This does not mean that all children with language difficulties are regarded as having DLD: those who meet criteria for intellectual disability, known syndromes or biomedical conditions are treated separately (see these slides for summary).

The tweet seems to suggest we should retain the term SLI, with its IQ cutoff, because it allows us to do neatly controlled research studies. I realise a brief, second-hand tweet about Rice's views may not be a fair portrayal of what she said, but it does emphasise a bone of contention that was thoroughly gnawed in the discussions of the CATALISE panel, namely, what is the purpose of diagnostic terminology? I would argue its primary purpose is clinical, and clinical considerations are not well-served by research criteria.

The traditional approach to selecting groups for research is to find 'pure' cases - quite simply, if you include children who have other problems beyond language (including other neurodevelopmental difficulties) then it is much harder to know how far you are assessing correlates or causes of language problems: things get messy and associations get hard to interpret. The importance of controlling for nonverbal IQ has been particularly emphasised over many years: quite simply, if you compare language-impaired vs comparison (typically-developing, or td) children on a language or cognitive measure, and the language-impaired group has lower nonverbal ability, then it may be that you are looking at a correlate of nonverbal ability rather than language. Restricting consideration to those who meet stringent IQ criteria to equalise the groups is one way of addressing the issue.

However, there are three big problems with this approach:

1. A child's nonverbal IQ can vary from time to time and it will depend on the test that is used. However, although this is problematic, it's not the main reason for dropping IQ cutoffs; the strongest arguments concern validity rather than reliability of an IQ-based approach.

2. The use of IQ-cutoffs ignores the fact that pure cases of language impairment are the exception rather than the rule. In CATALISE we looked at the evidence and concluded that if we were going to insist that you could only get a diagnosis of DLD if you had no developmental problems beyond language, then we'd exclude many children with language problems (see also this old blogpost). If our main purpose is to get a diagnostic system that is clinically workable, it should be applicable to the children who turn up in our clinics - not just a rarefied few who meet research criteria. An analogy can be drawn with medicine: imagine if your doctor identified you with high blood pressure but refused to treat you unless you were in every other regard fit and healthy. That would seem both unfair and ill-judged. Presence of co-occurring conditions might be important for tracking down underlying causes and determining a treatment path, but it's not a reason for excluding someone from receiving services.

3. Even for research purposes, it is not clear that a focus on highly specific disorders makes sense. An underlying assumption, which I remember starting out with, was the idea that the specific cases were in some important sense different from those who had additional problems. Yet, as noted in the CATALISE papers, the evidence for this assumption is missing: nonverbal IQ has very little bearing on a child's clinical profile, response to intervention, or aetiology. For me, what really knocked my belief in the reality of SLI as a category was doing twin studies: typically, I'd find that identical twins were very similar in their language abilities, but they sometimes differed in nonverbal ability, to the extent that one met criteria for SLI and the other did not. Researchers who treat SLI as a distinct category are at risk of doing research that has no application to the real world.

There is nothing to stop researchers focusing on 'pure' cases of language disorder to answer research questions of theoretical interest, such as questions about the modularity of language. This kind of research uses children with a language disorder as a kind of 'natural experiment' that may inform our understanding of broader issues. It is, however, important not to confuse such research with work whose goal is to discover clinically relevant information.

If practitioners let the theoretical interests of researchers dictate their diagnostic criteria, then they are doing a huge disservice to the many children who end up in a no-man's-land, without either diagnosis or access to intervention. 


Bishop, D. V. M. (2017). Why is it so hard to reach agreement on terminology? The case of developmental language disorder (DLD). International Journal of Language & Communication Disorders, 52(6), 671-680. doi:10.1111/1460-6984.12335

Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2016). CATALISE: a multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children. PLOS One, 11(7), e0158753. doi:10.1371/journal.pone.0158753

Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2017). Phase 2 of CATALISE: a multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068-1080. doi:10.1111/jcpp.12721

Sunday 27 May 2018

Sowing seeds of doubt: how Gilbert et al’s critique of the reproducibility project has played out

In Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising doubt can be used as an effective weapon against inconvenient science. On topics such as the effects of tobacco on health, climate change and causes of acid rain, it has been possible to delay or curb action to tackle problems by simply emphasising the lack of scientific consensus. This is always an option, because science is characterised by uncertainty, and indeed, we move forward by challenging one another’s findings: only a dead science would have no disagreements. But those raising concerns wield a two-edged sword: spurious and discredited criticisms can disrupt scientific progress, especially if the arguments are complex and technical: people will be left with a sense that they cannot trust the findings, even if they don’t fully understand the matters under debate.

The parallels with Merchants of Doubt occurred to me as I re-read the critique by Gilbert et al of the classic paper by the Open Science Collaboration (OSC) on ‘Estimating the reproducibility of psychological science’. I was prompted to do so because we were discussing the OSC paper in a journal club* and inevitably the question arose as to whether we needed to worry about reproducibility, in the light of the remarkable claim by Gilbert et al:  We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is also consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%.’

The Gilbert et al critique has, in turn, been the subject of considerable criticism, as well as a response by a subset of the OSC group. I summarise the main points of contention in Table 1: at times they seem to be making a defeatist argument that we don’t need to worry because replication in psychology is bound to be poor: something I have disputed.

But my main focus in this post is simply to consider the impact of the critique on the reproducibility debate by looking at citations of the original article and the critique. A quick check on Web of Science found 797 citations of the OSC paper, 67 citations of Gilbert et al, and 33 citations of the response by Anderson et al.

The next thing I did, admittedly in a very informal fashion, was to download the details of the articles citing Gilbert et al and code them according to the content of what they said, as either supporting Gilbert et al’s view, rejecting the criticism, or being neutral. I discovered I needed a fourth category for papers where the citation seemed wrong or so vague as to be unclassifiable. I discarded any papers where the relevant information could not be readily accessed – I can access most journals via Oxford University but a few were behind paywalls, others were not in English, or did not appear to cite Gilbert et al. This left 44 citing papers that focused on the commentary on the OSC study. Nine of these were supportive of Gilbert et al, two noted problems with their analysis, but 33 were categorised as ‘neutral’, because the citation read something like this: 

Because of the current replicability crisis in psychological science (e.g., Open Science Collaboration, 2015; but see Gilbert, King, Pettigrew, & Wilson, 2016)….”

The strong impression was that the authors of these papers lacked either the appetite or the ability to engage with the detailed arguments in the critique, but had a sense that there was a debate and felt that they should flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem - people are left thinking that it's all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.

Fortunately, it looks as if Gilbert et al’s critique has been less successful than might have been expected, given the eminence of the authors. This may in part be because the arguments in favour of change are founded not just on demonstrations such as the OSC project, but also on logical analyses of statistical practices and publication biases that have been known about for years (see slides 15-20 of my presentation here). Furthermore, as evidenced in the footnotes to Table 1, social media allows a rapid evaluation of claims and counter-claims that hitherto was not possible when debate was restricted to and controlled by journals. The publication this week of three more big replication studies  just heaps on further empirical evidence that we have a problem that needs addressing. Those who are saying ‘nothing to see here, move along’ cannot retain any credibility.

    Table 1
‘many of OSC’s replication studies drew their samples from different populations than the original studies did’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     Did not show that this feature affected replication rate
‘many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways.’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     OSC showed that this did not affect replication rate
·     Most striking example used by Gilbert et al is given detailed explanation by Nosek (1)  
‘How many of their replication studies should we expect to have failed by chance alone? Making this estimate requires having data from multiple replications of the same original study.’
Used data from pairwise comparisons of studies from the Many Labs project to argue a low rate of agreement is to be expected.
·     Ignores publication bias impact on original studies (2, 3)
·     G et al misinterpret confidence intervals (3, 4)
·     G et al fail to take sample size/power into account, though this is crucial determinant of confidence interval (3, 4)
·      ‘Gilbert et al.’s focus on the CI measure of reproducibility neither addresses nor can account for the facts that the OSC2015 replication effect sizes were about half the size of the original studies on average, and 83% of replications elicited smaller effect sizes than the original studies.’ (2)
Results depended on whether original authors endorsed the protocol for the replication: ‘This strongly suggests that the infidelities did not just introduce random error but instead biased the replication studies toward failure.
·     Use of term ‘the infidelities’ assumes the only reason for lack of endorsement is departure from original protocol. (2)
·     Lack of endorsement included non-response from original authors (3)

Anderson, C. J., Bahnik, S., Barnett-Cowan, M., & et al. (2016). Response to Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Nature, 349(6251). doi:10.1126/science.aac4716

*Thanks to the enthusiastic efforts of some of our grad students, and the support of Reproducible Research Oxford, we’ve had a series of Reproducibilitea journal clubs in our department this term.  I can recommend this as a great – and relatively cheap and easy - way of raising awareness of issues around reproducibility in a department: something that is sorely needed if a recent Twitter survey by Dan Lakens is anything to go by.