An interesting twitter thread came along yesterday, started by this query from Jan Wessel (@wessel_lab):
Quick thread of (honest) questions for the numerous people on here that subscribe to the position that sharing code in MATLAB ($) is bad open-science practice compared to open source languages (e.g., Python). What should I do as a PI that runs a lab whose entire coding structure is based (publicly shared) MATLAB code? Some say I should learn an open-source language and change my lab’s procedures over to it. But how would that work in practice?
When I resort to blogging, it’s often because someone has raised a question that has captured my interest because it does not have a simple answer. I have made a Twitter moment to store the rest of Jan’s thread and some of the responses to it, as they raise important points which have broad application.
In part, this is an argument about costs and benefits to the individual scientist and the community. Sometimes these can be aligned, but in this case, they is some conflict, because those who can’t afford Matlab would not be able to run Jan’s code. If he were to move to Python, then anyone would be able to do so.
His argument is that he has invested a lot of time in learning Matlab, has a good understanding of how Matlab code works, and feels competent to advise his trainees in it. Furthermore, he works in the field of EEG, where there are whole packages developed to do the complex analysis involved, and Matlab is the default in this field. So moving to another programming language would not only be a big time sink, but would also make him out of step with the rest of the field.
There was a fair bit of division of opinion in the replies. On the one hand, there were those who thought this was a non-issue. It was far more important to share code than to worry about whether it was written in a proprietary language. And indeed, if you are well-enough supported to be doing EEG research, then it’s likely your lab can afford the licensing costs.
I agree with the first premise: just having the code available can be helpful in understanding how an analysis was done, even if you can’t run it. And certainly, most of those in EEG research are using Matlab. However, I’m also aware that for those in resource-limited countries, EEG is a relatively cheap technology for doing cognitive neuroscience, so I guess there will be those who would be able to get EEG equipment, but for whom the Matlab licensing costs are prohibitive.
But the replies emphasised another point: the landscape is continually changing. People have been encouraging me to learn Python, and I’m resisting only because I’m starting to feel too old to learn yet another programming language. But over the years, I’ve had to learn Basic, Matlab and R, as well as some arcane stuff for generating auditory stimuli whose name I can’t even remember. But I’ve looked at Jan’s photo on the web, and he looks pretty young, so he doesn’t have my excuse. So on that basis, I’d agree with those advising to consider making a switch. Not just to be a good open scientist, but in his own interests, which involves keeping up to date. As some on the thread noted, many undergrads are now getting training in Python or R, and sooner or later open source will become the default.
In the replies there were some helpful suggestions from people who were encouraging Jan to move to open source but in the least painful way possible. And there was reassurance that there are huge savings in learning a new language: it’s really not like going back to square one. That’s my experience: in fact, my knowledge of Basic was surprisingly useful when learning Matlab.
So the bottom line seems to be, don’t beat yourself up about it. Posting Matlab code is far better than not posting any code. But be aware that things are changing, and sooner than later, you’ll need to adapt. The time costs of learning a new language may prove trivial in the long term, against the costs of being out of date. But I can state with total confidence that learning Python will not be the end of it: give it a few years and something else will come along.
When I was first embarking on an academic career, I remember looking at the people who were teaching me, who, at the age of around 40, looked very old indeed. And I thought it must be nice for them, because they have worked hard, learned stuff, and now they know it all and can just do research and teach. When I got to 40, I had the awful realisation that the field was changing so fast, that unless I kept learning new stuff, I would get left behind. And it hasn't stopped over the past 25 years!
Monday, 20 August 2018
Saturday, 11 August 2018
More haste less speed in calls for grant proposals
![]() |
| Helpful advice from the World Bank |
This blogpost was prompted by a funding call announced this week by the Economic and Social Research Council (ESRC) , which included the following key dates:
- Opening date for proposals – 6 August 2018
- Closing date for proposals – 18 September 2018
- PI response invited – 23 October 2018
- PI response due – 29 October 2018
- Panel – 3 December 2018
- Grants start – 14 February 2019
I make this about 30 working days notice. For a call issued in August. For projects of 36 months, up to £900k - substantial, for social sciences. With only one bid allowed to be led from each institution, so likely requiring an internal sift.
I thought it worth raising this with ESRC, and they replied promptly, saying:
To access funds for this call we’ve had to adhere to a very tight spending timeframe. We’ve had to balance the call opening time with a robust peer review process and a Feb 2019 project start. We know this is a challenge, but it was a now or never funding opportunity for us.
They suggested I email them for more information, and I’ve done that, so will update this post if I hear more. I’m particularly curious about what is the reason for the tight spending timeframe and the inflexible February 2019 start.
This exchange led to discussion on Twitter which I have gathered together here.
It’s clear that from the responses that this kind of time-frame is not unusual, and I have been sent some other examples. For instance this ESRC Leadership Fellowship (£100,000 for 12 months) had a call for proposals issued on 16th November 2017, with a deadline for submissions of 3 January. When you factor in that most universities shut down from late December until early January, and so this would need to be with administrators before the Christmas break, this gives applicants around 30 days to construct a competitive proposal. But it’s not only ESRC that does this, and I am less interested in pointing the finger at a particular funder – who may well be working under pressures outside their control - than just raising the issue of why this needs a rethink. I see five problems with these short lead times:
1. Poorer quality of proposals
The most obvious problem is that a hastily written proposal is likely to be weaker than one that is given more detailed consideration. The only good thing you might say about the time pressure is that it is likely to reduce the number of proposals, which reduces the load on the funder’s administration. It’s not clear, however, whether this is an intended consequence.
2. Stress on academic staff
There is ample evidence that academic staff in the UK have high stress levels, often linked to a sense of increasing demands and high workload. A good academic shows high attention to detail and is at pains to get things right: research is not something that can be done well under tight time pressure. So holding up the offer of a large grant with only a short time period to prepare a proposal is bound to increase stress: do you drop everything else to focus on grant-writing, or pass by the opportunity to enter the competition?
Where the interval between the funding call and the deadline occurs over a holiday period, some might find this beneficial, as other demands such as teaching are lower. But many people plan to take a vacation, and should be able to have a complete escape from work for at least a week or two. Others will have scheduled the time for preparing lectures, doing research, or writing papers. Having to defer those activities in order to meet a tight deadline just induces more sense of overload and guilt at having a growing backlog of work.
3. Equity issues
These points about vacations are particularly pertinent for those with children at home during the holidays, as pointed out in a series of tweets by Melissa Terras, Professor of Digital Cultural Heritage at Edinburgh University, who said:
I complained once to the AHRC about a call announced in November with a closing date of early January - giving people the chance to work over the Xmas shutdown on it. I wasn't applying to the call myself, but pointed out that it meant people with - say - school age kids - wouldn't have a "clear" Xmas shutdown to work on it, so it was prejudice against that cohort. They listened, apologised, and extended the deadline for a month, which I was thankful for. But we shouldn't have to explain this to them. Have RCUK done their implicit bias training?
4. Stress on administrative staff
One person who contacted me via email pointed out that many funders, including ESRC, ask institutions to filter out uncompetitive proposals through internal review. That could mean senior research administrators organising exploratory workshops, soliciting input from potential PIs, having people present their ideas, and considering collaborations with other institutions. None of that will be possible in a 30-day time frame. And for the administrators who do the routine work of checking grants for accuracy of funding bids and compliance with university and funder requirements, I suspect it’s not unusual to be dealing with a stressed researcher who expects them to do all of this with rapid turnaround, but where the funding scheme virtually guarantees everything is done in a rush, this just gets worse.
5. Perception of unfairness
Adding in to this toxic mix, we have the possibility of diminished trust in the funding process. My own interest in this issues stems from a time a few years ago when there was a funding call for a rather specific project in my area. The call came just before Christmas, with a deadline in mid January. I had a postdoc who was interested in applying, but after discussing it, we decided not to put in a bid. Part of the reason was that we had both planned a bit of time off over Christmas, but in addition I was suspicious about the combination of short time-scale and specific topic. This made me wonder whether a decision had already been made about who to award the funds to, and the exercise was just to fulfil requirements and give an illusion of fairness and transparency.
Responses on Twitter again indicate that others have had similar concerns. For instance, Jon May, Professor in Psychology at the University of Plymouth, wrote:
I suspect these short deadline calls follow ‘sandboxes’ where a favoured person has invited their (i.e his) friends to pitch ideas for the call. Favoured person cannot bid but friends can and have written the call.
And an anonymous correspondent on email noted:
I think unfairness (or the perception of unfairness) is really dangerous – a lot of people I talk to either suspect a stitch-up in terms of who gets the money, or an uneven playing field in terms of who knew this was coming.
So what’s the solution? One option would be to insist that, at least for those dispensing public money, there should be a minimum time between a call for proposals and the submission date: about 3 months would seem reasonable to me.
Comments will be open on this post for a limited time (2 months, since we are in holiday season!) so please add your thoughts.
P.S. Just as I was about to upload this blogpost, I was alerted on Twitter to this call from the World Bank, which is a beautiful illustration of point 5 - if you weren't already well aware this was coming, there would be no hope of applying. Apparently, this is not a 'grant' but a 'contract', but the same problems noted above would apply. The website is dated 2nd August, the closing date is 15th August. There is reference to a webinar for applicants dated 9th July, so presumably some information has been previously circulated, but still with a remarkably short time lag, given that there need to be at least two collaborating institutions (including middle- and low-income countries)
Update: 17th August 2018
An ESRC spokesperson sent this reply to my query:
Thank you for getting in touch with us with your concerns about the short call opening time for the recently announced Management Practices and Employee Engagement call, and the fact that it has opened in August.
We welcome feedback from our community on the administration of funding programmes, and we will think carefully about how to respond to these concerns as we design and plan future programmes.
To provide some background to this call. It builds on an open-invite scoping workshop we held in February 2018, at which we sought input from the academic, policy and third-sector communities on the shape of a (then) potential research investment on management practices and employee engagement. We subsequently flagged the likelihood of a funding call around the topic area this summer, both at the scoping workshop itself, as well as in our ongoing engagements with the academic community.
We do our best to make sure that calls are open for as long as possible. We have to balance call opening times with a robust and appropriately timetabled peer review process, feasible project start dates, the right safeguards and compliances, and, in certain cases such as this one, a requirement to spend funds within the financial year.
We take the concerns that you raise in your email and in your blog post of 11 August 2018 extremely seriously. The high standard of the UK's research is a result of the work of our academic community, and we are committed to delivering a system that respects and responds to their needs. As part of this, we are actively looking into ways to build in longer call lead times and/or pre-announcements of funding opportunities for potential future managed calls in this and other areas.
I would also like to stress that applicants can still submit proposals on the topic of management practices and employee engagement through our standard research grant process, which is open all year round. The peer review system and the Grant Assessment Panel does not take into account the fact that a managed call is open on a topic when awarding funding: decisions are taken based on the excellence of the proposal.
Update: 23rd August 2018
A spokesperson for the World Bank has written to note that the grant scheme alluded to in my postscript did in fact have a 2 month period between the call and submission date. I have apologised to them for suggesting it was shorter than this, and also apologise to readers for providing misleading information. The duration still seems short to me for a call of this nature, but my case is clearly not helped by providing wrong information, and I should have taken greater care to check details. Text of the response from the World Bank is below:
We noticed with some concern that in your Aug. 11 blog post, you had singled out a World Bank call for proposals as a “beautiful illustration” of a type of funding call that appears designed to favor an inside candidate. This characterization is entirely inaccurate and appears based on a misperception of the time lag between the announcement of the proposal and the deadline.
Your reference to the 2018 Call for Proposals for Collaborative Data Innovations for Sustainable Development by the World Bank and the Global Partnership for Sustainable Development Data as undermining faith in the funding process seems based on the mistaken assumption that the call was issued on or about August 2. It was not.
The call was announced June 19 on the websites of the World Bank and the GPSDD. This was two months before the closing date, a period we have deemed fair to applicants but also appropriate given our own time constraints. An online seminar was offered to assist prospective applicants, as you note, on July 9.
The seminar drew 127 attendees for whom we provided answers to 147 questions. We are still reviewing submissions for the most recent call for proposals for this project, but our call for the 2017 version elicited 228 proposals, of which 195 met criteria for external review.
As the response to the seminar and the record of submissions indicate, this funding call has been widely seen and provided numerous applicants the opportunity to respond. To suggest that this has not been an open and fair process does not do it justice.
Here are the links with the announcement dates of June 19th
Labels:
academics,
corruption,
equity,
fairness,
grants,
higher education,
research funding,
stress,
UKRI
Friday, 20 July 2018
Standing on the shoulders of giants, or slithering around on jellyfish: Why reviews need to be systematic
Yesterday I had the pleasure of hearing George Davey Smith (aka @mendel_random) talk. In the course of a wide-ranging lecture he recounted his experiences with conducting a systematic review. This caught my interest, as I’d recently considered the question of literature reviews when writing about fallibility in science. George’s talk confirmed my concerns that cherry-picking of evidence can be a massive problem for many fields of science.
Together with Mark Petticrew, George had reviewed the evidence on the impact of stress and social hierarchies on coronary artery disease in non-human primates. They found 14 studies on the topic, and revealed a striking mismatch between how the literature was cited and what it actually showed. Studies in this area are of interest to those attempting to explain the well-known socioeconomic gradient in health. It’s hard to unpack this in humans, because there are so many correlated characteristics that could potentially explain the association. The primate work has been cited to support psychosocial accounts of the link; i.e., the idea that socioeconomic influences on health operate primarily through psychological and social mechanisms. Demonstration of such an impact in primates is particularly convincing, because stress and social status can be experimentally manipulated in a way that is not feasible in humans.
The conclusion from the review was stark: ‘Overall, non-human primate studies present only limited evidence for an association between social status and coronary artery disease. Despite this, there is selective citation of individual non-human primate studies in reviews and commentaries relating to human disease aetiology’(p. e27937).
The relatively bland account in the written paper belies the stress that George and his colleague went through in doing this work. Before I tried doing one myself, I thought that a systematic review was a fairly easy and humdrum exercise. It could be if the literature were not so unruly. In practice, however, you not only have to find and synthesise the relevant evidence, but also to read and re-read papers to work out what exactly was done. Often, it’s not just a case of computing an effect size: finding the numbers that match the reported result can be challenging. One paper in the review that was particularly highly-cited in the epidemiology literature turned out to have data that were problematic: the raw data shown in scattergraphs are hard to reconcile with the adjusted means reported in a summary (see Figure below). Correspondence sent to the author apparently did not achieve a reply, let alone an explanation.
using this evidence to draw conclusions about human health focused on the ‘five-fold increase’ in coronary disease in dominant animals who became subordinate.
So what impact has the systematic review achieved? Well, the first point to note is that the authors had a great deal of difficulty getting it accepted for publication: it would be sent to reviewers who worked on stress in monkeys, and they would recommend rejection. This went on for some years: the abstract was first published in 2003, but the full paper did not appear until 2012.
The second, disappointing conclusion comes from looking at citations of the original studies reviewed by Petticrew and Davey Smith in the human health literature since their review appeared. The systematic review garnered 4 citations in the period 2013-2015 and just one during 2016-2018. The mean citations for the 14 articles covered in their meta-analysis was 2.36 for 2013-2015, and 3.00 for 2016-2018. The article that was the source of the Figure above had six citations in the human health literature in 2013-2015 and four in 2016-2018. These numbers aren’t sufficient for more than impressionistic interpretation, and I only did a superficial trawl through abstracts of citing papers, so I am not in a position to determine if all of these articles accepted the study authors’ conclusions. However, the pattern of citations fits with past experience in other fields showing that when cherry-picked facts fit a nice story, they will continue to be cited, without regard to subsequent corrections, criticism or even retraction.
The reason why this worries me is that the stark conclusion would appear to be that we can’t trust citations of the research literature unless they are based on well-conducted systematic reviews. Iain Chalmers has been saying this for years, and in his field of clinical trials these are more common than in other disciplines. But there are still many fields where it is seen as entirely appropriate to write an introduction to a paper that only cites supportive evidence and ignores a swathe of literature that shows null or opposite results. Most postgraduates have an initial thesis chapter that reviews the literature, but it's rare, at least in psychology, to see a systematic review - perhaps because this is so time-consuming and can be soul-destroying. But if we continue to cherry-pick evidence that suits us, then we are not so much standing on the shoulders of giants as slithering around on jellyfish, and science will not progress.
Together with Mark Petticrew, George had reviewed the evidence on the impact of stress and social hierarchies on coronary artery disease in non-human primates. They found 14 studies on the topic, and revealed a striking mismatch between how the literature was cited and what it actually showed. Studies in this area are of interest to those attempting to explain the well-known socioeconomic gradient in health. It’s hard to unpack this in humans, because there are so many correlated characteristics that could potentially explain the association. The primate work has been cited to support psychosocial accounts of the link; i.e., the idea that socioeconomic influences on health operate primarily through psychological and social mechanisms. Demonstration of such an impact in primates is particularly convincing, because stress and social status can be experimentally manipulated in a way that is not feasible in humans.
The conclusion from the review was stark: ‘Overall, non-human primate studies present only limited evidence for an association between social status and coronary artery disease. Despite this, there is selective citation of individual non-human primate studies in reviews and commentaries relating to human disease aetiology’(p. e27937).
The relatively bland account in the written paper belies the stress that George and his colleague went through in doing this work. Before I tried doing one myself, I thought that a systematic review was a fairly easy and humdrum exercise. It could be if the literature were not so unruly. In practice, however, you not only have to find and synthesise the relevant evidence, but also to read and re-read papers to work out what exactly was done. Often, it’s not just a case of computing an effect size: finding the numbers that match the reported result can be challenging. One paper in the review that was particularly highly-cited in the epidemiology literature turned out to have data that were problematic: the raw data shown in scattergraphs are hard to reconcile with the adjusted means reported in a summary (see Figure below). Correspondence sent to the author apparently did not achieve a reply, let alone an explanation.
using this evidence to draw conclusions about human health focused on the ‘five-fold increase’ in coronary disease in dominant animals who became subordinate.
So what impact has the systematic review achieved? Well, the first point to note is that the authors had a great deal of difficulty getting it accepted for publication: it would be sent to reviewers who worked on stress in monkeys, and they would recommend rejection. This went on for some years: the abstract was first published in 2003, but the full paper did not appear until 2012.
The second, disappointing conclusion comes from looking at citations of the original studies reviewed by Petticrew and Davey Smith in the human health literature since their review appeared. The systematic review garnered 4 citations in the period 2013-2015 and just one during 2016-2018. The mean citations for the 14 articles covered in their meta-analysis was 2.36 for 2013-2015, and 3.00 for 2016-2018. The article that was the source of the Figure above had six citations in the human health literature in 2013-2015 and four in 2016-2018. These numbers aren’t sufficient for more than impressionistic interpretation, and I only did a superficial trawl through abstracts of citing papers, so I am not in a position to determine if all of these articles accepted the study authors’ conclusions. However, the pattern of citations fits with past experience in other fields showing that when cherry-picked facts fit a nice story, they will continue to be cited, without regard to subsequent corrections, criticism or even retraction.
The reason why this worries me is that the stark conclusion would appear to be that we can’t trust citations of the research literature unless they are based on well-conducted systematic reviews. Iain Chalmers has been saying this for years, and in his field of clinical trials these are more common than in other disciplines. But there are still many fields where it is seen as entirely appropriate to write an introduction to a paper that only cites supportive evidence and ignores a swathe of literature that shows null or opposite results. Most postgraduates have an initial thesis chapter that reviews the literature, but it's rare, at least in psychology, to see a systematic review - perhaps because this is so time-consuming and can be soul-destroying. But if we continue to cherry-pick evidence that suits us, then we are not so much standing on the shoulders of giants as slithering around on jellyfish, and science will not progress.
Labels:
coronary artery disease,
primates,
stress,
systematic review
Thursday, 12 July 2018
One big study or two small studies? Insights from simulations
At a recent conference, someone posed a question that had been intriguing me for a while: suppose you have limited resources, with the potential to test N participants. Would it be better to do two studies, each with N/2 participants, or one big study with all N?
I've been on the periphery of conversations about this topic, but never really delved into it, so I gave a rather lame answer. I remembered hearing that statisticians would recommend the one big study option, but my intuition was that I'd trust a result that replicated more than one which was a one-off, even if the latter was from a bigger sample. Well, I've done the simulations and it's clear that my intuition is badly flawed.
Here's what I did. I adapted a script that is described in my recent slides that give hands-on instructions for beginners on how to simulate data, The script, Simulation_2_vs_1_study_b.R, which can be found here, generates data for a simple two-group comparison using a t-test. In this version, on each run of the simulation, you get output for one study where all subjects are divided into two groups of size N, and for two smaller studies each with half the number of subjects. I ran it with various settings to vary both the sample size and the effect size (Cohen's d). I included the case where there is no real difference between groups (d = 0), so I could estimate the false positive rate as well as the power to detect a true effect.
I used a one-tailed t-test, as I had pre-specified that group B had the higher mean when d > 0. I used a traditional approach with p-value cutoffs for statistical significance (and yes, I can hear many readers tut-tutting, but this is useful for this demonstration….) to see how often I got a result that met each of three different criteria:
Figure 1 summarises the results.
The figure is pretty busy but worth taking a while to unpack. Power is just the proportion of runs of the simulation where the significance criterion was met. It's conventional to adopt a power cutoff of .8 when deciding on how big a sample to use in a study. Sample size is colour coded, and refers to the number of subjects per group for the single study. So for the split replication, each group has half this number of subjects. The continuous line shows the proportion of results where p < .05 for the single study, the dotted line has results from the split replication, and the dashed line has results from the single study with more stringent significance criterion, p < .005 .
It's clear that for all sample sizes and all effect sizes, the one single sample is much better powered than the split replication.
But I then realised what had been bugging me and why my intuition was different. Look at the bottom left of the figure, where the x-axis is zero: the continuous lines (i.e., big sample, p < .05) all cross the y-axis at .05. This is inevitable: by definition, if you set p < .05, there's a one in 20 chance that you'll get a significant result when there's really no group difference in the population, regardless of the sample size. In contrast, the dotted lines cross the y-axis close to zero, reflecting the fact that when the null hypothesis is true, the chance of two samples both giving p < .05 in a replication study is one in 400 (.05^2 = .0025). So I had been thinking more like a Bayesian: given a significant result, how likely was it to have been come from a population with a true effect rather than a null effect? This is a very different thing from what a simple p-value tells you*.
Initially, I thought I was onto something. If we just stick with p < .05, then it could be argued that from a Bayesian perspective, the split replication approach is preferable. Although you are less likely to see a significant effect with this approach, when you do, you can be far more confident it is a real effect. In formal terms, the likelihood ratio for a true vs null hypothesis, given p < .05, will be much higher for the replication.
My joy at having my insight confirmed was, however, short-lived. I realised that this benefit of the replication approach could be exceeded with the single big sample simply by reducing the p-value so that the odds of a false positive are minimal. That's why Figure 1 also shows the scenario for one big sample with p < .005: a threshold that has recently proposed as a general recommendation for claims of new discoveries (Benjamin et al, 2018)**.
None of this will surprise expert statisticians: Figure 1 just reflects basic facts about statistical power that were popularised by Jacob Cohen in 1977. But I'm glad to have my intuitions now more aligned with reality, and I'd encourage others to try simulation as a great way to get more insights into statistical methods.
Here is the conclusions I've drawn from the simulation:
*In his paper The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(171085). doi:10.1098/rsos.171085 David Colquhoun (2017) discusses these issues and notes that we also need to consider the prior likelihood of the null hypothesis being true: something that is unknowable and can only be estimated on the basis of past experience and intuition.
**The proposal for adopting p < .005 as a more stringent statistical threshold for new discoveries can be found here: Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6-10. doi:10.1038/s41562-017-0189-z
This blogpost has generated a lot of discussion, mostly on Twitter. One point that particularly interested me was a comment that I hadn’t done a fair comparison between the one-study and two-study situation, because the plot showed a one-off two group study with an alpha at .005, versus a replication study (half sample size in each group) with alpha at .05. For a fair comparison, it was argued, I should equate the probabilities between the two situations, i.e. the alpha for the one-off study should be .05 squared = .0025.
So I took a look at the fair comparison: Figure 2 shows the situation when comparing one study with alpha set to .0025 vs a split replication with alpha of .05. The intuition of many people on Twitter was that these should be identical, but they aren’t. Why not? We have the same information in the two samples. (In fact, I modified the script so that this was literally true and the same sample was tested singly and again split into two – previously I’d just resampled to get the smaller samples. This makes no difference – the single sample with more extreme alpha still gives higher power).
To look at it another way, in one version of the simulation there were 1600 simulated experiments with a true effect (including all the simulated sample sizes and effect sizes). Of these 581 were identified as ‘significant’ both by the one-off study with p < .0025 and they were also replicated in two small studies with p < .05. Only 5 were identified by the split replication alone, but 134 were identified by the one-off study alone.
I think I worked out why this is the case, though I’d appreciate having a proper statistical opinion. It seems to have to do with accuracy of estimating the standard deviation. If you have a split sample and you estimate the mean from each half (A and B), then the average of mean A and mean B will be the same as for the big sample of AB combined. But when it comes to estimating the standard deviation – which is a key statistic when computing group differences – the estimate is more accurate and precise with the large sample. This is because the standard deviation is computed by measuring the difference of each value from its own sample mean. Means for A and B will fluctuate due to sampling error, and this will make the estimated SDs less reliable. You can estimate the pooled standard deviation for two samples by taking the square root of the average of the variances. However, that value is less precise than the SD from the single large sample. I haven’t done a large number of runs, but a quick check suggests that whereas both the one-off study and the split replication give pooled estimates of the SD at around the true value of 1.0, the standard deviation of the standard deviation (we are getting very meta here!) is around .01 for the one-off study but .14 for the split replication. Again, I’m reporting results from across all the simulated trials, including the full range of sample sizes and effect sizes.
This has been an intriguing puzzle to investigate, but in the original post, I hadn’t really been intending to do this kind of comparison - my interest was rather in making the more elementary point which is that there's a very low probability of achieving a replication when sample size and effect size are both relatively small.
Returning to that issue, another commentator said that they’d have far more confidence in five small studies all showing the same effect than in one giant study. This is exactly the view I would have taken before I looked into this with simulations; but I now realise this idea has a serious flaw, which is that you’re very unlikely to get those five replications, even if you are reasonably well powered, because – the tldr; message implicit in this post – when we’re talking about replications, we have to multiply the probabilities, and they rapidly get very low. So, if you look at the figure, suppose you have a moderate effect size, around .5, then you need a sample of 48 per group to get 80% power. But if you repeat the study five times, then the chance of getting a positive result in all five cases is .8^5, which is .33. So most of the time you’d get a mixture of null and positive results. Even if you doubled the sample size to increase power to around .95, the chance of all five studies coming out positive is still only .95^5 (82%).
Finally, another suggestion from Twitter is that a meta-analysis of several studies should give the same result as a single big sample. I’m afraid I have no expertise in meta-analysis, so I don’t know how well it handles the issue of more variable SD estimates in small samples, but I’d be interested to hear more from any readers who are up to speed with this.
I've been on the periphery of conversations about this topic, but never really delved into it, so I gave a rather lame answer. I remembered hearing that statisticians would recommend the one big study option, but my intuition was that I'd trust a result that replicated more than one which was a one-off, even if the latter was from a bigger sample. Well, I've done the simulations and it's clear that my intuition is badly flawed.
Here's what I did. I adapted a script that is described in my recent slides that give hands-on instructions for beginners on how to simulate data, The script, Simulation_2_vs_1_study_b.R, which can be found here, generates data for a simple two-group comparison using a t-test. In this version, on each run of the simulation, you get output for one study where all subjects are divided into two groups of size N, and for two smaller studies each with half the number of subjects. I ran it with various settings to vary both the sample size and the effect size (Cohen's d). I included the case where there is no real difference between groups (d = 0), so I could estimate the false positive rate as well as the power to detect a true effect.
I used a one-tailed t-test, as I had pre-specified that group B had the higher mean when d > 0. I used a traditional approach with p-value cutoffs for statistical significance (and yes, I can hear many readers tut-tutting, but this is useful for this demonstration….) to see how often I got a result that met each of three different criteria:
- a) Single study, p < .05
- b) Split sample, p < .05 replicated in both studies
- c) Single study, p < .005
Figure 1 summarises the results.
![]() |
| Figure 1 |
The figure is pretty busy but worth taking a while to unpack. Power is just the proportion of runs of the simulation where the significance criterion was met. It's conventional to adopt a power cutoff of .8 when deciding on how big a sample to use in a study. Sample size is colour coded, and refers to the number of subjects per group for the single study. So for the split replication, each group has half this number of subjects. The continuous line shows the proportion of results where p < .05 for the single study, the dotted line has results from the split replication, and the dashed line has results from the single study with more stringent significance criterion, p < .005 .
It's clear that for all sample sizes and all effect sizes, the one single sample is much better powered than the split replication.
But I then realised what had been bugging me and why my intuition was different. Look at the bottom left of the figure, where the x-axis is zero: the continuous lines (i.e., big sample, p < .05) all cross the y-axis at .05. This is inevitable: by definition, if you set p < .05, there's a one in 20 chance that you'll get a significant result when there's really no group difference in the population, regardless of the sample size. In contrast, the dotted lines cross the y-axis close to zero, reflecting the fact that when the null hypothesis is true, the chance of two samples both giving p < .05 in a replication study is one in 400 (.05^2 = .0025). So I had been thinking more like a Bayesian: given a significant result, how likely was it to have been come from a population with a true effect rather than a null effect? This is a very different thing from what a simple p-value tells you*.
Initially, I thought I was onto something. If we just stick with p < .05, then it could be argued that from a Bayesian perspective, the split replication approach is preferable. Although you are less likely to see a significant effect with this approach, when you do, you can be far more confident it is a real effect. In formal terms, the likelihood ratio for a true vs null hypothesis, given p < .05, will be much higher for the replication.
My joy at having my insight confirmed was, however, short-lived. I realised that this benefit of the replication approach could be exceeded with the single big sample simply by reducing the p-value so that the odds of a false positive are minimal. That's why Figure 1 also shows the scenario for one big sample with p < .005: a threshold that has recently proposed as a general recommendation for claims of new discoveries (Benjamin et al, 2018)**.
None of this will surprise expert statisticians: Figure 1 just reflects basic facts about statistical power that were popularised by Jacob Cohen in 1977. But I'm glad to have my intuitions now more aligned with reality, and I'd encourage others to try simulation as a great way to get more insights into statistical methods.
Here is the conclusions I've drawn from the simulation:
- First, even when the two groups come from populations with different means, it's unlikely that you'll get a clear result from a single small study unless the effect size is at least moderate; and the odds of finding a replicated significant effect are substantially lower than this. None of the dotted lines achieves 80% power for a replication if effect size is less than .3 - and many effects in psychology are no bigger than that.
- Second, from a statistical perspective, testing an a priori hypothesis in a larger sample with a lower p-value is more efficient than subdividing the sample and replicating the study using a less stringent p-value.
*In his paper The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(171085). doi:10.1098/rsos.171085 David Colquhoun (2017) discusses these issues and notes that we also need to consider the prior likelihood of the null hypothesis being true: something that is unknowable and can only be estimated on the basis of past experience and intuition.
**The proposal for adopting p < .005 as a more stringent statistical threshold for new discoveries can be found here: Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6-10. doi:10.1038/s41562-017-0189-z
Postscript, 15th July 2018
This blogpost has generated a lot of discussion, mostly on Twitter. One point that particularly interested me was a comment that I hadn’t done a fair comparison between the one-study and two-study situation, because the plot showed a one-off two group study with an alpha at .005, versus a replication study (half sample size in each group) with alpha at .05. For a fair comparison, it was argued, I should equate the probabilities between the two situations, i.e. the alpha for the one-off study should be .05 squared = .0025.
So I took a look at the fair comparison: Figure 2 shows the situation when comparing one study with alpha set to .0025 vs a split replication with alpha of .05. The intuition of many people on Twitter was that these should be identical, but they aren’t. Why not? We have the same information in the two samples. (In fact, I modified the script so that this was literally true and the same sample was tested singly and again split into two – previously I’d just resampled to get the smaller samples. This makes no difference – the single sample with more extreme alpha still gives higher power).
![]() |
| Figure 2: Power for one-off study with alpha .0025 (dashed lines) vs. split replication with p < .05 |
I think I worked out why this is the case, though I’d appreciate having a proper statistical opinion. It seems to have to do with accuracy of estimating the standard deviation. If you have a split sample and you estimate the mean from each half (A and B), then the average of mean A and mean B will be the same as for the big sample of AB combined. But when it comes to estimating the standard deviation – which is a key statistic when computing group differences – the estimate is more accurate and precise with the large sample. This is because the standard deviation is computed by measuring the difference of each value from its own sample mean. Means for A and B will fluctuate due to sampling error, and this will make the estimated SDs less reliable. You can estimate the pooled standard deviation for two samples by taking the square root of the average of the variances. However, that value is less precise than the SD from the single large sample. I haven’t done a large number of runs, but a quick check suggests that whereas both the one-off study and the split replication give pooled estimates of the SD at around the true value of 1.0, the standard deviation of the standard deviation (we are getting very meta here!) is around .01 for the one-off study but .14 for the split replication. Again, I’m reporting results from across all the simulated trials, including the full range of sample sizes and effect sizes.
![]() |
| Figure 3: Distribution of estimates of pooled SD; The range is narrower for the one-off study (pink) than for the split replication studies (blue). Purple shows area of overlap of distributions |
This has been an intriguing puzzle to investigate, but in the original post, I hadn’t really been intending to do this kind of comparison - my interest was rather in making the more elementary point which is that there's a very low probability of achieving a replication when sample size and effect size are both relatively small.
Returning to that issue, another commentator said that they’d have far more confidence in five small studies all showing the same effect than in one giant study. This is exactly the view I would have taken before I looked into this with simulations; but I now realise this idea has a serious flaw, which is that you’re very unlikely to get those five replications, even if you are reasonably well powered, because – the tldr; message implicit in this post – when we’re talking about replications, we have to multiply the probabilities, and they rapidly get very low. So, if you look at the figure, suppose you have a moderate effect size, around .5, then you need a sample of 48 per group to get 80% power. But if you repeat the study five times, then the chance of getting a positive result in all five cases is .8^5, which is .33. So most of the time you’d get a mixture of null and positive results. Even if you doubled the sample size to increase power to around .95, the chance of all five studies coming out positive is still only .95^5 (82%).
Finally, another suggestion from Twitter is that a meta-analysis of several studies should give the same result as a single big sample. I’m afraid I have no expertise in meta-analysis, so I don’t know how well it handles the issue of more variable SD estimates in small samples, but I’d be interested to hear more from any readers who are up to speed with this.
Tuesday, 26 June 2018
Preprint publication as karaoke
Doing research, analysing the results, and writing it up is a prolonged and difficult process. Submitting the paper to a journal is an anxious moment. Of course, you hope the editor and reviewers will love it and thank you for giving them the opportunity to read your compelling research. And of course, that never happens. More often you get comments from reviewers pointing out the various inadequacies of your grasp of the literature, your experimental design and your reasoning, leading to further angst as you consider how to reply. But worse than this is silence. You hear nothing. You enquire. You are told that the journal is still seeking reviewers. If you go through that loop a few times, you start to feel like the Jane Austen heroine who, having dressed up in her finery for the ball, spends the evening being ignored by all the men, while other, superficial and gaudy women are snapped up as dance partners.
There have been some downcast tweets in my timeline about papers getting stuck in this kind of journal limbo. When I suggested that it might help to post papers as preprints, several people asked how this worked, so I thought a short account might be useful.
To continue the analogy, a preprint server offers you a more modern world where you can try karaoke. You don't wait to be asked: you grab the microphone and do your thing. I now routinely post all my papers as preprints before submitting them to a journal. It gets the work out there, so even if journals are unduly slow, it can be read and you can get feedback on it.
So how does it work? Pre-prints are electronic articles that are not peer-reviewed. I hope those who know more about the history will be able to comment on this, as I'm hazy on the details, but the idea started with physicists, to whom the thought of waiting around for an editorial process to complete seemed ridiculous. Physicists have been routinely posting their work on arXiv (pronounced 'archive') for years to ensure rapid evaluation and exchange of ideas. They do still publish in journals, which creates a formal version of record, but the arXiv is what most of them read. The success of arXiv led to the development of BioRxiv, and then more recently PsyArXiv and SocArXiv. Some journals also host preprints - I have had good experiences with PeerJ, where you can deposit an article as a preprint, with the option of then updating it to a full submission to the journal if you wish*.
All of these platforms operate some basic quality control. For instance, the BioRxiv website states: 'all articles undergo a basic screening process for offensive and/or non-scientific content and for material that might pose a health or biosecurity risk and are checked for plagiarism'. However, once they have passed screening, articles are deposited immediately without further review.
Contrary to popular opinion, publishing a preprint does not usually conflict with journal policies. You can check the policy of the journal on the Sherpa/ROMEO database: most allow preprints prior to submission.
Sometimes concerns are expressed that if you post a preprint your work might be stolen by someone who'll then publish a journal article before you. In fact, it's quite the opposite. A preprint has a digital object identifier (DOI) and establishes your precedence, so guards against scooping. If you are in a fast-moving field where an evil reviewer will deliberately hold up your paper so they can get in ahead, pre-printing is the answer.
So when should you submit a preprint? I would normally recommend doing this a week or two before submitting to a journal, to allow for the possibility of incorporating feedback into the submitted manuscript, but, given that you will inevitably be asked for revisions by journal reviewers, if you post a preprint immediately before submission you will still have an opportunity to take on board other comments.
So what are the advantages of posting preprints?
1. The most obvious one is that people can access your work in a timely fashion. Preprints are freely available to all: a particularly welcome feature if you work in an area that has implications for clinical practice or policy, where practitioners may not have access to academic journals.
2. There have been cases where authors of a preprint have been invited to submit the work to a journal by an editor. This has never happened to me, but it's nice to know it's a possibility!
3. You can cite a preprint on a job application: it won't count as much as a peer-reviewed publication, but it does make it clear that the work is completed, and your evaluators can read it. This is preferable to just citing work as 'submitted'. Some funders are now also allowing preprints to be cited. https://wellcome.ac.uk/news/we-now-accept-preprints-grant-applications
4. Psychologically, for the author, it can be good to have a sense that the work is 'out there'. You have at least some control over the dissemination of your research, whereas waiting for editors and reviewers is depressing because you just feel powerless.
5. You can draw attention to a preprint on social media and explicitly request feedback. This is particularly helpful if you don't have colleagues to hand who are willing to read your paper. If you put out a request on Twitter, it doesn't mean people will necessarily reply, but you could get useful suggestions for improvement and/or make contact with others interested in your field.
On this final point, it is worth noting that there are several reasons why papers linger in journal limbo: it does not necessarily mean that the journal administration or editor is incompetent (though that can happen!). The best of editors can have a hard job finding reviewers: it's not uncommon to have to invite ten reviewers to find two who agree to review. If your papers is in a niche area then it gets even harder. For these reasons it is crucial to make your title and abstract as clear and interesting as possible: these are the only parts of the paper that potential reviewers will see, and if you are getting a lot of refusals to review, it could be that your abstract is a turn-off. So asking for feedback on a preprint may help you rewrite it in a way that encourages more interest from reviewers.
*Readers: please feel free to add other suggestions while comments are open. (I close comments once the invasion of spammers starts - typically 3-4 weeks after posting).
Saturday, 9 June 2018
Developmental language disorder: the need for a clinically relevant definition
There's been debate over the new terminology for Developmental Language Disorder (DLD) at a meeting (SRCLD) in the USA. I've not got any of the nuance here, but I feel I should make a quick comment on one issue I was specifically asked about, viz:
As background: the field of children's language disorders has been a terminological minefield. The term Specific Language Impairment (SLI) began to be used widely in the 1980s as a diagnosis for children who had problems acquiring language for no apparent reason. One criterion for the diagnosis was that the child's language problems should be out of line with other aspects of development, and hence 'specific', and this was interpreted as requiring normal range nonverbal IQ (nviq).
The term SLI was never adopted by the two main diagnostic systems -WHO's International Classification of Diseases (ICD) or the American Psychiatric Association's Diagnostic and Statistical Manual (DSM), but the notion that IQ should play a part in the diagnosis became prevalent.
In 2016-7 I headed up the CATALISE project with the specific goal of achieving some consensus about the diagnostic criteria and terminology for children's language disorders: the published papers about this are openly available for all to read (see below). The consensus of a group of experts from a range of professions and countries was to reject SLI in favour of the term DLD.
Any child who meets criteria for SLI will meet criteria for DLD: the main difference is that the use of an IQ cutoff is no longer part of the definition. This does not mean that all children with language difficulties are regarded as having DLD: those who meet criteria for intellectual disability, known syndromes or biomedical conditions are treated separately (see these slides for summary).
The tweet seems to suggest we should retain the term SLI, with its IQ cutoff, because it allows us to do neatly controlled research studies. I realise a brief, second-hand tweet about Rice's views may not be a fair portrayal of what she said, but it does emphasise a bone of contention that was thoroughly gnawed in the discussions of the CATALISE panel, namely, what is the purpose of diagnostic terminology? I would argue its primary purpose is clinical, and clinical considerations are not well-served by research criteria.
The traditional approach to selecting groups for research is to find 'pure' cases - quite simply, if you include children who have other problems beyond language (including other neurodevelopmental difficulties) then it is much harder to know how far you are assessing correlates or causes of language problems: things get messy and associations get hard to interpret. The importance of controlling for nonverbal IQ has been particularly emphasised over many years: quite simply, if you compare language-impaired vs comparison (typically-developing, or td) children on a language or cognitive measure, and the language-impaired group has lower nonverbal ability, then it may be that you are looking at a correlate of nonverbal ability rather than language. Restricting consideration to those who meet stringent IQ criteria to equalise the groups is one way of addressing the issue.
However, there are three big problems with this approach:
1. A child's nonverbal IQ can vary from time to time and it will depend on the test that is used. However, although this is problematic, it's not the main reason for dropping IQ cutoffs; the strongest arguments concern validity rather than reliability of an IQ-based approach.
2. The use of IQ-cutoffs ignores the fact that pure cases of language impairment are the exception rather than the rule. In CATALISE we looked at the evidence and concluded that if we were going to insist that you could only get a diagnosis of DLD if you had no developmental problems beyond language, then we'd exclude many children with language problems (see also this old blogpost). If our main purpose is to get a diagnostic system that is clinically workable, it should be applicable to the children who turn up in our clinics - not just a rarefied few who meet research criteria. An analogy can be drawn with medicine: imagine if your doctor identified you with high blood pressure but refused to treat you unless you were in every other regard fit and healthy. That would seem both unfair and ill-judged. Presence of co-occurring conditions might be important for tracking down underlying causes and determining a treatment path, but it's not a reason for excluding someone from receiving services.
3. Even for research purposes, it is not clear that a focus on highly specific disorders makes sense. An underlying assumption, which I remember starting out with, was the idea that the specific cases were in some important sense different from those who had additional problems. Yet, as noted in the CATALISE papers, the evidence for this assumption is missing: nonverbal IQ has very little bearing on a child's clinical profile, response to intervention, or aetiology. For me, what really knocked my belief in the reality of SLI as a category was doing twin studies: typically, I'd find that identical twins were very similar in their language abilities, but they sometimes differed in nonverbal ability, to the extent that one met criteria for SLI and the other did not. Researchers who treat SLI as a distinct category are at risk of doing research that has no application to the real world.
There is nothing to stop researchers focusing on 'pure' cases of language disorder to answer research questions of theoretical interest, such as questions about the modularity of language. This kind of research uses children with a language disorder as a kind of 'natural experiment' that may inform our understanding of broader issues. It is, however, important not to confuse such research with work whose goal is to discover clinically relevant information.
If practitioners let the theoretical interests of researchers dictate their diagnostic criteria, then they are doing a huge disservice to the many children who end up in a no-man's-land, without either diagnosis or access to intervention.
References
Bishop, D. V. M. (2017). Why is it so hard to reach agreement on terminology? The case of developmental language disorder (DLD). International Journal of Language & Communication Disorders, 52(6), 671-680. doi:10.1111/1460-6984.12335
Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2016). CATALISE: a multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children. PLOS One, 11(7), e0158753. doi:10.1371/journal.pone.0158753
Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2017). Phase 2 of CATALISE: a multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068-1080. doi:10.1111/jcpp.12721
As background: the field of children's language disorders has been a terminological minefield. The term Specific Language Impairment (SLI) began to be used widely in the 1980s as a diagnosis for children who had problems acquiring language for no apparent reason. One criterion for the diagnosis was that the child's language problems should be out of line with other aspects of development, and hence 'specific', and this was interpreted as requiring normal range nonverbal IQ (nviq).
The term SLI was never adopted by the two main diagnostic systems -WHO's International Classification of Diseases (ICD) or the American Psychiatric Association's Diagnostic and Statistical Manual (DSM), but the notion that IQ should play a part in the diagnosis became prevalent.
In 2016-7 I headed up the CATALISE project with the specific goal of achieving some consensus about the diagnostic criteria and terminology for children's language disorders: the published papers about this are openly available for all to read (see below). The consensus of a group of experts from a range of professions and countries was to reject SLI in favour of the term DLD.
Any child who meets criteria for SLI will meet criteria for DLD: the main difference is that the use of an IQ cutoff is no longer part of the definition. This does not mean that all children with language difficulties are regarded as having DLD: those who meet criteria for intellectual disability, known syndromes or biomedical conditions are treated separately (see these slides for summary).
The tweet seems to suggest we should retain the term SLI, with its IQ cutoff, because it allows us to do neatly controlled research studies. I realise a brief, second-hand tweet about Rice's views may not be a fair portrayal of what she said, but it does emphasise a bone of contention that was thoroughly gnawed in the discussions of the CATALISE panel, namely, what is the purpose of diagnostic terminology? I would argue its primary purpose is clinical, and clinical considerations are not well-served by research criteria.
The traditional approach to selecting groups for research is to find 'pure' cases - quite simply, if you include children who have other problems beyond language (including other neurodevelopmental difficulties) then it is much harder to know how far you are assessing correlates or causes of language problems: things get messy and associations get hard to interpret. The importance of controlling for nonverbal IQ has been particularly emphasised over many years: quite simply, if you compare language-impaired vs comparison (typically-developing, or td) children on a language or cognitive measure, and the language-impaired group has lower nonverbal ability, then it may be that you are looking at a correlate of nonverbal ability rather than language. Restricting consideration to those who meet stringent IQ criteria to equalise the groups is one way of addressing the issue.
However, there are three big problems with this approach:
1. A child's nonverbal IQ can vary from time to time and it will depend on the test that is used. However, although this is problematic, it's not the main reason for dropping IQ cutoffs; the strongest arguments concern validity rather than reliability of an IQ-based approach.
2. The use of IQ-cutoffs ignores the fact that pure cases of language impairment are the exception rather than the rule. In CATALISE we looked at the evidence and concluded that if we were going to insist that you could only get a diagnosis of DLD if you had no developmental problems beyond language, then we'd exclude many children with language problems (see also this old blogpost). If our main purpose is to get a diagnostic system that is clinically workable, it should be applicable to the children who turn up in our clinics - not just a rarefied few who meet research criteria. An analogy can be drawn with medicine: imagine if your doctor identified you with high blood pressure but refused to treat you unless you were in every other regard fit and healthy. That would seem both unfair and ill-judged. Presence of co-occurring conditions might be important for tracking down underlying causes and determining a treatment path, but it's not a reason for excluding someone from receiving services.
3. Even for research purposes, it is not clear that a focus on highly specific disorders makes sense. An underlying assumption, which I remember starting out with, was the idea that the specific cases were in some important sense different from those who had additional problems. Yet, as noted in the CATALISE papers, the evidence for this assumption is missing: nonverbal IQ has very little bearing on a child's clinical profile, response to intervention, or aetiology. For me, what really knocked my belief in the reality of SLI as a category was doing twin studies: typically, I'd find that identical twins were very similar in their language abilities, but they sometimes differed in nonverbal ability, to the extent that one met criteria for SLI and the other did not. Researchers who treat SLI as a distinct category are at risk of doing research that has no application to the real world.
There is nothing to stop researchers focusing on 'pure' cases of language disorder to answer research questions of theoretical interest, such as questions about the modularity of language. This kind of research uses children with a language disorder as a kind of 'natural experiment' that may inform our understanding of broader issues. It is, however, important not to confuse such research with work whose goal is to discover clinically relevant information.
If practitioners let the theoretical interests of researchers dictate their diagnostic criteria, then they are doing a huge disservice to the many children who end up in a no-man's-land, without either diagnosis or access to intervention.
References
Bishop, D. V. M. (2017). Why is it so hard to reach agreement on terminology? The case of developmental language disorder (DLD). International Journal of Language & Communication Disorders, 52(6), 671-680. doi:10.1111/1460-6984.12335
Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2016). CATALISE: a multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children. PLOS One, 11(7), e0158753. doi:10.1371/journal.pone.0158753
Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2017). Phase 2 of CATALISE: a multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068-1080. doi:10.1111/jcpp.12721
Labels:
CATALISE,
child,
clinical,
diagnosis,
DLD,
language disorder,
research,
SLI,
terminology
Sunday, 27 May 2018
Sowing seeds of doubt: how Gilbert et al’s critique of the reproducibility project has played out
In
Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising
doubt can be used as an effective weapon against inconvenient science. On
topics such as the effects of tobacco on health, climate change and causes of
acid rain, it has been possible to delay or curb action to tackle problems by
simply emphasising the lack of scientific consensus. This is always an option,
because science is characterised by uncertainty, and indeed, we move forward by
challenging one another’s findings: only a dead science would have no
disagreements. But those raising concerns wield a two-edged sword: spurious and discredited criticisms can disrupt scientific progress,
especially if the arguments are complex and technical: people will be left with
a sense that they cannot trust the findings, even if they don’t fully
understand the matters under debate.
The parallels with Merchants of Doubt occurred to me as I
re-read the critique
by Gilbert et al of the classic paper by the Open Science Collaboration (OSC)
on ‘Estimating
the reproducibility of psychological science’. I was prompted to do so
because we were discussing the OSC paper in a journal club* and inevitably the
question arose as to whether we needed to worry about reproducibility, in the
light of the remarkable claim by Gilbert et al: ‘We show
that OSC's article contains three major statistical errors and, when corrected,
provides no evidence of a replication crisis. Indeed, the evidence is also
consistent with the opposite conclusion -- that the reproducibility of
psychological science is quite high and, in fact, statistically
indistinguishable from 100%.’
The Gilbert et al critique has, in turn, been the subject of
considerable criticism, as well as a response by a
subset of the OSC group. I summarise the main points of contention in Table
1: at times they seem to be making a defeatist argument that we don’t need to
worry because replication in psychology is bound to be poor: something I
have disputed.
But my main focus in this post is simply to consider the
impact of the critique on the reproducibility debate by looking at citations of
the original article and the critique. A quick check on Web of Science found
797 citations of the OSC paper, 67 citations of Gilbert et al, and 33 citations
of the response by Anderson et al.
The next thing I did, admittedly in a very informal fashion,
was to download the details of the articles citing Gilbert et al and code them
according to the content of what they said, as either supporting Gilbert et
al’s view, rejecting the criticism, or being neutral. I discovered I needed a
fourth category for papers where the citation seemed wrong or so vague as to be unclassifiable. I discarded any papers where the relevant information could
not be readily accessed – I can access most journals via Oxford University but
a few were behind paywalls, others were not in English, or did not appear to
cite Gilbert et al. This left 44 citing papers that focused on the commentary
on the OSC study. Nine of these were supportive of Gilbert et al, two noted
problems with their analysis, but 33 were categorised as ‘neutral’, because the
citation read something like this:
“Because of the
current replicability crisis in psychological science (e.g., Open Science
Collaboration, 2015; but see Gilbert, King, Pettigrew, & Wilson, 2016)….”
The strong impression was that the authors of these papers lacked
either the appetite or the ability to engage with the detailed arguments in the
critique, but had a sense that there was a debate and felt that they should
flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem - people are left thinking that it's all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.
Fortunately, it looks as if Gilbert et al’s critique has
been less successful than might have been expected, given the eminence of the
authors. This may in part be because the arguments in favour of change are
founded not just on demonstrations such as the OSC project, but also on logical
analyses of statistical practices and publication biases that have been known
about for years (see slides 15-20 of my presentation here). Furthermore, as evidenced in the footnotes to Table 1, social
media allows a rapid evaluation of claims and counter-claims that hitherto was
not possible when debate was restricted to and controlled by journals. The publication this
week of three more big replication studies
just heaps on further empirical evidence that we have a problem that
needs addressing. Those who are saying ‘nothing to see here, move along’ cannot
retain any credibility.
Table 1
|
Criticism
|
Rejoinder
|
|
‘many of OSC’s replication studies drew their samples from
different populations than the original studies did’
|
·
‘Many’ implies the majority. No attempt to
quantify – just gives examples
·
Did not show that this feature affected
replication rate
|
|
‘many of OSC’s replication studies used procedures that
differed from the original study’s procedures in substantial ways.’
|
·
‘Many’ implies the majority. No attempt to
quantify – just gives examples
·
OSC showed that this did not affect
replication rate
·
Most striking example used by Gilbert et al is
given detailed explanation by Nosek (1)
|
|
‘How many of their replication studies should we expect to
have failed by chance alone? Making this estimate requires having data from
multiple replications of the same original study.’
Used data from pairwise comparisons of studies from the
Many Labs project to argue a low rate of agreement is to be expected.
|
·
Ignores publication bias impact on original
studies (2, 3)
·
G et al misinterpret confidence intervals (3,
4)
·
G et al fail to take sample size/power into
account, though this is crucial determinant of confidence interval (3, 4)
·
‘Gilbert
et al.’s focus on the CI measure of reproducibility neither addresses nor can
account for the facts that the OSC2015 replication effect sizes were about
half the size of the original studies on average, and 83% of replications
elicited smaller effect sizes than the original studies.’ (2)
|
|
Results depended on whether original authors endorsed the
protocol for the replication: ‘This strongly suggests that the infidelities
did not just introduce random error but instead biased the replication
studies toward failure.
|
·
Use of term ‘the infidelities’ assumes the only
reason for lack of endorsement is departure from original protocol. (2)
·
Lack of endorsement included non-response from
original authors (3)
|
(1)
Nosek: https://retractionwatch.com/2016/03/07/lets-not-mischaracterize-replication-studies-authors/
(2)
Anderson et al: http://science.sciencemag.org/content/351/6277/1037.3
(3)
Srivastava:
https://thehardestscience.com/2016/03/03/evaluating-a-new-critique-of-the-reproducibility-project/
References
Anderson, C. J., Bahnik, S., Barnett-Cowan, M., & et al. (2016).
Response to Comment on "Estimating the reproducibility of psychological
science". Science, 351(6277).
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016).
Comment on "Estimating the reproducibility of psychological science".
Science, 351(6277).
Open Science Collaboration. (2015). Estimating the reproducibility
of psychological science. Nature, 349(6251). doi:10.1126/science.aac4716
*Thanks to the enthusiastic efforts of some of our grad
students, and the support of Reproducible
Research Oxford, we’ve had a series of Reproducibilitea
journal clubs in our department this term.
I can recommend this as a great – and relatively cheap and easy - way of
raising awareness of issues around reproducibility in a department: something
that is sorely needed if a recent Twitter survey by Dan Lakens
is anything to go by.
Sunday, 13 May 2018
How to survive on Twitter – a simple rule to reduce stress
In recent weeks, I’ve seen tweets from a handful of people I
follow saying they are thinking of giving up Twitter because it has become so
negative. Of course they are entitled to do so, and they may find that it frees
up time and mental space that could be better used for other things. The
problem, though, is that I detect a sense of regret. And this is appropriate because Twitter, used judiciously,
has great potential for good.
For me as an academic, the benefits include:
·
Finding out about latest papers and other
developments relevant to my work
·
Discovering new people with interesting points
of view – often these aren’t eminent or well-known and I’d never have come
across them if I hadn’t been on social media
·
Being able to ask for advice from experts –
sometimes getting a remarkably quick and relevant response
·
Being able to interact with non-academics who
are interested in the same stuff as me
·
Getting a much better sense of the diversity of
views in the broader community about topics I take for granted – this often
influences how I go about public engagement
·
Having fun – there are lots of witty people who brighten
my day with their tweets
The bad side, of course, is that some people say things on
Twitter that they would not dream of saying to your face. They can be rude,
abusive, and cruel, and sometimes mind-bogglingly impervious to reason. We now
know that some of them are not even real people – they are just bots set up by
those who want to sow discord among those with different political views. So
how do we deal with that?
Well, I have a pretty simple rule that works for me, which
is that if I find someone rude, obnoxious, irritating or tedious, I mute them.
Muting differs from blocking in that the person doesn’t know they are muted. So
they may continue hurling abuse or provocations at you, unaware that they are
now screaming into the void.
A few years ago, when I first got into a situation where I
was attacked by a group of unpleasant alt-right people (who I now realise were
probably mostly bots), it didn’t feel right to ignore them, for three reasons:
·
First, they were publicly maligning me, and I
felt I should defend myself.
·
Second, we’ve been told to beware the Twitter
bubble. If we only interact on social media with those who are like-minded: it
can create a totally false impression of what the world is like.
·
Third, walking away from an argument is not a
thing a good academic does: we are trained experts in reasoned debate, and our
whole instinct is to engage with those who disagree with us, examine what they
say and make a counterargument.
But I soon learned that some people on social media don’t
play by the rules of academic engagement. They are not sincere in their desire
to discuss topics: they have a viewpoint that nothing will change, and they
will use any method they can find to discredit an opponent. This includes ad
hominem attacks, lying and wilful misrepresentation of what you say. It's not cowardly to avoid these people: it's just a sensible reaction. So I now just mute anyone where I get a
whiff of such behaviour – directed either towards me or anyone else.
The thing is, social media is so different from normal
face-to-face interaction, that it needs different rules. Just imagine if you
were sitting with friends at the pub, having a chat, and someone barged in and
started shouting at you aggressively. Or someone sat down next to you,
uninvited, and proceeded to drone on about a very boring topic, impervious to
the impact they are having. People may have different ways of extricating
themselves from these situations, but one thing you can be sure of: when you
next go to the pub, you would not seek these individuals out and try to engage them
in discussion.
So my rule boils down to this: Ask yourself, if I was
talking to this person in the pub, would I want to prolong the interaction? Or,
if there was a button that I could press to make them disappear, would I use
it? Well, on social media, there is such
a button, and I recommend taking advantage of it.*
*I should make it clear that there are situations
when a person is subject to such a volume of abuse that this isn’t going to be
effective. Avoidance of Twitter for a while may be the only sensible option in
such cases. My advice is intended for those who aren’t the centre of a
vitriolic campaign, but who are turned off Twitter because of the stress it causes to observe or participate in hostile Twitter exchanges.
Subscribe to:
Posts (Atom)








