Thursday, 12 July 2018

One big study or two small studies? Insights from simulations

At a recent conference, someone posed a question that had been intriguing me for a while: suppose you have limited resources, with the potential to test N participants. Would it be better to do two studies, each with N/2 participants, or one big study with all N?

I've been on the periphery of conversations about this topic, but never really delved into it, so I gave a rather lame answer. I remembered hearing that statisticians would recommend the one big study option, but my intuition was that I'd trust a result that replicated more than one which was a one-off, even if the latter was from a bigger sample. Well, I've done the simulations and it's clear that my intuition is badly flawed.

Here's what I did. I adapted a script that is described in my recent slides that give hands-on instructions for beginners on how to simulate data, The script, Simulation_2_vs_1_study_b.R, which can be found here, generates data for a simple two-group comparison using a t-test. In this version, on each run of the simulation, you get output for one study where all subjects are divided into two groups of size N, and for two smaller studies each with half the number of subjects. I ran it with various settings to vary both the sample size and the effect size (Cohen's d). I included the case where there is no real difference between groups (d = 0), so I could estimate the false positive rate as well as the power to detect a true effect.

I used a one-tailed t-test, as I had pre-specified that group B had the higher mean when d > 0. I used a traditional approach with p-value cutoffs for statistical significance (and yes, I can hear many readers tut-tutting, but this is useful for this demonstration….) to see how often I got a result that met each of three different criteria:
  • a) Single study, p < .05 
  • b) Split sample, p < .05 replicated in both studies 
  • c) Single study, p < .005

Figure 1 summarises the results.
Figure 1


The figure is pretty busy but worth taking a while to unpack. Power is just the proportion of runs of the simulation where the significance criterion was met. It's conventional to adopt a power cutoff of .8 when deciding on how big a sample to use in a study. Sample size is colour coded, and refers to the number of subjects per group for the single study. So for the split replication, each group has half this number of subjects. The continuous line shows the proportion of results where p < .05 for the single study, the dotted line has results from the split replication, and the dashed line has results from the single study with more stringent significance criterion, p < .005 .

It's clear that for all sample sizes and all effect sizes, the one single sample is much better powered than the split replication.

But I then realised what had been bugging me and why my intuition was different. Look at the bottom left of the figure, where the x-axis is zero: the continuous lines (i.e., big sample, p < .05) all cross the y-axis at .05. This is inevitable: by definition, if you set p < .05, there's a one in 20 chance that you'll get a significant result when there's really no group difference in the population, regardless of the sample size. In contrast, the dotted lines cross the y-axis close to zero, reflecting the fact that when the null hypothesis is true, the chance of two samples both giving p < .05 in a replication study is one in 400 (.05^2 = .0025). So I had been thinking more like a Bayesian: given a significant result, how likely was it to have been come from a population with a true effect rather than a null effect? This is a very different thing from what a simple p-value tells you*.

Initially, I thought I was onto something. If we just stick with p < .05, then it could be argued that from a Bayesian perspective, the split replication approach is preferable. Although you are less likely to see a significant effect with this approach, when you do, you can be far more confident it is a real effect. In formal terms, the likelihood ratio for a true vs null hypothesis, given p < .05, will be much higher for the replication.

My joy at having my insight confirmed was, however, short-lived. I realised that this benefit of the replication approach could be exceeded with the single big sample simply by reducing the p-value so that the odds of a false positive are minimal. That's why Figure 1 also shows the scenario for one big sample with p < .005: a threshold that has recently proposed as a general recommendation for claims of new discoveries (Benjamin et al, 2018)**.

None of this will surprise expert statisticians: Figure 1 just reflects basic facts about statistical power that were popularised by Jacob Cohen in 1977. But I'm glad to have my intuitions now more aligned with reality, and I'd encourage others to try simulation as a great way to get more insights into statistical methods.

Here is the conclusions I've drawn from the simulation:
  • First, even when the two groups come from populations with different means, it's unlikely that you'll get a clear result from a single small study unless the effect size is at least moderate; and the odds of finding a replicated significant effect are substantially lower than this.  None of the dotted lines achieves 80% power for a replication if effect size is less than .3 - and many effects in psychology are no bigger than that. 
  • Second, from a statistical perspective, testing an a priori hypothesis in a larger sample with a lower p-value is more efficient than subdividing the sample and replicating the study using a less stringent p-value.
I'm not a stats expert, and I'm aware that there's been considerable debate out there about p-values - especially regarding the recommendations of Benjamin et al (2018). I have previously sat on the fence as I've not felt confident about the pros and cons. But on the basis of this simulation, I'm warming to the idea of p < .005. I'd welcome comments and corrections.

*In his paper The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(171085). doi:10.1098/rsos.171085 David Colquhoun (2017) discusses these issues and notes that we also need to consider the prior likelihood of the null hypothesis being true: something that is unknowable and can only be estimated on the basis of past experience and intuition.
**The proposal for adopting p < .005 as a more stringent statistical threshold for new discoveries can be found here: Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6-10. doi:10.1038/s41562-017-0189-z


Postscript, 15th July 2018


This blogpost has generated a lot of discussion, mostly on Twitter. One point that particularly interested me was a comment that I hadn’t done a fair comparison between the one-study and two-study situation, because the plot showed a one-off two group study with an alpha at .005, versus a replication study (half sample size in each group) with alpha at .05. For a fair comparison, it was argued, I should equate the probabilities between the two situations, i.e. the alpha for the one-off study should be .05 squared = .0025.

So I took a look at the fair comparison: Figure 2 shows the situation when comparing one study with alpha set to .0025 vs a split replication with alpha of .05. The intuition of many people on Twitter was that these should be identical, but they aren’t. Why not? We have the same information in the two samples. (In fact, I modified the script so that this was literally true and the same sample was tested singly and again split into two – previously I’d just resampled to get the smaller samples. This makes no difference – the single sample with more extreme alpha still gives higher power).

Figure 2: Power for one-off study with alpha .0025 (dashed lines) vs. split replication with p < .05
To look at it another way, in one version of the simulation there were 1600 simulated experiments with a true effect (including all the simulated sample sizes and effect sizes). Of these 581 were identified as ‘significant’ both by the one-off study with p < .0025 and they were also replicated in two small studies with p < .05. Only 5 were identified by the split replication alone, but 134 were identified by the one-off study alone.

I think I worked out why this is the case, though I’d appreciate having a proper statistical opinion. It seems to have to do with accuracy of estimating the standard deviation. If you have a split sample and you estimate the mean from each half (A and B), then the average of mean A and mean B will be the same as for the big sample of AB combined. But when it comes to estimating the standard deviation – which is a key statistic when computing group differences – the estimate is more accurate and precise with the large sample. This is because the standard deviation is computed by measuring the difference of each value from its own sample mean. Means for A and B will fluctuate due to sampling error, and this will make the estimated SDs less reliable. You can estimate the pooled standard deviation for two samples by taking the square root of the average of the variances. However, that value is less precise than the SD from the single large sample. I haven’t done a large number of runs, but a quick check suggests that whereas both the one-off study and the split replication give pooled estimates of the SD at around the true value of 1.0, the standard deviation of the standard deviation (we are getting very meta here!) is around .01 for the one-off study but .14 for the split replication. Again, I’m reporting results from across all the simulated trials, including the full range of sample sizes and effect sizes.

Figure 3: Distribution of estimates of pooled SD; The range is narrower for the one-off study (pink) than for the split replication studies (blue). Purple shows area of overlap of distributions

This has been an intriguing puzzle to investigate, but in the original post, I hadn’t really been intending to do this kind of comparison - my interest was rather in making the more elementary point which is that there's a very low probability of achieving a replication when sample size and effect size are both relatively small.

Returning to that issue, another commentator said that they’d have far more confidence in five small studies all showing the same effect than in one giant study. This is exactly the view I would have taken before I looked into this with simulations; but I now realise this idea has a serious flaw, which is that you’re very unlikely to get those five replications, even if you are reasonably well powered, because – the tldr; message implicit in this post – when we’re talking about replications, we have to multiply the probabilities, and they rapidly get very low. So, if you look at the figure, suppose you have a moderate effect size, around .5, then you need a sample of 48 per group to get 80% power. But if you repeat the study five times, then the chance of getting a positive result in all five cases is .8^5, which is .33. So most of the time you’d get a mixture of null and positive results. Even if you doubled the sample size to increase power to around .95, the chance of all five studies coming out positive is still only .95^5 (82%).

Finally, another suggestion from Twitter is that a meta-analysis of several studies should give the same result as a single big sample. I’m afraid I have no expertise in meta-analysis, so I don’t know how well it handles the issue of more variable SD estimates in small samples, but I’d be interested to hear more from any readers who are up to speed with this.

Tuesday, 26 June 2018

Preprint publication as karaoke


 Doing research, analysing the results, and writing it up is a prolonged and difficult process. Submitting the paper to a journal is an anxious moment. Of course, you hope the editor and reviewers will love it and thank you for giving them the opportunity to read your compelling research. And of course, that never happens. More often you get comments from reviewers pointing out the various inadequacies of your grasp of the literature, your experimental design and your reasoning, leading to further angst as you consider how to reply. But worse than this is silence. You hear nothing. You enquire. You are told that the journal is still seeking reviewers. If you go through that loop a few times, you start to feel like the Jane Austen heroine who, having dressed up in her finery for the ball, spends the evening being ignored by all the men, while other, superficial and gaudy women are snapped up as dance partners.

There have been some downcast tweets in my timeline about papers getting stuck in this kind of journal limbo. When I suggested that it might help to post papers as preprints, several people asked how this worked, so I thought a short account might be useful.

To continue the analogy, a preprint server offers you a more modern world where you can try karaoke. You don't wait to be asked: you grab the microphone and do your thing. I now routinely post all my papers as preprints before submitting them to a journal. It gets the work out there, so even if journals are unduly slow, it can be read and you can get feedback on it.

So how does it work? Pre-prints are electronic articles that are not peer-reviewed. I hope those who know more about the history will be able to comment on this, as I'm hazy on the details, but the idea started with physicists, to whom the thought of waiting around for an editorial process to complete seemed ridiculous. Physicists have been routinely posting their work on arXiv (pronounced 'archive') for years to ensure rapid evaluation and exchange of ideas. They do still publish in journals, which creates a formal version of record, but the arXiv is what most of them read. The success of arXiv led to the development of BioRxiv, and then more recently PsyArXiv and SocArXiv. Some journals also host preprints - I have had good experiences with PeerJ, where you can deposit an article as a preprint, with the option of then updating it to a full submission to the journal if you wish*.

All of these platforms operate some basic quality control. For instance, the BioRxiv website states: 'all articles undergo a basic screening process for offensive and/or non-scientific content and for material that might pose a health or biosecurity risk and are checked for plagiarism'. However, once they have passed screening, articles are deposited immediately without further review.

Contrary to popular opinion, publishing a preprint does not usually conflict with journal policies. You can check the policy of the journal on the Sherpa/ROMEO database: most allow preprints prior to submission.

Sometimes concerns are expressed that if you post a preprint your work might be stolen by someone who'll then publish a journal article before you. In fact, it's quite the opposite. A preprint has a digital object identifier (DOI) and establishes your precedence, so guards against scooping. If you are in a fast-moving field where an evil reviewer will deliberately hold up your paper so they can get in ahead, pre-printing is the answer.

So when should you submit a preprint? I would normally recommend doing this a week or two before submitting to a journal, to allow for the possibility of incorporating feedback into the submitted manuscript, but, given that you will inevitably be asked for revisions by journal reviewers, if you post a preprint immediately before submission you will still have an opportunity to take on board other comments.

So what are the advantages of posting preprints?

1. The most obvious one is that people can access your work in a timely fashion. Preprints are freely available to all: a particularly welcome feature if you work in an area that has implications for clinical practice or policy, where practitioners may not have access to academic journals.

2. There have been cases where authors of a preprint have been invited to submit the work to a journal by an editor. This has never happened to me, but it's nice to know it's a possibility!

3. You can cite a preprint on a job application: it won't count as much as a peer-reviewed publication, but it does make it clear that the work is completed, and your evaluators can read it. This is preferable to just citing work as 'submitted'. Some funders are now also allowing preprints to be cited. https://wellcome.ac.uk/news/we-now-accept-preprints-grant-applications

4. Psychologically, for the author, it can be good to have a sense that the work is 'out there'. You have at least some control over the dissemination of your research, whereas waiting for editors and reviewers is depressing because you just feel powerless.

5. You can draw attention to a preprint on social media and explicitly request feedback. This is particularly helpful if you don't have colleagues to hand who are willing to read your paper. If you put out a request on Twitter, it doesn't mean people will necessarily reply, but you could get useful suggestions for improvement and/or make contact with others interested in your field.

On this final point, it is worth noting that there are several reasons why papers linger in journal limbo: it does not necessarily mean that the journal administration or editor is incompetent (though that can happen!). The best of editors can have a hard job finding reviewers: it's not uncommon to have to invite ten reviewers to find two who agree to review. If your papers is in a niche area then it gets even harder. For these reasons it is crucial to make your title and abstract as clear and interesting as possible: these are the only parts of the paper that potential reviewers will see, and if you are getting a lot of refusals to review, it could be that your abstract is a turn-off. So asking for feedback on a preprint may help you rewrite it in a way that encourages more interest from reviewers.

*Readers: please feel free to add other suggestions while comments are open. (I close comments once the invasion of spammers starts - typically 3-4 weeks after posting).

Saturday, 23 June 2018

Bishopblog catalogue (updated 23 June 2018)

Source: http://www.weblogcartoons.com/2008/11/23/ideas/

Those of you who follow this blog may have noticed a lack of thematic coherence. I write about whatever is exercising my mind at the time, which can range from technical aspects of statistics to the design of bathroom taps. I decided it might be helpful to introduce a bit of order into this chaotic melange, so here is a catalogue of posts by topic.

Language impairment, dyslexia and related disorders
The common childhood disorders that have been left out in the cold (1 Dec 2010) What's in a name? (18 Dec 2010) Neuroprognosis in dyslexia (22 Dec 2010) Where commercial and clinical interests collide: Auditory processing disorder (6 Mar 2011) Auditory processing disorder (30 Mar 2011) Special educational needs: will they be met by the Green paper proposals? (9 Apr 2011) Is poor parenting really to blame for children's school problems? (3 Jun 2011) Early intervention: what's not to like? (1 Sep 2011) Lies, damned lies and spin (15 Oct 2011) A message to the world (31 Oct 2011) Vitamins, genes and language (13 Nov 2011) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012) Phonics screening: sense and sensibility (3 Apr 2012) What Chomsky doesn't get about child language (3 Sept 2012) Data from the phonics screen (1 Oct 2012) Auditory processing disorder: schisms and skirmishes (27 Oct 2012) High-impact journals (Action video games and dyslexia: critique) (10 Mar 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) Raising awareness of language learning impairments (26 Sep 2013) Good and bad news on the phonics screen (5 Oct 2013) What is educational neuroscience? (25 Jan 2014) Parent talk and child language (17 Feb 2014) My thoughts on the dyslexia debate (20 Mar 2014) Labels for unexplained language difficulties in children (23 Aug 2014) International reading comparisons: Is England really do so poorly? (14 Sep 2014) Our early assessments of schoolchildren are misleading and damaging (4 May 2015) Opportunity cost: a new red flag for evaluating interventions (30 Aug 2015) The STEP Physical Literacy programme: have we been here before? (2 Jul 2017) Prisons, developmental language disorder, and base rates (3 Nov 2017) Reproducibility and phonics: necessary but not sufficient (27 Nov 2017) Developmental language disorder: the need for a clinically relevant definition (9 Jun 2018)

Autism
Autism diagnosis in cultural context (16 May 2011) Are our ‘gold standard’ autism diagnostic instruments fit for purpose? (30 May 2011) How common is autism? (7 Jun 2011) Autism and hypersystematising parents (21 Jun 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011) Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) The ‘autism epidemic’ and diagnostic substitution (4 Jun 2012) How wishful thinking is damaging Peta's cause (9 June 2014)

Developmental disorders/paediatrics
The hidden cost of neglected tropical diseases (25 Nov 2010) The National Children's Study: a view from across the pond (25 Jun 2011) The kids are all right in daycare (14 Sep 2011) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Changing the landscape of psychiatric research (11 May 2014)

Genetics
Where does the myth of a gene for things like intelligence come from? (9 Sep 2010) Genes for optimism, dyslexia and obesity and other mythical beasts (10 Sep 2010) The X and Y of sex differences (11 May 2011) Review of How Genes Influence Behaviour (5 Jun 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Genes, brains and lateralisation (22 Dec 2012) Genetic variation and neuroimaging (11 Jan 2013) Have we become slower and dumber? (15 May 2013) Overhyped genetic findings: the case of dyslexia (16 Jun 2013) Incomprehensibility of much neurogenetics research ( 1 Oct 2016) A common misunderstanding of natural selection (8 Jan 2017) Sample selection in genetic studies: impact of restricted range (23 Apr 2017) Pre-registration or replication: the need for new standards in neurogenetic studies (1 Oct 2017)

Neuroscience
Neuroprognosis in dyslexia (22 Dec 2010) Brain scans show that… (11 Jun 2011)  Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Neuronal migration in language learning impairments (2 May 2012) Sharing of MRI datasets (6 May 2012) Genetic variation and neuroimaging (1 Jan 2013) The arcuate fasciculus and word learning (11 Aug 2013) Changing children's brains (17 Aug 2013) What is educational neuroscience? ( 25 Jan 2014) Changing the landscape of psychiatric research (11 May 2014) Incomprehensibility of much neurogenetics research ( 1 Oct 2016)

Reproducibility
Accentuate the negative (26 Oct 2011) Novelty, interest and replicability (19 Jan 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013) Who's afraid of open data? (15 Nov 2015) Blogging as post-publication peer review (21 Mar 2013) Research fraud: More scrutiny by administrators is not the answer (17 Jun 2013) Pressures against cumulative research (9 Jan 2014) Why does so much research go unpublished? (12 Jan 2014) Replication and reputation: Whose career matters? (29 Aug 2014) Open code: note just data and publications (6 Dec 2015) Why researchers need to understand poker ( 26 Jan 2016) Reproducibility crisis in psychology ( 5 Mar 2016) Further benefit of registered reports ( 22 Mar 2016) Would paying by results improve reproducibility? ( 7 May 2016) Serendipitous findings in psychology ( 29 May 2016) Thoughts on the Statcheck project ( 3 Sep 2016) When is a replication not a replication? (16 Dec 2016) Reproducible practices are the future for early career researchers (1 May 2017) Which neuroimaging measures are useful for individual differences research? (28 May 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017) Pre-registration or replication: the need for new standards in neurogenetic studies (1 Oct 2017) Citing the research literature: the distorting lens of memory (17 Oct 2017) Reproducibility and phonics: necessary but not sufficient (27 Nov 2017) Improving reproducibility: the future is with the young (9 Feb 2018) Sowing seeds of doubt: how Gilbert et al's critique of the reproducibility project has played out (27 May 2018)  

Statistics
Book review: biography of Richard Doll (5 Jun 2010) Book review: the Invisible Gorilla (30 Jun 2010) The difference between p < .05 and a screening test (23 Jul 2010) Three ways to improve cognitive test scores without intervention (14 Aug 2010) A short nerdy post about the use of percentiles (13 Apr 2011) The joys of inventing data (5 Oct 2011) Getting genetic effect sizes in perspective (20 Apr 2012) Causal models of developmental disorders: the perils of correlational data (24 Jun 2012) Data from the phonics screen (1 Oct 2012)Moderate drinking in pregnancy: toxic or benign? (1 Nov 2012) Flaky chocolate and the New England Journal of Medicine (13 Nov 2012) Interpreting unexpected significant results (7 June 2013) Data analysis: Ten tips I wish I'd known earlier (18 Apr 2014) Data sharing: exciting but scary (26 May 2014) Percentages, quasi-statistics and bad arguments (21 July 2014) Why I still use Excel ( 1 Sep 2016) Sample selection in genetic studies: impact of restricted range (23 Apr 2017) Prospecting for kryptonite: the value of null results (17 Jun 2017) Prisons, developmental language disorder, and base rates (3 Nov 2017) How Analysis of Variance Works (20 Nov 2017) ANOVA, t-tests and regression: different ways of showing the same thing (24 Nov 2017) Using simulations to understand the importance of sample size (21 Dec 2017) Using simulations to understand p-values (26 Dec 2017)

Journalism/science communication
Orwellian prize for scientific misrepresentation (1 Jun 2010) Journalists and the 'scientific breakthrough' (13 Jun 2010) Science journal editors: a taxonomy (28 Sep 2010) Orwellian prize for journalistic misrepresentation: an update (29 Jan 2011) Academic publishing: why isn't psychology like physics? (26 Feb 2011) Scientific communication: the Comment option (25 May 2011)  Publishers, psychological tests and greed (30 Dec 2011) Time for academics to withdraw free labour (7 Jan 2012) 2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012) Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012) Communicating science in the age of the internet (13 Jul 2012) How to bury your academic writing (26 Aug 2012) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013)  A short rant about numbered journal references (5 Apr 2013) Schizophrenia and child abuse in the media (26 May 2013) Why we need pre-registration (6 Jul 2013) On the need for responsible reporting of research (10 Oct 2013) A New Year's letter to academic publishers (4 Jan 2014) Journals without editors: What is going on? (1 Feb 2015) Editors behaving badly? (24 Feb 2015) Will Elsevier say sorry? (21 Mar 2015) How long does a scientific paper need to be? (20 Apr 2015) Will traditional science journals disappear? (17 May 2015) My collapse of confidence in Frontiers journals (7 Jun 2015) Publishing replication failures (11 Jul 2015) Psychology research: hopeless case or pioneering field? (28 Aug 2015) Desperate marketing from J. Neuroscience ( 18 Feb 2016) Editorial integrity: publishers on the front line ( 11 Jun 2016) When scientific communication is a one-way street (13 Dec 2016) Breaking the ice with buxom grapefruits: Pratiques de publication and predatory publishing (25 Jul 2017)

Social Media
A gentle introduction to Twitter for the apprehensive academic (14 Jun 2011) Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011) Will I still be tweeting in 2013? (2 Jan 2012) Blogging in the service of science (10 Mar 2012) Blogging as post-publication peer review (21 Mar 2013) The impact of blogging on reputation ( 27 Dec 2013) WeSpeechies: A meeting point on Twitter (12 Apr 2014) Email overload ( 12 Apr 2016) How to survive on Twitter - a simple rule to reduce stress (13 May 2018)

Academic life
An exciting day in the life of a scientist (24 Jun 2010) How our current reward structures have distorted and damaged science (6 Aug 2010) The challenge for science: speech by Colin Blakemore (14 Oct 2010) When ethics regulations have unethical consequences (14 Dec 2010) A day working from home (23 Dec 2010) Should we ration research grant applications? (8 Jan 2011) The one hour lecture (11 Mar 2011) The expansion of research regulators (20 Mar 2011) Should we ever fight lies with lies? (19 Jun 2011) How to survive in psychological research (13 Jul 2011) So you want to be a research assistant? (25 Aug 2011) NHS research ethics procedures: a modern-day Circumlocution Office (18 Dec 2011) The REF: a monster that sucks time and money from academic institutions (20 Mar 2012) The ultimate email auto-response (12 Apr 2012) Well, this should be easy…. (21 May 2012) Journal impact factors and REF2014 (19 Jan 2013)  An alternative to REF2014 (26 Jan 2013) Postgraduate education: time for a rethink (9 Feb 2013)  Ten things that can sink a grant proposal (19 Mar 2013)Blogging as post-publication peer review (21 Mar 2013) The academic backlog (9 May 2013)  Discussion meeting vs conference: in praise of slower science (21 Jun 2013) Why we need pre-registration (6 Jul 2013) Evaluate, evaluate, evaluate (12 Sep 2013) High time to revise the PhD thesis format (9 Oct 2013) The Matthew effect and REF2014 (15 Oct 2013) The University as big business: the case of King's College London (18 June 2014) Should vice-chancellors earn more than the prime minister? (12 July 2014)  Some thoughts on use of metrics in university research assessment (12 Oct 2014) Tuition fees must be high on the agenda before the next election (22 Oct 2014) Blaming universities for our nation's woes (24 Oct 2014) Staff satisfaction is as important as student satisfaction (13 Nov 2014) Metricophobia among academics (28 Nov 2014) Why evaluating scientists by grant income is stupid (8 Dec 2014) Dividing up the pie in relation to REF2014 (18 Dec 2014)  Shaky foundations of the TEF (7 Dec 2015) A lamentable performance by Jo Johnson (12 Dec 2015) More misrepresentation in the Green Paper (17 Dec 2015) The Green Paper’s level playing field risks becoming a morass (24 Dec 2015) NSS and teaching excellence: wrong measure, wrongly analysed (4 Jan 2016) Lack of clarity of purpose in REF and TEF ( 2 Mar 2016) Who wants the TEF? ( 24 May 2016) Cost benefit analysis of the TEF ( 17 Jul 2016)  Alternative providers and alternative medicine ( 6 Aug 2016) We know what's best for you: politicians vs. experts (17 Feb 2017) Advice for early career researchers re job applications: Work 'in preparation' (5 Mar 2017) Should research funding be allocated at random? (7 Apr 2018) Power, responsibility and role models in academia (3 May 2018) My response to the EPA's 'Strengthening Transparency in Regulatory Science' (9 May 2018)  

Celebrity scientists/quackery
Three ways to improve cognitive test scores without intervention (14 Aug 2010) What does it take to become a Fellow of the RSM? (24 Jul 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011) Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011) How to become a celebrity scientific expert (12 Sep 2011) The kids are all right in daycare (14 Sep 2011)  The weird world of US ethics regulation (25 Nov 2011) Pioneering treatment or quackery? How to decide (4 Dec 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012) Why most scientists don't take Susan Greenfield seriously (26 Sept 2014)

Women
Academic mobbing in cyberspace (30 May 2010) What works for women: some useful links (12 Jan 2011) The burqua ban: what's a liberal response (21 Apr 2011) C'mon sisters! Speak out! (28 Mar 2012) Psychology: where are all the men? (5 Nov 2012) Should Rennard be reinstated? (1 June 2014) How the media spun the Tim Hunt story (24 Jun 2015)

Politics and Religion
Lies, damned lies and spin (15 Oct 2011) A letter to Nick Clegg from an ex liberal democrat (11 Mar 2012) BBC's 'extensive coverage' of the NHS bill (9 Apr 2012) Schoolgirls' health put at risk by Catholic view on vaccination (30 Jun 2012) A letter to Boris Johnson (30 Nov 2013) How the government spins a crisis (floods) (1 Jan 2014) The alt-right guide to fielding conference questions (18 Feb 2017) We know what's best for you: politicians vs. experts (17 Feb 2017) Barely a good word for Donald Trump in Houses of Parliament (23 Feb 2017) Do you really want another referendum? Be careful what you wish for (12 Jan 2018) My response to the EPA's 'Strengthening Transparency in Regulatory Science' (9 May 2018)

Humour and miscellaneous Orwellian prize for scientific misrepresentation (1 Jun 2010) An exciting day in the life of a scientist (24 Jun 2010) Science journal editors: a taxonomy (28 Sep 2010) Parasites, pangolins and peer review (26 Nov 2010) A day working from home (23 Dec 2010) The one hour lecture (11 Mar 2011) The expansion of research regulators (20 Mar 2011) Scientific communication: the Comment option (25 May 2011) How to survive in psychological research (13 Jul 2011) Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011) 2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012) The ultimate email auto-response (12 Apr 2012) Well, this should be easy…. (21 May 2012) The bewildering bathroom challenge (19 Jul 2012) Are Starbucks hiding their profits on the planet Vulcan? (15 Nov 2012) Forget the Tower of Hanoi (11 Apr 2013) How do you communicate with a communications company? ( 30 Mar 2014) Noah: A film review from 32,000 ft (28 July 2014) The rationalist spa (11 Sep 2015) Talking about tax: weasel words ( 19 Apr 2016) Controversial statues: remove or revise? (22 Dec 2016) The alt-right guide to fielding conference questions (18 Feb 2017) My most popular posts of 2016 (2 Jan 2017)

Saturday, 9 June 2018

Developmental language disorder: the need for a clinically relevant definition

There's been debate over the new terminology for Developmental Language Disorder (DLD) at a meeting (SRCLD) in the USA. I've not got any of the nuance here, but I feel I should make a quick comment on one issue I was specifically asked about, viz:


As background: the field of children's language disorders has been a terminological minefield. The term Specific Language Impairment (SLI) began to be used widely in the 1980s as a diagnosis for children who had problems acquiring language for no apparent reason. One criterion for the diagnosis was that the child's language problems should be out of line with other aspects of development, and hence 'specific', and this was interpreted as requiring normal range nonverbal IQ (nviq).

The term SLI was never adopted by the two main diagnostic systems -WHO's International Classification of Diseases (ICD) or the American Psychiatric Association's Diagnostic and Statistical Manual (DSM), but the notion that IQ should play a part in the diagnosis became prevalent.

In 2016-7 I headed up the CATALISE project with the specific goal of achieving some consensus about the diagnostic criteria and terminology for children's language disorders: the published papers about this are openly available for all to read (see below). The consensus of a group of experts from a range of professions and countries was to reject SLI in favour of the term DLD.

Any child who meets criteria for SLI will meet criteria for DLD: the main difference is that the use of an IQ cutoff is no longer part of the definition. This does not mean that all children with language difficulties are regarded as having DLD: those who meet criteria for intellectual disability, known syndromes or biomedical conditions are treated separately (see these slides for summary).

The tweet seems to suggest we should retain the term SLI, with its IQ cutoff, because it allows us to do neatly controlled research studies. I realise a brief, second-hand tweet about Rice's views may not be a fair portrayal of what she said, but it does emphasise a bone of contention that was thoroughly gnawed in the discussions of the CATALISE panel, namely, what is the purpose of diagnostic terminology? I would argue its primary purpose is clinical, and clinical considerations are not well-served by research criteria.

The traditional approach to selecting groups for research is to find 'pure' cases - quite simply, if you include children who have other problems beyond language (including other neurodevelopmental difficulties) then it is much harder to know how far you are assessing correlates or causes of language problems: things get messy and associations get hard to interpret. The importance of controlling for nonverbal IQ has been particularly emphasised over many years: quite simply, if you compare language-impaired vs comparison (typically-developing, or td) children on a language or cognitive measure, and the language-impaired group has lower nonverbal ability, then it may be that you are looking at a correlate of nonverbal ability rather than language. Restricting consideration to those who meet stringent IQ criteria to equalise the groups is one way of addressing the issue.

However, there are three big problems with this approach:

1. A child's nonverbal IQ can vary from time to time and it will depend on the test that is used. However, although this is problematic, it's not the main reason for dropping IQ cutoffs; the strongest arguments concern validity rather than reliability of an IQ-based approach.

2. The use of IQ-cutoffs ignores the fact that pure cases of language impairment are the exception rather than the rule. In CATALISE we looked at the evidence and concluded that if we were going to insist that you could only get a diagnosis of DLD if you had no developmental problems beyond language, then we'd exclude many children with language problems (see also this old blogpost). If our main purpose is to get a diagnostic system that is clinically workable, it should be applicable to the children who turn up in our clinics - not just a rarefied few who meet research criteria. An analogy can be drawn with medicine: imagine if your doctor identified you with high blood pressure but refused to treat you unless you were in every other regard fit and healthy. That would seem both unfair and ill-judged. Presence of co-occurring conditions might be important for tracking down underlying causes and determining a treatment path, but it's not a reason for excluding someone from receiving services.

3. Even for research purposes, it is not clear that a focus on highly specific disorders makes sense. An underlying assumption, which I remember starting out with, was the idea that the specific cases were in some important sense different from those who had additional problems. Yet, as noted in the CATALISE papers, the evidence for this assumption is missing: nonverbal IQ has very little bearing on a child's clinical profile, response to intervention, or aetiology. For me, what really knocked my belief in the reality of SLI as a category was doing twin studies: typically, I'd find that identical twins were very similar in their language abilities, but they sometimes differed in nonverbal ability, to the extent that one met criteria for SLI and the other did not. Researchers who treat SLI as a distinct category are at risk of doing research that has no application to the real world.

There is nothing to stop researchers focusing on 'pure' cases of language disorder to answer research questions of theoretical interest, such as questions about the modularity of language. This kind of research uses children with a language disorder as a kind of 'natural experiment' that may inform our understanding of broader issues. It is, however, important not to confuse such research with work whose goal is to discover clinically relevant information.

If practitioners let the theoretical interests of researchers dictate their diagnostic criteria, then they are doing a huge disservice to the many children who end up in a no-man's-land, without either diagnosis or access to intervention. 

References

Bishop, D. V. M. (2017). Why is it so hard to reach agreement on terminology? The case of developmental language disorder (DLD). International Journal of Language & Communication Disorders, 52(6), 671-680. doi:10.1111/1460-6984.12335

Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2016). CATALISE: a multinational and multidisciplinary Delphi consensus study. Identifying language impairments in children. PLOS One, 11(7), e0158753. doi:10.1371/journal.pone.0158753

Bishop, D. V. M., Snowling, M. J., Thompson, P. A., Greenhalgh, T., & CATALISE Consortium. (2017). Phase 2 of CATALISE: a multinational and multidisciplinary Delphi consensus study of problems with language development: Terminology. Journal of Child Psychology and Psychiatry, 58(10), 1068-1080. doi:10.1111/jcpp.12721

Sunday, 27 May 2018

Sowing seeds of doubt: how Gilbert et al’s critique of the reproducibility project has played out



In Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising doubt can be used as an effective weapon against inconvenient science. On topics such as the effects of tobacco on health, climate change and causes of acid rain, it has been possible to delay or curb action to tackle problems by simply emphasising the lack of scientific consensus. This is always an option, because science is characterised by uncertainty, and indeed, we move forward by challenging one another’s findings: only a dead science would have no disagreements. But those raising concerns wield a two-edged sword: spurious and discredited criticisms can disrupt scientific progress, especially if the arguments are complex and technical: people will be left with a sense that they cannot trust the findings, even if they don’t fully understand the matters under debate.

The parallels with Merchants of Doubt occurred to me as I re-read the critique by Gilbert et al of the classic paper by the Open Science Collaboration (OSC) on ‘Estimating the reproducibility of psychological science’. I was prompted to do so because we were discussing the OSC paper in a journal club* and inevitably the question arose as to whether we needed to worry about reproducibility, in the light of the remarkable claim by Gilbert et al:  We show that OSC's article contains three major statistical errors and, when corrected, provides no evidence of a replication crisis. Indeed, the evidence is also consistent with the opposite conclusion -- that the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%.’

The Gilbert et al critique has, in turn, been the subject of considerable criticism, as well as a response by a subset of the OSC group. I summarise the main points of contention in Table 1: at times they seem to be making a defeatist argument that we don’t need to worry because replication in psychology is bound to be poor: something I have disputed.

But my main focus in this post is simply to consider the impact of the critique on the reproducibility debate by looking at citations of the original article and the critique. A quick check on Web of Science found 797 citations of the OSC paper, 67 citations of Gilbert et al, and 33 citations of the response by Anderson et al.

The next thing I did, admittedly in a very informal fashion, was to download the details of the articles citing Gilbert et al and code them according to the content of what they said, as either supporting Gilbert et al’s view, rejecting the criticism, or being neutral. I discovered I needed a fourth category for papers where the citation seemed wrong or so vague as to be unclassifiable. I discarded any papers where the relevant information could not be readily accessed – I can access most journals via Oxford University but a few were behind paywalls, others were not in English, or did not appear to cite Gilbert et al. This left 44 citing papers that focused on the commentary on the OSC study. Nine of these were supportive of Gilbert et al, two noted problems with their analysis, but 33 were categorised as ‘neutral’, because the citation read something like this: 

Because of the current replicability crisis in psychological science (e.g., Open Science Collaboration, 2015; but see Gilbert, King, Pettigrew, & Wilson, 2016)….”

The strong impression was that the authors of these papers lacked either the appetite or the ability to engage with the detailed arguments in the critique, but had a sense that there was a debate and felt that they should flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem - people are left thinking that it's all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.

Fortunately, it looks as if Gilbert et al’s critique has been less successful than might have been expected, given the eminence of the authors. This may in part be because the arguments in favour of change are founded not just on demonstrations such as the OSC project, but also on logical analyses of statistical practices and publication biases that have been known about for years (see slides 15-20 of my presentation here). Furthermore, as evidenced in the footnotes to Table 1, social media allows a rapid evaluation of claims and counter-claims that hitherto was not possible when debate was restricted to and controlled by journals. The publication this week of three more big replication studies  just heaps on further empirical evidence that we have a problem that needs addressing. Those who are saying ‘nothing to see here, move along’ cannot retain any credibility.

    Table 1
Criticism
Rejoinder
‘many of OSC’s replication studies drew their samples from different populations than the original studies did’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     Did not show that this feature affected replication rate
‘many of OSC’s replication studies used procedures that differed from the original study’s procedures in substantial ways.’
·     ‘Many’ implies the majority. No attempt to quantify – just gives examples
·     OSC showed that this did not affect replication rate
·     Most striking example used by Gilbert et al is given detailed explanation by Nosek (1)  
‘How many of their replication studies should we expect to have failed by chance alone? Making this estimate requires having data from multiple replications of the same original study.’
Used data from pairwise comparisons of studies from the Many Labs project to argue a low rate of agreement is to be expected.
·     Ignores publication bias impact on original studies (2, 3)
·     G et al misinterpret confidence intervals (3, 4)
·     G et al fail to take sample size/power into account, though this is crucial determinant of confidence interval (3, 4)
·      ‘Gilbert et al.’s focus on the CI measure of reproducibility neither addresses nor can account for the facts that the OSC2015 replication effect sizes were about half the size of the original studies on average, and 83% of replications elicited smaller effect sizes than the original studies.’ (2)
Results depended on whether original authors endorsed the protocol for the replication: ‘This strongly suggests that the infidelities did not just introduce random error but instead biased the replication studies toward failure.
·     Use of term ‘the infidelities’ assumes the only reason for lack of endorsement is departure from original protocol. (2)
·     Lack of endorsement included non-response from original authors (3)


References
Anderson, C. J., Bahnik, S., Barnett-Cowan, M., & et al. (2016). Response to Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on "Estimating the reproducibility of psychological science". Science, 351(6277).
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Nature, 349(6251). doi:10.1126/science.aac4716


*Thanks to the enthusiastic efforts of some of our grad students, and the support of Reproducible Research Oxford, we’ve had a series of Reproducibilitea journal clubs in our department this term.  I can recommend this as a great – and relatively cheap and easy - way of raising awareness of issues around reproducibility in a department: something that is sorely needed if a recent Twitter survey by Dan Lakens is anything to go by.