Sunday 10 March 2013

High-impact journals: where newsworthiness trumps methodology

Here’s a paradox: Most scientists would give their eye teeth to get a paper in a high impact journal, such as Nature, Science, or Proceedings of the National Academy of Sciences. Yet these journals have had a bad press lately, with claims that the papers they publish are more likely to be retracted than papers in journals with more moderate impact factors. It’s been suggested that this is because the high impact journals treat newsworthiness as an important criterion for accepting a paper. Newsworthiness is high when a finding is both of general interest and surprising, but surprising findings have a nasty habit of being wrong.

A new slant on this topic was provided recently by a paper by Tressoldi et al (2013), who compared the statistical standards of papers in high impact journals with those of three respectable but lower-impact journals. It’s often assumed that high impact journals have a very high rejection rate because they adopt particularly rigorous standards, but this appears not to be the case. Tressoldi et al focused specifically on whether papers reported effect sizes, confidence intervals, power analysis or model-fitting. Medical journals fared much better than the others, but Science and Nature did poorly on these criteria. Certainly my own experience squares with the conclusions of Tressoldi et al (2013), as I described in the course of discussion about an earlier blogpost.

Last week a paper appeared in Current Biology (impact factor = 9.65) with the confident title: “Action video games make dyslexic children read better.” It's a classic example of a paper that is on the one hand highly newsworthy, but on the other, methodologically weak. I’m not usually a betting person, but I’d be prepared to put money on the main effect failing to replicate if the study were repeated with improved methodology. In saying this, I’m not suggesting that the authors are in any way dishonest. I have no doubt that they got the results they reported and that they genuinely believe they have discovered an important intervention for dyslexia. Furthermore, I’d be absolutely delighted to be proved wrong: There could be no better news for children with dyslexia than to find that they can overcome their difficulties by playing enjoyable computer games rather than slogging away with books. But there are good reasons to believe this is unlikely to be the case.

An interesting way to evaluate any study is to read just the Introduction and Methods, without looking at Results and Discussion. This allows you to judge whether the authors have identified an interesting question and adopted an appropriate methodology to evaluate it, without being swayed by the sexiness of the results. For the Current Biology paper, it’s not so easy to do this, because the Methods section has to be downloaded separately as Supplementary Material. (This in itself speaks volumes about the attitude of Current Biology editors to the papers they publish: Methods are seen as much less important than Results). On the basis of just Introduction and Methods, we can ask whether the paper would be publishable in a reputable journal regardless of the outcome of the study.

On the basis of that criterion, I would argue that the Current Biology paper is problematic, purely on the basis of sample size. There were 10 Italian children aged 7 to 13 years in each of two groups: one group played ‘action’ computer games and the other was a control group playing non-action games (all games from Wii's Rayman Raving Rabbids - see here for examples). Children were trained for 9 sessions of 80 minutes per day over two weeks. Unfortunately, the study was seriously underpowered. In plain language, with a sample this small, even if there is a big effect of intervention, it would be hard to detect it. Most interventions for dyslexia have small-to-moderate effects, i.e. they improve performance in the treated group by .2 to .5 standard deviations. With 10 children per group, the power is less than .2, i.e. there’s a less than one in five chance of detecting a true effect of this magnitude. In clinical trials, it is generally recommended that the sample size be set to achieve power of around .8. This is only possible with a total sample of 20 children if the true effect of intervention is enormous – i.e. around 1.2 SD, meaning there would be little overlap between the two groups’ reading scores after intervention. Before doing this study there would have been no reason to anticipate such a massive effect of this intervention, and so use of only 10 participants per group was inadequate. Indeed, in the context of clinical trials, such a study would be rejected by many ethics committees (IRBs) because it would be deemed unethical to recruit participants for a study which had such a small chance of detecting a true effect.

But, I hear you saying, this study did find a significant effect of intervention, despite being underpowered. So isn’t that all the more convincing? Sadly, the answer is no. As Christley (2010) has demonstrated, positive findings in underpowered studies are particularly likely to be false positives when they are surprising – i.e., when we have no good reason to suppose that there will be a true effect of intervention. This seems particularly pertinent in the case of the Current Biology study – if playing active computer games really does massively enhance children’s reading, we might have expected to see a dramatic improvement in reading levels in the general population in the years since such games became widely available.

The small sample size is not the only problem with the Current Biology study. There are other ways in which it departs from the usual methodological requirements of a clinical trial: it is not clear how the assignment of children to treatments was made or whether assessment was blind to treatment status, no data were provided on drop-outs, on some measures there were substantial differences in the variances of the two groups, no adjustment appears to have been made for the non-normality of some outcome measures, and a follow-up analysis was confined to six children in the intervention group. Finally, neither group showed significant improvement in reading accuracy, where scores remained 2 to 3 SD below the population mean (Tables S1 and S3): the group differences were seen only for measures of reading speed.

Will any damage be done? Probably not much – some false hopes may be raised, but the stakes are not nearly as high as they are for medical trials, where serious harm or even death can result from wrong results. There is concern, however, that quite apart from the implications for families of children with reading problems, there is another issue here, about the publication policies of high-impact journals. These journals wield immense power. It is not overstating the case to say that a person’s career may depend on having a publication in a journal like Current Biology (see this account – published, as it happens, in Current Biology!). But, as the dyslexia example illustrates, a home in a high-impact journal is no guarantee of methodological quality. Perhaps this should not surprise us: I looked at the published criteria for papers on the websites of Nature, Science, PNAS and Current Biology. None of them mentioned the need for strong methodology or replicability; all of them emphasised “importance” of the findings.

Methods are not a boring detail to be consigned to a supplement: they are crucial in evaluating research. My fear is that the primary goal of some journals is media coverage, and consequently science is being reduced to journalism, and is suffering as a consequence.

References

Brembs, B., & Munafò, M. R. (2013). Deep impact: Unintended consequences of journal rank. arXiv:1301.3748.

Christley, R. M. (2010). Power and error: increased risk of false positive results in underpowered studies. The Open Epidemiology Journal, 3, 16-19.

Halpern, S. D.,  Karlawish, J. T, & Berlin, J. A. (2002). The continuing unethical conduct of underpowered clinical trials. Journal of the American Medical Association, 288(3), 358-362. doi: 10.1001/jama.288.3.358

Lawrence, P. A. (2007). The mismeasurement of science. Current Biology, 17(15), R583-R585. doi: 10.1016/j.cub.2007.06.014

Tressoldi, P., Giofré, D., Sella, F., & Cumming, G. (2013). High Impact = High Statistical Standards? Not Necessarily So. PLoS ONE, 8 (2) DOI: 10.1371/journal.pone.0056180

21 comments:

  1. Great article in general but in regularly spelled languages it's really hard to get any changes in accuracy - as it's usually at ceiling - you have to measure speed (see my 2000 article in Applied Psycholinguistics).

    ReplyDelete
  2. Hi Katie
    They didn't have ceiling effects on accuracy as far as I can tell. Assuming the numbers in table S2 are proportions correct, the accuracy on reading pseudowords is well off ceiling (though they are up near ceiling on real words, as you indicate). Their main outcome measure combined real and pseudowords.

    ReplyDelete
  3. just about to start the dissertation for my Psychological Research Methods MSc, working out the sample size required was a key step in getting approval for my proposal. As a novice researcher it is really helpful to read this and consider how much might apply to the papers I'm currently reading

    ReplyDelete
  4. Charlie Wilson @crewilson10 March 2013 at 18:52

    "The primary goal of some journals is media coverage" and more to the point profit in general, which is why, particularly in the case of these high profile magazine-journals, they understandably search for spectacle and newsworthiness in their articles.

    I'd have thought the only way to stop that is to say that if you want public research money, you have to publish in not-for-profit journals...

    ReplyDelete
  5. Dovetails nicely with another CurrBiol paper from my field which I recently covered:
    http://bjoern.brembs.net/comment-n899.html

    ReplyDelete
  6. It's pretty amazing that a study with such a small sample should achieve such prominence. Especially given your important point:

    "there are good reasons to believe this is unlikely to be the case."

    Remarkable game-changing findings are rare in careful science, and when they occur the first question on anyone's lips should be "how likely is this to be a false positive?" Any psychology undergrad should know how to answer that question, let alone a reviewer for a major journal.

    At a recent talk about the many recent retractions in the social priming literature, a point that emerged was that fields where there is a strong, consistent body of theory and evidence should be less vulnerable to "paradigm hijacks" like this. Here it looks like the editors of CB might not have consulted any experts, or even considered whether the study was capable of addressing the question.

    ReplyDelete
  7. Your blog post prompted me to read Christley's (2010) article, as I was intrigued by its title. To me, it suggested that underpowered studies somehow had Type I error rates that are higher than the nominal level, which would be newsworthy indeed. What Christley talks about, though, is that IF you reject H0, then this decision is more likely to be an error if your study has low statistical power (which he makes explicit in about every other paragraph). Therefore, your interpretation of his article ("positive findings in underpowered studies are particularly likely to be false positives when they are surprising") is, of course, wholly correct.

    That said, I thought I'd add a footnote to point out that Type I error rates in the actual sense, i.e. the probability of incorrectly rejecting H0 when it is in fact correct, are at their nominal level in underpowered studies. What is dependent on power is the proportion of Type I errors among significant findings. John Ioannidis made the same point (and a host of others) in 2005 (http://dx.doi.org/10.1371/journal.pmed.0020124).

    ReplyDelete
    Replies
    1. It is, perhaps, worth further distinguishing between the errors raised in this comment. The "actual" type I error that you describe occurs when a null hypothesis is rejected when it is in fact true. The problem we face is that, when undertaking statistical analysis to "test" a hypothesis, we do not know whether or not the H0 is true. Hence, in practice this definition of type I error is of less interest than the effects that such error may have on our conclusions based on a study.

      Hence, there is a tendency to rely on interpretation of the p-value, and treat it as if it can be interpreted in the same way as the type I error. My argument in the cited article is that the conclusion that the null hypothesis is likely to be inconsistent with the observed data (and hence rejected) is more likely to be a false conclusion than is often recognised, and that this probability is inversely proportional to the study power. As noted in the paper, t the commonly accepted values for alpha and beta (0.05 and 0.2) will result in fewer that 1 in 20 false positives only when the prior probability that there is a difference is in excess of approximately 50%. Hence, if the ‘significant’ result is a unexpected, there may be an unacceptably high probability of a false positive conclusion.

      As noted, this finding has been reported elsewhere, but as yet have not been seen as newsworthy in most disciplines. Old habits die hard.

      Delete
  8. Great article and discussion. I think it also worth saying that most of these problems of interpreting an underpowered significant effect could be avoided if scientists were required to provide confidence intervals rather than p-values.

    ReplyDelete
    Replies
    1. While confidence intervals have the advantage of focussing attention on effect size, they are only another way of expressing the same information you get from P values. They tell you nothing about the false discovery rate, which, I maintain, is what matters. See http://www.dcscience.net/?p=6518 and http://arxiv.org/abs/1407.5296

      Delete
  9. Dear Prof. Bishop,
    We are really surprised about your post in your blog: bashing other people’s work without any collected data that prove or at least support that claims seemed to us really unfair, honestly.

    We really appreciated your long and consistent scientific production and we are not willing by any means to start a fight with you, especially in a blog. We are not used to write in a blog at all, sorry about that, we should probably update us :-), but we believe that the suited arena for a scientific discussion should be the peer reviewed international journals.

    However, we can’t ignore the fact that you call out our paper and consequentially us and you did that with several statements that we believe are unsupported by the facts.

    Then, please, don’t get us wrong, we really have a lot of respect and appreciation for you but we are forced to reply in order to give your readers the chance to correctly evaluate the situation.

    We also would like to specify that we are not willing to do a back and forward useless and maybe endless discussion here, we really just want to provide some evidence about the legitimacy of our results that we believe (and luckily we are in good company) are really relevant and useful for the dyslexia research.

    For first the claim that our paper with this sample size was accepted only because it was sent to a high impact factor journal that you suppose to be more interesting about newsworthy then to science it is clearly unsupported by the simple fact that the size of our sample is more or less the standard that it is found in several previous studies in dyslexia treatments that were published in different journals regardless their impact factor. In other words, a sample of around ten dyslexics is absolutely very common (with notable exceptions, of course) in research studies that attempt to see if a dyslexia treatment is working or not and this is completely independent from the impact factor of the journal. You can obviously start to bash all the scientific production in the field till now which it could be legit, maybe, but the question why it is exactly our paper and not what was produced in the last 30 years, to catch your attention is at least a legit question too.

    Here we just show few examples of previous works in the field of dyslexia remediation that presented a sample size around 10 and the journals that published them:

    - Hohn & Ehri, 1983 Journal of Educational Psychology (size of experimental group N=8);
    - Wynne 1997 Topics in Language Disorders (size of experimental group N about 12);
    - Brennan & Ireson 1997 Reading & Writing (size of experimental group N=12);
    - Goldstein, 1976 Journal of Educational Psychology (size of experimental group N=11);
    - Fox & Routh 1984 Journal of Educational Psychology (size of experimental group N=10);
    - Williams 1980 Journal of Educational Psychology (size of experimental group Exp 1 N=2-8 and Exp 2 N=8);
    - Judica rt al., 2001 Neuropsychological Rehabilitation (size of experimental group N=9)
    - Cunninham 1990 Journal of Experimental Child Psychology (size of experimental group N=14);
    - Treisman & Barin 1983 Memory & Cognition (size of experimental group Exp 1 N=8 and Exp 2 N=12);
    - Lovett et al 1994 Brain & Language (size of experimental group N=5);
    - Facoetti et al., 2003 Cognitive Brain Research (size of experimental group N=12);
    - Spironelli et al., 2010 Brain (size of experimental group N=14);
    - Eden et al. 2004 Neuron (size of experimental group N=9)
    - Bradey & Briant 1983 Nature (size of experimental group N=13).
    Now that it is clear that our sample size is at least not an exception at all other points should be clarify.

    ReplyDelete
  10. We had solid basis to believe that an attentional treatment should produce a great effect in reading abilities. We published more than 30 papers on the link between attention and dyslexia (one even with SLI), most of them (more than 90% for sure) in what you probably call reputable journals with middle to low impact factors. Last year we published another paper on Current Biology (Franceschini et al, 2012, a longitudinal study with a initial sample size of 96 prereader children) in which we showed that the attentional deficit was a core deficit in dyslexia. More importantly based on that study we can make a raw estimation about the effect of a possible attention training on the reading abilities. Based on our previous data in our longitudinal study a simple correlation between the attentional skills at the prereading stage and the future reading abilities at the second grade of the primary school shows a r = .55. This correlation coefficient is very high, if we based the study with videogames only on that we could easily say that using the software GPower 3.1.6 (a very useful and free tool for the estimation of the sample size) the required sample (two groups together) will be 14, we used 20 children instead. Moreover the effect size (Cohen’s d) of the previously published (in several journals with different impact factors) remediation studies for Dyslexia were between .38 to .88 (see for example Bus and Ijzendoorn, 1999). Finally, we remind you that a simple visual manipulation that facilitates the attentional distribution on a single letter, without any training at all, increases the reading speed of about 0.3 syll/sec and the accuracy of more than 50% (Zorzi et al. 2012, PNAS, sample size n=96).
    In other words there were no solid reason to believe that the effect of an attentional training on the reading skills should be that small as you wrote, we actually believed that the effect could be pretty big and our expectation is supported by the data.

    Then, as a partial summary we can easily say that our sample size is pretty standard in comparison with the previous studies in dyslexia remediation and the point that our study was underpowered it is really way more than questionable.

    Moreover while it is commonly accepted that the II Type Error is largely influenced by the sample size, it seems that the idea that also the I Type Error is so influenced by the sample size is borrowed by a paper published in a journal that has no impact factor and it is not even listed by ISI (one of the main research motor of scientific journals). We are not willing to discuss what makes a journal reliable or not but this is just to make the facts clear. Even if we fully believe in that paper and we agree that a small sample size can give some uncertainty in the results we would like to remind you that hundreds thousands of scientific articles published in all range of impact factors present smaller sample sizes and there are still relevant in science (some of them are actually milestones), this is hardly to be denied.

    Moreover claiming that our specific results are prone to the I Error Type is pretentious, we provided the p value that already says which is the probability that our results are just a false positive, we planned the sample size before the study and we analyzed the data only after the data of both samples were collected. Consequently the probability to be the results of a I Error Type is at least similar to the standard of the published articles regardless the impact factor of the journal, which we can easily call minimal. Talking about betting it is clear that if we made a bet with you before analyzing the data about the significance of this result, we would win that bet… and if you want to talk about the result of a future bet, please, try to collect some data on that and we’ll be happy to discuss together the results regardless the outcome.

    ReplyDelete
  11. We would also like to specify that we reported also the analysis for the single subjects that are really clear and we reported the correlation analysis between the attentional improvements and the reading abilities improvements done on all the sample (N=20) and again the results are very clear: improving the attentional skills makes dyslexics read better.

    Let us to reply also to other supposed problems that are related to our paper: the drop out was zero, no drop out at all, guess what: the children love it :-) which is a very important point for future remediation programs, as you know the drop out is a serious issue in treatments. The assignment of the children to the two group was blind and the children did not know the aim of the study. The variance of the two groups were reported in the Tables and the homogeneity of variance was controlled. Regarding the follow up, we correctly reported in the text that because this one was an experimental training 4 children started a traditional training before we could re-call them for the follow up and then, we could not retest them, however all the 6 children that we tested again showed equal reading abilities after two months as a group and at the single child level, which is a good result, at least. Regarding the accuracy, it is true that it did not improve significantly but there is an important improvement in reading speed with any cost in accuracy that means that the reading efficiency is improved in general. The reading speed is notoriously important for dyslexics also to help the comprehension which is impaired in dyslexia primarily because of the speed and accuracy deficit and not for a specific comprehension impairment. It is worth to note that increasing the reading speed in is way harder than increase accuracy in the traditional treatment of dyslexia. Ignoring such a large improvement in reading speed without cost for the accuracy it seems to us a very biased approach that can only be detrimental for the dyslexic children and their families.

    Moreover, we found really unfair that you talk about creating false hopes to the children and their family. Most of the so-called traditional treatments in dyslexia are far to be scientific validated, many of them did not present even an attempt to see if they have rights to be applied. The reasons because are commonly accepted are often only based on the fact that many believe that dyslexia is exclusively a language problem (even if the literature clearly show that it is not the case) then train the language or/and the reading abilities must work, no matter what the scientific literature says. It is interesting to note that according to your post on your blog entitled: “Neuroscientific intervention for dyslexia: red flags” of the 24Th February 2012, our paper has not any single red flag…

    In other words: who is giving false hopes to the children and families? Not us for sure: If you read all the media press that we released after our study was published, caution was the key word, we never suggested to quit a training to start to play action video games, we never suggested to play video games without a supervision of an expert. The interest of our study is to show that a training of attention can change the reading abilities, it is not a typical clinical trial (did you read “clinical trial” anywhere in our study?) but it is an experimental study that opens up for future possibilities in dyslexia remediation but even more important, at this stage, it clearly prove the attentional deficit role in dyslexia.

    We believe that based on our results an attentional training could be add to the typical trainings in order to maximize the chance to reduce the dyslexia problems which should be our common goal.

    ReplyDelete
  12. In sum: our study is far from being methodologically poor, we passed three rounds of revisions (peer review) with three of the world most relevant experts in the field (Current Biology is a world top ranked journal for several reasons… ), there are no reason to claim that our data would not be published by a lower impact factor journal, it is really unfair to say that we are going to give false hopes to the families of the children with dyslexia and to say that we were not cautious in describing our results to the media.

    We believe that before saying that a study will not be replicated some new data have to be collected, bashing a study without any proof is easy and can be done about every single study but it is also unethical at best and we are really surprised about your attack. We believe that the reason of your attack should be found elsewhere, maybe it is hard to accept that dyslexia is not only a language problem but this kind of resistance to the scientific data can produce only a bad outcome: to delay the knowledge about the disorder and consequently reduce the chance to find effective treatment for that.

    In conclusion if you want to bash high impact factor journals for some personal reason that we are not interested in, go ahead but, please, go to search elsewhere from our paper or please again, try to collect some serious data, publish them in a reputable peer review journal and after that we’ll be happy to discuss that.
    Until that, well, the only data that are around support our view.

    We appreciated your concern about science that risks to become journalism but we are even more worry that science could become blog material without any peer review and even any data to support claims. Discredit a study without any new data is very easy on internet, doing a better study is often way more difficult but also more correct.

    In the hope that you understand that there is nothing personal and that our respect and appreciation to you is really still solid as before we hope to see you soon in some conferences around the world to have the chance to chat together maybe in front of a good cup of tea :-)

    All the best,
    Andrea Facoetti & Simone Gori

    ReplyDelete
  13. I think the danger with relying on the sample size of previous studies is that it systematically exerts downward pressure on statistical power.

    If a study achieves a p value of 0.05, and we assume the the observed effect size is an accurate estimate of the true effect size, then the power of a replication sample with exactly the same N will be only 50% (the situation will be better if the p value is smaller, but it illustrates the danger of justifying sample size in this way).

    However, we also know that early studies tend to over-estimate effects (or may be false positives) - John Ioannidis has written extensively about this, and it has been observed in multiple fields. Therefore, in epidemiology (for example), best practice is that replication studies should ideally be much larger than initial (discovery) studies.

    All other things being equal, larger studies will allow us to estimate any real effects more precisely, and will protect us from these problems. Unfortunately at the moment we're incentivised to publish in a way that reduces the motivation to spend a long time running a large study. The Impact Factor issue is relevant to this last point.

    ReplyDelete
  14. Puzzled. Without going into pros and cons of this case, one does not have to gather and publish other data to evaluate and critique a piece of work. This is a great service to those of us without the time or skills to do it.

    Also I think blogs are in fact a much more useful and free forum for debate than formal letters in stuffy journals -often inaccessible to nonsubscribers /lay people.

    ReplyDelete
  15. To Drs Facoetti and Gori: Thank you for your considered response. Like you, I found the word limit on comments inadequate for a reply, so I have responded to you in a new blog post. I am sorry that you feel I have been unfair to you but hope this makes my motivation for writing clear. I will be very happy to have further reaction from you - if need be I can append to the post to save you writing several comments. see http://deevybee.blogspot.co.uk/2013/03/blogging-as-post-publication-peer.html

    ReplyDelete
  16. There was a time when New Scientist would publish, it seems, just about any article that said something insulting about religious people. I'm told that's less so now, but this is one reader they've lost.

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. 10 kids seems very small, just intutitively (based on human experience).

    Would also be interested in controls such as no intervention, dedicated reading instruction, dedicated reading time. Rather than just Call of Duty versus Hello Kitty.

    ReplyDelete
  19. Reminds me of the classical music helps reading silliness from a few years ago. There's a pattern here of junk science in education research.

    ReplyDelete