Saturday 14 August 2010

Three ways to improve cognitive test scores without intervention

My research focuses on neurodevelopmental disorders such as autism, specific language impairment and dyslexia, where more or less miraculous cures are heralded at regular intervals. These claims are often unfounded, being based on anecdote or uncontrolled studies. I often feel like a kill-joy, picking holes in reports that offer hope to parents and their children. But I carry on because I don't like to see people being misled, often at considerable personal expense, and sometimes even being encouraged to subject their child to practices that are at best boring and at worst painful or dangerous.

Psychological or educational interventions have much in common with medical treatments, and the same randomised-controlled trial (RCT) methodology is needed to obtain convincing evidence of efficacy. But what is a RCT controlling for?  In drug treatments, the emphasis is on the need to control for (a) spontaneous improvement or (b) placebo effects, i.e. the improvement that you get just from being the focus of medical attention.  In psychological and educational interventions there are three other important sources of potential bias that can give a misleading impression of treatment efficacy unless they are controlled for: developmental change, practice effects and regression to the mean.

Developmental change

This is the simplest to understand. Basically, children get better at most things as they get older. If someone gives me a brain training game that is supposed to cure dyslexia, and my child does the exercises every day for a year, he may well be able to read more words at the end of the intervention than he could at the start. However, this would hardly be good evidence for effectiveness of the games, since most children learn more words as they grow older. Indeed, his feet will also be bigger at the end of the intervention, but I would not attribute that to the brain training. This is so obvious that it may seem unnecessary to state it, except for the fact that with some conditions, people assume that development is static.  For instance, the National Autistic Society website states "Autism is a lifelong developmental disability".  Now, this does not mean that children with autism will not change with age, yet it is easy to see how someone might misinterpret it. In fact, diagnostic instruments for autism focus on the child's behaviour at 4 to 5 years of age, precisely because that is when autistic features are most evident.  As the child grows older, these features may become milder or take a different form. See, for instance, the detailed account by Clara Claiborne Park of her autistic daughter, who barely used speech at 4 years, but could produce simple sentences by the age of 8. If your child has some intervention around the age of 4 years and then shows improvements, it is easy to assume that the intervention was responsible.  However, the improvement may well have occurred anyway.  The same is true for children who are late to start talking. This was brought home to me by a study conducted in the 1980s evaluating home-based  therapy with late talkers in an inner city area. The treated children showed a dramatic improvement.  However, so too did an untreated control group.  Without the evidence of the control group it would be easy to be misled into thinking that all late talkers should be given the home-based intervention, but a very different recommendation would follow when the controls are taken into account.  The data suggested that it would be better to adopt a 'watchful waiting' policy, because most children improved spontaneously. I would not want to imply we should not treat children with language difficulties: my own research has been concerned with identifying which children won't catch up by themselves, so we can target intervention appropriately.  But it is easy to underestimate the amount of change that can occur without intervention, especially in very young children.


On many tests, people get better with practice. Many people will have experienced the WiiFit machine, which gives you a series of tests to estimate your 'fitness age'. These involve activities such as standing on a balance board and trying to keep a dot representing your centre of gravity in a central position, or doing complex memory or calculation tests under time pressure. The first time I did this, my fitness age was estimated at 76. However, just one day later, it had come down to 60 years, and with further practice there was a steep decline. I don't think my ineffectual and limited attempts at WiiFit exercises had reversed 16 years worth of decrepitude: rather, I got better because I had a clearer idea of what I was supposed to be doing.  The first time I had no strategy for performing the tests, and in some cases completely misunderstood what I was supposed to be doing. So it was hardly surprising that I improved dramatically just by my second attempt.  It is worth noting that balance tests are used by Dore centres to evaluate the success of their training program for neurodevelopmental disorders.  Some of these tests show substantial practice effects (Liao et al, 2001)  Parents can see their child's performance improving, and gain the impression that this indicates that important brain changes are occurring. However, without correction for practice effects this is misleading. (As far as I can establish, Dore centres don't correct for developmental changes in balance either, though these are well documented on such tests (e.g.

Cognitive tests vary in the extent to which they show practice effects, but most will show some benefits of prior exposure. Take a couple of well-known tests of non-verbal ability, matrices and object assembly.  Matrices involves working out the logic behind visual patterns. Once you have worked out how a specific pattern is constructed, it is much easier to do it on the next occasion, because you don't have to entertain a wide range of hypotheses about the correct answer.  Object assembly involves a jig-saw like task where you assemble pieces to make a meaningful figure, but you don't know what the meaningful figure is – at least not the first time you do it. Once you have done it, you are likely to show gains in performance on a second occasion even after some years.  Over a short interval, Rapport et al (1997) showed gains in IQ of 5-10 points after one previous test.  I studied short-term practice effects on a language comprehension test I have developed: children showed significant gains in scores on the second administration, even though I used different test items (but the same format).  McArthur has specifically looked at the extent to which improvements are seen in untreated groups of children on tests of reading and language, and has documented substantial gains.  

Practice effects mean that gains in cognitive test scores after an intervention are usually uninterpretable unless we have comparison information from individuals who were given the same schedule of testing without the intervening intervention.   Again, it sounds obvious, but it is a point that is often missed by researchers, as well as by non-scientists.  A striking example comes from a study of interventions for children with speech-language problems by Gillam and colleagues. This study was designed as a randomised controlled trial, and had four groups: three of them were given specific treatments designed to improve children's language, and the fourth was a control group who were given a  computerised training of general 'academic skills'. Everyone improved, but the amount of improvement was no greater for the three treatment groups than for the control group.  The conventional interpretation of such a result would be to conclude that the treatments were ineffective, because the changes associated with them were no greater than for the control group.  However, impressed by the increase in scores and apparently unaware of the possibility of practice effects, the authors of the study concluded instead that all treatments were effective, and spent some time discussing why the control treatment might be good at boosting language.

One further point: it is sometimes assumed that we don't need to worry about practice effects if a test has good test-retest reliability. But that is not the case. High reliability just means that the rank ordering of a group of children will be stable across testing occasions.  But reliablity indices are insensitive to absolute level of performance. So if we took a group of children's scores at time 1 and gave each of them a score 10 points higher at time 2, then we would have perfect reliability because their rank ordering would be the same. Reliability is, however, an important factor when considering the last topic, regression toward the mean.

Regression toward the mean

Regression toward the mean is a horribly difficult concept to understand, even for those with statistical training. 
In their masterly introduction to the subject, Campbell and Kenny noted that regression to the mean has confused many people, including Nobel prize winners.   More specifically, they stated that: “Social scientists incorrectly estimate the effects of ameliorative interventions.....and snake-oil peddlers earn a healthy living all because our intuition fails when trying to comprehend regression toward the mean". 

Perhaps the easiest way to get a grasp of what it entails is to suppose we had a group of 10 people and asked them each to throw a dice 10 times and total the score.   Let's then divide them into the 5 people who got the lowest scores and the 5 who got the highest scores and repeat the experiment. What do we expect? Well, assuming we don't believe that anything other than chance affects scores (no supernatural forces or 'winning streaks'), we'd expect the average score of the low-scorers to improve, and the average score of the high scorers to decline.  This is because the probability for any one person is that they will get an average score on any one set of dice throws.  So that's the simplest case, when we have a score that is determined only by chance.

Cognitive test scores are interesting here because they are typically thought of comprising two parts: a 'true' score, which reflects how good you really are at whatever is being measured, and an 'error' score, which reflects random influences. Suppose, for instance, that you test a child's reading ability. In practice, the child is a very good reader, in the top 10% for her age, but the exact number of words she gets right will depend on all sorts of things: the particular words selected for the test (she may know 'catacomb' but not 'pharynx'), the mood she is in on the day, whether she is lucky or unlucky at guessing at words she is uncertain about. All these factors would be implicated in the 'error' score, which is treated just like an additional chance factor or throw of the dice affecting a test score.  A good test is mostly determined by the 'true' score, with only a very small contribution of an 'error' score, and we can identify it by the fact that children will be ranked by the test in a very similar way from one test occasion to the next, i.e. there will be good test-retest reliablity.  In other words, the correlation from time 1 to time 2 will be high.

Simulated test scores for 21 children on tests varying in reliability
These figures show simulated scores for a group of 21 children on tests that vary in test-retest reliability.  In each case individuals are colour-coded depending on whether they fall in the bottom (blue), middle (purple) or top (red) third of scores at time 1. The simulations assume no systematic differences between time 1 and 2 - i.e. no intervention effect, maturation or practice.  Scores are simulated as random numbers at time 1, with time 2 scores then set to give a specific correlation between time 1 and 2, with no change in average score for the group as a whole.

Now suppose we select children because they have a particularly low score at time 1. Insofar as chance contributes to their scores, then at time 2, we would expect the average score of such a group to improve, because chance pushes the group average towards the overall mean score.  The left-hand panel shows the situation when reliability (i.e., correlation between time 1 and time 2 scores) is zero, so scores are determined just by chance, like throwing a dice. The mean scores for the blue, purple and red cases at time 1 are, by definition different (they are selected to be low, medium and high scorers). But at time 2 they are all equivalent.  The upshot is that the mean at time 2 for those with initial low scores (blue) goes up, whereas for those that start out with high scores, the time 2 mean comes down.
The middle and right-hand panels show a more realistic situation, where the test score is mixture of true score and error. With very high reliability (right-hand panel) the effect of regression to the mean is small, but with medium reliability (middle panel) it is detectable by eye even for this very small sample.

The key point here is that if we select individuals on the basis of low scores on a test (as is often done, e.g. when identifying children with poor reading scores for a study of a dyslexia treatment), then, unless we have a highly reliable test with a negligible error term, the expectation is that the group's average score will improve on a second test session, for purely statistical reasons.  In general, psychometric tests are designed to have reasonable reliability, but this varies from test to test and is seldom higher than .75-.8.

So regression to the mean is a real issue in longitudinal studies.  It is yet another reason why scores will change over time. Zhang and Tomblin (2003) noted that we can overcome this problem by using different tests to select children for a study and to measure their improvement.  Or we can allow for regression to the mean if our study includes a control group, who will be subject to the same regression to the mean as the intervention group.

So, if you started reading this blog thinking that people who demand control groups for intervention studies were just a kind of methodological Stasi whose obsession with RCTs was misplaced, I hope I have convinced you that the concerns are real. There are just so many ways that we can mislead ourselves into thinking that an intervention has led to improvement, when in fact it is making no difference. It is depressingly hard to demonstrate positive effects of treatment, but the only way to do it is by using properly designed studies that can overcome the biases I have noted here by use of appropriate controls. 

Addition resources on this topic can be found on this section of my website, see esp. 

  7th BDA International Conference – 2008: Dyslexia: Making Links
  2nd UK Paediatric Neuropsychology Symposium
  DysTalk online (video)

And a couple of short articles on how not to do intervention studies:
BISHOP, D. V. M.  2007. Curing dyslexia and ADHD by training motor co-ordination: Miracle or myth? Journal of Paediatrics and Child Health, 43, 653-655.
BISHOP, D. V. M. 2008. Criteria for evaluating behavioural interventions for neurodevelopmental disorders (Letter). Journal of Paediatrics and Child Health, 44, 520-521.

Friday 6 August 2010

How our current reward structures have distorted and damaged science

Two things almost everyone would agree with:

1. Good scientists do research and publish their results, which then have impact on other scientists and the broader community.

2. Science is a competitive business: there is not enough research funding for everyone, not enough academic jobs in science, and not enough space in scientific journals. We therefore need ways of ensuring that the limited resources go to the best people.

When I started in research in the 1970s, research evaluation focused on individuals. If you wrote a grant proposal, applied for a job, or submitted a paper to a journal, evaluation depended on peer review, a process that is known to be flawed and subject to whim and bias, but is nevertheless regarded by many as the best option we have.

What has changed in my lifetime is the increasing emphasis on evaluating institutions rather than individuals. The 1980s saw the introduction of the Research Assessment Exercise, used to evaluate Universities in terms of their research excellence in order to have a more objective and rational basis for allocating central funds (quality weighted research funding or QR) by the national funding council (HEFCE in England). The methods for evaluating institutions evolved over the next 20 years, and are still a matter of debate, but they have subtly influenced the whole process of evaluation of individual academics, because of the need to use standard metrics.

This is inevitable, because the panel evaluating a subject area can't be expected to read all the research produced by staff at an institution, but they would be criticised for operating an 'old boy network', or favouring their own speciality, if they relied just on personal knowledge of who is doing good work – which was what tended to happen before the RAE. Therefore they are forced into using metrics. The two obvious things that can be counted are research income and number of publications. But number of publications was early on recognised as problematic, as it would mean that someone with three parochial reports in a journal of national society would look better than someone with a major breakthrough published in a top journal. There has therefore been an attempt to move from quantity to quality, by taking into account the impact factor of the journals that papers are published in.

Evaluation systems always change the behaviour of those being evaluated, as people attempt to maximise rewards. Recognising that institutional income depends on getting a good RAE score, vice-chancellors and department heads in many institutions now set overt quotas for their staff in terms of expected grant income and number of publications in high impact journals. The jobs market is also affected, as it becomes clear that employability depends on how good one looks on the RAE metrics.

The problem with all of this is that it means that the tail starts to wag the dog. Consider first how the process of grant funding has changed. The motivation to get a grant ought to be that one has an interesting idea and needs money to investigate it. Instead, it has turned into a way of funding the home institution and enhancing employability. Furthermore, the bigger the grant, the more the kudos, and so the pressure is on to do large-scale expensive studies. If individuals were assessed, not in terms of grant income, but in terms of research output relative to grant income, many would change status radically, as cheap, efficient research projects would rise up the scale. In psychology, there has been a trend to bolt on expensive but often unmotivated brain imaging to psychological studies, ensuring that the cost of each experiment is multiplied at least 10-fold. Junior staff are under pressure to obtain a minimum level of research funding, and consequently spend a great deal of time writing grant proposals, and the funding agencies are overwhelmed with applications. In my experience, applications that are written because someone tells you to write one are typically of poor quality, and just waste the time of both applicants and reviewers. The scientist who is successful in meeting their quota is likely to be managing several grants. This may be a good thing if they are really talented, or have superb staff, but in my experience research is done best if the principal investigator puts serious thought and time into the day-to-day running of the project, and that becomes impossible with multiple grants.

Regarding publications, I am old enough to have been publishing before the RAE, and I'm in the fortunate but unusual position of having had full-time research funding for my whole career. In the current system I am relatively safe, and I look good on an RAE return. But most people aren't so fortunate: they are trying to juggle doing research with teaching and administration, raising children and other distractions, yet feel under intense pressure to publish. The worry about the current system is that it will encourage people to cut corners, to favour research that is quick and easy. Sometimes, one is lucky, and a simple study leads to an interesting result that can be published quickly. But the best work typically requires a large investment of time and thought. The studies I am proudest of are ones which have taken years rather than months to complete: in some cases, the time is just on data collection, but in others, the time has involved reading, thinking, and working out ways of analysing and interpreting data. But this kind of paper is getting increasingly rare. As a reviewer, I frequently see piecemeal publication, so if you suggest that a further analysis would strengthen the paper, you are told that it has been done, but is the subject of another paper. Scholarship and contact with prior literature has become extremely rare: prior research is cited without reading it – or not cited at all – and the notion of research building on prior work has been eroded to the point that I sometimes think we are all so busy writing papers that we have no time to read them. There are growing complaints about an 'avalanche' of low-quality publications.

As noted above, in response to this, there has been a move to focus on quality rather than quantity of publications, with researchers being told that their work will only count if it is published in a high-impact journal. Some departments will produce lists of acceptable journals and will discourage staff from publishing elsewhere. In effect, impact factor is being used as a proxy for likelihood that a paper will be cited in future, and I'm sure that is generally true. But just because a paper in a high impact journal is likely to be highly cited, it does not mean that all highly-cited papers appear in high impact journals. In general, my own most highly-cited papers appeared in middle-ranking journals in my field. Moreover, the highest impact journals have several limitations:

1. They only take very short papers. Yes, it is usually possible to put extra information in 'supplementary material', but what you can't do is to take up space putting the work in context or discussing alternative interpretations. When I started in the field, it was not uncommon to publish a short paper in Nature, followed up with a more detailed account in another, lowlier, journal. But that no longer happens .We only get the brief account.

2. Demand for page space outstrips supply. To handle a flood of submissions, these journals operate a triage system, where the editor determines whether the paper should go out for review. This can have the benefit that rejection is rapid, but it puts a lot of power in the hands of editors, who are unlikely to be specialists in the subject area of the paper, and in some cases will explicit in their preference for papers with a 'wow' factor. It also means that one gets no useful feedback from reviewers: viz my recent experience with the New England Journal of Medicine, where I submitted a paper that I thought had all the features they'd find attractive – originality, clinical relevance and a link between genes and behaviour. It was bounced without review, and I emailed, not to appeal, but just to ask if I could have a bit more information about the criteria on which they based their rejection. I was told that they could not give me any feedback as they had not sent it out for review.

3. If the paper does go out for review, the subsequent review process can be very slow. There's an account of the trials and tribulations of dealing with Nature and Science which makes for depressing reading. Slow reviewing is clearly not a problem restricted to high impact journals. My experience is that lower-impact journals can be even worse. But the impression from the comments on FemaleScienceProfessor's blog, is that reviewers can be unduly picky when the stakes are high.

So what can be done? I'd like to see us return to a time when the purpose of publishing was to communicate, and the purpose of research funding was to enable a scientist to pursue interesting ideas. The current methods of evaluation have encouraged an unstoppable tide of publications and grant proposals, many of which are poor quality. Many scientists are spending time on writing doomed proposals and papers when they would be better off engaging in research and scholarship in a less frenetic and more considered manner. But they won't do that so long as the pressures are on them to bring in grants and generate publications. I'll conclude with a few thoughts on how the system might be improved.

1. My first suggestions, regarding publications, are already adopted widely in the UK, but my impression is they may be less common elsewhere. Requiring applicants for jobs or fellowships to specify their five best publications rather than providing a full list rewards those who publish significant pieces of work, and punishes piecemeal publication. Use of the H-index as an evaluation metric rather than either number of publications or journal impact factor is another way to encourage people to focus on producing substantial papers rather than a flood of trivial pieces, as papers with low citations have no impact whatever on the H-index.There are downsides: we have the lag problem, which makes the H-index pretty useless for evaluating junior people, and in its current form the index does not take into account the contribution of authors, thereby encouraging multiple authorship, since anyone who can get their name on a highly-cited paper will boost their H-index, regardless of whether they are a main investigator or freeloader.

2. Younger researchers should be made aware that a sole focus on publishing in very high impact journals may be counter-productive. Rapid publication in an Open Access journals (many of which have perfectly respectable impact factors) may be more beneficial to ones career ( because the work is widely accessible and so more likely to be cited. A further benefit of the PLOS journals, for instance, is that they don't impose strict length limits, so research can be properly described and put in context, rather than being restricted to the soundbite format required by very high impact journals.

3. Instead of using metrics based on grant income, those doing evaluations should use those based on efficiency, i.e. an input/output function. Two problems here: the lag in output is considerable, and the best metric for measuring output is unclear. The lag means it would be necessary to rely on track record, which can be problematic for those starting out in the field. Nevertheless, a move in this direction would at least encourage applicants and funders to think more about value for money, rather than maximising the size of a grant – a trend that has been exacerbated by Full Economic Costing (don't get me started on that). And it might make grant-holders and their bosses see the value of putting less time and energy into writing multiple proposals and more into getting a project done well, so that it will generate good outputs on a reasonable time scale.

4. The most radical suggestion is that we abandon formal institutional rankings (i.e. the successor to the RAE, the REF). I've been asking colleagues who were around before the RAE, what they think it achieved. The general view was that the first ever RAE was a useful exercise that exposed weaknesses in insitutions and individuals and got everyone to sharpen up their act. But the costs of subsequent RAEs (especially in terms of time) have not been justified by any benefit. I remember a speech given by Prof Colin Blakemore at the British Association for the Advancement of Science some years ago where he made this point, arguing that rankings changed rather little after the first exercise, and certainly not enough to justify the mammoth bureaucratic task involved. When I talk to people who have not known life without an RAE, they find it hard to imagine such a thing, but nobody has put forward a good argument that has convinced me it should be retained. I'd be interested to see what others think.