Psychological or educational interventions have much in common with medical treatments, and the same randomised-controlled trial (RCT) methodology is needed to obtain convincing evidence of efficacy. But what is a RCT controlling for? In drug treatments, the emphasis is on the need to control for (a) spontaneous improvement or (b) placebo effects, i.e. the improvement that you get just from being the focus of medical attention. In psychological and educational interventions there are three other important sources of potential bias that can give a misleading impression of treatment efficacy unless they are controlled for: developmental change, practice effects and regression to the mean.
This is the simplest to understand. Basically, children get better at most things as they get older. If someone gives me a brain training game that is supposed to cure dyslexia, and my child does the exercises every day for a year, he may well be able to read more words at the end of the intervention than he could at the start. However, this would hardly be good evidence for effectiveness of the games, since most children learn more words as they grow older. Indeed, his feet will also be bigger at the end of the intervention, but I would not attribute that to the brain training. This is so obvious that it may seem unnecessary to state it, except for the fact that with some conditions, people assume that development is static. For instance, the National Autistic Society website states "Autism is a lifelong developmental disability". Now, this does not mean that children with autism will not change with age, yet it is easy to see how someone might misinterpret it. In fact, diagnostic instruments for autism focus on the child's behaviour at 4 to 5 years of age, precisely because that is when autistic features are most evident. As the child grows older, these features may become milder or take a different form. See, for instance, the detailed account by Clara Claiborne Park of her autistic daughter, who barely used speech at 4 years, but could produce simple sentences by the age of 8. If your child has some intervention around the age of 4 years and then shows improvements, it is easy to assume that the intervention was responsible. However, the improvement may well have occurred anyway. The same is true for children who are late to start talking. This was brought home to me by a study conducted in the 1980s evaluating home-based therapy with late talkers in an inner city area. The treated children showed a dramatic improvement. However, so too did an untreated control group. Without the evidence of the control group it would be easy to be misled into thinking that all late talkers should be given the home-based intervention, but a very different recommendation would follow when the controls are taken into account. The data suggested that it would be better to adopt a 'watchful waiting' policy, because most children improved spontaneously. I would not want to imply we should not treat children with language difficulties: my own research has been concerned with identifying which children won't catch up by themselves, so we can target intervention appropriately. But it is easy to underestimate the amount of change that can occur without intervention, especially in very young children.
On many tests, people get better with practice. Many people will have experienced the WiiFit machine, which gives you a series of tests to estimate your 'fitness age'. These involve activities such as standing on a balance board and trying to keep a dot representing your centre of gravity in a central position, or doing complex memory or calculation tests under time pressure. The first time I did this, my fitness age was estimated at 76. However, just one day later, it had come down to 60 years, and with further practice there was a steep decline. I don't think my ineffectual and limited attempts at WiiFit exercises had reversed 16 years worth of decrepitude: rather, I got better because I had a clearer idea of what I was supposed to be doing. The first time I had no strategy for performing the tests, and in some cases completely misunderstood what I was supposed to be doing. So it was hardly surprising that I improved dramatically just by my second attempt. It is worth noting that balance tests are used by Dore centres to evaluate the success of their training program for neurodevelopmental disorders. Some of these tests show substantial practice effects (Liao et al, 2001) Parents can see their child's performance improving, and gain the impression that this indicates that important brain changes are occurring. However, without correction for practice effects this is misleading. (As far as I can establish, Dore centres don't correct for developmental changes in balance either, though these are well documented on such tests (e.g. http://www.ncbi.nlm.nih.gov/pubmed/17105679).
Cognitive tests vary in the extent to which they show practice effects, but most will show some benefits of prior exposure. Take a couple of well-known tests of non-verbal ability, matrices and object assembly. Matrices involves working out the logic behind visual patterns. Once you have worked out how a specific pattern is constructed, it is much easier to do it on the next occasion, because you don't have to entertain a wide range of hypotheses about the correct answer. Object assembly involves a jig-saw like task where you assemble pieces to make a meaningful figure, but you don't know what the meaningful figure is – at least not the first time you do it. Once you have done it, you are likely to show gains in performance on a second occasion even after some years. Over a short interval, Rapport et al (1997) showed gains in IQ of 5-10 points after one previous test. I studied short-term practice effects on a language comprehension test I have developed: children showed significant gains in scores on the second administration, even though I used different test items (but the same format). McArthur has specifically looked at the extent to which improvements are seen in untreated groups of children on tests of reading and language, and has documented substantial gains.
Practice effects mean that gains in cognitive test scores after an intervention are usually uninterpretable unless we have comparison information from individuals who were given the same schedule of testing without the intervening intervention. Again, it sounds obvious, but it is a point that is often missed by researchers, as well as by non-scientists. A striking example comes from a study of interventions for children with speech-language problems by Gillam and colleagues. This study was designed as a randomised controlled trial, and had four groups: three of them were given specific treatments designed to improve children's language, and the fourth was a control group who were given a computerised training of general 'academic skills'. Everyone improved, but the amount of improvement was no greater for the three treatment groups than for the control group. The conventional interpretation of such a result would be to conclude that the treatments were ineffective, because the changes associated with them were no greater than for the control group. However, impressed by the increase in scores and apparently unaware of the possibility of practice effects, the authors of the study concluded instead that all treatments were effective, and spent some time discussing why the control treatment might be good at boosting language.
One further point: it is sometimes assumed that we don't need to worry about practice effects if a test has good test-retest reliability. But that is not the case. High reliability just means that the rank ordering of a group of children will be stable across testing occasions. But reliablity indices are insensitive to absolute level of performance. So if we took a group of children's scores at time 1 and gave each of them a score 10 points higher at time 2, then we would have perfect reliability because their rank ordering would be the same. Reliability is, however, an important factor when considering the last topic, regression toward the mean.
Regression toward the mean
Regression toward the mean is a horribly difficult concept to understand, even for those with statistical training. In their masterly introduction to the subject, Campbell and Kenny noted that regression to the mean has confused many people, including Nobel prize winners. More specifically, they stated that: “Social scientists incorrectly estimate the effects of ameliorative interventions.....and snake-oil peddlers earn a healthy living all because our intuition fails when trying to comprehend regression toward the mean".
Perhaps the easiest way to get a grasp of what it entails is to suppose we had a group of 10 people and asked them each to throw a dice 10 times and total the score. Let's then divide them into the 5 people who got the lowest scores and the 5 who got the highest scores and repeat the experiment. What do we expect? Well, assuming we don't believe that anything other than chance affects scores (no supernatural forces or 'winning streaks'), we'd expect the average score of the low-scorers to improve, and the average score of the high scorers to decline. This is because the probability for any one person is that they will get an average score on any one set of dice throws. So that's the simplest case, when we have a score that is determined only by chance.
Cognitive test scores are interesting here because they are typically thought of comprising two parts: a 'true' score, which reflects how good you really are at whatever is being measured, and an 'error' score, which reflects random influences. Suppose, for instance, that you test a child's reading ability. In practice, the child is a very good reader, in the top 10% for her age, but the exact number of words she gets right will depend on all sorts of things: the particular words selected for the test (she may know 'catacomb' but not 'pharynx'), the mood she is in on the day, whether she is lucky or unlucky at guessing at words she is uncertain about. All these factors would be implicated in the 'error' score, which is treated just like an additional chance factor or throw of the dice affecting a test score. A good test is mostly determined by the 'true' score, with only a very small contribution of an 'error' score, and we can identify it by the fact that children will be ranked by the test in a very similar way from one test occasion to the next, i.e. there will be good test-retest reliablity. In other words, the correlation from time 1 to time 2 will be high.
|Simulated test scores for 21 children on tests varying in reliability|
Now suppose we select children because they have a particularly low score at time 1. Insofar as chance contributes to their scores, then at time 2, we would expect the average score of such a group to improve, because chance pushes the group average towards the overall mean score. The left-hand panel shows the situation when reliability (i.e., correlation between time 1 and time 2 scores) is zero, so scores are determined just by chance, like throwing a dice. The mean scores for the blue, purple and red cases at time 1 are, by definition different (they are selected to be low, medium and high scorers). But at time 2 they are all equivalent. The upshot is that the mean at time 2 for those with initial low scores (blue) goes up, whereas for those that start out with high scores, the time 2 mean comes down.
The middle and right-hand panels show a more realistic situation, where the test score is mixture of true score and error. With very high reliability (right-hand panel) the effect of regression to the mean is small, but with medium reliability (middle panel) it is detectable by eye even for this very small sample.
The key point here is that if we select individuals on the basis of low scores on a test (as is often done, e.g. when identifying children with poor reading scores for a study of a dyslexia treatment), then, unless we have a highly reliable test with a negligible error term, the expectation is that the group's average score will improve on a second test session, for purely statistical reasons. In general, psychometric tests are designed to have reasonable reliability, but this varies from test to test and is seldom higher than .75-.8.
So regression to the mean is a real issue in longitudinal studies. It is yet another reason why scores will change over time. Zhang and Tomblin (2003) noted that we can overcome this problem by using different tests to select children for a study and to measure their improvement. Or we can allow for regression to the mean if our study includes a control group, who will be subject to the same regression to the mean as the intervention group.
So, if you started reading this blog thinking that people who demand control groups for intervention studies were just a kind of methodological Stasi whose obsession with RCTs was misplaced, I hope I have convinced you that the concerns are real. There are just so many ways that we can mislead ourselves into thinking that an intervention has led to improvement, when in fact it is making no difference. It is depressingly hard to demonstrate positive effects of treatment, but the only way to do it is by using properly designed studies that can overcome the biases I have noted here by use of appropriate controls.
Addition resources on this topic can be found on this section of my website, see esp.
7th BDA International Conference – 2008: Dyslexia: Making Links
2nd UK Paediatric Neuropsychology Symposium
DysTalk online (video)
BISHOP, D. V. M. 2008. Criteria for evaluating behavioural interventions for neurodevelopmental disorders (Letter). Journal of Paediatrics and Child Health, 44, 520-521.