Wednesday, 22 December 2010

“Neuroprognosis” in dyslexia

Every week seems to bring a new finding about brains of people with various neurodevelopmental disorders. The problem is that the methods get ever more technical, and so the scientific papers get harder and harder to follow, even if, like me, you have some background in neuropsychology.  I don’t normally blog about specific papers, but various people have asked what I think of an article that was published this week by Hoeft et al in Proceedings of National Academy of Science, and so I thought I’d have a shot at (a) summarising what they found in understandable language, and (b) giving my personal evaluation of the study. The bottom line is that the paper seems methodologically sound, but I’m nevertheless sceptical because (a) I’m always sceptical, and (b) there are some aspects of the results that just seem a bit hard to make sense of.

What question did the researchers ask?

Fig 1: location of IFG
The team, headed by researchers from Stanford University of Medicine, took as their starting point two prior observations.
First, previous studies have found that children with dyslexia look different from normal readers when they do reading tasks in a brain scanner. As well as showing under-activation of regions normally involved in reading, dyslexics often show over-activation in the frontal lobes, specifically in the inferior frontal gyri (IFG) (Figure 1).
The researchers were interested in the idea that this IFG activation could be a sign that dyslexics were using different brain regions to compensate for their dyslexia. They reasoned that, if so, then the amount of IFG activity observed on one occasion might predict the amount of reading improvement at a later occasion.
So the specific question was: Does greater involvement of the IFG in reading predict future long-term gains in reading for children with dyslexia?

Who did they study?
The main focus was on 25 teenagers with dyslexia, whose average age was 14 years at the start of the study. The standard deviation was 1.96 years, indicating that most would have been 12 to 16 years old.  There were 12 boys and 13 girls. They were followed up 2.5 years later, at which point they were subdivided, according to how much progress they’d made on one of the reading measures, into a group of 12 with ‘no-gain’ and 13 with ‘reading-gain’.

The criteria for dyslexia were that, at time 1, (a) performance was in the bottom 25% for age on a composite measure based on two subtests of the Test of Word Reading Efficiency (TOWRE), and (b) performance on a nonverbal IQ subtest (WASI Matrices) was in normal limits (within 1 SD of average).  The TOWRE is a speeded reading test with two parts: reading of real words, and reading of made-up words – the latter subtest picks up difficulties with converting letters into sounds. The average scores are given in Supporting Materials, and confirm that these children had a mean nonverbal ability score of 103 and a mean TOWRE score of 80. This is well in line with how dyslexia is often defined, and confirms they were children of normal ability with significant reading difficulties.

In addition, a ‘control’ group of normal readers was recruited, but they don’t feature much in the paper. The aim of including them was to see whether the same brain measures that predicted reading improvement in dyslexics would also predict reading improvement in normal readers.  However, the control children were rather above average in reading to start with, and, perhaps not surprisingly, they did not show any improvement over time beyond that which you’d expect for their age.

How was reading measured?

Children were given a fairly large battery of tests of reading, spelling and related skills, in addition to the TOWRE, which had been used to diagnose dyslexia. These included measures of reading accuracy (how many words are read correctly from a word list), reading speed, and reading comprehension (how far the child understands what is read). The key measure used to evaluate reading improvement over time was the the Word Identification Subtest from the Woodcock Reading Mastery Test (WID).

It is important to realise that all test scores are shown as age-scaled scores. This allows us to ignore the child’s age, as the score just indicates how good or bad the child is relative to others of the same age. For most measures, the scaling is set so that 100 indicates an average score, with standard deviation (SD) of 15 . You can tell how abnormal a score is by seeing how many SDs it is from the mean; around 16% of children get a score of 85 or less (1 SD below average), but only 3% score 70 or less (2 SD below average).  At Time 1, the average scores of the dyslexics were mostly in the high 70s to mid 80s, confirming that these children are doing poorly for their age.

When using age-scaled scores, the expectation is that, if a child doesn’t get better or worse relative to other children over time, then the score will stay the same. So a static score does not mean the child has learned nothing: rather they have just not changed their position relative to other children in terms reading ability.

Another small point: scaled scores can be transformed so that, for instance, instead of being based on a population average of 100 and SD of 15, the average is specified at 10 and SD as 3. The measures used in this study varied in the scaling they used, but I transformed them so they are all on the same scale: average 100, SD 15. This makes it easier to compare effect sizes across different measures (see below).

How was brain activity measured?
Functional magnetic resonance imaging (fMRI) was used to measure brain activity while the child did a reading task in a brain scanner at Time 1. The Wikipedia account of fMRI gives a pretty good introduction for the non-specialist, though my readers may be able to recommend better sources.  The reading task involved judging whether two written words rhymed, e.g. bait-gate (YES) or price-miss (NO). Brain activity was also measured during rest periods, when no task was presented, and this was subtracted from the activity during the rhyme task. This is a standard procedure in fMRI that allows one to see what activation is specifically associated with task performance. Activity is measured across the whole brain, which is subdivided into cubes measuring 2 x 2 x 2 mm (voxels). For each voxel, a measure indicates the amount of activation in that region. There are thousands of voxels, and so a huge amount of data is generated for each person.

The researchers also did another kind of brain imaging measurement, diffusion tensor imaging (DTI). This measures connectivity between different brain regions, and reflects aspects of underlying brain structure. The DTI results are of some interest, but not a critical part of the paper and I won’t say more about them here.

No brain imaging was done at Time 2 (or if it was it was not reported). This was because the goal of the study was to see whether imaging at one point in time could predict outcome later on.

How were the data analysed?

The dyslexics were subdivided into two groups, using a median split based on improvement on the WID test. In other words, those showing the least improvement formed one group (with 12 children) and those with most improvement formed the other (13 children).

The aim, then, was to see how far (a) behavioural measures, such as initial reading test scores, or (b) fMRI results were able to predict which group children came from.

Readers  may have come across a method that is often used to do this kind of classification, known as discriminant function analysis. The basic logic is that you take a bunch of measures, and allocate a weighting to each measure according to how well it distinguishes the two groups. So if the measure had the same average score for both groups, the weighting would be zero, but if it was excellent at distinguishing them, the weighting might be 1.0. You then add together all the measures, multiplied by their weightings, with the aim of getting a total score that will do the best possible job at distinguishing groups.  You can then use this total score to predict, for each person, which group they belong to. This way you can tell how good the prediction is, e.g. what percentage of people are accurately classified.

The extension of this kind of logic to brain imaging is known as multivariate pattern analysis (MVPA). It is nicely explained, with diagrams, on Neuroskeptic’s blog. .  For a more formal tutorial, see

It has long been recognised that there’s a potential problem with this approach, as it can give you spuriously good predictions, because the method will capitalise on chance fluctuations in the data that are not really meaningful. This is known as ‘over-fitting’. One way of getting around this is to use the leave-one-out method.  You repeatedly run the analysis, leaving out data from one participant, and then see if you could predict that person’s group status from the function derived from all the other participants. This is what was done in this study, and it is an accepted method for protecting against spurious findings.

Another way of checking that the results aren’t invalid is to directly estimate how likely it would be to get this result if you just had random data. To do this, you assign all your participants a new group code that is entirely arbitrary, using random numbers. So every person in the study has a 50% chance of being in group A or group B. You then re-run the analysis and see whether you can predict whether a person is an A or B on the basis of the same brain data. If you can, this would indicate you are in trouble, as the groups you have put in to the analysis are arbitrary. Typically, one re-runs this kind of arbitrary analysis many times, in what is called a permutation analysis; if you do it enough times, occasionally you will get a good classification result by chance, but that does not matter, so long as the likelihood of it occurring is very rare, say less than 1 in 1000 runs.  For readers with statistical training, we can say that the permutation analysis is a nice way of getting a direct estimate of the p-value associated with the analysis done with the original groups.

So what did they find?

View My Stats
Fig 2: discriminant function (y-axis) vs reading gain

The classification accuracy of the method using the whole-brain fMRI data was reported as an impressive 92%, which was well above chance.  Also, the score on the function used to separate groups was correlated .73 with the amount of reading improvement. The brain regions that contributed most to the classification included the right IFG, and left prefrontal cortex, where greater improvement was associated with higher activation. Also the left parietotemporal region showed the opposite pattern, with greater improvement in those who showed less activation.

So could the researchers have saved themselves a lot of time and got the same result if they’d just used the time 1 behavioural data as predictors? They argue not. The prediction from the behavioural measures was highly significant, but not as strong, with accuracy reported (figure S1 of Supporting Materials) as less than 60%.  Also, once the brain measures had been entered into the equation, adding behavioural measures did not improve the prediction.

And what conclusions did they draw?

  • Variation in brain function predicts reading improvement in children with dyslexia. In particular, activation of the right IFG during a reading task predicts improvement. However, taking a single brain region alone does not give as good a prediction as combining information across the whole brain.
  • Brain measures are better than behavioural measures at predicting future gains in reading.
  • This suggests that children with dyslexia can use the right IFG to compensate for their reading difficulties.
  • Dyslexics learn to read by using different neural mechanisms than those used by normal readers.

Did they note any limitations of the study?
  • It’s possible that different behavioural measures might have done a better job in predicting outcomes.
  • It’s also possible that a different kind of brain activation task could have given different results.
  • Some children had received remediation during the course of the study: but this didn’t affect their outcomes. (Bad news for those doing the remediation!).
  • Children varied in IQ, age, etc, but this didn’t differentiate those who improved and those who didn’t.

Questions I have about the study

Just how good is the prediction from the brain classifier?

Figure 2 (above) shows on the y-axis the discriminant function (the hyperplane), which is the weighted sum of voxels that does the best job of distinguishing groups. The x-axis shows the reading gain. As you can see clearly, there are two individuals who fall in the lower right quadrant, i.e. they have low hyperplane scores, and so would be predicted to be no-gain cases, but actually they make positive gains. The figure of 92% appears to come by treating these as cases where prediction failed, i.e. accurate prediction for the remainder gives 23/25 = 92% correct.
Fig 3: Vertical line dividing groups moved

However, this is not quite what the authors said they did.  They divided the sample into two equal-sized groups (or as equal as you can get with an odd number) in order to do the analysis, which means that the ‘no improvement’ group contains four additional cases, and that the dividing line for quadrants needs to be moved to the right, as shown in Figure 3.  Once again, accurate prediction occurs for those who fall in the top right quadrant, or the bottom left. Prediction is now rather less good, with 4 cases misclassified (three in the top left quadrant, one in the bottom right, i.e. 84% correct).However, it must be accepted that this is still good prediction.

Why do the reading gain group improve on only some measures?
One odd feature of the data is the rather selective nature of the reading improvement seen in the reading-gain group. Table 1 shows the data, after standardising all measures to a mean of 100, SD 15. The analysis used the WRMT-WID test, which is shown in pink. On this test, and on the other WMRT tests, the reading-gain group do make impressively bigger gains than the no-gain group. But the two groups look very similar on the TOWRE measures, which were used to diagnose dyslexia, and also on the Gray Oral Reading Test (GORT).  Of course, it’s possible that there is something critical about the content of the different tests – the GORT involves passage-reading, and the TOWRE involves speeded reading of lists of items.  But I’d have been a bit more convinced of the generalisability of the findings, if the reading improvement in the reading-gain group had been evident across a wider range of measures.(Note that we also should not assume all gain is meaningful: see my earlier blog for explanation of why).

Why do the control group show similar levels of right IFG activation to dyslexics?
The authors conclude that the involvement of certain brain regions, notably the right IFG, is indicative of an alternative reading strategy to that adopted by typical readers. Yet the control group appear to show as wide a range of activation of this area as the dyslexics, as shown in Figure 4. The authors don’t present statistics on this, but eyeballing the data doesn’t suggest much group difference.
Figure 4: activation in dyslexics (red) and controls (blue)

If involvement of the right IFG improves reading, why don’t dyslexic groups differ at time 1?

This is more of a logical issue than anything else, but it goes like this. Children who improve in reading by time 2 showed a different pattern of brain activation at time 1. The authors argue that right IFG activation predicts better reading. But at time 1, the groups did not differ on the reading measures – or indeed on their performance of the reading task in the scanner. This would be compatible with some kind of ‘sleeper’ effect, whereby the benefits of using the right IFG take time to trickle through. But what makes me uneasy is that this implies the researchers had been lucky enough to just catch the children at the point where they’d started to use the right IFG, but before this had had any beneficial effect.  So I find myself asking what would have happened if they’d started with younger children? 

Overall evaluation
This is an interesting attempt to use neuroimaging to throw light on mechanisms behind compensatory changes in brains of people with dyslexia.  The methodology appears very sound and clearly described (albeit highly technical in places). The idea that the IFG is involved in compensation fits with some other studies in the field.

There are, however, a few features of the data that I find a bit difficult to make sense of, and that makes me wonder about generalisability of this result.

Having said that, this kind of study is challenging. It is not easy to do scanning with children, and just collecting and assessing a well-documented sample can take many months. One then has to wait to follow them up more than two years later. The analyses are highly demanding.  I think we should see this as an important step in the direction of understanding brain mechanisms in dyslexia, but it’s far from being conclusive.

Hoeft F, McCandliss BD, Black JM, Gantman A, Zakerani N, Hulme C, Lyytinen H, Whitfield-Gabrieli S, Glover GH, Reiss AL, & Gabrieli JD (2011). Neural systems predicting long-term outcome in dyslexia. Proceedings of the National Academy of Sciences of the United States of America, 108 (1), 361-6 PMID: 21173250


  1. Dorothy, I have a lot of questions about this study. I don't have access to PNAS so I have to ask you.

    I live next to Stanford. I am assuming that the subjects in the study were resided near to Stanford.

    I am familiar with the school districts surrounding Stanford, the various reading programs they use, and the remediation programs that are available locally (some effective, some not).

    Did the paper discuss what reading instruction the subjects (both dyslexic & non) had had?

    Did the dyslexic subjects have any additional remediation over the course of the study?

    Did the authors discuss why they enrolled older children rather than say 8 year olds?

    Those are a few of the questions I have. I've asked a friend who is an academic for a copy of the paper.

  2. Liz:
    The paper is Open Access, so anyone can get it!

    The researchers noted that some of the participants had additional interventions over and above regular teaching, but these were not controlled, so differed from child to child. Whether or not the child had intervention made no difference to the amount of improvement seen.

    The paper doesn't say where the participants were recruited from.

    I don't know why this age band was selected, but can think of a few possible reasons:
    a) In the Supporting Materials it says that around 3/4 of participants had taken part in a previous study; it's not clear which study this was (a reference number is given but it refers to a British study on adults, so I think that's a typo). But it's likely that the researchers used an existing sample that already had undergone reading assessments, as this is far more efficient than recruiting a whole new sample
    b) It's much easier to do brain scanning with teenagers than younger children. Not impossible with 8-year-olds, but they have to lie still and comply with instructions. It's also possible that objections might be raised by ethics committees (IRBs) to scanning younger children for research purposes (though most would find it acceptable these days, if done with appropriate safeguards).
    c) With older dyslexic children it's easier to devise a reading task that they'll be able to do in the scanner and that is reasonably challenging.

  3. PS the correct reference for the earlier study that most of the participants came from is stated in the text:
    Hoeft F, et al. (2007) Functional and morphometric brain dissociation between
    dyslexia and reading ability. Proc Natl Acad Sci USA 104:4234–4239.