Sunday 24 June 2012

Causal models of developmental disorders: the perils of correlational data

Experimental psychology depends heavily on statistics, but psychologists don’t always agree about the best ways of analyzing data. Take the following problem:
I have two groups each of 30 children, dyslexics and controls. I give them a test of auditory discrimination and find a significant difference between the groups, with the dyslexic mean being lower. I want to see whether reading ability is related to the auditory task. I compute the correlation between the auditory measure and reading, and find it is .42, which in a sample of 64 cases is significant at the .001 level.
I write up the results, concluding that poor auditory skill is a risk factor for poor reading. But reviewers are critical. So what’s wrong with this?
I’ll deal quickly with two obvious points. First, there is the well-worn phrase that correlation does not equal causation. The correlation could reflect a causal link from auditory deficit to poor reading, but we need also to consider other causal routes, as I’ll illustrate further below. This is an issue about interpretation rather than data analysis.
A second point concerns the need to look at the data rather than just computing the correlation statistic. Correlations can be sensitive to distributional properties of the data and can be heavily influenced by outliers. There are statistical ways of checking for such effects, but a good first step is just plotting a scatterplot to see whether the data look orderly. A tip for students: if your supervisor asks to see your project data, don’t just turn up with numerical output from the analysis: be ready to show some plots.
Figure 1: Fictitious data showing spurious correlation between height and reading ability
A less familiar point concerns the pooling of data across the dyslexic and control groups. Some people have strong views about this, yet, as far as I’m aware, it hasn’t been discussed much in the context of developmental disorders. I therefore felt it would be good to give it an airing on my blog and see what others think.
Let’s start with a fictitious example that illustrates the dangers of pooling data from two groups. Figure 1 is a scatterplot showing the correlation between height and reading ability in groups of 6-year-olds and 10-year-olds. If I pool across groups, I’m likely to see a strong correlation between height and reading ability, whereas within any one age group the correlation is negligible. This is a clear case of spurious correlation, as illustrated in Figure 2. Here the case against pooling is unambiguous, and it's clear that if you look at the correlation within either age band, there is no relationship between reading ability and height.
Figure 2: Model showing how a spurious correlation between height and reading arises because both are affected by age

Examples such as this have led some people to argue that you shouldn’t pool data in studies such as the dyslexic vs. control example. Or, to be more precise, the recommendation is usually that you should check the correlations within each group, and avoid pooling if they don’t look consistent with the pooled correlation. I’ve always been a bit uneasy about this logic and have been giving some thought as to why.
First, there is the simple issue of power. If you halve your sample size, then you increase the standard error of estimate for a correlation coefficient, making it more likely that it will be nonsignificant. Figure 3 shows the 95% confidence intervals around a correlation of .5 depending on sample size, and you can readily see that these are larger for small than big samples. There's a nice website by Stan Brown that gives relevant formulae in Excel.
Figure 3: 95% confidence interval around estimated correlation of .5, with different sample sizes

A less obvious point is that the data in Figure 1 look analogous to the dyslexic vs. control example, but there is an important difference. We know where we are with age: it is unambiguous to define and measure. But dyslexia is more tricky. Suppose we substitute dyslexia for age, and auditory processing for height, in the model of spurious correlation in Figure 2. We have a problem: there is no independent diagnostic test for dyslexia. It is actually defined in terms of one of our correlated variables, reading ability. Thus, the criterion used to allocate children to groups is not independent of the measures that are entered into the correlation. This creates distortions in within-group correlations, as follows.
If we define our groups in terms of their scores on one variable, we effectively restrict the range of values obtained by each group, and this lowers the correlation.  Furthermore, the restriction will be less for the controls than for the dyslexic group - who are typically selected as scoring below a low cutoff, such as one SD below the mean. Figure 4 shows simulated data for two groups selected from a population where the true correlation between variables A and B is .5. Thirty individuals (dyslexics) are selected as scoring more than 1 SD below average on variable A, and another 30 (controls) are selected as scoring above this level. 
Figure 4: Correlations obtained in samples of dyslexic (red) and controls (blue) for 20 runs of simulation with N = 30 per group.
The Figure shows correlations from twenty runs of this simulation. For both groups, the average correlation is less than the true value of .5, because of the restricted range of scores on variable A. However, because the range is more restricted for the dyslexic group, their average correlation is lower than that of the controls. A correlation of .42 corresponds to the .05 significance level for a sample of this size, and we can see that the controls are more likely to exceed this value than the dyslexic group. All these results are just artefacts of the way in which the groups were selected: both groups come from the same population where r = .5.
What can we conclude from all this? Well, the bottom line is that if we find non-significant within-group correlations this does not necessarily invalidate a causal model. The simulation shows that we may find that within-group correlations look quite different in dyslexic and control groups, even if they come from a common distribution.
So where does this leave us?! It would seem that in general, within-group data are unlikely to help us distinguish between causal and non-causal models: they may be compatible with both. So how should we proceed?
There’s no simple solution, but here are some suggestions:
1. If considering correlational data, always report the 95% confidence interval. Usually people (including me!) just report the correlation coefficient, degrees of freedom and p-value. It’s so uncommon to add confidence intervals that I suspect most psychologists don’t know how to compute it. Do not assume that because one correlation is significant and another is not that they are meaningfully different. This website can be used to test for the significance of the difference between correlations. I would, however, advise against interpreting such a comparison if your data are affected by the kinds of restriction of range discussed above.
2. Study the relationship between key variables in a large unselected sample covering a wide range of scores. This is a more tractable solution, but is seldom done. Typically, people recruit an equivalent number of cases and controls, with a sample size that is inadequate for getting a precise estimate of a correlation in either group. If your underlying model predicts a linear relationship between, say, auditory processing and phonological awareness, then with a sample of 200 cases, a fairly precise estimate can be obtained. With this approach, one can also identify whether the relationship is linear.
3. More generally, it’s important to be explicit about what models you are testing. For instance, I’ve identified four underlying models of the relationship between auditory deficit and language impairment, as shown in Figure 5. In general, correlational data on these two skills won’t distinguish between these models, but specifying the alternatives may help you think of other data that could be informative. 
Figure 5: Models of causal relationships underlying observed correlation between auditory deficit and language impairment
For instance:
  • We found that, when studying heritable conditions, it is useful to include data on parents or siblings. Models differ in predictions about how measures of genetic risk - for instance, family history, or presence of specific genetic variants - relate to A (auditory deficit) and B (language impairment) in the child. This approach is illustrated in this paper. Interestingly, we found that the causal model that is often implicitly assumed, which we termed the Endophenotype model, did not fit the data, but nor did the spurious correlation model, which corresponds here to the Pleiotropy model.
  • There may be other groups that can be informative: for instance, if you think auditory deficits are key in causing language problems, it may be worth including children with hearing loss in a study - see this paper for an example of this approach using converging evidence.
  • Longitudinal data can help distinguish whether A causes B or B causes A.
  • Training studies are particularly powerful, in allowing one to manipulate A and see if it changes B.
So what’s the bottom line? In general, correlational data from small samples of clinical and control groups are inadequate for testing causal models. They can lead to type I errors, where pooling data leads to a spurious association between variables, but also to type II errors, where a genuine association is discounted because it isn’t evident within subject groups. For the field to move forward, we need to go beyond correlational data.

P.S. 9th July 2012
I've written a little tutorial on simulating data using R to illustrate some of these points. No prior knowledge of R required. see:

Bishop DV, Hardiman MJ, & Barry JG (2012). Auditory deficit as a consequence rather than endophenotype of specific language impairment: electrophysiological evidence. PloS one, 7 (5) PMID: 22662112

If you liked this post, you may also be interested in my other posts on statistical topics:
Getting genetic effect sizes in perspective
The joys of inventing data
A short nerdy post about the use of percentiles
The difference between p < .05 and a screening test


  1. Great post.

    I guess what it really comes down to is what the extra correlational analysis is supposed to be conveying. If the correlation is entirely driven by group differences then you're not really adding any new information, beyond what you've already figured out from a t-test.

    We get this all the time in autism research, where someone will report that, not only were there group differences on test X, but X also correlated with scores on the ADOS/SRS/SCQ that necessarily differ between groups. Often the conclusion is that X determines autism *severity*. But I'd argue that here you really should be looking only within the autism group.

    Alternatively, I wonder whether there might be justification in taking a stepwise regression approach. So put in group first and then see if adding your continuous severity measure as a second step significantly improves the fit of the regression model.

    But I guess the big question is whether the correlational analyses is telling you something that isn't obvious from group comparison. For example, in a study of dyslexia, you might find group differences in Y but correlational analyses tell you that Y correlates with word reading but not nonword reading. That might be theoretically interesting and there'd be a case that you could pool the data.

  2. thanks Jon.
    I think it would be a big mistake to put Group into a regression, unless you have an indicator of group that is not logically dependent on the variables you are correlating.
    For conditions like SLI and dyslexia, which are defined in terms of language/reading measures, it would be seriously confusing to put the diagnostic category first in stepwise regression, if the correlated variables included a reading/language-related measure. It would amount to seeing if auditory/reading were correlated after taking out the effect of reading....I guess I could simulate that to confirm my intuition here, but it seems seriously confounded.
    It would be different if you had an independent marker, such as a genetic variant, which could then be entered in an analysis in a similar way that Age would be in my example above. It was thinking along the lines of needing something that was logically distinct from our language measures that got me onto using data from relatives.
    When you say "what is it telling you" the answer is, if you do have an independent marker of diagnostic status, the predicted pattern of covariances will be different depending on whether you have causality or spurious correlation - or something else.
    Apart from this, what I've tried to argue is that if you find a significant within-group correlation, then that might be meaningful, especially if it survives correction for the multiple tests that are often done in this situation. But we typically have small samples and limited power, and a lack of significant correlation is pretty uninterpretable in that situation. My concern is when people conclude a cross-group correlation must be spurious, just because its not significant within groups. If you think it's spurious, you need to model the third (causal) factor, I think, and show it explains the results.

  3. The PLoS One (2012) paper is fascinating but a bit hard for me to interpret because it seems to me that (1) I can’t tell if the kids in the clinical sample have speech sound disorders or specific language impairment; (2) the Language Impairment latent variable is a mix of language and phonological processing measures and (3) the Auditory Deficit latent variable is a mix of auditory and phonological measures. Clearly the endophenotype model does not work if Auditory Deficits are considered to be the endophenotype that causes SLI but maybe it would if Phonological Processing is considered to be the endophenotype that contributes to Speech Sound Disorders and Dyslexia. It makes sense to me that the Phonological Processing endophenotype reflects both an inherited problem with speech perception (not auditory processing) and experience with language (with trade offs between speech perception and vocabulary knowledge as we have shown in our model for children with speech sound disorders). The downstream link from phonological processing to auditory deficits is predicted by Ramus’ model whereby cortical impairments may result in downstream sensory impairments under certain conditions. So ultimately what is required is a melding of the endophenotype and neuroplasticity models with integration of environmental effects. (All of this without getting into the fact that so many “normal” children fail to show significant Ta peaks – I don’t really know how to interpret ERP studies in any case!). Thanx for an interesting Monday morning (Fête Nationale in Québec, better than reading about the «printemps québécois» in all the papers, that’s for sure).

  4. Thanks Susan for taking time to spell out some points needing clarification.
    Re (1) The participants have SLI, i.e. language problems extending beyond, though often encompassing, phonological problems. If there is difference here in our use of terminology let me know - I'd typically use speech sound disorder to refer to children who make phonological errors in spontaneous expressive speech. Some children in our sample had expressive speech sound problems but most didn't and poor performance on nonword repetition and NEPSY oromotor skills (which is rather misleadingly named I think) is often found in children with SLI who have a normal phonological repertoire in spontaneous speech. We've also shown those two tests to be the most sensitive to residual language difficulties in parents of children with SLI, who sound entirely normal when you talk to them. The difficulties seem to arise when you ask people to repeat novel phonological forms that are long and complex. I would regard these phonological problems as part of a language impairment. (2) The decision to create a latent variable that includes language and phonological measures was based on interrelationships in the data. Yes, logically you can distinguish them, but in this sample, consisting of children receiving special help for language problems, they typically go hand in hand. In response to a referee we did subclassify the types of language difficulty to see if this could predict who had abnormal T-complex, but we did not see any systematic relationships. So we decided to take a composite measure, because the ERP abnormality appeared to be a correlate of rather general language problems, rather than one specific aspect of language (3) The decision to use responses to both tones and syllables in an 'auditory' composite was likewise motivated by the data: the deficit we found in T-complex in SLI did not interact with stimulus type, and the different T-complex measures were intercorrelated.
    I'd be very happy if we could show more orderly relationships between particular linguistic deficits and specific ERP components, but it doesn't tend to come out that way. But it's certainly a problem that the phenotype is so variable. I was really quite surprised that we replicated Shafer's T-complex result because I'd thought SLI was too messy a phenotype to show anything clearcut. Why on earth we get typical-language kids who don't show clear T-complex is also a mystery. It's possible that ERPs are just too messy at the individual level to give reliable categorisation.
    Incorporating environmental measures - would be great, and could be incorporated in modeling, but which ones?

  5. Great post!
    I tend to be cautious about using simple correlations, and comparisons between correlations, when comparing typically developing and disorder groups. This is because the assumptions involving in fitting linear functions to data too easily rendered implicit with correlations. Instead I explicitly model the linear trajectories involved – between performance and chronological age, between performance and mental age, or between performance on any pair of tasks. Modelling linear trajectories encourages plotting of data and checking of assumptions, and allows for group comparisons, evaluate whether relationships differ in typically developing and disorder groups.

    I developed some fairly rudimentary methods to do this ( Conceptually, the methods are similar to standard Analyses of Variance (ANOVA) but instead of testing the difference between group means, one evaluates the difference between the straight lines used to depict the developmental trajectory in each group (technically, the method uses a fully-factorial ANCOVA).

    This approach allows between-subjects, within-subjects, mixed designs, as per ANOVA. You can explicitly test whether task performance develops at a slower rate in the disorder group than control group, or whether task A develops at a faster rate than task B within a given group, or whether the relationship between the development of two tasks differs between disorder and control groups.

    When using ANCOVA, one has to carry out a basic check to see that the groups overlap on the covariate (for the data in Fig.1, they do not). If you have no reliable relationship in disorder and control groups on their own, but you see one when the groups are combined, in my experience this is most often an issue of statistical power. If you have a relationship in one group but not the other (and not both combined), you should be able to test for the significance of the group difference by an interaction between group and covariate. Sometimes, there is no reliable interaction even though a relationship appears present in one group but not the other. This is sometimes because of unequal variance between the groups, where the greater variability in the no-effect group washes out the relationship in the effect group when the two are combined. Other times, it is because the confidence intervals for the gradients of the two trajectories overlap. I don’t have a problem with reporting within group effects separately, so long as the comparison was planned at the design stage (i.e., it’s not a fishing expedition). But for the latter case, the conclusion has to be a cautious one: there’s a reliable relationship in one group but not the other, and not reliable difference in the relationship between groups.

    There’s an interesting complication when using linear regression that a flat line and a data cloud report the same effect. Thus you can’t distinguish the case where development has plateaued (doesn’t change over the age range considered) from the situation where there is indeed no relationship between performance and age. I came up with a method to try to distinguish these two cases statistically by rotating the data by 45 degrees and re-testing the relationship – plateaued development should now produce a correlation of 1, but a data cloud will remain a data cloud. ( Of course, if one shows a flat trajectory, one still has to interpret the result – it could be development at asymptote, or it could be an issue to do with measurement sensitivity, in that one group is at ceiling or floor on the measure.

    When you say, 'correlational data from small samples of clinical and control groups are inadequate for testing causal models', do you think using categorical data get you out of this problem? (Ie, comparing group means for disorder and control groups across discrete experimental conditions using ANOVA).

  6. Fred - just looking at your comment again and realise I did not answer the question at the end.
    Answer is a clear no! The issue is not about categorical vs continuous data. It's about the fact that a causal model from disorder X to deficit Y is only one of several models that need to be considered, and to disambiguate them you will typically need additional data.

  7. Is there anybody there now? If so, comments on this would be appreciated. Imagine that an adequate sample size of children reveals a strong association (ie, correlation - with intelligence controlled) between reading and, say, visual tracking ability. Is it a more interesting result if the sample is comprised of children with a range of disorders - or more interesting if the sample is restricted only to those diagnosed with Dyslexia?? Richard.

    1. That's an intriguing question, and like most questions, I guess it forces you to think harder about what exactly you are testing. I think that if a correlation only showed up in one group, and not in another, that would be interesting in suggesting some qualitative difference in the affected group. In general, I don't think there are many examples of that, though I can think of cases where skills that are correlated in typical children are not correlated in a clinical group. You see that in some studies of autism, where you seem to get fractionation of things that usually go together.
      But overall, I think it is often a good move to move beyond studies that just contrast one clinical group with a typically-developing control group, to look at a wider range. This can really overturn your ideas: for instance, see Courtenay Norbury's research showing that deficits that are thought to characterise autism actually are correlated with language impairment rather than autistic symptoms.

    2. Dear Deevybee, thank you so much for finding the time to comment on my conundrum. Very much appreciated!
      I must have a look at Courtenay Norbury's paper - Richard