Friday, 23 July 2010

The difference between p < .05 and a screening test

Here's a fairly standard science story. A researcher compares a group of 30 children with autism/dyslexia/depression/chilblains and a matched control group on something: genotype/reactions to pharamaceuticals/behaviour profile/brain activation. A statistically significant difference is found between the groups. The result is published in a high profile journal.
The media want to know the relevance of this finding for their readers. It is not clear that this will lead to immediate treatment for autism/dyslexia/depression/chilblains, but they do feel justified in saying that the research will be useful in improving screening. After all, we all know that the earlier we can identify something, the easier it is to treat.

What's wrong with this? Well, alas, everything. And before going further, it is important to note that it is not just the media that get it wrong; many scientists blithely assert that their findings will have application in screening with no understanding of what is required to make a good screening test.

The four key considerations for a screening programme are take-up, costs, accuracy and intervention. Take-up can be surprisingly high, even for embarrassing and uncomfortable procedures. In the UK, most women go along with screening for cervical and breast cancer because, hey, having a total stranger introduce your private parts to cold metallic objects is, on balance, less bad than finding too late that you have a terminal disease.

But there are costs, of course. First, to the individual being screened, who has to take time out of a busy life to go along for screening, and second to the NHS, which must pay for the staff, equipment, facilities and infrastructure to do the screening. None of this is going to be worthwhile unless the screening is very accurate. If you are screening for cancer, then you cannot afford to make errors of omission: giving the all-clear when in fact the person has disease. But equally, you will be very unpopular if you make errors of commission – if you tell healthy person they have failed the screen, causing anxiety and leading them to have further tests that then fail to confirm the diagnosis. This has been a problem for the PSA screening test for prostate cancer.

Quite apart from the accuracy of diagnosis, the screening would not be worthwhile if the disorder you are screening for was not readily treatable – or preventable. There would be little point in identifying large numbers of people with disease X, even if you could do accurately, if nothing could be done about X. Indeed, it could be counterproductive.

In this light, let us consider recent stories in the media about using automated analysis of vocalisations to screen for autism. The study, by Oller and colleages, was published in Proceedings of the National Academy of Sciences in July 2010. It is an impressive and innovative study: the researchers had children wearing body-microphones so that all their utterances could be recorded, and they developed software that was able, with a high degree of accuracy, to distinguish the child's own utterances from those of other people and background noise. Using their extensive knowledge of the acoustic characteristics of babble and early language, they derived algorithms to characterise aspects such as pitch and rhythmic features in the vocalisations. They compared results for 106 typically-developing children, 77 children with autism and 49 children with language delay, and demonstrated that on a combination of these indices the groups were well discriminated. They concluded "We interpret this as a proof of concept that automated analysis can now be included in the tool repertoire for research on vocal development and should be expected soon to yield further scientific benefits with clinical implications." Screening is mentioned twice in the paper; in the abstract it states: "automated analysis should soon be able to contribute to screening and diagnosis procedures for early disorders", and in the introduction it is noted that autism is currently diagnosed mainly in terms of negative features, such as joint attention deficits and communication deficits and "abnormal vocal characteristics would constitute a positive marker that might enhance precision of screening or diagnosis." Note that these claims are pretty cautious: the vocal analysis is seen as 'contributing to' and 'enhancing precision' of screening. Nevertheless, the potential of the method as a screening tool has been emphasised in some media reports.

So how accurate is it? The key statistics for any screening instrument are sensitivity and specificity. Sensitivity is the proportion of 'hits', i.e. cases with autism who are correctly identified as having autism. Specificity is the proportion of 'true negatives', i.e, non-autistic children who are correctly identified as non-autistic. These two indices can only be defined in relation to a specific population, and their interpretation will depend crucially on the base rate of disorder in that population.

Oller et al reported identification of autism versus typical development showed sensitivity of 0.75 and specificity of 0.98. This corresponds to the situation shown below: 104/106 (98%) of typical children are accurately diagnosed, and 58/77 (75%) of autistic children are accurately diagnosed.




Actual diagnosis


Typical
Autistic
% autistic
Diagnosis
Typical
104
19

from vocal
Autistic
2
58

analysis
Total
106
77
42%

But their study sample was selected to contain a high proportion of children with autism, 42%. In the general population, rates of autism are much lower than this. The most liberal estimate, which includes children with far less severe problems than those included in this study, is around 1% (Baird et al, 2006). If we apply the sensitivity and specificity figures from Oller et al to a population with a base rate of autism of 1% - just by multiplying the number of typical children while keeping specificity constant – then the numbers look like this:



Actual diagnosis


Typical
Autistic
% autistic
Diagnosis
Typical
7471
19

from vocal
Autistic
152
58

analysis
Total
7623
77
1%

We see that in this simulated general population sample around three quarters of those who'd be diagnosed as autistic on the basis of vocal analysis are non-autistic. In reality, accuracy is likely to be lower than this in such a sample, because it would include children with milder cases of autism, including those with Asperger syndrome, who were excluded from the study by Oller et al.


It gets worse. A population sample would also include other non-autistic children with poor communication skills, including those with specific language difficulites, or conditions associated with more general developmental delay. Oller et al did include a non-autistic sample with language impairments, and it is clear from the presented data that there was considerable overlap between the autism and language impaired groups; my reading of the plotted data suggests around 43% of children with language impairment fell below the cutoff for abnormality on the vocal measures. The notion that vocal abnormality is a 'unique signature' for autism, as reported by some of the media, is clearly false.

Might there be situations where the vocal analysis could be clinically useful? Potentially, yes. Autism diagnosis is particularly difficult in young, low-functioning children, and even our gold-standard diagnostic tools can have difficulty distinguishing autism from non-autistic mental handicap in children under 18 months of age (Lord et al, 1993). The vocal measures of Oller et al might provide useful supplementary information in such cases, but bear in mind that the data presented so far suggest the vocal abnormalities can occur in conditions other than autism. Another context where the vocal analysis could be of value is in evaluation of very young children at high risk of autism, in particular young siblings of children with autism, around 10-20% of whom will develop autism or a related neurodevelopmental condition. It is possible that vocal analysis will be useful in identifying those likely to have problems. However, as noted by Zwaigenbaum et al (2006), even if we are able to identify such children early on, we then have to confront the fact that little is known about effective early intervention for autism. A further issue in the study of infant sibs is that vocal behaviours could be abnormal in such a sample if children imitate their older autistic sibling. Overall, automated vocal analysis could prove useful in early diagnosis of high risk groups, but we cannot be certain without further research. And even if we did find it could be used in this role, we would want to demonstrate that it provided information that could not be obtained by other simpler means. For instance, a brief parental questionnaire can be pretty effective in identifying autism (Berument et al., 1999).

Despite these caveats, I want to stress that I think the Oller et al study is an excellent piece of science. The method they have developed could provide important insights into the nature of communication problems in young children, and also be used to study how typical children learn to speak. It is a scholarly piece of work built on many years of painstaking research.

I've used the example of the Oller et al study because it provides a recent example of a good study whose clinical implications have been overstated, but there are many, many other instances. Take for instance, media coverage of findings of genes associated with risk for dyslexia. I just did so by Googling 'genes dyslexia screening', and found this article in the New York Times, which is typical of how genetic associations are reported in the media. The newspaper article states " Researchers said a genetic test for dyslexia should be available within a year or less. Children in families that have a history of the disorder could then be tested, with a cheek swab, before they are exposed to reading instruction. If children carry a genetic risk, they could be placed in early intervention programs". Now, I recently reviewed the research on genetic studies of dyslexia (Bishop, 2009). Even though I thought I knew the literature, I was pretty surprised by what I found. I looked at one of the more robust results in the field, on a gene called KIAA0319, described in a paper with the title " Strong evidence that KIAA0319 is a susceptibility gene for developmental dyslexia". The authors compared children with different haplotypes (allele combinations) in relation to dyslexia status. Here's my summary:

"The most common form with alleles 1–1 was equally frequent for affected versus unaffected cases, but two other common forms, 1–2 and 2–1, showed contrasting effects. (Conventionally, 1 is the more frequent allele, and 2 is the less frequent). The 1–2 haplotype was found in 35% of those with dyslexia and 27% of those without, whereas for 2–1, the figures were 24% in those with dyslexia and 36% in unaffected controls.... Extrapolating to the general population (i.e. taking base rates into account), one would expect that most individuals with the 1–2 risk haplotype would not be dyslexic, and most dyslexic individuals would not have the 1– 2 haplotype."

Note that, even without taking base rates into account, the strength of the association with genotypes is considerably lower than the Oller et al association between vocal behaviour and autism. This result causes excitement among geneticists because it has been replicated, and the evidence for association is strong in terms of p-values in a very large sample size. This finding potentially can tell us something about the development of brain regions that lead to risk for dyslexia, and it indicates that biology plays a role in dyslexia, and it is not just the result of poor parenting or poor teaching. But there is a difference between "strong evidence" and "strong effect": in fact, the effect size for individual genotypes is very small – far too small to play a useful role in screening. Since the 1970s we have had reasonably accurate ways of predicting which preschoolers are at risk for reading difficulties, based on simple behavioural tests and family background (see e.g. Satz & Friel, 1978). They are unlikely to be surpassed by any genetic tests in the foreseeable future.

Nevertheless, the belief that genes are destiny and are somehow more accurate than simple behavioural or observational measures in predicting problems is persistent. This encourages disturbing developments, such as firms offering genetic screening to identify your child's talents. Apparently you can pay to have your child tested for genes related to a host of personal characteristics, ranging from intelligence to optimism to dancing. Just one small problem: for many of these characteristics there is no published data linking gene variants to the trait. And for those that exist, the effects are tiny.

The arguments I present here are very old; I was taught about the importance of base rates by the late A. E. Maxwell when training as a clinical psychologist in 1975. However, knowledge of this aspect of epidemiology is not widespread, even among academic psychologists who are otherwise sophisticated users of statistics. I recently asked a group of MSc students if any of them knew about base rates, and they thought I was talking about the Bank of England.

So, the bottom line: just because someone finds a statistically significant difference between two groups, don't assume that will translate into a useful screening test.

(Red underline denotes corrections made 19th August 2010).

Cited references
Baird, G., Simonoff, E., Pickles, A., Chandler, S., Loucas, T., Meldrum, D., et al. (2006). Prevalence of disorders of the autism spectrum in a population cohort of children in South Thames: the Special Needs and Autism Project (SNAP). Lancet, 368 (9531), 210-215.
Berument, S. K., Rutter, M., Lord, C., Pickles, A., & Bailey, A. (1999). Autism screening questionnaire: Diagnostic validity. British Journal of Psychiatry, 175, 444-451.
Bishop, D. V. M. (2009). Genes, cognition and communication: insights from neurodevelopmental disorders. The Year in Cognitive Neuroscience: Annals of the New York Academy of Sciences, 1156, 1-18.
Lord, C., Storoschuk, S., Rutter, M., & Pickles, A. (1993). Using the ADI-R to diagnose autism in preschool children. Infant Mental Health Journal, 14, 234-252.
Satz, P., & Friel, J. (1978). Predictive validity of an abbreviated screening battery. Journal of Learning Disabilities, 11(6), 347-351.
Zwaigenbaum, L., Thurm, A., Stone, W., Baranek, G., Bryson, S., Iverson, J., et al. (2006). Studying the emergence of autism spectrum disorders in high-risk infants: Methodological and practical Issues. Journal of Autism and Developmental Disorders, 37, 466-480.

Further reading
Loong, T. W. (2003). Understanding sensitivity and specificity with the right side of the brain. (10.1136/bmj.327.7417.716). British Medical Journal, 327(7417), 716-719.


Postscript, 12th August 2010

An article by Ecker et al that was published yesterday in Journal of Neuroscience comprehensively illustrates the points above.  The researchers gathered MRI brain scans on 20 adults with autism, 19 adults with ADHD, and 20 control adults with no neuropsychiatric diagnosis. Previous research has reported various differences between brains of people with autism and controls, but there have been few findings that are sufficiently strong and convincing to discriminate the groups. Indeed, it has been surprisingly difficult to find any consistent abnormalities on brain scanning, even in the case of children with severe impairments. Ecker et al adopted a different approach, by arguing that a constellation of features would be more effective than individual indices in distinguishing autism from control brains.

The three groups did not differ in terms of three measures of brain capacity: intracranial volume, total brain volume, and gray matter volume.

They used five measures to derive a 'classifier': an equation that puts the measures together in a way that maximally distinguishes the autism and control groups. These measures were (as far as I understand the highly technical account):

  1. The average convexity or concavity of large-scale features of the brain surface. This measures sulcal depth and gyral height
  2. Mean (radial) curvature. This assesses number of smaller convolutions on the surface of the brain.
  3. A measure of the degree of cortical folding in terms of metric distortion relative to an average template.
  4. Cortical thickness
  5. Pial area, a measure of the grey matter surface.

To avoid the bias that can arise when the same data is used to derive and test a classifier, they adopted a 'leave one out' method, whereby one person with autism and one control were omitted when deriving the estimation of the equation. The resulting equation was then applied to classify the left-out cases. This was done separately for the left- and right sides of the brain.  Only the left-sided classifier was effective, giving the following results.



Actual diagnosis


Autistic
Control

Diagnosis
Autistic
18
4

from MRI
Control
2
16

classifier
Total
20
20


When the same left-hemisphere classifier was appplied to the ADHD group, the picture was similar. 4 of 19 cases were classified as autistic, and the remainder as non-autistic.

In the discussion of the results, the authors note the potential for clinical application as follows:

"... the existence of an ASD biomarker such as brain anatomy might be useful to facilitate and guide the behavioral diagnosis. This would, however, require further extensive exploration in the clinical setting, particularly with regards to classifier specificity to ASD rather than neurodevelopmental conditions in general. "

In other words, the classifier needs to be validated on other samples, and it is as yet unclear whether it would be as accurate if people with autism were contrasted, say, with people with language impairments. The authors also noted that the adult sample used here was largely composed of high-functioning adults with Asperger syndrome, and it was not clear whether results would generalise to other cases of autism.  In the final sentence of the paper it is stated: "classification values and specific patterns we report must be considered as preliminary".

In sum, the paper illustrates a promising approach to understanding more about brain development in autism, but it does not demonstrate the utility of this approach for differential diagnosis of autism from other conditions, nor for assisting diagnosis in marginal and difficult cases.

Encouraged by an enthusiastic press release from the Medical Research Council, who funded this work, the media picked up the story yesterday.  It was featured prominently on BBC television and radio news, made the front page story on the Independent, and was given extensive coverage in other national newspapers.  The account of the work showed little evidence of the caution and qualification in the published account.  In this case, I don't think the media can be blamed for misrepresenting the work, as much of what was said was taken from the MRC press release.  Here are some statements from the Independent, under the headline: "Brain scan promises to identify the hidden sufferers of autism":

  • Autism could in future be diagnosed in 15 minutes from a brain scan
  • Scientists ... have devised a method which can distinguish the autistic brain from the normal brain with 90 per cent accuracy.
  • The method could lead to the introduction of screening for the disorder in children.
Dr Ecker was quoted as saying:
    "The [computer programmme] says yes this is autism or no it is not, and also gives an indication of the severity."  She also contrasted the 15 minutes it took to do an MRI scan with conventional diagnostic procedures which "takes the whole day and involves asking embarrassing questions such as how many friends they have and how they are doing at school"

    Dr Ecker seems to be implying that we should stop using diagnostic interviews to diagnose autism, because a brain scan is more accurate.  Since autism is currently defined in terms of a pattern of impairments in different domains of functioning, this is a remarkably bold claim, especially since her method awarded an autism diagnosis to 20% of controls.

    Much is also made of the potential of the method for diagnosing autism in children. It's possible this may be effective, but children's brains are very different from adult brains, and one would need a whole new study to evaluate this. Furthermore, although Dr Ecker says of brain scanning "It doesn't hurt and you can even go to sleep", it is not an easy procedure to apply to young children without sedation.  Lying still in a very noisy tube is not something that comes naturally to a child, even if you can explain the procedure to them clearly so they know what to expect. Most children with autism have poor language comprehension and may also become anxious when in an unfamiliar situation. It is not impossible to do brain scanning studies with them, but it certainly is not easy.

    A final thought. In recent years, brain research on autism has increased exponentially. There are other MRI datasets for adults and children with autism and other clinical groups as well as controls. It should be possible for Ecker and colleagues to cross-validate their findings fairly rapidly against other datasets. I hope that others will be willing to share their data to make this possible.






    5 comments:

    1. Excellent analysis as usual! My cynical suspicion is that journal editors encourage statements about screening etc in order to attract more media attention. Unfortunately, as in this case, this strategy detracts from the more probable and very exciting applications for research of the Oller et al. work.

      another point about early identification: studies of infants at risk of ASD are revealing what we've known about more specific language impairments for a long time - many children who look impaired at age 2 appear to right themselves and look absolutely typical at age 5ish, with very little additional input. So even if we had a superb early intervention that we could apply to all children at risk (and we really don't), we may be spending lots of time and money treating children who really don't need it. In the age of austerity this is a message that needs some attention - supporting families at risk and deciding when and how to best intervene should be at the top of the agenda.

      ReplyDelete
    2. This falls in the category where nowadays levels of toxins in food/toys etc can be measured to an absurdly tiny fraction. Sure it's great science to be able to measure minute trace amounts, but mostly it only serves to cause anxiety.

      Scientific progress has a tendency to create more problems then it solves in certain area's.

      ReplyDelete
    3. Thank you, DVB, for this clear and cogent explanation! It's a great place for clinicians to send parents to help them understand the issues in these highly touted new "screening" procedures!

      ReplyDelete
    4. The Base Rate Issue or equivalently the prediction of rare events
      was pointed out very clearly by Paul Meehl and Albert Rosen in
      Antecedent Probability etc
      Psychological Bulletin Vol 52,1955.
      Well worth rereading for its potent elementary algebra from Bayes' Theorem
      Only 55 yrs ago but this elementary point is missed regularly.
      You would think editors would know better.

      ReplyDelete
    5. Thank you, Professor Bishop, for this particular post.

      No scan machine will ever be able to find out the kinds of information that psychological tests and developmental interviews can obtain in the process of assessment and diagnosis. Autism is more than a set of shapes noted in the structures of tracts of brain tissue: the diagnostically-characteristic behaviours of the autistic person do not come from brain anatomy alone - they come from all biological factors and all environmental factors and the interactions between them ... B=f(P, E), as Lewin famously expressed it.

      In a sense, making a diagnosis is like developing a grounded theory about someone's behaviour. It takes time, and a lot of work. No machine operated for 15 minutes, can ever replace that process.

      ReplyDelete