Monday 30 May 2011

Are our ‘gold standard’ autism diagnostic instruments fit for purpose?

In 1985, Simon Baron-Cohen, Alan Leslie and Uta Frith published a landmark paper entitled “Does the autistic child have a theory of mind?” It described a small study that Simon Baron-Cohen completed for his doctoral thesis which, according to Google Scholar, has been cited 2800 times. If the same paper were submitted for publication today, most journals would reject it. Why? The paper stated: “The 20 autistic children had been diagnosed according to established criteria (Rutter, 1978)”. Nowadays this would be deemed inadequate. Many editors and reviewers insist that studies of autism use two diagnostic procedures, the Autism Diagnostic Interview - Revised (ADI-R) and the Autism Diagnostic Observation Schedule - Generic (ADOS-G). According to the NIH National Database for Autism Research, requiring people to use these “gold standard” assessments will “help accelerate scientific discovery”. But use of these instruments adds hugely to the time and money costs of research. Small-scale studies of autism by PhD students have become nonviable, and large-scale genetic and epidemiological studies are bogged down by the need to spend hours just establishing the phenotype for each case. Researchers from countries where ADI-R and ADOS-G are not available are at a serious disadvantage. And, as I shall argue, the end result is not a clearcut diagnosis.

Autism has three key defining features: impairments in communication, social interaction and behavioural repertoire. The latter encompasses both repetitive behaviours such as stereotyped movements, and restricted interests, e.g., an obsessive fascination with aeroplanes. After autism was first described by Leo Kanner in 1943, the diagnosis quickly became popular, but there were concerns that it was over-used. There was a clear need to translate Kanner’s clinical descriptions into more objective diagnostic criteria. The first step was to develop checklists of symptoms, and these were included for the first time in the 1980 version of the Diagnostic and Statistical Manual of the American Psychiatric Association, DSM-III. However, this still left room for uncertainty: clinicians might, for instance, disagree about interpretation of terms such as: “Pervasive lack of responsiveness to other people”.

The Autism Diagnostic Interview (ADI) was designed to address this problem. It was first published in 1989, with a revised form, ADI-R, appearing in 1994. The ADI-R typically takes 1.5 to 3 hours to administer, and covers all the symptoms of autism and related conditions. Items are coded on a 4-point scale, from 0 (absent) to 3 (present in extreme form). The scores from a subset of items are then combined to give a total for each of the three autism domains, and a diagnosis of autism is given if scores on all three domains are above cutoffs, and onset was evident by 36 months. Validation of the ADI-R was carried out by comparing scores for 25 preschool children who had clinically diagnosed autism and 25 nonautistic children with intellectual retardation or language impairment.

Right from the outset, however, there was concern that diagnosis of autism should not be made on the basis of parental report alone. Some parents are poor informants. On the one hand, they may fail to remember key features of their child’s behaviour; on the other hand, their memories may have been coloured by reading about autism. Parental report therefore needs to be backed up by observation of the child. The Autism Diagnostic Observation Schedule (ADOS), published in 1989, was designed for this purpose. It exposes the child to a range of situations designed to elicit autistic features, and particular behaviours, such as eye contact, are then coded by a trained examiner. ADOS-G, a generic version, was published in 2000, and covers a wide age range, from toddlers through to adults.

ADI-R and ADOS-G quickly became the instruments of choice for autism diagnosis. It was generally appreciated that if we use standard instruments, researchers and clinicians should be able to communicate about autism with a fair degree of confidence that they are referring to individuals who meet the same diagnostic criteria.

There is, however, a downside. ADI-R and ADOS-G were designed to be comprehensive, but they were not designed to be efficient. As noted above, ADI-R takes up to 3 hours to administer and score. ADOS-G takes about 45 minutes. In addition, testers must be trained to use each instrument, and it may take some months to find a place on a training course. Each course lasts around one week, and the trainee then has to do further assessments which are recorded and sent for validation by experts. This process can easily add another 6 months. For anyone under time pressure, such as a doctoral student or grant-holder with research assistants on fixed term contracts, the training requirements can make a study impossible to do. Inclusion of both ADI-R and ADOS-G in a test protocol can double or treble the duration of a project, especially where it is necessary to travel to interview parents who may be available only during anti-social hours.

So, does autism diagnosis need to involve such a lengthy process? This was a question I was prompted to consider when I was asked to speak at a roundtable debate on diagnostic tools for autism at the International Meeting For Autism Research (IMFAR) in London in 2008. I concluded that the answer is almost certainly no. As Matson et al (2007) put it: “Some measures emphasize the fact that they are very detailed. We would argue that detail equals time. From a pragmatic perspective, our view is that a major priority should be to develop the balance between obtaining relevant information to make a diagnosis, while parsing out items that do not enhance that goal”. (p. 49). I was surprised when I first undertook ADI-R training to find that the interview included many items that did not feature in the final algorithm. When I (and others) queried this, we were told that the interview worked in its entirety, and to pull out selected items would disrupt a natural flow. Also, the non-algorithm items might be useful for diagnosing conditions other than autism. While both points may be important in a clinical setting, they have much less force for the poor graduate student who is doing a doctorate on, say, perceptual processing in autism, and only wants to use the ADI-R to confirm a diagnosis that has already been made by a clinician. There was a short form, I was told, but it was not recommended and should only be used by clinicians, not researchers. I was even more surprised to find how the algorithm was devised. I was familiar with discriminant function analysis, whereby you take a set of scores on two (or more) groups, and find the best weighted sum of scores to discriminate the groups. You can then use the correlations between items to drop items from the algorithm successively until you get to the point where accuracy of diagnostic assignment declines if further items are dropped. I had assumed that this kind of statistical data reduction had been used to identify the optimal set of items for identifying autism. I was wrong. It seemed that the items were selected on the basis of their match to clinical descriptions of autism, and no attempt had been made to test the efficiency of the algorithm by dropping items.

There is good reason to believe that a much shorter and simpler procedure would be feasible. In 1999, a study was published comparing diagnostic accuracy of the ADI with a 40-item screening questionnaire. The accuracy of the questionnaire was as good as the interview. My impression is that this finding did not lead to rejoicing at the prospect of a shorter diagnostic procedure, but rather alarm that diagnosis could be reduced to a trivial box-checking exercise. While I have some sympathy with that view, I feel these results should have made the researchers pause to consider whether a much shorter and more efficient approach to diagnosis might be feasible.

But the problems get worse. As Jon Brock recently pointed out on his blog, Kanner’s view of autism as a distinct syndrome is no longer accepted. It’s clear that autism symptoms can occur in milder form, and that they do not necessarily all go together. The broader term ‘Autism Spectrum Disorder’ (ASD) is nowadays used to encompass these cases. The schematic illustration in Figure 1 illustrates the diagnostic problem: one has to decide where to place boundaries on the figure to distinguish ASD from normality, when in reality, all three domains of the autism triad of symptoms shade into normality, with no sharp cutoffs. The ADI-R algorithm specifies whether or not you have autism, and does not give cutoffs for milder forms of ASD. The ADOS-G does have cutoffs for milder forms, but is inappropriate for detailed assessment of repetitive behaviours/restricted interests, so it is not watertight. This means that, when it comes to diagnosis of ASD, a great deal is left to clinician’s judgements. These, rather, than algorithm scores, are used to arrive at diagnoses.

Figure 1: Schematic of autism as a spectrum disorder: Circle correspond to areas of deficit, red = social impairment, blue = communication difficulties; yellow = repetitive behaviour/restricted interests, with depth of colour indicating severity of impairment. Individuals with all three features (centre of the figure) meet full diagnostic critieria for autism, but those falling outside this region, who have milder or partial difficulties are candidates for a diagnosis of autism spectrum disorder. Note, however, there is no clearcut boundary between autistic spectrum disorder and normal variation.
There were quite a few researchers at the IMFAR meeting who complained that their papers were rejected by journals because they relied on a diagnosis made by an expert, rather than ADI-R and ADOS-G. They would be baffled to see that several recent state-of-the-art epidemiological studies use consensus judgement by expert clinicians to make their diagnoses - and these diagnoses don’t necessarily agree with ADI-R and ADOS-G. So, for instance, Baird et al had 81 cases who met a consensus clinical diagnosis of childhood autism, but only 53 (65%) were classified as autistic by the algorithms of both the ADOS-G and the ADI-R. They also identified 77 children with consensus diagnosis of ‘other ASD’ of whom 69% met criteria for autism on the ADI-R, and 38% met cutoff for PDD or autism on the ADOS. On the ADOS-G, 10% of non-autistic children scored above cutoff for either ASD or autism. This fits my experience: high scores can reflect lack of engagement, shyness, or language difficulties. Similarly, Baron-Cohen et al identified four cases of autism and seven with other ASDs in a population screening of children aged 5 to 9 years in Cambridgeshire. All the autism cases met autism criteria on both ADI-R and ADOS-G. Of the other ASD cases, five met criteria for autism on ADI-R but not ADOS-G, and two did not meet criteria for autism or ASD on either instrument. In describing these findings, I am not criticising the authors of these studies, whose methods were transparently reported and are consistent with practice as it has evolved in the field. But it is ironic that we seem to have come full circle. ADI-R and ADOS-G were developed to make diagnosis more objective, but because they aren’t geared up to diagnose ASD, we are thrown back on ‘expert clinical opinion’. This is far from reassuring, given that a recent study reported that, after months of training, researchers agreed well on scoring standardized instruments, but “consistent differences between sites in overall clinical impression were reported”.

My own recommendation is for a two-step procedure. The first step would involve a much briefer version of the ADI-R, which would be designed to pick up clear-cut cases of autism that everyone would agree on and distinguish them from clearly non-autistic cases. It’s an empirical question, but I suspect that if we were to do a stepwise discriminant analysis to identify a minimum set of diagnostic items, this would be considerably shorter than the current set used in ADI-R. The interview may then need redesigning so that it still flows fluently and follows a logical course, but this should not be impossible. In clinical settings, those identified might require further direct assessment to confirm diagnosis and identify specific needs, but this would not be necessary for determining who should be included in a research study. This would leave a group of children in whom autism was suspected but not confirmed. The question here is whether we will ever arrive at a diagnostic procedure that will clearly separate such children into ASD and non-ASD. I was involved in a study with just such a group of ‘marginal’ cases a few years ago, where we administered ADI-R and ADOS-G. The results were all over the place: some children looked autistic on ADI-R and not on ADOS-G and others showed the opposite pattern. Some had evidence of marked change in behaviour between preschool and school-age years. When I asked an autism expert how such cases should be categorised, he suggested I get an expert clinical opinion. Yet expert clinical opinion is not seen as adequate by many journal editors! And there is documented evidence that even experienced clinicians will disagree in cases where the child has a confusing pattern of symptoms, and that expert diagnoses are not stable over time. My suggestion is that in our current state of knowledge it makes no sense to try and get reliable cutoffs for identifying ASD. Instead we should aim to assess the nature and degree of impairments in different domains. Assessments such as the 3Di or Social Responsiveness Scale, which treat autistic features as dimensions rather than all-or-none symptoms, seem better suited to this task than the existing gold standards.

Finally, I must emphasise that, although I think the ADI-R and ADOS-G are not optimal for diagnosing ASD for research purposes, they nevertheless have value. They distill a great deal of clinical wisdom in the assessment process, and are cleverly crafted to pinpoint the key features of autism. Anyone who undergoes training in their use will come away with a far greater understanding of autism than they had when they started. However, these instruments are not well suited for addressing the NIH aim of “accelerating scientific discovery”. In research contexts they have the opposite effect, by making researchers go through an unnecessarily long and complex diagnostic process which does not yield suitable quantitative results for assessing the dimensional aspect of ASD.

Rondeau E, Klein LS, Masse A, Bodeau N, Cohen D, & Guilé JM (2010). Is Pervasive Developmental Disorder Not Otherwise Specified Less Stable Than Autistic Disorder? A Meta-Analysis. Journal of autism and developmental disorders PMID: 21153874


  1. I am an adult with (so-called) "high functioning autism" (HFA) or, if one asks a different person, I have "Asperger's Syndrome." I am 46 years of age, and autism was first applied to me by a PhD clinical psychologist around 1990. Prior to that, I had discovered a category that seemed a lot like me, but there were differences significant enough that I knew that "Schizoid" was not exactly "it." When I learned of the CORE issues of autism (whether "high" or "low" functioning), I knew that the clinical psycholigist had pegged my case.

    Of course, the bane of my existence has too often been the horribly subjective definitions of "functioning." While a level of physical functioning is rather obvious, and an IQ can be assessed, the adjectives are much, much trickier to apply to the parts and processes of a person that are invisible. In fact, even if the end result passes as "normal enough," the result can be a huge lie relative to the wretched anxiety, terror, painfulness and confusion, etc., of the process endured that produced that "normal enough" result.

    All of that to ask:

    After all of the diagnostic tools (e.g.,the ADOS-G)and the "expert" opinion, at what point does the testimony of an adult regarding their own lifelong experiences enter the diagnostic conversation? Should it ever enter? What might be the implications of excluding it?

    I would, for example, have much to say about how adjectives are applied, and why. I would argue, for example, that "normal enough results" (e.g., some eye-contact is made; some casual social interaction occurs on some days; a 'special interest' isn't a WEIRD topic or object, etc.)DO NOT suffice to ascertain one's true "level of functioning." In fact, my true "level of functioning" seems to occur in realms that are not "measurable" because it occurs much less with my body than it does with my brain and my soul.

    In fact, my true "level of functioning" is so intangible and deep that I am accused of lying about the extent of my difficulties. Also, I am usually told that I am "too smart" to possibly have such difficulties.

    If only what researchers and diagnosticians can see or measure objectively and(they presume) properly interpret is considered valid for confirming or disconfirming autism in a person (forget the gradations of "functioning"; I am referring to the CORE elements of autism as expressed here: 'Autism has three key defining features: impairments in communication, social interaction and behavioural repertoire. The latter encompasses both repetitive behaviours such as stereotyped movements, and restricted interests, e.g., an obsessive fascination with aeroplanes.'), then I submit to you that most of what is "autism" will forever be beyond your reach. Therefore, what people with autism TELL ABOUT THEMSELVES must gain genuine and substantial validity with professionals.

    Will some people lie? Probably. As for me, I could never have made this up.

  2. This discussion is incredibly brave, important and also overdue. Lets face it: there is no gold standard for the diagnosis of ASD. This applies to all currently available diagnostic instruments, including interviews and behavioural observations.

    I like the suggestion of a separation of diagnosis for different purposes. For clinical purposes, time taken to interview parents and observe patients should not be curtailed by considerations of efficiency. For research purposes however, efficiency is a critical consideration. The suggested procedure of using quantitative scores on relatively brief questionnaires is a most useful start.

    We should not pretend there is a gold standard for diagnosing ASD. Nevertheless, I urge researchers, clinicians, parents and affected individuals to work together to develop more sensitive tests, adapted to different ages and abilities. I for one have not given up hope that the 'wisdom' of clinical and introspective experience can be translated into objective tests.

    Thank you, Dorothy, for opening the discussion on this thorny issue. It will surely benefit many struggling researchers and perhaps will educate journal editors.

  3. Bravo, Dorothy! In addition to making "gold standard" diagnoses unreasonably onerous, insistence on ADOS/ADI data also makes it difficult to confirm diagnoses of relatively high-functioning children as they develop. With cases we have followed from toddlerhood, we find that high functioning children who were clearly ASD at 2 often fail to meet ADOS criteria at 5, when their relatively strong verbal skills and the relatively low demands of social interaction at this age level allow them to "pass." Often, they will meet criteria again at later ages.
    In addition, the use of ADOS/ADI criteria across cultures can be problematic. Kim et al.'s (2011) recent epidemiological study of ASD in Korea, using ADOS/ADI, reports 1 in 38 children have the syndrome! It seems possible that cultural differences in talkativeness and rules for relating to adults might be influencing this figure. In addition, we have often seen children who are extremely shy, or who have some psychotic symptoms score high on the ADOS, although clinical judgement by experienced clinicians does not see them as ASD. Just as in the case of SLI, we have some distance to go before really finding a gold standard for these disorders.

  4. As Dorothy said "Autism has three key defining features: impairments in communication, social interaction and behavioural repertoire" and as Jon's blog says, autism symptoms do not necessarily all go together. So if scientists say they are studying autism, what is it that they are studying?

    This issue was dealt with re acquired dyslexia and aphasia in the 1970s and re developmental dyslexia in the 1990s In these fields many people came to the view that the proper topic for scientific study here is the symptom, not the syndrome. So for example many people stopped saying they were studying Broca's aphasia and began instead to study agrammatic sentence production, or agrammatic sentence comprehension.

    Will there come a time when scientists interested in autism stop saying "I am studying autism" and start saying "I am studying the kind of impaired communication that is often seen in people with autism: what is the nature of this impaired communication and what causes it?"

    And can one really refer to the three features listed by Dorothy as "defining features" if none of them is always present in every child with autism?

  5. If I’ve understood Max correctly, he's saying that signs and symptoms are the specific characteristics shown by individuals.

    The syndrome is impaired social interaction, communication and repetitive and stereotyped behaviours - the intersection in Dorothy’s Venn diagram.

    Sometimes individuals exhibiting the syndrome have very similar signs and symptoms - sometimes they don’t.

    But the syndrome is often conflated with the signs and symptoms - people talk about impaired social interaction, communication and repetitive and stereotyped behaviours as ‘the symptoms of autism’.

    In addition, the syndrome is often conflated with the underlying cause – people showing the syndrome are implicitly assumed to have some causal commonality. So researchers end up looking for factors common to everybody exhibiting the syndrome, regardless of their individual signs and symptoms.

    What researchers keep coming up with is findings such as individual differences in displacement of foveal fixation, auditory filter bandwidth, auditory ERPs, urinary peptide profiles etc etc but if these differences do not discriminate between people-diagnosed-with-autism-as-a-group from controls-as-a-group, a promising line of research gets abandoned.

    Today, we need naming of parts….

  6. thanks for all the comments. I do have a piece on the Korean epidemiological study in the pipeline at Guardian science blogs, which I hope will be posted soon. But I'm interested in the different perspectives revealed by the 3 commentators; isen101 regards autism as a real condition but isn't happy with current diagnostic methods; Max and Sue seem to question whether autism is a single coherent condition. I think that etiologically it's becoming clear that there isn't one cause, and there's certainly much heteroegeneity in presentation, especially if you add developmental change into the mix. But I nevertheless think that if you have someone where all 3 domains of impairment overlap then an autism diagnosis is helpful as a shorthand for communication and getting them to appropriate services. I wonder why we need quite such a time-consuming method to demonstrate this. As I've argued elsewhere, for the cases outside that central area, labelling has huge implications, both for how seriously any impairments are taken, and for access to services.

  7. Not sure quite what services a diagnosis gives access to, Dorothy, not in the UK at least.

    And even if everyone with a diagnosis of autism manifests the syndrome (the three overlapping domains of impairment), what they need support wit, surely, is their specific social, communicative and behavioural signs and symptoms - not the signs and symptoms of the average person with autism. So I'm still unsure of the purpose of a 'diagnosis' per se.

  8. I do agree that the term "autism" is helpful as a shorthand for communication, just as the term "Broca's aphasia". If someone said to me "The person you are about to meet has Broca's aphasia", that gives me a clear picture of what types of language symptom I am likely to encounter.

    So these terms are clearly clinical useful. I was saying that they aren't scientifically useful. If there is no symptom that everyone with autism shows, then there is no particular part of cognition which is affected in all cases of autism, so what do people think they are studying when they are studying cognition in autism?

    In contrast, if you select a group of people with autism on the basis that they ALL exhibit a difficulty in understanding pragmatic aspects of language, then there is at least a chance that all of them have the same part of cognition affected, namely, that part of cognition which controls our ability to understand pragmatic aspects of language. So studying this group might tell us something about this part of cognition. Just as studying a group of people in whom head injury has impaired the understanding of pragmatic aspects of language. Studying these two different groups might even inform us about the same part of language cognition, even though the etiologies are so different.


  9. The more I understand non-autism the more I can see that the symptoms of autism are secondary and there is an underlying representational theme. Communication difficulties are due to lack of shared context, specialised interests and physical symptoms are due to information filtering, etc. All these symptoms can vanish in the right circumstances, and are mitigated to varying degrees in individuals.

    Maybe diagnosing autism is meaningless if you aren't asking for government money to treat symptoms, but out of scientific curiosity I'd like to find out more about what appears to be two distinct strands of humanity in our recent development. I LOVE this two question test that seems to probe semantic substructure: - if anyone can work out what's going on there you'd get close to far more efficient testing.

  10. Hi, I am late coming to this discussion and the question I have doesn't relate precisely to the subject that is being addressed here.

    My psychiatrist has suggested that I may have 'mild/high functioning autism'. I am a female adult with lifelong severe anxiety, OCD and eating disorder. These behaviours started when I was very young and I do recall that as a child I found eye contact very difficult (too intense, frightening and intrusive). I didn't participate in pretend play with other children and found other children puzzling. I did make friends in primary school, but at puberty I 'fell apart' mentally. I couldn't cope with the changes that were happening to my body and I couldn't understand why my female peers became interested in clothes, make up and boys. I isolated myself and ended up developing anorexia nervosa.

    I am now recovered from anorexia nervosa, but I still have OCD. I accept that I do show a number of characteristics of ASD, but the one thing that confuses is me about this suggested diagnosis is that I do not lack empathy. I recognise that empathy is complex, but if, for example, I see a person or animal in pain, bullied or badly treated it causes me terrible distress. I can feel that other being's pain and I want to stop their pain. I don't always know the right response to other people's distress, but I certainly feel it. Many people with ASD deny that they lack empathy.

  11. What a great blog and an intriguing discussion. I applaud you for questioning the "Gold Standard". Our lab has been skeptical about this claim for some time. My question is this: What exactly is the criteria for "gold standard"? As someone with a background in evaluating tools and tests, I would think that psychometric evidence (e.g. how a test does compared to others, size of the norming sample in terms of # of items, and ability to discriminate between similar conditions) might be some of the criteria in a gold standard (our review discussing some of these considerations for adult asperger syndrome tools can be found here-sorry it is not open access, but email me if you'd like a copy of the paper:

    In perusing the manual for the ADI-R, one notes some very basic issues that do not uphold "Gold Standards" in test development and this leads to questions about what the term really means in the field of autism research and practice. How did these tools become the "Gold Standard"? Was it language used to market the test in the early stages? Did expert consensus put it forth? If so, was this a research-based consensus, or just a prevailing opinion? Or did people just say "Gold Standard" enough times that is was accepted? I am so pleased to see others questioning this claim-though to be clear, I would have no difficulty with it if the actual criteria were detailed. What is it about these particular tools that makes them "gold standard"? Can we test this claim out (yes, and our lab is working on it)? It really is time we sorted this issue out. Thank you for some great posts!

    Now, just a quick response to Anonymous:
    People with ASD can certainly have intact empathy. Indeed, many individuals I see clinically report such intense feelings in emotional interactions that they avoid situations that may 'trigger' intense emotional reactions. In my doctoral work, I examined Emotional Intelligence (EI) in young adults diagnosed with Asperger syndrome, and we found that while some aspects of EI were problematic (particularly real-time social interactions that involve EI and feelings about those interactions), actual complex understanding about emotions was intact, and actually significantly better developed than typically developing controls and norm groups. While this is not a direct assessment of empathy, it points to the same conclusion you present: that people with ASD do indeed have empathy. It is, however, how they respond in 'real-time' that seems to be a bigger issue. If you are interested in the study, an early article is here, tho more recent investigations are available thru libraries: (we are working on plain language summaries on my webpage-feel free to check it out: )

    Thank you again for this great blog!