Sunday, 3 March 2019

Benchmarking in the TEF: Something doesn't add up (v.2)

Update: March 6th 
This is version 2 of this blogpost, taking into account new insights into the weird z-scores used in TEF.  I had originally suggested there might be an algebraic error in the formula used to derive z-scores: I now realise there is a simpler explanation, which is that the z-scores used in TEF are not calculated in the usual way, with the standard deviation as denominator, but rather with the standard error of measurement as denominator. 
In exploring this issue, I've greatly benefited from working openly with a R markdown script on Github, as that has allowed others with statistical expertise to propose alternative analyses and explanations. This process is continuing, and those interested in technical details can follow developments as they happen on Github, see benchmarking_Feb2019.rmd.
Maybe my experience will encourage OfS to adopt reproducible working practices.

I'm a long-term critic of the Teaching Excellence and Student Outcomes Framework (TEF). I've put forward a swathe of arguments against the rationale for TEF in this lecture, as well as blogging for the Council for Defence of British Universities (CDBU) about problems with its rationale and statistical methods. But this week, things got even more interesting. In poking about in the data behind the TEF, I stumbled upon some anomalies that suggest to me that the TEF is not just misguided, but also is based on a foundation of statistical error.

Statistical critiques of TEF are not new. This week, the Royal Statistical Society wrote a scathing report on the statistical limitations of TEF, complaining that their previous evidence to TEF evaluations had been ignored, and stating: 'We are extremely worried about the entire benchmarking concept and implementation. It is at the heart of TEF and has an inordinately large influence on the final TEF outcome'. They expressed particular concern about the lack of clarity regarding the benchmarking methodology, which made it impossible to check results.

This reflects concerns I have had, which have led me to do further analyses of the publicly available TEF datasets. The conclusion I have come to is that the way in which z-scores are defined is very different from the usual interpretation, and leads to massive overdiagnosis of under- and over-performing institutions.

Needless, to say, this is all quite technical, but even if you don't follow the maths, I suggest you just consider the analyses reported below, in which I compare the benchmarking output from the Draper and Gittoes method with that from an alternative approach.

Draper & Gittoes (2004): a toy example

Benchmarking is intended to provide a way of comparing institutions on some metric, while taking into account differences between institutions in characteristics that might be expected to affect their performance, such as the subjects of study, and the social backgrounds of students. I will refer to these as 'contextual factors'.

The method used to do benchmarking comes from Draper and Gittoes, 2004, and is explained in this document by the Higher Education Statistics Agency: HESA. A further discussion of the method can be found in this pdf of slides from a talk by Draper (2006).

Draper (2006) provides a 'small world' example with 5 universities and 2 binary contextual categories, age and gender, to yield four combinations of contextual factors. The numbers in the top part of the chart are the proportions in each contextual (PCF) category meeting the criterion of student continuation.  The numbers in the bottom part are the numbers of students in each contextual category.

Table 1. Small world example from Draper 2006, showing % passing benchmark (top) and N students (bottom)

Essentially, the obtained score (weighted mean column) for an institution is an average of indicator values for each combination of contextual factors, weighted by the numbers with each combination of contextual factors in the institution. The benchmarked score is computed by taking the average score for each combination across all institutions (bottom row of top table) and then for each institution creating a mean score, weighted by the number in each category for a that institution. Though cumbersome (and hard to explain in words!) it is not difficult to compute.  You can find an R markdown script that does the computation here (see benchmarking_Feb2019.rmd, benchmark_function). The difference between obtained values and benchmarked value can then be computed, to see if the institution is scoring above expectation (positive difference) or below expectation (negative difference).  Results for the small world example are shown in Table 2.
Table 2. Benchmarks (Ei) computed for small world example
The column headed Oi is the observed proportion with a pass mark on the indicator (student continuation), Ei is the benchmark (expected) value for each institution, and Di is the difference between the two.

Computing standard errors of difference scores

The next step is far more complex. A z-score is computed by dividing the difference between observed and expected values on an indicator (Di) by a denominator, which is variously referred to as a standard deviation and a standard error in the documents on benchmarking.

For those who are not trained in statistics, the basic logic here is that the estimate of an institution's performance will be more labile if it is based on a small sample. If the institution takes on only 5 students each year, then estimates of completion rates from year to year will be variable - in a year where one student drops out, then the completion rate is only 80%, but if none drop out it will be 100%. You would not expect it to be constant because of random factors outside the control of the institution will affect student drop-outs. In contrast, for an institution with 1000 students, we will see much less variation from year to year. The standard error provides an estimate of the extent to which we expect the estimate of average drop-out to vary from year to year, taking size of population into account. 

To interpret benchmarked scores we need a way of estimating the standard error of the difference between the observed score on a metric (such as completion rate) and the benchmarked score, reflecting how much we would expect this to vary from one occasion to another. Only then can we judge whether the institution's performance is in line with expectation. 

Draper (2006) walks the reader through a standard method for computing the standard errors, based on the rather daunting formulae of Figure 1. The values in the SE column of table 2 are computed this way, and the z-scores are obtained by dividing each Di value by its corresponding SE.

Fomulae 5 to 8 are used to compute difference scores and standard errors (Draper, 2006)
Now anyone familiar with z-scores will notice something pretty odd about the values in Table 2. The absolute z-scores given by this method seem remarkably large: In this toy example, we see z-scores with absolute values of 5, 9 and 20.  Usually z-scores range from about -3 to 3. (Draper noted this point).

Z-scores in real TEF data

Next, I downloaded some real TEF data, so I could see whether the distribution of z-scores was unusual. Data from Year 2 (2017) in .csv format can be downloaded from this website.
The z-scores here have been computed by HESA. Here is the distribution of core z-scores for one of the metrics (Non-continuation) for the 233 institutions with data on FullTime students.

The distribution is completely out of line with what we would expect from a z-score distribution.  Absolute z-scores greater than 3, which should be vanishingly rare, are common - with the exact number varying across the six available metrics, but ranging from 33% to 58%.

Yet, they are interpreted in TEF as if a large z-score is an indicator of abnormally good or poor performance:

From p. 42  of this pdf giving Technical Specifications:

"In TEF metrics the number of standard deviations that the indicator is from the benchmark is given as the Z-score. Differences from a benchmark with a Z-score +/-1.9623 will be considered statistically significant. This is equivalent to a 95% confidence interval (that is, we can have 95% confidence that the difference is not due to chance)."

What does the z-score represent?

Z-scores feature heavily in my line of work: in psychological assessment they are used to identify people whose problems are outside the normal range. However, they aren't computed like the TEF z-scores, because they involve dividing a mean score by the standard deviation, rather than by the standard error.

It's easiest to explain this by an analogy. I'm 169 cm tall. Suppose you want to find out if that's out of line with the population of women in Oxford. You measure 10,000 women and find their mean height is 170 cm, with a standard deviation of 3. On a conventional z-score, my height is unremarkable. You just divide the difference between my height and the population height and divide by the standard deviation, -1/3, to give a z-score of -0.33. That's well within the normal limits used by TEF of -1.96 to 1.96.

Now let's compute the standard error of the population mean - to do that we compute the standard error, which is the standard deviation divided by the square root of the sample size, which gives 3/100 or .03. From that information we can get an estimate of the precision of our estimate of the population mean: we multiply the SE by 1.96, and add and subtract that value to the mean to get 95% confidence limits, which are 169.94 and 170.06. If we were to compute the z-score corresponding to my height using the SE instead of the SD, I would seem to be alarmingly short: the value would be -1/.03 = -33.33.

So what does that mean? Well the second z-score based on the SE does not test whether my height is in line with the population of 10,000 women. It tests whether my height can be regarded as equivalent to that of the average from that population. Because the population is very large, the estimate of the average is very precise, and my height is outside the error of measurement for the mean.

The problem with the TEF data is that they use the latter, SE-based method to evaluate differences from the benchmark value, but appear to interpret it as if it was a conventional SD-based z-score:

E.g. in the Technical Specificiations document (5.63):

As a test of the likelihood that a difference between a provider’s benchmark and its indicator is due to chance alone, a z-score +/- 3.0 means the likelihood of the difference being due to chance alone has reduced substantially and is negligible.

As illustrated with the height analogy, the SE-based method seems designed to over-identify high and low-achieving institutions. The only step taken to counteract this trend is an ad hoc one: because large institutions are particularly prone to obtain extreme scores, a large absolute z-score is only flagged as 'significant' if the absolute difference score is also greater than 2 or 3 percentage points. Nevertheless, the number of flagged institutions for each metric, is still far higher than would be the case if conventional z-scores based on the SD were used.

Relationship between SE-based and SD-based z-scores
(N.B. Thanks to Jonathan Mellon who noted an error in my script for computing the true z-scores. 
This update and correction made 20.20 p.m. on 6 March 2019).
I computed conventional z-scores by dividing each institution's difference from benchmark by the SD of for difference scores for all institutions and plotted it against the TEF z-scores. An example for one of the metrics is shown below. The range is in line with expectation (most values between -3 and +3) for the conventional z-scores, but much bigger for the TEF z-scores.

Conversion of z-scores into flags

In TEF benchmarking, TEF z-scores are converted into 'flags', ranging from - - or -, to denote performance below expectation, up to + or ++ for performance above expectation, with = used to indicate performance in line with expectation. It is these flags that the TEF panel considers when deciding which award (Gold, Silver or Bronze) to award.

Draper-Gittoes z-scores are flagged for significance as follows:
  •  - - z-score of -3 or less, AND an absolute difference between observed and expected values of 3%. 
  •  - z-score of -2 or less, AND an absolute difference between observed and expected values of 2%. 
  •  + z-score of 2 or more, AND an absolute difference between observed and expected values of 2%.   
  • ++ z-score of 3 or more, AND an absolute difference between observed and expected values of 3%. 
Given the problems with the method outlined above, this method is likely to massively overdiagnose both problems and good performance.

Using quantiles rather than TEF z-scores

Given that the z-scores obtained with the Draper-Gittoes method are so extreme, it could be argued that flags should be based on quantiles rather than z-score cutoffs, omitting the additional absolute difference criterion. For instance, for the Year 2 TEF data (Core z-scores) we can find cutoffs corresponding to the most extreme 5% or 1%.  If flags were based on these, then we would award extreme flags (- - or ++) only to those with negative z-scores of -13.7 or less, or positive score of 14.6 or more; less extreme flags would be awarded to those with negative z-score of -7 or less (- flag), or positive z-score of 8.6 or more (+).

Update 6th March: An alternative way of achieving the same end would be to use the TEF cutoffs with conventional z-scores; this would achieve a very similar result.

Deriving award levels from flags

It is interesting to consider how this change in procedure would affect the allocation of awards. In TEF, the mapping from raw data to awards is complex and involves more than just a consideration of flags: qualitative information is also taken into account. Furthermore, as well as the core metrics, which we have looked at above, the split metrics are also considered - i.e. flags are also awarded for subcategories, such as male/female, disabled/non-disabled: in all there are around 130 flags awarded across the six metrics for each institution. But not all flags are treated equally: the three metrics based on the National Student Survey are given half the weight of other metrics.

Not surprisingly, if we were to recompute flag scores based on quantiles, rather than using the computed z-scores, the proportions of institutions with Bronze or Gold awards drops massively.

When TEF awards were first announced, there was a great deal of publicity around the award of Bronze to certain high-profile institutions, in particularly the London School of Economics, Southampton University, University of Liverpool and the School of Oriental and African Studies. On the basis of quantile scores for Core metrics, none of these would meet criteria for Bronze: their flag scores would be -1, 0, -.5 and 0 respectively. But these are not the only institutions to see a change in award when quantiles are used. The majority of smaller institutions awarded Bronze obtain flag scores of zero.

The same is true of Gold Awards. Most institutions that were deemed to significantly outperform their benchmarks no longer do so if quantiles are used.


Should we therefore change the criteria used in benchmarking and adopt quantile scores? Because I think there are other conceptual problems with benchmarking, and indeed with TEF in general, I would not make that recommendation. I would prefer to see TEF abandoned. I hope the current analysis can at least draw people's attention to the questionable use of statistics used in deriving z-scores and their corresponding flags. The difference between a Bronze, Silver and Gold can potentially have a large impact on an institution's reputation. The current system for allocating these awards is not, to my mind, defensible.

I will, of course, be receptive to attempts to defend it or to show errors in my analysis, which is fully documented with scripts on github, benchmarking_Feb2019.rmd.

Saturday, 9 February 2019

The Paper-in-a-Day Approach

Guest post by
Jennifer L. Tackett
Northwestern University; Personality Across Development Lab

The PiaD approach was borne of a desire to figure out a way, some way, any way, to tackle that ever-growing project list of studies-that-should-get-done-but-never-do. I’m guessing we all have these lists. These are the projects that come up when you’re sitting in a conference talk and lean over to your grad student and whisper (“You know, we actually have the data to test that thing they can’t test, we should do that!”), and your grad student sort of nods a little but also kind of looks like she wants to kill you. Or, you’re sitting in lab meeting talking about ideas, and suddenly shout, “Hey, we totally have data to test that! We should do that! Someone, add it to the list!” and people’s initial look of enthusiasm is quickly replaced by a somewhat sinister side-eye (or perhaps a look of desperation and panic; apparently it depends on who you ask). Essentially, anytime you come up with a project idea and think – Hey, that would be cool, we already have the data, and it wouldn’t be too onerous/lengthy, maybe someone wants to just write that up! – you may have a good PiaD paper.

In other words, the PiaD approach was apparently borne out of a desire to finally get these papers written without my grad students killing me. Seems as reasonable a motivation as any.

The initial idea was simple.

-       You have a project idea that is circumscribed and straightforward.

-       You have data to test the idea.

-       The analyses to do so are not overly complex or novel.

-       The project topic is in an area that everyone in the lab1 is at least somewhat (to very) familiar with.

What would happen if we all locked ourselves in the same room, with no other distractions, for a full day, and worked our tails off? Surely we could write this paper, right?

The answer was: somewhat, and at least sometimes, yes.
But even better were all the things we learned along the way.

We have been scheduling an annual PiaD since 2013. Our process has evolved a LOT along the way. Rather than giving a historical recounting, I thought I would summarize where we are at now – the current working process we have arrived at, and some of the unanticipated benefits and challenges that have come up for us over the years.

Our Current PiaD Approach

Front-End Work:
We write our PiaD papers in the late spring/early summer. Sometime in the fall, we decide as a group what the focus of the PiaD paper will be and who will be first author (see also: benefits and challenges). Then, in the months leading up to PiaD, the first author (and senior author, if not one-and-the-same), take care of some front-end tasks.2 Accomplishing the front-end tasks is essential for making sure we can all hit the ground running on the day of. So, here are the things we do in advance:

1.              Write the present study paragraph: what exactly do we want to do, and why/how? (Now, we write this as a pre/registration! But in the olden days, a thorough present study paragraph would do.)

2.              Run a first pass of the analyses (again, remember – data are already available and analyses are straightforward and familiar).

3.              Populate a literature review folder. We now use a shared reference manager library (Zotero) to facilitate this step and later populating references.

4.              Create a game plan – a list of the target outlet with journal submission guidelines, a list of all the tasks that must be accomplished on the actual DAY, a list of all the people on the team and preliminary assignments. The planning stage of PiaD is key – it can make or break the success of the approach. One aspect of this is being really thoughtful about task assignments. Someone used other data from that sample for a recent study? Put them on the Methods section. Someone used similar analyses in a recent project? Put them on re-running and checking analyses (first pass is always done by the first author in advance; another team member checks syntax and runs a fresh pass on the day. We also have multiple checks built in for examining final output). Someone has expertise in a related literature? Assign them appropriate sections of the Intro/Discussion. You get the idea. Leverage people’s strengths and expertise in the task assignments.

5.              Email a link to a Dropbox folder with all of the above, and attach 2-3 key references, to everyone on the team, a couple of weeks before the DAY. All team members are expected to read the key papers and familiarize themselves with the Dropbox folder ahead of time.

The DAY:
Because this process is pretty intense, and every paper is different, our PiaD DAYs always evolve a bit differently. Here are some key components for us:

1.     Start with coffee.

2.     Then start with the Game Plan. Make sure everyone understands the goal of the paper, the nature of the analyses, and their assigned tasks. Talk through the structure of the Introduction section at a broad level for feedback/discussion.


4.     Take a lunch break. Leave where you are. Turn your computers off. Eat some food. For the most part, we tend to talk about the paper. It’s nice for us to have this opportunity to process more openly mid-day, see where people are at, how the paper is shaping up, what else we should be thinking about, etc. The chance for free and open discussion is really important, after being in such a narrow task-focused state.


6.     Throughout the working chunks, we are constantly renegotiating the task list. Someone finishes their task more quickly, crosses it off the Game Plan (we use this as an active collaborative document to track our work in real time), and claims the next task they plan to move to.

7.     Although we have a “no distraction” work space3 for PiaD, we absolutely talk to one another throughout the day. This is one of the biggest benefits of PiaD – the ability to ask questions and get immediate answers, to have all the group minds tackling challenges as they arise. It’s a huge time efficiency to work in this way, and absolutely makes end decisions of much higher quality than the typical fragmented writing approach.

8.         Similarly, we have group check-ins about every 1-1.5 hours – where is everyone on their task? What will they move to work on next?

9.         Over the years, some PiaD members have found walks helpful, too. Feeling stuck? Peel someone off to go walk through your stuck-ness with you. Come back fresher and clearer.

10.       About an hour before end time, we take stock – how close are we to meeting our goals? How are things looking when we piece them all together? What tasks are we prioritizing in the final hour, and which will need to go unfinished and added to the back-end work for the first author? Some years, we are wrapping up the submission cover letter at this stage. Other years, we’re realizing we still have tasks to complete after PiaD. Just depends on the nature of the project.

11.       Celebrate. Ideally with some sort of shared beverage of choice. To each their own, but for us, this has often involved bubbles. And an early bedtime.

Jennifer celebrating with Kathleen, Cassie, Avanté, and bubbles

Back-End Work:

This will be different from year-to-year. Obviously, the goal with PiaD is to be done with the manuscript by the end of the day. EVEN WHEN THIS HAPPENS, we never, EVER do final-proofing the same day. We are just way too exhausted. So we usually give ourselves a couple of weeks to freshen up, then do our final proofing before submission. Other years, for a variety of reasons, various tasks remain. That’s just how it goes with manuscript writing. Even in this case, it is fair to say that the overwhelming majority of the work gets done on the DAY. So either way, it’s still a really productive mechanism (for us).

Some Benefits and Challenges

There are many of both. But overall, we have found this to be a really great experience for many reasons beyond actually getting some of these papers out in the world (which we have! Which is so cool!). Some of these benefits for us are:

1.     Bonding as a team. It’s a really great way to strengthen your community, come together in an informal space on a hard but shared problem, and struggle through it together.

2.     A chance to see one another work. This can be incredibly powerful, for example, for junior scholars to observe scientific writing approaches “in the wild”. It never occurred to me before my grad students pointed this out at our first PiaD, but they rarely get to see faculty actually work in this way. And vice versa!

3.     Accuracy, clarity, and error reduction. So many of our smaller errors could likely be avoided if we’re able to ask our whole team of experts our questions WHILE WE’RE WRITING THE PAPER. Real-time answers, group answers, a chance for one group member to correct another, etc. Good stuff.

4.     Enhancing ethical and rigorous practices. The level of accountability when you are all working in the same space at the same time on the same files is probably as good as you can get. How many of our problematic practices might be extinguished if we were always working with others like this?

5.     One of the goals I had with PiaD was to have the first author status rotate across the team – i.e., grad students would “take turns” being first author. I still think this is a great idea, as it’s a great learning experience for advanced grad students to learn how to manage team papers in this way. But, of course, it’s also HARD. So, be more thoughtful about scope of the project depending on seniority of the first author, and anticipate more front- and back-end work, accordingly.

Bottom Line

PiaD has been a really cool mechanism for my lab to work with and learn from over the years. It has brought us many benefits as a team, far beyond increased productivity. But the way it works best for each team is likely different, and tweaking it over time is the way to make it work best for you. I would love to hear more from others who have been trying something similar in their groups, and also want to acknowledge the working team on the approach outlined here: Kat Herzhoff, Kathleen Reardon, Avanté Smack, Cassie Brandes, and Allison Shields.


1For PiaD purposes, I am defining the lab as PI + graduate students.

2Some critics like to counter, well then it’s not really Paper IN A DAY, now is it??? (Nanny-nanny-boo-boo!) Umm.. I guess not? Or maybe we can all remember that time demarcations are arbitrary and just chill out a bit? In all seriousness, if we all lived in the world where our data were perfectly cleaned and organized, all our literature folders were populated and labeled, etc. – maybe the tasks could all be accomplished in a single day. But unfortunately, my lab isn’t that perfect. YET. (Grad students sending me murderous side-eye over the internet.)

3The question of music or no-music is fraught conversational territory. You may need to set these parameters in advance to avoid PiaD turmoil and potential derailment. You may also need your team members to provide definition of current terminology in advance, in order to even have the conversation at all. Whatever you do, DON’T start having conversations about things like “What is Norm-core?” and everyone googling “norm-core”, and then trying to figure out if there is “norm-core music”, and what that might be. It’s a total PiaD break-down at that point.


Saturday, 12 January 2019

NeuroPointDX's blood test for Autism Spectrum Disorder: a critical evaluation

NeuroPointDX (NPDX), a Madison-based biomedical company, is developing blood tests for early diagnosis of Autism Spectrum Disorder (ASD). According to their Facebook page, the NPDX ASD test is available in 45 US states. It does not appear to require FDA approval. On the Payments tab of the website, we learn that the test is currently self-pay (not covered by insurance), but for those who have difficulty meeting the costs, a Payment Plan is available, whereby the test is conducted after a down payment is received, but the results are not disclosed to the referring physician until two further payments have been made.

So what does the test achieve, and what is the evidence behind it?

Claims made for the test
On their website, NPDX describe their test as a 'tool for earlier ASD diagnosis'. Specifically they say:
'It can be difficult to know when to be concerned because kids develop different skills, like walking and talking, at different times. It can be hard to tell if a child is experiencing delayed development that could signal a condition like ASD or is simply developing at a different pace compared to his or her peers...... This is why a biological test, one that’s less susceptible to interpretation, could help doctors diagnose children with ASD at a younger age. The NPDX ASD test was developed for children as young as 18 months old.'
They go on to say:
'In our research of autism spectrum disorder (ASD) and metabolism, we found differences in the metabolic profiles of certain small molecules in the blood of children with ASD. The NPDX ASD test measures a number of molecules in the blood called metabolites and compares them to these metabolic profiles.

The results of our metabolic test provide the ordering physician with information about the child’s metabolism. In some instances, this information may be used to inform more precise treatment. Preliminary research suggests, for example, that adding or removing certain foods or supplements may be beneficial for some of these children. NeuroPointDX is working on further studies to explore this.

The NPDX ASD test can identify about 30% of children with autism spectrum disorder with an increased risk of an ASD diagnosis. This means that three in 10 kids with autism spectrum disorder could receive an earlier diagnosis, get interventions sooner, and potentially receive more precise treatment suggestions from their doctors, based on information about their own metabolism.'
They further state that this is:  'A new approach to thinking about ASD that has been rigorously validated in a large clinical study' and they note that results from their Children’s Autism Metabolome Project (CAMP) study have been 'published in a peer-reviewed, highly-regarded journal, Biological Psychiatry'.

The test is recommended for a child who:
  • Has failed screening for developmental milestones indicating risk for ASD (e.g. M-CHAT, ASQ-3, PEDS, STAT, etc.). 
  • Has a family history such as a sibling diagnosed with ASD. 
  • Has an ASD diagnosis for whom additional metabolic information may provide insight into the child’s condition and therapy.
In September, Xconomy, which reports on biotech developments, ran an interview with Stemina CEO and co-founder Elizabeth Donley, which gives more background, noting that the test is not intended as a general population screen, but rather as a way of identifying specific subtypes among children with developmental delay.

Where are the non-autistic children with developmental delay? 
I looked at the published paper from the CAMP study in Biological Psychiatry.

Given the recommendations made by NPDX, I had expected that the study would involve comparison of children with developmental delay to compare metabolomic profiles in those who did and did not subsequently meet diagnostic criteria for ASD.

However, what I found instead was a study that compared metabolomics in 516 children with a certified diagnosis of ASD and 164 typically-developing children. There was a striking difference between the two groups in 'developmental quotient (DQ)', which is an index of overall developmental level. The mean DQ for the ASD group was 62.8 (SD = 17.8), whereas that of the typically developing comparison group was 100.1 (SD = 16.5). This information can be found in Supplementary Materials Table 3.

It is not possible, using this study design, to use metabolomic results to distinguish children with ASD from other cases of developmental delay. To do that, we'd need a comparison sample of non-autistic children with developmental delay.

The CAMP study is registered on, where it is described as follows:
'The purpose of this study is to identify a metabolite signature in blood plasma and/or urine using a panel of biomarker metabolites that differentiate children with autism spectrum disorder (ASD) from children with delayed development (DD) and/or typical development (TD), to develop an algorithm that maximizes sensitivity and specificity of the biomarker profile, and to evaluate the algorithm as a diagnostic tool.' (My emphasis)
The study is also included on the NIH Project Reporter portfolio, where the description includes the following information:
'Stemina seeks funding to enroll 1500 patients in a well-defined clinical study to develop a biomarker-based diagnostic test capable of classifying ASD relative to other developmental delays at greater than 80% accuracy. In addition, we propose to identify metabolic subtypes present within the ASD spectrum that can be used for personalized treatment. The study will include ASD, DD and TD children between 18 and 48 months of age. Inclusion of DD patients is a novel and important aspect of this proposed study from the perspective of a commercially available diagnostic test.' (My emphasis)
So, the authors were aware that it was important to include a group with developmental delay, but they then reported no data on this group. Such children are difficult to recruit, especially for a study involving invasive procedures, and it is not unusual for studies to fail to meet recruitment goals. That is understandable. But it is not understandable that the test should then be described as being useful for diagnosing ASD from within a population with developmental delay, when it has not been validated for that purpose.

Is the test more accurate than behavioural diagnostic tests? 
A puzzling aspect of the NPDX claims is a footnote (marked *) on this webpage:
'Our test looks for certain metabolic imbalances that have been identified through our clinical study to be associated with ASD. When we detect one or more imbalance(s), there is an increased risk that the child will receive an ASD diagnosis'
*Compared to the results of the ADOS-2 (Autism Diagnostic Observation Schedule), Second Edition
It's not clear exactly what is meant by this: it sounds as though the claim is that the blood test is more accurate than ADOS-2. That can't be right, though, because in the CAMP study, we are told: 'The Autism Diagnostic Observation Schedule–Second Version (ADOS-2) was performed by research-reliable clinicians to confirm an ASD diagnosis.' So all the ASD children in the study met ADOS-2 criteria. It looks like 'compared to' means 'based on' in this context, but it is then unclear what the 'increased risk' refers to.

How reliable is the test?
A test's validity depends crucially on its reliability: if a blood test gives different results on different occasions, then it cannot be used for diagnosis of a long-term condition. Presumably because of this, the account of the study on states: 'A subset of the subjects will be asked to return to the clinic 30-60 days later to obtain a replicate metabolic profile.' Yet no data on this replicate sample is reported in the Biological Psychiatry paper.

I have no expertise in metabolomics, but it seems reasonable to suppose that amines measured in the blood may vary from one occasion to another; indeed in 2014 the authors published a preliminary report on a smaller sample from CAMP, where they specifically noted that, presumably to minimise impact of medication or special diets, blood samples were taken when the child was fasting and prior to morning administration of medication. (34% of the ASD group and 10% of the typically-developing group were on regular medication, and 19% of the ASD group were on gluten and/or casein-free diets).

I contacted the authors to ask for information on this point. They did not provide any data on test-retest reliability beyond stating:
Thirty one CAMP subjects were recruited at random for a test-retest analysis during CAMP. These subjects were all amino acid dysregulation metabotype negative at the initial time point (used in the analysis for the manuscript). The subjects were sampled 30-60 days later for retest analysis. At the second time point the 31 subjects were still metabotype negative. There are plans for additional resampling of a select group of CAMP subjects. These will include metabotype positive individuals.
Thus, we do not currently know whether a positive result on the NPDX ASD test is meaningful, in the sense of being a consistent physiological marker in the individual.

Scientific evaluation of the methods used in the Biological Psychiatry paper 
The Biological Psychiatry paper describing development of the test is highly complex, involving a wide range of statistical methods. In their previous paper with a smaller sample, the authors described thousands of blood markers and claimed that using machine learning methods, they could identify a subset that discriminated the ASD and typically-developing groups with above chance accuracy. However, they noted this finding needed confirmation in a larger sample.

In the 2018 Biological Psychiatry paper, no significant differences were found for measures of metabolite abundance, failing to replicate the 2014 findings. However, further consideration of the data led the authors to concentrate instead on ratios between metabolites. As they noted: 'Ratios can uncover biological properties not evident with individual metabolites and increase the signal when two metabolites with a negative correlation are evaluated.'

Furthermore, they focused on individuals with extreme values for ratio scores, on the grounds that ASD is a heterogeneous condition, and the interest is in identifying subgroups who may have altered metabolism. The basic logic is illustrated in Figure 1 – the idea is to find a cutoff on the distribution which selects a higher proportion of ASD than typical cases. Because 76% of the sample are ASD cases, we would expect to find 76% of cases in the tail of the distribution. However, by exploring different cutoffs, it can be possible to identify a higher proportion. The proportion of ASD cases above a positive cutoff (or below a negative cutoff) is known as the positive predictive value (PPV), and for some of the ratios examined by the researchers, it was over 90%.

Figure 1: Illustrative distributions of z-scores for 4 of the 31 metabolites in ASD and typical group: this plot shows raw levels for metabolites; blue boxes show the numbers falling above or below a cutoff that is set to maximise group differences. The final analysis focused on ratios between metabolites, rather than raw levels. From Figure S2, Smith et al (2018).

This kind of approach readily lends itself to finding spurious 'positive' results, insofar as one is first inspecting the data and then identifying a cutoff that maximises the difference between two groups. It is noteworthy that the metabolites that were selected for consideration in ratio scores were identified on the basis that they showed negative correlations within a subset of the ASD sample (the 'training set'). Accordingly, PPV values from a 'training set' are likely to be biased and will over-estimate group differences. However, to avoid circularity, one can take cutoffs from the training set, and then see how they perform with a new subset of data that was not used to derive the cutoff – the 'test set'. Provided the test set is predetermined prior to any analysis, and totally separate from the training set, then the results with the test set can be regarded as giving a good indication of how the test would perform in a new sample. This is a standard way of approaching this kind of classification problem.

Usually, the PPV for a test set will be less good than for a training set: this is just a logical consequence of the fact that observed differences between groups will involve random noise as well as true population differences, and these will boost the PPV. In the test set, random effects will be different are so are more likely to hinder rather than help prediction, and so PPV will decline. However, in the Biological Psychiatry paper, the PPVs for the test sets were only marginally different from those from the training sets: for the ratios described in Table 1, the mean PPV was .887 (range .806 - .943) for the training set, and mean .880 (range .757 - .975) for the test set.

I wanted to understand this better, and asked the authors for their analysis scripts, so I could reconstruct what they did. Here is the reply I received from Beth Donley:
We would be happy to have a call to discuss the methodology used to arrive at the findings in our paper. Our scripts and the source code they rely on are proprietary and will not be made public unless and until we publish them in a paper of our own. We think it would be more meaningful to have a call to discuss our approach so that you can ask questions and we can provide answers.
My questions were sufficiently technical and complex that this was not going to work, so I provided written questions, to which I received responses. However, although the replies were prompt, they did not really inspire confidence, and, without the scripts I could not check anything.

For instance:
My question: Is there an explanation for why the PPVs are so similar for training and test datasets? Usually you'd expect a drop in PPV in the test dataset if the function was optimised for the training dataset, just because the training threshold would inevitably be capitalising on chance.
Response: We observed this phenomenon, as well, and were surprised by the similarity of the training and test confusion matrix performance metric values. We have no way to know why the metrics were similar between sets. Our best guess is that the demographics of the training and test set of subjects had closely matched demographic and study related variables.
But the demographic similarity between a test and training set is not the main issue here. One thing that crucially determines how close the results will be is the reliability of the metabolomic measure. The lower the test-retest reliability of the measure, the more likely that results from a training set will fail to replicate. So it would be helpful if the authors would report the quantitative data that they have on this question.

If we ignore all the problems, how good is prediction? 
Unfortunately, it is virtually impossible to tell how accurate the test would be in a real-life context. First, we would have to make the assumption that a non-autistic group with developmental delay would be comparable to the typically-developing group. If non-autistic children with developmental delay show metabolomic imbalances, then the test's potential for diagnosis of ASD is compromised. Second, we would have to come up with an estimate of how many children who are given the test will actually have ASD: that's very hard to judge, but let us suppose it may be as high as 50%. Then, for the ratios reported in the Biological Psychiatry paper, we can compute that around 50% to 83% of those testing positive would have ASD. Note that the majority of children with and without ASD won't have scores in the tail of the distribution and will not therefore test positive (see Figure 1). On the NPDX website is is claimed that around 30% of children with ASD test positive: That is hard to square this with the account in Biological Psychiatry which reported 'an altered metabolic phenotype' in 16.7% of those with ASD.

Conflict of interest and need for transparency
The published paper gives a comprehensive COI statement as follows:
AMS, MAL, and REB are employees of, JJK and PRW were employees of, and ELRD is an equity owner in Stemina Biomarker Discovery Inc. AMS, JJK, PRW, MAL, ELRD, and REB are inventors on provisional patent application 62/623,153 titled “Amino Acid Analysis and Autism Subsets” filed January 29, 2018. DGA receives research funding from the National Institutes of Health, the Simons Foundation, and Stemina Biomarker Discovery Inc. He is on the scientific advisory boards of Stemina Biomarker Discovery Inc. and Axial Therapeutics.
It is generally accepted that just because there is COI, this does not invalidate the work: it simply provides a context in which it can be interpreted. The study reported in Biological Psychiatry represents a huge investment of time and money, with research funds contributed from both public and private sources. In the Xconomy interview, it is stated that the research has cost $8 million to date. This kind of work may only be possible to do with involvement of a biotechnology company which is willing to invest funds in the hope of making discoveries that can be commercialised; this is a similar model to drug development.

Where there is a strong commercial interest in the outcome of research, the best way of counteracting negative impressions is for researchers to be as open and transparent as possible. This was not the case with the NPDX study: as described above, there were substantial changes from the registered protocol on, not discussed in the paper. The analysis scripts are not available – this means we have to take on trust details of the methods in an area where the devil is in the detail. As Philip Stark has argued, a paper that is long on results but short on methods is more like an advertisement than a research communication: "Science should be ‘show me’, not ‘trust me’; it should be ‘help me if you can’, not ‘catch me if you can’."

On 27th December, Biological Psychiatry published correspondence on the Smith et al paper by Kristin Sainani and Steven Goodman from Stanford University. They raised some of the points noted above regarding the lack of predictive utility of the blood test in clinical contexts, the lack of a comparison sample with developmental delay, and the conflict of interest issues. In their response, the authors made the point that they had noted these limitations in their published paper.

Sainani, K. L., & Goodman, S. N. (2018). Lack of diagnostic utility of 'amino acid dysregulation metabotypes'. Biological Psychiatry. doi:10.1016/j.biopsych.2018.11.012

Smith, A. M., Donley, E. L. R., Burrier, R. E., King, J. J., & Amaral, D. G. (2018). Reply to: Lack of Diagnostic Utility of “Amino Acid Dysregulation Metabotypes”. Biological Psychiatry. doi:

Smith, A. M., King, J., J, West, P. R., Ludwig, M. A., Donley, E. L. R., Burrier, R. E., & Amaral, D. G. (2018). Amino acid dysregulation metabotypes: Potential biomarkers for diagnosis and individualized treatment for subtypes of autism spectrum disorder. Biological Psychiatry. doi:

Stark, P. (2018). Before reproducibiity must come preproducibility. Nature, 557, 613. doi:10.1038/d41586-018-05256-0

West, P. R., Amaral, D. G., Bais, P., Smith, A. M., Egnash, L. A., Ross, M. E., . . . Burrier, R. E. (2014). Metabolomics as a tool for discovery of biomarkers of Autism Spectrum Disorder in the blood plasma of children. PLOS One, 9(11), e112445. doi: