This is a preprint of a review written for the Journal of Mind and Behavior.
Monday, 15 April 2019
Innate: How the Wiring of Our Brains Shapes Who We Are. Kevin J. Mitchell. Princeton, New Jersey, USA: Princeton University Press, 2018, 293 pages, hardcover. ISBN: 978-0-691-17388-7.
This is a preprint of a review written for the Journal of Mind and Behavior.
Most of us are perfectly comfortable hearing about biological bases of differences between species, but studies of biological bases of differences between people can make us uneasy. This can create difficulties for the scientist who wants to do research on the way genes influence neurodevelopment: if we identify genetic variants that account for individual differences in brain function, then it is may seem a small step to concluding that some people are inherently more valuable than others. And indeed in 2018 we have seen calls for use of polygenic risk scores to select embryos for potential educational attainment (Parens et al, 2019). There has also been widespread condemnation of the first attempt to create a genetically modified baby using CRISPR technology (Normile, 2018), with the World Health Organization responding by setting up an advisory committee to develop global standards for governance of human genome editing (World Health Organization, 2019).
Kevin Mitchell's book Innate is essential reading for anyone concerned about the genetics behind these controversies. The author is a superb communicator, who explains complex ideas clearly without sacrificing accuracy. The text is devoid of hype and wishful thinking, and it confronts the ethical dilemmas raised by this research area head-on. I'll come back to those later, but will start by summarising Mitchell's take on where we are in our understanding of genetic influences on neurodevelopment.
Perhaps one of the biggest mistakes that we've made in the past is to teach elementary genetics with an exclusive focus on Mendelian inheritance. Mendel and his peas provided crucial insights into units of inheritance, allowing us to predict precisely the probabilities of different outcomes in offspring of parents through several generations. The discovery of DNA provided a physical instantiation of the hitherto abstract gene, as well as providing insight into mechanisms of inheritance. During the first half of the 20th century it became clear that there are human traits and diseases that obey Mendelian laws impeccably: blood groups, Huntington's disease, and cystic fibrosis, to name but a few. The problem is that many intelligent laypeople assume that this is how genetics works in general. If a condition is inherited, then the task is to track down the gene responsible. And indeed, 40 years ago, many researchers took this view, and set out to track genes for autism, hearing loss, dyslexia and so on. Ben Goldacre's (2014) comment 'I think you'll find it's a bit more complicated than that' was made in a rather different context, but is a very apt slogan to convey where genetics finds itself in 2019. Here are some of the key messages that the author conveys, with clarity and concision, which provide essential background to any discussion of ethical implications of research.
1. Genes are not a blueprint
The same DNA does not lead to identical outcomes. We know this from the study of inbred animals, from identical human twins, and even from studying development of the two sides of the body in a single person. How can this be? DNA is a chemically inert material, which carries instructions for how to build a body from proteins in a sequence of bases. Shouldn't two organisms with identical DNA should turn out the same? The answer is no, because DNA can in effect be switched on and off: that's how it is possible for the same DNA to create a wide variety of different cell types, depending on which proteins are transcribed and when. As Mitchell puts it: "While DNA just kind of sits there, proteins are properly impressive – they do all sorts of things inside cells, acting like tiny molecular machines or robots, carrying out tens of thousands of different functions." DNA is chemically stable, but messenger RNA, which conveys the information to the cell where proteins are produced, is much less so. Individual cells transcribe messenger RNA in bursts. There is variability in this process, which can lead to differences in development.
2. Chance plays an important role in neurodevelopment
Consideration of how RNA functions leads to an important conclusion: factors affecting neurodevelopment can't just be divided into genetic vs. environmental influences: random fluctuations in the transcription process mean that chance also plays a role.
Moving from the neurobiological level, Mitchell notes that the interpretation of twin studies tends to ignore the role of chance. When identical (monozygotic or MZ) twins grow up differently, this is often attributed to the effects of 'non-shared environment', implying there may have been some systematic differences in their experiences, either pre- or post-natal, that led them to differ. But, such effects don't need to be invoked to explain why identical twins can differ: this can arise because of random effects operating at a very early stage of neurodevelopment.
3. Small initial differences can lead to large variation in outcome
If chance is one factor overlooked in many accounts of genetics, development is the other. There are interactions between proteins, such that when messenger RNA from gene A reaches a certain level, this will increase expression of genes B and C. Those genes in turn can affect others in a cascading sequence. This mechanism can amplify small initial differences to create much larger effects.
4. Genetic is not the same as heritable
Genetic variants that influence neurodevelopment can be transmitted in the DNA passed from parent to child leading to heritable disorders and traits. But many genetically-based neurodevelopmental disorders do not work like this; rather, they are caused by 'de novo' mutations, i.e. changes to DNA that arise early in embryogenesis, and so are not shared with either parent.
5. We all have many mutations
The notion that there is a clear divide between 'normal people' with a nice pure genome and 'disordered' people with mutations is a fiction. All of us have numerous copy number variants (CNVs), chunks of DNA that are deleted or duplicated (Beckmann, Estivill, & Antonarakis, 2007), as well as point mutations, - i.e. changes in a single base pair of DNA. When the scale of mutation in 'normal' people was first discovered, it created quite a shock to the genetics community, jamming a spanner in the works for researchers trying to uncover causes of specific conditions. If we find a rare CNV or point mutation in a person with a disorder, it could just be coincidence and not play any causal role. Converging evidence is needed. Studies of gene function can help establish causality; the impact on brain development will depend on whether a mutation affects key aspects of protein synthesis; but even so, there have been cases where a mutation thought to play a key role in disorder then pops up in someone whose development is entirely unremarkable. A cautionary tale is offered by Toma et al (2018), who studied variants in CNTNAP2, a gene that was thought to be related to autism and schizophrenia. They found that the burden of rare variants that disrupted gene function were just as high in individuals from the general population as in people with autism or schizophrenia.
6. One gene – one disorder is the exception rather than the rule
For many neurodevelopmental conditions, e.g. autism, intellectual disability, and epilepsy, associated mutations have been tracked down. But most of them account for only a small proportion of affected individuals, and furthermore, the same mutation is typically associated with different disorders. Our diagnostic categories don't map well onto the genes.
This message is of particular interest to me, as I have been studying the impact of a major genetic change – presence of an extra X or Y chromosome - on children's development: this includes girls with an additional X chromosome ( trisomy X ), boys with an extra X (XXY or Klinefelter's syndrome) and boys with an extra Y (XYY constitution). The impact of an extra sex chromosome is far less than you might expect: most of these children attend mainstream school and live independently as adults. There has been much speculation about possible contrasting effects of an extra X versus extra Y chromosome. However, in general, one finds that variation within a particular trisomy group is far greater than variation between them. So, with all three types of trisomy, there is an increased likelihood that the child with have educational difficulties, language and attentional problems, and there's also a risk of social anxiety. In a minority of cases the child meets criteria for autism or intellectual disability (Wilson, King & Bishop, 2019). The range of outcomes is substantial – something that makes it difficult to advise parents when the trisomy is discovered. The story is similar for some other mutations: there are cases where a particular gene is described as an 'autism gene', only for later studies to find that individuals with the same mutation may have attention deficit hyperactivity disorder, epilepsy, language disorder, intellectual disability – or indeed, no diagnosis at all. For instance, Niarchou et al (2019) published a study of a sample of children with deletion or duplication at a site on chromosome 16 (16p11.2), predicting that the deletion would be associated with autism, and duplication with autism or schizophrenia. In fact, they found that the commonest diagnosis with both conditions was attention deficit hyperactivity disorder, though rates of intellectual disability and autism were also increased. 52% of the cases with deletion and 37% of those with a duplication had no psychiatric diagnosis.
There are several ways in which such variation in outcomes might arise. First, the impact of a particular mutation may depend on the genetic background – for instance, if the person has another mutation affecting the same neural circuits, this 'double hit' may have a severe impact, whereas either mutation alone would be innocuous. A second possibility is that there may be environmental factors that affect outcomes. There is a lot of interest in this idea because it opens up potential for interventions. The third option, though, is the one that is often overlooked: the possibility that differences in outcomes are the consequence of random factors early in neurodevelopment, which then have cascading effects that amplify initial minor differences (see points 2 and 3).
6. A mutation may create general developmental instability
Many geneticists think of effects of mutations in terms of the functional impact on particular developmental processes. In the case of neurodevelopment, there is interest in how genes affect processes such as neuronal migration (movement of cells to their final position in the brain), synaptic connectivity (affecting communication between cells) or myelination (formation of white matter sheaths around nerve fibres). Mitchell suggests, however, that mutations may have more general effects, simply making the brain less able to adapt to disruptive processes in development. Many of us learn about genetics in the context of conditions like Huntington's disease, where a specific mutation leads to a recognisable syndrome. However, for many neurodevelopmental conditions, the impact of a mutation is to increase the variation in outcomes. This makes sense of the observations outlined in point 5: a mutation can be associated with a range of developmental disabilities, but with different conditions in different people.
7. Sex differences in risk for neurodevelopmental disorders have genetic origins
There has been so much exaggeration and bad science in research on sex differences in the brain, that it has become popular to either deny their existence, or attribute them to sex differences in environmental experiences of males and females. Mitchell has no time for such arguments. There is ample evidence from animal studies that both genes and hormones affect neurodevelopment: why should humans be any different? But he adds two riders: first, although systematic sex differences can be found in human brains, they are small enough to be swamped by individual variation within each sex. So if you want to know about the brain of an individual, their sex would not tell you very much. And second, different does not mean inferior.
Mitchell argues that brain development is more variable in males than females and he cites evidence that, while average ability scores are similar for males and females, males show more variation and are overrepresented at the extremes of distributions of ability. The over-representation at the lower end has been recognised for many years and is at least partly explicable in terms of how the sex chromosomes operate. Many syndromes of intellectual disability are X-linked, which means they are caused by a mutation of large effect on the X chromosome. The mother of an affected boy often carries the same mutation but shows no impairment: this is because she has two X chromosomes, and the effect of a mutation on one of them is compensated for by the unaffected chromosome. The boy has XY chromosome constitution, with the Y being a small chromosome with few genes on it, and so the full impact of an X-linked mutation will be seen. Having said that, many conditions with a male preponderance, such as autism and developmental language disorder, do not appear to involve X-linked genes, and some disorders, such as depression, are more common in females, so there is still much we need to explain. Mitchell's point is that we won't make progress in doing so by denying a role for sex chromosomes or hormones in neurodevelopment.
Mitchell moves into much more controversial territory in describing studies showing over-representation of males at the other end of the ability distribution: e.g. in people with extraordinary skills in mathematics. That is much harder to account for in terms of his own account of genetic mechanisms, which questions the existence of genetic variants associated with high ability. I have not followed that literature closely enough to know how solid the evidence of male over-representation is, but assuming it is reliable, I'd like to see studies that looked more broadly at other aspects of cognition of males who had spectacular ability in domains such as maths or chess. The question is how to reconcile such findings with Mitchell's position – which he summarises rather bluntly by saying there are no genes for intelligence, only genes for stupidity. He does suggest that greater developmental instability in males might lead to some cases of extremely high-functioning, but that is at odds with his general view that instability generally leads to deficits, not strengths. I'd be interested in studies of these exceptional high achievers to look at their skills across a wider range of domains. Is it really the case that males at the very top end of the IQ distribution are uniformly good at everything, or are there compensating deficits? It's easy to think of anecdotal examples of geniuses who were lacking in what we might term social intelligence, and whose ability to flourish was limited to a very restricted ecological niche in the groves of academe. Maybe these are people whose specific focus on certain topics would have been detrimental to reproductive fitness in our ancestors, but who can thrive in modern society where people are able to pursue exceptionally narrow interests. If so, we can predict that at the point in the distribution where exceptional ability has a strong male bias, we should expect to find that the skill is highly specific and accompanied by limitations in other domains of cognition or behaviour.
8. It is difficult to distinguish polygenic effects from genetic heterogeneity
Way back in the early 1900s, there was criticism of Mendelian genetics because it maintained that genetic material was transmitted in quanta, and so it seemed not to be able to explain inheritance of continuous traits such as height, where the child's phenotype may be intermediate between those of parents. Reconciliation of these positions was achieved by Ronald Fisher, who showed that if a phenotype was influenced by the combined impact of many genes of small effect, we would expect correlations between related individuals in continuous traits. This polygenic view of inheritance is thought to apply to many common traits and disorders. If so, then the best way to discover genetic bases for disorder is not to hunt through the genome looking for rare mutations, but rather to search for common variants of small effect. The problem with that is that on the one hand it requires enormous samples to identify tiny effects, and on the other it's easy to find false positive associations. The method of Genome Wide Association has been developed to address these issues, and has had some success in identifying genetic variants that have little effect in isolation, but which in aggregate play a role in causing disorder.
Mitchell, however, has a rather different approach. At a time when most geneticists were embracing the idea that conditions such as schizophrenia and autism were the result of the combined effect of the tiny influence of numerous common genetic variants, Mitchell (2012) argued for another possibility - that we may be dealing with rare variants of large effect, which differ from family to family. In Innate, he suggests it is a mistake to reduce this to an either/or question: a person's polygenic background may establish a degree of risk for disorder, with specific mutations then determining how far that risk is manifest.
This is not just an academic debate: it has implications for how we invest in science, and for clinical applications of genetics. Genome-wide association studies need enormous samples, and collection, analysis and storage of data is expensive. There have been repeated criticisms that the yield of positive findings has been low and they have not given good value for money. In particular, it's been noted that the effects of individual genetic variants are minuscule, can only be detected in enormous samples, and throw little light on underlying mechanisms (Turkheimer, 2012, 2016). This has led to a sense of gloom that this line of work is unlikely to provide any explanations of disorder or improvements in treatment.
An approach that is currently in vogue is to derive a Polygenic Risk Score, which is based on all the genetic variants associated with a condition, weighted by the strength of association. This can give some probabilistic information about likelihood of a specific phenotype, but for cognitive and behavioural phenotypes, the level of prediction is not impressive. The more data is obtained on enormous samples, the better the prediction becomes, and some scientists predict that Polygenic Risk Scores will become accurate enough to be used in personalised medicine or psychology. Others, though, have serious doubts. A thoughtful account of the pros and cons of Polygenic Risk Scores is found in an interview that Ed Yong (2018) had with Daniel Benjamin, one of the authors of a recent study reporting on Polygenic Risk Scores for educational attainment (Lee et al, 2018). Benjamin suggested that predicting educational attainment from genes is a non-starter, because prediction for individuals is very weak. But he suggested that the research has value as we can use a Polygenic Risk Score as a covariate to control for genetic variation when studying the impact of environmental interventions. However, this depends on results generalising to other samples. It is noteworthy that when the Polygenic Risk Score for educational attainment was tested for its ability to explain within-family variation (in siblings), its predictive power dropped (Lee et al, 2018).
It is often argued that knowledge of genetic variants contributing to a Polygenic Risk Score will help identify the functions controlled by the relevant genes, which may lead to new discoveries in developmental neurobiology and drug design. However, others would question whether Polygenetic Risk Scores have the necessary biological specificity to fulfil this promise (Reimers et al, 2018). Furthermore, recent papers have raised concerns that population stratification means that Polygenetic Risk Scores may give misleading results: for instance, we might be able to find a group of SNPs predictive of 'chopsticks-eating skills', but this would just be based on genetic variants that happen to differ between ethnic groups that do and don't eat with chopsticks (Barton et al, 2019).
I think Mitchell would in any case regard the quest for Polygenic Risk Scores as a distraction from other more promising approaches that focus on finding rare variants of big effect. Rather than investing in analyses that require huge amounts of big data to detect marginal associations between phenotypes and SNPs, his view is that we will make most progress by studying the consequences of mutations. The tussle between these viewpoints is reflected in two articles that appeared at the end of 2017. Boyle, Li, and Pritchard (2017) queried some of the assumptions behind genome-wide association studies, and suggested that most progress will occur if we focus on detecting rare variants that may help understand the biological pathways involved in disorder. Wray et al (2017) countered by arguing that while exploring for de novo mutations is important for understanding severe childhood disorders, this approach is unlikely to be cost-effective when dealing with common diseases, where genome-wide associations with enormous samples is the optimal strategy. In fact, the positions of these authors are not diametrically opposed: it is rather a question of which approach should be given most resources. The discussion involves more than just scientific disagreement: reputations and large amounts of research funding are at stake.
And so we come to the ethical issues around modern genetics. I hope I have at least convinced readers that in order to have a rational analysis of moral questions in this field, one needs to move away from simplistic ideas of the genome as some kind of blueprint that determines brain structure and function. Ethical issues which are quite hard enough when things are deterministic are given a whole new layer of complexity when we realise that there's a large contribution of chance in most relationships between genes and neurodevelopment.
But let's start with the simpler and more straightforward case where you can reliably predict how a person will turn out from knowledge of their genetic constitution. There are then two problematic issues to grapple with: 1) if you have knowledge of genetic constitution prenatally, under what situations would you consider using the information to select an embryo or terminate a pregnancy? 2) if a person with a genetically-determined condition exists, should they be treated differently on the basis of that condition?
Some religions bypass the first question altogether, by arguing that it is never acceptable to terminate a pregnancy. But, if we put absolutist positions to one side, I suspect most people would give a range of answers to question 1, depending on what the impact of the genetic condition is: termination may be judged acceptable or even desirable if there are such severe impacts on the developing brain that the infant would be unlikely to survive into childhood, be in a great deal of distress or pain, or be severely mentally impaired. At the other extreme, terminating a pregnancy because a person lacks a Y chromosome seems highly unethical to many people, yet this practice is legal in some countries, and widely adopted even when it is not (Hvistendahl, 2011). These polarised scenarios may seem relatively straightforward, but there are numerous challenges because there will always be cases that fall between these extremes.
It is impossible to ignore the role of social factors in our judgements. Many hearing people are shocked when they discover that some Deaf parents want to use reproductive technologies to select for Deafness in their child (Mand et al., 2009), but those who wish to adopt such a practice argue that Deafness is a cultural difference rather than a disability.
Now let's add chance into the mix. Suppose you have a genetic condition that makes it more likely that a child will have learning difficulties or behaviour problems, but the range of outcomes is substantial; the typical outcome is mild educational difficulties, and many children do perfectly well. This is exactly the dilemma facing parents of children who are found on prenatal screening to have an extra X or Y chromosome. In many countries parents may be offered a termination of pregnancy in such cases, but it is clear that whether or not they decide to continue with the pregnancy depends on what they are told about potential outcomes (Jeon, Chen, & Goodson, 2012).
Like Kevin Mitchell, I don't have easy solutions to such dilemmas, but like him, I think that we need to anticipate that such thorny ethical questions are likely to increase as our knowledge of genetics expands – with many if not most genetic influences being probabilistic rather than deterministic. The science fiction film Gattaca portrays a chilling vision of a world where genetic testing at birth is used to identify elite individuals who will have the opportunity to be astronauts, leaving those with less optimal alleles to do menial work – even though prediction is only probabilistic, and those with 'invalid' genomes may have desirable traits that were not screened for. The Gattaca vision is bleak not just because of the evident unfairness of using genetic screening to allocate resources to people, but because a world inhabited by a set of clones, selected for perfection on a handful of traits, could wipe out the diversity that makes us such a successful species.
There's another whole set of ethical issues that have to do with how we treat people who are known to have genetic differences. Suppose we find that someone standing trial has a genetic mutation that is known to be associated with aggressive outbursts. Should this genetic information be used in mitigation for criminal behaviour? Some might say this would be tantamount to letting a criminal get away with antisocial behaviour, whereas others may regard it as unethical to withhold this information from the court. The problem, again, becomes particularly thorny because association between genetic variation and aggression is always probabilistic. Is someone with a genetic variant that confers a 50% increase in risk of aggression less guilty than someone with a different variant that makes then 50% less likely to be aggressive? Of course, it could be argued that the most reliable genetic predictor of criminality is having a Y chromosome, but we do not therefore treat male criminals more leniently than females. Rather, we recognise that genetic constitution is but one aspect of an individual's make-up, and that factors that lead a person to commit a crime go far beyond their DNA sequence.
As we gain ever more knowledge of genetics, the ethical challenges raised by our ability to detect and manipulate genetic variation need to be confronted. To do that we need an up-to-date and nuanced understanding of the ways in which genes influence neurodevelopment and ultimately affect behaviour. Innate provides exactly that.
I thank David Didau for comments on a draft version of this review, and in particular for introducing me to Gattaca.
Barton, N., Hermisson, J., & Nordborg, M. (2019). Population genetics: Why structure matters. eLife, 8, e45380. doi:10.7554/eLife.45380
Beckmann, J. S., Estivill, X., & Antonarakis, S. E. (2007). Copy number variants and genetic traits: closer to the resolution of phenotypic to genotypic variability. Nature Reviews Genetics, 8(8), 639-646.
Boyle, E. A., Yang, I. L., & Pritchard, J. K. (2017). An expanded view of complex traits: From polygenic to omnigenic. Cell, 169(7), 1177-1186.
Goldacre, B. (2014). I think you'll find it's a bit more complicated than that. London, UK: Harper Collins.
Hvistendahl, M. (2011). Unnatural Selection: Choosing Boys Over Girls, and the Consequences of a World Full of Men. New York: Public Affairs.
Jeon, K. C., Chen, L.-S., & Goodson, P. (2012). Decision to abort after a prenatal diagnosis of sex chromosome abnormality: a systematic review of the literature. Genetics in Medicine, 14, 27-38.
Mand, C., Duncan, R. E., Gillam, L., Collins, V., & Delatycki, M. B. (2009). Genetic selection for deafness: the views of hearing children of deaf adults. Journal of Medical Ethics, 35(12), 722-728. doi:http://dx.doi.org/10.1136/jme.2009.030429
Mitchell, K. J. (2012). What is complex about complex disorders? Genome Biology, 13, 237.
Niarchou, M., Chawner, S. J. R. A., Doherty, J. L., Maillard, A. M., Jacquemont, S., Chung, W. K., . . . van der Bree, M. B. M. (2019). Psychiatric disorders in children with 16p11.2 deletion and duplication. Translational Psychiatry 9(8). doi:10.1038/s41398-018-0339-8
Normile, D. (2018). Shock greets claim of CRISPR-edited babies. Science, 362(6418), 978-979. doi:10.1126/science.362.6418.978
Parens, E., Appelbaum, P., & Chung, W. (2019). Embryo editing for higher IQ is a fantasy. Embryo profiling for it is almost here. Stat+(Feb 12 2019).
Reimers, M. A., Craver, C., Dozmorov, M., Bacanu, S. A., & Kendler, K. S. (2018). The coherence problem: Finding meaning in GWAS complexity. Behavior Genetics. doi:https://doi.org/10.1007/s10519-018-9935-x
Toma, C., Pierce, K. D., Shaw, A. D., Heath, A., Mitchell, P. B., Schofield, P. R., & Fullerton, J. M. (2018). Comprehensive cross-disorder analyses of CNTNAP2 suggest it is unlikely to be a primary risk gene for psychiatric disorders. Bioarxiv. doi:https://doi.org/10.1101/363846
Turkheimer, E. (2012). Genome Wide Association Studies of behavior are social science. In K. S. Plaisance & T. A. C. Reydon (Eds.), Philosophy of Behavioral Biology, 43 Boston Studies in the Philosophy of Science 282, DOI 10.1007/978-94-007-1951-4_3, (pp. 43-64): Springer Science+Business Media.
Turkheimer, E. (2016). Weak genetic explanation 20 years later: Reply to Plomin et al (2016). Perspectives on Psychological Science, 11(1), 24-28. doi:10.1177/1745691615617442
World Health Organization (2019). WHO establishing expert panel to develop global standards for governance and oversight of human genome editing. https://www.who.int/ethics/topics/human-genome-editing/en/.
Wray, N. R., Wijmenga, C., Sullivan, P. F., Yang, J., & Visscher, P. M. (2018). Common disease Is more complex than implied by the core gene omnigenic model. Cell, 173, 1573-1590. doi:10.1016/j.cell.2018.05.051
Yong, E. (2018). An enormous study of the genes related to staying in school. The Atlantic. https://www.theatlantic.com/science/archive/2018/07/staying-in-school.../565832/
Wednesday, 27 March 2019
I'm not one of those people who thinks all politicians are corrupt and evil. Undoubtedly such cases exist, and psychopaths often thrive in the kind of highly competitive context of politics, where you need a thick skin and a determination to succeed. But many people go into politics with the aim of making a positive difference, and I am open-minded enough to recognise that many of them are well-intentioned, even if I disagree with their political strategies and prejudices.
Sunday, 3 March 2019
Update: March 6th:
This is version 2 of this blogpost, taking into account new insights into the weird z-scores used in TEF. I had originally suggested there might be an algebraic error in the formula used to derive z-scores: I now realise there is a simpler explanation, which is that the z-scores used in TEF are not calculated in the usual way, with the standard deviation as denominator, but rather with the standard error of measurement as denominator.
In exploring this issue, I've greatly benefited from working openly with a R markdown script on Github, as that has allowed others with statistical expertise to propose alternative analyses and explanations. This process is continuing, and those interested in technical details can follow developments as they happen on Github, see benchmarking_Feb2019.rmd.
Maybe my experience will encourage OfS to adopt reproducible working practices.
Statistical critiques of TEF are not new. This week, the Royal Statistical Society wrote a scathing report on the statistical limitations of TEF, complaining that their previous evidence to TEF evaluations had been ignored, and stating: 'We are extremely worried about the entire benchmarking concept and implementation. It is at the heart of TEF and has an inordinately large influence on the final TEF outcome'. They expressed particular concern about the lack of clarity regarding the benchmarking methodology, which made it impossible to check results.
This reflects concerns I have had, which have led me to do further analyses of the publicly available TEF datasets. The conclusion I have come to is that the way in which z-scores are defined is very different from the usual interpretation, and leads to massive overdiagnosis of under- and over-performing institutions.
Needless, to say, this is all quite technical, but even if you don't follow the maths, I suggest you just consider the analyses reported below, in which I compare the benchmarking output from the Draper and Gittoes method with that from an alternative approach.
Draper & Gittoes (2004): a toy example
Benchmarking is intended to provide a way of comparing institutions on some metric, while taking into account differences between institutions in characteristics that might be expected to affect their performance, such as the subjects of study, and the social backgrounds of students. I will refer to these as 'contextual factors'.
The method used to do benchmarking comes from Draper and Gittoes, 2004, and is explained in this document by the Higher Education Statistics Agency: HESA. A further discussion of the method can be found in this pdf of slides from a talk by Draper (2006).
Draper (2006) provides a 'small world' example with 5 universities and 2 binary contextual categories, age and gender, to yield four combinations of contextual factors. The numbers in the top part of the chart are the proportions in each contextual (PCF) category meeting the criterion of student continuation. The numbers in the bottom part are the numbers of students in each contextual category.
|Table 1. Small world example from Draper 2006, showing % passing benchmark (top) and N students (bottom)|
Essentially, the obtained score (weighted mean column) for an institution is an average of indicator values for each combination of contextual factors, weighted by the numbers with each combination of contextual factors in the institution. The benchmarked score is computed by taking the average score for each combination across all institutions (bottom row of top table) and then for each institution creating a mean score, weighted by the number in each category for a that institution. Though cumbersome (and hard to explain in words!) it is not difficult to compute. You can find an R markdown script that does the computation here (see benchmarking_Feb2019.rmd, benchmark_function). The difference between obtained values and benchmarked value can then be computed, to see if the institution is scoring above expectation (positive difference) or below expectation (negative difference). Results for the small world example are shown in Table 2.
|Table 2. Benchmarks (Ei) computed for small world example|
Computing standard errors of difference scores
The next step is far more complex. A z-score is computed by dividing the difference between observed and expected values on an indicator (Di) by a denominator, which is variously referred to as a standard deviation and a standard error in the documents on benchmarking.
For those who are not trained in statistics, the basic logic here is that the estimate of an institution's performance will be more labile if it is based on a small sample. If the institution takes on only 5 students each year, then estimates of completion rates from year to year will be variable - in a year where one student drops out, then the completion rate is only 80%, but if none drop out it will be 100%. You would not expect it to be constant because of random factors outside the control of the institution will affect student drop-outs. In contrast, for an institution with 1000 students, we will see much less variation from year to year. The standard error provides an estimate of the extent to which we expect the estimate of average drop-out to vary from year to year, taking size of population into account.
To interpret benchmarked scores we need a way of estimating the standard error of the difference between the observed score on a metric (such as completion rate) and the benchmarked score, reflecting how much we would expect this to vary from one occasion to another. Only then can we judge whether the institution's performance is in line with expectation.
Draper (2006) walks the reader through a standard method for computing the standard errors, based on the rather daunting formulae of Figure 1. The values in the SE column of table 2 are computed this way, and the z-scores are obtained by dividing each Di value by its corresponding SE.
|Fomulae 5 to 8 are used to compute difference scores and standard errors (Draper, 2006)|
Z-scores in real TEF data
Next, I downloaded some real TEF data, so I could see whether the distribution of z-scores was unusual. Data from Year 2 (2017) in .csv format can be downloaded from this website.
The z-scores here have been computed by HESA. Here is the distribution of core z-scores for one of the metrics (Non-continuation) for the 233 institutions with data on FullTime students.
Yet, they are interpreted in TEF as if a large z-score is an indicator of abnormally good or poor performance:
From p. 42 of this pdf giving Technical Specifications:
"In TEF metrics the number of standard deviations that the indicator is from the benchmark is given as the Z-score. Differences from a benchmark with a Z-score +/-1.9623 will be considered statistically significant. This is equivalent to a 95% confidence interval (that is, we can have 95% confidence that the difference is not due to chance)."
What does the z-score represent?
Z-scores feature heavily in my line of work: in psychological assessment they are used to identify people whose problems are outside the normal range. However, they aren't computed like the TEF z-scores, because they involve dividing a mean score by the standard deviation, rather than by the standard error.
It's easiest to explain this by an analogy. I'm 169 cm tall. Suppose you want to find out if that's out of line with the population of women in Oxford. You measure 10,000 women and find their mean height is 170 cm, with a standard deviation of 3. On a conventional z-score, my height is unremarkable. You just divide the difference between my height and the population height and divide by the standard deviation, -1/3, to give a z-score of -0.33. That's well within the normal limits used by TEF of -1.96 to 1.96.
Now let's compute the standard error of the population mean - to do that we compute the standard error, which is the standard deviation divided by the square root of the sample size, which gives 3/100 or .03. From that information we can get an estimate of the precision of our estimate of the population mean: we multiply the SE by 1.96, and add and subtract that value to the mean to get 95% confidence limits, which are 169.94 and 170.06. If we were to compute the z-score corresponding to my height using the SE instead of the SD, I would seem to be alarmingly short: the value would be -1/.03 = -33.33.
So what does that mean? Well the second z-score based on the SE does not test whether my height is in line with the population of 10,000 women. It tests whether my height can be regarded as equivalent to that of the average from that population. Because the population is very large, the estimate of the average is very precise, and my height is outside the error of measurement for the mean.
The problem with the TEF data is that they use the latter, SE-based method to evaluate differences from the benchmark value, but appear to interpret it as if it was a conventional SD-based z-score:
E.g. in the Technical Specificiations document (5.63):
As a test of the likelihood that a difference between a provider’s benchmark and its indicator is due to chance alone, a z-score +/- 3.0 means the likelihood of the difference being due to chance alone has reduced substantially and is negligible.
As illustrated with the height analogy, the SE-based method seems designed to over-identify high and low-achieving institutions. The only step taken to counteract this trend is an ad hoc one: because large institutions are particularly prone to obtain extreme scores, a large absolute z-score is only flagged as 'significant' if the absolute difference score is also greater than 2 or 3 percentage points. Nevertheless, the number of flagged institutions for each metric, is still far higher than would be the case if conventional z-scores based on the SD were used.
Relationship between SE-based and SD-based z-scores
(N.B. Thanks to Jonathan Mellon who noted an error in my script for computing the true z-scores.
This update and correction made 20.20 p.m. on 6 March 2019).
I computed conventional z-scores by dividing each institution's difference from benchmark by the SD of for difference scores for all institutions and plotted it against the TEF z-scores. An example for one of the metrics is shown below. The range is in line with expectation (most values between -3 and +3) for the conventional z-scores, but much bigger for the TEF z-scores.
Conversion of z-scores into flags
In TEF benchmarking, TEF z-scores are converted into 'flags', ranging from - - or -, to denote performance below expectation, up to + or ++ for performance above expectation, with = used to indicate performance in line with expectation. It is these flags that the TEF panel considers when deciding which award (Gold, Silver or Bronze) to award.
Draper-Gittoes z-scores are flagged for significance as follows:
- - - z-score of -3 or less, AND an absolute difference between observed and expected values of 3%.
- - z-score of -2 or less, AND an absolute difference between observed and expected values of 2%.
- + z-score of 2 or more, AND an absolute difference between observed and expected values of 2%.
- ++ z-score of 3 or more, AND an absolute difference between observed and expected values of 3%.
Using quantiles rather than TEF z-scores
Given that the z-scores obtained with the Draper-Gittoes method are so extreme, it could be argued that flags should be based on quantiles rather than z-score cutoffs, omitting the additional absolute difference criterion. For instance, for the Year 2 TEF data (Core z-scores) we can find cutoffs corresponding to the most extreme 5% or 1%. If flags were based on these, then we would award extreme flags (- - or ++) only to those with negative z-scores of -13.7 or less, or positive score of 14.6 or more; less extreme flags would be awarded to those with negative z-score of -7 or less (- flag), or positive z-score of 8.6 or more (+).
Update 6th March: An alternative way of achieving the same end would be to use the TEF cutoffs with conventional z-scores; this would achieve a very similar result.
Deriving award levels from flags
It is interesting to consider how this change in procedure would affect the allocation of awards. In TEF, the mapping from raw data to awards is complex and involves more than just a consideration of flags: qualitative information is also taken into account. Furthermore, as well as the core metrics, which we have looked at above, the split metrics are also considered - i.e. flags are also awarded for subcategories, such as male/female, disabled/non-disabled: in all there are around 130 flags awarded across the six metrics for each institution. But not all flags are treated equally: the three metrics based on the National Student Survey are given half the weight of other metrics.
Not surprisingly, if we were to recompute flag scores based on quantiles, rather than using the computed z-scores, the proportions of institutions with Bronze or Gold awards drops massively.
When TEF awards were first announced, there was a great deal of publicity around the award of Bronze to certain high-profile institutions, in particularly the London School of Economics, Southampton University, University of Liverpool and the School of Oriental and African Studies. On the basis of quantile scores for Core metrics, none of these would meet criteria for Bronze: their flag scores would be -1, 0, -.5 and 0 respectively. But these are not the only institutions to see a change in award when quantiles are used. The majority of smaller institutions awarded Bronze obtain flag scores of zero.
The same is true of Gold Awards. Most institutions that were deemed to significantly outperform their benchmarks no longer do so if quantiles are used.
Should we therefore change the criteria used in benchmarking and adopt quantile scores? Because I think there are other conceptual problems with benchmarking, and indeed with TEF in general, I would not make that recommendation. I would prefer to see TEF abandoned. I hope the current analysis can at least draw people's attention to the questionable use of statistics used in deriving z-scores and their corresponding flags. The difference between a Bronze, Silver and Gold can potentially have a large impact on an institution's reputation. The current system for allocating these awards is not, to my mind, defensible.
I will, of course, be receptive to attempts to defend it or to show errors in my analysis, which is fully documented with scripts on github, benchmarking_Feb2019.rmd.
Saturday, 9 February 2019
Guest post by
Jennifer L. Tackett
Northwestern University; Personality Across Development Lab
The PiaD approach was borne of a desire to figure out a way, some way, any way, to tackle that ever-growing project list of studies-that-should-get-done-but-never-do. I’m guessing we all have these lists. These are the projects that come up when you’re sitting in a conference talk and lean over to your grad student and whisper (“You know, we actually have the data to test that thing they can’t test, we should do that!”), and your grad student sort of nods a little but also kind of looks like she wants to kill you. Or, you’re sitting in lab meeting talking about ideas, and suddenly shout, “Hey, we totally have data to test that! We should do that! Someone, add it to the list!” and people’s initial look of enthusiasm is quickly replaced by a somewhat sinister side-eye (or perhaps a look of desperation and panic; apparently it depends on who you ask). Essentially, anytime you come up with a project idea and think – Hey, that would be cool, we already have the data, and it wouldn’t be too onerous/lengthy, maybe someone wants to just write that up! – you may have a good PiaD paper.
In other words, the PiaD approach was apparently borne out of a desire to finally get these papers written without my grad students killing me. Seems as reasonable a motivation as any.
The initial idea was simple.
- You have a project idea that is circumscribed and straightforward.
- You have data to test the idea.
- The analyses to do so are not overly complex or novel.
- The project topic is in an area that everyone in the lab1 is at least somewhat (to very) familiar with.
What would happen if we all locked ourselves in the same room, with no other distractions, for a full day, and worked our tails off? Surely we could write this paper, right?
The answer was: somewhat, and at least sometimes, yes.
But even better were all the things we learned along the way.
We have been scheduling an annual PiaD since 2013. Our process has evolved a LOT along the way. Rather than giving a historical recounting, I thought I would summarize where we are at now – the current working process we have arrived at, and some of the unanticipated benefits and challenges that have come up for us over the years.
Our Current PiaD Approach
We write our PiaD papers in the late spring/early summer. Sometime in the fall, we decide as a group what the focus of the PiaD paper will be and who will be first author (see also: benefits and challenges). Then, in the months leading up to PiaD, the first author (and senior author, if not one-and-the-same), take care of some front-end tasks.2 Accomplishing the front-end tasks is essential for making sure we can all hit the ground running on the day of. So, here are the things we do in advance:
1. Write the present study paragraph: what exactly do we want to do, and why/how? (Now, we write this as a pre/registration! But in the olden days, a thorough present study paragraph would do.)
2. Run a first pass of the analyses (again, remember – data are already available and analyses are straightforward and familiar).
3. Populate a literature review folder. We now use a shared reference manager library (Zotero) to facilitate this step and later populating references.
4. Create a game plan – a list of the target outlet with journal submission guidelines, a list of all the tasks that must be accomplished on the actual DAY, a list of all the people on the team and preliminary assignments. The planning stage of PiaD is key – it can make or break the success of the approach. One aspect of this is being really thoughtful about task assignments. Someone used other data from that sample for a recent study? Put them on the Methods section. Someone used similar analyses in a recent project? Put them on re-running and checking analyses (first pass is always done by the first author in advance; another team member checks syntax and runs a fresh pass on the day. We also have multiple checks built in for examining final output). Someone has expertise in a related literature? Assign them appropriate sections of the Intro/Discussion. You get the idea. Leverage people’s strengths and expertise in the task assignments.
5. Email a link to a Dropbox folder with all of the above, and attach 2-3 key references, to everyone on the team, a couple of weeks before the DAY. All team members are expected to read the key papers and familiarize themselves with the Dropbox folder ahead of time.
Because this process is pretty intense, and every paper is different, our PiaD DAYs always evolve a bit differently. Here are some key components for us:
1. Start with coffee.
2. Then start with the Game Plan. Make sure everyone understands the goal of the paper, the nature of the analyses, and their assigned tasks. Talk through the structure of the Introduction section at a broad level for feedback/discussion.
3. WORK LIKE THE WIND.
4. Take a lunch break. Leave where you are. Turn your computers off. Eat some food. For the most part, we tend to talk about the paper. It’s nice for us to have this opportunity to process more openly mid-day, see where people are at, how the paper is shaping up, what else we should be thinking about, etc. The chance for free and open discussion is really important, after being in such a narrow task-focused state.
5. WORK LIKE THE WIND.
6. Throughout the working chunks, we are constantly renegotiating the task list. Someone finishes their task more quickly, crosses it off the Game Plan (we use this as an active collaborative document to track our work in real time), and claims the next task they plan to move to.
7. Although we have a “no distraction” work space3 for PiaD, we absolutely talk to one another throughout the day. This is one of the biggest benefits of PiaD – the ability to ask questions and get immediate answers, to have all the group minds tackling challenges as they arise. It’s a huge time efficiency to work in this way, and absolutely makes end decisions of much higher quality than the typical fragmented writing approach.
8. Similarly, we have group check-ins about every 1-1.5 hours – where is everyone on their task? What will they move to work on next?
9. Over the years, some PiaD members have found walks helpful, too. Feeling stuck? Peel someone off to go walk through your stuck-ness with you. Come back fresher and clearer.
10. About an hour before end time, we take stock – how close are we to meeting our goals? How are things looking when we piece them all together? What tasks are we prioritizing in the final hour, and which will need to go unfinished and added to the back-end work for the first author? Some years, we are wrapping up the submission cover letter at this stage. Other years, we’re realizing we still have tasks to complete after PiaD. Just depends on the nature of the project.
11. Celebrate. Ideally with some sort of shared beverage of choice. To each their own, but for us, this has often involved bubbles. And an early bedtime.
|Jennifer celebrating with Kathleen, Cassie, Avanté, and bubbles|
This will be different from year-to-year. Obviously, the goal with PiaD is to be done with the manuscript by the end of the day. EVEN WHEN THIS HAPPENS, we never, EVER do final-proofing the same day. We are just way too exhausted. So we usually give ourselves a couple of weeks to freshen up, then do our final proofing before submission. Other years, for a variety of reasons, various tasks remain. That’s just how it goes with manuscript writing. Even in this case, it is fair to say that the overwhelming majority of the work gets done on the DAY. So either way, it’s still a really productive mechanism (for us).
Some Benefits and Challenges
There are many of both. But overall, we have found this to be a really great experience for many reasons beyond actually getting some of these papers out in the world (which we have! Which is so cool!). Some of these benefits for us are:
1. Bonding as a team. It’s a really great way to strengthen your community, come together in an informal space on a hard but shared problem, and struggle through it together.
2. A chance to see one another work. This can be incredibly powerful, for example, for junior scholars to observe scientific writing approaches “in the wild”. It never occurred to me before my grad students pointed this out at our first PiaD, but they rarely get to see faculty actually work in this way. And vice versa!
3. Accuracy, clarity, and error reduction. So many of our smaller errors could likely be avoided if we’re able to ask our whole team of experts our questions WHILE WE’RE WRITING THE PAPER. Real-time answers, group answers, a chance for one group member to correct another, etc. Good stuff.
4. Enhancing ethical and rigorous practices. The level of accountability when you are all working in the same space at the same time on the same files is probably as good as you can get. How many of our problematic practices might be extinguished if we were always working with others like this?
5. One of the goals I had with PiaD was to have the first author status rotate across the team – i.e., grad students would “take turns” being first author. I still think this is a great idea, as it’s a great learning experience for advanced grad students to learn how to manage team papers in this way. But, of course, it’s also HARD. So, be more thoughtful about scope of the project depending on seniority of the first author, and anticipate more front- and back-end work, accordingly.
PiaD has been a really cool mechanism for my lab to work with and learn from over the years. It has brought us many benefits as a team, far beyond increased productivity. But the way it works best for each team is likely different, and tweaking it over time is the way to make it work best for you. I would love to hear more from others who have been trying something similar in their groups, and also want to acknowledge the working team on the approach outlined here: Kat Herzhoff, Kathleen Reardon, Avanté Smack, Cassie Brandes, and Allison Shields.
1For PiaD purposes, I am defining the lab as PI + graduate students.
2Some critics like to counter, well then it’s not really Paper IN A DAY, now is it??? (Nanny-nanny-boo-boo!) Umm.. I guess not? Or maybe we can all remember that time demarcations are arbitrary and just chill out a bit? In all seriousness, if we all lived in the world where our data were perfectly cleaned and organized, all our literature folders were populated and labeled, etc. – maybe the tasks could all be accomplished in a single day. But unfortunately, my lab isn’t that perfect. YET. (Grad students sending me murderous side-eye over the internet.)
3The question of music or no-music is fraught conversational territory. You may need to set these parameters in advance to avoid PiaD turmoil and potential derailment. You may also need your team members to provide definition of current terminology in advance, in order to even have the conversation at all. Whatever you do, DON’T start having conversations about things like “What is Norm-core?” and everyone googling “norm-core”, and then trying to figure out if there is “norm-core music”, and what that might be. It’s a total PiaD break-down at that point.