BishopBlog: March 2019

I'm not one of those people who thinks all politicians are corrupt and evil. Undoubtedly such cases exist, and psychopaths often thrive in the kind of highly competitive context of politics, where you need a thick skin and a determination to succeed. But many people go into politics with the aim of making a positive difference, and I am open-minded enough to recognise that many of them are well-intentioned, even if I disagree with their political strategies and prejudices.

I suspect Theresa May started out aiming to do good, but something has clearly gone horribly wrong, and I am intrigued as to what is motivating her.

Many people have argued that what drives her is a lust for power. I have a different take on it. To me she looks very like a victim of 'mind control'. I got fascinated with this topic when writing a novel where some characters are recruited into a cult. Among my background reading was a book by Steve Hassan, Combating Cult Mind Control. Steve used his experience of being recruited into the Moonies as a student to cast light on the mental state that adherents get into.

Although I've been tempted at times to think Theresa May has been abducted and had a chip put into her brain by a malign foreign power, I doubt that anyone is acting as her controller. But I do think that she has boxed herself in to a mental set from which she cannot escape.

After she became Prime Minister, something changed radically in her behaviour. She had previously often espoused views I found odious, but she was not particularly weird. Indeed, prior to the Brexit vote in 2016, she gave a speech that was remarkably well-argued: (video version here). No doubt much was written by civil servants, but she sounded coherent and engaged. Over the months following Brexit, she became increasingly wooden, and rapidly earned the name 'Maybot', for her failure to engage with questions and to simply regurgitate the same old hackneyed cliches.

Now anyone in her position would have difficulty: having supported Remain, and made very cogent arguments against leaving the EU, she now had to see the UK through a Brexit that was won by a narrow margin. We know she is a religious woman, who thinks that she is guided by God, and so I assume she prayed a fair bit and came to the conclusion that seeing through the democratically expressed will of the people was the one goal she would cling to. I am prepared to give her credit for thinking that was the right thing to do, rather than this being an opportunistic means of retaining power, as it is sometimes portrayed.

It's worth bearing in mind that, as Steve Hassan noted, one thing the Moonies did with their converts would be to keep them awake with activities and surrounded by fellow believers. It's difficult to think rationally when sleep-deprived, when there is never any escape from the situation or down-time when you can talk things through with people who take a different view. May's schedule has been unremitting, with trips all over the world and relentless political pressure; it must have been utterly exhausting.

So the way it looks to me is that she has brainwashed herself into a state of monomania as a way of coping with the situation. If you are going to do something very challenging, one way of dealing with it is simply to refuse to engage with any attempt to sway you from your course. The only thing that matters to May is achieving Brexit, regardless of how many livelihoods and institutions are destroyed in the process. Unpleasant truths are ignored or characterised as invalid.

As PM of a country that she presumably cares about, May should be disturbed to hear that Brexit is proving disastrous for the car manufacturing industry, the banking sector, higher education, science and the National Health Service, and is likely to destroy the union with Northern Ireland and with Scotland. Her ability to avoid processing this information is reminiscent of Steve Hassan's account of his thought processes while in the Moonies, where he totally rejected his parents when they tried to reason with him, refusing to question any aspect of his beliefs and regarding them as evil.

One piece of information that would make her implode is that the 'will of the people' might have changed. Her face contorted when Caroline Lucas raised this possibility in Parliament in November 2018 - almost as if the Maybot had reached a state of 'does not compute'. She is reduced to arguing that demands for another vote are 'anti-democratic', and characterising those who want this as her enemies: a very odd state of affairs, given that it is the same 'people' who would be expressing their more up-to-date and informed view. Meanwhile neither she nor anyone else has come up with a single advantage of Brexit for this country.

I am on record of being opposed to another referendum – because I think a referendum is a terrible way of making political decisions on complex matters. But now it does seem the only option that might have some prospect of saving us from May's destructive course. Her mantra that we must obey the 'will of the people' could only be countered by a demonstration that the will of the people has indeed changed. Hundreds of thousands marching through London hasn't done it – they are just dismissed as the metropolitan elite. Five million signing a petition to revoke Article 50 hasn't done it, because it falls short of 17.4 million who voted for Brexit – even though the parallel petition to Leave without a Deal has only one tenth of the support. Polls are deemed inconclusive. So it seems the only way to demonstrate that the majority no longer wants the Brexit that is on offer would be another vote.

Would she abandon Brexit if the people now voted against it? I wonder. It's difficult to change course when you have committed yourself to accepting that the UK must go through with Brexit even if it means breaking up the United Kingdom, and damaging jobs, the economy and our international standing. She may have started out concerned that to reject Brexit would be the act of a despot. Sadly, now in refusing to listen to the possibility that people may have changed their minds, she has turned into exactly that.

-->

Update: March 6th:
This is version 2 of this blogpost, taking into account new insights into the weird z-scores used in TEF. I had originally suggested there might be an algebraic error in the formula used to derive z-scores: I now realise there is a simpler explanation, which is that the z-scores used in TEF are not calculated in the usual way, with the standard deviation as denominator, but rather with the standard error of measurement as denominator.
In exploring this issue, I've greatly benefited from working openly with a R markdown script on Github, as that has allowed others with statistical expertise to propose alternative analyses and explanations. This process is continuing, and those interested in technical details can follow developments as they happen on Github, see benchmarking_Feb2019.rmd.
Maybe my experience will encourage OfS to adopt reproducible working practices.

I'm a long-term critic of the Teaching Excellence and Student Outcomes Framework (TEF). I've put forward a swathe of arguments against the rationale for TEF in this lecture, as well as blogging for the Council for Defence of British Universities (CDBU) about problems with its rationale and statistical methods. But this week, things got even more interesting. In poking about in the data behind the TEF, I stumbled upon some anomalies that suggest to me that the TEF is not just misguided, but also is based on a foundation of statistical error.

Statistical critiques of TEF are not new. This week, the Royal Statistical Society wrote a scathing report on the statistical limitations of TEF, complaining that their previous evidence to TEF evaluations had been ignored, and stating: 'We are extremely worried about the entire benchmarking concept and implementation. It is at the heart of TEF and has an inordinately large influence on the final TEF outcome'. They expressed particular concern about the lack of clarity regarding the benchmarking methodology, which made it impossible to check results.

This reflects concerns I have had, which have led me to do further analyses of the publicly available TEF datasets. The conclusion I have come to is that the way in which z-scores are defined is very different from the usual interpretation, and leads to massive overdiagnosis of under- and over-performing institutions.

Needless, to say, this is all quite technical, but even if you don't follow the maths, I suggest you just consider the analyses reported below, in which I compare the benchmarking output from the Draper and Gittoes method with that from an alternative approach.

Draper & Gittoes (2004): a toy example

Benchmarking is intended to provide a way of comparing institutions on some metric, while taking into account differences between institutions in characteristics that might be expected to affect their performance, such as the subjects of study, and the social backgrounds of students. I will refer to these as 'contextual factors'.

The method used to do benchmarking comes from Draper and Gittoes, 2004, and is explained in this document by the Higher Education Statistics Agency: HESA. A further discussion of the method can be found in this pdf of slides from a talk by Draper (2006).

Draper (2006) provides a 'small world' example with 5 universities and 2 binary contextual categories, age and gender, to yield four combinations of contextual factors. The numbers in the top part of the chart are the proportions in each contextual (PCF) category meeting the criterion of student continuation. The numbers in the bottom part are the numbers of students in each contextual category.

Table 1. Small world example from Draper 2006, showing % passing benchmark (top) and N students (bottom)

Essentially, the obtained score (weighted mean column) for an institution is an average of indicator values for each combination of contextual factors, weighted by the numbers with each combination of contextual factors in the institution. The benchmarked score is computed by taking the average score for each combination across all institutions (bottom row of top table) and then for each institution creating a mean score, weighted by the number in each category for a that institution. Though cumbersome (and hard to explain in words!) it is not difficult to compute. You can find an R markdown script that does the computation here (see benchmarking_Feb2019.rmd, benchmark_function). The difference between obtained values and benchmarked value can then be computed, to see if the institution is scoring above expectation (positive difference) or below expectation (negative difference). Results for the small world example are shown in Table 2.

Table 2. Benchmarks (Ei) computed for small world example

The column headed Oi is the observed proportion with a pass mark on the indicator (student continuation), Ei is the benchmark (expected) value for each institution, and Di is the difference between the two.

Computing standard errors of difference scores

The next step is far more complex. A z-score is computed by dividing the difference between observed and expected values on an indicator (Di) by a denominator, which is variously referred to as a standard deviation and a standard error in the documents on benchmarking.

For those who are not trained in statistics, the basic logic here is that the estimate of an institution's performance will be more labile if it is based on a small sample. If the institution takes on only 5 students each year, then estimates of completion rates from year to year will be variable - in a year where one student drops out, then the completion rate is only 80%, but if none drop out it will be 100%. You would not expect it to be constant because of random factors outside the control of the institution will affect student drop-outs. In contrast, for an institution with 1000 students, we will see much less variation from year to year. The standard error provides an estimate of the extent to which we expect the estimate of average drop-out to vary from year to year, taking size of population into account.

To interpret benchmarked scores we need a way of estimating the standard error of the difference between the observed score on a metric (such as completion rate) and the benchmarked score, reflecting how much we would expect this to vary from one occasion to another. Only then can we judge whether the institution's performance is in line with expectation.

Draper (2006) walks the reader through a standard method for computing the standard errors, based on the rather daunting formulae of Figure 1. The values in the SE column of table 2 are computed this way, and the z-scores are obtained by dividing each Di value by its corresponding SE.

Fomulae 5 to 8 are used to compute difference scores and standard errors (Draper, 2006)

Now anyone familiar with z-scores will notice something pretty odd about the values in Table 2. The absolute z-scores given by this method seem remarkably large: In this toy example, we see z-scores with absolute values of 5, 9 and 20. Usually z-scores range from about -3 to 3. (Draper noted this point).

Z-scores in real TEF data

Next, I downloaded some real TEF data, so I could see whether the distribution of z-scores was unusual. Data from Year 2 (2017) in .csv format can be downloaded from this website.
The z-scores here have been computed by HESA. Here is the distribution of core z-scores for one of the metrics (Non-continuation) for the 233 institutions with data on FullTime students.

The distribution is completely out of line with what we would expect from a z-score distribution. Absolute z-scores greater than 3, which should be vanishingly rare, are common - with the exact number varying across the six available metrics, but ranging from 33% to 58%.

Yet, they are interpreted in TEF as if a large z-score is an indicator of abnormally good or poor performance:

From p. 42 of this pdf giving Technical Specifications:

"In TEF metrics the number of standard deviations that the indicator is from the benchmark is given as the Z-score. Differences from a benchmark with a Z-score +/-1.9623 will be considered statistically significant. This is equivalent to a 95% confidence interval (that is, we can have 95% confidence that the difference is not due to chance)."

What does the z-score represent?

Z-scores feature heavily in my line of work: in psychological assessment they are used to identify people whose problems are outside the normal range. However, they aren't computed like the TEF z-scores, because they involve dividing a mean score by the standard deviation, rather than by the standard error.

It's easiest to explain this by an analogy. I'm 169 cm tall. Suppose you want to find out if that's out of line with the population of women in Oxford. You measure 10,000 women and find their mean height is 170 cm, with a standard deviation of 3. On a conventional z-score, my height is unremarkable. You just divide the difference between my height and the population height and divide by the standard deviation, -1/3, to give a z-score of -0.33. That's well within the normal limits used by TEF of -1.96 to 1.96.

Now let's compute the standard error of the population mean - to do that we compute the standard error, which is the standard deviation divided by the square root of the sample size, which gives 3/100 or .03. From that information we can get an estimate of the precision of our estimate of the population mean: we multiply the SE by 1.96, and add and subtract that value to the mean to get 95% confidence limits, which are 169.94 and 170.06. If we were to compute the z-score corresponding to my height using the SE instead of the SD, I would seem to be alarmingly short: the value would be -1/.03 = -33.33.

So what does that mean? Well the second z-score based on the SE does not test whether my height is in line with the population of 10,000 women. It tests whether my height can be regarded as equivalent to that of the average from that population. Because the population is very large, the estimate of the average is very precise, and my height is outside the error of measurement for the mean.

The problem with the TEF data is that they use the latter, SE-based method to evaluate differences from the benchmark value, but appear to interpret it as if it was a conventional SD-based z-score:

E.g. in the Technical Specificiations document (5.63):

As a test of the likelihood that a difference between a provider’s benchmark and its indicator is due to chance alone, a z-score +/- 3.0 means the likelihood of the difference being due to chance alone has reduced substantially and is negligible.

As illustrated with the height analogy, the SE-based method seems designed to over-identify high and low-achieving institutions. The only step taken to counteract this trend is an ad hoc one: because large institutions are particularly prone to obtain extreme scores, a large absolute z-score is only flagged as 'significant' if the absolute difference score is also greater than 2 or 3 percentage points. Nevertheless, the number of flagged institutions for each metric, is still far higher than would be the case if conventional z-scores based on the SD were used.

Relationship between SE-based and SD-based z-scores
(N.B. Thanks to Jonathan Mellon who noted an error in my script for computing the true z-scores.
This update and correction made 20.20 p.m. on 6 March 2019).

I computed conventional z-scores by dividing each institution's difference from benchmark by the SD of for difference scores for all institutions and plotted it against the TEF z-scores. An example for one of the metrics is shown below. The range is in line with expectation (most values between -3 and +3) for the conventional z-scores, but much bigger for the TEF z-scores.

Conversion of z-scores into flags

In TEF benchmarking, TEF z-scores are converted into 'flags', ranging from - - or -, to denote performance below expectation, up to + or ++ for performance above expectation, with = used to indicate performance in line with expectation. It is these flags that the TEF panel considers when deciding which award (Gold, Silver or Bronze) to award.

Draper-Gittoes z-scores are flagged for significance as follows:

- - z-score of -3 or less, AND an absolute difference between observed and expected values of 3%.
- z-score of -2 or less, AND an absolute difference between observed and expected values of 2%.
+ z-score of 2 or more, AND an absolute difference between observed and expected values of 2%.
++ z-score of 3 or more, AND an absolute difference between observed and expected values of 3%.

Given the problems with the method outlined above, this method is likely to massively overdiagnose both problems and good performance.

Using quantiles rather than TEF z-scores

Given that the z-scores obtained with the Draper-Gittoes method are so extreme, it could be argued that flags should be based on quantiles rather than z-score cutoffs, omitting the additional absolute difference criterion. For instance, for the Year 2 TEF data (Core z-scores) we can find cutoffs corresponding to the most extreme 5% or 1%. If flags were based on these, then we would award extreme flags (- - or ++) only to those with negative z-scores of -13.7 or less, or positive score of 14.6 or more; less extreme flags would be awarded to those with negative z-score of -7 or less (- flag), or positive z-score of 8.6 or more (+).

Update 6th March: An alternative way of achieving the same end would be to use the TEF cutoffs with conventional z-scores; this would achieve a very similar result.

Deriving award levels from flags

It is interesting to consider how this change in procedure would affect the allocation of awards. In TEF, the mapping from raw data to awards is complex and involves more than just a consideration of flags: qualitative information is also taken into account. Furthermore, as well as the core metrics, which we have looked at above, the split metrics are also considered - i.e. flags are also awarded for subcategories, such as male/female, disabled/non-disabled: in all there are around 130 flags awarded across the six metrics for each institution. But not all flags are treated equally: the three metrics based on the National Student Survey are given half the weight of other metrics.

Not surprisingly, if we were to recompute flag scores based on quantiles, rather than using the computed z-scores, the proportions of institutions with Bronze or Gold awards drops massively.

When TEF awards were first announced, there was a great deal of publicity around the award of Bronze to certain high-profile institutions, in particularly the London School of Economics, Southampton University, University of Liverpool and the School of Oriental and African Studies. On the basis of quantile scores for Core metrics, none of these would meet criteria for Bronze: their flag scores would be -1, 0, -.5 and 0 respectively. But these are not the only institutions to see a change in award when quantiles are used. The majority of smaller institutions awarded Bronze obtain flag scores of zero.

The same is true of Gold Awards. Most institutions that were deemed to significantly outperform their benchmarks no longer do so if quantiles are used.

Conclusion

Should we therefore change the criteria used in benchmarking and adopt quantile scores? Because I think there are other conceptual problems with benchmarking, and indeed with TEF in general, I would not make that recommendation. I would prefer to see TEF abandoned. I hope the current analysis can at least draw people's attention to the questionable use of statistics used in deriving z-scores and their corresponding flags. The difference between a Bronze, Silver and Gold can potentially have a large impact on an institution's reputation. The current system for allocating these awards is not, to my mind, defensible.

I will, of course, be receptive to attempts to defend it or to show errors in my analysis, which is fully documented with scripts on github, benchmarking_Feb2019.rmd.

BishopBlog

Wednesday, 27 March 2019

What is driving Theresa May?

Sunday, 3 March 2019

Benchmarking in the TEF: Something doesn't add up (v.2)

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers