Saturday, 12 December 2015

A lamentable performance by Jo Johnson

Last week I wrote a blogpost for the Council for Defence of British Universities, in which I discussed the government’s Green Paper “Fulfilling Our Potential”. The Green Paper is a consultation document that introduces, among other things, the Teaching Excellence Framework (TEF). This is an evaluation process for teaching that is intended to parallel the Research Excellence Framework (REF). I argued against it. I’m concerned that the imposition of another complex bureaucratic exercise will do damage to our Higher Education system, and I think that the case for introducing it has not been made. Among other things, I noted that there was little evidence for the claim that there was widespread dissatisfaction among students.  Put simply, my argument was, if it ain’t broke, don’t fix it.

A day after my blogpost appeared, there was a select committee meeting of the department of Business, Innovation and Skills to take oral evidence on topics relating to the Green Paper. The oral evidence is available here as a transcript. This is fascinating, because there appeared to be a difference of opinion between the Minister, Jo Johnson, and the others giving evidence in terms of their views of the state of teaching in our Universities. The most telling part of the session was when Jo Johnson was challenged on his previous use of the word ‘lamentable’ to describe teaching in parts of our higher education system. I am reproducing the transcript here in full, despite its length, as it is important context to what comes next:

Chair: Can I take you back to your speech on 9 September about higher education fulfilling our potential? There is a particular passage in there that is really interesting, talking about a family and varying levels of experience. May I quote you? “This patchiness in the student experience within and between institutions cannot continue. There is extraordinary teaching that deserves greater recognition. And there is lamentable teaching that must be driven out of our system.” Could you tell us where that lamentable teaching is? 
Joseph Johnson: Thank you very much for having me, and I will certainly come to that in just one second. What I want to say is that it is a pleasure to be here to give evidence before you, and I am delighted at the interest the Committee is taking in this very important subject. There is extraordinary excellence across our higher education system; that is the first thing to say. We have a great university system in this country, it is one of our national success stories, and it is a terrific calling card for us on the global stage. It is very important to put that frame in context out there, but of course the sector cannot stand still. University systems around the world are becoming more and more competitive. Developing countries are putting in place stronger and stronger frameworks for their own university systems, and in that environment it is incumbent on us to continue to make a great sector greater still. That is the opening frame of how I see the sector. It is continuing and continuous improvement, and that is all the more important for us, as a sector, at a time when we are seeing ever-increasing numbers of our young people go through university. We are now at a stage of mass higher education in this country, with about 47% of people likely to go through higher education at some point in their lives, and it is vital for us, as a Government, that we ensure that they are getting the best-quality experience for the time and for the money that they are investing in higher education.  You referred back to a speech I gave to Universities UK and I used that word; it made a point. It made a point that there is, essentially, patchiness in provision and I am happy, before you, to give evidence of where I see patchiness, if that is helpful.
Chair: Would you use the word “lamentable” again?
Joseph Johnson: I certainly made the point, and the point was made in order to highlight the fact that there is patchiness and variability in provision. 
Chair: “Patchiness” is not “lamentable” though. 
Joseph Johnson: Patchiness and variability are the features that I want to stress before you today. I am quite happy to give plenty of supporting evidence of that and I think the sector, in its responses to you as a Committee, has also agreed that there is a need to focus on the quality of teaching in our institutions. I am happy to give more evidence on that, if you want.
 Chair: I would be very keen for you to give evidence to us, but just to push you on this, “lamentable” is an extraordinarily strong word. Would you use it again? 
Joseph Johnson: I think there are patches of poor-quality provision and whether or not we want to use that word—
Chair: Lamentable patches?
Joseph Johnson: Whether we want to use that word, it certainly made a point. It highlighted the point I was trying to make. I do not see the need to repeat it ad nauseam, but I think I made my point.
Johnson clearly wanted to move away from discussions about his choice of words and onto the ‘evidence’. I’m going to focus here on what he said about results from the National Student Survey (NSS). There are many pertinent questions about how far the NSS can be taken as evidence of teaching quality, but I will leave those to one side and just focus on what the Minister said about it, which was:
In the NSS 2015 survey, two thirds of providers are performing well below their peers on at least one aspect of the student experience; and 44% of providers are performing well below their peers on at least one aspect of the teaching, assessment and feedback part of the student experience.
I was surprised by these numbers for two reasons: first, they seemed at odds with other reports about the NSS that had indicated a high level of student satisfaction. Second, they seemed statistically weird. How can you have a high proportion of providers doing very poorly without dragging down the average – which we know to be high? I looked in vain online for a report that might be the source of these figures. Meanwhile, I decided to look myself at the NSS 2015 results, which fortunately are available for download here.

All items in the NSS are rated from 1 (definitely disagree) to 5 (definitely agree). I focused on full-time courses, and combined all data from each institution, rather than breaking it down by course, and I excluded any institutions with fewer than 80 student responses, as estimates from such small numbers would be less reliable. Then, to familiarise myself with the data, and get an overall impression of findings, I plotted the distribution of ratings for the final overview item in the survey, i.e., “Overall, I am satisfied with the quality of the course”. As you can see in Figure 1, the overwhelming majority of students either ‘agree’ or ‘definitely agree’ with this statement. Few institutions get less than 75% approval, and none has high rates of disapproval.

Figure 1: Distribution of responses to item 22: "Overall I am satisfied with the quality of the course"

Johnson’s comments, however, concerned individual items on the survey.

As you can see in the table below, there is variation between items in ratings, with lower mean scores for those concerning feedback and smooth running of the course, but overall the means are at the positive end of the scale for all items.
Table 1: Mean scores for NSS items
Item Mean (SD)
1. Staff are good at explaining things. 4.19 (0.11)
2. Staff have made the subject interesting. 4.12 (0.14)
3. Staff are enthusiastic about what they are teaching. 4.3 (0.14)
4. The course is intellectually stimulating. 4.19 (0.17)
5. The criteria used in marking have been clear in advance. 4.02 (0.19)
6. Assessment arrangements and marking have been fair. 4.01 (0.19)
7. Feedback on my work has been prompt. 3.79 (0.24)
8. I have received detailed comments on my work. 3.95 (0.23)
9. Feedback on my work has helped me clarify things I did not understand. 3.85 (0.21)
10. I have received sufficient advice and support with my studies. 4.09 (0.16)
11. I have been able to contact staff when I needed to. 4.27 (0.16)
12. Good advice was available when I needed to make study choices. 4.11 (0.15)
13. The timetable works efficiently as far as my activities are concerned. 4.09 (0.18)
14. Any changes in the course or teaching have been communicated effectively. 3.95 (0.24)
15. The course is well organised and is running smoothly. 3.87 (0.27)
16. The library resources and services are good enough for my needs. 4.19 (0.26)
17. I have been able to access general IT resources when I needed to. 4.28 (0.23)
18. I have been able to access specialised equipment, facilities or rooms when I needed to. 4.11 (0.23)
19. The course has helped me to present myself with confidence. 4.18 (0.13)
20. My communication skills have improved. 4.31 (0.13)
21. As a result of the course, I feel confident in tackling unfamiliar problems. 4.21 (0.12)
22. Overall, I am satisfied with the quality of the course 4.16 (0.18)

It could be argued that Johnson was quite right to focus not so much on the average or the best, but rather on the range of scores. However, the way he did this was strange, because he computed percentages of those who did poorly on any one of a raft of measures. This seems quite a high bar, as a low rating on a single item could create the impression of failure.

In order to reproduce Johnson’s figures, I had to work out what he meant when he said an institution performed “well below” its peers. I looked at two ways of computing this. First, I just considered how many institutions fell below an absolute cutoff on ratings: I picked out cases where there were 20% or more ratings in categories 1 (strongly disagree) or 2 (disagree); this was entirely arbitrary, and determined by my personal view that an institution where one in five students is dissatisfied might be looking to do something about this. Using this cutoff, I found that 24% of institutions did poorly on at least one item in the range 1-9 (covering teaching assessment and feedback), and 35% were rated poorly on at least one item from the full set of 22 items. This was about half the level of problems reported by Johnson.

I wondered whether Johnson had used a relative rather than absolute criterion for judging failure. The fact that he talked of providers performing ‘well below their peers’ suggested he might have done so. One way to make relative judgements is to use z-scores, i.e. for every item, you take the mean and standard deviation across all institutions and then compute a z-score which represents how far this institution scores above or below the average on that item. Using a cutoff of one standard deviation, I obtained numbers that looked more like those reported by Johnson – 43% doing poorly on at least one of the items in the range 1-9, and 59% doing poorly on at least one item from the entire set of 22. However, there is a fatal flaw to this method; unless the data have a strange distribution, the proportions scoring below a z-score cutoff are entirely predictable from the normal distribution: for a one SD cutoff, it will be around 16 per cent. You’d get that percentage, even if everyone was doing wonderfully, or everyone was doing very poorly, because you are not anchoring your criterion to any external reality. For anyone trained in statistics this is a trivial point, but to explain it for those who are not, just look again at Table 1. Take, for instance, item 21, where the mean rating is 4.21 and standard deviation 0.12. These scores are tightly packed and so a score of 4.09 is statistically unusual (one SD lower than average), but it would be harsh to regard it as evidence of poor performance, given that this is still well in the positive range.

I have no idea what method Johnson relied upon for the statistics he presented: I am trying to find out and if I do I will add the information to this post. But meanwhile, I have to say I find it disturbing that NSS data appear to have been spun to paint the state of university teaching in as bad a light as possible. We know that politicians spin things all the time, but it is a serious matter if a Government minister presents public data in a misleading way when giving evidence before a select committee. Those working in primary and secondary education, and in our hard-pressed health service, are already familiar with endless reorganisations that are justified by arguing that we ‘cannot stand still’ and must ‘remain competitive’. We are losing good teachers and doctors who have just had enough. We need to draw back from extending this approach to our Higher Education system. Of course, I am not saying it is perfect, and we need to be self-critical, but the imposition of yet another major shake-up, when we have a system that has an international reputation for excellence, would be immensely damaging, and could leave us with a shortage of the talent that universities depend upon.

NB. You can reproduce what I did by looking at this R script, where my analysis is documented. This has flexiblity to look at alternative ways of defining the key item in Johnson’s analysis, i.e. the definition of “well below one’s peers”.

PS 14th Dec 2015: Another source of evidence cited in the Green Paper is this report from HEPI. Well worth a read. Confirms widespread student satisfaction with courses. Does show that 'value for money' is rated much higher in Scotland (low fees) than England (£9K per annum) 

PS. 16th Dec 2015. I have now had a response from BIS. It is rather hard to follow, but indicates that they do use a relative rather than absolute criterion for expected scores. Expected scores are also benchmarked to take into account student characteristics. I am currently struggling to understand how 66% of institutions can score more than 3 SD below a benchmark on at least one item, given that a z-score as extreme as -3 is expected for only 0.1% of a population. When I get the opportunity, I will look at the HEFCE source they recommend to see if it offers any enlightenment. 

Here is the BIS response:
In the NSS 2015 survey, two thirds of providers are performing well below their peers on at least one aspect of the student experience;

The statistic is based on the National Student Survey 2015, including HEFCE funded institutions with undergraduate students (123 institutions). Answers to all questions (Q1-22) are then compared to their institutional benchmarks. Those institutions that are statistically significantly below their benchmark for at least one question are counted (77 in 2015 NSS data). Therefore, 63% of institutions are performing below their benchmarks on one aspect of the student experience in 2015.

44% of providers are performing well below their peers on at least one aspect of the teaching, assessment and feedback part of the student experience.

This statistic is calculated using the same method as above. The difference is that it is based on Q1-9 of the NSS survey; where Q1-4 relate to teaching and Q5-9 relate to assessment and feedback.


Benchmarks are the expected scores for each question for an institution given the characteristics of its students and its entry qualifications. Benchmarks are based on initial calculations by HEFCE. More information can be found on their website, where benchmarks for Q22 are published.

Statistical significance

Scores are considered statistically different from their benchmarks if they are more than 3 standard deviations and 3 percentage points below their benchmarks. This is the same convention used in the UK HE performance indicators.

I had previously contacted HEFCE who explained they had not been involved in generating the figures reported by BIS and suggested I contact BIS directly for information They also said:

As you will be aware HEFCE currently publishes benchmark data for question 22 of the NSS only and the current published data based on this question shows a relatively small proportion of institutions who are significantly below their benchmark. (The data can be accessed from

We together with the other UK funding bodies have highlighted our interest in developing benchmarks for other questions in the recent consultation on information about learning and teaching, and the student experience, however this would need to be considered in a thorough and robust manner including any factors that should be included in a benchmarking that is suitable for publication. (The consultation document is available from

PPS 20th December 2015
I have now created a script in R that creates percentages close to those reported by BIS. The approach is, as I indicate above, still reliant on a statistical definition of 'below expectation' that means that, regardless of how well institutions are performing overall, there will always be some who perform in this range - unless everyone has 100% satisfaction ratings. Those who are interested in the technical details can find the relevant data and scripts on Open Science Framework: 


  1. This is really useful. Just wondering what would happen if you don't combine across all the courses within an institute. It's unclear what Johnson means by "provider" - could be a provider of an individual course?

  2. Thanks. It would not be too hard to modify the script to look at specific subject areas, as these are coded in the raw data. I think it would just be a case of altering a line where the data are selected for analysis. You might also then need to change criterion for excluding HEIs for small N responses - it's currently set at 80, but that might not be realistic when looking at specific courses.
    There were, of course, lots of analyses done when the NSS came out so it may be this is just reinventing the wheel. My main goal was to just try and work out what on earth JJ was referring to - and there's no evidence that he was looking beyond the institutional level.

  3. This comment has been removed by the author.

  4. A thought-provoking piece, thank you Dorothy. Two additional things come to mind:
    1. What students regard as great teaching will always be subjective to some extent, and therefore difficult to nail down with certainty. As Academic Head of Learning and Teaching for my department, I have students coming to me regularly with praise or otherwise for my teaching colleagues. Over the years I've realised that for every member of staff, there will be students who love their teaching and other students who really don't like it. I generally get good feedback on my teaching, but not always. For example, I did an exercise with an introductory psychology class last year where all 40 of us acted out an Action Potential. According to the end of module evaluation forms, most students really enjoyed the activity and felt they had learnt from it, but one complained that I had treated them like nursery school children. Similarly, some students love breaking into smaller groups for discussion, others don't. Some students like it when we use videos in class to supplement our teaching, others find it a waste of time. Whatever we do, some students will like it and some won't!
    2. My university's workload management system, if I've understood it correctly, allows us just one hour of preparation for ever hour that we teach. Given that we teach sessions of three hours long, that's only three hours to prepare each teaching session. I'm yet to succeed in even updating a 3 hour session in a mere 3 hours, let alone in creating a new one from scratch. Finding time to prepare adequately for teaching when our schedules are becoming increasingly busy and when we are under increasing pressure to generate research income is really not easy.

    1. Just to say that Chloe Marshall emailed me to claim this comment, which blogger somehow allocated to Unknown.
      Thanks Chloe.

  5. Excellent post. I think it is worth emphasising that the very idea of looking at percentages with a single item below threshold is fundamentally flawed as it is confounded with the number of items. Specifically, it has the unattractive property that the more items you have the worse the picture looks. This is exact opposite of how a good measure should behave - because the number of items is arbitrary. You are, in effect, maximising the role of error variance rather than using items to minimise the role of error.

    1. Thanks Tom. It's rather reminiscent of p-hacking - the more measures you look at the more likely it is you will find an extreme score. The other thing is that you could find absolutely anything you wanted with this approach: I could shift the cutoff so that 90% were doing poorly, or so that only 10% were. We have to ask ourselves why JJ should go for a metric that accentuates the negative. Some have suggested statistical illiteracy, but I suspect a more deliberate agenda.

  6. That BIS draw on statistics of relative performance reminds me of the parallel absurdity in NHS statistics:

  7. Or Michael Gove's claim as Education Secretary that all UK schools should get children to outperform the national average:

  8. Thanks Dorothy, for this thoughtful post. This may be of interest too:
    We need to talk about employability, not employment