Like many others, I have moaned about the costs of the REF: not just in money, but also the time spent by university staff, who could be more cheerfully and productively engaged in academic activities. The walrus needs feeding copious amounts of data: research outputs must be carefully selected and then graded in terms of research quality. Over the summer, those dedicated souls who sit on REF panels were required to read and evaluate several hundred papers. Come December, the walrus digestive system will have condensed the concerted ponderings of some of the best academic minds in the UK into a handful of rankings.
But is there a viable alternative? Last week I attended a fascinating workshop on the use of metrics in research. I had earlier submitted comments to an independent review of the role of metrics in research assessment from the Higher Education Funding Council for England (HEFCE), arguing that we need to consider cost-effectiveness when developing assessment methods. The current systems of evaluation have grown ever more complex and expensive, without anyone considering whether the associated improvements justified the increasing costs. My view is that an evaluation system need not be perfect – it just needs to be ‘good enough’ to provide a basis for disbursement of funds that can be seen to be both transparent and fair, and which does not lend itself readily to gaming.
Is there an alternative?
When I started preparing my presentation, I had intended to talk just about the use of measures of citations to rank departments, using analysis done for an earlier blogpost, as well as results from this paper by Mryglod et al. Both sources indicated that, at least in sciences, the ultimate quality-related research (QR) funding allocation for a department was highly correlated with a department-based measure of citations. So I planned to make the case that if we used a citation-based metric (which can be computed by a single person in a few hours) we could achieve much the same result as the full REF process for evaluating outputs, which takes many months and involves hundreds of people.
However, in pondering the data, I then realised that there was an even better predictor of QR funding per department: simply the number of staff entered into the REF process.
Before presenting the analysis, I need to backtrack to just explain the measures I am using, as this can get quite confusing. HEFCE deserves an accolade for its website, where all the relevant data can be found. My analyses were based on the 2008 Research Assessment Exercise (RAE). In what follows I used a file called QR funding and research volume broken down by institution and subject, which is downloadable here. This contains details of funding for each institution and subject for 2009-2010. I am sure the calculations I present here have been done much better by others and I hope they will not by shy to inform me if there are mistakes in my working.
The variables of interest are:
- The percentages of research falling in each star band in the RAE. From this, one can compute an average quality rating, by multiplying 4* by 7, 3* by 3, and 2* by 1 and adding these, and dividing the total by 100. Note that this figure is independent of department size and can be treated as an estimate of the average quality of a researcher in that department and subject.
- The number of full-time equivalent research-active staff entered for the RAE. This is labelled as the ‘model volume number’, but I will call it Nstaff. (In fact, the numbers given in the 2009-2010 spreadsheet are slightly different from those used in the computation, for reasons I am not clear about, but I have used the correct numbers, i.e. those in HEFCE tables from RAE2008).
- The departmental quality rating: this is average quality rating x Nstaff. (Labelled as “model quality-weighted volume” in the file). This is summed across all departments in a discipline to give a total subject quality rating (labelled as “total quality-weighted volume for whole unit of assessment”).
- The overall funds available for the subject are listed as “Model total QR quanta for whole unit of assessment (£)”. I have not been able to establish how this number is derived, but I assume it has to do with the size and cost of the subject, and the amount of funding available from government.
- QR (quality-related) funding is then derived by dividing the departmental quality rating by the total subject quality rating and multiplying by overall funds. This gives the sum of QR money allocated by HEFCE to that department for that year, which in 2009 ranged from just over £2K (Coventry University, Psychology) to over £12 million (UCL, Hospital-based clinical subjects). The total QR allocation in 2009-2010 for all disciplines was just over £1 billion.
- The departmental H-index is taken from my previous blogpost. It is derived by doing a Web of Knowledge search for articles from the departmental address, and then computing the H-index in the usual way. Note that this does not involve identifying individual scientists.
Size-related funding
To check this out, I computed an alternative metric, size-related funding, which multiplies the overall funds by the proportion of Nstaff in the department relative to total staff in that subject across all departments. So if across all departments in the subject there are 100 staff, a department with 10 staff would get .1 of the overall funds for the subject.
Table 1 shows: the correlation between Nstaff and QR funding (r QR/Nstaff) and how much a department would typically gain or lose if size-related funding were adopted, expressing the absolute difference as a percentage of QR funding (± % diff).
Table 1: Mean number of staff and QR funding by subject, with correlation between QR and N staff, and mean difference between QR funding and size-related funding
| 
 Correlations between Nstaff and QR funding are very high –above .9. Nevertheless, this analysis shows that, as is evident in Table 1, if we substituted size-related funding for QR funding, the amounts gained or lost by individual departments can be substantial. In some subjects, though, mainly in the Humanities, where overall QR allocations are anyhow quite modest, the difference between size-related and QR funding is not large in absolute terms. In such cases, it might be rational to allocate funds solely by Nstaff and ignore quality ratings. The advantage would be an enormous saving in time – one could bypass the RAE or REF entirely. This might be a reasonable option if the amount of expenditure on the RAE/REF by the department exceeds any potential gain from inclusion of quality ratings. Is the departmental H-index useful? If we assume that the goal is to have a system that approximates the outcomes of the RAE (and I’ll come back to that later) then for most subjects you need something more than Nstaff. The issue then is whether an easily computed department-based metric such as the H-index or total citations could add further predictive power. I looked at the figures for two subjects where I had computed the departmental H-index: Psychology and Physics. As it happens, Physics is an extreme case: the correlation between Nstaff and QR funding was .994. Adding an H-index does not improve prediction because there is virtually no variance left to explain. As can be seen from Table 1, Physics is a case where use of size-related funding might be justified, given that the difference between size-related and QR funding averages out at only 8%. For Psychology, adding the H-index to the regression explains a small but significant 6.2% of additional variance, with the correlation increasing to .95. But how much difference would it make in practice if we were to use these readily available measures to award funding instead of the RAE formula? The answer is more than you might think, and this is because the range in award size is so very large that even a small departure from perfect prediction can translate into a lot of money. Table 2 shows the different levels of funding that departments would accrue depending on how the funding formula is computed. The full table is too large and complex to show here, so I'll just show every 8th institution. As well as comparing alternative size-related and H-index-based (QRH) metrics with the RAE funding formula (QR0137), I have looked at how things change if the funding formula is tweaked: either to give more linear weighting to the different star categories (QR1234), or to give more extreme reward for the highest 4* category (QR0039) – something which is rumoured to be a preferred method for REF2014. In addition, I have devised a metric that has some parallels with the RAE metric, based on the residual of the H-index after removing effect of departmental size. This could be used as an index of quality that is independent of size; it correlates with r = .87 with the RAE average quality rating. To get an alternative QR estimate, it was substituted for the average quality rating in the funding formula to give the Size.Hres measure. Table 2: Funding results in £K from different metrics for seven Psychology departments representing different levels of QR funding 
 To avoid invidious comparisons, I have not labelled the departments, though anyone who is curious about their identity could discover them quite readily. The two columns that use the H-index tend to give similar results, and are closer to a QR funding based that treats the four star ratings as equal points on a scale (QR1234). It is also apparent that a move to QR0039 (where most reward is given for 4* research and none for 1* or 2*) will increase the share of funds to those institutions who are already doing well, and decrease it for those who already have poorer income under the current system. One can also see that some of the Universities at the lower end of the table – all of them post 1992 universities – seem disadvantaged by the RAE metric, in that the funding they received seems low relative to both their size and the H-index. The quest for a fair solution So what is a fair solution? Here, of course, lies the problem. There is no gold standard. There has been a lot of discussion about whether we should use metrics, but much less discussion of what we are hoping to achieve with a funding allocation. How about the idea that we could allocate funds simply on the basis of the number of research-active staff? In a straw poll I’ve taken, two concerns are paramount. First, there is a widely held view that we should give maximum rewards to those with highest quality research, because this will help them maintain their high standing, and incentivise others to do well. This is coupled with a view that we should not be rewarding those who don’t perform. But how extreme do we want this concentration of funding to be? I’ve expressed concerns before that too much concentration in a few elite institutions is not good for UK academia, and that we should be thinking about helping middle-ranking institution become elite, rather than focusing all our attention on those who have already achieved that status. The calculations from RAE in Table 2 show how a tweaking of the funding formula to give higher weighting to 4* research will take money from the poorer institutions and give it to the richer ones: it would be good to see some discussion of the rationale for this approach. The second source of worry is the potential for gaming. What is to stop a department from entering all their staff, or boosting numbers by taking on extra staff? The first point could be dealt with by having objective criteria for inclusion, such as some minimal number of first- or last-authored publications in the reporting period. The second strategy would be a risky one, since the institution would have to provide salaries and facilities for the additional staff, and this would only be cost-effective if the QR allocation would cover it. Of course, a really cynical gaming strategy would be to hire people briefly for the REF and then fire them once it is over. However, if funding were simply a function of number of research-active staff, it would be easy to do an assessment annually, to deter such short-term strategies. How about the departmental H-index? I have shown that it not only is a fairly good predictor of RAE QR funding outcomes on its own, incorporating as it does both aspects of departmental size and research quality, but it also correlates with the RAE measure of quality, once the effect of departmental size is adjusted for. This is all the more impressive when one notes that the departmental H-index is based on any articles listed as coming from the departmental address, whereas the quality rating is based just on those articles submitted to the RAE. There are well-rehearsed objections to the use of citation metrics such as the H-index: first any citation-based measure is useless for very recent articles. Second, citations vary from discipline to discipline, and in my own subject, Psychology, within sub-disciplines.. Furthermore, the H-index can be gamed to some extent by self-citation, or scientific cliques, and one way of boosting it is to insist on having your name on any publication you are remotely connected with - though the latter strategy is more likely to work for the H-index of the individual than for the H-index of the department. It is easy to find anecdotal instances of poor articles that are highly cited and good articles that are neglected. Nevertheless, it may be a ‘good enough’ measure when used in aggregate: not to judge individuals but to gauge the scientific influence of work coming from a given department over a period of a few years. The quest for a perfect measure of quality I doubt that either of these ‘quick and dirty’ indices will be adopted for future funding allocations, because it’s clear that most academics hate the idea of anything so simple. One message frequently voiced at the Sussex meeting was that quality is far too complex to be reduced to a single number. While I agree with that sentiment, I am concerned that in our attempts to get a perfect assessment method, we are developing systems that are ever more complex and time-consuming. The initial rationale for the RAE was that we needed a fair and transparent means of allocating funding after the 1992 shake-up of the system created many new universities. Over the years, there has been mission creep, and the purpose of the RAE has been taken over by the idea that we can and should measure quality, feeding an obsession with league tables and competition. My quest for something simpler is not because I think quality is simple, but rather because I think we should use the REF just as a means to allocate funds. If that is our goal, we should not reject simple metrics just because we find them oversimplistic: we should base our decisions on evidence and go for whatever achieves an acceptable outcome at reasonable cost. If a citation-based metric can do that job, then we should consider using it unless we can demonstrate that something else works better. I'd be very grateful for comments and corrections. | 
Mryglod, O., Kenna, R., Holovatch, Y., & Berche, B. (2013). Comparison of a citation-based indicator and peer review for absolute and specific measures of research-group excellence Scientometrics, 97 (3), 767-777 DOI: 10.1007/s11192-013-1058-9
 
 
 
 Posts
Posts
 
 
 
 
 
 
 
 
I found your new analysis fascinating. I implies that the quality of submitted staff differs little between universities.
ReplyDeleteAllocation on the basis of department size is the simplest and cheapest of all solutions. The only problem is to prevent submission of the departmental cat to bolster numbers, and hiring people for the REF who never take up residence. It should be possible to use existing data to test various schemes for preventing these forms of dishonesty (I don't like the euphemism "gaming"). As you suggest, there are many ways in with eligible staff can be selected automatically and they should be tested, And absentee stars could surely be detected, and rejected, by inspection of employment records.
To include the H-index would produce great pressure to improve you H-index by the usual well-understood forms of dishonesty. But the H-index contributes such a small amount of predictive ability that it's not worth including it.
It may be objected that such a wonderfully cheap and quick system would put out of business the hordes of commercial producers of metrics. This could only increase the honesty of the scientific enterprise (in any case, the h-index is available free from Google Scholar). The traditional publishers are having to face up to the inexorable fact that they are about as relevant to present day publishing as the handloom weavers were in the early 18th century.
Very interesting. would be useful to repeat it on newer funding data, say the 2014-15 data, which are (of course) also on the HEFCE website, but where everything except the mysterious 'total QR quanta' are also still based on the last RAE profiles and staff numbers. (The analysis won't be identical, because HEFCE changed the way that the amounts depend on the star rating profile - now only the staff numbers rated at 3* or 4* come into it, whereas in the 09/10 data that you looked at, 2* came into it as well. It's this change that makes it potentially interesting to do it again, otherwise the results would doubtless be pretty well identical to what you found.)
ReplyDeleteActually I think we should be much more interested than people usually seem to be in the size of the 'subject pots', that is, the total QR quanta for units of assessment. I did do some digging on where these come from, but haven't got that info here, so the following might not be perfectly right but it's certainly quite close. There are some details on the HEFCE site but they need a bit of digging, and in the case of psychology and geography they involved some adjustments, that looked rather ad hoc to me, to allow for the act that these subjects cover some research that is similar in cost to social science subjects, and some that is similar to lab subjects. But putting that to one side, basically they divide up the total available funding in proportion to subject measures that depend on the number or researchers submitted and on multipliers that say how much a researcher costs to do research in different subjects, relative to the cheapest subjects. These ratios are strange - they haven't been revised for decades, I think, and there are only 3 different bands if I recall right (or maybe 4) so it's a bit crude. These determine how much a department attracts in QR funding(and remember that depts. don't get that money direct, it all goes to the university who may divide it up differently).HEFCE did commission some consultants to look at the relativities ( report on the HEFCE site but I couldn't find it just now) and they reported that the real cost differences are nothing like the figures used, but if they were changed to something closer to what the consultants found, basically nearly all the research budget would go to biomedical research and the rest of us would be fighting over a few crumbs.
My more general reaction to the workshop was that people were very often failing to distinguish between using metrics to measure quality and using metrics to divide up the available funding. I think everyone would agree that research excellence has many dimensions, but a quantity of money doesn't have many dimensions.
Sorry to have another go, but 2 more thoughts:
ReplyDelete1. The discrepancies you found between the 'model volume number' in the spreadsheet and the submitted staff numbers from RAE2008. I think this may be because the 'module volume number' doesn't count all the submitted staff, but only the a percentage of them that corresponds to the % of the profile that got 2* or better. Certainly that's the case in more recent allocations (except that it's a bigger difference because they don't include the 2* percentage). Note that I'm not talking about the quality adjusted version, just the total before they do that - it's still not all the submitted staff. I know you used the actual submitted numbers in your analysis, but this is a reason why it might not fit nearly so well for later years.
2. Since the money doesn't go to departments, but forms part of a block grant to universities, and since some universities (quite a proportion, I believe) don't allocate QR funding internally in proportion to HEFCE UoA allocations, it might be very instructive to see how university total allocations across all their UoAs would differ if your method of just using staff numbers were used.
(2) is a really good point. When I tried this analysis, I got an overall correlation between total QR income and total staff of .989.
DeleteYes, I did it too, for the 2014-15 allocations, and got .975. (A bit different from you - did you include the UoAs that had staff in the RAE but get no QR becasue their profile has no 3* and 4*? - but that's not important, the point is that it's certainly very high). Also a correlation of .983 between the actual 14-15 allocations to universities and what they'd get if the allocation was (within UoAs) proportional to staff numbers submitted, as Dorothy suggested.
DeleteBut you've got to be careful not to assume that these high correlations imply the allocations would be just about the same if they were based on staff numbers alone. High correlation means there's a strong linear relationship, not that one quantity is equal to the other one. The regression equation to get the actual allocation from the proportional-to-staff (within UoAs) allocation is (roughly), in £,
Actual= -1.7million + 1.2 * prop-to-staff allocation
and, perhaps even more interesting, if you regress the actual allocation on the total staff submitted (FTE) regardless of messing about with different 'subject pot' sizes to take account of different research costs etc., you get
Actual = -2.7m + 32k per staff member
So we could use that and get pretty similar allocations to now, one could argue. Universities that submit fewer than 85 staff aren't going to be happy because they get a negative allocation, and nor are the biggest QR recipients, Oxford and Cambridge, because they are high outliers in both models.
Also with allocation proportional to staff numbers within UoAs (and not 'corrected' by regression to look more like the current actuals), Oxford and Cambridge would lose out pretty badly - each would get over £20m less than their current (£77m each) allocations. Imperial and UCL would be big losers too (about £10m each), and Manchester about £9m, while Plymouth and Leicester would each gain a bit less than £5m.
Fun, eh?
I used the 2009/10 figures, so that probably accounts for our different results.
DeleteDon't forget that you're saving the vast majority of the cost of the assessment exercise too, which was apparently £47m in 2008. If you assume that this is real money, then it covers the Oxbridge loss you calculated on its own. And if you were still worried about the effect on one type of institution I guess it would be pretty easy to adjust the objective definition of who counts as research active to provide any outcome that the sector perceives as desirable.
Great post. Those correlations between number of staff and QR income are astonishingly high. I wonder what the test-retest reliability of the REF process is. It seems entirely plausible that it wouldn't be higher than those rs.
ReplyDeleteThis sounds like a very sensible suggestion. I'm no fan of metrics but it seems that allocation of a discrete numeric value (funds) would be much better served by a discrete numeric measure (research staff numbers) than anything else. Also much more difficult to game than quality metrics which are all too often a consequence of fads and trends rather than lasting quality. Any such thing should be judged with a 5-10 years gap. The REF (and formerly RAE to which I was closer) seems more like a game of Potemkin villages than anything of any value (given the amount of time and resource it takes both for the assessors and the assessed).
ReplyDeleteBut more interestingly it also speaks to a cultural bias in assessment and education. The Anglo Saxon is obsessed with fairness to the extent of obscuring any possible level of transparency. I went through my university education being assessed by individual lecturers during a 10-15 minute conversation about the subject. And have done both kinds of assessments. Although, there will be differences in reliability and validity in the two approaches, they are probably not that different in their ultimate fairness and subject to the same kinds of individual and systemic biases. As long as the system does not incentivise too radically unequitable distribution of resources, it is more likely to be fair if it is simple than complex.
I think there is a difficulty with the analysis in that the number of staff returned (n) is itself substantially a function of quality (because of selectivity). It is likely that departments only returned staff estimated to be above a particular threshold and that (generally) this threshold was fairly stable across departments.
ReplyDeleteIf this was a perfectly efficient process then this would mean that n captures nearly all the variation in quality. The interesting thing is that the h-index (or other quality measures) then becomes a measure of the total inefficiency in the process (either at department level, inherent difficulty in measuring quality and biases etc. in the panel outcomes).
Dorothy I think that there are three points that make your exercise less valuable than it otherwise might be.
ReplyDeleteI agree with you that it would be good to reduce cost but I think it is essential to measure the quality and impact of HEFCE funded research in order to justify spending public funds on it. So even if staff numbers are a perfect predictor, you need a quality measure. I also think that impact measures are very important in Social Sciences and Humanities.
Second, I am not sure that finding a low cost way of assessing quality will make much difference to the overall cost because I suspect that most of the costs of preparation for the REF are not strictly necessary. For example, it would be much cheaper to submit all staff but Universities would rather engage in expensive selection exercises that reduce their overall income from the REF because it gives them a chance to rise up the league tables.
Third, without the REF, academic departments would have no incentive to manage staff in ways that support their research. A lot of departments try to do this now and some of them do it very well, to the benefit of staff.
"Third, without the REF, academic departments would have no incentive to manage staff in ways that support their research. A lot of departments try to do this now and some of them do it very well, to the benefit of staff."
ReplyDelete...??! Some academic aepartments do try and do this, but 'very well' is not a view I've ever heard from the rank and file. The view from the trenches would be closer to:
1) Granting a sabbatical is the only real managerial tool of any value, and historically that was a right anyway;
2) Most of the rest consisting of HoDs/Deans/managers repeating 'Yes,we do know you're trying' whilst little or nothing else concrete happens.
3) Ever more pressure on those in the department the bosses see as 'non research-viable going forward' (in their parlance) to transfer to a non-research teaching-only or 'teaching and scholarship' contract
You might be interested to learn that a regression model using metrics to predict RAE 2008 outcomes explained 92% of the variance for Business and Management Studies (Kelly et al) and 82% for Sociology (Kelly and Burrows). The latest journal metrics (e.g. SNIP2) improve or substitute for the indicators used in these reports.
ReplyDelete