Saturday 26 January 2013

An alternative to REF2014?

After blogging last week about use of journal impact factors in REF2014, many people have asked me what alternative I'd recommend. Clearly, we need a transparent, fair and cost-effective method for distributing funding to universities to support research. Those designing the REF have tried hard over the years to devise such a method, and have explored various alternatives, but the current system leaves much to be desired.

Consider the current criteria for rating research outputs, designed by someone with a true flair for ambiguity:
Rating Definition
4* Quality that is world-leading in terms of originality, significance and rigour
3* Quality that is internationally excellent in terms of originality, significance and rigour but which falls short of the highest standards of excellence
2* Quality that is recognised internationally in terms of originality, significance and rigour
1* Quality that is recognised nationally in terms of originality, significance and rigour

Since only 4* and 3* outputs will feature in the funding formula, then a great deal hinges on whether research is deemed “world-leading”, “internationally excellent” or “internationally recognised”. This is hardly transparent or objective. That’s one reason why many institutions want to translate these star ratings into journal impact factors. But substituting a discredited, objective criterion for a subjective criterion is not a solution.

The use of bibliometrics was considered but rejected in the past. My suggestion is that we should reconsider this idea, but in a new version. A few months ago, I blogged about how university rankings in the previous assessment exercise (RAE) related to grant income and citation rates for outputs. Instead of looking at citations for individual researchers, I used Web of Science to compute an H-index for the period 2000-2007 for each department, by using the ‘address’ field to search. As noted in my original post, I did this fairly hastily and the method can get problematic in cases where a Unit of Assessment does not correspond neatly to a single department. The H-index reflected all research outputs of everyone at that address – regardless of whether they were still at the institution or entered for the RAE. Despite these limitations, the resulting H-index predicted the RAE results remarkably well, as seen in the scatterplot below, which shows H-index in relation to the funding level following from RAE. This is computed by number of full-time staff equivalents multiplied by the formula:
    .1 x 2* + .3  x 3* + .7 x 4*
(N.B. I ignored subject weighting, so units are arbitrary).

Psychology (Unit of Assessment 44), RAE2008 outcome by H-index
Yes, you might say, but the prediction is less successful at the top end of the scale, and this could mean that the RAE panels incorporated factors that aren’t readily measured by such a crude score as H-index. Possibly true, but how do we know those factors are fair and objective? In this dataset, one variable that accounted for additional variance in outcome, over and above departmental H-index, was whether the department had a representative on the psychology panel: if they did, then the trend was for the department to have a higher ranking than that predicted from the H-index. With panel membership included in the regression, the correlation (r) increased significantly from .84 to .86, t = 2.82, p = .006. It makes sense that if you are a member of a panel, you will be much more clued up than other people about how the whole process works, and you can use this information to ensure your department’s submission is strategically optimal. I should stress that this was a small effect, and I did not see it in a handful of other disciplines that I looked at, so it could be a fluke. Nevertheless, with the best intentions in the world, the current system can’t ever defend completely against such biases.

So overall, my conclusion is that we might be better off using a bibliometric measure such as a departmental H-index to rank departments. It is crude and imperfect, and I suspect it would not work for all disciplines – especially those in the humanities. It relies solely on citations, and it's debatable whether that is desirable. But for sciences, it seems to be pretty much measuring whatever the RAE was measuring, and it would seem to be the lesser of various possible evils, with a number of advantages compared to the current system. It is transparent and objective, it would not require departments to decide who they do and don’t enter for the assessment, and most importantly, it wins hands down on cost-effectiveness. If we'd used this method instead of the RAE, a small team of analysts armed with Web of Science should be able to derive the necessary data in a couple of weeks to give outcomes that are virtually identical to those of the RAE.  The money saved both by HEFCE and individual universities could be ploughed back into research. Of course, people will attempt to manipulate whatever criterion is adopted, but this one might be less easily gamed than some others, especially if self-citations from the same institution are excluded.

It will be interesting to see how well this method predicts RAE outcomes in other subjects, and whether it can also predict results from the REF2014, where the newly-introduced “impact statement” is intended to incorporate a new dimension into assessment.


  1. This is a very interesting proposal, and would certainly save lots of time and money. Notwithstanding the fact that h-index seems to predict an impressive proportion of the variance in the last RAE results, there are two reasons I'd be concerned about this.

    One is that relying on h-index discriminates against early-career researchers, and could have the effect that departments become reluctant to bring in new blood who may have great potential but whose citation counts are inevitably likely to be less than someone much further on in their career, who may not have produced much of note recently.

    The second issue, of course, is that the h-index takes no account of why work may be cited. A UoA that published hundreds of papers a year, all of which are wrong, as demonstrated by hundreds of other papers that cite them in order to rip them to pieces, would score just as highly as a UoA publishing hundreds of groundbreaking, field defining, Nobel prize winning papers.

    Not sure if there's any way around these issues, although if h-index ends up predicting the REF results as reliably as last time, perhaps it doesn't matter. No system is ever going to be perfect, and if yours is just as imperfect as the current one, but at a tiny fraction of the cost, then we should certainly move to yours!

  2. "But substituting a discredited, objective criterion for a subjective criterion is not a solution."

    Did you mean to say "... for a discredited subjective criterion"?

  3. This is an interesting proposal that is consistent with an analysis that Mike Eysenck and Andy Smith did about 10 years ago. They calculated the citations that every staff member in a UK psychology department got in a single year (1998). They then calculated an average number of citations for each department for that year and found that this correlated remarkably well with RAE grades given in 1996 and 2001. See for the full run down. They argue that many of the concerns that people have about citations (e.g. a paper cited as an example of poor methods) become less significant when one averages across a whole department.

    Having had the pleasure of leading two RAE/REF submissions for my department, I would definitely favour this kind of objective, simplified approach. Honestly, when the work of a robot comes to the same outcome as the thousands of human hours poured into the REF ... well, it is hard to see how the current approach can continue to be justified.

    1. Thanks Kathy for pointing me to the Eysenck/Smith piece. Makes the same point v. clearly. Only difference is that if you do H index by department it is even more efficient, and avoids all the problems from people with common names etc. I extracted the H index data for 76 psychology departments in an afternoon.

  4. As pointed out in the discussion, the problem is that junior members of department will naturally have smaller contributions to the departmental H-index or average citation number. An obvious solution is to penalise each department for the average seniority of the staff (say, the number of years after PhD). Then, each department would have to balance junior and senior members. But, this type of system will make bean-counting people very happy because they can quite easily set up numerical targets (e.g., you are 5 year after PHD, so you need to have X citations in a single year).

  5. What about calculating (number of citations)/(number of years since PhD)
    for each staff member, and averaging those over the department?

  6. The point about junior staff that several of you have raised is interesting, but my suspicion is that it may exert a relatively small effect when considering departments as a whole. But that could be tested by obtaining data on age profile for departments and seeing if the outcome was affected by weighting for that.
    Remember, though, that I counted citations only for work published in the period covered by the RAE: usually the H-index of a more senior person is higher than for a junior one just because they have been publishing for longer.
    And suppose a department had to replace someone who had retired, and they had to choose between a young 'rising star' and an established person. The established person might seem a good, though (in salary terms) more expensive, bet, but they would not bring any benefit from their prior publications, because they would be attached to a different address. So I don't think decisions about appointments would necessarily be influenced to any noticeable degree if my scheme were adopted. Indeed, I suspect it might reduce the tendency for a 'transfer market' in jobs in the period prior to the REF.

  7. As has often been observed, once a metric becomes a target, its use as a measure becomes meaningless. Peer-review is much harder to subvert or distort.

    As well as the problem of junior staff above, from the Departments point of view, publications from single PhD students with their supervisor will also be disadvantaged by H-factor: a single major publication from several theses would be 'better' than several publications each showing the excellence of a student and needed to get their next position (notwithstanding 'joint first author' or listing 'contributions').

    A further problem with H-index is that "reviews" are included - and are not original research being measured in REF. Generally, reviews have substantially higher citation (at least in the short term) than original research. If H-factor (or indeed any non-peer reviewed bibliometric) is used, then academics will be driven to write more reviews. But you can't exclude all: a meta-analysis might or might not be a review. What about a novel model at the end of a lengthy review?

    Finally, for the avoidance of the ambiguity in your first sentence regarding REF2014, "No sub-panel will make any use of journal impact factors ... in assessing the quality of research outputs." You can't be clearer than that.

    (Declaration: REF sub-panel member)

  8. Pathh - thanks for your comment. As a REF sub-panel member you have my deepest sympathy, but I have to disagree with much of what you say.
    I should preface by saying I am not wild about bibliometrics, and I'm aware that H-index is very subject-sensitive. But as Kathy mentions above, if you get the same result from an automated data processing exercise that takes a couple of weeks as you do from 2 years preparation and pondering by hordes of highly-skilled people, then I think it's worth looking at.
    Re "Peer-review is much harder to subvert or distort" . I am a great fan of David Colquhoun who disparages most bibliometrics and says people should just read the papers and form an opinion. I prefer to do that myself (and do so when rating grants, for instance), but it is open to personal bias - and many people reckon it is unworkable on the scale required by the REF. Also, I've had a lot of informal feedback on my previous post, some from people from other countries, and a common theme is that I shouldn't object to impact factor because at least that is objective. These commenters come from countries where things are decided by a cabal of the great and good and they feel lack of transparency is a problem. And the effect I noted of panel membership is a bit worrisome too.
    Re the PhD student: I don't really buy this argument: the scenario of being held back from publishing is not one I've encountered, but it might be no bad thing to have more multi-experiment papers than bitty publications. The student isn't going to get entered in the REF anyhow. In terms of future career, surely, there'd be real benefit in the longer term of being an author on a high-impact meaty paper, especially if each contribution is clarified.
    Reviews - again, I don't have a problem with that -surely they only get highly-cited if they are good, and most fields could benefit from some good integrative reviews. If more people are encouraged to write reviews, I'd not see that as damaging science - it might actually make people stop and think about how their work fits into a broader context.
    And finally, re the impact factor statement - you are right, it couldn't be clearer, but look at the storify/comments on my previous post and you'll find plenty of people who just don't accept it. This includes some senior figures involved in REF preparations and panels. What needs clarifying by HEFCE is WHY this statement was included in the REF guidance, and WHAT can be done if people are found to be flouting it.

    1. I hardly think an "automated data processing exercise that takes a couple of weeks" is likely to be an acceptable way to distribute something like £10billion of taxpayers money, transparent though it might be. Where will you find a database with sufficient accuracy? What 'outputs' are you including? As you note, a 'cabal of the great and good' is potentially as bad as leaving it to a group of well-lobbied politicians, although I would encourge both groups to have an input into how Universities should be funded. Whatever, it is very important to consider what is a suitable method for distributing this large component of University funding, and to present the data for helping make the decisions about REF20_?_21.

      Before REF2014 went ahead, there was considerable work on bibliometric approaches (and, indeed, consideration of simply funding as in the last decade or so). Not surprisingly, because that is what they are designed to measure, correlation of most bibliometric methods with RAE scores, funding and other measures of quality were good, but far from perfect. And the outliers I think are the important data points that allow evolution or, indeed, revolution in funding - of subjects or Universities. As an extreme example, I expect at one time alchemy or eugenics papers would have been well-cited (self-perpetuating) and even better funded, but would not qualify as having national significance in terms of significance and rigour (ie <1*).

      I'm also a fan of David Colquhoun's blog, and agree with 'just read the paper'. I disagree that it is 'unworkable on the scale required by the REF'. The sizes of sub-panels, distribution of papers, appointment of assessors, number of papers required from each active researcher, number of scale points (unclassified, 1*-4*) have been (or some cases still are being) considered carefully to ensure the methods are workable and there are the resources to assess submissions exactly as in the submission guides. Your effect of panel membership on RAE scores is certainly worrisome though.

      Aside from REF, nobody wants to read bitty papers. But I don't think a typical graduate student whose main PhD work leads to middle authorship on an important but many-authored paper is likely to have a well-rounded PhD training. They are unlikely (it is possible, but unlikely) to have picked themselves the problem, nor chosen the experiments they have done having looked at what else is known about their problem, nor considered the holes in the background literature, nor to have written the paper's discussion.

      All panel members have agreed not to make any use of journal impact factors in assessing the quality of research outputs, so the issue is closed. I would have thought the members of the panels (mostly academics anyway), and the administrators, would work together to ensure that this, like all other points in the guidance documents, are followed exactly. Personally, I can't see why any panel member would want to subvert the statement anyway: journal impact factor is widely accepted and known not to be a good proxy for quality of a paper.

      Re declaration, I should have added that I am (I hope obviously) writing in personal capacity. I subscribe to the REF sub-panel confidentiality conditions. These allow public comments. Unlike other jobs, some of which I have taken on and others I have not, it is not a role that needs sympathy - it will be a very large, planned and interesting job, with neither impossible nor overwhelming amount of work next year. Getting funding of University research right, based on research excellence, is critically important, so I want to help to get this worthwhile task completed as accurately and rigorously as possible.

  9. It's got to be better than what the research councils currently seem to be using: gross RC income per university, which also correlates with the RAE but discriminates against smaller universities. I'd love to know who some of your outliers are.

  10. This is a great alternative. As research director for a UoA that isn't psychology (allied health) we are entering ref with other depts. But the H-Index will still work so long as depts entered as one are Hindexed as a group. Re early career researchers, I like the idea of the quotient with experience. And although the REF 2014 criteria has this sentence about not using IF, it also says that papers will be assessed "with reference to international research quality standards" - it seems naive to think that no panel member will use IF at some level as this type of "reference" especially when reviewing content with which they are less familiar. Attaching submissions to the institution credited on the paper is fairer and will stop the mad rush to employ people for REF census dates sometimes with little regard for their staying after this date, and should help depts to encourage and grow young researchers without the danger of them leaving taking their cv nurtured by your dept with them. UoA that are more interdisciplinary than psychology such as allied health will also avoid the current problems of reviewers reading out of area (from sociolinguistics to optometry in our case).I also don't understand why a good, highly cited review is not included as 'research' contribution and in this sense the H index solves not creates a problem. There is an issue re: why things get cited but that bias is not entirely absent in the reviewing process either and poor but highly cited papers account for a small proportion of the submissions. Personally I would be delighted if an H-index system was introduced. As depts we can begin to effect this change if we all include a dept H index in our environment statements so that this metric starts to become the norm.

  11. I think this could be a starting point. However, it needs to be emphasized and kept in mind that once such metrics were to become the rule for dispensing a significant amount of funds, the metrics would be the targets of the departments and not the science giving rise to the metrics. Departments would do whatever it takes to increase their h-index and often the shortest route is not to do good science.

    Thus, once metrics become involved, one needs to establish a dynamic and not a static system where the set of metrics used evolves and improves at every single iteration, such that the shortest and most effective way to max out on the metrics for the departments is just do do good science.

  12. Many people have asked about that outlier in the bottom right hand corner of the plot: massive H-index but less funding than would be predicted. This is University of Cambridge, and I've poked in the data a bit more to see if I can account for this. Cambridge had the highest average ranking of all psychology departments, but it's a small department with only 24 staff entered: the funding is weighted for this.
    Now this raises an interesting point: it could be argued that the departmental H-index already takes number of staff into account, because the more people publishing, the greater the potential for high citations. Therefore, you might argue that Cambridge should have had an even better financial outcome, because it was punching above its weight in terms of size.
    But there's yet another wrinkle on this. Further scrutiny of the publications revealed that the high H-index was largely attributable to one illustrious researcher, with a staggering number of very highly cited papers in the RAE period. So I suspect the current system, which allows people to submit only 4 publications, would have worked against Cambridge here, because someone who had 20 very highly cited papers over the reporting period would not be distinguished from someone who had four: the outputs would be rated 4* in either case. Whether this is a good or bad thing is up for debate. The limit of four papers encourages people to write meaty papers with a good chance of having impact, rather than spreading their work thinly. But this system works against those rare individuals who can produce substantially more papers than this without a loss of quality.

  13. NikkiB raised an interesting point about how bibliometrics would impact departments (UoAs) that are interdisciplinary. As NikkiB points out, bibliometrics would avoid the problem of REF reviewers having to judge a quality of papers at the periphery of their expertise. But, bibliometrics would create unhealthy tension between different disciplines within departments (UoAs) because the base-line citation rates are higher in some field (e.g., brain imaging) than in other (e.g., traditional experimental psychology without neuorophysiological measures). But, maybe, bibliometrics are relatively low cost, so UoAs can be broken down into small units to avoid the problem. (My understanding is that, in REF2014, psychology, neuroscience and psychiatry were all put into one UoA due to the cost of having too many UoAs.) Or, alternatively, somebody has already come up with an adjustment to H-index that takes the base-line citation rate of a particular field into account.

  14. I think Kita makes an important point. If one of the impacts of the REF was to encourage psychologists to become neuroscientists in order to chase higher H-index scores it would be, to my mind, a bad thing. We already have too much maths and physics envy. We do not need more encouragement.

    How about this an alternative? We drop the REF and take all the money and put it into funding grants. Another terrible waste of time is writing grant proposals that have a 5-10% chance of funding. If this money could be be used to dramatically up the success rates it would get money to better departments/researchers with little cost to the administration of grants.

  15. My statistician's eye found the scatterplot quite enlightening because it's got some classic signs of heteroscedasticity and a hint of nonlinearity, which makes a linear regression (even with .84 R^2) somewhat suspect. Would it be possible to obtain the data? If nothing else it would make a MARVELOUS teaching example.

    1. Your statistician's eye missed that it's an r of .84, r-squared of .7

  16. Sotaro/Jeff: there are variants on the H-index that do take discipline into account, see:
    But it could be argued that neuroscience costs more to do than other fields.
    I'm sympathetic to Jeff's point, but you do have to decide who to give the funds to. Would you be happy if all Universities got the same amount? If not, you will need to specify some way of divvying it up.
    Jay - the file is on its way to you. I'll be interested to see what you make of it. One point: unusually for social sciences, this is total population data - i.e. there aren't any other universities that we want to generalise to.

  17. Very interesting post. I did something similar for physics. I included an h-index ranking and a citations per publication ranking. It seemed somewhat consistent with the RAE2008 results, but there were some anomalies and I didn't do any kind of statistical analysis.

    I'm not really a fan of using metrics to judge things, but I tend to agree that in this case it's probably okay as long as departments are sufficiently diverse so that variations in citation practices average out. If you're interested in the physics ranking that I did, the link (if it works) is below

    1. Thanks for this. You were ahead of me!
      Just entered a comment on your blog, after doing the stats for physics the same way as I did for psychology, using your H index figures.
      Remarkable agreement. Using research income as the dependent variable (ie taking into account N people entered) and H index as predictor, the correlation is .80. Adding N panel members as a predictor, the correlation jumps to .92. So it seems that panel membership has an even bigger effect in physics than in psychology. I did this quickly and it needs double checking, but all the relevant numbers are available on the RAE2008 site.

  18. If you want to see the extent to which departmental h-index (x) is a predictor of RAE income (y), I think that the correlation coefficient is not the right way to do it. Surely you should calculate the regression of y on x, with confidence limits for the predicted y. I suspect that will give a less optimistic view of the predictive ability than the (rather modest) r-squared of 0.7.

    My main worry about your proposal is that, if it were adopted, one can imagine the enormous pressure that would be exerted on every academic to increase their h-index. It is an (arbitrary) measure based on citations. The easiest way to get an enormous number of citations is to do no research at all, but to write reviews of trendy areas. Another way is to write 100s of short papers. Or, of course to be plain wrong (Andrew Wakefield's fraudulent paper had 758 citations by 2012).

    What you should not do is to write anything mathematical, For example, a 1992 paper by Jalali & Hawkes has only 28 citations, despite the fact that it laid the foundations for all subsequent work on maximum likelihood estimation in the analysis of single ion channel records. That's the highly original paper that gets few citations, because it is hard, Try it yourself: Generalised eigenproblems arising in aggregated Markov process allowing for time interval omission. Advances in Applied Probability 24, 302–321.

  19. This comment has been removed by the author.