Friday 28 November 2014

Metricophobia among academics

Most academics loathe metrics. I’ve seldom attracted so much criticism as for my suggestion that a citation-based metric might be used to allocate funding to university departments. This suggestion was recycled this week in the Times Higher Education, after a group of researchers published predictions of REF2014 results based on departmental H-indices for four subjects.

Twitter was appalled. Philip Moriarty, in a much-retweeted plea said: “Ugh. *Please* stop giving credence to simplistic metrics like the h-index. V. damaging”. David Colquhoun, with whom I agree on many things, responded like an exorcist confronted with the spawn of the devil, arguing that any use of metrics would just encourage universities to pressurise staff to increase their H-indices.

Now, as I’ve explained before, I don’t particularly like metrics. In fact, my latest proposal is to drop both REF and metrics and simply award funding on the basis of the number of research-active people in a department.  But I‘ve become intrigued by the loathing of metrics that is revealed whenever a metrics-based system is suggested, particularly since some of the arguments put forward do seem rather illogical.

Odd idea #1 is that doing a study relating metrics to funding outcomes is ‘giving credence’ to metrics. It’s not. What would give credence would be if the prediction of REF outcomes from H-index turned out to be very good. We already know that whereas it seems to give reasonable predictions for sciences, it’s much less accurate for humanities. It will be interesting to see how things turn out for the REF, but it’s an empirical question.

Odd idea #2 is that use of metrics will lead to gaming. Of course it will! Gaming will be a problem for any method of allocating money. The answer to gaming, though, is to be aware of how this might be achieved and to block obvious strategies, not to dismiss any system that could potentially be gamed. I suspect the H-index is less easy to game than many other metrics - though I’m aware of one remarkable case where a journal editor has garnered an impressive H-index from papers published in his own journals, with numerous citations to his own work. In general, though, those of us without editorial control are more likely to get a high H-index from publishing smaller amounts of high-quality science than churning out pot-boilers.

Odd idea #3 is the assumption that the REF’s system of peer review is preferable to a metric. At the HEFCE metrics meeting I attended last month, almost everyone was in favour of complex, qualitative methods of assessing research. David Colquhoun argued passionately that to evaluate research you need to read the publications. To disagree with that would be like slamming motherhood and apple pie. But, as Derek Sayer has pointed out, it is inevitable that the ‘peer review’ component of the REF will be flawed, given that panel members are required to evaluate several hundred submissions in a matter of weeks. The workload is immense and cannot involve the careful consideration of the content of books or journal articles, many of which will be outside the reader’s area of expertise.

My argument is a pragmatic one: we are currently engaged in a complex evaluation exercise that is enormously expensive in time and money, that has distorted incentives in academia, and that cannot be regarded as a ‘gold standard’. So, as an empirical scientist, my view is that we should be looking hard at other options, to see whether we might be able to achieve similar results in a more cost-effective way.

Different methods can be compared in terms of the final result, and also in terms of unintended consequences. For instance, in its current manifestation, the REF encourages universities to take on research staff shortly before the deadline – as satirised by Laurie Taylor (see Appointments section of this article). In contrast, if departments were rewarded for a high H-index, there would be no incentive for such behaviour. Also, staff members who were not principal investigators but who made valuable contributions to research would be appreciated, rather than threatened with redundancy.  Use of an H-index would also avoid the invidious process of selecting staff for inclusion in the REF.

I suspect, anyhow, we will find predictions from the H-index are less good for REF than for RAE. One difficulty for Mryglod et al that it is not clear whether the Units of Assessment they base their predictions on will correspond to those used in REF. Furthermore, in REF, a substantial proportion of the overall score comes from impact, evaluated on the basis of case studies. To quote from the REF2014 website: “Case studies may include any social, economic or cultural impact or benefit beyond academia that has taken place during the assessment period, and was underpinned by excellent research produced by the submitting institution within a given timeframe.” My impression is that impact was included precisely to capture an aspect of academic quality that was orthogonal to traditional citation-based metrics, and so this should weaken any correlation of outcomes with H-index.

Be this as it may, I’m intrigued by people’s reactions to the H-index suggestion, and wondering whether this relates to the subject one works in. For those in arts and humanities, it is particularly self-evident that we cannot capture all the nuances of departmental quality from an H-index – and indeed, it is already clear that correlations between H-index and RAE outcomes are relatively low these disciplines. These academics work in fields where complex, qualitative analysis is essential. Interestingly, RAE outcomes in arts and humanities (as with other subjects) are pretty well predicted by departmental size, and it could be argued that this would be the most effective way of allocating funds.

Those who work in the hard sciences, on the other hand, take precision of measurement very seriously. Physicists, chemists and biologists, are often working with phenomena that can be measured precisely and unambiguously. Their dislike for an H-index might, therefore, stem from awareness of its inherent flaws: it varies with subject area and can be influenced by odd things, such as high citations arising from notoriety.

Psychologists, though, sit between these extremes. The phenomena we work with are complex. Many of us strive to treat them quantitatively, but we are used to dealing with measurements that are imperfect but ‘good enough’. To take an example from my own research. Years ago I wanted to measure the severity of children’s language problems, and I was using an elicitation task, where the child was shown pictures and asked to say what was happening. The test had a straightforward scoring system that gave indices of the maturity of the content and grammar of the responses. Various people, however, criticised this as too simple. I should take a spontaneous language sample, I was told, and do a full grammatical analysis. So, being young and impressionable I did. I ended up spending hours transcribing tape-recordings from largely silent children, and hours more mapping their utterances onto a complex grammatical chart. The outcome: I got virtually the same result from the two processes – one which took ten minutes and the other which took two days.

Psychologists evaluate their measures in terms of how reliable (repeatable) they are and how validly they do what they are supposed to do. My approach to the REF is the same as my approach to the rest of my work: try to work with measures that are detailed and complex enough to be valid for their intended purpose, but no more so. To work out whether a measure fits that bill, we need to do empirical studies comparing different approaches – not just rely on our gut reaction.


  1. One of the biggest problems of the REF is that it increasingly seems like a sledgehammer to crack a nut - an elaborate system with lots of hidden costs is used to allocate a relatively small amount of money compared with the research council side of the dual support system.

    As you imply, some objections to some use of metrics seem to be religious rather than fact-based - I suspect many people would object to metrics on principle even if they were shown to match very well the actual outcomes of the current process. OK, that's fine but as you note the peer review process is plenty flaky in its own right and there is lots of scope for gaming in the current systems. (The RC peer review process is even more flaky, of course, in that what is being reviewed is research that has not yet been done).

    If we can't agree on simplifying the REF then why not abolish dual support altogether? Either hand all the money to the RCs to dish out, or - even better, abolish the RCs and dish out all the money via the existing REF process. If you add up all the hidden costs of the RC process - peer reviewing applications for discrete parcels of funding - they are likely to look even less efficient than the allocation of QR by the expensive REF. And the peer review of published work is less problematic in principle than the peer review of research proposals - even if it is still open to gaming in practice.

    Generally, I'd prefer to see more, rather than less, diversity in funding mechanisms so I would be reluctant to see the end of dual support. But if we can't agree on reforms, maybe it is the only option?

  2. First, my antipathy to metrics comes from analysing some examples rather than religious zeal.

    Second. as I understand it, most of your observed correlation comes simply from the size of the department -i.e. it is essentially an artefact. Adding the H-index had very little effect.

    The right conclusion from your data should surely be (roughly speaking)
    (a) each person who was submitted had much the same worth, so share the cash equally (per person), and
    (b) the h-index adds very little, so forget it.

  3. #1 of 2.

    First, and like David Colquhoun has suggested previously, Kieron's suggestion of doing away with the dual support system has a heck of a lot to recommend it. But (and despite my criticisms of the research councils (RCs) over the years!) I would not abolish the RCs. Rather I would abolish the REF and transfer the funding to the RCs. The RCUK peer review system should be changed, however, so that impact is judged at the *end* of a grant (and in subsequent years) rather than nonsensically being appraised before the grant starts.

    This seems to be the 'lesser of two evils' to me but I realise that there's a broad spectrum of opinions on this issue!

    David's already handled the key points in his pithy comment above but I'd like to address your three "odd ideas".

    -- 'Odd' idea #1. The THE article trumpets that H-indices can potentially be used to predict REF results. In what sense is that article *not* giving credence to the idea that the H-index is a metric that university managers/PVCs should monitor? *You* may appreciate the subtleties. I can guarantee that many managers and PVCs will certainly not see beyond the "headline" figure. (Look at how universities exploit highly suspect world rankings at the moment). I already know of cases where lectureship applications have been sifted by H-index. The article in the THE is certainly not going to discourage this type of behaviour.

    I simply don't see how the suggestion that the THE article is bolstering the concept of "H-index-for-quality-assessment" is an 'odd' idea. You are helping to strengthen the perception of the H-index as a 'reliable' indicator of quality. Would you really want the H-index to be used for staff appraisal? You laudably criticised KCL severely for its use of a simple-minded metric (grant income) as a mechanism for staff assessment. The H-index is an equally simplistic and flawed metric.

    There's a very good comment from David Riley below the line of the article at the THE website. Here's his first point:

    "The evidence of the link between citations and quality as far as I am aware largely comes from comparing RAE/REF outcomes to citations. To what extent did the panel members use citations to help them decide on rankings? If they used them (whether officially or not) then this puts a question mark over the findings. A correlation would be inevitable regardless of the validity. "

    This is a very important point. I would broaden the point still further and say that the key difficulty with the H-index is that it assumes that there is always a direct and positive relationship between citations and research quality. Citations do not necessarily measure research quality and I can point to many examples where this is not the case. Here's just three from my research field (nanoscience):



    - and, most recently, a paper was published in Science claiming that hydrogen-bonds are directly observed in scanning probe microscope images. A paper subsequently published in Physical Review B convincingly and compellingly has shown that these features most likely arise from artefacts due to the probe itself. Guess which paper will pick up more citations...?

    More generally, citations are a measure of popularity of a paper and this is a function of many variables, which need not include the level of scientific rigour. The potential for headline generation can often trump rigour in those "top tier" journals, as Randy Schekman has highlighted.

  4. #2 of 2.

    [Apologies for having to split my response in two. I hit the 4096 character limit for comments!]

    -- 'Odd' idea #2. I don't see how it's 'odd' to suggest that a H-index based system is highly susceptible to gaming. And I don't see at all how a H-index-based system is somehow likely to be less susceptible than the current REF system (which, of course, is far from ideal). See, for example, this:

    and this:

    and this:

    -- 'Odd' idea #3. I'm also an empirical scientist. But I also know that we shouldn't attempt to quantify the unquantifiable, and we should be very careful that a measurement isn't (i) so invasive that it distorts the system we're measuring, and (ii) is actually a good representation of the quantity we're trying to determine.

    You said: "My approach to the REF is the same as my approach to the rest of my work: try to work with measures that are detailed and complex enough to be valid for their intended purpose, but no more so."

    The H-index is a single number which is easily gamed; effectively impossible to normalise across narrow sub-fields (let alone entire disciplines); often a questionable indicator of research quality; and a quantity which disadvantages early career researchers. If you think that this is "detailed and complex" enough to "be valid for its intended purpose" then I guess we'll just have to agree to disagree.

    >>"To work out whether a measure fits that bill, we need to do empirical studies comparing different approaches – not just rely on our gut reaction."

    I think it's rather unfair to argue that those who are critical of using a simplistic metric like the H-index to assess staff are "arguing from the gut". (And make no mistake, if the H-index was adopted as a mechanism for allocating QR funding, every academic in the country will be under pressure to increase their H-index).

    I will stress again that my H-index is higher than that of *the* leader in my research field -- a scientist who has been responsible for some of the most elegant, inspiring, and, errmmm, heavily cited research in nanoscience. [See ].

    This observation alone is enough, in my view, to discredit the entire H-index concept!

  5. I think economists also like metrics - mainly because we think peer reviewing already peer reviewed publications is a waste of resources, but also because we are aware like psychologists that social science measurement is going to be imperfect. I just published a paper in PLOS ONE, advocating using metrics ( and we are working on another one which I think will be more convincing.

    1. The first line in the abstract of your paper assumes that quality and citations go hand in hand. How do you justify that assumption?

      See, for example, final paragraph of this:

      "The conclusion I would rather draw, however, is that peer review vs. metrics is in many ways not the issue. Neither is capable of measuring research quality as such—whatever that may be. Peer review measures conformity to disciplinary expectations and bibliometrics measure how much a given output has registered on other academics’ horizons, either of which might be an indicator of quality but neither of which has to be."

  6. Previous effort seems to have got lost in the ether...

    There is a correlation, at least for departments over a certain size between RAE/REF ranking and ranking by h-index.

    Metrics are flawed.

    REF evaluation of papers is flawed because papers cannot be read, there are far too many of them.

    Given the above then we could follow Kieron Flanagan's advice, to which I would add that, for high impact science, the UK has had the model: the LMB. Simple rule, no more than 5 people per PI (technicians, PhD students and postdocs). So dish out the cash per person in each PI's group to a ceiling of 5.

    Alternatively, since both processes are flawed, then we have a choice: throw away our time on process #1 or live with process #2 and have considerable more time to devote to teaching and research. As someone involved in REF enjoying a post REF renaissance, I would take ANY process that didn't consume several years of my career, even if it meant losing resource. Time is the most precious of resources,yet we seem happy to chuck it in the bin.

    1. One of my comments also got lost in the ether, Dave. I assume it's in the moderation queue (too many URLs, maybe).

      My first comment kicked off as follows:

      "First, and like David Colquhoun has suggested previously, Kieron's suggestion of doing away with the dual support system has a heck of a lot to recommend it. But (and despite my criticisms of the research councils (RCs) over the years!) I would not abolish the RCs. Rather I would abolish the REF and transfer the funding to the RCs. The RCUK peer review system should be changed, however, so that impact is judged at the *end* of a grant (and in subsequent years) rather than nonsensically being appraised before the grant starts."

  7. Many thanks for your responses. Philip & ferniglab: I’m sorry your initial comments fell victim to Blogger’s primitive automated system for spam detection. I found Philip’s comment in spam and have reinstated it.

    The thing about H-index pressure is that it isn’t the same as pressure to publish. It is pressure to publish work that will be highly cited – or indeed to be involved in fostering such work (given that we are talking about departmental level). It should discourage people publishing loads of papers. Things like self-citation can be easily discounted. I’m not saying there’d be no pressure, but I am not sure it would be worse that what we have already. I would dispute that the departmental H-index is ‘easily gamed’. Eg you can game your individual H-index by spurious authorship: this would do the departmental H-index no good if all authors were from your institution.

    Subject-specific variations in Hindex are important if you are comparing across disciplines, but that is not the case here. I accept we could have problems in disciplines with a mix of types: in psychology, neuroscience tends to get higher citations than other types, and I’d worry about everyone moving to neuroscience. Of course, that pressure is already there because that is where the big grant money is. But neuroscience is also more expensive to do than other kinds of psychology, so you could argue that it is right that departments with lots of neuroscience get more core funding. Not sure if it works that way in other disciplines – ie more expensive also tends to be more cited. Would be interesting to have evidence on that point.

    Several people have argued that H-index is flawed by drawing attention to instances where bad work gets a high H-index. But I am concerned with the departmental H-index in aggregate. You can put up with a bit of slop in the system if it averages out when you have big numbers. Whether or not that is so is an empirical question which is addressed by studies such as Mryglod et al.

    I’d share concerns about H-index being used to rate individuals, except that would not be sensible if we are talking at the departmental level, where a highly influential paper is likely to be the result of several people collaborating – some of whom will be juniors whose personal H-index is small.

    Also an empirical question is whether H-index adds anything over departmental size (shorthand for N research active). David says it doesn’t on the basis of very slender evidence: I looked at 2 subjects pretty informally and showed that H-index explained a small amount of extra variance over dept size in psychology and none in physics. He may be right, that overall it’s negligible, but we need more data.

    Like Kieron, I am all for simplification, but I think sole reliance on REF would be terrible! It would put far too much power in the hands of a small group of people.
    I’d also worry about just giving money to research councils. Then the pressure would be on everyone to get expensive grants – and I see that as even more pernicious and damaging to science than pressure to publish.

    As I’ve argued before, I’d be happy with funds allocation in relation to N research-active staff – measured over the whole period to avoid people being parachuted in at the last moment. I think that is unlikely to be acceptable because people want to measure that elusive thing ‘quality’. H-index may be more attractive to the bureacrats for that reason: If you divide department Hindex by dept size you have a measure of mean ‘quality’ that allows people to compile league tables. Since I don’t like league tables, I don’t regard that as a good thing, but it is a consideration. So I agree that there is more to ‘quality’ than an H-index. However, whatever it is, it’s not clear to me that the REF does a better job of detecting it. Both are proxy indicators, but one is far more efficient than the other, so if they give essentially the same outcome, it is the least bad option.

    1. Thanks for the comprehensive response, Dorothy. I'd like to address a few points.

      "The thing about H-index pressure is that it isn’t the same as pressure to publish."

      I disagree entirely -- it's most definitely pressure to publish! I am going to use my example of Don Eigler again (see Eigler and his group at IBM did science exactly the way science should be done -- carefully, rigorously, and addressing key challenges. Eigler's group would produce a ground-breaking paper which would be heavily cited, and then "go off the radar" for a couple of years until they then "re-appeared" with another inspiring advance.

      But according to Web of Science, Eigler's h-index is 24. Yet if I were to select the most important scientist in the sub-field in which I work, Eigler would be right at the top of the list.

      Why is his H-index so low? Well, it's very simple -- he didn't publish "enough" papers (from the perspective of upping his H-index). What he did publish was exceptionally well-cited, but quality alone is not enough for H-index -- quantity is important as well. Thus, there is no question that a focus on H-index will produce a pressure to publish.

      >>"departmental H-index".

      I don't see how focusing on a departmental H-index is going to relieve the pressure on individual academics to increase their individual H-index. If the metric is "funding is based on average H-index of department" then all staff will be pressured to increase their H-index. I am 100% certain that this would be the case in Nottingham and, from my reading of the Times Higher every week, I don't see strong evidence across the UK HE sector that Nottingham is an outlier with regard to chasing simplistic metrics in a simple-minded way! This is what I meant by not giving credence to the H-index.

      >>"Subject-specific variations in Hindex are important if you are comparing across disciplines, but that is not the case here"

      Again, I fundamentally disagree. Comparing across sub-disciplines (and sub-sub-disciplines) is a *major* issue. The citation behaviour in condensed matter physics and particle physics, as just one example, is very, very different indeed.

      >>"But I am concerned with the departmental H-index in >>aggregate".

      But the aggregate departmental H-index depends on the H-indices of individual members of staff. Sorry to bang on about this, but you know as well as I would what would happen if funding is based on aggregate H-index! Just as we now have for Student Evaluation of Teaching scores and World Rankings, university managers would blindly compare average H-index scores across departments and faculties (down to four or five (in)significant figures!) with no attention paid to variations in citation behaviour. (And even if they were to pay attention to those sub-discipline dependencies, just how would they credibly normalise them out? I shudder to think of the type of multi-parameter functionals that bibliometricians would dream up...)

      >>"H-index may be more attractive to the bureacrats for that reason: If you divide department Hindex by dept size you have a measure of mean ‘quality’ that allows people to compile league tables."

      This is *exactly* the problem. Why are we pandering to "bureaucrats"? H-index gives the illusion of tracking quality but for all of the reasons discussed above, that's all it is -- an illusion.

    2. Philip. You are doing exactly what I advocate. Take a paper (or person) that everyone agrees is good, and see how they perform. That's not what bibliometricians do, perhaps because the results tend to show how useless metrics really are.

    3. Just seen Anonymous' comment below. It must have been posted while I was in the middle of writing the missive above. Their point about H-index scaling with output number is v. important and supports the "Eigler-centric" argument I made.

      "Recent analysis of h-indeces in mathematics and physical sciences suggests that they simply scale with the number of outputs (a combinatorial Fermi problem - details here"

    4. Thanks, David. The paper cited by Anonymous in their comment below is worth reading. It's an intriguing analysis which is closer to the type of approximate mathematical approach physicists often use than the level of rigour usually associated with pure maths. I've only skim-read it as yet but it's a very interesting analysis.

    5. OK I take your point that you looked at only two subject areas. If bibliometricians were doing their job it is the sort of thing that they should be doing. I hope the fact that they aren't doing it is not because your results so far suggest that H-index contributes very little and that threatens to put them out of business. It would be much cheaper if counting the size of departments could be substituted for employing/buying bibliometrics,

      Perhaps this is a job the HEFCE should be doing, if bibliometricians won't (attention: James Wilsdon)

  8. Lets just use IQ tests in REF.

  9. A chilling moment for me was when the ex-head of my ex-department told me that he had reanalysed the 2008RAE results and found that an identical outcome was found by just taking the *impact factor* of the journal that outputs were published in and assigning the scores only on this basis.
    Since the JIF is the most spurious of research metrics, I really liked your original post suggesting that the departmental H-index could be used instead. I don't really like the idea of metrics used in assessment either, but since they are (sub-consciously or consciously by the panel in the case of JIF) and since the alternative of actually reading and assessing papers is not plausible, then this seemed like a good idea and would save a lot of time and hassle. My own department hired several FTEs to handle our REF submission: madness! There is one problem however...
    Recent analysis of h-indeces in mathematics and physical sciences suggests that they simply scale with the number of outputs (a combinatorial Fermi problem - details here If this is true outside of Maths and Physics, then the largest departments or the ones that return the most academics in a REF assessment will always come out top.
    Dorothy's last post suggested that using staff number as a proxy could be a shortcut to research assessment and save us all time, and DC has responded that the departmental cat would be returned if this were the case. My point is that the departmental H-index idea and the number of returnees idea may well be one and the same.

    1. Exactly. The point of my first comment was, I think, that Dorothy's data means the opposite of what she suggested. It shows that the H-index has little prognostic value, in that it adds little to what you can predict by counting the number of returns.

    2. Thanks for posting the link to the paper on the H-index as a combinatorial Fermi problem, Anon -- much obliged. It's a really intriguing analysis.


  10. Thanks all. Just 3 more responses.
    1. Just to clear up one point: the departmental H-index is NOT the same as the average H-index per department. It is the H-index you get if you search for papers by departmental address and then compute H-index based on the resulting publications. This is rather different because people may come and go: what matters is the research published from your address. If you try to parachute in a research star, it won't do you much good unless you commit to them long enough for them to publish from your institution and accrue citations. Conversely, if you fire an active researcher for nonproductivity you'd need to be pretty confident that they aren't going to go off and publish their work from a different address.
    2.Re David' point: you would need to demonstrate that the departmental cat was research-active and on the payroll.
    3. Cases like Don Eigler: again I reiterate that I am NOT saying H-index is anything like a perfect indicator of quality of individual researchers or even departments: I am just saying we need a measure for comparing departments that is good enough to act as a proxy scale when allocating income. I don't think there *is* a gold standard, so we are bound to find any measure inadequate. We need to ensure that the amount of time and money we spend on evaluation is not defeating the purpose of the exercise and damaging the ability of researchers to research.
    One final thought: I have suggested elsewhere that University league tables should take into account staff satisfaction. Perhaps if we found a way to include that in the funding metric it would address concerns about adverse effects on science from gaming?

    1. I don't disagree with any of these points. Although having benefitted fmyself rom moving in the REF transfer window, I think that the fact that REF outputs are portable has been good for academics and our salaries. I know other people might not agree.

      My concern about using number of returnees to allocate funds is not so much about the REF-ability of the Departmental cat (I agree we could probably nix that aspect of gaming the system). I just worry that the Department of Physics (or whatever) will suddenly become huge for the purposes of REF whereas smaller departments would be closed down. OTOH, at my previous University I saw a very small Dept get a good RAE score in an under-populated UoA. This then resulted in a huge new building and permission to double the number of academic staff. All on the basis of a (probably erroneous) RAE performance.

  11. A bigger aspect of the problem is that fundamental issue in all science of generalising from individuals to populations or in reverse, controlling populations (wholes) by controlling individuals (parts).

    And in the sort of sciences that can point to engineering successes for their epistemic legitimacy (even if the engineering often came first), the daily experience is that of what we know about individuals fairly straightforwardly scales up to populations (allowing for explainable size and interaction effects). So, if I know how individual atoms release energy, I get nuclear fission when millions of atoms release energy in fractions of a second. If I know what happens when gasoline explodes once in a closed chamber, I get a combustion engine with hundreds of explosions per second. The same with a computer. If I know, how to change the state of a piece of silicon from 1 to 0, I know how to do anything at that I can translate into 1s and 0s even if I do that a million times a second.

    Setting aside the fact, that even in the hard sciences this is mostly an illusion because things quickly break down at bigger or smaller scales, this relationship of individuals to populations is much more tenuous in the social and psychological realm.

    Properties of individuals have beguiling similarities to properties of populations but are often completely structurally different. For example, Rational Choice Theory proved to be a very useful model for certain populational economic behaviours. Unfortunately, it did not actually describe how any one individual made choices. It was successful as long as we did not care about what every individual actually did. But when the behaviour of individuals became important, it broke down.

    Readability metrics are another example, they could rely on completely mechanistic measures that didn't even require understanding the text to make predictions about populations (proportion of results of a large number of readers reading a large number of texts). But they are much less useful for judging the readability of any one text. And they are almost completely useless at judging how difficult any one text will be for any one individual. Learning styles is a similar story.

    The REF is a great example. The task is to deal with populations. Distribute a discrete sum across a population. Therefore, a population type measure would be most successful. But the way the task is approached is through assessing the individuals as complete beings. Which means that the measures used cannot work because such in-depth peer review assessment will always produce fundamentally incommensurate results (kind of like Ofsted inspections).

    The paradox is that what is fair to all individuals may turn out to be an unfair system which satisfies no one. All that REF running around will simply have to be reduced to a simple numerical formula. So it would be much smarter to deal with the allocation as a numerical task to start with using a primary proxy measure (size of department sounds like a brilliant solution) but not to then translate that measure into quality. The job is to improve the quality of the system as a whole and any one individual's quality is only distantly related to it. So the REF is more similar to voodoo than the causal investigation it is styled as.

    PS: As I was typing it, I got a strange sense of doing this before. I think I may have made a similar comment on another post here. If so, sorry for belabouring the point.

  12. I like your take on this Dominik. I know you have commented on my REF-related posts before, but I think this perspective is well worth airing.

  13. While Philip Moriarty thinks an example such as Don Eigler indicates that a metric such as the H index won't work, it would appear much less simple. An academic who produces outstanding papers every couple of years or could be a less than stellar contributor to the current system - they might not have the requisite 4 papers in the REF period. So the metric does no worse / no different than the current arrangements.
    Since no-one is claiming they have the perfect assessment system, pointing out anomalies or injustices in one approach is not so helpful, unless it is clear that these in some tangible sense outweigh the problems in the other. Let alone that the current system can justify the resource costs it consumes given what it delivers, as pointed out by others.

    1. I am not for one second suggesting that the current REF system is not flawed. But the problems with focusing on H-index to my mind clearly outweigh the issues with other strategies because, as I said in the tweet to which Dorothy refers in her post, it lends credence to the concept of the H-index itself.

      I did a session today for the "Politics, Perception and Philosophy of Physics" Year 4 MSci module I teach on the subject of p-values (lots of mentions of David Colquhoun's blog posts on this topic!). There are very interesting parallels between the p-value problem and the use of H-indices in that both attempt to reduce a complex multi-faceted dataset to a single number.

      The P-value concept has been exceptionally damaging to science [R. Nuzzo, Nature 506 150 (2014)]. Let's not start similarly lending the H-index credibility it doesn't deserve.

  14. This has been an interesting discussion.

    Hirsch's 2005 PNAS is entitled An index to quantify an individual's scientific research output. Scaling up from the individual is fraught with difficulty as well as with opportunity for manipulation to produce a desired result. H-indices for addresses, as Dorothy suggests, would undoubtedly produce a flurry of renaming of departments and even institutions. There seems to be no end to the lengths to which managers will go in adjusting the appearance, rather than the reality, of research achievement.

    In contrast, a REF rating, whatever it means, is a property of an ensemble of individuals. In some REF dry runs, individuals have nevertheless been allocated personal REF ratings. The values were obtained by undisclosed means, and by persons who remained anonymous – at least to those who were being rated. The question was then asked, by one's local managers "Well, what are you going to do to increase your rating?".

    If it were possible to answer that question, it would devalue the whole exercise, since REF ratings would then, at least in part, be a measure of the effectiveness of REF tactics rather than research quality.

    I honestly recommend that REF is scrapped completely. Even if an alternative system of evaluation could be agreed, it would only report on past achievement.

    Research is discovery, which is unpredictable. The more important the discovery, the more is it unpredictable.

    I've listed what I consider to be six serious flaws in REF on Research Assessment and REF | John F. Allen's Blog.

    I am sure there are many others. Why go to such lengths to devise and implement a system that purports to measure the unmeasurable?

  15. Entering into this late but many excellent points have been made and this is instructive to all (scientists and, hopefully, the mandarins that seem think that science assessment exercises are infallible). I have only two things to add:

    1. DC notes that actually reading the papers is the most direct means of assessing quality. This is routinely dismissed as a gargantuan task. However, it can be made far more manageable (and interesting for the scientific assessor) if the applicants are asked to limit submissions to what they consider their 5 most significant papers (could be less) and why. In Canada, CIHR used to do this as part of their CV. In their infinite wisdom, they just removed it. As a reviewer, I found the most significant publ. section far more valuable that the detailing of their full "productivity".

    2. Someone should ask whether the RAE, etc. are negatively impacting the quality of the science they profess to improve. I can think of several reasons why this may be the case. Firstly, the time and effort consumed is significant. Secondly, it is changing behaviour (and encouraging different behaviour). Is this positive or negative? It is certainly infectious. Thirdly, we, as scientists, are at least partly to blame. We have blithely bargained for more resources to expand our enterprise on the promise of wonderful outcomes. We are being called on it. The return on investment in high quality science is inversely proportional to the amount of time to impact. We are discouraging longer term thinking and acting like professional football clubs, wheeling and dealing players to prop up gate receipts, but without the benefit of tallying balls in the net after 90 mins.

  16. This comment has been removed by the author.