The Certain Uncertainty of University Rankings

Note:

This is the html version of the document. You may prefer to use the markdown version. Open R Studio and in the Console, cut and paste the following:

download.file("https://www.dropbox.com/s/h08u29493dduq6r/lab1-in-class-version.Rmd?dl=1", "lab1.Rmd")
file.edit("lab1.Rmd")

Introduction

In my news feed the other day I noticed the work of colleagues at the Centre for Multilevel Modelling here in Bristol.¹ The argument they were making is that pupils’ backgrounds should better be taken into consideration when compiling data for secondary school league tables in England. The reason they say this is because contextual factors matter in regard to a child’s achievement. The ‘best’ schools are not simply those where their pupils achieve the highest grades because that attainment is a function not just of what the schools do but also who they recruit. We know some pupils face considerably more social and economic barriers in terms of their learning when compared to pupils from more advantaged backgrounds. If schools only recruit the most advantaged pupils it is hardly surprising that they achieve higher grades, on average. What characterises a good school, therefore, is one that does most to raise the attainment of its pupils given their particular backgrounds.

What interested me most about this research was the finding that “a fifth of schools would see their national league table position change by over 500 places if such factors were taken into account.”² In other words, if you change the metric of ‘success’ then the rank order of schools changes too. In some respects, this is not surprising: if you change the measurement you will most likely change the result too. But what if decisions are being made on less than perfect rankings that give the wrong impression of school achievements? It ought to be a concern to parents who use school league tables to make choices about their children’s education if there are questions to be asked about the reliability of those tables.

This raises at least two statistical issues that are of relevance to this unit:

The importance of having a clear conceptual understanding of what it is that is being measured; and
The issue of non-random selection: pupils are not randomly allocated to schools which means that comparisons of schools need to control for (make statistical accommodation for) the differences between the schools’ intakes;³

Here, however I want to focus on a third issue, which is the issue of trying to create ‘league tables’ from, in the example above, schools data and, in the examples that follow, by ranking Universities from ‘the best’ downwards.

League tables and Rankings

League tables of school performance often appear in the media⁴ but not for all parts of the UK – in fact, they are only published for England, having been abolished in Wales, Northern Ireland and Scotland. They are contentious: on the one hand, they can be said to be providing information for parental choice. On the other, they can be regarded as crass or overly simplistic, reducing the multidimensional nature of what education and what might constitute a good school to a single and subjective ranked index.

Various types of league table have arisen in the public and other sectors which work by ranking institutions from ‘top’ (‘best’) to ‘bottom’ (‘worst’) – this despite the concerns that have been written against them.⁵ Universities are not immune to this trend, with examples including the World University Rankings⁶ compiled by the THES, and the QS World University Rankings⁷. But, there is a problem – there is is no agreement on what is being measured or how it should be measured. To take an example: what is academic reputation, how can it be measured, and is its definition somewhat circular: do Oxbridge have a good reputation because they have a good reputation? All University rankings emerge as a produce of choices that could have been made differently.

This isn’t to say that measures of the various functions and achievements of Universities are always without merit. Rather, the problem is one of collapsing those various measures into a simple (and simplistic) ranking. As my friend and Professor of Geocomputation, Chris Brunsdon,⁸ sometimes reminds people, you cannot rank multidimensional data (i.e. a data set containing more than one variable). Sadly, that doesn’t stop people from trying! And they can do so in lots of different ways with lots of different data, which helps explain why there has been a proliferation of such rankings and also how University publicity can select and promote only those that rank them highest! A moment’s thought will reveal the futility of the exercise: if there is no single and definitive way to characterise either a ‘good’ or a ‘bad’ University then there is no full-proof way to rank them in a league table despite the best efforts of those who wish to do so. Providing more-and-more rankings doesn’t solve the problem; it just reveals the arbitrariness of each one.

Have a chat to the people around you and write-down what you would be interested in measuring if you were asked to capture those aspects of a University most important to you.

Reducing multivariate data to a ranking

Imagine you have selected a number of variables that you consider to represent the various sub-domains you regard as important to a University. It might include grant income, the staff to student ratio, student satisfaction, price of a cup of tea, whatever. These measurements are themselves partial and this is because what you are trying to quantify is not clearly demarcated – for example, the quality of a University. Because of this, it would be sensible to provide as wide a range of data as possible and allow people to use them all ‘in the round’. In fairness, that increasingly is what is happening in terms of schools performance data⁹ and for data about Universities too.¹⁰

Even so, the temptation to reduce these data to a single ranking remains. The easiest way to do this is to add the variables together but this assumes that each should count equally in your measurement of (in this case study) Universities. The THES does not do this; instead it gives a weight of 30% to its Teaching score, 30% to Research, 30% to Citations, 7.5% to International Outlook and 2.5% to Industry Income. The obvious question to ask is, what happens if you change those weightings? Let’s find out!

First, we will read in the data and tidy it a little, retaining only the top 200 ranked Universities:

if(! "tidyverse" %in% installed.packages()[,1]) install.packages("tidyverse")
require(tidyverse)
download.file("https://www.dropbox.com/s/ebhcq1uoo1l8bnk/rankings.csv?dl=1",
              "rankings.csv")
read_csv("rankings.csv") %>%
  na.omit() -> df_all
names(df_all) <- c("Rank", "Name", "Overall", "Teaching", "Research", "Citations", "Industry", "Outlook")
df_all %>%
  slice(1:200) %>%
  mutate(Rank = as.numeric(Rank)) -> df

Having done so, let’s look at the ‘top ten’ Universities as currently ranked - the output will appear on screen.

df %>%
  arrange(Rank) %>%
  dplyr::select(Rank, Name) %>%
  slice(1:10)

Which are the top three ranked universities?

Now let’s change the weighting making it 50% for Teaching, 20% for Research, 15% for Citations, 10% to International Outlook and 5% for Industry Income. Again, the output should apper on screen.

df %>%
  mutate(newscore = 0.5 * Teaching + 0.2 * Research + 0.15 * Citations +
           0.1 * Outlook + 0.05 * Industry) %>%
  mutate(newrank = rank(desc(newscore)), originalrank = Rank) %>%
  arrange(newrank) %>%
  dplyr::select(newrank, originalrank, Name) %>%
  slice(1:10)

Has anything changed?

How about if we place the weight much more firmly on teaching?

df %>%
  mutate(newscore = 0.9 * Teaching + 0.05 * Research + 0.05 * Citations +
           0 * Outlook + 0 * Industry) %>%
  mutate(newrank = rank(desc(newscore)), originalrank = Rank) %>%
  arrange(newrank) %>%
  dplyr::select(newrank, originalrank, Name) %>%
  slice(1:10)

Unsurprisingly, changing the weights does change the ranking. That ranking reflects what is a value-judgement about what we think a University should be doing – a value-judgement that is then operationalised by the weights used.

Have a go at modifying and re-running the code above with different weighting schemes (you only need to change the numbers in the line beginning mutate). How low can you get the Universities of Oxford and Cambridge to be?!?!

Let’s now expand our view to look at all 249 institutions in this data, comparing the original ranking with the new ranking that gives greater prominence to teaching. Looking at the chart that is created by the code chunk below, the trend appears to be that the greater the original ranking (i.e. the lower its original score) the more susceptible that ranking appears to be to the change in the weightings:

if(! "ggplot2" %in% installed.packages()[,1]) install.packages("ggplot2")
require(ggplot2)

df %>%
  mutate(newscore = 0.9 * Teaching + 0.05 * Research + 0.05 * Citations +
           0 * Outlook + 0 * Industry) %>%
  mutate(newrank = rank(desc(newscore)), originalrank = Rank) %>%
  ggplot(aes(x = originalrank, y = newrank)) +
  geom_line() +
  geom_abline(intercept = 0, slope = 1,type = "dotted")

The Uncertain Nature of the Ranks

What begins to emerge from the graph is an issue of uncertainty. Although the league table seems to state, categorically, that the University ranked, say, 50th, is better than that ranked 55th, actually there is a lot of uncertainty in the measurement and the way that it has been created which rather undermines any definitive interpretations of the ranking.

Another way to reveal the uncertainty is through a thought experiment. Imagine you gathered 1000 people in a room. Call them experts. All broadly agree with the THES weightings used in the original ranking but not exactly so. Instead, there is random variation from person to person. The question is, allowing for this variation, would it much affect the 2019 THES rankings of Universities, using, as before, the data published (to their credit) on their website?¹¹ This is what the following code simulates.

nsims <- 1000
set.seed(1751)

# Do the simulation, with target weights as for the THES but allowed to vary

results <- matrix(nrow=nrow(df_all), ncol=nsims)
w <- matrix(nrow=nsims, ncol=5)

for(i in 1: nsims) {
  
  w[i,1] = rbinom(1, 100, 0.3)
  w[i,2] = rbinom(1, 100, 0.3)
  w[i,3] = rbinom(1, 100, 0.3)
  w[i,4] = rbinom(1, 100, 0.025)
  w[i,5] = rbinom(1, 100, 0.075)
  w[i,] = w[i,] / sum(w[i,])
  
  tmp <- df_all %>%
    transmute(new = w[i,1] * Teaching + w[i,2] * Research + w[i,3] * Citations +
                w[i,4] * Industry + w[i,5] * Outlook) %>%
    mutate(new = rank(desc(new), ties.method = "average"))
  
  results[,i] <- tmp$new
  
}

# Get rid of the extreme results (2.5% in each tail) and focus on the 'top 200'

tmp <- apply(results, 1, function(x) {
  quantile(x, probs = c(0.025, 0.975))
})
tmp <- tmp[,1:200]

# Gather the plot data and plot it

mydata <- data.frame(x = 1:nrow(df), y = as.numeric(df$Rank), lwr = tmp[1,], upr = tmp[2,], stringsAsFactors = FALSE)

ggplot(mydata, aes(x=x, y=x, ymin=lwr, ymax=upr)) +
  geom_linerange() +
  theme_minimal() +
  geom_point(size=0.1) +
  xlab("Actual Rank") +
  ylab("Range of rank")

The code above should produce a chart. What that chart is showing is that even if everyone broadly agrees on the weightings (and on the data to which they are applied) still variations in the ranks occur by chance.

So, continuing the thought exercise, imagine a University goes up 5 or 10 places in the rankings one year – is it really something for the vice chancellor to get excited about? Discuss it with the people around you.

Remember, this thought exercise assumes that everyone broadly agrees with the weightings. If they don’t then what do you think will happen to the variation in the ranks? Let’s see…

# Do the simulation, without the constraint on the weights

results <- matrix(nrow=nrow(df_all), ncol=nsims)
w <- matrix(nrow=nsims, ncol=5)

for(i in 1: nsims) {
  
  w[i,1] = runif(1)
  w[i,2] = runif(1)
  w[i,3] = runif(1)
  w[i,4] = runif(1)
  w[i,5] = runif(1)
  w[i,] = w[i,] / sum(w[i,])
  
  tmp <- df_all %>%
    transmute(new = w[i,1] * Teaching + w[i,2] * Research + w[i,3] * Citations +
                w[i,4] * Industry + w[i,5] * Outlook) %>%
    mutate(new = rank(desc(new), ties.method = "average"))
  
  results[,i] <- tmp$new
  
}

# Get rid of the extreme results (2.5% in each tail) and focus on the 'top 200'

tmp <- apply(results, 1, function(x) {
  quantile(x, probs = c(0.025, 0.975))
})
tmp <- tmp[,1:200]

# Gather the plot data and plot it

mydata <- data.frame(x = 1:nrow(df), y = as.numeric(df$Rank), lwr = tmp[1,], upr = tmp[2,], stringsAsFactors = FALSE)

ggplot(mydata, aes(x=x, y=x, ymin=lwr, ymax=upr)) +
  geom_linerange() +
  theme_minimal() +
  geom_point(size=0.1) +
  xlab("Actual Rank") +
  ylab("Range of rank")

Now even greater variations in the ranks occur. And this still ignores any uncertainty in the actual data. What we have been simulating is uncertainty in what the exact weights should be that are applied to the data. The data themselves will generate further uncertainty (because they will never be perfectly complete and error-free measures of whatever it is we hope to measure).

Discussion

As a quantitative geographer, I value data and the information and knowledge that can be drawn from them even if that knowledge is subjective (as, I suspect, it always will be). I am not necessarily against the use of data to evaluate strengths and weaknesses of or to make comparisons between institutions. I am, however, concerned about poor comparisons and the University equivalent of ‘teaching to the test’ – trying to game the system to maximise a ranking that has no widespread agreement of its worth.

There is a clear parallel between these league tables and those used elsewhere, including for schools. And the same sorts of criticisms apply. School data, is however, provided by the DfE and that is accountable to the taxpayer in a way that publishers of University Rankings are not. Still, there is no need to throw the baby out with the bathwater, and to the end I once suggested some framing principles under which rankings might operate.¹² Those framing principles are listed below.

Taking time to read through the framing principles and think how they might apply to other types of ‘league table’ rankings and to quantitative research more generally. Discuss this with the people around you.

Framing Principles

There is no one way of measuring the excellence of Universities. Excellence is an amorphous concept that reflects the value-judgments assigned to the various and diverse civic and public functions Universities serve. Any measure of University performance is inherently subjective.
It follows that rankings are not objective facts; they are statistical artifacts and constructs.
That does mean it is not possible nor of no value to measure some of the various teaching, research and other outcomes that Universities generate. It is, however, to recognise that no measure is beyond debate nor can be said to be definitive (or the best).
It follows that users should be discouraged from attaching too steadfast a meaning to any one ranking, such as ‘these are the best Universities, and these are not’.
It also follows that data compilers should provide a clear theoretical or conceptual justification for their choice of measurements and the way they are analysed – this would include a clear statement of which variable domains are included, which are omitted, and why; and a rationale for any weightings that are applied to the variables to create a summary or overall score.
It is recognised as good, scientific practice for the results of data analysis to be verifiable through re-production. This requires open access to the underlying data as well as a clear ‘code book’ outlining the analytical processes applied to the data, such as data transformation or standardisation.
Where secondary data are used, those sources should be stated, together with any known errors or biases associated with the data. Where new data are collected, convenience samples should be avoided, and robust and representative sampling procedures employed.
All data are uncertain. As such, confidence intervals, credible intervals or other measures of uncertainty should be provided to supplement the rankings, allowing users to determine substantive differences between institutions and more-than-random changes in the rank positioning of an institution over successive time periods.
By being honest about uncertainty, users should be discouraged from drawing what is likely to be erroneous meaning from small differences; it should be understood that a rise of a few places up the rankings is unlikely to represent any statistically significant change. As public educators, Universities and their press offices have a responsibility to use statistics in an informed way and not to exaggerate inconsequential results.
If the method of data collection or of analysis changes from one period to another, this should be clearly highlighted by the provider in order to avoid and prevent the erroneous comparison of institutions from one time period to another.
Notwithstanding the checks and safeguards listed above, upfront rankings, scored by the data provider, are better avoided entirely. This is because there is no way to translate the multidimensional nature of Universities nor even multivariate measures of their performance into a summary score nor ranking without applying inherently subjective weightings. Instead, data providers should be encouraged to provide a range of data and the tools by which end-users can apply their own sortings and rankings, as well as to measure confidence in the differences between institutions according to their rankings. This is better that specifying rankings and weightings a priori and gives users more control over the data.
Equally, it is better to provide more data than less, reflecting the diverse and multidimensional nature of University activities.
Data integrity concerns not only the lineage and scope of the data but also the ethical consequences of its usage. The publication and dissemination of University Rankings have affects, both positive and negative, upon the sector it measures and the people employed by it. Providers should be encouraged to work with stakeholders from a range of institutions and at a range of professional levels to be cognisant and understanding of those affects, and to monitor their more adverse consequences.
Data provide opportunity to highlight the strengths of the sector and of the various institutions within it. However, rankings create a zero-sum game of competition (because if one institution goes up, another must go down).
Quantitative measures will always be partial and incomplete. The value of in-depth and qualitative approaches should be explored to provide better understanding of what makes each University unique.
Finally, it is suggested that further consideration of University Rankings be undertaken by the UK Forum for Responsible Research Metrics, with the potential to audit the data providers by these or other agreed upon principles (cf. IREG Observatory on Academic Ranking and Excellence)

It was featured on the BBC News website.↩
BBC News website, op. cit.↩
Regression analysis is often used for this purpose↩
For example, https://www.bbc.co.uk/news/education-46947617 ↩
https://www.thebritishacademy.ac.uk/publications/measuring-success-league-tables-public-sector ↩
https://www.timeshighereducation.com/world-university-rankings/2019/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores ↩
https://www.qs.com/rankings/↩
https://www.maynoothuniversity.ie/people/chris-brunsdon ↩
https://www.gov.uk/school-performance-tables ↩
For example, https://www.timeshighereducation.com/world-university-rankings/2019/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores allows the data to be sorted by various measurement domains.↩
The broad data used by the THES to produce their overall score is available on their website. It would be nice to have access to all the underlying data but I suspect this is a pipe dream - selling data and data consultancy is a revenue stream for this and other companies.↩
https://paper.dropbox.com/doc/University-Rankings--Anwisu74FkT4ttcfNmWt2QmGAg-idaxkLSKfkaqEHvazeYSA ↩