Introduction

About THE

Times Higher Education World University Rankings is an annual evaluation of universities’ performances conducted by the Times Higher Education (THE) journal. The ranking takes a number of indicators, including teaching, citations, the share of international students, female/male ratio, etc. Among with the World ranking, there are other lists: for example, HSE holds 301-350 place in the World University Rankings 2022, 151-175 place in the World Reputation Rankings 2020, and 57 place in the Young University Rankings 2021.

In this paper, I first get the THE data and then analyze the last-years dynamics at the top of its World Rankings.

Libraries & a small chunk

There are 2 sources of getting THE data:

Wikipedia;
THE website.

Though I focus on the second option (which is challenging as THE site presents data in the dynamic form), I want to briefly discuss the first source as well. During the lesson, we discussed scraping with XML package and readHTMLTable() function. It takes 3 lines of code and results in a data.frame where the first row reflects the column names. That is okay as the result requires some processing anyway, but it seems to me that the htmltab package is more user-friendly. The code needed when using this package is 1/3 shorter and does not penetrate into the first row. Nevertheless, htmltab() ruins the majority of the column names (due to hyperlinks) and changes the row order - still, it seems to need less processing in total.

####################
#### libraries #####
####################
# working with data:
library(tidyverse)
library(stringr)
library(readr)
library(kableExtra)

# scraping:
library(rvest)
library(htmltab)
library(httr)
library(XML)
library(RSelenium)
library(wdman)

#ethics:
library(robotstxt)
library(Rcpp)

# slopegraph:
library(CGPfunctions)


###############################
######## xml approach #########
###############################

di <- GET("https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings")
doc <- readHTMLTable(doc = content(di, "text"))
parsed <- doc[[3]] # save the third list Democracy Index 2020
parsed[1:3,1:4] %>% 
  kable(escape = FALSE,
        caption = "`XML` approach: original column names become the first row.") %>% 
  kable_styling()

`XML` approach: original column names become the first row.
V1	V2	V3	V4
Institution	2021–22[46]	2020–21[47]	2019–20[48]
University of Oxford	1	1	1
Harvard University	2	3	7

###################################
######## htmltab approach #########
###################################

link <- "https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings"
parsed <- htmltab(link, 3)
parsed[1:3,1:4] %>% 
  kable(escape = FALSE,
        caption = "`htmltab` approach: the column names are ruined and row order is changed.") %>% 
  kable_styling()

`htmltab` approach: the column names are ruined and row order is changed.
	Institution	2021вЂ“22	2020вЂ“21	2019вЂ“20
2	University of Oxford	1	1	1
3	Harvard University	2	3	7
4	California Institute of Technology	2	4	2

RSelenium package

Before scraping

Though the methods discussed earlier are not specific for the Wikipedia and can be used for the other pages, it is impossible to use either of them for the THE website: the result in each case contains an empty data.frame with the column names. For that reason, I decided to try using the RSelenium package - I had not experienced working with it before the task, and it proved to be a good opportunity.

I did not figure out how to launch it from RStudio, so, I describe my not-R steps in the beginning:

Open the command line, change the directory so that it leads to the browser app and selenium-server file;
After that, the 3 elements should be paste there: (1) java, (2) -Dwebdriver.chrome.driver=“C:projects.exe”, and (3) -jar selenium-server-standalone-3.141.59.jar, with spaces. These examples are mine, there should be specifications for the other users depending on the chosen browser, the selenium version, and the working directory.
The input from the previous step allows to go the browser and paste: localhost:[connection number]/wd/hub.
There, a user needs to click on the “create session” button and then select his/her browser.

Now, it is time to connect the RStudio to the server:

con <- remoteDriver(remoteServerAddr = "localhost",
                    port = 4444,
                    browserName = "chrome")
con$getStatus()

## $ready
## [1] TRUE
## 
## $message
## [1] "Server is running"
## 
## $build
## $build$revision
## [1] "e82be7d358"
## 
## $build$time
## [1] "2018-11-14T08:25:53"
## 
## $build$version
## [1] "3.141.59"
## 
## 
## $os
## $os$arch
## [1] "amd64"
## 
## $os$name
## [1] "Windows 10"
## 
## $os$version
## [1] "10.0"
## 
## 
## $java
## $java$version
## [1] "1.8.0_281"

…and open the connection (open the new browser window managed by automated test software):

con$open()

## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
## 
## $browserName
## [1] "chrome"
## 
## $browserVersion
## [1] "94.0.4606.54"
## 
## $chrome
## $chrome$chromedriverVersion
## [1] "93.0.4577.63 (ff5c0da2ec0adeaed5550e6c7e98417dac77d98a-refs/branch-heads/4577@{#1135})"
## 
## $chrome$userDataDir
## [1] "C:\\Users\\Zver\\AppData\\Local\\Temp\\scoped_dir14912_575245655"
## 
## 
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:3895"
## 
## 
## $networkConnectionEnabled
## [1] FALSE
## 
## $pageLoadStrategy
## [1] "normal"
## 
## $platformName
## [1] "windows"
## 
## $proxy
## named list()
## 
## $setWindowRect
## [1] TRUE
## 
## $strictFileInteractability
## [1] FALSE
## 
## $timeouts
## $timeouts$implicit
## [1] 0
## 
## $timeouts$pageLoad
## [1] 300000
## 
## $timeouts$script
## [1] 30000
## 
## 
## $unhandledPromptBehavior
## [1] "dismiss and notify"
## 
## $`webauthn:extension:credBlob`
## [1] TRUE
## 
## $`webauthn:extension:largeBlob`
## [1] TRUE
## 
## $`webauthn:virtualAuthenticators`
## [1] TRUE
## 
## $webdriver.remote.sessionid
## [1] "1883ed24e7c774c8823bcb2510679d6e"
## 
## $id
## [1] "1883ed24e7c774c8823bcb2510679d6e"

Ethics & loading the data

Before further moves, I want to verify that it is okay to download the data from THE website using robotstxt package. As the return is “TRUE”, we may continue.

paths_allowed(paths = c("https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"))

## [1] TRUE

The table I am interested in is in fact presented in 2 parts: ranking and scores. Also, to collect the longitudinal data, the year should be selected in the menu. To avoid writing the same massive code many times, I created 2 functions which allows to scrape each of these parts (I faced some server-related errors when tried to write a single function, so, it is separated):

scrape_THE_table_scores(year, number)
scrape_THE_table_stats(year, number),

where “year” stands for the year we want to scrape and “number” stands for the number of universities presented in the page simultaneously. As the table formatting was slightly different before 2016 (no stats part), the first argument accepts values from 2016 to 2022 (until the new year is revealed). The second argument, number, accepts values “10”, “25”, “50”, “100”, and “-1” (the last shows all the universities). Obviously, the input constitutes the links which the functions use to go to the website.

########################
##### function 1 #######
########################

scrape_THE_table_stats <- function(year, number) {
  part1 = "https://www.timeshighereducation.com/world-university-rankings/"
  part2 = "/world-ranking#!/page/0/length/"
  part3 = "/sort_by/rank/sort_order/asc/cols/"

  link1 <- str_c(part1, year, part2, number, part3, "stats")
  con$navigate(link1)
  
  table <- con$findElement(using = "id", value = "datatable-1")
  table_html <- table$getPageSource()
  table_html <- read_html(table_html[[1]])
  
  rank <- table_html %>% 
    html_nodes('.sorting_2') %>% 
    html_text()
  
  location <- table_html %>% 
    html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "location", " " ))]//a') %>% 
    html_text()
  
  name <- table_html %>% 
    html_nodes(".ranking-institution-title") %>% 
    html_text()
  
  fte_students <- table_html %>% 
    html_nodes('td.stats_number_students') %>% 
    html_text()
  
  student_staff_ratio <- table_html %>% 
    html_nodes('td.stats_student_staff_ratio') %>% 
    html_text()
  
  intr_students <- table_html %>% 
    html_nodes('td.stats_pc_intl_students') %>% 
    html_text()
  
  
  female_male_ratio <- table_html %>% 
    html_nodes('td.stats_female_male_ratio') %>% 
    html_text()
  
  first_table <- data.frame(rank, 
                            name, 
                            location, 
                            fte_students, 
                            student_staff_ratio, 
                            intr_students, 
                            female_male_ratio)
  return(first_table)
}


########################
##### function 2 #######
########################
  
scrape_THE_table_scores <- function(year, number) {
  con$refresh()
  part1 = "https://www.timeshighereducation.com/world-university-rankings/"
  part2 = "/world-ranking#!/page/0/length/"
  part3 = "/sort_by/rank/sort_order/asc/cols/"

  link2 <- str_c(part1, year, part2, number, part3, "scores")
  con$navigate(link2)
  
  table <- con$findElement(using = "id", value = "datatable-1")
  table_html <- table$getPageSource()
  
  table_html <- read_html(table_html[[1]])
  
  rank <- table_html %>% 
    html_nodes('.sorting_2') %>% 
    html_text()
  
  name <- table_html %>% 
    html_nodes(".ranking-institution-title") %>% 
    html_text()
  
  
  overall <- table_html %>% 
    html_nodes('td.overall-score') %>% 
    html_text()
  
  teaching <- table_html %>% 
    html_nodes('td.teaching-score') %>% 
    html_text()
  
  research <- table_html %>% 
    html_nodes('td.research-score') %>% 
    html_text()
  
  citations <- table_html %>% 
    html_nodes('td.citations-score') %>% 
    html_text()
  
  industry_income <- table_html %>% 
    html_nodes('td.industry_income-score') %>% 
    html_text()
  
  international_outlook <- table_html %>% 
    html_nodes('td.international_outlook-score') %>% 
    html_text()
  
  second_table <- data.frame(rank, 
                             name,
                             overall, 
                             teaching, 
                             research, 
                             citations, 
                             industry_income, 
                             international_outlook)
  return(second_table)
}

con$refresh()

Applying the new functions to scrape the data for (1) 25 universities in (2) years 2016 and 2022:

second_table2016 <- scrape_THE_table_scores(2016, 25)
second_table2022 <- scrape_THE_table_scores(2022, 25)
first_table2016 <- scrape_THE_table_stats(2016, 25)
first_table2022 <- scrape_THE_table_stats(2022, 25)

data_2016 <- first_table2016 %>% 
    inner_join(second_table2016)

data_2022 <- first_table2022 %>% 
    inner_join(second_table2022)

data_2022[1:4,1:4] %>% 
  kable(escape = FALSE,
        caption = "The part of the data for 2022 before cleaning.") %>% 
  kable_styling()

The part of the data for 2022 before cleaning.
rank	name	location	fte_students
1	University of Oxford	United Kingdom	20,835
=2	California Institute of Technology	United States	2,233
=2	Harvard University	United States	21,574
4	Stanford University	United States	16,319

Preparations & analysis

Data cleaning

In this section, I clean the entire data for 2022 (it is not needed for 2016 as I would use only 3 variables from that data.frame, but a more gentle and precise way is to write another function for these tables). The code is annotated.

# Cleaning each column:
data_2022 = data_2022 %>% 
  mutate(rank = str_remove_all(rank, "="),
         fte_students = str_remove_all(fte_students, ","),
         intr_students = str_remove_all(intr_students, "%"),
         intr_students = as.numeric(intr_students)/100)

# Changing variables' classes:
cleaning <- data_2022 %>% 
  select(-name,
           -location,
           -female_male_ratio)

cleaning <- cleaning %>% 
  mutate_if(sapply(cleaning, 
                   is.character), as.numeric)

data_2022 <- data_2022 %>% 
  select(name, location, female_male_ratio) %>% 
  cbind(cleaning)

# Separating female_male_ratio column into 2 parts, as the format 'number:number' is not perfect. Now, there are 2 columns with each share in a range from 0 to 1. In addition, the only NA value is detected.
data_2022 <- separate(data_2022, 
                      female_male_ratio, 
                      into = c("female_share", "male_share"), 
                      sep = " : ")
data_2022[11,3:4] <- NA

data_2022$female_share = as.numeric(data_2022$female_share) / 100
data_2022$male_share = as.numeric(data_2022$male_share) / 100

data_2022[1:4, 1:4] %>% 
  kable(escape = FALSE,
        caption = "The same part of the data for 2022 after cleaning.") %>% 
  kable_styling()

The same part of the data for 2022 after cleaning.
name	location	female_share	male_share
University of Oxford	United Kingdom	0.47	0.53
California Institute of Technology	United States	0.36	0.64
Harvard University	United States	0.50	0.50
Stanford University	United States	0.46	0.54

Finally, I save the dataset that I got:

write_csv(data_2022, "THE_2022.csv")

Slopegraph

As the rankings positions may intercept, I decided to build the slopegraph using the overall score that each university has - as I understand, it somehow sums up the other scores and properties. First, I prepare the subset of 2016 data and create slope_data, a long format dataset with the columns (1) name, (2) location, (3) overall, and (4) year.

# Selecting variables for 2016 data:
data_2016 <- data_2016 %>% 
  select(name, 
         location,
         overall) %>% 
  mutate(overall = as.numeric(overall))

# Merging tables:
one <- data_2022 %>% 
  select(name,
         location,
         overall)

two <- data_2016 %>% 
  select(name,
         location,
         overall)

one$year = "2022"
two$year = "2016"

slope_data <- rbind(one, one[rep(1:25,1), ])

slope_data[26:50,1] <- data_2016$name
slope_data[26:50,2] <- data_2016$location
slope_data[26:50,3] <- data_2016$overall
slope_data[26:50,4] <- "2016"

After that operation, the data contains 50 observations. For a slopegraph, I use fancy CGPfunctions package:

# specifying colors for the universities:
cols <- c("ETH Zurich" = "#F7DD72",

          "University of Oxford" = "#5AB1BB",
          "University of Cambridge" = "#5AB1BB",
          "Imperial College London" = "#5AB1BB",
          
          "California Institute of Technology" = "#1E152A",
          "Columbia University" = "#1E152A",
          "Harvard University" = "#1E152A",
          "Johns Hopkins University" = "#1E152A",
          "Massachusetts Institute of Technology" = "#1E152A",
          "Princeton University" = "#1E152A",
          "Stanford University" = "#1E152A",
          "The University of Chicago" = "#1E152A",
          "University of California, Berkeley" = "#1E152A",
          "University of Pennsylvania" = "#1E152A",
          "Yale University" = "#1E152A")

# graph:
slope_data$year = as.character(slope_data$year)
newggslopegraph(dataframe = slope_data %>% filter(overall > 88), 
                Times = year, 
                Measurement = overall,
                Grouping = name,
                Title = "The changes in overall THE score for world top universities,",
                SubTitle = "In the period from 2016 to 2022",
                Caption = "data source: 
                www.timeshighereducation.com/world-university-rankings/",
                LineThickness = 0.4,
                LineColor = cols,
                ThemeChoice = "gdocs",
                XTextSize = 10,
                YTextSize = 2.5,
                TitleTextSize = 14,
                SubTitleTextSize = 12,
                TitleJustify = "c",
                SubTitleJustify = "c",
                CaptionTextSize = 8)

Interpretation

To make the graph more readable, I filtered overall to be higher than 88. All the universities which had such scores in 2016 persist in 2022: most of them has improved these values (the largest improvement was done by Princeton University). On the other hand, 6 new universities entered the described cohort: the largest score among these competitors is shown by the University of California (Berkeley).

The striking fact about this plot is that there is only one university outside US & UK - ETH Zurich, which has the lowest score among all the selected universities. In the total list of universities, there were not many other examples of not-US and not-UK campuses (see the table below). To note, 2 different Chinese universities entered the list in 2022: as I heard, this country becomes a real destination for the educational migration.

slope_data %>% 
  filter(location != "United States" &
           location != "United Kingdom") %>% 
  kable(escape = FALSE,
        caption = "Not-US & not-UK campuses in the sub-sample") %>% 
  kable_styling()

Not-US & not-UK campuses in the sub-sample
name	location	overall	year
ETH Zurich	Switzerland	88.2	2022
Peking University	China	87.5	2022
Tsinghua University	China	87.5	2022
University of Toronto	Canada	87.2	2022
National University of Singapore	Singapore	85.2	2022
ETH Zurich	Switzerland	88.3	2016
University of Toronto	Canada	83.9	2016

There are not many UK universities in the graph, despite UK share in the original data holds the second place. Instead, their quality is high: 2/3 of these campuses are Oxford and Cambridge. The other 11 campuses are US-based; what astonished me a lot is the geographical diversity of these American universities. Though, according to this list, California looks like the state with the largest number of the best campuses.

To conclude, the THE top world universities present a static group of primarily American (with several British) universities. The overall score considered in the analysis improved significantly in the interval between 2016 and 2022 - it probably means that even the best universities have a lot to work with. Lastly, it would be interesting to see how the picture is going to change in the upcoming decades: as mentioned, there is at least one country that aims to get in the top.

Web scraping

Arthur Pecherskikh

(last updated on 25.09.21)