Times Higher Education World University Rankings is an annual evaluation of universities’ performances conducted by the Times Higher Education (THE) journal. The ranking takes a number of indicators, including teaching, citations, the share of international students, female/male ratio, etc. Among with the World ranking, there are other lists: for example, HSE holds 301-350 place in the World University Rankings 2022, 151-175 place in the World Reputation Rankings 2020, and 57 place in the Young University Rankings 2021.
In this paper, I first get the THE data and then analyze the last-years dynamics at the top of its World Rankings.
There are 2 sources of getting THE data:
Though I focus on the second option (which is challenging as THE site presents data in the dynamic form), I want to briefly discuss the first source as well. During the lesson, we discussed scraping with XML package and readHTMLTable() function. It takes 3 lines of code and results in a data.frame where the first row reflects the column names. That is okay as the result requires some processing anyway, but it seems to me that the htmltab package is more user-friendly. The code needed when using this package is 1/3 shorter and does not penetrate into the first row. Nevertheless, htmltab() ruins the majority of the column names (due to hyperlinks) and changes the row order - still, it seems to need less processing in total.
####################
#### libraries #####
####################
# working with data:
library(tidyverse)
library(stringr)
library(readr)
library(kableExtra)
# scraping:
library(rvest)
library(htmltab)
library(httr)
library(XML)
library(RSelenium)
library(wdman)
#ethics:
library(robotstxt)
library(Rcpp)
# slopegraph:
library(CGPfunctions)
###############################
######## xml approach #########
###############################
di <- GET("https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings")
doc <- readHTMLTable(doc = content(di, "text"))
parsed <- doc[[3]] # save the third list Democracy Index 2020
parsed[1:3,1:4] %>%
kable(escape = FALSE,
caption = "`XML` approach: original column names become the first row.") %>%
kable_styling()
| V1 | V2 | V3 | V4 |
|---|---|---|---|
| Institution | 2021–22[46] | 2020–21[47] | 2019–20[48] |
| University of Oxford | 1 | 1 | 1 |
| Harvard University | 2 | 3 | 7 |
###################################
######## htmltab approach #########
###################################
link <- "https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings"
parsed <- htmltab(link, 3)
parsed[1:3,1:4] %>%
kable(escape = FALSE,
caption = "`htmltab` approach: the column names are ruined and row order is changed.") %>%
kable_styling()
| Institution | 2021–22 | 2020–21 | 2019–20 | |
|---|---|---|---|---|
| 2 | University of Oxford | 1 | 1 | 1 |
| 3 | Harvard University | 2 | 3 | 7 |
| 4 | California Institute of Technology | 2 | 4 | 2 |
Though the methods discussed earlier are not specific for the Wikipedia and can be used for the other pages, it is impossible to use either of them for the THE website: the result in each case contains an empty data.frame with the column names. For that reason, I decided to try using the RSelenium package - I had not experienced working with it before the task, and it proved to be a good opportunity.
I did not figure out how to launch it from RStudio, so, I describe my not-R steps in the beginning:
Now, it is time to connect the RStudio to the server:
con <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "chrome")
con$getStatus()
## $ready
## [1] TRUE
##
## $message
## [1] "Server is running"
##
## $build
## $build$revision
## [1] "e82be7d358"
##
## $build$time
## [1] "2018-11-14T08:25:53"
##
## $build$version
## [1] "3.141.59"
##
##
## $os
## $os$arch
## [1] "amd64"
##
## $os$name
## [1] "Windows 10"
##
## $os$version
## [1] "10.0"
##
##
## $java
## $java$version
## [1] "1.8.0_281"
…and open the connection (open the new browser window managed by automated test software):
con$open()
## [1] "Connecting to remote server"
## $acceptInsecureCerts
## [1] FALSE
##
## $browserName
## [1] "chrome"
##
## $browserVersion
## [1] "94.0.4606.54"
##
## $chrome
## $chrome$chromedriverVersion
## [1] "93.0.4577.63 (ff5c0da2ec0adeaed5550e6c7e98417dac77d98a-refs/branch-heads/4577@{#1135})"
##
## $chrome$userDataDir
## [1] "C:\\Users\\Zver\\AppData\\Local\\Temp\\scoped_dir14912_575245655"
##
##
## $`goog:chromeOptions`
## $`goog:chromeOptions`$debuggerAddress
## [1] "localhost:3895"
##
##
## $networkConnectionEnabled
## [1] FALSE
##
## $pageLoadStrategy
## [1] "normal"
##
## $platformName
## [1] "windows"
##
## $proxy
## named list()
##
## $setWindowRect
## [1] TRUE
##
## $strictFileInteractability
## [1] FALSE
##
## $timeouts
## $timeouts$implicit
## [1] 0
##
## $timeouts$pageLoad
## [1] 300000
##
## $timeouts$script
## [1] 30000
##
##
## $unhandledPromptBehavior
## [1] "dismiss and notify"
##
## $`webauthn:extension:credBlob`
## [1] TRUE
##
## $`webauthn:extension:largeBlob`
## [1] TRUE
##
## $`webauthn:virtualAuthenticators`
## [1] TRUE
##
## $webdriver.remote.sessionid
## [1] "1883ed24e7c774c8823bcb2510679d6e"
##
## $id
## [1] "1883ed24e7c774c8823bcb2510679d6e"
Before further moves, I want to verify that it is okay to download the data from THE website using robotstxt package. As the return is “TRUE”, we may continue.
paths_allowed(paths = c("https://www.timeshighereducation.com/world-university-rankings/2020/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats"))
## [1] TRUE
The table I am interested in is in fact presented in 2 parts: ranking and scores. Also, to collect the longitudinal data, the year should be selected in the menu. To avoid writing the same massive code many times, I created 2 functions which allows to scrape each of these parts (I faced some server-related errors when tried to write a single function, so, it is separated):
scrape_THE_table_scores(year, number)scrape_THE_table_stats(year, number),where “year” stands for the year we want to scrape and “number” stands for the number of universities presented in the page simultaneously. As the table formatting was slightly different before 2016 (no stats part), the first argument accepts values from 2016 to 2022 (until the new year is revealed). The second argument, number, accepts values “10”, “25”, “50”, “100”, and “-1” (the last shows all the universities). Obviously, the input constitutes the links which the functions use to go to the website.
########################
##### function 1 #######
########################
scrape_THE_table_stats <- function(year, number) {
part1 = "https://www.timeshighereducation.com/world-university-rankings/"
part2 = "/world-ranking#!/page/0/length/"
part3 = "/sort_by/rank/sort_order/asc/cols/"
link1 <- str_c(part1, year, part2, number, part3, "stats")
con$navigate(link1)
table <- con$findElement(using = "id", value = "datatable-1")
table_html <- table$getPageSource()
table_html <- read_html(table_html[[1]])
rank <- table_html %>%
html_nodes('.sorting_2') %>%
html_text()
location <- table_html %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "location", " " ))]//a') %>%
html_text()
name <- table_html %>%
html_nodes(".ranking-institution-title") %>%
html_text()
fte_students <- table_html %>%
html_nodes('td.stats_number_students') %>%
html_text()
student_staff_ratio <- table_html %>%
html_nodes('td.stats_student_staff_ratio') %>%
html_text()
intr_students <- table_html %>%
html_nodes('td.stats_pc_intl_students') %>%
html_text()
female_male_ratio <- table_html %>%
html_nodes('td.stats_female_male_ratio') %>%
html_text()
first_table <- data.frame(rank,
name,
location,
fte_students,
student_staff_ratio,
intr_students,
female_male_ratio)
return(first_table)
}
########################
##### function 2 #######
########################
scrape_THE_table_scores <- function(year, number) {
con$refresh()
part1 = "https://www.timeshighereducation.com/world-university-rankings/"
part2 = "/world-ranking#!/page/0/length/"
part3 = "/sort_by/rank/sort_order/asc/cols/"
link2 <- str_c(part1, year, part2, number, part3, "scores")
con$navigate(link2)
table <- con$findElement(using = "id", value = "datatable-1")
table_html <- table$getPageSource()
table_html <- read_html(table_html[[1]])
rank <- table_html %>%
html_nodes('.sorting_2') %>%
html_text()
name <- table_html %>%
html_nodes(".ranking-institution-title") %>%
html_text()
overall <- table_html %>%
html_nodes('td.overall-score') %>%
html_text()
teaching <- table_html %>%
html_nodes('td.teaching-score') %>%
html_text()
research <- table_html %>%
html_nodes('td.research-score') %>%
html_text()
citations <- table_html %>%
html_nodes('td.citations-score') %>%
html_text()
industry_income <- table_html %>%
html_nodes('td.industry_income-score') %>%
html_text()
international_outlook <- table_html %>%
html_nodes('td.international_outlook-score') %>%
html_text()
second_table <- data.frame(rank,
name,
overall,
teaching,
research,
citations,
industry_income,
international_outlook)
return(second_table)
}
con$refresh()
Applying the new functions to scrape the data for (1) 25 universities in (2) years 2016 and 2022:
second_table2016 <- scrape_THE_table_scores(2016, 25)
second_table2022 <- scrape_THE_table_scores(2022, 25)
first_table2016 <- scrape_THE_table_stats(2016, 25)
first_table2022 <- scrape_THE_table_stats(2022, 25)
data_2016 <- first_table2016 %>%
inner_join(second_table2016)
data_2022 <- first_table2022 %>%
inner_join(second_table2022)
data_2022[1:4,1:4] %>%
kable(escape = FALSE,
caption = "The part of the data for 2022 before cleaning.") %>%
kable_styling()
| rank | name | location | fte_students |
|---|---|---|---|
| 1 | University of Oxford | United Kingdom | 20,835 |
| =2 | California Institute of Technology | United States | 2,233 |
| =2 | Harvard University | United States | 21,574 |
| 4 | Stanford University | United States | 16,319 |
In this section, I clean the entire data for 2022 (it is not needed for 2016 as I would use only 3 variables from that data.frame, but a more gentle and precise way is to write another function for these tables). The code is annotated.
# Cleaning each column:
data_2022 = data_2022 %>%
mutate(rank = str_remove_all(rank, "="),
fte_students = str_remove_all(fte_students, ","),
intr_students = str_remove_all(intr_students, "%"),
intr_students = as.numeric(intr_students)/100)
# Changing variables' classes:
cleaning <- data_2022 %>%
select(-name,
-location,
-female_male_ratio)
cleaning <- cleaning %>%
mutate_if(sapply(cleaning,
is.character), as.numeric)
data_2022 <- data_2022 %>%
select(name, location, female_male_ratio) %>%
cbind(cleaning)
# Separating female_male_ratio column into 2 parts, as the format 'number:number' is not perfect. Now, there are 2 columns with each share in a range from 0 to 1. In addition, the only NA value is detected.
data_2022 <- separate(data_2022,
female_male_ratio,
into = c("female_share", "male_share"),
sep = " : ")
data_2022[11,3:4] <- NA
data_2022$female_share = as.numeric(data_2022$female_share) / 100
data_2022$male_share = as.numeric(data_2022$male_share) / 100
data_2022[1:4, 1:4] %>%
kable(escape = FALSE,
caption = "The same part of the data for 2022 after cleaning.") %>%
kable_styling()
| name | location | female_share | male_share |
|---|---|---|---|
| University of Oxford | United Kingdom | 0.47 | 0.53 |
| California Institute of Technology | United States | 0.36 | 0.64 |
| Harvard University | United States | 0.50 | 0.50 |
| Stanford University | United States | 0.46 | 0.54 |
Finally, I save the dataset that I got:
write_csv(data_2022, "THE_2022.csv")
As the rankings positions may intercept, I decided to build the slopegraph using the overall score that each university has - as I understand, it somehow sums up the other scores and properties. First, I prepare the subset of 2016 data and create slope_data, a long format dataset with the columns (1) name, (2) location, (3) overall, and (4) year.
# Selecting variables for 2016 data:
data_2016 <- data_2016 %>%
select(name,
location,
overall) %>%
mutate(overall = as.numeric(overall))
# Merging tables:
one <- data_2022 %>%
select(name,
location,
overall)
two <- data_2016 %>%
select(name,
location,
overall)
one$year = "2022"
two$year = "2016"
slope_data <- rbind(one, one[rep(1:25,1), ])
slope_data[26:50,1] <- data_2016$name
slope_data[26:50,2] <- data_2016$location
slope_data[26:50,3] <- data_2016$overall
slope_data[26:50,4] <- "2016"
After that operation, the data contains 50 observations. For a slopegraph, I use fancy CGPfunctions package:
# specifying colors for the universities:
cols <- c("ETH Zurich" = "#F7DD72",
"University of Oxford" = "#5AB1BB",
"University of Cambridge" = "#5AB1BB",
"Imperial College London" = "#5AB1BB",
"California Institute of Technology" = "#1E152A",
"Columbia University" = "#1E152A",
"Harvard University" = "#1E152A",
"Johns Hopkins University" = "#1E152A",
"Massachusetts Institute of Technology" = "#1E152A",
"Princeton University" = "#1E152A",
"Stanford University" = "#1E152A",
"The University of Chicago" = "#1E152A",
"University of California, Berkeley" = "#1E152A",
"University of Pennsylvania" = "#1E152A",
"Yale University" = "#1E152A")
# graph:
slope_data$year = as.character(slope_data$year)
newggslopegraph(dataframe = slope_data %>% filter(overall > 88),
Times = year,
Measurement = overall,
Grouping = name,
Title = "The changes in overall THE score for world top universities,",
SubTitle = "In the period from 2016 to 2022",
Caption = "data source:
www.timeshighereducation.com/world-university-rankings/",
LineThickness = 0.4,
LineColor = cols,
ThemeChoice = "gdocs",
XTextSize = 10,
YTextSize = 2.5,
TitleTextSize = 14,
SubTitleTextSize = 12,
TitleJustify = "c",
SubTitleJustify = "c",
CaptionTextSize = 8)
To make the graph more readable, I filtered overall to be higher than 88. All the universities which had such scores in 2016 persist in 2022: most of them has improved these values (the largest improvement was done by Princeton University). On the other hand, 6 new universities entered the described cohort: the largest score among these competitors is shown by the University of California (Berkeley).
The striking fact about this plot is that there is only one university outside US & UK - ETH Zurich, which has the lowest score among all the selected universities. In the total list of universities, there were not many other examples of not-US and not-UK campuses (see the table below). To note, 2 different Chinese universities entered the list in 2022: as I heard, this country becomes a real destination for the educational migration.
slope_data %>%
filter(location != "United States" &
location != "United Kingdom") %>%
kable(escape = FALSE,
caption = "Not-US & not-UK campuses in the sub-sample") %>%
kable_styling()
| name | location | overall | year |
|---|---|---|---|
| ETH Zurich | Switzerland | 88.2 | 2022 |
| Peking University | China | 87.5 | 2022 |
| Tsinghua University | China | 87.5 | 2022 |
| University of Toronto | Canada | 87.2 | 2022 |
| National University of Singapore | Singapore | 85.2 | 2022 |
| ETH Zurich | Switzerland | 88.3 | 2016 |
| University of Toronto | Canada | 83.9 | 2016 |
There are not many UK universities in the graph, despite UK share in the original data holds the second place. Instead, their quality is high: 2/3 of these campuses are Oxford and Cambridge. The other 11 campuses are US-based; what astonished me a lot is the geographical diversity of these American universities. Though, according to this list, California looks like the state with the largest number of the best campuses.
To conclude, the THE top world universities present a static group of primarily American (with several British) universities. The overall score considered in the analysis improved significantly in the interval between 2016 and 2022 - it probably means that even the best universities have a lot to work with. Lastly, it would be interesting to see how the picture is going to change in the upcoming decades: as mentioned, there is at least one country that aims to get in the top.