Web scraping is a technique used to automatically extract large amounts of data from websites. In this project, we scrape the Times Higher Education World University Rankings from Wikipedia to analyze and visualize changes in university rankings over time.
By scraping this data, we can obtain up-to-date information in a structured form without manual copying. The scraped data can then be used for various analyses, such as identifying trends in university performance and comparing how different institutions have evolved over the years.
The script starts by fetching data from a webpage using the
httr
package. Specifically, it targets the Times Higher
Education World University Rankings page on Wikipedia.
library(httr)
library(ggplot2)
library(kableExtra)
library(ggrepel)
library(XML)
di <- GET("https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings")
di
## Response [https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings]
## Date: 2024-10-24 00:37
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 307 kB
## <!DOCTYPE html>
## <html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
## <head>
## <meta charset="UTF-8">
## <title>Times Higher Education World University Rankings - Wikipedia</title>
## <script>(function(){var className="client-js vector-feature-language-in-heade...
## "wgMonthNames":["","January","February","March","April","May","June","July","...
## "2010 introductions"],"wgPageViewLanguage":"en","wgPageContentLanguage":"en",...
## "wgULSisLanguageSelectorEmpty":false,"wgWikibaseItemId":"Q1318318","wgCheckUs...
## "site","mediawiki.page.ready","jquery.tablesorter","jquery.makeCollapsible","...
## ...
The content from the scraped Wikipedia page is read as an HTML table using the XML package.
The structure of the parsed table (doc) is then displayed to understand how the data is organized.
A specific table (the third one in this case) is extracted from the parsed data and stored as UnivRank.
doc <- XML::readHTMLTable(
doc = content(di, "text"))
str(doc)
## List of 7
## $ Times Higher Education World University Rankings :'data.frame': 9 obs. of 2 variables:
## ..$ V1: chr [1:9] "" "Editor" "Categories" "Frequency" ...
## ..$ V2: chr [1:9] NA "Phil Baty" "Higher education" "Annual" ...
## $ NULL :'data.frame': 6 obs. of 3 variables:
## ..$ V1: chr [1:6] "Overall indicator" "Industry Income – innovation" "International diversity (currently: International outlook (staff, students, research))" "Teaching – the learning environment" ...
## ..$ V2: chr [1:6] "Individual indicator" "Research income from industry (per academic staff)" "Ratio of international to domestic staff\nRatio of international to domestic students\nInternational Collaborat"| __truncated__ "Reputational survey (teaching)\nPhDs awards per academic\nUndergrad. admitted per academic\nIncome per academic"| __truncated__ ...
## ..$ V3: chr [1:6] "Percentage weighting" "2.5%" "3% (2.5% as for 2022–23)\n2% (2.5% as for 2022–23)\n2.5% (as for 2022–23)" "15%\n6%\n4.5%\n2.25%\n2.25%" ...
## $ Times Higher Education World University Rankings—Top 10[Note 1] :'data.frame': 11 obs. of 13 variables:
## ..$ V1 : chr [1:11] "Institution" "University of Oxford" "Stanford University" "Massachusetts Institute of Technology" ...
## ..$ V2 : chr [1:11] "2024[48]" "1" "2" "3" ...
## ..$ V3 : chr [1:11] "2023[49]" "1" "4" "5" ...
## ..$ V4 : chr [1:11] "2022[50]" "1" "4" "5" ...
## ..$ V5 : chr [1:11] "2021[51]" "1" "2" "5" ...
## ..$ V6 : chr [1:11] "2020[52]" "1" "4" "5" ...
## ..$ V7 : chr [1:11] "2019[53]" "1" "3" "4" ...
## ..$ V8 : chr [1:11] "2018[54]" "1" "3" "5" ...
## ..$ V9 : chr [1:11] "2017[55]" "1" "3" "5" ...
## ..$ V10: chr [1:11] "2016[56]" "2" "3" "5" ...
## ..$ V11: chr [1:11] "2015[57]" "3" "4" "6" ...
## ..$ V12: chr [1:11] "2014[58]" "2" "4" "5" ...
## ..$ V13: chr [1:11] "2013[59]" "2" "3" "5" ...
## $ Times Higher Education World Reputation Rankings—Top 10[Note 1] :'data.frame': 11 obs. of 14 variables:
## ..$ V1 : chr [1:11] "Institution" "Harvard University" "Massachusetts Institute of Technology" "Stanford University" ...
## ..$ V2 : chr [1:11] "2023[67]" "1" "2" "3" ...
## ..$ V3 : chr [1:11] "2022[68]" "1" "2" "3" ...
## ..$ V4 : chr [1:11] "2021[69]" "1" "2" "4" ...
## ..$ V5 : chr [1:11] "2020[70]" "1" "2" "3" ...
## ..$ V6 : chr [1:11] "2019[71]" "1" "2" "3" ...
## ..$ V7 : chr [1:11] "2018[72]" "1" "2" "3" ...
## ..$ V8 : chr [1:11] "2017[73]" "1" "2" "3" ...
## ..$ V9 : chr [1:11] "2016[74]" "1" "2" "3" ...
## ..$ V10: chr [1:11] "2015[75]" "1" "4" "5" ...
## ..$ V11: chr [1:11] "2014[76]" "1" "2" "3" ...
## ..$ V12: chr [1:11] "2013[77]" "1" "2" "6" ...
## ..$ V13: chr [1:11] "2012[78]" "1" "2" "4" ...
## ..$ V14: chr [1:11] "2011[79]" "1" "2" "5" ...
## $ Times Higher Education Asia University Rankings as shown below – Top 10[Note 1]:'data.frame': 11 obs. of 13 variables:
## ..$ V1 : chr [1:11] "Institution" "Tsinghua University" "Peking University" "National University of Singapore" ...
## ..$ V2 : chr [1:11] "2024" "1" "2" "3" ...
## ..$ V3 : chr [1:11] "2023[81]" "1" "2" "3" ...
## ..$ V4 : chr [1:11] "2022[82]" "1" "2" "3" ...
## ..$ V5 : chr [1:11] "2021[83]" "1" "2" "3" ...
## ..$ V6 : chr [1:11] "2020[84]" "1" "2" "3" ...
## ..$ V7 : chr [1:11] "2019[85]" "1" "5" "2" ...
## ..$ V8 : chr [1:11] "2018[86]" "2" "3" "1" ...
## ..$ V9 : chr [1:11] "2017[87]" "3" "2" "1" ...
## ..$ V10: chr [1:11] "2016[80]" "5" "2" "1" ...
## ..$ V11: chr [1:11] "2015[88]" "5" "4" "2" ...
## ..$ V12: chr [1:11] "2014[89]" "6" "5" "2" ...
## ..$ V13: chr [1:11] "2013[90]" "6" "4" "2" ...
## $ Times Higher Education BRICS & Emerging Economies Rankings – Top 10[Note 1] :'data.frame': 11 obs. of 10 variables:
## ..$ V1 : chr [1:11] "Institution" "Peking University" "Tsinghua University" "Zhejiang University" ...
## ..$ V2 : chr [1:11] "2022[91]" "1" "2" "3" ...
## ..$ V3 : chr [1:11] "2021[92]" "2" "1" "3" ...
## ..$ V4 : chr [1:11] "2020[93]" "2" "1" "3" ...
## ..$ V5 : chr [1:11] "2019[94]" "2" "1" "3" ...
## ..$ V6 : chr [1:11] "2018[95]" "1" "2" "6" ...
## ..$ V7 : chr [1:11] "2017[96]" "1" "2" "9" ...
## ..$ V8 : chr [1:11] "2016[97]" "1" "2" "8" ...
## ..$ V9 : chr [1:11] "2015[98]" "1" "2" "21" ...
## ..$ V10: chr [1:11] "2014[99]" "1" "2" "22" ...
## $ NULL :'data.frame': 5 obs. of 2 variables:
## ..$ V1: chr [1:5] ".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{fl"| __truncated__ "Global" "Regional" "National" ...
## ..$ V2: chr [1:5] NA "Academic Ranking of World Universities\nCenter for World University Rankings\nCWTS Leiden Ranking\nGlobal Unive"| __truncated__ "European Union\nLatin America\nASEAN" "Australia\nExcellence in Research for Australia\nBrazil\nCanada\nMaclean's\nChina\nBCUR\nWu Shulian\nCUAA\nNetb"| __truncated__ ...
UnivRank <- doc[[3]]
The first 12 rows of the data are displayed in a nicely formatted table using kable and kableExtra.
The last rows of the data are also shown using tail() to inspect the structure.
UnivRank[1:11, ] %>%
kable() %>%
kable_styling()
V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Institution | 2024[48] | 2023[49] | 2022[50] | 2021[51] | 2020[52] | 2019[53] | 2018[54] | 2017[55] | 2016[56] | 2015[57] | 2014[58] | 2013[59] |
University of Oxford | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 2 | 2 |
Stanford University | 2 | 4 | 4 | 2 | 4 | 3 | 3 | 3 | 3 | 4 | 4 | 3 |
Massachusetts Institute of Technology | 3 | 5 | 5 | 5 | 5 | 4 | 5 | 5 | 5 | 6 | 5 | 5 |
Harvard University | 4 | 2 | 2 | 3 | 7 | 6 | 6 | 6 | 6 | 2 | 2 | 4 |
University of Cambridge | 5 | 3 | 5 | 6 | 3 | 2 | 2 | 4 | 4 | 5 | 7 | 7 |
Princeton University | 6 | 7 | 7 | 9 | 6 | 7 | 7 | 7 | 7 | 7 | 6 | 6 |
California Institute of Technology | 7 | 6 | 2 | 4 | 2 | 5 | 3 | 2 | 1 | 1 | 1 | 1 |
Imperial College London | 8 | 10 | 12 | 11 | 10 | 9 | 8 | 8 | 8 | 9 | 10 | 8 |
University of California, Berkeley | 9 | 8 | 8 | 7 | 13 | 15 | 18 | 10 | 13 | 8 | 8 | 9 |
Yale University | 10 | 9 | 9 | 8 | 8 | 8 | 12 | 12 | 12 | 9 | 11 | 11 |
The column names of the university rankings data are updated to reflect the years (from 2013 to 2024).
The first row, which might contain headers, is removed, and the row names are reset.
The script then shows the first 10 rows of the cleaned data.
# Renaming the columns of UnivRank
names(UnivRank) <- c("Institution", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024")
# Removing the first row (usually a header row) if needed
UnivRank <- UnivRank[-1, ]
# Resetting row names
rownames(UnivRank) <- NULL
library(DT)
# Displaying the interactive datatable
DT::datatable(UnivRank)
write.csv(UnivRank, file = "Uni-ranks.csv", row.names = FALSE)
data.frame <- read.csv("Uni-ranks.csv", row.names = NULL, header=T,check.names=F)
head(data.frame, 20)
## Institution 2013 2014 2015 2016 2017 2018 2019
## 1 University of Oxford 1 1 1 1 1 1 1
## 2 Stanford University 2 4 4 2 4 3 3
## 3 Massachusetts Institute of Technology 3 5 5 5 5 4 5
## 4 Harvard University 4 2 2 3 7 6 6
## 5 University of Cambridge 5 3 5 6 3 2 2
## 6 Princeton University 6 7 7 9 6 7 7
## 7 California Institute of Technology 7 6 2 4 2 5 3
## 8 Imperial College London 8 10 12 11 10 9 8
## 9 University of California, Berkeley 9 8 8 7 13 15 18
## 10 Yale University 10 9 9 8 8 8 12
## 2020 2021 2022 2023 2024
## 1 1 2 3 2 2
## 2 3 3 4 4 3
## 3 5 5 6 5 5
## 4 6 6 2 2 4
## 5 4 4 5 7 7
## 6 7 7 7 6 6
## 7 2 1 1 1 1
## 8 8 8 9 10 8
## 9 10 13 8 8 9
## 10 12 12 9 11 11
str(data.frame)
## 'data.frame': 10 obs. of 13 variables:
## $ Institution: chr "University of Oxford" "Stanford University" "Massachusetts Institute of Technology" "Harvard University" ...
## $ 2013 : int 1 2 3 4 5 6 7 8 9 10
## $ 2014 : int 1 4 5 2 3 7 6 10 8 9
## $ 2015 : int 1 4 5 2 5 7 2 12 8 9
## $ 2016 : int 1 2 5 3 6 9 4 11 7 8
## $ 2017 : int 1 4 5 7 3 6 2 10 13 8
## $ 2018 : int 1 3 4 6 2 7 5 9 15 8
## $ 2019 : int 1 3 5 6 2 7 3 8 18 12
## $ 2020 : int 1 3 5 6 4 7 2 8 10 12
## $ 2021 : int 2 3 5 6 4 7 1 8 13 12
## $ 2022 : int 3 4 6 2 5 7 1 9 8 9
## $ 2023 : int 2 4 5 2 7 6 1 10 8 11
## $ 2024 : int 2 3 5 4 7 6 1 8 9 11
The data is filtered to include only the ranking data for the years 2013 and 2024.
These rankings are converted to numeric format, and the universities are arranged based on their 2024 ranking.
The top 10 universities in 2024 are selected for further visualization.
data.frame1= data.frame %>% dplyr::select(Institution, `2013`,`2024` )
data.frame1$`2013`=as.numeric(as.character(data.frame1$`2013`))
data.frame1$`2024`=as.numeric(as.character(data.frame1$`2024`))
data.frame1 = data.frame1 %>% dplyr::arrange((`2024`))
data.frame1.1 = head(data.frame1, 10)
Using reshape2, the data is reshaped to a long format for easy plotting. Each university’s rank in 2013 and 2024 is recorded, and the year and rank columns are converted into appropriate formats.
library(reshape2)
d <- melt(data.frame1.1, id.vars = 'Institution')
d$year = d$variable
d$rang = d$value
data.frame2=d %>% dplyr::select(Institution, rang, year)
data.frame2$rang = as.numeric(as.character(data.frame2$rang))
data.frame2$year = as.factor(data.frame2$year)
A custom theme is defined to format the slopegraph. The theme removes grid lines, axis titles, and adds some custom formatting for the plot text.
A slopegraph is plotted using ggplot2. It visualizes the changes in rankings of the top 10 universities between 2013 and 2024.
Text labels are added to the start (2013) and end (2024) of the slope lines to indicate which institution each line represents.
MySlope <- list(
scale_x_discrete(position = "top"),
theme_bw(),
theme(legend.position = "none"),
theme(panel.border = element_blank()),
theme(axis.title.y = element_blank()),
theme(axis.text.y = element_blank()),
theme(panel.grid.major.y = element_blank()),
theme(panel.grid.minor.y = element_blank()),
theme(axis.title.x = element_blank()),
theme(panel.grid.major.x = element_blank()),
theme(axis.text.x.top = element_text(size=12)),
theme(axis.ticks = element_blank()),
theme(plot.title = element_text(size=14, face = "bold", hjust = 0.5)),
theme(plot.subtitle = element_text(hjust = 0.5))
)
ggplot(data = data.frame2, aes(x = year, y = -rang, group = Institution)) +
geom_line(aes(color = Institution, alpha = 1), size = 1) +
geom_text_repel(data = data.frame2 %>% dplyr::filter(year == "2013"),
aes(label = Institution) ,
hjust = "left",
fontface = "bold",
size = 3,
nudge_x = -.45,
direction = "y") +
geom_text_repel(data = data.frame2 %>% dplyr::filter(year == "2024"),
aes(label = Institution) ,
hjust = "right",
fontface = "bold",
size = 3,
nudge_x = .5,
direction = "y") +
geom_label(aes(label = rang),
size = 2.5,
label.padding = unit(0.05, "lines"),
label.size = 0.0) +
MySlope +
labs(
title = "Trends in rankings for universities 2013-2024",
subtitle = " "
)
Welfare Statistics Directorate, BPS, saptahas@bps.go.id