Introduction

Welcome to the unknown! This report’s agenda is to cluster universities present in the Times Higher Education (THE) World University Rankings 2021 (link:https://www.timeshighereducation.com/world-university-rankings/2021/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/scores). The peculiarity of this work is that its target audience are applicants around the world who consider enterig one or several of the listed universities at some point. Accordingly, careful attention shuld be paid to the variables which are to be selected for clustering from the point of view of the applicant and his/her primary factors influencing the decision of choosing a university.

Data description

The data were retrieved directly from the THE website (credits to Ms Anna Shirokanova https://rpubs.com/shirokaner for the scraping code) and has the following structure and a set of variables.

library(tidyverse)
library(stringr)
library(knitr)
library(purrr)
library(V8)
library(jsonlite)
library(psych)
library(kableExtra)
library(dplyr)
library(polycor)
library(ggcorrplot)
library(factoextra)
library(sjPlot)
library(cluster)

college_json2021 <- fromJSON(paste0(
  'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/', 
  'world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json'))

college_2021 <- college_json2021$data

college_2021 <- college_2021 %>% 
  mutate(stats_female_share = as.numeric(str_match(college_2021$stats_female_male_ratio, "^\\d{1,2}")))

college_2021 <- college_2021 %>% 
  mutate(stats_pc_intl_students = as.numeric(str_replace(stats_pc_intl_students, "%", "")))

college_2021 <- college_2021 %>% 
  mutate(stats_number_students = str_replace(stats_number_students, ",", ""))

college_2021 <- college_2021 %>% 
  mutate(ref_link = if_else(is.na(apply_link), FALSE, TRUE))

df <- college_2021
tofa <- c("rank", 
          "scores_overall",
          "scores_overall_rank",
          "scores_teaching_rank",
          "scores_research_rank",
          "scores_citations_rank",
          "scores_industry_income_rank",
          "scores_international_outlook_rank",
          "record_type",
          "member_level",
          "location",
          "nid")

for (i in tofa){
  df[, i] <- as.factor(df[, i])
}

tonu <- c("scores_teaching",
          "scores_research",
          "scores_citations",
          "scores_industry_income",
          "scores_international_outlook",
          "stats_number_students",
          "stats_student_staff_ratio",
          "stats_pc_intl_students",
          "stats_female_share")
for (i in tonu){
  df[ , i] <- as.numeric(df[ , i])
}
options(scipen = 999)
glimpse(df)
## Rows: 1,526
## Columns: 30
## $ rank_order                        <chr> "10", "20", "30", "40", "50", "60...
## $ rank                              <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11...
## $ name                              <chr> "University of Oxford", "Stanford...
## $ scores_overall                    <fct> 95.6, 94.9, 94.8, 94.5, 94.4, 94....
## $ scores_overall_rank               <fct> 10, 20, 30, 40, 50, 60, 70, 80, 9...
## $ scores_teaching                   <dbl> 91.3, 92.2, 94.4, 92.5, 90.7, 90....
## $ scores_teaching_rank              <fct> 5, 3, 1, 2, 6, 7, 13, 4, 10, 9, 1...
## $ scores_research                   <dbl> 99.6, 96.7, 98.8, 96.9, 94.4, 99....
## $ scores_research_rank              <fct> 1, 6, 3, 5, 8, 2, 4, 9, 10, 17, 2...
## $ scores_citations                  <dbl> 98.0, 99.9, 99.4, 97.0, 99.7, 95....
## $ scores_citations_rank             <fct> 37, 14, 22, 49, 17, 70, 25, 38, 2...
## $ scores_industry_income            <dbl> 68.7, 90.1, 46.8, 92.7, 90.4, 52....
## $ scores_industry_income_rank       <fct> 189, 75, 421, 67, 73, 343, 104, 2...
## $ scores_international_outlook      <dbl> 96.4, 79.5, 77.7, 83.6, 90.0, 95....
## $ scores_international_outlook_rank <fct> 24, 200, 218, 158, 96, 31, 275, 3...
## $ record_type                       <fct> master_account, private, private,...
## $ member_level                      <fct> 0, 11, 0, 0, 0, 0, 0, 0, 0, 11, 0...
## $ url                               <chr> "/world-university-rankings/unive...
## $ nid                               <fct> 468, 467, 466, 128779, 471, 470, ...
## $ location                          <fct> United Kingdom, United States, Un...
## $ stats_number_students             <dbl> 20774, 16223, 21261, 2238, 11276,...
## $ stats_student_staff_ratio         <dbl> 11.1, 7.4, 9.3, 6.3, 8.4, 11.0, 1...
## $ stats_pc_intl_students            <dbl> 41, 23, 25, 33, 34, 38, 17, 20, 2...
## $ stats_female_male_ratio           <chr> "46 : 54", "44 : 56", "49 : 51", ...
## $ aliases                           <chr> "University of Oxford", "Stanford...
## $ subjects_offered                  <chr> "Geography,Medicine & Dentistry,B...
## $ closed                            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE...
## $ apply_link                        <chr> "https://www.timeshighereducation...
## $ stats_female_share                <dbl> 46, 44, 49, 36, 39, 47, 51, 50, 4...
## $ ref_link                          <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRU...

30 variables and 1 526 cases in total. As for the variables present, there are university’s official name and its alias (name, aliases). Then, there are variables assesing some aspect of university output coupled with their ranked counterpart. They include ariables include the un university rank position itself (rank, rank_order), scores assessing the learning environment (scores_teaching, scores_teaching_rank), research volume, income and reputation (scores_research, scores_research_rank), research influence (scores_citations, scores_citations_rank), innovation (scores_industry_income, scores_industry_income_rank), international diersity (scores_international_outlook, scores_international_outlook_rank), and overall number for scores (scores_overall, scores_overall_rank).

There are variables indicating where a university is located, although it is a country level (location), what subjects it offers (subjects_offered) and a bunch of variables measuring the shares of students, of students and staff members, of international students, and of females and males (stats_number_students, stats_student_staff_ratio, stats_pc_intl_students, stats_female_male_ratio).

There are also variables which can be called administrative, such as the type of a record in the ranking table (record_type), a university’s memebership level (member_level), url-address (url), link for an application (apply_link), a reference link (ref_link), numeric identificator (nid), and if a university is closed (closed). I assume these variables are not of crucial importance when choosing a university, so they can very well be excluded from further analysis.

But first, let’s take a look at the descriptive stats for all of them:

describe(df) 

There is no way to make conclusions about categorical variables observed in the data with these descriptives. As for numerical variables, our attention can be drawn to mean, minimum, and maximum values. Mean value of the scores_overall variable is 11.3 which is quite low considering that the maximum value present in the data is 142. Scores for citations, international outlook and industry income are, so to say, dispersed in a more balanced manner - their mean values are 48, 47, and 46, respectively, which is very close to the center of values interval from 0 to 100.

The maximum value of female students share is 98%, which becomes kinda interesting when I come to think that there is a university(-ies) with 98% of students being female. And it is nice to see that the mean value of this variable is almost 50%, which means that, on average, there are equal proportion of female to male students in the listed universities. The situation is worse for the share of international students - on average, there are only 11% of them studying in the listed universities, with the minimum of 0%. Sad. Finally, let’s look at the number of students statistics. The minimum value here is 557 people - that’s a bit more than one Russian school has! My best guess is that it is a private university.

Additionally, I had an urge to see if there are groups of highly correlated variables, so it would be easier for me to select those for the cluster analysis. Unfortunately, there are variables which are categorical but cannot be considered of ordinal type. Which makes it impossible to construct a correlation matrix even for the mighty hetcor() function and its polyserial and polychoric correlations. Instead, I will introduce a snippet of numerically coded variables with a heatmap of correlations between them provided. I will include rank variables coerced to numeric class and a variable with offered by a university subjects transformed with one-hot encoding to binary variables.

The latter ones will be looking like the following:

df_dummy <- df %>% 
  mutate(subjects_offered=str_split(subjects_offered, ",")) %>% 
  unnest(cols = c(subjects_offered)) %>% 
  mutate(dummy=1) %>% 
  spread(subjects_offered, dummy, fill=0)

head(df_dummy[, 30:65], 10)

And the the correlations heatmap (or mosaic) itself:

df_4cor <- df_dummy[,c(2, 4:15, 21:23, 28, 30:65)]
df_4cor[, 1:17] <- sapply(df_4cor[, 1:17], as.numeric)

het <- hetcor(df_4cor)
cors <- round(het$correlations, 2)
ggcorrplot(cors, 
           hc.order = TRUE, 
           type = "upper", 
           show.diag    = F, 
           tl.cex = 9, 
           tl.col = "black", 
           ggtheme = ggplot2::theme_bw, 
           colors =c("cyan", "white", "red"))

Okay, let’s take it from the bottom part of the diagonal which leads us through the darkest colored, i.e., most correlated, pairs of variables. There is a triangle of three highly correlated variables, they are rank, citations score and overall score. Next, research, citations, teaching, the share of international students, and the overall score represent another highly correlated subgroup on the plot. The triangle above the latter ones is the one representing Engineering subjects offered in a university which are, as expected, correlated. Another similar correlation group is subjects related to Natural Sciences (like Geology, Biology, Marine sciences, Physics, etc.). Lastly, there is quite bug reddish area of correlated subjects represented by Social sciences and Humanities, Art included.

It would be really nice to run PCA or EFA on the offered subjects converted to binary variables and eventually shrink numerous subjects to more encompassing variables. But that is out of the scope of this report. Now, let’s do some clustering!

Clustering

My first thought about what can influence applicant’s choice of a university immediately brought me to the offered subjects. When I was looking for university myself, availability of the programs I want to enroll in was the primary filter for further digging. After that, my attention was drawn to other aspects, such as syllabus, scholarship, university’s reputation, etc.

In their study “Influential factors in choosing higher education institution: A case study of a private university in Surabaya” (2015) Proboyo and Soedarsono conclude that the three most crucial attributes of institutions relevant to students when choosing higher education organization are reputation, values, and success of the alumni - three as the respondents were asked to select exactly three characteristics from the offered list.

Although values of a university and success of the alumni cannot be traced with the data available in our case, reputation is formed via scores variables and a university’s rank. It is my best guess. And, as scores separately were highly correlated with the overall score, I am going to select the latter one only for the cluster analysis - as the one representing the scores attributes. The rank still seems important to me, so I feel like I need to include it. Finally, the offered subjects will be also included, for the reasons I mentioned earlier. I am not going to include shares of students and student-staff ratio, as I do not consider them important for the applicant at the stage of choosing a university in general, with the exception for single cases. Share of international students might be somewhat relevant, perhaps - but it is highly correlated with the overall score, which means I can take one of them.

Thus, the set of variables for clustering would include mixed data types - as several of them, rank position and location, are categorical and some are numerical, overall score and binary coded subjects - which implies several difficulties with the existing methods for clustering.

Partitioning around medoids (PAM)

The first method I am going to try is Partitioning around medoids (PAM) clustering. It is an analogy of K-means clustering which uses so-called medoids as centers of the tentatively defined clusters. Within a cluster, a medoid represents a data point whose dissimilarities with all other points in the cluster are tending to a minimum.

Choosing number of clusters

As in any other clustering algorithm, the number of clusters is initially unknown and should be determined either based on some theory or with the help of specific tests. I have no theory so I am going to rely on the “elbow” method and “silhouette” method. The first one uses within-cluster sum of squares to the cluster centroid, or,in our case, medoid, to tell the optimal number of clusters. Clusters number solution can be found where the line drastically falls forming an elbow. The second method determines how well a data point fits into a particular cluster, with high average silhouette width values indicating a clusters number solution.

uni_mix <- df_dummy[, c(2:4,20, 30:65)]

dist <- daisy(uni_mix[-2], metric = "gower")
distmat <- as.matrix(dist)

p111 <- fviz_nbclust(distmat, cluster::pam, method = "wss") +
  ggtitle("(A) Elbow method")
p222 <- fviz_nbclust(distmat, cluster::pam, method = "silhouette") +
  ggtitle("(B) Silhouette method")

gridExtra::grid.arrange(p111, p222, nrow = 1)

It looks like both methods agree that there should be two clusters when the data is clustered using PAM. However, to my not really skillful eye, there is a potential of three clusters hinted by the elbow plot.

2 clusters

Let’s run the two clusters solution and see how it goes.

pam_fit1 <- pam(distmat, diss = TRUE, k = 2)

uni_mix$pam2_cluster <- as.factor(pam_fit1$clustering)
describeBy(uni_mix, uni_mix$pam2_cluster, mat = T)

The very first thing that raises some concerns is the quantity of cases in each group. It seems that the majority of observations have gotten into the first cluster. It is either interesting discovery or some mistake of the algorithm. But I guess one can never know. Going further, we can say for sure that universities in the first cluster have more twice higher overall scores value than universities from the second cluster, on average.

Nothing can be said about location and rank,, as those are categorical and cannot be summarized with descriptive stats. But some points can be mentioned about sets of subjects. If we define a threshold of 0.5 for the binary variables representing presence of a subject offered in a university, then we can say that a subject is there if its mean value in a cluster is higher than or equal to 0.5. In the first cluster almost all subjects are present, except for Veterinary Sciences. In the second cluster, on the other hand, Business & Management, Engineering specialties, Computer Science and Mathematics, Biology and Physics are present. This either does not make any sense or is telling that the first group is represented by generally specialized universities, while the second group includes narrowly specialized universities devoted specifically to, let’s say, Engineering.

Now, let’s look at the top-3 most frequent rank positions occurring in each cluster

uni_mix %>% 
  mutate(cluster = pam_fit1$clustering) %>%
  group_by(cluster) %>%
  count(rank) %>% 
  top_n(3)

The distribution of rank positions frequency is about the same in clusters. Interestingly, the most frequent rank position in both clusters is the furthest one, rank 1001+.

But what about location

uni_mix %>% 
  mutate(cluster = pam_fit1$clustering) %>%
  group_by(cluster) %>%
  count(location) %>% 
  top_n(3)

The majority of universities from the first cluster are located in the US, with the United Kingdom and China taking the second and the third places by frequency of occurrence, respectively. While the most frequent location in the second cluster is Japan, with India and China going after.

Finally, let’s get the medoids - the most representative data points - for each cluster

uni_mix[pam_fit1$medoids, ]

Here, it becomes quite clear that the most representing points for both clusters confirm the mentions observed in the descriptive stats. As for other attributes, University of the Ryukyus, the medoid of the first cluster, is located in Asia and puts an emphasis on education and research specific to the aspects of the local climate and Asian-Pacific region in general. Lodz University of Technology, on the other hand, is located in Europe and is showing itself as a constantly evolving institution "..with a strong tradition in engineering.

Additionally, a visualization of data points colored by clusters on the pane. Some overlap on the border segregating blue and yellow areas can be observed.

fviz_cluster(list(data = distmat, cluster = pam_fit1$clustering), 
             palette = "jco",
             ggtheme = theme_bw())

3 clusters

Okay, perhaps, the solution would make more sense if we run the procedure with three cluster.

pam_fit2 <- pam(distmat, diss = TRUE, k = 3)
pam_fit2$data <- uni_mix[ , -2]

uni_mix$pam3_cluster <- as.factor(pam_fit2$clustering)
describeBy(uni_mix, uni_mix$pam3_cluster, mat = T)

Alright, it actually does make a bit more sense. We still have a generally specialized (by subjects) cluster, with the highers mean overall score value among the three groups. But with this solution we get two more narrowly specialized (by subjects) clusters, one of which is still related to Engineering, Math and Computer Science, Biology and Physics - the second clusters, and another one related to Social Sciences, Humanities & Art, and Economics, Natural Sciences and Math & Computer Science - the third cluster. The second cluster is still more than a half of points behind in its mean overall score value compared to the first cluster, whereas the third cluster has a very close mean score overall value to the first cluster. My guess is that, in the solution on two clusters, the third cluster was a subgroup of the first one.

Now, let’s turn to the most frequent values of the categorical variables

uni_mix %>% 
  mutate(cluster = pam_fit2$clustering) %>%
  group_by(cluster) %>%
  count(rank) %>% 
  top_n(3)
uni_mix %>% 
  mutate(cluster = pam_fit2$clustering) %>%
  group_by(cluster) %>%
  count(location) %>% 
  top_n(3)

No change for rank positions - there are the same for all of the clusters. As for locations, in case of the second cluster they are the same as they were in the 2 clusters solution, while the third most frequent location has changed in the first cluster. It is Japan now, a more developed country, compared to China that was in the 2 clusters solution. Interestingly, the third cluster repeats the top and the second top most frequent locations after the first cluster. But the third top is Germany.

Lastly, the medoids of the clusters:

uni_mix[pam_fit2$medoids, ]

In case of the first cluster, its most representative university is different than it was in the two clusters solution, still being located in Japan, though. The second cluster’s most representative university is Lodz University of Technology, again. While for the third cluster we observe Edge Hill University located in the UK. This university has been initially founded as the first non-denominational teacher training woman college and nowadays regularly hosts suffragette symposium.

fviz_cluster(list(data = distmat, cluster = pam_fit2$clustering), 
             palette = "Set1",
             ggtheme = theme_bw())

The visualization of data points by clusters presented above confirms my guess that the third cluster might very well be a subgroup of the first cluster. On the plot their colored sectors overlap a lot. So, I think it is okay to further stick to two clusters.

Hierarchical clustering

Now I am going to try another clustering method and see if the results would be different and/or better. Let’s do hierarchical clustering.

It is based on iterative grouping of data objects which resulting in a tree of clusters. Hierarchical clustering is realized in four different ways of measuring similarity between objects: ward linkage, single linkage, average linkage, and complete linkage. Ward linkage builds the tree by minimizing the average distance between clusters. Single linkage uses the maximum distance among all data objects in existing clusters. Average linkage computes the average distance of each data object and existing clusters. And complete linkage calculates maximum distance among all data points in existing clusters.

As I am going to use agglomerative hierarchical clustering, there is a way to tell which one out of the four mentioned types is the best suitable for the data we have before going into detailed interpretation - by using the agglomerative coefficient. It describes the strength of the resulting from each type of hierarchical clustering grouped structure. Agglomerative coefficient ranges from 0 to 1: the closer it is to 1, the better.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
agnes(dist, diss = TRUE, method = x)$ac
}

purrr::map_dbl(m, ac)
##   average    single  complete      ward 
## 0.8175652 0.6262123 0.8889337 0.9896187

The output above tells that the best clustering structure is obtained with ward linkage, so let me proceed with it.

Number of clusters

Again, the procedure of picking the optimal number of clusters. I will stick to the previously used methods.

p1111 <- fviz_nbclust(distmat, hcut, method = "wss") +
  ggtitle("(A) Elbow method")
p2222 <- fviz_nbclust(distmat, hcut, method = "silhouette") +
  ggtitle("(B) Silhouette method")

gridExtra::grid.arrange(p1111, p2222, nrow = 1)

Although another clustering method is specified, both Elbow plot and Silouhette plot suggest to go with two clusters with a potential for three clusters solution. Again.

2 clusters

So I guess I will repeat the procedure with different number of clusters one by one.

hfitfull_ward <- agnes(dist, method = "ward")

hfull_cut <- hcut(dist, k = 2, hc_method = "ward.D")

fviz_dend(
  hfitfull_ward,
  k = 2,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  palette = "jco",
  cex = .01
)

Okaaay, this is how it looks like when plotted. It seems that this clustering on this data is not an easy task, in general. On the plot it can be seen that clusters begin to proliferate closer to the bottom levels, which means that there can be much more, perhaps, indistinguishable clusters inside the two main ones.

Anyways, let’s turn to the descriptive statistics.

uni_mix$hc_cluster <- as.factor(hfull_cut$cluster)
describeBy(uni_mix, uni_mix$hc_cluster, mat = T)

As you can see, this time even more observations have gotten into the first cluster, than it was with PAM. The mean value of the overall score is still more than two times higher in the first cluster, than it is in the second cluster.

As for the subjects, the situation is pretty the same as it was with the PAM results. However, this time Archaeology is not present neither in the first, nor in the second cluster.

Let’s see the top-3 most frequent location settings for universities from each cluster

uni_mix %>% 
  mutate(cluster = hfull_cut$cluster) %>%
  group_by(cluster) %>%
  count(location) %>% 
  top_n(3)

So output is the very same as it was with PAM. The same conclusion applies to the most frequent rank positions, as it can be seen below.

uni_mix %>% 
  mutate(cluster = hfull_cut$cluster) %>%
  group_by(cluster) %>%
  count(rank) %>% 
  top_n(3)

So I can conclude that clusters resulting from PAM and hierarchical clustering are mostly coherent. Which means that, if I ran three clustering solution with hierarchical clustering, I would probably get the same results I got from PAM three clustering solution. So here is where I am going to stop with clustering.

To prove my point, here is the picture of hierarchical clustering dendrogram with three clusters specified and colored.

h3_cut <- hcut(dist, k = 3, hc_method = "ward.D")

dend2 <- fviz_dend(
  hfitfull_ward,
  k = 2,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  palette = "jco",
  cex = .01) + ggtitle("Two clusters")

dend3 <- fviz_dend(
  hfitfull_ward,
  k = 3,
  horiz = TRUE,
  rect = TRUE,
  rect_fill = TRUE,
  palette = "Set1",
  cex = .01) + ggtitle("Three clusters")

gridExtra::grid.arrange(dend2, dend3, nrow = 1)

The cluster which is fully colored in blue on the left picture parts into two groups when the number of clusters is risen to 3 - on the right picture.

Concluding

That is almost the end of this ‘wonderful’ report. In conclusion I would like to highlight several points:

  1. As it turned out, it is NOT a peace of cake to get meaningful results from clustering the data I used. Perhaps, the variables set should be different to obtain some perfect clusters, but I got what I’ve got.

  2. Clustering variables of different types is scary in a sense that it brings a lot of limitations and affects the quality of results and their visualization. I was especially upset with the latter one :(

  3. With the data input like mine, the most optimal number of clusters of universities is two. If there was a great need to name them, I would call the first more encompassing cluster “Generally specialized universities” and the second cluster - “Narrowly specialized universities” or “Engineering universities”.

And this is the end. Thanks for the attention!