In this project I am going to apply method of cluster analysis for the data coming from the World University Rankings 2021 by Times Higher Education. The aim is to clusterise universities for a company targeting university applicants from all over the world. I need to locate clusters of universities so as to match the best suiting university group to each applicant.
First, we load, as usual, necessary packages for work.
library(factoextra)
library(cluster)
library(dplyr)
library(psych)
library(kableExtra)
library(ggplot2)
library(ggthemes)
library(naniar)
library(Rtsne)
library(gridExtra)
library(ggpubr)
library(dendextend)
Just to save a space I left scraping procedure out of the report, so I load prepared RDS file as data. The file is a little bit different from the original table that was extracted from Times Higher Education site. I deleted several columns because they did not carry any information about characteristics of universities but were rather technical things.
df <- readRDS("THE_rating.RDS")
In this part I am choosing variables from dataset for cluster analysis. Since clustering is demanded by a company targeting university applicants, we need to chose variables that would be important for applicants. I suppose that variables about quality of teaching as well as academic abilities (‘scores_teaching’ and ‘scores_research’ respectively) should be included for sure because they are usually main characteristics of university. Also, international outlook of universities may be rather important for applicants to be sure that a university is open for international students.
Also, since universities are presented for applicants from all over the worlds, it can be important where the university is located. Thus, I created new variable based on variable ‘location’ where I specified regions for universities according to countries of their location.
Overall, I take 4 variables for my analysis. Let’s put them in separate dataset.
df1 <- select(df, c(name, scores_teaching, scores_research, scores_international_outlook, region))
In this section I describe the data obtained for analysis. First, I should check for missing values in the data.
vis_miss(df1)
As seen from the plot above, there are no missing values in our data. And now let’s proceed to variables. Each variable is described in separate subsection below.
Teaching (scores_teaching) represents results of quantitative assessment of learning environment of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
In the histogram it is seen that values distributed unequally with prevalence of low scores. The distribution is right-skewed.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 27.81 | 13.62 | 23.55 | 11.7 | 94.4 | 2.02 | 4.98 |
Research (scores_research) represents results of quantitative assessment of volume, income and reputation of universities in research field. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
As seen from the histogram, values distributed unequally with prevalence of low scores. The distribution is right-skewed.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 23.57 | 17.33 | 17.5 | 7.1 | 99.6 | 1.86 | 3.72 |
International outlook (scores_international_outlook) represents results of quantitative assessment of international staff, students, research of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.
The histogram shows that observations are distributed more or less equally (at least compared to previous variables), however there is still right skewness.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 1526 | 47.03 | 23.17 | 42.3 | 13.5 | 100 | 0.59 | -0.73 |
Region is a nominal variable representing information about location of universities in the world. The distribution of the variable is presented below.
As seen from the barplot, regions are distributed quite unequally, however it completely depends on size of territory as well as population.
In this section I will conduct cluster analysis. There will be several subsections with preparations, applying different methods and interpretation.
Before clusterisation it is necessary to decide which number of clusters should be retained. For this, I conduct several tests.
The first approach is ‘elbow method’ to determine the optimal k. To do this, we run several models with k = [1:10] and compare the ‘within-cluster sum of squares to the cluster centroid’ in each case. The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.
The results below suggest 2 or 3 clusters.
fviz_nbclust(df1[ , 2:4], kmeans, method = "wss")
The next approach is the ‘silhouette width statistic’. It measures the quality of a clustering. That is, it determines how well each object lies within its cluster.
As seen from the plot below, recommended number of clusters is 2.
fviz_nbclust(df1[ , 2:4], kmeans, method = "silhouette")
Finally, I use gap statistic. It compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).
The plot below suggests 2 clusters for k-means clustering.
fviz_nbclust(df1[ , 2:4], kmeans, method = "gap")
As for hierarchical clustering, it is also suggested to retain 2 clusters.
fviz_nbclust(df1[ , 2:4], hcut, method = "gap")
In this part I apply k-means method of clustering. As was recommended by tests above, I look for 2 clusters. Since k-means is only applicable for numeric variables, I include only scores on teaching, research and international outlook in the analysis.
set.seed(100)
uniclust <- kmeans(df1[ , 2:4], 2)
I executed function of k-means clustering and now we can visualize the result.
In order to conduct hierarchical clustering, it is necessary to count euclidean distances between observations for numeric variables.
d <- dist(df1[ , 2:4], method = 'euclidean')
Now, I build a plot with distances between observations and a dendrogram.
heatmap(as.matrix(d), symm = T,
distfun = function(x) as.dist(x))
The plot is aimed to show number of clusters, however it is hard to identify red squares indicating number of clusters in this particular one.
Now, we get the solution through three different type of distances: minimal, average and maximal.
hfit_ward <- agnes(d, method = "ward")
hfit_average <- agnes(d, method = "average")
hfit_complete <- agnes(d, method = "complete")
To choose better method, let’s look at agglomerative coefficients of each.
hfit_ward$ac
## [1] 0.998086
hfit_average$ac
## [1] 0.9698641
hfit_complete$ac
## [1] 0.9832386
As seen from the output, ward method has the closest to 1 value, so I take this one. Let’s build a dendrogram.
According to the dendrogram, there are two clusters different from each other (judging by height of the first branch).
Let’s visualize clusters.
It is also possible to colour clusters in dendrogram.
In this part I will try to do clustering using data of both numeric and categorical type. I will include ‘region’ variable in the analysis.
In order to do that, I count distances using ‘gower’ metric.
gower_dist <- daisy(df1[ , -1],
metric = "gower")
summary(gower_dist)
## 1163575 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.001132 0.312730 0.374980 0.384530 0.461510 0.967030
## Metric : mixed ; Types = I, I, I, N
## Number of objects : 1526
In the summary we see that we have 3 integer and 1 nominal variable which is true.
Then, we build a distance matrix.
gower_mat <- as.matrix(gower_dist)
It is necessary to look at the most similar and dissimilar pairs in the data to see if they make sense
df1[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
| name | scores_teaching | scores_research | scores_international_outlook | region | |
|---|---|---|---|---|---|
| 1161 | University of Fukui | 21.4 | 10.2 | 22.3 | East Asia |
| 891 | Kurume University | 21.5 | 10.4 | 22.2 | East Asia |
df1[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
| name | scores_teaching | scores_research | scores_international_outlook | region | |
|---|---|---|---|---|---|
| 1258 | Maharaja Sayajirao University of Baroda | 13.9 | 8.8 | 15.1 | South Asia |
| 1 | University of Oxford | 91.3 | 99.6 | 96.4 | West Europe |
It seems that they make sense, so we can continue.
Now, it is necessary to understand how many clusters we can retain. I again apply silhouette width statistic.
fviz_nbclust(gower_mat, cluster::pam, method = "silhouette")
The plot shows that we should use 10 clusters.
And now we conduct clustering using partitioning around medoids (PAM).
pam_fit <- pam(gower_dist, diss = TRUE, k = 10)
pam_fit$data <- df1[ , -1]
I did clustering and now we can see the results in visual form.
For now I have three results of cluster analysis: two based on numeric data and one based on both numeric and categorical data. It is necessary to choose the best solution. Thus, let’s look at visualizations one more time.
Results of k-means clustering and hierarchical clustering look similar but do not seem to be very meaningful. In both plots 1st clusters are very scattered. Also, 1st and 2nd clusters in both plots are very close to each other and it is very hard to detect difference. Although PAM clustering with mixed data has quite a large number of clusters (10), it looks more meaningful. Thus, I would take the last result as the best solution.
Since I chose the most appropriate result, I can now define what clusters basically mean. Let’s look at the statistics of each one.
pam_results <- df1 %>%
dplyr::select(-name) %>%
mutate(cluster = pam_fit$clustering)%>%
group_by(cluster) %>%
do(the_summary = summary(.))
pam_results$the_summary
## [[1]]
## scores_teaching scores_research scores_international_outlook
## Min. :15.10 Min. : 9.80 Min. :42.40
## 1st Qu.:23.00 1st Qu.:19.80 1st Qu.:63.10
## Median :31.30 Median :31.10 Median :76.50
## Mean :33.32 Mean :34.09 Mean :75.16
## 3rd Qu.:39.40 3rd Qu.:44.60 3rd Qu.:87.70
## Max. :91.30 Max. :99.60 Max. :99.60
##
## region cluster
## West Europe :243 Min. :1
## North Europe : 18 1st Qu.:1
## Southeast Asia: 4 Median :1
## Africa : 0 Mean :1
## Australia : 0 3rd Qu.:1
## South Europe : 0 Max. :1
## (Other) : 0
##
## [[2]]
## scores_teaching scores_research scores_international_outlook
## Min. :15.90 Min. : 8.20 Min. :21.30
## 1st Qu.:26.00 1st Qu.:19.95 1st Qu.:38.55
## Median :34.45 Median :27.45 Median :50.05
## Mean :38.97 Mean :35.48 Mean :52.34
## 3rd Qu.:46.62 3rd Qu.:42.70 3rd Qu.:64.58
## Max. :94.40 Max. :98.80 Max. :94.70
##
## region cluster
## North America :211 Min. :2
## North Europe : 8 1st Qu.:2
## Southeast Asia : 6 Median :2
## Central America: 1 Mean :2
## Africa : 0 3rd Qu.:2
## Australia : 0 Max. :2
## (Other) : 0
##
## [[3]]
## scores_teaching scores_research scores_international_outlook
## Min. :14.00 Min. : 7.60 Min. : 15.30
## 1st Qu.:19.10 1st Qu.:11.80 1st Qu.: 22.80
## Median :23.15 Median :16.05 Median : 28.50
## Mean :27.01 Mean :22.65 Mean : 32.42
## 3rd Qu.:29.48 3rd Qu.:27.15 3rd Qu.: 37.25
## Max. :89.60 Max. :94.90 Max. :100.00
##
## region cluster
## East Asia :288 Min. :3
## Southeast Asia: 4 1st Qu.:3
## Africa : 0 Median :3
## Australia : 0 Mean :3
## North Europe : 0 3rd Qu.:3
## South Europe : 0 Max. :3
## (Other) : 0
##
## [[4]]
## scores_teaching scores_research scores_international_outlook
## Min. :16.40 Min. :13.20 Min. :59.80
## 1st Qu.:22.38 1st Qu.:26.77 1st Qu.:84.15
## Median :25.60 Median :32.90 Median :89.25
## Mean :31.48 Mean :37.14 Mean :87.27
## 3rd Qu.:38.58 3rd Qu.:43.10 3rd Qu.:93.42
## Max. :75.90 Max. :90.80 Max. :97.00
##
## region cluster
## Australia :45 Min. :4
## North Europe : 7 1st Qu.:4
## Southeast Asia: 3 Median :4
## Middle East : 1 Mean :4
## Africa : 0 3rd Qu.:4
## South Europe : 0 Max. :4
## (Other) : 0
##
## [[5]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.70 Min. : 8.60 Min. :25.90
## 1st Qu.:17.93 1st Qu.:14.57 1st Qu.:39.20
## Median :20.00 Median :18.00 Median :44.50
## Mean :22.69 Mean :19.74 Mean :46.02
## 3rd Qu.:25.18 3rd Qu.:22.75 3rd Qu.:53.15
## Max. :55.60 Max. :41.40 Max. :75.90
##
## region cluster
## South Europe :129 Min. :5
## North Europe : 7 1st Qu.:5
## Central America: 1 Median :5
## Southeast Asia : 1 Mean :5
## Africa : 0 3rd Qu.:5
## Australia : 0 Max. :5
## (Other) : 0
##
## [[6]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.40 Min. : 7.20 Min. :22.60
## 1st Qu.:15.45 1st Qu.: 7.90 1st Qu.:37.10
## Median :18.30 Median : 9.20 Median :43.10
## Mean :19.21 Mean :12.28 Mean :43.78
## 3rd Qu.:22.50 3rd Qu.:11.30 3rd Qu.:46.95
## Max. :31.30 Max. :43.90 Max. :81.60
##
## region cluster
## Africa :63 Min. :6
## Southeast Asia : 6 1st Qu.:6
## Central America: 4 Median :6
## North Europe : 2 Mean :6
## Australia : 0 3rd Qu.:6
## South Europe : 0 Max. :6
## (Other) : 0
##
## [[7]]
## scores_teaching scores_research scores_international_outlook
## Min. :14.0 Min. : 7.70 Min. :16.50
## 1st Qu.:17.4 1st Qu.: 9.60 1st Qu.:26.90
## Median :19.5 Median :11.60 Median :34.50
## Mean :22.0 Mean :13.99 Mean :37.44
## 3rd Qu.:22.6 3rd Qu.:14.50 3rd Qu.:44.50
## Max. :80.0 Max. :67.60 Max. :75.80
##
## region cluster
## East Europe :124 Min. :7
## Southeast Asia : 10 1st Qu.:7
## North Europe : 2 Median :7
## Central America: 1 Mean :7
## Africa : 0 3rd Qu.:7
## Australia : 0 Max. :7
## (Other) : 0
##
## [[8]]
## scores_teaching scores_research scores_international_outlook
## Min. :11.70 Min. : 7.10 Min. :14.90
## 1st Qu.:17.60 1st Qu.: 9.70 1st Qu.:18.90
## Median :21.40 Median :13.10 Median :24.00
## Mean :23.01 Mean :15.77 Mean :37.53
## 3rd Qu.:26.40 3rd Qu.:18.60 3rd Qu.:53.60
## Max. :46.80 Max. :52.00 Max. :99.40
##
## region cluster
## Middle East :134 Min. :8
## Southeast Asia: 2 1st Qu.:8
## North Europe : 1 Median :8
## Africa : 0 Mean :8
## Australia : 0 3rd Qu.:8
## South Europe : 0 Max. :8
## (Other) : 0
##
## [[9]]
## scores_teaching scores_research scores_international_outlook
## Min. :13.20 Min. : 7.3 Min. :14.60
## 1st Qu.:15.80 1st Qu.: 8.9 1st Qu.:22.30
## Median :17.50 Median :10.1 Median :28.20
## Mean :19.54 Mean :12.1 Mean :32.28
## 3rd Qu.:21.20 3rd Qu.:12.9 3rd Qu.:44.00
## Max. :56.60 Max. :58.9 Max. :58.90
##
## region cluster
## South America :90 Min. :9
## Central America:13 1st Qu.:9
## Southeast Asia :10 Median :9
## Africa : 0 Mean :9
## Australia : 0 3rd Qu.:9
## North Europe : 0 Max. :9
## (Other) : 0
##
## [[10]]
## scores_teaching scores_research scores_international_outlook
## Min. :12.10 Min. : 7.20 Min. :13.50
## 1st Qu.:18.10 1st Qu.: 9.25 1st Qu.:16.15
## Median :23.40 Median :11.00 Median :19.30
## Mean :25.34 Mean :13.08 Mean :24.30
## 3rd Qu.:30.40 3rd Qu.:14.90 3rd Qu.:33.10
## Max. :58.10 Max. :53.10 Max. :46.70
##
## region cluster
## South Asia :84 Min. :10
## Southeast Asia: 3 1st Qu.:10
## Africa : 0 Median :10
## Australia : 0 Mean :10
## North Europe : 0 3rd Qu.:10
## South Europe : 0 Max. :10
## (Other) : 0
It is quite hard to give names to clusters. Most of them are formed by regional characteristic. So, I will just name characteristics of clusters without naming. All in all, since our aim is to present universities to applicants, characteristics are more important than names.
1st cluster - good universities with moderately high teaching and research scores and rather high international orientation; located in Western Europe
2nd cluster - the best universities with the best research and learning environment and moderate international outlook; located in North America
3rd cluster - ordinary universities with moderate scores for everything; located in East Asia
4th cluster - the most internationally oriented universities with good learning and research environment; located in Australia
5th cluster - ordinary universities with lower scores for everything; located in South Europe
6th cluster - universities with the worst research and learning environment from Africa
7th cluster - ordinary universities with lower scores for everything; located in East Europe
8th cluster - ordinary universities with lower scores for everything from Middle East
9th cluster - ordinary universities with lower scores for everything from Central and South America
10th cluster - universities with bad research and learning environment from South Asia
To sum up, I made relatively informative, but still not perfect, clusterization of universities, As was presented in the plot, there are some outliers for many clusters which were assigned to one or another cluster somehow. Also, some regions were represented with small shares in different clusters. Thus, the results could be better, but nevertheless are meaningful.