Introduction

In this project I am going to apply method of cluster analysis for the data coming from the World University Rankings 2021 by Times Higher Education. The aim is to clusterise universities for a company targeting university applicants from all over the world. I need to locate clusters of universities so as to match the best suiting university group to each applicant.

Data and package preparation

First, we load, as usual, necessary packages for work.

library(factoextra)
library(cluster)
library(dplyr)
library(psych)
library(kableExtra)
library(ggplot2)
library(ggthemes)
library(naniar)
library(Rtsne)
library(gridExtra)
library(ggpubr)
library(dendextend)

Just to save a space I left scraping procedure out of the report, so I load prepared RDS file as data. The file is a little bit different from the original table that was extracted from Times Higher Education site. I deleted several columns because they did not carry any information about characteristics of universities but were rather technical things.

df <- readRDS("THE_rating.RDS")

Choice of variables

In this part I am choosing variables from dataset for cluster analysis. Since clustering is demanded by a company targeting university applicants, we need to chose variables that would be important for applicants. I suppose that variables about quality of teaching as well as academic abilities (‘scores_teaching’ and ‘scores_research’ respectively) should be included for sure because they are usually main characteristics of university. Also, international outlook of universities may be rather important for applicants to be sure that a university is open for international students.

Also, since universities are presented for applicants from all over the worlds, it can be important where the university is located. Thus, I created new variable based on variable ‘location’ where I specified regions for universities according to countries of their location.

Overall, I take 4 variables for my analysis. Let’s put them in separate dataset.

df1 <- select(df, c(name, scores_teaching, scores_research, scores_international_outlook, region))

Data description

In this section I describe the data obtained for analysis. First, I should check for missing values in the data.

vis_miss(df1)

As seen from the plot above, there are no missing values in our data. And now let’s proceed to variables. Each variable is described in separate subsection below.

scores_teching

Teaching (scores_teaching) represents results of quantitative assessment of learning environment of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

In the histogram it is seen that values distributed unequally with prevalence of low scores. The distribution is right-skewed.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	27.81	13.62	23.55	11.7	94.4	2.02	4.98

scores_research

Research (scores_research) represents results of quantitative assessment of volume, income and reputation of universities in research field. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

As seen from the histogram, values distributed unequally with prevalence of low scores. The distribution is right-skewed.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	23.57	17.33	17.5	7.1	99.6	1.86	3.72

scores_international_outlook

International outlook (scores_international_outlook) represents results of quantitative assessment of international staff, students, research of universities. Its scale ranges from 1 to 100. The descriptive statistics and distribution of the variable are presented below.

The histogram shows that observations are distributed more or less equally (at least compared to previous variables), however there is still right skewness.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
1526	47.03	23.17	42.3	13.5	100	0.59	-0.73

region

Region is a nominal variable representing information about location of universities in the world. The distribution of the variable is presented below.

As seen from the barplot, regions are distributed quite unequally, however it completely depends on size of territory as well as population.

Cluster analysis

In this section I will conduct cluster analysis. There will be several subsections with preparations, applying different methods and interpretation.

How many clusters to retain?

Before clusterisation it is necessary to decide which number of clusters should be retained. For this, I conduct several tests.

The first approach is ‘elbow method’ to determine the optimal k. To do this, we run several models with k = [1:10] and compare the ‘within-cluster sum of squares to the cluster centroid’ in each case. The optimal number of clusters (k) is where the ‘elbow’ occurs, i.e. a solution where the within-sum-of-squares decreases dramatically.

The results below suggest 2 or 3 clusters.

fviz_nbclust(df1[ , 2:4], kmeans, method = "wss")

The next approach is the ‘silhouette width statistic’. It measures the quality of a clustering. That is, it determines how well each object lies within its cluster.

As seen from the plot below, recommended number of clusters is 2.

fviz_nbclust(df1[ , 2:4], kmeans, method = "silhouette")

Finally, I use gap statistic. It compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

The plot below suggests 2 clusters for k-means clustering.

fviz_nbclust(df1[ , 2:4], kmeans, method = "gap")

As for hierarchical clustering, it is also suggested to retain 2 clusters.

fviz_nbclust(df1[ , 2:4], hcut, method = "gap")

K-means clustering

In this part I apply k-means method of clustering. As was recommended by tests above, I look for 2 clusters. Since k-means is only applicable for numeric variables, I include only scores on teaching, research and international outlook in the analysis.

set.seed(100)
uniclust <- kmeans(df1[ , 2:4], 2)

I executed function of k-means clustering and now we can visualize the result.

Hierarchical clustering

In order to conduct hierarchical clustering, it is necessary to count euclidean distances between observations for numeric variables.

d <- dist(df1[ , 2:4], method = 'euclidean')

Now, I build a plot with distances between observations and a dendrogram.

heatmap(as.matrix(d), symm = T,
        distfun = function(x) as.dist(x))

The plot is aimed to show number of clusters, however it is hard to identify red squares indicating number of clusters in this particular one.

Now, we get the solution through three different type of distances: minimal, average and maximal.

hfit_ward <- agnes(d, method = "ward") 
hfit_average <- agnes(d, method = "average")
hfit_complete <- agnes(d, method = "complete")

To choose better method, let’s look at agglomerative coefficients of each.

hfit_ward$ac

## [1] 0.998086

hfit_average$ac

## [1] 0.9698641

hfit_complete$ac

## [1] 0.9832386

As seen from the output, ward method has the closest to 1 value, so I take this one. Let’s build a dendrogram.

According to the dendrogram, there are two clusters different from each other (judging by height of the first branch).

Let’s visualize clusters.

It is also possible to colour clusters in dendrogram.

Clustering mixed data

In this part I will try to do clustering using data of both numeric and categorical type. I will include ‘region’ variable in the analysis.

In order to do that, I count distances using ‘gower’ metric.

gower_dist <- daisy(df1[ , -1],
                    metric = "gower")

summary(gower_dist)

## 1163575 dissimilarities, summarized :
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.001132 0.312730 0.374980 0.384530 0.461510 0.967030 
## Metric :  mixed ;  Types = I, I, I, N 
## Number of objects : 1526

In the summary we see that we have 3 integer and 1 nominal variable which is true.

Then, we build a distance matrix.

gower_mat <- as.matrix(gower_dist)

It is necessary to look at the most similar and dissimilar pairs in the data to see if they make sense

df1[which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]

	name	scores_teaching	scores_research	scores_international_outlook	region
1161	University of Fukui	21.4	10.2	22.3	East Asia
891	Kurume University	21.5	10.4	22.2	East Asia

df1[which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]

	name	scores_teaching	scores_research	scores_international_outlook	region
1258	Maharaja Sayajirao University of Baroda	13.9	8.8	15.1	South Asia
1	University of Oxford	91.3	99.6	96.4	West Europe

It seems that they make sense, so we can continue.

Now, it is necessary to understand how many clusters we can retain. I again apply silhouette width statistic.

fviz_nbclust(gower_mat, cluster::pam, method = "silhouette")

The plot shows that we should use 10 clusters.

And now we conduct clustering using partitioning around medoids (PAM).

pam_fit <- pam(gower_dist, diss = TRUE, k = 10)
pam_fit$data <- df1[ , -1]

I did clustering and now we can see the results in visual form.

Comparing results

For now I have three results of cluster analysis: two based on numeric data and one based on both numeric and categorical data. It is necessary to choose the best solution. Thus, let’s look at visualizations one more time.

Results of k-means clustering and hierarchical clustering look similar but do not seem to be very meaningful. In both plots 1st clusters are very scattered. Also, 1st and 2nd clusters in both plots are very close to each other and it is very hard to detect difference. Although PAM clustering with mixed data has quite a large number of clusters (10), it looks more meaningful. Thus, I would take the last result as the best solution.

Interpretaion

Since I chose the most appropriate result, I can now define what clusters basically mean. Let’s look at the statistics of each one.

pam_results <- df1 %>%
  dplyr::select(-name) %>%
  mutate(cluster = pam_fit$clustering)%>%
  group_by(cluster) %>%
  do(the_summary = summary(.)) 

pam_results$the_summary

## [[1]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :15.10   Min.   : 9.80   Min.   :42.40               
##  1st Qu.:23.00   1st Qu.:19.80   1st Qu.:63.10               
##  Median :31.30   Median :31.10   Median :76.50               
##  Mean   :33.32   Mean   :34.09   Mean   :75.16               
##  3rd Qu.:39.40   3rd Qu.:44.60   3rd Qu.:87.70               
##  Max.   :91.30   Max.   :99.60   Max.   :99.60               
##                                                              
##             region       cluster 
##  West Europe   :243   Min.   :1  
##  North Europe  : 18   1st Qu.:1  
##  Southeast Asia:  4   Median :1  
##  Africa        :  0   Mean   :1  
##  Australia     :  0   3rd Qu.:1  
##  South Europe  :  0   Max.   :1  
##  (Other)       :  0              
## 
## [[2]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :15.90   Min.   : 8.20   Min.   :21.30               
##  1st Qu.:26.00   1st Qu.:19.95   1st Qu.:38.55               
##  Median :34.45   Median :27.45   Median :50.05               
##  Mean   :38.97   Mean   :35.48   Mean   :52.34               
##  3rd Qu.:46.62   3rd Qu.:42.70   3rd Qu.:64.58               
##  Max.   :94.40   Max.   :98.80   Max.   :94.70               
##                                                              
##              region       cluster 
##  North America  :211   Min.   :2  
##  North Europe   :  8   1st Qu.:2  
##  Southeast Asia :  6   Median :2  
##  Central America:  1   Mean   :2  
##  Africa         :  0   3rd Qu.:2  
##  Australia      :  0   Max.   :2  
##  (Other)        :  0              
## 
## [[3]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :14.00   Min.   : 7.60   Min.   : 15.30              
##  1st Qu.:19.10   1st Qu.:11.80   1st Qu.: 22.80              
##  Median :23.15   Median :16.05   Median : 28.50              
##  Mean   :27.01   Mean   :22.65   Mean   : 32.42              
##  3rd Qu.:29.48   3rd Qu.:27.15   3rd Qu.: 37.25              
##  Max.   :89.60   Max.   :94.90   Max.   :100.00              
##                                                              
##             region       cluster 
##  East Asia     :288   Min.   :3  
##  Southeast Asia:  4   1st Qu.:3  
##  Africa        :  0   Median :3  
##  Australia     :  0   Mean   :3  
##  North Europe  :  0   3rd Qu.:3  
##  South Europe  :  0   Max.   :3  
##  (Other)       :  0              
## 
## [[4]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :16.40   Min.   :13.20   Min.   :59.80               
##  1st Qu.:22.38   1st Qu.:26.77   1st Qu.:84.15               
##  Median :25.60   Median :32.90   Median :89.25               
##  Mean   :31.48   Mean   :37.14   Mean   :87.27               
##  3rd Qu.:38.58   3rd Qu.:43.10   3rd Qu.:93.42               
##  Max.   :75.90   Max.   :90.80   Max.   :97.00               
##                                                              
##             region      cluster 
##  Australia     :45   Min.   :4  
##  North Europe  : 7   1st Qu.:4  
##  Southeast Asia: 3   Median :4  
##  Middle East   : 1   Mean   :4  
##  Africa        : 0   3rd Qu.:4  
##  South Europe  : 0   Max.   :4  
##  (Other)       : 0              
## 
## [[5]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.70   Min.   : 8.60   Min.   :25.90               
##  1st Qu.:17.93   1st Qu.:14.57   1st Qu.:39.20               
##  Median :20.00   Median :18.00   Median :44.50               
##  Mean   :22.69   Mean   :19.74   Mean   :46.02               
##  3rd Qu.:25.18   3rd Qu.:22.75   3rd Qu.:53.15               
##  Max.   :55.60   Max.   :41.40   Max.   :75.90               
##                                                              
##              region       cluster 
##  South Europe   :129   Min.   :5  
##  North Europe   :  7   1st Qu.:5  
##  Central America:  1   Median :5  
##  Southeast Asia :  1   Mean   :5  
##  Africa         :  0   3rd Qu.:5  
##  Australia      :  0   Max.   :5  
##  (Other)        :  0              
## 
## [[6]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.40   Min.   : 7.20   Min.   :22.60               
##  1st Qu.:15.45   1st Qu.: 7.90   1st Qu.:37.10               
##  Median :18.30   Median : 9.20   Median :43.10               
##  Mean   :19.21   Mean   :12.28   Mean   :43.78               
##  3rd Qu.:22.50   3rd Qu.:11.30   3rd Qu.:46.95               
##  Max.   :31.30   Max.   :43.90   Max.   :81.60               
##                                                              
##              region      cluster 
##  Africa         :63   Min.   :6  
##  Southeast Asia : 6   1st Qu.:6  
##  Central America: 4   Median :6  
##  North Europe   : 2   Mean   :6  
##  Australia      : 0   3rd Qu.:6  
##  South Europe   : 0   Max.   :6  
##  (Other)        : 0              
## 
## [[7]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :14.0    Min.   : 7.70   Min.   :16.50               
##  1st Qu.:17.4    1st Qu.: 9.60   1st Qu.:26.90               
##  Median :19.5    Median :11.60   Median :34.50               
##  Mean   :22.0    Mean   :13.99   Mean   :37.44               
##  3rd Qu.:22.6    3rd Qu.:14.50   3rd Qu.:44.50               
##  Max.   :80.0    Max.   :67.60   Max.   :75.80               
##                                                              
##              region       cluster 
##  East Europe    :124   Min.   :7  
##  Southeast Asia : 10   1st Qu.:7  
##  North Europe   :  2   Median :7  
##  Central America:  1   Mean   :7  
##  Africa         :  0   3rd Qu.:7  
##  Australia      :  0   Max.   :7  
##  (Other)        :  0              
## 
## [[8]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :11.70   Min.   : 7.10   Min.   :14.90               
##  1st Qu.:17.60   1st Qu.: 9.70   1st Qu.:18.90               
##  Median :21.40   Median :13.10   Median :24.00               
##  Mean   :23.01   Mean   :15.77   Mean   :37.53               
##  3rd Qu.:26.40   3rd Qu.:18.60   3rd Qu.:53.60               
##  Max.   :46.80   Max.   :52.00   Max.   :99.40               
##                                                              
##             region       cluster 
##  Middle East   :134   Min.   :8  
##  Southeast Asia:  2   1st Qu.:8  
##  North Europe  :  1   Median :8  
##  Africa        :  0   Mean   :8  
##  Australia     :  0   3rd Qu.:8  
##  South Europe  :  0   Max.   :8  
##  (Other)       :  0              
## 
## [[9]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :13.20   Min.   : 7.3    Min.   :14.60               
##  1st Qu.:15.80   1st Qu.: 8.9    1st Qu.:22.30               
##  Median :17.50   Median :10.1    Median :28.20               
##  Mean   :19.54   Mean   :12.1    Mean   :32.28               
##  3rd Qu.:21.20   3rd Qu.:12.9    3rd Qu.:44.00               
##  Max.   :56.60   Max.   :58.9    Max.   :58.90               
##                                                              
##              region      cluster 
##  South America  :90   Min.   :9  
##  Central America:13   1st Qu.:9  
##  Southeast Asia :10   Median :9  
##  Africa         : 0   Mean   :9  
##  Australia      : 0   3rd Qu.:9  
##  North Europe   : 0   Max.   :9  
##  (Other)        : 0              
## 
## [[10]]
##  scores_teaching scores_research scores_international_outlook
##  Min.   :12.10   Min.   : 7.20   Min.   :13.50               
##  1st Qu.:18.10   1st Qu.: 9.25   1st Qu.:16.15               
##  Median :23.40   Median :11.00   Median :19.30               
##  Mean   :25.34   Mean   :13.08   Mean   :24.30               
##  3rd Qu.:30.40   3rd Qu.:14.90   3rd Qu.:33.10               
##  Max.   :58.10   Max.   :53.10   Max.   :46.70               
##                                                              
##             region      cluster  
##  South Asia    :84   Min.   :10  
##  Southeast Asia: 3   1st Qu.:10  
##  Africa        : 0   Median :10  
##  Australia     : 0   Mean   :10  
##  North Europe  : 0   3rd Qu.:10  
##  South Europe  : 0   Max.   :10  
##  (Other)       : 0

It is quite hard to give names to clusters. Most of them are formed by regional characteristic. So, I will just name characteristics of clusters without naming. All in all, since our aim is to present universities to applicants, characteristics are more important than names.

1st cluster - good universities with moderately high teaching and research scores and rather high international orientation; located in Western Europe
2nd cluster - the best universities with the best research and learning environment and moderate international outlook; located in North America
3rd cluster - ordinary universities with moderate scores for everything; located in East Asia
4th cluster - the most internationally oriented universities with good learning and research environment; located in Australia
5th cluster - ordinary universities with lower scores for everything; located in South Europe
6th cluster - universities with the worst research and learning environment from Africa
7th cluster - ordinary universities with lower scores for everything; located in East Europe
8th cluster - ordinary universities with lower scores for everything from Middle East
9th cluster - ordinary universities with lower scores for everything from Central and South America
10th cluster - universities with bad research and learning environment from South Asia

Conclusion

To sum up, I made relatively informative, but still not perfect, clusterization of universities, As was presented in the plot, there are some outliers for many clusters which were assigned to one or another cluster somehow. Also, some regions were represented with small shares in different clusters. Thus, the results could be better, but nevertheless are meaningful.

Project on cluster analysis

by Artyom Kulikov

25/11/2020