Preparations

Load the packages for data wrangling (tidyverse, stringr, purrr) and jsonlite for web scraping a JSON:

library(tidyverse)
library(stringr)
library(purrr)
require("V8")
require("jsonlite")
library(V8)
library(jsonlite)
library(psych)
library(kableExtra)
library(ggplot2)

Extract the page with 2021 ratings as json file:

college_json2021 <- fromJSON(paste0(
  'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/', 
  'world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json'))

Turn the ‘data’ list from college_json2021 to your dataset

college_2021 <- college_json2021$data
head(college_2021)

Have a look at the variables and its classes. Varibles stats_female_male_ratio, stats_pc_intl_students, and stats_number_students need some corrections.

#describe(college_2021) %>% 
#  knitr::kable()
(college_2021$stats_female_male_ratio)[1] # we need to split the 1st part into 'female_share'
## [1] "46 : 54"
(college_2021$stats_pc_intl_students)[1] # we need to get rid of % sign
## [1] "41%"
(college_2021$stats_number_students)[1] # we need to get rid of commas
## [1] "20,774"
#Split the first part of the `stats_female_male_ratio` into a new variable, `female_share`. 
college_2021 <- college_2021 %>% 
  mutate(stats_female_share = as.numeric(str_match(college_2021$stats_female_male_ratio, "^\\d{1,2}")))
college_2021[1:6,"stats_female_share"]
## [1] 46 44 49 36 39 47
#Trim the "%" sign after the number and save the result to the old variable, `stats_pc_intl_students`.
college_2021 <- college_2021 %>% 
  mutate(stats_pc_intl_students = as.numeric(str_replace(stats_pc_intl_students, "%", "")))
college_2021[1:6,"stats_pc_intl_students"]
## [1] 41 23 25 33 34 38
#Get rid of the comma separating the digits. Save the result to the old variable, `stats_number_students`.
college_2021 <- college_2021 %>% 
  mutate(stats_number_students = str_replace(stats_number_students, ",", ""))
college_2021[1:6, "stats_number_students"]
## [1] "20774" "16223" "21261" "2238"  "11276" "19370"

Some universities have pre-paid a reference link from the THE website. The reference links are stored in the variable apply_link. Create a logical variable ref_link which is true if the university has an apply_link, and false otherwise.

# turn 'apply_link' to a logical vector:
college_2021 <- college_2021 %>% 
  mutate(ref_link = if_else(is.na(apply_link), FALSE, TRUE))

Now look at the column types and their values. Let’s turn all the rank variables into factors, as well as turn scores and stats into numeric.

df <- college_2021
#str(df)

# Find all the variables ending with "rank", then add `record_type`, `member_level`, and `location` to this set. Use a `for` loop to turn several variables from character to factor.

tofa <- c("rank", 
          "scores_overall",
          "scores_overall_rank",
          "scores_teaching_rank",
          "scores_research_rank",
          "scores_citations_rank",
          "scores_industry_income_rank",
          "scores_international_outlook_rank",
          "record_type",
          "member_level",
          "location",
          "nid")

for (i in tofa){
  df[, i] <- as.factor(df[, i])
}

# Now, take five variables on actual `scores` and four variables on various `stats` and make them numeric with a `for` loop.

tonu <- c("scores_teaching",
          "scores_research",
          "scores_citations",
          "scores_industry_income",
          "scores_international_outlook",
          "stats_number_students",
          "stats_student_staff_ratio",
          "stats_pc_intl_students",
          "stats_female_share")
for (i in tonu){
  df[ , i] <- as.numeric(df[ , i])
}
options(scipen = 999)
#describe(df) %>% 
#  knitr::kable() # all variable must have around 1500 values

The task

__is to clusterise universities for a company targeting university applicants from all over the world. You need to locate clusters of universities so as to match the best suiting university group to each applicant. Using the variables of your choice, explore the possible clustering solutions and describe the differrences between the obtained clusters of universities.

Data description and variable choice

To remind some points of the data, here is its quick summary. The set collects data from The Times Higher Education World University Rankings 2021. 1448 universities from 92 countries are ranked according to several criterias: scores for their teaching, researches, citations, industry income and international outlook.

I want to explore universities clusterization with the following characteristics: research, citation and international outlook rates. This way I will be able to evaluate universities from the interantional view, as pieces of research and its citations are some tools to present the university out beyond the borders. Moreover, if we take a look at correlational analysis, these three variable have a slight positive correlation with each other (~0.5).

I’ve also provided a table with the summary for these three variales - in general all three variables are disctributed between 1 and 100. Minimal citation score is 1.90, minimal research score is 7.10 and for international outlook score it is 13.5. As for the maximum values it is 100 for interantinal outlook and citations and 99.60 for research score.

the2021_num <- df %>% select(scores_research, scores_citations, scores_international_outlook)
library(polycor)
df_cor <- hetcor(the2021_num)
as.data.frame(df_cor$correlations)
data <- df %>% select(scores_research, scores_citations, scores_international_outlook)
summary(data)
##  scores_research scores_citations scores_international_outlook
##  Min.   : 7.10   Min.   :  1.90   Min.   : 13.50              
##  1st Qu.:11.40   1st Qu.: 23.50   1st Qu.: 27.50              
##  Median :17.50   Median : 45.10   Median : 42.30              
##  Mean   :23.57   Mean   : 48.13   Mean   : 47.03              
##  3rd Qu.:29.98   3rd Qu.: 71.30   3rd Qu.: 63.00              
##  Max.   :99.60   Max.   :100.00   Max.   :100.00
#ggplot() +
#  geom_histogram()

So now I have counted distanced between observations. As all variables are numeric I’ve used euclidean method. For some visualization and short outlook I’ve presented scutterplot below, which shows disctibution of universities by research and citation scores.

dist_data <- dist(data, method = 'euclidean')
#visualization of disctibution of universities by research and citation scores 
ggplot(data, aes(x = scores_research, y = scores_citations)) + 
  geom_point() +
  theme_bw()

Hierarchical clustering

Here you may see results of hierarchical clusterization. I’ve done two graphs: with two and three clusters. In addition you may look at cluster dendragrams below. They are reeeally not good, but just in case, to resent three types of linckages (complete, single, average).

hc_data <- hclust(dist_data, method = 'complete')
cluster_assigments <- cutree(hc_data, k=2)
data_clustered_2 <- mutate(data, cluster = as.factor(cluster_assigments))
#head(data_clustered_2)
ggplot(data_clustered_2, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "2 clusters")+
  theme_bw()

hc_data <- hclust(dist_data, method = 'complete')
cluster_assigments <- cutree(hc_data, k=3)
data_clustered_3 <- mutate(data, cluster = as.factor(cluster_assigments))
#head(data_clustered_3)
ggplot(data_clustered_3, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "3 clusters")+
  theme_bw()

# Prepare the Distance Matrix
dist_players <- dist(data)

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method = "complete")
hc_single <- hclust(dist_players, method = "single")
hc_average <- hclust(dist_players, method = "average")

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')
plot(hc_single, main = 'Single Linkage')
plot(hc_average, main = 'Average Linkage')

#K-means clustering

model <- kmeans(data, centers = 2)
#print(model$cluster)
data_K_clustered_2 <- mutate(data, cluster = as.factor(model$cluster))
#head(data_K_clustered)

plot_K_clustered_2 <- ggplot(data_K_clustered_2, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "K-means clustering", subtitle = "2 clusters")+
  theme_bw()
plot_K_clustered_2

model <- kmeans(data, centers = 3)
#print(model$cluster)
data_K_clustered_3 <- mutate(data, cluster = as.factor(model$cluster))
#head(data_K_clustered)

plot_K_clustered_3 <- ggplot(data_K_clustered_3, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "K-means clustering", subtitle = "3 clusters")+
  theme_bw()
plot_K_clustered_3

##Elbow plot check

As for the graph below it is better to use 2 centers (k=2).

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
  model <- kmeans(x = data, centers = k)
  model$tot.withinss
})

# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
  k = 1:10,
  tot_withinss = tot_withinss
)

# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
  geom_line() +
  scale_x_continuous(breaks = 1:10) +
  theme_bw()

##Silhoueete analysis check

Then, I’ve tries silhoueete analysis. Average silhoueete width for 2 clusters is 0.452. That is not bad (as it is more than 0), but could be better (closer to 1). As for the 3 clusters, average silhoueete width there is less, 0.359, so 3 clusters are worse than 2.

library(cluster)
pam_k2 <- pam(data, k = 2)
#pam_k2$silinfo$widths
sil_plot <- silhouette(pam_k2)
#plot(sil_plot)

pam_k3 <- pam(data, k = 3)
#pam_k3$silinfo$widths
sil_plot <- silhouette(pam_k3)
#plot(sil_plot)

pam_k2$silinfo$avg.width #0.4519057 - not bad, but colud be better
## [1] 0.4519057
pam_k3$silinfo$avg.width #0.3592351 - worse, than with 2 cluster
## [1] 0.3592351

Conclusion

To sum up I have tried two types of clustering (hierarchical and k-means) as well as have checked by elbow plot and silhoueete analysis the number of clusters to leave. Descpite the fact that in general this cluster analysisis not perfect, we still can leave two clusters, not more. If we take a look at the graph of hierarchical clusterization with 2 clusters we may see some visible division between clusters. Unfortunatelly, clusters are still a bit mixed up on the border, but in general the picture is not so bad.

Aslo I would say that hierarchical clusterization works better for this set than K-means clustering. Again, I conclude it by the plotted results (in K-means clusterization the border is very blurred).

ggplot(data_clustered_2, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "2 clusters")+
  theme_bw()

plot_K_clustered_2

To sum up I may say that universities have divided on two groups: with lower citation and research scores and with high citation scores. Below you may see the full picture with all combinations of my three variables.

ggplot(data_clustered_2, aes(x = scores_research, scores_citations, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "2 clusters")+
  theme_bw()

ggplot(data_clustered_2, aes(x = scores_research, scores_international_outlook, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "2 clusters")+
  theme_bw()

ggplot(data_clustered_2, aes(x = scores_citations, scores_international_outlook, color = cluster)) + 
  geom_point() +
  labs(title = "Hierarchical clustering", subtitle = "2 clusters")+
  theme_bw()