Data & Project Purpose

The main purpose of this project is to clusterise universities for a company targeting university applicants from all over the world. Likewise, there is an objective to locate clusters of universities so as to match the best suiting university group to each applicant.

Using the variables of your choice, explore the possible clustering solutions and describe the differences between the obtained clusters of universities.

The data used for this project is retrieved from www.timeshighereducation.com that provides some information about the best universities in the world ranking them. The rank created in 2021 is used.

# Remove scientific notations
options(scipen = 999)

library(tidyverse)
library(stringr)
library(purrr)
library(V8)
library(jsonlite)
library(psych)
library(kableExtra)

####################
# Data Preparation #
####################

# Extract the page with 2021 ratings as json file
college_json2021 <- fromJSON(paste0(
  'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/', 
  'world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json'))

# Turn the 'data' list from college_json2021 to your dataset
college_2021 <- college_json2021$data

# A new variable - `female_share`. 
college_2021 <- college_2021 %>% 
  mutate(stats_female_share = as.numeric(str_match(college_2021$stats_female_male_ratio, "^\\d{1,2}")))

# Trim the "%" sign after the number 
# and save the result to the old variable, `stats_pc_intl_students`
college_2021 <- college_2021 %>% 
  mutate(stats_pc_intl_students = as.numeric(str_replace(stats_pc_intl_students, "%", "")))

# Get rid of the comma separating the digits. 
# Save the result to the old variable, `stats_number_students`.
college_2021 <- college_2021 %>% 
  mutate(stats_number_students = str_replace(stats_number_students, ",", ""))

# Turn 'apply_link' to a logical vector:
college_2021 <- college_2021 %>% 
  mutate(ref_link = if_else(is.na(apply_link), FALSE, TRUE))

# Change types of variables
df <- college_2021

tofa <- c("rank", 
          "scores_overall",
          "scores_overall_rank",
          "scores_teaching_rank",
          "scores_research_rank",
          "scores_citations_rank",
          "scores_industry_income_rank",
          "scores_international_outlook_rank",
          "record_type",
          "member_level",
          "location",
          "nid")

for (i in tofa){
  df[, i] <- as.factor(df[, i])
}

# take five variables on actual `scores` 
# and four variables on various `stats` 
# and make them numeric with a `for` loop 
tonu <- c("scores_teaching",
          "scores_research",
          "scores_citations",
          "scores_industry_income",
          "scores_international_outlook",
          "stats_number_students",
          "stats_student_staff_ratio",
          "stats_pc_intl_students",
          "stats_female_share")
for (i in tonu){
  df[ , i] <- as.numeric(df[ , i])
}
rm(college_2021, college_json2021, i, tofa, tonu)

Data Exploration

What types of variables do we have?

sapply(df, class) %>% kable()

	x
rank_order	character
rank	factor
name	character
scores_overall	factor
scores_overall_rank	factor
scores_teaching	numeric
scores_teaching_rank	factor
scores_research	numeric
scores_research_rank	factor
scores_citations	numeric
scores_citations_rank	factor
scores_industry_income	numeric
scores_industry_income_rank	factor
scores_international_outlook	numeric
scores_international_outlook_rank	factor
record_type	factor
member_level	factor
url	character
nid	factor
location	factor
stats_number_students	numeric
stats_student_staff_ratio	numeric
stats_pc_intl_students	numeric
stats_female_male_ratio	character
aliases	character
subjects_offered	character
closed	logical
apply_link	character
stats_female_share	numeric
ref_link	logical

library(DataExplorer)
plot_bar(df)

plot_histogram(df)

How many missings do we have?

plot_missing(df)

Although we can remove apply_link variable more than a half values are absent, we cannot do the same with variables stood for female share at universities. Also, hardly can we apply na.omit function because doing this we will remove some universities from the data frame. So, let’s fill missing values with the mean of near rows.

df <- df %>% select(-apply_link)
for(i in 1:ncol(df)){
  df[is.na(df[,i]), i] <- mean(df[,"stats_female_share"], na.rm = TRUE)
}

Also, we can delete such variables as closed (it contains only one repeated value), url (hardly can we do with it something useful at all), and aliases (we know names of universities that is enough for us, and there is no issue to use difference languages such as Chinese or Korean in the analysis not to face problem with encoding or something like that).

Ass we mostly interested in numerical type of variables, we can figure out how many universities offer particular disciplines. For that, we proceed with the following categories: Humanities, Applied Sciences, Social Sciences, Natural Sciences, Formal Sciences (the idea is based on this Wikipedia article ).

data <- df %>% select(-url, -closed, -aliases)

# Let's see how many unique values of subjects we have, as well as how we can group them
subjects <- sub("^.*,\\s*", "", data$subjects_offered) %>% as.data.frame() %>% unique()
subjects$discipline[subjects$. == "Literature & Linguistics"] <- "Humanities"
subjects$discipline[subjects$. == "Accounting & Finance"] <- "Social Sciences"
subjects$discipline[subjects$. == "Education"] <- "Humanities"
subjects$discipline[subjects$. == "Politics & International Studies (incl Development Studies)"] <- "Social Sciences"
subjects$discipline[subjects$. == "Communication & Media Studies"] <- "Humanities"
subjects$discipline[subjects$. == "Medicine & Dentistry"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Computer Science"] <- "Formal Sciences"
subjects$discipline[subjects$. == "Mathematics & Statistics"] <- "Formal Sciences"
subjects$discipline[subjects$. == "Economics & Econometrics"] <- "Social Sciences"
subjects$discipline[subjects$. == "Chemical Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Agriculture & Forestry"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Geography"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Business & Management"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Earth & Marine Sciences"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Sociology"] <- "Social Sciences"
subjects$discipline[subjects$. == "Psychology"] <- "Social Sciences"
subjects$discipline[subjects$. == "Mechanical & Aerospace Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Philosophy & Theology"] <- "Humanities"
subjects$discipline[subjects$. == "Biological Sciences"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Physics & Astronomy"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Archaeology"] <- "Social Sciences"
subjects$discipline[subjects$. == "General Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Veterinary Science"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Civil Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Performing Arts & Design"] <- "Humanities"
subjects$discipline[subjects$. == "Chemistry"] <- "Humanities"
subjects$discipline[subjects$. == "Law"] <- "Humanities"
subjects$discipline[subjects$. == "Other Health"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Electrical & Electronic Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Sport Science"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Architecture"] <- "Humanities"

# Create new columns for each discipline
subj <- data %>% select(nid, subjects_offered)
subj$subjects_offered <- gsub(" ","",subj$subjects_offered)

## APPLIED SCIENCES
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Medicine&Dentistry") == TRUE, 1, 0)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "ChemicalEngineering") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Business&Management") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Mechanical&AerospaceEngineering") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "GeneralEngineering") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "VeterinaryScience") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "CivilEngineering") == TRUE,
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "OtherHealth") == TRUE, 
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Electrical&ElectronicEngineering"), subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "SportScience") == TRUE, 
                                subj$Applied_Sciences + 1, subj$Applied_Sciences)

## FORMAL SCIENCES
subj$Formal_Sciences <- ifelse(str_detect(subj$subjects_offered, "ComputerScience") == TRUE, 1, 0)
subj$Formal_Sciences <- ifelse(str_detect(subj$subjects_offered, "SportScience") == TRUE, 
                               subj$Formal_Sciences + 1, subj$Formal_Sciences)

## HUMANITIES
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Literature&Linguistics") == TRUE, 1, 0)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Education") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Communication&MediaStudies") == TRUE,
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Philosophy&Theology") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "PerformingArts&Design") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Chemistry") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Law") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Architecture") == TRUE, 
                          subj$Humanities + 1, subj$Humanities)

## NATURAL SCIENCES
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Agriculture&Forestry") == TRUE, 1, 0)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Geography") == TRUE, 
                                subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Earth&MarineSciences") == TRUE,
                                subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "BiologicalSciences") == TRUE,
                                subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Physics&Astronomy") == TRUE,
                                subj$Natural_Sciences + 1, subj$Natural_Sciences)

## SOCIAL SCIENCES
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Accounting & Finance") == TRUE, 1, 0)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Politics & International Studies (incl Development Studies)") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Economics & Econometrics") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Sociology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Psychology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Archaeology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)

# Join to get a full dataset
subj <- subj %>% select(-subjects_offered)
data_new <- merge(data, subj, by = "nid")

As a result, for example, for now we can answer how many disciplines universities offer from different countries:

data_new2 <- data_new %>% select(location, 
                                 Applied_Sciences, Social_Sciences, Humanities,
                                 Formal_Sciences, Natural_Sciences)
data_new2$n_disc <- rowSums(data_new2[, 2:6])
dt_df <- data_new2 %>% group_by(location) %>% summarise(num_universities = n(), 
                                               avg_num_disc = median(n_disc)) %>% arrange(location)

library(DT)
datatable(dt_df)

In addition, using filters in the table above we also can know that the top-3 leading countries regarding a number of universities - United States, Japan, United Kingdom.

Variables Selection

To perform such type of cluster analysis as k-means we have to have numeric variables mainly due to a reason that the distance between observation will be calculated as Euclidean method. The same variables will be used also for hierarchical clustering to make more or less valid comparison of clusters. Our advantage in this case is that there are more numeric variables in the prepared dataset than any other types.

Let’s select variables, set university index ( nid ) as row index, and scale variables

In variable selection let’s focus more on scores rather than ranks (as they are ordinal) and counted characteristics.

cl_data <- data_new %>% select(nid, scores_teaching, scores_research, scores_citations,
                              scores_industry_income, scores_international_outlook,
                              stats_number_students, stats_student_staff_ratio,
                              stats_pc_intl_students, stats_female_share,
                              Applied_Sciences, Formal_Sciences, Humanities, 
                              Natural_Sciences, Social_Sciences)

rownames(cl_data) <- cl_data$nid
cl_data <- cl_data %>% select(-nid)
cl_data_sc <- scale(cl_data)
psych::describe(cl_data_sc)

##                              vars    n mean sd median trimmed  mad   min   max
## scores_teaching                 1 1526    0  1  -0.31   -0.18 0.66 -1.18  4.89
## scores_research                 2 1526    0  1  -0.35   -0.19 0.63 -0.95  4.39
## scores_citations                3 1526    0  1  -0.11   -0.04 1.26 -1.67  1.88
## scores_industry_income          4 1526    0  1  -0.44   -0.22 0.39 -0.75  3.14
## scores_international_outlook    5 1526    0  1  -0.20   -0.08 1.08 -1.45  2.29
## stats_number_students           6 1526    0  1  -0.21   -0.14 0.47 -0.82 17.73
## stats_student_staff_ratio       7 1526    0  1  -0.19   -0.12 0.48 -1.29 23.18
## stats_pc_intl_students          8 1526    0  1  -0.36   -0.17 0.74 -0.94  6.04
## stats_female_share              9 1526    0  1   0.17    0.07 0.75 -4.14  4.06
## Applied_Sciences               10 1526    0  1   0.20    0.07 1.16 -2.55  1.37
## Formal_Sciences                11 1526    0  1   0.89    0.13 0.00 -2.28  0.89
## Humanities                     12 1526    0  1   0.53    0.15 0.62 -2.38  0.94
## Natural_Sciences               13 1526    0  1   0.41    0.12 0.98 -2.22  1.07
## Social_Sciences                14 1526    0  1   0.17    0.07 1.35 -1.65  1.08
##                              range  skew kurtosis   se
## scores_teaching               6.07  2.02     4.98 0.03
## scores_research               5.34  1.86     3.72 0.03
## scores_citations              3.55  0.24    -1.16 0.03
## scores_industry_income        3.89  1.76     2.19 0.03
## scores_international_outlook  3.73  0.59    -0.73 0.03
## stats_number_students        18.54  8.31   111.39 0.03
## stats_student_staff_ratio    24.48 10.15   202.34 0.03
## stats_pc_intl_students        6.99  1.81     4.44 0.03
## stats_female_share            8.19 -0.74     0.94 0.03
## Applied_Sciences              3.92 -0.50    -0.67 0.03
## Formal_Sciences               3.16 -0.68    -0.53 0.03
## Humanities                    3.32 -1.04    -0.10 0.03
## Natural_Sciences              3.29 -0.77    -0.40 0.03
## Social_Sciences               2.73 -0.48    -1.10 0.03

Number of clusters

for k-means:

library(factoextra)
library(cluster)
fviz_nbclust(cl_data_sc, kmeans, method = "silhouette")

Let’s take 4 as a number of clusters when we will perform a k-mean algorithm.

for hierarchical clustering:

fviz_nbclust(cl_data_sc, hcut, method = "silhouette")

Let’s take 2 as a number of clusters when we will perform a hierarchical clustering algorithm.

Computation of Algorithms

For npw, let’s compute k-means & hierarchical clustering, visualize them

k-means:

k_cluster <- pam(cl_data_sc, k = 4, metric = "euclidean")
fviz_cluster(k_cluster, data = cl_data_sc,
             palette = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca"), 
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw()
             )

The most distinct feature is that the second cluster is overlapping 1st and 3rd. According to additional information provided below the biggest cluster is the 1st one (the 4th is less in 2 times), while 2nd and 3rd have approximately the same sizes values of which are almost between sizes of 1st and 4th clusters.

k_cluster$clusinfo

##      size  max_diss  av_diss diameter separation
## [1,]  520 14.641621 2.450922 16.27788  0.6665128
## [2,]  364 29.707240 2.877499 31.18026  0.6665128
## [3,]  360  6.995456 3.069297 10.27398  1.1234652
## [4,]  282  7.089818 3.390478 10.50479  0.8681102

hierarchical clustering:

Here let’s perform hierarchical clustering with Ward’s Method - the within-cluster variance is minimized.

ward_doc <- hclust(d= dist(cl_data_sc), method="ward.D2")
plot(ward_doc, xlab=NA, sub=NA, main = "Ward's Method")
wards_hc_clusters <- cutree(tree = ward_doc, k=3)
rect.hclust(tree=ward_doc, k = 3, border = 2:4)

Black area (or leaves) is not interesting for us as there are many ids of universities (there is over-representation). The most appealing is that being based on clades location (from the top to the bottom), we can observe a nice division in three groups.

And we also can retrieve cluster numbers obtained both methods and merge them with the initial dataset.

cl_data <- data_new %>% select(nid, scores_teaching, scores_research, scores_citations,
                              scores_industry_income, scores_international_outlook,
                              stats_number_students, stats_student_staff_ratio,
                              stats_pc_intl_students, stats_female_share,
                              Applied_Sciences, Formal_Sciences, Humanities, 
                              Natural_Sciences, Social_Sciences)

k_cluster <- as.data.frame(k_cluster$clustering)
k_cluster <- tibble::rowid_to_column(k_cluster, "nid")
k_cluster <- k_cluster %>% rename("k_cluster" = "k_cluster$clustering")
cl_data <- merge(cl_data, k_cluster, by = "nid")

hc_cluster <- wards_hc_clusters %>% as.data.frame()
hc_cluster <- tibble::rowid_to_column(hc_cluster, "nid")
hc_cluster <- hc_cluster %>% rename("hc_cluster" = ".")
cl_data <- merge(cl_data, hc_cluster, by = "nid")
cl_data$hc_cluster <- as.factor(cl_data$hc_cluster)
cl_data$k_cluster <- as.factor(cl_data$k_cluster)
library(ggpubr)

Exploration & Comparison

library(GGally)
data_new_k <- cl_data %>% select(-hc_cluster)
ggparcoord(data_new_k, columns = c("Formal_Sciences", "Applied_Sciences",
                                   "Humanities", "Social_Sciences", "Natural_Sciences"), 
           groupColumn = "k_cluster", scale = "globalminmax", order = "skewness")  +
  labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
  theme(plot.title = element_text(size=14, hjust = 0.5, face="bold")) + coord_flip()

data_new_hc <- cl_data %>% select(-k_cluster)
ggparcoord(data_new_hc, columns = c("Formal_Sciences", "Applied_Sciences",
                                   "Humanities", "Social_Sciences", "Natural_Sciences"), 
           groupColumn = "hc_cluster", scale = "globalminmax", order = "skewness")  +
  labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
  theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold")) + coord_flip()

Looking at these two plots, we can observe that regarding such characteristics as offered disciplines in case of k-means one cluster, 3rd one, is in some sense identical to a set of observations analyzed as 2nd cluster in case of hc.

Let’s see at some stats and scores:

stats_number_students

library(ggpubr)
hc1 <- ggplot(data = cl_data, aes(x = stats_number_students, y = k_cluster)) + 
  geom_boxplot(aes(color = k_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()

hc2 <- ggplot(data = cl_data, aes(x = stats_number_students, y = hc_cluster)) + 
  geom_boxplot(aes(color = hc_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()

ggarrange(hc1, hc2,
                    labels = c("K-means", "HC"),
                    ncol = 2, nrow = 1, legend = "bottom")

As for number of students, hardly we can say that additional cluster (4th one obtained via k-means) can provide any noticeable things as distributions (width of boxes), as well as medians, are visually similar.

scores_international_outlook

hc1 <- ggplot(data = cl_data, aes(x = scores_international_outlook, y = k_cluster)) + 
  geom_boxplot(aes(color = k_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()

hc2 <- ggplot(data = cl_data, aes(x = scores_international_outlook, y = hc_cluster)) + 
  geom_boxplot(aes(color = hc_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()

ggarrange(hc1, hc2,
                    labels = c("K-means", "HC"),
                    ncol = 2, nrow = 1, legend = "bottom")

But here the 4th cluster is more vivid one - its median and distribution differ from the others.

scores_citations

hc1 <- ggplot(data = cl_data, aes(x = scores_citations, y = k_cluster)) + 
  geom_boxplot(aes(color = k_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()

hc2 <- ggplot(data = cl_data, aes(x = scores_citations, y = hc_cluster)) + 
  geom_boxplot(aes(color = hc_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()

ggarrange(hc1, hc2,
                    labels = c("K-means", "HC"),
                    ncol = 2, nrow = 1, legend = "bottom")

In case of scores citations we can suppose that the 3rd cluster resulted as k-mean outcome can be named as a subset of 2nd cluster obtained via hc. And the 4th cluster (k-mean) looks like 3rd cluster in hc.

stats_female_share

hc1 <- ggplot(data = cl_data, aes(x = stats_female_share, y = k_cluster)) + 
  geom_boxplot(aes(color = k_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()

hc2 <- ggplot(data = cl_data, aes(x = stats_female_share, y = hc_cluster)) + 
  geom_boxplot(aes(color = hc_cluster)) +
  scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()

ggarrange(hc1, hc2,
                    labels = c("K-means", "HC"),
                    ncol = 2, nrow = 1, legend = "bottom")

As for female proportions, we can see that in each case there are quite a lot outliers. Athough medians are actually similar, distributions are not.

Overall, it might be concluded that both methods showed quite nice results. But as we have not so many clusters in each case, probably having diversity when choosing a university would be more preferable instead of having too unified categories. Unfortunately, we do not have explicit diversity to have 4 clusters that is why HC algorithm can be taken to label clusters further.

psych::describeBy(data_new_hc, group = "hc_cluster")

## 
##  Descriptive statistics by group 
## group: 1
##                              vars   n     mean       sd   median  trimmed
## nid*                            1 252   315.69   146.18   337.00   319.34
## scores_teaching                 2 252    34.27    11.07    32.10    33.30
## scores_research                 3 252    33.47    14.44    31.00    32.36
## scores_citations                4 252    61.73    22.90    65.60    62.52
## scores_industry_income          5 252    53.22    20.46    43.60    50.25
## scores_international_outlook    6 252    58.90    22.04    56.35    58.67
## stats_number_students           7 252 24976.60 23025.37 20110.00 21414.10
## stats_student_staff_ratio       8 252    18.52    10.54    16.40    16.80
## stats_pc_intl_students          9 252    15.98    11.29    13.00    14.77
## stats_female_share             10 252    48.95    11.35    51.00    50.00
## Applied_Sciences               11 252     6.86     2.49     7.00     7.07
## Formal_Sciences                12 252     1.50     0.59     2.00     1.56
## Humanities                     13 252     6.19     2.07     7.00     6.57
## Natural_Sciences               14 252     3.78     1.26     4.00     3.98
## Social_Sciences                15 252     2.10     1.02     2.00     2.24
## hc_cluster*                    16 252     1.00     0.00     1.00     1.00
##                                   mad   min      max    range  skew kurtosis
## nid*                           193.48  27.0    540.0    513.0 -0.19    -1.33
## scores_teaching                 10.67  13.8     72.4     58.6  0.77     0.09
## scores_research                 13.71  10.0     73.8     63.8  0.68    -0.06
## scores_citations                27.87  13.2     99.3     86.1 -0.26    -1.09
## scores_industry_income          12.53  33.3    100.0     66.7  0.98    -0.36
## scores_international_outlook    27.21  15.8     99.3     83.5  0.14    -1.17
## stats_number_students        13548.74 557.0 222102.0 221545.0  4.29    27.73
## stats_student_staff_ratio        6.52   2.6     70.2     67.6  2.00     5.11
## stats_pc_intl_students          10.38   1.0     81.0     80.0  1.37     3.65
## stats_female_share               8.90   1.0     83.0     82.0 -0.98     1.51
## Applied_Sciences                 2.97   0.0     10.0     10.0 -0.67    -0.41
## Formal_Sciences                  0.00   0.0      2.0      2.0 -0.72    -0.48
## Humanities                       1.48   0.0      8.0      8.0 -1.34     1.05
## Natural_Sciences                 1.48   0.0      5.0      5.0 -1.15     0.92
## Social_Sciences                  1.48   0.0      3.0      3.0 -0.86    -0.45
## hc_cluster*                      0.00   1.0      1.0      0.0   NaN      NaN
##                                   se
## nid*                            9.21
## scores_teaching                 0.70
## scores_research                 0.91
## scores_citations                1.44
## scores_industry_income          1.29
## scores_international_outlook    1.39
## stats_number_students        1450.46
## stats_student_staff_ratio       0.66
## stats_pc_intl_students          0.71
## stats_female_share              0.71
## Applied_Sciences                0.16
## Formal_Sciences                 0.04
## Humanities                      0.13
## Natural_Sciences                0.08
## Social_Sciences                 0.06
## hc_cluster*                     0.00
## ------------------------------------------------------------ 
## group: 2
##                              vars   n     mean       sd  median  trimmed
## nid*                            1 179   289.85   137.21   280.0   290.32
## scores_teaching                 2 179    34.47    13.15    31.4    32.67
## scores_research                 3 179    35.05    14.69    31.8    33.32
## scores_citations                4 179    64.15    21.89    65.1    65.54
## scores_industry_income          5 179    52.07    18.14    46.2    49.15
## scores_international_outlook    6 179    61.70    22.22    60.1    61.98
## stats_number_students           7 179 23514.92 13109.45 22236.0 22370.91
## stats_student_staff_ratio       8 179    19.79    10.69    16.8    18.28
## stats_pc_intl_students          9 179    17.31    11.88    15.0    15.90
## stats_female_share             10 179    51.03     9.44    53.0    52.04
## Applied_Sciences               11 179     6.95     2.45     7.0     7.16
## Formal_Sciences                12 179     1.56     0.56     2.0     1.61
## Humanities                     13 179     6.56     1.71     7.0     6.88
## Natural_Sciences               14 179     3.87     1.22     4.0     4.06
## Social_Sciences                15 179     2.23     0.92     2.0     2.37
## hc_cluster*                    16 179     2.00     0.00     2.0     2.00
##                                   mad    min     max   range  skew kurtosis
## nid*                           166.05    3.0   538.0   535.0 -0.03    -0.95
## scores_teaching                 11.27   16.9    91.3    74.4  1.53     3.03
## scores_research                 12.90   11.4    99.6    88.2  1.30     2.32
## scores_citations                23.13   10.3    99.9    89.6 -0.49    -0.54
## scores_industry_income          12.90   33.4   100.0    66.6  1.26     0.64
## scores_international_outlook    28.02   16.0    99.8    83.8 -0.03    -1.10
## stats_number_students        12731.09 2355.0 77496.0 75141.0  0.90     1.13
## stats_student_staff_ratio        6.97    3.4    71.5    68.1  1.93     4.98
## stats_pc_intl_students          10.38    1.0    65.0    64.0  1.16     1.33
## stats_female_share               7.41   20.0    71.0    51.0 -1.00     0.79
## Applied_Sciences                 2.97    1.0    10.0     9.0 -0.62    -0.62
## Formal_Sciences                  0.00    0.0     2.0     2.0 -0.80    -0.42
## Humanities                       1.48    0.0     8.0     8.0 -1.61     2.31
## Natural_Sciences                 1.48    0.0     5.0     5.0 -1.31     1.44
## Social_Sciences                  1.48    0.0     3.0     3.0 -0.99     0.00
## hc_cluster*                      0.00    2.0     2.0     0.0   NaN      NaN
##                                  se
## nid*                          10.26
## scores_teaching                0.98
## scores_research                1.10
## scores_citations               1.64
## scores_industry_income         1.36
## scores_international_outlook   1.66
## stats_number_students        979.85
## stats_student_staff_ratio      0.80
## stats_pc_intl_students         0.89
## stats_female_share             0.71
## Applied_Sciences               0.18
## Formal_Sciences                0.04
## Humanities                     0.13
## Natural_Sciences               0.09
## Social_Sciences                0.07
## hc_cluster*                    0.00
## ------------------------------------------------------------ 
## group: 3
##                              vars   n     mean       sd  median  trimmed
## nid*                            1 109   134.24   129.14    65.0   117.60
## scores_teaching                 2 109    52.69    21.65    50.1    52.11
## scores_research                 3 109    56.75    25.16    56.6    56.80
## scores_citations                4 109    80.31    18.80    87.3    83.25
## scores_industry_income          5 109    58.32    21.61    50.0    56.32
## scores_international_outlook    6 109    68.26    20.55    68.9    68.95
## stats_number_students           7 109 27198.30 20630.95 22981.0 24935.18
## stats_student_staff_ratio       8 109    16.32     9.17    13.7    15.14
## stats_pc_intl_students          9 109    22.31    13.67    20.0    20.92
## stats_female_share             10 109    50.88     8.32    52.0    51.69
## Applied_Sciences               11 109     7.36     2.15     8.0     7.48
## Formal_Sciences                12 109     1.54     0.54     2.0     1.57
## Humanities                     13 109     6.61     1.78     7.0     6.96
## Natural_Sciences               14 109     3.97     1.05     4.0     4.11
## Social_Sciences                15 109     2.42     0.87     3.0     2.60
## hc_cluster*                    16 109     3.00     0.00     3.0     3.00
##                                   mad    min      max    range  skew kurtosis
## nid*                            81.54    1.0    537.0    536.0  0.98     0.06
## scores_teaching                 25.20   15.4     94.4     79.0  0.24    -1.07
## scores_research                 32.32    8.4     99.2     90.8  0.03    -1.24
## scores_citations                13.64   11.1     99.9     88.8 -1.38     1.52
## scores_industry_income          18.98   33.4    100.0     66.6  0.71    -0.88
## scores_international_outlook    27.13   18.1     98.5     80.4 -0.22    -1.03
## stats_number_students        13404.19 3034.0 187170.0 184136.0  4.42    30.75
## stats_student_staff_ratio        5.93    4.3     65.5     61.2  2.02     6.57
## stats_pc_intl_students          11.86    1.0     72.0     71.0  1.05     1.14
## stats_female_share               4.45   24.0     68.0     44.0 -1.00     1.27
## Applied_Sciences                 1.48    1.0     10.0      9.0 -0.56    -0.65
## Formal_Sciences                  0.00    0.0      2.0      2.0 -0.52    -1.01
## Humanities                       1.48    1.0      8.0      7.0 -1.60     1.99
## Natural_Sciences                 1.48    1.0      5.0      4.0 -0.85     0.08
## Social_Sciences                  0.00    0.0      3.0      3.0 -1.49     1.36
## hc_cluster*                      0.00    3.0      3.0      0.0   NaN      NaN
##                                   se
## nid*                           12.37
## scores_teaching                 2.07
## scores_research                 2.41
## scores_citations                1.80
## scores_industry_income          2.07
## scores_international_outlook    1.97
## stats_number_students        1976.09
## stats_student_staff_ratio       0.88
## stats_pc_intl_students          1.31
## stats_female_share              0.80
## Applied_Sciences                0.21
## Formal_Sciences                 0.05
## Humanities                      0.17
## Natural_Sciences                0.10
## Social_Sciences                 0.08
## hc_cluster*                     0.00

To label cluster, let’s think about their distinct features:

This is the largest cluster in terms of grouped observations. A bit bigger male share, more industrial income, are noticed compared to the 2nd cluster. Label: “OpportunityExchange”.

data_new_k1 <- data_new_k %>% filter(k_cluster == "1")
names1 <- data %>% select(name, nid)
uni1 <- inner_join(names1, data_new_k1, by = "nid")
uni1$sum_scores <- rowSums(uni1[, 3:7])
datatable(uni1)

As for the 2nd cluster, there is a bit more Humanities disciplines offered on average. We also can observe a bit higher scores for research and international outlook in comparison with the 1st cluster, as well as a percentage of international students. Label: “BestForAcademicResearch”

data_new_k2 <- data_new_k %>% filter(k_cluster == "2")
names2 <- data %>% select(name, nid)
uni2 <- inner_join(names2, data_new_k2, by = "nid")
uni2$sum_scores <- rowSums(uni2[, 3:7])
datatable(uni2)

In the 3rd cluster we have the largest average stats and scores. There are the best of the best! Precisely, there is the highest percentage of international students on average. Also, we can see that in this cluster is more preferable to offer Applied Sciences such as (Medicine & Dentistry, Business & Management,General Engineering, etc). Label: “AppliedLeaders” (or “UNIcorns” as they are truly the coolest).

data_new_k3 <- data_new_k %>% filter(k_cluster == "3")
names3 <- data %>% select(name, nid)
uni3 <- inner_join(names3, data_new_k3, by = "nid")
uni3$sum_scores <- rowSums(uni3[, 3:7])
datatable(uni3)

Overall, it is quite a tough task to choose best of the best. To improve cluster analysis other variables, which probably could describe not only academic scores but also availability of location, facilities, research equipment, labs, etc, are more desirable for consideration.

Some references:

[1] https://rpubs.com/abdul_yunus/Kmeans_Clustering

[2] https://rpubs.com/chelseyhill/ds_cluster

[3] https://rpubs.com/Sergio_Garcia/cluster_analysis_in_r

Cluster Analysis Task Description

THE World University Rankings

Victoria Zakharova

Last updated: 12 Декабрь, 2020

Data & Project Purpose

Variables Selection

Number of clusters

Computation of Algorithms

Exploration & Comparison