The main purpose of this project is to clusterise universities for a company targeting university applicants from all over the world. Likewise, there is an objective to locate clusters of universities so as to match the best suiting university group to each applicant.
Using the variables of your choice, explore the possible clustering solutions and describe the differences between the obtained clusters of universities.
The data used for this project is retrieved from www.timeshighereducation.com that provides some information about the best universities in the world ranking them. The rank created in 2021 is used.
# Remove scientific notations
options(scipen = 999)
library(tidyverse)
library(stringr)
library(purrr)
library(V8)
library(jsonlite)
library(psych)
library(kableExtra)
####################
# Data Preparation #
####################
# Extract the page with 2021 ratings as json file
college_json2021 <- fromJSON(paste0(
'https://www.timeshighereducation.com/sites/default/files/the_data_rankings/',
'world_university_rankings_2021_0__fa224219a267a5b9c4287386a97c70ea.json'))
# Turn the 'data' list from college_json2021 to your dataset
college_2021 <- college_json2021$data
# A new variable - `female_share`.
college_2021 <- college_2021 %>%
mutate(stats_female_share = as.numeric(str_match(college_2021$stats_female_male_ratio, "^\\d{1,2}")))
# Trim the "%" sign after the number
# and save the result to the old variable, `stats_pc_intl_students`
college_2021 <- college_2021 %>%
mutate(stats_pc_intl_students = as.numeric(str_replace(stats_pc_intl_students, "%", "")))
# Get rid of the comma separating the digits.
# Save the result to the old variable, `stats_number_students`.
college_2021 <- college_2021 %>%
mutate(stats_number_students = str_replace(stats_number_students, ",", ""))
# Turn 'apply_link' to a logical vector:
college_2021 <- college_2021 %>%
mutate(ref_link = if_else(is.na(apply_link), FALSE, TRUE))
# Change types of variables
df <- college_2021
tofa <- c("rank",
"scores_overall",
"scores_overall_rank",
"scores_teaching_rank",
"scores_research_rank",
"scores_citations_rank",
"scores_industry_income_rank",
"scores_international_outlook_rank",
"record_type",
"member_level",
"location",
"nid")
for (i in tofa){
df[, i] <- as.factor(df[, i])
}
# take five variables on actual `scores`
# and four variables on various `stats`
# and make them numeric with a `for` loop
tonu <- c("scores_teaching",
"scores_research",
"scores_citations",
"scores_industry_income",
"scores_international_outlook",
"stats_number_students",
"stats_student_staff_ratio",
"stats_pc_intl_students",
"stats_female_share")
for (i in tonu){
df[ , i] <- as.numeric(df[ , i])
}
rm(college_2021, college_json2021, i, tofa, tonu)
Data Exploration
What types of variables do we have?
sapply(df, class) %>% kable()
| x | |
|---|---|
| rank_order | character |
| rank | factor |
| name | character |
| scores_overall | factor |
| scores_overall_rank | factor |
| scores_teaching | numeric |
| scores_teaching_rank | factor |
| scores_research | numeric |
| scores_research_rank | factor |
| scores_citations | numeric |
| scores_citations_rank | factor |
| scores_industry_income | numeric |
| scores_industry_income_rank | factor |
| scores_international_outlook | numeric |
| scores_international_outlook_rank | factor |
| record_type | factor |
| member_level | factor |
| url | character |
| nid | factor |
| location | factor |
| stats_number_students | numeric |
| stats_student_staff_ratio | numeric |
| stats_pc_intl_students | numeric |
| stats_female_male_ratio | character |
| aliases | character |
| subjects_offered | character |
| closed | logical |
| apply_link | character |
| stats_female_share | numeric |
| ref_link | logical |
library(DataExplorer)
plot_bar(df)
plot_histogram(df)
How many missings do we have?
plot_missing(df)
Although we can remove apply_link variable more than a half values are absent, we cannot do the same with variables stood for female share at universities. Also, hardly can we apply na.omit function because doing this we will remove some universities from the data frame. So, let’s fill missing values with the mean of near rows.
df <- df %>% select(-apply_link)
for(i in 1:ncol(df)){
df[is.na(df[,i]), i] <- mean(df[,"stats_female_share"], na.rm = TRUE)
}
Also, we can delete such variables as closed (it contains only one repeated value), url (hardly can we do with it something useful at all), and aliases (we know names of universities that is enough for us, and there is no issue to use difference languages such as Chinese or Korean in the analysis not to face problem with encoding or something like that).
Ass we mostly interested in numerical type of variables, we can figure out how many universities offer particular disciplines. For that, we proceed with the following categories: Humanities, Applied Sciences, Social Sciences, Natural Sciences, Formal Sciences (the idea is based on this Wikipedia article ).
data <- df %>% select(-url, -closed, -aliases)
# Let's see how many unique values of subjects we have, as well as how we can group them
subjects <- sub("^.*,\\s*", "", data$subjects_offered) %>% as.data.frame() %>% unique()
subjects$discipline[subjects$. == "Literature & Linguistics"] <- "Humanities"
subjects$discipline[subjects$. == "Accounting & Finance"] <- "Social Sciences"
subjects$discipline[subjects$. == "Education"] <- "Humanities"
subjects$discipline[subjects$. == "Politics & International Studies (incl Development Studies)"] <- "Social Sciences"
subjects$discipline[subjects$. == "Communication & Media Studies"] <- "Humanities"
subjects$discipline[subjects$. == "Medicine & Dentistry"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Computer Science"] <- "Formal Sciences"
subjects$discipline[subjects$. == "Mathematics & Statistics"] <- "Formal Sciences"
subjects$discipline[subjects$. == "Economics & Econometrics"] <- "Social Sciences"
subjects$discipline[subjects$. == "Chemical Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Agriculture & Forestry"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Geography"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Business & Management"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Earth & Marine Sciences"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Sociology"] <- "Social Sciences"
subjects$discipline[subjects$. == "Psychology"] <- "Social Sciences"
subjects$discipline[subjects$. == "Mechanical & Aerospace Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Philosophy & Theology"] <- "Humanities"
subjects$discipline[subjects$. == "Biological Sciences"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Physics & Astronomy"] <- "Natural Sciences"
subjects$discipline[subjects$. == "Archaeology"] <- "Social Sciences"
subjects$discipline[subjects$. == "General Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Veterinary Science"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Civil Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Performing Arts & Design"] <- "Humanities"
subjects$discipline[subjects$. == "Chemistry"] <- "Humanities"
subjects$discipline[subjects$. == "Law"] <- "Humanities"
subjects$discipline[subjects$. == "Other Health"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Electrical & Electronic Engineering"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Sport Science"] <- "Applied Sciences"
subjects$discipline[subjects$. == "Architecture"] <- "Humanities"
# Create new columns for each discipline
subj <- data %>% select(nid, subjects_offered)
subj$subjects_offered <- gsub(" ","",subj$subjects_offered)
## APPLIED SCIENCES
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Medicine&Dentistry") == TRUE, 1, 0)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "ChemicalEngineering") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Business&Management") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Mechanical&AerospaceEngineering") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "GeneralEngineering") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "VeterinaryScience") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "CivilEngineering") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "OtherHealth") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "Electrical&ElectronicEngineering"), subj$Applied_Sciences + 1, subj$Applied_Sciences)
subj$Applied_Sciences <- ifelse(str_detect(subj$subjects_offered, "SportScience") == TRUE,
subj$Applied_Sciences + 1, subj$Applied_Sciences)
## FORMAL SCIENCES
subj$Formal_Sciences <- ifelse(str_detect(subj$subjects_offered, "ComputerScience") == TRUE, 1, 0)
subj$Formal_Sciences <- ifelse(str_detect(subj$subjects_offered, "SportScience") == TRUE,
subj$Formal_Sciences + 1, subj$Formal_Sciences)
## HUMANITIES
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Literature&Linguistics") == TRUE, 1, 0)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Education") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Communication&MediaStudies") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Philosophy&Theology") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "PerformingArts&Design") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Chemistry") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Law") == TRUE,
subj$Humanities + 1, subj$Humanities)
subj$Humanities <- ifelse(str_detect(subj$subjects_offered, "Architecture") == TRUE,
subj$Humanities + 1, subj$Humanities)
## NATURAL SCIENCES
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Agriculture&Forestry") == TRUE, 1, 0)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Geography") == TRUE,
subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Earth&MarineSciences") == TRUE,
subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "BiologicalSciences") == TRUE,
subj$Natural_Sciences + 1, subj$Natural_Sciences)
subj$Natural_Sciences <- ifelse(str_detect(subj$subjects_offered, "Physics&Astronomy") == TRUE,
subj$Natural_Sciences + 1, subj$Natural_Sciences)
## SOCIAL SCIENCES
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Accounting & Finance") == TRUE, 1, 0)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Politics & International Studies (incl Development Studies)") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Economics & Econometrics") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Sociology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Psychology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
subj$Social_Sciences <- ifelse(str_detect(subj$subjects_offered, "Archaeology") == TRUE, subj$Social_Sciences + 1, subj$Social_Sciences)
# Join to get a full dataset
subj <- subj %>% select(-subjects_offered)
data_new <- merge(data, subj, by = "nid")
As a result, for example, for now we can answer how many disciplines universities offer from different countries:
data_new2 <- data_new %>% select(location,
Applied_Sciences, Social_Sciences, Humanities,
Formal_Sciences, Natural_Sciences)
data_new2$n_disc <- rowSums(data_new2[, 2:6])
dt_df <- data_new2 %>% group_by(location) %>% summarise(num_universities = n(),
avg_num_disc = median(n_disc)) %>% arrange(location)
library(DT)
datatable(dt_df)
In addition, using filters in the table above we also can know that the top-3 leading countries regarding a number of universities - United States, Japan, United Kingdom.
To perform such type of cluster analysis as k-means we have to have numeric variables mainly due to a reason that the distance between observation will be calculated as Euclidean method. The same variables will be used also for hierarchical clustering to make more or less valid comparison of clusters. Our advantage in this case is that there are more numeric variables in the prepared dataset than any other types.
Let’s select variables, set university index ( nid ) as row index, and scale variables
In variable selection let’s focus more on scores rather than ranks (as they are ordinal) and counted characteristics.
cl_data <- data_new %>% select(nid, scores_teaching, scores_research, scores_citations,
scores_industry_income, scores_international_outlook,
stats_number_students, stats_student_staff_ratio,
stats_pc_intl_students, stats_female_share,
Applied_Sciences, Formal_Sciences, Humanities,
Natural_Sciences, Social_Sciences)
rownames(cl_data) <- cl_data$nid
cl_data <- cl_data %>% select(-nid)
cl_data_sc <- scale(cl_data)
psych::describe(cl_data_sc)
## vars n mean sd median trimmed mad min max
## scores_teaching 1 1526 0 1 -0.31 -0.18 0.66 -1.18 4.89
## scores_research 2 1526 0 1 -0.35 -0.19 0.63 -0.95 4.39
## scores_citations 3 1526 0 1 -0.11 -0.04 1.26 -1.67 1.88
## scores_industry_income 4 1526 0 1 -0.44 -0.22 0.39 -0.75 3.14
## scores_international_outlook 5 1526 0 1 -0.20 -0.08 1.08 -1.45 2.29
## stats_number_students 6 1526 0 1 -0.21 -0.14 0.47 -0.82 17.73
## stats_student_staff_ratio 7 1526 0 1 -0.19 -0.12 0.48 -1.29 23.18
## stats_pc_intl_students 8 1526 0 1 -0.36 -0.17 0.74 -0.94 6.04
## stats_female_share 9 1526 0 1 0.17 0.07 0.75 -4.14 4.06
## Applied_Sciences 10 1526 0 1 0.20 0.07 1.16 -2.55 1.37
## Formal_Sciences 11 1526 0 1 0.89 0.13 0.00 -2.28 0.89
## Humanities 12 1526 0 1 0.53 0.15 0.62 -2.38 0.94
## Natural_Sciences 13 1526 0 1 0.41 0.12 0.98 -2.22 1.07
## Social_Sciences 14 1526 0 1 0.17 0.07 1.35 -1.65 1.08
## range skew kurtosis se
## scores_teaching 6.07 2.02 4.98 0.03
## scores_research 5.34 1.86 3.72 0.03
## scores_citations 3.55 0.24 -1.16 0.03
## scores_industry_income 3.89 1.76 2.19 0.03
## scores_international_outlook 3.73 0.59 -0.73 0.03
## stats_number_students 18.54 8.31 111.39 0.03
## stats_student_staff_ratio 24.48 10.15 202.34 0.03
## stats_pc_intl_students 6.99 1.81 4.44 0.03
## stats_female_share 8.19 -0.74 0.94 0.03
## Applied_Sciences 3.92 -0.50 -0.67 0.03
## Formal_Sciences 3.16 -0.68 -0.53 0.03
## Humanities 3.32 -1.04 -0.10 0.03
## Natural_Sciences 3.29 -0.77 -0.40 0.03
## Social_Sciences 2.73 -0.48 -1.10 0.03
for k-means:
library(factoextra)
library(cluster)
fviz_nbclust(cl_data_sc, kmeans, method = "silhouette")
Let’s take 4 as a number of clusters when we will perform a k-mean algorithm.
for hierarchical clustering:
fviz_nbclust(cl_data_sc, hcut, method = "silhouette")
Let’s take 2 as a number of clusters when we will perform a hierarchical clustering algorithm.
For npw, let’s compute k-means & hierarchical clustering, visualize them
k-means:
k_cluster <- pam(cl_data_sc, k = 4, metric = "euclidean")
fviz_cluster(k_cluster, data = cl_data_sc,
palette = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)
The most distinct feature is that the second cluster is overlapping 1st and 3rd. According to additional information provided below the biggest cluster is the 1st one (the 4th is less in 2 times), while 2nd and 3rd have approximately the same sizes values of which are almost between sizes of 1st and 4th clusters.
k_cluster$clusinfo
## size max_diss av_diss diameter separation
## [1,] 520 14.641621 2.450922 16.27788 0.6665128
## [2,] 364 29.707240 2.877499 31.18026 0.6665128
## [3,] 360 6.995456 3.069297 10.27398 1.1234652
## [4,] 282 7.089818 3.390478 10.50479 0.8681102
hierarchical clustering:
Here let’s perform hierarchical clustering with Ward’s Method - the within-cluster variance is minimized.
ward_doc <- hclust(d= dist(cl_data_sc), method="ward.D2")
plot(ward_doc, xlab=NA, sub=NA, main = "Ward's Method")
wards_hc_clusters <- cutree(tree = ward_doc, k=3)
rect.hclust(tree=ward_doc, k = 3, border = 2:4)
Black area (or leaves) is not interesting for us as there are many ids of universities (there is over-representation). The most appealing is that being based on clades location (from the top to the bottom), we can observe a nice division in three groups.
And we also can retrieve cluster numbers obtained both methods and merge them with the initial dataset.
cl_data <- data_new %>% select(nid, scores_teaching, scores_research, scores_citations,
scores_industry_income, scores_international_outlook,
stats_number_students, stats_student_staff_ratio,
stats_pc_intl_students, stats_female_share,
Applied_Sciences, Formal_Sciences, Humanities,
Natural_Sciences, Social_Sciences)
k_cluster <- as.data.frame(k_cluster$clustering)
k_cluster <- tibble::rowid_to_column(k_cluster, "nid")
k_cluster <- k_cluster %>% rename("k_cluster" = "k_cluster$clustering")
cl_data <- merge(cl_data, k_cluster, by = "nid")
hc_cluster <- wards_hc_clusters %>% as.data.frame()
hc_cluster <- tibble::rowid_to_column(hc_cluster, "nid")
hc_cluster <- hc_cluster %>% rename("hc_cluster" = ".")
cl_data <- merge(cl_data, hc_cluster, by = "nid")
cl_data$hc_cluster <- as.factor(cl_data$hc_cluster)
cl_data$k_cluster <- as.factor(cl_data$k_cluster)
library(ggpubr)
library(GGally)
data_new_k <- cl_data %>% select(-hc_cluster)
ggparcoord(data_new_k, columns = c("Formal_Sciences", "Applied_Sciences",
"Humanities", "Social_Sciences", "Natural_Sciences"),
groupColumn = "k_cluster", scale = "globalminmax", order = "skewness") +
labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
theme(plot.title = element_text(size=14, hjust = 0.5, face="bold")) + coord_flip()
data_new_hc <- cl_data %>% select(-k_cluster)
ggparcoord(data_new_hc, columns = c("Formal_Sciences", "Applied_Sciences",
"Humanities", "Social_Sciences", "Natural_Sciences"),
groupColumn = "hc_cluster", scale = "globalminmax", order = "skewness") +
labs(x = "", title = "Parallel Coordinate Plot") + theme_linedraw() +
theme(plot.title = element_text(size = 14, hjust = 0.5, face = "bold")) + coord_flip()
Looking at these two plots, we can observe that regarding such characteristics as offered disciplines in case of k-means one cluster, 3rd one, is in some sense identical to a set of observations analyzed as 2nd cluster in case of hc.
Let’s see at some stats and scores:
stats_number_students
library(ggpubr)
hc1 <- ggplot(data = cl_data, aes(x = stats_number_students, y = k_cluster)) +
geom_boxplot(aes(color = k_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()
hc2 <- ggplot(data = cl_data, aes(x = stats_number_students, y = hc_cluster)) +
geom_boxplot(aes(color = hc_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()
ggarrange(hc1, hc2,
labels = c("K-means", "HC"),
ncol = 2, nrow = 1, legend = "bottom")
As for number of students, hardly we can say that additional cluster (4th one obtained via k-means) can provide any noticeable things as distributions (width of boxes), as well as medians, are visually similar.
scores_international_outlook
hc1 <- ggplot(data = cl_data, aes(x = scores_international_outlook, y = k_cluster)) +
geom_boxplot(aes(color = k_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()
hc2 <- ggplot(data = cl_data, aes(x = scores_international_outlook, y = hc_cluster)) +
geom_boxplot(aes(color = hc_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()
ggarrange(hc1, hc2,
labels = c("K-means", "HC"),
ncol = 2, nrow = 1, legend = "bottom")
But here the 4th cluster is more vivid one - its median and distribution differ from the others.
scores_citations
hc1 <- ggplot(data = cl_data, aes(x = scores_citations, y = k_cluster)) +
geom_boxplot(aes(color = k_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()
hc2 <- ggplot(data = cl_data, aes(x = scores_citations, y = hc_cluster)) +
geom_boxplot(aes(color = hc_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()
ggarrange(hc1, hc2,
labels = c("K-means", "HC"),
ncol = 2, nrow = 1, legend = "bottom")
In case of scores citations we can suppose that the 3rd cluster resulted as k-mean outcome can be named as a subset of 2nd cluster obtained via hc. And the 4th cluster (k-mean) looks like 3rd cluster in hc.
stats_female_share
hc1 <- ggplot(data = cl_data, aes(x = stats_female_share, y = k_cluster)) +
geom_boxplot(aes(color = k_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800", "#d69bca")) + theme_bw()
hc2 <- ggplot(data = cl_data, aes(x = stats_female_share, y = hc_cluster)) +
geom_boxplot(aes(color = hc_cluster)) +
scale_color_manual(values = c("#2E9FDF", "#66cc66", "#E7B800")) + theme_bw()
ggarrange(hc1, hc2,
labels = c("K-means", "HC"),
ncol = 2, nrow = 1, legend = "bottom")
As for female proportions, we can see that in each case there are quite a lot outliers. Athough medians are actually similar, distributions are not.
Overall, it might be concluded that both methods showed quite nice results. But as we have not so many clusters in each case, probably having diversity when choosing a university would be more preferable instead of having too unified categories. Unfortunately, we do not have explicit diversity to have 4 clusters that is why HC algorithm can be taken to label clusters further.
psych::describeBy(data_new_hc, group = "hc_cluster")
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed
## nid* 1 252 315.69 146.18 337.00 319.34
## scores_teaching 2 252 34.27 11.07 32.10 33.30
## scores_research 3 252 33.47 14.44 31.00 32.36
## scores_citations 4 252 61.73 22.90 65.60 62.52
## scores_industry_income 5 252 53.22 20.46 43.60 50.25
## scores_international_outlook 6 252 58.90 22.04 56.35 58.67
## stats_number_students 7 252 24976.60 23025.37 20110.00 21414.10
## stats_student_staff_ratio 8 252 18.52 10.54 16.40 16.80
## stats_pc_intl_students 9 252 15.98 11.29 13.00 14.77
## stats_female_share 10 252 48.95 11.35 51.00 50.00
## Applied_Sciences 11 252 6.86 2.49 7.00 7.07
## Formal_Sciences 12 252 1.50 0.59 2.00 1.56
## Humanities 13 252 6.19 2.07 7.00 6.57
## Natural_Sciences 14 252 3.78 1.26 4.00 3.98
## Social_Sciences 15 252 2.10 1.02 2.00 2.24
## hc_cluster* 16 252 1.00 0.00 1.00 1.00
## mad min max range skew kurtosis
## nid* 193.48 27.0 540.0 513.0 -0.19 -1.33
## scores_teaching 10.67 13.8 72.4 58.6 0.77 0.09
## scores_research 13.71 10.0 73.8 63.8 0.68 -0.06
## scores_citations 27.87 13.2 99.3 86.1 -0.26 -1.09
## scores_industry_income 12.53 33.3 100.0 66.7 0.98 -0.36
## scores_international_outlook 27.21 15.8 99.3 83.5 0.14 -1.17
## stats_number_students 13548.74 557.0 222102.0 221545.0 4.29 27.73
## stats_student_staff_ratio 6.52 2.6 70.2 67.6 2.00 5.11
## stats_pc_intl_students 10.38 1.0 81.0 80.0 1.37 3.65
## stats_female_share 8.90 1.0 83.0 82.0 -0.98 1.51
## Applied_Sciences 2.97 0.0 10.0 10.0 -0.67 -0.41
## Formal_Sciences 0.00 0.0 2.0 2.0 -0.72 -0.48
## Humanities 1.48 0.0 8.0 8.0 -1.34 1.05
## Natural_Sciences 1.48 0.0 5.0 5.0 -1.15 0.92
## Social_Sciences 1.48 0.0 3.0 3.0 -0.86 -0.45
## hc_cluster* 0.00 1.0 1.0 0.0 NaN NaN
## se
## nid* 9.21
## scores_teaching 0.70
## scores_research 0.91
## scores_citations 1.44
## scores_industry_income 1.29
## scores_international_outlook 1.39
## stats_number_students 1450.46
## stats_student_staff_ratio 0.66
## stats_pc_intl_students 0.71
## stats_female_share 0.71
## Applied_Sciences 0.16
## Formal_Sciences 0.04
## Humanities 0.13
## Natural_Sciences 0.08
## Social_Sciences 0.06
## hc_cluster* 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed
## nid* 1 179 289.85 137.21 280.0 290.32
## scores_teaching 2 179 34.47 13.15 31.4 32.67
## scores_research 3 179 35.05 14.69 31.8 33.32
## scores_citations 4 179 64.15 21.89 65.1 65.54
## scores_industry_income 5 179 52.07 18.14 46.2 49.15
## scores_international_outlook 6 179 61.70 22.22 60.1 61.98
## stats_number_students 7 179 23514.92 13109.45 22236.0 22370.91
## stats_student_staff_ratio 8 179 19.79 10.69 16.8 18.28
## stats_pc_intl_students 9 179 17.31 11.88 15.0 15.90
## stats_female_share 10 179 51.03 9.44 53.0 52.04
## Applied_Sciences 11 179 6.95 2.45 7.0 7.16
## Formal_Sciences 12 179 1.56 0.56 2.0 1.61
## Humanities 13 179 6.56 1.71 7.0 6.88
## Natural_Sciences 14 179 3.87 1.22 4.0 4.06
## Social_Sciences 15 179 2.23 0.92 2.0 2.37
## hc_cluster* 16 179 2.00 0.00 2.0 2.00
## mad min max range skew kurtosis
## nid* 166.05 3.0 538.0 535.0 -0.03 -0.95
## scores_teaching 11.27 16.9 91.3 74.4 1.53 3.03
## scores_research 12.90 11.4 99.6 88.2 1.30 2.32
## scores_citations 23.13 10.3 99.9 89.6 -0.49 -0.54
## scores_industry_income 12.90 33.4 100.0 66.6 1.26 0.64
## scores_international_outlook 28.02 16.0 99.8 83.8 -0.03 -1.10
## stats_number_students 12731.09 2355.0 77496.0 75141.0 0.90 1.13
## stats_student_staff_ratio 6.97 3.4 71.5 68.1 1.93 4.98
## stats_pc_intl_students 10.38 1.0 65.0 64.0 1.16 1.33
## stats_female_share 7.41 20.0 71.0 51.0 -1.00 0.79
## Applied_Sciences 2.97 1.0 10.0 9.0 -0.62 -0.62
## Formal_Sciences 0.00 0.0 2.0 2.0 -0.80 -0.42
## Humanities 1.48 0.0 8.0 8.0 -1.61 2.31
## Natural_Sciences 1.48 0.0 5.0 5.0 -1.31 1.44
## Social_Sciences 1.48 0.0 3.0 3.0 -0.99 0.00
## hc_cluster* 0.00 2.0 2.0 0.0 NaN NaN
## se
## nid* 10.26
## scores_teaching 0.98
## scores_research 1.10
## scores_citations 1.64
## scores_industry_income 1.36
## scores_international_outlook 1.66
## stats_number_students 979.85
## stats_student_staff_ratio 0.80
## stats_pc_intl_students 0.89
## stats_female_share 0.71
## Applied_Sciences 0.18
## Formal_Sciences 0.04
## Humanities 0.13
## Natural_Sciences 0.09
## Social_Sciences 0.07
## hc_cluster* 0.00
## ------------------------------------------------------------
## group: 3
## vars n mean sd median trimmed
## nid* 1 109 134.24 129.14 65.0 117.60
## scores_teaching 2 109 52.69 21.65 50.1 52.11
## scores_research 3 109 56.75 25.16 56.6 56.80
## scores_citations 4 109 80.31 18.80 87.3 83.25
## scores_industry_income 5 109 58.32 21.61 50.0 56.32
## scores_international_outlook 6 109 68.26 20.55 68.9 68.95
## stats_number_students 7 109 27198.30 20630.95 22981.0 24935.18
## stats_student_staff_ratio 8 109 16.32 9.17 13.7 15.14
## stats_pc_intl_students 9 109 22.31 13.67 20.0 20.92
## stats_female_share 10 109 50.88 8.32 52.0 51.69
## Applied_Sciences 11 109 7.36 2.15 8.0 7.48
## Formal_Sciences 12 109 1.54 0.54 2.0 1.57
## Humanities 13 109 6.61 1.78 7.0 6.96
## Natural_Sciences 14 109 3.97 1.05 4.0 4.11
## Social_Sciences 15 109 2.42 0.87 3.0 2.60
## hc_cluster* 16 109 3.00 0.00 3.0 3.00
## mad min max range skew kurtosis
## nid* 81.54 1.0 537.0 536.0 0.98 0.06
## scores_teaching 25.20 15.4 94.4 79.0 0.24 -1.07
## scores_research 32.32 8.4 99.2 90.8 0.03 -1.24
## scores_citations 13.64 11.1 99.9 88.8 -1.38 1.52
## scores_industry_income 18.98 33.4 100.0 66.6 0.71 -0.88
## scores_international_outlook 27.13 18.1 98.5 80.4 -0.22 -1.03
## stats_number_students 13404.19 3034.0 187170.0 184136.0 4.42 30.75
## stats_student_staff_ratio 5.93 4.3 65.5 61.2 2.02 6.57
## stats_pc_intl_students 11.86 1.0 72.0 71.0 1.05 1.14
## stats_female_share 4.45 24.0 68.0 44.0 -1.00 1.27
## Applied_Sciences 1.48 1.0 10.0 9.0 -0.56 -0.65
## Formal_Sciences 0.00 0.0 2.0 2.0 -0.52 -1.01
## Humanities 1.48 1.0 8.0 7.0 -1.60 1.99
## Natural_Sciences 1.48 1.0 5.0 4.0 -0.85 0.08
## Social_Sciences 0.00 0.0 3.0 3.0 -1.49 1.36
## hc_cluster* 0.00 3.0 3.0 0.0 NaN NaN
## se
## nid* 12.37
## scores_teaching 2.07
## scores_research 2.41
## scores_citations 1.80
## scores_industry_income 2.07
## scores_international_outlook 1.97
## stats_number_students 1976.09
## stats_student_staff_ratio 0.88
## stats_pc_intl_students 1.31
## stats_female_share 0.80
## Applied_Sciences 0.21
## Formal_Sciences 0.05
## Humanities 0.17
## Natural_Sciences 0.10
## Social_Sciences 0.08
## hc_cluster* 0.00
To label cluster, let’s think about their distinct features:
data_new_k1 <- data_new_k %>% filter(k_cluster == "1")
names1 <- data %>% select(name, nid)
uni1 <- inner_join(names1, data_new_k1, by = "nid")
uni1$sum_scores <- rowSums(uni1[, 3:7])
datatable(uni1)
data_new_k2 <- data_new_k %>% filter(k_cluster == "2")
names2 <- data %>% select(name, nid)
uni2 <- inner_join(names2, data_new_k2, by = "nid")
uni2$sum_scores <- rowSums(uni2[, 3:7])
datatable(uni2)
data_new_k3 <- data_new_k %>% filter(k_cluster == "3")
names3 <- data %>% select(name, nid)
uni3 <- inner_join(names3, data_new_k3, by = "nid")
uni3$sum_scores <- rowSums(uni3[, 3:7])
datatable(uni3)
Overall, it is quite a tough task to choose best of the best. To improve cluster analysis other variables, which probably could describe not only academic scores but also availability of location, facilities, research equipment, labs, etc, are more desirable for consideration.
Some references:
[1] https://rpubs.com/abdul_yunus/Kmeans_Clustering