The data used in this project is about World University Rankings 2021.
library(tidyverse)
library(stats)
library(pca3d)
library(GGally)
library(knitr)
library(kableExtra)
library(factoextra)
library(DataExplorer)
library(psych)
library(plotly)
df_uni <- read.csv2("D:/4th Course/THE2021 2.csv", encoding = "latin1", sep = ",")
# https://campus.datacamp.com/courses/dimensionality-reduction-in-r/principal-component-analysis-pca?ex=1
# https://campus.datacamp.com/courses/machine-learning-for-marketing-analytics-in-r/reducing-dimensionality-with-principal-component-analysis?ex=1
# Remove a unicode
df_uni$name[df_uni$name == "<U+200B>Shahid Chamran University of Ahvaz"] <- "Shahid Chamran University of Ahvaz"
# Change data types
df_uni$name <- as.factor(df_uni$name)
df_uni$location <- as.factor(df_uni$location)
df_uni <- df_uni %>% mutate_if(is.character, as.numeric)
What data dimension do we have?
dim(df_uni)
## [1] 1448 11
We have 1448 observations and 11 variables.
Next, let’s find out whether there are some missing value.
How many missing values are in the dataset?
plot_missing(df_uni)
The dataset is completed. We successfully proceed.
What are the distributions of numerical variables?
plot_density(df_uni[,3:11])
On density plots above, distribution of numeric variables presented in the dataset is depicted. As can be observed, there is no any normal type, and many distributions are right-skewed.
What are statistics of numerical variables?
kable(describe(df_uni[,3:11]), caption = "Summary Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scores_teaching | 1 | 1448 | 27.53218 | 13.39072 | 23.40 | 25.166724 | 8.74734 | 11.7 | 94.4 | 82.7 | 2.0754314 | 5.3539491 | 0.3519003 |
| scores_research | 2 | 1448 | 23.13004 | 17.12087 | 17.15 | 19.875776 | 10.45233 | 7.1 | 99.6 | 92.5 | 1.9521958 | 4.1454328 | 0.4499266 |
| scores_citations | 3 | 1448 | 47.75808 | 27.68638 | 44.80 | 46.693793 | 34.98936 | 1.9 | 100.0 | 98.1 | 0.2576453 | -1.1544308 | 0.7275820 |
| scores_industry_income | 4 | 1448 | 45.52030 | 16.59352 | 38.40 | 41.818448 | 6.52344 | 33.3 | 100.0 | 66.7 | 1.8447516 | 2.5833140 | 0.4360680 |
| scores_international_outlook | 5 | 1448 | 46.96492 | 23.25387 | 42.25 | 45.056638 | 25.13007 | 13.5 | 100.0 | 86.5 | 0.5829290 | -0.7405898 | 0.6110981 |
| stats_number_students | 6 | 1448 | 22240.66160 | 20607.81583 | 17272.50 | 19137.444828 | 12851.91810 | 557.0 | 222102.0 | 221545.0 | 3.8146670 | 24.6923113 | 541.5613767 |
| stats_student_staff_ratio | 7 | 1448 | 18.51271 | 10.29089 | 16.20 | 17.111293 | 6.81996 | 0.9 | 109.1 | 108.2 | 2.3314541 | 10.1838094 | 0.2704385 |
| stats_pc_intl_students | 8 | 1448 | 11.26381 | 12.02288 | 7.00 | 9.219828 | 8.89560 | 0.0 | 84.0 | 84.0 | 1.8112797 | 4.4690133 | 0.3159542 |
| stats_female_share | 9 | 1448 | 49.93923 | 12.13517 | 53.00 | 50.834483 | 8.89560 | 1.0 | 98.0 | 97.0 | -0.7275846 | 0.7514047 | 0.3189052 |
According to the seen statistics, we can see that variables are more likely to be standardized. Let’s look at their variances.
lapply(df_uni[,3:11], var)
## $scores_teaching
## [1] 179.3113
##
## $scores_research
## [1] 293.1243
##
## $scores_citations
## [1] 766.5359
##
## $scores_industry_income
## [1] 275.3449
##
## $scores_international_outlook
## [1] 540.7424
##
## $stats_number_students
## [1] 424682073
##
## $stats_student_staff_ratio
## [1] 105.9024
##
## $stats_pc_intl_students
## [1] 144.5496
##
## $stats_female_share
## [1] 147.2624
Variables with high values of variances will be over-represented in the up-coming PCA. Besides that, differences in the measurable units provoke fake weights for variables. Standardization (mean is zero, variance is one) is the key to be on the right track regarding PCA.
Let’s try to obtain some interesting insights from the data. Just being curious.
In which countries is the biggest number of universities presented in the dataset?
loc_df <- df_uni %>% group_by(location) %>% summarise(quantity = n()) %>% arrange(desc(quantity))
# colSums(loc_df[11:92, 2])
loc_df <- loc_df %>% top_n(10)
loc_df1 <- data.frame(location = "Other",
quantity = 689)
loc_df <- rbind(loc_df, loc_df1)
rm(loc_df1)
ggplot() +
geom_bar(data = loc_df, aes(x = reorder(location, quantity), y = quantity),
width = 0.85, stat='identity',
fill = ifelse(loc_df$location == "Other", "#dbdbdb", "#779dc5"),
color = "#ababaf") +
coord_flip() + theme_light() +
geom_text(data = loc_df,
aes(x = reorder(location, quantity), y = quantity),
label = loc_df$quantity, color = "#5f646a", hjust = -0.4, size = 3) +
ylim(0, 700) +
labs(y = "Quantity", x = " ",
title = "Number of Universities in Countries") +
theme(plot.title = element_text(hjust = 0.5))
On the plot above, the leading countires according to the number of located universities are presented. Russia takes 10th place.
In which countries is the highest mean score (rating of universities) presented in the dataset?
loc_df <- df_uni
loc_df$mean_sc <- rowMeans(loc_df[,3:7], na.rm=TRUE)
loc_df <- loc_df %>% group_by(location) %>% summarise(max_score = sum(mean_sc)) %>%
arrange(desc(max_score)) %>% top_n(10)
ggplot() +
geom_bar(data = loc_df, aes(x = reorder(location, max_score), y = max_score),
width = 0.85, stat='identity',
fill = "#93cac4",
color = "#ababaf") +
coord_flip() + theme_light() +
geom_text(data = loc_df,
aes(x = reorder(location, max_score), y = max_score),
label = loc_df$max_score, color = "#5f646a", hjust = -0.4, size = 3) +
ylim(0, 9000) +
labs(y = "Quantity", x = " ",
title = "Top-10 Countries with the Maximum Mean Scores of Universities") +
theme(plot.title = element_text(hjust = 0.5))
Having concerned universities’ scores, we obtained another rating. As can be noticed, some countries have changed their position in comparison with the previous plot. For example, the US remains the leader, but the UK becomes the 2nd here while it has been put on the 3rd place in the rating based on the number of universities.
Is there multicollinearity issue?
To answer this question, a correlation matrix is computed.
ggcorr(df_uni[,3:11], palette = "RdBu", label = TRUE)
Scores are more likely to be correlated with each other, as well as stats. So, we can observe a multicollinearity.
Having the observed data, we can implement PCA mainly for the purpose of removing redundancy - there are some variables that can be logically grouped with each other as they refer similar ideas (e.g., scores can form index variable potentially stood for rating). In addition, performing PCA would allow elimination of multicollinearity that has been detected a few steps ago. Another reason is that there are some patterns that signalize about diversity of universities. For example, we can see that some universities are similar to each other, and vice versa. More precisely, looking on the plot below, it is noticed that universities in which proportion of female is from 25 to 75 have a tendency to get less than 50 scores for their research industrial value of which are more likely to be lower.
p0 <- ggplot(data = df_uni, aes(x = stats_female_share, y = scores_research, color = scores_industry_income)) +
geom_point() + theme_light() +
labs(title = "University Diversity: Score Comparison") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p0)
Let’s conduct PCA using prcomp() function and set TRUE for scale parameter to standartize the variables. The variables in the class prcomp are put out.
pca_uni <- prcomp(df_uni[ , 3:11], scale = TRUE)
ls(pca_uni)
## [1] "center" "rotation" "scale" "sdev" "x"
Let’s look at summary firstly.
pca_summary <- summary(pca_uni)$importance %>% as.data.frame()
kable(pca_summary, caption = "PCA Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 1.86706 | 1.209701 | 1.126506 | 0.9478422 | 0.8161851 | 0.7478691 | 0.6609918 | 0.3863436 | 0.267674 |
| Proportion of Variance | 0.38732 | 0.162600 | 0.141000 | 0.0998200 | 0.0740200 | 0.0621500 | 0.0485500 | 0.0165800 | 0.007960 |
| Cumulative Proportion | 0.38732 | 0.549920 | 0.690920 | 0.7907500 | 0.8647600 | 0.9269100 | 0.9754500 | 0.9920400 | 1.000000 |
In case if we take 70% as a threshold for cumulative proportion, we are advised to use 3 as a number of principle components. Alternatively, we can rely on eigenvalues choosing such number. An eigenvalue smaller than 1 means that the component covers less variance than a single variables contributed. In other words, the more eigenvalue (variance) is, the more meaningful component is. We also can use scree plot to visualize finding the elbow.
pca_uni$sdev ^ 2 %>% round(2)
## [1] 3.49 1.46 1.27 0.90 0.67 0.56 0.44 0.15 0.07
screeplot(pca_uni, type = "lines", main = "Scree Plot - University Data")
box()
abline(h = 1, lty = 2)
It can be seen that we can use 3 principle components eigenvalues of which are 3.49, 1.46, 1.27, respectively (as for the 4th, it is 0.90).
A Pareto chart provides some insights regarding how much variance the PCs explain. The PC1 explains 38.7% of the variance, the PC2 - 16.3%, the PC3 - 14.1%. Overall,percentage of variance explained by PC1, PC2, and PC3 is 69.09.
PCA <- pca_uni$sdev^2
names(PCA) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9")
qcc::pareto.chart(PCA)
##
## Pareto chart analysis for PCA
## Frequency Cum.Freq. Percentage Cum.Percent.
## PC1 3.48591462 3.48591462 38.73238472 38.73238472
## PC2 1.46337701 4.94929163 16.25974455 54.99212926
## PC3 1.26901618 6.21830781 14.10017976 69.09230903
## PC4 0.89840475 7.11671256 9.98227499 79.07458401
## PC5 0.66615817 7.78287073 7.40175744 86.47634145
## PC6 0.55930826 8.34217899 6.21453619 92.69087763
## PC7 0.43691022 8.77908921 4.85455799 97.54543563
## PC8 0.14926140 8.92835061 1.65846002 99.20389565
## PC9 0.07164939 9.00000000 0.79610435 100.00000000
As we defined 3 principle components, let’s create a 3D model to visualize obtained data patterns.
pca3d(pca_uni, components = 1:3, col = "#5f646a", biplot = T)
## [1] 0.14465857 0.09161782 0.15355832
## Creating new device
snapshotPCA3d(file="first_plot.png")
Further, let’s look at loadings. Being based on their values, we can understand whether 3 or 4 principle components can be labeled logically and meaningfully. Here, loadings stands for correlations between original variables and components. Likewise, in this way we can figure out whether it is possible to produce an acceptable PCA solution on these data or not.
pca_loadings <- pca_uni$rotation
kable(pca_loadings, caption = "PCA Loadings") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | |
|---|---|---|---|---|---|---|---|---|---|
| scores_teaching | -0.4381105 | 0.2271566 | -0.0926788 | 0.3502277 | -0.1028941 | 0.0719589 | -0.4682469 | -0.0880640 | -0.6184130 |
| scores_research | -0.4833842 | 0.1495963 | -0.1688547 | 0.1496110 | -0.1151585 | 0.0263800 | -0.3045944 | 0.2453111 | 0.7253638 |
| scores_citations | -0.3950244 | -0.1815466 | 0.0016034 | 0.1354757 | -0.2069581 | 0.6228418 | 0.5649204 | -0.2067331 | -0.0017450 |
| scores_industry_income | -0.2778028 | 0.4606120 | -0.2704135 | -0.1344482 | -0.1739303 | -0.5624707 | 0.5075345 | -0.0968172 | -0.0766320 |
| scores_international_outlook | -0.4215684 | -0.3092577 | 0.1821718 | -0.2802719 | 0.2460363 | -0.1077354 | 0.1298289 | 0.6808486 | -0.2496992 |
| stats_number_students | 0.0040629 | -0.2637754 | -0.6535686 | 0.3328828 | 0.6106956 | -0.0620470 | 0.1115204 | -0.0566147 | 0.0014930 |
| stats_student_staff_ratio | 0.0028557 | -0.2884075 | -0.5853605 | -0.5930806 | -0.3783155 | 0.1153138 | -0.2321805 | -0.0528838 | -0.0964213 |
| stats_pc_intl_students | -0.4026482 | -0.1907845 | 0.2784155 | -0.3440841 | 0.3590780 | -0.1768653 | -0.1669045 | -0.6357881 | 0.1151722 |
| stats_female_share | -0.0361937 | -0.6293671 | 0.0962917 | 0.3987798 | -0.4444382 | -0.4793594 | 0.0135438 | -0.0785610 | 0.0249735 |
PC1: there are 4 cases of moderate correlations - scores_teaching (-0.44), scores_research (-0.48), scores_citations (-0.4), scores_international_outlook (-0.42), stats_pc_intl_students (-0.4). Overall, this PC can be labeled as rating due to a reason that it mainly consists of scores.
PC2: scores_industry_income (0.46), stats_female_share (-0.63). Here we have a rather interesting case - if there are more women, the universities’ research are much less likely to be involved in commercial and industrial spheres. In other words, universities with a larger number of males are more business-oriented. This PC can be labeled as gender_gap .
PC3: stats_number_students (-0.65), stats_student_staff_ratio (-0.58). This PC reflects the size of the university, the number of people involved in its processes. The more students are, the more staff is employed. Let’s label this one as size meaning scope of students and university workforce.
pcs <- data.frame(pca_uni$x)
pcs <- pcs[1:3]
names(pcs) <- c("rating", "gender_gap", "size")
dataset <- cbind(df_uni, pcs)
Having performed PCA, it is also possible to provide answer the question - are the best universities located in the US and Canada?
dataset$country <- ifelse(dataset$loc == "United States" | dataset$loc == "Canada", 1, 0)
df_uni2 <- dataset[,3:11] # Remove the grouping variable
res.pca <- prcomp(df_uni2, scale = TRUE)
p <- fviz_pca(res.pca, col.ind = dataset$country,
palette = "jco", geom = "point",
title = "PCA: lightblue - US or Canada, darkblue - the others") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
ggplotly(p)
Before we defined that the PC1 that is also Dim1 stands for rating, the PC2 (or Dim2) is more about gender gap. Relying on the 2D visualization we can assume that it is so - universities with the highest scores (negative values for rating) are mostly located in US and Canada.
In order to escape any doubts about how score values and rating are related the plot presented below is created.
dataset2 <- dataset %>% filter(location == "United States" | location == "Canada")
dataset2$mean_sc <- rowMeans(dataset2[,c(3:5, 7,10)], na.rm=TRUE)
p2 <- ggplot() +
geom_point(data = dataset2, aes(x = rating, y = mean_sc), color = "steelblue") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
theme_light() +
labs(xlab = "Rating", y = "Average Scores", title = "American & Canadian Universities: Mean Scores & Rating")
ggplotly(p2)
Let’s compute t-test to check whether values of rating distributed equally per countries (1 group - US & Canada, 2 group - the other countries).
dataset$country <- as.factor(dataset$country)
t.test(rating ~ country, data = dataset)
##
## Welch Two Sample t-test
##
## data: rating by country
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.132717 1.721327
## sample estimates:
## mean in group 0 mean in group 1
## 0.198088 -1.228934
The null hypothesis is rejected (p-value < 0.05, means are not equal) that is why we can state that universities from one group is better than universities from another one.
In addition, we also can see how rating and gender gap variables are associated with each other.
p3 <- ggplot() +
geom_point(data = dataset2, aes(x = rating, y = gender_gap), color = "steelblue") +
theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
theme_light() +
labs(xlab = "Rating", y = "Gender Gap Index", title = "How the First Two PCs are Related to Each Other") +
theme(plot.title = element_text(hjust = 0.5))
ggplotly(p3)
In this project, conducted Principal Component Analysis allowed to reduce data dimensions (e.g., 5 variables were grouped into one and labeled as rating). Next, it is was revealed that the best universities are located in US and Canada in comparison with the rest countries all of which were combined in one group. The more detailed analysis would disclose more countries with best education systems in the future.