Data Exploration

The data used in this project is about World University Rankings 2021.

library(tidyverse)
library(stats)
library(pca3d)
library(GGally)
library(knitr)
library(kableExtra)
library(factoextra)
library(DataExplorer)
library(psych)
library(plotly)

df_uni <- read.csv2("D:/4th Course/THE2021 2.csv", encoding = "latin1", sep = ",")

# https://campus.datacamp.com/courses/dimensionality-reduction-in-r/principal-component-analysis-pca?ex=1
# https://campus.datacamp.com/courses/machine-learning-for-marketing-analytics-in-r/reducing-dimensionality-with-principal-component-analysis?ex=1

# Remove a unicode
df_uni$name[df_uni$name == "<U+200B>Shahid Chamran University of Ahvaz"] <- "Shahid Chamran University of Ahvaz"

# Change data types
df_uni$name <- as.factor(df_uni$name)
df_uni$location <- as.factor(df_uni$location)
df_uni <- df_uni %>% mutate_if(is.character, as.numeric)

What data dimension do we have?

dim(df_uni)
## [1] 1448   11

We have 1448 observations and 11 variables.

Next, let’s find out whether there are some missing value.

How many missing values are in the dataset?

plot_missing(df_uni)

The dataset is completed. We successfully proceed.

What are the distributions of numerical variables?

plot_density(df_uni[,3:11])

On density plots above, distribution of numeric variables presented in the dataset is depicted. As can be observed, there is no any normal type, and many distributions are right-skewed.

What are statistics of numerical variables?

kable(describe(df_uni[,3:11]), caption = "Summary Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
Summary Statistics
vars n mean sd median trimmed mad min max range skew kurtosis se
scores_teaching 1 1448 27.53218 13.39072 23.40 25.166724 8.74734 11.7 94.4 82.7 2.0754314 5.3539491 0.3519003
scores_research 2 1448 23.13004 17.12087 17.15 19.875776 10.45233 7.1 99.6 92.5 1.9521958 4.1454328 0.4499266
scores_citations 3 1448 47.75808 27.68638 44.80 46.693793 34.98936 1.9 100.0 98.1 0.2576453 -1.1544308 0.7275820
scores_industry_income 4 1448 45.52030 16.59352 38.40 41.818448 6.52344 33.3 100.0 66.7 1.8447516 2.5833140 0.4360680
scores_international_outlook 5 1448 46.96492 23.25387 42.25 45.056638 25.13007 13.5 100.0 86.5 0.5829290 -0.7405898 0.6110981
stats_number_students 6 1448 22240.66160 20607.81583 17272.50 19137.444828 12851.91810 557.0 222102.0 221545.0 3.8146670 24.6923113 541.5613767
stats_student_staff_ratio 7 1448 18.51271 10.29089 16.20 17.111293 6.81996 0.9 109.1 108.2 2.3314541 10.1838094 0.2704385
stats_pc_intl_students 8 1448 11.26381 12.02288 7.00 9.219828 8.89560 0.0 84.0 84.0 1.8112797 4.4690133 0.3159542
stats_female_share 9 1448 49.93923 12.13517 53.00 50.834483 8.89560 1.0 98.0 97.0 -0.7275846 0.7514047 0.3189052

According to the seen statistics, we can see that variables are more likely to be standardized. Let’s look at their variances.

lapply(df_uni[,3:11], var)
## $scores_teaching
## [1] 179.3113
## 
## $scores_research
## [1] 293.1243
## 
## $scores_citations
## [1] 766.5359
## 
## $scores_industry_income
## [1] 275.3449
## 
## $scores_international_outlook
## [1] 540.7424
## 
## $stats_number_students
## [1] 424682073
## 
## $stats_student_staff_ratio
## [1] 105.9024
## 
## $stats_pc_intl_students
## [1] 144.5496
## 
## $stats_female_share
## [1] 147.2624

Variables with high values of variances will be over-represented in the up-coming PCA. Besides that, differences in the measurable units provoke fake weights for variables. Standardization (mean is zero, variance is one) is the key to be on the right track regarding PCA.

Let’s try to obtain some interesting insights from the data. Just being curious.

In which countries is the biggest number of universities presented in the dataset?

loc_df <- df_uni %>% group_by(location) %>% summarise(quantity = n()) %>% arrange(desc(quantity)) 
# colSums(loc_df[11:92, 2])
loc_df <- loc_df %>% top_n(10)
loc_df1 <- data.frame(location = "Other",
                  quantity = 689)
loc_df <- rbind(loc_df, loc_df1)
rm(loc_df1)

ggplot() +
  geom_bar(data = loc_df, aes(x = reorder(location, quantity), y  = quantity),
           width = 0.85, stat='identity', 
           fill = ifelse(loc_df$location == "Other", "#dbdbdb", "#779dc5"), 
           color = "#ababaf") +
  coord_flip() + theme_light() +
  geom_text(data = loc_df, 
            aes(x = reorder(location, quantity), y  = quantity), 
            label = loc_df$quantity, color = "#5f646a", hjust = -0.4, size = 3) +
  ylim(0, 700) +
  labs(y = "Quantity", x = " ",
       title = "Number of Universities in Countries") +
  theme(plot.title = element_text(hjust = 0.5))

On the plot above, the leading countires according to the number of located universities are presented. Russia takes 10th place.

In which countries is the highest mean score (rating of universities) presented in the dataset?

loc_df <- df_uni 
loc_df$mean_sc <- rowMeans(loc_df[,3:7], na.rm=TRUE)
loc_df <- loc_df %>% group_by(location) %>% summarise(max_score = sum(mean_sc))  %>% 
  arrange(desc(max_score)) %>% top_n(10)

ggplot() +
  geom_bar(data = loc_df, aes(x = reorder(location, max_score), y  = max_score),
           width = 0.85, stat='identity', 
           fill = "#93cac4", 
           color = "#ababaf") +
  coord_flip() + theme_light() +
  geom_text(data = loc_df, 
            aes(x = reorder(location, max_score), y  = max_score), 
            label = loc_df$max_score, color = "#5f646a", hjust = -0.4, size = 3) +
  ylim(0, 9000) +
  labs(y = "Quantity", x = " ",
       title = "Top-10 Countries with the Maximum Mean Scores of Universities") +
  theme(plot.title = element_text(hjust = 0.5))

Having concerned universities’ scores, we obtained another rating. As can be noticed, some countries have changed their position in comparison with the previous plot. For example, the US remains the leader, but the UK becomes the 2nd here while it has been put on the 3rd place in the rating based on the number of universities.

Is there multicollinearity issue?

To answer this question, a correlation matrix is computed.

ggcorr(df_uni[,3:11], palette = "RdBu", label = TRUE)

Scores are more likely to be correlated with each other, as well as stats. So, we can observe a multicollinearity.

Principal Component Analysis, PCA

Having the observed data, we can implement PCA mainly for the purpose of removing redundancy - there are some variables that can be logically grouped with each other as they refer similar ideas (e.g., scores can form index variable potentially stood for rating). In addition, performing PCA would allow elimination of multicollinearity that has been detected a few steps ago. Another reason is that there are some patterns that signalize about diversity of universities. For example, we can see that some universities are similar to each other, and vice versa. More precisely, looking on the plot below, it is noticed that universities in which proportion of female is from 25 to 75 have a tendency to get less than 50 scores for their research industrial value of which are more likely to be lower.

p0 <- ggplot(data = df_uni, aes(x = stats_female_share, y = scores_research, color = scores_industry_income)) +
  geom_point() + theme_light() +
  labs(title = "University Diversity: Score Comparison") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(p0)

Let’s conduct PCA using prcomp() function and set TRUE for scale parameter to standartize the variables. The variables in the class prcomp are put out.

pca_uni <- prcomp(df_uni[ , 3:11], scale = TRUE)
ls(pca_uni)
## [1] "center"   "rotation" "scale"    "sdev"     "x"

Let’s look at summary firstly.

pca_summary <- summary(pca_uni)$importance %>% as.data.frame()
kable(pca_summary, caption = "PCA Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
PCA Summary
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Standard deviation 1.86706 1.209701 1.126506 0.9478422 0.8161851 0.7478691 0.6609918 0.3863436 0.267674
Proportion of Variance 0.38732 0.162600 0.141000 0.0998200 0.0740200 0.0621500 0.0485500 0.0165800 0.007960
Cumulative Proportion 0.38732 0.549920 0.690920 0.7907500 0.8647600 0.9269100 0.9754500 0.9920400 1.000000

In case if we take 70% as a threshold for cumulative proportion, we are advised to use 3 as a number of principle components. Alternatively, we can rely on eigenvalues choosing such number. An eigenvalue smaller than 1 means that the component covers less variance than a single variables contributed. In other words, the more eigenvalue (variance) is, the more meaningful component is. We also can use scree plot to visualize finding the elbow.

pca_uni$sdev ^ 2 %>% round(2)
## [1] 3.49 1.46 1.27 0.90 0.67 0.56 0.44 0.15 0.07
screeplot(pca_uni, type = "lines", main = "Scree Plot - University Data")
box()
abline(h = 1, lty = 2)

It can be seen that we can use 3 principle components eigenvalues of which are 3.49, 1.46, 1.27, respectively (as for the 4th, it is 0.90).

A Pareto chart provides some insights regarding how much variance the PCs explain. The PC1 explains 38.7% of the variance, the PC2 - 16.3%, the PC3 - 14.1%. Overall,percentage of variance explained by PC1, PC2, and PC3 is 69.09.

PCA <- pca_uni$sdev^2
names(PCA) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9")
qcc::pareto.chart(PCA)

##      
## Pareto chart analysis for PCA
##          Frequency    Cum.Freq.   Percentage Cum.Percent.
##   PC1   3.48591462   3.48591462  38.73238472  38.73238472
##   PC2   1.46337701   4.94929163  16.25974455  54.99212926
##   PC3   1.26901618   6.21830781  14.10017976  69.09230903
##   PC4   0.89840475   7.11671256   9.98227499  79.07458401
##   PC5   0.66615817   7.78287073   7.40175744  86.47634145
##   PC6   0.55930826   8.34217899   6.21453619  92.69087763
##   PC7   0.43691022   8.77908921   4.85455799  97.54543563
##   PC8   0.14926140   8.92835061   1.65846002  99.20389565
##   PC9   0.07164939   9.00000000   0.79610435 100.00000000

As we defined 3 principle components, let’s create a 3D model to visualize obtained data patterns.

pca3d(pca_uni, components = 1:3, col = "#5f646a", biplot = T)
## [1] 0.14465857 0.09161782 0.15355832
## Creating new device
snapshotPCA3d(file="first_plot.png")

Further, let’s look at loadings. Being based on their values, we can understand whether 3 or 4 principle components can be labeled logically and meaningfully. Here, loadings stands for correlations between original variables and components. Likewise, in this way we can figure out whether it is possible to produce an acceptable PCA solution on these data or not.

pca_loadings <- pca_uni$rotation
kable(pca_loadings, caption = "PCA Loadings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)
PCA Loadings
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
scores_teaching -0.4381105 0.2271566 -0.0926788 0.3502277 -0.1028941 0.0719589 -0.4682469 -0.0880640 -0.6184130
scores_research -0.4833842 0.1495963 -0.1688547 0.1496110 -0.1151585 0.0263800 -0.3045944 0.2453111 0.7253638
scores_citations -0.3950244 -0.1815466 0.0016034 0.1354757 -0.2069581 0.6228418 0.5649204 -0.2067331 -0.0017450
scores_industry_income -0.2778028 0.4606120 -0.2704135 -0.1344482 -0.1739303 -0.5624707 0.5075345 -0.0968172 -0.0766320
scores_international_outlook -0.4215684 -0.3092577 0.1821718 -0.2802719 0.2460363 -0.1077354 0.1298289 0.6808486 -0.2496992
stats_number_students 0.0040629 -0.2637754 -0.6535686 0.3328828 0.6106956 -0.0620470 0.1115204 -0.0566147 0.0014930
stats_student_staff_ratio 0.0028557 -0.2884075 -0.5853605 -0.5930806 -0.3783155 0.1153138 -0.2321805 -0.0528838 -0.0964213
stats_pc_intl_students -0.4026482 -0.1907845 0.2784155 -0.3440841 0.3590780 -0.1768653 -0.1669045 -0.6357881 0.1151722
stats_female_share -0.0361937 -0.6293671 0.0962917 0.3987798 -0.4444382 -0.4793594 0.0135438 -0.0785610 0.0249735
  • PC1: there are 4 cases of moderate correlations - scores_teaching (-0.44), scores_research (-0.48), scores_citations (-0.4), scores_international_outlook (-0.42), stats_pc_intl_students (-0.4). Overall, this PC can be labeled as rating due to a reason that it mainly consists of scores.

  • PC2: scores_industry_income (0.46), stats_female_share (-0.63). Here we have a rather interesting case - if there are more women, the universities’ research are much less likely to be involved in commercial and industrial spheres. In other words, universities with a larger number of males are more business-oriented. This PC can be labeled as gender_gap .

  • PC3: stats_number_students (-0.65), stats_student_staff_ratio (-0.58). This PC reflects the size of the university, the number of people involved in its processes. The more students are, the more staff is employed. Let’s label this one as size meaning scope of students and university workforce.

pcs <- data.frame(pca_uni$x)
pcs <- pcs[1:3]
names(pcs) <- c("rating", "gender_gap", "size")
dataset <- cbind(df_uni, pcs)

Having performed PCA, it is also possible to provide answer the question - are the best universities located in the US and Canada?

dataset$country <- ifelse(dataset$loc == "United States" | dataset$loc == "Canada", 1, 0)
df_uni2 <- dataset[,3:11] # Remove the grouping variable

res.pca <- prcomp(df_uni2, scale = TRUE)
p <- fviz_pca(res.pca, col.ind = dataset$country,
                palette = "jco", geom = "point",
         title = "PCA: lightblue - US or Canada, darkblue - the others") + 
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
ggplotly(p)

Before we defined that the PC1 that is also Dim1 stands for rating, the PC2 (or Dim2) is more about gender gap. Relying on the 2D visualization we can assume that it is so - universities with the highest scores (negative values for rating) are mostly located in US and Canada.

In order to escape any doubts about how score values and rating are related the plot presented below is created.

dataset2 <- dataset %>% filter(location == "United States" | location == "Canada")
dataset2$mean_sc <- rowMeans(dataset2[,c(3:5, 7,10)], na.rm=TRUE)

p2 <- ggplot() +
  geom_point(data = dataset2, aes(x = rating, y = mean_sc), color = "steelblue") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  theme_light() +
  labs(xlab = "Rating", y = "Average Scores", title = "American & Canadian Universities: Mean Scores & Rating")
ggplotly(p2)

Let’s compute t-test to check whether values of rating distributed equally per countries (1 group - US & Canada, 2 group - the other countries).

dataset$country <- as.factor(dataset$country)
t.test(rating ~ country, data = dataset)
## 
##  Welch Two Sample t-test
## 
## data:  rating by country
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.132717 1.721327
## sample estimates:
## mean in group 0 mean in group 1 
##        0.198088       -1.228934

The null hypothesis is rejected (p-value < 0.05, means are not equal) that is why we can state that universities from one group is better than universities from another one.

In addition, we also can see how rating and gender gap variables are associated with each other.

p3 <- ggplot() +
  geom_point(data = dataset2, aes(x = rating, y = gender_gap), color = "steelblue") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  theme_light() +
  labs(xlab = "Rating", y = "Gender Gap Index", title = "How the First Two PCs are Related to Each Other") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(p3)

Conclusion

In this project, conducted Principal Component Analysis allowed to reduce data dimensions (e.g., 5 variables were grouped into one and labeled as rating). Next, it is was revealed that the best universities are located in US and Canada in comparison with the rest countries all of which were combined in one group. The more detailed analysis would disclose more countries with best education systems in the future.