Data Exploration

The data used in this project is about World University Rankings 2021.

library(tidyverse)
library(stats)
library(pca3d)
library(GGally)
library(knitr)
library(kableExtra)
library(factoextra)
library(DataExplorer)
library(psych)
library(plotly)

df_uni <- read.csv2("D:/4th Course/THE2021 2.csv", encoding = "latin1", sep = ",")

# https://campus.datacamp.com/courses/dimensionality-reduction-in-r/principal-component-analysis-pca?ex=1
# https://campus.datacamp.com/courses/machine-learning-for-marketing-analytics-in-r/reducing-dimensionality-with-principal-component-analysis?ex=1

# Remove a unicode
df_uni$name[df_uni$name == "<U+200B>Shahid Chamran University of Ahvaz"] <- "Shahid Chamran University of Ahvaz"

# Change data types
df_uni$name <- as.factor(df_uni$name)
df_uni$location <- as.factor(df_uni$location)
df_uni <- df_uni %>% mutate_if(is.character, as.numeric)

What data dimension do we have?

dim(df_uni)

## [1] 1448   11

We have 1448 observations and 11 variables.

Next, let’s find out whether there are some missing value.

How many missing values are in the dataset?

plot_missing(df_uni)

The dataset is completed. We successfully proceed.

What are the distributions of numerical variables?

plot_density(df_uni[,3:11])

On density plots above, distribution of numeric variables presented in the dataset is depicted. As can be observed, there is no any normal type, and many distributions are right-skewed.

What are statistics of numerical variables?

kable(describe(df_uni[,3:11]), caption = "Summary Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

Summary Statistics
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
scores_teaching	1	1448	27.53218	13.39072	23.40	25.166724	8.74734	11.7	94.4	82.7	2.0754314	5.3539491	0.3519003
scores_research	2	1448	23.13004	17.12087	17.15	19.875776	10.45233	7.1	99.6	92.5	1.9521958	4.1454328	0.4499266
scores_citations	3	1448	47.75808	27.68638	44.80	46.693793	34.98936	1.9	100.0	98.1	0.2576453	-1.1544308	0.7275820
scores_industry_income	4	1448	45.52030	16.59352	38.40	41.818448	6.52344	33.3	100.0	66.7	1.8447516	2.5833140	0.4360680
scores_international_outlook	5	1448	46.96492	23.25387	42.25	45.056638	25.13007	13.5	100.0	86.5	0.5829290	-0.7405898	0.6110981
stats_number_students	6	1448	22240.66160	20607.81583	17272.50	19137.444828	12851.91810	557.0	222102.0	221545.0	3.8146670	24.6923113	541.5613767
stats_student_staff_ratio	7	1448	18.51271	10.29089	16.20	17.111293	6.81996	0.9	109.1	108.2	2.3314541	10.1838094	0.2704385
stats_pc_intl_students	8	1448	11.26381	12.02288	7.00	9.219828	8.89560	0.0	84.0	84.0	1.8112797	4.4690133	0.3159542
stats_female_share	9	1448	49.93923	12.13517	53.00	50.834483	8.89560	1.0	98.0	97.0	-0.7275846	0.7514047	0.3189052

According to the seen statistics, we can see that variables are more likely to be standardized. Let’s look at their variances.

lapply(df_uni[,3:11], var)

## $scores_teaching
## [1] 179.3113
## 
## $scores_research
## [1] 293.1243
## 
## $scores_citations
## [1] 766.5359
## 
## $scores_industry_income
## [1] 275.3449
## 
## $scores_international_outlook
## [1] 540.7424
## 
## $stats_number_students
## [1] 424682073
## 
## $stats_student_staff_ratio
## [1] 105.9024
## 
## $stats_pc_intl_students
## [1] 144.5496
## 
## $stats_female_share
## [1] 147.2624

Variables with high values of variances will be over-represented in the up-coming PCA. Besides that, differences in the measurable units provoke fake weights for variables. Standardization (mean is zero, variance is one) is the key to be on the right track regarding PCA.

Let’s try to obtain some interesting insights from the data. Just being curious.

In which countries is the biggest number of universities presented in the dataset?

loc_df <- df_uni %>% group_by(location) %>% summarise(quantity = n()) %>% arrange(desc(quantity)) 
# colSums(loc_df[11:92, 2])
loc_df <- loc_df %>% top_n(10)
loc_df1 <- data.frame(location = "Other",
                  quantity = 689)
loc_df <- rbind(loc_df, loc_df1)
rm(loc_df1)

ggplot() +
  geom_bar(data = loc_df, aes(x = reorder(location, quantity), y  = quantity),
           width = 0.85, stat='identity', 
           fill = ifelse(loc_df$location == "Other", "#dbdbdb", "#779dc5"), 
           color = "#ababaf") +
  coord_flip() + theme_light() +
  geom_text(data = loc_df, 
            aes(x = reorder(location, quantity), y  = quantity), 
            label = loc_df$quantity, color = "#5f646a", hjust = -0.4, size = 3) +
  ylim(0, 700) +
  labs(y = "Quantity", x = " ",
       title = "Number of Universities in Countries") +
  theme(plot.title = element_text(hjust = 0.5))

On the plot above, the leading countires according to the number of located universities are presented. Russia takes 10th place.

In which countries is the highest mean score (rating of universities) presented in the dataset?

loc_df <- df_uni 
loc_df$mean_sc <- rowMeans(loc_df[,3:7], na.rm=TRUE)
loc_df <- loc_df %>% group_by(location) %>% summarise(max_score = sum(mean_sc))  %>% 
  arrange(desc(max_score)) %>% top_n(10)

ggplot() +
  geom_bar(data = loc_df, aes(x = reorder(location, max_score), y  = max_score),
           width = 0.85, stat='identity', 
           fill = "#93cac4", 
           color = "#ababaf") +
  coord_flip() + theme_light() +
  geom_text(data = loc_df, 
            aes(x = reorder(location, max_score), y  = max_score), 
            label = loc_df$max_score, color = "#5f646a", hjust = -0.4, size = 3) +
  ylim(0, 9000) +
  labs(y = "Quantity", x = " ",
       title = "Top-10 Countries with the Maximum Mean Scores of Universities") +
  theme(plot.title = element_text(hjust = 0.5))

Having concerned universities’ scores, we obtained another rating. As can be noticed, some countries have changed their position in comparison with the previous plot. For example, the US remains the leader, but the UK becomes the 2nd here while it has been put on the 3rd place in the rating based on the number of universities.

Is there multicollinearity issue?

To answer this question, a correlation matrix is computed.

ggcorr(df_uni[,3:11], palette = "RdBu", label = TRUE)

Scores are more likely to be correlated with each other, as well as stats. So, we can observe a multicollinearity.

Principal Component Analysis, PCA

Having the observed data, we can implement PCA mainly for the purpose of removing redundancy - there are some variables that can be logically grouped with each other as they refer similar ideas (e.g., scores can form index variable potentially stood for rating). In addition, performing PCA would allow elimination of multicollinearity that has been detected a few steps ago. Another reason is that there are some patterns that signalize about diversity of universities. For example, we can see that some universities are similar to each other, and vice versa. More precisely, looking on the plot below, it is noticed that universities in which proportion of female is from 25 to 75 have a tendency to get less than 50 scores for their research industrial value of which are more likely to be lower.

p0 <- ggplot(data = df_uni, aes(x = stats_female_share, y = scores_research, color = scores_industry_income)) +
  geom_point() + theme_light() +
  labs(title = "University Diversity: Score Comparison") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(p0)

Let’s conduct PCA using prcomp() function and set TRUE for scale parameter to standartize the variables. The variables in the class prcomp are put out.

pca_uni <- prcomp(df_uni[ , 3:11], scale = TRUE)
ls(pca_uni)

## [1] "center"   "rotation" "scale"    "sdev"     "x"

Let’s look at summary firstly.

pca_summary <- summary(pca_uni)$importance %>% as.data.frame()
kable(pca_summary, caption = "PCA Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

PCA Summary
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
Standard deviation	1.86706	1.209701	1.126506	0.9478422	0.8161851	0.7478691	0.6609918	0.3863436	0.267674
Proportion of Variance	0.38732	0.162600	0.141000	0.0998200	0.0740200	0.0621500	0.0485500	0.0165800	0.007960
Cumulative Proportion	0.38732	0.549920	0.690920	0.7907500	0.8647600	0.9269100	0.9754500	0.9920400	1.000000

In case if we take 70% as a threshold for cumulative proportion, we are advised to use 3 as a number of principle components. Alternatively, we can rely on eigenvalues choosing such number. An eigenvalue smaller than 1 means that the component covers less variance than a single variables contributed. In other words, the more eigenvalue (variance) is, the more meaningful component is. We also can use scree plot to visualize finding the elbow.

pca_uni$sdev ^ 2 %>% round(2)

## [1] 3.49 1.46 1.27 0.90 0.67 0.56 0.44 0.15 0.07

screeplot(pca_uni, type = "lines", main = "Scree Plot - University Data")
box()
abline(h = 1, lty = 2)

It can be seen that we can use 3 principle components eigenvalues of which are 3.49, 1.46, 1.27, respectively (as for the 4th, it is 0.90).

A Pareto chart provides some insights regarding how much variance the PCs explain. The PC1 explains 38.7% of the variance, the PC2 - 16.3%, the PC3 - 14.1%. Overall,percentage of variance explained by PC1, PC2, and PC3 is 69.09.

PCA <- pca_uni$sdev^2
names(PCA) <- c("PC1", "PC2", "PC3", "PC4", "PC5", "PC6", "PC7", "PC8", "PC9")
qcc::pareto.chart(PCA)

##      
## Pareto chart analysis for PCA
##          Frequency    Cum.Freq.   Percentage Cum.Percent.
##   PC1   3.48591462   3.48591462  38.73238472  38.73238472
##   PC2   1.46337701   4.94929163  16.25974455  54.99212926
##   PC3   1.26901618   6.21830781  14.10017976  69.09230903
##   PC4   0.89840475   7.11671256   9.98227499  79.07458401
##   PC5   0.66615817   7.78287073   7.40175744  86.47634145
##   PC6   0.55930826   8.34217899   6.21453619  92.69087763
##   PC7   0.43691022   8.77908921   4.85455799  97.54543563
##   PC8   0.14926140   8.92835061   1.65846002  99.20389565
##   PC9   0.07164939   9.00000000   0.79610435 100.00000000

As we defined 3 principle components, let’s create a 3D model to visualize obtained data patterns.

pca3d(pca_uni, components = 1:3, col = "#5f646a", biplot = T)

## [1] 0.14465857 0.09161782 0.15355832
## Creating new device

snapshotPCA3d(file="first_plot.png")

Further, let’s look at loadings. Being based on their values, we can understand whether 3 or 4 principle components can be labeled logically and meaningfully. Here, loadings stands for correlations between original variables and components. Likewise, in this way we can figure out whether it is possible to produce an acceptable PCA solution on these data or not.

pca_loadings <- pca_uni$rotation
kable(pca_loadings, caption = "PCA Loadings") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F)

PCA Loadings
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9
scores_teaching	-0.4381105	0.2271566	-0.0926788	0.3502277	-0.1028941	0.0719589	-0.4682469	-0.0880640	-0.6184130
scores_research	-0.4833842	0.1495963	-0.1688547	0.1496110	-0.1151585	0.0263800	-0.3045944	0.2453111	0.7253638
scores_citations	-0.3950244	-0.1815466	0.0016034	0.1354757	-0.2069581	0.6228418	0.5649204	-0.2067331	-0.0017450
scores_industry_income	-0.2778028	0.4606120	-0.2704135	-0.1344482	-0.1739303	-0.5624707	0.5075345	-0.0968172	-0.0766320
scores_international_outlook	-0.4215684	-0.3092577	0.1821718	-0.2802719	0.2460363	-0.1077354	0.1298289	0.6808486	-0.2496992
stats_number_students	0.0040629	-0.2637754	-0.6535686	0.3328828	0.6106956	-0.0620470	0.1115204	-0.0566147	0.0014930
stats_student_staff_ratio	0.0028557	-0.2884075	-0.5853605	-0.5930806	-0.3783155	0.1153138	-0.2321805	-0.0528838	-0.0964213
stats_pc_intl_students	-0.4026482	-0.1907845	0.2784155	-0.3440841	0.3590780	-0.1768653	-0.1669045	-0.6357881	0.1151722
stats_female_share	-0.0361937	-0.6293671	0.0962917	0.3987798	-0.4444382	-0.4793594	0.0135438	-0.0785610	0.0249735

PC1: there are 4 cases of moderate correlations - scores_teaching (-0.44), scores_research (-0.48), scores_citations (-0.4), scores_international_outlook (-0.42), stats_pc_intl_students (-0.4). Overall, this PC can be labeled as rating due to a reason that it mainly consists of scores.
PC2: scores_industry_income (0.46), stats_female_share (-0.63). Here we have a rather interesting case - if there are more women, the universities’ research are much less likely to be involved in commercial and industrial spheres. In other words, universities with a larger number of males are more business-oriented. This PC can be labeled as gender_gap .
PC3: stats_number_students (-0.65), stats_student_staff_ratio (-0.58). This PC reflects the size of the university, the number of people involved in its processes. The more students are, the more staff is employed. Let’s label this one as size meaning scope of students and university workforce.

pcs <- data.frame(pca_uni$x)
pcs <- pcs[1:3]
names(pcs) <- c("rating", "gender_gap", "size")
dataset <- cbind(df_uni, pcs)

Having performed PCA, it is also possible to provide answer the question - are the best universities located in the US and Canada?

dataset$country <- ifelse(dataset$loc == "United States" | dataset$loc == "Canada", 1, 0)
df_uni2 <- dataset[,3:11] # Remove the grouping variable

res.pca <- prcomp(df_uni2, scale = TRUE)
p <- fviz_pca(res.pca, col.ind = dataset$country,
                palette = "jco", geom = "point",
         title = "PCA: lightblue - US or Canada, darkblue - the others") + 
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5))
ggplotly(p)

Before we defined that the PC1 that is also Dim1 stands for rating, the PC2 (or Dim2) is more about gender gap. Relying on the 2D visualization we can assume that it is so - universities with the highest scores (negative values for rating) are mostly located in US and Canada.

In order to escape any doubts about how score values and rating are related the plot presented below is created.

dataset2 <- dataset %>% filter(location == "United States" | location == "Canada")
dataset2$mean_sc <- rowMeans(dataset2[,c(3:5, 7,10)], na.rm=TRUE)

p2 <- ggplot() +
  geom_point(data = dataset2, aes(x = rating, y = mean_sc), color = "steelblue") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  theme_light() +
  labs(xlab = "Rating", y = "Average Scores", title = "American & Canadian Universities: Mean Scores & Rating")
ggplotly(p2)

Let’s compute t-test to check whether values of rating distributed equally per countries (1 group - US & Canada, 2 group - the other countries).

dataset$country <- as.factor(dataset$country)
t.test(rating ~ country, data = dataset)

## 
##  Welch Two Sample t-test
## 
## data:  rating by country
## t = 9.5491, df = 253.12, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.132717 1.721327
## sample estimates:
## mean in group 0 mean in group 1 
##        0.198088       -1.228934

The null hypothesis is rejected (p-value < 0.05, means are not equal) that is why we can state that universities from one group is better than universities from another one.

In addition, we also can see how rating and gender gap variables are associated with each other.

p3 <- ggplot() +
  geom_point(data = dataset2, aes(x = rating, y = gender_gap), color = "steelblue") +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5)) +
  theme_light() +
  labs(xlab = "Rating", y = "Gender Gap Index", title = "How the First Two PCs are Related to Each Other") +
  theme(plot.title = element_text(hjust = 0.5))

ggplotly(p3)

Conclusion

In this project, conducted Principal Component Analysis allowed to reduce data dimensions (e.g., 5 variables were grouped into one and labeled as rating). Next, it is was revealed that the best universities are located in US and Canada in comparison with the rest countries all of which were combined in one group. The more detailed analysis would disclose more countries with best education systems in the future.

Principal Component Analysis (PCA)

Zakharova Victoria

Last updated: 11 Ноябрь, 2020

Data Exploration

Principal Component Analysis, PCA

Conclusion