Hello there! In this chapter I am going to explore abilities of one of the dimensionality reduction techniques, Principal Component Analysis (or PCA).
The data I will be using is universities’ assessment scores data. Its structure is the following:
library(readr)
library(scales)
require(polycor)
require(ggcorrplot)
library(dplyr)
require(factoextra)
require(pca3d)
library(plotly)
THE <- read_csv("THE2021_2.csv")
THE$location <- as.factor(THE$location)
str(THE)
## tibble [1,448 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:1448] "University of Oxford" "Stanford University" "Harvard University" "California Institute of Technology" ...
## $ location : Factor w/ 92 levels "Algeria","Argentina",..: 89 90 90 90 90 89 90 90 90 90 ...
## $ scores_teaching : num [1:1448] 91.3 92.2 94.4 92.5 90.7 90.3 85.8 91.9 88.8 88.9 ...
## $ scores_research : num [1:1448] 99.6 96.7 98.8 96.9 94.4 99.2 97.2 93.8 92.5 90.5 ...
## $ scores_citations : num [1:1448] 98 99.9 99.4 97 99.7 95.6 99.1 97.9 98.9 98.6 ...
## $ scores_industry_income : num [1:1448] 68.7 90.1 46.8 92.7 90.4 52.1 84.3 56.1 58 54.9 ...
## $ scores_international_outlook: num [1:1448] 96.4 79.5 77.7 83.6 90 95.7 72.3 68.4 80.2 74 ...
## $ stats_number_students : num [1:1448] 20774 16223 21261 2238 11276 ...
## $ stats_student_staff_ratio : num [1:1448] 11.1 7.4 9.3 6.3 8.4 11 19.8 6 8 5.9 ...
## $ stats_pc_intl_students : num [1:1448] 41 23 25 33 34 38 17 20 23 31 ...
## $ stats_female_share : num [1:1448] 46 44 49 36 39 47 51 50 46 46 ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. location = col_character(),
## .. scores_teaching = col_double(),
## .. scores_research = col_double(),
## .. scores_citations = col_double(),
## .. scores_industry_income = col_double(),
## .. scores_international_outlook = col_double(),
## .. stats_number_students = col_double(),
## .. stats_student_staff_ratio = col_double(),
## .. stats_pc_intl_students = col_double(),
## .. stats_female_share = col_double()
## .. )
11 variables and 1,448 observations in total. Two of the variables, a university’s name and its location, are nominal characteristics, while the rest of the variables are numerical scores for the parameters by which a university is assessed and statistics on students or belongings of that university.
Let’s see if there are any missing data.
mean(is.na(THE))
## [1] 0
It seems that there are no NAs across the dataset. Now, let me check out summary statistics for the numeric variables, as PCA will be carried out using the latter ones only
summary(THE[, 3:11])
## scores_teaching scores_research scores_citations scores_industry_income
## Min. :11.70 Min. : 7.10 Min. : 1.90 Min. : 33.30
## 1st Qu.:18.50 1st Qu.:11.40 1st Qu.: 23.15 1st Qu.: 34.70
## Median :23.40 Median :17.15 Median : 44.80 Median : 38.40
## Mean :27.53 Mean :23.13 Mean : 47.76 Mean : 45.52
## 3rd Qu.:31.93 3rd Qu.:28.82 3rd Qu.: 70.90 3rd Qu.: 48.40
## Max. :94.40 Max. :99.60 Max. :100.00 Max. :100.00
## scores_international_outlook stats_number_students stats_student_staff_ratio
## Min. : 13.50 Min. : 557 Min. : 0.90
## 1st Qu.: 27.50 1st Qu.: 9746 1st Qu.: 12.20
## Median : 42.25 Median : 17273 Median : 16.20
## Mean : 46.96 Mean : 22241 Mean : 18.51
## 3rd Qu.: 63.02 3rd Qu.: 28753 3rd Qu.: 22.02
## Max. :100.00 Max. :222102 Max. :109.10
## stats_pc_intl_students stats_female_share
## Min. : 0.00 Min. : 1.00
## 1st Qu.: 2.00 1st Qu.:43.00
## Median : 7.00 Median :53.00
## Mean :11.26 Mean :49.94
## 3rd Qu.:17.00 3rd Qu.:58.00
## Max. :84.00 Max. :98.00
It appears that all parameters related to scores have a maximum below or equal to 100 points, suggesting that there is a 100-point scale for each of such a parameter. While, on average, the universities shown in the data score less than a half on those parameters.
Aside from that, there is a variable of the number of students which has a range of values from 557 to more than 220,000 points. It raises some concerns, as other numeric variables has much smaller range (that is also not included in the stats_number_students variable’s values range) and thus their scores cannot be compared in relation to the number of students statistics variable.
To nail this obstacle, I am going to scale all numeric variables making their values’ interval uniform , ranging from 0 to 1.
THE8 <- as.data.frame(apply(THE[, 3:11], 2, scales::rescale))
summary(THE8)
## scores_teaching scores_research scores_citations scores_industry_income
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.08222 1st Qu.:0.04649 1st Qu.:0.2166 1st Qu.:0.02099
## Median :0.14148 Median :0.10865 Median :0.4373 Median :0.07646
## Mean :0.19144 Mean :0.17330 Mean :0.4675 Mean :0.18321
## 3rd Qu.:0.24456 3rd Qu.:0.23486 3rd Qu.:0.7034 3rd Qu.:0.22639
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## scores_international_outlook stats_number_students stats_student_staff_ratio
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1618 1st Qu.:0.04148 1st Qu.:0.1044
## Median :0.3324 Median :0.07545 Median :0.1414
## Mean :0.3869 Mean :0.09787 Mean :0.1628
## 3rd Qu.:0.5725 3rd Qu.:0.12727 3rd Qu.:0.1952
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## stats_pc_intl_students stats_female_share
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.02381 1st Qu.:0.4330
## Median :0.08333 Median :0.5361
## Mean :0.13409 Mean :0.5045
## 3rd Qu.:0.20238 3rd Qu.:0.5876
## Max. :1.00000 Max. :1.0000
The result of the transformation is above. Looks good.
Now, let’s see if any correlations between variables can be observed.
the_het <- cor(THE8)
the_cors <- round(the_het, 2)
ggcorrplot(the_cors, hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_bw, colors =c("turquoise4", "white", "#E46726"))
The depicted piles suggest that there is a quite high correlation between the scores_teaching and scores_research variables, as well as between scores_international_outlook and stats_pc_intl_students. Quite interesting, I guess.
In general, a triangle (or something glider-like) of the darkest tone of orange zone indicates the highest correlations, including more than two pairs of variables. It suggests that there is a chance that dimensionality reduction applied to this data can be successful, so let’s begin.
PCA will be done with the help of a singular value decomposition method applied to the scaled data matrix. Here is the output’s structure, with variances (eigenvalues) showed:
pca_scores <- prcomp(THE8)
str(pca_scores, give.attr = F)
## List of 5
## $ sdev : num [1:9] 0.409 0.253 0.194 0.144 0.118 ...
## $ rotation: num [1:9, 1:9] -0.286 -0.378 -0.573 -0.296 -0.543 ...
## $ center : Named num [1:9] 0.191 0.173 0.467 0.183 0.387 ...
## $ scale : logi FALSE
## $ x : num [1:1448, 1:9] -1.33 -1.26 -1.08 -1.31 -1.35 ...
(pca_scores$sdev ^ 2/length(pca_scores)) %>% round(3)
## [1] 0.034 0.013 0.008 0.004 0.003 0.002 0.001 0.001 0.000
As expected when PCA is done, the majority of “weight” is carried by the first and second components.
The following are correlation values between the original variables and first four of the resulting principal components
# (correlations between original variables and components)
round(pca_scores$rotation[, 1:4], 2)
## PC1 PC2 PC3 PC4
## scores_teaching -0.29 0.20 -0.15 -0.62
## scores_research -0.38 0.22 -0.07 -0.51
## scores_citations -0.57 -0.26 -0.68 0.34
## scores_industry_income -0.30 0.81 0.14 0.43
## scores_international_outlook -0.54 -0.35 0.61 0.12
## stats_number_students 0.00 -0.01 -0.05 -0.02
## stats_student_staff_ratio -0.01 -0.01 0.01 0.15
## stats_pc_intl_students -0.25 -0.13 0.35 -0.09
## stats_female_share -0.03 -0.23 -0.01 -0.02
It seems that the first component is mostly associated with scores_research, scores_international_outlook, and scores_citations variables. If I were up to it, I would call it a “Research outlook” component. The second component is mostly associated with scores_industry_income variable, relatively to other variables. The third component is mostly associated with scores_citations, scores_international_outlook, and stats_pc_intl_students variables, it is somewhat similar to the first component. And the fourth component is the most highly correlated with scores_teaching, scores_research, and scores_industry_income variables, it is an education quality related component.
summary(pca_scores)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 0.4095 0.2532 0.1944 0.14402 0.11831 0.10281 0.08035
## Proportion of Variance 0.5109 0.1953 0.1151 0.06321 0.04266 0.03221 0.01968
## Cumulative Proportion 0.5109 0.7063 0.8214 0.88462 0.92728 0.95949 0.97917
## PC8 PC9
## Standard deviation 0.06835 0.04652
## Proportion of Variance 0.01424 0.00659
## Cumulative Proportion 0.99341 1.00000
#get_eigenvalue(pca_scores) #shows eigenvalues and percent of expplained variance altogether
By looking at the output above we can, again, speculate, whether it is useful to keep all of the components or it would be better to get rid of some of them. There is a settled threshold of the total amount of variance explained by the component using which it can be determined how many components should be considered in the analysis. Components explaining more than 70% of the variance are already enough. In our case, the first component (PC1) explains more than a half of total variance. Combined with the second component (PC2) we get right about 70% of total variance explained, which is already quite good. With the third component (PC3), more than 80% of total variance is explained. So, for this data, 2-3 principal components should be taken.
One more approach to define the optimal number of components is the Kaiser-Guttman criterion. It suggests to keep components with eigenvalues bigger than 1.
pca_scores$sdev^2
## [1] 0.167650766 0.064097961 0.037774093 0.020740957 0.013998367 0.010570108
## [7] 0.006456299 0.004671878 0.002163861
Here, however, we see that there are no suitable components to begin with - none of them have their eigenvalues bigger 1. This basically means that the discovered components cover less variance than a single variable contributed and thus do not really help to reduce dimensionality. This might be hinting at the fact that PCA is not really needed for this kind of data.
Lastly, there is a scree plot, a graphical method to decide on the number of components. On this kind of plot an “elbow” should be looked for in order to get the optimal number of components. In other words, the component after which the line on the plot sharply drops signifies the limit of components.
fviz_screeplot(pca_scores,
addlabels = TRUE,
barfill = "turquoise4",
barcolor = "turquoise4",
ylim = c(0, 60)
)
In our case we get an almost perfect degrading, or a power-law distribution. It is hard to tell where the elbow is. Yet, roughly speaking, I would call it after the first component. Which means that there should be ony one component.
The next step is to try to interpret the components. Visualization of variance spread across components might help.
#pca3d(pca_scores,
# col = "#E46726")
#snapshotPCA3d(file="first_biplot.png")
The plot above displays a 3-dimensional space with 3 components visualized. As it can be seen, the majority of variance is spread across PC1 pane.
The data points on such a plot can be colored according to some grouping. For instance, in our case it can be grouped by geographical location of universities recorded in the data - if we assume, that for some regions the dimensionality reduction technique can be applied to assessment parameters.
Let’s try this approach and divide universities considered in the data into two groups: those located in the US and Canada, as North American, and those located in other regions.
THE$NorthAmericaVSothers <- ifelse(THE$location=="United States", "North America",
ifelse(THE$location=="Canada", "North America",
"others"))
#pca3d(pca_scores,
# group = THE$NorthAmericaVSothers)
#snapshotPCA3d(file="sec_biplot.png")
Okay, the yellow points repreent North American universities, while blue points are the remainng universities. It is a bit hard to tell whether North American universiites are spread across one on the components specifically, though.
Perhaps, 2d space can help.
pca2d(pca_scores, group = THE$NorthAmericaVSothers)
Here, only the first 2 components are pictured. So, I would say that yellow points spread across two panes equally.
Finally, if we asume that the first component is the most useful one and can be utilized to represent all of the parameters of university assessment, we can test whether the two groups of universities we distinguished in the previous step (by geo-position) are different based on the first component’s scores.
Let’s begin with summary statistics
THEComps <- cbind(THE, pca_scores$x[,1:2]) %>% as.data.frame
THEComps$NorthAmericaVSothers <- as.factor(THEComps$NorthAmericaVSothers)
psych::describeBy(THEComps$PC1, THEComps$NorthAmericaVSothers, mat = T)
It can be seen that the mean scores of PC1 are quite different in two groups: for the North America group it equals to -0.26 and for the others group it is equal to 0.04.
Let’s check if the means statistically differ. It can be done with two sample t-test.
options(scipen = 999)
car::leveneTest(THEComps$PC1 ~ THEComps$NorthAmericaVSothers) #variances are equal
The output above is the Levene test assessing whether the variances in two groups are equal, as a prequisite for the t-test. P-value is much above the alpha-level of 0.05 which tells us that the null hypothesis, indicating that the variances are homogenous, holds.
t.test(THEComps$PC1 ~ THEComps$NorthAmericaVSothers, var.equal = T)
##
## Two Sample t-test
##
## data: THEComps$PC1 by THEComps$NorthAmericaVSothers
## t = -9.9556, df = 1446, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3589092 -0.2407544
## sample estimates:
## mean in group North America mean in group others
## -0.2582115 0.0416203
The output of the t-test displays the p-value that is much smaller than the aplha-level. The null hypothesis that the true population means in two groups are equal can be rejected. Thus, the universites from the US and Canada are different from universities from other regions based on the PC1 scores.
Interactive visuals:
here, the data points are plotted accross the first two components, by hovering over each colored dot component scores (coordinates) are displayed
p <- plot_ly(THEComps,
x = THEComps$PC1 ,
y = THEComps$PC2,
text = rownames(THEComps),
mode = "markers",
color = THEComps$NorthAmericaVSothers,
colors = c("#E46726", "turquoise4"),
marker = list(size = 6))
layout(p, title = "Principal Component Analysis: Universities Assessment",
xaxis = list(title = "PC 1"),
yaxis = list(title = "PC 2"))
So here the end comes. Thank you for tuning in!