Papa Principal Component Analysis

Introduction

Hello there! In this chapter I am going to explore abilities of one of the dimensionality reduction techniques, Principal Component Analysis (or PCA).

Exploratory Analysis

The data I will be using is universities’ assessment scores data. Its structure is the following:

library(readr)
library(scales)
require(polycor)
require(ggcorrplot)
library(dplyr)
require(factoextra)
require(pca3d)
library(plotly)

THE <- read_csv("THE2021_2.csv")
THE$location <- as.factor(THE$location)
str(THE)

## tibble [1,448 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name                        : chr [1:1448] "University of Oxford" "Stanford University" "Harvard University" "California Institute of Technology" ...
##  $ location                    : Factor w/ 92 levels "Algeria","Argentina",..: 89 90 90 90 90 89 90 90 90 90 ...
##  $ scores_teaching             : num [1:1448] 91.3 92.2 94.4 92.5 90.7 90.3 85.8 91.9 88.8 88.9 ...
##  $ scores_research             : num [1:1448] 99.6 96.7 98.8 96.9 94.4 99.2 97.2 93.8 92.5 90.5 ...
##  $ scores_citations            : num [1:1448] 98 99.9 99.4 97 99.7 95.6 99.1 97.9 98.9 98.6 ...
##  $ scores_industry_income      : num [1:1448] 68.7 90.1 46.8 92.7 90.4 52.1 84.3 56.1 58 54.9 ...
##  $ scores_international_outlook: num [1:1448] 96.4 79.5 77.7 83.6 90 95.7 72.3 68.4 80.2 74 ...
##  $ stats_number_students       : num [1:1448] 20774 16223 21261 2238 11276 ...
##  $ stats_student_staff_ratio   : num [1:1448] 11.1 7.4 9.3 6.3 8.4 11 19.8 6 8 5.9 ...
##  $ stats_pc_intl_students      : num [1:1448] 41 23 25 33 34 38 17 20 23 31 ...
##  $ stats_female_share          : num [1:1448] 46 44 49 36 39 47 51 50 46 46 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   location = col_character(),
##   ..   scores_teaching = col_double(),
##   ..   scores_research = col_double(),
##   ..   scores_citations = col_double(),
##   ..   scores_industry_income = col_double(),
##   ..   scores_international_outlook = col_double(),
##   ..   stats_number_students = col_double(),
##   ..   stats_student_staff_ratio = col_double(),
##   ..   stats_pc_intl_students = col_double(),
##   ..   stats_female_share = col_double()
##   .. )

11 variables and 1,448 observations in total. Two of the variables, a university’s name and its location, are nominal characteristics, while the rest of the variables are numerical scores for the parameters by which a university is assessed and statistics on students or belongings of that university.

Let’s see if there are any missing data.

mean(is.na(THE))

## [1] 0

It seems that there are no NAs across the dataset. Now, let me check out summary statistics for the numeric variables, as PCA will be carried out using the latter ones only

summary(THE[, 3:11])

##  scores_teaching scores_research scores_citations scores_industry_income
##  Min.   :11.70   Min.   : 7.10   Min.   :  1.90   Min.   : 33.30        
##  1st Qu.:18.50   1st Qu.:11.40   1st Qu.: 23.15   1st Qu.: 34.70        
##  Median :23.40   Median :17.15   Median : 44.80   Median : 38.40        
##  Mean   :27.53   Mean   :23.13   Mean   : 47.76   Mean   : 45.52        
##  3rd Qu.:31.93   3rd Qu.:28.82   3rd Qu.: 70.90   3rd Qu.: 48.40        
##  Max.   :94.40   Max.   :99.60   Max.   :100.00   Max.   :100.00        
##  scores_international_outlook stats_number_students stats_student_staff_ratio
##  Min.   : 13.50               Min.   :   557        Min.   :  0.90           
##  1st Qu.: 27.50               1st Qu.:  9746        1st Qu.: 12.20           
##  Median : 42.25               Median : 17273        Median : 16.20           
##  Mean   : 46.96               Mean   : 22241        Mean   : 18.51           
##  3rd Qu.: 63.02               3rd Qu.: 28753        3rd Qu.: 22.02           
##  Max.   :100.00               Max.   :222102        Max.   :109.10           
##  stats_pc_intl_students stats_female_share
##  Min.   : 0.00          Min.   : 1.00     
##  1st Qu.: 2.00          1st Qu.:43.00     
##  Median : 7.00          Median :53.00     
##  Mean   :11.26          Mean   :49.94     
##  3rd Qu.:17.00          3rd Qu.:58.00     
##  Max.   :84.00          Max.   :98.00

It appears that all parameters related to scores have a maximum below or equal to 100 points, suggesting that there is a 100-point scale for each of such a parameter. While, on average, the universities shown in the data score less than a half on those parameters.

Aside from that, there is a variable of the number of students which has a range of values from 557 to more than 220,000 points. It raises some concerns, as other numeric variables has much smaller range (that is also not included in the stats_number_students variable’s values range) and thus their scores cannot be compared in relation to the number of students statistics variable.

Scaling

To nail this obstacle, I am going to scale all numeric variables making their values’ interval uniform , ranging from 0 to 1.

THE8 <- as.data.frame(apply(THE[, 3:11], 2, scales::rescale))
summary(THE8)

##  scores_teaching   scores_research   scores_citations scores_industry_income
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000       
##  1st Qu.:0.08222   1st Qu.:0.04649   1st Qu.:0.2166   1st Qu.:0.02099       
##  Median :0.14148   Median :0.10865   Median :0.4373   Median :0.07646       
##  Mean   :0.19144   Mean   :0.17330   Mean   :0.4675   Mean   :0.18321       
##  3rd Qu.:0.24456   3rd Qu.:0.23486   3rd Qu.:0.7034   3rd Qu.:0.22639       
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000       
##  scores_international_outlook stats_number_students stats_student_staff_ratio
##  Min.   :0.0000               Min.   :0.00000       Min.   :0.0000           
##  1st Qu.:0.1618               1st Qu.:0.04148       1st Qu.:0.1044           
##  Median :0.3324               Median :0.07545       Median :0.1414           
##  Mean   :0.3869               Mean   :0.09787       Mean   :0.1628           
##  3rd Qu.:0.5725               3rd Qu.:0.12727       3rd Qu.:0.1952           
##  Max.   :1.0000               Max.   :1.00000       Max.   :1.0000           
##  stats_pc_intl_students stats_female_share
##  Min.   :0.00000        Min.   :0.0000    
##  1st Qu.:0.02381        1st Qu.:0.4330    
##  Median :0.08333        Median :0.5361    
##  Mean   :0.13409        Mean   :0.5045    
##  3rd Qu.:0.20238        3rd Qu.:0.5876    
##  Max.   :1.00000        Max.   :1.0000

The result of the transformation is above. Looks good.

Now, let’s see if any correlations between variables can be observed.

the_het <- cor(THE8)
the_cors <- round(the_het, 2)
ggcorrplot(the_cors, hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_bw, colors =c("turquoise4", "white", "#E46726"))

The depicted piles suggest that there is a quite high correlation between the scores_teaching and scores_research variables, as well as between scores_international_outlook and stats_pc_intl_students. Quite interesting, I guess.

In general, a triangle (or something glider-like) of the darkest tone of orange zone indicates the highest correlations, including more than two pairs of variables. It suggests that there is a chance that dimensionality reduction applied to this data can be successful, so let’s begin.

Principal Component Analysis

PCA will be done with the help of a singular value decomposition method applied to the scaled data matrix. Here is the output’s structure, with variances (eigenvalues) showed:

pca_scores <- prcomp(THE8)
str(pca_scores, give.attr = F)

## List of 5
##  $ sdev    : num [1:9] 0.409 0.253 0.194 0.144 0.118 ...
##  $ rotation: num [1:9, 1:9] -0.286 -0.378 -0.573 -0.296 -0.543 ...
##  $ center  : Named num [1:9] 0.191 0.173 0.467 0.183 0.387 ...
##  $ scale   : logi FALSE
##  $ x       : num [1:1448, 1:9] -1.33 -1.26 -1.08 -1.31 -1.35 ...

(pca_scores$sdev ^ 2/length(pca_scores)) %>% round(3)

## [1] 0.034 0.013 0.008 0.004 0.003 0.002 0.001 0.001 0.000

As expected when PCA is done, the majority of “weight” is carried by the first and second components.

The following are correlation values between the original variables and first four of the resulting principal components

#  (correlations between original variables and components)
round(pca_scores$rotation[, 1:4], 2)

##                                PC1   PC2   PC3   PC4
## scores_teaching              -0.29  0.20 -0.15 -0.62
## scores_research              -0.38  0.22 -0.07 -0.51
## scores_citations             -0.57 -0.26 -0.68  0.34
## scores_industry_income       -0.30  0.81  0.14  0.43
## scores_international_outlook -0.54 -0.35  0.61  0.12
## stats_number_students         0.00 -0.01 -0.05 -0.02
## stats_student_staff_ratio    -0.01 -0.01  0.01  0.15
## stats_pc_intl_students       -0.25 -0.13  0.35 -0.09
## stats_female_share           -0.03 -0.23 -0.01 -0.02

It seems that the first component is mostly associated with scores_research, scores_international_outlook, and scores_citations variables. If I were up to it, I would call it a “Research outlook” component. The second component is mostly associated with scores_industry_income variable, relatively to other variables. The third component is mostly associated with scores_citations, scores_international_outlook, and stats_pc_intl_students variables, it is somewhat similar to the first component. And the fourth component is the most highly correlated with scores_teaching, scores_research, and scores_industry_income variables, it is an education quality related component.

summary(pca_scores)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     0.4095 0.2532 0.1944 0.14402 0.11831 0.10281 0.08035
## Proportion of Variance 0.5109 0.1953 0.1151 0.06321 0.04266 0.03221 0.01968
## Cumulative Proportion  0.5109 0.7063 0.8214 0.88462 0.92728 0.95949 0.97917
##                            PC8     PC9
## Standard deviation     0.06835 0.04652
## Proportion of Variance 0.01424 0.00659
## Cumulative Proportion  0.99341 1.00000

#get_eigenvalue(pca_scores) #shows eigenvalues and percent of expplained variance altogether

By looking at the output above we can, again, speculate, whether it is useful to keep all of the components or it would be better to get rid of some of them. There is a settled threshold of the total amount of variance explained by the component using which it can be determined how many components should be considered in the analysis. Components explaining more than 70% of the variance are already enough. In our case, the first component (PC1) explains more than a half of total variance. Combined with the second component (PC2) we get right about 70% of total variance explained, which is already quite good. With the third component (PC3), more than 80% of total variance is explained. So, for this data, 2-3 principal components should be taken.

One more approach to define the optimal number of components is the Kaiser-Guttman criterion. It suggests to keep components with eigenvalues bigger than 1.

pca_scores$sdev^2

## [1] 0.167650766 0.064097961 0.037774093 0.020740957 0.013998367 0.010570108
## [7] 0.006456299 0.004671878 0.002163861

Here, however, we see that there are no suitable components to begin with - none of them have their eigenvalues bigger 1. This basically means that the discovered components cover less variance than a single variable contributed and thus do not really help to reduce dimensionality. This might be hinting at the fact that PCA is not really needed for this kind of data.

Lastly, there is a scree plot, a graphical method to decide on the number of components. On this kind of plot an “elbow” should be looked for in order to get the optimal number of components. In other words, the component after which the line on the plot sharply drops signifies the limit of components.

fviz_screeplot(pca_scores, 
               addlabels = TRUE, 
               barfill = "turquoise4",
               barcolor = "turquoise4",
               ylim = c(0, 60)
               )

In our case we get an almost perfect degrading, or a power-law distribution. It is hard to tell where the elbow is. Yet, roughly speaking, I would call it after the first component. Which means that there should be ony one component.

Visualization

The next step is to try to interpret the components. Visualization of variance spread across components might help.

#pca3d(pca_scores, 
 #     col = "#E46726")
#snapshotPCA3d(file="first_biplot.png")

The plot above displays a 3-dimensional space with 3 components visualized. As it can be seen, the majority of variance is spread across PC1 pane.

The data points on such a plot can be colored according to some grouping. For instance, in our case it can be grouped by geographical location of universities recorded in the data - if we assume, that for some regions the dimensionality reduction technique can be applied to assessment parameters.

Let’s try this approach and divide universities considered in the data into two groups: those located in the US and Canada, as North American, and those located in other regions.

THE$NorthAmericaVSothers <- ifelse(THE$location=="United States", "North America",
                                      ifelse(THE$location=="Canada", "North America",
                                   "others"))

#pca3d(pca_scores, 
 #     group = THE$NorthAmericaVSothers)
#snapshotPCA3d(file="sec_biplot.png")

Okay, the yellow points repreent North American universities, while blue points are the remainng universities. It is a bit hard to tell whether North American universiites are spread across one on the components specifically, though.

Perhaps, 2d space can help.

pca2d(pca_scores, group = THE$NorthAmericaVSothers)

Here, only the first 2 components are pictured. So, I would say that yellow points spread across two panes equally.

Hypotheses Testing

Finally, if we asume that the first component is the most useful one and can be utilized to represent all of the parameters of university assessment, we can test whether the two groups of universities we distinguished in the previous step (by geo-position) are different based on the first component’s scores.

Let’s begin with summary statistics

THEComps <- cbind(THE, pca_scores$x[,1:2]) %>%  as.data.frame

THEComps$NorthAmericaVSothers <- as.factor(THEComps$NorthAmericaVSothers)

psych::describeBy(THEComps$PC1, THEComps$NorthAmericaVSothers, mat = T)

It can be seen that the mean scores of PC1 are quite different in two groups: for the North America group it equals to -0.26 and for the others group it is equal to 0.04.

Let’s check if the means statistically differ. It can be done with two sample t-test.

options(scipen = 999)
car::leveneTest(THEComps$PC1 ~ THEComps$NorthAmericaVSothers) #variances are equal

The output above is the Levene test assessing whether the variances in two groups are equal, as a prequisite for the t-test. P-value is much above the alpha-level of 0.05 which tells us that the null hypothesis, indicating that the variances are homogenous, holds.

t.test(THEComps$PC1 ~ THEComps$NorthAmericaVSothers, var.equal = T)

## 
##  Two Sample t-test
## 
## data:  THEComps$PC1 by THEComps$NorthAmericaVSothers
## t = -9.9556, df = 1446, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3589092 -0.2407544
## sample estimates:
## mean in group North America        mean in group others 
##                  -0.2582115                   0.0416203

The output of the t-test displays the p-value that is much smaller than the aplha-level. The null hypothesis that the true population means in two groups are equal can be rejected. Thus, the universites from the US and Canada are different from universities from other regions based on the PC1 scores.

Plotly bonus

Interactive visuals:

here, the data points are plotted accross the first two components, by hovering over each colored dot component scores (coordinates) are displayed

p <- plot_ly(THEComps, 
             x = THEComps$PC1 , 
             y = THEComps$PC2, 
             text = rownames(THEComps),
             mode = "markers", 
             color = THEComps$NorthAmericaVSothers, 
             colors = c("#E46726", "turquoise4"),
             marker = list(size = 6)) 

layout(p, title = "Principal Component Analysis: Universities Assessment", 
       xaxis = list(title = "PC 1"),
       yaxis = list(title = "PC 2"))

So here the end comes. Thank you for tuning in!