PCA on education dataset

Introduction

In this paper, I will conduct a Principal Component Analysis on a dataset that contains various education indicators about 107 different countries and regions which were measured in the year 2014. According to Bartholomew (2010), PCA prime purpose is to “reduce the dimensionality of multivariate data set and, also, illuminating its interpretation by identifying a smaller number of variables which, in a certain sense, summarize the larger set”.

Referring to the definition presented above, with the usage of the PCA method, in this paper I will try to reduce both the dimensionality of this dataset, and try to extract a smaller number of variables that will summarize the initial set.

Data

Before I conduct the PCA, let me outline the most important information about the variables contained in the dataset.

dane <- read.csv('Education Indicators 2014.csv')
summary(dane)

##      Country.Name      PPT                  GDP                
##  Albania   :  1   Min.   :    351706   Min.   :    1399000000  
##  Arab World:  1   1st Qu.:   7232819   1st Qu.:   31686733614  
##  Azerbaijan:  1   Median :  28174724   Median :  378416000000  
##  Belarus   :  1   Mean   : 652216785   Mean   : 4903726840450  
##  Belgium   :  1   3rd Qu.: 615735976   3rd Qu.: 2479755000000  
##  Belize    :  1   Max.   :7260780278   Max.   :78377100000000  
##  (Other)   :101                                                
##       PRPE             OOCP               ESE                 EPE           
##  Min.   : 0.000   Min.   :     397   Min.   :    30230   Min.   :    24072  
##  1st Qu.: 0.760   1st Qu.:   15124   1st Qu.:   592028   1st Qu.:   479808  
##  Median : 2.510   Median :  257525   Median :  2670847   Median :  3619640  
##  Mean   : 4.276   Mean   : 6511374   Mean   : 50610551   Mean   : 67869899  
##  3rd Qu.: 6.745   3rd Qu.: 4509864   3rd Qu.: 44676630   3rd Qu.: 65868878  
##  Max.   :24.250   Max.   :60943456   Max.   :568015872   Max.   :719054784  
##                                                                             
##      UNEMP             LEB             TDP       
##  Min.   : 1.000   Min.   :49.70   Min.   :4.000  
##  1st Qu.: 4.950   1st Qu.:66.31   1st Qu.:5.000  
##  Median : 6.540   Median :72.27   Median :6.000  
##  Mean   : 8.176   Mean   :70.95   Mean   :5.617  
##  3rd Qu.: 9.130   3rd Qu.:75.87   3rd Qu.:6.000  
##  Max.   :31.000   Max.   :83.98   Max.   :7.000  
##

head(dane)

##           Country.Name       PPT           GDP  PRPE    OOCP      ESE      EPE
## 1              Albania   2893654   13219857459  0.73    7097   333291   195720
## 2           Arab World 384222592 2889750000000  6.42 6827541 30972246 45126932
## 3 United Arab Emirates   9086139  401958000000  0.22   14611   411040   409776
## 4           Azerbaijan   9535079   75198010965  0.16   22821   949294   517708
## 5              Burundi  10816860    3093647227 24.25   69246   583308  2046794
## 6              Belgium  11231213  531762000000  2.49    6538  1210112   773568
##   UNEMP   LEB TDP
## 1 16.10 77.83   5
## 2 11.52 70.57   6
## 3  3.60 77.37   5
## 4  5.20 70.76   4
## 5  6.90 56.69   6
## 6  8.50 80.59   6

str(dane)

## 'data.frame':    107 obs. of  10 variables:
##  $ Country.Name: Factor w/ 107 levels "Albania","Arab World",..: 1 2 104 3 12 5 7 11 10 9 ...
##  $ PPT         : num  2893654 384222592 9086139 9535079 10816860 ...
##  $ GDP         : num  13219857459 2889750000000 401958000000 75198010965 3093647227 ...
##  $ PRPE        : num  0.73 6.42 0.22 0.16 24.25 ...
##  $ OOCP        : int  7097 6827541 14611 22821 69246 6538 70228 956718 9514 1668 ...
##  $ ESE         : int  333291 30972246 411040 949294 583308 1210112 896763 841886 518914 297460 ...
##  $ EPE         : int  195720 45126932 409776 517708 2046794 773568 2133330 2594024 258840 161023 ...
##  $ UNEMP       : num  16.1 11.5 3.6 5.2 6.9 ...
##  $ LEB         : num  77.8 70.6 77.4 70.8 56.7 ...
##  $ TDP         : int  5 6 5 4 6 6 6 6 4 5 ...

In the dataset (which includes data gathered by http://data.worldbank.org/), there are 10 variables, out of which the first one is the name of the country or region and the rest are various indicators of education. Let’s break down the shortcuts for each of the variables:

PPT - Population GDP - Gross Domestic Product OOCP - Out-of-school children of Primary School ESE - Enrolment in Secondary Education EPE - Enrolment in Primary Education UNEMP – Unemployment PRPE - Percentage of repeaters in Primary Education LEB - Life expectancy at birth TDP - Theoretical Duration of Primary Education

PCA

With that said we are now ready to begin the PCA. In order to conduct the PCA, the PCA function from FactoMiner package is used. I set the parameter scale.unit equal to TRUE to standardize the variables and make them comparable. I also graph the PCA result, which will be discussed in more detail in a while.

## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'

Number of components

To determine how many principal components would be satisfactory, let’s explore the eigenvalues and their corresponding proportion of explained variance (in percentage) to total variance within the dataset. In addition to the results in numerical form, I will use the scree plot (elbow plot) to visualize the explained variance of each eigenvalue.

library("factoextra")

eig.val <- get_eigenvalue(res.pca)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.389464620      48.77182911                    48.77183
## Dim.2 1.974395613      21.93772904                    70.70956
## Dim.3 0.953600245      10.59555828                    81.30512
## Dim.4 0.855736665       9.50818517                    90.81330
## Dim.5 0.420190575       4.66878416                    95.48209
## Dim.6 0.246365343       2.73739270                    98.21948
## Dim.7 0.156813269       1.74236965                    99.96185
## Dim.8 0.002191821       0.02435356                    99.98620
## Dim.9 0.001241849       0.01379833                   100.00000

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

According to Kaiser (1961), if the eigenvalue > 1 the principal component accounts for more variance than one of the original variables in standardized data. If we stick to that principle, we will choose only the first two components. In total they account for more than 70% of the cumulative variance, which is quite a large fraction of the whole variability within the dataset. Similar conclusions can be made if we look at the scree plot. The reduction of the variability levels off at the third dimension, meaning that we would take into consideration only two of them in further analysis.

Let’s now discuss the representation quality of a variable on the PCA axis. To measure that I will use a measure called squared cosine (cos2). The higher the value of cos2, the better the representation of the variable on the principal component.

var <- get_pca_var(res.pca)
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

The correlation plot above indicates, that almost all variables (except UNEMP and TDP), are well represented on just the first 2 principal components. Another way of visualizing the information about the correlation between variables and components is to use the correlation circle, which was plotted before with the function PCA (and the parameter graph was set to true).

This time I will use colours (with the respect to their squared cosine value) to highlight the importance of certain variables on building the principal components and show dependence between them.

fviz_pca_var(res.pca, col.var = "cos2", repel = TRUE, 
             gradient.cols = c("#CCE5FF", "#0080FF", "#000099"))

Variables that are positioned close to the horizontal axis with their arrows close to the circumference of the circle are highly important in building the first principal component. Similarly, variables with high squared cosine values (and therefore are close to the circumference) and are positioned close to the vertical axis (namely: LEB and PRPE) are the most significant variables that account for the explained variability of the second factor. On the contrary to others, variables UNEMP and TDP are close to the plot origin and are less important for the first two components.

The correlation circle provides another significant information. It tells us how the variables are correlated with each other. The ones that lie close to each other, are positively correlated. The more similar the position of the arrow, the more linear is the relationship between the pair of variables. For example variables, PPT and ESE are almost overlapping. The correlation between them is very close to 1. It is understandable as the number of children that enrol into schools depends very strongly on how many people live in a particular country. On the other hand, the variables that are on the opposite side of the quadrants are negatively correlated (such as PRPE and LEB).

The importance of variables to the first two components can be further investigated by checking the contribution values to the chosen PCs.

fviz_contrib(res.pca, "var", axes=1:2, xtickslab.rt=90)

As stated before, all variables (except TDP and UNEMP), contribute highly (at least 10% each) to explain the total variability in the two given components. It means that only TDP and UNEMP are not highly correlated with the first two PCs.

Rotated Components

The last part of this paper will be to extract the Rotated Components and possibly provide the interpretation of them. For this purpose, I will use the function called ‘principal’ from the psych library.

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

r.pca <- principal(dane[,2:10], nfactors = 2, rotate='varimax')
r.pca

## Principal Components Analysis
## Call: principal(r = dane[, 2:10], nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1   RC2    h2    u2 com
## PPT    0.99  0.04 0.979 0.021 1.0
## GDP    0.81 -0.17 0.689 0.311 1.1
## PRPE  -0.10  0.90 0.822 0.178 1.0
## OOCP   0.83  0.32 0.793 0.207 1.3
## ESE    0.98  0.02 0.967 0.033 1.0
## EPE    0.98  0.12 0.973 0.027 1.0
## UNEMP -0.21 -0.18 0.075 0.925 1.9
## LEB    0.01 -0.86 0.735 0.265 1.0
## TDP    0.18  0.55 0.330 0.670 1.2
## 
##                        RC1  RC2
## SS loadings           4.34 2.02
## Proportion Var        0.48 0.22
## Cumulative Var        0.48 0.71
## Proportion Explained  0.68 0.32
## Cumulative Proportion 0.68 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  38.07  with prob <  0.0058 
## 
## Fit based upon off diagonal values = 0.98

summary(r.pca)

## 
## Factor analysis with Call: principal(r = dane[, 2:10], nfactors = 2, rotate = "varimax")
## 
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 19  and the objective function was  4.62 
## The number of observations was  107  with Chi Square =  465.77  with prob <  0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000083 
## 
## The root mean square of the residuals (RMSA) is  0.07

print(loadings(r.pca), digits = 3, cutoff = 0.5, sort = TRUE)

## 
## Loadings:
##       RC1    RC2   
## PPT    0.989       
## GDP    0.813       
## OOCP   0.832       
## ESE    0.983       
## EPE    0.979       
## PRPE          0.901
## LEB          -0.857
## TDP           0.547
## UNEMP              
## 
##                  RC1   RC2
## SS loadings    4.340 2.024
## Proportion Var 0.482 0.225
## Cumulative Var 0.482 0.707

In the example above I used the so-called Varimax rotation, which is the most popular type from the orthogonal rotations. It “summarizes the sum of the variance of the squared loadings, where ‘loadings’ mean correlations between variables and factors”. The cut-off point equal to 0.5 was set, meaning that at least 0.5 correlation (in absolute terms) is needed to treat a variable as significant for a certain Rotated Component. As we can see, the first five variables will constitute the first component, whereas the next three will constitute the second one. TDP occurred to be also quite informative for the second component which was not proven by the previous results. Unemployment remains the only variable that is not significant and would be omitted if further analysis with Rotated Components is held.

The next step would be to find “umbrella names” for the Rotated Components that would describe accurately and jointly the variables within the group. This process would reduce the complexity of our data (reduce the number of variables) and in consequence, it would be way easier to interpret and formulate the final results. In the case of this dataset it is very hard to find an “umbrella name” for the selected components. Therefore in the presented paper, the rotation was done only to show the process of extracting the Rotated Components in R, but unfortunately this method is not very practical for this dataset.

PCA on education dataset

Aleksander Partyga

2/22/2021