Introduction
In this paper, I will conduct a Principal Component Analysis on a dataset that contains various education indicators about 107 different countries and regions which were measured in the year 2014. According to Bartholomew (2010), PCA prime purpose is to “reduce the dimensionality of multivariate data set and, also, illuminating its interpretation by identifying a smaller number of variables which, in a certain sense, summarize the larger set”.
Referring to the definition presented above, with the usage of the PCA method, in this paper I will try to reduce both the dimensionality of this dataset, and try to extract a smaller number of variables that will summarize the initial set.
Data
Before I conduct the PCA, let me outline the most important information about the variables contained in the dataset.
dane <- read.csv('Education Indicators 2014.csv')
summary(dane)
## Country.Name PPT GDP
## Albania : 1 Min. : 351706 Min. : 1399000000
## Arab World: 1 1st Qu.: 7232819 1st Qu.: 31686733614
## Azerbaijan: 1 Median : 28174724 Median : 378416000000
## Belarus : 1 Mean : 652216785 Mean : 4903726840450
## Belgium : 1 3rd Qu.: 615735976 3rd Qu.: 2479755000000
## Belize : 1 Max. :7260780278 Max. :78377100000000
## (Other) :101
## PRPE OOCP ESE EPE
## Min. : 0.000 Min. : 397 Min. : 30230 Min. : 24072
## 1st Qu.: 0.760 1st Qu.: 15124 1st Qu.: 592028 1st Qu.: 479808
## Median : 2.510 Median : 257525 Median : 2670847 Median : 3619640
## Mean : 4.276 Mean : 6511374 Mean : 50610551 Mean : 67869899
## 3rd Qu.: 6.745 3rd Qu.: 4509864 3rd Qu.: 44676630 3rd Qu.: 65868878
## Max. :24.250 Max. :60943456 Max. :568015872 Max. :719054784
##
## UNEMP LEB TDP
## Min. : 1.000 Min. :49.70 Min. :4.000
## 1st Qu.: 4.950 1st Qu.:66.31 1st Qu.:5.000
## Median : 6.540 Median :72.27 Median :6.000
## Mean : 8.176 Mean :70.95 Mean :5.617
## 3rd Qu.: 9.130 3rd Qu.:75.87 3rd Qu.:6.000
## Max. :31.000 Max. :83.98 Max. :7.000
##
head(dane)
## Country.Name PPT GDP PRPE OOCP ESE EPE
## 1 Albania 2893654 13219857459 0.73 7097 333291 195720
## 2 Arab World 384222592 2889750000000 6.42 6827541 30972246 45126932
## 3 United Arab Emirates 9086139 401958000000 0.22 14611 411040 409776
## 4 Azerbaijan 9535079 75198010965 0.16 22821 949294 517708
## 5 Burundi 10816860 3093647227 24.25 69246 583308 2046794
## 6 Belgium 11231213 531762000000 2.49 6538 1210112 773568
## UNEMP LEB TDP
## 1 16.10 77.83 5
## 2 11.52 70.57 6
## 3 3.60 77.37 5
## 4 5.20 70.76 4
## 5 6.90 56.69 6
## 6 8.50 80.59 6
str(dane)
## 'data.frame': 107 obs. of 10 variables:
## $ Country.Name: Factor w/ 107 levels "Albania","Arab World",..: 1 2 104 3 12 5 7 11 10 9 ...
## $ PPT : num 2893654 384222592 9086139 9535079 10816860 ...
## $ GDP : num 13219857459 2889750000000 401958000000 75198010965 3093647227 ...
## $ PRPE : num 0.73 6.42 0.22 0.16 24.25 ...
## $ OOCP : int 7097 6827541 14611 22821 69246 6538 70228 956718 9514 1668 ...
## $ ESE : int 333291 30972246 411040 949294 583308 1210112 896763 841886 518914 297460 ...
## $ EPE : int 195720 45126932 409776 517708 2046794 773568 2133330 2594024 258840 161023 ...
## $ UNEMP : num 16.1 11.5 3.6 5.2 6.9 ...
## $ LEB : num 77.8 70.6 77.4 70.8 56.7 ...
## $ TDP : int 5 6 5 4 6 6 6 6 4 5 ...
In the dataset (which includes data gathered by http://data.worldbank.org/), there are 10 variables, out of which the first one is the name of the country or region and the rest are various indicators of education. Let’s break down the shortcuts for each of the variables:
PPT - Population GDP - Gross Domestic Product OOCP - Out-of-school children of Primary School ESE - Enrolment in Secondary Education EPE - Enrolment in Primary Education UNEMP – Unemployment PRPE - Percentage of repeaters in Primary Education LEB - Life expectancy at birth TDP - Theoretical Duration of Primary Education
PCA
With that said we are now ready to begin the PCA. In order to conduct the PCA, the PCA function from FactoMiner package is used. I set the parameter scale.unit equal to TRUE to standardize the variables and make them comparable. I also graph the PCA result, which will be discussed in more detail in a while.
## Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
## when loading 'dplyr'
Number of components
To determine how many principal components would be satisfactory, let’s explore the eigenvalues and their corresponding proportion of explained variance (in percentage) to total variance within the dataset. In addition to the results in numerical form, I will use the scree plot (elbow plot) to visualize the explained variance of each eigenvalue.
library("factoextra")
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.389464620 48.77182911 48.77183
## Dim.2 1.974395613 21.93772904 70.70956
## Dim.3 0.953600245 10.59555828 81.30512
## Dim.4 0.855736665 9.50818517 90.81330
## Dim.5 0.420190575 4.66878416 95.48209
## Dim.6 0.246365343 2.73739270 98.21948
## Dim.7 0.156813269 1.74236965 99.96185
## Dim.8 0.002191821 0.02435356 99.98620
## Dim.9 0.001241849 0.01379833 100.00000
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))
According to Kaiser (1961), if the eigenvalue > 1 the principal component accounts for more variance than one of the original variables in standardized data. If we stick to that principle, we will choose only the first two components. In total they account for more than 70% of the cumulative variance, which is quite a large fraction of the whole variability within the dataset. Similar conclusions can be made if we look at the scree plot. The reduction of the variability levels off at the third dimension, meaning that we would take into consideration only two of them in further analysis.
Let’s now discuss the representation quality of a variable on the PCA axis. To measure that I will use a measure called squared cosine (cos2). The higher the value of cos2, the better the representation of the variable on the principal component.
var <- get_pca_var(res.pca)
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)
The correlation plot above indicates, that almost all variables (except UNEMP and TDP), are well represented on just the first 2 principal components. Another way of visualizing the information about the correlation between variables and components is to use the correlation circle, which was plotted before with the function PCA (and the parameter graph was set to true).
This time I will use colours (with the respect to their squared cosine value) to highlight the importance of certain variables on building the principal components and show dependence between them.
fviz_pca_var(res.pca, col.var = "cos2", repel = TRUE,
gradient.cols = c("#CCE5FF", "#0080FF", "#000099"))
Variables that are positioned close to the horizontal axis with their arrows close to the circumference of the circle are highly important in building the first principal component. Similarly, variables with high squared cosine values (and therefore are close to the circumference) and are positioned close to the vertical axis (namely: LEB and PRPE) are the most significant variables that account for the explained variability of the second factor. On the contrary to others, variables UNEMP and TDP are close to the plot origin and are less important for the first two components.
The correlation circle provides another significant information. It tells us how the variables are correlated with each other. The ones that lie close to each other, are positively correlated. The more similar the position of the arrow, the more linear is the relationship between the pair of variables. For example variables, PPT and ESE are almost overlapping. The correlation between them is very close to 1. It is understandable as the number of children that enrol into schools depends very strongly on how many people live in a particular country. On the other hand, the variables that are on the opposite side of the quadrants are negatively correlated (such as PRPE and LEB).
The importance of variables to the first two components can be further investigated by checking the contribution values to the chosen PCs.
fviz_contrib(res.pca, "var", axes=1:2, xtickslab.rt=90)
As stated before, all variables (except TDP and UNEMP), contribute highly (at least 10% each) to explain the total variability in the two given components. It means that only TDP and UNEMP are not highly correlated with the first two PCs.
Rotated Components
The last part of this paper will be to extract the Rotated Components and possibly provide the interpretation of them. For this purpose, I will use the function called ‘principal’ from the psych library.
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
r.pca <- principal(dane[,2:10], nfactors = 2, rotate='varimax')
r.pca
## Principal Components Analysis
## Call: principal(r = dane[, 2:10], nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## PPT 0.99 0.04 0.979 0.021 1.0
## GDP 0.81 -0.17 0.689 0.311 1.1
## PRPE -0.10 0.90 0.822 0.178 1.0
## OOCP 0.83 0.32 0.793 0.207 1.3
## ESE 0.98 0.02 0.967 0.033 1.0
## EPE 0.98 0.12 0.973 0.027 1.0
## UNEMP -0.21 -0.18 0.075 0.925 1.9
## LEB 0.01 -0.86 0.735 0.265 1.0
## TDP 0.18 0.55 0.330 0.670 1.2
##
## RC1 RC2
## SS loadings 4.34 2.02
## Proportion Var 0.48 0.22
## Cumulative Var 0.48 0.71
## Proportion Explained 0.68 0.32
## Cumulative Proportion 0.68 1.00
##
## Mean item complexity = 1.2
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.07
## with the empirical chi square 38.07 with prob < 0.0058
##
## Fit based upon off diagonal values = 0.98
summary(r.pca)
##
## Factor analysis with Call: principal(r = dane[, 2:10], nfactors = 2, rotate = "varimax")
##
## Test of the hypothesis that 2 factors are sufficient.
## The degrees of freedom for the model is 19 and the objective function was 4.62
## The number of observations was 107 with Chi Square = 465.77 with prob < 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000083
##
## The root mean square of the residuals (RMSA) is 0.07
print(loadings(r.pca), digits = 3, cutoff = 0.5, sort = TRUE)
##
## Loadings:
## RC1 RC2
## PPT 0.989
## GDP 0.813
## OOCP 0.832
## ESE 0.983
## EPE 0.979
## PRPE 0.901
## LEB -0.857
## TDP 0.547
## UNEMP
##
## RC1 RC2
## SS loadings 4.340 2.024
## Proportion Var 0.482 0.225
## Cumulative Var 0.482 0.707
In the example above I used the so-called Varimax rotation, which is the most popular type from the orthogonal rotations. It “summarizes the sum of the variance of the squared loadings, where ‘loadings’ mean correlations between variables and factors”. The cut-off point equal to 0.5 was set, meaning that at least 0.5 correlation (in absolute terms) is needed to treat a variable as significant for a certain Rotated Component. As we can see, the first five variables will constitute the first component, whereas the next three will constitute the second one. TDP occurred to be also quite informative for the second component which was not proven by the previous results. Unemployment remains the only variable that is not significant and would be omitted if further analysis with Rotated Components is held.
The next step would be to find “umbrella names” for the Rotated Components that would describe accurately and jointly the variables within the group. This process would reduce the complexity of our data (reduce the number of variables) and in consequence, it would be way easier to interpret and formulate the final results. In the case of this dataset it is very hard to find an “umbrella name” for the selected components. Therefore in the presented paper, the rotation was done only to show the process of extracting the Rotated Components in R, but unfortunately this method is not very practical for this dataset.