Principal component analysis (PCA) allows us to summarize and to visualize the information in a data set containing inter-correlated quantitative observations.
Each variable could be considered as a different dimension, but the numbver of four or more dimensions could be difficult to visualize.
PCA can be used to extract the important information from a multivariate data and to express this information as a set of few new variables called principal components. Each principal component corresponds to a linear combination of the original variables. However, the number of principal components can only be equal or less than the number of the originals.
The goal of PCA is to reduce the dimensionality of a multivariate data to two or three principal components, as well as visualize it graphically with minimal loss of information. PCA also identifies directions (or principal components) along which the variation in the data is maximal.
In PCA, the amount of variance retained by each principal component is measured eigenvalue. The method is is particularly useful when the variables within the data set are highly correlated. The idea is that correlation indicates redundancy in the data. For this, PCA is used to reduce the original variables into a smaller number of PC to explain most of the variance in the original variables.
setwd("D:/Class Materials & Work/Summer 2020 practice/Principal Component Analysis")
library("FactoMineR")
library("factoextra")
We will use decathlon2
data set from the factoextra
package. The data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables.
data(decathlon2)
head(decathlon2)
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75
## CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72
## BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87
## YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26
## ZSIVOCZKY 11.13 7.30 13.48 2.01 48.62 14.17 45.67
## McMULLEN 10.83 7.31 13.76 2.13 49.91 14.38 44.41
## Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE 5.02 63.19 291.7 1 8217 Decastar
## CLAY 4.92 60.15 301.5 2 8122 Decastar
## BERNARD 5.32 62.77 280.1 4 8067 Decastar
## YURKOV 4.72 63.44 276.4 5 8036 Decastar
## ZSIVOCZKY 4.42 55.37 268.0 7 8004 Decastar
## McMULLEN 4.42 56.37 285.1 8 7995 Decastar
In PCA, we will only use some of the variables (column) and observations (row) for PCA. These component are called active variables and active individuals respectively. The remaining variables and observations are called supplementary variables and individuals, which will be predicted using information and parameters from the PCA. Supplementary variables can either be discrete or continuous.
We start by subsetting active individuals and active variables for the principal component analysis:
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6], 4)
## X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69
## CLAY 10.76 7.40 14.26 1.86 49.37 14.05
## BERNARD 11.02 7.23 14.25 1.92 48.93 14.99
## YURKOV 11.34 7.09 15.19 2.10 50.42 15.31
Usually, the data needs to be standardized by putting them on the same scale. This can be done using scale()
with R base package. However, the function FactoMineR::PCA()
standardizes the data automatically during the PCA.
res.pca <- PCA(decathlon2.active, graph = FALSE)
#output
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
The created object contains several aspects of the result. Each result will be discussed in the next section.
As described, the eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are typicall large for the first PC and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.1242133 41.242133 41.24213
## Dim.2 1.8385309 18.385309 59.62744
## Dim.3 1.2391403 12.391403 72.01885
## Dim.4 0.8194402 8.194402 80.21325
## Dim.5 0.7015528 7.015528 87.22878
## Dim.6 0.4228828 4.228828 91.45760
## Dim.7 0.3025817 3.025817 94.48342
## Dim.8 0.2744700 2.744700 97.22812
## Dim.9 0.1552169 1.552169 98.78029
## Dim.10 0.1219710 1.219710 100.00000
The proportion of variation explained by each eigenvalue is given in the second column. Eigenvalues can be used to determine the number of principal components to retain after PCA.
An eigenvalue > 1 indicates that PCs account for more variance than the original variables. This is commonly used as a cutoff point for which PCs are retained, with the assumption that the data is standardized. We can also limit the number of explained variance as well to filter out small PCs.
The scree plot can be produced using the function fviz_eig()
or fviz_screeplot()
.
fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))
From the plot above, we might want to stop at the fifth principal component. 87% of the information (variances) contained in the data are retained by the first five principal components.
We can extract results from variables with factorextra::get_pca_var()
var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
The components of the result can be plotted as follows:
var$coord
for scatter plotvar$cos2
represents the quality of representation for variables on the factor mapvar$contrib
contains the contributions of the variables to the PCsThe different components can be accessed as follows:
# Coordinates
head(var$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m -0.8506257 -0.17939806 0.3015564 0.03357320 -0.1944440
## Long.jump 0.7941806 0.28085695 -0.1905465 -0.11538956 0.2331567
## Shot.put 0.7339127 0.08540412 0.5175978 0.12846837 -0.2488129
## High.jump 0.6100840 -0.46521415 0.3300852 0.14455012 0.4027002
## X400m -0.7016034 0.29017826 0.2835329 0.43082552 0.1039085
## X110m.hurdle -0.7641252 -0.02474081 0.4488873 -0.01689589 0.2242200
# Cos2: quality on the factore map
head(var$cos2)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
## Long.jump 0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
## Shot.put 0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
## High.jump 0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
## X400m 0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
## X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463
# Contributions to the principal components
head(var$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 17.544293 1.7505098 7.338659 0.13755240 5.389252
## Long.jump 15.293168 4.2904162 2.930094 1.62485936 7.748815
## Shot.put 13.060137 0.3967224 21.620432 2.01407269 8.824401
## High.jump 9.024811 11.7715838 8.792888 2.54987951 23.115504
## X400m 11.935544 4.5799296 6.487636 22.65090599 1.539012
## X110m.hurdle 14.157544 0.0332933 16.261261 0.03483735 7.166193
We will visualize and draw conclusions from the extracted component above. Correlation circle will be plotted, while variables will be highlighted according to their quality of representation and contribution to the PCs.
The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC.
# Coordinates of variables
head(var$coord, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m -0.8506257 -0.17939806 0.3015564 0.0335732 -0.1944440
## Long.jump 0.7941806 0.28085695 -0.1905465 -0.1153896 0.2331567
## Shot.put 0.7339127 0.08540412 0.5175978 0.1284684 -0.2488129
## High.jump 0.6100840 -0.46521415 0.3300852 0.1445501 0.4027002
#plot
fviz_pca_var(res.pca, col.var = "black")
The variable correlation plot above shows the relationships between all variables. It can be interpreted as follows:
head(var$cos2, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 0.7235641 0.032183664 0.09093628 0.00112716 0.03780845
## Long.jump 0.6307229 0.078880629 0.03630798 0.01331475 0.05436203
## Shot.put 0.5386279 0.007293864 0.26790749 0.01650412 0.06190783
## High.jump 0.3722025 0.216424207 0.10895622 0.02089474 0.16216747
corrplot::corrplot(var$cos2, is.corr=FALSE)
# Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(res.pca, choice = "var", axes = 1:2)
A high cos2 indicates a good representation of the variable on the principal component. The sum of the cos2 on all the principal components is equal to one.
It’s possible to color variables by their cos2 values using the argument col.var = "cos2"
.
# Color by cos2 values: quality on the factor map
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) # Avoid text overlapping
# Indicate cos2 by transparency
fviz_pca_var(res.pca, alpha.var = "cos2")
The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables with low contribution can be removed for simplification purpose.
Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.
head(var$contrib, 4)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m 17.544293 1.7505098 7.338659 0.1375524 5.389252
## Long.jump 15.293168 4.2904162 2.930094 1.6248594 7.748815
## Shot.put 13.060137 0.3967224 21.620432 2.0140727 8.824401
## High.jump 9.024811 11.7715838 8.792888 2.5498795 23.115504
corrplot::corrplot(var$contrib, is.corr=FALSE)
# Top 10 Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Top 10 Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
#Total contribution of PC1 and PC2
fviz_contrib(res.pca, choice = "var", axes = 1:2, top = 10)
The red dashed line on the graph above indicates the expected average contribution. It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the most to the dimensions 1 and 2.
The most important (or, contributing) variables can be highlighted on the correlation plot as follow:
fviz_pca_var(res.pca, col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))
# Change the transparency by contrib values
fviz_pca_var(res.pca, alpha.var = "contrib")
It is possible to color variables by any custom continuous variable. The coloring variable should have the same length as the number of active variables in the PCA (here n = 10).
# Create a random continuous variable of length 10
set.seed(123)
my.cont.var <- rnorm(10)
# Color variables by the continuous variable
fviz_pca_var(res.pca, col.var = my.cont.var,
gradient.cols = c("blue", "yellow", "red"),
legend.title = "Cont.Var")
It’s also possible to change the color of variables by groups defined by a qualitative/categorical variable.
# Create a grouping variable using kmeans
# Create 3 groups of variables (centers = 3)
set.seed(123)
res.km <- kmeans(var$coord, centers = 3, nstart = 25)
grp <- as.factor(res.km$cluster)
# Color variables by groups
fviz_pca_var(res.pca, col.var = grp,
palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
legend.title = "Cluster")
dimdesc()
can be used to identify the most significantly associated variables with a given principal component.
res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05)
# Description of dimension 1
res.desc$Dim.1
## $quanti
## correlation p.value
## Long.jump 0.7941806 6.059893e-06
## Discus 0.7432090 4.842563e-05
## Shot.put 0.7339127 6.723102e-05
## High.jump 0.6100840 1.993677e-03
## Javeline 0.4282266 4.149192e-02
## X400m -0.7016034 1.910387e-04
## X110m.hurdle -0.7641252 2.195812e-05
## X100m -0.8506257 2.727129e-07
##
## attr(,"class")
## [1] "condes" "list"
# Description of dimension 2
res.desc$Dim.2
## $quanti
## correlation p.value
## Pole.vault 0.8074511 3.205016e-06
## X1500m 0.7844802 9.384747e-06
## High.jump -0.4652142 2.529390e-02
##
## attr(,"class")
## [1] "condes" "list"
The results, for individuals can be extracted using get_pca_ind()
.
ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
# Coordinates of individuals
head(ind$coord)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## SEBRLE 0.1955047 1.5890567 0.6424912 0.08389652 1.16829387
## CLAY 0.8078795 2.4748137 -1.3873827 1.29838232 -0.82498206
## BERNARD -1.3591340 1.6480950 0.2005584 -1.96409420 0.08419345
## YURKOV -0.8889532 -0.4426067 2.5295843 0.71290837 0.40782264
## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
## McMULLEN 0.1212195 -1.0139102 -0.8625170 1.34164291 1.62151286
# Quality of individuals
head(ind$cos2)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## SEBRLE 0.007530179 0.49747323 0.081325232 0.001386688 0.2689026575
## CLAY 0.048701249 0.45701660 0.143628117 0.125791741 0.0507850580
## BERNARD 0.197199804 0.28996555 0.004294015 0.411819183 0.0007567259
## YURKOV 0.096109800 0.02382571 0.778230322 0.061812637 0.0202279796
## ZSIVOCZKY 0.001574385 0.57641944 0.239754152 0.001388216 0.0054654972
## McMULLEN 0.002175437 0.15219499 0.110137872 0.266486530 0.3892621478
# Contributions of individuals
head(ind$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## SEBRLE 0.04029447 5.9714533 1.4483919 0.03734589 8.45894063
## CLAY 0.68805664 14.4839248 6.7537381 8.94458283 4.21794385
## BERNARD 1.94740183 6.4234107 0.1411345 20.46819433 0.04393073
## YURKOV 0.83308415 0.4632733 22.4517396 2.69663605 1.03075263
## ZSIVOCZKY 0.01232413 10.1217143 6.2464325 0.05469230 0.25151025
## McMULLEN 0.01549089 2.4310854 2.6102794 9.55055888 16.29493304
fviz_pca_ind(res.pca)
#Color by cos2 value
fviz_pca_ind(res.pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) # Avoid text overlapping (slow if many points)
#Point size by cos2 value
fviz_pca_ind(res.pca, pointsize = "cos2",
pointshape = 21, fill = "#E7B800",
repel = TRUE) # Avoid text overlapping (slow if many points)
#Both size and color by cos2
fviz_pca_ind(res.pca, col.ind = "cos2", pointsize = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE) # Avoid text overlapping (slow if many points)
We can also create a bar plot of the quality of representation (cos2) of individuals.
fviz_cos2(res.pca, choice = "ind")
# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2)
As for variables, individuals can be colored by any custom continuous variable with the argument col.ind
# Create a random continuous variable of length 23,
# Same length as the number of active individuals in the PCA
set.seed(123)
my.cont.var <- rnorm(23)
# Color individuals by the continuous variable
fviz_pca_ind(res.pca, col.ind = my.cont.var,
gradient.cols = c("blue", "yellow", "red"),
legend.title = "Cont.Var")
Both suplementary continuous and categorical variables are not used for the determination of the principal components, but rather used as the predicted elements from their coordinates through the information provided by the PCA.
To specify supplementary individuals and variables, the function PCA()
can be used as follow:
res.pca_2 <- PCA(decathlon2, ind.sup = 24:27,
quanti.sup = 11:12, quali.sup = 13, graph=FALSE)
Predicted results (coordinates, correlation and cos2) for the supplementary quantitative variables:
res.pca_2$quanti.sup
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.7014777 -0.24519443 -0.1834294 0.05575186 -0.07382647
## Points 0.9637075 0.07768262 0.1580225 -0.16623092 -0.03114711
##
## $cor
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank -0.7014777 -0.24519443 -0.1834294 0.05575186 -0.07382647
## Points 0.9637075 0.07768262 0.1580225 -0.16623092 -0.03114711
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Rank 0.4920710 0.060120310 0.03364635 0.00310827 0.0054503477
## Points 0.9287322 0.006034589 0.02497110 0.02763272 0.0009701427
#visualize
fviz_pca_var(res.pca_2)
by default, supplementary quantitative variables are shown in blue color and dashed lines. We can further customize the plot by:
# Change color of variables
fviz_pca_var(res.pca_2,
col.var = "black", # Active variables
col.quanti.sup = "red" # Suppl. quantitative variables
)
# Hide active variables on the plot and show only supplementary variables
fviz_pca_var(res.pca_2, invisible = "var")
# Hide supplementary variables
fviz_pca_var(res.pca_2, invisible = "quanti.sup")
Predicted results for the supplementary individuals (ind.sup
).
res.pca_2$ind.sup
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## KARPOV 0.7947206 0.77951227 -1.6330203 1.7242283 -0.75070396
## WARNERS -0.3864645 -0.12159237 -1.7387332 -0.7063341 -0.03230011
## Nool -0.5591306 1.97748871 -0.4830358 -2.2784526 -0.25461493
## Drews -1.1092038 0.01741477 -3.0488182 -1.5343468 -0.32642192
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## KARPOV 0.05104677 4.911173e-02 0.21553730 0.24028620 0.0455487744
## WARNERS 0.02422707 2.398250e-03 0.49039677 0.08092862 0.0001692349
## Nool 0.02897149 3.623868e-01 0.02162236 0.48108780 0.0060077529
## Drews 0.09207094 2.269527e-05 0.69560547 0.17617609 0.0079736753
##
## $dist
## KARPOV WARNERS Nool Drews
## 3.517470 2.482899 3.284943 3.655527
Visualize all individuals (active and supplementary ones). We can also add the supplementary qualitative variables with res.pca$quali.supp$coord
.
p_2 <- fviz_pca_ind(res.pca_2, col.ind.sup = "blue", repel = TRUE)
p_2 <- fviz_add(p_2, res.pca_2$quali.sup$coord, color = "red")
p_2
Supplementary individuals are shown in blue. The levels of the supplementary qualitative variable are shown in red color.
The results concerning the supplementary qualitative variable are:
res.pca_2$quali
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Decastar -1.343451 0.1218097 -0.03789524 0.1808357 0.1343364
## OlympicG 1.231497 -0.1116589 0.03473730 -0.1657661 -0.1231417
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Decastar 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
## OlympicG 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
##
## $v.test
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Decastar -2.970766 0.4034256 -0.1528767 0.8971036 0.7202457
## OlympicG 2.970766 -0.4034256 0.1528767 -0.8971036 -0.7202457
##
## $dist
## Decastar OlympicG
## 1.412108 1.294433
##
## $eta2
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## Competition 0.4011568 0.00739783 0.001062332 0.03658159 0.02357972
We can use the argument habillage
to color individuals by a supplementary qualitative variable.
fviz_pca_ind(res.pca_2, habillage = 13,
addEllipses =TRUE, ellipse.type = "confidence",
palette = "jco", repel = TRUE)