Principal Component Analysis

Principal component analysis (PCA) allows us to summarize and to visualize the information in a data set containing inter-correlated quantitative observations.

Each variable could be considered as a different dimension, but the numbver of four or more dimensions could be difficult to visualize.

PCA can be used to extract the important information from a multivariate data and to express this information as a set of few new variables called principal components. Each principal component corresponds to a linear combination of the original variables. However, the number of principal components can only be equal or less than the number of the originals.

The goal of PCA is to reduce the dimensionality of a multivariate data to two or three principal components, as well as visualize it graphically with minimal loss of information. PCA also identifies directions (or principal components) along which the variation in the data is maximal.

In PCA, the amount of variance retained by each principal component is measured eigenvalue. The method is is particularly useful when the variables within the data set are highly correlated. The idea is that correlation indicates redundancy in the data. For this, PCA is used to reduce the original variables into a smaller number of PC to explain most of the variance in the original variables.

setwd("D:/Class Materials & Work/Summer 2020 practice/Principal Component Analysis")

library("FactoMineR")
library("factoextra")

We will use decathlon2 data set from the factoextra package. The data used here describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables.

data(decathlon2)
head(decathlon2)

##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE          5.02    63.19  291.7    1   8217    Decastar
## CLAY            4.92    60.15  301.5    2   8122    Decastar
## BERNARD         5.32    62.77  280.1    4   8067    Decastar
## YURKOV          4.72    63.44  276.4    5   8036    Decastar
## ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
## McMULLEN        4.42    56.37  285.1    8   7995    Decastar

In PCA, we will only use some of the variables (column) and observations (row) for PCA. These component are called active variables and active individuals respectively. The remaining variables and observations are called supplementary variables and individuals, which will be predicted using information and parameters from the PCA. Supplementary variables can either be discrete or continuous.

We start by subsetting active individuals and active variables for the principal component analysis:

decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active[, 1:6], 4)

##         X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE  11.04      7.58    14.83      2.07 49.81        14.69
## CLAY    10.76      7.40    14.26      1.86 49.37        14.05
## BERNARD 11.02      7.23    14.25      1.92 48.93        14.99
## YURKOV  11.34      7.09    15.19      2.10 50.42        15.31

Usually, the data needs to be standardized by putting them on the same scale. This can be done using scale() with R base package. However, the function FactoMineR::PCA() standardizes the data automatically during the PCA.

res.pca <- PCA(decathlon2.active, graph = FALSE)

#output
print(res.pca)

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

The created object contains several aspects of the result. Each result will be discussed in the next section.

Visualization and Interpretation

As described, the eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are typicall large for the first PC and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

eig.val <- get_eigenvalue(res.pca)

eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.1242133        41.242133                    41.24213
## Dim.2   1.8385309        18.385309                    59.62744
## Dim.3   1.2391403        12.391403                    72.01885
## Dim.4   0.8194402         8.194402                    80.21325
## Dim.5   0.7015528         7.015528                    87.22878
## Dim.6   0.4228828         4.228828                    91.45760
## Dim.7   0.3025817         3.025817                    94.48342
## Dim.8   0.2744700         2.744700                    97.22812
## Dim.9   0.1552169         1.552169                    98.78029
## Dim.10  0.1219710         1.219710                   100.00000

The proportion of variation explained by each eigenvalue is given in the second column. Eigenvalues can be used to determine the number of principal components to retain after PCA.

An eigenvalue > 1 indicates that PCs account for more variance than the original variables. This is commonly used as a cutoff point for which PCs are retained, with the assumption that the data is standardized. We can also limit the number of explained variance as well to filter out small PCs.

The scree plot can be produced using the function fviz_eig() or fviz_screeplot().

fviz_eig(res.pca, addlabels = TRUE, ylim = c(0, 50))

From the plot above, we might want to stop at the fifth principal component. 87% of the information (variances) contained in the data are retained by the first five principal components.

Graph of variables

We can extract results from variables with factorextra::get_pca_var()

var <- get_pca_var(res.pca)

var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

The components of the result can be plotted as follows:

var$coord for scatter plot
var$cos2 represents the quality of representation for variables on the factor map
var$contrib contains the contributions of the variables to the PCs

The different components can be accessed as follows:

# Coordinates
head(var$coord)

##                   Dim.1       Dim.2      Dim.3       Dim.4      Dim.5
## X100m        -0.8506257 -0.17939806  0.3015564  0.03357320 -0.1944440
## Long.jump     0.7941806  0.28085695 -0.1905465 -0.11538956  0.2331567
## Shot.put      0.7339127  0.08540412  0.5175978  0.12846837 -0.2488129
## High.jump     0.6100840 -0.46521415  0.3300852  0.14455012  0.4027002
## X400m        -0.7016034  0.29017826  0.2835329  0.43082552  0.1039085
## X110m.hurdle -0.7641252 -0.02474081  0.4488873 -0.01689589  0.2242200

# Cos2: quality on the factore map
head(var$cos2)

##                  Dim.1        Dim.2      Dim.3        Dim.4      Dim.5
## X100m        0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
## Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
## Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
## High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
## X400m        0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
## X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463

# Contributions to the principal components
head(var$contrib)

##                  Dim.1      Dim.2     Dim.3       Dim.4     Dim.5
## X100m        17.544293  1.7505098  7.338659  0.13755240  5.389252
## Long.jump    15.293168  4.2904162  2.930094  1.62485936  7.748815
## Shot.put     13.060137  0.3967224 21.620432  2.01407269  8.824401
## High.jump     9.024811 11.7715838  8.792888  2.54987951 23.115504
## X400m        11.935544  4.5799296  6.487636 22.65090599  1.539012
## X110m.hurdle 14.157544  0.0332933 16.261261  0.03483735  7.166193

We will visualize and draw conclusions from the extracted component above. Correlation circle will be plotted, while variables will be highlighted according to their quality of representation and contribution to the PCs.

Correlation circle

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC.

# Coordinates of variables
head(var$coord, 4)

##                Dim.1       Dim.2      Dim.3      Dim.4      Dim.5
## X100m     -0.8506257 -0.17939806  0.3015564  0.0335732 -0.1944440
## Long.jump  0.7941806  0.28085695 -0.1905465 -0.1153896  0.2331567
## Shot.put   0.7339127  0.08540412  0.5175978  0.1284684 -0.2488129
## High.jump  0.6100840 -0.46521415  0.3300852  0.1445501  0.4027002

#plot
fviz_pca_var(res.pca, col.var = "black")

The variable correlation plot above shows the relationships between all variables. It can be interpreted as follows:

Positively and negatively correlated variables are grouped together in an opposite manner.
The further the variables are from the origin, the better the quality of them on the factor map.

Quality of representation

head(var$cos2, 4)

##               Dim.1       Dim.2      Dim.3      Dim.4      Dim.5
## X100m     0.7235641 0.032183664 0.09093628 0.00112716 0.03780845
## Long.jump 0.6307229 0.078880629 0.03630798 0.01331475 0.05436203
## Shot.put  0.5386279 0.007293864 0.26790749 0.01650412 0.06190783
## High.jump 0.3722025 0.216424207 0.10895622 0.02089474 0.16216747

corrplot::corrplot(var$cos2, is.corr=FALSE)

# Total cos2 of variables on Dim.1 and Dim.2
fviz_cos2(res.pca, choice = "var", axes = 1:2)

A high cos2 indicates a good representation of the variable on the principal component. The sum of the cos2 on all the principal components is equal to one.

It’s possible to color variables by their cos2 values using the argument col.var = "cos2".

# Color by cos2 values: quality on the factor map

fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE) # Avoid text overlapping

# Indicate cos2 by transparency
fviz_pca_var(res.pca, alpha.var = "cos2")

Contributions of variables to PCs

The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables with low contribution can be removed for simplification purpose.

Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set.

head(var$contrib, 4)

##               Dim.1      Dim.2     Dim.3     Dim.4     Dim.5
## X100m     17.544293  1.7505098  7.338659 0.1375524  5.389252
## Long.jump 15.293168  4.2904162  2.930094 1.6248594  7.748815
## Shot.put  13.060137  0.3967224 21.620432 2.0140727  8.824401
## High.jump  9.024811 11.7715838  8.792888 2.5498795 23.115504

corrplot::corrplot(var$contrib, is.corr=FALSE)

# Top 10 Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Top 10 Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

#Total contribution of PC1 and PC2
fviz_contrib(res.pca, choice = "var", axes = 1:2, top = 10)

The red dashed line on the graph above indicates the expected average contribution. It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the most to the dimensions 1 and 2.

The most important (or, contributing) variables can be highlighted on the correlation plot as follow:

fviz_pca_var(res.pca, col.var = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

# Change the transparency by contrib values
fviz_pca_var(res.pca, alpha.var = "contrib")

Color by a custom continuous variable

It is possible to color variables by any custom continuous variable. The coloring variable should have the same length as the number of active variables in the PCA (here n = 10).

# Create a random continuous variable of length 10
set.seed(123)
my.cont.var <- rnorm(10)

# Color variables by the continuous variable
fviz_pca_var(res.pca, col.var = my.cont.var,
             gradient.cols = c("blue", "yellow", "red"),
             legend.title = "Cont.Var")

Color by groups

It’s also possible to change the color of variables by groups defined by a qualitative/categorical variable.

# Create a grouping variable using kmeans
# Create 3 groups of variables (centers = 3)
set.seed(123)
res.km <- kmeans(var$coord, centers = 3, nstart = 25)
grp <- as.factor(res.km$cluster)

# Color variables by groups
fviz_pca_var(res.pca, col.var = grp, 
             palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),
             legend.title = "Cluster")

Dimension description

dimdesc() can be used to identify the most significantly associated variables with a given principal component.

res.desc <- dimdesc(res.pca, axes = c(1,2), proba = 0.05)

# Description of dimension 1
res.desc$Dim.1

## $quanti
##              correlation      p.value
## Long.jump      0.7941806 6.059893e-06
## Discus         0.7432090 4.842563e-05
## Shot.put       0.7339127 6.723102e-05
## High.jump      0.6100840 1.993677e-03
## Javeline       0.4282266 4.149192e-02
## X400m         -0.7016034 1.910387e-04
## X110m.hurdle  -0.7641252 2.195812e-05
## X100m         -0.8506257 2.727129e-07
## 
## attr(,"class")
## [1] "condes" "list"

# Description of dimension 2
res.desc$Dim.2

## $quanti
##            correlation      p.value
## Pole.vault   0.8074511 3.205016e-06
## X1500m       0.7844802 9.384747e-06
## High.jump   -0.4652142 2.529390e-02
## 
## attr(,"class")
## [1] "condes" "list"

Graph of individuals

The results, for individuals can be extracted using get_pca_ind().

ind <- get_pca_ind(res.pca)
ind

## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

# Coordinates of individuals
head(ind$coord)

##                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
## SEBRLE     0.1955047  1.5890567  0.6424912  0.08389652  1.16829387
## CLAY       0.8078795  2.4748137 -1.3873827  1.29838232 -0.82498206
## BERNARD   -1.3591340  1.6480950  0.2005584 -1.96409420  0.08419345
## YURKOV    -0.8889532 -0.4426067  2.5295843  0.71290837  0.40782264
## ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217
## McMULLEN   0.1212195 -1.0139102 -0.8625170  1.34164291  1.62151286

# Quality of individuals
head(ind$cos2)

##                 Dim.1      Dim.2       Dim.3       Dim.4        Dim.5
## SEBRLE    0.007530179 0.49747323 0.081325232 0.001386688 0.2689026575
## CLAY      0.048701249 0.45701660 0.143628117 0.125791741 0.0507850580
## BERNARD   0.197199804 0.28996555 0.004294015 0.411819183 0.0007567259
## YURKOV    0.096109800 0.02382571 0.778230322 0.061812637 0.0202279796
## ZSIVOCZKY 0.001574385 0.57641944 0.239754152 0.001388216 0.0054654972
## McMULLEN  0.002175437 0.15219499 0.110137872 0.266486530 0.3892621478

# Contributions of individuals
head(ind$contrib)

##                Dim.1      Dim.2      Dim.3       Dim.4       Dim.5
## SEBRLE    0.04029447  5.9714533  1.4483919  0.03734589  8.45894063
## CLAY      0.68805664 14.4839248  6.7537381  8.94458283  4.21794385
## BERNARD   1.94740183  6.4234107  0.1411345 20.46819433  0.04393073
## YURKOV    0.83308415  0.4632733 22.4517396  2.69663605  1.03075263
## ZSIVOCZKY 0.01232413 10.1217143  6.2464325  0.05469230  0.25151025
## McMULLEN  0.01549089  2.4310854  2.6102794  9.55055888 16.29493304

Plots: quality and contribution

fviz_pca_ind(res.pca)

#Color by cos2 value

fviz_pca_ind(res.pca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE) # Avoid text overlapping (slow if many points)

#Point size by cos2 value

fviz_pca_ind(res.pca, pointsize = "cos2", 
             pointshape = 21, fill = "#E7B800",
             repel = TRUE) # Avoid text overlapping (slow if many points)

#Both size and color by cos2

fviz_pca_ind(res.pca, col.ind = "cos2", pointsize = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE) # Avoid text overlapping (slow if many points)

We can also create a bar plot of the quality of representation (cos2) of individuals.

fviz_cos2(res.pca, choice = "ind")

# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2)

Color by a custom continuous variable

As for variables, individuals can be colored by any custom continuous variable with the argument col.ind

# Create a random continuous variable of length 23,
# Same length as the number of active individuals in the PCA

set.seed(123)
my.cont.var <- rnorm(23)

# Color individuals by the continuous variable

fviz_pca_ind(res.pca, col.ind = my.cont.var,
             gradient.cols = c("blue", "yellow", "red"),
             legend.title = "Cont.Var")

Supplementary elements

Both suplementary continuous and categorical variables are not used for the determination of the principal components, but rather used as the predicted elements from their coordinates through the information provided by the PCA.

To specify supplementary individuals and variables, the function PCA() can be used as follow:

res.pca_2 <- PCA(decathlon2, ind.sup = 24:27, 
               quanti.sup = 11:12, quali.sup = 13, graph=FALSE)

Quantitative variables

Predicted results (coordinates, correlation and cos2) for the supplementary quantitative variables:

res.pca_2$quanti.sup

## $coord
##             Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
## Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
## Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711
## 
## $cor
##             Dim.1       Dim.2      Dim.3       Dim.4       Dim.5
## Rank   -0.7014777 -0.24519443 -0.1834294  0.05575186 -0.07382647
## Points  0.9637075  0.07768262  0.1580225 -0.16623092 -0.03114711
## 
## $cos2
##            Dim.1       Dim.2      Dim.3      Dim.4        Dim.5
## Rank   0.4920710 0.060120310 0.03364635 0.00310827 0.0054503477
## Points 0.9287322 0.006034589 0.02497110 0.02763272 0.0009701427

#visualize
fviz_pca_var(res.pca_2)

by default, supplementary quantitative variables are shown in blue color and dashed lines. We can further customize the plot by:

# Change color of variables
fviz_pca_var(res.pca_2,
             col.var = "black",     # Active variables
             col.quanti.sup = "red" # Suppl. quantitative variables
)

# Hide active variables on the plot and show only supplementary variables
fviz_pca_var(res.pca_2, invisible = "var")

# Hide supplementary variables
fviz_pca_var(res.pca_2, invisible = "quanti.sup")

Individuals

Predicted results for the supplementary individuals (ind.sup).

res.pca_2$ind.sup

## $coord
##              Dim.1       Dim.2      Dim.3      Dim.4       Dim.5
## KARPOV   0.7947206  0.77951227 -1.6330203  1.7242283 -0.75070396
## WARNERS -0.3864645 -0.12159237 -1.7387332 -0.7063341 -0.03230011
## Nool    -0.5591306  1.97748871 -0.4830358 -2.2784526 -0.25461493
## Drews   -1.1092038  0.01741477 -3.0488182 -1.5343468 -0.32642192
## 
## $cos2
##              Dim.1        Dim.2      Dim.3      Dim.4        Dim.5
## KARPOV  0.05104677 4.911173e-02 0.21553730 0.24028620 0.0455487744
## WARNERS 0.02422707 2.398250e-03 0.49039677 0.08092862 0.0001692349
## Nool    0.02897149 3.623868e-01 0.02162236 0.48108780 0.0060077529
## Drews   0.09207094 2.269527e-05 0.69560547 0.17617609 0.0079736753
## 
## $dist
##   KARPOV  WARNERS     Nool    Drews 
## 3.517470 2.482899 3.284943 3.655527

Visualize all individuals (active and supplementary ones). We can also add the supplementary qualitative variables with res.pca$quali.supp$coord.

p_2 <- fviz_pca_ind(res.pca_2, col.ind.sup = "blue", repel = TRUE)
p_2 <- fviz_add(p_2, res.pca_2$quali.sup$coord, color = "red")
p_2

Supplementary individuals are shown in blue. The levels of the supplementary qualitative variable are shown in red color.

Qualitative variables

The results concerning the supplementary qualitative variable are:

res.pca_2$quali

## $coord
##              Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## Decastar -1.343451  0.1218097 -0.03789524  0.1808357  0.1343364
## OlympicG  1.231497 -0.1116589  0.03473730 -0.1657661 -0.1231417
## 
## $cos2
##              Dim.1       Dim.2        Dim.3      Dim.4       Dim.5
## Decastar 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
## OlympicG 0.9051233 0.007440939 0.0007201669 0.01639956 0.009050062
## 
## $v.test
##              Dim.1      Dim.2      Dim.3      Dim.4      Dim.5
## Decastar -2.970766  0.4034256 -0.1528767  0.8971036  0.7202457
## OlympicG  2.970766 -0.4034256  0.1528767 -0.8971036 -0.7202457
## 
## $dist
## Decastar OlympicG 
## 1.412108 1.294433 
## 
## $eta2
##                 Dim.1      Dim.2       Dim.3      Dim.4      Dim.5
## Competition 0.4011568 0.00739783 0.001062332 0.03658159 0.02357972

We can use the argument habillage to color individuals by a supplementary qualitative variable.

fviz_pca_ind(res.pca_2, habillage = 13,
             addEllipses =TRUE, ellipse.type = "confidence",
             palette = "jco", repel = TRUE)

Reference: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/112-pca-principal-component-analysis-essentials/#graph-of-variables

Principal Component Analysis

Tarid Wongvorachan

December 18th, 2020

Visualization and Interpretation

Graph of variables

Correlation circle

Quality of representation

Contributions of variables to PCs

Color by a custom continuous variable

Color by groups

Dimension description

Graph of individuals

Plots: quality and contribution

Color by a custom continuous variable

Supplementary elements

Quantitative variables

Individuals

Qualitative variables