PCA method is particularly useful when the variables within the data set are highly correlated.

The Main purpose of PCA is 1. Identify hidden patterns in a data set. 2. reduce the dimensionality by removing noise and redundancy in the data 3. Identify correlated variables

library(FactoMineR)
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Loading data. the data used here describes athletes’ performance during two sporting events (Desctar and OlympicG).

It contains 27 individuals (athletes) described by 13 variables.

data("decathlon2")
head(decathlon2)
##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m Rank Points Competition
## SEBRLE          5.02    63.19  291.7    1   8217    Decastar
## CLAY            4.92    60.15  301.5    2   8122    Decastar
## BERNARD         5.32    62.77  280.1    4   8067    Decastar
## YURKOV          4.72    63.44  276.4    5   8036    Decastar
## ZSIVOCZKY       4.42    55.37  268.0    7   8004    Decastar
## McMULLEN        4.42    56.37  285.1    8   7995    Decastar

Here we use the data of first 23 athletes ( active) to predict that of the rest ahtletes. Also we use the first 10 column data(active) to predict rank and points (quantative variables) as well as competition ( qualitative variable)

We start by subsetting active individuals and active variables for the principal component analysis:

decathlon2.active <- decathlon2[1:23,1:10]
head(decathlon2.active[,1:6],4)
##         X100m Long.jump Shot.put High.jump X400m X110m.hurdle
## SEBRLE  11.04      7.58    14.83      2.07 49.81        14.69
## CLAY    10.76      7.40    14.26      1.86 49.37        14.05
## BERNARD 11.02      7.23    14.25      1.92 48.93        14.99
## YURKOV  11.34      7.09    15.19      2.10 50.42        15.31

The goal is to make the variables comparable. Generally variables are scaled to have i) standard deviation one and ii) mean zero.

The R base function scale() can be used to standardize the data. It takes a numeric matrix as an input and performs the scaling on the columns.

Note that, by default, the function PCA() [in FactoMineR], standardizes the data automatically during the PCA; so you don’t need do this transformation before the PCA.

R code

The function PCA() is used the simplified format is

PCA(X,scale.unit=TRUE,ncp=5,graph=TRUE)

X: a data frame. Rows are individuals and columns are numeric variables

scale.unit: a logical value. If TRUE, the data are scaled to unit variance before the analysis. This standardization to the same scale avoids some variables to become dominant just because of their large measurement units. It makes variable comparable.

ncp: number of dimensions kept in the final results.

graph: a logical value. If TRUE a graph is displayed.

res.pca <- PCA( decathlon2.active, graph = FALSE)
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 23 individuals, described by 10 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"
summary(res.pca)
## 
## Call:
## PCA(X = decathlon2.active, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               4.124   1.839   1.239   0.819   0.702   0.423   0.303
## % of var.             41.242  18.385  12.391   8.194   7.016   4.229   3.026
## Cumulative % of var.  41.242  59.627  72.019  80.213  87.229  91.458  94.483
##                        Dim.8   Dim.9  Dim.10
## Variance               0.274   0.155   0.122
## % of var.              2.745   1.552   1.220
## Cumulative % of var.  97.228  98.780 100.000
## 
## Individuals (the 10 first)
##                  Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## SEBRLE       |  2.253 |  0.196  0.040  0.008 |  1.589  5.971  0.497 |  0.642
## CLAY         |  3.661 |  0.808  0.688  0.049 |  2.475 14.484  0.457 | -1.387
## BERNARD      |  3.061 | -1.359  1.947  0.197 |  1.648  6.423  0.290 |  0.201
## YURKOV       |  2.867 | -0.889  0.833  0.096 | -0.443  0.463  0.024 |  2.530
## ZSIVOCZKY    |  2.725 | -0.108  0.012  0.002 | -2.069 10.122  0.576 | -1.334
## McMULLEN     |  2.599 |  0.121  0.015  0.002 | -1.014  2.431  0.152 | -0.863
## MARTINEAU    |  3.848 | -2.446  6.308  0.404 | -1.314  4.082  0.117 |  0.918
## HERNU        |  3.060 | -1.934  3.941  0.399 |  1.205  3.434  0.155 |  0.160
## BARRAS       |  2.311 | -1.814  3.470  0.616 | -0.422  0.421  0.033 | -0.673
## NOOL         |  4.057 | -2.839  8.499  0.490 | -1.608  6.115  0.157 | -0.621
##                 ctr   cos2  
## SEBRLE        1.448  0.081 |
## CLAY          6.754  0.144 |
## BERNARD       0.141  0.004 |
## YURKOV       22.452  0.778 |
## ZSIVOCZKY     6.246  0.240 |
## McMULLEN      2.610  0.110 |
## MARTINEAU     2.959  0.057 |
## HERNU         0.090  0.003 |
## BARRAS        1.589  0.085 |
## NOOL          1.353  0.023 |
## 
## Variables
##                 Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## X100m        | -0.851 17.544  0.724 | -0.179  1.751  0.032 |  0.302  7.339
## Long.jump    |  0.794 15.293  0.631 |  0.281  4.290  0.079 | -0.191  2.930
## Shot.put     |  0.734 13.060  0.539 |  0.085  0.397  0.007 |  0.518 21.620
## High.jump    |  0.610  9.025  0.372 | -0.465 11.772  0.216 |  0.330  8.793
## X400m        | -0.702 11.936  0.492 |  0.290  4.580  0.084 |  0.284  6.488
## X110m.hurdle | -0.764 14.158  0.584 | -0.025  0.033  0.001 |  0.449 16.261
## Discus       |  0.743 13.393  0.552 |  0.050  0.134  0.002 |  0.177  2.515
## Pole.vault   | -0.217  1.145  0.047 |  0.807 35.462  0.652 |  0.094  0.714
## Javeline     |  0.428  4.446  0.183 |  0.386  8.109  0.149 |  0.604 29.453
## X1500m       |  0.004  0.000  0.000 |  0.784 33.473  0.615 | -0.219  3.887
##                cos2  
## X100m         0.091 |
## Long.jump     0.036 |
## Shot.put      0.268 |
## High.jump     0.109 |
## X400m         0.080 |
## X110m.hurdle  0.201 |
## Discus        0.031 |
## Pole.vault    0.009 |
## Javeline      0.365 |
## X1500m        0.048 |

Eigenvalues /Variances
the eigenvalues measure the amount of variation retained by each principal component. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

eig.val <- get_eigenvalue(res.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.1242133        41.242133                    41.24213
## Dim.2   1.8385309        18.385309                    59.62744
## Dim.3   1.2391403        12.391403                    72.01885
## Dim.4   0.8194402         8.194402                    80.21325
## Dim.5   0.7015528         7.015528                    87.22878
## Dim.6   0.4228828         4.228828                    91.45760
## Dim.7   0.3025817         3.025817                    94.48342
## Dim.8   0.2744700         2.744700                    97.22812
## Dim.9   0.1552169         1.552169                    98.78029
## Dim.10  0.1219710         1.219710                   100.00000

The proportion of variation explained by each eigenvalue is given in the second column. For example, 4.124 divided by 10 equals 0.4124, or, about 41.24% of the variation is explained by this first eigenvalue. The cumulative percentage explained is obtained by adding the successive proportions of variation explained to obtain the running total. For instance, 41.242% plus 18.385% equals 59.627%, and so forth. Therefore, about 59.627% of the variation is explained by the first two eigenvalues together.

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This holds true only when the data is standardised

You can also limit the number of component to that number that accounts for a certain fraction of the total variance. For example, if you are satisfied with 70% of the total variance explained then use the number of components to achieve that.

In our analysis, the first three principal components explain 72% of the variation. This is an acceptably large percentage.

We can also create a scree plot

fviz_eig( res.pca, addlabels = TRUE, ylim = c( 0, 50))  

From the plot above, we might want to stop at the fifth principal component. 87% of the information (variances) contained in the data are retained by the first five principal components.

A simple method to extract the results, for variables, from a PCA output is to use the function get_pca_var() [factoextra package].

This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

The correlation between a variable and a principal component is used as the coordinate of the variable on the PC.

head(var$coord,4)
##                Dim.1       Dim.2      Dim.3      Dim.4      Dim.5
## X100m     -0.8506257 -0.17939806  0.3015564  0.0335732 -0.1944440
## Long.jump  0.7941806  0.28085695 -0.1905465 -0.1153896  0.2331567
## Shot.put   0.7339127  0.08540412  0.5175978  0.1284684 -0.2488129
## High.jump  0.6100840 -0.46521415  0.3300852  0.1445501  0.4027002

To plot the variables in the top two dimension

fviz_pca_var( res.pca, col.var = "black")

The plot above is also known as variable correlation plots. It shows the relationships between all variables. It can be interpreted as follow:

Positively correlated variables are grouped together.
Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants).
The distance between variables and the origin measures the quality of the variables on the factor map.
Variables that are away from the origin are well represented on the factor map.

Quality of Representation

The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates) . You can access to the cos2 as follow:

head(var$cos2,4)
##               Dim.1       Dim.2      Dim.3      Dim.4      Dim.5
## X100m     0.7235641 0.032183664 0.09093628 0.00112716 0.03780845
## Long.jump 0.6307229 0.078880629 0.03630798 0.01331475 0.05436203
## Shot.put  0.5386279 0.007293864 0.26790749 0.01650412 0.06190783
## High.jump 0.3722025 0.216424207 0.10895622 0.02089474 0.16216747

You can visualize the cos2 of variables on all the dimensions using the corrplot package:

library(corrplot)
## corrplot 0.84 loaded
corrplot(var$cos2,is.corr=FALSE)

It’s also possible to create a bar plot of variables cos2 using the function fviz_cos2()[ in factoextra]:

#Total cos2 of variables on Dim. 1 and Dim. 2
fviz_cos2( res.pca, choice = "var", axes = 1: 2)

A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.

A low cos2 indicates that the variable is not perfectly represented

It’s possible to color variables by their cos2 values using the argument col.var = “cos2”. This produces a gradient colors. In this case, the argument gradient.cols can be used to provide a custom color. For instance, gradient.cols = c(" white“,”blue“,”red“) means that: variables with low cos2 values will be colored in”white" variables with mid cos2 values will be colored in “blue” variables with high cos2 values will be colored in red

# Color by cos2 values: quality on the factor map 
fviz_pca_var(res.pca, col.var = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping 
             )

## Contribution of variables to PC

The contributions of variables in accounting for the variability in a given principal component are expressed in percentage. Variables that are correlated with PC1 (i.e., Dim. 1) and PC2 (i.e., Dim. 2) are the most important in explaining the variability in the data set. Variables that do not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis. The contribution of variables can be extracted as follow :

head(var$contrib, 4)
##               Dim.1      Dim.2     Dim.3     Dim.4     Dim.5
## X100m     17.544293  1.7505098  7.338659 0.1375524  5.389252
## Long.jump 15.293168  4.2904162  2.930094 1.6248594  7.748815
## Shot.put  13.060137  0.3967224 21.620432 2.0140727  8.824401
## High.jump  9.024811 11.7715838  8.792888 2.5498795 23.115504
round(var$contrib,2)
##              Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## X100m        17.54  1.75  7.34  0.14  5.39
## Long.jump    15.29  4.29  2.93  1.62  7.75
## Shot.put     13.06  0.40 21.62  2.01  8.82
## High.jump     9.02 11.77  8.79  2.55 23.12
## X400m        11.94  4.58  6.49 22.65  1.54
## X110m.hurdle 14.16  0.03 16.26  0.03  7.17
## Discus       13.39  0.13  2.51 19.04 23.76
## Pole.vault    1.14 35.46  0.71 14.02  7.01
## Javeline      4.45  8.11 29.45 13.43  5.58
## X1500m        0.00 33.47  3.89 24.49  9.88

The larger the value of the contribution, the more the variable contributes to the component.

We can use the function corrplot() to highlight the most contributing variable for each column

corrplot(var$contrib,is.corr=FALSE)

If your data has many variables you can show only top contributing variables.

# Contribution of Variables to PC1
fviz_contrib( res.pca, choice = "var", axes = 1, top = 10)

# Contribution of Variables to PC2
fviz_contrib( res.pca, choice = "var", axes = 2, top = 10)

The red dashed line on the graph above indicates the expected average contribution. If the contribution of the variables were uniform, the expected value would be 1/ length( variables) = 1/ 10 = 10%. For a given component, a variable with a contribution larger than this cutoff could be considered as important in contributing to the component.

The total contribution to PC1 and PC2 can be obtained by

# Contribution of Variables to PC2
fviz_contrib( res.pca, choice = "var", axes = 1:2, top = 10)

It can be seen that the variables - X100m, Long.jump and Pole.vault - contribute the most to the dimensions 1 and 2.

The most important ( contributing variables can be highlighted in the correlation plot as)

fviz_pca_var(res.pca, col.var = "contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07")) 

You can change the transparency by the contribution value

fviz_pca_var( res.pca, alpha.var = "contrib")

It’s also possible to change the color of variables by groups defined by a qualitative/ categorical variable, also called factor in R terminology.

As we don’t have any grouping variable in our data sets for classifying variables, we’ll create it.

In the following demo example, we start by classifying the variables into 3 groups using the kmeans clustering algorithm.

Next, we use the clusters returned by the kmeans algorithm to color variables.

# Create a grouping variable using kmeans
# Create 3 groups of variables (centers = 3)
set.seed( 123)
res.km <- kmeans( var $ coord, centers = 3, nstart = 25) 
grp <- as.factor( res.km $ cluster)
# Color variables by groups 
fviz_pca_var( res.pca, col.var = grp, palette = c("#0073C2FF", "#EFC000FF", "#868686FF"),legend.title ="cluster")

## Dimension Description

We can identify the most significantly associated variables with a given principal component

res.desc <- dimdesc( res.pca, axes = c( 1,2), proba = 0.05)
res.desc$Dim.1
## $quanti
##              correlation      p.value
## Long.jump      0.7941806 6.059893e-06
## Discus         0.7432090 4.842563e-05
## Shot.put       0.7339127 6.723102e-05
## High.jump      0.6100840 1.993677e-03
## Javeline       0.4282266 4.149192e-02
## X400m         -0.7016034 1.910387e-04
## X110m.hurdle  -0.7641252 2.195812e-05
## X100m         -0.8506257 2.727129e-07
## 
## attr(,"class")
## [1] "condes" "list "
res.desc$Dim.2
## $quanti
##            correlation      p.value
## Pole.vault   0.8074511 3.205016e-06
## X1500m       0.7844802 9.384747e-06
## High.jump   -0.4652142 2.529390e-02
## 
## attr(,"class")
## [1] "condes" "list "