Principal Component Analysis (PCA)

Introduction

The main purpose of the project is to present and implement unsupervised learning methods for Principal Component Analysis (PCA) - a statistical method to reduce dimensionality. PCA is used to extract significant information from a multivariate data table and to express this information as a set of few new variables called principal components. The goal of PCA is to identify directions that can be visualized graphically with minimal loss of information.

In my analysis, I will use dataset decathlon2 from factoextra package. The methodology of the project will concentrate on: descriptive statistics and explanatory data analysis of the dataset; computing PCA with prcomp(); extracting the eigenvalues of principal components; extracting the results for individuals and variables; k-means clustering based on PCA results. The analysis will be supported by the necessary visualisations. The outcome of the project is to distinguish athletes with the best, average and worst performance (results) in mentioned sport disciplines.

Data Processing

Firstly, the needed libraries are introduced: factoextra, ggplot2, ggfortify, gridExtra, carData, car, corrplot. Factoextra - provides functions to extract and visualize the output of multivariate data analyses. GridExtra - used to arrange multiple grid-based plots on a page, and draw tables. CarData and car - are companions to applied regression datasets. Corrplot - enables visualization of a correlation matrix.Ggplot2 - is used for data manipulation and plotting. Ggfortify - provides data visualization tools for statistical analysis results.

library("ggplot2")
library("ggfortify")
library("gridExtra")
library("carData")
library("car")
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("corrplot")
## corrplot 0.92 loaded

Now it is possible to load demo dataset decathlon2 from the factoextra package. The data describes athletes’ performance during two sporting events (Desctar and OlympicG). It contains 27 individuals (athletes) described by 13 variables (sport disciplines). For further analysis, I will subset active individuals (rows 1:23) and active variables (columns 1:10) from decathlon2 dataset, therefore I will create new dataset decathlon2.active to conduct the principal component analysis. Decathlon2.active dataset consists of 23 observations and 10 variables (presented below).

data(decathlon2)
decathlon2.active <- decathlon2[1:23, 1:10]
head(decathlon2.active)
##           X100m Long.jump Shot.put High.jump X400m X110m.hurdle Discus
## SEBRLE    11.04      7.58    14.83      2.07 49.81        14.69  43.75
## CLAY      10.76      7.40    14.26      1.86 49.37        14.05  50.72
## BERNARD   11.02      7.23    14.25      1.92 48.93        14.99  40.87
## YURKOV    11.34      7.09    15.19      2.10 50.42        15.31  46.26
## ZSIVOCZKY 11.13      7.30    13.48      2.01 48.62        14.17  45.67
## McMULLEN  10.83      7.31    13.76      2.13 49.91        14.38  44.41
##           Pole.vault Javeline X1500m
## SEBRLE          5.02    63.19  291.7
## CLAY            4.92    60.15  301.5
## BERNARD         5.32    62.77  280.1
## YURKOV          4.72    63.44  276.4
## ZSIVOCZKY       4.42    55.37  268.0
## McMULLEN        4.42    56.37  285.1

Explanatory Data Analysis

In order to provide the model, a summary of data is required. The variables (sport disciplines) descriptions are as follows: X100m - 100 metres results expressed in seconds Long.jump - long jump results expressed in metres Shot.put - shot put results expressed in metres High.jump - high jump results expressed in metres X400m - 400 metres results expressed in seconds X110m.hurdle - 110 metres hurdle race results expressed in seconds Discus - discus throw results expressed in metres Pole.vault - pole vault results expressed in metres Javeline - javeline throw results expressed in metres X1500m - 1500 metres results expressed in seconds

The summary statistics and histograms below show the distribution of observations in all numeric variables. Horizontal axis represents the values of observations, while the vertical axis “count” shows the amount of certain observations for each value.

summary(decathlon2.active)
##      X100m         Long.jump        Shot.put       High.jump    
##  Min.   :10.44   Min.   :6.800   Min.   :12.68   Min.   :1.860  
##  1st Qu.:10.84   1st Qu.:7.165   1st Qu.:14.17   1st Qu.:1.940  
##  Median :10.97   Median :7.310   Median :14.65   Median :2.010  
##  Mean   :11.00   Mean   :7.350   Mean   :14.62   Mean   :2.007  
##  3rd Qu.:11.23   3rd Qu.:7.525   3rd Qu.:15.14   3rd Qu.:2.095  
##  Max.   :11.64   Max.   :7.960   Max.   :16.36   Max.   :2.150  
##      X400m        X110m.hurdle       Discus        Pole.vault   
##  Min.   :46.81   Min.   :13.97   Min.   :37.92   Min.   :4.400  
##  1st Qu.:48.95   1st Qu.:14.17   1st Qu.:43.74   1st Qu.:4.610  
##  Median :49.40   Median :14.37   Median :44.75   Median :4.820  
##  Mean   :49.43   Mean   :14.53   Mean   :45.16   Mean   :4.797  
##  3rd Qu.:50.02   3rd Qu.:14.94   3rd Qu.:46.93   3rd Qu.:5.000  
##  Max.   :51.16   Max.   :15.67   Max.   :51.65   Max.   :5.320  
##     Javeline         X1500m     
##  Min.   :52.33   Min.   :262.1  
##  1st Qu.:55.40   1st Qu.:268.8  
##  Median :57.44   Median :278.1  
##  Mean   :59.11   Mean   :277.9  
##  3rd Qu.:62.98   3rd Qu.:283.6  
##  Max.   :70.52   Max.   :301.5

100 metres and 400 metres histograms represent negative skewness. The left tailor indicates that the majority of observations are concentrated in higher time results. 110 metres hurdle race, javeline throw and 1500 metres histograms show positive skewness. The right tailor implies that observations are centered around lower values (shorter time results in case of races and shorter distance results in javeline throw). Other variables have either simmetrical disctribution (centered around median) or observations distributed along the whole x axis.

Methodology of PCA - Visualization and Interpretation

The first step of the analysis is focused around computation of PCA using prcomp() function. This command allows for: centering data around 0 by shifting the variables; rescaling the variance to 1 unit; data standarization needed due to the fact that variables are measured in different scales. Additionally, the eigenvalues are extracted by get_eigenvalue() function. Eigenvalues measure the amount of variation held by each principal component (PC). They are evaluated to determine the number of principal components to be considered.

res.pca <- prcomp(decathlon2.active, scale = TRUE)
print(res.pca)
## Standard deviations (1, .., p=10):
##  [1] 2.0308159 1.3559244 1.1131668 0.9052294 0.8375875 0.6502944 0.5500742
##  [8] 0.5238988 0.3939758 0.3492435
## 
## Rotation (n x k) = (10 x 10):
##                       PC1         PC2         PC3         PC4        PC5
## X100m        -0.418859080  0.13230683 -0.27089959  0.03708806 -0.2321476
## Long.jump     0.391064807 -0.20713320  0.17117519 -0.12746997  0.2783669
## Shot.put      0.361388111 -0.06298590 -0.46497777  0.14191803 -0.2970589
## High.jump     0.300413236  0.34309742 -0.29652805  0.15968342  0.4807859
## X400m        -0.345478567 -0.21400770 -0.25470839  0.47592968  0.1240569
## X110m.hurdle -0.376265119  0.01824645 -0.40325254 -0.01866477  0.2676975
## Discus        0.365965721 -0.03662510 -0.15857927  0.43636361 -0.4873988
## Pole.vault   -0.106985591 -0.59549862 -0.08449563 -0.37447391 -0.2646712
## Javeline      0.210864329 -0.28475723 -0.54270782 -0.36646463  0.2361698
## X1500m        0.002106782 -0.57855748  0.19715884  0.49491281  0.3142987
##                       PC6         PC7         PC8         PC9        PC10
## X100m         0.054398099 -0.16604375 -0.19988005 -0.76924639  0.12718339
## Long.jump    -0.051865558 -0.28056361 -0.75850657 -0.13094589  0.08509665
## Shot.put     -0.368739186 -0.01797323  0.04649571  0.12129309  0.62263702
## High.jump    -0.437716883  0.05118848  0.16111045 -0.28463225 -0.38244596
## X400m        -0.075796432  0.52012255 -0.44579641  0.20854176 -0.09784197
## X110m.hurdle  0.004048005 -0.67276768 -0.01592804  0.41058421 -0.04475363
## Discus        0.305315353 -0.25946615 -0.07550934  0.03391600 -0.49418361
## Pole.vault   -0.503563524 -0.01889413  0.06282691 -0.06540692 -0.39288155
## Javeline      0.556821016  0.24281145  0.10086127 -0.10268134 -0.01103627
## X1500m        0.064663250 -0.20245828  0.37119711 -0.25950868  0.17991689
summary(res.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0308 1.3559 1.1132 0.90523 0.83759 0.65029 0.55007
## Proportion of Variance 0.4124 0.1839 0.1239 0.08194 0.07016 0.04229 0.03026
## Cumulative Proportion  0.4124 0.5963 0.7202 0.80213 0.87229 0.91458 0.94483
##                            PC8     PC9   PC10
## Standard deviation     0.52390 0.39398 0.3492
## Proportion of Variance 0.02745 0.01552 0.0122
## Cumulative Proportion  0.97228 0.98780 1.0000
eig.val<-get_eigenvalue(res.pca)
eig.val
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1   4.1242133        41.242133                    41.24213
## Dim.2   1.8385309        18.385309                    59.62744
## Dim.3   1.2391403        12.391403                    72.01885
## Dim.4   0.8194402         8.194402                    80.21325
## Dim.5   0.7015528         7.015528                    87.22878
## Dim.6   0.4228828         4.228828                    91.45760
## Dim.7   0.3025817         3.025817                    94.48342
## Dim.8   0.2744700         2.744700                    97.22812
## Dim.9   0.1552169         1.552169                    98.78029
## Dim.10  0.1219710         1.219710                   100.00000
fviz_eig(res.pca, col.var="blue")

On the basis of importance of components, is it visible that first two PCs have the highest vales for proportion of variance. This statement is also proved by eigenvalues measure. They are large for the first PCs and small for the subsequent PCs, which means that the first PCs corresponds to the directions with the maximum amount of variation in the data set. The sum of all the eigenvalues gives a total variance of 10. As far as scatter plot is concerned, first eigenvalue explain 41.24% of the variation, second - 18.385%. Therefore, 59.627% of the variation is explained by the first two eigenvalues together, which is a proper indicator for further analysis.

PCA results for variables

PCA results can be assesed with regard to variables (sport disciplines) and individuals (athletes). Firstly, I will conduct extraction of results for variables. For that purpose get_pca_var() is used to provide a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine, and contributions).

var <- get_pca_var(res.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Cos2 is called square cosine (squared coordinates) and corresponds to the quality of representation of variables. Cos2 of variables on all the dimensions using the corrplot package is displayed below, as well as bar plot of variables cos2 using the function fviz_cos2().

head(var$cos2)
##                  Dim.1        Dim.2      Dim.3        Dim.4      Dim.5
## X100m        0.7235641 0.0321836641 0.09093628 0.0011271597 0.03780845
## Long.jump    0.6307229 0.0788806285 0.03630798 0.0133147506 0.05436203
## Shot.put     0.5386279 0.0072938636 0.26790749 0.0165041211 0.06190783
## High.jump    0.3722025 0.2164242070 0.10895622 0.0208947375 0.16216747
## X400m        0.4922473 0.0842034209 0.08039091 0.1856106269 0.01079698
## X110m.hurdle 0.5838873 0.0006121077 0.20149984 0.0002854712 0.05027463
##                     Dim.6        Dim.7        Dim.8       Dim.9       Dim.10
## X100m        1.251375e-03 0.0083423353 1.096563e-02 0.091848077 0.0019729565
## Long.jump    1.137570e-03 0.0238179990 1.579114e-01 0.002661478 0.0008832459
## Shot.put     5.749878e-02 0.0000977451 5.933633e-04 0.002283554 0.0472853495
## High.jump    8.102269e-02 0.0007928428 7.124302e-03 0.012574981 0.0178400831
## X400m        2.429504e-03 0.0818566479 5.454664e-02 0.006750333 0.0011676349
## X110m.hurdle 6.929502e-06 0.1369534023 6.963371e-05 0.026166378 0.0002442942
library("corrplot")
corrplot(var$cos2, is.corr=FALSE)

fviz_cos2(res.pca, choice = "var", axes = 1:2)

Additionally, the quality of representation of variables can be draw on the factor map, where cos2 values differ by gradient colors. Variables with low cos2 values will be colored “darkorchid4”, medium cos2 values - “gold”, high co2 values - “darkorange”. Positively correlated variables are grouped together, whereas negatively correlated variables are positioned on opposite sides of the plot origin. The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map.

fviz_pca_var(res.pca,
             col.var = "cos2", # Color by the quality of representation
             gradient.cols = c("darkorchid4", "gold", "darkorange"),
             repel = TRUE
             )

X100m, Long.jump and Pole.vault have very high cos2, which implies a good representation of the variable on the principal component. In this case variables are positioned close to the circumference of the correlation circle. Javeline has the lowest cos2, which indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle - it is less important for the first components.

Contrib is a contribution of variables. The function fviz_contrib() is used to draw a bar plot of variable contributions for the most significant dimensions, therefore PC1 and PC2.

# Contributions of variables to PC1
a<-fviz_contrib(res.pca, choice = "var", axes = 1)
# Contributions of variables to PC2
b<-fviz_contrib(res.pca, choice = "var", axes = 2)
grid.arrange(a,b, ncol=2, top='Contribution of the variables to the first two PCs')

The red dashed line on the graph above indicates the expected average contribution. For a certain component, a variable with a contribution exceeding this benchmark is considered as important in contributing to the component. It can be seen that the variables X100m, Long.jump and Pole.vault contribute the most to both dimensions.

PCA results for individuals

The results, for individuals (athletes) will be extracted using the function get_pca_ind(). Similarly to variables, it provides a list of matrices containing all the results for the individuals (coordinates, correlation between individuals and axes, squared cosine, and contributions). As for the individuals, the analysis will be focused on cos2 and contributions of individuals to the first two principal components (PC1 and PC2).

ind <- get_pca_ind(res.pca)
ind
## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"
fviz_pca_ind(res.pca,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("darkorchid4", "gold", "darkorange"),
             repel = TRUE
             )

BOURGUIGNON, Macey, Karpov and Clay have very high cos2, which implies a good representation of individuals on the principal component - they are positioned close to the circumference of the correlation circle. YURKOV, Pogorelov, Barras and McMULLEN have the lowest cos2, which indicates that they are not well represented by the PCs - they are close to the center of the circle.

# Total contribution on PC1 and PC2
fviz_contrib(res.pca, choice = "ind", axes = 1:2)

Based on the position of the red dashed line (average contribution), individuals BOURGUIGNON, Karpov and Clay contribute the most to both dimensions.

The summary of above PCA analysis for both variables (sport disciplines) and individuals (athletes) is displayed in a correlation plot (autoplot) from ggfortify package wirg reference to dimensions 1 and 2.

autoplot(res.pca, loadings=TRUE, loadings.colour='darkorchid4', loadings.label=TRUE, loadings.label.size=3)

Final results and analysis

The purpose of this project is to distinguish which athletes obtained the best results among the whole group. So far, the Principal Component Analysis was conducted for both variables (sport disciplines) and individuals (athletes) with the use of prcomp() computation, eigenvalues extraction, square cosine and contributions. Considering the calculated PCs, I will summarize them in clusters via k-means clustering method. For that purpose, I will use eclust() function with 4 clusters as an assumption and autoplot() for 2D observations.

kmeans<-eclust(decathlon2.active, k=4)

autoplot(res.pca, data=kmeans, colour="cluster")

Cluster which is the closest to the origin (blue) presents athletes with the best results in sport disciplines. Violet cluster shows the athletes with average results, wheres remaining clusters (green and red) correspond to the worst athletes.

References

  1. www.sthda.com
  2. www.stackoverflow.com
  3. www.r-project.org