Multivariate Statistics - StatUa

Principal Component Analysis - Exercise solutions

Questions:

Explore the dimensions of the variables

Is scaling necessary?

Run a PCA analysis

Use Kaiser’s Criterion (eigenvalues > 1) and the “Elbow” method to decide how many components to keep. What percentage of total variance is explained by your choice?

Examine the structural loadings for the first two principal components.

Identify which events load most heavily (> 0.4) on PC1. Does PC1 represent “speed,” “strength,” or “technical skill”?

Construct a Biplot to visualize both athletes (points) and events (vectors).

Find an athlete who performed exceptionally well in “speed” events but poorly in “strength” events based on their position relative to the vectors.


library(FactoMineR)

data(decathlon)

decathlon_cont <- decathlon[,1:10]

summary(decathlon_cont)
#>       100m         Long.jump       Shot.put       High.jump          400m      
#>  Min.   :10.44   Min.   :6.61   Min.   :12.68   Min.   :1.850   Min.   :46.81  
#>  1st Qu.:10.85   1st Qu.:7.03   1st Qu.:13.88   1st Qu.:1.920   1st Qu.:48.93  
#>  Median :10.98   Median :7.30   Median :14.57   Median :1.950   Median :49.40  
#>  Mean   :11.00   Mean   :7.26   Mean   :14.48   Mean   :1.977   Mean   :49.62  
#>  3rd Qu.:11.14   3rd Qu.:7.48   3rd Qu.:14.97   3rd Qu.:2.040   3rd Qu.:50.30  
#>  Max.   :11.64   Max.   :7.96   Max.   :16.36   Max.   :2.150   Max.   :53.20  
#>   110m.hurdle        Discus        Pole.vault       Javeline    
#>  Min.   :13.97   Min.   :37.92   Min.   :4.200   Min.   :50.31  
#>  1st Qu.:14.21   1st Qu.:41.90   1st Qu.:4.500   1st Qu.:55.27  
#>  Median :14.48   Median :44.41   Median :4.800   Median :58.36  
#>  Mean   :14.61   Mean   :44.33   Mean   :4.762   Mean   :58.32  
#>  3rd Qu.:14.98   3rd Qu.:46.07   3rd Qu.:4.920   3rd Qu.:60.89  
#>  Max.   :15.67   Max.   :51.65   Max.   :5.400   Max.   :70.52  
#>      1500m      
#>  Min.   :262.1  
#>  1st Qu.:271.0  
#>  Median :278.1  
#>  Mean   :279.0  
#>  3rd Qu.:285.1  
#>  Max.   :317.0

decathlon_range <- data.frame(cbind(sapply(decathlon_cont,min),
                 sapply(decathlon_cont,max),
                 round(sapply(decathlon_cont,var),2))) 
names(decathlon_range) <- c("min value","max value","variance")

decathlon_range
#>             min value max value variance
#> 100m            10.44     11.64     0.07
#> Long.jump        6.61      7.96     0.10
#> Shot.put        12.68     16.36     0.68
#> High.jump        1.85      2.15     0.01
#> 400m            46.81     53.20     1.33
#> 110m.hurdle     13.97     15.67     0.22
#> Discus          37.92     51.65    11.41
#> Pole.vault       4.20      5.40     0.08
#> Javeline        50.31     70.52    23.30
#> 1500m          262.10    317.00   136.26

There’s a big difference in the range of the variables. For example, the ‘1500m’ event variable ranges between [262,317] and the range of the ‘long jump’ variable lies between [6.6, 7.9]. If we don’t standardize the variables, the variance would be dominated by the ‘1500m’ variable, possibly biasing the results. Therefore, scaling of the decathlon dataset is necessary before analysis.


decathlon_scaled <- scale(decathlon_cont)

pca_decathlon <- prcomp(decathlon_scaled)

screeplot(pca_decathlon, type="lines", main='Scree plot',pch=19,cex=1.5)
abline(h=1,lty='dotted',col='red') # add a horizontal line for eigenvalue = 1


summary(pca_decathlon)
#> Importance of components:
#>                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
#> Standard deviation     1.8088 1.3180 1.1853 1.0280 0.82751 0.77412 0.67174
#> Proportion of Variance 0.3272 0.1737 0.1405 0.1057 0.06848 0.05993 0.04512
#> Cumulative Proportion  0.3272 0.5009 0.6414 0.7471 0.81556 0.87548 0.92061
#>                            PC8     PC9    PC10
#> Standard deviation     0.62998 0.46348 0.42688
#> Proportion of Variance 0.03969 0.02148 0.01822
#> Cumulative Proportion  0.96030 0.98178 1.00000


fviz_eig(pca_decathlon)

Based on Kaiser’s Criterion, we would choose 4 components, based on the “Elbow” method we would choose 4 or 5 components. If we choose 4 components, we explain 74.71% of the total variance.

pca_decathlon$rotation
#>                     PC1        PC2         PC3         PC4         PC5
#> 100m        -0.42829627  0.1419891 -0.15557953  0.03678703  0.36518741
#> Long.jump    0.41015201 -0.2620794  0.15372674 -0.09901016  0.04432336
#> Shot.put     0.34414444  0.4539470 -0.01972378 -0.18539458  0.13431954
#> High.jump    0.31619436  0.2657761 -0.21894349  0.13189684  0.67121760
#> 400m        -0.37571570  0.4320460  0.11091758 -0.02850297 -0.10597034
#> 110m.hurdle -0.41255442  0.1735910 -0.07815576 -0.28290068  0.19857266
#> Discus       0.30542571  0.4600244  0.03623770  0.25259074 -0.12667770
#> Pole.vault   0.02783081 -0.1368411  0.58361717 -0.53649480  0.39873734
#> Javeline     0.15319802  0.2405071 -0.32874217 -0.69285498 -0.36873120
#> 1500m       -0.03210733  0.3598049  0.65987362  0.15669648 -0.18557094
#>                     PC6         PC7         PC8         PC9        PC10
#> 100m        -0.29607739 -0.38177608  0.46160211 -0.10475771  0.42428269
#> Long.jump    0.30612478 -0.62769317 -0.02101165 -0.48266910  0.08104448
#> Shot.put    -0.30547299  0.30972542 -0.31393005 -0.42729075  0.39028424
#> High.jump    0.46777116  0.09145002  0.12509166  0.24366054 -0.10642724
#> 400m         0.33252178  0.12442114  0.21339819 -0.55212939 -0.41399532
#> 110m.hurdle  0.09963776 -0.35733030 -0.71111429  0.15013429 -0.09086448
#> Discus      -0.44937288 -0.42988982  0.03838986  0.15480715 -0.44916580
#> Pole.vault  -0.26166458  0.09796019  0.17803824  0.08297769 -0.27645138
#> Javeline     0.16320268 -0.10674519  0.29614206  0.24732691  0.08777340
#> 1500m        0.29826888 -0.08362898  0.01371744  0.30773397  0.42923132

The variables that load most strongly onto PC1 are: - 100m - 110m.hurdle - long.jump

The first principal component represents ‘speed’.

The variables that load most strongly onto PC2 are: - shot.put - 400m - discus

The second principal component represents ‘strength’.

library(ggfortify)
#> Warning: package 'ggfortify' was built under R version 4.5.3

autoplot(pca_decathlon, data = decathlon,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size = 4)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the ggfortify package.
#>   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.


library(ggbiplot)
#> Warning: package 'ggbiplot' was built under R version 4.5.3
#> 
#> Attaching package: 'ggbiplot'
#> The following object is masked from 'package:ggfortify':
#> 
#>     ggbiplot
#> The following object is masked from 'package:psych':
#> 
#>     reflect

ggbiplot(pca_decathlon)

pca_stand_scores <- data.frame(scale(pca_decathlon$x))
which(pca_stand_scores$PC1 < -1 & pca_stand_scores$PC2 < -1)
#> [1] 12 37

Nool and Lorenzo are two athletes that score well on the ‘speed’ events of the decathlon, but not well on the ‘strength’ events.

library(tidyverse)
ggbiplot(pca_decathlon)+
geom_point(aes(x=PC1,y=PC2),data=pca_stand_scores[c(12,37),1:2],
           pch=21,
           size=6,
           colour="red")

Graphical displays - exercise solutions

dr. Annelies Agten

2026-04-27

Multivariate Statistics - StatUa

Principal Component Analysis - Exercise solutions