Questions:
Explore the dimensions of the variables
Is scaling necessary?
Run a PCA analysis
Use Kaiser’s Criterion (eigenvalues > 1) and the “Elbow” method to decide how many components to keep. What percentage of total variance is explained by your choice?
Examine the structural loadings for the first two principal components.
Identify which events load most heavily (> 0.4) on PC1. Does PC1 represent “speed,” “strength,” or “technical skill”?
Construct a Biplot to visualize both athletes (points) and events (vectors).
Find an athlete who performed exceptionally well in “speed” events but poorly in “strength” events based on their position relative to the vectors.
library(FactoMineR)
data(decathlon)
decathlon_cont <- decathlon[,1:10]
summary(decathlon_cont)
#> 100m Long.jump Shot.put High.jump 400m
#> Min. :10.44 Min. :6.61 Min. :12.68 Min. :1.850 Min. :46.81
#> 1st Qu.:10.85 1st Qu.:7.03 1st Qu.:13.88 1st Qu.:1.920 1st Qu.:48.93
#> Median :10.98 Median :7.30 Median :14.57 Median :1.950 Median :49.40
#> Mean :11.00 Mean :7.26 Mean :14.48 Mean :1.977 Mean :49.62
#> 3rd Qu.:11.14 3rd Qu.:7.48 3rd Qu.:14.97 3rd Qu.:2.040 3rd Qu.:50.30
#> Max. :11.64 Max. :7.96 Max. :16.36 Max. :2.150 Max. :53.20
#> 110m.hurdle Discus Pole.vault Javeline
#> Min. :13.97 Min. :37.92 Min. :4.200 Min. :50.31
#> 1st Qu.:14.21 1st Qu.:41.90 1st Qu.:4.500 1st Qu.:55.27
#> Median :14.48 Median :44.41 Median :4.800 Median :58.36
#> Mean :14.61 Mean :44.33 Mean :4.762 Mean :58.32
#> 3rd Qu.:14.98 3rd Qu.:46.07 3rd Qu.:4.920 3rd Qu.:60.89
#> Max. :15.67 Max. :51.65 Max. :5.400 Max. :70.52
#> 1500m
#> Min. :262.1
#> 1st Qu.:271.0
#> Median :278.1
#> Mean :279.0
#> 3rd Qu.:285.1
#> Max. :317.0
decathlon_range <- data.frame(cbind(sapply(decathlon_cont,min),
sapply(decathlon_cont,max),
round(sapply(decathlon_cont,var),2)))
names(decathlon_range) <- c("min value","max value","variance")
decathlon_range
#> min value max value variance
#> 100m 10.44 11.64 0.07
#> Long.jump 6.61 7.96 0.10
#> Shot.put 12.68 16.36 0.68
#> High.jump 1.85 2.15 0.01
#> 400m 46.81 53.20 1.33
#> 110m.hurdle 13.97 15.67 0.22
#> Discus 37.92 51.65 11.41
#> Pole.vault 4.20 5.40 0.08
#> Javeline 50.31 70.52 23.30
#> 1500m 262.10 317.00 136.26There’s a big difference in the range of the variables. For example, the ‘1500m’ event variable ranges between [262,317] and the range of the ‘long jump’ variable lies between [6.6, 7.9]. If we don’t standardize the variables, the variance would be dominated by the ‘1500m’ variable, possibly biasing the results. Therefore, scaling of the decathlon dataset is necessary before analysis.
decathlon_scaled <- scale(decathlon_cont)
pca_decathlon <- prcomp(decathlon_scaled)
screeplot(pca_decathlon, type="lines", main='Scree plot',pch=19,cex=1.5)
abline(h=1,lty='dotted',col='red') # add a horizontal line for eigenvalue = 1
summary(pca_decathlon)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 1.8088 1.3180 1.1853 1.0280 0.82751 0.77412 0.67174
#> Proportion of Variance 0.3272 0.1737 0.1405 0.1057 0.06848 0.05993 0.04512
#> Cumulative Proportion 0.3272 0.5009 0.6414 0.7471 0.81556 0.87548 0.92061
#> PC8 PC9 PC10
#> Standard deviation 0.62998 0.46348 0.42688
#> Proportion of Variance 0.03969 0.02148 0.01822
#> Cumulative Proportion 0.96030 0.98178 1.00000Based on Kaiser’s Criterion, we would choose 4 components, based on the “Elbow” method we would choose 4 or 5 components. If we choose 4 components, we explain 74.71% of the total variance.
pca_decathlon$rotation
#> PC1 PC2 PC3 PC4 PC5
#> 100m -0.42829627 0.1419891 -0.15557953 0.03678703 0.36518741
#> Long.jump 0.41015201 -0.2620794 0.15372674 -0.09901016 0.04432336
#> Shot.put 0.34414444 0.4539470 -0.01972378 -0.18539458 0.13431954
#> High.jump 0.31619436 0.2657761 -0.21894349 0.13189684 0.67121760
#> 400m -0.37571570 0.4320460 0.11091758 -0.02850297 -0.10597034
#> 110m.hurdle -0.41255442 0.1735910 -0.07815576 -0.28290068 0.19857266
#> Discus 0.30542571 0.4600244 0.03623770 0.25259074 -0.12667770
#> Pole.vault 0.02783081 -0.1368411 0.58361717 -0.53649480 0.39873734
#> Javeline 0.15319802 0.2405071 -0.32874217 -0.69285498 -0.36873120
#> 1500m -0.03210733 0.3598049 0.65987362 0.15669648 -0.18557094
#> PC6 PC7 PC8 PC9 PC10
#> 100m -0.29607739 -0.38177608 0.46160211 -0.10475771 0.42428269
#> Long.jump 0.30612478 -0.62769317 -0.02101165 -0.48266910 0.08104448
#> Shot.put -0.30547299 0.30972542 -0.31393005 -0.42729075 0.39028424
#> High.jump 0.46777116 0.09145002 0.12509166 0.24366054 -0.10642724
#> 400m 0.33252178 0.12442114 0.21339819 -0.55212939 -0.41399532
#> 110m.hurdle 0.09963776 -0.35733030 -0.71111429 0.15013429 -0.09086448
#> Discus -0.44937288 -0.42988982 0.03838986 0.15480715 -0.44916580
#> Pole.vault -0.26166458 0.09796019 0.17803824 0.08297769 -0.27645138
#> Javeline 0.16320268 -0.10674519 0.29614206 0.24732691 0.08777340
#> 1500m 0.29826888 -0.08362898 0.01371744 0.30773397 0.42923132The variables that load most strongly onto PC1 are: - 100m - 110m.hurdle - long.jump
The first principal component represents ‘speed’.
The variables that load most strongly onto PC2 are: - shot.put - 400m - discus
The second principal component represents ‘strength’.
library(ggfortify)
#> Warning: package 'ggfortify' was built under R version 4.5.3
autoplot(pca_decathlon, data = decathlon,
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 4)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the ggfortify package.
#> Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
library(ggbiplot)
#> Warning: package 'ggbiplot' was built under R version 4.5.3
#>
#> Attaching package: 'ggbiplot'
#> The following object is masked from 'package:ggfortify':
#>
#> ggbiplot
#> The following object is masked from 'package:psych':
#>
#> reflect
ggbiplot(pca_decathlon)pca_stand_scores <- data.frame(scale(pca_decathlon$x))
which(pca_stand_scores$PC1 < -1 & pca_stand_scores$PC2 < -1)
#> [1] 12 37Nool and Lorenzo are two athletes that score well on the ‘speed’ events of the decathlon, but not well on the ‘strength’ events.
library(tidyverse)
ggbiplot(pca_decathlon)+
geom_point(aes(x=PC1,y=PC2),data=pca_stand_scores[c(12,37),1:2],
pch=21,
size=6,
colour="red")