Questions:
Examine the correlation matrix of the 10 decathlon events.
Is this data suitable for Factor Analysis?
Run the analysis twice: once with Maximum Likelihood (ML) and once with Principal Axis Factoring (PAF).
Check for normality in the variables. Based on your findings, which extraction method is more “defensible”?
Use a Scree Plot or Parallel Analysis to decide on the number of factors.
Extract the factors using Principal Axis Factoring.
Check the communalities. Which events are poorly explained by the common factors? (i.e., they have high “uniqueness”) .
Compare Varimax (Orthogonal) vs. Oblimin (Oblique).
Does a “strong” athlete naturally tend to be a “fast” athlete? Check the factor correlations in the Oblimin output.
Which rotation provides a “Simpler Structure” (cleaner loadings)?
Based on your best EFA model, name your factors (e.g., “Explosive Power,” “Technical Agility,” “Endurance”).
Compare your EFA factors to your PCA components. Do the latent factors provide a more “common sense” understanding of athlete profiles?.
library(FactoMineR)
data(decathlon)
decathlon_cont <- decathlon[,1:10]
decathlon_cont <- scale(decathlon_cont)
summary(decathlon_cont)
#> 100m Long.jump Shot.put High.jump
#> Min. :-2.12167 Min. :-2.0544 Min. :-2.1798 Min. :-1.4258
#> 1st Qu.:-0.56287 1st Qu.:-0.7269 1st Qu.:-0.7242 1st Qu.:-0.6389
#> Median :-0.06862 Median : 0.1264 Median : 0.1127 Median :-0.3016
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.53969 3rd Qu.: 0.6953 3rd Qu.: 0.5979 3rd Qu.: 0.7102
#> Max. : 2.44067 Max. : 2.2124 Max. : 2.2839 Max. : 1.9468
#> 400m 110m.hurdle Discus Pole.vault
#> Min. :-2.4330 Min. :-1.3478 Min. :-1.89636 Min. :-2.0232
#> 1st Qu.:-0.5950 1st Qu.:-0.8390 1st Qu.:-0.71809 1st Qu.:-0.9440
#> Median :-0.1876 Median :-0.2668 Median : 0.02498 Median : 0.1351
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.5927 3rd Qu.: 0.7930 3rd Qu.: 0.51642 3rd Qu.: 0.5668
#> Max. : 3.1069 Max. : 2.2556 Max. : 2.16836 Max. : 2.2934
#> Javeline 1500m
#> Min. :-1.658770 Min. :-1.44989
#> 1st Qu.:-0.631179 1st Qu.:-0.68575
#> Median : 0.008994 Median :-0.08351
#> Mean : 0.000000 Mean : 0.00000
#> 3rd Qu.: 0.533149 3rd Qu.: 0.52043
#> Max. : 2.528251 Max. : 3.25318For EFA to be meaningful, variables must show at least moderate correlations (typically |r| > 0.30). In the decathlon, events like the 100m, Long Jump, and 110m Hurdles show strong positive correlations because they all rely on explosive “fast-twitch” muscle fibers. If the matrix were mostly zeros, there would be no “common variance” to extract.
Another method of testing whether the data is suitable for Factor Analysis, is by using the Kaisr-Meyer-Olkin and Bartlett’s test.
# Kaiser-Meyer-Olkin (KMO)
KMO(decathlon_cont)
#> Kaiser-Meyer-Olkin factor adequacy
#> Call: KMO(r = decathlon_cont)
#> Overall MSA = 0.6
#> MSA for each item =
#> 100m Long.jump Shot.put High.jump 400m 110m.hurdle
#> 0.69 0.72 0.57 0.70 0.61 0.83
#> Discus Pole.vault Javeline 1500m
#> 0.54 0.24 0.44 0.29
# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(decathlon_cont))
#> $chisq
#> [1] 133.2358
#>
#> $p.value
#> [1] 1.15685e-10
#>
#> $df
#> [1] 45Interpretation:
Based on the correlation matrix, the result of the KMO and Bartlett’s test, we can assume that the decathlon data is suitable for factor analysis.
In order to be able to use the Maximum Likelihood extraction method,
we need to test whether our data follows a (multivariate) Normal
distribution. We can test for univariate normality of the variables by
using the Shapiro-Wilk test. Multivariate normality can be checked using
the MVN function in the MVN package.
# Shapiro tests (not perfect but indicative)
apply(decathlon_cont, 2, shapiro.test)
#> $`100m`
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.9818, p-value = 0.7435
#>
#>
#> $Long.jump
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.98763, p-value = 0.9289
#>
#>
#> $Shot.put
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.9884, p-value = 0.9456
#>
#>
#> $High.jump
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.93734, p-value = 0.0255
#>
#>
#> $`400m`
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.95714, p-value = 0.1248
#>
#>
#> $`110m.hurdle`
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.93087, p-value = 0.01544
#>
#>
#> $Discus
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.96975, p-value = 0.3385
#>
#>
#> $Pole.vault
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.97003, p-value = 0.3456
#>
#>
#> $Javeline
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.97106, p-value = 0.3732
#>
#>
#> $`1500m`
#>
#> Shapiro-Wilk normality test
#>
#> data: newX[, i]
#> W = 0.93652, p-value = 0.02391
library(MVN)
#> Warning: package 'MVN' was built under R version 4.5.3
#>
#> Attaching package: 'MVN'
#> The following object is masked from 'package:psych':
#>
#> mardia
mvn(scale(decathlon_cont))$multivariate_normality
#> Test Statistic p.value Method MVN
#> 1 Henze-Zirkler 0.973 0.271 asymptotic ✓ NormalInterpretation:
There is no clear cut answer here. Univariate tests show deviation from normality, but the multivariate normality test shows no indication of the assumption being violated. We can run an analysis with both ‘PA’ and ‘ML’ as the extraction method and compare the results.
# Maximum Likelihood
fa_ml <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "none")
# Principal Axis Factoring
fa_paf <- fa(decathlon_cont, nfactors = 3, fm = "pa", rotate = "none")
#> maximum iteration exceeded
#> Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
#> ultra-Heywood case was detected. Examine the results carefully
# Compare loadings
fa_ml$loadings
#>
#> Loadings:
#> ML2 ML3 ML1
#> 100m -0.573 0.508
#> Long.jump 0.455 -0.629
#> Shot.put 0.881 0.322 0.121
#> High.jump 0.547
#> 400m -0.432 0.617 0.412
#> 110m.hurdle -0.493 0.514
#> Discus 0.621 0.114 0.261
#> Pole.vault -0.174 0.246
#> Javeline 0.356 0.239 -0.178
#> 1500m 0.997
#>
#> ML2 ML3 ML1
#> SS loadings 2.554 1.504 1.347
#> Proportion Var 0.255 0.150 0.135
#> Cumulative Var 0.255 0.406 0.541
fa_paf$loadings
#>
#> Loadings:
#> PA1 PA2 PA3
#> 100m -0.725 0.232
#> Long.jump 0.703 -0.340
#> Shot.put 0.613 0.323 0.552
#> High.jump 0.486 0.302
#> 400m -0.670 0.364 0.313
#> 110m.hurdle -0.682 0.204
#> Discus 0.490 0.345 0.357
#> Pole.vault 0.123 -0.266
#> Javeline 0.222 0.320
#> 1500m -0.103 1.198 -0.352
#>
#> PA1 PA2 PA3
#> SS loadings 2.849 1.821 1.130
#> Proportion Var 0.285 0.182 0.113
#> Cumulative Var 0.285 0.467 0.580The results are quite different across the different extraction methods. Since the assumption of multivariate normality was not violated, we can here chose for the ML extraction method.
To choose the number of latent factors to retain, we can look at the ‘elbow’ in the scree plot, or we can perform a parallel procedure analysis.
#> Parallel analysis suggests that the number of factors = 2 and the number of components = NA
Interpretation:
Based on the parallel analysis method, we choose three latent factors.
fa_final <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "varimax")
print(fa_final$loadings, cutoff = 0.3)
#>
#> Loadings:
#> ML3 ML2 ML1
#> 100m 0.699
#> Long.jump -0.764
#> Shot.put 0.934
#> High.jump 0.497
#> 400m 0.815
#> 110m.hurdle 0.685
#> Discus 0.618
#> Pole.vault
#> Javeline 0.412
#> 1500m 0.976
#>
#> ML3 ML2 ML1
#> SS loadings 2.350 1.791 1.264
#> Proportion Var 0.235 0.179 0.126
#> Cumulative Var 0.235 0.414 0.541
fa_final$communality
#> 100m Long.jump Shot.put High.jump 400m 110m.hurdle
#> 0.58935708 0.60410100 0.89440642 0.30274742 0.73621520 0.50882895
#> Discus Pole.vault Javeline 1500m
#> 0.46633122 0.09336771 0.21517895 0.99500532If we check the communalities, we can see that there are a few variables that have a low value (below 0.3): Pole vault and Javeline. These variables have a high uniqueness, meaning they cannot be explained well by the underlying factors or concepts. This can be explained by the fact that these are quite technical events, hence they have less in common with some of the other events, that have similar ‘types’ of movement (throwing/running for example).
# Varimax (orthogonal)
fa_varimax <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "varimax")
# Oblimin (oblique)
fa_oblimin <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "oblimin")
#> Loading required namespace: GPArotation
# Compare loadings
fa_varimax$loadings
#>
#> Loadings:
#> ML3 ML2 ML1
#> 100m 0.699 -0.264 -0.178
#> Long.jump -0.764 0.107
#> Shot.put -0.128 0.934
#> High.jump -0.232 0.497
#> 400m 0.815 0.263
#> 110m.hurdle 0.685 -0.183
#> Discus -0.152 0.618 0.248
#> Pole.vault -0.124 0.278
#> Javeline 0.412 -0.214
#> 1500m 0.190 0.976
#>
#> ML3 ML2 ML1
#> SS loadings 2.350 1.791 1.264
#> Proportion Var 0.235 0.179 0.126
#> Cumulative Var 0.235 0.414 0.541
fa_oblimin$loadings
#>
#> Loadings:
#> ML3 ML2 ML1
#> 100m 0.713 -0.143 -0.143
#> Long.jump -0.795
#> Shot.put 0.952
#> High.jump -0.148 0.488
#> 400m 0.782 0.290
#> 110m.hurdle 0.693
#> Discus -0.101 0.597 0.190
#> Pole.vault -0.187 0.284
#> Javeline 0.110 0.455 -0.259
#> 1500m 0.995
#>
#> ML3 ML2 ML1
#> SS loadings 2.311 1.742 1.297
#> Proportion Var 0.231 0.174 0.130
#> Cumulative Var 0.231 0.405 0.535If we compare the loadings of the two rotations, they are very similar, and they produce similar interpretations for the factors.
fa_oblimin$Phi
#> ML3 ML2 ML1
#> ML3 1.0000000 -0.3002647 0.1286287
#> ML2 -0.3002647 1.0000000 0.1421010
#> ML1 0.1286287 0.1421010 1.0000000If we look at the correlations between the factors, we see that these are very small to moderate (below 0.3). Therefore, we can assume that an oblique rotation is not necessary in this case and the varimax transformation suffices.
Giving an explanation to the latent factors:
Lastly, comparing the results to PCA analysis:
# PCA
pca_res <- prcomp(scale(decathlon_cont))
summary(pca_res)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 1.8088 1.3180 1.1853 1.0280 0.82751 0.77412 0.67174
#> Proportion of Variance 0.3272 0.1737 0.1405 0.1057 0.06848 0.05993 0.04512
#> Cumulative Proportion 0.3272 0.5009 0.6414 0.7471 0.81556 0.87548 0.92061
#> PC8 PC9 PC10
#> Standard deviation 0.62998 0.46348 0.42688
#> Proportion of Variance 0.03969 0.02148 0.01822
#> Cumulative Proportion 0.96030 0.98178 1.00000
pca_res$rotation
#> PC1 PC2 PC3 PC4 PC5
#> 100m -0.42829627 0.1419891 -0.15557953 0.03678703 0.36518741
#> Long.jump 0.41015201 -0.2620794 0.15372674 -0.09901016 0.04432336
#> Shot.put 0.34414444 0.4539470 -0.01972378 -0.18539458 0.13431954
#> High.jump 0.31619436 0.2657761 -0.21894349 0.13189684 0.67121760
#> 400m -0.37571570 0.4320460 0.11091758 -0.02850297 -0.10597034
#> 110m.hurdle -0.41255442 0.1735910 -0.07815576 -0.28290068 0.19857266
#> Discus 0.30542571 0.4600244 0.03623770 0.25259074 -0.12667770
#> Pole.vault 0.02783081 -0.1368411 0.58361717 -0.53649480 0.39873734
#> Javeline 0.15319802 0.2405071 -0.32874217 -0.69285498 -0.36873120
#> 1500m -0.03210733 0.3598049 0.65987362 0.15669648 -0.18557094
#> PC6 PC7 PC8 PC9 PC10
#> 100m -0.29607739 -0.38177608 0.46160211 -0.10475771 0.42428269
#> Long.jump 0.30612478 -0.62769317 -0.02101165 -0.48266910 0.08104448
#> Shot.put -0.30547299 0.30972542 -0.31393005 -0.42729075 0.39028424
#> High.jump 0.46777116 0.09145002 0.12509166 0.24366054 -0.10642724
#> 400m 0.33252178 0.12442114 0.21339819 -0.55212939 -0.41399532
#> 110m.hurdle 0.09963776 -0.35733030 -0.71111429 0.15013429 -0.09086448
#> Discus -0.44937288 -0.42988982 0.03838986 0.15480715 -0.44916580
#> Pole.vault -0.26166458 0.09796019 0.17803824 0.08297769 -0.27645138
#> Javeline 0.16320268 -0.10674519 0.29614206 0.24732691 0.08777340
#> 1500m 0.29826888 -0.08362898 0.01371744 0.30773397 0.42923132Interpretation:
Key insight: EFA factors align better with real-world concepts