Exploratory Factor Analysis - exercise solutions

dr. Annelies Agten

2026-04-27

Multivariate Statistics - StatUa

Exploratory Factor Analysis - Exercise solutions

Questions:

Examine the correlation matrix of the 10 decathlon events.

Is this data suitable for Factor Analysis?

Run the analysis twice: once with Maximum Likelihood (ML) and once with Principal Axis Factoring (PAF).

Check for normality in the variables. Based on your findings, which extraction method is more “defensible”?

Use a Scree Plot or Parallel Analysis to decide on the number of factors.

Extract the factors using Principal Axis Factoring.

Check the communalities. Which events are poorly explained by the common factors? (i.e., they have high “uniqueness”) .

Compare Varimax (Orthogonal) vs. Oblimin (Oblique).

Does a “strong” athlete naturally tend to be a “fast” athlete? Check the factor correlations in the Oblimin output.

Which rotation provides a “Simpler Structure” (cleaner loadings)?

Based on your best EFA model, name your factors (e.g., “Explosive Power,” “Technical Agility,” “Endurance”).

Compare your EFA factors to your PCA components. Do the latent factors provide a more “common sense” understanding of athlete profiles?.


library(FactoMineR)

data(decathlon)

decathlon_cont <- decathlon[,1:10]

decathlon_cont <- scale(decathlon_cont)

summary(decathlon_cont)
#>       100m            Long.jump          Shot.put         High.jump      
#>  Min.   :-2.12167   Min.   :-2.0544   Min.   :-2.1798   Min.   :-1.4258  
#>  1st Qu.:-0.56287   1st Qu.:-0.7269   1st Qu.:-0.7242   1st Qu.:-0.6389  
#>  Median :-0.06862   Median : 0.1264   Median : 0.1127   Median :-0.3016  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.53969   3rd Qu.: 0.6953   3rd Qu.: 0.5979   3rd Qu.: 0.7102  
#>  Max.   : 2.44067   Max.   : 2.2124   Max.   : 2.2839   Max.   : 1.9468  
#>       400m          110m.hurdle          Discus           Pole.vault     
#>  Min.   :-2.4330   Min.   :-1.3478   Min.   :-1.89636   Min.   :-2.0232  
#>  1st Qu.:-0.5950   1st Qu.:-0.8390   1st Qu.:-0.71809   1st Qu.:-0.9440  
#>  Median :-0.1876   Median :-0.2668   Median : 0.02498   Median : 0.1351  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.5927   3rd Qu.: 0.7930   3rd Qu.: 0.51642   3rd Qu.: 0.5668  
#>  Max.   : 3.1069   Max.   : 2.2556   Max.   : 2.16836   Max.   : 2.2934  
#>     Javeline             1500m         
#>  Min.   :-1.658770   Min.   :-1.44989  
#>  1st Qu.:-0.631179   1st Qu.:-0.68575  
#>  Median : 0.008994   Median :-0.08351  
#>  Mean   : 0.000000   Mean   : 0.00000  
#>  3rd Qu.: 0.533149   3rd Qu.: 0.52043  
#>  Max.   : 2.528251   Max.   : 3.25318

For EFA to be meaningful, variables must show at least moderate correlations (typically |r| > 0.30). In the decathlon, events like the 100m, Long Jump, and 110m Hurdles show strong positive correlations because they all rely on explosive “fast-twitch” muscle fibers. If the matrix were mostly zeros, there would be no “common variance” to extract.


cor_mat <- cor(decathlon_cont)

corrplot(cor_mat, method = "color", addCoef.col = "black")

Another method of testing whether the data is suitable for Factor Analysis, is by using the Kaisr-Meyer-Olkin and Bartlett’s test.

# Kaiser-Meyer-Olkin (KMO)
KMO(decathlon_cont)
#> Kaiser-Meyer-Olkin factor adequacy
#> Call: KMO(r = decathlon_cont)
#> Overall MSA =  0.6
#> MSA for each item = 
#>        100m   Long.jump    Shot.put   High.jump        400m 110m.hurdle 
#>        0.69        0.72        0.57        0.70        0.61        0.83 
#>      Discus  Pole.vault    Javeline       1500m 
#>        0.54        0.24        0.44        0.29

# Bartlett’s test
cortest.bartlett(cor_mat, n = nrow(decathlon_cont))
#> $chisq
#> [1] 133.2358
#> 
#> $p.value
#> [1] 1.15685e-10
#> 
#> $df
#> [1] 45

Interpretation:

Based on the correlation matrix, the result of the KMO and Bartlett’s test, we can assume that the decathlon data is suitable for factor analysis.

In order to be able to use the Maximum Likelihood extraction method, we need to test whether our data follows a (multivariate) Normal distribution. We can test for univariate normality of the variables by using the Shapiro-Wilk test. Multivariate normality can be checked using the MVN function in the MVN package.


# Shapiro tests (not perfect but indicative)
apply(decathlon_cont, 2, shapiro.test)
#> $`100m`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.9818, p-value = 0.7435
#> 
#> 
#> $Long.jump
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.98763, p-value = 0.9289
#> 
#> 
#> $Shot.put
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.9884, p-value = 0.9456
#> 
#> 
#> $High.jump
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.93734, p-value = 0.0255
#> 
#> 
#> $`400m`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.95714, p-value = 0.1248
#> 
#> 
#> $`110m.hurdle`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.93087, p-value = 0.01544
#> 
#> 
#> $Discus
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.96975, p-value = 0.3385
#> 
#> 
#> $Pole.vault
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.97003, p-value = 0.3456
#> 
#> 
#> $Javeline
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.97106, p-value = 0.3732
#> 
#> 
#> $`1500m`
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  newX[, i]
#> W = 0.93652, p-value = 0.02391

library(MVN)
#> Warning: package 'MVN' was built under R version 4.5.3
#> 
#> Attaching package: 'MVN'
#> The following object is masked from 'package:psych':
#> 
#>     mardia

mvn(scale(decathlon_cont))$multivariate_normality
#>            Test Statistic p.value     Method      MVN
#> 1 Henze-Zirkler     0.973   0.271 asymptotic ✓ Normal

Interpretation:

There is no clear cut answer here. Univariate tests show deviation from normality, but the multivariate normality test shows no indication of the assumption being violated. We can run an analysis with both ‘PA’ and ‘ML’ as the extraction method and compare the results.

# Maximum Likelihood
fa_ml <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "none")

# Principal Axis Factoring
fa_paf <- fa(decathlon_cont, nfactors = 3, fm = "pa", rotate = "none")
#> maximum iteration exceeded
#> Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
#> ultra-Heywood case was detected.  Examine the results carefully

# Compare loadings
fa_ml$loadings
#> 
#> Loadings:
#>             ML2    ML3    ML1   
#> 100m        -0.573  0.508       
#> Long.jump    0.455 -0.629       
#> Shot.put     0.881  0.322  0.121
#> High.jump    0.547              
#> 400m        -0.432  0.617  0.412
#> 110m.hurdle -0.493  0.514       
#> Discus       0.621  0.114  0.261
#> Pole.vault         -0.174  0.246
#> Javeline     0.356  0.239 -0.178
#> 1500m                      0.997
#> 
#>                  ML2   ML3   ML1
#> SS loadings    2.554 1.504 1.347
#> Proportion Var 0.255 0.150 0.135
#> Cumulative Var 0.255 0.406 0.541
fa_paf$loadings
#> 
#> Loadings:
#>             PA1    PA2    PA3   
#> 100m        -0.725         0.232
#> Long.jump    0.703        -0.340
#> Shot.put     0.613  0.323  0.552
#> High.jump    0.486         0.302
#> 400m        -0.670  0.364  0.313
#> 110m.hurdle -0.682         0.204
#> Discus       0.490  0.345  0.357
#> Pole.vault          0.123 -0.266
#> Javeline     0.222         0.320
#> 1500m       -0.103  1.198 -0.352
#> 
#>                  PA1   PA2   PA3
#> SS loadings    2.849 1.821 1.130
#> Proportion Var 0.285 0.182 0.113
#> Cumulative Var 0.285 0.467 0.580

The results are quite different across the different extraction methods. Since the assumption of multivariate normality was not violated, we can here chose for the ML extraction method.

To choose the number of latent factors to retain, we can look at the ‘elbow’ in the scree plot, or we can perform a parallel procedure analysis.


fa.parallel(decathlon_cont, fa="fa", fm="ml", ylab="Eigenvalues")

#> Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

Interpretation:

Based on the parallel analysis method, we choose three latent factors.


fa_final <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "varimax")

print(fa_final$loadings, cutoff = 0.3)
#> 
#> Loadings:
#>             ML3    ML2    ML1   
#> 100m         0.699              
#> Long.jump   -0.764              
#> Shot.put            0.934       
#> High.jump           0.497       
#> 400m         0.815              
#> 110m.hurdle  0.685              
#> Discus              0.618       
#> Pole.vault                      
#> Javeline            0.412       
#> 1500m                      0.976
#> 
#>                  ML3   ML2   ML1
#> SS loadings    2.350 1.791 1.264
#> Proportion Var 0.235 0.179 0.126
#> Cumulative Var 0.235 0.414 0.541

fa_final$communality
#>        100m   Long.jump    Shot.put   High.jump        400m 110m.hurdle 
#>  0.58935708  0.60410100  0.89440642  0.30274742  0.73621520  0.50882895 
#>      Discus  Pole.vault    Javeline       1500m 
#>  0.46633122  0.09336771  0.21517895  0.99500532

If we check the communalities, we can see that there are a few variables that have a low value (below 0.3): Pole vault and Javeline. These variables have a high uniqueness, meaning they cannot be explained well by the underlying factors or concepts. This can be explained by the fact that these are quite technical events, hence they have less in common with some of the other events, that have similar ‘types’ of movement (throwing/running for example).

# Varimax (orthogonal)
fa_varimax <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "varimax")

# Oblimin (oblique)
fa_oblimin <- fa(decathlon_cont, nfactors = 3, fm = "ml", rotate = "oblimin")
#> Loading required namespace: GPArotation

# Compare loadings
fa_varimax$loadings
#> 
#> Loadings:
#>             ML3    ML2    ML1   
#> 100m         0.699 -0.264 -0.178
#> Long.jump   -0.764         0.107
#> Shot.put    -0.128  0.934       
#> High.jump   -0.232  0.497       
#> 400m         0.815         0.263
#> 110m.hurdle  0.685 -0.183       
#> Discus      -0.152  0.618  0.248
#> Pole.vault  -0.124         0.278
#> Javeline            0.412 -0.214
#> 1500m        0.190         0.976
#> 
#>                  ML3   ML2   ML1
#> SS loadings    2.350 1.791 1.264
#> Proportion Var 0.235 0.179 0.126
#> Cumulative Var 0.235 0.414 0.541
fa_oblimin$loadings
#> 
#> Loadings:
#>             ML3    ML2    ML1   
#> 100m         0.713 -0.143 -0.143
#> Long.jump   -0.795              
#> Shot.put            0.952       
#> High.jump   -0.148  0.488       
#> 400m         0.782         0.290
#> 110m.hurdle  0.693              
#> Discus      -0.101  0.597  0.190
#> Pole.vault  -0.187         0.284
#> Javeline     0.110  0.455 -0.259
#> 1500m                      0.995
#> 
#>                  ML3   ML2   ML1
#> SS loadings    2.311 1.742 1.297
#> Proportion Var 0.231 0.174 0.130
#> Cumulative Var 0.231 0.405 0.535

If we compare the loadings of the two rotations, they are very similar, and they produce similar interpretations for the factors.

fa_oblimin$Phi
#>            ML3        ML2       ML1
#> ML3  1.0000000 -0.3002647 0.1286287
#> ML2 -0.3002647  1.0000000 0.1421010
#> ML1  0.1286287  0.1421010 1.0000000

If we look at the correlations between the factors, we see that these are very small to moderate (below 0.3). Therefore, we can assume that an oblique rotation is not necessary in this case and the varimax transformation suffices.

Giving an explanation to the latent factors:

Lastly, comparing the results to PCA analysis:


# PCA
pca_res <- prcomp(scale(decathlon_cont))

summary(pca_res)
#> Importance of components:
#>                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
#> Standard deviation     1.8088 1.3180 1.1853 1.0280 0.82751 0.77412 0.67174
#> Proportion of Variance 0.3272 0.1737 0.1405 0.1057 0.06848 0.05993 0.04512
#> Cumulative Proportion  0.3272 0.5009 0.6414 0.7471 0.81556 0.87548 0.92061
#>                            PC8     PC9    PC10
#> Standard deviation     0.62998 0.46348 0.42688
#> Proportion of Variance 0.03969 0.02148 0.01822
#> Cumulative Proportion  0.96030 0.98178 1.00000
pca_res$rotation
#>                     PC1        PC2         PC3         PC4         PC5
#> 100m        -0.42829627  0.1419891 -0.15557953  0.03678703  0.36518741
#> Long.jump    0.41015201 -0.2620794  0.15372674 -0.09901016  0.04432336
#> Shot.put     0.34414444  0.4539470 -0.01972378 -0.18539458  0.13431954
#> High.jump    0.31619436  0.2657761 -0.21894349  0.13189684  0.67121760
#> 400m        -0.37571570  0.4320460  0.11091758 -0.02850297 -0.10597034
#> 110m.hurdle -0.41255442  0.1735910 -0.07815576 -0.28290068  0.19857266
#> Discus       0.30542571  0.4600244  0.03623770  0.25259074 -0.12667770
#> Pole.vault   0.02783081 -0.1368411  0.58361717 -0.53649480  0.39873734
#> Javeline     0.15319802  0.2405071 -0.32874217 -0.69285498 -0.36873120
#> 1500m       -0.03210733  0.3598049  0.65987362  0.15669648 -0.18557094
#>                     PC6         PC7         PC8         PC9        PC10
#> 100m        -0.29607739 -0.38177608  0.46160211 -0.10475771  0.42428269
#> Long.jump    0.30612478 -0.62769317 -0.02101165 -0.48266910  0.08104448
#> Shot.put    -0.30547299  0.30972542 -0.31393005 -0.42729075  0.39028424
#> High.jump    0.46777116  0.09145002  0.12509166  0.24366054 -0.10642724
#> 400m         0.33252178  0.12442114  0.21339819 -0.55212939 -0.41399532
#> 110m.hurdle  0.09963776 -0.35733030 -0.71111429  0.15013429 -0.09086448
#> Discus      -0.44937288 -0.42988982  0.03838986  0.15480715 -0.44916580
#> Pole.vault  -0.26166458  0.09796019  0.17803824  0.08297769 -0.27645138
#> Javeline     0.16320268 -0.10674519  0.29614206  0.24732691  0.08777340
#> 1500m        0.29826888 -0.08362898  0.01371744  0.30773397  0.42923132

Interpretation:

Key insight: EFA factors align better with real-world concepts