Exploratory Factor Analysis (EFA) is a multivariate technique used to identify underlying latent variables, or factors, that explain patterns of correlations among observed variables. Unlike PCA, which focuses on summarizing total variance, EFA assumes that the observed variables are influenced by a smaller number of unobserved constructs, along with some measurement error.
The main goal of EFA is to uncover the underlying structure of a dataset by grouping variables that are highly correlated with each other into common factors. This can help reduce data complexity and provide a more interpretable representation of the relationships between variables.
In the previous discussion, we implicitly assumed that all observed variables are continuous and that their relationships can be adequately described using Pearson correlations. However, in many practical applications, especially in the social sciences, psychology, and survey research, variables are often ordinal (e.g., Likert scales) or binary (e.g., yes/no responses).
Pearson correlation relies on several assumptions:
When these assumptions are violated, as is the case with categorical or ordinal data, the resulting correlations may be biased, which in turn can lead to distorted factor structures in EFA.
To address this issue, alternative correlation coefficients are used that better reflect the nature of categorical data:
Ordinal variables
Polychoric correlations: These estimate the relationship between two unobserved continuous variables underlying the observed ordinal responses.
Binary variables
Tetrachoric correlations: A special case of polychoric correlation for dichotomous variables.
These methods assume that the observed categorical responses arise from latent continuous variables that have been discretized.
The overall procedure of EFA remains unchanged:
However, the input correlation matrix is different. Instead of using Pearson correlations, we use a matrix based on polychoric or tetrachoric correlations.
This often results in:
We will start by loading the most fundamental package:
psych. If you never used the package psych, then you need
to first install it:
Then you can load the library:
We are now working with a categorical dataset (sdq: Strengths and Difficulties Questionnaire) and we are planning to run EFA with 30 items from 425 elementary school children. Since the items are rated on a 1–6 Likert scale, they are ordinal categorical variables and this affects how EFA should be conducted.
sdq <- readxl::read_excel("sdq.xls")
describe(sdq)
#> vars n mean sd median trimmed mad min max range skew kurtosis se
#> sdq1 1 425 4.17 1.80 5 4.33 1.48 1 6 5 -0.57 -1.06 0.09
#> sdq2 2 425 4.83 1.37 5 5.07 1.48 1 6 5 -1.20 0.70 0.07
#> sdq3 3 425 4.82 1.42 5 5.03 1.48 1 6 5 -0.96 -0.20 0.07
#> sdq4r 4 425 3.82 1.92 4 3.90 2.97 1 6 5 -0.24 -1.49 0.09
#> sdq5r 5 425 5.32 1.27 6 5.63 0.00 1 6 5 -1.95 2.85 0.06
#> sdq6 6 425 4.00 1.81 4 4.13 2.97 1 6 5 -0.45 -1.18 0.09
#> sdq7 7 425 4.05 1.80 4 4.19 2.97 1 6 5 -0.48 -1.14 0.09
#> sdq8 8 425 4.77 1.38 5 5.01 1.48 1 6 5 -1.21 0.79 0.07
#> sdq9r 9 425 4.86 1.58 6 5.16 0.00 1 6 5 -1.27 0.37 0.08
#> sdq10r 10 425 4.79 1.58 6 5.07 0.00 1 6 5 -1.16 0.12 0.08
#> sdq11r 11 425 5.02 1.29 6 5.26 0.00 1 6 5 -1.39 1.26 0.06
#> sdq12 12 425 4.44 1.48 5 4.62 1.48 1 6 5 -0.83 -0.23 0.07
#> sdq13 13 425 3.85 1.87 4 3.93 2.97 1 6 5 -0.31 -1.35 0.09
#> sdq14 14 425 4.77 1.38 5 5.02 1.48 1 6 5 -1.28 0.99 0.07
#> sdq15r 15 425 5.14 1.46 6 5.48 0.00 1 6 5 -1.65 1.48 0.07
#> sdq16r 16 425 4.30 1.80 5 4.49 1.48 1 6 5 -0.67 -0.97 0.09
#> sdq17r 17 425 4.96 1.59 6 5.30 0.00 1 6 5 -1.44 0.76 0.08
#> sdq18 18 425 4.19 1.79 5 4.36 1.48 1 6 5 -0.57 -1.04 0.09
#> sdq19 19 425 4.24 1.69 5 4.42 1.48 1 6 5 -0.62 -0.88 0.08
#> sdq20 20 425 4.81 1.45 5 5.08 1.48 1 6 5 -1.26 0.72 0.07
#> sdq21r 21 425 5.04 1.53 6 5.38 0.00 1 6 5 -1.52 1.06 0.07
#> sdq22r 22 425 4.87 1.66 6 5.20 0.00 1 6 5 -1.33 0.40 0.08
#> sdq23r 23 425 5.13 1.47 6 5.47 0.00 1 6 5 -1.66 1.58 0.07
#> sdq24 24 425 4.46 1.55 5 4.68 1.48 1 6 5 -0.87 -0.24 0.08
#> sdq25 25 425 3.97 1.78 4 4.08 1.48 1 6 5 -0.47 -1.16 0.09
#> sdq26 26 425 5.52 1.01 6 5.77 0.00 1 6 5 -2.76 8.14 0.05
#> sdq27r 27 425 3.82 1.88 4 3.90 2.97 1 6 5 -0.17 -1.50 0.09
#> sdq28r 28 425 4.68 1.76 6 4.98 0.00 1 6 5 -1.03 -0.41 0.09
#> sdq29r 29 425 5.58 0.96 6 5.84 0.00 1 6 5 -2.59 6.26 0.05
#> sdq30 30 425 4.71 1.45 5 4.96 1.48 1 6 5 -1.11 0.36 0.07Given that the SDQ items are measured on a 6-point Likert scale, we treat them as ordinal categorical variables. Using Pearson correlations in this case would violate assumptions of linearity and normality. Therefore, we compute a polychoric correlation matrix, which estimates the correlation between unobserved continuous latent variables that underlie the ordinal responses (Flora & Curran, 2004; Floral & Flake, 2017; Lloret et al., 2017; McCoach et al., 2013).
# Convert the data to a polychoric correlation matrix for EFA analysis
out <- polychoric(sdq)
# Extract the polychoric matrix
sdqpoly <- out$rho
corPlot(sdqpoly, diag = FALSE, upper = FALSE, numbers = TRUE)Most correlations are positive and moderate to strong. Very few correlations are close to 0.
KMO(sdqpoly)
#> Kaiser-Meyer-Olkin factor adequacy
#> Call: KMO(r = sdqpoly)
#> Overall MSA = 0.86
#> MSA for each item =
#> sdq1 sdq2 sdq3 sdq4r sdq5r sdq6 sdq7 sdq8 sdq9r sdq10r sdq11r
#> 0.89 0.87 0.86 0.87 0.86 0.76 0.86 0.85 0.85 0.88 0.89
#> sdq12 sdq13 sdq14 sdq15r sdq16r sdq17r sdq18 sdq19 sdq20 sdq21r sdq22r
#> 0.87 0.89 0.84 0.86 0.88 0.82 0.82 0.90 0.86 0.81 0.88
#> sdq23r sdq24 sdq25 sdq26 sdq27r sdq28r sdq29r sdq30
#> 0.83 0.90 0.93 0.84 0.78 0.90 0.84 0.87
cortest.bartlett(sdqpoly, n = 425)
#> $chisq
#> [1] 8106.004
#>
#> $p.value
#> [1] 0
#>
#> $df
#> [1] 435KMO test and Bartlett’s test of sphericity results confirm that the data are suitable for EFA based on the polychoric correlations.
The rest of the factor analysis can be carried out analogously
#> Parallel analysis suggests that the number of factors = NA and the number of components = 4
VSS(sdqpoly, n = 10, fm="pc", rotate = "promax", n.obs = 425, plot = F)
#>
#> Very Simple Structure
#> Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm,
#> n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
#> VSS complexity 1 achieves a maximimum of 0.85 with 4 factors
#> VSS complexity 2 achieves a maximimum of 0.92 with 3 factors
#>
#> The Velicer MAP achieves a minimum of 0.02 with 4 factors
#> Warning in min(x$vss.stats[, "BIC"], na.rm = TRUE): no non-missing arguments to
#> min; returning Inf
#>
#> BIC achieves a minimum of Inf with factors
#> Warning in min(x$vss.stats[, "SABIC"], na.rm = TRUE): no non-missing arguments
#> to min; returning Inf
#> Sample Size adjusted BIC achieves a minimum of Inf with factors
#>
#> Statistics by number of factors
#> vss1 vss2 map dof chisq prob sqresid fit RMSEA BIC SABIC complex eChisq
#> 1 0.57 0.00 0.069 0 NA NA 48.6 0.57 NA NA NA NA NA
#> 2 0.76 0.85 0.030 0 NA NA 16.9 0.85 NA NA NA NA NA
#> 3 0.85 0.90 0.019 0 NA NA 10.6 0.91 NA NA NA NA NA
#> 4 0.85 0.89 0.018 0 NA NA 7.8 0.93 NA NA NA NA NA
#> 5 0.83 0.90 0.018 0 NA NA 6.3 0.94 NA NA NA NA NA
#> 6 0.81 0.91 0.019 0 NA NA 5.2 0.95 NA NA NA NA NA
#> 7 0.80 0.87 0.020 0 NA NA 4.4 0.96 NA NA NA NA NA
#> 8 0.80 0.89 0.022 0 NA NA 3.8 0.97 NA NA NA NA NA
#> 9 0.71 0.88 0.025 0 NA NA 3.2 0.97 NA NA NA NA NA
#> 10 0.80 0.92 0.028 0 NA NA 2.7 0.98 NA NA NA NA NA
#> SRMR eCRMS eBIC
#> 1 NA NA NA
#> 2 NA NA NA
#> 3 NA NA NA
#> 4 NA NA NA
#> 5 NA NA NA
#> 6 NA NA NA
#> 7 NA NA NA
#> 8 NA NA NA
#> 9 NA NA NA
#> 10 NA NA NA
fa1 <- fa(sdq, cor = "poly", nfactors = 4, rotate = "promax", fm = "uls")
#> Loading required namespace: GPArotation
fa1
#> Factor Analysis using method = uls
#> Call: fa(r = sdq, nfactors = 4, rotate = "promax", fm = "uls", cor = "poly")
#> Standardized loadings (pattern matrix) based upon correlation matrix
#> ULS1 ULS3 ULS2 ULS4 h2 u2 com
#> sdq1 0.88 -0.01 0.00 -0.13 0.75 0.25 1.0
#> sdq2 -0.10 0.75 -0.02 -0.01 0.50 0.50 1.0
#> sdq3 -0.02 0.10 0.46 0.36 0.52 0.48 2.0
#> sdq4r 0.71 -0.09 -0.14 0.17 0.54 0.46 1.2
#> sdq5r 0.02 0.61 -0.16 0.20 0.46 0.54 1.4
#> sdq6 0.07 0.00 0.66 -0.13 0.39 0.61 1.1
#> sdq7 0.73 0.03 0.21 -0.30 0.56 0.44 1.5
#> sdq8 -0.07 0.64 0.19 -0.10 0.45 0.55 1.3
#> sdq9r 0.04 -0.02 0.16 0.61 0.45 0.55 1.2
#> sdq10r 0.59 0.09 -0.19 0.29 0.56 0.44 1.8
#> sdq11r 0.07 0.61 -0.01 0.15 0.51 0.49 1.1
#> sdq12 -0.05 -0.01 0.70 0.08 0.53 0.47 1.0
#> sdq13 0.83 0.07 0.17 -0.22 0.72 0.28 1.2
#> sdq14 -0.01 0.73 0.11 -0.19 0.50 0.50 1.2
#> sdq15r -0.04 0.05 0.02 0.71 0.53 0.47 1.0
#> sdq16r 0.80 -0.09 -0.15 0.21 0.70 0.30 1.2
#> sdq17r -0.10 0.76 -0.23 0.12 0.53 0.47 1.3
#> sdq18 -0.04 -0.13 0.90 0.04 0.78 0.22 1.0
#> sdq19 0.83 -0.10 -0.07 0.10 0.68 0.32 1.1
#> sdq20 0.06 0.47 0.14 -0.05 0.30 0.70 1.2
#> sdq21r 0.06 -0.06 0.33 0.45 0.38 0.62 1.9
#> sdq22r 0.70 0.07 0.12 0.01 0.54 0.46 1.1
#> sdq23r 0.00 0.77 -0.09 -0.04 0.52 0.48 1.0
#> sdq24 -0.04 -0.05 0.67 0.27 0.61 0.39 1.3
#> sdq25 0.75 -0.05 -0.02 0.01 0.55 0.45 1.0
#> sdq26 0.07 0.42 0.06 0.00 0.23 0.77 1.1
#> sdq27r -0.01 0.12 0.22 0.31 0.25 0.75 2.1
#> sdq28r 0.82 0.11 -0.04 -0.05 0.73 0.27 1.0
#> sdq29r 0.04 0.82 -0.11 0.11 0.73 0.27 1.1
#> sdq30 -0.06 0.00 0.71 0.19 0.63 0.37 1.2
#>
#> ULS1 ULS3 ULS2 ULS4
#> SS loadings 5.96 4.56 3.51 2.10
#> Proportion Var 0.20 0.15 0.12 0.07
#> Cumulative Var 0.20 0.35 0.47 0.54
#> Proportion Explained 0.37 0.28 0.22 0.13
#> Cumulative Proportion 0.37 0.65 0.87 1.00
#>
#> With factor correlations of
#> ULS1 ULS3 ULS2 ULS4
#> ULS1 1.00 0.35 -0.06 0.15
#> ULS3 0.35 1.00 0.36 0.44
#> ULS2 -0.06 0.36 1.00 0.31
#> ULS4 0.15 0.44 0.31 1.00
#>
#> Mean item complexity = 1.3
#> Test of the hypothesis that 4 factors are sufficient.
#>
#> df null model = 435 with the objective function = 19.62 with Chi Square = 8106
#> df of the model are 321 and the objective function was 3.88
#>
#> The root mean square of the residuals (RMSR) is 0.04
#> The df corrected root mean square of the residuals is 0.05
#>
#> The harmonic n.obs is 425 with the empirical chi square 361.54 with prob < 0.059
#> The total n.obs was 425 with Likelihood Chi Square = 1593.75 with prob < 1.7e-167
#>
#> Tucker Lewis Index of factoring reliability = 0.774
#> RMSEA index = 0.097 and the 90 % confidence intervals are 0.092 0.101
#> BIC = -348.97
#> Fit based upon off diagonal values = 0.98
#> Measures of factor score adequacy
#> ULS1 ULS3 ULS2 ULS4
#> Correlation of (regression) scores with factors 0.96 0.89 0.92 0.84
#> Multiple R square of scores with factors 0.92 0.80 0.85 0.70
#> Minimum correlation of possible factor scores 0.84 0.60 0.71 0.40
fa2 <- fa(sdq, cor = "poly", nfactors = 3, rotate = "promax", fm = "uls")
fa2
#> Factor Analysis using method = uls
#> Call: fa(r = sdq, nfactors = 3, rotate = "promax", fm = "uls", cor = "poly")
#> Standardized loadings (pattern matrix) based upon correlation matrix
#> ULS1 ULS3 ULS2 h2 u2 com
#> sdq1 0.88 -0.07 -0.06 0.74 0.26 1.0
#> sdq2 -0.10 0.75 -0.04 0.49 0.51 1.0
#> sdq3 -0.02 0.16 0.62 0.50 0.50 1.1
#> sdq4r 0.72 -0.02 -0.07 0.51 0.49 1.0
#> sdq5r 0.03 0.70 -0.10 0.45 0.55 1.0
#> sdq6 0.04 -0.13 0.60 0.30 0.70 1.1
#> sdq7 0.70 -0.10 0.08 0.45 0.55 1.1
#> sdq8 -0.08 0.57 0.15 0.40 0.60 1.2
#> sdq9r 0.06 0.17 0.40 0.27 0.73 1.4
#> sdq10r 0.61 0.21 -0.07 0.49 0.51 1.3
#> sdq11r 0.08 0.67 0.04 0.51 0.49 1.0
#> sdq12 -0.06 -0.09 0.75 0.51 0.49 1.0
#> sdq13 0.81 -0.03 0.07 0.65 0.35 1.0
#> sdq14 -0.02 0.63 0.03 0.41 0.59 1.0
#> sdq15r -0.01 0.27 0.30 0.24 0.76 2.0
#> sdq16r 0.81 -0.01 -0.07 0.65 0.35 1.0
#> sdq17r -0.10 0.84 -0.20 0.53 0.47 1.1
#> sdq18 -0.06 -0.24 0.94 0.73 0.27 1.1
#> sdq19 0.84 -0.07 -0.03 0.67 0.33 1.0
#> sdq20 0.05 0.43 0.12 0.28 0.72 1.2
#> sdq21r 0.06 0.05 0.52 0.30 0.70 1.0
#> sdq22r 0.70 0.05 0.12 0.54 0.46 1.1
#> sdq23r 0.00 0.77 -0.13 0.51 0.49 1.1
#> sdq24 -0.04 -0.06 0.81 0.62 0.38 1.0
#> sdq25 0.76 -0.05 -0.01 0.55 0.45 1.0
#> sdq26 0.07 0.41 0.06 0.22 0.78 1.1
#> sdq27r 0.00 0.19 0.35 0.22 0.78 1.5
#> sdq28r 0.82 0.09 -0.06 0.74 0.26 1.0
#> sdq29r 0.04 0.88 -0.09 0.73 0.27 1.0
#> sdq30 -0.06 -0.05 0.82 0.64 0.36 1.0
#>
#> ULS1 ULS3 ULS2
#> SS loadings 5.96 4.77 4.14
#> Proportion Var 0.20 0.16 0.14
#> Cumulative Var 0.20 0.36 0.50
#> Proportion Explained 0.40 0.32 0.28
#> Cumulative Proportion 0.40 0.72 1.00
#>
#> With factor correlations of
#> ULS1 ULS3 ULS2
#> ULS1 1.00 0.36 0.02
#> ULS3 0.36 1.00 0.49
#> ULS2 0.02 0.49 1.00
#>
#> Mean item complexity = 1.1
#> Test of the hypothesis that 3 factors are sufficient.
#>
#> df null model = 435 with the objective function = 19.62 with Chi Square = 8106
#> df of the model are 348 and the objective function was 4.87
#>
#> The root mean square of the residuals (RMSR) is 0.06
#> The df corrected root mean square of the residuals is 0.06
#>
#> The harmonic n.obs is 425 with the empirical chi square 617.22 with prob < 2.7e-17
#> The total n.obs was 425 with Likelihood Chi Square = 2004.15 with prob < 3e-230
#>
#> Tucker Lewis Index of factoring reliability = 0.729
#> RMSEA index = 0.106 and the 90 % confidence intervals are 0.101 0.11
#> BIC = -101.98
#> Fit based upon off diagonal values = 0.97
#> Measures of factor score adequacy
#> ULS1 ULS3 ULS2
#> Correlation of (regression) scores with factors 0.96 0.92 0.95
#> Multiple R square of scores with factors 0.91 0.85 0.91
#> Minimum correlation of possible factor scores 0.83 0.70 0.82