Exploratory Factor Analysis (EFA) with categorical variables

dr. Annelies Agten

2026-04-27

Exploratory Factor Analysis (EFA) is a multivariate technique used to identify underlying latent variables, or factors, that explain patterns of correlations among observed variables. Unlike PCA, which focuses on summarizing total variance, EFA assumes that the observed variables are influenced by a smaller number of unobserved constructs, along with some measurement error.

The main goal of EFA is to uncover the underlying structure of a dataset by grouping variables that are highly correlated with each other into common factors. This can help reduce data complexity and provide a more interpretable representation of the relationships between variables.

EFA with categorical variables

In the previous discussion, we implicitly assumed that all observed variables are continuous and that their relationships can be adequately described using Pearson correlations. However, in many practical applications, especially in the social sciences, psychology, and survey research, variables are often ordinal (e.g., Likert scales) or binary (e.g., yes/no responses).

Pearson correlation relies on several assumptions:

When these assumptions are violated, as is the case with categorical or ordinal data, the resulting correlations may be biased, which in turn can lead to distorted factor structures in EFA.

Alternative correlation measures

To address this issue, alternative correlation coefficients are used that better reflect the nature of categorical data:

These methods assume that the observed categorical responses arise from latent continuous variables that have been discretized.

Implications for EFA

The overall procedure of EFA remains unchanged:

However, the input correlation matrix is different. Instead of using Pearson correlations, we use a matrix based on polychoric or tetrachoric correlations.

This often results in:

Categorical EFA Analysis in R

We will start by loading the most fundamental package: psych. If you never used the package psych, then you need to first install it:

install.packages('psych')

Then you can load the library:

library('psych')
#> Warning: package 'psych' was built under R version 4.5.3

We are now working with a categorical dataset (sdq: Strengths and Difficulties Questionnaire) and we are planning to run EFA with 30 items from 425 elementary school children. Since the items are rated on a 1–6 Likert scale, they are ordinal categorical variables and this affects how EFA should be conducted.

sdq <- readxl::read_excel("sdq.xls")

describe(sdq)
#>        vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
#> sdq1      1 425 4.17 1.80      5    4.33 1.48   1   6     5 -0.57    -1.06 0.09
#> sdq2      2 425 4.83 1.37      5    5.07 1.48   1   6     5 -1.20     0.70 0.07
#> sdq3      3 425 4.82 1.42      5    5.03 1.48   1   6     5 -0.96    -0.20 0.07
#> sdq4r     4 425 3.82 1.92      4    3.90 2.97   1   6     5 -0.24    -1.49 0.09
#> sdq5r     5 425 5.32 1.27      6    5.63 0.00   1   6     5 -1.95     2.85 0.06
#> sdq6      6 425 4.00 1.81      4    4.13 2.97   1   6     5 -0.45    -1.18 0.09
#> sdq7      7 425 4.05 1.80      4    4.19 2.97   1   6     5 -0.48    -1.14 0.09
#> sdq8      8 425 4.77 1.38      5    5.01 1.48   1   6     5 -1.21     0.79 0.07
#> sdq9r     9 425 4.86 1.58      6    5.16 0.00   1   6     5 -1.27     0.37 0.08
#> sdq10r   10 425 4.79 1.58      6    5.07 0.00   1   6     5 -1.16     0.12 0.08
#> sdq11r   11 425 5.02 1.29      6    5.26 0.00   1   6     5 -1.39     1.26 0.06
#> sdq12    12 425 4.44 1.48      5    4.62 1.48   1   6     5 -0.83    -0.23 0.07
#> sdq13    13 425 3.85 1.87      4    3.93 2.97   1   6     5 -0.31    -1.35 0.09
#> sdq14    14 425 4.77 1.38      5    5.02 1.48   1   6     5 -1.28     0.99 0.07
#> sdq15r   15 425 5.14 1.46      6    5.48 0.00   1   6     5 -1.65     1.48 0.07
#> sdq16r   16 425 4.30 1.80      5    4.49 1.48   1   6     5 -0.67    -0.97 0.09
#> sdq17r   17 425 4.96 1.59      6    5.30 0.00   1   6     5 -1.44     0.76 0.08
#> sdq18    18 425 4.19 1.79      5    4.36 1.48   1   6     5 -0.57    -1.04 0.09
#> sdq19    19 425 4.24 1.69      5    4.42 1.48   1   6     5 -0.62    -0.88 0.08
#> sdq20    20 425 4.81 1.45      5    5.08 1.48   1   6     5 -1.26     0.72 0.07
#> sdq21r   21 425 5.04 1.53      6    5.38 0.00   1   6     5 -1.52     1.06 0.07
#> sdq22r   22 425 4.87 1.66      6    5.20 0.00   1   6     5 -1.33     0.40 0.08
#> sdq23r   23 425 5.13 1.47      6    5.47 0.00   1   6     5 -1.66     1.58 0.07
#> sdq24    24 425 4.46 1.55      5    4.68 1.48   1   6     5 -0.87    -0.24 0.08
#> sdq25    25 425 3.97 1.78      4    4.08 1.48   1   6     5 -0.47    -1.16 0.09
#> sdq26    26 425 5.52 1.01      6    5.77 0.00   1   6     5 -2.76     8.14 0.05
#> sdq27r   27 425 3.82 1.88      4    3.90 2.97   1   6     5 -0.17    -1.50 0.09
#> sdq28r   28 425 4.68 1.76      6    4.98 0.00   1   6     5 -1.03    -0.41 0.09
#> sdq29r   29 425 5.58 0.96      6    5.84 0.00   1   6     5 -2.59     6.26 0.05
#> sdq30    30 425 4.71 1.45      5    4.96 1.48   1   6     5 -1.11     0.36 0.07

Given that the SDQ items are measured on a 6-point Likert scale, we treat them as ordinal categorical variables. Using Pearson correlations in this case would violate assumptions of linearity and normality. Therefore, we compute a polychoric correlation matrix, which estimates the correlation between unobserved continuous latent variables that underlie the ordinal responses (Flora & Curran, 2004; Floral & Flake, 2017; Lloret et al., 2017; McCoach et al., 2013).


# Convert the data to a polychoric correlation matrix for EFA analysis
out <- polychoric(sdq)

# Extract the polychoric matrix 
sdqpoly <- out$rho

corPlot(sdqpoly, diag = FALSE, upper = FALSE, numbers = TRUE)

Most correlations are positive and moderate to strong. Very few correlations are close to 0.


KMO(sdqpoly)
#> Kaiser-Meyer-Olkin factor adequacy
#> Call: KMO(r = sdqpoly)
#> Overall MSA =  0.86
#> MSA for each item = 
#>   sdq1   sdq2   sdq3  sdq4r  sdq5r   sdq6   sdq7   sdq8  sdq9r sdq10r sdq11r 
#>   0.89   0.87   0.86   0.87   0.86   0.76   0.86   0.85   0.85   0.88   0.89 
#>  sdq12  sdq13  sdq14 sdq15r sdq16r sdq17r  sdq18  sdq19  sdq20 sdq21r sdq22r 
#>   0.87   0.89   0.84   0.86   0.88   0.82   0.82   0.90   0.86   0.81   0.88 
#> sdq23r  sdq24  sdq25  sdq26 sdq27r sdq28r sdq29r  sdq30 
#>   0.83   0.90   0.93   0.84   0.78   0.90   0.84   0.87

cortest.bartlett(sdqpoly, n = 425)
#> $chisq
#> [1] 8106.004
#> 
#> $p.value
#> [1] 0
#> 
#> $df
#> [1] 435

KMO test and Bartlett’s test of sphericity results confirm that the data are suitable for EFA based on the polychoric correlations.

The rest of the factor analysis can be carried out analogously


fa.parallel(sdqpoly, n.obs = 425, fa = "pc", n.iter=500)

#> Parallel analysis suggests that the number of factors =  NA  and the number of components =  4
VSS(sdqpoly, n = 10, fm="pc", rotate = "promax", n.obs = 425, plot = F)
#> 
#> Very Simple Structure
#> Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
#>     n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
#> VSS complexity 1 achieves a maximimum of 0.85  with  4  factors
#> VSS complexity 2 achieves a maximimum of 0.92  with  3  factors
#> 
#> The Velicer MAP achieves a minimum of 0.02  with  4  factors
#> Warning in min(x$vss.stats[, "BIC"], na.rm = TRUE): no non-missing arguments to
#> min; returning Inf
#> 
#> BIC achieves a minimum of  Inf  with    factors
#> Warning in min(x$vss.stats[, "SABIC"], na.rm = TRUE): no non-missing arguments
#> to min; returning Inf
#> Sample Size adjusted BIC achieves a minimum of  Inf  with    factors
#> 
#> Statistics by number of factors 
#>    vss1 vss2   map dof chisq prob sqresid  fit RMSEA BIC SABIC complex eChisq
#> 1  0.57 0.00 0.069   0    NA   NA    48.6 0.57    NA  NA    NA      NA     NA
#> 2  0.76 0.85 0.030   0    NA   NA    16.9 0.85    NA  NA    NA      NA     NA
#> 3  0.85 0.90 0.019   0    NA   NA    10.6 0.91    NA  NA    NA      NA     NA
#> 4  0.85 0.89 0.018   0    NA   NA     7.8 0.93    NA  NA    NA      NA     NA
#> 5  0.83 0.90 0.018   0    NA   NA     6.3 0.94    NA  NA    NA      NA     NA
#> 6  0.81 0.91 0.019   0    NA   NA     5.2 0.95    NA  NA    NA      NA     NA
#> 7  0.80 0.87 0.020   0    NA   NA     4.4 0.96    NA  NA    NA      NA     NA
#> 8  0.80 0.89 0.022   0    NA   NA     3.8 0.97    NA  NA    NA      NA     NA
#> 9  0.71 0.88 0.025   0    NA   NA     3.2 0.97    NA  NA    NA      NA     NA
#> 10 0.80 0.92 0.028   0    NA   NA     2.7 0.98    NA  NA    NA      NA     NA
#>    SRMR eCRMS eBIC
#> 1    NA    NA   NA
#> 2    NA    NA   NA
#> 3    NA    NA   NA
#> 4    NA    NA   NA
#> 5    NA    NA   NA
#> 6    NA    NA   NA
#> 7    NA    NA   NA
#> 8    NA    NA   NA
#> 9    NA    NA   NA
#> 10   NA    NA   NA

fa1 <- fa(sdq, cor = "poly", nfactors = 4, rotate = "promax", fm = "uls")
#> Loading required namespace: GPArotation

fa1
#> Factor Analysis using method =  uls
#> Call: fa(r = sdq, nfactors = 4, rotate = "promax", fm = "uls", cor = "poly")
#> Standardized loadings (pattern matrix) based upon correlation matrix
#>         ULS1  ULS3  ULS2  ULS4   h2   u2 com
#> sdq1    0.88 -0.01  0.00 -0.13 0.75 0.25 1.0
#> sdq2   -0.10  0.75 -0.02 -0.01 0.50 0.50 1.0
#> sdq3   -0.02  0.10  0.46  0.36 0.52 0.48 2.0
#> sdq4r   0.71 -0.09 -0.14  0.17 0.54 0.46 1.2
#> sdq5r   0.02  0.61 -0.16  0.20 0.46 0.54 1.4
#> sdq6    0.07  0.00  0.66 -0.13 0.39 0.61 1.1
#> sdq7    0.73  0.03  0.21 -0.30 0.56 0.44 1.5
#> sdq8   -0.07  0.64  0.19 -0.10 0.45 0.55 1.3
#> sdq9r   0.04 -0.02  0.16  0.61 0.45 0.55 1.2
#> sdq10r  0.59  0.09 -0.19  0.29 0.56 0.44 1.8
#> sdq11r  0.07  0.61 -0.01  0.15 0.51 0.49 1.1
#> sdq12  -0.05 -0.01  0.70  0.08 0.53 0.47 1.0
#> sdq13   0.83  0.07  0.17 -0.22 0.72 0.28 1.2
#> sdq14  -0.01  0.73  0.11 -0.19 0.50 0.50 1.2
#> sdq15r -0.04  0.05  0.02  0.71 0.53 0.47 1.0
#> sdq16r  0.80 -0.09 -0.15  0.21 0.70 0.30 1.2
#> sdq17r -0.10  0.76 -0.23  0.12 0.53 0.47 1.3
#> sdq18  -0.04 -0.13  0.90  0.04 0.78 0.22 1.0
#> sdq19   0.83 -0.10 -0.07  0.10 0.68 0.32 1.1
#> sdq20   0.06  0.47  0.14 -0.05 0.30 0.70 1.2
#> sdq21r  0.06 -0.06  0.33  0.45 0.38 0.62 1.9
#> sdq22r  0.70  0.07  0.12  0.01 0.54 0.46 1.1
#> sdq23r  0.00  0.77 -0.09 -0.04 0.52 0.48 1.0
#> sdq24  -0.04 -0.05  0.67  0.27 0.61 0.39 1.3
#> sdq25   0.75 -0.05 -0.02  0.01 0.55 0.45 1.0
#> sdq26   0.07  0.42  0.06  0.00 0.23 0.77 1.1
#> sdq27r -0.01  0.12  0.22  0.31 0.25 0.75 2.1
#> sdq28r  0.82  0.11 -0.04 -0.05 0.73 0.27 1.0
#> sdq29r  0.04  0.82 -0.11  0.11 0.73 0.27 1.1
#> sdq30  -0.06  0.00  0.71  0.19 0.63 0.37 1.2
#> 
#>                       ULS1 ULS3 ULS2 ULS4
#> SS loadings           5.96 4.56 3.51 2.10
#> Proportion Var        0.20 0.15 0.12 0.07
#> Cumulative Var        0.20 0.35 0.47 0.54
#> Proportion Explained  0.37 0.28 0.22 0.13
#> Cumulative Proportion 0.37 0.65 0.87 1.00
#> 
#>  With factor correlations of 
#>       ULS1 ULS3  ULS2 ULS4
#> ULS1  1.00 0.35 -0.06 0.15
#> ULS3  0.35 1.00  0.36 0.44
#> ULS2 -0.06 0.36  1.00 0.31
#> ULS4  0.15 0.44  0.31 1.00
#> 
#> Mean item complexity =  1.3
#> Test of the hypothesis that 4 factors are sufficient.
#> 
#> df null model =  435  with the objective function =  19.62 with Chi Square =  8106
#> df of  the model are 321  and the objective function was  3.88 
#> 
#> The root mean square of the residuals (RMSR) is  0.04 
#> The df corrected root mean square of the residuals is  0.05 
#> 
#> The harmonic n.obs is  425 with the empirical chi square  361.54  with prob <  0.059 
#> The total n.obs was  425  with Likelihood Chi Square =  1593.75  with prob <  1.7e-167 
#> 
#> Tucker Lewis Index of factoring reliability =  0.774
#> RMSEA index =  0.097  and the 90 % confidence intervals are  0.092 0.101
#> BIC =  -348.97
#> Fit based upon off diagonal values = 0.98
#> Measures of factor score adequacy             
#>                                                   ULS1 ULS3 ULS2 ULS4
#> Correlation of (regression) scores with factors   0.96 0.89 0.92 0.84
#> Multiple R square of scores with factors          0.92 0.80 0.85 0.70
#> Minimum correlation of possible factor scores     0.84 0.60 0.71 0.40

fa2 <- fa(sdq, cor = "poly", nfactors = 3, rotate = "promax", fm = "uls")
fa2
#> Factor Analysis using method =  uls
#> Call: fa(r = sdq, nfactors = 3, rotate = "promax", fm = "uls", cor = "poly")
#> Standardized loadings (pattern matrix) based upon correlation matrix
#>         ULS1  ULS3  ULS2   h2   u2 com
#> sdq1    0.88 -0.07 -0.06 0.74 0.26 1.0
#> sdq2   -0.10  0.75 -0.04 0.49 0.51 1.0
#> sdq3   -0.02  0.16  0.62 0.50 0.50 1.1
#> sdq4r   0.72 -0.02 -0.07 0.51 0.49 1.0
#> sdq5r   0.03  0.70 -0.10 0.45 0.55 1.0
#> sdq6    0.04 -0.13  0.60 0.30 0.70 1.1
#> sdq7    0.70 -0.10  0.08 0.45 0.55 1.1
#> sdq8   -0.08  0.57  0.15 0.40 0.60 1.2
#> sdq9r   0.06  0.17  0.40 0.27 0.73 1.4
#> sdq10r  0.61  0.21 -0.07 0.49 0.51 1.3
#> sdq11r  0.08  0.67  0.04 0.51 0.49 1.0
#> sdq12  -0.06 -0.09  0.75 0.51 0.49 1.0
#> sdq13   0.81 -0.03  0.07 0.65 0.35 1.0
#> sdq14  -0.02  0.63  0.03 0.41 0.59 1.0
#> sdq15r -0.01  0.27  0.30 0.24 0.76 2.0
#> sdq16r  0.81 -0.01 -0.07 0.65 0.35 1.0
#> sdq17r -0.10  0.84 -0.20 0.53 0.47 1.1
#> sdq18  -0.06 -0.24  0.94 0.73 0.27 1.1
#> sdq19   0.84 -0.07 -0.03 0.67 0.33 1.0
#> sdq20   0.05  0.43  0.12 0.28 0.72 1.2
#> sdq21r  0.06  0.05  0.52 0.30 0.70 1.0
#> sdq22r  0.70  0.05  0.12 0.54 0.46 1.1
#> sdq23r  0.00  0.77 -0.13 0.51 0.49 1.1
#> sdq24  -0.04 -0.06  0.81 0.62 0.38 1.0
#> sdq25   0.76 -0.05 -0.01 0.55 0.45 1.0
#> sdq26   0.07  0.41  0.06 0.22 0.78 1.1
#> sdq27r  0.00  0.19  0.35 0.22 0.78 1.5
#> sdq28r  0.82  0.09 -0.06 0.74 0.26 1.0
#> sdq29r  0.04  0.88 -0.09 0.73 0.27 1.0
#> sdq30  -0.06 -0.05  0.82 0.64 0.36 1.0
#> 
#>                       ULS1 ULS3 ULS2
#> SS loadings           5.96 4.77 4.14
#> Proportion Var        0.20 0.16 0.14
#> Cumulative Var        0.20 0.36 0.50
#> Proportion Explained  0.40 0.32 0.28
#> Cumulative Proportion 0.40 0.72 1.00
#> 
#>  With factor correlations of 
#>      ULS1 ULS3 ULS2
#> ULS1 1.00 0.36 0.02
#> ULS3 0.36 1.00 0.49
#> ULS2 0.02 0.49 1.00
#> 
#> Mean item complexity =  1.1
#> Test of the hypothesis that 3 factors are sufficient.
#> 
#> df null model =  435  with the objective function =  19.62 with Chi Square =  8106
#> df of  the model are 348  and the objective function was  4.87 
#> 
#> The root mean square of the residuals (RMSR) is  0.06 
#> The df corrected root mean square of the residuals is  0.06 
#> 
#> The harmonic n.obs is  425 with the empirical chi square  617.22  with prob <  2.7e-17 
#> The total n.obs was  425  with Likelihood Chi Square =  2004.15  with prob <  3e-230 
#> 
#> Tucker Lewis Index of factoring reliability =  0.729
#> RMSEA index =  0.106  and the 90 % confidence intervals are  0.101 0.11
#> BIC =  -101.98
#> Fit based upon off diagonal values = 0.97
#> Measures of factor score adequacy             
#>                                                   ULS1 ULS3 ULS2
#> Correlation of (regression) scores with factors   0.96 0.92 0.95
#> Multiple R square of scores with factors          0.91 0.85 0.91
#> Minimum correlation of possible factor scores     0.83 0.70 0.82