Exploratory Factor Analysis (EFA) is a data-driven method for identifying the latent factor structure underlying a set of observed variables, without imposing a theoretically specified structure in advance. I apply EFA to 20 items from my first-year project pilot data (coded_columns.csv), which includes measures of mental health (monthly perceived stress and isolation, weekly affective symptoms), implicit theories of personal change (kp_1, kp_2), and personality (personality_1 through personality_10). The goal is to identify how many meaningful factors underlie these items and which items cluster together empirically. The aim of my first year project is to identify the latent constructs that enable someone to undergo volitional personal transformation.

# install.packages(c("lavaan", "semPlot", "semTools", "psych", "tidyverse"))
library(lavaan)     # cfa(), summary(), modindices()
# library(semPlot)    # semPaths() for path diagrams
library(semTools)   # reliability()
library(psych)      # describe()
library(tidyverse)  # wrangling + ggplot2
library(dplyr)

Step 1: Select data for factor analysis

dat <- read.csv("coded_columns.csv") |>
  subset(select = -c(close_1, ls_14))

Step 2: Extract a set of factors sequentially using a set of optimization criteria

Factors are extracted using Principal Axis. The eigenvalues from the full correlation matrix are used to evaluate how much variance each successive factor accounts for.

eigenvalues <- eigen(cor(dat))$values
round(eigenvalues, 3)
##  [1] 7.283 2.332 1.411 1.322 1.235 1.015 0.960 0.773 0.659 0.570 0.464 0.374
## [13] 0.353 0.274 0.249 0.220 0.169 0.140 0.109 0.088

Step 3: Select a smaller number of common factors for ease in interpretation

The scree plot and Kaiser rule (eigenvalue > 1) are used to select the number of factors to retain. Six eigenvalues exceed 1. The elbow in the scree plot appears after factor 2, suggesting the remaining factors account for relatively little additional variance.

# Scree plot
plot(eigenvalues, type = "b", pch = 19,
     xlab = "Factor", ylab = "Eigenvalue",
     main = "Scree Plot")
abline(h = 1, lty = 2, col = "red")

# Kaiser rule
n_factors <- sum(eigenvalues > 1)

Step 4: Rotate selected factors towards an interpretable solution

Three rotations for 6 factors and 2 factors are compared to see which produces the most interpretable solution.

Procrustes is clearly the poorest solution, as the mean item complexity is 2.5, meaning most items cross-load on multiple factors.

Varimax (orthogonal) produces a clean simple structure where six factors together explain 64.8% of total variance. : PA1 captures mental health symptom burden including the neuroticism personality items, PA2 captures personality items (reverse-scored), PA3 captures implicit theories of change (kp_1, kp_2), and PA4–PA6 capture remaining personality clusters.

Promax (oblique) yields similar factor structure. The correlation matrix shows PA1 and PA6 are strongly negatively correlated (r = −.62), suggesting these factors are not truly independent and oblique rotation may be more appropriate.

The overall fit: RMSEA = 0.091 (marginal), TLI = 0.831 (below the 0.95 threshold), suggesting 6 factors may not be the optimal solution for this data. All four interpretable TIPI subscales separate more cleanly under promax, and item complexity scores are lower (mean = 1.4 vs 1.7), meaning items load more exclusively on single factors. Promax is the preferred solution, with a note that the overall fit is marginal: RMSEA = 0.091 (acceptable threshold is <.08) and TLI = 0.831 (below the .95 threshold).

# Orthogonal
efa_varimax <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "varimax")
print(efa_varimax, cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method =  pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                item    PA1    PA2    PA3    PA6    PA5    PA4    h2      u2
## mental_week_2     4  0.865                                    0.800  0.2000
## mental_month_2    2  0.750                                    0.725  0.2753
## mental_week_6     8  0.735                                    0.639  0.3609
## mental_week_4     6  0.716                                    0.579  0.4207
## mental_week_5     7  0.710 -0.433                             0.732  0.2678
## mental_month_1    1  0.686               -0.487               0.745  0.2549
## mental_week_1     3  0.645               -0.315               0.523  0.4769
## mental_week_3     5  0.476                                    0.392  0.6081
## personality_1    11         0.886                             0.950  0.0502
## personality_6    16         0.738                             0.662  0.3380
## kp_1              9                1.036                      1.127 -0.1270
## kp_2             10                0.755                      0.687  0.3134
## personality_4    14 -0.431                0.752               0.793  0.2072
## personality_9    19 -0.306                0.717               0.636  0.3642
## personality_10   20                              0.925        0.910  0.0902
## personality_5    15                              0.492        0.332  0.6682
## personality_7    17                              0.427        0.238  0.7624
## personality_2    12                                           0.228  0.7722
## personality_3    13                                     0.938 0.947  0.0535
## personality_8    18                                     0.447 0.318  0.6824
##                 com
## mental_week_2  1.14
## mental_month_2 1.63
## mental_week_6  1.39
## mental_week_4  1.27
## mental_week_5  1.85
## mental_month_1 1.99
## mental_week_1  1.49
## mental_week_3  2.50
## personality_1  1.44
## personality_6  1.46
## kp_1           1.10
## kp_2           1.43
## personality_4  1.77
## personality_9  1.48
## personality_10 1.13
## personality_5  1.78
## personality_7  1.62
## personality_2  3.95
## personality_3  1.15
## personality_8  2.24
## 
##                         PA1   PA2   PA3   PA6   PA5   PA4
## SS loadings           4.554 1.896 1.854 1.665 1.647 1.345
## Proportion Var        0.228 0.095 0.093 0.083 0.082 0.067
## Cumulative Var        0.228 0.322 0.415 0.498 0.581 0.648
## Proportion Explained  0.351 0.146 0.143 0.128 0.127 0.104
## Cumulative Proportion 0.351 0.498 0.641 0.769 0.896 1.000
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 6 factors are sufficient.
## 
## df null model =  190  with the objective function =  12.988 with Chi Square =  1188.357
## df of  the model are 85  and the objective function was  1.787 
## 
## The root mean square of the residuals (RMSR) is  0.037 
## The df corrected root mean square of the residuals is  0.055 
## 
## The harmonic n.obs is  100 with the empirical chi square  52.065  with prob <  0.998 
## The total n.obs was  100  with Likelihood Chi Square =  156.377  with prob <  3.89e-06 
## 
## Tucker Lewis Index of factoring reliability =  0.8314
## RMSEA index =  0.0911  and the 90 % confidence intervals are  0.0691 0.1146
## BIC =  -235.062
## Fit based upon off diagonal values = 0.989
# Oblique
efa_promax  <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "promax")
print(efa_promax,  cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method =  pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                item    PA1    PA2    PA3    PA6    PA5    PA4    h2      u2
## mental_week_2     4  1.003                                    0.800  0.2000
## mental_week_4     6  0.789                                    0.579  0.4207
## mental_week_6     8  0.780                                    0.639  0.3609
## mental_month_2    2  0.741                                    0.725  0.2753
## mental_week_5     7  0.700 -0.376                             0.732  0.2678
## mental_week_1     3  0.668                                    0.523  0.4769
## mental_month_1    1  0.567               -0.415               0.745  0.2549
## mental_week_3     5  0.426                                    0.392  0.6081
## personality_1    11         0.930                             0.950  0.0502
## personality_6    16         0.775                             0.662  0.3380
## kp_1              9                1.048                      1.127 -0.1270
## kp_2             10                0.734                      0.687  0.3134
## personality_4    14                       0.823               0.793  0.2072
## personality_9    19                       0.816               0.636  0.3642
## personality_10   20                              1.099        0.910  0.0902
## personality_5    15                              0.490        0.332  0.6682
## personality_7    17                              0.436        0.238  0.7624
## personality_2    12                                           0.228  0.7722
## personality_3    13                                     1.118 0.947  0.0535
## personality_8    18                                     0.466 0.318  0.6824
##                 com
## mental_week_2  1.11
## mental_week_4  1.15
## mental_week_6  1.07
## mental_month_2 1.08
## mental_week_5  1.58
## mental_week_1  1.43
## mental_month_1 1.99
## mental_week_3  1.80
## personality_1  1.04
## personality_6  1.02
## kp_1           1.02
## kp_2           1.10
## personality_4  1.06
## personality_9  1.03
## personality_10 1.30
## personality_5  1.37
## personality_7  1.50
## personality_2  4.13
## personality_3  1.14
## personality_8  1.47
## 
##                         PA1   PA2   PA3   PA6   PA5   PA4
## SS loadings           4.577 1.945 1.795 1.763 1.538 1.342
## Proportion Var        0.229 0.097 0.090 0.088 0.077 0.067
## Cumulative Var        0.229 0.326 0.416 0.504 0.581 0.648
## Proportion Explained  0.353 0.150 0.139 0.136 0.119 0.104
## Cumulative Proportion 0.353 0.503 0.642 0.778 0.896 1.000
## 
##  With factor correlations of 
##        PA1    PA2    PA3    PA6    PA5    PA4
## PA1  1.000 -0.413  0.175 -0.624 -0.323 -0.478
## PA2 -0.413  1.000 -0.257  0.317  0.439  0.381
## PA3  0.175 -0.257  1.000 -0.118 -0.290 -0.238
## PA6 -0.624  0.317 -0.118  1.000  0.367  0.351
## PA5 -0.323  0.439 -0.290  0.367  1.000  0.519
## PA4 -0.478  0.381 -0.238  0.351  0.519  1.000
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 6 factors are sufficient.
## 
## df null model =  190  with the objective function =  12.988 with Chi Square =  1188.357
## df of  the model are 85  and the objective function was  1.787 
## 
## The root mean square of the residuals (RMSR) is  0.037 
## The df corrected root mean square of the residuals is  0.055 
## 
## The harmonic n.obs is  100 with the empirical chi square  52.065  with prob <  0.998 
## The total n.obs was  100  with Likelihood Chi Square =  156.377  with prob <  3.89e-06 
## 
## Tucker Lewis Index of factoring reliability =  0.8314
## RMSEA index =  0.0911  and the 90 % confidence intervals are  0.0691 0.1146
## BIC =  -235.062
## Fit based upon off diagonal values = 0.989
# Target (Procrustes) — requires a target matrix
# Each row = item, each col = factor; 1 = expected loading, 0 = not
target <- matrix(0, nrow = ncol(dat), ncol = n_factors)
# Example: assign first 8 items to F1, next 2 to F2, rest to F3+
target[1:8,  1] <- 1
target[9:10, 2] <- 1
target[11:20, 3] <- 1
efa_procrustes <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "target",
                     Target = target)

print(efa_procrustes, cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method =  pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "target", fm = "pa", 
##     Target = target)
## Standardized loadings (pattern matrix) based upon correlation matrix
##                item    PA1    PA2    PA3    PA4    PA5    PA6    h2      u2
## mental_month_2    2  0.823                                    0.725  0.2753
## mental_week_5     7  0.794                                    0.732  0.2678
## mental_month_1    1  0.777  0.342                             0.745  0.2549
## mental_week_2     4  0.767                              0.322 0.800  0.2000
## mental_week_6     8  0.744                                    0.639  0.3609
## personality_4    14 -0.700                              0.414 0.793  0.2072
## mental_week_4     6  0.679                                    0.579  0.4207
## personality_1    11 -0.634  0.410  0.434 -0.422               0.950  0.0502
## mental_week_1     3  0.597  0.337                             0.523  0.4769
## mental_week_3     5  0.597                                    0.392  0.6081
## personality_9    19 -0.572                              0.438 0.636  0.3642
## personality_6    16 -0.571         0.362 -0.351               0.662  0.3380
## personality_8    18 -0.366                0.355               0.318  0.6824
## personality_2    12 -0.365                                    0.228  0.7722
## kp_2             10  0.461 -0.550  0.386                      0.687  0.3134
## personality_5    15         0.416                             0.332  0.6682
## personality_7    17                                           0.238  0.7624
## kp_1              9  0.468 -0.637  0.666                      1.127 -0.1270
## personality_3    13 -0.464                0.555 -0.532        0.947  0.0535
## personality_10   20 -0.382  0.513         0.336  0.551        0.910  0.0902
##                 com
## mental_month_2 1.14
## mental_week_5  1.34
## mental_month_1 1.47
## mental_week_2  1.75
## mental_week_6  1.32
## personality_4  2.28
## mental_week_4  1.54
## personality_1  3.50
## mental_week_1  1.93
## mental_week_3  1.20
## personality_9  2.72
## personality_6  3.07
## personality_8  2.83
## personality_2  2.44
## kp_2           2.97
## personality_5  2.76
## personality_7  3.84
## kp_1           3.09
## personality_3  3.91
## personality_10 4.14
## 
##                         PA1   PA2   PA3   PA4   PA5   PA6
## SS loadings           6.975 2.056 1.235 0.984 0.943 0.767
## Proportion Var        0.349 0.103 0.062 0.049 0.047 0.038
## Cumulative Var        0.349 0.452 0.513 0.563 0.610 0.648
## Proportion Explained  0.538 0.159 0.095 0.076 0.073 0.059
## Cumulative Proportion 0.538 0.697 0.792 0.868 0.941 1.000
## 
## Mean item complexity =  2.5
## Test of the hypothesis that 6 factors are sufficient.
## 
## df null model =  190  with the objective function =  12.988 with Chi Square =  1188.357
## df of  the model are 85  and the objective function was  1.787 
## 
## The root mean square of the residuals (RMSR) is  0.037 
## The df corrected root mean square of the residuals is  0.055 
## 
## The harmonic n.obs is  100 with the empirical chi square  52.065  with prob <  0.998 
## The total n.obs was  100  with Likelihood Chi Square =  156.377  with prob <  3.89e-06 
## 
## Tucker Lewis Index of factoring reliability =  0.8314
## RMSEA index =  0.0911  and the 90 % confidence intervals are  0.0691 0.1146
## BIC =  -235.062
## Fit based upon off diagonal values = 0.989

When we collapse to two factors based on the scree plot, PA1 captures all the mental health items plus the neuroticism element of the TIPI (personality_4, personality_9), while PA2 captures personality and kp items. However, fit is poor (RMSEA = 0.14 and TLI = 0.62) — indicating 2 factors substantially underfit the data.

efa_2f_varimax <- fa(dat, nfactors = 2, fm = "pa", rotate = "varimax")
print(efa_2f_varimax, cut = .4, sort = TRUE, digits = 3)
## Factor Analysis using method =  pa
## Call: fa(r = dat, nfactors = 2, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                item    PA1    PA2    h2    u2  com
## mental_month_1    1  0.851        0.734 0.266 1.03
## mental_week_2     4  0.798        0.657 0.343 1.07
## mental_month_2    2  0.784        0.709 0.291 1.30
## mental_week_5     7  0.742        0.643 0.357 1.33
## mental_week_6     8  0.727        0.585 0.415 1.21
## mental_week_4     6  0.716        0.529 0.471 1.07
## mental_week_1     3  0.699        0.489 0.511 1.00
## personality_4    14 -0.676        0.492 0.508 1.15
## personality_9    19 -0.549        0.326 0.674 1.16
## mental_week_3     5  0.512        0.363 0.637 1.67
## personality_2    12               0.133 0.867 1.60
## kp_2             10        -0.660 0.454 0.546 1.08
## personality_1    11         0.655 0.529 0.471 1.45
## kp_1              9        -0.597 0.376 0.624 1.11
## personality_5    15         0.548 0.301 0.699 1.00
## personality_10   20         0.545 0.306 0.694 1.06
## personality_6    16         0.519 0.385 0.615 1.73
## personality_7    17         0.406 0.166 0.834 1.02
## personality_8    18               0.171 0.829 1.62
## personality_3    13               0.200 0.800 1.95
## 
##                         PA1   PA2
## SS loadings           5.568 2.982
## Proportion Var        0.278 0.149
## Cumulative Var        0.278 0.427
## Proportion Explained  0.651 0.349
## Cumulative Proportion 0.651 1.000
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  190  with the objective function =  12.988 with Chi Square =  1188.357
## df of  the model are 151  and the objective function was  4.973 
## 
## The root mean square of the residuals (RMSR) is  0.085 
## The df corrected root mean square of the residuals is  0.096 
## 
## The harmonic n.obs is  100 with the empirical chi square  276.32  with prob <  2.13e-09 
## The total n.obs was  100  with Likelihood Chi Square =  448.436  with prob <  2.92e-31 
## 
## Tucker Lewis Index of factoring reliability =  0.6185
## RMSEA index =  0.14  and the 90 % confidence intervals are  0.1261 0.1562
## BIC =  -246.945
## Fit based upon off diagonal values = 0.942
## Measures of factor score adequacy             
##                                                     PA1   PA2
## Correlation of (regression) scores with factors   0.956 0.890
## Multiple R square of scores with factors          0.915 0.792
## Minimum correlation of possible factor scores     0.829 0.583
efa_2f_promax <- fa(dat, nfactors = 2, fm = "pa", rotate = "promax")
print(efa_2f_promax, cut = .3, sort = TRUE, digits = 3)
## Factor Analysis using method =  pa
## Call: fa(r = dat, nfactors = 2, rotate = "promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                item    PA1    PA2    h2    u2  com
## mental_month_1    1  0.932        0.734 0.266 1.06
## mental_week_2     4  0.857        0.657 0.343 1.02
## mental_week_1     3  0.794        0.489 0.511 1.17
## mental_month_2    2  0.789        0.709 0.291 1.03
## mental_week_4     6  0.769        0.529 0.471 1.02
## mental_week_6     8  0.748        0.585 0.415 1.00
## mental_week_5     7  0.742        0.643 0.357 1.04
## personality_4    14 -0.705        0.492 0.508 1.00
## personality_9    19 -0.571        0.326 0.674 1.00
## mental_week_3     5  0.477        0.363 0.637 1.32
## personality_2    12               0.133 0.867 1.24
## kp_2             10        -0.704 0.454 0.546 1.02
## personality_1    11         0.638 0.529 0.471 1.11
## kp_1              9        -0.629 0.376 0.624 1.00
## personality_5    15         0.614 0.301 0.699 1.12
## personality_10   20         0.587 0.306 0.694 1.03
## personality_6    16         0.477 0.385 0.615 1.40
## personality_7    17         0.448 0.166 0.834 1.08
## personality_8    18         0.336 0.171 0.829 1.26
## personality_3    13               0.200 0.800 1.84
## 
##                         PA1   PA2
## SS loadings           5.803 2.746
## Proportion Var        0.290 0.137
## Cumulative Var        0.290 0.427
## Proportion Explained  0.679 0.321
## Cumulative Proportion 0.679 1.000
## 
##  With factor correlations of 
##        PA1    PA2
## PA1  1.000 -0.529
## PA2 -0.529  1.000
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  190  with the objective function =  12.988 with Chi Square =  1188.357
## df of  the model are 151  and the objective function was  4.973 
## 
## The root mean square of the residuals (RMSR) is  0.085 
## The df corrected root mean square of the residuals is  0.096 
## 
## The harmonic n.obs is  100 with the empirical chi square  276.32  with prob <  2.13e-09 
## The total n.obs was  100  with Likelihood Chi Square =  448.436  with prob <  2.92e-31 
## 
## Tucker Lewis Index of factoring reliability =  0.6185
## RMSEA index =  0.14  and the 90 % confidence intervals are  0.1261 0.1562
## BIC =  -246.945
## Fit based upon off diagonal values = 0.942
## Measures of factor score adequacy             
##                                                     PA1   PA2
## Correlation of (regression) scores with factors   0.969 0.914
## Multiple R square of scores with factors          0.939 0.835
## Minimum correlation of possible factor scores     0.879 0.670

Step 5: Estimate factor scores using another set of criteria

Factor scores are estimated as sum scores by averaging the items with loadings ≥ |.40| on each factor from the promax solution. F4 (neuroticism) shows the least variability (SD = 0.45), meaning people in this sample reported similar levels of neuroticism. F3 (Implicit Theories of Change) and F2 (Extraversion) show the most variability (SD = 0.97, 0.95), suggesting people differ most in their beliefs about personal change and their level of extraversion, which are both theoretically relevant dimensions for a study on volitional personal transformation since a lot of people reported wanting to be more extroverted or capable of “being alone.” Further analysis would be seeing if those who score more introverted report wanting to be more extroverted, and vice versa.

dat_scaled <- as.data.frame(scale(dat))

loadings_mat <- unclass(efa_promax$loadings)

sum_scores <- data.frame(matrix(NA, nrow = nrow(dat), ncol = n_factors))
colnames(sum_scores) <- paste0("F", 1:n_factors)

for (f in 1:n_factors) {
  items <- which(abs(loadings_mat[, f]) >= 0.4)
  if (length(items) > 0) {
    sum_scores[[paste0("F", f)]] <- rowMeans(dat_scaled[, items])
  }
}

head(sum_scores)
##            F1         F2         F3         F4         F5         F6
## 1 -0.70110935  1.8355714  0.3551533 -0.2739694  0.9157299  1.0536191
## 2 -0.81827727  1.3446088  1.1078309  0.1419515  0.6187849  1.0536191
## 3 -0.71132869 -0.3773407 -0.4340224  0.4616302  0.2601208  0.4039230
## 4  1.68964016 -0.8683034  0.3369043 -0.4736590  0.9157299 -1.5451650
## 5  0.25652853  1.3446088  0.3186552 -0.5835760  0.7363979  0.1069264
## 6 -0.06022594  0.3626834  0.3369043 -0.3084519 -0.4840466 -0.1900702
psych::describe(sum_scores) |> round(2)
##    vars   n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
## F1    1 100    0 0.79  -0.11   -0.04 0.97 -1.29 1.69  2.98  0.33    -1.00 0.08
## F2    2 100    0 0.95  -0.37   -0.07 1.08 -1.11 1.84  2.95  0.49    -1.09 0.10
## F3    3 100    0 0.97  -0.06   -0.05 1.70 -1.20 1.88  3.08  0.19    -1.17 0.10
## F4    4 100    0 0.45   0.08    0.03 0.42 -1.18 0.85  2.03 -0.59    -0.34 0.05
## F5    5 100    0 0.77   0.11    0.07 0.75 -2.30 1.12  3.42 -0.81     0.48 0.08
## F6    6 100    0 0.86   0.08    0.09 0.92 -2.55 1.05  3.60 -0.76     0.11 0.09

A note that my data was n=100, which is a small sample that may not replicate, so I will need to redo this analysis with a larger sample. Also, while I recognize that a CFA may have been more obvious for analyzing this data given that I was using validated scales, I wanted to do an EFA because I had primed participants to think aspirationally, and I was curious if this would change the structure of the data (which I know CFA could have also determined).

Reference

Ram, N., Conroy, D. E., Pincus, A. L., Hyde, A. L., & Molloy, L. E. (2012). Tethering theory to method: Using measures of intraindividual variability to operationalize individuals’ dynamic characteristics. In G. Hancock & J. Harring (Eds.), Advances in longitudinal methods in the social and behavioral sciences (pp. 81–110). New York: Information Age.