Exploratory Factor Analysis (EFA) is a data-driven method for identifying the latent factor structure underlying a set of observed variables, without imposing a theoretically specified structure in advance. I apply EFA to 20 items from my first-year project pilot data (coded_columns.csv), which includes measures of mental health (monthly perceived stress and isolation, weekly affective symptoms), implicit theories of personal change (kp_1, kp_2), and personality (personality_1 through personality_10). The goal is to identify how many meaningful factors underlie these items and which items cluster together empirically. The aim of my first year project is to identify the latent constructs that enable someone to undergo volitional personal transformation.
# install.packages(c("lavaan", "semPlot", "semTools", "psych", "tidyverse"))
library(lavaan) # cfa(), summary(), modindices()
# library(semPlot) # semPaths() for path diagrams
library(semTools) # reliability()
library(psych) # describe()
library(tidyverse) # wrangling + ggplot2
library(dplyr)
dat <- read.csv("coded_columns.csv") |>
subset(select = -c(close_1, ls_14))
Factors are extracted using Principal Axis. The eigenvalues from the full correlation matrix are used to evaluate how much variance each successive factor accounts for.
eigenvalues <- eigen(cor(dat))$values
round(eigenvalues, 3)
## [1] 7.283 2.332 1.411 1.322 1.235 1.015 0.960 0.773 0.659 0.570 0.464 0.374
## [13] 0.353 0.274 0.249 0.220 0.169 0.140 0.109 0.088
The scree plot and Kaiser rule (eigenvalue > 1) are used to select the number of factors to retain. Six eigenvalues exceed 1. The elbow in the scree plot appears after factor 2, suggesting the remaining factors account for relatively little additional variance.
# Scree plot
plot(eigenvalues, type = "b", pch = 19,
xlab = "Factor", ylab = "Eigenvalue",
main = "Scree Plot")
abline(h = 1, lty = 2, col = "red")
# Kaiser rule
n_factors <- sum(eigenvalues > 1)
Three rotations for 6 factors and 2 factors are compared to see which produces the most interpretable solution.
Procrustes is clearly the poorest solution, as the mean item complexity is 2.5, meaning most items cross-load on multiple factors.
Varimax (orthogonal) produces a clean simple structure where six factors together explain 64.8% of total variance. : PA1 captures mental health symptom burden including the neuroticism personality items, PA2 captures personality items (reverse-scored), PA3 captures implicit theories of change (kp_1, kp_2), and PA4–PA6 capture remaining personality clusters.
Promax (oblique) yields similar factor structure. The correlation matrix shows PA1 and PA6 are strongly negatively correlated (r = −.62), suggesting these factors are not truly independent and oblique rotation may be more appropriate.
The overall fit: RMSEA = 0.091 (marginal), TLI = 0.831 (below the 0.95 threshold), suggesting 6 factors may not be the optimal solution for this data. All four interpretable TIPI subscales separate more cleanly under promax, and item complexity scores are lower (mean = 1.4 vs 1.7), meaning items load more exclusively on single factors. Promax is the preferred solution, with a note that the overall fit is marginal: RMSEA = 0.091 (acceptable threshold is <.08) and TLI = 0.831 (below the .95 threshold).
# Orthogonal
efa_varimax <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "varimax")
print(efa_varimax, cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method = pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PA1 PA2 PA3 PA6 PA5 PA4 h2 u2
## mental_week_2 4 0.865 0.800 0.2000
## mental_month_2 2 0.750 0.725 0.2753
## mental_week_6 8 0.735 0.639 0.3609
## mental_week_4 6 0.716 0.579 0.4207
## mental_week_5 7 0.710 -0.433 0.732 0.2678
## mental_month_1 1 0.686 -0.487 0.745 0.2549
## mental_week_1 3 0.645 -0.315 0.523 0.4769
## mental_week_3 5 0.476 0.392 0.6081
## personality_1 11 0.886 0.950 0.0502
## personality_6 16 0.738 0.662 0.3380
## kp_1 9 1.036 1.127 -0.1270
## kp_2 10 0.755 0.687 0.3134
## personality_4 14 -0.431 0.752 0.793 0.2072
## personality_9 19 -0.306 0.717 0.636 0.3642
## personality_10 20 0.925 0.910 0.0902
## personality_5 15 0.492 0.332 0.6682
## personality_7 17 0.427 0.238 0.7624
## personality_2 12 0.228 0.7722
## personality_3 13 0.938 0.947 0.0535
## personality_8 18 0.447 0.318 0.6824
## com
## mental_week_2 1.14
## mental_month_2 1.63
## mental_week_6 1.39
## mental_week_4 1.27
## mental_week_5 1.85
## mental_month_1 1.99
## mental_week_1 1.49
## mental_week_3 2.50
## personality_1 1.44
## personality_6 1.46
## kp_1 1.10
## kp_2 1.43
## personality_4 1.77
## personality_9 1.48
## personality_10 1.13
## personality_5 1.78
## personality_7 1.62
## personality_2 3.95
## personality_3 1.15
## personality_8 2.24
##
## PA1 PA2 PA3 PA6 PA5 PA4
## SS loadings 4.554 1.896 1.854 1.665 1.647 1.345
## Proportion Var 0.228 0.095 0.093 0.083 0.082 0.067
## Cumulative Var 0.228 0.322 0.415 0.498 0.581 0.648
## Proportion Explained 0.351 0.146 0.143 0.128 0.127 0.104
## Cumulative Proportion 0.351 0.498 0.641 0.769 0.896 1.000
##
## Mean item complexity = 1.7
## Test of the hypothesis that 6 factors are sufficient.
##
## df null model = 190 with the objective function = 12.988 with Chi Square = 1188.357
## df of the model are 85 and the objective function was 1.787
##
## The root mean square of the residuals (RMSR) is 0.037
## The df corrected root mean square of the residuals is 0.055
##
## The harmonic n.obs is 100 with the empirical chi square 52.065 with prob < 0.998
## The total n.obs was 100 with Likelihood Chi Square = 156.377 with prob < 3.89e-06
##
## Tucker Lewis Index of factoring reliability = 0.8314
## RMSEA index = 0.0911 and the 90 % confidence intervals are 0.0691 0.1146
## BIC = -235.062
## Fit based upon off diagonal values = 0.989
# Oblique
efa_promax <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "promax")
print(efa_promax, cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method = pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PA1 PA2 PA3 PA6 PA5 PA4 h2 u2
## mental_week_2 4 1.003 0.800 0.2000
## mental_week_4 6 0.789 0.579 0.4207
## mental_week_6 8 0.780 0.639 0.3609
## mental_month_2 2 0.741 0.725 0.2753
## mental_week_5 7 0.700 -0.376 0.732 0.2678
## mental_week_1 3 0.668 0.523 0.4769
## mental_month_1 1 0.567 -0.415 0.745 0.2549
## mental_week_3 5 0.426 0.392 0.6081
## personality_1 11 0.930 0.950 0.0502
## personality_6 16 0.775 0.662 0.3380
## kp_1 9 1.048 1.127 -0.1270
## kp_2 10 0.734 0.687 0.3134
## personality_4 14 0.823 0.793 0.2072
## personality_9 19 0.816 0.636 0.3642
## personality_10 20 1.099 0.910 0.0902
## personality_5 15 0.490 0.332 0.6682
## personality_7 17 0.436 0.238 0.7624
## personality_2 12 0.228 0.7722
## personality_3 13 1.118 0.947 0.0535
## personality_8 18 0.466 0.318 0.6824
## com
## mental_week_2 1.11
## mental_week_4 1.15
## mental_week_6 1.07
## mental_month_2 1.08
## mental_week_5 1.58
## mental_week_1 1.43
## mental_month_1 1.99
## mental_week_3 1.80
## personality_1 1.04
## personality_6 1.02
## kp_1 1.02
## kp_2 1.10
## personality_4 1.06
## personality_9 1.03
## personality_10 1.30
## personality_5 1.37
## personality_7 1.50
## personality_2 4.13
## personality_3 1.14
## personality_8 1.47
##
## PA1 PA2 PA3 PA6 PA5 PA4
## SS loadings 4.577 1.945 1.795 1.763 1.538 1.342
## Proportion Var 0.229 0.097 0.090 0.088 0.077 0.067
## Cumulative Var 0.229 0.326 0.416 0.504 0.581 0.648
## Proportion Explained 0.353 0.150 0.139 0.136 0.119 0.104
## Cumulative Proportion 0.353 0.503 0.642 0.778 0.896 1.000
##
## With factor correlations of
## PA1 PA2 PA3 PA6 PA5 PA4
## PA1 1.000 -0.413 0.175 -0.624 -0.323 -0.478
## PA2 -0.413 1.000 -0.257 0.317 0.439 0.381
## PA3 0.175 -0.257 1.000 -0.118 -0.290 -0.238
## PA6 -0.624 0.317 -0.118 1.000 0.367 0.351
## PA5 -0.323 0.439 -0.290 0.367 1.000 0.519
## PA4 -0.478 0.381 -0.238 0.351 0.519 1.000
##
## Mean item complexity = 1.4
## Test of the hypothesis that 6 factors are sufficient.
##
## df null model = 190 with the objective function = 12.988 with Chi Square = 1188.357
## df of the model are 85 and the objective function was 1.787
##
## The root mean square of the residuals (RMSR) is 0.037
## The df corrected root mean square of the residuals is 0.055
##
## The harmonic n.obs is 100 with the empirical chi square 52.065 with prob < 0.998
## The total n.obs was 100 with Likelihood Chi Square = 156.377 with prob < 3.89e-06
##
## Tucker Lewis Index of factoring reliability = 0.8314
## RMSEA index = 0.0911 and the 90 % confidence intervals are 0.0691 0.1146
## BIC = -235.062
## Fit based upon off diagonal values = 0.989
# Target (Procrustes) — requires a target matrix
# Each row = item, each col = factor; 1 = expected loading, 0 = not
target <- matrix(0, nrow = ncol(dat), ncol = n_factors)
# Example: assign first 8 items to F1, next 2 to F2, rest to F3+
target[1:8, 1] <- 1
target[9:10, 2] <- 1
target[11:20, 3] <- 1
efa_procrustes <- fa(dat, nfactors = n_factors, fm = "pa", rotate = "target",
Target = target)
print(efa_procrustes, cut = 0.3, sort = TRUE, digits = 3)
## Factor Analysis using method = pa
## Call: fa(r = dat, nfactors = n_factors, rotate = "target", fm = "pa",
## Target = target)
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PA1 PA2 PA3 PA4 PA5 PA6 h2 u2
## mental_month_2 2 0.823 0.725 0.2753
## mental_week_5 7 0.794 0.732 0.2678
## mental_month_1 1 0.777 0.342 0.745 0.2549
## mental_week_2 4 0.767 0.322 0.800 0.2000
## mental_week_6 8 0.744 0.639 0.3609
## personality_4 14 -0.700 0.414 0.793 0.2072
## mental_week_4 6 0.679 0.579 0.4207
## personality_1 11 -0.634 0.410 0.434 -0.422 0.950 0.0502
## mental_week_1 3 0.597 0.337 0.523 0.4769
## mental_week_3 5 0.597 0.392 0.6081
## personality_9 19 -0.572 0.438 0.636 0.3642
## personality_6 16 -0.571 0.362 -0.351 0.662 0.3380
## personality_8 18 -0.366 0.355 0.318 0.6824
## personality_2 12 -0.365 0.228 0.7722
## kp_2 10 0.461 -0.550 0.386 0.687 0.3134
## personality_5 15 0.416 0.332 0.6682
## personality_7 17 0.238 0.7624
## kp_1 9 0.468 -0.637 0.666 1.127 -0.1270
## personality_3 13 -0.464 0.555 -0.532 0.947 0.0535
## personality_10 20 -0.382 0.513 0.336 0.551 0.910 0.0902
## com
## mental_month_2 1.14
## mental_week_5 1.34
## mental_month_1 1.47
## mental_week_2 1.75
## mental_week_6 1.32
## personality_4 2.28
## mental_week_4 1.54
## personality_1 3.50
## mental_week_1 1.93
## mental_week_3 1.20
## personality_9 2.72
## personality_6 3.07
## personality_8 2.83
## personality_2 2.44
## kp_2 2.97
## personality_5 2.76
## personality_7 3.84
## kp_1 3.09
## personality_3 3.91
## personality_10 4.14
##
## PA1 PA2 PA3 PA4 PA5 PA6
## SS loadings 6.975 2.056 1.235 0.984 0.943 0.767
## Proportion Var 0.349 0.103 0.062 0.049 0.047 0.038
## Cumulative Var 0.349 0.452 0.513 0.563 0.610 0.648
## Proportion Explained 0.538 0.159 0.095 0.076 0.073 0.059
## Cumulative Proportion 0.538 0.697 0.792 0.868 0.941 1.000
##
## Mean item complexity = 2.5
## Test of the hypothesis that 6 factors are sufficient.
##
## df null model = 190 with the objective function = 12.988 with Chi Square = 1188.357
## df of the model are 85 and the objective function was 1.787
##
## The root mean square of the residuals (RMSR) is 0.037
## The df corrected root mean square of the residuals is 0.055
##
## The harmonic n.obs is 100 with the empirical chi square 52.065 with prob < 0.998
## The total n.obs was 100 with Likelihood Chi Square = 156.377 with prob < 3.89e-06
##
## Tucker Lewis Index of factoring reliability = 0.8314
## RMSEA index = 0.0911 and the 90 % confidence intervals are 0.0691 0.1146
## BIC = -235.062
## Fit based upon off diagonal values = 0.989
When we collapse to two factors based on the scree plot, PA1 captures all the mental health items plus the neuroticism element of the TIPI (personality_4, personality_9), while PA2 captures personality and kp items. However, fit is poor (RMSEA = 0.14 and TLI = 0.62) — indicating 2 factors substantially underfit the data.
efa_2f_varimax <- fa(dat, nfactors = 2, fm = "pa", rotate = "varimax")
print(efa_2f_varimax, cut = .4, sort = TRUE, digits = 3)
## Factor Analysis using method = pa
## Call: fa(r = dat, nfactors = 2, rotate = "varimax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PA1 PA2 h2 u2 com
## mental_month_1 1 0.851 0.734 0.266 1.03
## mental_week_2 4 0.798 0.657 0.343 1.07
## mental_month_2 2 0.784 0.709 0.291 1.30
## mental_week_5 7 0.742 0.643 0.357 1.33
## mental_week_6 8 0.727 0.585 0.415 1.21
## mental_week_4 6 0.716 0.529 0.471 1.07
## mental_week_1 3 0.699 0.489 0.511 1.00
## personality_4 14 -0.676 0.492 0.508 1.15
## personality_9 19 -0.549 0.326 0.674 1.16
## mental_week_3 5 0.512 0.363 0.637 1.67
## personality_2 12 0.133 0.867 1.60
## kp_2 10 -0.660 0.454 0.546 1.08
## personality_1 11 0.655 0.529 0.471 1.45
## kp_1 9 -0.597 0.376 0.624 1.11
## personality_5 15 0.548 0.301 0.699 1.00
## personality_10 20 0.545 0.306 0.694 1.06
## personality_6 16 0.519 0.385 0.615 1.73
## personality_7 17 0.406 0.166 0.834 1.02
## personality_8 18 0.171 0.829 1.62
## personality_3 13 0.200 0.800 1.95
##
## PA1 PA2
## SS loadings 5.568 2.982
## Proportion Var 0.278 0.149
## Cumulative Var 0.278 0.427
## Proportion Explained 0.651 0.349
## Cumulative Proportion 0.651 1.000
##
## Mean item complexity = 1.3
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 190 with the objective function = 12.988 with Chi Square = 1188.357
## df of the model are 151 and the objective function was 4.973
##
## The root mean square of the residuals (RMSR) is 0.085
## The df corrected root mean square of the residuals is 0.096
##
## The harmonic n.obs is 100 with the empirical chi square 276.32 with prob < 2.13e-09
## The total n.obs was 100 with Likelihood Chi Square = 448.436 with prob < 2.92e-31
##
## Tucker Lewis Index of factoring reliability = 0.6185
## RMSEA index = 0.14 and the 90 % confidence intervals are 0.1261 0.1562
## BIC = -246.945
## Fit based upon off diagonal values = 0.942
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.956 0.890
## Multiple R square of scores with factors 0.915 0.792
## Minimum correlation of possible factor scores 0.829 0.583
efa_2f_promax <- fa(dat, nfactors = 2, fm = "pa", rotate = "promax")
print(efa_2f_promax, cut = .3, sort = TRUE, digits = 3)
## Factor Analysis using method = pa
## Call: fa(r = dat, nfactors = 2, rotate = "promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item PA1 PA2 h2 u2 com
## mental_month_1 1 0.932 0.734 0.266 1.06
## mental_week_2 4 0.857 0.657 0.343 1.02
## mental_week_1 3 0.794 0.489 0.511 1.17
## mental_month_2 2 0.789 0.709 0.291 1.03
## mental_week_4 6 0.769 0.529 0.471 1.02
## mental_week_6 8 0.748 0.585 0.415 1.00
## mental_week_5 7 0.742 0.643 0.357 1.04
## personality_4 14 -0.705 0.492 0.508 1.00
## personality_9 19 -0.571 0.326 0.674 1.00
## mental_week_3 5 0.477 0.363 0.637 1.32
## personality_2 12 0.133 0.867 1.24
## kp_2 10 -0.704 0.454 0.546 1.02
## personality_1 11 0.638 0.529 0.471 1.11
## kp_1 9 -0.629 0.376 0.624 1.00
## personality_5 15 0.614 0.301 0.699 1.12
## personality_10 20 0.587 0.306 0.694 1.03
## personality_6 16 0.477 0.385 0.615 1.40
## personality_7 17 0.448 0.166 0.834 1.08
## personality_8 18 0.336 0.171 0.829 1.26
## personality_3 13 0.200 0.800 1.84
##
## PA1 PA2
## SS loadings 5.803 2.746
## Proportion Var 0.290 0.137
## Cumulative Var 0.290 0.427
## Proportion Explained 0.679 0.321
## Cumulative Proportion 0.679 1.000
##
## With factor correlations of
## PA1 PA2
## PA1 1.000 -0.529
## PA2 -0.529 1.000
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 190 with the objective function = 12.988 with Chi Square = 1188.357
## df of the model are 151 and the objective function was 4.973
##
## The root mean square of the residuals (RMSR) is 0.085
## The df corrected root mean square of the residuals is 0.096
##
## The harmonic n.obs is 100 with the empirical chi square 276.32 with prob < 2.13e-09
## The total n.obs was 100 with Likelihood Chi Square = 448.436 with prob < 2.92e-31
##
## Tucker Lewis Index of factoring reliability = 0.6185
## RMSEA index = 0.14 and the 90 % confidence intervals are 0.1261 0.1562
## BIC = -246.945
## Fit based upon off diagonal values = 0.942
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.969 0.914
## Multiple R square of scores with factors 0.939 0.835
## Minimum correlation of possible factor scores 0.879 0.670
Factor scores are estimated as sum scores by averaging the items with loadings ≥ |.40| on each factor from the promax solution. F4 (neuroticism) shows the least variability (SD = 0.45), meaning people in this sample reported similar levels of neuroticism. F3 (Implicit Theories of Change) and F2 (Extraversion) show the most variability (SD = 0.97, 0.95), suggesting people differ most in their beliefs about personal change and their level of extraversion, which are both theoretically relevant dimensions for a study on volitional personal transformation since a lot of people reported wanting to be more extroverted or capable of “being alone.” Further analysis would be seeing if those who score more introverted report wanting to be more extroverted, and vice versa.
dat_scaled <- as.data.frame(scale(dat))
loadings_mat <- unclass(efa_promax$loadings)
sum_scores <- data.frame(matrix(NA, nrow = nrow(dat), ncol = n_factors))
colnames(sum_scores) <- paste0("F", 1:n_factors)
for (f in 1:n_factors) {
items <- which(abs(loadings_mat[, f]) >= 0.4)
if (length(items) > 0) {
sum_scores[[paste0("F", f)]] <- rowMeans(dat_scaled[, items])
}
}
head(sum_scores)
## F1 F2 F3 F4 F5 F6
## 1 -0.70110935 1.8355714 0.3551533 -0.2739694 0.9157299 1.0536191
## 2 -0.81827727 1.3446088 1.1078309 0.1419515 0.6187849 1.0536191
## 3 -0.71132869 -0.3773407 -0.4340224 0.4616302 0.2601208 0.4039230
## 4 1.68964016 -0.8683034 0.3369043 -0.4736590 0.9157299 -1.5451650
## 5 0.25652853 1.3446088 0.3186552 -0.5835760 0.7363979 0.1069264
## 6 -0.06022594 0.3626834 0.3369043 -0.3084519 -0.4840466 -0.1900702
psych::describe(sum_scores) |> round(2)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## F1 1 100 0 0.79 -0.11 -0.04 0.97 -1.29 1.69 2.98 0.33 -1.00 0.08
## F2 2 100 0 0.95 -0.37 -0.07 1.08 -1.11 1.84 2.95 0.49 -1.09 0.10
## F3 3 100 0 0.97 -0.06 -0.05 1.70 -1.20 1.88 3.08 0.19 -1.17 0.10
## F4 4 100 0 0.45 0.08 0.03 0.42 -1.18 0.85 2.03 -0.59 -0.34 0.05
## F5 5 100 0 0.77 0.11 0.07 0.75 -2.30 1.12 3.42 -0.81 0.48 0.08
## F6 6 100 0 0.86 0.08 0.09 0.92 -2.55 1.05 3.60 -0.76 0.11 0.09
Reference
Ram, N., Conroy, D. E., Pincus, A. L., Hyde, A. L., & Molloy, L. E. (2012). Tethering theory to method: Using measures of intraindividual variability to operationalize individuals’ dynamic characteristics. In G. Hancock & J. Harring (Eds.), Advances in longitudinal methods in the social and behavioral sciences (pp. 81–110). New York: Information Age.