In this practice, we will combine both EFA and CFA on the BigFive Personality dataset (bfi).

setwd("D:/Class Materials & Work/Summer 2020 practice/SEM/EFA_to_CFA")

#Loading required packages
library(parameters)
library(tidyverse)
library(psych)
library(see)
library(lavaan)
library(performance)
library(semPlot)

The first step is to test the dataset for factor analysis suitability with check_factorstructure(). Two existing methods are the Bartlett’s Test of Sphericity and the Kaiser, Meyer, Olkin (KMO) Measure of Sampling Adequacy (MSA).

The former tests whether a matrix is significantly different from an identity matrix. This statistical test for the presence of correlations among variables, providing the statistical probability that the correlation matrix has significant correlations among at least some of variables. As for factor analysis to work, some relationships between variables are needed, thus, a significant Bartlett’s test of sphericity is required to be significant.

The latter method, ranging from 0 to 1, indicates he degree to which each variable in a set is predicted without error by the other variables. A value of 0 indicates that the sum of partial correlations is large relative to the sum correlations, indicating factor analysis is likely to be inappropriate. A KMO value close to 1 indicates that the sum of partial correlations is not large relative to the sum of correlations and so factor analysis should yield distinct and reliable factors.

# Load the data
data <- psych::bfi[, 1:25]  # Select only the 25 first columns corresponding to the items
data <- na.omit(data)  # remove missing values

# Check factor structure
check_factorstructure(data)
## # Is the data suitable for Factor Analysis?
## 
##   - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.85).
##   - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analaysis (Chisq(300) = 18146.07, p < .001).

Next, we would need to split the dataset into two to perform EFA and CFA.

# Establish two sets of indices to split the dataset
N <- nrow(data)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N)))
indices_CFA <- indices[!(indices %in% indices_EFA)]

# Use those indices to split the dataset into halves for your EFA and CFA
bfi_EFA <- data[indices_EFA, ]
bfi_CFA <- data[indices_CFA, ]

check_factorstructure(bfi_EFA)
## # Is the data suitable for Factor Analysis?
## 
##   - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.83).
##   - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analaysis (Chisq(300) = 9252.62, p < .001).
check_factorstructure(bfi_CFA)
## # Is the data suitable for Factor Analysis?
## 
##   - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.85).
##   - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analaysis (Chisq(300) = 9189.15, p < .001).

Exploratory Factor Analysis

Now that we are confident that our dataset is appropriate, we will explore a factor structure made of 5 latent variables corresponding to the items’ authors theory of personality.

# Fit an EFA
efa <- psych::fa(data, nfactors = 5) %>% 
  model_parameters(sort = TRUE, threshold = "max")
## Loading required namespace: GPArotation
efa
## # Rotated loadings from Factor Analysis (oblimin-rotation)
## 
## Variable |  MR2 |   MR3 |   MR1 |   MR5 |   MR4 | Complexity | Uniqueness
## -------------------------------------------------------------------------
## N1       | 0.83 |       |       |       |       |       1.07 |       0.32
## N2       | 0.78 |       |       |       |       |       1.03 |       0.39
## N3       | 0.70 |       |       |       |       |       1.08 |       0.46
## N5       | 0.48 |       |       |       |       |       2.00 |       0.65
## N4       | 0.47 |       |       |       |       |       2.33 |       0.49
## C2       |      |  0.67 |       |       |       |       1.18 |       0.55
## C4       |      | -0.64 |       |       |       |       1.13 |       0.52
## C3       |      |  0.57 |       |       |       |       1.10 |       0.68
## C5       |      | -0.56 |       |       |       |       1.41 |       0.56
## C1       |      |  0.55 |       |       |       |       1.20 |       0.65
## E2       |      |       |  0.67 |       |       |       1.08 |       0.45
## E4       |      |       | -0.59 |       |       |       1.52 |       0.46
## E1       |      |       |  0.55 |       |       |       1.22 |       0.65
## E5       |      |       | -0.42 |       |       |       2.68 |       0.59
## E3       |      |       | -0.41 |       |       |       2.65 |       0.56
## A3       |      |       |       |  0.68 |       |       1.06 |       0.46
## A2       |      |       |       |  0.66 |       |       1.03 |       0.54
## A5       |      |       |       |  0.54 |       |       1.48 |       0.53
## A4       |      |       |       |  0.45 |       |       1.74 |       0.70
## A1       |      |       |       | -0.44 |       |       1.88 |       0.80
## O3       |      |       |       |       |  0.62 |       1.16 |       0.53
## O5       |      |       |       |       | -0.54 |       1.21 |       0.70
## O1       |      |       |       |       |  0.52 |       1.10 |       0.68
## O2       |      |       |       |       | -0.47 |       1.68 |       0.73
## O4       |      |       |       |       |  0.36 |       2.65 |       0.75
## 
## The 5 latent factors (oblimin rotation) accounted for 42.36% of the total variance of the original data (MR2 = 10.31%, MR3 = 8.39%, MR1 = 8.83%, MR5 = 8.29%, MR4 = 6.55%).

As we can see, the 25 items nicely spread on the 5 latent factors as the theory suggests. Based on this model, we can now predict back the scores for each individual for these new variables:

predict_result <- predict(efa, names = c("Neuroticism", "Conscientiousness", "Extraversion", "Agreeableness", "Opennness"))

head(predict_result)
##   Neuroticism Conscientiousness Extraversion Agreeableness  Opennness
## 1  -0.2224206       -1.32676913  -0.12782944   -0.85526039 -1.6092005
## 2   0.1617985       -0.57221744  -0.46568248   -0.07242517 -0.1722208
## 3   0.6190719       -0.04294820  -0.14137012   -0.55241463  0.2334112
## 4  -0.1169213       -1.06320191  -0.05806339   -0.09063411 -1.0615165
## 5  -0.1737249       -0.09947351  -0.46000604   -0.71181595 -0.6608569
## 6   0.1934209        1.39949887  -1.09198663    0.38874676  0.6242695

How many factors to retain in Factor Analysis (FA)

When running a factor analysis (FA), one often needs to specify how many components (or latent variables) to retain or to extract. This decision is often supported by some statistical indices and procedures aiming at finding the optimal number of factors, e.g., scree plot from (scree()).

Interestingly, a huge amount of methods exist to statistically address this issue. These methods can sometime contradict with each other in terms of retained factor. As a result, seeking the number that is supported by most methods is a great compromise.

The Method Agreement procedure

The Method Agreement procedure, first implemented in the psycho package, proposes to rely on the consensus of methods, rather than on one method in particular.

This procedure can be used through the n_factors() by providing a dataframe, and the function will run a large number of routines and return the optimal number of factors based on the higher consensus.

n <- parameters::n_factors(data)

n
## # Method Agreement Procedure:
## 
## The choice of 1 dimensions is supported by 3 (13.04%) methods out of 23 (Acceleration factor, TLI, RMSEA).

For more details, a summary table can be obtained

as.data.frame(n)
##    n_Factors              Method              Family
## 1          1 Acceleration factor               Scree
## 2          1                 TLI                 Fit
## 3          1               RMSEA                 Fit
## 4          3                 CNG                 CNG
## 5          4                beta Multiple_regression
## 6          4    VSS complexity 1                 VSS
## 7          5    VSS complexity 2                 VSS
## 8          5       Velicer's MAP        Velicers_MAP
## 9          6 Optimal coordinates               Scree
## 10         6   Parallel analysis               Scree
## 11         6    Kaiser criterion               Scree
## 12         7                   t Multiple_regression
## 13         7                   p Multiple_regression
## 14         7                  R2            Scree_SE
## 15         8            SE Scree            Scree_SE
## 16         8                 BIC                 BIC
## 17         8                 BIC                 Fit
## 18        11      BIC (adjusted)                 BIC
## 19        18                CRMS                 Fit
## 20        22             Bentler             Bentler
## 21        24            Bartlett             Barlett
## 22        24            Anderson             Barlett
## 23        24              Lawley             Barlett
summary(n)
##    n_Factors n_Methods
## 1          1         3
## 2          3         1
## 3          4         2
## 4          5         2
## 5          6         3
## 6          7         3
## 7          8         3
## 8         11         1
## 9         18         1
## 10        22         1
## 11        24         3

Interestingly, most methods also suggest six factors from the model, which is consistent with the HEXACO model of personalities.

A plot can also be obtained from the see package.

plot(n) + theme_modern()

Confirmatory Factor Analysis

We’ve seen above that while an EFA with 5 latent variables works great on our dataset, a structure with 6 latent factors might in fact be more appropriate. This topic can be statistically tested with a CFA to bridge factor analysis with Structural Equation Modelling (SEM).

However, in order to do that cleanly, EFA should be independent from CFA in the sense that the factor structure should be explored on a “training” set, and then tested (or “confirmed”) on a test set.

Partition the data

The data can be easily split into two sets with the data_partition() function, through which we will use 70% of the sample as training and the rest as test. Note that we can also use initial_split() of the tidymodel package.

partitions <- data_partition(data, training_proportion = 0.7)

training <- partitions$training
test <- partitions$test

Create CFA structures out of EFA models

In the next step, we will run two EFA models on the training set, specifying 5 and 6 latent factors respectively, that we will then transform into CFA structures.

#Converting EFA into a lavaan-ready syntax
structure_big5 <- psych::fa(training, nfactors = 5) %>% 
  efa_to_cfa()

structure_big6 <- psych::fa(training, nfactors = 6)  %>% 
  efa_to_cfa()

# Investigate how a model looks
structure_big5
## # Latent variables
## MR2 =~ N1 + N2 + N3 + N4 + N5
## MR1 =~ E1 + E2 + E3 + E4 + E5 + O4
## MR3 =~ C1 + C2 + C3 + C4 + C5
## MR5 =~ A1 + A2 + A3 + A4 + A5
## MR4 =~ O1 + O2 + O3 + O5
structure_big6
## # Latent variables
## MR2 =~ N1 + N2 + N3 + N5
## MR1 =~ E1 + E2 + E4 + E5 + N4 + O4
## MR3 =~ C1 + C2 + C3 + C4 + C5
## MR5 =~ A1 + A2 + A3 + A4 + A5
## MR4 =~ E3 + O1 + O2 + O3 + O5

Fit and Compare models

Next, we will fit both models with lavaan package. After that, a fit measure can be obtained with fitmeasures, but we will efficiently compare both models with compare_performance.

big5 <- lavaan::cfa(structure_big5, data = test)
big6 <- lavaan::cfa(structure_big6, data = test)

fitmeasures(big5, fit.measures = "all", output = "text")
## 
## Model Test User Model:
## 
##   Test statistic                              1460.386
##   Degrees of freedom                               265
##   P-value                                        0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              5935.998
##   Degrees of freedom                               300
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.788
##   Tucker-Lewis Index (TLI)                       0.760
##   Bentler-Bonett Non-normed Fit Index (NNFI)     0.760
##   Bentler-Bonett Normed Fit Index (NFI)          0.754
##   Parsimony Normed Fit Index (PNFI)              0.666
##   Bollen's Relative Fit Index (RFI)              0.721
##   Bollen's Incremental Fit Index (IFI)           0.789
##   Relative Noncentrality Index (RNI)             0.788
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -29918.849
##   Loglikelihood unrestricted model (H1)     -29188.656
##                                                       
##   Akaike (AIC)                               59957.697
##   Bayesian (BIC)                             60233.362
##   Sample-size adjusted Bayesian (BIC)        60042.843
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.079
##   90 Percent confidence interval - lower         0.075
##   90 Percent confidence interval - upper         0.083
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   RMR                                            0.158
##   RMR (No Mean)                                  0.158
##   SRMR                                           0.076
## 
## Other Fit Indices:
## 
##   Hoelter Critical N (CN) alpha = 0.05         153.153
##   Hoelter Critical N (CN) alpha = 0.01         161.917
##                                                       
##   Goodness of Fit Index (GFI)                    0.849
##   Adjusted Goodness of Fit Index (AGFI)          0.815
##   Parsimony Goodness of Fit Index (PGFI)         0.692
##                                                       
##   McDonald Fit Index (MFI)                       0.441
##                                                       
##   Expected Cross-Validation Index (ECVI)         2.162
fitmeasures(big6, fit.measures = "all", output = "text")
## 
## Model Test User Model:
## 
##   Test statistic                              1620.632
##   Degrees of freedom                               265
##   P-value                                        0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              5935.998
##   Degrees of freedom                               300
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.759
##   Tucker-Lewis Index (TLI)                       0.728
##   Bentler-Bonett Non-normed Fit Index (NNFI)     0.728
##   Bentler-Bonett Normed Fit Index (NFI)          0.727
##   Parsimony Normed Fit Index (PNFI)              0.642
##   Bollen's Relative Fit Index (RFI)              0.691
##   Bollen's Incremental Fit Index (IFI)           0.761
##   Relative Noncentrality Index (RNI)             0.759
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -29998.972
##   Loglikelihood unrestricted model (H1)     -29188.656
##                                                       
##   Akaike (AIC)                               60117.944
##   Bayesian (BIC)                             60393.609
##   Sample-size adjusted Bayesian (BIC)        60203.090
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.084
##   90 Percent confidence interval - lower         0.080
##   90 Percent confidence interval - upper         0.088
##   P-value RMSEA <= 0.05                          0.000
## 
## Standardized Root Mean Square Residual:
## 
##   RMR                                            0.183
##   RMR (No Mean)                                  0.183
##   SRMR                                           0.086
## 
## Other Fit Indices:
## 
##   Hoelter Critical N (CN) alpha = 0.05         138.108
##   Hoelter Critical N (CN) alpha = 0.01         146.005
##                                                       
##   Goodness of Fit Index (GFI)                    0.837
##   Adjusted Goodness of Fit Index (AGFI)          0.800
##   Parsimony Goodness of Fit Index (PGFI)         0.683
##                                                       
##   McDonald Fit Index (MFI)                       0.396
##                                                       
##   Expected Cross-Validation Index (ECVI)         2.381
compare<-performance::compare_performance(big5, big6)

rmarkdown::paged_table(compare)

Both Big5 and Big6 models are reliable.