Welcome to this R demo session! Here, I will demonstrate how to use R to conduct exploratory factor analysis (EFA).
#read in data from Excel file
library(readxl)
example<-read_excel("EFA_example_data.xlsx")
#examine a snapshot of the data
head(example)
## # A tibble: 6 × 12
## syn ant waisvoc picvoc raven sparel papfld wmsrec logmem letcom patcom
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8 10 52 22 0.06 7 5 0.67 27 5.5 10
## 2 5 5 52 20 0.44 15 8 0.67 41 13 20
## 3 3 3 36 9 0.06 2 2 0.67 30 6 10
## 4 10 10 61 23 0.5 8 7 0.61 44 11.5 15
## 5 5 2 48 13 0.5 9 7 0.81 59 11.5 18
## 6 6 5 38 11 0.13 3 4 0.42 20 8 15
## # ℹ 1 more variable: digsym <dbl>
#data summary
summary(example)
## syn ant waisvoc picvoc raven
## Min. : 0.00 Min. : 0.000 Min. :14.00 Min. : 1 Min. :0.0000
## 1st Qu.: 5.00 1st Qu.: 5.000 1st Qu.:45.00 1st Qu.:16 1st Qu.:0.3800
## Median : 8.00 Median : 8.000 Median :54.00 Median :20 Median :0.5000
## Mean : 7.38 Mean : 7.185 Mean :51.27 Mean :19 Mean :0.4855
## 3rd Qu.:10.00 3rd Qu.:10.000 3rd Qu.:59.00 3rd Qu.:23 3rd Qu.:0.6300
## Max. :10.00 Max. :10.000 Max. :68.00 Max. :30 Max. :1.0000
## sparel papfld wmsrec logmem
## Min. : 0.000 Min. : 0.000 Min. :0.2800 Min. :16.00
## 1st Qu.: 5.000 1st Qu.: 5.000 1st Qu.:0.6100 1st Qu.:38.00
## Median : 9.000 Median : 7.000 Median :0.7200 Median :45.00
## Mean : 9.406 Mean : 6.822 Mean :0.7101 Mean :44.75
## 3rd Qu.:13.000 3rd Qu.: 9.000 3rd Qu.:0.8100 3rd Qu.:51.00
## Max. :20.000 Max. :12.000 Max. :0.9700 Max. :72.00
## letcom patcom digsym
## Min. : 5.00 Min. : 0.00 Min. : 8.00
## 1st Qu.: 9.25 1st Qu.:15.00 1st Qu.: 65.00
## Median :11.00 Median :18.00 Median : 76.00
## Mean :10.90 Mean :17.95 Mean : 77.31
## 3rd Qu.:12.50 3rd Qu.:21.00 3rd Qu.: 90.00
## Max. :19.00 Max. :29.00 Max. :129.00
#calculate correlations
cor(example)
## syn ant waisvoc picvoc raven sparel
## syn 1.00000000 0.84776546 0.7476302 0.71965033 0.2671055 0.2984242
## ant 0.84776546 1.00000000 0.7515984 0.68335918 0.3141898 0.3453001
## waisvoc 0.74763024 0.75159844 1.0000000 0.68403034 0.3523038 0.4387302
## picvoc 0.71965033 0.68335918 0.6840303 1.00000000 0.2637889 0.4038272
## raven 0.26710545 0.31418983 0.3523038 0.26378889 1.0000000 0.6725621
## sparel 0.29842419 0.34530012 0.4387302 0.40382718 0.6725621 1.0000000
## papfld 0.21470260 0.25749728 0.3225538 0.26893141 0.6840154 0.7346558
## wmsrec 0.16133075 0.20858126 0.2766254 0.19708991 0.4689090 0.3653197
## logmem 0.37432118 0.32601853 0.4461294 0.38000605 0.4620228 0.4188174
## letcom 0.11256498 0.19492839 0.2119698 0.07619716 0.4599375 0.3289703
## patcom 0.01690756 0.08852634 0.1445336 0.03273032 0.5006182 0.4031702
## digsym 0.08188352 0.15012283 0.2190173 0.07071468 0.5573211 0.3959214
## papfld wmsrec logmem letcom patcom digsym
## syn 0.2147026 0.1613308 0.3743212 0.11256498 0.01690756 0.08188352
## ant 0.2574973 0.2085813 0.3260185 0.19492839 0.08852634 0.15012283
## waisvoc 0.3225538 0.2766254 0.4461294 0.21196981 0.14453364 0.21901734
## picvoc 0.2689314 0.1970899 0.3800060 0.07619716 0.03273032 0.07071468
## raven 0.6840154 0.4689090 0.4620228 0.45993752 0.50061824 0.55732107
## sparel 0.7346558 0.3653197 0.4188174 0.32897035 0.40317021 0.39592136
## papfld 1.0000000 0.4206718 0.3994568 0.28612656 0.40139280 0.44076833
## wmsrec 0.4206718 1.0000000 0.5300479 0.38281658 0.34758926 0.52090513
## logmem 0.3994568 0.5300479 1.0000000 0.27337511 0.30254195 0.33553305
## letcom 0.2861266 0.3828166 0.2733751 1.00000000 0.60690924 0.63523302
## patcom 0.4013928 0.3475893 0.3025420 0.60690924 1.00000000 0.59750526
## digsym 0.4407683 0.5209051 0.3355330 0.63523302 0.59750526 1.00000000
#create scree plot for eigenvalues
library(psych)
fa.parallel(example,fm="pa",fa="fa")
## Parallel analysis suggests that the number of factors = 4 and the number of components = NA
The blue line is the eigenvalues for the real data
Note that this is based on the eigenvalues of the reduced matrix. That is, the correlation matrix with 1s replaced with estimated communalities
Based on the Kaiser rule, we would extract 2 factors. That is the black horizontal line
Parallel analysis suggests 4 factors (you can see that in the output in the console and in the graph)
The red line(s) are estimated eigenvalues if there were no factors underlying these indicators where the blue line is above the red line is evidence of potential factors in parallel analysis
Theory suggests 4 factors:
syn
, ant
, waisvoc
,
picvoc
raven
,
sparel
, papfld
)wmsrec
, logmem
letcom
,
patcom
, digsym
efa <- fa(example,nfactors=4,fm="pa",rotate="none")
efa
## Factor Analysis using method = pa
## Call: fa(r = example, nfactors = 4, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 PA3 PA4 h2 u2 com
## syn 0.63 -0.67 0.13 -0.04 0.86 0.14 2.1
## ant 0.66 -0.58 0.16 -0.10 0.81 0.19 2.1
## waisvoc 0.70 -0.47 0.08 0.00 0.72 0.28 1.8
## picvoc 0.59 -0.54 -0.04 0.02 0.64 0.36 2.0
## raven 0.75 0.30 -0.18 -0.07 0.68 0.32 1.5
## sparel 0.74 0.14 -0.41 -0.15 0.76 0.24 1.7
## papfld 0.69 0.25 -0.44 -0.06 0.74 0.26 2.0
## wmsrec 0.59 0.29 0.09 0.53 0.72 0.28 2.5
## logmem 0.62 0.01 -0.01 0.30 0.47 0.53 1.4
## letcom 0.54 0.43 0.43 -0.19 0.69 0.31 3.2
## patcom 0.51 0.49 0.19 -0.18 0.57 0.43 2.5
## digsym 0.60 0.49 0.25 -0.02 0.66 0.34 2.3
##
## PA1 PA2 PA3 PA4
## SS loadings 4.90 2.22 0.73 0.48
## Proportion Var 0.41 0.19 0.06 0.04
## Cumulative Var 0.41 0.59 0.65 0.69
## Proportion Explained 0.59 0.27 0.09 0.06
## Cumulative Proportion 0.59 0.85 0.94 1.00
##
## Mean item complexity = 2.1
## Test of the hypothesis that 4 factors are sufficient.
##
## df null model = 66 with the objective function = 7.54 with Chi Square = 3055.99
## df of the model are 24 and the objective function was 0.16
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.02
##
## The harmonic n.obs is 411 with the empirical chi square 11.27 with prob < 0.99
## The total n.obs was 411 with Likelihood Chi Square = 62.63 with prob < 2.7e-05
##
## Tucker Lewis Index of factoring reliability = 0.964
## RMSEA index = 0.063 and the 90 % confidence intervals are 0.044 0.082
## BIC = -81.82
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2 PA3 PA4
## Correlation of (regression) scores with factors 0.97 0.95 0.85 0.78
## Multiple R square of scores with factors 0.95 0.90 0.73 0.61
## Minimum correlation of possible factor scores 0.89 0.80 0.45 0.22
First table of output is (unrotated) factor loadings (PA1-PA4, order doesn’t matter). More about this table after next analysis with rotation
Second table is information about the factors. Most important information is in last two lines
efa <- fa(example,nfactors=4,fm="pa",rotate="Promax")
efa
## Factor Analysis using method = pa
## Call: fa(r = example, nfactors = 4, rotate = "Promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA2 PA1 PA3 PA4 h2 u2 com
## syn 0.97 -0.09 0.00 -0.04 0.86 0.14 1.0
## ant 0.94 -0.05 0.12 -0.10 0.81 0.19 1.1
## waisvoc 0.79 0.05 0.04 0.05 0.72 0.28 1.0
## picvoc 0.74 0.13 -0.15 0.03 0.64 0.36 1.2
## raven 0.01 0.63 0.23 0.06 0.68 0.32 1.3
## sparel 0.07 0.92 -0.02 -0.12 0.76 0.24 1.0
## papfld -0.11 0.94 -0.08 0.02 0.74 0.26 1.0
## wmsrec -0.10 -0.09 0.04 0.91 0.72 0.28 1.0
## logmem 0.18 0.10 -0.03 0.53 0.47 0.53 1.3
## letcom 0.12 -0.15 0.92 -0.05 0.69 0.31 1.1
## patcom -0.07 0.16 0.70 -0.06 0.57 0.43 1.1
## digsym -0.05 0.03 0.67 0.19 0.66 0.34 1.2
##
## PA2 PA1 PA3 PA4
## SS loadings 3.06 2.18 1.94 1.16
## Proportion Var 0.25 0.18 0.16 0.10
## Cumulative Var 0.25 0.44 0.60 0.69
## Proportion Explained 0.37 0.26 0.23 0.14
## Cumulative Proportion 0.37 0.63 0.86 1.00
##
## With factor correlations of
## PA2 PA1 PA3 PA4
## PA2 1.00 0.44 0.15 0.42
## PA1 0.44 1.00 0.58 0.66
## PA3 0.15 0.58 1.00 0.58
## PA4 0.42 0.66 0.58 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 4 factors are sufficient.
##
## df null model = 66 with the objective function = 7.54 with Chi Square = 3055.99
## df of the model are 24 and the objective function was 0.16
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.02
##
## The harmonic n.obs is 411 with the empirical chi square 11.27 with prob < 0.99
## The total n.obs was 411 with Likelihood Chi Square = 62.63 with prob < 2.7e-05
##
## Tucker Lewis Index of factoring reliability = 0.964
## RMSEA index = 0.063 and the 90 % confidence intervals are 0.044 0.082
## BIC = -81.82
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA2 PA1 PA3 PA4
## Correlation of (regression) scores with factors 0.97 0.95 0.93 0.91
## Multiple R square of scores with factors 0.94 0.90 0.86 0.82
## Minimum correlation of possible factor scores 0.88 0.80 0.72 0.65
Now let’s focus on first table
syn
, ant
, waisvoc
,
picvoc
and small, negligible loadings for everything else.
We could turn to theory to name this: this is Gc, exactly as we
expectedraven
,
sparel
, papfld
. Gf, as expectedletcom
,
patcom
, digsym
. Speed/working memory as
expectedwmsrec
and
logmem.
Long-term memory as expectedHowever, remember how I said that the Kaiser rule suggests 2 factors? Does that mean the above analysis tends to overextract?
Plus, looking at the scree plot, the elbow is at the 3rd eigenvalue, which means 2 or possibly 3 factors. Some argue to include the elbow, some argue to do the elbow - 1
efa <- fa(example,nfactors=2,fm="pa",rotate="Promax")
efa
## Factor Analysis using method = pa
## Call: fa(r = example, nfactors = 2, rotate = "Promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## syn -0.13 0.96 0.83 0.17 1.0
## ant -0.03 0.88 0.76 0.24 1.0
## waisvoc 0.09 0.81 0.72 0.28 1.0
## picvoc -0.05 0.83 0.65 0.35 1.0
## raven 0.78 0.08 0.66 0.34 1.0
## sparel 0.61 0.22 0.53 0.47 1.3
## papfld 0.66 0.10 0.50 0.50 1.0
## wmsrec 0.60 0.03 0.38 0.62 1.0
## logmem 0.44 0.29 0.38 0.62 1.7
## letcom 0.68 -0.12 0.42 0.58 1.1
## patcom 0.77 -0.22 0.51 0.49 1.2
## digsym 0.83 -0.18 0.60 0.40 1.1
##
## PA1 PA2
## SS loadings 3.71 3.23
## Proportion Var 0.31 0.27
## Cumulative Var 0.31 0.58
## Proportion Explained 0.53 0.47
## Cumulative Proportion 0.53 1.00
##
## With factor correlations of
## PA1 PA2
## PA1 1.0 0.4
## PA2 0.4 1.0
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 factors are sufficient.
##
## df null model = 66 with the objective function = 7.54 with Chi Square = 3055.99
## df of the model are 43 and the objective function was 1.05
##
## The root mean square of the residuals (RMSR) is 0.06
## The df corrected root mean square of the residuals is 0.08
##
## The harmonic n.obs is 411 with the empirical chi square 217.12 with prob < 4.3e-25
## The total n.obs was 411 with Likelihood Chi Square = 424.57 with prob < 3.2e-64
##
## Tucker Lewis Index of factoring reliability = 0.803
## RMSEA index = 0.147 and the 90 % confidence intervals are 0.135 0.16
## BIC = 165.77
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.95 0.96
## Multiple R square of scores with factors 0.89 0.93
## Minimum correlation of possible factor scores 0.79 0.86
Looking at the table of factor loadings:
Main conclusion: good justification for 2 and 4 factor results
The analysis has given us 2 good answers and no particular evidence strong enough to pick
Isn’t that annoying?