Introduction

Welcome to this R demo session! Here, I will demonstrate how to use R to conduct exploratory factor analysis (EFA).

#read in data from Excel file
library(readxl)
example<-read_excel("EFA_example_data.xlsx")

#examine a snapshot of the data
head(example)
## # A tibble: 6 × 12
##     syn   ant waisvoc picvoc raven sparel papfld wmsrec logmem letcom patcom
##   <dbl> <dbl>   <dbl>  <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     8    10      52     22  0.06      7      5   0.67     27    5.5     10
## 2     5     5      52     20  0.44     15      8   0.67     41   13       20
## 3     3     3      36      9  0.06      2      2   0.67     30    6       10
## 4    10    10      61     23  0.5       8      7   0.61     44   11.5     15
## 5     5     2      48     13  0.5       9      7   0.81     59   11.5     18
## 6     6     5      38     11  0.13      3      4   0.42     20    8       15
## # ℹ 1 more variable: digsym <dbl>
#data summary
summary(example)
##       syn             ant            waisvoc          picvoc       raven       
##  Min.   : 0.00   Min.   : 0.000   Min.   :14.00   Min.   : 1   Min.   :0.0000  
##  1st Qu.: 5.00   1st Qu.: 5.000   1st Qu.:45.00   1st Qu.:16   1st Qu.:0.3800  
##  Median : 8.00   Median : 8.000   Median :54.00   Median :20   Median :0.5000  
##  Mean   : 7.38   Mean   : 7.185   Mean   :51.27   Mean   :19   Mean   :0.4855  
##  3rd Qu.:10.00   3rd Qu.:10.000   3rd Qu.:59.00   3rd Qu.:23   3rd Qu.:0.6300  
##  Max.   :10.00   Max.   :10.000   Max.   :68.00   Max.   :30   Max.   :1.0000  
##      sparel           papfld           wmsrec           logmem     
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.2800   Min.   :16.00  
##  1st Qu.: 5.000   1st Qu.: 5.000   1st Qu.:0.6100   1st Qu.:38.00  
##  Median : 9.000   Median : 7.000   Median :0.7200   Median :45.00  
##  Mean   : 9.406   Mean   : 6.822   Mean   :0.7101   Mean   :44.75  
##  3rd Qu.:13.000   3rd Qu.: 9.000   3rd Qu.:0.8100   3rd Qu.:51.00  
##  Max.   :20.000   Max.   :12.000   Max.   :0.9700   Max.   :72.00  
##      letcom          patcom          digsym      
##  Min.   : 5.00   Min.   : 0.00   Min.   :  8.00  
##  1st Qu.: 9.25   1st Qu.:15.00   1st Qu.: 65.00  
##  Median :11.00   Median :18.00   Median : 76.00  
##  Mean   :10.90   Mean   :17.95   Mean   : 77.31  
##  3rd Qu.:12.50   3rd Qu.:21.00   3rd Qu.: 90.00  
##  Max.   :19.00   Max.   :29.00   Max.   :129.00

Exploratory factor analysis – Data exploration

#calculate correlations
cor(example)
##                syn        ant   waisvoc     picvoc     raven    sparel
## syn     1.00000000 0.84776546 0.7476302 0.71965033 0.2671055 0.2984242
## ant     0.84776546 1.00000000 0.7515984 0.68335918 0.3141898 0.3453001
## waisvoc 0.74763024 0.75159844 1.0000000 0.68403034 0.3523038 0.4387302
## picvoc  0.71965033 0.68335918 0.6840303 1.00000000 0.2637889 0.4038272
## raven   0.26710545 0.31418983 0.3523038 0.26378889 1.0000000 0.6725621
## sparel  0.29842419 0.34530012 0.4387302 0.40382718 0.6725621 1.0000000
## papfld  0.21470260 0.25749728 0.3225538 0.26893141 0.6840154 0.7346558
## wmsrec  0.16133075 0.20858126 0.2766254 0.19708991 0.4689090 0.3653197
## logmem  0.37432118 0.32601853 0.4461294 0.38000605 0.4620228 0.4188174
## letcom  0.11256498 0.19492839 0.2119698 0.07619716 0.4599375 0.3289703
## patcom  0.01690756 0.08852634 0.1445336 0.03273032 0.5006182 0.4031702
## digsym  0.08188352 0.15012283 0.2190173 0.07071468 0.5573211 0.3959214
##            papfld    wmsrec    logmem     letcom     patcom     digsym
## syn     0.2147026 0.1613308 0.3743212 0.11256498 0.01690756 0.08188352
## ant     0.2574973 0.2085813 0.3260185 0.19492839 0.08852634 0.15012283
## waisvoc 0.3225538 0.2766254 0.4461294 0.21196981 0.14453364 0.21901734
## picvoc  0.2689314 0.1970899 0.3800060 0.07619716 0.03273032 0.07071468
## raven   0.6840154 0.4689090 0.4620228 0.45993752 0.50061824 0.55732107
## sparel  0.7346558 0.3653197 0.4188174 0.32897035 0.40317021 0.39592136
## papfld  1.0000000 0.4206718 0.3994568 0.28612656 0.40139280 0.44076833
## wmsrec  0.4206718 1.0000000 0.5300479 0.38281658 0.34758926 0.52090513
## logmem  0.3994568 0.5300479 1.0000000 0.27337511 0.30254195 0.33553305
## letcom  0.2861266 0.3828166 0.2733751 1.00000000 0.60690924 0.63523302
## patcom  0.4013928 0.3475893 0.3025420 0.60690924 1.00000000 0.59750526
## digsym  0.4407683 0.5209051 0.3355330 0.63523302 0.59750526 1.00000000
#create scree plot for eigenvalues
library(psych)
fa.parallel(example,fm="pa",fa="fa")

## Parallel analysis suggests that the number of factors =  4  and the number of components =  NA

The blue line is the eigenvalues for the real data

Note that this is based on the eigenvalues of the reduced matrix. That is, the correlation matrix with 1s replaced with estimated communalities

Based on the Kaiser rule, we would extract 2 factors. That is the black horizontal line

Parallel analysis suggests 4 factors (you can see that in the output in the console and in the graph)

The red line(s) are estimated eigenvalues if there were no factors underlying these indicators where the blue line is above the red line is evidence of potential factors in parallel analysis

Theory suggests 4 factors:

  • Gc = crystallized intelligence (essentially knowledge): syn, ant, waisvoc, picvoc
  • Gf/spatial-visualization = fluid intelligence /spatial visualization, two very closely related concepts (raven, sparel, papfld)
  • Long-term memory: wmsrec, logmem
  • Speed/short-term/working memory: letcom, patcom, digsym

Start by extracting 4 factors

efa <- fa(example,nfactors=4,fm="pa",rotate="none")
efa
## Factor Analysis using method =  pa
## Call: fa(r = example, nfactors = 4, rotate = "none", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          PA1   PA2   PA3   PA4   h2   u2 com
## syn     0.63 -0.67  0.13 -0.04 0.86 0.14 2.1
## ant     0.66 -0.58  0.16 -0.10 0.81 0.19 2.1
## waisvoc 0.70 -0.47  0.08  0.00 0.72 0.28 1.8
## picvoc  0.59 -0.54 -0.04  0.02 0.64 0.36 2.0
## raven   0.75  0.30 -0.18 -0.07 0.68 0.32 1.5
## sparel  0.74  0.14 -0.41 -0.15 0.76 0.24 1.7
## papfld  0.69  0.25 -0.44 -0.06 0.74 0.26 2.0
## wmsrec  0.59  0.29  0.09  0.53 0.72 0.28 2.5
## logmem  0.62  0.01 -0.01  0.30 0.47 0.53 1.4
## letcom  0.54  0.43  0.43 -0.19 0.69 0.31 3.2
## patcom  0.51  0.49  0.19 -0.18 0.57 0.43 2.5
## digsym  0.60  0.49  0.25 -0.02 0.66 0.34 2.3
## 
##                        PA1  PA2  PA3  PA4
## SS loadings           4.90 2.22 0.73 0.48
## Proportion Var        0.41 0.19 0.06 0.04
## Cumulative Var        0.41 0.59 0.65 0.69
## Proportion Explained  0.59 0.27 0.09 0.06
## Cumulative Proportion 0.59 0.85 0.94 1.00
## 
## Mean item complexity =  2.1
## Test of the hypothesis that 4 factors are sufficient.
## 
## df null model =  66  with the objective function =  7.54 with Chi Square =  3055.99
## df of  the model are 24  and the objective function was  0.16 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic n.obs is  411 with the empirical chi square  11.27  with prob <  0.99 
## The total n.obs was  411  with Likelihood Chi Square =  62.63  with prob <  2.7e-05 
## 
## Tucker Lewis Index of factoring reliability =  0.964
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.044 0.082
## BIC =  -81.82
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2  PA3  PA4
## Correlation of (regression) scores with factors   0.97 0.95 0.85 0.78
## Multiple R square of scores with factors          0.95 0.90 0.73 0.61
## Minimum correlation of possible factor scores     0.89 0.80 0.45 0.22

First table of output is (unrotated) factor loadings (PA1-PA4, order doesn’t matter). More about this table after next analysis with rotation

Second table is information about the factors. Most important information is in last two lines

  • Proportion Explained:
    • The proportion of variance in the indicators
    • Accounted for by this factor vs. accounted for by all the factors
    • Here, the first factor accounts for 59% of the shared variance
    • Shared variance means variance in the indicators after removing unique variance
  • Cumulative Proportion is the proportion of shared variance accounted for by all the factors so far
    • First factor accounts for 59%
    • First and second factor together account for 85%
    • All 4 factors account for 100% of the shared variance
    • 94% accounted for by first 3 factors
    • Is 6% by 4th factor enough to matter? maybe, maybe not
  • The rest of the output has interesting stuff but not critical to interpretation

Now let’s employ an oblique rotation

efa <- fa(example,nfactors=4,fm="pa",rotate="Promax")
efa
## Factor Analysis using method =  pa
## Call: fa(r = example, nfactors = 4, rotate = "Promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           PA2   PA1   PA3   PA4   h2   u2 com
## syn      0.97 -0.09  0.00 -0.04 0.86 0.14 1.0
## ant      0.94 -0.05  0.12 -0.10 0.81 0.19 1.1
## waisvoc  0.79  0.05  0.04  0.05 0.72 0.28 1.0
## picvoc   0.74  0.13 -0.15  0.03 0.64 0.36 1.2
## raven    0.01  0.63  0.23  0.06 0.68 0.32 1.3
## sparel   0.07  0.92 -0.02 -0.12 0.76 0.24 1.0
## papfld  -0.11  0.94 -0.08  0.02 0.74 0.26 1.0
## wmsrec  -0.10 -0.09  0.04  0.91 0.72 0.28 1.0
## logmem   0.18  0.10 -0.03  0.53 0.47 0.53 1.3
## letcom   0.12 -0.15  0.92 -0.05 0.69 0.31 1.1
## patcom  -0.07  0.16  0.70 -0.06 0.57 0.43 1.1
## digsym  -0.05  0.03  0.67  0.19 0.66 0.34 1.2
## 
##                        PA2  PA1  PA3  PA4
## SS loadings           3.06 2.18 1.94 1.16
## Proportion Var        0.25 0.18 0.16 0.10
## Cumulative Var        0.25 0.44 0.60 0.69
## Proportion Explained  0.37 0.26 0.23 0.14
## Cumulative Proportion 0.37 0.63 0.86 1.00
## 
##  With factor correlations of 
##      PA2  PA1  PA3  PA4
## PA2 1.00 0.44 0.15 0.42
## PA1 0.44 1.00 0.58 0.66
## PA3 0.15 0.58 1.00 0.58
## PA4 0.42 0.66 0.58 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 4 factors are sufficient.
## 
## df null model =  66  with the objective function =  7.54 with Chi Square =  3055.99
## df of  the model are 24  and the objective function was  0.16 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic n.obs is  411 with the empirical chi square  11.27  with prob <  0.99 
## The total n.obs was  411  with Likelihood Chi Square =  62.63  with prob <  2.7e-05 
## 
## Tucker Lewis Index of factoring reliability =  0.964
## RMSEA index =  0.063  and the 90 % confidence intervals are  0.044 0.082
## BIC =  -81.82
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA2  PA1  PA3  PA4
## Correlation of (regression) scores with factors   0.97 0.95 0.93 0.91
## Multiple R square of scores with factors          0.94 0.90 0.86 0.82
## Minimum correlation of possible factor scores     0.88 0.80 0.72 0.65

Now let’s focus on first table

  • Factor PA2 (order doesn’t matter) has large loadings for syn, ant, waisvoc, picvoc and small, negligible loadings for everything else. We could turn to theory to name this: this is Gc, exactly as we expected
  • Factor PA1 has large loadings for raven, sparel, papfld. Gf, as expected
  • Factor PA3 has large loadings for letcom, patcom, digsym. Speed/working memory as expected
  • Factor PA4 has large loadings for wmsrec and logmem. Long-term memory as expected
  • Overall, this is a really clean, theoretically justified EFA
  • Note: h2 is the estimated communality; u2 is the estimated unique variance.
  • I don’t know what the com column is

However, remember how I said that the Kaiser rule suggests 2 factors? Does that mean the above analysis tends to overextract?

Plus, looking at the scree plot, the elbow is at the 3rd eigenvalue, which means 2 or possibly 3 factors. Some argue to include the elbow, some argue to do the elbow - 1

So try extracting 2 factors

efa <- fa(example,nfactors=2,fm="pa",rotate="Promax")
efa
## Factor Analysis using method =  pa
## Call: fa(r = example, nfactors = 2, rotate = "Promax", fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           PA1   PA2   h2   u2 com
## syn     -0.13  0.96 0.83 0.17 1.0
## ant     -0.03  0.88 0.76 0.24 1.0
## waisvoc  0.09  0.81 0.72 0.28 1.0
## picvoc  -0.05  0.83 0.65 0.35 1.0
## raven    0.78  0.08 0.66 0.34 1.0
## sparel   0.61  0.22 0.53 0.47 1.3
## papfld   0.66  0.10 0.50 0.50 1.0
## wmsrec   0.60  0.03 0.38 0.62 1.0
## logmem   0.44  0.29 0.38 0.62 1.7
## letcom   0.68 -0.12 0.42 0.58 1.1
## patcom   0.77 -0.22 0.51 0.49 1.2
## digsym   0.83 -0.18 0.60 0.40 1.1
## 
##                        PA1  PA2
## SS loadings           3.71 3.23
## Proportion Var        0.31 0.27
## Cumulative Var        0.31 0.58
## Proportion Explained  0.53 0.47
## Cumulative Proportion 0.53 1.00
## 
##  With factor correlations of 
##     PA1 PA2
## PA1 1.0 0.4
## PA2 0.4 1.0
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 factors are sufficient.
## 
## df null model =  66  with the objective function =  7.54 with Chi Square =  3055.99
## df of  the model are 43  and the objective function was  1.05 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.08 
## 
## The harmonic n.obs is  411 with the empirical chi square  217.12  with prob <  4.3e-25 
## The total n.obs was  411  with Likelihood Chi Square =  424.57  with prob <  3.2e-64 
## 
## Tucker Lewis Index of factoring reliability =  0.803
## RMSEA index =  0.147  and the 90 % confidence intervals are  0.135 0.16
## BIC =  165.77
## Fit based upon off diagonal values = 0.98
## Measures of factor score adequacy             
##                                                    PA1  PA2
## Correlation of (regression) scores with factors   0.95 0.96
## Multiple R square of scores with factors          0.89 0.93
## Minimum correlation of possible factor scores     0.79 0.86

Looking at the table of factor loadings:

  • Factor PA1 is everything but the Gc indicators
  • Factor PA2 is the Gc indicators
  • This is also consistent with theory. Theoretically, Gc is differentiated from the other more ‘fluid’ cognitive abilities

Main conclusion: good justification for 2 and 4 factor results

The analysis has given us 2 good answers and no particular evidence strong enough to pick

Isn’t that annoying?