Exploratory Factor Analysis

Exploratory Factor Analysis is a statistical technique in social science to explain the variance between several measured variables as a smaller set of latent variables. EFA is often used to consolidate survey data by revealing the groupings (factors) that underly individual questions.

An EFA provides information on each item’s relationship to a single factor that is hypothesized to be represented by each of the items. EFA results give you basic information about how well items relate to that hypothesized construct.

The common application of EFA is to investigate relationships between observed variabls and latent variables (factor) such as measurement piloting.

We will ve using psych package as lavaan does not explicitly support EFA.

setwd("D:/Class Materials & Work/Summer 2020 practice/SEM/EFA")

library(psych)
library(tidyverse)
library(semPlot)
library(ggcorrplot)

gcbs <- readRDS("GCBS_data.rds")

In an actual implementation, be sure to use a dataset that only contains item responses - other types of data will cause errors and/or incorrect results.

In the gcbs dataset, these are examinees’ responses to 15 items from the Generic Conspiracist Beliefs Scale, which is designed to measure conspiracist beliefs.

Perform the EFA

# Conduct a single-factor EFA
EFA_model <- fa(gcbs)

# View the results
EFA_model

## Factor Analysis using method =  minres
## Call: fa(r = gcbs)
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   h2   u2 com
## Q1  0.70 0.49 0.51   1
## Q2  0.72 0.52 0.48   1
## Q3  0.64 0.41 0.59   1
## Q4  0.77 0.59 0.41   1
## Q5  0.67 0.45 0.55   1
## Q6  0.75 0.56 0.44   1
## Q7  0.73 0.54 0.46   1
## Q8  0.65 0.43 0.57   1
## Q9  0.70 0.48 0.52   1
## Q10 0.56 0.32 0.68   1
## Q11 0.72 0.52 0.48   1
## Q12 0.79 0.62 0.38   1
## Q13 0.68 0.46 0.54   1
## Q14 0.74 0.55 0.45   1
## Q15 0.57 0.33 0.67   1
## 
##                 MR1
## SS loadings    7.27
## Proportion Var 0.48
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 factor is sufficient.
## 
## The degrees of freedom for the null model are  105  and the objective function was  9.31 with Chi Square of  23173.8
## The degrees of freedom for the model are 90  and the objective function was  1.93 
## 
## The root mean square of the residuals (RMSR) is  0.08 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  2495 with the empirical chi square  3398.99  with prob <  0 
## The total number of observations was  2495  with Likelihood Chi Square =  4809.34  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.761
## RMSEA index =  0.145  and the 90 % confidence intervals are  0.142 0.149
## BIC =  4105.36
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    MR1
## Correlation of (regression) scores with factors   0.97
## Multiple R square of scores with factors          0.94
## Minimum correlation of possible factor scores     0.87

Viewing and visualizing the factor loadings

Each fa() result is a list that contains specific information about the analysis. Factor loadings represent the strength and directionality of the relationship between each item and the underlying factor, ranging from from -1 to 1.

The fa.diagram() function takes a result object from fa() and creates a path diagram showing the items’ loadings ordered from strongest to weakest. Path diagrams are more common for the full SEM than for factor analysis, but this type of visualization can be a helpful way to represent your results.

# View the factor loadings
EFA_model$loadings

## 
## Loadings:
##     MR1  
## Q1  0.703
## Q2  0.719
## Q3  0.638
## Q4  0.770
## Q5  0.672
## Q6  0.746
## Q7  0.734
## Q8  0.654
## Q9  0.695
## Q10 0.565
## Q11 0.719
## Q12 0.786
## Q13 0.679
## Q14 0.743
## Q15 0.574
## 
##                  MR1
## SS loadings    7.267
## Proportion Var 0.484

# Create a path diagram of the items' factor loadings
fa.diagram(EFA_model)

Interpreting Individuals’ Factor Score

The EFA_model also contains factor scores for each person, which indicate how much or how little of the factor (or the construct) each person possesses. Factor scores are not computed for examinees with missing data.

#Drawing out the total scores for the first six respondents.
rowSums(gcbs[1:6, ])

##  1  2  3  4  5  6 
## 68 65 37 55 59 15

#Drawing out the first few rows of factor scores.
head(EFA_model$scores)

##             MR1
## [1,]  1.5614675
## [2,]  1.3432026
## [3,] -0.3960355
## [4,]  0.7478868
## [5,]  1.0435203
## [6,] -1.7290812

#See the distribution of factor scores
describe(EFA_model$scores)

##    vars    n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 2495    0 0.97      0   -0.02 1.12 -1.85 1.95   3.8 0.09    -0.91 0.02

Next, we can visualize the estimated factor scores with plot() and density().

plot(density(EFA_model$scores, na.rm = TRUE), 
     main = "Factor Scores")

Descriptive Statistics of the Dataset

describe(gcbs)

##     vars    n mean   sd median trimmed  mad min max range  skew kurtosis   se
## Q1     1 2495 3.47 1.46      4    3.59 1.48   0   5     5 -0.55    -1.10 0.03
## Q2     2 2495 2.96 1.49      3    2.96 1.48   0   5     5 -0.01    -1.40 0.03
## Q3     3 2495 2.05 1.39      1    1.82 0.00   0   5     5  0.98    -0.44 0.03
## Q4     4 2495 2.64 1.45      2    2.55 1.48   0   5     5  0.26    -1.34 0.03
## Q5     5 2495 3.25 1.47      4    3.32 1.48   0   5     5 -0.35    -1.27 0.03
## Q6     6 2495 3.11 1.51      3    3.14 1.48   0   5     5 -0.17    -1.42 0.03
## Q7     7 2495 2.67 1.51      2    2.59 1.48   0   5     5  0.28    -1.39 0.03
## Q8     8 2495 2.45 1.57      2    2.32 1.48   0   5     5  0.51    -1.30 0.03
## Q9     9 2495 2.23 1.42      2    2.05 1.48   0   5     5  0.76    -0.82 0.03
## Q10   10 2495 3.50 1.39      4    3.63 1.48   1   5     4 -0.59    -0.94 0.03
## Q11   11 2495 3.27 1.40      4    3.34 1.48   0   5     5 -0.35    -1.11 0.03
## Q12   12 2495 2.64 1.50      2    2.56 1.48   0   5     5  0.29    -1.37 0.03
## Q13   13 2495 2.10 1.38      1    1.89 0.00   0   5     5  0.89    -0.56 0.03
## Q14   14 2495 2.96 1.49      3    2.95 1.48   0   5     5 -0.02    -1.43 0.03
## Q15   15 2495 4.23 1.10      5    4.47 0.00   0   5     5 -1.56     1.71 0.02

#mean scores and confidence intervals for each variable in the dataset.
error.dots(gcbs)

#bar chart of the mean scores and confidence intervals for each variable in the dataset.
error.bars(gcbs)

Splitting the dataset

During the measure development process, it’s important to conduct EFA and CFA on separate datasets to avoid model overfit. Instead, you can split your dataset in half, then use one half for the EFA and the other half for the CFA.

N <- nrow(gcbs)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N))) #rounded down
indices_CFA <- indices[!(indices %in% indices_EFA)]

gcbs_EFA <- gcbs[indices_EFA, ]
gcbs_CFA <- gcbs[indices_CFA, ]

Comparing the halves of your dataset

Just as you inspected the features of your full dataset, it’s important to examine the halves after you’ve split the data.

In this exercise, you’ll use the indices created when splitting the dataset to create a grouping variable and attach it to the gcbs dataset. Once that grouping variable is set up, you can use describeBy() and statsBy() to view basic descriptive statistics as well as between-group statistics.

Be noted that while the group argument of describeBy() has to be a vector, the group argument of statsBy() has to be the name of a column in your dataframe.

#Use the indices that we created to create a grouping variable
group_var <- vector("numeric", nrow(gcbs))

group_var[indices_EFA] <- 1
group_var[indices_CFA] <- 2

#Bind that grouping variable onto the gcbs dataset
gcbs_grouped <- cbind(gcbs, group_var)

#Compare  stats across groups
describeBy(gcbs_grouped, group = group_var)

## 
##  Descriptive statistics by group 
## group: 1
##           vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## Q1           1 1247 3.47 1.47      4    3.58 1.48   0   5     5 -0.53    -1.15
## Q2           2 1247 2.98 1.49      3    2.98 1.48   0   5     5 -0.04    -1.39
## Q3           3 1247 2.05 1.39      1    1.83 0.00   0   5     5  0.97    -0.47
## Q4           4 1247 2.66 1.45      2    2.58 1.48   1   5     4  0.26    -1.36
## Q5           5 1247 3.30 1.46      4    3.38 1.48   0   5     5 -0.41    -1.19
## Q6           6 1247 3.06 1.53      3    3.08 1.48   0   5     5 -0.11    -1.45
## Q7           7 1247 2.72 1.52      2    2.66 1.48   0   5     5  0.22    -1.43
## Q8           8 1247 2.45 1.58      2    2.32 1.48   0   5     5  0.52    -1.30
## Q9           9 1247 2.27 1.44      2    2.10 1.48   0   5     5  0.72    -0.91
## Q10         10 1247 3.51 1.39      4    3.64 1.48   1   5     4 -0.60    -0.93
## Q11         11 1247 3.28 1.38      4    3.36 1.48   0   5     5 -0.39    -1.06
## Q12         12 1247 2.67 1.51      2    2.59 1.48   0   5     5  0.28    -1.36
## Q13         13 1247 2.11 1.40      1    1.89 0.00   0   5     5  0.90    -0.56
## Q14         14 1247 3.00 1.50      3    3.00 1.48   0   5     5 -0.08    -1.44
## Q15         15 1247 4.19 1.14      5    4.43 0.00   0   5     5 -1.48     1.36
## group_var   16 1247 1.00 0.00      1    1.00 0.00   1   1     0   NaN      NaN
##             se
## Q1        0.04
## Q2        0.04
## Q3        0.04
## Q4        0.04
## Q5        0.04
## Q6        0.04
## Q7        0.04
## Q8        0.04
## Q9        0.04
## Q10       0.04
## Q11       0.04
## Q12       0.04
## Q13       0.04
## Q14       0.04
## Q15       0.03
## group_var 0.00
## ------------------------------------------------------------ 
## group: 2
##           vars    n mean   sd median trimmed  mad min max range  skew kurtosis
## Q1           1 1248 3.48 1.45      4    3.60 1.48   0   5     5 -0.56    -1.06
## Q2           2 1248 2.94 1.50      3    2.94 1.48   0   5     5  0.02    -1.41
## Q3           3 1248 2.04 1.38      1    1.81 0.00   0   5     5  0.99    -0.43
## Q4           4 1248 2.61 1.45      2    2.52 1.48   0   5     5  0.27    -1.32
## Q5           5 1248 3.21 1.48      4    3.26 1.48   0   5     5 -0.29    -1.33
## Q6           6 1248 3.15 1.48      3    3.19 1.48   0   5     5 -0.22    -1.38
## Q7           7 1248 2.61 1.50      2    2.52 1.48   0   5     5  0.35    -1.34
## Q8           8 1248 2.45 1.56      2    2.32 1.48   0   5     5  0.51    -1.30
## Q9           9 1248 2.19 1.40      2    2.00 1.48   0   5     5  0.79    -0.73
## Q10         10 1248 3.49 1.38      4    3.62 1.48   1   5     4 -0.57    -0.96
## Q11         11 1248 3.25 1.42      3    3.32 1.48   0   5     5 -0.31    -1.16
## Q12         12 1248 2.62 1.50      2    2.53 1.48   0   5     5  0.30    -1.38
## Q13         13 1248 2.10 1.36      1    1.90 0.00   0   5     5  0.87    -0.58
## Q14         14 1248 2.91 1.48      3    2.89 1.48   0   5     5  0.04    -1.41
## Q15         15 1248 4.26 1.07      5    4.50 0.00   1   5     4 -1.64     2.09
## group_var   16 1248 2.00 0.00      2    2.00 0.00   2   2     0   NaN      NaN
##             se
## Q1        0.04
## Q2        0.04
## Q3        0.04
## Q4        0.04
## Q5        0.04
## Q6        0.04
## Q7        0.04
## Q8        0.04
## Q9        0.04
## Q10       0.04
## Q11       0.04
## Q12       0.04
## Q13       0.04
## Q14       0.04
## Q15       0.03
## group_var 0.00

statsBy(gcbs_grouped, group = "group_var")

## Statistics within and between groups  
## Call: statsBy(data = gcbs_grouped, group = "group_var")
## Intraclass Correlation 1 (Percentage of variance due to groups) 
##        Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8 
##         0         0         0         0         0         0         0         0 
##        Q9       Q10       Q11       Q12       Q13       Q14       Q15 group_var 
##         0         0         0         0         0         0         0         1 
## Intraclass Correlation 2 (Reliability of group differences) 
##        Q1        Q2        Q3        Q4        Q5        Q6        Q7        Q8 
##    -24.06     -1.53    -15.53     -0.48      0.60      0.57      0.71    -85.07 
##        Q9       Q10       Q11       Q12       Q13       Q14       Q15 group_var 
##      0.47    -11.66     -2.02     -0.31    -93.55      0.57      0.61      1.00 
## eta^2 between groups  
##  Q1.bg  Q2.bg  Q3.bg  Q4.bg  Q5.bg  Q6.bg  Q7.bg  Q8.bg  Q9.bg Q10.bg Q11.bg 
##      0      0      0      0      0      0      0      0      0      0      0 
## Q12.bg Q13.bg Q14.bg Q15.bg 
##      0      0      0      0 
## 
## To see the correlations between and within groups, use the short=FALSE option in your print statement.
## Many results are not shown directly. To see specific objects select from the following list:
##  mean sd n F ICC1 ICC2 ci1 ci2 raw rbg pbg rwg nw ci.wg pwg etabg etawg nwg nG Call

Viewing and testing correlations

To get a feel for your dataset, we will examine the relationships of your variables with each other. The lowerCor() function is useful to generate a correlation matrix, with corr.test to generate a correlation matrix of p-value.

lowerCor(gcbs, use = "pairwise.complete.obs")

##     Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9   Q10  Q11 
## Q1  1.00                                                  
## Q2  0.53 1.00                                             
## Q3  0.36 0.40 1.00                                        
## Q4  0.52 0.53 0.50 1.00                                   
## Q5  0.48 0.46 0.40 0.57 1.00                              
## Q6  0.63 0.55 0.40 0.61 0.50 1.00                         
## Q7  0.47 0.67 0.42 0.57 0.45 0.54 1.00                    
## Q8  0.39 0.38 0.78 0.49 0.41 0.41 0.41 1.00               
## Q9  0.42 0.49 0.49 0.56 0.46 0.48 0.53 0.48 1.00          
## Q10 0.44 0.38 0.32 0.40 0.43 0.41 0.39 0.36 0.37 1.00     
## Q11 0.64 0.52 0.34 0.52 0.49 0.62 0.49 0.37 0.46 0.45 1.00
## Q12 0.52 0.72 0.44 0.60 0.49 0.59 0.75 0.42 0.57 0.40 0.55
## Q13 0.38 0.40 0.71 0.51 0.43 0.42 0.45 0.76 0.54 0.37 0.40
## Q14 0.53 0.50 0.43 0.60 0.54 0.55 0.52 0.45 0.55 0.41 0.56
## Q15 0.51 0.40 0.27 0.39 0.45 0.47 0.39 0.31 0.32 0.45 0.54
##     Q12  Q13  Q14  Q15 
## Q12 1.00               
## Q13 0.49 1.00          
## Q14 0.56 0.50 1.00     
## Q15 0.41 0.30 0.46 1.00

However, the function ggcorrplot() provides a more effective graphical insight with correlation coefficient and insignificance marking.

variables.to.use <- colnames(gcbs[1:15])

gcbs.corr<-cor(gcbs[variables.to.use],
                 method = "pearson",
                 use='pairwise.complete.obs')

ggcorrplot(gcbs.corr,
           p.mat=cor_pmat(gcbs[variables.to.use]),
           hc.order=T, #Group similar coefficient together
           type='lower',
           color=c('red3', 'white', 'green3'),
           outline.color = 'darkgoldenrod1', 
           lab=T,
           lab_size=3, #Size of the correlation coefficient
           legend.title='Correlation',
           pch=4, 
           pch.cex=12)+ 
  labs(title="gcbs correlation")+
  theme(plot.title=element_text(face='bold',size=14,hjust=0.5,colour="darkred"))+
  theme(legend.position=c(0.10,0.80), legend.box.just = "bottom")

Examining Internal Reliability

Internal reliability shows how well our test items relate to each other within the measurement. Cronbach’s Alpha (alpha()) and Split-Half (splitHalf()) are two common ways of assessing reliability. These statistics are a function of the measure length and items’ interrelatedness, which is explained above with the correlation matrix.

The overall reliability may change if items are dropped.

#Estimate coefficient alpha
psych::alpha(gcbs)

## 
## Reliability analysis   
## Call: psych::alpha(x = gcbs)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean sd median_r
##       0.93      0.93    0.94      0.48  14 0.002  2.9  1     0.47
## 
##  lower alpha upper     95% confidence boundaries
## 0.93 0.93 0.94 
## 
##  Reliability if an item is dropped:
##     raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## Q1       0.93      0.93    0.94      0.48  13   0.0021 0.0105  0.46
## Q2       0.93      0.93    0.94      0.48  13   0.0021 0.0099  0.47
## Q3       0.93      0.93    0.94      0.49  13   0.0020 0.0084  0.48
## Q4       0.93      0.93    0.94      0.47  13   0.0022 0.0105  0.46
## Q5       0.93      0.93    0.94      0.48  13   0.0021 0.0112  0.48
## Q6       0.93      0.93    0.94      0.48  13   0.0021 0.0104  0.46
## Q7       0.93      0.93    0.94      0.48  13   0.0021 0.0098  0.47
## Q8       0.93      0.93    0.94      0.48  13   0.0020 0.0086  0.49
## Q9       0.93      0.93    0.94      0.48  13   0.0021 0.0108  0.46
## Q10      0.93      0.93    0.94      0.49  14   0.0020 0.0102  0.49
## Q11      0.93      0.93    0.94      0.48  13   0.0021 0.0104  0.46
## Q12      0.93      0.93    0.94      0.47  13   0.0022 0.0093  0.46
## Q13      0.93      0.93    0.94      0.48  13   0.0021 0.0092  0.48
## Q14      0.93      0.93    0.94      0.48  13   0.0021 0.0109  0.46
## Q15      0.93      0.93    0.94      0.49  14   0.0020 0.0095  0.49
## 
##  Item statistics 
##        n raw.r std.r r.cor r.drop mean  sd
## Q1  2495  0.73  0.73  0.70   0.68  3.5 1.5
## Q2  2495  0.74  0.74  0.72   0.69  3.0 1.5
## Q3  2495  0.68  0.67  0.66   0.62  2.0 1.4
## Q4  2495  0.78  0.78  0.76   0.74  2.6 1.5
## Q5  2495  0.70  0.70  0.67   0.65  3.3 1.5
## Q6  2495  0.76  0.76  0.74   0.72  3.1 1.5
## Q7  2495  0.75  0.75  0.73   0.70  2.7 1.5
## Q8  2495  0.69  0.69  0.68   0.63  2.5 1.6
## Q9  2495  0.72  0.72  0.69   0.67  2.2 1.4
## Q10 2495  0.61  0.61  0.57   0.55  3.5 1.4
## Q11 2495  0.74  0.74  0.72   0.69  3.3 1.4
## Q12 2495  0.79  0.79  0.79   0.75  2.6 1.5
## Q13 2495  0.71  0.71  0.70   0.66  2.1 1.4
## Q14 2495  0.76  0.76  0.74   0.71  3.0 1.5
## Q15 2495  0.60  0.62  0.58   0.56  4.2 1.1
## 
## Non missing response frequency for each item
##        0    1    2    3    4    5 miss
## Q1  0.00 0.16 0.12 0.12 0.27 0.32    0
## Q2  0.01 0.23 0.19 0.16 0.20 0.22    0
## Q3  0.00 0.55 0.13 0.12 0.10 0.10    0
## Q4  0.00 0.32 0.18 0.15 0.20 0.14    0
## Q5  0.00 0.19 0.14 0.13 0.28 0.26    0
## Q6  0.00 0.23 0.15 0.15 0.23 0.24    0
## Q7  0.00 0.33 0.19 0.13 0.18 0.17    0
## Q8  0.00 0.44 0.12 0.14 0.12 0.18    0
## Q9  0.00 0.45 0.19 0.12 0.12 0.11    0
## Q10 0.00 0.14 0.12 0.14 0.30 0.30    0
## Q11 0.00 0.16 0.14 0.19 0.27 0.24    0
## Q12 0.00 0.34 0.18 0.15 0.17 0.17    0
## Q13 0.01 0.51 0.15 0.15 0.10 0.09    0
## Q14 0.00 0.25 0.17 0.15 0.22 0.20    0
## Q15 0.00 0.05 0.05 0.08 0.27 0.55    0

#Estimate split half coefficient
splitHalf(gcbs)

## Split half reliabilities  
## Call: splitHalf(r = gcbs)
## 
## Maximum split half reliability (lambda 4) =  0.95
## Guttman lambda 6                          =  0.94
## Average split half reliability            =  0.93
## Guttman lambda 3 (alpha)                  =  0.93
## Guttman lambda 2                          =  0.93
## Minimum split half reliability  (beta)    =  0.86
## Average interitem r =  0.48  with median =  0.47