Exploratory Factor Analysis is a statistical technique in social science to explain the variance between several measured variables as a smaller set of latent variables. EFA is often used to consolidate survey data by revealing the groupings (factors) that underly individual questions.
An EFA provides information on each item’s relationship to a single factor that is hypothesized to be represented by each of the items. EFA results give you basic information about how well items relate to that hypothesized construct.
The common application of EFA is to investigate relationships between observed variabls and latent variables (factor) such as measurement piloting.
We will ve using psych
package as lavaan
does not explicitly support EFA.
setwd("D:/Class Materials & Work/Summer 2020 practice/SEM/EFA")
library(psych)
library(tidyverse)
library(semPlot)
library(ggcorrplot)
gcbs <- readRDS("GCBS_data.rds")
In an actual implementation, be sure to use a dataset that only contains item responses - other types of data will cause errors and/or incorrect results.
In the gcbs dataset, these are examinees’ responses to 15 items from the Generic Conspiracist Beliefs Scale, which is designed to measure conspiracist beliefs.
# Conduct a single-factor EFA
EFA_model <- fa(gcbs)
# View the results
EFA_model
## Factor Analysis using method = minres
## Call: fa(r = gcbs)
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 h2 u2 com
## Q1 0.70 0.49 0.51 1
## Q2 0.72 0.52 0.48 1
## Q3 0.64 0.41 0.59 1
## Q4 0.77 0.59 0.41 1
## Q5 0.67 0.45 0.55 1
## Q6 0.75 0.56 0.44 1
## Q7 0.73 0.54 0.46 1
## Q8 0.65 0.43 0.57 1
## Q9 0.70 0.48 0.52 1
## Q10 0.56 0.32 0.68 1
## Q11 0.72 0.52 0.48 1
## Q12 0.79 0.62 0.38 1
## Q13 0.68 0.46 0.54 1
## Q14 0.74 0.55 0.45 1
## Q15 0.57 0.33 0.67 1
##
## MR1
## SS loadings 7.27
## Proportion Var 0.48
##
## Mean item complexity = 1
## Test of the hypothesis that 1 factor is sufficient.
##
## The degrees of freedom for the null model are 105 and the objective function was 9.31 with Chi Square of 23173.8
## The degrees of freedom for the model are 90 and the objective function was 1.93
##
## The root mean square of the residuals (RMSR) is 0.08
## The df corrected root mean square of the residuals is 0.09
##
## The harmonic number of observations is 2495 with the empirical chi square 3398.99 with prob < 0
## The total number of observations was 2495 with Likelihood Chi Square = 4809.34 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.761
## RMSEA index = 0.145 and the 90 % confidence intervals are 0.142 0.149
## BIC = 4105.36
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## MR1
## Correlation of (regression) scores with factors 0.97
## Multiple R square of scores with factors 0.94
## Minimum correlation of possible factor scores 0.87
Each fa()
result is a list that contains specific information about the analysis. Factor loadings represent the strength and directionality of the relationship between each item and the underlying factor, ranging from from -1 to 1.
The fa.diagram()
function takes a result object from fa()
and creates a path diagram showing the items’ loadings ordered from strongest to weakest. Path diagrams are more common for the full SEM than for factor analysis, but this type of visualization can be a helpful way to represent your results.
# View the factor loadings
EFA_model$loadings
##
## Loadings:
## MR1
## Q1 0.703
## Q2 0.719
## Q3 0.638
## Q4 0.770
## Q5 0.672
## Q6 0.746
## Q7 0.734
## Q8 0.654
## Q9 0.695
## Q10 0.565
## Q11 0.719
## Q12 0.786
## Q13 0.679
## Q14 0.743
## Q15 0.574
##
## MR1
## SS loadings 7.267
## Proportion Var 0.484
# Create a path diagram of the items' factor loadings
fa.diagram(EFA_model)
The EFA_model
also contains factor scores for each person, which indicate how much or how little of the factor (or the construct) each person possesses. Factor scores are not computed for examinees with missing data.
#Drawing out the total scores for the first six respondents.
rowSums(gcbs[1:6, ])
## 1 2 3 4 5 6
## 68 65 37 55 59 15
#Drawing out the first few rows of factor scores.
head(EFA_model$scores)
## MR1
## [1,] 1.5614675
## [2,] 1.3432026
## [3,] -0.3960355
## [4,] 0.7478868
## [5,] 1.0435203
## [6,] -1.7290812
#See the distribution of factor scores
describe(EFA_model$scores)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 2495 0 0.97 0 -0.02 1.12 -1.85 1.95 3.8 0.09 -0.91 0.02
Next, we can visualize the estimated factor scores with plot()
and density()
.
plot(density(EFA_model$scores, na.rm = TRUE),
main = "Factor Scores")
describe(gcbs)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## Q1 1 2495 3.47 1.46 4 3.59 1.48 0 5 5 -0.55 -1.10 0.03
## Q2 2 2495 2.96 1.49 3 2.96 1.48 0 5 5 -0.01 -1.40 0.03
## Q3 3 2495 2.05 1.39 1 1.82 0.00 0 5 5 0.98 -0.44 0.03
## Q4 4 2495 2.64 1.45 2 2.55 1.48 0 5 5 0.26 -1.34 0.03
## Q5 5 2495 3.25 1.47 4 3.32 1.48 0 5 5 -0.35 -1.27 0.03
## Q6 6 2495 3.11 1.51 3 3.14 1.48 0 5 5 -0.17 -1.42 0.03
## Q7 7 2495 2.67 1.51 2 2.59 1.48 0 5 5 0.28 -1.39 0.03
## Q8 8 2495 2.45 1.57 2 2.32 1.48 0 5 5 0.51 -1.30 0.03
## Q9 9 2495 2.23 1.42 2 2.05 1.48 0 5 5 0.76 -0.82 0.03
## Q10 10 2495 3.50 1.39 4 3.63 1.48 1 5 4 -0.59 -0.94 0.03
## Q11 11 2495 3.27 1.40 4 3.34 1.48 0 5 5 -0.35 -1.11 0.03
## Q12 12 2495 2.64 1.50 2 2.56 1.48 0 5 5 0.29 -1.37 0.03
## Q13 13 2495 2.10 1.38 1 1.89 0.00 0 5 5 0.89 -0.56 0.03
## Q14 14 2495 2.96 1.49 3 2.95 1.48 0 5 5 -0.02 -1.43 0.03
## Q15 15 2495 4.23 1.10 5 4.47 0.00 0 5 5 -1.56 1.71 0.02
#mean scores and confidence intervals for each variable in the dataset.
error.dots(gcbs)
#bar chart of the mean scores and confidence intervals for each variable in the dataset.
error.bars(gcbs)
During the measure development process, it’s important to conduct EFA and CFA on separate datasets to avoid model overfit. Instead, you can split your dataset in half, then use one half for the EFA and the other half for the CFA.
N <- nrow(gcbs)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N))) #rounded down
indices_CFA <- indices[!(indices %in% indices_EFA)]
gcbs_EFA <- gcbs[indices_EFA, ]
gcbs_CFA <- gcbs[indices_CFA, ]
Just as you inspected the features of your full dataset, it’s important to examine the halves after you’ve split the data.
In this exercise, you’ll use the indices created when splitting the dataset to create a grouping variable and attach it to the gcbs dataset. Once that grouping variable is set up, you can use describeBy()
and statsBy()
to view basic descriptive statistics as well as between-group statistics.
Be noted that while the group argument of describeBy()
has to be a vector, the group argument of statsBy()
has to be the name of a column in your dataframe.
#Use the indices that we created to create a grouping variable
group_var <- vector("numeric", nrow(gcbs))
group_var[indices_EFA] <- 1
group_var[indices_CFA] <- 2
#Bind that grouping variable onto the gcbs dataset
gcbs_grouped <- cbind(gcbs, group_var)
#Compare stats across groups
describeBy(gcbs_grouped, group = group_var)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis
## Q1 1 1247 3.47 1.47 4 3.58 1.48 0 5 5 -0.53 -1.15
## Q2 2 1247 2.98 1.49 3 2.98 1.48 0 5 5 -0.04 -1.39
## Q3 3 1247 2.05 1.39 1 1.83 0.00 0 5 5 0.97 -0.47
## Q4 4 1247 2.66 1.45 2 2.58 1.48 1 5 4 0.26 -1.36
## Q5 5 1247 3.30 1.46 4 3.38 1.48 0 5 5 -0.41 -1.19
## Q6 6 1247 3.06 1.53 3 3.08 1.48 0 5 5 -0.11 -1.45
## Q7 7 1247 2.72 1.52 2 2.66 1.48 0 5 5 0.22 -1.43
## Q8 8 1247 2.45 1.58 2 2.32 1.48 0 5 5 0.52 -1.30
## Q9 9 1247 2.27 1.44 2 2.10 1.48 0 5 5 0.72 -0.91
## Q10 10 1247 3.51 1.39 4 3.64 1.48 1 5 4 -0.60 -0.93
## Q11 11 1247 3.28 1.38 4 3.36 1.48 0 5 5 -0.39 -1.06
## Q12 12 1247 2.67 1.51 2 2.59 1.48 0 5 5 0.28 -1.36
## Q13 13 1247 2.11 1.40 1 1.89 0.00 0 5 5 0.90 -0.56
## Q14 14 1247 3.00 1.50 3 3.00 1.48 0 5 5 -0.08 -1.44
## Q15 15 1247 4.19 1.14 5 4.43 0.00 0 5 5 -1.48 1.36
## group_var 16 1247 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN
## se
## Q1 0.04
## Q2 0.04
## Q3 0.04
## Q4 0.04
## Q5 0.04
## Q6 0.04
## Q7 0.04
## Q8 0.04
## Q9 0.04
## Q10 0.04
## Q11 0.04
## Q12 0.04
## Q13 0.04
## Q14 0.04
## Q15 0.03
## group_var 0.00
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis
## Q1 1 1248 3.48 1.45 4 3.60 1.48 0 5 5 -0.56 -1.06
## Q2 2 1248 2.94 1.50 3 2.94 1.48 0 5 5 0.02 -1.41
## Q3 3 1248 2.04 1.38 1 1.81 0.00 0 5 5 0.99 -0.43
## Q4 4 1248 2.61 1.45 2 2.52 1.48 0 5 5 0.27 -1.32
## Q5 5 1248 3.21 1.48 4 3.26 1.48 0 5 5 -0.29 -1.33
## Q6 6 1248 3.15 1.48 3 3.19 1.48 0 5 5 -0.22 -1.38
## Q7 7 1248 2.61 1.50 2 2.52 1.48 0 5 5 0.35 -1.34
## Q8 8 1248 2.45 1.56 2 2.32 1.48 0 5 5 0.51 -1.30
## Q9 9 1248 2.19 1.40 2 2.00 1.48 0 5 5 0.79 -0.73
## Q10 10 1248 3.49 1.38 4 3.62 1.48 1 5 4 -0.57 -0.96
## Q11 11 1248 3.25 1.42 3 3.32 1.48 0 5 5 -0.31 -1.16
## Q12 12 1248 2.62 1.50 2 2.53 1.48 0 5 5 0.30 -1.38
## Q13 13 1248 2.10 1.36 1 1.90 0.00 0 5 5 0.87 -0.58
## Q14 14 1248 2.91 1.48 3 2.89 1.48 0 5 5 0.04 -1.41
## Q15 15 1248 4.26 1.07 5 4.50 0.00 1 5 4 -1.64 2.09
## group_var 16 1248 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN
## se
## Q1 0.04
## Q2 0.04
## Q3 0.04
## Q4 0.04
## Q5 0.04
## Q6 0.04
## Q7 0.04
## Q8 0.04
## Q9 0.04
## Q10 0.04
## Q11 0.04
## Q12 0.04
## Q13 0.04
## Q14 0.04
## Q15 0.03
## group_var 0.00
statsBy(gcbs_grouped, group = "group_var")
## Statistics within and between groups
## Call: statsBy(data = gcbs_grouped, group = "group_var")
## Intraclass Correlation 1 (Percentage of variance due to groups)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
## 0 0 0 0 0 0 0 0
## Q9 Q10 Q11 Q12 Q13 Q14 Q15 group_var
## 0 0 0 0 0 0 0 1
## Intraclass Correlation 2 (Reliability of group differences)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8
## -24.06 -1.53 -15.53 -0.48 0.60 0.57 0.71 -85.07
## Q9 Q10 Q11 Q12 Q13 Q14 Q15 group_var
## 0.47 -11.66 -2.02 -0.31 -93.55 0.57 0.61 1.00
## eta^2 between groups
## Q1.bg Q2.bg Q3.bg Q4.bg Q5.bg Q6.bg Q7.bg Q8.bg Q9.bg Q10.bg Q11.bg
## 0 0 0 0 0 0 0 0 0 0 0
## Q12.bg Q13.bg Q14.bg Q15.bg
## 0 0 0 0
##
## To see the correlations between and within groups, use the short=FALSE option in your print statement.
## Many results are not shown directly. To see specific objects select from the following list:
## mean sd n F ICC1 ICC2 ci1 ci2 raw rbg pbg rwg nw ci.wg pwg etabg etawg nwg nG Call
To get a feel for your dataset, we will examine the relationships of your variables with each other. The lowerCor()
function is useful to generate a correlation matrix, with corr.test
to generate a correlation matrix of p-value.
lowerCor(gcbs, use = "pairwise.complete.obs")
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11
## Q1 1.00
## Q2 0.53 1.00
## Q3 0.36 0.40 1.00
## Q4 0.52 0.53 0.50 1.00
## Q5 0.48 0.46 0.40 0.57 1.00
## Q6 0.63 0.55 0.40 0.61 0.50 1.00
## Q7 0.47 0.67 0.42 0.57 0.45 0.54 1.00
## Q8 0.39 0.38 0.78 0.49 0.41 0.41 0.41 1.00
## Q9 0.42 0.49 0.49 0.56 0.46 0.48 0.53 0.48 1.00
## Q10 0.44 0.38 0.32 0.40 0.43 0.41 0.39 0.36 0.37 1.00
## Q11 0.64 0.52 0.34 0.52 0.49 0.62 0.49 0.37 0.46 0.45 1.00
## Q12 0.52 0.72 0.44 0.60 0.49 0.59 0.75 0.42 0.57 0.40 0.55
## Q13 0.38 0.40 0.71 0.51 0.43 0.42 0.45 0.76 0.54 0.37 0.40
## Q14 0.53 0.50 0.43 0.60 0.54 0.55 0.52 0.45 0.55 0.41 0.56
## Q15 0.51 0.40 0.27 0.39 0.45 0.47 0.39 0.31 0.32 0.45 0.54
## Q12 Q13 Q14 Q15
## Q12 1.00
## Q13 0.49 1.00
## Q14 0.56 0.50 1.00
## Q15 0.41 0.30 0.46 1.00
However, the function ggcorrplot()
provides a more effective graphical insight with correlation coefficient and insignificance marking.
variables.to.use <- colnames(gcbs[1:15])
gcbs.corr<-cor(gcbs[variables.to.use],
method = "pearson",
use='pairwise.complete.obs')
ggcorrplot(gcbs.corr,
p.mat=cor_pmat(gcbs[variables.to.use]),
hc.order=T, #Group similar coefficient together
type='lower',
color=c('red3', 'white', 'green3'),
outline.color = 'darkgoldenrod1',
lab=T,
lab_size=3, #Size of the correlation coefficient
legend.title='Correlation',
pch=4,
pch.cex=12)+
labs(title="gcbs correlation")+
theme(plot.title=element_text(face='bold',size=14,hjust=0.5,colour="darkred"))+
theme(legend.position=c(0.10,0.80), legend.box.just = "bottom")
Internal reliability shows how well our test items relate to each other within the measurement. Cronbach’s Alpha (alpha()
) and Split-Half (splitHalf()
) are two common ways of assessing reliability. These statistics are a function of the measure length and items’ interrelatedness, which is explained above with the correlation matrix.
The overall reliability may change if items are dropped.
#Estimate coefficient alpha
psych::alpha(gcbs)
##
## Reliability analysis
## Call: psych::alpha(x = gcbs)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.93 0.93 0.94 0.48 14 0.002 2.9 1 0.47
##
## lower alpha upper 95% confidence boundaries
## 0.93 0.93 0.94
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## Q1 0.93 0.93 0.94 0.48 13 0.0021 0.0105 0.46
## Q2 0.93 0.93 0.94 0.48 13 0.0021 0.0099 0.47
## Q3 0.93 0.93 0.94 0.49 13 0.0020 0.0084 0.48
## Q4 0.93 0.93 0.94 0.47 13 0.0022 0.0105 0.46
## Q5 0.93 0.93 0.94 0.48 13 0.0021 0.0112 0.48
## Q6 0.93 0.93 0.94 0.48 13 0.0021 0.0104 0.46
## Q7 0.93 0.93 0.94 0.48 13 0.0021 0.0098 0.47
## Q8 0.93 0.93 0.94 0.48 13 0.0020 0.0086 0.49
## Q9 0.93 0.93 0.94 0.48 13 0.0021 0.0108 0.46
## Q10 0.93 0.93 0.94 0.49 14 0.0020 0.0102 0.49
## Q11 0.93 0.93 0.94 0.48 13 0.0021 0.0104 0.46
## Q12 0.93 0.93 0.94 0.47 13 0.0022 0.0093 0.46
## Q13 0.93 0.93 0.94 0.48 13 0.0021 0.0092 0.48
## Q14 0.93 0.93 0.94 0.48 13 0.0021 0.0109 0.46
## Q15 0.93 0.93 0.94 0.49 14 0.0020 0.0095 0.49
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## Q1 2495 0.73 0.73 0.70 0.68 3.5 1.5
## Q2 2495 0.74 0.74 0.72 0.69 3.0 1.5
## Q3 2495 0.68 0.67 0.66 0.62 2.0 1.4
## Q4 2495 0.78 0.78 0.76 0.74 2.6 1.5
## Q5 2495 0.70 0.70 0.67 0.65 3.3 1.5
## Q6 2495 0.76 0.76 0.74 0.72 3.1 1.5
## Q7 2495 0.75 0.75 0.73 0.70 2.7 1.5
## Q8 2495 0.69 0.69 0.68 0.63 2.5 1.6
## Q9 2495 0.72 0.72 0.69 0.67 2.2 1.4
## Q10 2495 0.61 0.61 0.57 0.55 3.5 1.4
## Q11 2495 0.74 0.74 0.72 0.69 3.3 1.4
## Q12 2495 0.79 0.79 0.79 0.75 2.6 1.5
## Q13 2495 0.71 0.71 0.70 0.66 2.1 1.4
## Q14 2495 0.76 0.76 0.74 0.71 3.0 1.5
## Q15 2495 0.60 0.62 0.58 0.56 4.2 1.1
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## Q1 0.00 0.16 0.12 0.12 0.27 0.32 0
## Q2 0.01 0.23 0.19 0.16 0.20 0.22 0
## Q3 0.00 0.55 0.13 0.12 0.10 0.10 0
## Q4 0.00 0.32 0.18 0.15 0.20 0.14 0
## Q5 0.00 0.19 0.14 0.13 0.28 0.26 0
## Q6 0.00 0.23 0.15 0.15 0.23 0.24 0
## Q7 0.00 0.33 0.19 0.13 0.18 0.17 0
## Q8 0.00 0.44 0.12 0.14 0.12 0.18 0
## Q9 0.00 0.45 0.19 0.12 0.12 0.11 0
## Q10 0.00 0.14 0.12 0.14 0.30 0.30 0
## Q11 0.00 0.16 0.14 0.19 0.27 0.24 0
## Q12 0.00 0.34 0.18 0.15 0.17 0.17 0
## Q13 0.01 0.51 0.15 0.15 0.10 0.09 0
## Q14 0.00 0.25 0.17 0.15 0.22 0.20 0
## Q15 0.00 0.05 0.05 0.08 0.27 0.55 0
#Estimate split half coefficient
splitHalf(gcbs)
## Split half reliabilities
## Call: splitHalf(r = gcbs)
##
## Maximum split half reliability (lambda 4) = 0.95
## Guttman lambda 6 = 0.94
## Average split half reliability = 0.93
## Guttman lambda 3 (alpha) = 0.93
## Guttman lambda 2 = 0.93
## Minimum split half reliability (beta) = 0.86
## Average interitem r = 0.48 with median = 0.47