Exploratory Factor Analysis in R

Data Overview

Fetching data from the server

We first need to fetch sample data from the server. This is a raw data set, so each row row represents a person’s survery.

# Dataset
url <- "https://raw.githubusercontent.com/housecricket/data/main/efa/sample1.csv"
data_survey <- read.csv(url, sep = ",")
paged_table(data_survey)

Describing the data

We also look at the dataset before we run any analysis.

paged_table(describe(data_survey))

We use the dim function to retrieve the dimension of the dataset.

dim(data_survey)
[1] 259  14

Cleaning data

In our data frame, we have an ID variable in the first column. So, we can use a -1 in the column index to remove the first column and save our data to a new object.

dat <- data_survey[ , -1] 
paged_table(head(dat))

Correlation matrix

We also should take a look at the correlations among our variables to determine if factor analysis is appropriate.

datamatrix <- cor(dat[,c(-13)])
corrplot(datamatrix, method="number")

The Factorability of the Data

X <- dat[,-c(13)]
Y <- dat[,13]

KMO

The Kaiser-Meyer-Olkin (KMO) used to measure sampling adequacy is a better measure of factorability.

KMO(r=cor(X))
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = cor(X))
Overall MSA =  0.83
MSA for each item = 
 KM1  KM2  KM3  QC1  QC2  QC3  CT1  CT2  CT3  PC1  PC2  PC3 
0.87 0.84 0.82 0.88 0.86 0.86 0.83 0.85 0.85 0.70 0.78 0.80 

According to Kaiser’s (1974) guidelines, a suggested cutoff for determining the factorability of the sample data is KMO ≥ 60. The total KMO is 0.83, indicating that, based on this test, we can probably conduct a factor analysis

Bartlett’s Test of Sphericity

cortest.bartlett(X)
R was not square, finding R from data
$chisq
[1] 1595.75

$p.value
[1] 8.846246e-290

$df
[1] 66

Small values (8.84e-290 < 0.05) of the significance level indicate that a factor analysis may be useful with our data.

det(cor(X))
[1] 0.00183051

We have a positive determinant, which means the factor analysis will probably run.

The Number of Factors to Extract

Scree Pilot

library(ggplot2)
fafitfree <- fa(dat,nfactors = ncol(X), rotate = "none")
n_factors <- length(fafitfree$e.values)
scree     <- data.frame(
  Factor_n =  as.factor(1:n_factors), 
  Eigenvalue = fafitfree$e.values)
ggplot(scree, aes(x = Factor_n, y = Eigenvalue, group = 1)) + 
  geom_point() + geom_line() +
  xlab("Number of factors") +
  ylab("Initial eigenvalue") +
  labs( title = "Scree Plot", 
        subtitle = "(Based on the unreduced correlation matrix)")

Parallel Analysis

We can use the parallel() function from the nFactors package (Raiche & Magis, 2020) to perform a parallel analysis.

parallel <- fa.parallel(X)

Parallel analysis suggests that the number of factors =  4  and the number of components =  3 

Conducting the Factor Analysis

Factor analysis using fa method

fa.none <- fa(r=X, 
 nfactors = 4, 
 # covar = FALSE, SMC = TRUE,
 fm="pa", # type of factor analysis we want to use ("pa" is principal axis factoring)
 max.iter=100, # (50 is the default, but we have changed it to 100
 rotate="varimax") # none rotation
print(fa.none)
Factor Analysis using method =  pa
Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
     PA1   PA3  PA2   PA4   h2   u2 com
KM1 0.80  0.13 0.15  0.24 0.73 0.27 1.3
KM2 0.82  0.22 0.08  0.20 0.76 0.24 1.3
KM3 0.88  0.21 0.17  0.18 0.88 0.12 1.3
QC1 0.31  0.06 0.16  0.56 0.43 0.57 1.8
QC2 0.23  0.34 0.01  0.74 0.71 0.29 1.6
QC3 0.13  0.35 0.13  0.66 0.59 0.41 1.7
CT1 0.15  0.72 0.08  0.14 0.56 0.44 1.2
CT2 0.19  0.75 0.10  0.21 0.65 0.35 1.3
CT3 0.15  0.69 0.15  0.25 0.59 0.41 1.5
PC1 0.18 -0.03 0.82  0.10 0.71 0.29 1.1
PC2 0.13  0.11 0.78  0.22 0.68 0.32 1.3
PC3 0.04  0.23 0.67 -0.03 0.51 0.49 1.2

                       PA1  PA3  PA2  PA4
SS loadings           2.38 1.97 1.86 1.60
Proportion Var        0.20 0.16 0.16 0.13
Cumulative Var        0.20 0.36 0.52 0.65
Proportion Explained  0.30 0.25 0.24 0.20
Cumulative Proportion 0.30 0.56 0.80 1.00

Mean item complexity =  1.4
Test of the hypothesis that 4 factors are sufficient.

df null model =  66  with the objective function =  6.3 with Chi Square =  1595.75
df of  the model are 24  and the objective function was  0.15 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic n.obs is  259 with the empirical chi square  10.85  with prob <  0.99 
The total n.obs was  259  with Likelihood Chi Square =  38.29  with prob <  0.032 

Tucker Lewis Index of factoring reliability =  0.974
RMSEA index =  0.048  and the 90 % confidence intervals are  0.014 0.075
BIC =  -95.07
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA3  PA2  PA4
Correlation of (regression) scores with factors   0.95 0.87 0.91 0.84
Multiple R square of scores with factors          0.89 0.76 0.83 0.71
Minimum correlation of possible factor scores     0.79 0.53 0.65 0.43

Factor analysis using the factanal method

factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.none)

Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")

Uniquenesses:
  KM1   KM2   KM3   QC1   QC2   QC3   CT1   CT2   CT3   PC1   PC2   PC3 
0.272 0.238 0.117 0.577 0.278 0.403 0.451 0.355 0.405 0.269 0.332 0.497 

Loadings:
    Factor1 Factor2 Factor3 Factor4
KM1  0.794   0.132   0.152   0.238 
KM2  0.820   0.212           0.197 
KM3  0.885   0.209   0.165   0.169 
QC1  0.321           0.162   0.538 
QC2  0.232   0.335           0.745 
QC3  0.125   0.338   0.138   0.669 
CT1  0.146   0.704           0.155 
CT2  0.189   0.747           0.205 
CT3  0.149   0.703   0.140   0.242 
PC1  0.179           0.829   0.104 
PC2  0.132   0.121   0.770   0.206 
PC3          0.234   0.668         

               Factor1 Factor2 Factor3 Factor4
SS loadings      2.388   1.958   1.864   1.595
Proportion Var   0.199   0.163   0.155   0.133
Cumulative Var   0.199   0.362   0.517   0.650

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 37.66 on 24 degrees of freedom.
The p-value is 0.0376 

Graph Factor Loading Matrices

fa.diagram(fa.none)

Regression analysis

Scores for all the rows

head(fa.none$scores)
             PA1        PA3         PA2         PA4
[1,]  2.01452365 -1.9781194 -0.47865733 -0.99954894
[2,] -0.38088283  0.7697328 -1.27812282  0.71255859
[3,] -0.88620583 -0.3131598  0.69972250 -1.42078948
[4,] -0.02336401  0.5722214 -1.64788160 -0.02484005
[5,]  0.45117677  0.5996335  0.01897338 -0.59929794
[6,] -2.36524692  1.6224651  0.50346364  0.30393048

Labeling the data

regdata <- cbind(dat["QD"], fa.none$scores)
#Labeling the data
names(regdata) <- c("QD", "F1", "F2",
 "F3", "F4")
paged_table(head(regdata))

Splitting the data to train and test set

#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]

Regression Model using train data

model.fa.score = lm(QD ~ ., data = train)
summary(model.fa.score)

Call:
lm(formula = QD ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2041 -0.2225 -0.0894  0.2571  1.4851 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.55854    0.02932 121.387  < 2e-16 ***
F1           0.72846    0.03028  24.058  < 2e-16 ***
F2           0.28709    0.03382   8.489 8.43e-15 ***
F3           0.07357    0.03061   2.403   0.0173 *  
F4           0.62655    0.03494  17.932  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3935 on 176 degrees of freedom
Multiple R-squared:  0.8703,    Adjusted R-squared:  0.8674 
F-statistic: 295.4 on 4 and 176 DF,  p-value: < 2.2e-16

Check vif

vif(model.fa.score)
      F1       F2       F3       F4 
1.009786 1.056645 1.001647 1.063940 

Check prediction of the model in the test dataset

#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test
       3        6        9       17       19       23       28       29 
1.984360 2.528829 4.143840 3.323313 3.476094 3.246743 4.366091 5.200139 
      33       34       36       38       54       56       57       59 
1.462812 3.579119 2.046164 4.822811 3.260452 3.703757 2.226866 3.281367 
      61       63       66       67       83       86       89       90 
3.279521 3.994264 3.638951 3.638951 4.204144 4.071425 2.381443 3.260452 
      96       99      101      102      104      109      113      114 
3.200268 4.123159 3.105286 3.776780 4.300589 2.672036 5.145143 4.049866 
     117      120      122      125      128      129      134      136 
3.161295 3.125512 3.071635 3.718316 3.257887 3.161295 3.260452 1.267189 
     138      139      141      144      145      149      161      162 
3.260452 3.149997 3.161576 4.073711 3.554128 4.252041 4.559893 4.368536 
     166      168      173      176      177      178      181      184 
4.111750 3.768509 3.260452 3.685362 2.249302 3.371394 4.354512 4.732477 
     186      190      195      201      204      208      209      217 
2.260460 3.735488 1.739988 3.198181 5.089403 3.260452 3.836492 4.001907 
     220      221      223      233      240      241      244      245 
4.237114 3.298169 5.089403 4.444631 3.323962 5.187997 3.652735 2.992014 
     248      250      252      255      257      258 
2.719640 3.625425 3.149997 2.984902 4.771091 2.630307 
test$QD_Predicted <- pred_test
paged_table(head(test[c("QD","QD_Predicted")], 10))