Code
efa_data <- read.csv("https://raw.githubusercontent.com/statmind/testdata/refs/heads/main/EFA_TESTDATA_CSV.csv")
library(DT)
datatable(efa_data)In this e-book we will show you how to use R to perform an exploratory factor analysis (EFA).
We will discuss:
The data set used as an example, and how to read the data into R.
The basic tests to be used before running the factor analysis: KMO and Bartlett.
Preparatory steps, including producing the correlation matrix.
An overview of eigenvalues, and visualizing these eigenvalues in a screeplot, in order to decide on the number of factors.
Running a basic factor analysis, without rotation.
Using rotation of the retained factors, allowing easier interpretation and labeling of the factors.
Lastly, we will provide some links to supplementary resources in case you want to learn more about factor analysis. And in addition, for those who want to use the results of factor in analysis in academic works, there are some links on how to write-up the results.
To illustrate how to use factor analysis for revealing underlying patterns in the data, we have generated a data set on economic and social items which are assumed to explain firm competitiveness.
There are six social items (s1-s6), construed in such a way that two out of these six items seem to form a sub-factor.
Of the four economic items (e1-e4), one is is poorly correlated to the other three.
The data can be read from our GitHub account using the code below.
efa_data <- read.csv("https://raw.githubusercontent.com/statmind/testdata/refs/heads/main/EFA_TESTDATA_CSV.csv")
library(DT)
datatable(efa_data)A summary of the data will reveal that all items have a minimum value of 1, and a maximum of 6 or 7, suggesting that the items are scored by the (300) respondents on a 7-point answering scale.
Since the data is computer generated, we don’t have to deal with missing values or values outside the range of the scale.
It is common to first check whether it makes sense to perform an EFA. Frequently used tests are KMO and Bartlett’s test.
KMO (short for Kaiser-Meyer-Olkin test) evaluates if your variables have enough variance in common. Bartlett’s test verifies if the correlations between variables are statistically significant enough to proceed.
KMO theoretically ranges from 0 to 1. However, the KMO has a value of \(.5\) if your variables are uncorrelated, and values \(<.5\) are very rare. If \(KMO \leq .5\), then factor analysis makes no sense, and doing so is considered unacceptable.
That does not imply that a value of \(KMO>.50\) is acceptable! Most researchers agree that the KMO should be well above .60.
Bartlett’s test checks if the correlation matrix is significantly different from an identity matrix, i.e., a matrix with ones on the diagonal and zero correlations between the variables. Since a close-to identity matrix would imply a low KMO, we find Bartlett’s kind of redundant, but, well, for the sake of academic completeness let’s just do it. Your hope is to have a large value with a \(p<.05\).
For more information on these tests, click here.
KMO and Bartlett’s test are not available in base R. The psych package has functions to perform these tests. Of course, you have to install that package if you haven’t already done so.
Both functions use the correlation matrix as input. We use the cor() function in R to create an object containing the correlation matrix.
If you want to share your data with others, for example to seek their help or feedback, then it is often enough to share the correlation matrix, along with other statistics (sample size; means and standard deviations of variables). Sharing the raw data may be sensitive, for all kinds of reasons - including the privacy of your respondents. And it is more efficient than sharing large files with hundreds of records!
Let’s first compute the correlation matrix and store it as an object. We print part of the matrix, with a couple of social and economic items. Try to detect some patterns for yourself!
efa_mat <- cor(efa_data)
(round(efa_mat[3:9,3:9],2)) s3 s4 s5 s6 e1 e2 e3
s3 1.00 0.76 0.28 0.29 0.07 0.14 0.13
s4 0.76 1.00 0.30 0.30 0.05 0.11 0.09
s5 0.28 0.30 1.00 0.65 0.07 0.11 0.09
s6 0.29 0.30 0.65 1.00 0.09 0.10 0.09
e1 0.07 0.05 0.07 0.09 1.00 0.14 0.16
e2 0.14 0.11 0.11 0.10 0.14 1.00 0.68
e3 0.13 0.09 0.09 0.09 0.16 0.68 1.00
Next, we compute KMO and Bartlett’s test using functions from psych. We compute the sample size that is needed for Bartlett’s test as the number of rows in the data set using the nrow() function.
# install.packages("psych")
library(psych)
KMO(efa_mat)Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = efa_mat)
Overall MSA = 0.67
MSA for each item =
s1 s2 s3 s4 s5 s6 e1 e2 e3 e4 c1 c2 c3
0.83 0.73 0.78 0.82 0.73 0.70 0.93 0.64 0.63 0.63 0.44 0.68 0.69
size <- nrow(efa_data)
cortest.bartlett(efa_mat, size)$chisq
[1] 2495.989
$p.value
[1] 0
$df
[1] 78
Our overall KMO is \(.67\) which is good enough but not great. We can inspect the KMO scores for the items, and conclude that items c1 has a low value. Leaving out that item might boost the overall KMO, but we see little reason to do so at this stage of the analysis.
Let’s proceed to the next phase!
There is a rich body of literature on factor analysis, and a lot has been said about deciding on the number of factors, and extracting them.
An excellent overview of all the ins and outs of EFA can be found in Osborne (2014).
The idea behind is fairly simple. Suppose we have only 2 variables (or items, apologies for using those terms interchangeably) in our data set. We can use a scatterplot to check the relation between these variables.
If the variables are correlated, then the pattern of the dots looks like an oval (rather than a circle). This oval has two dimensions, just like the original scatterplot. Both of these dimensions are linear combinations of the variables. The length of the oval (F1) captures most of the variance and the data, and its height (F2) accounts the remainder. See the figure below.
We can expand this idea to any number of variables. The shapes become more complex, and impossible to visualize.
But mathematically it is still possible to compute factors in such a way that the first factor accounts for most of the variance; the second factor for most of the remaining variance; and so on.
The number of factors is by definition the same as the number of variables. And if we standardize the variables (with means of 0, and a variance of 1), then the total variance is equal to the number of items.
That is helpful. We can now look at the variances explained by the factors, and use a criterion to cut off the number of factors to be retained. For example, we can look for the number of factors that jointly account for 70% or more of the variance. If we have n items, then the first k factors (\(k \ll n\)) account for a relatively large proportion of total variance.
For that reason, factor analysis is a method of data reduction: we group a large number of variables into a small number of factors.
The variance explained by a factor is called its eigenvalue.
Eigenvalue is not an everyday word, it’s origins are from linear algebra and smart German mathematicians.
A frequently used criterion for retaining factors is an \(eigenvalue>1\). this is intuitively appealing. If a factor explains less of the total variance than the variance of each individual variable, then the contribution of the factor is small. Our hope is that a small number of factors has eigenvalues well above 1. Our choice of the number of factors to be retained, is facilitated by a screeplot of the eigenvalues.
Let’s compute the eigenvalues, and show the screeplot.
efa_eigen <- eigen(efa_mat); efa_eigen$values [1] 5.18750496 2.45377054 1.24386417 0.95742076 0.76391591 0.58305339
[7] 0.36504001 0.34238582 0.32275816 0.29185930 0.23661001 0.19645193
[13] 0.05536503
tv <- sum(efa_eigen$values)
cat("The sum of eigenvalues =",tv)The sum of eigenvalues = 13
cat("Cumulative % of variance explained by the factors:\n")Cumulative % of variance explained by the factors:
round(cumsum(efa_eigen$values)/.13,2) [1] 39.90 58.78 68.35 75.71 81.59 86.07 88.88 91.52 94.00 96.24
[11] 98.06 99.57 100.00
The first factor accounts for 5.18 (39.90%) of all variance in the data. We would need to retain 4 out of 13 factors to account for 75% of all variance. The first three factors have an \(eigenvalue>1\), suggesting retaining 3 factors.
We can visualize the eigenvalues in a screeplot.
plot(efa_eigen$values, type="b",ylab="Eigenvalues", xlab="Factor")After three factors, the screeplot is leveling off. Additional factors contribute only marginally to the variance explained.
So far, we have made sure that the data lend themselves for a factor analysis. We have computed the correlation matrix, which we can visually inspect (are the correlations within the groups of economic and social items higher than between the groups? Do some items have lower correlations than expected?). Using common criteria, we have decided to retain three factors. These three out of 13 factors (less than one quarter) accounts for 68% of total variance.
We are now ready to do the factor analysis.
fa1 <- factanal(x=efa_data, factors=3, rotation='none'); fa1
Call:
factanal(x = efa_data, factors = 3, rotation = "none")
Uniquenesses:
s1 s2 s3 s4 s5 s6 e1 e2 e3 e4 c1 c2 c3
0.418 0.400 0.327 0.329 0.827 0.827 0.962 0.268 0.278 0.309 0.005 0.153 0.176
Loadings:
Factor1 Factor2 Factor3
s1 0.484 0.275 0.521
s2 0.505 0.280 0.517
s3 0.526 0.280 0.563
s4 0.497 0.269 0.593
s5 0.302 0.164 0.233
s6 0.288 0.180 0.240
e1 0.122 0.149
e2 0.389 0.634 -0.422
e3 0.334 0.663 -0.412
e4 0.381 0.638 -0.373
c1 0.997
c2 0.814 0.395 -0.168
c3 0.810 0.380 -0.150
Factor1 Factor2 Factor3
SS loadings 2.932 2.929 1.859
Proportion Var 0.226 0.225 0.143
Cumulative Var 0.226 0.451 0.594
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 353.23 on 42 degrees of freedom.
The p-value is 7.99e-51
The output show the loadings of the variables on the three factors, and the uniqueness of the variables. Variables with high uniqueness are poorly explained by the three factors.
It often pays off not to jump to conclusions and get rid of such items. It helps to experiment, and try factor analyses with more or less factors, and see what comes out.
For our data, we can repeat the analysis with four rather than two factors (note that the eigenvalue of the 4\(^{th}\) factor is close to 1).
In addition, we can rotate the factors. Rotation is partly cosmetic, especially if we use varimax rotation. Varimax rotation produces a matrix of factor loadings in a simple and easier to interpret structure, while maintaining orthogonality (i.e., uncorrelated, or independent factors).
Since it’s common in social and economic research to have factors that are somewhat correlated with one another, we can use promax rotation that allows for correlation. The factanal() function in base R only has the varimax and promax options. Functions in the psych package offer more options.
Below, we repeat the factor analysis with promax rotation, for three and four factor solutions. The cut=.2 option sees to it that loadings smaller than .2 are left blank.
The three factor solution:
fa2 <- factanal(x=efa_data, factors=3, rotation='promax')
print(fa2, cut=.2)
Call:
factanal(x = efa_data, factors = 3, rotation = "promax")
Uniquenesses:
s1 s2 s3 s4 s5 s6 e1 e2 e3 e4 c1 c2 c3
0.418 0.400 0.327 0.329 0.827 0.827 0.962 0.268 0.278 0.309 0.005 0.153 0.176
Loadings:
Factor1 Factor2 Factor3
s1 0.777
s2 0.781
s3 0.837
s4 0.854
s5 0.389
s6 0.393
e1
e2 0.842
e3 0.805 0.223
e4 0.795
c1 0.290 0.840
c2 0.833 0.225 -0.224
c3 0.810 0.239 -0.229
Factor1 Factor2 Factor3
SS loadings 3.467 3.154 0.938
Proportion Var 0.267 0.243 0.072
Cumulative Var 0.267 0.509 0.581
Factor Correlations:
Factor1 Factor2 Factor3
Factor1 1.000 0.4091 0.2794
Factor2 0.409 1.0000 -0.0345
Factor3 0.279 -0.0345 1.0000
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 353.23 on 42 degrees of freedom.
The p-value is 7.99e-51
And the four factor solution:
fa3 <- factanal(x=efa_data, factors=4, rotation='promax')
print(fa3, cut=.2)
Call:
factanal(x = efa_data, factors = 4, rotation = "promax")
Uniquenesses:
s1 s2 s3 s4 s5 s6 e1 e2 e3 e4 c1 c2 c3
0.434 0.415 0.298 0.308 0.374 0.317 0.960 0.268 0.278 0.308 0.005 0.153 0.176
Loadings:
Factor1 Factor2 Factor3 Factor4
s1 0.743
s2 0.755
s3 0.871
s4 0.868
s5 0.799
s6 0.846
e1
e2 0.840
e3 0.803 0.224
e4 0.793
c1 0.282 0.843
c2 0.841 0.203 -0.225
c3 0.818 0.214 -0.230
Factor1 Factor2 Factor3 Factor4
SS loadings 3.466 2.805 1.366 0.946
Proportion Var 0.267 0.216 0.105 0.073
Cumulative Var 0.267 0.482 0.587 0.660
Factor Correlations:
Factor1 Factor2 Factor3 Factor4
Factor1 1.000 0.3921 0.2804 0.2854
Factor2 0.392 1.0000 0.4901 -0.0469
Factor3 0.280 0.4901 1.0000 -0.0134
Factor4 0.285 -0.0469 -0.0134 1.0000
Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 222.69 on 32 degrees of freedom.
The p-value is 1.94e-30
We would go for the four factor solution since it clearly reveals that items s5 and s6 are strongly correlated with one another but only moderately with the other social items (s1-s4). In the three factor solution they are more or less forced into the social factor with relatively small loadings.
Both the three and four factor solutions leave out item e1, which is weakly correlated with all other items (and therefore not a reliable or valid measure of the economic factor).
In this analysis, we have lazily lumped together all items measuring the independent variables (the social and economic factors) or dependent variable (competitiveness, the items c1-c3). If - like here - the economic factor is a strong determinant of competitiveness, then lumping e1-e4 and c1-c3 together will lead to cross-loadings, like c-items loading on a factor with e-items. It would be better to treat items that are conceptually different according to our research model, separate from the rest.
Actually, if the social and economic items stem from existing theories, models, and questionnaires, it is not a very good idea to explore patterns in all items related to these concepts using EFA. If our measurements are based on existing models, then it makes more sense to skip EFA altogether and use Confirmatory Factor Analysis (CFA).
Still, EFA might be a good idea, to get a better feel of the data, and if a lot of the items have been developed from scratch.
To report a factor analysis in APA style, structure your results into three main parts:
Preliminary assumptions
Extraction and rotation decisions
The final factor solution.
Outline the factor names, loadings, variance explained, and internal consistency in a well-organized table and narrative!
You can find an example of reporting EFA results here