Exploratory Factor Analysis in R

Author

StatMind

Published

May 20, 2026

1 Introduction

In this e-book we will show you how to use R to perform an exploratory factor analysis (EFA).

We will discuss:

  1. The data set used as an example, and how to read the data into R.

  2. The basic tests to be used before running the factor analysis: KMO and Bartlett.

  3. Preparatory steps, including producing the correlation matrix.

  4. An overview of eigenvalues, and visualizing these eigenvalues in a screeplot, in order to decide on the number of factors.

  5. Running a basic factor analysis, without rotation.

  6. Using rotation of the retained factors, allowing easier interpretation and labeling of the factors.

Lastly, we will provide some links to supplementary resources in case you want to learn more about factor analysis. And in addition, for those who want to use the results of factor in analysis in academic works, there are some links on how to write-up the results.

2 The Data

To illustrate how to use factor analysis for revealing underlying patterns in the data, we have generated a data set on economic and social items which are assumed to explain firm competitiveness.

  • There are six social items (s1-s6), construed in such a way that two out of these six items seem to form a sub-factor.

  • Of the four economic items (e1-e4), one is is poorly correlated to the other three.

The data can be read from our GitHub account using the code below.

Code
efa_data <- read.csv("https://raw.githubusercontent.com/statmind/testdata/refs/heads/main/EFA_TESTDATA_CSV.csv")
library(DT)
datatable(efa_data)

A summary of the data will reveal that all items have a minimum value of 1, and a maximum of 6 or 7, suggesting that the items are scored by the (300) respondents on a 7-point answering scale.

Since the data is computer generated, we don’t have to deal with missing values or values outside the range of the scale.

3 Basic Tests

It is common to first check whether it makes sense to perform an EFA. Frequently used tests are KMO and Bartlett’s test.

KMO (short for Kaiser-Meyer-Olkin test) evaluates if your variables have enough variance in common. Bartlett’s test verifies if the correlations between variables are statistically significant enough to proceed.

KMO theoretically ranges from 0 to 1. However, the KMO has a value of \(.5\) if your variables are uncorrelated, and values \(<.5\) are very rare. If \(KMO \leq .5\), then factor analysis makes no sense, and doing so is considered unacceptable.

Warning

That does not imply that a value of \(KMO>.50\) is acceptable! Most researchers agree that the KMO should be well above .60.

Bartlett’s test checks if the correlation matrix is significantly different from an identity matrix, i.e., a matrix with ones on the diagonal and zero correlations between the variables. Since a close-to identity matrix would imply a low KMO, we find Bartlett’s kind of redundant, but, well, for the sake of academic completeness let’s just do it. Your hope is to have a large value with a \(p<.05\).

For more information on these tests, click here.

KMO and Bartlett’s test are not available in base R. The psych package has functions to perform these tests. Of course, you have to install that package if you haven’t already done so.

Both functions use the correlation matrix as input. We use the cor() function in R to create an object containing the correlation matrix.

TipAll We Need ..

If you want to share your data with others, for example to seek their help or feedback, then it is often enough to share the correlation matrix, along with other statistics (sample size; means and standard deviations of variables). Sharing the raw data may be sensitive, for all kinds of reasons - including the privacy of your respondents. And it is more efficient than sharing large files with hundreds of records!

Let’s first compute the correlation matrix and store it as an object. We print part of the matrix, with a couple of social and economic items. Try to detect some patterns for yourself!

Code
efa_mat <- cor(efa_data)
(round(efa_mat[3:9,3:9],2))
     s3   s4   s5   s6   e1   e2   e3
s3 1.00 0.76 0.28 0.29 0.07 0.14 0.13
s4 0.76 1.00 0.30 0.30 0.05 0.11 0.09
s5 0.28 0.30 1.00 0.65 0.07 0.11 0.09
s6 0.29 0.30 0.65 1.00 0.09 0.10 0.09
e1 0.07 0.05 0.07 0.09 1.00 0.14 0.16
e2 0.14 0.11 0.11 0.10 0.14 1.00 0.68
e3 0.13 0.09 0.09 0.09 0.16 0.68 1.00

Next, we compute KMO and Bartlett’s test using functions from psych. We compute the sample size that is needed for Bartlett’s test as the number of rows in the data set using the nrow() function.

Code
# install.packages("psych")
library(psych)
KMO(efa_mat)
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = efa_mat)
Overall MSA =  0.67
MSA for each item = 
  s1   s2   s3   s4   s5   s6   e1   e2   e3   e4   c1   c2   c3 
0.83 0.73 0.78 0.82 0.73 0.70 0.93 0.64 0.63 0.63 0.44 0.68 0.69 
Code
size <- nrow(efa_data)
cortest.bartlett(efa_mat, size)
$chisq
[1] 2495.989

$p.value
[1] 0

$df
[1] 78

Our overall KMO is \(.67\) which is good enough but not great. We can inspect the KMO scores for the items, and conclude that items c1 has a low value. Leaving out that item might boost the overall KMO, but we see little reason to do so at this stage of the analysis.

Let’s proceed to the next phase!

4 Deciding on the Number of Factors

There is a rich body of literature on factor analysis, and a lot has been said about deciding on the number of factors, and extracting them.

An excellent overview of all the ins and outs of EFA can be found in Osborne (2014).

The idea behind is fairly simple. Suppose we have only 2 variables (or items, apologies for using those terms interchangeably) in our data set. We can use a scatterplot to check the relation between these variables.

If the variables are correlated, then the pattern of the dots looks like an oval (rather than a circle). This oval has two dimensions, just like the original scatterplot. Both of these dimensions are linear combinations of the variables. The length of the oval (F1) captures most of the variance and the data, and its height (F2) accounts the remainder. See the figure below.

We can expand this idea to any number of variables. The shapes become more complex, and impossible to visualize.

But mathematically it is still possible to compute factors in such a way that the first factor accounts for most of the variance; the second factor for most of the remaining variance; and so on.

The number of factors is by definition the same as the number of variables. And if we standardize the variables (with means of 0, and a variance of 1), then the total variance is equal to the number of items.

That is helpful. We can now look at the variances explained by the factors, and use a criterion to cut off the number of factors to be retained. For example, we can look for the number of factors that jointly account for 70% or more of the variance. If we have n items, then the first k factors (\(k \ll n\)) account for a relatively large proportion of total variance.

NoteData Reduction

For that reason, factor analysis is a method of data reduction: we group a large number of variables into a small number of factors.

The variance explained by a factor is called its eigenvalue.

NoteUh .. Eigenvalues?

Eigenvalue is not an everyday word, it’s origins are from linear algebra and smart German mathematicians.

A frequently used criterion for retaining factors is an \(eigenvalue>1\). this is intuitively appealing. If a factor explains less of the total variance than the variance of each individual variable, then the contribution of the factor is small. Our hope is that a small number of factors has eigenvalues well above 1. Our choice of the number of factors to be retained, is facilitated by a screeplot of the eigenvalues.

Let’s compute the eigenvalues, and show the screeplot.

Code
efa_eigen <- eigen(efa_mat); efa_eigen$values
 [1] 5.18750496 2.45377054 1.24386417 0.95742076 0.76391591 0.58305339
 [7] 0.36504001 0.34238582 0.32275816 0.29185930 0.23661001 0.19645193
[13] 0.05536503
Code
tv <- sum(efa_eigen$values)
cat("The sum of eigenvalues =",tv)
The sum of eigenvalues = 13
Code
cat("Cumulative % of variance explained by the factors:\n")
Cumulative % of variance explained by the factors:
Code
round(cumsum(efa_eigen$values)/.13,2)
 [1]  39.90  58.78  68.35  75.71  81.59  86.07  88.88  91.52  94.00  96.24
[11]  98.06  99.57 100.00

The first factor accounts for 5.18 (39.90%) of all variance in the data. We would need to retain 4 out of 13 factors to account for 75% of all variance. The first three factors have an \(eigenvalue>1\), suggesting retaining 3 factors.

We can visualize the eigenvalues in a screeplot.

Code
plot(efa_eigen$values, type="b",ylab="Eigenvalues", xlab="Factor")

After three factors, the screeplot is leveling off. Additional factors contribute only marginally to the variance explained.

5 The Factor Analysis

So far, we have made sure that the data lend themselves for a factor analysis. We have computed the correlation matrix, which we can visually inspect (are the correlations within the groups of economic and social items higher than between the groups? Do some items have lower correlations than expected?). Using common criteria, we have decided to retain three factors. These three out of 13 factors (less than one quarter) accounts for 68% of total variance.

We are now ready to do the factor analysis.

Code
fa1 <- factanal(x=efa_data, factors=3, rotation='none'); fa1

Call:
factanal(x = efa_data, factors = 3, rotation = "none")

Uniquenesses:
   s1    s2    s3    s4    s5    s6    e1    e2    e3    e4    c1    c2    c3 
0.418 0.400 0.327 0.329 0.827 0.827 0.962 0.268 0.278 0.309 0.005 0.153 0.176 

Loadings:
   Factor1 Factor2 Factor3
s1  0.484   0.275   0.521 
s2  0.505   0.280   0.517 
s3  0.526   0.280   0.563 
s4  0.497   0.269   0.593 
s5  0.302   0.164   0.233 
s6  0.288   0.180   0.240 
e1  0.122   0.149         
e2  0.389   0.634  -0.422 
e3  0.334   0.663  -0.412 
e4  0.381   0.638  -0.373 
c1          0.997         
c2  0.814   0.395  -0.168 
c3  0.810   0.380  -0.150 

               Factor1 Factor2 Factor3
SS loadings      2.932   2.929   1.859
Proportion Var   0.226   0.225   0.143
Cumulative Var   0.226   0.451   0.594

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 353.23 on 42 degrees of freedom.
The p-value is 7.99e-51 

The output show the loadings of the variables on the three factors, and the uniqueness of the variables. Variables with high uniqueness are poorly explained by the three factors.

6 Rotation and Simple Structure

It often pays off not to jump to conclusions and get rid of such items. It helps to experiment, and try factor analyses with more or less factors, and see what comes out.

For our data, we can repeat the analysis with four rather than two factors (note that the eigenvalue of the 4\(^{th}\) factor is close to 1).

In addition, we can rotate the factors. Rotation is partly cosmetic, especially if we use varimax rotation. Varimax rotation produces a matrix of factor loadings in a simple and easier to interpret structure, while maintaining orthogonality (i.e., uncorrelated, or independent factors).

Since it’s common in social and economic research to have factors that are somewhat correlated with one another, we can use promax rotation that allows for correlation. The factanal() function in base R only has the varimax and promax options. Functions in the psych package offer more options.

Below, we repeat the factor analysis with promax rotation, for three and four factor solutions. The cut=.2 option sees to it that loadings smaller than .2 are left blank.

The three factor solution:

Code
fa2 <- factanal(x=efa_data, factors=3, rotation='promax')
print(fa2, cut=.2)

Call:
factanal(x = efa_data, factors = 3, rotation = "promax")

Uniquenesses:
   s1    s2    s3    s4    s5    s6    e1    e2    e3    e4    c1    c2    c3 
0.418 0.400 0.327 0.329 0.827 0.827 0.962 0.268 0.278 0.309 0.005 0.153 0.176 

Loadings:
   Factor1 Factor2 Factor3
s1          0.777         
s2          0.781         
s3          0.837         
s4          0.854         
s5          0.389         
s6          0.393         
e1                        
e2  0.842                 
e3  0.805           0.223 
e4  0.795                 
c1  0.290           0.840 
c2  0.833   0.225  -0.224 
c3  0.810   0.239  -0.229 

               Factor1 Factor2 Factor3
SS loadings      3.467   3.154   0.938
Proportion Var   0.267   0.243   0.072
Cumulative Var   0.267   0.509   0.581

Factor Correlations:
        Factor1 Factor2 Factor3
Factor1   1.000  0.4091  0.2794
Factor2   0.409  1.0000 -0.0345
Factor3   0.279 -0.0345  1.0000

Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 353.23 on 42 degrees of freedom.
The p-value is 7.99e-51 

And the four factor solution:

Code
fa3 <- factanal(x=efa_data, factors=4, rotation='promax')
print(fa3, cut=.2)

Call:
factanal(x = efa_data, factors = 4, rotation = "promax")

Uniquenesses:
   s1    s2    s3    s4    s5    s6    e1    e2    e3    e4    c1    c2    c3 
0.434 0.415 0.298 0.308 0.374 0.317 0.960 0.268 0.278 0.308 0.005 0.153 0.176 

Loadings:
   Factor1 Factor2 Factor3 Factor4
s1          0.743                 
s2          0.755                 
s3          0.871                 
s4          0.868                 
s5                  0.799         
s6                  0.846         
e1                                
e2  0.840                         
e3  0.803                   0.224 
e4  0.793                         
c1  0.282                   0.843 
c2  0.841   0.203          -0.225 
c3  0.818   0.214          -0.230 

               Factor1 Factor2 Factor3 Factor4
SS loadings      3.466   2.805   1.366   0.946
Proportion Var   0.267   0.216   0.105   0.073
Cumulative Var   0.267   0.482   0.587   0.660

Factor Correlations:
        Factor1 Factor2 Factor3 Factor4
Factor1   1.000  0.3921  0.2804  0.2854
Factor2   0.392  1.0000  0.4901 -0.0469
Factor3   0.280  0.4901  1.0000 -0.0134
Factor4   0.285 -0.0469 -0.0134  1.0000

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 222.69 on 32 degrees of freedom.
The p-value is 1.94e-30 

We would go for the four factor solution since it clearly reveals that items s5 and s6 are strongly correlated with one another but only moderately with the other social items (s1-s4). In the three factor solution they are more or less forced into the social factor with relatively small loadings.

Both the three and four factor solutions leave out item e1, which is weakly correlated with all other items (and therefore not a reliable or valid measure of the economic factor).

In this analysis, we have lazily lumped together all items measuring the independent variables (the social and economic factors) or dependent variable (competitiveness, the items c1-c3). If - like here - the economic factor is a strong determinant of competitiveness, then lumping e1-e4 and c1-c3 together will lead to cross-loadings, like c-items loading on a factor with e-items. It would be better to treat items that are conceptually different according to our research model, separate from the rest.

Actually, if the social and economic items stem from existing theories, models, and questionnaires, it is not a very good idea to explore patterns in all items related to these concepts using EFA. If our measurements are based on existing models, then it makes more sense to skip EFA altogether and use Confirmatory Factor Analysis (CFA).

Still, EFA might be a good idea, to get a better feel of the data, and if a lot of the items have been developed from scratch.

7 Reporting on EFA

To report a factor analysis in APA style, structure your results into three main parts:

  1. Preliminary assumptions

  2. Extraction and rotation decisions

  3. The final factor solution.

Outline the factor names, loadings, variance explained, and internal consistency in a well-organized table and narrative!

You can find an example of reporting EFA results here

8 Supplementary Sources

Advanced Statistics Using R

Introduction to EFA