The following theory is mainly based on the book Exploratory Factor Analysis by Fabrigar and Wegener, and Chapter 6 from The Philosophy of Quantitative Methods: Understanding Statistics by Haig.
Factor analysis is used when you believe that a set of variables are associated with each other because the data generating process is driven by common underlying constructs. You can get some insights into that structure by simply looking at the correlation between variables. If all variables are strongly correlated with each other, we may think is an underlying cause for why they are associated. Similarly, if there is strong correlation within two groups of variables, but no correlation between those groups, then we may think there are two underlying factors. If we see no correlation between variables then they may be driven by unique factors (or nothing at all).
Factor analysis emerged to make this process of exploring correlations between variables more systematic. Factor analysis is a set of statistical tools to determine the number of constructs needed to account for the pattern of correlations between variables. These approaches assume that these underlying constructs linearly influence related observed variables, which are, in turn, correlated with each other. These underlying factors that influence multiple variables are called common factor, since they are common to more than one variable. However, there is typically also unexplained residual variance in a variable, which is not explained by common factors. This residual variance is caused by unique factors - factors unique to a single variable, which don’t tell us anything else about other variables. These unique factors are partly made up of measurement error, and partly the specific factor, which are unique to the observed variable. Measurement error might be because of an ambiguously worded question, which can be answered differently depending on how its interpreted. Specific factors might emerge from a bias in the wording of a question, compared to the other questions in a battery.
So we can partition the observed variance as:
\[observed = common + unique \]
(where \(observed\) is the observed variance, \(common\) is the common factor, and \(unique\) is the unique factor) and further partition unique factor as:
\[unique = specific + error \]
(where \(specific\) is the specific factor, and \(error\) is the measurement error). We can see what this looks like graphically, below
This figure show the observed variables represented by squares, linear causal affects indicated by single-headed arrows, and common factors indicated by the sphere. The numbers on the single headed arrows connecting the common factor to the measured variables are the factor loadings - the strength and direction of the effect of the common factor on the measured variables.
A two-headed arrow connecting a variable to itself describes the residual variance, which is the unique factor in the case of the observed variables. This unique factor is also commonly shown as a separate box or sphere that is connected by a single headed arrow to the measured variable. The coefficient between the unique factor and the observed variable is set to 1, but this isn’t shown. Also, the unique factors are assumed to be independent of each other.
The next two plot shows us two common factors. The absence of a two-headed arrow in the first plot indicates that they are independent of each other, compared to the bottom plot, where they are dependent.
OK, so we’ve learned the intuition behind factor analysis, but how is it presented mathematically? First, it’s worth pointing out that the equation assumes that the measured variables have been standardized (mean 0, SD = 1) and that the common factors can also be interpreted in standardized z-score metrics. Normally factor analysis is presented using matrix algebra, as below:
\[P = \Lambda\Phi\Lambda^\top + \mathrm{D}\psi\] Where \(P\) is the population correlation matrix of observed variables across the set of variables. In the example above, this would be:
\[\begin{array} {rrr} & V.A & V.B & V.C & V.D & V.E & V.F \\ V.A & 1 & \\ V.B & p_{2.1} & 1 \\ V.C & p_{3.1} & p_{3.2} & 1 \\ V.D & p_{4.1} & p_{4.2} & p_{4.3} & 1 \\ V.E & p_{5.1} & p_{5.2} & p_{5.3} & p_{5.4} & 1 \\ V.F & p_{6.1} & p_{6.2} & p_{6.3} & p_{6.4} & p_{6.5} & 1 \end{array} \]
Where \(p_{r.c}\) is a correlation between two variables (e.g. \(V.A\) and \(V.E\)), the first subscript (\(_{r}\)) refers to the row of the matrix, and the second subscript (\(_{c}\)) refers to the column of the matrix.
\(\Lambda\) is a matrix that describes the strength and direction of the common factor on the observed variables, referred to as a factor loading matrix. The matrix columns are the common factors and the rows are the observed variables. Again, using the earlier example this would be:
\[\begin{array} {rrr} & MR1 & MR2 \\ V.A & \Lambda_{1.1} & \Lambda_{1.2} \\ V.B & \Lambda_{2.1} & \Lambda_{2.2} \\ V.C & \Lambda_{3.1} & \Lambda_{3.2} \\ V.D & \Lambda_{4.1} & \Lambda_{4.2} \\ V.E & \Lambda_{5.1} & \Lambda_{5.2} \\ V.F & \Lambda_{6.1} & \Lambda_{6.2} \end{array} \]
Where \(MR1\) and \(MR2\) are the first and second common factors respectively, and \(\Lambda_{ij}\) refers to the factor loading for the \(i\)th row and \(j\)th column. This is the same value as shown in the directional path between a common factor and corresponding observed variable.
\(\Lambda^\top\) is the transpose of the \(\Lambda\) matrix, i.e.:
\[\begin{array} {rrr} & V.A & V.B & V.C & V.D & V.E & V.F \\ MR1 & \Lambda_{1.1} & \Lambda_{1.2} & \Lambda_{1.3} & \Lambda_{1.4} & \Lambda_{1.5} & \Lambda_{1.6} \\ MR2 & \Lambda_{2.1} & \Lambda_{2.2} & \Lambda_{2.3} & \Lambda_{2.4} & \Lambda_{2.5} & \Lambda_{2.6} \\ \end{array}\]
\(\Phi\) is the correlation matrix between the common factors, i.e.:
\[\begin{array} {rrr} & MR1 & MR2 \\ MR1 & 1 & \\ MR2 & \Phi_{2.1} &1 \end{array} \] If there is no association between the common factors, then this is 0, in which case this matrix can be omitted from the equation.
The final term is \(\mathrm{D}\psi\), which is the covariance matrix among the unique factors (i.e. the unexplained residual variance). The diagonal of this matrix is the variance of the unique factors. The off-diagonal are the covariance among unique factors. Because the model assumes that the unique factors are independent of each other, the off-diagonal elements are assumed to be zero. In our example, this matrix looks like this:
\[\begin{array} {rrr} & U.A & U.B & U.C & U.D & U.E & U.F \\ U.A & \mathrm{D}_{\psi1.1} & \\ U.B & 0 & \mathrm{D}_{\psi2.2} \\ U.C & 0 & 0 & \mathrm{D}_{\psi3.3} \\ U.D &0 & 0 & 0 & \mathrm{D}_{\psi4.4} \\ U.E & 0 & 0 & 0 & 0 & \mathrm{D}_{\psi5.5}\\ U.F & 0 & 0 & 0 & 0 & 0 & \mathrm{D}_{\psi6.6} \end{array} \]
Where \(U_k\) is the unique factor corresponding to the \(k\)th observed variable. \(U_k\) is actually shown by the double headed arrow joining the observed variables in the above figures.
For the time being, we’re going to assume that the common factors are uncorrelated, expressed by:
\[P = \Lambda\Lambda^\top + \mathrm{D}\psi\] We mentioned before that measured variables are normally standardized, and common factors assumed to be standardized. In this case, the factor loading (\(\Lambda_{r.c}\), where \(_r\) is row, \(_c\) is column) can be squared to provide the variance in the observed variable by the corresponding common factor. For example, \(\Lambda_{1.1}\) is the factor loading of the first common factor on the first observed variable, and \({\Lambda_{1.1}}^2\) is the proportion of variance in the first measured variable explained by the first common factor. Similarly, \({\Lambda_{1.2}}^2\) is the proportion of variance in the first observed variable explained by the second common factor. \(\mathrm{D}_{\psi1.1}\) is the proportion of variance in the observed variable explained by the unique factor for that variable. As a result, the sum of the squared factor loading associated with each common factor and the unique factor represents all the variance in the measured variable. In our example, the diagonal elements of the equation are described by:
\[p_{1.1}=1={\Lambda_{1.1}}^2+{\Lambda_{1.2}}^2+\mathrm{D}_{\psi1.1}\] In other words, “the model implies that each measured variable will be a function of the sum of variance explained by each common factor in the model as well as the unique factor corresponding to that measured variable.”
However, the off-diagonal elements are calculated slightly differently:
\[p_{2.1}={\Lambda_{2.1}}^2+{\Lambda_{2.2}}^2\]
This equation shows that the correlation between two observed variables will be equal to the product of the factor loading on the first and second observed variable from the first common factor, plus the product of the factor loading on those two variables from the second common factor. This means that the two observed variables would be highly correlated if the either or both the common factors strongly influence both observed variables. Conversely, if one common factor was exclusively loaded on one measured variable, and another common factor was exclusively loaded on another variable, then those two variables would not be correlated.
Now we understand the basic common factor model. However, there are multiple computational algorithms used to fit the model. The book Exploratory Factor Analysis discusses three main groups of algoriths: non-iterated principal axis (NIPA) factor analysis, iterated principal axis (IPA) factor analysis, and maximum likelihood (ML) factor analysis. We’re not going to go into detail about each method, but will talk about the strengths and weaknesses of each of them.
However, they are all trying to solve the same fundamental problem. Remembering the common factor model with independent factors, we have to estimate elements for each of the three matrixes (\(P\), \(\Lambda\), and \(\mathrm{D}\psi\)). This means that the values of two matrixes has to be obtained to be able to calculate the third. This is easy if we have census data for an entire population. When this isn’t the case, we’re drawing inferences from a sample to a population, and refer to the following equation:
\[R \approx \Lambda\Lambda^\top + \mathrm{D}\psi\]
Where \(R\) is the matrix of correlations among a sample of the population. \(R\) can be directly calculated from the sample. Given that all three methods are trying to solve the same model, we’d expect the results to be similar. This is true when the data do not seriously violate model assumptions, there are strong common factors, and the models are not poorly specified. However, where this isn’t the case, the results can be somewhat different.
NIPA and IPA (which are based on similar algorithms) generally produce similar results, although IPA is slightly more accurate (but IPA can sometimes fail to converge on a solution or provide impossible solutions (Heywood cases, like negative variances), which typically indicate models assumptions are violated or the model is misspecified). ML provides measures of model fit (discussed next), and allows for the computation of standard errors, etc. However, ML assumes multivariate normality (where residuals are normally distributed), but NIPA and IPA does not. Although ML is relatively robust, severe violations of multivariate normality can cause problems. However, ML is generally preferred, because of the additional information it provides, but it’s always worth comparing estimates to see if the results are similar.
Some more details about ML. In exploratory factor analysis, ML makes two assumptions about the data. First, that the data is a random sample drawn from the population. Secondly, as mentioned, that the observed variable follows a multivariate normal distribution, discussed in the inset.
The advantages of ML come with the assumption of multivariate normality, where residudals are normally distributed once you’ve accounted for the effect of explanitory variables. Although ML methods are reasonably robust to violations of this assumption, results can be wrong if there are severe violations. One suggested rule of thumb (derived from simulated data) is that an absolute value of skew that is greater than two, and an absolute value of kurtosis of greater than seven, might be problematic. If this is the case, there are a number of steps that can be taken. However, since we’re not dealing with continous data we’re not going to spend time on it.
A crucial component of the ML is the likelihood function: a numerical values describing the relative likelihood of the data given a set of parameter estimates. ML aims to provide a set of parameter estimates (factor loadings and unique variances) that are most likely to have given rise to the data.
A couple of key points. First, it is computationally easier to calculate the inverse of the likelihood function than the likelihood function itself - this is called the - maximum likelihood discrepancy function. This maximum likelihood discrepancy function is equal to or greater than 0, where 0 is where the model perfectly fits the data. ML seeks to find a combination of parameters that minimise the maximum likelihood discrepancy function.
ML is iterative. It starts from similar steps involved in NIPA and IPA, but then varies the parameter estimates until changes in these estimates do not yield higher likelihood that the model produced the data (actually, the “perfect” solution can never be found, so the iterations stop when a change in likelihood is less than a given threshold). When this happens, the model has “converged” on a solution. Once the model converges, it produces the final parameter estimates. However, like the other models, these estimates still need to be rotated to provide easily interpretable results.
One assumption of the common factor model is that scores on each observed variable are the result of one or more common factors, and a unique factor. As a result, common factor are assumed to have a linear causal effect on observed variables. The common factor model is part of a larger class of models called effects indicator models, where latent variables are thought to cause variance in observed variables. This assumption is quite plausible in many cases. Alternatively, in some cases the causal direction may be reversed, where a construct could be considered to emerge from the combination of a range of observed variables. These are called causal indicator models. For instance, socio-economic status could be considered to be a construct that emerges from a range of measurable social and economic characteristics. One key difference between the effects indicator models and causal indicator models is that in causal indicator models, the observed variables may not be correlated with each other. Exploratory factor analysis is only appropriate in the context of effects indicator models.
Another assumption is that common factors have a linear effect on the observed variable. In many cases, this assumption of linearity might be approximately correct. However, this assumption does not hold for non-interval data. Many data based on scales, such as Likert scales, are assumed to be interval data, although this assumption may not always hold.
In the case of dichotomous observed variables, although you can model these variables using the basic common factor model, these items can produce difficulty factors that reflect variation in the endorsement rate of the variable, rather than the underlying construct being assessed. Special procedures can be used for this type of variable.
However, even when you have interval or ratio data, common factors can have non-linear effects on the observed variable. This can happen when an observed variable is only able to detect changes in the common factor at certain levels of the common factor. As a result, the relationship between the common factor and the observed variable would depend on the level of the common factor. When we have equal interval data, item-response theory can be used to help understand the relationship between the common factor and the observed variable.
Another source of non-linearity is where there is a a non-linear associations between common factors and observed variables because of interactions between common factors; i.e. where the effect of a common factor depends on the value of another common factor. Factor analysis, and more general models like structural equation models, assume that common factors are linearly additive. However, in some cases there might be compelling reasons to think that this isn’t the case.
In cases where it apears non-linear effects exist, you may transform a variable so it assumes a linear relationship with the common factor. There have also been advances in non-linear factor analytic methods.
One key bit of information provided by ML is the likelihood ratio test statistic. This is produced by multiplying the final ML discrepancy function value by \(N-1\) (where \(N\) is the sample size).
The likelihood ratio test statistic is an index of the model fit, and tests the null hypothesis that the “common factor model with \(m\) [common] factors holds perfectly in the population (i.e., the model and the data do not differ from one another). A significant test statistic indicates that the hypothesis has been rejected and thus the model does not hold perfectly in the population.” However, this statistic has limitations. It is often too strict, since even a very good model isn’t going to be perfect as well as being highly sensitive to sample size (the larger the sample, the easier it is to reject a good model).
Alternative descriptive fit indices are often recommended since they quantify the magnitude of lack of model fit, rather than providing a simply dichotomous “perfect model/inadequate model”. These descriptive fit indices are also used in confirmatory factor analysis and structural equation modelling. One indicator of model fit is the Root mean square error of approximation (RMSEA), defined as:
\[\mathrm{RMSEA} =\sqrt{F_o/df}\] where \(F_0\) is the estimate of the ML discrepancy function and the \(df\) is the degrees of freedom in the model. RMSEA can be thought of as an index of the discrepancy between the model and the data, per degree of freedom; 0 indicates a perfect fit, large values indicate a poor fit:
RMSEA performs well in detecting misspecified models, and takes into account model parsimony (i.e. penalises more complicated models). Also, RMSEA has a known sampling distribution which allows for the calculation of confidence intervals around the RMSEA estimate. ML also allows us to calculate standard errors, confidence intervals, etc. for the model parameters.
There are a wide range of ways of deciding the number of factors that should be extracted. When working with samples that contain error, it’s misguided to say we’re looking for the ‘true’ number of factors. Instead, the number of factors extracted is a judgement that has conceptual utility (the number of factors that can be readily interpreted) and statistical utility (the number of factors needed to approximate the pattern of correlations among observed variables). The methods used to help make this judgement should fill a number of criteria: the model accounts for most of the correlation between observed variables; a model with fewer common factors is substantially worse at accounting for the correlation; a model with more factors doesn’t do much better at accounting for correlations; the common factors extracted can be interpreted and related to theoretical constructs.
One of the most commonly used methods is the Kaiser Criterion (or eigenvalues-greater-than-one rule). This approach has multiple known limitations, and so won’t be used or discussed further. Another commonly used approach is using a Scree plot, where the eigenvalues (from the sample correlation matrix) of each common factor are plotted, typically as a line graph. Essentially, you eye-ball the scree plot and see where the slope approximately levels out - where there is a very small change in the eigenvalue from including one more extra common factor. This approach has been criticised as being overly subjective, since it’s based on a judgement of “when the slope approximately levels out”. However, where there are strong common factors the scree plot approach works reasonably well.
Parallel analysis provides a more objective criterion than the scree plot. Parallel analysis compares eigenvalues from the sample data to eigenvalues from completely random data (with the same number of parameters and observations). The appropriate number of common factors is the number of eigenvalues (from the reduced correlation matrix) from the real data that are larger than the eigenvalues from the random data. The eigenvalues for the random data are often generated from simulated data. However, parallel analysis is quite lenient in the number of factors it suggests and can lead to overfactoring, and so should be considered an upper-bound of the number of common factors to extract. A slightly more conservative approach is to use minimum rank factor analysis within parallel analysis.
When ML is used, we can also assess how many factors to extract based on the goodness of fit of the model, using the Likelihood Ratio Test Statistic. One approach is to start with a simply model and see if the null hypothesis for the Likelihood Ratio Test Statistic is rejected or not, and proceeding to add more common factors until the statistic is insignificant. However, this approach is somewhat arbitrary and strict (as discussed above). A preferred method is to compare models of differing complexity using “difference tests”. You start with the simplest one common factor model, add another factor, then test to see if the addition of the factor (statistically) significantly improves model fit. If the difference is not significant, the simpler model is retained. If it is significant, then the process is repeated with an even more complex model, until the difference is non-significant. With large sample sizes, you can end up with more complex models even when the additional number of common factors doesn’t really explain much variance.
Descriptive indices of model fit, including RMSEA, can also be used to assess the appropriate number of factors to extract. This is done by calculating the RMSEA for increasingly complex models. You then examine the RMSEA of each model, selecting the model that: fits the data well; fits the data better than the previous simpler mode, fits the data about as well as the model with one more factor. Differences of 0.01 to 0.02 are marginally meaningful differences, but differences of <0.01 are not really meaningful. When making this judgement, you can also look at the confidence intervals - the broader the confidence intervals, the more caution about interpreting differences between the RMSEA scores.
When you have large amounts of data, the stability of a factor solution can be a useful way of determining the appropriate number of factors. Imagine a situation where there are marginal differences between two models with different numbers of factors. If you have a large dataset, you can split the dataset and estimate the factor scores across each dataset using the different number of factors. A model that has stable results across the split datasets might be better supported.
Ultimately, a crucial criteria for how useful the model is, is how interpretable the results are and how they align with a priori theory. There are a number of considerations when assessing the theoretical utility of the model. First, does it make sense that the observed variables associated with each common factor are associated with each other? Secondly, it’s OK if a observed variable is loaded on more than one factor, if its makes theoretical sense that the observed variable might itself be the result of multiple latent variables. Fourth, does a common factor have no strong factor loading, or is it loaded heavily on one observed variable. In the first case, this may indicate that the common factor isn’t really that strong. In the second case, this indicates overfactoring, since a factor should influence multiple factors. Finally, when it’s difficult to find a single common theme associated with all the observed variables associated with a common factor, then there may be underfactoring.
For any exploritory factor analysis model with more than one common factor, there is an infinite number of equally well fitting solutions; there is no unique solution. This means we have to select one solution from an infinite number of equally good solutions - this problem is called rotational indeterminacy.
Imagine we have a two factor model, based on 10 observed variables, and that these factors are independent. We can visualise this as:
Where the x-axis is the first factor, and the y-axis is the second factor, and the points are the factor loadings are the coordinates of the observed variables in this space. The distance between observed variables is the strength and direction of the correlation among them. Close points indicate a strong positive correlation, far points have strong negative correlations. The model fit is the extent to which the distance in space between observed variables can reconstruct the observed correlation among observed variables. If the there is poor fit, then it may be necessary to include another dimension so the distance between points accurately reflects the actual correlation between observed variables.
However, the ability of this spatial representation to account for the correlation is not dependent on the orientation of the two dimensions. You could spin the axis in any direction, and the coordinates would change, but the relative distance between points would stay the same so the model would have the same fit. Which orientation is most useful?
The simple structure can be used to find factor analysis solutions that can be plausible and interpretable. Thurstone, who developed this approach, suggested five general guidelines for a solution with simple structure:
In a simple structure solution, each common factor is represented by a subset of observed variables with high factor loadings, and each measures variable should only be influenced by a subset of common factors. In this approach, the axis should be rotated so as to maximise the simple structure. The figure above does not satisfy the simple structure criteria, since the observed variables load on both factors. However, if we rotate the axis, then we do a much better job of satisfying those criteria:
Again, where the x-axis is the first common factor, and the y-axis is the second. Now the simple structure of the solution is much better. The top right cluster is mostly loaded on the first factor, and the secound lower cluster is mostly loaded on the secound factor. However, this becomes less clear when we have weaker correlations between observed variables, and more dimensions (common factors). As a result, it soon becomes necessary to automate the process. A number of approaches have been developed to find the optimal rotation.
Orthagnol rotation assumes the common factors are independent. These approaches include varimax rotation, which generally performs well when the assumption of independence between factors is met.
However, in practice it is rare for common factors to be truly independent, which is where oblique rotation is used. It’s is important to note that if a common factors are truly indpendent, then oblique rotation will provide the same results as orthagnol rotation. When the common factors are not independenent, oblique rotation produces simpler structures than orthagnol rotation, as can be seen from the below figure:
In this figure, the orthogonal rotation is the solid lines, which are orthogonal to each other. As a result, they fail to intersect with either cluster of observed variables, and so the there will not be a simple structure. However, the dashed lines are the common factors when they are not assumed to be independent, and so can move freely to intersect with each cluster, providing a better simple structure. There are a number of oblique rotation methods available, including Promax, Orthoblique, and direct quartimin rotation (part of a family of direct oblimin rotation methods). We’ll not discuss the differences between these methods. However, there are a couple of points to think about when interpreting any results following rotation.
First, the sign of the common factors is arbitrary - a measured variable might be positively or negatively loaded on the factor, but the absolute value of the loading will stay the same, since a factor can be positive or negative. Second, after rotation the ordering of common factors can be arbitrary. Although the model fit, variance accounted for by the factor, or the communalities will stay the same, rotation redistributes the variance accounted for by the factors so the first factor may no longer explain more variance than the second one. As a result, you can re-order the common factors after rotation.
Also, unlike with orthagnol rotation, following oblique rotation the factor loadings are not longer the correlation between common factors and observed variables. Instead, the factor loadings are effetively the same as standardised partial regression coefficients. In other words, they are the standardised units of increase in the observed variable given one standardised unit of increase in the common factor (1SD greater), when controlling for other common factors. As a result, they are not bound between -1 and 1, and squaring them does not give the proportion of variance in the measured variable accounted for by the common factor.
Three matrixes are often reported by statistical software, following oblique rotation; the pattern matrix, structure matrix, and the factor correlation matrix. The pattern matrix is the rotated factor loading matrix, and represents the influence of each common factor on the observed variable, controlling for the other common factors. The structure matrix is the correlations between common factors and measurd variables without controlling for the other common factors. The pattern matrix is often the one of greatest interest to researchers. The third matrix is the factor correlation matrix, corresponding to \(\Phi\), which can be interpreted as a matrix of correlations. These are correlations between latent variables, after controlling for the unique factors (including measurement error).
So far we’ve just been looking at equal interval data. However, what happens when you are using data that might not be equal interval, such as ordinal data from Likert scales? This is a very common problem and has a relatively easy solution. Pearson’s correlations are typically used to estimate the correlation matrix. However, when using this approach we have to make the assumption of equal interval. Polychoric correlation is the correlation [of the bivariate normal distribution] of the latent variables (see this paper for more details). The assumption is that the observed ordinal variable is caused by a continuous latent variable, and so you can look at the correlations between those latent variables referred to as the polychoric correlation. You can then perform the factor analysis on that polychoric correlation, assuming that each ordinal variable truly is described by an underlying latent variable. Polychoric correlation is typically recommended for ordinal data, and the results will be the same as those from using Pearson’s correlation if the data is actually equal interval (see here).