PCA and EFA

3/25/2020

Dimension Reduction

As social and behavioral researchers, we often collect a whole bunch of information form our participants.
Every single piece of information we collect, including responses to each survey item for each scale, or every key stroke in a computerized cognitive test, may be a valuable nugget when it comes to predicting our outcomes of interest.
How do we know what pieces of information to keep? And are we really interested in every single piece of information, or are we more interested in what those pieces of information represent on a larger level?

Dimension Reduction

The concept of dimension reduction is that we can gather up the most important evidence from all these little bits of information, and organize it into something more manageable.
Specifically, we organize it by grouping all these bits with other like items, thereby reducing the amount of small bits to juggle and to interpret.
Items that cluster together are considered to be related to a common underlying factor, sometimes called a latent variable.
Having fewer items is beneficial because theories and statistical analyses alike become less unwieldy and more interpretable.

Dimension Reduction

This week: principal components analysis (PCA) and exploratory factor analysis (EFA)
Next week: confirmatory factor analysis (CFA)
These are all methods of dimension and item reduction
In addition to just data reduction, these methods can be used to understand the general structure of a set of variables and the way they relate to an underlying construct, or to help identify "good" items for a scale development project

Revisit Covariance

Data reduction is a process of grouping items/variables based on similarities
Specifically, we are very interested in how the individual items correlate or covary
Factor analysis and PCA methods use information from the variance/covariance matrix (similar to a correlation matrix) to find common factors

Factor Analysis

Factor analysis can be further broken down into Exploratory and Confirmatory Factor Analysis, or EFA and CFA, respectively.
Exploratory factor analysis derives from Baconism and inductive reasoning; CFA can be thought of as a derivative of Hypotheticism and the idea of useful testable hypotheses (more on this next week)
While these approaches differ philosophically, the models for EFA and CFA are mathematically equivalent
Despite the mathematical equivalence, the steps to analysis in EFA and CFA are very different

Theory and Process of EFA

The definitive aspect of EFA is that it does not require apriori hypotheses about the structure of the data
Statistically speaking, this means that a researcher does not need to specify a particular model; he or she uses this exploratory approach to discover factors from the data
EFA is therefore not defined by a particilar single analysis, but more by the process by which the researcher approaches data analysis for a set of items. Lets walk through this exploratory process…

Step 1: Consider the population and constucts of Interest

Before EFA can begin, the researcher must select variables and subjects for analysis

include reliable variables
each variable should correlate with at least some of the other variables in the selection
sample for analysis should be representative of the population and fully representative of the construct of interest. That is, a researcher should hope to have participants at both extremes of the constructs which they might expect to discover with the EFA

Step 2: Select a Rotation Method

Because no model is specified apriori, the exploratory methods attempt to “find” an optimal model with what are called rotation procedures
These are particular algorithms that will be used to discover the ‘optimal’ EFA model by searching for relationships in the data in some systematic way.
Orthogonal (restrictive) factor rotation methods search for common factors amongst the items, but require that the factors are unrelated or uncorrelated with each other
Oblique (non-restrictive) methods of rotation search for common underlying factors, but allow for those factors to be correlated
In general, I recommend using non-restrictive methods

Step 3: Run the Models

In EFA, you would typically run multiple models. You may run a 1-factor model, a 2-factor model, 3-factor model, and so on, for example
Next, I describe some popular approaches to selecting a final "best" model from these several models

Step 4: Model Selection

For each model you run (2-factor, 3-factor, and so on) the ‘optimal’ version of that model is discovered based on the inter-item correlations and factor rotation, the model output is presented
Now, you make observations about the factors and factor loadings in order to define the factor structure (i.e. how many factors are minimally sufficient to account for item variance, and which items belong to which factors)
Remember the goal is to select the simplest model that sufficiently explains the data and makes theoretical sense
It is best practice to decide ahead of time what criteria you will use to evaluate your model and make decisions about which model is the best model. In the following slides, I go over some options

Model Selection: Fit Statistics

Recall that the goal: explain a sufficient amount of item variance using common factors. If a model does this well, fit statistics should attest to it

RMSEA (Root Mean-Square Error of Approximation): represents the overall model misfit per degree of freedom, penalizing overly complex models. Lower values of RMSEA are desired. Some sources recommend RMSEA under .10, others use cutoff of .05. A sound approach is to use the CI, and if it includes .05, take this as evidence of a good fit.
AIC and BIC (Aikaike's/Bayesian Information Criterion): When comparing subsequent models (i.e. a 2-factor, 3-factor), you should prioritize the model with lower AIC and BIC. A big drop in AIC from the 3-factor to the 2-factor model is usually good evidence that the 2-factor model is preferable, because it is simpler and sufficiently explanatory.

Model Selection: Eigenvalues

Each model output includes factor eigenvalues, representing the total variance explained by a given factor
For the sake of simplicity and interpretability, you may choose to go with the factor structure that excludes factors with small eigenvalues.
Kaiser Rule states that factors with eigenvalues < 1 should be dropped

Model Selection: Eigenvalues

One way to study the eigenvalues in EFA is to look at a scree plot. A scree plot displays the eigenvalues for each factor. See this example
People often use these scree plots to make decisions about how many factors to retain. This can be seen visually by locating the 'elbow' in the plot. Factors after the 'elbow' are not adding much explained variance

Model Selection: Factor Loadings

Next, you may look at the item factor loadings, which represent each item's correlation with each factor.
A common recommendation is to use .40 as a cut-off for determining which items 'belong' to which factor. (As always, I don't suggest you adhere strictly to this cut-off, but use it as a guiding principle when making decisions)
If every item only clearly loads onto a single factor (every item has a high loading on one factor and low loadings on all other factors), you have acheived what is called a simple structure.
Simple structures are preferred, unless you have reason to believe that an item should load onto multiple factors. This is called cross-loading.

Model Selection: Final Decision-Making

After evaluating the eigenvalues, fit statistics, and factor loadings, you should have a feel for what factor structure(s) best explain the variances and covariances in your data.
In summary, you are looking for simple structures with low RMSEA values and relatively low AIC and BIC values, and you are looking for the model with the smallest number of interpretable factors.
You are the researcher, so it is important that you also consider your theory! Don't just rely blindly on the numbers.
After choosing a model, you can name your factors and interpret the model! Cool!

Some Notes of Caution about EFA

CAUTION:

In the EFA framework, the researcher draws theoretical inferences about what the factors represent, based on the results of the model. As such, the researcher is making post-hoc hypotheses about the data. A model is not useful if it doesn’t make sense theoretically, and the researcher is responsible for making sure this is the case.
Further, we must take caution when interpreting results in factor analysis because factor scores under this approach are known to be indeterminate, meaning that for any given factor, identical scores could have been produced from many different patterns of raw data. This could be a problem for our theory.

Principal Components Analysis (PCA)

PCA is the most rudimentary form of dimension reduction, and was the most common data reduction method through the 1990's and arguably later
The main distinction between factor analysis and PCA is in the assumptions we make about item variance

Factor Analysis Item Variance Assumptions

FA assumes that every item has a total variance that can be parsed into unique variance and common variance.
The model assumes that items have other unique variances that are not accounted for by the factor structure.
Example: Rosenberg Self-Esteem Scale item, "I feel I am able to do most things as well as other people." – people may vary in their responses to this item for reasons other than just their self-esteem levels.
If these other factors aren't included in the factor structure, the variance due to these things is just unexplained item variance, or unique variance.

PCA Item Variance Assumptions

In PCA, we assume that the total variance is common variance, suggesting that 100% of the variance in the items can be explained by the factors or latent variables- that there are no unique variances.
This assumption may sometimes be tenable, but rarely may be in psychology.

Principal Components Analysis (PCA)

Because PCA and FA may sometimes lead to different inferences, I recommend the use of factor analysis over PCA.
You will find that not all researchers agree on this advice. For example, some argue that PCA is preferred because it doesn't have the same problems of factor indeterminacy as factor analysis.