Principal Component Analysis and Exploratory Factor Analysis

November 20, 2017

Outline

Principal Component Analysis (PCA) and Exploratory Factor Analysis (EFA) are both data reduction techniques
- Linearly combines measured variables to create a smaller set of composite scores
Each method:
1. Takes several measured variables (assumed to be related)
2. Mathematically combines them into smaller set of composite scores
  - Composite scores (hopefully) capture a significant amount of variance
3. Composite scores are used in subsequent analyses

In PCA composite variables called Components
In EFA composite variables called Factors
Math used to derive components and factors are similar (but not the same)
Researcher determines the number of components/factors to retain and determines if they are interpretable

Theory underlying PCA and EFA are drastically different which changes how you interpret results
- In PCA scores on the measured variables are assumed to cause the component
- In EFA factors are latent constructs that cause the scores on the observed variables
Due to theoretical differences, PCA is more commonly used with "objectively" measured variables (e.g., gene expressions) and EFA is used with self-report/patient-reported outcomes (e.g., quality of life).
- PCA assumes no measurement error
- EFA explicitly models measurement error

Components in PCA are assumed to be orthogonal while factors are (generally) assumed to be correlated
These theoretical differences result in mathematical differences

Say you have a study where you are trying to predict breast cancer diagnosis in a sample of 1000 breast cancer survivors using gene expression from a microarray.
The array gives you 100's of gene expressions but you can't put all of those into a regression
- All may not be important and may decrease power
- Multicolinearity may mess with your model

\(c\) is the principal component that we are reducing the measured gene expressions into
\(x_j\) are the measured gene expressions
\(a_j\) are weights relating the gene expression to the principal component

\(C_m = a_{1}X_1 + a_{2}X_2 + ... + a_{j}X_j\)

where:

PCA uses a linear combination of variables to create components
- Component is (essentially) a weighted average of measured variables
Mathematically we want components that capture the most variance
- Done by selecting optimal variables and optimal weights

Can have n > p measured variables where n is number of observations
- Maximum number of potential components retained = p
- Retain only components that predict a significant amount of variance
Each k extracted component exists in k dimensional euclidean space
Components are orthogonal from each other

Sample size
- Contrary to popular opinion, sample size is predicated on how strongly measured variables relate to the components, not the number of items you have
- If you don't know how variables are related to components, you need a sample size of 300
- Need more subjects than measured variables
Variables need to be on the same scale
Sometimes PCA isn't useful/doesn't work

In EFA we aren't as much interested in reducing the number of variables as we are trying to measure a latent (unobserved) construct
We want to see if the observed variables are actually measuring this latent construct(s)
- E.g., do these variables "hang" together the way we hypothesized
This is the first step in validating a self-report measure
PCA has roots in classical statistical tradition (general linear model) which operates on observed variables and assumes no measurement error
EFA comes from psychometric tradition and assumes all observed scores come from unobserved common factors

You want to develop a measure of fatigue in cancer patients
You've created an 8 item measure and want to see if all these items measure a single latent construct

\(X_j = a_{j1}F_1 + a_{j2}F_2 + ... + a_{jm}F_m + d_jU_j\)

where:

In PCA all variances due to scores, uniqueness, and error go into the weights for the variables
- In EFA we have separate weights for the scores (\(a\)) vs the uniqueness (\(d\)) which is the variation that is unique to variable j and the error in j

Steps for EFA are very similar to PCA
- Math is different to account for uniqueness
Use EFA
- Step 1: Run EFA using items from measure
- Step 2: Determine number of factors to retain
- Step 3: Ensure interpretability of factors from Step 2
  - If factors are uninterpretable or items do not load, remove items one at a time and repeat steps 1 and 2
- Step 4: Rerun EFA extracting only the number of factors from Step 3
- Step 5: Collect data in different sample (same population) and conduct Confirmatory Factor Analysis

Similar to PCA
Sample size:
- 250-300
- Still dependent on how strongly measured variables are related to factors
We are more concerned with how items relate to factors in EFA than in PCA
- Lots of statistics that need to be checked for model fit
Need at least 3 items per factor
Factors aren't orthogonal like in PCA

Think of factors as houses
- The items are the materials
- Factor analysis is the blue print
- Factor loadings and items are a synergistic representation of factors
- Cannot use different/less items and have the same factor
Factors are only valid in the population you validated the measure
- Need to establish "measurement invariance" in other populations
Validating a measure is a long and tedious process
- EFA is the first in a series of steps

While these techniques are fairly easy to implement, the theory and math behind them are complex
Particularly with EFA, make sure you work closely with your analyst to ensure stats and theory are aligned
With categorical variables, different techniques are used (and are more complicated)