November 20, 2017

Outline

  • Overview
    • Similarities
    • Differences

  • Principal Component Analysis
    • Example
    • Theory
    • Math

  • Exploratory Factor Analysis
    • Example
    • Theory
    • Math

  • Sample size

Objective

  • Describe when you should use PCA and EFA
    • Do so by explaining how they work
  • Focus on the big picture

Overview (similarities)

  • Principal Component Analysis (PCA) and Exploratory Factor Analysis (EFA) are both data reduction techniques
    • Linearly combines measured variables to create a smaller set of composite scores

  • Each method:
    1. Takes several measured variables (assumed to be related)
    2. Mathematically combines them into smaller set of composite scores
      • Composite scores (hopefully) capture a significant amount of variance
    3. Composite scores are used in subsequent analyses

Overview (similarities)

  • In PCA composite variables called Components

  • In EFA composite variables called Factors

  • Math used to derive components and factors are similar (but not the same)

  • Researcher determines the number of components/factors to retain and determines if they are interpretable

Overview (differences)

  • Theory underlying PCA and EFA are drastically different which changes how you interpret results
    • In PCA scores on the measured variables are assumed to cause the component
    • In EFA factors are latent constructs that cause the scores on the observed variables

  • Due to theoretical differences, PCA is more commonly used with "objectively" measured variables (e.g., gene expressions) and EFA is used with self-report/patient-reported outcomes (e.g., quality of life).
    • PCA assumes no measurement error
    • EFA explicitly models measurement error

Overview (differences)

  • Components in PCA are assumed to be orthogonal while factors are (generally) assumed to be correlated

  • These theoretical differences result in mathematical differences

Principal Component Analysis (Example)

  • Say you have a study where you are trying to predict breast cancer diagnosis in a sample of 1000 breast cancer survivors using gene expression from a microarray.

  • The array gives you 100's of gene expressions but you can't put all of those into a regression
    • All may not be important and may decrease power
    • Multicolinearity may mess with your model

Principal Component Analysis (Theory)

  • \(c\) is the principal component that we are reducing the measured gene expressions into

  • \(x_j\) are the measured gene expressions

  • \(a_j\) are weights relating the gene expression to the principal component

Principal Component Analysis (Mathematics)

\(C_m = a_{1}X_1 + a_{2}X_2 + ... + a_{j}X_j\)

where:
    • \(X_j\) is a standardized measured variable j (there are p measured variables)
    • \(C_m\) is a component (there are k principal components)
    • \(a_{j}\) is the weight of measured variable j on component m
  • PCA uses a linear combination of variables to create components
    • Component is (essentially) a weighted average of measured variables

  • Mathematically we want components that capture the most variance
    • Done by selecting optimal variables and optimal weights

Principal Component Analysis (Mathematics)

  • Can have n > p measured variables where n is number of observations
    • Maximum number of potential components retained = p
    • Retain only components that predict a significant amount of variance

  • Each k extracted component exists in k dimensional euclidean space

  • Components are orthogonal from each other

Principal Component Analysis (Steps)

  • Use PCA
    • Step 1: Run PCA using gene expressions
    • Step 2: Determine number of components to retain
    • Step 3: Ensure interpretability of components from Step 2
      • If factors are uniterpretable/don't capture variance, PCA may not be appropriate
    • Step 4: Rerun PCA extracting only the number of components from Step 2
    • Step 5: Use component scores from step 4 in regression of breast cancer dx on component score

  • There are a lot more complicated statistical issues I'm not covering here

Practical Issues in PCA

  • Sample size
    • Contrary to popular opinion, sample size is predicated on how strongly measured variables relate to the components, not the number of items you have
    • If you don't know how variables are related to components, you need a sample size of 300
    • Need more subjects than measured variables

  • Variables need to be on the same scale

  • Sometimes PCA isn't useful/doesn't work

Exploratory Factor Analysis (Theory)

  • In EFA we aren't as much interested in reducing the number of variables as we are trying to measure a latent (unobserved) construct

  • We want to see if the observed variables are actually measuring this latent construct(s)
    • E.g., do these variables "hang" together the way we hypothesized

  • This is the first step in validating a self-report measure

  • PCA has roots in classical statistical tradition (general linear model) which operates on observed variables and assumes no measurement error

  • EFA comes from psychometric tradition and assumes all observed scores come from unobserved common factors

Exploratory Factor Analysis (Example)

  • You want to develop a measure of fatigue in cancer patients

  • You've created an 8 item measure and want to see if all these items measure a single latent construct

Exploratory Factor Analysis (Theory)

  • \(f\) is the latent factor Fatigue we are interested in measuring

  • \(a_j\) is the relationship between item j and the latent factor Fatigue

  • \(U_j\) is the uniqueness in variable j that is not due to Fatigue

  • \(d_j\) is the relationship between item j and and the uniqueness

Exploratory Factor Analysis (Mathematics)

\(X_j = a_{j1}F_1 + a_{j2}F_2 + ... + a_{jm}F_m + d_jU_j\)

where:
    • \(X_j\) is a standardized measured variable j (there are p measured variables)
    • \(F_m\) is a common factor (there are k common factors)
    • \(U_j\) is a unique factor with one for each variable (i.e., p in all)
    • \(a_{jm}\) is the factor loading of measured variable j on common factor m
    • \(d_j\) is the loading of the measured variable on the unique factor (p in all)

Exploratory Factor Analysis (Mathematics)

  • In PCA all variances due to scores, uniqueness, and error go into the weights for the variables
    • In EFA we have separate weights for the scores (\(a\)) vs the uniqueness (\(d\)) which is the variation that is unique to variable j and the error in j

Exploratory Factor Analysis (Steps)

  • Steps for EFA are very similar to PCA
    • Math is different to account for uniqueness

  • Use EFA
    • Step 1: Run EFA using items from measure
    • Step 2: Determine number of factors to retain
    • Step 3: Ensure interpretability of factors from Step 2
      • If factors are uninterpretable or items do not load, remove items one at a time and repeat steps 1 and 2
    • Step 4: Rerun EFA extracting only the number of factors from Step 3
    • Step 5: Collect data in different sample (same population) and conduct Confirmatory Factor Analysis

Practical Issues in EFA

  • Similar to PCA

  • Sample size:
    • 250-300
    • Still dependent on how strongly measured variables are related to factors

  • We are more concerned with how items relate to factors in EFA than in PCA
    • Lots of statistics that need to be checked for model fit

  • Need at least 3 items per factor

  • Factors aren't orthogonal like in PCA

Practical Issues in EFA (psychometrics)

  • Think of factors as houses
    • The items are the materials
    • Factor analysis is the blue print
    • Factor loadings and items are a synergistic representation of factors
    • Cannot use different/less items and have the same factor

  • Factors are only valid in the population you validated the measure
    • Need to establish "measurement invariance" in other populations

  • Validating a measure is a long and tedious process
    • EFA is the first in a series of steps

Final thoughts

  • While these techniques are fairly easy to implement, the theory and math behind them are complex

  • Particularly with EFA, make sure you work closely with your analyst to ensure stats and theory are aligned

  • With categorical variables, different techniques are used (and are more complicated)