Introduction
Principal Components Regression (PCR) is a statistical technique used
to address the problem of multicollinearity in multiple
linear regression models. Multicollinearity occurs when predictor
variables are highly correlated, which leads to unstable coefficient
estimates and inflated standard errors in ordinary least squares (OLS)
regression. Although OLS estimates remain unbiased under
multicollinearity, their variance may be large, making predictions
unreliable.
PCR resolves this by transforming the original correlated predictors
into a smaller set of uncorrelated variables called
principal components, which are linear combinations of
the original variables. These components are then used as predictors in
a linear regression model.
By discarding components associated with low variance (often linked
to noise or multicollinearity), PCR introduces a form of
bias but significantly reduces
variance, often resulting in better generalization to new
data.
Note: PCR differs from ridge
regression, which addresses multicollinearity by shrinking
coefficients through a penalty term. Ridge regression retains all
predictors, while PCR transforms them.
Key Steps in PCR
- Standardize the predictors
- Perform Principal Component Analysis (PCA)
- Select the number of principal components (via
cross-validation)
- Fit a regression model using the selected components
- Predict and evaluate model performance
Assumptions
PCR assumes the same conditions as standard linear regression:
- Linearity: A linear relationship between predictors
and response.
- Constant Variance (Homoscedasticity): Equal spread
of residuals.
- Independence of Errors: Residuals are not
autocorrelated.
- Normality is not strictly required unless
confidence intervals are needed.
Diagnostic Checks
Linearity
- Check: Residuals vs. each original predictor
plots
- Look for: Random scatter. Non-linear patterns
suggest misspecification.
Constant Variance
- Check: Residuals vs. Fitted values (Yhat) plot
- Look for: A horizontal band. A funnel shape
indicates heteroscedasticity.
Independence
- Check: Residuals vs. Time (if applicable)
- Look for: Trends or cycles suggest
autocorrelation.
- OR Residuals vs. each original predictor plots – A non-random
pattern suggests a lack of independence. OR Residuals vs. Yhat plot –
The residuals should be randomly scattered around zero, with no
discernible pattern.
Advantages of PCR
- Effectively handles multicollinearity
- Reduces model complexity and overfitting
- Enhances prediction accuracy when predictors are correlated
Limitations of PCR
- Reduced interpretability (uses principal components, not original
variables)
- Components selected based on predictor variance, not predictive
power
- May include components with little relevance to the response
variable