Principal Component Analyis and Multicollinearity

A Critical Examination of Assumptions, Limitations, and Practical Considerations

Author

Timothy Achala

Published

June 15, 2026

Background

Principal Component Regression (PCR) is widely recommended in statistical textbooks as a method for addressing multicollinearity in regression models. The core idea is appealing: transform a set of correlated predictors into orthogonal principal components, then regress the outcome on these uncorrelated components. In doing so, the variance inflation associated with collinear predictors is resolved — at least in the component space.

However, Hadi and Ling (1998) raised important cautionary notes about this practice, demonstrating both theoretically and empirically that PCR carries serious limitations that are often understated or absent from standard textbook treatments. This analysis explores those limitations through a simulated clinical example, and discusses the conditions under which PCR may or may not be appropriate.


1. Simulated Data: Collinear Adiposity Measures

To illustrate the behaviour of PCR under multicollinearity, we simulate a clinical dataset where three adiposity biomarkers — waist circumference, BMI, and body fat percentage — are used to predict systolic blood pressure (SBP). These three predictors are intentionally highly correlated, as they all capture overlapping aspects of body adiposity.

Code
library(tidyverse)
library(gt)
library(broom)
library(corrr)
library(GGally)
library(car)
library(factoextra)

set.seed(2024)

n <- 300

waist   <- rnorm(n, mean = 90, sd = 10)
bmi     <- 0.3 * waist + rnorm(n, mean = 5, sd = 1.5)
bodyfat <- 0.5 * waist + 0.4 * bmi + rnorm(n, mean = 2, sd = 1.5)

# True SBP: driven primarily by waist circumference
sbp <- 100 + 0.6 * waist + 0.05 * bmi + 0.02 * bodyfat + rnorm(n, sd = 6)

data <- tibble(waist, bmi, bodyfat, sbp)

head(data, 6) |>
  gt() |>
  tab_header(
    title = "Simulated Clinical Data",
    subtitle = "Adiposity measures predicting systolic blood pressure (n = 300)"
  ) |>
  fmt_number(everything(), decimals = 2)
Simulated Clinical Data
Adiposity measures predicting systolic blood pressure (n = 300)
waist bmi bodyfat sbp
99.82 36.01 62.44 165.77
94.69 32.46 64.26 156.17
88.92 31.20 59.85 146.61
87.87 32.96 59.32 156.83
101.58 37.86 68.26 170.15
102.92 36.55 66.83 164.92

2. Assessing Multicollinearity in the Original Predictors

Code
GGally::ggpairs(
  data |> select(waist, bmi, bodyfat),
  title = "Pairwise Relationships Among Adiposity Predictors"
) +
  theme_minimal(base_size = 11)

Figure 1: Pairwise scatterplots and correlation coefficients confirm severe multicollinearity among the three adiposity predictors.
Code
m_raw <- lm(sbp ~ waist + bmi + bodyfat, data = data)

vif(m_raw) |>
  enframe(name = "Predictor", value = "VIF") |>
  gt() |>
  tab_header(
    title = "Variance Inflation Factors: OLS Model",
    subtitle = "VIF values indicate severe multicollinearity among predictors"
  ) |>
  fmt_number(VIF, decimals = 1) |>
  tab_style(
    style = cell_fill(color = "#FADBD8"),
    locations = cells_body(rows = VIF > 10)
  )
Variance Inflation Factors: OLS Model
VIF values indicate severe multicollinearity among predictors
Predictor VIF
waist 16.6
bmi 5.2
bodyfat 17.5

The high VIF values confirm that the standard OLS estimates are unstable. Individual coefficient estimates are unreliable due to the near-linear dependency among the predictors. This is the scenario in which PCR is typically recommended.


3. Applying Principal Component Analysis

Code
pca_fit <- prcomp(data |> select(waist, bmi, bodyfat), scale. = TRUE)

summary(pca_fit)$importance |>
  as.data.frame() |>
  rownames_to_column("Metric") |>
  gt() |>
  tab_header(
    title = "Variance Explained by Principal Components",
    subtitle = "PC1 captures the dominant shared variance among predictors"
  ) |>
  fmt_number(-Metric, decimals = 3)
Variance Explained by Principal Components
PC1 captures the dominant shared variance among predictors
Metric PC1 PC2 PC3
Standard deviation 1.683 0.366 0.178
Proportion of Variance 0.945 0.045 0.011
Cumulative Proportion 0.945 0.989 1.000
Code
pca_fit$rotation |>
  as.data.frame() |>
  rownames_to_column("Original Variable") |>
  gt() |>
  tab_header(
    title = "Principal Component Loadings",
    subtitle = "Each loading describes how original variables contribute to a given component"
  ) |>
  fmt_number(-`Original Variable`, decimals = 3)
Principal Component Loadings
Each loading describes how original variables contribute to a given component
Original Variable PC1 PC2 PC3
waist −0.582 0.424 0.694
bmi −0.566 −0.824 0.028
bodyfat −0.583 0.377 −0.720

PC1 accounts for the majority of variance in the predictor space and loads approximately equally on all three adiposity measures — reflecting the shared latent dimension of general body adiposity. PC2 and PC3 capture the remaining, largely residual variation.


4. Regression on Principal Components

Code
pc_scores  <- as_tibble(pca_fit$x)
data_pca   <- bind_cols(data |> select(sbp), pc_scores)

# Full PCR: all components retained
m_pca_full <- lm(sbp ~ PC1 + PC2 + PC3, data = data_pca)

tidy(m_pca_full) |>
  gt() |>
  tab_header(
    title = "Full PCR: Regression on All Three Principal Components",
    subtitle = "All components retained — equivalent to OLS"
  ) |>
  fmt_number(c(estimate, std.error, statistic, p.value), decimals = 4)
Full PCR: Regression on All Three Principal Components
All components retained — equivalent to OLS
term estimate std.error statistic p.value
(Intercept) 156.8239 0.3278 478.4794 0.0000
PC1 −3.8115 0.1950 −19.5450 0.0000
PC2 2.2081 0.8964 2.4633 0.0143
PC3 7.5573 1.8435 4.0994 0.0001
Code
vif(m_pca_full) |>
  enframe(name = "Component", value = "VIF") |>
  gt() |>
  tab_header(
    title = "VIF After PCA Transformation",
    subtitle = "Components are orthogonal by construction — VIF = 1.0 for all"
  ) |>
  fmt_number(VIF, decimals = 3) |>
  tab_style(
    style = cell_fill(color = "#D5F5E3"),
    locations = cells_body()
  )
VIF After PCA Transformation
Components are orthogonal by construction — VIF = 1.0 for all
Component VIF
PC1 1.000
PC2 1.000
PC3 1.000

The VIF values after transformation are exactly 1.0, which reflects the mathematical property of orthogonality among principal components. However, it is important to recognise that when all components are retained, this result is not a statistical fix — it is a geometric property of the transformation. As noted by Hadi and Ling (1998), when all principal components are included in the regression, the PCR estimator is algebraically equivalent to the OLS estimator. The underlying instability has not been addressed; it has simply been re-expressed in a different coordinate system.


5. The Truncated PCR: Where the “Fix” Actually Occurs

The practical application of PCR involves dropping the low-variance components — typically those with small eigenvalues. This step is where the actual variance reduction occurs.

Code
# Truncated PCR: drop PC3 (smallest variance)
m_pca_trunc <- lm(sbp ~ PC1 + PC2, data = data_pca)

tidy(m_pca_trunc) |>
  gt() |>
  tab_header(
    title = "Truncated PCR: PC3 Excluded",
    subtitle = "Dropping the lowest-variance component to stabilise estimates"
  ) |>
  fmt_number(c(estimate, std.error, statistic, p.value), decimals = 4)
Truncated PCR: PC3 Excluded
Dropping the lowest-variance component to stabilise estimates
term estimate std.error statistic p.value
(Intercept) 156.8239 0.3364 466.2349 0.0000
PC1 −3.8115 0.2001 −19.0449 0.0000
PC2 2.2081 0.9199 2.4002 0.0170
Code
# Compare: how much does PC3 actually explain in Y?
pc3_y_cor <- cor(data_pca$PC3, data_pca$sbp)

tibble(
  Component = c("PC1", "PC2", "PC3"),
  `Variance in X (%)` = round(summary(pca_fit)$importance[2,] * 100, 1),
  `Correlation with SBP (Y)` = round(c(
    cor(data_pca$PC1, data_pca$sbp),
    cor(data_pca$PC2, data_pca$sbp),
    cor(data_pca$PC3, data_pca$sbp)
  ), 3)
) |>
  gt() |>
  tab_header(
    title = "Variance in X vs. Correlation with Outcome Y",
    subtitle = "A component's variance in X does not guarantee its relevance to Y"
  ) |>
  tab_style(
    style = cell_fill(color = "#FEF9E7"),
    locations = cells_body(rows = Component == "PC3")
  )
Variance in X vs. Correlation with Outcome Y
A component's variance in X does not guarantee its relevance to Y
Component Variance in X (%) Correlation with SBP (Y)
PC1 94.5 -0.738
PC2 4.5 0.093
PC3 1.1 0.155

This table is central to understanding the key limitation of PCR. The component selection criterion is based entirely on variance in predictor space (X), with no reference to the outcome variable (Y). Hadi and Ling (1998) demonstrated — both theoretically and through empirical examples with well-known datasets — that it is entirely possible for the component with the smallest eigenvalue to be the one most strongly associated with Y, while the dominant components may contribute little or nothing to predicting the outcome. Dropping low-variance components is therefore a biased estimation strategy whose consequences cannot be assessed from the predictor structure alone.


6. The Interpretability Consideration

Code
loadings <- pca_fit$rotation[, 1]

cat(glue::glue("
Beyond the estimation considerations above, PCR also raises an important interpretability
question that is particularly relevant in clinical and epidemiological research.

The PC1 coefficient from the full model is **{round(coef(m_pca_full)['PC1'], 3)}**, with
loadings of **{round(loadings['waist'], 3)}** on waist, **{round(loadings['bmi'], 3)}**
on BMI, and **{round(loadings['bodyfat'], 3)}** on body fat.

In practical terms, this means the model estimates that a one-unit increase in the composite
*({round(loadings['waist'],2)} × Waist) + ({round(loadings['bmi'],2)} × BMI) + ({round(loadings['bodyfat'],2)} × Body Fat)*
is associated with a **{round(coef(m_pca_full)['PC1'], 2)} mmHg** increase in SBP.

While mathematically valid, this formulation does not permit a clinician or policymaker to
act on the individual predictor effects. The question — for every 1 cm increase in waist
circumference, holding BMI and body fat constant, how does SBP change? — cannot be answered
directly from the PCR output without back-transformation, and even then, the back-transformed
coefficients carry the constraints imposed by the X correlation structure rather than the
Y-relevant signal.
"))

Beyond the estimation considerations above, PCR also raises an important interpretability question that is particularly relevant in clinical and epidemiological research.

The PC1 coefficient from the full model is -3.812, with loadings of -0.582 on waist, -0.566 on BMI, and -0.583 on body fat.

In practical terms, this means the model estimates that a one-unit increase in the composite (-0.58 × Waist) + (-0.57 × BMI) + (-0.58 × Body Fat) is associated with a -3.81 mmHg increase in SBP.

While mathematically valid, this formulation does not permit a clinician or policymaker to act on the individual predictor effects. The question — for every 1 cm increase in waist circumference, holding BMI and body fat constant, how does SBP change? — cannot be answered directly from the PCR output without back-transformation, and even then, the back-transformed coefficients carry the constraints imposed by the X correlation structure rather than the Y-relevant signal.


7. Visual Summary: Biplot

Code
fviz_pca_biplot(
  pca_fit,
  geom.ind  = "point",
  col.ind   = "grey70",
  col.var   = "#2E86AB",
  repel     = TRUE,
  title     = "PCA Biplot: Waist, BMI, and Body Fat in Principal Component Space"
)

Figure 2: PCA biplot showing the direction of each original variable in component space. The near-parallel arrows indicate that all three predictors share a common latent direction — the variance that PCR reorganises into PC1.

The near-parallel orientation of the three variable arrows confirms that waist circumference, BMI, and body fat percentage share a dominant common direction in the predictor space. PCA reorganises this shared structure into PC1, but the underlying correlation among the original variables is unchanged.


8. Summary of Considerations

Code
tibble(
  Scenario = c(
    "Full PCR (all components retained)",
    "Truncated PCR (low-variance components dropped)",
    "PCR for prediction (dimension reduction goal)",
    "PCR for causal / clinical inference"
  ),
  `What PCR Achieves` = c(
    "Orthogonal components; VIF = 1.0",
    "Reduced coefficient variance; more stable estimates",
    "Efficient compression of correlated predictors",
    "Estimated associations in component space"
  ),
  `Key Limitation` = c(
    "Algebraically equivalent to OLS — instability is unchanged",
    "Component selection is Y-blind; dropped components may predict Y well",
    "Interpretability of individual predictors is lost",
    "Back-transformed coefficients are constrained by X structure, not Y signal"
  )
) |>
  gt() |>
  tab_header(
    title = "PCR Under Different Analytical Goals",
    subtitle = "The appropriateness of PCR depends heavily on the inferential objective"
  )
PCR Under Different Analytical Goals
The appropriateness of PCR depends heavily on the inferential objective
Scenario What PCR Achieves Key Limitation
Full PCR (all components retained) Orthogonal components; VIF = 1.0 Algebraically equivalent to OLS — instability is unchanged
Truncated PCR (low-variance components dropped) Reduced coefficient variance; more stable estimates Component selection is Y-blind; dropped components may predict Y well
PCR for prediction (dimension reduction goal) Efficient compression of correlated predictors Interpretability of individual predictors is lost
PCR for causal / clinical inference Estimated associations in component space Back-transformed coefficients are constrained by X structure, not Y signal

When PCR May Be Appropriate

PCR can be a reasonable approach under specific conditions:

  • Prediction-focused analyses where the goal is forecasting rather than coefficient interpretation (e.g., genomics, spectroscopy, high-dimensional biomarker panels)
  • Exploratory analyses where identifying latent constructs (e.g., a “general adiposity” dimension) is itself informative
  • Downstream machine learning pipelines where orthogonal inputs improve computational stability

When Alternative Approaches Deserve Consideration

When the goal is to estimate the effect of individual named predictors — as is typical in clinical epidemiology and health policy research — the following alternatives preserve original variable identity while addressing the instability problem:

  • Ridge regression — continuously shrinks all coefficients toward zero, with greater shrinkage applied to components with smaller eigenvalues. Unlike truncated PCR, Ridge regression does not make a hard binary decision about which components to include; the shrinkage is proportional and keeps the outcome variable in the penalisation process
  • LASSO / Elastic Net — performs variable selection while retaining named-variable coefficients, allowing the outcome to guide which predictors remain in the model
  • Partial Least Squares (PLS) — constructs components that maximise covariance with Y rather than variance in X alone, directly addressing the Y-blindness of standard PCR
  • Domain-informed variable selection — selecting a single clinically meaningful proxy based on subject-matter knowledge, avoiding the statistical complexity while preserving interpretability

9. What the Founders Actually Intended

A question worth asking is whether the limitations discussed above are a failure of PCA itself, or a failure of how it has been applied. Examining the original papers by Pearson (1901) and Hotelling (1933) provides an important perspective.

Pearson (1901): A Geometric, Not a Regression, Problem

Karl Pearson’s 1901 paper presents the underlying problem of PCA from a purely geometric standpoint, describing how to find low-dimensional subspaces that best fit — in the least squares sense — a cloud of points in space. Notably, Pearson never used the term “principal components analysis.”

Pearson’s original question was: “Given a cloud of points in p-dimensional space, what line or plane best represents the data?” This is a problem of geometric approximation — finding the subspace of lowest dimensionality that preserves the structure of the data. Pearson’s aim was to escape from the non-symmetry of dependent and independent variables in linear regression — the regression line changes if the roles of dependent and independent variables are reversed — by giving equal status to all variables. He was led to find the best fit of a system of points in the plane by a line.

Critically, there is no outcome variable Y in Pearson’s formulation. PCA as originally conceived was an entirely unsupervised geometric operation — there was no dependent variable to predict, and therefore no question of whether the retained components were relevant to a regression target. Applying Pearson’s method to a regression problem with a specific outcome variable Y is an extension of its original scope, not something Pearson himself designed or validated for this purpose.

Hotelling (1933): Variance Decomposition for Psychometrics

Pearson (1901) and Hotelling (1933) adopted different approaches. The standard algebraic derivation is close to that introduced by Hotelling (1933). Pearson (1901), on the other hand, was concerned with finding lines and planes that best fit a set of points in p-dimensional space.

Hotelling formalised the algebraic framework and introduced the term “principal components,” publishing in the Journal of Educational Psychology — a psychometrics context. His goal was to understand the latent structure of correlated psychological test variables, not to stabilise regression coefficients under multicollinearity. The aim of the method is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible. It is a form of unsupervised learning in that it relies entirely on the input data itself without reference to the corresponding target data — the criterion to be maximised is the variance.

This is the core design decision that underpins the limitation Hadi and Ling (1998) identified: the criterion being maximised is variance in X, not covariance with Y. Both founders were explicit about this. Neither Pearson nor Hotelling proposed principal components as a solution to multicollinearity in regression. That application emerged later, and the Y-blindness of the method — which is by design in the founders’ original formulation — becomes a liability precisely in the regression context.

The Implication

The limitations of PCR identified in this analysis are therefore not a flaw in PCA as Pearson and Hotelling conceived it. They are a consequence of applying an unsupervised geometric tool to a supervised prediction problem. PCA is a linear transformation that transforms the data to a new coordinate system such that the new set of variables — the principal components — are linear functions of the original variables, are uncorrelated, and the greatest variance by any projection of the data comes to lie on the first coordinate. This design is entirely appropriate for dimension reduction, visualisation, and exploratory analysis. It is the mismatch with the inferential goals of regression — where relevance to Y, not variance in X, should govern component selection — that Hadi and Ling (1998) documented as problematic.

Methods such as Partial Least Squares (PLS), which explicitly maximise covariance with Y rather than variance in X, are better aligned with the supervised learning context that PCR is often asked to serve.


References

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

Hadi, A.S. and Ling, R.F. (1998). Some cautionary notes on the use of principal components regression. The American Statistician, 52(1), 15–19. https://doi.org/10.2307/2685559

Artigue, H. and Smith, G. (2019). The principal problem with principal components regression. Cogent Mathematics & Statistics, 6(1), 1622190. https://doi.org/10.1080/25742558.2019.1622190

Jolliffe, I.T. (1982). A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C, 31(3), 300–303.

Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), Chapter 6. Springer. https://www.statlearning.com