Principal Component Analyis and Multicollinearity

A Critical Examination of Assumptions, Limitations, and Practical Considerations

Author

Timothy Achala

Published

June 15, 2026

Background

Principal Component Regression (PCR) is widely recommended in statistical textbooks as a method for addressing multicollinearity in regression models. The core idea is appealing: transform a set of correlated predictors into orthogonal principal components, then regress the outcome on these uncorrelated components. In doing so, the variance inflation associated with collinear predictors is resolved — at least in the component space.

However, Hadi and Ling (1998) raised important cautionary notes about this practice, demonstrating both theoretically and empirically that PCR carries serious limitations that are often understated or absent from standard textbook treatments. This analysis explores those limitations through a simulated clinical example, and discusses the conditions under which PCR may or may not be appropriate.

1. Simulated Data: Collinear Adiposity Measures

To illustrate the behaviour of PCR under multicollinearity, we simulate a clinical dataset where three adiposity biomarkers — waist circumference, BMI, and body fat percentage — are used to predict systolic blood pressure (SBP). These three predictors are intentionally highly correlated, as they all capture overlapping aspects of body adiposity.

Code

library(tidyverse)
library(gt)
library(broom)
library(corrr)
library(GGally)
library(car)
library(factoextra)

set.seed(2024)

n <- 300

waist   <- rnorm(n, mean = 90, sd = 10)
bmi     <- 0.3 * waist + rnorm(n, mean = 5, sd = 1.5)
bodyfat <- 0.5 * waist + 0.4 * bmi + rnorm(n, mean = 2, sd = 1.5)

# True SBP: driven primarily by waist circumference
sbp <- 100 + 0.6 * waist + 0.05 * bmi + 0.02 * bodyfat + rnorm(n, sd = 6)

data <- tibble(waist, bmi, bodyfat, sbp)

head(data, 6) |>
  gt() |>
  tab_header(
    title = "Simulated Clinical Data",
    subtitle = "Adiposity measures predicting systolic blood pressure (n = 300)"
  ) |>
  fmt_number(everything(), decimals = 2)

waist	bmi	bodyfat	sbp
Simulated Clinical Data
Adiposity measures predicting systolic blood pressure (n = 300)
99.82	36.01	62.44	165.77
94.69	32.46	64.26	156.17
88.92	31.20	59.85	146.61
87.87	32.96	59.32	156.83
101.58	37.86	68.26	170.15
102.92	36.55	66.83	164.92

2. Assessing Multicollinearity in the Original Predictors

Code

GGally::ggpairs(
  data |> select(waist, bmi, bodyfat),
  title = "Pairwise Relationships Among Adiposity Predictors"
) +
  theme_minimal(base_size = 11)

Figure 1: Pairwise scatterplots and correlation coefficients confirm severe multicollinearity among the three adiposity predictors.

Code

m_raw <- lm(sbp ~ waist + bmi + bodyfat, data = data)

vif(m_raw) |>
  enframe(name = "Predictor", value = "VIF") |>
  gt() |>
  tab_header(
    title = "Variance Inflation Factors: OLS Model",
    subtitle = "VIF values indicate severe multicollinearity among predictors"
  ) |>
  fmt_number(VIF, decimals = 1) |>
  tab_style(
    style = cell_fill(color = "#FADBD8"),
    locations = cells_body(rows = VIF > 10)
  )

Predictor	VIF
Variance Inflation Factors: OLS Model
VIF values indicate severe multicollinearity among predictors
waist	16.6
bmi	5.2
bodyfat	17.5

The high VIF values confirm that the standard OLS estimates are unstable. Individual coefficient estimates are unreliable due to the near-linear dependency among the predictors. This is the scenario in which PCR is typically recommended.

3. Applying Principal Component Analysis

Code

pca_fit <- prcomp(data |> select(waist, bmi, bodyfat), scale. = TRUE)

summary(pca_fit)$importance |>
  as.data.frame() |>
  rownames_to_column("Metric") |>
  gt() |>
  tab_header(
    title = "Variance Explained by Principal Components",
    subtitle = "PC1 captures the dominant shared variance among predictors"
  ) |>
  fmt_number(-Metric, decimals = 3)

Metric	PC1	PC2	PC3
Variance Explained by Principal Components
PC1 captures the dominant shared variance among predictors
Standard deviation	1.683	0.366	0.178
Proportion of Variance	0.945	0.045	0.011
Cumulative Proportion	0.945	0.989	1.000

Code

pca_fit$rotation |>
  as.data.frame() |>
  rownames_to_column("Original Variable") |>
  gt() |>
  tab_header(
    title = "Principal Component Loadings",
    subtitle = "Each loading describes how original variables contribute to a given component"
  ) |>
  fmt_number(-`Original Variable`, decimals = 3)

Original Variable	PC1	PC2	PC3
Principal Component Loadings
Each loading describes how original variables contribute to a given component
waist	−0.582	0.424	0.694
bmi	−0.566	−0.824	0.028
bodyfat	−0.583	0.377	−0.720

PC1 accounts for the majority of variance in the predictor space and loads approximately equally on all three adiposity measures — reflecting the shared latent dimension of general body adiposity. PC2 and PC3 capture the remaining, largely residual variation.

4. Regression on Principal Components

Code

pc_scores  <- as_tibble(pca_fit$x)
data_pca   <- bind_cols(data |> select(sbp), pc_scores)

# Full PCR: all components retained
m_pca_full <- lm(sbp ~ PC1 + PC2 + PC3, data = data_pca)

tidy(m_pca_full) |>
  gt() |>
  tab_header(
    title = "Full PCR: Regression on All Three Principal Components",
    subtitle = "All components retained — equivalent to OLS"
  ) |>
  fmt_number(c(estimate, std.error, statistic, p.value), decimals = 4)

term	estimate	std.error	statistic	p.value
Full PCR: Regression on All Three Principal Components
All components retained — equivalent to OLS
(Intercept)	156.8239	0.3278	478.4794	0.0000
PC1	−3.8115	0.1950	−19.5450	0.0000
PC2	2.2081	0.8964	2.4633	0.0143
PC3	7.5573	1.8435	4.0994	0.0001

Code

vif(m_pca_full) |>
  enframe(name = "Component", value = "VIF") |>
  gt() |>
  tab_header(
    title = "VIF After PCA Transformation",
    subtitle = "Components are orthogonal by construction — VIF = 1.0 for all"
  ) |>
  fmt_number(VIF, decimals = 3) |>
  tab_style(
    style = cell_fill(color = "#D5F5E3"),
    locations = cells_body()
  )

Component	VIF
VIF After PCA Transformation
Components are orthogonal by construction — VIF = 1.0 for all
PC1	1.000
PC2	1.000
PC3	1.000

The VIF values after transformation are exactly 1.0, which reflects the mathematical property of orthogonality among principal components. However, it is important to recognise that when all components are retained, this result is not a statistical fix — it is a geometric property of the transformation. As noted by Hadi and Ling (1998), when all principal components are included in the regression, the PCR estimator is algebraically equivalent to the OLS estimator. The underlying instability has not been addressed; it has simply been re-expressed in a different coordinate system.

5. The Truncated PCR: Where the “Fix” Actually Occurs

The practical application of PCR involves dropping the low-variance components — typically those with small eigenvalues. This step is where the actual variance reduction occurs.

Code

# Truncated PCR: drop PC3 (smallest variance)
m_pca_trunc <- lm(sbp ~ PC1 + PC2, data = data_pca)

tidy(m_pca_trunc) |>
  gt() |>
  tab_header(
    title = "Truncated PCR: PC3 Excluded",
    subtitle = "Dropping the lowest-variance component to stabilise estimates"
  ) |>
  fmt_number(c(estimate, std.error, statistic, p.value), decimals = 4)

term	estimate	std.error	statistic	p.value
Truncated PCR: PC3 Excluded
Dropping the lowest-variance component to stabilise estimates
(Intercept)	156.8239	0.3364	466.2349	0.0000
PC1	−3.8115	0.2001	−19.0449	0.0000
PC2	2.2081	0.9199	2.4002	0.0170

Code

# Compare: how much does PC3 actually explain in Y?
pc3_y_cor <- cor(data_pca$PC3, data_pca$sbp)

tibble(
  Component = c("PC1", "PC2", "PC3"),
  `Variance in X (%)` = round(summary(pca_fit)$importance[2,] * 100, 1),
  `Correlation with SBP (Y)` = round(c(
    cor(data_pca$PC1, data_pca$sbp),
    cor(data_pca$PC2, data_pca$sbp),
    cor(data_pca$PC3, data_pca$sbp)
  ), 3)
) |>
  gt() |>
  tab_header(
    title = "Variance in X vs. Correlation with Outcome Y",
    subtitle = "A component's variance in X does not guarantee its relevance to Y"
  ) |>
  tab_style(
    style = cell_fill(color = "#FEF9E7"),
    locations = cells_body(rows = Component == "PC3")
  )

Component	Variance in X (%)	Correlation with SBP (Y)
Variance in X vs. Correlation with Outcome Y
A component's variance in X does not guarantee its relevance to Y
PC1	94.5	-0.738
PC2	4.5	0.093
PC3	1.1	0.155

This table is central to understanding the key limitation of PCR. The component selection criterion is based entirely on variance in predictor space (X), with no reference to the outcome variable (Y). Hadi and Ling (1998) demonstrated — both theoretically and through empirical examples with well-known datasets — that it is entirely possible for the component with the smallest eigenvalue to be the one most strongly associated with Y, while the dominant components may contribute little or nothing to predicting the outcome. Dropping low-variance components is therefore a biased estimation strategy whose consequences cannot be assessed from the predictor structure alone.

6. The Interpretability Consideration

Code

loadings <- pca_fit$rotation[, 1]

cat(glue::glue("
Beyond the estimation considerations above, PCR also raises an important interpretability
question that is particularly relevant in clinical and epidemiological research.

The PC1 coefficient from the full model is **{round(coef(m_pca_full)['PC1'], 3)}**, with
loadings of **{round(loadings['waist'], 3)}** on waist, **{round(loadings['bmi'], 3)}**
on BMI, and **{round(loadings['bodyfat'], 3)}** on body fat.

In practical terms, this means the model estimates that a one-unit increase in the composite
*({round(loadings['waist'],2)} × Waist) + ({round(loadings['bmi'],2)} × BMI) + ({round(loadings['bodyfat'],2)} × Body Fat)*
is associated with a **{round(coef(m_pca_full)['PC1'], 2)} mmHg** increase in SBP.

While mathematically valid, this formulation does not permit a clinician or policymaker to
act on the individual predictor effects. The question — for every 1 cm increase in waist
circumference, holding BMI and body fat constant, how does SBP change? — cannot be answered
directly from the PCR output without back-transformation, and even then, the back-transformed
coefficients carry the constraints imposed by the X correlation structure rather than the
Y-relevant signal.
"))

Beyond the estimation considerations above, PCR also raises an important interpretability question that is particularly relevant in clinical and epidemiological research.

The PC1 coefficient from the full model is -3.812, with loadings of -0.582 on waist, -0.566 on BMI, and -0.583 on body fat.

In practical terms, this means the model estimates that a one-unit increase in the composite (-0.58 × Waist) + (-0.57 × BMI) + (-0.58 × Body Fat) is associated with a -3.81 mmHg increase in SBP.

While mathematically valid, this formulation does not permit a clinician or policymaker to act on the individual predictor effects. The question — for every 1 cm increase in waist circumference, holding BMI and body fat constant, how does SBP change? — cannot be answered directly from the PCR output without back-transformation, and even then, the back-transformed coefficients carry the constraints imposed by the X correlation structure rather than the Y-relevant signal.

7. Visual Summary: Biplot

Code

fviz_pca_biplot(
  pca_fit,
  geom.ind  = "point",
  col.ind   = "grey70",
  col.var   = "#2E86AB",
  repel     = TRUE,
  title     = "PCA Biplot: Waist, BMI, and Body Fat in Principal Component Space"
)

Figure 2: PCA biplot showing the direction of each original variable in component space. The near-parallel arrows indicate that all three predictors share a common latent direction — the variance that PCR reorganises into PC1.

The near-parallel orientation of the three variable arrows confirms that waist circumference, BMI, and body fat percentage share a dominant common direction in the predictor space. PCA reorganises this shared structure into PC1, but the underlying correlation among the original variables is unchanged.

8. Summary of Considerations

Code

tibble(
  Scenario = c(
    "Full PCR (all components retained)",
    "Truncated PCR (low-variance components dropped)",
    "PCR for prediction (dimension reduction goal)",
    "PCR for causal / clinical inference"
  ),
  `What PCR Achieves` = c(
    "Orthogonal components; VIF = 1.0",
    "Reduced coefficient variance; more stable estimates",
    "Efficient compression of correlated predictors",
    "Estimated associations in component space"
  ),
  `Key Limitation` = c(
    "Algebraically equivalent to OLS — instability is unchanged",
    "Component selection is Y-blind; dropped components may predict Y well",
    "Interpretability of individual predictors is lost",
    "Back-transformed coefficients are constrained by X structure, not Y signal"
  )
) |>
  gt() |>
  tab_header(
    title = "PCR Under Different Analytical Goals",
    subtitle = "The appropriateness of PCR depends heavily on the inferential objective"
  )

Scenario	What PCR Achieves	Key Limitation
PCR Under Different Analytical Goals
The appropriateness of PCR depends heavily on the inferential objective
Full PCR (all components retained)	Orthogonal components; VIF = 1.0	Algebraically equivalent to OLS — instability is unchanged
Truncated PCR (low-variance components dropped)	Reduced coefficient variance; more stable estimates	Component selection is Y-blind; dropped components may predict Y well
PCR for prediction (dimension reduction goal)	Efficient compression of correlated predictors	Interpretability of individual predictors is lost
PCR for causal / clinical inference	Estimated associations in component space	Back-transformed coefficients are constrained by X structure, not Y signal

When PCR May Be Appropriate

PCR can be a reasonable approach under specific conditions:

Prediction-focused analyses where the goal is forecasting rather than coefficient interpretation (e.g., genomics, spectroscopy, high-dimensional biomarker panels)
Exploratory analyses where identifying latent constructs (e.g., a “general adiposity” dimension) is itself informative
Downstream machine learning pipelines where orthogonal inputs improve computational stability

When Alternative Approaches Deserve Consideration

When the goal is to estimate the effect of individual named predictors — as is typical in clinical epidemiology and health policy research — the following alternatives preserve original variable identity while addressing the instability problem:

Ridge regression — continuously shrinks all coefficients toward zero, with greater shrinkage applied to components with smaller eigenvalues. Unlike truncated PCR, Ridge regression does not make a hard binary decision about which components to include; the shrinkage is proportional and keeps the outcome variable in the penalisation process
LASSO / Elastic Net — performs variable selection while retaining named-variable coefficients, allowing the outcome to guide which predictors remain in the model
Partial Least Squares (PLS) — constructs components that maximise covariance with Y rather than variance in X alone, directly addressing the Y-blindness of standard PCR
Domain-informed variable selection — selecting a single clinically meaningful proxy based on subject-matter knowledge, avoiding the statistical complexity while preserving interpretability

9. What the Founders Actually Intended

A question worth asking is whether the limitations discussed above are a failure of PCA itself, or a failure of how it has been applied. Examining the original papers by Pearson (1901) and Hotelling (1933) provides an important perspective.

Pearson (1901): A Geometric, Not a Regression, Problem

Karl Pearson’s 1901 paper presents the underlying problem of PCA from a purely geometric standpoint, describing how to find low-dimensional subspaces that best fit — in the least squares sense — a cloud of points in space. Notably, Pearson never used the term “principal components analysis.”

Pearson’s original question was: “Given a cloud of points in p-dimensional space, what line or plane best represents the data?” This is a problem of geometric approximation — finding the subspace of lowest dimensionality that preserves the structure of the data. Pearson’s aim was to escape from the non-symmetry of dependent and independent variables in linear regression — the regression line changes if the roles of dependent and independent variables are reversed — by giving equal status to all variables. He was led to find the best fit of a system of points in the plane by a line.

Critically, there is no outcome variable Y in Pearson’s formulation. PCA as originally conceived was an entirely unsupervised geometric operation — there was no dependent variable to predict, and therefore no question of whether the retained components were relevant to a regression target. Applying Pearson’s method to a regression problem with a specific outcome variable Y is an extension of its original scope, not something Pearson himself designed or validated for this purpose.

Hotelling (1933): Variance Decomposition for Psychometrics

Pearson (1901) and Hotelling (1933) adopted different approaches. The standard algebraic derivation is close to that introduced by Hotelling (1933). Pearson (1901), on the other hand, was concerned with finding lines and planes that best fit a set of points in p-dimensional space.

Hotelling formalised the algebraic framework and introduced the term “principal components,” publishing in the Journal of Educational Psychology — a psychometrics context. His goal was to understand the latent structure of correlated psychological test variables, not to stabilise regression coefficients under multicollinearity. The aim of the method is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible. It is a form of unsupervised learning in that it relies entirely on the input data itself without reference to the corresponding target data — the criterion to be maximised is the variance.

This is the core design decision that underpins the limitation Hadi and Ling (1998) identified: the criterion being maximised is variance in X, not covariance with Y. Both founders were explicit about this. Neither Pearson nor Hotelling proposed principal components as a solution to multicollinearity in regression. That application emerged later, and the Y-blindness of the method — which is by design in the founders’ original formulation — becomes a liability precisely in the regression context.

The Implication

The limitations of PCR identified in this analysis are therefore not a flaw in PCA as Pearson and Hotelling conceived it. They are a consequence of applying an unsupervised geometric tool to a supervised prediction problem. PCA is a linear transformation that transforms the data to a new coordinate system such that the new set of variables — the principal components — are linear functions of the original variables, are uncorrelated, and the greatest variance by any projection of the data comes to lie on the first coordinate. This design is entirely appropriate for dimension reduction, visualisation, and exploratory analysis. It is the mismatch with the inferential goals of regression — where relevance to Y, not variance in X, should govern component selection — that Hadi and Ling (1998) documented as problematic.

Methods such as Partial Least Squares (PLS), which explicitly maximise covariance with Y rather than variance in X, are better aligned with the supervised learning context that PCR is often asked to serve.

References

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.

Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, 417–441.

Hadi, A.S. and Ling, R.F. (1998). Some cautionary notes on the use of principal components regression. The American Statistician, 52(1), 15–19. https://doi.org/10.2307/2685559

Artigue, H. and Smith, G. (2019). The principal problem with principal components regression. Cogent Mathematics & Statistics, 6(1), 1622190. https://doi.org/10.1080/25742558.2019.1622190

Jolliffe, I.T. (1982). A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C, 31(3), 300–303.

Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer.

James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), Chapter 6. Springer. https://www.statlearning.com