A Critical Examination of Assumptions, Limitations, and Practical Considerations
Author
Timothy Achala
Published
June 15, 2026
Background
Principal Component Regression (PCR) is widely recommended in statistical textbooks as a method for addressing multicollinearity in regression models. The core idea is appealing: transform a set of correlated predictors into orthogonal principal components, then regress the outcome on these uncorrelated components. In doing so, the variance inflation associated with collinear predictors is resolved — at least in the component space.
However, Hadi and Ling (1998) raised important cautionary notes about this practice, demonstrating both theoretically and empirically that PCR carries serious limitations that are often understated or absent from standard textbook treatments. This analysis explores those limitations through a simulated clinical example, and discusses the conditions under which PCR may or may not be appropriate.
1. Simulated Data: Collinear Adiposity Measures
To illustrate the behaviour of PCR under multicollinearity, we simulate a clinical dataset where three adiposity biomarkers — waist circumference, BMI, and body fat percentage — are used to predict systolic blood pressure (SBP). These three predictors are intentionally highly correlated, as they all capture overlapping aspects of body adiposity.
VIF values indicate severe multicollinearity among predictors
Predictor
VIF
waist
16.6
bmi
5.2
bodyfat
17.5
The high VIF values confirm that the standard OLS estimates are unstable. Individual coefficient estimates are unreliable due to the near-linear dependency among the predictors. This is the scenario in which PCR is typically recommended.
3. Applying Principal Component Analysis
Code
pca_fit <-prcomp(data |>select(waist, bmi, bodyfat), scale. =TRUE)summary(pca_fit)$importance |>as.data.frame() |>rownames_to_column("Metric") |>gt() |>tab_header(title ="Variance Explained by Principal Components",subtitle ="PC1 captures the dominant shared variance among predictors" ) |>fmt_number(-Metric, decimals =3)
Variance Explained by Principal Components
PC1 captures the dominant shared variance among predictors
Metric
PC1
PC2
PC3
Standard deviation
1.683
0.366
0.178
Proportion of Variance
0.945
0.045
0.011
Cumulative Proportion
0.945
0.989
1.000
Code
pca_fit$rotation |>as.data.frame() |>rownames_to_column("Original Variable") |>gt() |>tab_header(title ="Principal Component Loadings",subtitle ="Each loading describes how original variables contribute to a given component" ) |>fmt_number(-`Original Variable`, decimals =3)
Principal Component Loadings
Each loading describes how original variables contribute to a given component
Original Variable
PC1
PC2
PC3
waist
−0.582
0.424
0.694
bmi
−0.566
−0.824
0.028
bodyfat
−0.583
0.377
−0.720
PC1 accounts for the majority of variance in the predictor space and loads approximately equally on all three adiposity measures — reflecting the shared latent dimension of general body adiposity. PC2 and PC3 capture the remaining, largely residual variation.
4. Regression on Principal Components
Code
pc_scores <-as_tibble(pca_fit$x)data_pca <-bind_cols(data |>select(sbp), pc_scores)# Full PCR: all components retainedm_pca_full <-lm(sbp ~ PC1 + PC2 + PC3, data = data_pca)tidy(m_pca_full) |>gt() |>tab_header(title ="Full PCR: Regression on All Three Principal Components",subtitle ="All components retained — equivalent to OLS" ) |>fmt_number(c(estimate, std.error, statistic, p.value), decimals =4)
Full PCR: Regression on All Three Principal Components
All components retained — equivalent to OLS
term
estimate
std.error
statistic
p.value
(Intercept)
156.8239
0.3278
478.4794
0.0000
PC1
−3.8115
0.1950
−19.5450
0.0000
PC2
2.2081
0.8964
2.4633
0.0143
PC3
7.5573
1.8435
4.0994
0.0001
Code
vif(m_pca_full) |>enframe(name ="Component", value ="VIF") |>gt() |>tab_header(title ="VIF After PCA Transformation",subtitle ="Components are orthogonal by construction — VIF = 1.0 for all" ) |>fmt_number(VIF, decimals =3) |>tab_style(style =cell_fill(color ="#D5F5E3"),locations =cells_body() )
VIF After PCA Transformation
Components are orthogonal by construction — VIF = 1.0 for all
Component
VIF
PC1
1.000
PC2
1.000
PC3
1.000
The VIF values after transformation are exactly 1.0, which reflects the mathematical property of orthogonality among principal components. However, it is important to recognise that when all components are retained, this result is not a statistical fix — it is a geometric property of the transformation. As noted by Hadi and Ling (1998), when all principal components are included in the regression, the PCR estimator is algebraically equivalent to the OLS estimator. The underlying instability has not been addressed; it has simply been re-expressed in a different coordinate system.
5. The Truncated PCR: Where the “Fix” Actually Occurs
The practical application of PCR involves dropping the low-variance components — typically those with small eigenvalues. This step is where the actual variance reduction occurs.
Code
# Truncated PCR: drop PC3 (smallest variance)m_pca_trunc <-lm(sbp ~ PC1 + PC2, data = data_pca)tidy(m_pca_trunc) |>gt() |>tab_header(title ="Truncated PCR: PC3 Excluded",subtitle ="Dropping the lowest-variance component to stabilise estimates" ) |>fmt_number(c(estimate, std.error, statistic, p.value), decimals =4)
Truncated PCR: PC3 Excluded
Dropping the lowest-variance component to stabilise estimates
term
estimate
std.error
statistic
p.value
(Intercept)
156.8239
0.3364
466.2349
0.0000
PC1
−3.8115
0.2001
−19.0449
0.0000
PC2
2.2081
0.9199
2.4002
0.0170
Code
# Compare: how much does PC3 actually explain in Y?pc3_y_cor <-cor(data_pca$PC3, data_pca$sbp)tibble(Component =c("PC1", "PC2", "PC3"),`Variance in X (%)`=round(summary(pca_fit)$importance[2,] *100, 1),`Correlation with SBP (Y)`=round(c(cor(data_pca$PC1, data_pca$sbp),cor(data_pca$PC2, data_pca$sbp),cor(data_pca$PC3, data_pca$sbp) ), 3)) |>gt() |>tab_header(title ="Variance in X vs. Correlation with Outcome Y",subtitle ="A component's variance in X does not guarantee its relevance to Y" ) |>tab_style(style =cell_fill(color ="#FEF9E7"),locations =cells_body(rows = Component =="PC3") )
Variance in X vs. Correlation with Outcome Y
A component's variance in X does not guarantee its relevance to Y
Component
Variance in X (%)
Correlation with SBP (Y)
PC1
94.5
-0.738
PC2
4.5
0.093
PC3
1.1
0.155
This table is central to understanding the key limitation of PCR. The component selection criterion is based entirely on variance in predictor space (X), with no reference to the outcome variable (Y). Hadi and Ling (1998) demonstrated — both theoretically and through empirical examples with well-known datasets — that it is entirely possible for the component with the smallest eigenvalue to be the one most strongly associated with Y, while the dominant components may contribute little or nothing to predicting the outcome. Dropping low-variance components is therefore a biased estimation strategy whose consequences cannot be assessed from the predictor structure alone.
6. The Interpretability Consideration
Code
loadings <- pca_fit$rotation[, 1]cat(glue::glue("Beyond the estimation considerations above, PCR also raises an important interpretabilityquestion that is particularly relevant in clinical and epidemiological research.The PC1 coefficient from the full model is **{round(coef(m_pca_full)['PC1'], 3)}**, withloadings of **{round(loadings['waist'], 3)}** on waist, **{round(loadings['bmi'], 3)}**on BMI, and **{round(loadings['bodyfat'], 3)}** on body fat.In practical terms, this means the model estimates that a one-unit increase in the composite*({round(loadings['waist'],2)} × Waist) + ({round(loadings['bmi'],2)} × BMI) + ({round(loadings['bodyfat'],2)} × Body Fat)*is associated with a **{round(coef(m_pca_full)['PC1'], 2)} mmHg** increase in SBP.While mathematically valid, this formulation does not permit a clinician or policymaker toact on the individual predictor effects. The question — for every 1 cm increase in waistcircumference, holding BMI and body fat constant, how does SBP change? — cannot be answereddirectly from the PCR output without back-transformation, and even then, the back-transformedcoefficients carry the constraints imposed by the X correlation structure rather than theY-relevant signal."))
Beyond the estimation considerations above, PCR also raises an important interpretability question that is particularly relevant in clinical and epidemiological research.
The PC1 coefficient from the full model is -3.812, with loadings of -0.582 on waist, -0.566 on BMI, and -0.583 on body fat.
In practical terms, this means the model estimates that a one-unit increase in the composite (-0.58 × Waist) + (-0.57 × BMI) + (-0.58 × Body Fat) is associated with a -3.81 mmHg increase in SBP.
While mathematically valid, this formulation does not permit a clinician or policymaker to act on the individual predictor effects. The question — for every 1 cm increase in waist circumference, holding BMI and body fat constant, how does SBP change? — cannot be answered directly from the PCR output without back-transformation, and even then, the back-transformed coefficients carry the constraints imposed by the X correlation structure rather than the Y-relevant signal.
7. Visual Summary: Biplot
Code
fviz_pca_biplot( pca_fit,geom.ind ="point",col.ind ="grey70",col.var ="#2E86AB",repel =TRUE,title ="PCA Biplot: Waist, BMI, and Body Fat in Principal Component Space")
Figure 2: PCA biplot showing the direction of each original variable in component space. The near-parallel arrows indicate that all three predictors share a common latent direction — the variance that PCR reorganises into PC1.
The near-parallel orientation of the three variable arrows confirms that waist circumference, BMI, and body fat percentage share a dominant common direction in the predictor space. PCA reorganises this shared structure into PC1, but the underlying correlation among the original variables is unchanged.
8. Summary of Considerations
Code
tibble(Scenario =c("Full PCR (all components retained)","Truncated PCR (low-variance components dropped)","PCR for prediction (dimension reduction goal)","PCR for causal / clinical inference" ),`What PCR Achieves`=c("Orthogonal components; VIF = 1.0","Reduced coefficient variance; more stable estimates","Efficient compression of correlated predictors","Estimated associations in component space" ),`Key Limitation`=c("Algebraically equivalent to OLS — instability is unchanged","Component selection is Y-blind; dropped components may predict Y well","Interpretability of individual predictors is lost","Back-transformed coefficients are constrained by X structure, not Y signal" )) |>gt() |>tab_header(title ="PCR Under Different Analytical Goals",subtitle ="The appropriateness of PCR depends heavily on the inferential objective" )
PCR Under Different Analytical Goals
The appropriateness of PCR depends heavily on the inferential objective
Scenario
What PCR Achieves
Key Limitation
Full PCR (all components retained)
Orthogonal components; VIF = 1.0
Algebraically equivalent to OLS — instability is unchanged
Truncated PCR (low-variance components dropped)
Reduced coefficient variance; more stable estimates
Component selection is Y-blind; dropped components may predict Y well
PCR for prediction (dimension reduction goal)
Efficient compression of correlated predictors
Interpretability of individual predictors is lost
PCR for causal / clinical inference
Estimated associations in component space
Back-transformed coefficients are constrained by X structure, not Y signal
When PCR May Be Appropriate
PCR can be a reasonable approach under specific conditions:
Prediction-focused analyses where the goal is forecasting rather than coefficient interpretation (e.g., genomics, spectroscopy, high-dimensional biomarker panels)
Exploratory analyses where identifying latent constructs (e.g., a “general adiposity” dimension) is itself informative
Downstream machine learning pipelines where orthogonal inputs improve computational stability
When Alternative Approaches Deserve Consideration
When the goal is to estimate the effect of individual named predictors — as is typical in clinical epidemiology and health policy research — the following alternatives preserve original variable identity while addressing the instability problem:
Ridge regression — continuously shrinks all coefficients toward zero, with greater shrinkage applied to components with smaller eigenvalues. Unlike truncated PCR, Ridge regression does not make a hard binary decision about which components to include; the shrinkage is proportional and keeps the outcome variable in the penalisation process
LASSO / Elastic Net — performs variable selection while retaining named-variable coefficients, allowing the outcome to guide which predictors remain in the model
Partial Least Squares (PLS) — constructs components that maximise covariance with Y rather than variance in X alone, directly addressing the Y-blindness of standard PCR
Domain-informed variable selection — selecting a single clinically meaningful proxy based on subject-matter knowledge, avoiding the statistical complexity while preserving interpretability
9. What the Founders Actually Intended
A question worth asking is whether the limitations discussed above are a failure of PCA itself, or a failure of how it has been applied. Examining the original papers by Pearson (1901) and Hotelling (1933) provides an important perspective.
Pearson (1901): A Geometric, Not a Regression, Problem
Karl Pearson’s 1901 paper presents the underlying problem of PCA from a purely geometric standpoint, describing how to find low-dimensional subspaces that best fit — in the least squares sense — a cloud of points in space. Notably, Pearson never used the term “principal components analysis.”
Pearson’s original question was: “Given a cloud of points in p-dimensional space, what line or plane best represents the data?” This is a problem of geometric approximation — finding the subspace of lowest dimensionality that preserves the structure of the data. Pearson’s aim was to escape from the non-symmetry of dependent and independent variables in linear regression — the regression line changes if the roles of dependent and independent variables are reversed — by giving equal status to all variables. He was led to find the best fit of a system of points in the plane by a line.
Critically, there is no outcome variable Y in Pearson’s formulation. PCA as originally conceived was an entirely unsupervised geometric operation — there was no dependent variable to predict, and therefore no question of whether the retained components were relevant to a regression target. Applying Pearson’s method to a regression problem with a specific outcome variable Y is an extension of its original scope, not something Pearson himself designed or validated for this purpose.
Hotelling (1933): Variance Decomposition for Psychometrics
Pearson (1901) and Hotelling (1933) adopted different approaches. The standard algebraic derivation is close to that introduced by Hotelling (1933). Pearson (1901), on the other hand, was concerned with finding lines and planes that best fit a set of points in p-dimensional space.
Hotelling formalised the algebraic framework and introduced the term “principal components,” publishing in the Journal of Educational Psychology — a psychometrics context. His goal was to understand the latent structure of correlated psychological test variables, not to stabilise regression coefficients under multicollinearity. The aim of the method is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible. It is a form of unsupervised learning in that it relies entirely on the input data itself without reference to the corresponding target data — the criterion to be maximised is the variance.
This is the core design decision that underpins the limitation Hadi and Ling (1998) identified: the criterion being maximised is variance in X, not covariance with Y. Both founders were explicit about this. Neither Pearson nor Hotelling proposed principal components as a solution to multicollinearity in regression. That application emerged later, and the Y-blindness of the method — which is by design in the founders’ original formulation — becomes a liability precisely in the regression context.
The Implication
The limitations of PCR identified in this analysis are therefore not a flaw in PCA as Pearson and Hotelling conceived it. They are a consequence of applying an unsupervised geometric tool to a supervised prediction problem. PCA is a linear transformation that transforms the data to a new coordinate system such that the new set of variables — the principal components — are linear functions of the original variables, are uncorrelated, and the greatest variance by any projection of the data comes to lie on the first coordinate. This design is entirely appropriate for dimension reduction, visualisation, and exploratory analysis. It is the mismatch with the inferential goals of regression — where relevance to Y, not variance in X, should govern component selection — that Hadi and Ling (1998) documented as problematic.
Methods such as Partial Least Squares (PLS), which explicitly maximise covariance with Y rather than variance in X, are better aligned with the supervised learning context that PCR is often asked to serve.
References
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.Philosophical Magazine, 2(11), 559–572.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components.Journal of Educational Psychology, 24, 417–441.
Hadi, A.S. and Ling, R.F. (1998). Some cautionary notes on the use of principal components regression.The American Statistician, 52(1), 15–19. https://doi.org/10.2307/2685559
Artigue, H. and Smith, G. (2019). The principal problem with principal components regression.Cogent Mathematics & Statistics, 6(1), 1622190. https://doi.org/10.1080/25742558.2019.1622190
Jolliffe, I.T. (1982). A note on the use of principal components in regression.Journal of the Royal Statistical Society: Series C, 31(3), 300–303.
Jolliffe, I.T. (2002). Principal Component Analysis (2nd ed.). Springer.
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.), Chapter 6. Springer. https://www.statlearning.com
Source Code
---title: "Principal Component Analyis and Multicollinearity"subtitle: "A Critical Examination of Assumptions, Limitations, and Practical Considerations"author: "Timothy Achala"date: todayformat: html: toc: true toc-depth: 2 theme: flatly code-fold: true code-tools: true fig-width: 8 fig-height: 6 embed-resources: trueexecute: warning: false message: false---## BackgroundPrincipal Component Regression (PCR) is widely recommended in statistical textbooks as a method for addressing multicollinearity in regression models. The core idea is appealing: transform a set of correlated predictors into orthogonal principal components, then regress the outcome on these uncorrelated components. In doing so, the variance inflation associated with collinear predictors is resolved — at least in the component space.However, Hadi and Ling (1998) raised important cautionary notes about this practice, demonstrating both theoretically and empirically that PCR carries serious limitations that are often understated or absent from standard textbook treatments. This analysis explores those limitations through a simulated clinical example, and discusses the conditions under which PCR may or may not be appropriate.---## 1. Simulated Data: Collinear Adiposity MeasuresTo illustrate the behaviour of PCR under multicollinearity, we simulate a clinical dataset where three adiposity biomarkers — waist circumference, BMI, and body fat percentage — are used to predict systolic blood pressure (SBP). These three predictors are intentionally highly correlated, as they all capture overlapping aspects of body adiposity.```{r}#| label: setuplibrary(tidyverse)library(gt)library(broom)library(corrr)library(GGally)library(car)library(factoextra)set.seed(2024)n <-300waist <-rnorm(n, mean =90, sd =10)bmi <-0.3* waist +rnorm(n, mean =5, sd =1.5)bodyfat <-0.5* waist +0.4* bmi +rnorm(n, mean =2, sd =1.5)# True SBP: driven primarily by waist circumferencesbp <-100+0.6* waist +0.05* bmi +0.02* bodyfat +rnorm(n, sd =6)data <-tibble(waist, bmi, bodyfat, sbp)head(data, 6) |>gt() |>tab_header(title ="Simulated Clinical Data",subtitle ="Adiposity measures predicting systolic blood pressure (n = 300)" ) |>fmt_number(everything(), decimals =2)```---## 2. Assessing Multicollinearity in the Original Predictors```{r}#| label: fig-corr#| fig-cap: "Pairwise scatterplots and correlation coefficients confirm severe multicollinearity among the three adiposity predictors."GGally::ggpairs( data |>select(waist, bmi, bodyfat),title ="Pairwise Relationships Among Adiposity Predictors") +theme_minimal(base_size =11)``````{r}#| label: vif-olsm_raw <-lm(sbp ~ waist + bmi + bodyfat, data = data)vif(m_raw) |>enframe(name ="Predictor", value ="VIF") |>gt() |>tab_header(title ="Variance Inflation Factors: OLS Model",subtitle ="VIF values indicate severe multicollinearity among predictors" ) |>fmt_number(VIF, decimals =1) |>tab_style(style =cell_fill(color ="#FADBD8"),locations =cells_body(rows = VIF >10) )```The high VIF values confirm that the standard OLS estimates are unstable. Individual coefficient estimates are unreliable due to the near-linear dependency among the predictors. This is the scenario in which PCR is typically recommended.---## 3. Applying Principal Component Analysis```{r}#| label: pca-fitpca_fit <-prcomp(data |>select(waist, bmi, bodyfat), scale. =TRUE)summary(pca_fit)$importance |>as.data.frame() |>rownames_to_column("Metric") |>gt() |>tab_header(title ="Variance Explained by Principal Components",subtitle ="PC1 captures the dominant shared variance among predictors" ) |>fmt_number(-Metric, decimals =3)``````{r}#| label: pca-loadingspca_fit$rotation |>as.data.frame() |>rownames_to_column("Original Variable") |>gt() |>tab_header(title ="Principal Component Loadings",subtitle ="Each loading describes how original variables contribute to a given component" ) |>fmt_number(-`Original Variable`, decimals =3)```PC1 accounts for the majority of variance in the predictor space and loads approximately equally on all three adiposity measures — reflecting the shared latent dimension of general body adiposity. PC2 and PC3 capture the remaining, largely residual variation.---## 4. Regression on Principal Components```{r}#| label: pc-regressionpc_scores <-as_tibble(pca_fit$x)data_pca <-bind_cols(data |>select(sbp), pc_scores)# Full PCR: all components retainedm_pca_full <-lm(sbp ~ PC1 + PC2 + PC3, data = data_pca)tidy(m_pca_full) |>gt() |>tab_header(title ="Full PCR: Regression on All Three Principal Components",subtitle ="All components retained — equivalent to OLS" ) |>fmt_number(c(estimate, std.error, statistic, p.value), decimals =4)``````{r}#| label: vif-pcavif(m_pca_full) |>enframe(name ="Component", value ="VIF") |>gt() |>tab_header(title ="VIF After PCA Transformation",subtitle ="Components are orthogonal by construction — VIF = 1.0 for all" ) |>fmt_number(VIF, decimals =3) |>tab_style(style =cell_fill(color ="#D5F5E3"),locations =cells_body() )```The VIF values after transformation are exactly 1.0, which reflects the mathematical property of orthogonality among principal components. However, it is important to recognise that when all components are retained, this result is not a statistical fix — it is a geometric property of the transformation. As noted by Hadi and Ling (1998), when all principal components are included in the regression, the PCR estimator is algebraically equivalent to the OLS estimator. The underlying instability has not been addressed; it has simply been re-expressed in a different coordinate system.---## 5. The Truncated PCR: Where the "Fix" Actually OccursThe practical application of PCR involves dropping the low-variance components — typically those with small eigenvalues. This step is where the actual variance reduction occurs.```{r}#| label: truncated-pcr# Truncated PCR: drop PC3 (smallest variance)m_pca_trunc <-lm(sbp ~ PC1 + PC2, data = data_pca)tidy(m_pca_trunc) |>gt() |>tab_header(title ="Truncated PCR: PC3 Excluded",subtitle ="Dropping the lowest-variance component to stabilise estimates" ) |>fmt_number(c(estimate, std.error, statistic, p.value), decimals =4)# Compare: how much does PC3 actually explain in Y?pc3_y_cor <-cor(data_pca$PC3, data_pca$sbp)tibble(Component =c("PC1", "PC2", "PC3"),`Variance in X (%)`=round(summary(pca_fit)$importance[2,] *100, 1),`Correlation with SBP (Y)`=round(c(cor(data_pca$PC1, data_pca$sbp),cor(data_pca$PC2, data_pca$sbp),cor(data_pca$PC3, data_pca$sbp) ), 3)) |>gt() |>tab_header(title ="Variance in X vs. Correlation with Outcome Y",subtitle ="A component's variance in X does not guarantee its relevance to Y" ) |>tab_style(style =cell_fill(color ="#FEF9E7"),locations =cells_body(rows = Component =="PC3") )```This table is central to understanding the key limitation of PCR. The component selection criterion is based entirely on variance in predictor space (X), with no reference to the outcome variable (Y). Hadi and Ling (1998) demonstrated — both theoretically and through empirical examples with well-known datasets — that it is entirely possible for the component with the smallest eigenvalue to be the one most strongly associated with Y, while the dominant components may contribute little or nothing to predicting the outcome. Dropping low-variance components is therefore a biased estimation strategy whose consequences cannot be assessed from the predictor structure alone.---## 6. The Interpretability Consideration```{r}#| results: asisloadings <- pca_fit$rotation[, 1]cat(glue::glue("Beyond the estimation considerations above, PCR also raises an important interpretabilityquestion that is particularly relevant in clinical and epidemiological research.The PC1 coefficient from the full model is **{round(coef(m_pca_full)['PC1'], 3)}**, withloadings of **{round(loadings['waist'], 3)}** on waist, **{round(loadings['bmi'], 3)}**on BMI, and **{round(loadings['bodyfat'], 3)}** on body fat.In practical terms, this means the model estimates that a one-unit increase in the composite*({round(loadings['waist'],2)} × Waist) + ({round(loadings['bmi'],2)} × BMI) + ({round(loadings['bodyfat'],2)} × Body Fat)*is associated with a **{round(coef(m_pca_full)['PC1'], 2)} mmHg** increase in SBP.While mathematically valid, this formulation does not permit a clinician or policymaker toact on the individual predictor effects. The question — for every 1 cm increase in waistcircumference, holding BMI and body fat constant, how does SBP change? — cannot be answereddirectly from the PCR output without back-transformation, and even then, the back-transformedcoefficients carry the constraints imposed by the X correlation structure rather than theY-relevant signal."))```---## 7. Visual Summary: Biplot```{r}#| label: fig-pca-biplot#| fig-cap: "PCA biplot showing the direction of each original variable in component space. The near-parallel arrows indicate that all three predictors share a common latent direction — the variance that PCR reorganises into PC1."fviz_pca_biplot( pca_fit,geom.ind ="point",col.ind ="grey70",col.var ="#2E86AB",repel =TRUE,title ="PCA Biplot: Waist, BMI, and Body Fat in Principal Component Space")```The near-parallel orientation of the three variable arrows confirms that waist circumference, BMI, and body fat percentage share a dominant common direction in the predictor space. PCA reorganises this shared structure into PC1, but the underlying correlation among the original variables is unchanged.---## 8. Summary of Considerations```{r}#| label: summary-tabletibble(Scenario =c("Full PCR (all components retained)","Truncated PCR (low-variance components dropped)","PCR for prediction (dimension reduction goal)","PCR for causal / clinical inference" ),`What PCR Achieves`=c("Orthogonal components; VIF = 1.0","Reduced coefficient variance; more stable estimates","Efficient compression of correlated predictors","Estimated associations in component space" ),`Key Limitation`=c("Algebraically equivalent to OLS — instability is unchanged","Component selection is Y-blind; dropped components may predict Y well","Interpretability of individual predictors is lost","Back-transformed coefficients are constrained by X structure, not Y signal" )) |>gt() |>tab_header(title ="PCR Under Different Analytical Goals",subtitle ="The appropriateness of PCR depends heavily on the inferential objective" )```### When PCR May Be AppropriatePCR can be a reasonable approach under specific conditions:- **Prediction-focused analyses** where the goal is forecasting rather than coefficient interpretation (e.g., genomics, spectroscopy, high-dimensional biomarker panels)- **Exploratory analyses** where identifying latent constructs (e.g., a "general adiposity" dimension) is itself informative- **Downstream machine learning pipelines** where orthogonal inputs improve computational stability### When Alternative Approaches Deserve ConsiderationWhen the goal is to estimate the effect of individual named predictors — as is typical in clinical epidemiology and health policy research — the following alternatives preserve original variable identity while addressing the instability problem:- **Ridge regression** — continuously shrinks all coefficients toward zero, with greater shrinkage applied to components with smaller eigenvalues. Unlike truncated PCR, Ridge regression does not make a hard binary decision about which components to include; the shrinkage is proportional and keeps the outcome variable in the penalisation process- **LASSO / Elastic Net** — performs variable selection while retaining named-variable coefficients, allowing the outcome to guide which predictors remain in the model- **Partial Least Squares (PLS)** — constructs components that maximise covariance with Y rather than variance in X alone, directly addressing the Y-blindness of standard PCR- **Domain-informed variable selection** — selecting a single clinically meaningful proxy based on subject-matter knowledge, avoiding the statistical complexity while preserving interpretability---## 9. What the Founders Actually IntendedA question worth asking is whether the limitations discussed above are a failure of PCA itself, or a failure of how it has been applied. Examining the original papers by Pearson (1901) and Hotelling (1933) provides an important perspective.### Pearson (1901): A Geometric, Not a Regression, ProblemKarl Pearson's 1901 paper presents the underlying problem of PCA from a purely geometric standpoint, describing how to find low-dimensional subspaces that best fit — in the least squares sense — a cloud of points in space. Notably, Pearson never used the term "principal components analysis."Pearson's original question was: *"Given a cloud of points in p-dimensional space, what line or plane best represents the data?"* This is a problem of geometric approximation — finding the subspace of lowest dimensionality that preserves the structure of the data. Pearson's aim was to escape from the non-symmetry of dependent and independent variables in linear regression — the regression line changes if the roles of dependent and independent variables are reversed — by giving equal status to all variables. He was led to find the best fit of a system of points in the plane by a line.Critically, **there is no outcome variable Y in Pearson's formulation**. PCA as originally conceived was an entirely unsupervised geometric operation — there was no dependent variable to predict, and therefore no question of whether the retained components were relevant to a regression target. Applying Pearson's method to a regression problem with a specific outcome variable Y is an extension of its original scope, not something Pearson himself designed or validated for this purpose.### Hotelling (1933): Variance Decomposition for PsychometricsPearson (1901) and Hotelling (1933) adopted different approaches. The standard algebraic derivation is close to that introduced by Hotelling (1933). Pearson (1901), on the other hand, was concerned with finding lines and planes that best fit a set of points in p-dimensional space.Hotelling formalised the algebraic framework and introduced the term "principal components," publishing in the *Journal of Educational Psychology* — a psychometrics context. His goal was to understand the latent structure of correlated psychological test variables, not to stabilise regression coefficients under multicollinearity. The aim of the method is to reduce the dimensionality of multivariate data whilst preserving as much of the relevant information as possible. It is a form of unsupervised learning in that it relies entirely on the input data itself without reference to the corresponding target data — the criterion to be maximised is the variance.This is the core design decision that underpins the limitation Hadi and Ling (1998) identified: **the criterion being maximised is variance in X, not covariance with Y**. Both founders were explicit about this. Neither Pearson nor Hotelling proposed principal components as a solution to multicollinearity in regression. That application emerged later, and the Y-blindness of the method — which is by design in the founders' original formulation — becomes a liability precisely in the regression context.### The ImplicationThe limitations of PCR identified in this analysis are therefore not a flaw in PCA as Pearson and Hotelling conceived it. They are a consequence of applying an unsupervised geometric tool to a supervised prediction problem. PCA is a linear transformation that transforms the data to a new coordinate system such that the new set of variables — the principal components — are linear functions of the original variables, are uncorrelated, and the greatest variance by any projection of the data comes to lie on the first coordinate. This design is entirely appropriate for dimension reduction, visualisation, and exploratory analysis. It is the mismatch with the inferential goals of regression — where relevance to Y, not variance in X, should govern component selection — that Hadi and Ling (1998) documented as problematic.Methods such as Partial Least Squares (PLS), which explicitly maximise covariance with Y rather than variance in X, are better aligned with the supervised learning context that PCR is often asked to serve.---## ReferencesPearson, K. (1901). *On lines and planes of closest fit to systems of points in space.* **Philosophical Magazine**, 2(11), 559–572.Hotelling, H. (1933). *Analysis of a complex of statistical variables into principal components.* **Journal of Educational Psychology**, 24, 417–441.Hadi, A.S. and Ling, R.F. (1998). *Some cautionary notes on the use of principal components regression.* **The American Statistician**, 52(1), 15–19. [https://doi.org/10.2307/2685559](https://doi.org/10.2307/2685559)Artigue, H. and Smith, G. (2019). *The principal problem with principal components regression.* **Cogent Mathematics & Statistics**, 6(1), 1622190. [https://doi.org/10.1080/25742558.2019.1622190](https://doi.org/10.1080/25742558.2019.1622190)Jolliffe, I.T. (1982). *A note on the use of principal components in regression.* **Journal of the Royal Statistical Society: Series C**, 31(3), 300–303.Jolliffe, I.T. (2002). *Principal Component Analysis* (2nd ed.). Springer.James, G., Witten, D., Hastie, T. and Tibshirani, R. (2021). *An Introduction to Statistical Learning* (2nd ed.), Chapter 6. Springer. [https://www.statlearning.com](https://www.statlearning.com)