Read this question from the datascience subreddit.

In this assignment, you craft a response to the question. You should help the poster understand:

1. When is PCA helpful?

2. What does PCA in a regression setting achieves? (model complexity vs data input complexity, comparison to LASSO, RIDGE)

3. Example of how it might be accomplished + code

Introduction

The purpose of this article is to provide a simplified explanation of Principal Component Analysis (PCA). Even though our goal is to answer the three questions above, we will dive a little deeper into the functionality and purpose of PCA in order to provide a more general understanding of this concept.

What is PCA?

Principal component analysis (PCA) is a statistical analysis method used to reduce the dimensionality the of a large data set by transforming it into a smaller set one that captures most of the information. These transformation is given by linear combinations of the original data capturing most of the variance in the data.

As we expect, reducing the number of variables of a data set will reduce the accuracy of our model and so our goal is to minimize the accuracy loss while significantly simplifying our model.

How does PCA work?

Next, I will work on a general example on how we might want to calculate the PCAs for a given data set using Singular Value Decomposition (SVD).

Steps for finding PC1 in a 2-dimensional plot:

  1. Create a scatter plot of your dataset.

  2. Find the line that best fit the data.

  3. Create a random line that goes through the origin.

  4. Project data onto line and find the distance of each data point to the origin Rotate the line until the sum of square distances is maximized

If the slope of PC1 is less than 0.5, then

  • The points are more clustered in the x-axis (more x change compared to y change) The x-axis is more important when it comes to describing how the data is spread out

If the slope of PC1 is is greater than 0.5, then

  • The points are more clustered in the y-axis (more y change compared to x change) The y-axis is more important when it comes to describing how the data is spread out

Notice that

Eigenvalue for PC1 = Sum of Square distances (SS) = \(d_1^2 + d_2^2+...+d_n^2\)

Singular value for PC1 = square root of eigenvalue for PC1.

Pythagorean Theorem: \(c^2 = a^2 + b^2\)

Singular Vector or the eigenvector for PC1:

  • Find the unit vector (PC1) by plotting the change in x and the change in y Calculate the vector’s length using Pythagorean Theorem Dividing all sides by this length the vector has length 1. Loading scores: The lengths of the x and y variables of the eigenvector for PC1.

PC2: The line perpendicular to PC1.

PCA plot:

  • Rotate everything so that PC1 is the x-axis and PC2 is the y-axis
  • Use the projected points to find where the sample go in the PCA plot

Variation of PC1 and PC2:

  • Variation of PC1: \(\frac{SS(PC1)}{n - 1}\)

  • Variation of PC2: \(\frac{SS(PC2)}{n - 1}\)

where \(n\) is the number of data points.

Total variation of PCs: variation of PC1 + variation of PC2.

When is PCA helpful?

Principal Component Analysis (PCA) is just one of many methods we can use to impose constraints on your model to promote stability and stay within your parameter budget. Some constraints are ‘soft’ and just push the parameters in a specific direction (e.g., Bayesian priors), whereas others, like PCA, impose ‘hard’ constraints by constraining your model in a very specific way.

We will talk about some of the other approaches below, but PCA imposes a ‘hard’ constraint by clustering predictors together, thereby reducing dimensionality as mentioned above. With this in mind, PCA is helpful in three primary situations.

  1. The number of parameters exceeds the number of observations. If this is the case, then there are infinite solutions that minimize the sum of squared error. By clustering parameters together using principal components, we are able to find one possible solution that fits within our parameter budget without excluding predictors from your model.

  2. Model includes too many predictors and becomes unstable. In this case, the model doesn’t have more predictors than observations, but instead exceeds its parameter budget. By using PCA, you can stay within the recommended 10-20 observations per parameter, which helps promote model stability.

  3. PCA is also helpful when you have a highly correlated cluster of variables. Rather than include each of the variables, you could cluster them together as a principal component. In turn, you can reduce the number of parameters in your model to stay within your parameter budget and promote model stability without directly omitting any of the highly correlated variables from your model.

What does PCA archives in a regression setting?

If we have more parameters than observations, we would have an infinite number of models that minimize the sum of squares. In this case, there is one major problem we might encounter: overfitting.

  • Overfitting: The model performs well on the training set of data but not so well on test set of data.

  • Underfitting: The model neither performs well on the training set of data nor on the test set of data.

In this case PCA is one way to handle this issue. To be more thoughtful of our variables, we can cluster them. If we have two variables that are closely related, we might entirely exclude one of them or combine two or more closely related variables into one principal component, which reduces the complexity of the model. This is one way to handle having more parameters than observations.

Another way to handle this issue is by using LASSO or Ridge. LASSO effectively eliminates some of the parameters by pushing some of their betas all the way to zero, and Ridge pushes betas for all variables very close to zero (but not all the way). LASSO and Ridge can also be combined. All of these are valid ways to reduce model complexity when there are more parameters than data observations.

Penalty Term:

\(\lambda\in[0,\infty)\) is the penalty term that denotes the amount of shrinkage (or constraint) that will be implemented in the equation.

Lasso Regression :

General Equation: Minimization objective = Residual Sum of Squares + \(\lambda\) * (sum of absolute value of coefficients)**

  • Lasso regression stands for Least Absolute Shrinkage and Selection Operator.

  • If a model uses the L1 regularization technique, then it is called lasso regression.

  • The L1 term adds penalty equivalent to absolute value of the magnitude of coefficients(betas).

  • The larger \(\lambda\) is, the greater the penalization.

  • Alpha is added to the cost function which shrinks the coefficients (betas) and helps to reduce the model complexity.

  • Forces the model to be smaller by invoking sparsity (i.e. models with fewer parameters)

  • L1 tends to shrink coefficients to zero

Ridge Regression :

General Equation: Minimization objective = Residual Sum of Squares + \(\lambda\) * (sum of square of coefficients)

  • If a model uses the L2 regularization technique, then it is called ridge regression.

  • The L2 adds penalty equivalent to square of the magnitude of coefficients (betas)

  • If \(\lambda=0\), then the equation is just the residual sum of squares. If \(\lambda>0\), then it will add a constraint to the coefficients (betas).

  • As we increase the value of \(\lambda\) this constraint causes the value of the coefficient to tend towards zero.

  • L2 tends to shrink coefficients evenly

Bias and Varience Tradeoff:

For both methods we have that

  • When the bias increases, the value of \(\lambda\) increases.

  • When the variance increases, the value of \(\lambda\) decreases.

PCA example + R Code

For our example of PCA Analysis, let’s take fit a model from the base datasets in R. We used the nhgh dataset and predicted log blood urea nitrogen (bun) using age, weight, waist circumference, height, and body mass index (bmi). The predictors aren’t that important, we’re mainly focusing on applying PCA to a model and seeing the results. The summary of our model follows:

## Frequencies of Missing Values Due to Each Variable
## log(bun)      age       wt    waist       ht      bmi 
##       89        0        0      239        0        0 
## 
## Linear Regression Model
## 
## ols(formula = f1, data = nhgh, x = TRUE)
## 
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs    6475    LR chi2    1468.50    R2       0.203    
## sigma0.3446    d.f.             5    R2 adj   0.202    
## d.f.   6469    Pr(> chi2)  0.0000    g        0.200    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -2.50135 -0.20683  0.01381  0.21959  1.75292 
## 
## 
##           Coef    S.E.   t     Pr(>|t|)
## Intercept  2.0222 0.2984  6.78 <0.0001 
## age        0.0088 0.0003 34.25 <0.0001 
## wt         0.0028 0.0019  1.48 0.1377  
## waist     -0.0020 0.0008 -2.34 0.0191  
## ht         0.0009 0.0018  0.51 0.6098  
## bmi       -0.0039 0.0051 -0.77 0.4433

This works as a baseline, with an adjusted R-squared of .202. Next, let’s perform the PCA.

pca1 <- prcomp(f1pca, data = nhgh, scale. = TRUE)
summary(pca1)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.7238 1.0672 0.9072 0.24557 0.07823
## Proportion of Variance 0.5943 0.2278 0.1646 0.01206 0.00122
## Cumulative Proportion  0.5943 0.8221 0.9867 0.99878 1.00000
get_eig(pca1)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.971575227       59.4315045                    59.43150
## Dim.2 1.139005145       22.7801029                    82.21161
## Dim.3 0.822994762       16.4598952                    98.67150
## Dim.4 0.060304749        1.2060950                    99.87760
## Dim.5 0.006120117        0.1224023                   100.00000
#nhgh[, "PC"] <- predict(pca1, newdata = nhgh)?
`%|%` <- function(a,b) paste0(a,b)
nhgh[, "PC" %|% 1:5] <- predict(pca1, newdata = nhgh)
fviz_screeplot(pca1)

fviz_pca_biplot(pca1)

#calculate total variance explained by each principal component
var_explained = pca1$sdev^2 / sum(pca1$sdev^2)

#create scree plot

qplot(c(1:5), var_explained) + 
  geom_line() + 
  xlab("Principal Component") + 
  ylab("Variance Explained") +
  ggtitle("Scree Plot") +
  ylim(0, 1)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here, we can see that the first three components seem to capture the most variance. Remember, one of the reasons we use PCA analysis is to fit our parameter budget, so judging by this, we would likely be able to omit the last 2 components from the model and not loose too much predictive power, but only if we are seeking to fit within a certain parameter budget. Let’s fit the model with all of the principle components and compare the predictions made with the full model of predictors against our model with all PCA components.

f2 <- log(bun) ~ PC1 + PC2 + PC3 + PC4 + PC5
m2 <- ols(f2, data = nhgh, x=TRUE)
print(m2)
## Frequencies of Missing Values Due to Each Variable
## log(bun)      PC1      PC2      PC3      PC4      PC5 
##       89      239      239      239      239      239 
## 
## Linear Regression Model
## 
## ols(formula = f2, data = nhgh, x = TRUE)
## 
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs    6475    LR chi2    1468.50    R2       0.203    
## sigma0.3446    d.f.             5    R2 adj   0.202    
## d.f.   6469    Pr(> chi2)  0.0000    g        0.200    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -2.50135 -0.20683  0.01381  0.21959  1.75292 
## 
## 
##           Coef    S.E.   t      Pr(>|t|)
## Intercept  2.4785 0.0043 578.79 <0.0001 
## PC1        0.0387 0.0025  15.54 <0.0001 
## PC2        0.0784 0.0040  19.51 <0.0001 
## PC3        0.1497 0.0047  31.72 <0.0001 
## PC4        0.0671 0.0174   3.85 0.0001  
## PC5       -0.0582 0.0550  -1.06 0.2900
yhat_pca_full <- predict(m2)
plot(yhat_pca_full, yhat_full, xlab = 'PCA Predictions with All Components', ylab = 'Full Model Predictions')

Other than the coefficients, there is no difference in these models. Remember, PCA is just a linear combination of our variables, so it makes sense that the predictions are exactly the same. Next, let’s do the same comparison, but only using 4 principle components.

f3 <- log(bun) ~ PC1 + PC2 + PC3 + PC4
m3 <- ols(f3, data = nhgh, x=TRUE)
print(m3)
## Frequencies of Missing Values Due to Each Variable
## log(bun)      PC1      PC2      PC3      PC4 
##       89      239      239      239      239 
## 
## Linear Regression Model
## 
## ols(formula = f3, data = nhgh, x = TRUE)
## 
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs    6475    LR chi2    1467.38    R2       0.203    
## sigma0.3446    d.f.             4    R2 adj   0.202    
## d.f.   6470    Pr(> chi2)  0.0000    g        0.200    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -2.50271 -0.20828  0.01439  0.21962  1.75196 
## 
## 
##           Coef   S.E.   t      Pr(>|t|)
## Intercept 2.4785 0.0043 578.79 <0.0001 
## PC1       0.0387 0.0025  15.53 <0.0001 
## PC2       0.0784 0.0040  19.51 <0.0001 
## PC3       0.1497 0.0047  31.72 <0.0001 
## PC4       0.0671 0.0174   3.85 0.0001
yhat_pca_4 <- predict(m3)
plot(yhat_pca_4, yhat_full, xlab = 'PCA Predictions with 4 Components', ylab = 'Full Model Predictions')

Pretty cool! We don’t lose all that much performance when we omit the last principle component, which we would probably assume by the look of our earlier scree plot. Let’s do another comparison, but this time using only 3 principle components.

f4 <- log(bun) ~ PC1 + PC2 + PC3 
m4 <- ols(f4, data = nhgh, x=TRUE)
print(m4)
## Frequencies of Missing Values Due to Each Variable
## log(bun)      PC1      PC2      PC3 
##       89      239      239      239 
## 
## Linear Regression Model
## 
## ols(formula = f4, data = nhgh, x = TRUE)
## 
## 
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
## Obs    6475    LR chi2    1452.56    R2       0.201    
## sigma0.3449    d.f.             3    R2 adj   0.201    
## d.f.   6471    Pr(> chi2)  0.0000    g        0.199    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -2.50278 -0.20743  0.01395  0.21979  1.73790 
## 
## 
##           Coef   S.E.   t      Pr(>|t|)
## Intercept 2.4786 0.0043 578.18 <0.0001 
## PC1       0.0387 0.0025  15.52 <0.0001 
## PC2       0.0784 0.0040  19.49 <0.0001 
## PC3       0.1497 0.0047  31.68 <0.0001
yhat_pca_3 <- predict(m4)
plot(yhat_pca_3, yhat_full, xlab = 'PCA Predictions with 3 Components', ylab = 'Full Model Predictions')

As expected, we lose some performance, but it can be a good choice if we want to include information from all of our original predictors. Let’s do the comparison with just two components.

f5 <- log(bun) ~ PC1 + PC2 
m5 <- ols(f5, data = nhgh, x=TRUE)
print(m5)
## Frequencies of Missing Values Due to Each Variable
## log(bun)      PC1      PC2 
##       89      239      239 
## 
## Linear Regression Model
## 
## ols(formula = f5, data = nhgh, x = TRUE)
## 
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs    6475    LR chi2    518.76    R2       0.077    
## sigma0.3707    d.f.            2    R2 adj   0.077    
## d.f.   6472    Pr(> chi2) 0.0000    g        0.122    
## 
## Residuals
## 
##       Min        1Q    Median        3Q       Max 
## -2.423992 -0.226916  0.008503  0.238538  1.851077 
## 
## 
##           Coef   S.E.   t      Pr(>|t|)
## Intercept 2.4786 0.0046 538.01 <0.0001 
## PC1       0.0387 0.0027  14.44 <0.0001 
## PC2       0.0786 0.0043  18.18 <0.0001
yhat_pca_2 <- predict(m5)
plot(yhat_pca_2, yhat_full, xlab = 'PCA Predictions with 2 Components', ylab = 'Full Model Predictions')

Now, our predictive power is comically bad, which is in line with the variance captured in our scree plot. However, we’re still including information from all of our predictors. Hopefully you wouldn’t be pushed into constraining a model this much, but it’s an option.

require(tgsify)
require(rms)
library(dplyr)

d1 <- read.csv("http://hbiostat.org/data/repo/2.20.Framingham.csv") %>%
  mutate(sex = factor(sex, 1:2, c('male', 'female')))


f1 <- log(scl) ~ sex + rcs(age,4) + rcs(bmi,4) + rcs(age,4)*sex + rcs(bmi,4)*sex +rcs(bmi,4)%ia%rcs(age,4)

m1 <- ols(f1, data = d1, x=TRUE, y=TRUE)
dd <- datadist(d1)
options(datadist = "dd")

m1
## Frequencies of Missing Values Due to Each Variable
## log(scl)      sex      age      bmi 
##       33        0        0        9 
## 
## Linear Regression Model
## 
## ols(formula = f1, data = d1, x = TRUE, y = TRUE)
## 
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs    4658    LR chi2    740.65    R2       0.147    
## sigma0.1769    d.f.           18    R2 adj   0.144    
## d.f.   4639    Pr(> chi2) 0.0000    g        0.083    
## 
## Residuals
## 
##       Min        1Q    Median        3Q       Max 
## -0.673982 -0.113647 -0.003467  0.115242  0.948226 
## 
## 
##                    Coef    S.E.   t     Pr(>|t|)
## Intercept           4.5385 0.5476  8.29 <0.0001 
## sex=female          0.2034 0.1961  1.04 0.2996  
## age                 0.0079 0.0133  0.60 0.5498  
## age'                0.0178 0.0447  0.40 0.6908  
## age''              -0.0452 0.1016 -0.45 0.6562  
## bmi                 0.0254 0.0226  1.12 0.2617  
## bmi'               -0.0431 0.0595 -0.73 0.4682  
## bmi''               0.1162 0.1789  0.65 0.5160  
## bmi * age          -0.0001 0.0005 -0.11 0.9157  
## bmi * age'         -0.0012 0.0017 -0.70 0.4810  
## bmi * age''         0.0025 0.0038  0.66 0.5082  
## bmi' * age          0.0004 0.0012  0.36 0.7209  
## bmi'' * age        -0.0017 0.0037 -0.46 0.6489  
## sex=female * age    0.0023 0.0034  0.67 0.5011  
## sex=female * age'   0.0259 0.0132  1.95 0.0510  
## sex=female * age'' -0.0582 0.0305 -1.91 0.0562  
## sex=female * bmi   -0.0153 0.0066 -2.32 0.0206  
## sex=female * bmi'   0.0250 0.0227  1.10 0.2715  
## sex=female * bmi'' -0.0581 0.0689 -0.84 0.3991