Read this question from the datascience subreddit.
In this assignment, you craft a response to the question. You should help the poster understand:
1. When is PCA helpful?
2. What does PCA in a regression setting achieves? (model complexity vs data input complexity, comparison to LASSO, RIDGE)
3. Example of how it might be accomplished + code
The purpose of this article is to provide a simplified explanation of Principal Component Analysis (PCA). Even though our goal is to answer the three questions above, we will dive a little deeper into the functionality and purpose of PCA in order to provide a more general understanding of this concept.
Principal component analysis (PCA) is a statistical analysis method used to reduce the dimensionality the of a large data set by transforming it into a smaller set one that captures most of the information. These transformation is given by linear combinations of the original data capturing most of the variance in the data.
As we expect, reducing the number of variables of a data set will reduce the accuracy of our model and so our goal is to minimize the accuracy loss while significantly simplifying our model.
Next, I will work on a general example on how we might want to calculate the PCAs for a given data set using Singular Value Decomposition (SVD).
Steps for finding PC1 in a 2-dimensional plot:
Create a scatter plot of your dataset.
Find the line that best fit the data.
Create a random line that goes through the origin.
Project data onto line and find the distance of each data point to the origin Rotate the line until the sum of square distances is maximized
If the slope of PC1 is less than 0.5, then
If the slope of PC1 is is greater than 0.5, then
Notice that
Eigenvalue for PC1 = Sum of Square distances (SS) = \(d_1^2 + d_2^2+...+d_n^2\)
Singular value for PC1 = square root of eigenvalue for PC1.
Pythagorean Theorem: \(c^2 = a^2 + b^2\)
Singular Vector or the eigenvector for PC1:
PC2: The line perpendicular to PC1.
PCA plot:
Variation of PC1 and PC2:
Variation of PC1: \(\frac{SS(PC1)}{n - 1}\)
Variation of PC2: \(\frac{SS(PC2)}{n - 1}\)
where \(n\) is the number of data points.
Total variation of PCs: variation of PC1 + variation of PC2.
Principal Component Analysis (PCA) is just one of many methods we can use to impose constraints on your model to promote stability and stay within your parameter budget. Some constraints are ‘soft’ and just push the parameters in a specific direction (e.g., Bayesian priors), whereas others, like PCA, impose ‘hard’ constraints by constraining your model in a very specific way.
We will talk about some of the other approaches below, but PCA imposes a ‘hard’ constraint by clustering predictors together, thereby reducing dimensionality as mentioned above. With this in mind, PCA is helpful in three primary situations.
The number of parameters exceeds the number of observations. If this is the case, then there are infinite solutions that minimize the sum of squared error. By clustering parameters together using principal components, we are able to find one possible solution that fits within our parameter budget without excluding predictors from your model.
Model includes too many predictors and becomes unstable. In this case, the model doesn’t have more predictors than observations, but instead exceeds its parameter budget. By using PCA, you can stay within the recommended 10-20 observations per parameter, which helps promote model stability.
PCA is also helpful when you have a highly correlated cluster of variables. Rather than include each of the variables, you could cluster them together as a principal component. In turn, you can reduce the number of parameters in your model to stay within your parameter budget and promote model stability without directly omitting any of the highly correlated variables from your model.
If we have more parameters than observations, we would have an infinite number of models that minimize the sum of squares. In this case, there is one major problem we might encounter: overfitting.
Overfitting: The model performs well on the training set of data but not so well on test set of data.
Underfitting: The model neither performs well on the training set of data nor on the test set of data.
In this case PCA is one way to handle this issue. To be more thoughtful of our variables, we can cluster them. If we have two variables that are closely related, we might entirely exclude one of them or combine two or more closely related variables into one principal component, which reduces the complexity of the model. This is one way to handle having more parameters than observations.
Another way to handle this issue is by using LASSO or Ridge. LASSO effectively eliminates some of the parameters by pushing some of their betas all the way to zero, and Ridge pushes betas for all variables very close to zero (but not all the way). LASSO and Ridge can also be combined. All of these are valid ways to reduce model complexity when there are more parameters than data observations.
\(\lambda\in[0,\infty)\) is the penalty term that denotes the amount of shrinkage (or constraint) that will be implemented in the equation.
General Equation: Minimization objective = Residual Sum of Squares + \(\lambda\) * (sum of absolute value of coefficients)**
Lasso regression stands for Least Absolute Shrinkage and Selection Operator.
If a model uses the L1 regularization technique, then it is called lasso regression.
The L1 term adds penalty equivalent to absolute value of the magnitude of coefficients(betas).
The larger \(\lambda\) is, the greater the penalization.
Alpha is added to the cost function which shrinks the coefficients (betas) and helps to reduce the model complexity.
Forces the model to be smaller by invoking sparsity (i.e. models with fewer parameters)
L1 tends to shrink coefficients to zero
General Equation: Minimization objective = Residual Sum of Squares + \(\lambda\) * (sum of square of coefficients)
If a model uses the L2 regularization technique, then it is called ridge regression.
The L2 adds penalty equivalent to square of the magnitude of coefficients (betas)
If \(\lambda=0\), then the equation is just the residual sum of squares. If \(\lambda>0\), then it will add a constraint to the coefficients (betas).
As we increase the value of \(\lambda\) this constraint causes the value of the coefficient to tend towards zero.
L2 tends to shrink coefficients evenly
For both methods we have that
When the bias increases, the value of \(\lambda\) increases.
When the variance increases, the value of \(\lambda\) decreases.
For our example of PCA Analysis, let’s take fit a model from the base datasets in R. We used the nhgh dataset and predicted log blood urea nitrogen (bun) using age, weight, waist circumference, height, and body mass index (bmi). The predictors aren’t that important, we’re mainly focusing on applying PCA to a model and seeing the results. The summary of our model follows:
## Frequencies of Missing Values Due to Each Variable
## log(bun) age wt waist ht bmi
## 89 0 0 239 0 0
##
## Linear Regression Model
##
## ols(formula = f1, data = nhgh, x = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 6475 LR chi2 1468.50 R2 0.203
## sigma0.3446 d.f. 5 R2 adj 0.202
## d.f. 6469 Pr(> chi2) 0.0000 g 0.200
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.50135 -0.20683 0.01381 0.21959 1.75292
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 2.0222 0.2984 6.78 <0.0001
## age 0.0088 0.0003 34.25 <0.0001
## wt 0.0028 0.0019 1.48 0.1377
## waist -0.0020 0.0008 -2.34 0.0191
## ht 0.0009 0.0018 0.51 0.6098
## bmi -0.0039 0.0051 -0.77 0.4433
This works as a baseline, with an adjusted R-squared of .202. Next, let’s perform the PCA.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.7238 1.0672 0.9072 0.24557 0.07823
## Proportion of Variance 0.5943 0.2278 0.1646 0.01206 0.00122
## Cumulative Proportion 0.5943 0.8221 0.9867 0.99878 1.00000
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.971575227 59.4315045 59.43150
## Dim.2 1.139005145 22.7801029 82.21161
## Dim.3 0.822994762 16.4598952 98.67150
## Dim.4 0.060304749 1.2060950 99.87760
## Dim.5 0.006120117 0.1224023 100.00000
#nhgh[, "PC"] <- predict(pca1, newdata = nhgh)?
`%|%` <- function(a,b) paste0(a,b)
nhgh[, "PC" %|% 1:5] <- predict(pca1, newdata = nhgh)
fviz_screeplot(pca1)#calculate total variance explained by each principal component
var_explained = pca1$sdev^2 / sum(pca1$sdev^2)
#create scree plot
qplot(c(1:5), var_explained) +
geom_line() +
xlab("Principal Component") +
ylab("Variance Explained") +
ggtitle("Scree Plot") +
ylim(0, 1)## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Here, we can see that the first three components seem to capture the most variance. Remember, one of the reasons we use PCA analysis is to fit our parameter budget, so judging by this, we would likely be able to omit the last 2 components from the model and not loose too much predictive power, but only if we are seeking to fit within a certain parameter budget. Let’s fit the model with all of the principle components and compare the predictions made with the full model of predictors against our model with all PCA components.
## Frequencies of Missing Values Due to Each Variable
## log(bun) PC1 PC2 PC3 PC4 PC5
## 89 239 239 239 239 239
##
## Linear Regression Model
##
## ols(formula = f2, data = nhgh, x = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 6475 LR chi2 1468.50 R2 0.203
## sigma0.3446 d.f. 5 R2 adj 0.202
## d.f. 6469 Pr(> chi2) 0.0000 g 0.200
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.50135 -0.20683 0.01381 0.21959 1.75292
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 2.4785 0.0043 578.79 <0.0001
## PC1 0.0387 0.0025 15.54 <0.0001
## PC2 0.0784 0.0040 19.51 <0.0001
## PC3 0.1497 0.0047 31.72 <0.0001
## PC4 0.0671 0.0174 3.85 0.0001
## PC5 -0.0582 0.0550 -1.06 0.2900
yhat_pca_full <- predict(m2)
plot(yhat_pca_full, yhat_full, xlab = 'PCA Predictions with All Components', ylab = 'Full Model Predictions')Other than the coefficients, there is no difference in these models. Remember, PCA is just a linear combination of our variables, so it makes sense that the predictions are exactly the same. Next, let’s do the same comparison, but only using 4 principle components.
## Frequencies of Missing Values Due to Each Variable
## log(bun) PC1 PC2 PC3 PC4
## 89 239 239 239 239
##
## Linear Regression Model
##
## ols(formula = f3, data = nhgh, x = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 6475 LR chi2 1467.38 R2 0.203
## sigma0.3446 d.f. 4 R2 adj 0.202
## d.f. 6470 Pr(> chi2) 0.0000 g 0.200
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.50271 -0.20828 0.01439 0.21962 1.75196
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 2.4785 0.0043 578.79 <0.0001
## PC1 0.0387 0.0025 15.53 <0.0001
## PC2 0.0784 0.0040 19.51 <0.0001
## PC3 0.1497 0.0047 31.72 <0.0001
## PC4 0.0671 0.0174 3.85 0.0001
yhat_pca_4 <- predict(m3)
plot(yhat_pca_4, yhat_full, xlab = 'PCA Predictions with 4 Components', ylab = 'Full Model Predictions')
Pretty cool! We don’t lose all that much performance when we omit the
last principle component, which we would probably assume by the look of
our earlier scree plot. Let’s do another comparison, but this time using
only 3 principle components.
## Frequencies of Missing Values Due to Each Variable
## log(bun) PC1 PC2 PC3
## 89 239 239 239
##
## Linear Regression Model
##
## ols(formula = f4, data = nhgh, x = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 6475 LR chi2 1452.56 R2 0.201
## sigma0.3449 d.f. 3 R2 adj 0.201
## d.f. 6471 Pr(> chi2) 0.0000 g 0.199
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.50278 -0.20743 0.01395 0.21979 1.73790
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 2.4786 0.0043 578.18 <0.0001
## PC1 0.0387 0.0025 15.52 <0.0001
## PC2 0.0784 0.0040 19.49 <0.0001
## PC3 0.1497 0.0047 31.68 <0.0001
yhat_pca_3 <- predict(m4)
plot(yhat_pca_3, yhat_full, xlab = 'PCA Predictions with 3 Components', ylab = 'Full Model Predictions')As expected, we lose some performance, but it can be a good choice if we want to include information from all of our original predictors. Let’s do the comparison with just two components.
## Frequencies of Missing Values Due to Each Variable
## log(bun) PC1 PC2
## 89 239 239
##
## Linear Regression Model
##
## ols(formula = f5, data = nhgh, x = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 6475 LR chi2 518.76 R2 0.077
## sigma0.3707 d.f. 2 R2 adj 0.077
## d.f. 6472 Pr(> chi2) 0.0000 g 0.122
##
## Residuals
##
## Min 1Q Median 3Q Max
## -2.423992 -0.226916 0.008503 0.238538 1.851077
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 2.4786 0.0046 538.01 <0.0001
## PC1 0.0387 0.0027 14.44 <0.0001
## PC2 0.0786 0.0043 18.18 <0.0001
yhat_pca_2 <- predict(m5)
plot(yhat_pca_2, yhat_full, xlab = 'PCA Predictions with 2 Components', ylab = 'Full Model Predictions')Now, our predictive power is comically bad, which is in line with the variance captured in our scree plot. However, we’re still including information from all of our predictors. Hopefully you wouldn’t be pushed into constraining a model this much, but it’s an option.
require(tgsify)
require(rms)
library(dplyr)
d1 <- read.csv("http://hbiostat.org/data/repo/2.20.Framingham.csv") %>%
mutate(sex = factor(sex, 1:2, c('male', 'female')))
f1 <- log(scl) ~ sex + rcs(age,4) + rcs(bmi,4) + rcs(age,4)*sex + rcs(bmi,4)*sex +rcs(bmi,4)%ia%rcs(age,4)
m1 <- ols(f1, data = d1, x=TRUE, y=TRUE)
dd <- datadist(d1)
options(datadist = "dd")
m1## Frequencies of Missing Values Due to Each Variable
## log(scl) sex age bmi
## 33 0 0 9
##
## Linear Regression Model
##
## ols(formula = f1, data = d1, x = TRUE, y = TRUE)
##
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 4658 LR chi2 740.65 R2 0.147
## sigma0.1769 d.f. 18 R2 adj 0.144
## d.f. 4639 Pr(> chi2) 0.0000 g 0.083
##
## Residuals
##
## Min 1Q Median 3Q Max
## -0.673982 -0.113647 -0.003467 0.115242 0.948226
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 4.5385 0.5476 8.29 <0.0001
## sex=female 0.2034 0.1961 1.04 0.2996
## age 0.0079 0.0133 0.60 0.5498
## age' 0.0178 0.0447 0.40 0.6908
## age'' -0.0452 0.1016 -0.45 0.6562
## bmi 0.0254 0.0226 1.12 0.2617
## bmi' -0.0431 0.0595 -0.73 0.4682
## bmi'' 0.1162 0.1789 0.65 0.5160
## bmi * age -0.0001 0.0005 -0.11 0.9157
## bmi * age' -0.0012 0.0017 -0.70 0.4810
## bmi * age'' 0.0025 0.0038 0.66 0.5082
## bmi' * age 0.0004 0.0012 0.36 0.7209
## bmi'' * age -0.0017 0.0037 -0.46 0.6489
## sex=female * age 0.0023 0.0034 0.67 0.5011
## sex=female * age' 0.0259 0.0132 1.95 0.0510
## sex=female * age'' -0.0582 0.0305 -1.91 0.0562
## sex=female * bmi -0.0153 0.0066 -2.32 0.0206
## sex=female * bmi' 0.0250 0.0227 1.10 0.2715
## sex=female * bmi'' -0.0581 0.0689 -0.84 0.3991