Problem # 2
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer:
Justification: Lasso regression is less flexible than OLS because its L₁ penalty shrinks coefficients toward zero, reducing variance but increasing bias. Improved prediction accuracy is achieved when the reduction in variance compensates for the increase in bias.
Part b:
Repeat (a) for ridge regression relative to least squares.
Answer:
Justification: Ridge regression, like lasso, penalizes coefficient size (but with an L2 norm). This shrinkage reduces variance (compared to OLS) while increasing bias, so it is less flexible than OLS. Prediction accuracy improves precisely when the reduction in variance outweighs the increase in bias.
Part c:
Repeat (a) for non-linear methods relative to least squares.
Answer:
Justification: Most non-linear methods are more flexible than OLS: they reduce bias by better capturing complex patterns but increase variance due to more parameters or complexity. They improve prediction accuracy precisely if the reduction in bias outweighs the increase in variance.
Problem # 9:
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Part a:
Split the data set into a training set and a test set.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
data("College")
set.seed(1) # for reproducibility
train_idx <- sample(seq_len(nrow(College)), size = 0.7 * nrow(College))
train_data <- College[train_idx, ]
test_data <- College[-train_idx, ]
Part b
Fit a linear model using least squares on the training set, and report the test error obtained.
# Fit model on the training set
lm_fit <- lm(Apps ~ ., data = train_data)
# Predict on the test set
lm_pred <- predict(lm_fit, newdata = test_data)
# Calculate test MSE
lm_mse <- mean((test_data$Apps - lm_pred)^2)
lm_mse
## [1] 1261630
Part c Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Convert data frames to matrices for glmnet
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1] # drop the intercept column
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps
# Perform cross-validation for ridge (alpha = 0)
set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min
# Predict on test set using best lambda
ridge_pred <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
ridge_mse
## [1] 1121034
Part d
Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
# Cross-validation for lasso (alpha = 1)
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min
# Predict on test set
lasso_pred <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_mse
## [1] 1233246
# Number of nonzero coefficients
lasso_coef <- coef(cv_lasso, s = best_lambda_lasso)
nonzero_count <- sum(lasso_coef != 0)
nonzero_count
## [1] 15
Part e
Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.4.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
# Fit PCR on training data with cross-validation
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
# Check CV errors to choose M (number of components)
validationplot(pcr_fit, val.type = "MSEP") # visual inspection
summary(pcr_fit) # shows CV metrics for different components
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 3807 2130 2134 1826 1692 1696
## adjCV 3895 3807 2126 2132 1806 1682 1690
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1695 1638 1604 1593 1601 1603 1606
## adjCV 1694 1627 1598 1587 1595 1598 1600
## 14 comps 15 comps 16 comps 17 comps
## CV 1607 1544 1145 1113
## adjCV 1602 1524 1137 1106
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.051 57.00 64.42 70.27 75.65 80.65 84.26 87.61
## Apps 5.788 71.69 71.70 80.97 82.60 82.60 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.84 94.93 96.74 97.82 98.72 99.39
## Apps 84.55 84.82 84.86 84.86 85.01 85.05 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.03 93.32
# Suppose the best M (number of components) is determined by the lowest CV error:
best_M_pcr <- which.min(pcr_fit$validation$PRESS) # or chosen from the plot
# Predict on test set using best number of components
pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = best_M_pcr)
pcr_mse <- mean((test_data$Apps - pcr_pred)^2)
pcr_mse
## [1] 1261630
Interpretation:
The PCR CV results show that as we add the first few principal components, the prediction error (MSEP) drops sharply, indicating that these components capture most of the relevant signal. Beyond a certain number (where the CV error plateaus), additional components yield little improvement. This pattern is typical for PCR—confirming that a parsimonious model with a subset of components is optimal.
Part f:
Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
# Fit PLS on training data with cross-validation
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
# Check CV errors
validationplot(pls_fit, val.type = "MSEP") # visual inspection
summary(pls_fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 1969 1762 1566 1493 1320 1236
## adjCV 3895 1963 1757 1556 1469 1291 1222
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1225 1213 1212 1207 1213 1207 1206
## adjCV 1212 1201 1199 1195 1200 1195 1194
## 14 comps 15 comps 16 comps 17 comps
## CV 1205 1205 1205 1205
## adjCV 1193 1193 1193 1193
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.68 47.43 62.46 64.88 67.34 72.68 77.20 80.92
## Apps 76.62 82.39 86.93 90.76 92.82 93.05 93.13 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.16 87.35 90.73 92.49 95.10 97.09
## Apps 93.26 93.28 93.30 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.40 100.00
## Apps 93.32 93.32
# Suppose the best M is determined similarly:
best_M_pls <- which.min(pls_fit$validation$PRESS)
# Predict on test set
pls_pred <- predict(pls_fit, newdata = test_data, ncomp = best_M_pls)
pls_mse <- mean((test_data$Apps - pls_pred)^2)
pls_mse
## [1] 1261630
Interpretation:
In the PLS results, the initial components sharply reduce the RMSEP (CV error), indicating they capture the main predictive signals. Each additional component further lowers the error but with diminishing returns. By ~7–10 components, the error curve flattens, and by 17 components, you’ve explained nearly 100% of the predictor variance and about 93% of App variance. This pattern is typical of PLS: a rapid improvement from the first few components, followed by gradually smaller gains, suggesting that a more parsimonious model can often be chosen before all 17 components.
Part e:
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Prediction Accuracy: All five methods give test MSEs roughly between 1.1 million and 1.3 million. This means that while our models capture a fair amount of the signal behind the number of applications, there’s still a notable amount of variability or noise that remains unexplained.
Method Comparisons:
The basic OLS model has an error of about 1.26 million and serves as our baseline.
Ridge Regression shows improvement with an error around 1.12 million—its shrinkage penalty helps reduce overfitting by controlling the magnitude of the coefficients.
Lasso Regression results in a test error near 1.23 million. Although a bit higher than ridge, lasso has the advantage of automatically setting many coefficients to zero (here, only 15 remain), which makes the model easier to interpret.
For PCR and PLS, we see the usual pattern: a sharp drop in error with the addition of the first few components, after which improvements level off. Their final errors are close to those of ridge and lasso.
Summary:
No single method dramatically outperforms the others. The differences in test errors are modest—typically within a 5–10% range. This indicates that we can reasonably predict the number of college applications with any of these methods. The choice among them might then depend on other considerations such as interpretability (e.g., lasso’s simpler model) versus a slight edge in prediction accuracy (e.g., ridge regression).
In summary, these models all perform fairly similarly in predicting college applications, and while some minor improvements are noted (especially with ridge regression), none of the approaches stands out as being significantly superior to the others.
Problem # 11
We will now try to predict per capita crime rate in the Boston data set.
Part a:
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
## Warning: package 'MASS' was built under R version 4.4.2
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.2
library(glmnet)
library(pls)
data("Boston")
#Train/Test Split
set.seed(1)
train_idx <- sample(seq_len(nrow(Boston)), size = 0.7 * nrow(Boston))
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]
# Separate predictors and response for later (glmnet/pcr needs matrices)
x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test <- test_data$crim
# Best subset selection uses 'regsubsets'
best_subset_fit <- regsubsets(crim ~ ., data = train_data, nvmax = 13)
best_subset_summary <- summary(best_subset_fit)
# Inspect metrics (adjr2, Cp, BIC, etc.) for each model size
best_subset_summary$adjr2
## [1] 0.4128277 0.4558228 0.4720392 0.4745639 0.4749360 0.4769643 0.4785831
## [8] 0.4790469 0.4797650 0.4786691 0.4773511 0.4758577 0.4743257
best_subset_summary$cp
## [1] 43.180047 15.354654 5.522343 4.841842 5.595966 5.258249 5.197729
## [8] 5.901439 6.440537 8.165938 10.031650 12.006191 14.000000
best_subset_summary$bic
## [1] -177.7483 -199.8055 -205.6558 -202.4962 -197.8935 -194.4130 -190.6627
## [8] -186.1330 -181.7796 -176.1959 -170.4664 -164.6236 -158.7608
# I have chose the model size by BIC
best_size <- which.min(best_subset_summary$bic)
# Extract final coefficients for that model
coef(best_subset_fit, best_size)
## (Intercept) rad black lstat
## 0.95384519 0.43521730 -0.01310151 0.24416589
# Refit that model
form_best_subset <- as.formula(
paste("crim ~", paste(names(coef(best_subset_fit, best_size))[-1], collapse=" + "))
)
best_subset_lm <- lm(form_best_subset, data = train_data)
# Test MSE
best_subset_pred <- predict(best_subset_lm, newdata = test_data)
best_subset_mse <- mean((y_test - best_subset_pred)^2)
best_subset_mse
## [1] 61.86531
#Ridge Regression
set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0) # 10-fold CV by default
best_lambda_ridge <- cv_ridge$lambda.min
# Predict on test set
ridge_pred <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
ridge_mse
## [1] 60.15076
#Lasso Regression
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min
# Predict on test set
lasso_pred <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_mse
## [1] 59.55616
# Number of nonzero coefficients
lasso_coef <- coef(cv_lasso, s = best_lambda_lasso)
sum(lasso_coef != 0)
## [1] 11
# Principle Component Regression
set.seed(1)
pcr_fit <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")
# Inspect CV errors for each # of components
summary(pcr_fit)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.085 6.627 6.619 6.149 6.107 6.117 6.184
## adjCV 8.085 6.624 6.616 6.142 6.099 6.111 6.176
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.159 6.005 6.004 6.00 6.012 6.018 5.974
## adjCV 6.151 5.995 5.987 5.99 6.001 6.006 5.962
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.34 61.18 70.44 77.16 83.41 88.28 91.51 93.81
## crim 33.34 33.76 43.13 44.14 44.15 44.20 44.63 47.57
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.52 97.14 98.48 99.50 100.00
## crim 48.08 48.12 48.13 48.46 49.37
# Choose # of components that gives minimum CV error
best_comp <- which.min(pcr_fit$validation$PRESS)
# Evaluate test MSE
pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = best_comp)
pcr_mse <- mean((y_test - pcr_pred)^2)
pcr_mse
## [1] 58.97718
# Comparison of the results
results <- data.frame(
Method = c("Best Subset", "Ridge", "Lasso", "PCR"),
Test_MSE = c(best_subset_mse, ridge_mse, lasso_mse, pcr_mse)
)
results
## Method Test_MSE
## 1 Best Subset 61.86531
## 2 Ridge 60.15076
## 3 Lasso 59.55616
## 4 PCR 58.97718
Interpretation
The numbers tell us that all methods are doing a decent job, but there are some small differences. The PCR model comes out on top with the lowest error (around 59), which means it did the best job predicting the number of college applications. Lasso (≈59.6) and Ridge (≈60.2) followed close behind, while the Best Subset model, at about 61.9, didn’t do quite as well.
PCR: By turning the predictors into principal components, PCR seems to capture the essential information really well. Even if it’s a bit more abstract (since you lose the original predictor names), it gives the best predictions.
Lasso & Ridge: Both these methods use regularization to prevent overfitting. Lasso has an added bonus of simplifying the model by cutting out some predictors entirely, and it performs almost as well as PCR.
Best Subset: Although it picks out a smaller set of predictors (which can be easier to interpret), it seems to miss some important information compared to the other methods, which shows up as a slightly higher error on unseen data.
The Bottom Line:
All models are fairly close in performance, but if we’re aiming solely for prediction accuracy, PCR has a slight edge. However, if interpretability and simplicity are important for our application, using Lasso or even Best Subset might still be worthwhile.
Part b:
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
PCR (Principal Components Regression) has the lowest test MSE (≈58.98), so it offers the best out-of-sample predictive performance among these four approaches. The reasoning is that by transforming the predictors into principal components (which capture the most variance in the data), PCR effectively reduces noise or collinearity that can hamper other models.
In short, PCR is a strong choice for pure predictive accuracy based on your validation/test error. If you also value interpretability, you might consider Lasso, since it’s close in performance but zeroes out some coefficients and thus yields a simpler model.
Part c:
Does your chosen model involve all of the features in the data set? Why or why not?
PCR uses all features in the sense that every principal component is a linear combination of every original predictor. However, if you only keep a subset of principal components (e.g., 5 out of 13), you’re effectively discarding components that capture less variance.
Unlike Lasso (which can entirely remove certain predictors by assigning them zero coefficients), PCR doesn’t drop specific features. Instead, it repackages all features into fewer principal components.
So, we are using every original variable to form the components, but not in their original, individual form. This can reduce interpretability (you can’t easily say “Feature X has coefficient Y”), but it often helps predictive accuracy when predictors are highly correlated.