\(Homework 5, Chapter 6, questions 2, 9, 11\)

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is:

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

–Lasso imposes an ℓ₁ penalty that shrinks some coefficients to exactly zero, effectively performing variable selection. This restricts the flexibility of the model compared to least squares. Reduced flexibility tends to increase bias but can significantly reduce variance, particularly when many predictors are irrelevant or correlated. Prediction improves if the gain from lower variance outweighs the increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

– Ridge regression applies an ℓ₂ penalty which shrinks coefficients toward zero (but not exactly zero). Like lasso, this introduces bias but reduces variance, especially in the presence of multicollinearity. So again, the key is that the reduction in variance compensates for the increase in bias.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

– Non-linear methods (like splines, GAMs, or regression trees) are typically more flexible than least squares. This increased flexibility reduces bias but increases variance. These methods improve prediction if the decrease in bias outweighs the increase in variance.

Exercise 9

(a) Consider (6.12) with p = 1. For some choice of y1 and λ > 0, plot (6.12) as a function of β1. Your plot should confirm that (6.12) is solved by (6.14).

y1 <- 1
lambda <- 1
beta1 <- seq(-2, 2, length.out = 100)
objective <- (y1 - beta1)^2 + lambda * abs(beta1)

plot(beta1, objective, type = "l", lwd = 2,
     ylab = "Objective Function",
     xlab = expression(beta[1]),
     main = expression((y[1] - beta[1])^2 + lambda %.% "|" %.% beta[1] %.% "|"))

abline(v = y1 - lambda/2, col = "blue", lty = 2)  # soft-thresholded solution
abline(v = 0, col = "red", lty = 3)               # exact zero if applicable

y1 <- 1.5
lambda <- 1
beta1 <- seq(-2, 3, length.out = 200)
objective <- (y1 - beta1)^2 + lambda * abs(beta1)

plot(beta1, objective, type = "l", lwd = 2,
     ylab = "Objective Function",
     xlab = expression(beta[1]),
     main = expression((y[1] - beta[1])^2 + lambda %.% "|" %.% beta[1] %.% "|"))

# Add a vertical line at the soft-thresholded solution
soft_threshold <- if (abs(y1) > lambda / 2) sign(y1) * (abs(y1) - lambda / 2) else 0
abline(v = soft_threshold, col = "blue", lty = 2)
text(soft_threshold, min(objective), labels = "Soft-thresholded min", pos = 3, col = "blue")

Exercise 9

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## Warning: package 'pls' was built under R version 4.4.2
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings

a,

set.seed(1)
data(College)
train_index <- sample(1:nrow(College), nrow(College) / 2)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

b,

lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((test_data$Apps - lm_pred)^2)
cat("(b) Linear Model Test MSE:", lm_mse, "\n")
## (b) Linear Model Test MSE: 1135758
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- cv_ridge$lambda.min
ridge_pred <- predict(cv_ridge, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
cat("(c) Ridge Regression Test MSE:", ridge_mse, "\n")
## (c) Ridge Regression Test MSE: 976261.5
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- cv_lasso$lambda.min
lasso_pred <- predict(cv_lasso, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_coef <- predict(cv_lasso, s = lasso_best_lambda, type = "coefficients")
nonzero_lasso <- sum(lasso_coef != 0) - 1
cat("(d) Lasso Test MSE:", lasso_mse, "\n")
## (d) Lasso Test MSE: 1115901
cat("    Number of non-zero coefficients:", nonzero_lasso, "\n")
##     Number of non-zero coefficients: 17
set.seed(1)
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = which.min(pcr_fit$validation$PRESS))
pcr_mse <- mean((test_data$Apps - pcr_pred)^2)
cat("(e) PCR Test MSE:", pcr_mse, "\n")
## (e) PCR Test MSE: 1135758
cat("    Optimal number of components (M):", which.min(pcr_fit$validation$PRESS), "\n")
##     Optimal number of components (M): 17
set.seed(1)
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pls_fit, val.type = "MSEP")

pls_pred <- predict(pls_fit, newdata = test_data, ncomp = which.min(pls_fit$validation$PRESS))
pls_mse <- mean((test_data$Apps - pls_pred)^2)
cat("(f) PLS Test MSE:", pls_mse, "\n")
## (f) PLS Test MSE: 1135758
cat("    Optimal number of components (M):", which.min(pls_fit$validation$PRESS), "\n")
##     Optimal number of components (M): 17
  1. Comparing the five tests:

Linear Model is simple and easy to understand, but could overfit if the predictors are highly correlated Ridge shrinks all the cofficients, it can handle multicollinearity and is an improvement if overfitting is the problem Lasso can set coefficients to zero, which helps in variable selection PCR reduces the dimensionality before regression, and might ignore some relevant predictors if they’re not aligned with top principal components. PLS also reduces dimensionality, but considers response in component selection often better than PCR

All the outcomes are fairly close, between 976261.5 to 1135758, which interestingly the Linear Model and the Princial Components MSE ended up being the same, but the main point is all do well.

Overall, there is not a big difference in the testing outcomes between the five methods, however, Ridge, Lasso, and PLS slightly outperform the others due to reduced variance and how they handle multicollinerity and irrelevant predictors.

Exercise 11 Load data, set seed, and make train and testing sets

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
## 
##     Boston
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.2
data(Boston)
set.seed(19)

train_index <- sample(1:nrow(Boston), nrow(Boston) / 2)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test <- test_data$crim

Best Subset Selection

regfit_full <- regsubsets(crim ~ ., data = train_data, nvmax = 13)
val_errors <- rep(NA, 13)

for (i in 1:13) {
  coefi <- coef(regfit_full, id = i)
  pred <- as.matrix(cbind(1, x_test[, names(coefi)[-1]])) %*% coefi
  val_errors[i] <- mean((y_test - pred)^2)
}
best_subset_mse <- min(val_errors)
best_subset_model_size <- which.min(val_errors)

cat("Best Subset Selection Test MSE:", best_subset_mse, "\n")
## Best Subset Selection Test MSE: 46.40334
cat("Optimal number of predictors:", best_subset_model_size, "\n")
## Optimal number of predictors: 8

Ridge Regression

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_pred <- predict(cv_ridge, s = cv_ridge$lambda.min, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Regression Test MSE:", ridge_mse, "\n")
## Ridge Regression Test MSE: 47.37969

Lasso

cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_pred <- predict(cv_lasso, s = cv_lasso$lambda.min, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_coef <- predict(cv_lasso, s = cv_lasso$lambda.min, type = "coefficients")
nonzero_lasso <- sum(lasso_coef != 0) - 1
cat("Lasso Test MSE:", lasso_mse, "\n")
## Lasso Test MSE: 48.09659
cat("Number of non-zero coefficients (Lasso):", nonzero_lasso, "\n")
## Number of non-zero coefficients (Lasso): 4

PCR

pcr_fit <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

opt_pcr <- which.min(pcr_fit$validation$PRESS)
pcr_pred <- predict(pcr_fit, test_data, ncomp = opt_pcr)
pcr_mse <- mean((test_data$crim - pcr_pred)^2)
cat("PCR Test MSE:", pcr_mse, "\n")
## PCR Test MSE: 47.88692
cat("Optimal number of components (PCR):", opt_pcr, "\n")
## Optimal number of components (PCR): 8

(b)

The best subset model performed well, Lasso is close and provides automatic variable selection, and PCR and Ridge did the worse which could be related to interpretability.

The Lasso provies a balance of prediction accuracy and model simplicity.

(c)

The lasso model uses four variables and does this by shrinking some coefficients to zero, this reduces model complexity and avoids overfitting thereby focusing on variables that contribute most to predicting crime rate.