\(Homework 5, Chapter 6, questions 2, 9, 11\)
Question 2
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is:
–Lasso imposes an ℓ₁ penalty that shrinks some coefficients to exactly zero, effectively performing variable selection. This restricts the flexibility of the model compared to least squares. Reduced flexibility tends to increase bias but can significantly reduce variance, particularly when many predictors are irrelevant or correlated. Prediction improves if the gain from lower variance outweighs the increase in bias.
(b) Repeat (a) for ridge regression relative to least squares.
– Ridge regression applies an ℓ₂ penalty which shrinks coefficients toward zero (but not exactly zero). Like lasso, this introduces bias but reduces variance, especially in the presence of multicollinearity. So again, the key is that the reduction in variance compensates for the increase in bias.
(c) Repeat (a) for non-linear methods relative to least squares.
– Non-linear methods (like splines, GAMs, or regression trees) are typically more flexible than least squares. This increased flexibility reduces bias but increases variance. These methods improve prediction if the decrease in bias outweighs the increase in variance.
Exercise 9
(a) Consider (6.12) with p = 1. For some choice of y1 and λ > 0, plot (6.12) as a function of β1. Your plot should confirm that (6.12) is solved by (6.14).
y1 <- 1
lambda <- 1
beta1 <- seq(-2, 2, length.out = 100)
objective <- (y1 - beta1)^2 + lambda * abs(beta1)
plot(beta1, objective, type = "l", lwd = 2,
ylab = "Objective Function",
xlab = expression(beta[1]),
main = expression((y[1] - beta[1])^2 + lambda %.% "|" %.% beta[1] %.% "|"))
abline(v = y1 - lambda/2, col = "blue", lty = 2) # soft-thresholded solution
abline(v = 0, col = "red", lty = 3) # exact zero if applicable
y1 <- 1.5
lambda <- 1
beta1 <- seq(-2, 3, length.out = 200)
objective <- (y1 - beta1)^2 + lambda * abs(beta1)
plot(beta1, objective, type = "l", lwd = 2,
ylab = "Objective Function",
xlab = expression(beta[1]),
main = expression((y[1] - beta[1])^2 + lambda %.% "|" %.% beta[1] %.% "|"))
# Add a vertical line at the soft-thresholded solution
soft_threshold <- if (abs(y1) > lambda / 2) sign(y1) * (abs(y1) - lambda / 2) else 0
abline(v = soft_threshold, col = "blue", lty = 2)
text(soft_threshold, min(objective), labels = "Soft-thresholded min", pos = 3, col = "blue")
Exercise 9
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## Warning: package 'pls' was built under R version 4.4.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
a,
set.seed(1)
data(College)
train_index <- sample(1:nrow(College), nrow(College) / 2)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
b,
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((test_data$Apps - lm_pred)^2)
cat("(b) Linear Model Test MSE:", lm_mse, "\n")
## (b) Linear Model Test MSE: 1135758
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- cv_ridge$lambda.min
ridge_pred <- predict(cv_ridge, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
cat("(c) Ridge Regression Test MSE:", ridge_mse, "\n")
## (c) Ridge Regression Test MSE: 976261.5
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- cv_lasso$lambda.min
lasso_pred <- predict(cv_lasso, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_coef <- predict(cv_lasso, s = lasso_best_lambda, type = "coefficients")
nonzero_lasso <- sum(lasso_coef != 0) - 1
cat("(d) Lasso Test MSE:", lasso_mse, "\n")
## (d) Lasso Test MSE: 1115901
cat(" Number of non-zero coefficients:", nonzero_lasso, "\n")
## Number of non-zero coefficients: 17
set.seed(1)
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")
pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = which.min(pcr_fit$validation$PRESS))
pcr_mse <- mean((test_data$Apps - pcr_pred)^2)
cat("(e) PCR Test MSE:", pcr_mse, "\n")
## (e) PCR Test MSE: 1135758
cat(" Optimal number of components (M):", which.min(pcr_fit$validation$PRESS), "\n")
## Optimal number of components (M): 17
set.seed(1)
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pls_fit, val.type = "MSEP")
pls_pred <- predict(pls_fit, newdata = test_data, ncomp = which.min(pls_fit$validation$PRESS))
pls_mse <- mean((test_data$Apps - pls_pred)^2)
cat("(f) PLS Test MSE:", pls_mse, "\n")
## (f) PLS Test MSE: 1135758
cat(" Optimal number of components (M):", which.min(pls_fit$validation$PRESS), "\n")
## Optimal number of components (M): 17
Linear Model is simple and easy to understand, but could overfit if the predictors are highly correlated Ridge shrinks all the cofficients, it can handle multicollinearity and is an improvement if overfitting is the problem Lasso can set coefficients to zero, which helps in variable selection PCR reduces the dimensionality before regression, and might ignore some relevant predictors if they’re not aligned with top principal components. PLS also reduces dimensionality, but considers response in component selection often better than PCR
All the outcomes are fairly close, between 976261.5 to 1135758, which interestingly the Linear Model and the Princial Components MSE ended up being the same, but the main point is all do well.
Overall, there is not a big difference in the testing outcomes between the five methods, however, Ridge, Lasso, and PLS slightly outperform the others due to reduced variance and how they handle multicollinerity and irrelevant predictors.
Exercise 11 Load data, set seed, and make train and testing sets
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.2
data(Boston)
set.seed(19)
train_index <- sample(1:nrow(Boston), nrow(Boston) / 2)
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test <- test_data$crim
Best Subset Selection
regfit_full <- regsubsets(crim ~ ., data = train_data, nvmax = 13)
val_errors <- rep(NA, 13)
for (i in 1:13) {
coefi <- coef(regfit_full, id = i)
pred <- as.matrix(cbind(1, x_test[, names(coefi)[-1]])) %*% coefi
val_errors[i] <- mean((y_test - pred)^2)
}
best_subset_mse <- min(val_errors)
best_subset_model_size <- which.min(val_errors)
cat("Best Subset Selection Test MSE:", best_subset_mse, "\n")
## Best Subset Selection Test MSE: 46.40334
cat("Optimal number of predictors:", best_subset_model_size, "\n")
## Optimal number of predictors: 8
Ridge Regression
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_pred <- predict(cv_ridge, s = cv_ridge$lambda.min, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Regression Test MSE:", ridge_mse, "\n")
## Ridge Regression Test MSE: 47.37969
Lasso
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_pred <- predict(cv_lasso, s = cv_lasso$lambda.min, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_coef <- predict(cv_lasso, s = cv_lasso$lambda.min, type = "coefficients")
nonzero_lasso <- sum(lasso_coef != 0) - 1
cat("Lasso Test MSE:", lasso_mse, "\n")
## Lasso Test MSE: 48.09659
cat("Number of non-zero coefficients (Lasso):", nonzero_lasso, "\n")
## Number of non-zero coefficients (Lasso): 4
PCR
pcr_fit <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")
opt_pcr <- which.min(pcr_fit$validation$PRESS)
pcr_pred <- predict(pcr_fit, test_data, ncomp = opt_pcr)
pcr_mse <- mean((test_data$crim - pcr_pred)^2)
cat("PCR Test MSE:", pcr_mse, "\n")
## PCR Test MSE: 47.88692
cat("Optimal number of components (PCR):", opt_pcr, "\n")
## Optimal number of components (PCR): 8
(b)
The best subset model performed well, Lasso is close and provides automatic variable selection, and PCR and Ridge did the worse which could be related to interpretability.
The Lasso provies a balance of prediction accuracy and model simplicity.
(c)
The lasso model uses four variables and does this by shrinking some coefficients to zero, this reduces model complexity and avoids overfitting thereby focusing on variables that contribute most to predicting crime rate.