Assignment 5

STA6543

Author

Stephen Garcia (wqr974)

Published

July 7, 2025

Chapter 6

Linear Model Selection and Regularization

Conceptual Exercises:

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a)
The lasso, relative to least squares, is:
1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Response:
The correct answer is 3: The lasso adds an L1 penalty to least squares, which reduces model flexibility by shrinking some coefficients to zero. This introduces bias but also lowers variance. When the reduction in variance outweighs the added bias, the lasso can outperform least squares in prediction accuracy.

Applied Exercises:

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a)
Split the data set into a training set and a test set.

options(repos = c(CRAN = "https://cran.rstudio.com"))

if (!requireNamespace("pls", quietly = TRUE)) {
    install.packages("pls")
}

library(ISLR)
library(caret)

# Set seed for reproducibility
set.seed(42)

# Load the data
data(College)

# Create a training/test split (70/30)
train_index <- createDataPartition(College$Apps, p = 0.7, list = FALSE)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b)
Fit a linear model using least squares on the training set, and report the test error obtained.

set.seed(42)

# Set up training control (no resampling since we already split manually)
train_control <- trainControl(method = "none")

# Train linear model using caret
lm_model <- train(Apps ~ .,
    data = train_data,
    method = "lm",
    trControl = train_control
)

# Predict on the test set
lm_preds <- predict(lm_model, newdata = test_data)

# Compute test RMSE
lm_rmse <- RMSE(lm_preds, test_data$Apps)
lm_rmse
[1] 1179.198

(c)
Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

# Set up cross-validation
set.seed(42)
sum(is.na(College))
[1] 0
train_control <- trainControl(method = "cv", number = 10)

# Train ridge regression model (alpha = 0 for ridge)
ridge_model <- train(Apps ~ .,
    data = train_data,
    method = "glmnet",
    tuneGrid = expand.grid(
        alpha = 0,
        lambda = 10^seq(10, -5, length = 1000)
    ),
    trControl = train_control
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
# Best lambda
ridge_model$bestTune
    alpha   lambda
501     0 321.7418
# Predict on the test set
ridge_preds <- predict(ridge_model, newdata = test_data)

# Compute test RMSE
ridge_rmse <- RMSE(ridge_preds, test_data$Apps)
ridge_rmse
[1] 1146.797

(d)
Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Set up cross-validation
set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

# Train lasso model (alpha = 1 for lasso)
lasso_model <- train(Apps ~ .,
    data = train_data,
    method = "glmnet",
    tuneGrid = expand.grid(
        alpha = 1,
        lambda = 10^seq(10, -5, length = 1000)
    ),
    trControl = train_control
)
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
# Best lambda chosen by cross-validation
lasso_model$bestTune
    alpha  lambda
434     1 31.7323
# Predict on the test set
lasso_preds <- predict(lasso_model, newdata = test_data)

# Compute test RMSE
lasso_rmse <- RMSE(lasso_preds, test_data$Apps)
lasso_rmse
[1] 1201.174
# Extract coefficients at best lambda
best_lambda <- lasso_model$bestTune$lambda
lasso_coefs <- coef(lasso_model$finalModel, s = best_lambda)

# Count number of non-zero coefficients (excluding intercept)
nonzero_coef_count <- sum(lasso_coefs != 0) - 1
nonzero_coef_count
[1] 12

(e)
Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
Warning: package 'pls' was built under R version 4.4.3
# Set up cross-validation
set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

# Train PCR model using caret
pcr_model <- train(Apps ~ .,
    data = train_data,
    method = "pcr",
    trControl = train_control,
    tuneLength = 20,
    preProcess = c("center", "scale")
)

# Best number of components (M)
best_M <- pcr_model$bestTune$ncomp
best_M
[1] 16
# Predict on test set
pcr_preds <- predict(pcr_model, newdata = test_data)

# Compute test RMSE
pcr_rmse <- RMSE(pcr_preds, test_data$Apps)
pcr_rmse
[1] 1191.296
plot(pcr_model)

(f)
Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

# Set up cross-validation
set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

# Train PLS model
pls_model <- train(Apps ~ .,
    data = train_data,
    method = "pls",
    trControl = train_control,
    tuneLength = 20, # Try up to 20 components
    preProcess = c("center", "scale")
)

# Best number of components (M)
best_M_pls <- pls_model$bestTune$ncomp
best_M_pls
[1] 13
# Predict on the test set
pls_preds <- predict(pls_model, newdata = test_data)

# Compute test RMSE
pls_rmse <- RMSE(pls_preds, test_data$Apps)
pls_rmse
[1] 1177.479
plot(pls_model)

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Response:


Model Test RMSE Notes
Linear Regression 1179.20
Ridge Regression 1146.80 λ = 321.74
Lasso Regression 1201.17 λ = 31.73, sparse model
PCR 1191.30 M = 16 components
PLS 1177.48 M = 13 components
# Actual values
y_actual <- test_data$Apps
ridge_r2 <- 1 - sum((y_actual - ridge_preds)^2) / sum((y_actual - mean(y_actual))^2)
ridge_r2
[1] 0.9062702

All five models produced similar test RMSEs, with ridge regression performing best (RMSE = 1146.80). Regularization and supervised dimension reduction (PLS) offered slight improvements over linear regression, but the gains were modest. Lasso underperformed slightly, likely due to excluding useful predictors. Overall, ridge and PLS strike a good balance between complexity and predictive accuracy. The R^2 for the ridge regression model is approximately 0.91, indicating a strong fit to the data.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.
(a)
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
library(leaps)

# Load Boston data
data(Boston)

# Log-transform crime rate for better modeling (optional, depending on skew)
# Boston$crim <- log(Boston$crim + 1)  # uncomment if using log transformation

# Split into training and test sets (70/30 split)
set.seed(123)
train_index <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
train_data <- Boston[train_index, ]
test_data  <- Boston[-train_index, ]
# Best subset selection using forward stepwise (caret wrapper)
set.seed(123)
best_subset <- train(crim ~ ., 
                     data = train_data,
                     method = "leapSeq",
                     tuneGrid = data.frame(nvmax = 1:13),
                     trControl = trainControl(method = "cv", number = 10))

# Best model size
best_subset$bestTune
   nvmax
10    10
# Predict on test set
best_preds <- predict(best_subset, newdata = test_data)

# Test RMSE
best_rmse <- RMSE(best_preds, test_data$crim)
best_rmse
[1] 7.534746
set.seed(123)
ridge_model <- train(crim ~ ., 
                     data = train_data,
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = 10^seq(4, -2, length = 100)),
                     trControl = trainControl(method = "cv", number = 10),
                     preProcess = c("center", "scale"))
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
# Best lambda
ridge_model$bestTune
   alpha   lambda
36     0 1.321941
# Predict + RMSE
ridge_preds <- predict(ridge_model, newdata = test_data)
ridge_rmse <- RMSE(ridge_preds, test_data$crim)
ridge_rmse
[1] 7.439455
set.seed(123)
lasso_model <- train(crim ~ ., 
                     data = train_data,
                     method = "glmnet",
                     tuneGrid = expand.grid(alpha = 1,
                                            lambda = 10^seq(4, -2, length = 100)),
                     trControl = trainControl(method = "cv", number = 10),
                     preProcess = c("center", "scale"))
Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
: There were missing values in resampled performance measures.
# Best lambda
lasso_model$bestTune
   alpha    lambda
31     1 0.6579332
# Predict + RMSE
lasso_preds <- predict(lasso_model, newdata = test_data)
lasso_rmse <- RMSE(lasso_preds, test_data$crim)
lasso_rmse
[1] 7.561412
# Number of non-zero coefficients
lasso_coef <- coef(lasso_model$finalModel, s = lasso_model$bestTune$lambda)
nonzero_lasso <- sum(lasso_coef != 0) - 1
nonzero_lasso
[1] 5
set.seed(123)
pcr_model <- train(crim ~ ., 
                   data = train_data,
                   method = "pcr",
                   trControl = trainControl(method = "cv", number = 10),
                   tuneLength = 13,
                   preProcess = c("center", "scale"))

# Best number of components
pcr_model$bestTune
  ncomp
8     8
# Predict + RMSE
pcr_preds <- predict(pcr_model, newdata = test_data)
pcr_rmse <- RMSE(pcr_preds, test_data$crim)
pcr_rmse
[1] 7.462283

Response:


Model Test RMSE Notes
Best Subset 7.53 10 predictors (nvmax)
Ridge Regression 7.44 λ = 1.32
Lasso Regression 7.56 λ = 0.66, 5 non-zero predictors
PCR 7.46 M = 8 components

All four models produced similar test RMSEs in the range of 7.4 to 7.6 when predicting per capita crime rate. Ridge regression had the lowest test RMSE (7.44), suggesting that moderate regularization improved generalization. PCR performed similarly well (7.46), while lasso slightly underperformed despite yielding a sparse model with only 5 predictors. Best subset selection also performed reasonably but did not outperform regularized or dimension-reduced approaches. Overall, ridge and PCR offer the best trade-offs between accuracy and model complexity in this case.

(b)
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Response:
Based on 10-fold cross-validation and test set evaluation, ridge regression and principal components regression (PCR) performed best for predicting per capita crime rate (crim) in the Boston dataset. Ridge regression achieved the lowest test RMSE (7.44), closely followed by PCR (7.46), both outperforming best subset selection (7.53) and lasso regression (7.56).

Ridge regression is a strong choice because it retains all predictors while shrinking their coefficients, which helps manage multicollinearity without discarding useful information. PCR also performed well by reducing dimensionality through principal components, which can filter out noise and improve prediction accuracy.

Importantly, all model comparisons were based on validation set RMSE, not training error, ensuring that the results reflect real generalization performance. Given its balance of accuracy and stability, ridge regression is recommended as the most effective model for this prediction task, with PCR as a strong alternative if dimensionality reduction is preferred.

All model comparisons were based on out-of-sample prediction error, not training error. Specifically, each model was trained using 10-fold cross-validation within the training set to select tuning parameters (e.g., λ for ridge and lasso, number of components for PCR), and final performance was evaluated on a held-out 30% test set. This approach ensures that the reported RMSE values reflect each model’s generalization performance, providing a fair and unbiased basis for comparison.

(c)
Does your chosen model involve all of the features in the data set? Why or why not?

*Response:
Yes, the chosen model — ridge regression — includes all predictors in the dataset. Unlike lasso regression or best subset selection, ridge does not perform variable selection; instead, it shrinks all coefficient estimates toward zero without setting any exactly to zero. This allows the model to retain all available information while controlling for multicollinearity and overfitting through regularization.

In this context, using all features is appropriate because the predictors likely contain complementary information about crime rate, and ridge regression’s penalty helps mitigate the risk of overfitting that could arise from high variance or correlated variables. The model benefits from the breadth of predictors while remaining stable and interpretable due to the shrinkage effect.