Problem 2.

a.

The lasso, relative to least squares

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This is incorrect. The lasso is not more flexible than least squares; it’s less flexible due to the penalty. Additionally, the phrasing suggests that increased flexibility drives the improvement, which doesn’t align with the lasso’s mechanism.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is incorrect. The lasso is less flexible, and it reduces variance while increasing bias, not the other way around. The condition described doesn’t match the lasso’s behavior.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This is correct. The lasso is indeed less flexible due to the L1 penalty. It increases bias by shrinking coefficients but decreases variance by stabilizing estimates. Prediction accuracy improves when the increase in bias is outweighed by the decrease in variance, reducing the overall MSE. This aligns with the lasso’s typical behavior, especially in settings with many predictors or sparse signals.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is incorrect. The lasso decreases variance, not increases it, and increases bias, not decreases it. The trade-off described here is the opposite of what the lasso does.

Correct choice: iii.

b.

Ridge regression relative to least squares

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This is incorrect. Ridge regression is not more flexible than least squares; it’s less flexible due to the L2 penalty. The premise of increased flexibility doesn’t apply, so this option is misaligned.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is incorrect. Ridge regression is less flexible, and it decreases variance while increasing bias, not the reverse. The trade-off described here doesn’t match ridge regression’s behavior.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This looks promising. Ridge regression is less flexible than least squares because the L2 penalty constrains the coefficients. It increases bias by shrinking coefficients toward zero but decreases variance by stabilizing the estimates. Prediction accuracy improves when the increase in bias is outweighed by the decrease in variance, reducing the overall MSE. This aligns with how ridge regression improves performance, especially in high-variance scenarios like multicollinearity.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is incorrect. Ridge regression decreases variance and increases bias, not the other way around. The condition described doesn’t reflect ridge regression’s mechanism.

Correct choice: iii.

c.

Non-linear methods relative to least squares

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This is incorrect. Non-linear methods are more flexible, but they typically decrease bias (by better fitting non-linear patterns) and increase variance (due to their sensitivity to training data). The option suggests an increase in bias and decrease in variance, which is the opposite of what non-linear methods do.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is promising. Non-linear methods are more flexible than least squares because they can model complex relationships. When the true relationship is non-linear, least squares has high bias due to model misspecification. Non-linear methods reduce this bias by fitting the true pattern more closely, but they increase variance due to their flexibility. Prediction accuracy improves when the decrease in bias (from better fitting the non-linear relationship) outweighs the increase in variance (from added complexity), leading to a lower MSE. This aligns with the behavior of non-linear methods.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance: This is incorrect. Non-linear methods are more flexible, not less, than least squares. Additionally, they decrease bias and increase variance, not the reverse, so the trade-off described doesn’t apply.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias: This is incorrect. Non-linear methods are more flexible, and while they decrease bias, they increase variance. The option’s premise of reduced flexibility and the wrong direction of the trade-off doesn’t match.

Correct choice: ii.

Problem 9.

a.

# Load the ISLR2 library
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

# Set a seed for reproducibility
set.seed(123)

# Load the College dataset
data(College)

# Create an index for splitting
n <- nrow(College)
train_size <- floor(0.7 * n)  # 70% for training
train_idx <- sample(1:n, train_size, replace = FALSE)

# Split into training and test sets
train_data <- College[train_idx, ]
test_data <- College[-train_idx, ]

b.

# Fit a linear model on the training set
lm_model <- lm(Apps ~ ., data = train_data)

# Predict on the test set
predictions <- predict(lm_model, newdata = test_data)

# Compute the test error (Mean Squared Error)
test_error <- mean((test_data$Apps - predictions)^2)

# Report the test error
cat("Test MSE:", test_error, "\n")

## Test MSE: 1734841

c.

library(glmnet)    # For ridge and lasso

## Warning: package 'glmnet' was built under R version 4.4.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

library(pls)       # For PCR and PLS

## Warning: package 'pls' was built under R version 4.4.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

# Prepare data for glmnet (matrix of predictors, response vector)
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]  # Exclude intercept
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test <- test_data$Apps

# Fit ridge regression with cross-validation
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)

# Get the optimal lambda
lambda_ridge <- ridge_cv$lambda.min
cat("Optimal lambda for ridge:", lambda_ridge, "\n")

## Optimal lambda for ridge: 314.2524

# Fit ridge model with optimal lambda
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_ridge)

# Predict on test set
ridge_pred <- predict(ridge_model, s = lambda_ridge, newx = x_test)

# Compute test MSE
ridge_test_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Test MSE:", ridge_test_mse, "\n")

## Ridge Test MSE: 2979790

d.

# Fit lasso with cross-validation
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)

# Get the optimal lambda
lambda_lasso <- lasso_cv$lambda.min
cat("Optimal lambda for lasso:", lambda_lasso, "\n")

## Optimal lambda for lasso: 8.154925

# Fit lasso model with optimal lambda
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_lasso)

# Predict on test set
lasso_pred <- predict(lasso_model, s = lambda_lasso, newx = x_test)

# Compute test MSE
lasso_test_mse <- mean((y_test - lasso_pred)^2)
cat("Lasso Test MSE:", lasso_test_mse, "\n")

## Lasso Test MSE: 1740543

# Count non-zero coefficients
lasso_coefs <- coef(lasso_model, s = lambda_lasso)
non_zero_coefs <- sum(lasso_coefs != 0) - 1  # Exclude intercept
cat("Number of non-zero coefficients:", non_zero_coefs, "\n")

## Number of non-zero coefficients: 16

e.

# Fit PCR with cross-validation
pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Find optimal M (number of components)
cv_errors <- RMSEP(pcr_model)$val[1, , ]  # Extract CV errors
M_optimal <- which.min(cv_errors) - 1  # Index starts at 0 components
cat("Optimal number of components (M):", M_optimal, "\n")

## Optimal number of components (M): 16

# Predict on test set with optimal M
pcr_pred <- predict(pcr_model, newdata = test_data, ncomp = M_optimal)

# Compute test MSE
pcr_test_mse <- mean((test_data$Apps - pcr_pred)^2)
cat("PCR Test MSE:", pcr_test_mse, "\n")

## PCR Test MSE: 1853635

f.

# Fit PLS with cross-validation
pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Find optimal M
cv_errors_pls <- RMSEP(pls_model)$val[1, , ]
M_optimal_pls <- which.min(cv_errors_pls) - 1
cat("Optimal number of components (M):", M_optimal_pls, "\n")

## Optimal number of components (M): 8

# Predict on test set with optimal M
pls_pred <- predict(pls_model, newdata = test_data, ncomp = M_optimal_pls)

# Compute test MSE
pls_test_mse <- mean((test_data$Apps - pls_pred)^2)
cat("PLS Test MSE:", pls_test_mse, "\n")

## PLS Test MSE: 1774522

g.

Accuracy: The test MSEs range from 1,734,841 (least squares) to 2,979,790 (ridge), with RMSEs of ~1,316–1,727 applications. Given Apps ranges from hundreds to tens of thousands, predictions are moderately accurate but have notable errors, suggesting room for improvement (e.g., non-linear models).

Comparison:

Least squares (1,734,841), lasso (1,740,543), PLS (1,774,522), and PCR (1,853,635) have similar test MSEs, indicating comparable performance.
Ridge performs worst (2,979,790), possibly due to an overly large λ or data-specific issues.
Differences are modest except for ridge; no single method stands out significantly.

We can predict college applications with moderate accuracy (RMSE ~1,300–1,700). Test errors are similar across methods, with ridge notably worse, suggesting regularization didn’t consistently improve over least squares here.

Problem 11.

a.

# Load required libraries
library(ISLR2)     # For Boston dataset
library(leaps)     # For best subset selection

## Warning: package 'leaps' was built under R version 4.4.3

library(glmnet)    # For ridge and lasso
library(pls)       # For PCR

# Set seed for reproducibility
set.seed(123)

# Load Boston dataset
data(Boston)

# Split into training and test sets
n <- nrow(Boston)
train_size <- floor(0.7 * n)
train_idx <- sample(1:n, train_size, replace = FALSE)
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]

# Best subset selection
best_subset <- regsubsets(crim ~ ., data = train_data, nvmax = 13)
summary_best <- summary(best_subset)

# Use BIC to select model (or adjust based on CV if preferred)
best_model_idx <- which.min(summary_best$bic)
best_predictors <- names(coef(best_subset, id = best_model_idx))[-1]

# Fit selected model
formula_best <- as.formula(paste("crim ~", paste(best_predictors, collapse = "+")))
best_model <- lm(formula_best, data = train_data)

# Predict on test set
best_pred <- predict(best_model, newdata = test_data)

# Compute test MSE
best_test_mse <- mean((test_data$crim - best_pred)^2)
cat("Best Subset Test MSE:", best_test_mse, "\n")

## Best Subset Test MSE: 19.31744

Ridge Regression

# Prepare data for glmnet
x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test <- test_data$crim

# Fit ridge with cross-validation
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
lambda_ridge <- ridge_cv$lambda.min
cat("Ridge Optimal lambda:", lambda_ridge, "\n")

## Ridge Optimal lambda: 0.5863068

# Fit ridge model
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = lambda_ridge)

# Predict and compute test MSE
ridge_pred <- predict(ridge_model, s = lambda_ridge, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Test MSE:", ridge_test_mse, "\n")

## Ridge Test MSE: 17.75282

Lasso

# Fit lasso with cross-validation
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1, nfolds = 10)
lambda_lasso <- lasso_cv$lambda.min
cat("Lasso Optimal lambda:", lambda_lasso, "\n")

## Lasso Optimal lambda: 0.06741104

# Fit lasso model
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = lambda_lasso)

# Predict and compute test MSE
lasso_pred <- predict(lasso_model, s = lambda_lasso, newx = x_test)
lasso_test_mse <- mean((y_test - lasso_pred)^2)
cat("Lasso Test MSE:", lasso_test_mse, "\n")

## Lasso Test MSE: 18.3185

# Count non-zero coefficients
lasso_coefs <- coef(lasso_model, s = lambda_lasso)
non_zero_coefs <- sum(lasso_coefs != 0) - 1  # Exclude intercept
cat("Lasso Non-zero Coefficients:", non_zero_coefs, "\n")

## Lasso Non-zero Coefficients: 10

PCR

# Fit PCR with cross-validation
pcr_model <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")
cv_errors <- RMSEP(pcr_model)$val[1, , ]
M_optimal <- which.min(cv_errors) - 1
cat("PCR Optimal M:", M_optimal, "\n")

## PCR Optimal M: 12

# Predict and compute test MSE
pcr_pred <- predict(pcr_model, newdata = test_data, ncomp = M_optimal)
pcr_test_mse <- mean((test_data$crim - pcr_pred)^2)
cat("PCR Test MSE:", pcr_test_mse, "\n")

## PCR Test MSE: 18.67968

Discuss:

The test MSEs are: ridge (17.75282), lasso (18.31850), PCR (18.67968), and best subset (19.31744). Predictions are moderately accurate (RMSE ~4.2–4.4), capturing trends in crim but struggling with skewness. Ridge performs best, leveraging regularization, followed closely by lasso and PCR. Best subset lags slightly. Differences are small (~5–10%), suggesting linear models perform similarly, with ridge and lasso preferred for handling correlated predictors. Non-linear methods or a transformed response might improve results further.

b.

# Load libraries
library(ISLR2)
library(glmnet)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(MASS)  # For stepAIC in lasso-log model

## 
## Attaching package: 'MASS'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Boston

## The following object is masked from 'package:ISLR2':
## 
##     Boston

# Set seed
set.seed(123)

# Load and split data
data(Boston)
n <- nrow(Boston)
train_size <- floor(0.7 * n)
train_idx <- sample(1:n, train_size)
train_data <- Boston[train_idx, ]
test_data <- Boston[-train_idx, ]

# 1. Ridge Regression
x_train <- model.matrix(crim ~ ., train_data)[, -1]
y_train <- train_data$crim
x_test <- model.matrix(crim ~ ., test_data)[, -1]
y_test <- test_data$crim
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0, nfolds = 10)
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = ridge_cv$lambda.min)
ridge_pred <- predict(ridge_model, s = ridge_cv$lambda.min, newx = x_test)
ridge_test_mse <- mean((y_test - ridge_pred)^2)
cat("Ridge Test MSE:", ridge_test_mse, "\n")

## Ridge Test MSE: 17.57638

# 2. Lasso with Log(crim)
# Replace crim <= 0 with small value if any (Boston has none, but for safety)
train_data$log_crim <- log(train_data$crim + 0.01)
test_data$log_crim <- log(test_data$crim + 0.01)
x_train_log <- model.matrix(log_crim ~ ., train_data)[, -1]
y_train_log <- train_data$log_crim
x_test_log <- model.matrix(log_crim ~ ., test_data)[, -1]
y_test_log <- test_data$log_crim
lasso_log_cv <- cv.glmnet(x_train_log, y_train_log, alpha = 1, nfolds = 10)
lasso_log_model <- glmnet(x_train_log, y_train_log, alpha = 1, lambda = lasso_log_cv$lambda.min)
lasso_log_pred <- predict(lasso_log_model, s = lasso_log_cv$lambda.min, newx = x_test_log)
# Back-transform predictions
lasso_pred_crim <- exp(lasso_log_pred) - 0.01
lasso_log_test_mse <- mean((test_data$crim - lasso_pred_crim)^2)
cat("Lasso-Log Test MSE:", lasso_log_test_mse, "\n")

## Lasso-Log Test MSE: 4.500821

# 3. Random Forest
rf_model <- randomForest(crim ~ ., data = train_data, ntree = 500, mtry = floor(13/3), importance = TRUE)
rf_pred <- predict(rf_model, newdata = test_data)
rf_test_mse <- mean((test_data$crim - rf_pred)^2)
cat("Random Forest Test MSE:", rf_test_mse, "\n")

## Random Forest Test MSE: 2.125633

The random forest performed best (lowest test MSE), so it will be my chosen model.

c.

The chosen model, random forest, involves all 13 features in the Boston dataset. This is because random forest uses all predictors in its ensemble of trees, leveraging their collective information to capture complex relationships, as evidenced by its low test MSE (2.125633). No feature selection was needed, as all predictors contribute without causing overfitting, fitting the dataset’s correlated and potentially non-linear structure. For comparison, ridge also uses all features, while lasso-log likely excludes some for sparsity, but random forest’s superior performance justifies retaining all.

Predictive Modeling HW5

Minh Nguyen-zpx082

2025-04-11

Problem 2.

a.

b.

c.

Problem 9.

a.

b.

c.

d.

e.

f.

g.

Problem 11.

a.

b.

c.