Problem 2

(a)

Lasso is less flexible than least squares because it imposes an L1 penalty, shrinking some coefficients to zero. This adds bias, but can reduce variance, especially when many predictors are irrelevant.
Lasso improves prediction accuracy when the increase in bias is less than the decrease in variance.


(b)

Ridge regression is also less flexible due to the L2 penalty, which shrinks coefficients but does not set them to zero. It performs well when bias increase < variance decrease.


(c)

Non-linear methods are more flexible and tend to fit training data better. They help when their increase in variance is outweighed by the reduction in bias.


Problem 9 – Predicting College Applications

data("College")
College <- College %>% mutate(Private = as.numeric(Private == "Yes"))
set.seed(1)
train_idx <- sample(1:nrow(College), nrow(College)/2)
train <- College[train_idx, ]
test <- College[-train_idx, ]

x_train <- model.matrix(Apps ~ ., data = train)[, -1]
y_train <- train$Apps
x_test <- model.matrix(Apps ~ ., data = test)[, -1]
y_test <- test$Apps

(b)

lm_fit <- lm(Apps ~ ., data = train)
lm_pred <- predict(lm_fit, test)
mean((lm_pred - y_test)^2)
## [1] 1135758

(c)

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_pred <- predict(cv_ridge, s = cv_ridge$lambda.min, newx = x_test)
mean((ridge_pred - y_test)^2)
## [1] 976261.5

(d)

cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_pred <- predict(cv_lasso, s = cv_lasso$lambda.min, newx = x_test)
mean((lasso_pred - y_test)^2)
## [1] 1115901
lasso_coef <- predict(cv_lasso, s = cv_lasso$lambda.min, type = "coefficients")
sum(lasso_coef != 0)
## [1] 18

(e)

pcr_fit <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(pcr_fit, val.type = "MSEP")

opt_comp <- which.min(pcr_fit$validation$PRESS)
pcr_pred <- predict(pcr_fit, test, ncomp = opt_comp)
mean((pcr_pred - y_test)^2)
## [1] 1135758

(f)

pls_fit <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
validationplot(pls_fit, val.type = "MSEP")

opt_pls <- which.min(pls_fit$validation$PRESS)
pls_pred <- predict(pls_fit, test, ncomp = opt_pls)
mean((pls_pred - y_test)^2)
## [1] 1135758

(g)

All five models give reasonable predictions.
- Lasso reduces test error and simplifies the model by selecting fewer variables.
- Ridge helps when many predictors are correlated.
- PCR and PLS reduce dimensionality.
- Least Squares may overfit.
*Lasso** is often preferred due to performance and interpretability.


Problem 11 – Predicting Crime Rate in Boston

data("Boston")
set.seed(1)
train_idx <- sample(1:nrow(Boston), nrow(Boston)*0.7)
train <- Boston[train_idx, ]
test <- Boston[-train_idx, ]

x_train <- model.matrix(crim ~ ., data = train)[, -1]
y_train <- train$crim
x_test <- model.matrix(crim ~ ., data = test)[, -1]
y_test <- test$crim

(a)

# Ridge
cv_ridge_b <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_b_pred <- predict(cv_ridge_b, s = cv_ridge_b$lambda.min, newx = x_test)
ridge_b_mse <- mean((ridge_b_pred - y_test)^2)

# Lasso
cv_lasso_b <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_b_pred <- predict(cv_lasso_b, s = cv_lasso_b$lambda.min, newx = x_test)
lasso_b_mse <- mean((lasso_b_pred - y_test)^2)
lasso_b_coef <- predict(cv_lasso_b, s = cv_lasso_b$lambda.min, type = "coefficients")

# PCR
pcr_b <- pcr(crim ~ ., data = train, scale = TRUE, validation = "CV")
opt_pcr <- which.min(pcr_b$validation$PRESS)
pcr_b_pred <- predict(pcr_b, test, ncomp = opt_pcr)
pcr_b_mse <- mean((pcr_b_pred - y_test)^2)

# MSE Comparison
c(Ridge = ridge_b_mse, Lasso = lasso_b_mse, PCR = pcr_b_mse)
##    Ridge    Lasso      PCR 
## 58.75168 58.06509 57.61252

(b)

The model with the lowest test MSE is preferred.
- Lasso often performs well, offering a sparse solution and avoiding overfitting.
- Ridge works better when all predictors contribute.
- PCR may lose predictive power if key variables are excluded in low-dimensional components.


(c)

Lasso uses fewer variables by shrinking some coefficients to zero —
This improves interpretability and reduces overfitting.
PCR and Ridge include all variables but with regularization or dimensionality reduction.