Question 2

(a) Lasso vs. Least Squares

Correct answer: iii. Compared to least squares, the lasso is less flexible since it applies a penalty that can cause some coefficients to shrink to zero. As a result, it introduces bias at the expense of reducing variance.

(b) Ridge Regression vs. Least Squares

Correct answer: iii. The penalty element, which lowers variance at the expense of bias, makes ridge regression less flexible as well.

(c) Non-linear Methods vs. Least Squares

Correct answer: i. When the increase in bias is less than the decrease in variance, non-linear approaches can increase prediction accuracy and are generally more flexible.

Question 9 - College Data

(a) Split the data

data(College)
College <- College %>% mutate(Private = as.numeric(Private == "Yes"))
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]

(b) Linear Regression

lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((lm_pred - test_data$Apps)^2)
lm_mse
## [1] 1261630

(c) Ridge Regression

x_train <- model.matrix(Apps ~ ., train_data)[, -1]
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[, -1]
y_test <- test_data$Apps

ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_pred <- predict(ridge_cv, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 1121034

(d) Lasso

lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_coef <- predict(lasso_cv, s = lasso_best_lambda, type = "coefficients")
lasso_nonzero <- sum(lasso_coef != 0)
lasso_mse
## [1] 1254408
lasso_nonzero
## [1] 18

(e) Principal Components Regression (PCR)

pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr_model)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3793     2133     2152     1835     1705     1719
## adjCV         3895     3794     2129     2153     1808     1695     1712
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1719     1670     1646      1639      1642      1651      1646
## adjCV     1713     1658     1639      1632      1635      1643      1639
##        14 comps  15 comps  16 comps  17 comps
## CV         1647      1621      1216      1197
## adjCV      1640      1603      1205      1185
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32
validationplot(pcr_model, val.type = "MSEP")

pcr_pred <- predict(pcr_model, test_data, ncomp = which.min(pcr_model$validation$PRESS))
pcr_mse <- mean((pcr_pred - test_data$Apps)^2)
pcr_mse
## [1] 1261630

(f) Partial Least Squares (PLS)

pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls_model)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     1955     1746     1541     1471     1266     1200
## adjCV         3895     1949     1747     1534     1448     1242     1188
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1177     1164     1158      1161      1157      1157      1157
## adjCV     1167     1155     1149      1151      1147      1148      1148
##        14 comps  15 comps  16 comps  17 comps
## CV         1156      1155      1155      1155
## adjCV      1147      1146      1146      1146
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
## Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
## Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.40    100.00
## Apps     93.32     93.32
validationplot(pls_model, val.type = "MSEP")

pls_pred <- predict(pls_model, test_data, ncomp = which.min(pls_model$validation$PRESS))
pls_mse <- mean((pls_pred - test_data$Apps)^2)
pls_mse
## [1] 1261630

(g) Comment

Answer: PCR and PLS generally perform slightly better than linear models. Lasso and ridge are also competitive and have the added benefit of feature selection and shrinkage. Among the five, the model with the lowest test MSE would be preferred.

Question 11 - Boston Crime Rate

(a) Compare Models

data(Boston)
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim

# Split data
train_index <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
x_train <- x[train_index, ]
y_train <- y[train_index]
x_test <- x[-train_index, ]
y_test <- y[-train_index]

# Best subset selection
best_fit <- regsubsets(crim ~ ., data = Boston[train_index, ], nvmax = 13)
best_summary <- summary(best_fit)
which.min(best_summary$cp)
## [1] 6
# Lasso
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_mse
## [1] 64.21061
# Ridge
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_pred <- predict(ridge_cv, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 65.15657
# PCR
pcr_model <- pcr(crim ~ ., data = Boston[train_index, ], scale = TRUE, validation = "CV")
pcr_pred <- predict(pcr_model, Boston[-train_index, ], ncomp = which.min(pcr_model$validation$PRESS))
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 64.19247

(b) Best Model

Answer: The model chosen is the one with the lowest validation/test MSE. Because of regularization, ridge or lasso frequently exhibits the best performance.

(c) Feature Inclusion

Answer: No, because it automatically reduces some coefficients to zero, the selected model (such as lasso) usually does not include all features. As a result, the model becomes easier to understand and may perform better.