Correct answer: iii. Compared to least squares, the lasso is less flexible since it applies a penalty that can cause some coefficients to shrink to zero. As a result, it introduces bias at the expense of reducing variance.
Correct answer: iii. The penalty element, which lowers variance at the expense of bias, makes ridge regression less flexible as well.
Correct answer: i. When the increase in bias is less than the decrease in variance, non-linear approaches can increase prediction accuracy and are generally more flexible.
data(College)
College <- College %>% mutate(Private = as.numeric(Private == "Yes"))
train_index <- sample(1:nrow(College), nrow(College) * 0.7)
train_data <- College[train_index, ]
test_data <- College[-train_index, ]
lm_fit <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_fit, newdata = test_data)
lm_mse <- mean((lm_pred - test_data$Apps)^2)
lm_mse
## [1] 1261630
x_train <- model.matrix(Apps ~ ., train_data)[, -1]
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[, -1]
y_test <- test_data$Apps
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_pred <- predict(ridge_cv, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 1121034
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_coef <- predict(lasso_cv, s = lasso_best_lambda, type = "coefficients")
lasso_nonzero <- sum(lasso_coef != 0)
lasso_mse
## [1] 1254408
lasso_nonzero
## [1] 18
pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr_model)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 3793 2133 2152 1835 1705 1719
## adjCV 3895 3794 2129 2153 1808 1695 1712
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1719 1670 1646 1639 1642 1651 1646
## adjCV 1713 1658 1639 1632 1635 1643 1639
## 14 comps 15 comps 16 comps 17 comps
## CV 1647 1621 1216 1197
## adjCV 1640 1603 1205 1185
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.051 57.00 64.42 70.27 75.65 80.65 84.26 87.61
## Apps 5.788 71.69 71.70 80.97 82.60 82.60 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.84 94.93 96.74 97.82 98.72 99.39
## Apps 84.55 84.82 84.86 84.86 85.01 85.05 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.03 93.32
validationplot(pcr_model, val.type = "MSEP")
pcr_pred <- predict(pcr_model, test_data, ncomp = which.min(pcr_model$validation$PRESS))
pcr_mse <- mean((pcr_pred - test_data$Apps)^2)
pcr_mse
## [1] 1261630
pls_model <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls_model)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 1955 1746 1541 1471 1266 1200
## adjCV 3895 1949 1747 1534 1448 1242 1188
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1177 1164 1158 1161 1157 1157 1157
## adjCV 1167 1155 1149 1151 1147 1148 1148
## 14 comps 15 comps 16 comps 17 comps
## CV 1156 1155 1155 1155
## adjCV 1147 1146 1146 1146
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.68 47.43 62.46 64.88 67.34 72.68 77.20 80.92
## Apps 76.62 82.39 86.93 90.76 92.82 93.05 93.13 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.16 87.35 90.73 92.49 95.10 97.09
## Apps 93.26 93.28 93.30 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.40 100.00
## Apps 93.32 93.32
validationplot(pls_model, val.type = "MSEP")
pls_pred <- predict(pls_model, test_data, ncomp = which.min(pls_model$validation$PRESS))
pls_mse <- mean((pls_pred - test_data$Apps)^2)
pls_mse
## [1] 1261630
Answer: PCR and PLS generally perform slightly better than linear models. Lasso and ridge are also competitive and have the added benefit of feature selection and shrinkage. Among the five, the model with the lowest test MSE would be preferred.
data(Boston)
x <- model.matrix(crim ~ ., Boston)[, -1]
y <- Boston$crim
# Split data
train_index <- sample(1:nrow(Boston), nrow(Boston) * 0.7)
x_train <- x[train_index, ]
y_train <- y[train_index]
x_test <- x[-train_index, ]
y_test <- y[-train_index]
# Best subset selection
best_fit <- regsubsets(crim ~ ., data = Boston[train_index, ], nvmax = 13)
best_summary <- summary(best_fit)
which.min(best_summary$cp)
## [1] 6
# Lasso
lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1)
lasso_best_lambda <- lasso_cv$lambda.min
lasso_pred <- predict(lasso_cv, s = lasso_best_lambda, newx = x_test)
lasso_mse <- mean((lasso_pred - y_test)^2)
lasso_mse
## [1] 64.21061
# Ridge
ridge_cv <- cv.glmnet(x_train, y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_pred <- predict(ridge_cv, s = ridge_best_lambda, newx = x_test)
ridge_mse <- mean((ridge_pred - y_test)^2)
ridge_mse
## [1] 65.15657
# PCR
pcr_model <- pcr(crim ~ ., data = Boston[train_index, ], scale = TRUE, validation = "CV")
pcr_pred <- predict(pcr_model, Boston[-train_index, ], ncomp = which.min(pcr_model$validation$PRESS))
pcr_mse <- mean((pcr_pred - y_test)^2)
pcr_mse
## [1] 64.19247
Answer: The model chosen is the one with the lowest validation/test MSE. Because of regularization, ridge or lasso frequently exhibits the best performance.
Answer: No, because it automatically reduces some coefficients to zero, the selected model (such as lasso) usually does not include all features. As a result, the model becomes easier to understand and may perform better.