library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tibble' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.3
## Warning: package 'forcats' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.4.3
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-9
library(pls)
## Warning: package 'pls' was built under R version 4.4.3
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:pls':
##
## R2
##
## The following object is masked from 'package:purrr':
##
## lift
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increalse in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
b. Repeat (a) for ridge regression relative to least squares.
c. Repeat (a) for non-linear methods relative to least squares.
a)The lasso, relative to least squares, is:
Correct answer: iii. The lasso is less flexible than least squares because it penalizes large coefficients and can shrink some to zero, effectively performing variable selection. This introduces more bias but reduces variance, leading to improved prediction accuracy when the increase in bias is less than the decrease in variance.
Correct answer: iii. Ridge regression is also less flexible than least squares because it applies an L2 penalty that shrinks coefficients without eliminating them. This adds bias and reduces variance, and prediction improves when the bias increase is outweighed by the variance reduction.
Correct answer: i. Non-linear methods are generally more flexible than least squares, allowing them to fit complex patterns and reduce bias. However, this comes with increased variance, and they improve prediction accuracy when the increase in bias is less than the decrease in variance.
data("College")
College_df <- College
College_df$CollegeName <- rownames(College_df)
rownames(College_df) <- NULL
College_df$CollegeName <- NULL
set.seed(123)
indices <- createDataPartition(College_df$Apps, p = 0.7, list = FALSE)
train_data <- College_df[indices, ]
test_data <- College_df[-indices, ]
for (col in names(train_data)) {
if (is.factor(train_data[[col]])) {
test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
}
}
Fit a full linear regression model and assess the mean squared prediction error on the test data.
lm_model <- lm(Apps ~ ., data = train_data)
lm_pred <- predict(lm_model, newdata = test_data)
mse <- mean((test_data$Apps - lm_pred)^2)
mse
## [1] 1882074
Ridge regression uses glmnet
with alpha set to 0, which
corresponds to ridge regression. Lambda is chosen via
cross-validation.
X_train <- model.matrix(Apps ~ ., data = train_data)[, -1]
X_test <- model.matrix(Apps ~ ., data = test_data)[, -1]
Y_train <- train_data$Apps
Y_test <- test_data$Apps
ridge_cv <- cv.glmnet(X_train, Y_train, alpha = 0)
ridge_best_lambda <- ridge_cv$lambda.min
ridge_preds <- predict(ridge_cv, s = ridge_best_lambda, newx = X_test)
mse_ridge <- mean((Y_test - ridge_preds)^2)
mse_ridge
## [1] 3265646
set.seed(123)
cv_lasso <- cv.glmnet(X_train, Y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min
lasso.pred <- predict(cv_lasso, s = best_lasso, newx = X_test)
mean((Y_test - lasso.pred)^2)
## [1] 1936048
lasso_coef <- predict(cv_lasso, s = best_lasso, type = "coefficients")
sum(lasso_coef != 0)
## [1] 17
pcr.fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3389 3395 1657 1670 1409 1266 1254
## adjCV 3389 3394 1654 1672 1389 1243 1250
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1252 1198 1170 1169 1168 1175 1177
## adjCV 1254 1191 1168 1168 1166 1172 1174
## 14 comps 15 comps 16 comps 17 comps
## CV 1173 1174 1011 1007
## adjCV 1170 1172 1008 1004
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.892 57.49 64.60 70.26 75.69 80.70 84.36 87.81
## Apps 1.358 76.76 76.85 84.55 87.14 87.14 87.28 88.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.88 93.17 95.25 97.13 98.20 98.95 99.47
## Apps 88.69 88.77 88.90 88.98 89.04 89.09 89.37
## 16 comps 17 comps
## X 99.86 100.00
## Apps 92.08 92.22
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, newdata = test_data, ncomp = 17)
mean((test_data$Apps - pcr.pred)^2)
## [1] 1882074
pls.fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3389 1490 1198 1134 1100 1045 1016
## adjCV 3389 1488 1199 1132 1095 1042 1012
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1002.1 1000.5 997.8 996.3 996.2 995.4 995.4
## adjCV 998.9 997.5 994.9 993.5 993.2 992.6 992.6
## 14 comps 15 comps 16 comps 17 comps
## CV 995.1 995.0 995.1 995.1
## adjCV 992.4 992.3 992.4 992.4
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.63 38.41 62.86 66.01 69.22 73.83 76.90 81.19
## Apps 81.30 87.78 89.41 90.52 91.58 92.03 92.14 92.17
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.07 85.38 87.23 90.36 93.14 95.35 97.20
## Apps 92.19 92.20 92.21 92.21 92.22 92.22 92.22
## 16 comps 17 comps
## X 98.56 100.00
## Apps 92.22 92.22
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, newdata = test_data, ncomp = 16)
mean((test_data$Apps - pls.pred)^2)
## [1] 1882170
Five different methods were tested to predict how many students applied to each college. The most accurate predictions came from Linear Regression, Principal Components Regression (PCR), and Partial Least Squares (PLS), each with test errors close to 1,882,074. Lasso regression also performed well, with a test error of 1,936,048, and it used only 17 variables, making the model simpler and more interpretable. In contrast, Ridge Regression had the highest test error at 3,265,646, showing that it was less accurate than the other approaches. Overall, while all methods produced reasonable results, Linear Regression, PCR, and PLS provided the most accurate predictions.
Apply best subset selection, ridge regression, lasso regression, and PCR to the Boston dataset.
Setup and Data Split
data("Boston")
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Boston), rep = TRUE)
test <- !train
1. Best Subset Selection
reg.fit.best <- regsubsets(crim ~ ., data = Boston[train, ], nvmax = 13)
test_matrix <- model.matrix(crim ~ ., data = Boston[test, ])
nv <- nrow(summary(reg.fit.best)$which)
val_errors <- rep(NA, nv)
for (i in 1:nv) {
coeficient <- coef(reg.fit.best, id = i)
pred <- test_matrix[, names(coeficient)] %*% coeficient
val_errors[i] <- mean((Boston$crim[test] - pred)^2)
}
min(val_errors)
## [1] 49.15767
which.min(val_errors)
## [1] 12
plot(val_errors, type = "b", xlab = "Number of Variables", ylab = "Validation MSE")
2. Ridge Regression
x_train <- model.matrix(crim ~ ., Boston[train, ])[, -1]
x_test <- model.matrix(crim ~ ., Boston[test, ])[, -1]
y_train <- Boston$crim[train]
y_test <- Boston$crim[test]
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min
ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)
## [1] 49.91218
best_ridge
## [1] 0.5240686
plot(cv_ridge)
title("Ridge Regression: Cross-Validation Curve", line = 2.5)
3. Lasso Regression
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min
lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)
## [1] 49.37916
best_lasso
## [1] 0.04558081
plot(cv_lasso)
title("Lasso Regression: Cross-Validation Curve", line = 2.5)
4. Principal Components Regression (PCR)
pcr_model <- pcr(crim ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")
best_ncomp <- which.min(pcr_model$validation$PRESS)
pcr.pred <- predict(pcr_model, newdata = Boston[test, ], ncomp = best_ncomp)
pcr.pred <- as.vector(pcr.pred)
mean((Boston$crim[test] - pcr.pred)^2)
## [1] 49.15767
best_ncomp
## [1] 12
validationplot(pcr_model, val.type = "MSEP")
After trying four different ways to predict crime in Boston, the two best methods were best subset selection and PCR. Both had the lowest error, around 49.16, meaning they gave the most accurate guesses. Even though PCR worked just as well, it used new made-up variables called “components,” which are harder to explain. Best subset selection used the real features from the data, like how many rooms a house has or how far it is from work. Because it’s easier to understand and just as accurate, best subset selection is the better choice.
No, the model did not use all the features. There were 13 features available, but the model only picked 12. It left out the one that didn’t help much. This makes the model simpler and easier to understand, while still giving good results. Choosing only the most useful features helps the model stay smart and not get confused by unimportant information.