library(ISLR)
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(pls)
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
library(leaps)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
(a) The lasso, relative to least squares, is:
The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
The lasso adds a penalty to the size of the coefficients, which effectively reduces model flexibility by shrinking some coefficients to zero (performing variable selection). This increases bias but reduces variance, and can lead to improved prediction accuracy when the increase in bias is smaller than the decrease in variance.
(b) Repeat (a) for ridge regression relative to least squares.
Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge regression, like the lasso, adds a penalty (L2 norm) that shrinks coefficients but does not set any of them exactly to zero. It is less flexible than least squares, with increased bias and reduced variance. So, it improves prediction when the increase in bias is outweighed by the decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
Non-linear methods, relative to least squares, is more flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Non-linear methods (e.g., decision trees, splines, or polynomial regression) are generally more flexible than least squares regression. Their flexibility allows them to fit complex relationships but may result in higher variance. They improve prediction accuracy when the increase in bias is smaller than the decrease in variance.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
attach(College)
(a) Split the data set into a training set and a test set.
First, I split the data into a training set and a test set by doing a 70/30 split.
set.seed(1)
train <- sample(1:nrow(College), nrow(College) * 0.7)
college_train <- College[train, ]
college_test <- College[-train, ]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
The test error obtained from the linear model is 1261630.
lm.fit <- lm(Apps ~ ., data = college_train)
lm.pred <- predict(lm.fit, data = college_test)
mean((lm.pred - college_test$Apps)^2)
## [1] 26152159
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
The test error obtained from the ridge regression model is 1121034.
x_train <- model.matrix(Apps ~ ., college_train)[, -1]
x_test <- model.matrix(Apps ~ ., college_test)[, -1]
y_train <- college_train$Apps
y_test <- college_test$Apps
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min
ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)
## [1] 1121034
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coeffcient estimates.
The test error obtained from the lasso model is 1254408.
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min
lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)
## [1] 1254408
There are a total of 18 non-zero coefficients from the lasso model.
lasso_coef <- predict(cv_lasso, s = best_lasso, type = "coefficients")
sum(lasso_coef != 0)
## [1] 18
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
Using PCR with 17 components chosen by cross-validation, I have obtained a test error of 1261630.
pcr.fit <- pcr(Apps ~ ., data = college_train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 3793 2133 2152 1835 1705 1719
## adjCV 3895 3794 2129 2153 1808 1695 1712
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1719 1670 1646 1639 1642 1651 1646
## adjCV 1713 1658 1639 1632 1635 1643 1639
## 14 comps 15 comps 16 comps 17 comps
## CV 1647 1621 1216 1197
## adjCV 1640 1603 1205 1185
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.051 57.00 64.42 70.27 75.65 80.65 84.26 87.61
## Apps 5.788 71.69 71.70 80.97 82.60 82.60 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.84 94.93 96.74 97.82 98.72 99.39
## Apps 84.55 84.82 84.86 84.86 85.01 85.05 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.03 93.32
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, newdata = college_test, ncomp = 17)
mean((college_test$Apps - pcr.pred)^2)
## [1] 1261630
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
The lowest cross-validated CV error is 1157, occurring at 16 components. With this finding, i have obtained a test error of 1261625.
pls.fit <- plsr(Apps ~ ., data = college_train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 1955 1746 1541 1471 1266 1200
## adjCV 3895 1949 1747 1534 1448 1242 1188
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1177 1164 1158 1161 1157 1157 1157
## adjCV 1167 1155 1149 1151 1147 1148 1148
## 14 comps 15 comps 16 comps 17 comps
## CV 1156 1155 1155 1155
## adjCV 1147 1146 1146 1146
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.68 47.43 62.46 64.88 67.34 72.68 77.20 80.92
## Apps 76.62 82.39 86.93 90.76 92.82 93.05 93.13 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.16 87.35 90.73 92.49 95.10 97.09
## Apps 93.26 93.28 93.30 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.40 100.00
## Apps 93.32 93.32
validationplot(pls.fit, val.type = "MSEP")
pls.pred <- predict(pls.fit, newdata = college_test, ncomp = 16)
mean((college_test$Apps - pls.pred)^2)
## [1] 1261625
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
To predict the number of college applications received, five models were evaluated and compared based on test mean squared error. Their errors are listed below:
The ridge regression model achieved the lowest test error of 1,121,034, outperforming the other models. The linear regression model performed the worst with a test error of 26,152,159. The Lasso, PCR, and PLS models all provided similar errors. The accuracy of predicting the number of college applications received is reasonable but not perfect. The ridge regression model does not fully capture all of the data’s variability.
We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
data(Boston)
attach(Boston)
First I have used the best subset selection was used to identify the combination of predictors that best explains the per capita crime rate in the Boston dataset. The model with 12 predictors produced the lowest test error, which was 49.15767. This approach selects the best performing model based on validation error.
set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Boston), rep = TRUE)
test = !train
reg.fit.best <- regsubsets(crim ~ ., data = Boston[train, ], nvmax = 13)
test_matrix <- model.matrix(crim ~ ., data = Boston[test, ])
head(test_matrix)
## (Intercept) zn indus chas nox rm age dis rad tax ptratio lstat
## 2 1 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14
## 5 1 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33
## 9 1 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 29.93
## 10 1 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 17.10
## 16 1 0.0 8.14 0 0.538 5.834 56.5 4.4986 4 307 21.0 8.47
## 17 1 0.0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21.0 6.58
## medv
## 2 21.6
## 5 36.2
## 9 16.5
## 10 18.9
## 16 19.9
## 17 23.1
nv <- nrow(summary(reg.fit.best)$which)
val_errors <- rep(NA, nv)
for (i in 1:nv) {
coeficient <- coef(reg.fit.best, id = i)
pred <- test_matrix[, names(coeficient)] %*% coeficient
val_errors[i] <- mean((Boston$crim[test] - pred)^2)
}
min(val_errors)
## [1] 49.15767
which.min(val_errors)
## [1] 12
plot(val_errors, type = "b", xlab = "Number of Variables", ylab = "Validation MSE")
Next I use ridge regression, which applies L2 regularization to shrink coefficients and prevent overfitting. The test error was 49.91218, with the optimal λ value being 0.5240686. Ridge regression does not perform variable selection, but instead keeps all variables in the model with reduced magnitude.
x_train <- model.matrix(crim ~ ., Boston[train, ])[, -1]
x_test <- model.matrix(crim ~ ., Boston[test, ])[, -1]
y_train <- Boston$crim[train]
y_test <- Boston$crim[test]
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min
ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)
## [1] 49.91218
best_ridge
## [1] 0.5240686
plot(cv_ridge)
title("Ridge Regression: Cross-Validation Curve", line = 2.5)
The next method I used is lasso regression, which introduces L1 regularization that can shrink coefficients exactly to zero. The model achieved a test error of 49.33339, with an optimal λ of 0.0345. This method retaineds only the most relevant predictors.
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min
lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)
## [1] 49.33339
best_lasso
## [1] 0.03448021
plot(cv_lasso)
title("Lasso Regression: Cross-Validation Curve", line = 2.5)
Finally, I applied PCR to reduce dimensionality before performing regression. The optimal number of components selected with cross-validation was 12, resulting in a test error of 49.15767. This error is the same as best subset selection. PCR transforms predictors into orthogonal components before fitting the model, which can help in the presence of multicollinearity.
pcr_model <- pcr(crim ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")
best_ncomp <- which.min(pcr_model$validation$PRESS)
pcr.pred <- predict(pcr_model, newdata = Boston[test, ], ncomp = best_ncomp)
pcr.pred <- as.vector(pcr.pred)
mean((Boston$crim[test] - pcr.pred)^2)
## [1] 49.15767
validationplot(pcr_model, val.type = "MSEP")
which.min(pcr_model$validation$PRESS)
## [1] 12
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Base on the test errors obtained through cross-validation and validation, the best subset selection and PCR models both achieved the lowest test error of 49.15767. These models outperformed both ridge regression and lasso regression models. Either model would make a reasonable choice; however, best subset selection is more interpretable. For this reason, I will move forward with the best subset model. In contrast, PCR makes interpretation a bit more challenging because it uses transformed components.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
My chosen model does not involve all of the features in the data set. As seen in the code for best subset selection, I used the which.min() function to identify the best number of components. The results revealed that 12 of the 13 variables would be sufficient. Excluding less relevent variables allows for a more interpretable model.