Assignment #5

library(ISLR)
library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8

library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(leaps)

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(a) The lasso, relative to least squares, is:

The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The lasso adds a penalty to the size of the coefficients, which effectively reduces model flexibility by shrinking some coefficients to zero (performing variable selection). This increases bias but reduces variance, and can lead to improved prediction accuracy when the increase in bias is smaller than the decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Ridge regression, like the lasso, adds a penalty (L2 norm) that shrinks coefficients but does not set any of them exactly to zero. It is less flexible than least squares, with increased bias and reduced variance. So, it improves prediction when the increase in bias is outweighed by the decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods, relative to least squares, is more flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Non-linear methods (e.g., decision trees, splines, or polynomial regression) are generally more flexible than least squares regression. Their flexibility allows them to fit complex relationships but may result in higher variance. They improve prediction accuracy when the increase in bias is smaller than the decrease in variance.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

attach(College)

(a) Split the data set into a training set and a test set.

First, I split the data into a training set and a test set by doing a 70/30 split.

set.seed(1)  
train <- sample(1:nrow(College), nrow(College) * 0.7)
college_train <- College[train, ]
college_test <- College[-train, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

The test error obtained from the linear model is 1261630.

lm.fit <- lm(Apps ~ ., data = college_train) 
lm.pred <- predict(lm.fit, data = college_test)

mean((lm.pred - college_test$Apps)^2)

## [1] 26152159

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

The test error obtained from the ridge regression model is 1121034.

x_train <- model.matrix(Apps ~ ., college_train)[, -1]
x_test <- model.matrix(Apps ~ ., college_test)[, -1]
y_train <- college_train$Apps
y_test <- college_test$Apps

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min

ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)

## [1] 1121034

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coeffcient estimates.

The test error obtained from the lasso model is 1254408.

cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min
lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)

## [1] 1254408

There are a total of 18 non-zero coefficients from the lasso model.

lasso_coef <- predict(cv_lasso, s = best_lasso, type = "coefficients")
sum(lasso_coef != 0)

## [1] 18

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

Using PCR with 17 components chosen by cross-validation, I have obtained a test error of 1261630.

pcr.fit <- pcr(Apps ~ ., data = college_train, scale = TRUE, validation = "CV")
summary(pcr.fit)

## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3793     2133     2152     1835     1705     1719
## adjCV         3895     3794     2129     2153     1808     1695     1712
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1719     1670     1646      1639      1642      1651      1646
## adjCV     1713     1658     1639      1632      1635      1643      1639
##        14 comps  15 comps  16 comps  17 comps
## CV         1647      1621      1216      1197
## adjCV      1640      1603      1205      1185
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, newdata = college_test, ncomp = 17)
mean((college_test$Apps - pcr.pred)^2)

## [1] 1261630

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

The lowest cross-validated CV error is 1157, occurring at 16 components. With this finding, i have obtained a test error of 1261625.

pls.fit <- plsr(Apps ~ ., data = college_train, scale = TRUE, validation = "CV")
summary(pls.fit)

## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     1955     1746     1541     1471     1266     1200
## adjCV         3895     1949     1747     1534     1448     1242     1188
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1177     1164     1158      1161      1157      1157      1157
## adjCV     1167     1155     1149      1151      1147      1148      1148
##        14 comps  15 comps  16 comps  17 comps
## CV         1156      1155      1155      1155
## adjCV      1147      1146      1146      1146
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
## Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
## Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.40    100.00
## Apps     93.32     93.32

validationplot(pls.fit, val.type = "MSEP")

pls.pred <- predict(pls.fit, newdata = college_test, ncomp = 16)
mean((college_test$Apps - pls.pred)^2)

## [1] 1261625

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

To predict the number of college applications received, five models were evaluated and compared based on test mean squared error. Their errors are listed below:

Linear Regression: 26,152,159
Ridge Regression: 1,121,034
Lasso: 1,254,408
PCR: 1,261,630
PLS: 1,261,625

The ridge regression model achieved the lowest test error of 1,121,034, outperforming the other models. The linear regression model performed the worst with a test error of 26,152,159. The Lasso, PCR, and PLS models all provided similar errors. The accuracy of predicting the number of college applications received is reasonable but not perfect. The ridge regression model does not fully capture all of the data’s variability.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

data(Boston)
attach(Boston)

Best Subset Selection

First I have used the best subset selection was used to identify the combination of predictors that best explains the per capita crime rate in the Boston dataset. The model with 12 predictors produced the lowest test error, which was 49.15767. This approach selects the best performing model based on validation error.

set.seed(1)
train <- sample(c(TRUE, FALSE), nrow(Boston), rep = TRUE)
test = !train
reg.fit.best <- regsubsets(crim ~ ., data = Boston[train, ], nvmax = 13)
test_matrix <- model.matrix(crim ~ ., data = Boston[test, ])
head(test_matrix)

##    (Intercept)   zn indus chas   nox    rm   age    dis rad tax ptratio lstat
## 2            1  0.0  7.07    0 0.469 6.421  78.9 4.9671   2 242    17.8  9.14
## 5            1  0.0  2.18    0 0.458 7.147  54.2 6.0622   3 222    18.7  5.33
## 9            1 12.5  7.87    0 0.524 5.631 100.0 6.0821   5 311    15.2 29.93
## 10           1 12.5  7.87    0 0.524 6.004  85.9 6.5921   5 311    15.2 17.10
## 16           1  0.0  8.14    0 0.538 5.834  56.5 4.4986   4 307    21.0  8.47
## 17           1  0.0  8.14    0 0.538 5.935  29.3 4.4986   4 307    21.0  6.58
##    medv
## 2  21.6
## 5  36.2
## 9  16.5
## 10 18.9
## 16 19.9
## 17 23.1

nv <- nrow(summary(reg.fit.best)$which)
val_errors <- rep(NA, nv)

for (i in 1:nv) {
  coeficient <- coef(reg.fit.best, id = i)
  pred <- test_matrix[, names(coeficient)] %*% coeficient
  val_errors[i] <- mean((Boston$crim[test] - pred)^2)
}
min(val_errors)

## [1] 49.15767

which.min(val_errors)

## [1] 12

plot(val_errors, type = "b", xlab = "Number of Variables", ylab = "Validation MSE")

Ridge Regression

Next I use ridge regression, which applies L2 regularization to shrink coefficients and prevent overfitting. The test error was 49.91218, with the optimal λ value being 0.5240686. Ridge regression does not perform variable selection, but instead keeps all variables in the model with reduced magnitude.

x_train <- model.matrix(crim ~ ., Boston[train, ])[, -1]
x_test <- model.matrix(crim ~ ., Boston[test, ])[, -1]
y_train <- Boston$crim[train]
y_test <- Boston$crim[test]

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- cv_ridge$lambda.min

ridge.pred <- predict(cv_ridge, s = best_ridge, newx = x_test)
mean((y_test - ridge.pred)^2)

## [1] 49.91218

best_ridge

## [1] 0.5240686

plot(cv_ridge)
title("Ridge Regression: Cross-Validation Curve", line = 2.5)

Lasso

The next method I used is lasso regression, which introduces L1 regularization that can shrink coefficients exactly to zero. The model achieved a test error of 49.33339, with an optimal λ of 0.0345. This method retaineds only the most relevant predictors.

cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- cv_lasso$lambda.min

lasso.pred <- predict(cv_lasso, s = best_lasso, newx = x_test)
mean((y_test - lasso.pred)^2)

## [1] 49.33339

best_lasso

## [1] 0.03448021

plot(cv_lasso)
title("Lasso Regression: Cross-Validation Curve", line = 2.5)

PCR

Finally, I applied PCR to reduce dimensionality before performing regression. The optimal number of components selected with cross-validation was 12, resulting in a test error of 49.15767. This error is the same as best subset selection. PCR transforms predictors into orthogonal components before fitting the model, which can help in the presence of multicollinearity.

pcr_model <- pcr(crim ~ ., data = Boston[train, ], scale = TRUE, validation = "CV")
best_ncomp <- which.min(pcr_model$validation$PRESS)
pcr.pred <- predict(pcr_model, newdata = Boston[test, ], ncomp = best_ncomp)
pcr.pred <- as.vector(pcr.pred)
mean((Boston$crim[test] - pcr.pred)^2)

## [1] 49.15767

validationplot(pcr_model, val.type = "MSEP")

which.min(pcr_model$validation$PRESS)

## [1] 12

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Base on the test errors obtained through cross-validation and validation, the best subset selection and PCR models both achieved the lowest test error of 49.15767. These models outperformed both ridge regression and lasso regression models. Either model would make a reasonable choice; however, best subset selection is more interpretable. For this reason, I will move forward with the best subset model. In contrast, PCR makes interpretation a bit more challenging because it uses transformed components.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

My chosen model does not involve all of the features in the data set. As seen in the code for best subset selection, I used the which.min() function to identify the best number of components. The results revealed that 12 of the 13 variables would be sufficient. Excluding less relevent variables allows for a more interpretable model.