2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. is correct. The less flexible LASSO restricts the model, decreasing its variance and increasing its bias. If the bias increase is less than the reduction in variance (if the reduction in varianve outweighs the increase in bias), then the prediction accuracy will increase.

(b) Repeat (a) for ridge regression relative to least squares.

also iii. Ridge regression is also a less flexible model so the same reasoning as above applies.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods are more flexible and will introduce more variance while reducing bias. The answer would be ii. If the bias reduction outweighs the increased variance, the model will have a higher prediction accuracy.

9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR2)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
sum(is.na(College$Apps))
## [1] 0

(a) Split the data set into a training set and a test set.

set.seed(13)
index = sample(nrow(College), 0.5 * nrow(College))
train_college = College[index,]
test_college = College[-index,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

linear_model = lm(Apps ~ ., data = train_college)
probabilities = predict(linear_model, newdata = test_college)
test_mse = mean((test_college$Apps - probabilities)^2)

test_mse
## [1] 1062514
sqrt(test_mse)
## [1] 1030.783

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
set.seed(13)

X_train = model.matrix(Apps ~ ., data = train_college)[,-1]
y_train = train_college$Apps

X_test = model.matrix(Apps ~ ., data = test_college)[,-1]
y_test = test_college$Apps

cv_ridge = cv.glmnet(X_train, y_train, alpha = 0)

best_lambda = cv_ridge$lambda.min

ridge_model = glmnet(X_train, y_train, alpha = 0, lambda = best_lambda)

ridge_probs = predict(ridge_model, s = best_lambda, newx = X_test)

ridge_mse = mean((ridge_probs - y_test)^2)
ridge_mse
## [1] 1174478
sqrt(ridge_mse)
## [1] 1083.733

(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(13)

X_train = model.matrix(Apps ~ ., data = train_college)[,-1]
y_train = train_college$Apps

X_test = model.matrix(Apps ~ ., data = test_college)[,-1]
y_test = test_college$Apps

cv_lasso = cv.glmnet(X_train, y_train, alpha = 1)

best_lambda = cv_lasso$lambda.min

lasso_model = glmnet(X_train, y_train, alpha = 1, lambda = best_lambda)

lasso_probs = predict(lasso_model, s = best_lambda, newx = X_test)

lasso_mse = mean((lasso_probs - y_test)^2)
lasso_mse
## [1] 1055797
sqrt(lasso_mse)
## [1] 1027.52

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(13)

pcr_model = pcr(Apps ~ ., data = train_college, scale = TRUE, validation = "CV")
#validationplot(pcr_model, val.type = "MSEP")

cv_results = RMSEP(pcr_model)
best_m = which.min(cv_results$val[1, , -1])

probabilities = predict(pcr_model, newdata = test_college, ncomp = best_m)
pcr_mse = mean((probabilities - test_college$Apps)^2)
pcr_mse
## [1] 1062514
sqrt(pcr_mse)
## [1] 1030.783

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(13)

pls_model = plsr(Apps ~ ., data = train_college, scale = TRUE, validation = "CV")

cv_results = RMSEP(pls_model)
best_m = which.min(cv_results$val[1, , -1])

probabilities = predict(pls_model, newdata = test_college, ncomp = best_m)
pls_mse = mean((probabilities - test_college$Apps)^2)
pls_mse
## [1] 1062514
sqrt(pls_mse)
## [1] 1030.783

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Lasso 1027.52

OLS 1030.78

PCR 1030.78

PLS 1030.78

Ridge 1083.73

We can predict with moderate accuracy, around 34% error. The test errors across the board are similar between the 5 methods, suggesting the dataset is fairly robust the modeling method.

11

We will now try to predict per capita crime rate in the Boston data set.

str(Boston)
## 'data.frame':    506 obs. of  13 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
sum(is.na(Boston$crim))
## [1] 0

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(leaps)

regfit_full = regsubsets(crim ~ ., data = Boston, nvmax = 12)
summary(regfit_full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 12)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
reg_summary = summary(regfit_full)

reg_summary$rsq
##  [1] 0.3912567 0.4207965 0.4244920 0.4334892 0.4378328 0.4421077 0.4451514
##  [8] 0.4470236 0.4482891 0.4487917 0.4493353 0.4493378
plot(regfit_full, scale = "r2")

plot(regfit_full, scale = "adjr2")

plot(regfit_full, scale = "Cp")

plot(regfit_full, scale = "bic")

2 variables will be used to predict crim based on best subset method. rad and lstat. The 2 variable model has comparable rsquare and BIC values to the rest of the models, while also being simpler.

set.seed(13)
index = sample(nrow(Boston), 0.5 * nrow(Boston))
train_boston = Boston[index,]
test_boston = Boston[-index,]

# linear model with chosen best subset
linear_model = lm(crim ~ rad + lstat, data = train_boston)
probabilities = predict(linear_model, newdata = test_boston)
test_mse = mean((test_boston$crim - probabilities)^2)

test_mse
## [1] 33.22543
sqrt(test_mse)
## [1] 5.76415
# Ridge
library(glmnet)
set.seed(13)

X_train = model.matrix(crim ~ ., data = train_boston)[,-1]
y_train = train_boston$crim

X_test = model.matrix(crim ~ ., data = test_boston)[,-1]
y_test = test_boston$crim

cv_ridge = cv.glmnet(X_train, y_train, alpha = 0)

best_lambda = cv_ridge$lambda.min

ridge_model = glmnet(X_train, y_train, alpha = 0, lambda = best_lambda)

ridge_probs = predict(ridge_model, s = best_lambda, newx = X_test)

ridge_mse = mean((ridge_probs - y_test)^2)
ridge_mse
## [1] 33.08957
sqrt(ridge_mse)
## [1] 5.752353
# LASSO
set.seed(13)

X_train = model.matrix(crim ~ ., data = train_boston)[,-1]
y_train = train_boston$crim

X_test = model.matrix(crim ~ ., data = test_boston)[,-1]
y_test = test_boston$crim

cv_lasso = cv.glmnet(X_train, y_train, alpha = 1)

best_lambda = cv_lasso$lambda.min

lasso_model = glmnet(X_train, y_train, alpha = 1, lambda = best_lambda)

lasso_probs = predict(lasso_model, s = best_lambda, newx = X_test)

lasso_mse = mean((lasso_probs - y_test)^2)
lasso_mse
## [1] 32.88823
sqrt(lasso_mse)
## [1] 5.734826
# PCR

library(pls)
set.seed(13)

pcr_model = pcr(crim ~ ., data = train_boston, scale = TRUE, validation = "CV")
#validationplot(pcr_model, val.type = "MSEP")

cv_results = RMSEP(pcr_model)
best_m = which.min(cv_results$val[1, , -1])

probabilities = predict(pcr_model, newdata = test_boston, ncomp = best_m)
pcr_mse = mean((probabilities - test_boston$crim)^2)
pcr_mse
## [1] 34.82423
sqrt(pcr_mse)
## [1] 5.901206

RMSE

OLS (two variable model): 5.76415

Ridge: 5.752353

LASSO: 5.734826

PCR: 5.901206

LASSO had the lowest RMSE so it would be chosen as the best method.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

OLS with appropriate variables, Ridge Regression, and LASSO would be viable for this data set based on their RMSE in comparison with each other and PCR.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

lasso_coef = predict(lasso_model, type = "coefficients", s = best_lambda, newx = X_test)[1:12,]
lasso_coef
## (Intercept)          zn       indus        chas         nox          rm 
##  5.24016233  0.04936863 -0.11608797 -0.17313686  0.00000000  0.00000000 
##         age         dis         rad         tax     ptratio       lstat 
##  0.00000000 -0.80540024  0.52582341  0.00000000  0.00000000  0.13698970

The chosen model did not use all of the available features because LASSO filtered out irrelevant variables for the dataset.