2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (b) Repeat (a) for ridge regression relative to least squares. (c) Repeat (a) for non-linear methods relative to least squares.
(A). iii. The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.Using the lasso technique reduces the estimates of the coefficients so that some of the coefficients will be zero. This greatly reduces the variance.
(B). iii. Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.Using ridge regression reduces the estimates of the coefficients, however the coefficients will not be zero using this technique.
(C.) ii. Using non linear methods allow for more flexibility in the model. These models tend to be higher in variance so by the bias-variance trade-off the model will give improved prediction accuracy when the increase in variance is less than the decrease in bias.
9. In this exercise, we will predict the number of applications received using the other variables in the College data set. (a) Split the data set into a training set and a test set. (b) Fit a linear model using least squares on the training set, and report the test error obtained. (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. (d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates. (e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. (f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
(A).
library(ISLR)
set.seed(19)
id <- sample(1:nrow(College), nrow(College)/2)
train = College[id,]
test = College[-id,]
Here we split the data into train and test sets.
(B).
fit.lm <- lm(Apps~.,data=train)
pred.lm <- predict(fit.lm,test)
(err.lm <- mean((test$Apps-pred.lm)^2))
## [1] 1444045
We can see from this that the test error using a least squares linear model is 1,444,045.
(C).
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
xmat.train <- model.matrix(Apps~.,data=train)[,-1]
xmat.test <- model.matrix(Apps~.,data=test)[,-1]
fit.ridge <- cv.glmnet(xmat.train, train$Apps, alpha = 0)
lambda <- fit.ridge$lambda.min
pred.ridge <- predict(fit.ridge, s=lambda, newx = xmat.test)
ridge.err <- mean((test$Apps-pred.ridge)^2)
ridge.err
## [1] 2404092
From this we can see that the test error obtained using ridge regression is 2,404,092.
(D).
library(ISLR)
xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]
fit.lasso <- cv.glmnet(xmat.train, train$Apps, alpha=1)
lambda <- fit.lasso$lambda.min
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
err.lasso <- mean((test$Apps - pred.lasso)^2)
err.lasso
## [1] 1452826
coef.lasso <- predict(fit.lasso, type="coefficients", s=lambda)[1:ncol(College),]
coef.lasso[coef.lasso != 0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -8.269196e+02 -6.427826e+02 1.078460e+00 7.603700e-02 5.071759e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.432521e+01 1.084947e-01 8.384195e-03 -3.115003e-02 2.441802e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.476549e-02 1.244398e-01 -1.263119e+01 -4.842160e+00 1.566539e+01
## perc.alumni Expend Grad.Rate
## -7.466352e+00 8.644278e-02 9.016699e+00
We can see from this result that the test error obtained using the lasso was 1,452,826. Looking at the coefficients, we can see that the coefficient of Enroll is 7.604e-02 and the coefficient for PhD was -1.263e+01.
(E).
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(19)
fit.pcr <- pcr(Apps~., data=train, scale=TRUE, validation="CV")
summary(fit.pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3811 3480 1746 1750 1506 1340 1279
## adjCV 3811 3480 1743 1747 1503 1286 1272
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1273 1251 1244 1236 1232 1250 1243
## adjCV 1268 1246 1242 1233 1229 1246 1238
## 14 comps 15 comps 16 comps 17 comps
## CV 1246 1232 1202 1179
## adjCV 1242 1228 1197 1174
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 34.32 58.77 65.51 71.01 75.88 80.62 84.61 88.25
## Apps 17.48 79.76 79.98 85.29 89.41 89.59 89.66 89.88
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.27 93.64 95.71 97.35 98.31 99.00 99.48
## Apps 90.05 90.39 90.42 90.43 90.58 90.58 90.86
## 16 comps 17 comps
## X 99.82 100.00
## Apps 91.53 91.99
pred.pcr <- predict(fit.pcr, test, ncomp=16)
(err.pcr <- mean((test$Apps - pred.pcr)^2))
## [1] 2041398
We can see from the summary that the CV is lowest at M=16.With M=16, the test error obtained using principle components analysis is 2,041,398.
(F).
fit.pls <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
summary(fit.pls)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3811 1660 1461 1214 1205 1199 1169
## adjCV 3811 1659 1458 1211 1201 1193 1165
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1155 1156 1154 1155 1157 1158 1157
## adjCV 1152 1152 1150 1151 1153 1153 1153
## 14 comps 15 comps 16 comps 17 comps
## CV 1156 1157 1157 1157
## adjCV 1152 1152 1152 1152
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.86 56.31 63.96 67.78 71.50 74.79 78.55 81.17
## Apps 81.60 86.12 90.58 91.02 91.41 91.73 91.85 91.91
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.18 85.72 87.61 90.73 93.27 96.08 97.51
## Apps 91.95 91.96 91.98 91.98 91.99 91.99 91.99
## 16 comps 17 comps
## X 99.30 100.00
## Apps 91.99 91.99
pred.pls <- predict(fit.pls, test, ncomp=10)
(err.pls <- mean((test$Apps - pred.pls)^2))
## [1] 1489105
We can see from the summary that the CV is lowest when M=10. When M=10, the test error using partial least squares is 1,489,105.
(G). The results indicate that the linear model had the lowest test error with an estimated value of 1,444,045. The lasso method and the partial least squares method were right behind at 1,452,826 and 1,489,105 respectively. ridge regression and principle components regression were the worst with estimated test errors of 2,404,092 and 2,041,398 respectively.
We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider. (b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error. (c) Does your chosen model involve all of the features in the data set? Why or why not?
(A).
set.seed(7)
library(MASS)
set.seed(7)
trainid <- sample(1:nrow(Boston), nrow(Boston)/2)
train <- Boston[trainid,]
test <- Boston[-trainid,]
xmat.train <- model.matrix(crim~., data=train)[,-1]
xmat.test <- model.matrix(crim~., data=test)[,-1]
Now that we have these training and test data sets, we can perform ridge regression:
library(glmnet)
fit.ridge <- cv.glmnet(xmat.train, train$crim, alpha=0)
lambda <- fit.ridge$lambda.min
pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test)
(err.ridge <- mean((test$crim - pred.ridge)^2))
## [1] 41.13889
The test error obtained using ridge regression is 41.13889.
We can also perform lasso regression:
fit.lasso <- cv.glmnet(xmat.train, train$crim, alpha=1)
lambda <- fit.lasso$lambda.min
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
(err.lasso <- mean((test$crim - pred.lasso)^2))
## [1] 40.74732
The test error obtained using lasso regression is 40.86415.
We can also perform forward and backward selection and look at the associated test error:
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
fit.fwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(fwd.summary <- summary(fit.fwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " " " "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
err.fwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
pred.fwd <- predict(fit.fwd, test, id=i)
err.fwd[i] <- mean((test$crim - pred.fwd)^2)
}
min(err.fwd)
## [1] 40.84844
We can see from the summary results that the test error using forward selection is 40.848.
Now we calculate the test error for backward selection:
fit.bwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(bwd.summary <- summary(fit.bwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " " " "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" " "
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " " " "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
err.bwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
pred.bwd <- predict(fit.bwd, test, id=i)
err.bwd[i] <- mean((test$crim - pred.bwd)^2)
}
min(err.bwd)
## [1] 40.84844
This summary table suggests that using backward selection also results in a test error of 40.848.
Finally, we calculate the test error for PCR.
library(pls)
set.seed(19)
fit.pcr <- pcr(crim~., data=train, scale=TRUE, validation="CV")
summary(fit.pcr)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.924 7.324 7.316 7.071 7.087 7.185 7.202
## adjCV 8.924 7.319 7.311 7.058 7.078 7.167 7.181
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.193 7.170 7.147 7.011 7.044 6.999 6.928
## adjCV 7.170 7.128 7.121 6.985 7.018 6.971 6.899
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.81 61.39 70.07 77.12 83.77 88.50 91.71 93.76
## crim 33.90 34.42 40.37 40.37 40.67 41.66 42.19 43.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.78 97.39 98.72 99.65 100.00
## crim 43.84 45.67 45.99 47.08 48.37
pred.pcr <- predict(fit.pcr, test, ncomp=12)
(err.pcr <- mean((test$crim - pred.pcr)^2))
## [1] 42.64682
Here we can see that using PCR has a test error of 42.64682.
(B).
min(err.bwd)
## [1] 40.84844
min(err.fwd)
## [1] 40.84844
err.lasso
## [1] 40.74732
err.ridge
## [1] 41.13889
err.pcr
## [1] 42.64682
We can see that by using backward selection, forward selection, or lasso regression will result in the smallest test error.
(C).
No, because in forward and backward selection variables that are not important for predicting the response variable are removed. Lasso regression also removes variables by decreasing their coefficient to zero.