Chapter 6, problems 2,9,11

Problem 2.

2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (b) Repeat (a) for ridge regression relative to least squares. (c) Repeat (a) for non-linear methods relative to least squares.

(A). iii. The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.Using the lasso technique reduces the estimates of the coefficients so that some of the coefficients will be zero. This greatly reduces the variance.

(B). iii. Ridge regression, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.Using ridge regression reduces the estimates of the coefficients, however the coefficients will not be zero using this technique.

(C.) ii. Using non linear methods allow for more flexibility in the model. These models tend to be higher in variance so by the bias-variance trade-off the model will give improved prediction accuracy when the increase in variance is less than the decrease in bias.

Problem 9.

9. In this exercise, we will predict the number of applications received using the other variables in the College data set. (a) Split the data set into a training set and a test set. (b) Fit a linear model using least squares on the training set, and report the test error obtained. (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained. (d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates. (e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. (f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation. (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

(A).

library(ISLR)
set.seed(19)
id <- sample(1:nrow(College), nrow(College)/2)
train = College[id,]
test = College[-id,]

Here we split the data into train and test sets.

(B).

fit.lm <- lm(Apps~.,data=train)
pred.lm <- predict(fit.lm,test)
(err.lm <- mean((test$Apps-pred.lm)^2))
## [1] 1444045

We can see from this that the test error using a least squares linear model is 1,444,045.

(C).

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
xmat.train <- model.matrix(Apps~.,data=train)[,-1]
xmat.test <- model.matrix(Apps~.,data=test)[,-1]
fit.ridge <- cv.glmnet(xmat.train, train$Apps, alpha = 0)
lambda <- fit.ridge$lambda.min
pred.ridge <- predict(fit.ridge, s=lambda, newx = xmat.test)
ridge.err <- mean((test$Apps-pred.ridge)^2)
ridge.err
## [1] 2404092

From this we can see that the test error obtained using ridge regression is 2,404,092.

(D).

library(ISLR)
xmat.train <- model.matrix(Apps~., data=train)[,-1]
xmat.test <- model.matrix(Apps~., data=test)[,-1]
fit.lasso <- cv.glmnet(xmat.train, train$Apps, alpha=1)
lambda <- fit.lasso$lambda.min
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
err.lasso <- mean((test$Apps - pred.lasso)^2)
err.lasso
## [1] 1452826
coef.lasso <- predict(fit.lasso, type="coefficients", s=lambda)[1:ncol(College),]
coef.lasso[coef.lasso != 0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -8.269196e+02 -6.427826e+02  1.078460e+00  7.603700e-02  5.071759e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.432521e+01  1.084947e-01  8.384195e-03 -3.115003e-02  2.441802e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  2.476549e-02  1.244398e-01 -1.263119e+01 -4.842160e+00  1.566539e+01 
##   perc.alumni        Expend     Grad.Rate 
## -7.466352e+00  8.644278e-02  9.016699e+00

We can see from this result that the test error obtained using the lasso was 1,452,826. Looking at the coefficients, we can see that the coefficient of Enroll is 7.604e-02 and the coefficient for PhD was -1.263e+01.

(E).

library(pls)
## Warning: package 'pls' was built under R version 4.1.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(19)
fit.pcr <- pcr(Apps~., data=train, scale=TRUE, validation="CV")
summary(fit.pcr)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3811     3480     1746     1750     1506     1340     1279
## adjCV         3811     3480     1743     1747     1503     1286     1272
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1273     1251     1244      1236      1232      1250      1243
## adjCV     1268     1246     1242      1233      1229      1246      1238
##        14 comps  15 comps  16 comps  17 comps
## CV         1246      1232      1202      1179
## adjCV      1242      1228      1197      1174
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       34.32    58.77    65.51    71.01    75.88    80.62    84.61    88.25
## Apps    17.48    79.76    79.98    85.29    89.41    89.59    89.66    89.88
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.27     93.64     95.71     97.35     98.31     99.00     99.48
## Apps    90.05     90.39     90.42     90.43     90.58     90.58     90.86
##       16 comps  17 comps
## X        99.82    100.00
## Apps     91.53     91.99
pred.pcr <- predict(fit.pcr, test, ncomp=16)  
(err.pcr <- mean((test$Apps - pred.pcr)^2))
## [1] 2041398

We can see from the summary that the CV is lowest at M=16.With M=16, the test error obtained using principle components analysis is 2,041,398.

(F).

fit.pls <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
summary(fit.pls)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3811     1660     1461     1214     1205     1199     1169
## adjCV         3811     1659     1458     1211     1201     1193     1165
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1155     1156     1154      1155      1157      1158      1157
## adjCV     1152     1152     1150      1151      1153      1153      1153
##        14 comps  15 comps  16 comps  17 comps
## CV         1156      1157      1157      1157
## adjCV      1152      1152      1152      1152
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.86    56.31    63.96    67.78    71.50    74.79    78.55    81.17
## Apps    81.60    86.12    90.58    91.02    91.41    91.73    91.85    91.91
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.18     85.72     87.61     90.73     93.27     96.08     97.51
## Apps    91.95     91.96     91.98     91.98     91.99     91.99     91.99
##       16 comps  17 comps
## X        99.30    100.00
## Apps     91.99     91.99
pred.pls <- predict(fit.pls, test, ncomp=10)
(err.pls <- mean((test$Apps - pred.pls)^2))  
## [1] 1489105

We can see from the summary that the CV is lowest when M=10. When M=10, the test error using partial least squares is 1,489,105.

(G). The results indicate that the linear model had the lowest test error with an estimated value of 1,444,045. The lasso method and the partial least squares method were right behind at 1,452,826 and 1,489,105 respectively. ridge regression and principle components regression were the worst with estimated test errors of 2,404,092 and 2,041,398 respectively.

Problem 11.

We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider. (b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error. (c) Does your chosen model involve all of the features in the data set? Why or why not?

(A).

set.seed(7)
library(MASS)
set.seed(7)
trainid <- sample(1:nrow(Boston), nrow(Boston)/2)
train <- Boston[trainid,]
test <- Boston[-trainid,]
xmat.train <- model.matrix(crim~., data=train)[,-1]
xmat.test <- model.matrix(crim~., data=test)[,-1]

Now that we have these training and test data sets, we can perform ridge regression:

library(glmnet)
fit.ridge <- cv.glmnet(xmat.train, train$crim, alpha=0)
lambda <- fit.ridge$lambda.min
pred.ridge <- predict(fit.ridge, s=lambda, newx=xmat.test)
(err.ridge <- mean((test$crim - pred.ridge)^2))
## [1] 41.13889

The test error obtained using ridge regression is 41.13889.

We can also perform lasso regression:

fit.lasso <- cv.glmnet(xmat.train, train$crim, alpha=1)
lambda <- fit.lasso$lambda.min
pred.lasso <- predict(fit.lasso, s=lambda, newx=xmat.test)
(err.lasso <- mean((test$crim - pred.lasso)^2))
## [1] 40.74732

The test error obtained using lasso regression is 40.86415.

We can also perform forward and backward selection and look at the associated test error:

library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
predict.regsubsets <- function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id=id)
  xvars <- names(coefi)
  mat[,xvars]%*%coefi
}

fit.fwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(fwd.summary <- summary(fit.fwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) - 
##     1)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   " " 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     " "   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
err.fwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
  pred.fwd <- predict(fit.fwd, test, id=i)
  err.fwd[i] <- mean((test$crim - pred.fwd)^2)
}
min(err.fwd)
## [1] 40.84844

We can see from the summary results that the test error using forward selection is 40.848.

Now we calculate the test error for backward selection:

fit.bwd <- regsubsets(crim~., data=train, nvmax=ncol(Boston)-1)
(bwd.summary <- summary(fit.bwd))
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = train, nvmax = ncol(Boston) - 
##     1)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   " " 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     " "   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
err.bwd <- rep(NA, ncol(Boston)-1)
for(i in 1:(ncol(Boston)-1)) {
  pred.bwd <- predict(fit.bwd, test, id=i)
  err.bwd[i] <- mean((test$crim - pred.bwd)^2)
}
min(err.bwd)
## [1] 40.84844

This summary table suggests that using backward selection also results in a test error of 40.848.

Finally, we calculate the test error for PCR.

library(pls)
set.seed(19)
fit.pcr <- pcr(crim~., data=train, scale=TRUE, validation="CV")
summary(fit.pcr)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.924    7.324    7.316    7.071    7.087    7.185    7.202
## adjCV        8.924    7.319    7.311    7.058    7.078    7.167    7.181
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.193    7.170    7.147     7.011     7.044     6.999     6.928
## adjCV    7.170    7.128    7.121     6.985     7.018     6.971     6.899
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.81    61.39    70.07    77.12    83.77    88.50    91.71    93.76
## crim    33.90    34.42    40.37    40.37    40.67    41.66    42.19    43.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.78     97.39     98.72     99.65    100.00
## crim    43.84     45.67     45.99     47.08     48.37
pred.pcr <- predict(fit.pcr, test, ncomp=12)  
(err.pcr <- mean((test$crim - pred.pcr)^2))
## [1] 42.64682

Here we can see that using PCR has a test error of 42.64682.

(B).

min(err.bwd)
## [1] 40.84844
min(err.fwd)
## [1] 40.84844
err.lasso
## [1] 40.74732
err.ridge
## [1] 41.13889
err.pcr
## [1] 42.64682

We can see that by using backward selection, forward selection, or lasso regression will result in the smallest test error.

(C).

No, because in forward and backward selection variables that are not important for predicting the response variable are removed. Lasso regression also removes variables by decreasing their coefficient to zero.