Question 2

For parts (a) through (c), indicate which of i. through iv. is correct:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Justify your answer.

  1. The lasso, relative to least squares, is:

    The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance (iii). This is due to the shrinkage penalty that is included in estimating the coefficients. The lasso reduces the estimates up to and including zero, producing a simpler and more interpretable model.

  2. Repeat (a) for ridge regression relative to least squares.

    Ridge regression is less flexible than least squares and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance (iii). Ridge regression is very similar to the lasso but with a slightly different shrinkage penalty that reduces the coefficients very close to but not including zero. By shrinking the coefficient estimates, we can substantially decrease the variance at the cost of an increase in bias.

  3. Repeat (a) for non-linear methods relative to least squares.

    Non-linear methods are more flexible relative to least squares and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias (ii).

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

  1. Split the data set into a training set and a test set.
library(ISLR)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
sum(is.na(College))
## [1] 0
library(caret)
set.seed(21)
ind <- College$Apps %>% 
  createDataPartition(p = 0.7, list = FALSE)
college.train <- College[ind,]
college.test <- College[-ind,]
ctrl <- trainControl("repeatedcv", number = 10, repeats = 3)
  1. Fit a linear model using least squares on the training set, and report the test error obtained.

    Using 10-fold cross-validation, the linear model gives us a test error of 1,239.

lm.fit <- train(Apps~., data = college.train, method = "lm", 
                metric = "RMSE", trControl = ctrl) 
lm.preds <- lm.fit %>% 
  predict(college.test) 
lm.r2 <- caret::R2(college.test$Apps, lm.preds)
lm.rmse <- RMSE(college.test$Apps, lm.preds)
lm.rmse
## [1] 1238.523
  1. Fit a ridge regression model on the training set, with \(λ\) chosen by cross-validation. Report the test error obtained.
grid <- 10^seq(10, -2, length = 100)
ridge.fit <- train(Apps ~., data = college.train, method = "glmnet",
                   trControl = ctrl, metric = "RMSE",
                   tuneGrid = expand.grid(alpha = 0, lambda = grid))
ridge.fit$bestTune
##    alpha   lambda
## 38     0 305.3856
ridge.preds <- ridge.fit %>% 
  predict(college.test)
ridge.r2 <- caret::R2(college.test$Apps, ridge.preds)
ridge.rmse <- RMSE(college.test$Apps, ridge.preds)
ridge.rmse
## [1] 1665.928

The test error using the best \(\lambda\) increased from the linear model.

coef(ridge.fit$finalModel, ridge.fit$bestTune$lambda)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.127861e+03
## PrivateYes  -6.635202e+02
## Accept       8.144054e-01
## Enroll       5.892259e-01
## Top10perc    2.286175e+01
## Top25perc    1.456817e+00
## F.Undergrad  1.142090e-01
## P.Undergrad  1.253448e-02
## Outstate     2.223016e-03
## Room.Board   2.100673e-01
## Books        2.344128e-01
## Personal    -7.322214e-02
## PhD         -3.983164e+00
## Terminal    -4.764773e+00
## S.F.Ratio    4.094570e+00
## perc.alumni -1.384515e+01
## Expend       6.868602e-02
## Grad.Rate    9.985739e+00
  1. Fit a lasso model on the training set, with \(λ\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.fit <- train(Apps ~., data = college.train, method = "glmnet",
                   trControl = ctrl, metric = "RMSE",
                   tuneGrid = expand.grid(alpha = 1, lambda = grid))
lasso.fit$bestTune
##    alpha   lambda
## 27     1 14.17474
lasso.preds <- lasso.fit %>% 
  predict(college.test)
lasso.r2 <- caret::R2(college.test$Apps, lasso.preds)
lasso.rmse <- RMSE(college.test$Apps, lasso.preds)
lasso.rmse
## [1] 1289.071

The lasso gives us a test error of 1,289 with three variables set to zero, Enroll, Personal, and S.F.Ratio.

coef(lasso.fit$finalModel, lasso.fit$bestTune$lambda)
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -333.42729282
## PrivateYes  -623.90503359
## Accept         1.29807112
## Enroll         .         
## Top10perc     34.67284704
## Top25perc     -4.64574511
## F.Undergrad    0.01995245
## P.Undergrad    0.01834171
## Outstate      -0.02828174
## Room.Board     0.14140433
## Books          0.03705017
## Personal       .         
## PhD           -7.65561384
## Terminal      -2.99825994
## S.F.Ratio      .         
## perc.alumni   -6.92373141
## Expend         0.05835990
## Grad.Rate      6.35105420
  1. Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr.fit <- train(Apps~., data = college.train, method = "pcr", 
                 scale = TRUE, trControl = ctrl,
                 metric = "RMSE", tuneLength = 17)
plot(pcr.fit)

pcr.fit$bestTune
##    ncomp
## 16    16

The Principal Components model with the smallest cross-validation error occurs when M = 16 components, which is not too different from performing least squares regression since nearly all components are included in the model. The PCR model explains over 99% of the variation in the predictors and 93% of the variation in Apps.

summary(pcr.fit$finalModel)
## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 16
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X          31.959    57.64    64.52    70.22    75.58    80.49    84.28
## .outcome    9.532    78.39    78.84    81.26    86.10    87.56    88.25
##           8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X           87.86    90.94     93.28     95.26     96.94     97.98     98.96
## .outcome    88.76    89.33     89.55     89.56     89.57     89.57     89.58
##           15 comps  16 comps
## X            99.48     99.85
## .outcome     89.80     92.57
pcr.preds <- pcr.fit %>% 
  predict(college.test)
pcr.r2 <- caret::R2(college.test$Apps, pcr.preds)
pcr.rmse <- RMSE(college.test$Apps, pcr.preds)
pcr.rmse
## [1] 1295.164
  1. Fit a Partial Least Squares model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit <- train(Apps~., data = college.train, method = "pls", metric = "RMSE",
                 scale = TRUE, trControl = ctrl, tuneLength = 10)
plot(pls.fit)

pls.fit$bestTune
##   ncomp
## 9     9

The Partial Least Squares model gives us the lowest cross-validation error when M = 9 components. This captures 84% of the variation in the predictors and 93% of the variation in Apps.

summary(pls.fit$finalModel)
## Data:    X dimension: 545 17 
##  Y dimension: 545 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           26.67    50.02    62.88    66.84    70.57    73.77    77.66
## .outcome    81.90    86.19    89.97    90.88    91.93    92.69    92.80
##           8 comps  9 comps
## X           81.40    83.94
## .outcome    92.82    92.85
pls.preds <- pls.fit %>% 
  predict(college.test)
pls.r2 <- caret::R2(college.test$Apps, pls.preds)
pls.rmse <- RMSE(college.test$Apps, pls.preds)
pls.rmse
## [1] 1245.394
  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

    Out of the five approaches, the Linear and Partial Least Squares models produce the lowest test error. I would suggest using the PLS model because it contains only 9 components and is simpler than the linear model. Based on the test R\(^2\), the PLS model provides a good fit to data and can explain about 93% of the variation in Apps.

data.frame(Model = c("Linear Regression", "Ridge Regression", "Lasso", 
                     "Principal Components", "Partial Least Squares"),
           RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, pcr.rmse, pls.rmse),
           RSquared = percent(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), accuracy = 0.01)) %>% 
  arrange(RMSE) %>%
  kable(format.args = list(big.mark = ","), align = c('l', 'c', 'c'), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Model RMSE RSquared
Linear Regression 1,238.52 92.71%
Partial Least Squares 1,245.39 92.66%
Lasso 1,289.07 92.27%
Principal Components 1,295.16 92.18%
Ridge Regression 1,665.93 88.01%

Question 11

We will now try to predict per capita crime rate in the Boston data set.

  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
sum(is.na(Boston))
## [1] 0
set.seed(11)
ind <- Boston$crim %>% 
  createDataPartition(p = 0.7, list=FALSE)
boston.train <- Boston[ind,]
boston.test <- Boston[-ind,]
ctrl <- trainControl("repeatedcv", number = 10, repeats = 3)

Stepwise Selection

lm.step <- train(crim ~ ., data = boston.train, 
                 method = "leapSeq", metric = "RMSE",
                 tuneGrid = data.frame(nvmax = 1:11),
                 trControl = ctrl)
plot(lm.step)

Using stepwise selection, the model with 11 predictors minimizes the cross-validation error. The best 11 variables include zn, indus, nox, rm, dis, rad, tax, ptratio, black, lstat, and medv.

min <- lm.step$bestTune$nvmax
coef(lm.step$finalModel, min)
##   (Intercept)            zn         indus           nox            rm 
##  18.803005844   0.049895785  -0.077847103 -11.566222458   0.498400923 
##           dis           rad           tax       ptratio         black 
##  -1.126021669   0.641737854  -0.004108639  -0.328138847  -0.003503293 
##         lstat          medv 
##   0.111945440  -0.248228511
lm.preds <- lm.step %>% 
  predict(boston.test)
lm.r2 <- caret::R2(boston.test$crim, lm.preds)
lm.rmse <- RMSE(boston.test$crim, lm.preds)
lm.rmse
## [1] 5.644116

Ridge Regression

grid <- 10^seq(10, -2, length = 100)
ridge.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
                   trControl = ctrl, metric = "RMSE",
                   tuneGrid = expand.grid(alpha = 0, lambda = grid))
ridge.fit$bestTune
##    alpha   lambda
## 18     0 1.149757
coef(ridge.fit$finalModel, ridge.fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  5.450024251
## zn           0.028937374
## indus       -0.081669458
## chas        -1.149807159
## nox         -2.645270852
## rm           0.247099570
## age          0.005135483
## dis         -0.610186466
## rad          0.396576915
## tax          0.005317767
## ptratio     -0.093691841
## black       -0.005372294
## lstat        0.132609596
## medv        -0.135318917
ridge.preds <- ridge.fit %>% 
  predict(boston.test)
ridge.r2 <- caret::R2(boston.test$crim, ridge.preds)
ridge.rmse <- RMSE(boston.test$crim, ridge.preds)
ridge.rmse
## [1] 5.563703

Using the best \(\lambda\), the ridge model reduces the test error to 5.56.

The Lasso

lasso.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
                   trControl = ctrl, metric = "RMSE",
                   tuneGrid = expand.grid(alpha = 1, lambda = grid))
lasso.fit$bestTune
##    alpha   lambda
## 17     1 0.869749
coef(lasso.fit$finalModel, lasso.fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -1.151548399
## zn           .          
## indus        .          
## chas         .          
## nox          .          
## rm           .          
## age          .          
## dis         -0.003138506
## rad          0.487250940
## tax          .          
## ptratio      .          
## black        .          
## lstat        0.119299789
## medv        -0.050643562
lasso.preds <- lasso.fit %>% 
  predict(boston.test)
lasso.r2 <- caret::R2(boston.test$crim, lasso.preds)
lasso.rmse <- RMSE(boston.test$crim, lasso.preds)
lasso.rmse
## [1] 5.673909

The lasso includes only four variables in the model and gives us a test error of 5.67.

Elastic Net Regression

e.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
               trControl = ctrl, metric = "RMSE")
e.fit$bestTune
##   alpha    lambda
## 8     1 0.1150005
coef(e.fit$finalModel, e.fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  9.234532624
## zn           0.032256976
## indus       -0.063272348
## chas        -0.840476025
## nox         -2.982001590
## rm           .          
## age          .          
## dis         -0.663485627
## rad          0.540182929
## tax          .          
## ptratio     -0.141013003
## black       -0.003244753
## lstat        0.100814473
## medv        -0.153553125
e.preds <- e.fit %>% 
  predict(boston.test)
e.r2 <- caret::R2(boston.test$crim, e.preds)
e.rmse <- RMSE(boston.test$crim, e.preds)
e.rmse
## [1] 5.627802

Elastic net regression decreases the test error to 5.63 with 3 variables set to zero.

Principal Components Regression

pcr.fit <- train(crim ~ ., data = boston.train, method = "pcr", 
                 scale = TRUE, trControl = ctrl, metric = "RMSE",
                 tuneLength = 11)
plot(pcr.fit)

pcr.fit$bestTune
##   ncomp
## 9     9

The optimal number of principal components is 9, which explains 96% of the variation in the predictors and 43% of the variation in crim.

summary(pcr.fit$finalModel)
## Data:    X dimension: 356 13 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 9
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X           46.98    60.76    69.91    76.62    83.23    88.25    91.31
## .outcome    31.68    31.92    38.97    39.04    39.05    39.65    39.69
##           8 comps  9 comps
## X           93.61    95.50
## .outcome    43.16    43.37
pcr.preds <- pcr.fit %>% 
  predict(boston.test)
pcr.r2 <- caret::R2(boston.test$crim, pcr.preds)
pcr.rmse <- RMSE(boston.test$crim, pcr.preds)
pcr.rmse
## [1] 5.728774

Partial Least Squares

pls.fit <- train(crim ~ ., data = boston.train, method = "pls", metric = "RMSE",
                 scale = TRUE, trControl = ctrl, tuneLength = 11)
plot(pls.fit)

pls.fit$bestTune
##   ncomp
## 5     5

The Partial Least Squares model has the lowest cross-validation error when M = 5 components. This captures 76% of the variation in the predictors and 46% of the variation in crim.

summary(pls.fit$finalModel)
## Data:    X dimension: 356 13 
##  Y dimension: 356 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           46.62    55.94    62.24    71.95    76.29
## .outcome    34.94    42.05    44.45    45.31    45.76
pls.preds <- pls.fit %>% 
  predict(boston.test)
pls.r2 <- caret::R2(boston.test$crim, pls.preds)
pls.rmse <- RMSE(boston.test$crim, pls.preds)
pls.rmse
## [1] 5.625681
  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

    Ridge Regression and Partial Least Squares give us the lowest test error. I would suggest using the PLS model because it is simpler with only 5 components in the model. The overall test R\(^2\) is relatively low, indicating that a non-linear approach may provide a better fit.

data.frame(Model = c("Stepwise Selection", "Ridge Regression", "Lasso", "Elastic Net Regression", 
                     "Principal Components Regression", "Partial Least Squares"),
           RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, e.rmse, pcr.rmse, pls.rmse),
           RSquared = percent(c(lm.r2, ridge.r2, lasso.r2, e.r2, pcr.r2, pls.r2), accuracy = 0.01)) %>%
  arrange(RMSE) %>% 
  kable(align = c('l', 'c', 'c'), digits = 2) %>% 
  kable_styling(bootstrap_options = c("striped", "hover"))
Model RMSE RSquared
Ridge Regression 5.56 43.49%
Partial Least Squares 5.63 43.40%
Elastic Net Regression 5.63 43.06%
Stepwise Selection 5.64 43.32%
Lasso 5.67 40.40%
Principal Components Regression 5.73 41.63%
  1. Does your chosen model involve all of the features in the data set? Why or why not?

    Out of 13 features in the data set, the Partial Least Squares model was able to minimize the test error with only 5 components in the model. This explains 76% of the variation in the predictors and 46% of the variation in crim.

summary(pls.fit$finalModel)
## Data:    X dimension: 356 13 
##  Y dimension: 356 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           46.62    55.94    62.24    71.95    76.29
## .outcome    34.94    42.05    44.45    45.31    45.76