Q2 For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

**(a) The lasso, relative to least squares, is:**
  1. More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

The correct answer for part (a) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.

**(b) The ridge regression, relative to least squares, is:**
  1. More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

The correct answer for part (b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Ridge regression and lasso’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases leading to decreased variance but increased bias. The relationship between λ and variance and bias in this regression method is the key holder to understanding the relationship. When there is small change in the training data, the least squares coefficient produces a large change and larger value for variance. Whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, between these two methods, ridge regression works best in situations where the least squares estimates have high variance. The big difference between ridge and lasso is that lasso performs variance selection and makes it easier to interpret.

**(c) (c) The non-linear methods, relative to least squares, is:**
  1. More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.

The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.

Q9 In this exercise, we will predict the number of applications received using the other variables in the “College” data set.

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
data(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
pairs(College)

  1. Split the data into a training test and a test set
library(ISLR)
data(College)
set.seed(11)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
pairs(College)

  1. Fit a linear model using least squares on the training set, and report the test error obtained.
fit.lm <- lm(Apps ~ ., data = College.train)
pred.lm <- predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1026096

The test MSE is 1.538442210^{6}.

  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge <- cv.ridge$lambda.min
bestlam.ridge
## [1] 0.01
pred.ridge <- predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1026069

The test MSE is higher for ridge regression than for least squares.

  1. Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
fit.lasso <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
pred.lasso <- predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1026036

The test MSE is also higher for ridge regression than for least squares.

predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)   37.86520037
## (Intercept)    .         
## PrivateYes  -551.14946609
## Accept         1.74980812
## Enroll        -1.36005786
## Top10perc     65.55655577
## Top25perc    -22.52640339
## F.Undergrad    0.10181853
## P.Undergrad    0.01789131
## Outstate      -0.08706371
## Room.Board     0.15384585
## Books         -0.12227313
## Personal       0.16194591
## PhD          -14.29638634
## Terminal      -1.03118224
## S.F.Ratio      4.47956819
## perc.alumni   -0.45456280
## Expend         0.05618050
## Grad.Rate      9.07242834

(e)Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")

pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 1867486

The test MSE is also higher for PCR than for least squares.

  1. Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")

pred.pls <- predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1031287

Here, the test MSE is lower for PLS than for least squares.

(g)Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches ?

To compare the results obtained above, we have to compute the test R2 for all models.

test.avg <- mean(College.test$Apps)
lm.r2 <- 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2 <- 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2 <- 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2 <- 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2 <- 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)

So the test R2 for least squares is 0.9044281, the test R2 for ridge is 0.9000536, the test R2 for lasso is 0.8984123, the test R2 for pcr is 0.8127319 and the test R2 for pls is 0.9062579. All models, except PCR, predict college applications with high accuracy.

Q11 We will now try to predict per capita crime rate in the “Boston” data set

(a)Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression and PCR. Present and discuss results for the approaches that you consider.

require(MASS); require(tidyverse); require(caret); require(leaps)
## Loading required package: MASS
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()
## Loading required package: caret
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## The following object is masked from 'package:pls':
## 
##     R2
## Loading required package: leaps
set.seed(1)
data('Boston')

inTrain <- createDataPartition(Boston$crim, p = 0.6, list = FALSE)

x_train <- Boston[inTrain, -1]
y_train <- Boston[inTrain, 1]
x_test <- Boston[-inTrain, -1]
y_test <- Boston[-inTrain, 1]

best_subs <- regsubsets(x = x_train, y = y_train, nvmax = 13)

fit_summary <- summary(best_subs)

require(ggplot2); require(ggthemes)
## Loading required package: ggthemes
data_frame(MSE = fit_summary$rss/nrow(x_train),
           Cp = fit_summary$cp, 
           BIC = fit_summary$bic,
           AdjustedR2 = fit_summary$adjr2) %>%
    mutate(id = row_number()) %>%
    gather(Metric, value, -id) %>%
    ggplot(aes(id, value, col = Metric)) +
    geom_line() + geom_point() + ylab('') + 
    xlab('Number of Variables Used') + 
    facet_wrap(~ Metric, scales = 'free') +
    theme_tufte() + scale_x_continuous(breaks = 1:13)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

scales = c('r2', 'adjr2', 'bic', 'Cp')
par(mfrow = c(2,2))

for (scale in scales) {
    plot(best_subs, scale = scale)
}

par(mfrow = c(1,1))
test_errors <- rep(NA,13)

test.mat <- model.matrix(crim ~ ., data = Boston[-inTrain,])

for (i in 1:13){
    coefs <- coef(best_subs, id=i)
    pred <- test.mat[,names(coefs)]%*%coefs
    test_errors[i] <- sqrt(mean((y_test - pred)^2))
}

data_frame(RMSE = test_errors) %>% 
    mutate(id = row_number()) %>% 
    ggplot(aes(id, RMSE)) +
    geom_line() + geom_point() + 
    xlab('Number of Variables Used') + 
    ggtitle('MSE on testing set') + 
    theme_tufte() + 
    scale_x_continuous(breaks = 1:13)

(regsubset_info <- min(test_errors))
## [1] 5.811256
coef(best_subs, id = 1:5)
## [[1]]
## (Intercept)         rad 
##  -2.4448623   0.6546826 
## 
## [[2]]
## (Intercept)         rad        medv 
##   2.6574618   0.5769903  -0.1963036 
## 
## [[3]]
## (Intercept)         rad     ptratio        medv 
##  10.8303135   0.6042171  -0.4067664  -0.2351904 
## 
## [[4]]
## (Intercept)          zn         dis         rad        medv 
##  5.83377235  0.05509227 -0.72924037  0.52407648 -0.21866690 
## 
## [[5]]
## (Intercept)          zn          rm         dis         rad        medv 
## -1.70826140  0.05299909  1.50466724 -0.73918276  0.51079178 -0.29652993
lasso_fit <- train(x = x_train, y = y_train,
                   method = 'glmnet',
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(alpha = 1,
                                          lambda = seq(0.001, 1, length.out = 100)),
                   preProcess = c('center', 'scale'))

(lasso_info <- postResample(predict(lasso_fit, x_test), y_test))
##     RMSE Rsquared      MAE 
## 5.793303 0.433477 2.603667
coef(lasso_fit$finalModel, lasso_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     s1
## (Intercept)  3.7575062
## zn           .        
## indus        .        
## chas         .        
## nox          .        
## rm           .        
## age          .        
## dis         -0.1856880
## rad          4.2982107
## tax          .        
## ptratio      .        
## black        .        
## lstat        0.1178773
## medv        -1.0546456
lasso_fit$bestTune
##    alpha    lambda
## 84     1 0.8385455
plot(lasso_fit)

plot(varImp(lasso_fit))

ridge_fit <- train(x_train, y_train,
                   method = 'glmnet',
                   trControl = trainControl(method = 'cv', number = 10),
                   tuneGrid = expand.grid(alpha = 0,
                                          lambda = seq(0, 1e2, length.out = 50)),
                   preProcess = c('center', 'scale'))
(ridge_info <- postResample(predict(ridge_fit, x_test), y_test))
##      RMSE  Rsquared       MAE 
## 5.7137087 0.4498421 2.7943386
coef(ridge_fit$finalModel, ridge_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  3.75750624
## zn           0.54011923
## indus       -0.34230036
## chas        -0.30633776
## nox          0.14724495
## rm           0.44593736
## age          0.10933511
## dis         -0.98108462
## rad          2.85345034
## tax          1.16015712
## ptratio     -0.09232032
## black       -0.31532518
## lstat        0.64017157
## medv        -1.44312967
ridge_fit$bestTune
##   alpha   lambda
## 2     0 2.040816
plot(ridge_fit)

plot(varImp(ridge_fit))

glmnet_fit <- train(x_train, y_train,
                    method = 'glmnet',
                    trControl = trainControl(method = 'cv', number = 10),
                    tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 6),
                                           lambda = seq(0, 1e2, length.out = 20)),
                    preProcess = c('center', 'scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
(glmnet_info <- postResample(predict(glmnet_fit, x_test), y_test))
##      RMSE  Rsquared       MAE 
## 5.7504399 0.4478022 2.8789035
coef(object = glmnet_fit$finalModel, s = glmnet_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      s1
## (Intercept)  3.75750624
## zn           0.79927708
## indus       -0.56669351
## chas        -0.32661514
## nox         -0.32145785
## rm           0.70164835
## age          0.03721032
## dis         -1.61848336
## rad          4.01878688
## tax          0.60120176
## ptratio     -0.37415233
## black       -0.09300755
## lstat        0.54462549
## medv        -2.16987001
glmnet_fit$bestTune
##   alpha lambda
## 1     0      0
plot(glmnet_fit)

plot(varImp(glmnet_fit))

pcr_fit <- train(x_train, y_train,
                 method = 'pcr',
                 trControl = trainControl(method = 'cv', number = 10),
                 tuneGrid = expand.grid(ncomp = 1:13), 
                 preProcess = c('center', 'scale'))

(pcr_info <- postResample(predict(pcr_fit, x_test), y_test))
##      RMSE  Rsquared       MAE 
## 5.8816610 0.4296175 3.0845598
pcr_fit
## Principal Component Analysis 
## 
## 306 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 275, 277, 276, 275, 275, 275, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     6.560574  0.4800110  3.843889
##    2     6.559955  0.4824507  3.879334
##    3     6.175577  0.5954423  3.134008
##    4     6.166002  0.6001710  3.114538
##    5     6.165989  0.6016351  3.099134
##    6     6.164470  0.5941572  3.241512
##    7     6.171809  0.5937923  3.231744
##    8     6.030915  0.6217769  3.231019
##    9     6.056110  0.6186845  3.265161
##   10     6.064727  0.6201973  3.238258
##   11     6.044505  0.6234399  3.204063
##   12     6.173310  0.5958234  3.406977
##   13     6.116109  0.6072988  3.293449
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(pcr_fit)

plot(varImp(pcr_fit))

pls_fit <- train(x_train, y_train,
                 method = 'pls',
                 trControl = trainControl(method = 'cv', number = 10),
                 tuneGrid = expand.grid(ncomp = 1:13), 
                 preProcess = c('center', 'scale'))

(pls_info <- postResample(predict(pls_fit, x_test), y_test))
##      RMSE  Rsquared       MAE 
## 5.7361826 0.4505769 2.8789603
pls_fit
## Partial Least Squares 
## 
## 306 samples
##  13 predictor
## 
## Pre-processing: centered (13), scaled (13) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 275, 275, 274, 275, 276, 276, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE     
##    1     6.201189  0.5251744  3.645062
##    2     5.818300  0.6334711  3.019228
##    3     5.818943  0.6391877  3.147983
##    4     5.820687  0.6344493  3.194376
##    5     5.855419  0.6352588  3.196379
##    6     5.864940  0.6351171  3.216697
##    7     5.867013  0.6354134  3.224509
##    8     5.848427  0.6395298  3.186069
##    9     5.844713  0.6395603  3.188456
##   10     5.846431  0.6391123  3.187532
##   11     5.845136  0.6392478  3.186003
##   12     5.845142  0.6392653  3.185977
##   13     5.845154  0.6392627  3.185996
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pls_fit)

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.
rbind(c(regsubset_info, NA, NA),
      lasso_info, 
      ridge_info,
      glmnet_info,
      pcr_info,
      pls_info)
##                 RMSE  Rsquared      MAE
##             5.811256        NA       NA
## lasso_info  5.793303 0.4334770 2.603667
## ridge_info  5.713709 0.4498421 2.794339
## glmnet_info 5.750440 0.4478022 2.878903
## pcr_info    5.881661 0.4296175 3.084560
## pls_info    5.736183 0.4505769 2.878960
  1. Does your chosen model involve all of the features in the data set? Why or why not? Yes