Q2 For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
**(a) The lasso, relative to least squares, is:**
More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
The correct answer for part (a) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.
**(b) The ridge regression, relative to least squares, is:**
More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
The correct answer for part (b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: Ridge regression and lasso’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases leading to decreased variance but increased bias. The relationship between λ and variance and bias in this regression method is the key holder to understanding the relationship. When there is small change in the training data, the least squares coefficient produces a large change and larger value for variance. Whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, between these two methods, ridge regression works best in situations where the least squares estimates have high variance. The big difference between ridge and lasso is that lasso performs variance selection and makes it easier to interpret.
**(c) (c) The non-linear methods, relative to least squares, is:**
More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
Q9 In this exercise, we will predict the number of applications received using the other variables in the “College” data set.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
pairs(College)
library(ISLR)
data(College)
set.seed(11)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
pairs(College)
fit.lm <- lm(Apps ~ ., data = College.train)
pred.lm <- predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1026096
The test MSE is 1.538442210^{6}.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
train.mat <- model.matrix(Apps ~ ., data = College.train)
test.mat <- model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
fit.ridge <- glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge <- cv.ridge$lambda.min
bestlam.ridge
## [1] 0.01
pred.ridge <- predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1026069
The test MSE is higher for ridge regression than for least squares.
fit.lasso <- glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
pred.lasso <- predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1026036
The test MSE is also higher for ridge regression than for least squares.
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 37.86520037
## (Intercept) .
## PrivateYes -551.14946609
## Accept 1.74980812
## Enroll -1.36005786
## Top10perc 65.55655577
## Top25perc -22.52640339
## F.Undergrad 0.10181853
## P.Undergrad 0.01789131
## Outstate -0.08706371
## Room.Board 0.15384585
## Books -0.12227313
## Personal 0.16194591
## PhD -14.29638634
## Terminal -1.03118224
## S.F.Ratio 4.47956819
## perc.alumni -0.45456280
## Expend 0.05618050
## Grad.Rate 9.07242834
(e)Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
fit.pcr <- pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")
pred.pcr <- predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 1867486
The test MSE is also higher for PCR than for least squares.
fit.pls <- plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
pred.pls <- predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1031287
Here, the test MSE is lower for PLS than for least squares.
(g)Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches ?
To compare the results obtained above, we have to compute the test R2 for all models.
test.avg <- mean(College.test$Apps)
lm.r2 <- 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2 <- 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2 <- 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2 <- 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2 <- 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
So the test R2 for least squares is 0.9044281, the test R2 for ridge is 0.9000536, the test R2 for lasso is 0.8984123, the test R2 for pcr is 0.8127319 and the test R2 for pls is 0.9062579. All models, except PCR, predict college applications with high accuracy.
Q11 We will now try to predict per capita crime rate in the “Boston” data set
(a)Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression and PCR. Present and discuss results for the approaches that you consider.
require(MASS); require(tidyverse); require(caret); require(leaps)
## Loading required package: MASS
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:pls':
##
## R2
## Loading required package: leaps
set.seed(1)
data('Boston')
inTrain <- createDataPartition(Boston$crim, p = 0.6, list = FALSE)
x_train <- Boston[inTrain, -1]
y_train <- Boston[inTrain, 1]
x_test <- Boston[-inTrain, -1]
y_test <- Boston[-inTrain, 1]
best_subs <- regsubsets(x = x_train, y = y_train, nvmax = 13)
fit_summary <- summary(best_subs)
require(ggplot2); require(ggthemes)
## Loading required package: ggthemes
data_frame(MSE = fit_summary$rss/nrow(x_train),
Cp = fit_summary$cp,
BIC = fit_summary$bic,
AdjustedR2 = fit_summary$adjr2) %>%
mutate(id = row_number()) %>%
gather(Metric, value, -id) %>%
ggplot(aes(id, value, col = Metric)) +
geom_line() + geom_point() + ylab('') +
xlab('Number of Variables Used') +
facet_wrap(~ Metric, scales = 'free') +
theme_tufte() + scale_x_continuous(breaks = 1:13)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
scales = c('r2', 'adjr2', 'bic', 'Cp')
par(mfrow = c(2,2))
for (scale in scales) {
plot(best_subs, scale = scale)
}
par(mfrow = c(1,1))
test_errors <- rep(NA,13)
test.mat <- model.matrix(crim ~ ., data = Boston[-inTrain,])
for (i in 1:13){
coefs <- coef(best_subs, id=i)
pred <- test.mat[,names(coefs)]%*%coefs
test_errors[i] <- sqrt(mean((y_test - pred)^2))
}
data_frame(RMSE = test_errors) %>%
mutate(id = row_number()) %>%
ggplot(aes(id, RMSE)) +
geom_line() + geom_point() +
xlab('Number of Variables Used') +
ggtitle('MSE on testing set') +
theme_tufte() +
scale_x_continuous(breaks = 1:13)
(regsubset_info <- min(test_errors))
## [1] 5.811256
coef(best_subs, id = 1:5)
## [[1]]
## (Intercept) rad
## -2.4448623 0.6546826
##
## [[2]]
## (Intercept) rad medv
## 2.6574618 0.5769903 -0.1963036
##
## [[3]]
## (Intercept) rad ptratio medv
## 10.8303135 0.6042171 -0.4067664 -0.2351904
##
## [[4]]
## (Intercept) zn dis rad medv
## 5.83377235 0.05509227 -0.72924037 0.52407648 -0.21866690
##
## [[5]]
## (Intercept) zn rm dis rad medv
## -1.70826140 0.05299909 1.50466724 -0.73918276 0.51079178 -0.29652993
lasso_fit <- train(x = x_train, y = y_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0.001, 1, length.out = 100)),
preProcess = c('center', 'scale'))
(lasso_info <- postResample(predict(lasso_fit, x_test), y_test))
## RMSE Rsquared MAE
## 5.793303 0.433477 2.603667
coef(lasso_fit$finalModel, lasso_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.7575062
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis -0.1856880
## rad 4.2982107
## tax .
## ptratio .
## black .
## lstat 0.1178773
## medv -1.0546456
lasso_fit$bestTune
## alpha lambda
## 84 1 0.8385455
plot(lasso_fit)
plot(varImp(lasso_fit))
ridge_fit <- train(x_train, y_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 1e2, length.out = 50)),
preProcess = c('center', 'scale'))
(ridge_info <- postResample(predict(ridge_fit, x_test), y_test))
## RMSE Rsquared MAE
## 5.7137087 0.4498421 2.7943386
coef(ridge_fit$finalModel, ridge_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.75750624
## zn 0.54011923
## indus -0.34230036
## chas -0.30633776
## nox 0.14724495
## rm 0.44593736
## age 0.10933511
## dis -0.98108462
## rad 2.85345034
## tax 1.16015712
## ptratio -0.09232032
## black -0.31532518
## lstat 0.64017157
## medv -1.44312967
ridge_fit$bestTune
## alpha lambda
## 2 0 2.040816
plot(ridge_fit)
plot(varImp(ridge_fit))
glmnet_fit <- train(x_train, y_train,
method = 'glmnet',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 6),
lambda = seq(0, 1e2, length.out = 20)),
preProcess = c('center', 'scale'))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
(glmnet_info <- postResample(predict(glmnet_fit, x_test), y_test))
## RMSE Rsquared MAE
## 5.7504399 0.4478022 2.8789035
coef(object = glmnet_fit$finalModel, s = glmnet_fit$bestTune$lambda)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.75750624
## zn 0.79927708
## indus -0.56669351
## chas -0.32661514
## nox -0.32145785
## rm 0.70164835
## age 0.03721032
## dis -1.61848336
## rad 4.01878688
## tax 0.60120176
## ptratio -0.37415233
## black -0.09300755
## lstat 0.54462549
## medv -2.16987001
glmnet_fit$bestTune
## alpha lambda
## 1 0 0
plot(glmnet_fit)
plot(varImp(glmnet_fit))
pcr_fit <- train(x_train, y_train,
method = 'pcr',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(ncomp = 1:13),
preProcess = c('center', 'scale'))
(pcr_info <- postResample(predict(pcr_fit, x_test), y_test))
## RMSE Rsquared MAE
## 5.8816610 0.4296175 3.0845598
pcr_fit
## Principal Component Analysis
##
## 306 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 275, 277, 276, 275, 275, 275, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 6.560574 0.4800110 3.843889
## 2 6.559955 0.4824507 3.879334
## 3 6.175577 0.5954423 3.134008
## 4 6.166002 0.6001710 3.114538
## 5 6.165989 0.6016351 3.099134
## 6 6.164470 0.5941572 3.241512
## 7 6.171809 0.5937923 3.231744
## 8 6.030915 0.6217769 3.231019
## 9 6.056110 0.6186845 3.265161
## 10 6.064727 0.6201973 3.238258
## 11 6.044505 0.6234399 3.204063
## 12 6.173310 0.5958234 3.406977
## 13 6.116109 0.6072988 3.293449
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 8.
plot(pcr_fit)
plot(varImp(pcr_fit))
pls_fit <- train(x_train, y_train,
method = 'pls',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(ncomp = 1:13),
preProcess = c('center', 'scale'))
(pls_info <- postResample(predict(pls_fit, x_test), y_test))
## RMSE Rsquared MAE
## 5.7361826 0.4505769 2.8789603
pls_fit
## Partial Least Squares
##
## 306 samples
## 13 predictor
##
## Pre-processing: centered (13), scaled (13)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 275, 275, 274, 275, 276, 276, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 6.201189 0.5251744 3.645062
## 2 5.818300 0.6334711 3.019228
## 3 5.818943 0.6391877 3.147983
## 4 5.820687 0.6344493 3.194376
## 5 5.855419 0.6352588 3.196379
## 6 5.864940 0.6351171 3.216697
## 7 5.867013 0.6354134 3.224509
## 8 5.848427 0.6395298 3.186069
## 9 5.844713 0.6395603 3.188456
## 10 5.846431 0.6391123 3.187532
## 11 5.845136 0.6392478 3.186003
## 12 5.845142 0.6392653 3.185977
## 13 5.845154 0.6392627 3.185996
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 2.
plot(pls_fit)
rbind(c(regsubset_info, NA, NA),
lasso_info,
ridge_info,
glmnet_info,
pcr_info,
pls_info)
## RMSE Rsquared MAE
## 5.811256 NA NA
## lasso_info 5.793303 0.4334770 2.603667
## ridge_info 5.713709 0.4498421 2.794339
## glmnet_info 5.750440 0.4478022 2.878903
## pcr_info 5.881661 0.4296175 3.084560
## pls_info 5.736183 0.4505769 2.878960