For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(b) Repeat (a) for ridge regression relative to least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
library(ISLR)
attach(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
set.seed(1)
train <- sample(1:dim(College)[1], (dim(College)[1]/2))
test <- -train
college_train <- College[train,]
college_test <- College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
fit_lm <- lm(Apps~., data = college_train)
pred_lm <- predict(fit_lm, college_test)
mean((pred_lm-college_test$Apps)^2)
## [1] 1135758
The test error of the linear model is 1135758.
(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
library(glmnet)
train_mat <- model.matrix(Apps~., data = college_train)
test_mat <- model.matrix(Apps~., data = college_test)
grid <- 10^seq(4,-2,length = 100)
cv_out <- cv.glmnet(train_mat, college_train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
lambda_best <- cv_out$lambda.min
lambda_best
## [1] 0.01
ridge_mod <- glmnet(train_mat, college_train$Apps, alpha = 0)
ridge_pred <- predict(ridge_mod, s= lambda_best, newx = test_mat)
mean((ridge_pred- college_test$Apps)^2)
## [1] 976261.5
The test error of the ridge regression model, with \(\lambda\) chosen by cross-validation is 976261.5.
(d) Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(1)
cv_lasso <- cv.glmnet(train_mat, college_train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lasso_best <- cv_lasso$lambda.min
lasso_best
## [1] 0.01
fit_lasso <- glmnet(train_mat, college_train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
pred_lasso <- predict(fit_lasso, s=lasso_best, newx=test_mat)
mean((pred_lasso-college_test$Apps)^2)
## [1] 1135660
The test error of the lasso model, with \(\lambda\) chosen by cross-validation is 1135660.
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
fit_pcr <- pcr(Apps~., data = college_train, scale = "True", validation = "CV")
validationplot(fit_pcr, val.type = "MSEP")
summary(fit_pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4207 2216 2137 2013 1443 1399
## adjCV 4288 4203 2212 2133 2029 1427 1388
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1387 1378 1377 1326 1287 1290 1300
## adjCV 1377 1367 1366 1316 1277 1280 1289
## 14 comps 15 comps 16 comps 17 comps
## CV 1298 1230 1230 1236
## adjCV 1288 1220 1220 1225
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 51.812 87.28 95.83 97.47 98.79 99.45 99.94 99.98
## Apps 5.672 75.44 77.29 85.81 91.84 91.84 91.86 92.16
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 100.00 100.00 100.00 100.00 100.00 100.00 100.00
## Apps 92.18 92.68 93.05 93.06 93.07 93.15 93.82
## 16 comps 17 comps
## X 100.00 100.00
## Apps 93.85 93.89
pred_pcr <- predict(fit_pcr, college_test,ncomp = 10)
mean((pred_pcr-college_test$Apps)^2)
## [1] 1192180
The test error of the PCR model, with M = 10 chosen by cross-validation is 1192180.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
fit_pls <- plsr(Apps~., data = college_train, scale = TRUE, validation = "CV")
validationplot(fit_pls, val.type = "MSEP")
pred_pls <- predict(fit_pls, college_test, ncomp = 10)
mean((pred_pls-college_test$Apps)^2)
## [1] 1131661
summary(fit_pls)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 2195 1984 1732 1591 1460 1316
## adjCV 4288 2191 1979 1721 1566 1441 1301
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1270 1253 1245 1238 1235 1227 1224
## adjCV 1259 1242 1234 1228 1224 1217 1214
## 14 comps 15 comps 16 comps 17 comps
## CV 1223 1222 1222 1222
## adjCV 1213 1212 1212 1212
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.21 50.73 63.06 65.52 70.20 74.20 78.62 80.81
## Apps 75.39 81.24 86.97 91.14 92.62 93.43 93.56 93.68
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.29 87.17 89.15 91.37 92.58 94.42 96.98
## Apps 93.76 93.79 93.83 93.86 93.88 93.89 93.89
## 16 comps 17 comps
## X 98.78 100.00
## Apps 93.89 93.89
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Based on the test error results: PLS approach is the lowest error rate for prediction of collge applications received. Ridge regression is significantly different than the other approaches.
Linear: 1135758 Ridge: 976261.5 Lasso: 1135660 PCR: 1192180 PLS: 1131661
detach(College)
We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
attach(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
set.seed(1)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*" "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(rmse.cv)
## [1] 9
rmse.cv[which.min(rmse.cv)]
## [1] 6.543281
#Lasso Method
x <- model.matrix(crim ~ . - 1, data = Boston)
y <- Boston$crim
cv.lasso <- cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.921353
#Ridge Method
x <- model.matrix(crim ~ . - 1, data = Boston)
y <- Boston$crim
cv.ridge <- cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.523899548
## zn -0.002949852
## indus 0.029276741
## chas -0.166526006
## nox 1.874769661
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.669133
library(pls)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
(b) Propose a model (or set of models) that seem to perform well on this dataset, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Based on the results in (a), the model that has the lowest cross-validation error is the one chosen by the Best Subset Selection method. This model had an MSE of 6.52654.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
This method only includes 9 variables - zn, nox, rm, dis, rad, ptratio, black, lstat, and medv - that provide the best prediction accuracy.
detach(Boston)