For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
False
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
True - because lasso regression is still a linear model, it is less flexible. An increase is bias is traded off for a greater absolute decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False
(b) Repeat (a) for ridge regression relative to least squares.
Ridge regression is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression is like lasso regression in the sense that increased bias is willingly traded off for a greater decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
Non-linear methods are more flexible and hence will give improved prediction accuracy when its increase in variance is traded for a decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
sum(is.na(College))
## [1] 0
dim(College)
## [1] 777 18
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
(a) Split the data set into a training set and a test set.
set.seed(1)
train <- sample(777, 388)
test <- -train
College.train <- College[train, ]
College.test <- College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
The MSE is 1,135,758.
lm.fit <- lm(Apps ~ ., data = College.train)
lm.pred <- predict(lm.fit, College.test)
mean((College.test[, "Apps"] - lm.pred)^2)
## [1] 1135758
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
The MSE for the ridge regression was a small improvement at 1,135,714.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
set.seed(1)
x_train <- model.matrix(Apps~., data = College.train)[,-1]
y_train <- College.train$Apps
x_test <- model.matrix(Apps~., data = College.test)[,-1]
y_test <- College.test$Apps
grid <- 10^seq(10, -2, length = 100)
ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)
bestlam <- ridge.fit$lambda.min
bestlam
## [1] 0.01
ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)
## [1] 1135714
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
The test MSE was slightly lower than the ridge regression at 1,135,659. All predictors had nonzero coefficients. This is consistent with a minimum \(\lambda\) which was close to 0 and would result in a model not much different than the linear regression model.
set.seed(1)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)
bestlam <- lasso.fit$lambda.min
bestlam
## [1] 0.01
lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)
## [1] 1135659
x <- model.matrix(Apps~., data = College)[,-1]
y <- College$Apps
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:18,]
lasso.coeff[lasso.coeff != 0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -448.83141930 -494.15009341 1.58411981 -0.87022010 49.83428099
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -14.16510449 0.05618125 0.04447685 -0.08571647 0.15130927
## Books Personal PhD Terminal S.F.Ratio
## 0.02075742 0.03108262 -8.67588751 -3.33162029 15.41268430
## perc.alumni Expend Grad.Rate
## 0.13965876 0.07790164 8.66427451
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
The test MSE was 1,937,465 using 7 components.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = T, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, College.test, ncomp = 7)
mean((pcr.pred - y_test)^2)
## [1] 1937465
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
The MSE was 1,066,991 using 6 components.
set.seed(1)
pls.fit <- plsr(Apps ~ ., data = College.train, scale = T, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, College.test, ncomp = 6)
mean((pls.pred - y_test)^2)
## [1] 1066991
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The PLS model had the lowest MSE at 1,066,991. The ridge and lasso regression models showed minimal improvement in the MSE while the PCR model had a higher MSE than the base linear regression model. An estimated average accuracy rate can be used to compare the models by the following formula: 1 - \(\frac{|Mean Error|}{\overline{y}_{test}}\). As expected, the PLS model had the highest accuracy estimate at .776 while having the lowest MSE. The PCR model had the lowest accuracy estimate, which was lower than the base linear regression model. Both the ridge regression and lasso regression models showed minimal improvement over the base linear regression model.
# Base Regression Accuracy Estimate
1 - mean(abs(y_test - lm.pred)) / mean(y_test)
## [1] 0.7661462
# Ridge Regression Accuracy Estimate
1 - mean(abs(y_test - ridge.pred)) / mean(y_test)
## [1] 0.7661508
# Lasso Regression Accuracy Estimate
1 - mean(abs(y_test - lasso.pred)) / mean(y_test)
## [1] 0.7661595
# PCR Accuracy Estimate
1 - mean(abs(pcr.pred - y_test)) / mean(y_test)
## [1] 0.6673805
# PLS Accuracy Estimate
1 - mean(abs(pls.pred - y_test)) / mean(y_test)
## [1] 0.7757627
We will now try to predict per capita crime rate in the Boston data set.
set.seed(1)
library(MASS)
library(leaps)
library(glmnet)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
sum(is.na(Boston))
## [1] 0
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k <- 10
p <- ncol(Boston) - 1
folds <- sample(rep(1:k, length = nrow(Boston)))
cv.errors <- matrix(NA, k, p)
for (i in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred <- predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] <- mean((Boston$crim[folds == i] - pred)^2)
}
}
mse.cv <- apply(cv.errors, 2, mean)
plot(mse.cv, pch = 19, type = "b")
which.min(mse.cv)
## [1] 9
mse.cv[which.min(mse.cv)]
## [1] 42.81453
bestfit.full <- regsubsets(crim ~., data = Boston, nvmax = 13)
coef(bestfit.full, 9)
## (Intercept) zn indus nox dis
## 19.124636156 0.042788127 -0.099385948 -10.466490364 -1.002597606
## rad ptratio black lstat medv
## 0.539503547 -0.270835584 -0.008003761 0.117805932 -0.180593877
set.seed(1)
train <- sample(506, 253)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test,]
x_train <- model.matrix(crim~., data = Boston.train)[,-1]
y_train <- Boston.train$crim
x_test <- model.matrix(crim~., data = Boston.test)[,-1]
y_test <- Boston.test$crim
grid <- 10^seq(10, -2, length = 100)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)
bestlam <- lasso.fit$lambda.min
bestlam
## [1] 0.07054802
lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)
## [1] 40.88312
x <- model.matrix(crim ~., data = Boston)[,-1]
y <- Boston$crim
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:14,]
lasso.coeff[lasso.coeff != 0]
## (Intercept) zn indus chas nox rm
## 11.377041716 0.034373111 -0.062605969 -0.555023861 -5.721122105 0.151927338
## dis rad ptratio black lstat medv
## -0.715085799 0.506621375 -0.156825685 -0.007550409 0.122010397 -0.145585637
ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)
bestlam <- ridge.fit$lambda.min
bestlam
## [1] 0.3764936
ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)
## [1] 41.03422
ridgefull.fit <- glmnet(x, y, alpha = 0, lambda = grid)
ridge.coeff <- predict(ridgefull.fit, s = bestlam, type = "coefficients")[1:14,]
ridge.coeff
## (Intercept) zn indus chas nox rm
## 10.684582993 0.035254538 -0.083043901 -0.729909760 -6.473100982 0.364745896
## age dis rad tax ptratio black
## 0.001530985 -0.766449784 0.454002431 0.002246733 -0.165813717 -0.008294454
## lstat medv
## 0.141537228 -0.151745918
library(pls)
set.seed(1)
pcr.fit <- pcr(crim ~ ., data = Boston.train, scale = T, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, Boston.test, ncomp = 8)
mean((pcr.pred - y_test)^2)
## [1] 43.21097
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
The model with the lowest MSE was the lasso regression at 40.88312. The model can be expressed as:
\(crim = 11.377+0.034zn-0.063indus-0.555chas-5.721nox+0.152rm-0.715dis+0.507rad\) \(-0.157ptratio-0.008black+0.122lstat-0.146medv\)
(c) Does your chosen model involve all of the features in the data set? Why or why not?
The model did not include all of the features in the dataset because the minimum lambda caused the coefficient estimates of age and tax to equal 0 in the lasso regression model.