library(ISLR)
library(glmnet)
library(pls)
library(leaps)
library(MASS)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is iii:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
(b) Repeat (a) for ridge regression relative to least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
data("College")
attach(College)
set.seed(2)
train <- sample(1:nrow(College), nrow(College)/2)
test <- -train
y.test <- Apps[test]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit <- lm(Apps ~., data=College, subset = train)
lm.pred <- predict(lm.fit, College[test,])
lm.error <- mean((lm.pred - y.test)^2)
lm.error
## [1] 1093608
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
x <- model.matrix(Apps~., data=College)[, -1]
x.train <- x[train,]
y <- College$Apps
y.train <- y[train]
set.seed(7)
ridge.fit <- glmnet(x.train,y.train,alpha=0)
cv.ridge.fit <- cv.glmnet(x.train,y.train,alpha=0)
plot(cv.ridge.fit)
bestlam <- cv.ridge.fit$lambda.min
bestlam
## [1] 424.2704
ridge.pred <- predict(ridge.fit, s=bestlam, newx=x[test,])
ridge.error <- mean((ridge.pred-y.test)^2)
ridge.error
## [1] 1138030
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(2)
lasso.fit <- glmnet(x.train,y.train,alpha=1)
cv.lasso.fit<- cv.glmnet(x.train,y.train,alpha=1)
bestlam.lasso <- cv.lasso.fit$lambda.min
bestlam.lasso
## [1] 15.97351
lasso.pred <- predict(lasso.fit, s=bestlam.lasso, newx=x[test,])
lasso.error <- mean((lasso.pred-y.test)^2)
lasso.error
## [1] 1045922
lasso.c <- predict(lasso.fit,type="coefficients", s=bestlam)[1:17,]
length(lasso.c[lasso.c != 0])
## [1] 3
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pcr.fit <- pcr(Apps~., data=College, subset=train, scale=TRUE, validation ="CV")
summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4440 4463 2362 2385 2092 1835 1832
## adjCV 4440 4465 2359 2384 2066 1824 1823
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1829 1821 1753 1764 1771 1772 1784
## adjCV 1818 1813 1746 1757 1764 1764 1777
## 14 comps 15 comps 16 comps 17 comps
## CV 1778 1747 1401 1338
## adjCV 1773 1701 1385 1322
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.4816 57.40 64.84 70.54 75.80 80.23 84.04 87.64
## Apps 0.1398 72.45 72.50 80.45 84.74 84.77 85.06 85.16
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.87 93.29 95.33 96.98 98.05 98.75 99.37
## Apps 85.93 86.16 86.21 86.32 86.40 86.61 92.49
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.65 94.32
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit,x[test,], ncomp=17)
pcr.error <- mean((pcr.pred-y.test)^2)
pcr.error
## [1] 1093608
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
pls.fit <- plsr(Apps~., data=College, subset=train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4440 2153 1701 1711 1650 1498 1413
## adjCV 4440 2147 1677 1704 1615 1470 1392
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1378 1358 1356 1342 1338 1337 1337
## adjCV 1361 1342 1339 1326 1322 1321 1321
## 14 comps 15 comps 16 comps 17 comps
## CV 1338 1338 1338 1338
## adjCV 1322 1322 1322 1322
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.76 31.99 61.88 65.00 68.68 73.82 78.33 81.38
## Apps 77.92 87.80 88.38 92.06 93.59 93.92 94.01 94.09
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 83.24 85.59 87.76 90.49 92.69 94.27 97.03
## Apps 94.20 94.27 94.30 94.31 94.32 94.32 94.32
## 16 comps 17 comps
## X 99.21 100.00
## Apps 94.32 94.32
validationplot(pls.fit, val.type = "MSEP")
* The lowest cross-validation error occurs with 12 components.
pls.pred <- predict(pls.fit,x[test,],ncomp=12)
pls.error <- mean((pls.pred-y.test)^2)
pls.error
## [1] 1085346
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
errors <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names(errors) <- c("lm", "ridge", "lasso", "pcr", "pls")
barplot(sort(errors))
print(sort(errors))
## lasso pls lm pcr ridge
## 1045922 1085346 1093608 1093608 1138030
LASSO has the lowest test error, followed by PLS, then PCR and LM methods, and then RIDGE regression. All models, except RIDGE, predict college applications with high accuracy.
11. We will now try to predict per capita crime rate in the Boston data set.
data(Boston)
attach(Boston)
# Split the data into train and test sets
set.seed(4)
train <- sample(1:nrow(Boston), nrow(Boston)*0.70)
test <- -train
y.test <- crim[test]
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Least squares
lm.fit <- lm(crim~., data=Boston, subset=train)
summary(lm.fit)
##
## Call:
## lm(formula = crim ~ ., data = Boston, subset = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.262 -2.330 -0.452 1.065 73.765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.835e+01 8.809e+00 2.083 0.038043 *
## zn 5.297e-02 2.463e-02 2.150 0.032242 *
## indus -4.926e-02 1.045e-01 -0.471 0.637746
## chas -5.041e-01 1.526e+00 -0.330 0.741398
## nox -1.106e+01 6.734e+00 -1.643 0.101354
## rm 3.983e-01 7.073e-01 0.563 0.573711
## age 4.431e-03 2.129e-02 0.208 0.835250
## dis -1.139e+00 3.429e-01 -3.322 0.000991 ***
## rad 6.346e-01 1.112e-01 5.707 2.51e-08 ***
## tax -4.745e-03 6.555e-03 -0.724 0.469591
## ptratio -3.449e-01 2.375e-01 -1.453 0.147257
## black 2.775e-04 4.356e-03 0.064 0.949251
## lstat 6.032e-02 9.378e-02 0.643 0.520535
## medv -2.537e-01 7.344e-02 -3.455 0.000620 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.66 on 340 degrees of freedom
## Multiple R-squared: 0.451, Adjusted R-squared: 0.43
## F-statistic: 21.48 on 13 and 340 DF, p-value: < 2.2e-16
lm.pred <- predict(lm.fit, Boston[test,])
lm.error <- mean((lm.pred - y.test)^2)
lm.error
## [1] 36.58822
Best subset selection.
# Choose the model using cross-validation and the best subset method
set.seed(1)
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
plot(mean.cv.errors, type="b")
points(which.min(mean.cv.errors), mean.cv.errors[12], pch=20, col='red', cex=2)
min(mean.cv.errors)
## [1] 42.46014
Lasso.
x <- model.matrix(crim~., Boston)[, -1]
y <- Boston$crim
lasso.fit <- glmnet(x[train, ], y[train], alpha=1)
cv.lasso.fit <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.lasso.fit)
bestlam.lasso <- cv.lasso.fit$lambda.min
bestlam.lasso
## [1] 0.1013094
lasso.pred <- predict(lasso.fit, s=bestlam.lasso, newx=x[test,])
lasso.error <- mean((lasso.pred-y[test])^2)
lasso.error
## [1] 36.80542
Ridge regression
ridge.fit <- glmnet(x[train, ], y[train], alpha=0)
cv.ridge.fit <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.ridge.fit)
bestlam.ridge <- cv.ridge.fit$lambda.min
bestlam.ridge
## [1] 0.5533799
ridge.pred <- predict(ridge.fit, s=bestlam.ridge, newx=x[test,])
ridge.error <- mean((ridge.pred-y[test])^2)
ridge.error
## [1] 36.17571
PCR (Principal Component Regression)
set.seed(4)
pcr.fit <- pcr(crim~., data=Boston, subset=train, scale=TRUE, validation ="CV")
summary(pcr.fit)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.834 7.362 7.365 7.027 7.028 7.042 7.020
## adjCV 8.834 7.358 7.361 7.022 7.022 7.037 7.013
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.033 7.007 6.975 6.979 6.984 6.934 6.842
## adjCV 7.024 6.980 6.962 6.973 6.970 6.917 6.826
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.99 60.42 69.63 76.41 82.83 87.86 91.23 93.43
## crim 31.27 31.40 37.67 37.79 37.79 38.97 39.32 40.69
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.53 97.05 98.52 99.57 100.0
## crim 41.47 41.50 41.92 43.60 45.1
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit,x[test,], ncomp=13)
pcr.error <- mean((pcr.pred-y.test)^2)
pcr.error
## [1] 36.58822
errors <- c(lm.error, ridge.error, lasso.error, pcr.error,min(mean.cv.errors))
names(errors) <- c("lm", "ridge", "lasso", "pcr", "cross+best")
print(sort(errors))
## ridge pcr lm lasso cross+best
## 36.17571 36.58822 36.58822 36.80542 42.46014
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
As computed above the model with the lower cross-validation error is the one chosen by the Ridge method. (c) Does your chosen model involve all of the features in the data set? Why or why not?
Yes. Because I chose the Ridge method, it doesn’t perform variable selection, and it includes all the predictors in the final model.