For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
ANSWER: iii. The lasso is less flexible and will give improved prediction accuracy as long as the increase in bias is less than the decrease in variance. Lasso shrinks coefficient estimates of less-useful variables to zero, at the cost of a small increase in bias, which is generally worth it to minimize variance.
(b) Repeat (a) for ridge regression relative to least squares.
ANSWER: iii. For the same reasons as part (a), except ridge regression does not shring the less-useful variable coefficients to zero like lasso. Ridge regression still minimizes the coefficients of variables and reduces variance with a slight increase in bias.
(c) Repeat (a) for non-linear methods relative to least squares.
ANSWER: ii. More flexible and will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are more flexible because they make less assumptions about the form of the function. Relationships between predictors and the response are approximated which leads to a decrease in bias that outweighs and increase in variance, resulting in higher prediction accuracy.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
library(ISLR)
attach(College)
set.seed(1)
trainingindex<-sample(nrow(College), 0.7*nrow(College))
train<-College[trainingindex, ]
test<-College[-trainingindex, ]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
l.model <- lm(Apps~., data=train)
lm.pred=predict(l.model, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1261630
The test error for the linear model is 1,261,630.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
set.seed(1)
xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
# Ridge regression
ridge.mod = cv.glmnet(xtrain, ytrain, alpha=0)
ridge.lambda = ridge.mod$lambda.min
ridge.pred=predict(ridge.mod, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1163628
The test error rate of the ridge regression is 1,163,628. This is a better test error rate than the OLS regression.
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(1)
lasso.mod=cv.glmnet(xtrain, ytrain, alpha=1)
lasso.lambda=lasso.mod$lambda.min
lasso.pred=predict(lasso.mod, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1265298
lasso.coeff=predict(lasso.mod, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff
## (Intercept) (Intercept) Accept Enroll Top10perc
## -1.033941e+03 0.000000e+00 1.703414e+00 -9.625620e-01 5.160952e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.450882e+01 3.915947e-02 7.389830e-02 -1.077148e-01 1.466627e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.535477e-01 -8.694845e-03 -6.888117e+00 4.525972e-01 2.448283e+01
## perc.alumni Expend Grad.Rate
## 7.159570e-01 6.301767e-02 6.103923e+00
The test error rate for the lasso model is 1,265,298.
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type="MSEP")
summary(pcr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 3807 2130 2134 1826 1692 1696
## adjCV 3895 3807 2126 2132 1806 1682 1690
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1695 1638 1604 1593 1601 1603 1606
## adjCV 1694 1627 1598 1587 1595 1598 1600
## 14 comps 15 comps 16 comps 17 comps
## CV 1607 1544 1145 1113
## adjCV 1602 1524 1137 1106
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.051 57.00 64.42 70.27 75.65 80.65 84.26 87.61
## Apps 5.788 71.69 71.70 80.97 82.60 82.60 82.69 84.06
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.58 92.84 94.93 96.74 97.82 98.72 99.39
## Apps 84.55 84.82 84.86 84.86 85.01 85.05 89.81
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.03 93.32
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1837203
The number of components chosen was 10 due to a low CV and high variance explained. The test error rate for the PCR model is 1,837,203. This is a considerably higher error rate than any of the previous models.
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3895 1969 1762 1566 1493 1320 1236
## adjCV 3895 1963 1757 1556 1469 1291 1222
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1225 1213 1212 1207 1213 1207 1206
## adjCV 1212 1201 1199 1195 1200 1195 1194
## 14 comps 15 comps 16 comps 17 comps
## CV 1205 1205 1205 1205
## adjCV 1193 1193 1193 1193
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.68 47.43 62.46 64.88 67.34 72.68 77.20 80.92
## Apps 76.62 82.39 86.93 90.76 92.82 93.05 93.13 93.20
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.69 85.16 87.35 90.73 92.49 95.10 97.09
## Apps 93.26 93.28 93.30 93.31 93.32 93.32 93.32
## 16 comps 17 comps
## X 98.40 100.00
## Apps 93.32 93.32
pls.pred=predict(pls.fit, test, ncomp = 10)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1279353
The number of components chosen was once again 10 due to its low CV and high variance explained. This model performs similar to most other models, but not as well as the ridge regression.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
# Test Error
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkred", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
# r2
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), col="darkslategray2", xlab="Regression Methods", ylab="Test R-Squared", main="Test R-Squared for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
There is not much of a difference in the r2 for each model, however, the PCR is the least accurate model based on test error rate and r2. All of the models are pretty accurate and the ridge regression model has the lowest test error rate.
We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
set.seed(1)
library(MASS)
library(leaps)
data(Boston)
attach(Boston)
# Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*" "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
which.min(mean.cv.errors)
## [1] 9
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 42.81453
Cross-validation selects a 9 variable model which includes the
variables zn, indus, nox,
dis, rad, ptratio,
black, lstat, and medv.
# Lasso model
boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)
coef(boston.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 2.176491
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.150484
## tax .
## ptratio .
## black .
## lstat .
## medv .
boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err
## [1] 62.74783
Because Lasso is a variable reduction method, the model has removed
all variables except rad. This process results in a test
error of 62.74783.
# Ridge regression
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)
coef(boston.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.523899542
## zn -0.002949852
## indus 0.029276741
## chas -0.166526007
## nox 1.874769665
## rm -0.142852604
## age 0.006207995
## dis -0.094547258
## rad 0.045932737
## tax 0.002086668
## ptratio 0.071258052
## black -0.002605281
## lstat 0.035745604
## medv -0.023480540
boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err
## [1] 58.8156
The ridge regression model attempts to push all coefficients for
non-significant variables in the model to zero. This model shows
nox as the most significant variable and the test error for
the model is 58.8156.
# PCR model
library(pls)
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
pcr.pred=predict(boston.pcr, ncomp=13)
pcr.err=mean((pcr.pred-Boston$crim)^2)
pcr.err
## [1] 40.31607
Based on the CV error and variance explained, I would use all 13 components in the PCR model. This would explain 100% of the variance in the predictors by the model and 45.4% of the variance in the response variable. The MSE for this model is 40.31607.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
Based on the models computed above, the PCR model had the lowest MSE at 40.31607. The model is more complicated than the other models because it includes all 13 components, but it performs the best on this data set.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
Yes the chosen model involves all of the features in the data set. All features are included because, when used in the PCR model, this provides the lowest CV and the lowest MSE while explaining the highest amount of variance.