For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
a. The lasso, relative to least squares, is:
Answer choice iii. is correct because of the bias-variance trade off. The lasso lowers coefficients of variables that are not statistically significant, or less statistically significant, sometimes all the way down to 0. This will decrease variance and flexibility in the overall equation, which does come with an increase in bias.
b. Repeat (a) for ridge regression relative to least squares.
Answer choice iii. is correct again because of the bias-variance trade off. While the ridge regression will not bring less significant coefficients down to 0, the same principle applies as in the lasso example.
c. Repeat (a) for non-linear methods relative to least squares.
Answer choice ii. is correct because non-linear methods are more flexible than least squares because there is not a line to “fit”. This means less bias trying to assume linearity, which does come with an increase in variance but still more accuracy overall.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
a. Split the data set into a training set and a test set.
library(ISLR)
data(College)
set.seed(1)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
College.train = College[train, ]
College.test = College[-train, ]
b. Fit a linear model using least squares on the training set, and report the test error obtained.
fit.lm = lm(Apps ~ ., data = College.train)
pred.lm = predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1135758
c. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
train.mat = model.matrix(Apps ~ ., data = College.train)
test.mat = model.matrix(Apps ~ ., data = College.test)
grid = 10 ^ seq(4, -2, length = 100)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
fit.ridge = glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge = cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge = cv.ridge$lambda.min
bestlam.ridge
## [1] 0.01
pred.ridge = predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1135714
d. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
fit.lasso = glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso = cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso = cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
pred.lasso = predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1135660
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -7.900363e+02
## (Intercept) .
## PrivateYes -3.070103e+02
## Accept 1.779328e+00
## Enroll -1.469508e+00
## Top10perc 6.672214e+01
## Top25perc -2.230442e+01
## F.Undergrad 9.258974e-02
## P.Undergrad 9.408838e-03
## Outstate -1.083495e-01
## Room.Board 2.115147e-01
## Books 2.912105e-01
## Personal 6.120406e-03
## PhD -1.547200e+01
## Terminal 6.409503e+00
## S.F.Ratio 2.282638e+01
## perc.alumni 1.130498e+00
## Expend 4.856697e-02
## Grad.Rate 7.488081e+00
e. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.4
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
fit.pcr = pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")
pred.pcr = predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 1723100
f. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
fit.pls = plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
pred.pls = predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1131661
g. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
test.avg = mean(College.test$Apps)
lm.r2 = 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lm.r2
## [1] 0.9015413
ridge.r2 = 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2
## [1] 0.9015452
lasso.r2 = 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2
## [1] 0.9015499
pcr.r2 = 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2
## [1] 0.8506248
pls.r2 = 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2
## [1] 0.9018965
Of the results obtained, all of the models are very accurate, with the lowest r squared of .8506248 for the PCR model. Other than the PCR model, the rest are very similar in the .902 range for r squared, meaning about 90.2% of the variance in the number of college applications received can be predicted by our linear, ridge regression, lasso, and PLS models.
We will now try to predict per capita crime rate in the Boston data set.
a. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
data(Boston)
set.seed(1)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
train=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
test=(!train)
regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
summary(regfit.best)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}
regfit.best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(regfit.best,10)
## (Intercept) zn indus nox rm dis
## 16.38579874 0.04186311 -0.09330371 -10.62174498 0.44828268 -0.99077268
## rad ptratio black lstat medv
## 0.53566370 -0.26956103 -0.00759461 0.13084608 -0.19800062
k = 10
set.seed(1)
folds = sample(1:k, nrow(Boston), replace = TRUE)
cv.errors = matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred = predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] = mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
x = model.matrix(crim ~ ., Boston)[, -1]
y = Boston$crim
cv.out.lasso = cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out.lasso)
cv.out.rr = cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out.rr)
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.198 7.198 6.786 6.762 6.790 6.821
## adjCV 8.61 7.195 7.195 6.780 6.753 6.784 6.813
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.822 6.689 6.712 6.720 6.712 6.664 6.593
## adjCV 6.812 6.679 6.701 6.708 6.700 6.651 6.580
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.fit, val.type = "MSEP")
b. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
mean.cv.errors[12]
## 12
## 42.46014
cv.out.rr
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.54 100 43.48 14.33 13
## 1se 74.44 47 57.69 16.76 13
cv.out.lasso
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.051 51 43.11 14.16 11
## 1se 3.376 6 56.89 17.29 1
boston.predict.pcr=predict(pcr.fit,Boston,ncomp=5)
pcr.MSE=mean((Boston$crim-boston.predict.pcr)^2)
pcr.MSE
## [1] 44.59322
of the models proposed, the one with the lowest cross validation estimate for mean squared error of 42.46014 is the best subset selection model. Even though the BSS model performed the best, the lasso and ridge regression were not far behind, both performing around 43.11 43.48 for their mean squared errors, respectively. The worst performing model was the PCR, with a CV estimate for the MSE equal to 44.59.
c. Does your chosen model involve all of the features in the data set? Why or why not?
No, my chosen model does not have all of the features in the data set. I obtained a more accurate prediction with a lower CV estimate for MSE from using 12 of the variables in my BSS model.