9. In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
data(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
id = sample(1:nrow(College), nrow(College)/2)
train = College[id,]
test = College[-id,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps~., data=train)
lm.pred = predict(lm.fit, test)
lm.error = mean((test$Apps - lm.pred)^2)
lm.error
## [1] 1135758
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
#install.packages("glmnet", repos = "https://cran.us.r-project.org")
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.5
## Loading required package: Matrix
## Loaded glmnet 4.1-2
train.X = model.matrix(Apps ~ ., data = train)
train.Y = train[, "Apps"]
test.X = model.matrix(Apps ~ ., data = test)
test.Y = test[, "Apps"]
grid = 10 ^ seq(4, -2, length=100)
ridge.mod = glmnet(train.X, train.Y, alpha =0, lambda=grid, thresh = 1e-12)
lambda.best = ridge.mod$lambda.min
ridge.pred = predict(ridge.mod, newx= test.X, s=lambda.best)
(ridge.err = mean((test.Y - ridge.pred)^2))
## [1] 1164319
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso.mod = glmnet(train.X, train.Y, alpha =1, lambda=grid, thresh = 1e-12)
lasso.cv = cv.glmnet(train.X, train.Y, alpha =1, lambda=grid, thresh = 1e-12)
lambda.best = lasso.cv$lambda.min
lasso.pred = predict(lasso.mod, newx= test.X, s=lambda.best)
(lasso.err = mean((test.Y - lasso.pred)^2))
## [1] 1135660
lasso_coef = predict(lasso.mod, s = lambda.best, type="coefficients")
round(lasso_coef, 3)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -790.036
## (Intercept) .
## PrivateYes -307.010
## Accept 1.779
## Enroll -1.470
## Top10perc 66.722
## Top25perc -22.304
## F.Undergrad 0.093
## P.Undergrad 0.009
## Outstate -0.108
## Room.Board 0.212
## Books 0.291
## Personal 0.006
## PhD -15.472
## Terminal 6.410
## S.F.Ratio 22.826
## perc.alumni 1.130
## Expend 0.049
## Grad.Rate 7.488
- Test error: 1135660
- Number of non-zero coefficients: 18 out of 19
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
#install.packages("pls", repos = "https://cran.us.r-project.org")
library(pls)
## Warning: package 'pls' was built under R version 4.0.5
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit = pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

summary(pcr.fit)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4027 2351 2355 2046 1965 1906
## adjCV 4288 4031 2347 2353 2014 1955 1899
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1910 1913 1871 1799 1799 1802 1800
## adjCV 1903 1908 1866 1790 1792 1795 1793
## 14 comps 15 comps 16 comps 17 comps
## CV 1832 1728 1310 1222
## adjCV 1829 1702 1296 1212
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
pcr.pred = predict(pcr.fit, test, ncomp=10)
(pcr.error = mean((test$Apps - pcr.pred)^2))
## [1] 1723100
- M is equal to 17
- Test error: 1723100
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr.fit2 = pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit2, val.type="MSEP")

summary(pcr.fit2)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4288 4054 2422 2432 2117 2022 1959
## adjCV 4288 4051 2415 2426 2061 1999 1948
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1957 1955 1911 1852 1853 1856 1861
## adjCV 1947 1947 1900 1840 1843 1846 1851
## 14 comps 15 comps 16 comps 17 comps
## CV 1877 1679 1338 1269
## adjCV 1876 1649 1323 1256
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.20 57.78 65.31 70.99 76.37 81.27 84.8 87.85
## Apps 13.44 70.93 71.07 79.87 81.15 82.25 82.3 82.33
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.62 92.91 94.98 96.74 97.79 98.72 99.42
## Apps 83.38 84.76 84.80 84.84 85.11 85.14 90.55
## 16 comps 17 comps
## X 99.88 100.00
## Apps 93.42 93.89
pls.pred = predict(pcr.fit2, test, ncomp=12)
(pls.err = mean((test$Apps - pls.pred)^2))
## [1] 1706651
- Test error: 1723100
- M is equal to 12
11. We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
#install.packages("leaps", repos = "https://cran.us.r-project.org")
- Best Subset selection
library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
library(glmnet)
data(Boston)
set.seed(1)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}
k = 10
folds = sample(1:k, nrow(Boston), replace = TRUE)
cv.errors = matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred = predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] = mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors = apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

min(mean.cv.errors)
## [1] 42.46014
regfit.best = regsubsets(crim~., data=Boston, nvmax=13)
coef(regfit.best, 12)
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
- Lasso
x = model.matrix(crim ~ ., Boston)[, -1]
y = Boston$crim
cv.out = cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out)

cv.out
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.051 51 43.11 14.16 11
## 1se 3.376 6 56.89 17.29 1
- Ridge
cv.out = cv.glmnet(x, y, alpha = 0, type.measure = "mse")
cv.out
##
## Call: cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.54 100 43.48 14.33 13
## 1se 74.44 47 57.69 16.76 13
- pcr
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

(c) Does your chosen model involve all of the features in the data set? Why or why not?
- The 9 parameter best subset model because it had the best cross-validated RMSE
- It features the best 12 models, 1 feature is missing (age)