library(tidyverse)
library(ISLR2)
library(leaps)
library(glmnet)
library(pls)
set.seed(1989)
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
The lasso, relative to least squares, is
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
False: The lasso is less flexible than least squares.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False: The lasso is less flexible than least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
True: As lambda increases, the flexibility of the lasso decreases, so its bias will increase as its variance decreases. Therefore, the prediction accuracy is improved over least squares when the increase in bias is less than the decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False: The lasso increases bias in order to decrease variance.
Ridge regression, relative to least squares, is
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
False: Ridge regression is less flexible than least squares.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False: Ridge regression is less flexible than least squares.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
True: As with lasso, as lambda increases, the flexibility of ridge regression decreases, so its bias will increase as its variance decreases. Therefore, the prediction accuracy is improved over least squares when the increase in bias is less than the decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
False: Ridge regression increases bias in order to decrease variance.
Non-linear methods, relative to least squares, are
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
True: Non-linear methods make fewer assumptions about the form of f, meaning their bias is low, at a cost of higher variance. Therefore, non-linear methods will offer improved prediction accuracy when their increase in variance is less than their decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications
received using the other variables in the College data
set.
data("College")
Split the data into a training set and a test set.
train <- sample(777, 622) #train on ~80% of data
Fit a linear model using least squares on the training set, and report the test error obtained.
lm.coll <- lm(Apps~., data=College[train, ])
summary(lm.coll)
##
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5383.3 -448.8 -28.6 333.9 7269.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -434.43914 461.09746 -0.942 0.346475
## PrivateYes -539.75532 155.03367 -3.482 0.000535 ***
## Accept 1.60828 0.04470 35.980 < 2e-16 ***
## Enroll -0.54037 0.21605 -2.501 0.012645 *
## Top10perc 48.13104 6.39385 7.528 1.89e-13 ***
## Top25perc -14.59787 5.16736 -2.825 0.004884 **
## F.Undergrad -0.02005 0.03788 -0.529 0.596776
## P.Undergrad 0.05506 0.03428 1.606 0.108752
## Outstate -0.10173 0.02135 -4.766 2.36e-06 ***
## Room.Board 0.16612 0.05552 2.992 0.002884 **
## Books -0.01232 0.26380 -0.047 0.962757
## Personal 0.01156 0.07243 0.160 0.873261
## PhD -6.35582 5.12489 -1.240 0.215388
## Terminal -4.82663 5.61717 -0.859 0.390536
## S.F.Ratio 16.55312 14.81796 1.117 0.264396
## perc.alumni 1.26217 4.72216 0.267 0.789340
## Expend 0.08874 0.01488 5.963 4.21e-09 ***
## Grad.Rate 9.16692 3.24776 2.823 0.004921 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1059 on 604 degrees of freedom
## Multiple R-squared: 0.93, Adjusted R-squared: 0.9281
## F-statistic: 472.3 on 17 and 604 DF, p-value: < 2.2e-16
pred <- predict(lm.coll, College)
MSE.lm <- mean((College$Apps - pred)[-train]^2)
The MSE of this linear model is 1.008899^{6}.
Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
College$Private <- as.numeric(College$Private)
x <- model.matrix(College$Apps~., College[ ,c(1,3:18)])
y <- College$Apps
grid <- 10^seq(10, -2, length=100)
ridge.mod <- glmnet(x, y, alpha=0, lambda=grid)
dim(coef(ridge.mod))
## [1] 19 100
train <- sample(1:nrow(x), nrow(x)*0.8) #80-20 train-test
test <- (-train)
y.test <- y[test]
ridge.mod <- glmnet(x[train, ], y[train], alpha=0, lambda=grid, thresh=1e-12)
plot(ridge.mod)
cv.out <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.out)
bestlam.rr <- cv.out$lambda.min
The best lambda for this ridge regression model is 365.2.
#view best model coefficients
bestmod.rr <- glmnet(x, y, alpha=0, lambda=bestlam.rr)
coef(bestmod.rr)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -9.423576e+02
## (Intercept) .
## Private -5.278753e+02
## Accept 1.002931e+00
## Enroll 4.372110e-01
## Top10perc 2.575663e+01
## Top25perc 5.809636e-01
## F.Undergrad 7.232628e-02
## P.Undergrad 2.411640e-02
## Outstate -2.397655e-02
## Room.Board 1.991324e-01
## Books 1.287015e-01
## Personal -8.381134e-03
## PhD -4.020244e+00
## Terminal -4.815878e+00
## S.F.Ratio 1.297260e+01
## perc.alumni -8.552850e+00
## Expend 7.586912e-02
## Grad.Rate 1.126507e+01
#make predictions
ridge.pred <- predict(bestmod.rr, s=bestlam.rr, newx=x[test, ])
MSE.rr <- mean((ridge.pred-y.test)^2)
The MSE of the ridge regression model is 8.263948^{5}.
Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#model setup
lasso.mod <- glmnet(x[train, ], y[train], alpha=1, lambda=grid, thresh=1e-12)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
#find best lambda
cv.out.lasso <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.out.lasso)
bestlam.lasso <- cv.out.lasso$lambda.min
#update model and make predictions
bestmod.lasso <- glmnet(x[train, ], y[train], alpha=1, lambda=bestlam.lasso, thresh=1e-12)
coef(bestmod.lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -432.00481119
## (Intercept) .
## Private -444.55723237
## Accept 1.57003828
## Enroll -0.56586206
## Top10perc 49.50710420
## Top25perc -10.54447750
## F.Undergrad .
## P.Undergrad 0.05514873
## Outstate -0.08258519
## Room.Board 0.18917233
## Books 0.02072870
## Personal 0.04247966
## PhD -8.15668704
## Terminal -2.73771145
## S.F.Ratio 25.35721355
## perc.alumni .
## Expend 0.07137458
## Grad.Rate 5.88642832
lasso.pred <- predict(bestmod.lasso, s=bestlam.lasso, newx=x[test, ])
MSE.lasso <- mean((lasso.pred-y.test)^2)
The MSE of the lasso model is 8.482023^{5}. The lasso kicked out two predictors (coefficients became zero), leaving a model with 15 predictors.
Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
#train model
pcr.fit <- pcr(Apps~., data=College[train, ], scale=TRUE, validation="CV")
#choose M by cross-validation
summary(pcr.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3899 3851 2092 2107 1860 1659 1657
## adjCV 3899 3851 2089 2108 1823 1649 1652
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1652 1605 1593 1599 1603 1603 1612
## adjCV 1648 1595 1589 1594 1599 1599 1608
## 14 comps 15 comps 16 comps 17 comps
## CV 1616 1550 1236 1196
## adjCV 1612 1526 1226 1188
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.943 57.08 64.22 69.99 75.54 80.50 84.17 87.50
## Apps 3.233 71.85 71.85 80.00 83.04 83.05 83.22 84.28
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.52 92.90 94.99 96.76 97.84 98.75 99.38
## Apps 84.47 84.65 84.69 84.75 84.79 84.82 90.16
## 16 comps 17 comps
## X 99.86 100.00
## Apps 92.34 92.64
validationplot(pcr.fit, val.type="MSEP")
#test predictions
pcr.pred <- predict(pcr.fit, College[test, ], ncomp=17)
MSE.pcr <- mean((pcr.pred - College$Apps[test])^2)
#set model
pcr.mod <- pcr(Apps~., data=College, scale=TRUE, ncomp=17)
summary(pcr.mod)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.670 57.30 64.30 69.90 75.39 80.38 83.99 87.40
## Apps 2.316 73.06 73.07 82.08 84.08 84.11 84.32 85.18
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.91 95.01 96.81 97.9 98.75 99.36
## Apps 85.88 86.06 86.06 86.10 86.1 86.13 90.32
## 16 comps 17 comps
## X 99.84 100.00
## Apps 92.52 92.92
The smallest CV value is 1196, for 17 components. Squaring this value gives us the training MSE, which in this case is 1430416. However, this model includes all 17 predictors and therefore offers no benefit over the ordinary least-squares model.
The test MSE is of the PCR model is 8.884463^{5}.
Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
#train model
pls.fit <- plsr(Apps~., data=College, subset=train, scale=TRUE, validation="CV")
#choose M by cross-validation
summary(pls.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3899 1931 1687 1537 1480 1296 1254
## adjCV 3899 1928 1684 1532 1458 1272 1241
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1237 1226 1223 1221 1216 1215 1213
## adjCV 1226 1216 1214 1211 1206 1206 1204
## 14 comps 15 comps 16 comps 17 comps
## CV 1214 1214 1214 1214
## adjCV 1204 1205 1205 1205
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.46 43.67 62.52 64.67 66.96 71.61 74.57 78.82
## Apps 76.89 83.25 86.69 90.52 92.31 92.44 92.50 92.54
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.49 85.27 87.81 90.74 92.93 94.98 96.99
## Apps 92.58 92.60 92.62 92.63 92.63 92.64 92.64
## 16 comps 17 comps
## X 98.85 100.00
## Apps 92.64 92.64
validationplot(pls.fit, val.type="MSEP")
#test predictions
pls.pred <- predict(pls.fit, College[test, ], ncomp=13)
MSE.pls <- mean((pls.pred - College$Apps[test])^2)
#set model
pls.mod <- plsr(Apps~., data=College, scale=TRUE, ncomp=13)
summary(pls.mod)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 13
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.76 40.33 62.59 64.97 66.87 71.33 75.39 79.37
## Apps 78.01 85.14 87.67 90.73 92.63 92.72 92.77 92.82
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 82.36 85.04 87.92 90.65 92.69
## Apps 92.87 92.89 92.90 92.91 92.92
The smallest CV value is 1213, for 13 components. Squaring this value gives us the training MSE, which in this case is 1471369.
The test MSE of the PLS model is 8.872254^{5}.
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Comparing test MSEs:
Linear regression: 1.008899^{6}
Ridge regression: 8.263948^{5}
Lasso: 8.482023^{5}
Principal component regression: 8.884463^{5}
Partial least squares: 8.872254^{5}
Each of the four new models explored in this exercise outperform the linear regression model. Ridge regression performed the best in this case, but each of the four models are comparable to each other in terms of test error rate, compared to the linear model.
We will now try to predict per capita crime rate in the
Boston data set.
data("Boston")
train <- sample(506, 506*0.8) #train on ~80% of data
test <- -train
predict.regsubsets <- function(object, newdata, id, ...){
form <- as.formula(object$call[[2]])
mat <- model.matrix(form,newdata)
coefi <- coef(object,id=id)
xvars <- names(coefi)
mat[,xvars]%*%coefi
}
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
lm.Boston <- lm(crim~., data=Boston[train, ])
summary(lm.Boston)
##
## Call:
## lm(formula = crim ~ ., data = Boston[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.159 -2.242 -0.420 1.023 73.671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.838932 8.064107 1.592 0.11217
## zn 0.041925 0.022938 1.828 0.06834 .
## indus -0.033145 0.099178 -0.334 0.73841
## chas -1.163216 1.364030 -0.853 0.39430
## nox -12.742674 6.186635 -2.060 0.04009 *
## rm 0.808465 0.729457 1.108 0.26841
## age -0.002730 0.021103 -0.129 0.89715
## dis -1.025012 0.331655 -3.091 0.00214 **
## rad 0.621809 0.100103 6.212 1.34e-09 ***
## tax -0.003086 0.006063 -0.509 0.61100
## ptratio -0.302222 0.216031 -1.399 0.16261
## lstat 0.170649 0.086393 1.975 0.04894 *
## medv -0.193683 0.069700 -2.779 0.00572 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.589 on 391 degrees of freedom
## Multiple R-squared: 0.449, Adjusted R-squared: 0.4321
## F-statistic: 26.55 on 12 and 391 DF, p-value: < 2.2e-16
pred <- predict(lm.Boston, Boston)
MSE.lm.11 <- mean((Boston$crim - pred)[-train]^2)
regfit.full <- regsubsets(crim~., Boston[train, ], nvmax=12)
regsum <- summary(regfit.full)
#plot subset performance
par(mfrow=c(2,2))
plot(regsum$rss, xlab="Number of Variables", ylab="RSS")
plot(regsum$adjr2, xlab="Number of Variables", ylab="Adjusted Rsq")
which.max(regsum$adjr2)
## [1] 8
points(7, regsum$adjr2[7], col='red', cex=2, pch=20)
plot(regsum$cp, xlab="Number of Variables", ylab="Cp")
which.min(regsum$cp)
## [1] 7
points(6, regsum$cp[6], col='red', cex=2, pch=20)
plot(regsum$bic, xlab="Number of Variables", ylab="BIC")
which.min(regsum$bic)
## [1] 2
points(2, regsum$bic[2], col='red', cex=2, pch=20)
par(mfrow=c(2,2))
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")
#choose best model using validation set approach
regfit.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12)
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
coefi = coef(regfit.best, id=i)
pred = test.mat[ , names(coefi)]%*%coefi
val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
## [1] 40.10763 37.72795 37.95855 35.79385 35.98987 36.24553 35.88195 35.92582
## [9] 36.15041 35.88625 35.77180 35.78232
sub <- which.min(val.errors)
MSE.BS <- val.errors[sub]
coef(regfit.best, sub)
## (Intercept) zn indus chas nox rm
## 12.93904744 0.04218097 -0.03320827 -1.17449167 -12.95383082 0.78777999
## dis rad tax ptratio lstat medv
## -1.01122205 0.62299693 -0.00309680 -0.30540181 0.16727128 -0.19362701
#use full data set for most accurate coefficient estimates and final variable selection
regfit.best <- regsubsets(crim~., Boston, nvmax=12)
coef(regfit.best, sub)
## (Intercept) zn indus chas nox
## 13.801555544 0.045818028 -0.058345869 -0.828283847 -10.022404078
## rm dis rad tax ptratio
## 0.623650191 -1.008539667 0.612822152 -0.003782956 -0.304784434
## lstat medv
## 0.137698956 -0.220092318
#choose among models of different sizes using cross-validation
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12)
for(i in 1:12){
pred <- predict(best.fit, Boston[folds==j, ], id=i)
cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 44.41761 43.00535 43.08049 42.93178 42.46893 42.14550 41.96118 41.88762
## 9 10 11 12
## 41.66061 41.69240 41.72731 41.75622
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')
reg.best <- regsubsets(crim~., data=Boston, nvmax=12)
coef(reg.best, 9)
## (Intercept) zn indus nox rm dis
## 13.18247199 0.04305936 -0.08820604 -10.46823468 0.63510592 -1.00637216
## rad ptratio lstat medv
## 0.56096588 -0.30423849 0.14040855 -0.22012525
Test set validation has chosen a model using the all variables except
age to predict the crim response. However,
k-fold cross-validation selected the 9-variable model using
zn, indus, nox, rm,
dis, rad, ptratio,
lstat, and medv.
regfit.fwd <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 12,
## method = "forward")
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: forward
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " "*" " " " " " " " " "*" " " " " "*" " "
## 4 ( 1 ) "*" " " "*" " " " " " " " " "*" " " " " "*" " "
## 5 ( 1 ) "*" " " "*" " " " " " " "*" "*" " " " " "*" " "
## 6 ( 1 ) "*" " " "*" "*" " " " " "*" "*" " " " " "*" " "
## 7 ( 1 ) "*" " " "*" "*" " " " " "*" "*" " " " " "*" "*"
## 8 ( 1 ) "*" " " "*" "*" " " " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
#test set validation
fwd.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="forward")
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
coefi = coef(regfit.best, id=i)
pred = test.mat[ , names(coefi)]%*%coefi
val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
## [1] 39.98485 37.56243 36.86235 35.48677 34.90891 35.17538 35.27626 34.88713
## [9] 34.98015 35.09248 34.91346 34.91628
subF <- which.min(val.errors)
MSE.FW <- val.errors[sub]
coef(fwd.best, sub)
## (Intercept) zn indus chas nox rm
## 12.93904744 0.04218097 -0.03320827 -1.17449167 -12.95383082 0.78777999
## dis rad tax ptratio lstat medv
## -1.01122205 0.62299693 -0.00309680 -0.30540181 0.16727128 -0.19362701
#k-fold CV
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12, method="forward")
for(i in 1:12){
pred <- predict(best.fit, Boston[folds==j, ], id=i)
cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.25339 43.20515 43.30331 43.24910 42.97657 42.92470 43.21499 42.93087
## 9 10 11 12
## 42.97345 42.74839 42.62709 42.68803
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')
reg.fwd.best <- regsubsets(crim~., data=Boston, nvmax=12, method="forward")
coef(reg.best, 11)
## (Intercept) zn indus chas nox
## 13.801555544 0.045818028 -0.058345869 -0.828283847 -10.022404078
## rm dis rad tax ptratio
## 0.623650191 -1.008539667 0.612822152 -0.003782956 -0.304784434
## lstat medv
## 0.137698956 -0.220092318
regfit.bwd <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 12,
## method = "backward")
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: backward
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " " " "*" "*" " " " " " " "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
#test set validation
bwd.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="backward")
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
coefi = coef(regfit.best, id=i)
pred = test.mat[ , names(coefi)]%*%coefi
val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
## [1] 39.98485 37.56243 36.86235 35.48677 34.90891 35.17538 35.27626 34.88713
## [9] 34.98015 35.09248 34.91346 34.91628
sub <- which.min(val.errors)
MSE.BW <- val.errors[sub]
coef(bwd.best, sub)
## (Intercept) zn nox rm dis rad
## 12.76695727 0.03979059 -14.69727032 0.86921700 -0.96066398 0.57357936
## ptratio lstat medv
## -0.33746075 0.16431457 -0.20028468
#k-fold CV
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12, method="backward")
for(i in 1:12){
pred <- predict(best.fit, Boston[folds==j, ], id=i)
cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 47.55747 46.17442 45.91184 45.14469 44.89911 44.56049 44.41876 44.35529
## 9 10 11 12
## 44.25839 44.24622 44.16068 44.20958
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')
reg.best <- regsubsets(crim~., data=Boston, nvmax=12, method = "backward")
coef(reg.best, 11)
## (Intercept) zn indus chas nox
## 13.801555544 0.045818028 -0.058345869 -0.828283847 -10.022404078
## rm dis rad tax ptratio
## 0.623650191 -1.008539667 0.612822152 -0.003782956 -0.304784434
## lstat medv
## 0.137698956 -0.220092318
Forward and backward selection have both selected 11-variable models
using all variables in the Boston data set except
age.
x <- model.matrix(Boston$crim~., Boston[ , 2:13])
y <- Boston$crim
grid <- 10^seq(10, -2, length=100)
ridge.mod.11 <- glmnet(x, y, alpha=0, lambda=grid)
dim(coef(ridge.mod.11))
## [1] 14 100
y.test <- y[test]
ridge.mod.11 <- glmnet(x[train, ], y[train], alpha=0, lambda=grid, thresh=1e-12)
plot(ridge.mod.11)
cv.out <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.out)
bestlam.rr.11 <- cv.out$lambda.min
#view best model coefficients
bestmod.rr.11 <- glmnet(x, y, alpha=0, lambda=bestlam.rr.11)
coef(bestmod.rr.11)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 5.147230737
## (Intercept) .
## zn 0.033688987
## indus -0.077331392
## chas -0.828177689
## nox -4.838450264
## rm 0.510365944
## age 0.000231211
## dis -0.717408008
## rad 0.442334344
## tax 0.003767613
## ptratio -0.162536686
## lstat 0.155955370
## medv -0.157575194
#make predictions
ridge.pred.11 <- predict(bestmod.rr.11, s=bestlam.rr.11, newx=x[test, ])
MSE.rr.11 <- mean((ridge.pred.11-y.test)^2)
MSE.rr.11
## [1] 35.21923
#model setup
lasso.mod.11 <- glmnet(x[train, ], y[train], alpha=1, lambda=grid, thresh=1e-12)
plot(lasso.mod.11)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
#find best lambda
cv.out.lasso.11 <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.out.lasso.11)
bestlam.lasso.11 <- cv.out.lasso.11$lambda.min
#update model and make predictions
bestmod.lasso.11 <- glmnet(x[train, ], y[train], alpha=1, lambda=bestlam.lasso.11, thresh=1e-12)
coef(bestmod.lasso.11)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 9.08825330
## (Intercept) .
## zn 0.03367118
## indus -0.03803368
## chas -1.01766884
## nox -9.42176044
## rm 0.53782637
## age .
## dis -0.80603283
## rad 0.55971577
## tax .
## ptratio -0.22959521
## lstat 0.15885295
## medv -0.15522474
lasso.pred.11 <- predict(bestmod.lasso.11, s=bestlam.lasso.11, newx=x[test, ])
MSE.lasso.11 <- mean((lasso.pred.11-y.test)^2)
#train model
pcr.fit.11 <- pcr(crim~., data=Boston[train, ], scale=TRUE, validation="CV")
#choose M by cross-validation
summary(pcr.fit.11)
## Data: X dimension: 404 12
## Y dimension: 404 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.754 7.376 7.375 6.994 6.996 6.955 6.943
## adjCV 8.754 7.373 7.372 6.986 6.993 6.950 6.938
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.782 6.788 6.767 6.789 6.736 6.673
## adjCV 6.773 6.780 6.759 6.781 6.726 6.663
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.32 64.05 73.15 80.60 87.39 90.75 93.2 95.19
## crim 29.73 29.77 37.53 37.63 38.51 38.75 41.8 41.99
## 9 comps 10 comps 11 comps 12 comps
## X 96.87 98.31 99.50 100.0
## crim 42.36 42.46 43.73 44.9
validationplot(pcr.fit.11, val.type="MSEP")
#test predictions
pcr.pred.11 <- predict(pcr.fit.11, Boston[test, ], ncomp=7)
MSE.pcr.11 <- mean((pcr.pred.11 - Boston$crim[test])^2)
#set model
pcr.mod.11 <- pcr(crim~., data=Boston, scale=TRUE, ncomp=7)
summary(pcr.mod.11)
## Data: X dimension: 506 12
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 49.93 63.64 72.94 80.21 86.83 90.26 92.79
## crim 29.39 29.55 37.39 37.85 38.85 39.23 41.73
The PCR model selected by lowest cross-validation error reduces the 12 predictors down to 7 components.
#train model
pls.fit.11 <- plsr(crim~., data=Boston, subset=train, scale=TRUE, validation="CV")
#choose M by cross-validation
summary(pls.fit.11)
## Data: X dimension: 404 12
## Y dimension: 404 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.754 7.179 6.802 6.698 6.687 6.679 6.671
## adjCV 8.754 7.177 6.796 6.690 6.681 6.671 6.661
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.657 6.658 6.654 6.655 6.655 6.655
## adjCV 6.648 6.649 6.645 6.646 6.646 6.646
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.92 58.01 62.88 74.45 80.99 83.63 86.98 92.10
## crim 33.14 41.11 43.59 44.11 44.42 44.76 44.84 44.88
## 9 comps 10 comps 11 comps 12 comps
## X 94.94 96.65 98.45 100.0
## crim 44.90 44.90 44.90 44.9
validationplot(pls.fit.11, val.type="MSEP")
#test predictions
pls.pred.11 <- predict(pls.fit.11, Boston[test, ], ncomp=11)
MSE.pls.11 <- mean((pls.pred.11 - Boston$crim[test])^2)
#set model
pls.mod.11 <- plsr(crim~., data=Boston, scale=TRUE, ncomp=11)
summary(pls.mod.11)
## Data: X dimension: 506 12
## Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 11
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.49 58.85 64.82 74.54 80.88 83.57 87.80 92.13
## crim 33.00 41.20 43.32 44.04 44.40 44.78 44.87 44.91
## 9 comps 10 comps 11 comps
## X 94.56 96.22 98.11
## crim 44.93 44.93 44.93
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
MSE_value <- round(c(MSE.lm.11, MSE.BS, MSE.FW, MSE.BW, MSE.rr.11, MSE.lasso.11, MSE.pcr.11, MSE.pls.11), 2)
Model_Name <- c("Linear model", "Best subset", "Forward selection", "Backward selection", "Ridge regression", "Lasso", "Principal component regression", "Partial Least Squares")
MSEranks <- data.frame(Model_Name, MSE_value)
MSEranks[order(MSE_value), ]
## Model_Name MSE_value
## 4 Backward selection 34.89
## 3 Forward selection 34.91
## 5 Ridge regression 35.22
## 6 Lasso 35.70
## 2 Best subset 35.77
## 1 Linear model 35.78
## 8 Partial Least Squares 35.78
## 7 Principal component regression 37.46
According to validation set error, backward selection is the best-performing model, and principal component regression is the worst.
Does your chosen model involve all of the features in the data set? Why or why not?
No. Backwards selection omitted age from the model.