library(tidyverse)
library(ISLR2)
library(leaps)
library(glmnet)
library(pls)
set.seed(1989)

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

2(a)

The lasso, relative to least squares, is

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

False: The lasso is less flexible than least squares.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False: The lasso is less flexible than least squares.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

True: As lambda increases, the flexibility of the lasso decreases, so its bias will increase as its variance decreases. Therefore, the prediction accuracy is improved over least squares when the increase in bias is less than the decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False: The lasso increases bias in order to decrease variance.

2(b)

Ridge regression, relative to least squares, is

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

False: Ridge regression is less flexible than least squares.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False: Ridge regression is less flexible than least squares.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

True: As with lasso, as lambda increases, the flexibility of ridge regression decreases, so its bias will increase as its variance decreases. Therefore, the prediction accuracy is improved over least squares when the increase in bias is less than the decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False: Ridge regression increases bias in order to decrease variance.

2(c)

Non-linear methods, relative to least squares, are

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

True: Non-linear methods make fewer assumptions about the form of f, meaning their bias is low, at a cost of higher variance. Therefore, non-linear methods will offer improved prediction accuracy when their increase in variance is less than their decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

data("College")

9(a)

Split the data into a training set and a test set.

train <- sample(777, 622) #train on ~80% of data

9(b)

Fit a linear model using least squares on the training set, and report the test error obtained.

lm.coll <- lm(Apps~., data=College[train, ])
summary(lm.coll)
## 
## Call:
## lm(formula = Apps ~ ., data = College[train, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5383.3  -448.8   -28.6   333.9  7269.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -434.43914  461.09746  -0.942 0.346475    
## PrivateYes  -539.75532  155.03367  -3.482 0.000535 ***
## Accept         1.60828    0.04470  35.980  < 2e-16 ***
## Enroll        -0.54037    0.21605  -2.501 0.012645 *  
## Top10perc     48.13104    6.39385   7.528 1.89e-13 ***
## Top25perc    -14.59787    5.16736  -2.825 0.004884 ** 
## F.Undergrad   -0.02005    0.03788  -0.529 0.596776    
## P.Undergrad    0.05506    0.03428   1.606 0.108752    
## Outstate      -0.10173    0.02135  -4.766 2.36e-06 ***
## Room.Board     0.16612    0.05552   2.992 0.002884 ** 
## Books         -0.01232    0.26380  -0.047 0.962757    
## Personal       0.01156    0.07243   0.160 0.873261    
## PhD           -6.35582    5.12489  -1.240 0.215388    
## Terminal      -4.82663    5.61717  -0.859 0.390536    
## S.F.Ratio     16.55312   14.81796   1.117 0.264396    
## perc.alumni    1.26217    4.72216   0.267 0.789340    
## Expend         0.08874    0.01488   5.963 4.21e-09 ***
## Grad.Rate      9.16692    3.24776   2.823 0.004921 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1059 on 604 degrees of freedom
## Multiple R-squared:   0.93,  Adjusted R-squared:  0.9281 
## F-statistic: 472.3 on 17 and 604 DF,  p-value: < 2.2e-16
pred <- predict(lm.coll, College)
MSE.lm <- mean((College$Apps - pred)[-train]^2)

The MSE of this linear model is 1.008899^{6}.

9(c)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

College$Private <- as.numeric(College$Private)
x <- model.matrix(College$Apps~., College[ ,c(1,3:18)])
y <- College$Apps

grid <- 10^seq(10, -2, length=100)
ridge.mod <- glmnet(x, y, alpha=0, lambda=grid)
dim(coef(ridge.mod))
## [1]  19 100
train <- sample(1:nrow(x), nrow(x)*0.8) #80-20 train-test
test <- (-train)
y.test <- y[test]
ridge.mod <- glmnet(x[train, ], y[train], alpha=0, lambda=grid, thresh=1e-12)
plot(ridge.mod)

cv.out <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.out)

bestlam.rr <- cv.out$lambda.min

The best lambda for this ridge regression model is 365.2.

#view best model coefficients
bestmod.rr <- glmnet(x, y, alpha=0, lambda=bestlam.rr)
coef(bestmod.rr)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -9.423576e+02
## (Intercept)  .           
## Private     -5.278753e+02
## Accept       1.002931e+00
## Enroll       4.372110e-01
## Top10perc    2.575663e+01
## Top25perc    5.809636e-01
## F.Undergrad  7.232628e-02
## P.Undergrad  2.411640e-02
## Outstate    -2.397655e-02
## Room.Board   1.991324e-01
## Books        1.287015e-01
## Personal    -8.381134e-03
## PhD         -4.020244e+00
## Terminal    -4.815878e+00
## S.F.Ratio    1.297260e+01
## perc.alumni -8.552850e+00
## Expend       7.586912e-02
## Grad.Rate    1.126507e+01
#make predictions
ridge.pred <- predict(bestmod.rr, s=bestlam.rr, newx=x[test, ])
MSE.rr <- mean((ridge.pred-y.test)^2)

The MSE of the ridge regression model is 8.263948^{5}.

9(d)

Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#model setup
lasso.mod <- glmnet(x[train, ], y[train], alpha=1, lambda=grid, thresh=1e-12)
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

#find best lambda
cv.out.lasso <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.out.lasso)

bestlam.lasso <- cv.out.lasso$lambda.min

#update model and make predictions
bestmod.lasso <- glmnet(x[train, ], y[train], alpha=1, lambda=bestlam.lasso, thresh=1e-12)
coef(bestmod.lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) -432.00481119
## (Intercept)    .         
## Private     -444.55723237
## Accept         1.57003828
## Enroll        -0.56586206
## Top10perc     49.50710420
## Top25perc    -10.54447750
## F.Undergrad    .         
## P.Undergrad    0.05514873
## Outstate      -0.08258519
## Room.Board     0.18917233
## Books          0.02072870
## Personal       0.04247966
## PhD           -8.15668704
## Terminal      -2.73771145
## S.F.Ratio     25.35721355
## perc.alumni    .         
## Expend         0.07137458
## Grad.Rate      5.88642832
lasso.pred <- predict(bestmod.lasso, s=bestlam.lasso, newx=x[test, ])
MSE.lasso <- mean((lasso.pred-y.test)^2)

The MSE of the lasso model is 8.482023^{5}. The lasso kicked out two predictors (coefficients became zero), leaving a model with 15 predictors.

9(e)

Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

#train model
pcr.fit <- pcr(Apps~., data=College[train, ], scale=TRUE, validation="CV")

#choose M by cross-validation
summary(pcr.fit)
## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3899     3851     2092     2107     1860     1659     1657
## adjCV         3899     3851     2089     2108     1823     1649     1652
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1652     1605     1593      1599      1603      1603      1612
## adjCV     1648     1595     1589      1594      1599      1599      1608
##        14 comps  15 comps  16 comps  17 comps
## CV         1616      1550      1236      1196
## adjCV      1612      1526      1226      1188
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.943    57.08    64.22    69.99    75.54    80.50    84.17    87.50
## Apps    3.233    71.85    71.85    80.00    83.04    83.05    83.22    84.28
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.52     92.90     94.99     96.76     97.84     98.75     99.38
## Apps    84.47     84.65     84.69     84.75     84.79     84.82     90.16
##       16 comps  17 comps
## X        99.86    100.00
## Apps     92.34     92.64
validationplot(pcr.fit, val.type="MSEP")

#test predictions
pcr.pred <- predict(pcr.fit, College[test, ], ncomp=17)
MSE.pcr <- mean((pcr.pred - College$Apps[test])^2)

#set model
pcr.mod <- pcr(Apps~., data=College, scale=TRUE, ncomp=17)
summary(pcr.mod)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.670    57.30    64.30    69.90    75.39    80.38    83.99    87.40
## Apps    2.316    73.06    73.07    82.08    84.08    84.11    84.32    85.18
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.91     95.01     96.81      97.9     98.75     99.36
## Apps    85.88     86.06     86.06     86.10      86.1     86.13     90.32
##       16 comps  17 comps
## X        99.84    100.00
## Apps     92.52     92.92

The smallest CV value is 1196, for 17 components. Squaring this value gives us the training MSE, which in this case is 1430416. However, this model includes all 17 predictors and therefore offers no benefit over the ordinary least-squares model.

The test MSE is of the PCR model is 8.884463^{5}.

9f

Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

#train model
pls.fit <- plsr(Apps~., data=College, subset=train, scale=TRUE, validation="CV")

#choose M by cross-validation
summary(pls.fit)
## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3899     1931     1687     1537     1480     1296     1254
## adjCV         3899     1928     1684     1532     1458     1272     1241
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1237     1226     1223      1221      1216      1215      1213
## adjCV     1226     1216     1214      1211      1206      1206      1204
##        14 comps  15 comps  16 comps  17 comps
## CV         1214      1214      1214      1214
## adjCV      1204      1205      1205      1205
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.46    43.67    62.52    64.67    66.96    71.61    74.57    78.82
## Apps    76.89    83.25    86.69    90.52    92.31    92.44    92.50    92.54
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.49     85.27     87.81     90.74     92.93     94.98     96.99
## Apps    92.58     92.60     92.62     92.63     92.63     92.64     92.64
##       16 comps  17 comps
## X        98.85    100.00
## Apps     92.64     92.64
validationplot(pls.fit, val.type="MSEP")

#test predictions
pls.pred <- predict(pls.fit, College[test, ], ncomp=13)
MSE.pls <- mean((pls.pred - College$Apps[test])^2)

#set model
pls.mod <- plsr(Apps~., data=College, scale=TRUE, ncomp=13)
summary(pls.mod)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 13
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.76    40.33    62.59    64.97    66.87    71.33    75.39    79.37
## Apps    78.01    85.14    87.67    90.73    92.63    92.72    92.77    92.82
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       82.36     85.04     87.92     90.65     92.69
## Apps    92.87     92.89     92.90     92.91     92.92

The smallest CV value is 1213, for 13 components. Squaring this value gives us the training MSE, which in this case is 1471369.

The test MSE of the PLS model is 8.872254^{5}.

9g

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Comparing test MSEs:

Linear regression: 1.008899^{6}

Ridge regression: 8.263948^{5}

Lasso: 8.482023^{5}

Principal component regression: 8.884463^{5}

Partial least squares: 8.872254^{5}

Each of the four new models explored in this exercise outperform the linear regression model. Ridge regression performed the best in this case, but each of the four models are comparable to each other in terms of test error rate, compared to the linear model.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

data("Boston")
train <- sample(506, 506*0.8) #train on ~80% of data
test <- -train

predict.regsubsets <- function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form,newdata)
  coefi <- coef(object,id=id)
  xvars <- names(coefi)
  mat[,xvars]%*%coefi
  }

11(a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

lm.Boston <- lm(crim~., data=Boston[train, ])
summary(lm.Boston)
## 
## Call:
## lm(formula = crim ~ ., data = Boston[train, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.159 -2.242 -0.420  1.023 73.671 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.838932   8.064107   1.592  0.11217    
## zn            0.041925   0.022938   1.828  0.06834 .  
## indus        -0.033145   0.099178  -0.334  0.73841    
## chas         -1.163216   1.364030  -0.853  0.39430    
## nox         -12.742674   6.186635  -2.060  0.04009 *  
## rm            0.808465   0.729457   1.108  0.26841    
## age          -0.002730   0.021103  -0.129  0.89715    
## dis          -1.025012   0.331655  -3.091  0.00214 ** 
## rad           0.621809   0.100103   6.212 1.34e-09 ***
## tax          -0.003086   0.006063  -0.509  0.61100    
## ptratio      -0.302222   0.216031  -1.399  0.16261    
## lstat         0.170649   0.086393   1.975  0.04894 *  
## medv         -0.193683   0.069700  -2.779  0.00572 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.589 on 391 degrees of freedom
## Multiple R-squared:  0.449,  Adjusted R-squared:  0.4321 
## F-statistic: 26.55 on 12 and 391 DF,  p-value: < 2.2e-16
pred <- predict(lm.Boston, Boston)
MSE.lm.11 <- mean((Boston$crim - pred)[-train]^2)
regfit.full <- regsubsets(crim~., Boston[train, ], nvmax=12)
regsum <- summary(regfit.full)

#plot subset performance
par(mfrow=c(2,2))

plot(regsum$rss, xlab="Number of Variables", ylab="RSS")

plot(regsum$adjr2, xlab="Number of Variables", ylab="Adjusted Rsq")
which.max(regsum$adjr2)
## [1] 8
points(7, regsum$adjr2[7], col='red', cex=2, pch=20)

plot(regsum$cp, xlab="Number of Variables", ylab="Cp")
which.min(regsum$cp)
## [1] 7
points(6, regsum$cp[6], col='red', cex=2, pch=20)

plot(regsum$bic, xlab="Number of Variables", ylab="BIC")
which.min(regsum$bic)
## [1] 2
points(2, regsum$bic[2], col='red', cex=2, pch=20)

par(mfrow=c(2,2))
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")

#choose best model using validation set approach
regfit.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12)
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
   coefi = coef(regfit.best, id=i)
   pred = test.mat[ , names(coefi)]%*%coefi
   val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
##  [1] 40.10763 37.72795 37.95855 35.79385 35.98987 36.24553 35.88195 35.92582
##  [9] 36.15041 35.88625 35.77180 35.78232
sub <- which.min(val.errors)
MSE.BS <- val.errors[sub]
coef(regfit.best, sub)
##  (Intercept)           zn        indus         chas          nox           rm 
##  12.93904744   0.04218097  -0.03320827  -1.17449167 -12.95383082   0.78777999 
##          dis          rad          tax      ptratio        lstat         medv 
##  -1.01122205   0.62299693  -0.00309680  -0.30540181   0.16727128  -0.19362701
#use full data set for most accurate coefficient estimates and final variable selection
regfit.best <- regsubsets(crim~., Boston, nvmax=12)
coef(regfit.best, sub)
##   (Intercept)            zn         indus          chas           nox 
##  13.801555544   0.045818028  -0.058345869  -0.828283847 -10.022404078 
##            rm           dis           rad           tax       ptratio 
##   0.623650191  -1.008539667   0.612822152  -0.003782956  -0.304784434 
##         lstat          medv 
##   0.137698956  -0.220092318
#choose among models of different sizes using cross-validation
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
  best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12)
  for(i in 1:12){
    pred <- predict(best.fit, Boston[folds==j, ], id=i)
    cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
    }
  }
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 44.41761 43.00535 43.08049 42.93178 42.46893 42.14550 41.96118 41.88762 
##        9       10       11       12 
## 41.66061 41.69240 41.72731 41.75622
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')

reg.best <- regsubsets(crim~., data=Boston, nvmax=12)
coef(reg.best, 9)
##  (Intercept)           zn        indus          nox           rm          dis 
##  13.18247199   0.04305936  -0.08820604 -10.46823468   0.63510592  -1.00637216 
##          rad      ptratio        lstat         medv 
##   0.56096588  -0.30423849   0.14040855  -0.22012525

Test set validation has chosen a model using the all variables except age to predict the crim response. However, k-fold cross-validation selected the 9-variable model using zn, indus, nox, rm, dis, rad, ptratio, lstat, and medv.

regfit.fwd <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 12, 
##     method = "forward")
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: forward
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   "*"  " " " " " " " " "*" " " " "     "*"   " " 
## 4  ( 1 )  "*" " "   "*"  " " " " " " " " "*" " " " "     "*"   " " 
## 5  ( 1 )  "*" " "   "*"  " " " " " " "*" "*" " " " "     "*"   " " 
## 6  ( 1 )  "*" " "   "*"  "*" " " " " "*" "*" " " " "     "*"   " " 
## 7  ( 1 )  "*" " "   "*"  "*" " " " " "*" "*" " " " "     "*"   "*" 
## 8  ( 1 )  "*" " "   "*"  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
#test set validation
fwd.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="forward")
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
   coefi = coef(regfit.best, id=i)
   pred = test.mat[ , names(coefi)]%*%coefi
   val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
##  [1] 39.98485 37.56243 36.86235 35.48677 34.90891 35.17538 35.27626 34.88713
##  [9] 34.98015 35.09248 34.91346 34.91628
subF <- which.min(val.errors)
MSE.FW <- val.errors[sub]
coef(fwd.best, sub)
##  (Intercept)           zn        indus         chas          nox           rm 
##  12.93904744   0.04218097  -0.03320827  -1.17449167 -12.95383082   0.78777999 
##          dis          rad          tax      ptratio        lstat         medv 
##  -1.01122205   0.62299693  -0.00309680  -0.30540181   0.16727128  -0.19362701
#k-fold CV
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
  best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12, method="forward")
  for(i in 1:12){
    pred <- predict(best.fit, Boston[folds==j, ], id=i)
    cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
    }
  }
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.25339 43.20515 43.30331 43.24910 42.97657 42.92470 43.21499 42.93087 
##        9       10       11       12 
## 42.97345 42.74839 42.62709 42.68803
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')

reg.fwd.best <- regsubsets(crim~., data=Boston, nvmax=12, method="forward")
coef(reg.best, 11)
##   (Intercept)            zn         indus          chas           nox 
##  13.801555544   0.045818028  -0.058345869  -0.828283847 -10.022404078 
##            rm           dis           rad           tax       ptratio 
##   0.623650191  -1.008539667   0.612822152  -0.003782956  -0.304784434 
##         lstat          medv 
##   0.137698956  -0.220092318
regfit.bwd <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="backward")
summary(regfit.bwd)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 12, 
##     method = "backward")
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: backward
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*" 
## 3  ( 1 )  " " " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
#test set validation
bwd.best <- regsubsets(crim~., data=Boston[train, ], nvmax=12, method="backward")
test.mat <- model.matrix(crim~., data=Boston[test, ])
val.errors <- rep(NA, 12)
for(i in 1:12){
   coefi = coef(regfit.best, id=i)
   pred = test.mat[ , names(coefi)]%*%coefi
   val.errors[i] = mean((Boston$crim[test] - pred)^2)
}
val.errors
##  [1] 39.98485 37.56243 36.86235 35.48677 34.90891 35.17538 35.27626 34.88713
##  [9] 34.98015 35.09248 34.91346 34.91628
sub <- which.min(val.errors)
MSE.BW <- val.errors[sub]
coef(bwd.best, sub)
##  (Intercept)           zn          nox           rm          dis          rad 
##  12.76695727   0.03979059 -14.69727032   0.86921700  -0.96066398   0.57357936 
##      ptratio        lstat         medv 
##  -0.33746075   0.16431457  -0.20028468
#k-fold CV
k <- 10
folds <- sample(1:k, nrow(Boston), replace=TRUE)
cv.errors <- matrix(NA, k, 12, dimnames=list(NULL, paste(1:12)))
for(j in 1:k){
  best.fit <- regsubsets(crim~.,data=Boston[folds!=j, ], nvmax=12, method="backward")
  for(i in 1:12){
    pred <- predict(best.fit, Boston[folds==j, ], id=i)
    cv.errors[j,i] <- mean((Boston$crim[folds==j] - pred)^2)
    }
  }
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 47.55747 46.17442 45.91184 45.14469 44.89911 44.56049 44.41876 44.35529 
##        9       10       11       12 
## 44.25839 44.24622 44.16068 44.20958
par(mfrow=c(1, 1))
plot(mean.cv.errors, type='b')

reg.best <- regsubsets(crim~., data=Boston, nvmax=12, method = "backward")
coef(reg.best, 11)
##   (Intercept)            zn         indus          chas           nox 
##  13.801555544   0.045818028  -0.058345869  -0.828283847 -10.022404078 
##            rm           dis           rad           tax       ptratio 
##   0.623650191  -1.008539667   0.612822152  -0.003782956  -0.304784434 
##         lstat          medv 
##   0.137698956  -0.220092318

Forward and backward selection have both selected 11-variable models using all variables in the Boston data set except age.

x <- model.matrix(Boston$crim~., Boston[ , 2:13])
y <- Boston$crim

grid <- 10^seq(10, -2, length=100)
ridge.mod.11 <- glmnet(x, y, alpha=0, lambda=grid)
dim(coef(ridge.mod.11))
## [1]  14 100
y.test <- y[test]
ridge.mod.11 <- glmnet(x[train, ], y[train], alpha=0, lambda=grid, thresh=1e-12)
plot(ridge.mod.11)

cv.out <- cv.glmnet(x[train, ], y[train], alpha=0)
plot(cv.out)

bestlam.rr.11 <- cv.out$lambda.min

#view best model coefficients
bestmod.rr.11 <- glmnet(x, y, alpha=0, lambda=bestlam.rr.11)
coef(bestmod.rr.11)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  5.147230737
## (Intercept)  .          
## zn           0.033688987
## indus       -0.077331392
## chas        -0.828177689
## nox         -4.838450264
## rm           0.510365944
## age          0.000231211
## dis         -0.717408008
## rad          0.442334344
## tax          0.003767613
## ptratio     -0.162536686
## lstat        0.155955370
## medv        -0.157575194
#make predictions
ridge.pred.11 <- predict(bestmod.rr.11, s=bestlam.rr.11, newx=x[test, ])
MSE.rr.11 <- mean((ridge.pred.11-y.test)^2)
MSE.rr.11
## [1] 35.21923
#model setup
lasso.mod.11 <- glmnet(x[train, ], y[train], alpha=1, lambda=grid, thresh=1e-12)
plot(lasso.mod.11)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

#find best lambda
cv.out.lasso.11 <- cv.glmnet(x[train, ], y[train], alpha=1)
plot(cv.out.lasso.11)

bestlam.lasso.11 <- cv.out.lasso.11$lambda.min

#update model and make predictions
bestmod.lasso.11 <- glmnet(x[train, ], y[train], alpha=1, lambda=bestlam.lasso.11, thresh=1e-12)
coef(bestmod.lasso.11)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept)  9.08825330
## (Intercept)  .         
## zn           0.03367118
## indus       -0.03803368
## chas        -1.01766884
## nox         -9.42176044
## rm           0.53782637
## age          .         
## dis         -0.80603283
## rad          0.55971577
## tax          .         
## ptratio     -0.22959521
## lstat        0.15885295
## medv        -0.15522474
lasso.pred.11 <- predict(bestmod.lasso.11, s=bestlam.lasso.11, newx=x[test, ])
MSE.lasso.11 <- mean((lasso.pred.11-y.test)^2)
#train model
pcr.fit.11 <- pcr(crim~., data=Boston[train, ], scale=TRUE, validation="CV")

#choose M by cross-validation
summary(pcr.fit.11)
## Data:    X dimension: 404 12 
##  Y dimension: 404 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.754    7.376    7.375    6.994    6.996    6.955    6.943
## adjCV        8.754    7.373    7.372    6.986    6.993    6.950    6.938
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.782    6.788    6.767     6.789     6.736     6.673
## adjCV    6.773    6.780    6.759     6.781     6.726     6.663
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.32    64.05    73.15    80.60    87.39    90.75     93.2    95.19
## crim    29.73    29.77    37.53    37.63    38.51    38.75     41.8    41.99
##       9 comps  10 comps  11 comps  12 comps
## X       96.87     98.31     99.50     100.0
## crim    42.36     42.46     43.73      44.9
validationplot(pcr.fit.11, val.type="MSEP")

#test predictions
pcr.pred.11 <- predict(pcr.fit.11, Boston[test, ], ncomp=7)
MSE.pcr.11 <- mean((pcr.pred.11 - Boston$crim[test])^2)

#set model
pcr.mod.11 <- pcr(crim~., data=Boston, scale=TRUE, ncomp=7)
summary(pcr.mod.11)
## Data:    X dimension: 506 12 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       49.93    63.64    72.94    80.21    86.83    90.26    92.79
## crim    29.39    29.55    37.39    37.85    38.85    39.23    41.73

The PCR model selected by lowest cross-validation error reduces the 12 predictors down to 7 components.

#train model
pls.fit.11 <- plsr(crim~., data=Boston, subset=train, scale=TRUE, validation="CV")

#choose M by cross-validation
summary(pls.fit.11)
## Data:    X dimension: 404 12 
##  Y dimension: 404 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.754    7.179    6.802    6.698    6.687    6.679    6.671
## adjCV        8.754    7.177    6.796    6.690    6.681    6.671    6.661
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.657    6.658    6.654     6.655     6.655     6.655
## adjCV    6.648    6.649    6.645     6.646     6.646     6.646
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.92    58.01    62.88    74.45    80.99    83.63    86.98    92.10
## crim    33.14    41.11    43.59    44.11    44.42    44.76    44.84    44.88
##       9 comps  10 comps  11 comps  12 comps
## X       94.94     96.65     98.45     100.0
## crim    44.90     44.90     44.90      44.9
validationplot(pls.fit.11, val.type="MSEP")

#test predictions
pls.pred.11 <- predict(pls.fit.11, Boston[test, ], ncomp=11)
MSE.pls.11 <- mean((pls.pred.11 - Boston$crim[test])^2)

#set model
pls.mod.11 <- plsr(crim~., data=Boston, scale=TRUE, ncomp=11)
summary(pls.mod.11)
## Data:    X dimension: 506 12 
##  Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 11
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.49    58.85    64.82    74.54    80.88    83.57    87.80    92.13
## crim    33.00    41.20    43.32    44.04    44.40    44.78    44.87    44.91
##       9 comps  10 comps  11 comps
## X       94.56     96.22     98.11
## crim    44.93     44.93     44.93

11(b)

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

MSE_value <- round(c(MSE.lm.11, MSE.BS, MSE.FW, MSE.BW, MSE.rr.11, MSE.lasso.11, MSE.pcr.11, MSE.pls.11), 2)
Model_Name <- c("Linear model", "Best subset", "Forward selection", "Backward selection", "Ridge regression", "Lasso", "Principal component regression", "Partial Least Squares")
MSEranks <- data.frame(Model_Name, MSE_value)
MSEranks[order(MSE_value), ]
##                       Model_Name MSE_value
## 4             Backward selection     34.89
## 3              Forward selection     34.91
## 5               Ridge regression     35.22
## 6                          Lasso     35.70
## 2                    Best subset     35.77
## 1                   Linear model     35.78
## 8          Partial Least Squares     35.78
## 7 Principal component regression     37.46

According to validation set error, backward selection is the best-performing model, and principal component regression is the worst.

11(c)

Does your chosen model involve all of the features in the data set? Why or why not?

No. Backwards selection omitted age from the model.