library(ISLR2)
library(MASS)
library(class)

2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The solution is iii. When looking at the variance bias trade off, lasso can reduce variance with only a little increase in bias, when least squares may over fit.

(b) Repeat (a) for ridge regression relative to least squares.
The solution is iii. The reasons are the same, with the primary difference being that coefficients all the way to zero, but the trade off still applies.

(c) Repeat (a) for non-linear methods relative to least squares.
The solution is ii. Non-linear methods are more flexible. When the function is approximated best using non-linear relationships, there will be a decrease in bias but smaller increases in variance.

9

attach(College)

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Before we begin, lets check for nulls.

dim(College)
## [1] 777  18
sum(is.na(Apps))
## [1] 0
sum(is.na(College))
## [1] 0
library(leaps)

(a) Split the data set into a training set and a test set.

Lets split it 50/50

set.seed(1)
train=sample(1:dim(College)[1], round(dim(College)[1] / 2))
test=-train
College.train = College[train, ]
College.test = College[-train, ]
nrow(College.train) / nrow(College)
## [1] 0.4993565
nrow(College.test) / nrow(College)
## [1] 0.5006435

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps ~ ., data = College.train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = College.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5741.2  -479.5    15.3   359.6  7258.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.902e+02  6.381e+02  -1.238 0.216410    
## PrivateYes  -3.070e+02  2.006e+02  -1.531 0.126736    
## Accept       1.779e+00  5.420e-02  32.830  < 2e-16 ***
## Enroll      -1.470e+00  3.115e-01  -4.720 3.35e-06 ***
## Top10perc    6.673e+01  8.310e+00   8.030 1.31e-14 ***
## Top25perc   -2.231e+01  6.533e+00  -3.415 0.000708 ***
## F.Undergrad  9.269e-02  5.529e-02   1.676 0.094538 .  
## P.Undergrad  9.397e-03  5.493e-02   0.171 0.864275    
## Outstate    -1.084e-01  2.700e-02  -4.014 7.22e-05 ***
## Room.Board   2.115e-01  7.224e-02   2.928 0.003622 ** 
## Books        2.912e-01  3.985e-01   0.731 0.465399    
## Personal     6.133e-03  8.803e-02   0.070 0.944497    
## PhD         -1.548e+01  6.681e+00  -2.316 0.021082 *  
## Terminal     6.415e+00  7.290e+00   0.880 0.379470    
## S.F.Ratio    2.283e+01  2.047e+01   1.115 0.265526    
## perc.alumni  1.134e+00  6.083e+00   0.186 0.852274    
## Expend       4.857e-02  1.619e-02   2.999 0.002890 ** 
## Grad.Rate    7.490e+00  4.397e+00   1.703 0.089324 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared:  0.9389, Adjusted R-squared:  0.9361 
## F-statistic: 334.3 on 17 and 370 DF,  p-value: < 2.2e-16

With MSE

ols_pred = predict(lm.fit, College.test)
ols_mse = mean((ols_pred - Apps)^2)
ols_mse
## [1] 27181984

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
x=model.matrix(Apps~.,College[,-Apps])
y=Apps
y.test=y[test]
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x[train,],y[train],lambda = grid, alpha = 0,thresh=1e-12)
cv.out=cv.glmnet(x[train,],y[train],alpha=0, lambda=grid, thresh=1e-12)
bestlam=cv.out$lambda.min
bestlam
## [1] 0.01
ridge.pred=predict(ridge.mod, s=bestlam, newx=x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1135714

Admittedly, the code got a little messy. After researching other analysts’ solutions, I have rerun this code and obtained the same result.

train.mat=model.matrix(Apps~.,data=College.train)
test.mat=model.matrix(Apps~.,data=College.test)
grid=10^seq(4,-2,length=100)
ridge.mod = glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid,thresh = 1e-12)
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0,lambda=grid,thresh=1e-12)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01
pred.newridge = predict(ridge.mod,s=bestlam,newx=test.mat)
mean((College.test$Apps-pred.newridge)^2)
## [1] 1135714

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda=grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1116174

Lasso gave a slightly better performance than the ridge regression.

out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)[1:19,]
round(lasso.coef, 3)
## (Intercept) (Intercept)  PrivateYes      Accept      Enroll   Top10perc 
##    -469.506       0.000    -491.366       1.571      -0.768      48.256 
##   Top25perc F.Undergrad P.Undergrad    Outstate  Room.Board       Books 
##     -12.943       0.043       0.044      -0.083       0.150       0.015 
##    Personal         PhD    Terminal   S.F.Ratio perc.alumni      Expend 
##       0.029      -8.428      -3.265      14.640      -0.029       0.077 
##   Grad.Rate 
##       8.315

Based on these coefficients, none of the values got set to 0, and therefore they should all be used.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
set.seed(2)
pcr.fit=pcr(Apps~.,data=College,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3873     3831     2030     2037     1720     1595     1594
## adjCV         3873     3831     2028     2037     1688     1586     1590
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1583     1548     1507      1508      1516      1516      1520
## adjCV     1584     1543     1503      1505      1512      1512      1516
##        14 comps  15 comps  16 comps  17 comps
## CV         1523      1416      1169      1146
## adjCV      1519      1403      1163      1139
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.670    57.30    64.30    69.90    75.39    80.38    83.99    87.40
## Apps    2.316    73.06    73.07    82.08    84.08    84.11    84.32    85.18
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.91     95.01     96.81      97.9     98.75     99.36
## Apps    85.88     86.06     86.06     86.10      86.1     86.13     90.32
##       16 comps  17 comps
## X        99.84    100.00
## Apps     92.52     92.92
1507*1507
## [1] 2271049
1146*1146
## [1] 1313316
validationplot(pcr.fit,val.type = "MSEP")

“9 comps” for CV has the lowest value at 1507, which explains ~86% Apps explained. However, the MSE has quite a large jump at 9. We could go up to 16 where the next drop off occurs as well, but it becomes relatively flat at 9. Here our MSE is 1313316, which is close to our previous options.

We can run the same thing on the test data as well, using 9 for ncomp.

set.seed(1)
pcr.fit=pcr(Apps~., data=College[train,], scale=TRUE,validation='CV')
validationplot(pcr.fit,val.type="MSEP")

pcr.pred=predict(pcr.fit,College[test,],ncomp=9)
mean((pcr.pred-College$Apps[test])^2)
## [1] 1583520

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit=plsr(Apps~.,data=College[train,],scale=TRUE,validation='CV')
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2217     2019     1761     1630     1533     1347
## adjCV         4288     2211     2012     1749     1605     1510     1331
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1309     1303     1286      1283      1283      1277      1271
## adjCV     1296     1289     1273      1270      1270      1264      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1270      1270      1270      1270
## adjCV      1258      1257      1257      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20    78.62    80.81
## Apps    75.39    81.24    86.97    91.14    92.62    93.43    93.56    93.68
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.29     87.17     89.15     91.37     92.58     94.42     96.98
## Apps    93.76     93.79     93.83     93.86     93.88     93.89     93.89
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.89     93.89

The output is close to the PCR. The lowest cross validation error occurs at the end again, flattening out around 13, which explains ~94% of Apps.

1270*1270
## [1] 1612900

The MSE on PLS increased again.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

While there is a difference in the test errors, most are close to each other. For each of these models, it is explaining upper 80s to lower 90% of Apps. We should take the approach that has the lowest MSE. Lasso provided the lowest MSE, and unfortunately, using all variables in PLS was the best when using dimension reduction.

detach(College)

11

We will now try to predict per capita crime rate in the Boston data set.

attach(Boston)
dim(Boston)
## [1] 506  14
set.seed(1)
train = sample(nrow(Boston), nrow(Boston)*.7)
Boston.train = Boston[train,]
Boston.test=Boston[-train,]
nrow(Boston.train)/nrow(Boston)
## [1] 0.6996047
nrow(Boston.test)/nrow(Boston)
## [1] 0.3003953

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

First, \(\textbf{Full Model}\)

lr.full=lm(medv~.,data=Boston.train)
lr.predict=predict(lr.full,Boston.test)
MSE.lr=mean((lr.predict-Boston.test$medv)^2)
MSE.lr
## [1] 27.31196

\(\textbf{Best Subset}\)

best.subset=regsubsets(medv~., data=Boston.train,nbest=1,nvmax=13)
summary(best.subset)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston.train, nbest = 1, 
##     nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   " "  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   "*"  " " "*" " " " " " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     "*"   "*"  
## 8  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"  
## 9  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 10  ( 1 ) " "  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"

Using best subset with nvmax of 13 (all variables minus the prediction), rm is used in every one of them and lstat in all but 1 variable. In all but one case, once a variable is included, its used for each subsequent model, except chas is used in 4 but not 5.

Boston.test.mat = model.matrix(medv~ ., data = Boston.test, nvmax = 13)
val.errors = rep(NA, 13)
for (i in 1:13) {
    coefi = coef(best.subset, id = i)
    pred = Boston.test.mat[, names(coefi)] %*% coefi
    val.errors[i] = mean((pred - Boston.test$medv)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")

I liked a method I found from Swapnil Sharma in 2017 using the for loop to loop through all the options, and modified his code to examine the predictors shown above. We can see that the best number of predictors is 11, but we can verify easily enough:

which.min(val.errors)
## [1] 11
coef(best.subset, 11)
##  (Intercept)         crim           zn         chas          nox           rm 
##  28.87260476  -0.07192626   0.03661785   3.38187968 -15.21333970   4.72714761 
##          dis          rad          tax      ptratio        black        lstat 
##  -1.36283626   0.30319961  -0.01317327  -0.95549564   0.01047972  -0.49806806

To make comparing all the values easier, we’ll put each MSE in its own variable

MSE.bsm = val.errors[11]
MSE.bsm
## [1] 26.97848

\(\textbf{Ridge Regression}\)

bostontrain.mat<-model.matrix(medv~.,data=Boston.train)
bostontest.mat<-model.matrix(medv~.,data=Boston.test)
grid=10^seq(4,-2,length=500)
ridge.model=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
boston.cv.ridge<-cv.glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
plot(boston.cv.ridge)

Now we can reuse code from problem 9 to determine best lambda, use it in glmnet, and determine its MSE.

boston.bestlam.ridge=boston.cv.ridge$lambda.min
boston.bestlam.ridge
## [1] 0.1731717
ridge.model.train=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.train)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  25.856548970
## (Intercept)   .          
## crim         -0.065719240
## zn            0.032170536
## indus         0.004017301
## chas          3.527697622
## nox         -13.073981370
## rm            4.898809469
## age          -0.014867741
## dis          -1.316054846
## rad           0.246930945
## tax          -0.010916150
## ptratio      -0.915227320
## black         0.010545924
## lstat        -0.466744445
boston.pred.newridge=predict(ridge.model,s=boston.bestlam.ridge,newx=bostontest.mat)
MSE.ridge=mean((Boston.test$medv-boston.pred.newridge)^2)
MSE.ridge
## [1] 27.2719

\(\textbf{Lasso}\)

Lasso just needs to switch alpha to 0 from 1.

lasso.model=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
boston.cv.lasso=cv.glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=grid)
plot(boston.cv.lasso)

boston.bestlam.lasso=boston.cv.lasso$lambda.min
boston.bestlam.lasso
## [1] 0.2414157

Continuing to reuse code at this point, we find the best lambda the same way.

lasso.model.train=glmnet(bostontrain.mat,Boston.train$medv,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.train)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  25.118872062
## (Intercept)   .          
## crim         -0.064301884
## zn            0.030954936
## indus        -0.003100209
## chas          3.555879937
## nox         -12.567302633
## rm            4.906872607
## age          -0.014904396
## dis          -1.281172956
## rad           0.230383197
## tax          -0.010128972
## ptratio      -0.906531148
## black         0.010488861
## lstat        -0.462156393
boston.pred.newlasso=predict(lasso.model,s=boston.bestlam.lasso,newx=bostontest.mat)
MSE.lasso=mean((Boston.test$medv-boston.pred.newlasso)^2)
MSE.lasso
## [1] 27.26207

And Finally, \(\textbf{PCR:}\)

pcr.model=pcr(medv~.,data=Boston.train,scale=TRUE,validation="CV")
summary(pcr.model)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.575    7.456    6.962    5.538    5.281    4.953    4.942
## adjCV        9.575    7.454    6.958    5.538    5.284    4.942    4.935
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       4.945    4.928    4.943     5.013     4.946     4.862     4.783
## adjCV    4.937    4.919    4.933     5.004     4.933     4.848     4.768
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.82    59.12    68.54    75.10    81.25    86.12    90.03    93.14
## medv    39.44    48.26    67.33    71.41    74.40    74.65    74.87    75.28
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.17     96.79     98.21     99.48    100.00
## medv    75.30     75.30     76.07     76.92     77.78
validationplot(pcr.model,val.type="MSEP")

Looks like 5 is the number of components we are looking for.

pcr.model.5comp=pcr(medv~.,data=Boston.train,scale=TRUE,ncomp=5)
summary(pcr.model.5comp)
## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps
## X       47.82    59.12    68.54    75.10    81.25
## medv    39.44    48.26    67.33    71.41    74.40
boston.predict.pcr=predict(pcr.model,Boston.test,ncomp=5)
MSE.pcr=mean((Boston.test$medv-boston.predict.pcr)^2)
MSE.pcr
## [1] 31.26027

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Lets look at each of the models: Full:

MSE.lr
## [1] 27.31196

Best Subset:

MSE.bsm
## [1] 26.97848

Ridge:

MSE.ridge
## [1] 27.2719

Lasso:

MSE.lasso
## [1] 27.26207

PCR:

MSE.pcr
## [1] 31.26027

I propose we use best subset. All of the models were very close in MSE, but best subset was the lowest. Best subset is feasible because the dataset is not very large.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

When using best subset, we saw the best MSE occurred at 11 variables. Scrolling back up to the chart, we see at value of 11, the variables NOT included are indus and age.