Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase variance is less than its decrease in bias.

The Lasso, relative to least squares is: iii. “Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.” The Lasso, similarly to Ridge regression, includes some non-zero lambda term, whereas Least Squares assumes lambda=0. As lambda increases, flexibility of the fit decreases and ridge regression coefficients will decrease leading to a significant reduction in variance with a trade-off of a slight increase in bias.

(b) Repeat (a) for ridge regression relative to least squares.

Ridge regression, relative to least squares is: iii. “Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.” Ridge regression includes some non-zero lambda term, whereas Least Squares assumes lambda=0. As lambda increases, flexibility of the fit decreases and ridge regression coefficients will decrease leading to a significant reduction in variance with a trade-off of a slight increase in bias. The main difference between Lasso and Ridge Regression is that the coefficients of predictors in The Lasso can be exactly zero instead of approaching zero and therefore the Lasso does not have to include all the predictors in the model.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods, relative to Least Squares, are: ii. “More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.” Non-linear methods are the most flexible method we have, and therefore can sometimes have problems with overfitting. This overfitting issue is generally defined by too much variance and too little bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
data=College
set.seed(1)
train=sample(1:nrow(data), nrow(data)*0.8)
test=(-train)
x=model.matrix(Apps~.,data)
y=data$Apps

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

model = lm(Apps~., data, subset=train)
preds = predict(model, data[test,])
MSE1 = mean((y[test]-preds)^2)
MSE1
## [1] 1567324

The MSE or test error is 1,567,324.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.3
## Loading required package: Matrix
## Loaded glmnet 4.1
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=0, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=0)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE2=mean((preds-y[test])^2)
MSE2
## [1] 1440058

The MSE or test error is 1,440,058, slightly better than the Least Squares Method.

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

library(glmnet)
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=1, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=1)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE3=mean((preds-y[test])^2)
MSE3
## [1] 1526474

The MSE or test error is 1,526,474.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## Warning: package 'pls' was built under R version 4.0.3
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
model=pcr(Apps~.,data=data,scale=TRUE,validation="CV")
summary(model)
validationplot(model,val.type="MSEP")

## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3873     3842     2033     2033     1725     1617     1597
## adjCV         3873     3844     2031     2033     1684     1604     1593
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1592     1549     1510      1508      1514      1515      1525
## adjCV     1592     1543     1507      1505      1511      1511      1522
##        14 comps  15 comps  16 comps  17 comps
## CV         1526      1453      1163      1140
## adjCV      1522      1435      1157      1134
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.670    57.30    64.30    69.90    75.39    80.38    83.99    87.40
## Apps    2.316    73.06    73.07    82.08    84.08    84.11    84.32    85.18
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.50     92.91     95.01     96.81      97.9     98.75     99.36
## Apps    85.88     86.06     86.06     86.10      86.1     86.13     90.32
##       16 comps  17 comps
## X        99.84    100.00
## Apps     92.52     92.92

We will use 15 components so that there will be at least some amount of dimensionality reduction in our PCR model compared to the Least Squares Model.

set.seed(1)
model=pcr(Apps~., data=data[train,],scale=TRUE,validation='CV')
preds=predict(model,data[test,],ncomp=15)
MSE4=mean((preds-y[test])^2)
MSE4
## [1] 1676837

Note that the MSE for PCR is worse than our Least Squares Method. Also, the M value should be 17 according to our CV score but that will give us the same value as Least Squares so I used M=15.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
model=plsr(Apps~.,data=data[train,],scale=TRUE,validation='CV')
summary(model)
## Data:    X dimension: 621 17 
##  Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3837     1865     1623     1441     1380     1244     1166
## adjCV         3837     1862     1623     1437     1363     1222     1154
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1147     1133     1123      1122      1122      1121      1118
## adjCV     1136     1125     1115      1114      1114      1112      1110
##        14 comps  15 comps  16 comps  17 comps
## CV         1118      1118      1118      1118
## adjCV      1109      1110      1109      1109
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.55    45.38    62.59    65.08    67.55    72.02    75.93    80.46
## Apps    77.30    83.57    87.51    90.88    92.88    93.15    93.24    93.31
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.51     85.43     87.83     91.09     92.73     95.12     96.95
## Apps    93.39     93.42     93.45     93.46     93.47     93.47     93.47
##       16 comps  17 comps
## X        97.97    100.00
## Apps     93.47     93.47
validationplot(model,val.type="MSEP")

preds=predict(model,data[test,],ncomp=5)
MSE5=mean((preds-y[test])^2)
MSE5
## [1] 1588895

One thing to note here with PLS is that our MSE is very close to the Least Squares Method with only 5 predictors. Also, the M value should be 17 according to our CV score but that will give us the same value as Least Squares so I used M=5.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

testerror=data.frame(c("Least Squares","Ridge Regression","Lasso","PCR","PLS"),c(MSE1,MSE2,MSE3,MSE4,MSE5))
testerror <- setNames(testerror, c("Method","Test Error"))
testerror
##             Method Test Error
## 1    Least Squares    1567324
## 2 Ridge Regression    1440058
## 3            Lasso    1526474
## 4              PCR    1676837
## 5              PLS    1588895

As we can see from the table, Ridge Regression offers the lowest test error and seems to work the best compared to the other methods for this dataset. However, as mentioned previously, the Ridge regression contains all the predictors so if we are interested in a smaller model PLS is also very attractive because it’s score is fairly good and it only consists of 5 predictors.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.3
data=Boston
set.seed(1)
train=sample(1:nrow(data), nrow(data)*0.8)
test=(-train)
x=model.matrix(crim~.,data)
y=data$crim
test.mat=model.matrix(crim~., data=data[test,])

Best Subset Selection

model.fit=regsubsets(crim~.,data,subset=train,nvmax=13)
p=ncol(Boston)-1
val.errors=rep(NA,13)
for(i in 1:p)
{
  coefi=coef(model.fit,id=i)
  preds=test.mat[,names(coefi)]%*%coefi
  val.errors[i]=mean((y[test]-preds)^2)
}
MSE1=val.errors[which.min(val.errors)]
MSE1
## [1] 68.93374

The lowest MSE for the best subset selection was found with 11 predictors. The MSE value is 68.93374.

Lasso

grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=1, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=1)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE2=mean((preds-y[test])^2)
MSE2
## [1] 69.64741

Note that this MSE value is slightly worse than the Best Subset Selection MSE.

Ridge Regression

grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=0, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=0)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE3=mean((preds-y[test])^2)
MSE3
## [1] 70.70392

Note that this MSE value is slightly worse than both the Best Subset Selection and the Lasso MSE scores.

PCR

set.seed(1)
p=ncol(Boston)-1
val.errors=rep(NA,13)
model=pcr(crim~., data=data[train,],scale=TRUE,validation='CV')
for(i in 1:p)
{preds=predict(model,data[test,],ncomp=i)
  val.errors[i]=mean((y[test]-preds)^2)
}
MSE4=val.errors[which.min(val.errors)]
MSE4
## [1] 69.18774

Note that this MSE is slightly worse than Best Subset Selection.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

testerror=data.frame(c("Best Subset Selection","Lasso","Ridge Regression","PCR"),c(MSE1,MSE2,MSE3,MSE4))
testerror <- setNames(testerror, c("Method","Test Error"))
testerror
##                  Method Test Error
## 1 Best Subset Selection   68.93374
## 2                 Lasso   69.64741
## 3      Ridge Regression   70.70392
## 4                   PCR   69.18774

Based on the above table, we can see that all the test errors are extremely close to each other but the Best Subset Selection is the best by a slight margin.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

No, the best subset selection uses only 11 out of the 13 predictors. We calculated MSE’s for all predictors but the 11 predictor model chosen by Best Subset Selection had the lowest MSE.