For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase variance is less than its decrease in bias.
The Lasso, relative to least squares is: iii. “Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.” The Lasso, similarly to Ridge regression, includes some non-zero lambda term, whereas Least Squares assumes lambda=0. As lambda increases, flexibility of the fit decreases and ridge regression coefficients will decrease leading to a significant reduction in variance with a trade-off of a slight increase in bias.
(b) Repeat (a) for ridge regression relative to least squares.
Ridge regression, relative to least squares is: iii. “Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.” Ridge regression includes some non-zero lambda term, whereas Least Squares assumes lambda=0. As lambda increases, flexibility of the fit decreases and ridge regression coefficients will decrease leading to a significant reduction in variance with a trade-off of a slight increase in bias. The main difference between Lasso and Ridge Regression is that the coefficients of predictors in The Lasso can be exactly zero instead of approaching zero and therefore the Lasso does not have to include all the predictors in the model.
(c) Repeat (a) for non-linear methods relative to least squares.
Non-linear methods, relative to Least Squares, are: ii. “More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.” Non-linear methods are the most flexible method we have, and therefore can sometimes have problems with overfitting. This overfitting issue is generally defined by too much variance and too little bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
data=College
set.seed(1)
train=sample(1:nrow(data), nrow(data)*0.8)
test=(-train)
x=model.matrix(Apps~.,data)
y=data$Apps
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
model = lm(Apps~., data, subset=train)
preds = predict(model, data[test,])
MSE1 = mean((y[test]-preds)^2)
MSE1
## [1] 1567324
The MSE or test error is 1,567,324.
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.3
## Loading required package: Matrix
## Loaded glmnet 4.1
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=0, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=0)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE2=mean((preds-y[test])^2)
MSE2
## [1] 1440058
The MSE or test error is 1,440,058, slightly better than the Least Squares Method.
(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
library(glmnet)
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=1, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=1)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE3=mean((preds-y[test])^2)
MSE3
## [1] 1526474
The MSE or test error is 1,526,474.
(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## Warning: package 'pls' was built under R version 4.0.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
model=pcr(Apps~.,data=data,scale=TRUE,validation="CV")
summary(model)
validationplot(model,val.type="MSEP")
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3873 3842 2033 2033 1725 1617 1597
## adjCV 3873 3844 2031 2033 1684 1604 1593
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1592 1549 1510 1508 1514 1515 1525
## adjCV 1592 1543 1507 1505 1511 1511 1522
## 14 comps 15 comps 16 comps 17 comps
## CV 1526 1453 1163 1140
## adjCV 1522 1435 1157 1134
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.670 57.30 64.30 69.90 75.39 80.38 83.99 87.40
## Apps 2.316 73.06 73.07 82.08 84.08 84.11 84.32 85.18
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.50 92.91 95.01 96.81 97.9 98.75 99.36
## Apps 85.88 86.06 86.06 86.10 86.1 86.13 90.32
## 16 comps 17 comps
## X 99.84 100.00
## Apps 92.52 92.92
We will use 15 components so that there will be at least some amount of dimensionality reduction in our PCR model compared to the Least Squares Model.
set.seed(1)
model=pcr(Apps~., data=data[train,],scale=TRUE,validation='CV')
preds=predict(model,data[test,],ncomp=15)
MSE4=mean((preds-y[test])^2)
MSE4
## [1] 1676837
Note that the MSE for PCR is worse than our Least Squares Method. Also, the M value should be 17 according to our CV score but that will give us the same value as Least Squares so I used M=15.
(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)
model=plsr(Apps~.,data=data[train,],scale=TRUE,validation='CV')
summary(model)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3837 1865 1623 1441 1380 1244 1166
## adjCV 3837 1862 1623 1437 1363 1222 1154
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1147 1133 1123 1122 1122 1121 1118
## adjCV 1136 1125 1115 1114 1114 1112 1110
## 14 comps 15 comps 16 comps 17 comps
## CV 1118 1118 1118 1118
## adjCV 1109 1110 1109 1109
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.55 45.38 62.59 65.08 67.55 72.02 75.93 80.46
## Apps 77.30 83.57 87.51 90.88 92.88 93.15 93.24 93.31
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.51 85.43 87.83 91.09 92.73 95.12 96.95
## Apps 93.39 93.42 93.45 93.46 93.47 93.47 93.47
## 16 comps 17 comps
## X 97.97 100.00
## Apps 93.47 93.47
validationplot(model,val.type="MSEP")
preds=predict(model,data[test,],ncomp=5)
MSE5=mean((preds-y[test])^2)
MSE5
## [1] 1588895
One thing to note here with PLS is that our MSE is very close to the Least Squares Method with only 5 predictors. Also, the M value should be 17 according to our CV score but that will give us the same value as Least Squares so I used M=5.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
testerror=data.frame(c("Least Squares","Ridge Regression","Lasso","PCR","PLS"),c(MSE1,MSE2,MSE3,MSE4,MSE5))
testerror <- setNames(testerror, c("Method","Test Error"))
testerror
## Method Test Error
## 1 Least Squares 1567324
## 2 Ridge Regression 1440058
## 3 Lasso 1526474
## 4 PCR 1676837
## 5 PLS 1588895
As we can see from the table, Ridge Regression offers the lowest test error and seems to work the best compared to the other methods for this dataset. However, as mentioned previously, the Ridge regression contains all the predictors so if we are interested in a smaller model PLS is also very attractive because it’s score is fairly good and it only consists of 5 predictors.
We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
library(MASS)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.3
data=Boston
set.seed(1)
train=sample(1:nrow(data), nrow(data)*0.8)
test=(-train)
x=model.matrix(crim~.,data)
y=data$crim
test.mat=model.matrix(crim~., data=data[test,])
Best Subset Selection
model.fit=regsubsets(crim~.,data,subset=train,nvmax=13)
p=ncol(Boston)-1
val.errors=rep(NA,13)
for(i in 1:p)
{
coefi=coef(model.fit,id=i)
preds=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((y[test]-preds)^2)
}
MSE1=val.errors[which.min(val.errors)]
MSE1
## [1] 68.93374
The lowest MSE for the best subset selection was found with 11 predictors. The MSE value is 68.93374.
Lasso
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=1, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=1)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE2=mean((preds-y[test])^2)
MSE2
## [1] 69.64741
Note that this MSE value is slightly worse than the Best Subset Selection MSE.
Ridge Regression
grid=10^seq(10,-2,length=100)
set.seed(1)
model=glmnet(x[train,], y[train], alpha=0, lambda=grid)
cv.out=cv.glmnet(x[train,],y[train], alpha=0)
bestlam=cv.out$lambda.min
preds=predict(model,s=bestlam,x[test,])
MSE3=mean((preds-y[test])^2)
MSE3
## [1] 70.70392
Note that this MSE value is slightly worse than both the Best Subset Selection and the Lasso MSE scores.
PCR
set.seed(1)
p=ncol(Boston)-1
val.errors=rep(NA,13)
model=pcr(crim~., data=data[train,],scale=TRUE,validation='CV')
for(i in 1:p)
{preds=predict(model,data[test,],ncomp=i)
val.errors[i]=mean((y[test]-preds)^2)
}
MSE4=val.errors[which.min(val.errors)]
MSE4
## [1] 69.18774
Note that this MSE is slightly worse than Best Subset Selection.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
testerror=data.frame(c("Best Subset Selection","Lasso","Ridge Regression","PCR"),c(MSE1,MSE2,MSE3,MSE4))
testerror <- setNames(testerror, c("Method","Test Error"))
testerror
## Method Test Error
## 1 Best Subset Selection 68.93374
## 2 Lasso 69.64741
## 3 Ridge Regression 70.70392
## 4 PCR 69.18774
Based on the above table, we can see that all the test errors are extremely close to each other but the Best Subset Selection is the best by a slight margin.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
No, the best subset selection uses only 11 out of the 13 predictors. We calculated MSE’s for all predictors but the 11 predictor model chosen by Best Subset Selection had the lowest MSE.