For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
The correct statement is iii. Justification: The lasso exhibits similar behavior to ridge regression in that as \(lambda\) increases, the variance decreases and bias increases. Per the ISLR textbook: “when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.” For further detail see part (b).
(b) Repeat (a) for ridge regression relative to least squares.
The correct statement is iii. Justification: The advantage of ridge regression to ordinary least squares (OLS) regression is rooted in the bias variance trade-off. OLS can be understood as a special case of ridge regression where the lambda shrinkage parameter is equal to zero. OLS resimates have generally low bias but can be highly variable. As lambda increases, the flexibility of the ridge regression decreases resulting in substantial reduction in variance with some increase in bias. Ridge regression will work best in situations where the OLS estimates have high variance.
(c) Repeat (a) for non-linear methods relative to least squares.
The correct statement is ii. Justification: Non-linear methods are more flexible than least squares by definition. As flexibility increases, variance generally tends to increase and bias generally decreases.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
library(ISLR)
attach(College)
(a) Split the data set into a training set and a test set.
set.seed(1)
train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
test=(!train)
College.train = College[train,]
College.test = College[test,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit = lm(Apps~., data=College.train)
lm.pred = predict(lm.fit, College.test, type="response")
mean((lm.pred-College.test$Apps)^2)
## [1] 984743.1
The test error of the linear model fit is 984743.1.
(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
library(glmnet)
set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(Apps~., data = College.train)
test.mat = model.matrix(Apps~., data = College.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 394.2365
#Fit a ridge regression
ridge.mod = glmnet(train.mat,College.train$Apps,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - College.test$Apps)^2)
## [1] 940970.9
The test error of the ridge regression fit with a lambda chosen by cross-validation is 940970.9, which is lower than the linear model test error.
(d) Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#Choose lambda using cross-validation
set.seed(1)
cv.out2 = cv.glmnet(train.mat,College.train$Apps,alpha=1)
bestlam2 = cv.out2$lambda.min
bestlam2
## [1] 59.92044
#Fit lasso model
lasso.mod = glmnet(train.mat,College.train$Apps,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam2,newx=test.mat)
mean((lasso.pred - College.test$Apps)^2)
## [1] 993741.7
The test error of the lasso model fit with a lambda chosen by cross-validation is 993741.7, which is higher than both the linear model test error and the ridge regression test error.
(e) Fit a PCR model on the training set, with \(M\) chosen by crossvalidation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
library(pls)
set.seed(1)
#Fit and determine M based on CV results
pcr.fit=pcr(Apps~., data=College.train, scale=TRUE, validation="CV")
validationplot(pcr.fit,val.type = "MSEP")
summary(pcr.fit)
## Data: X dimension: 393 17
## Y dimension: 393 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4189 4094 2321 2336 2168 2099 1961
## adjCV 4189 4094 2315 2332 2156 2088 1949
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1949 1937 1861 1792 1808 1830 1839
## adjCV 1940 1922 1845 1783 1800 1821 1831
## 14 comps 15 comps 16 comps 17 comps
## CV 1848 1612 1348 1333
## adjCV 1845 1584 1334 1320
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.858 57.44 64.20 69.91 75.10 80.17 83.82
## Apps 4.353 70.99 71.18 76.84 78.34 81.03 81.59
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.30 90.26 92.74 94.79 96.70 97.76 98.67
## Apps 82.21 83.31 83.97 83.97 84.34 84.58 84.70
## 15 comps 16 comps 17 comps
## X 99.37 99.82 100.00
## Apps 91.28 92.83 93.02
The lowest MSEP with PCR dimension reduction appears to occur around \(M=10\). Looking at the summary I also confirm \(M=10\) has the lowest CV error (other than 15+ components, which I am not considering because that does not accomplish much dimension reduction).
#Make predictions using chosen M
pcr.pred=predict(pcr.fit,College.test,ncomp=10)
#Compute test error
mean((pcr.pred-College.test$Apps)^2)
## [1] 1682909
The test error of the lasso model fit with a lambda chosen by cross-validation is 1682909, which underperforms all previous models.
(f) Fit a PLS model on the training set, with \(M\) chosen by crossvalidation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.
set.seed(1)
#Fit and determine M based on CV results
pls.fit = plsr(Apps~., data=College.train, scale=TRUE,validation="CV")
validationplot(pls.fit,val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 393 17
## Y dimension: 393 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4189 2172 1932 1720 1669 1489 1382
## adjCV 4189 2163 1930 1709 1640 1463 1365
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1333 1328 1323 1329 1332 1334 1334
## adjCV 1321 1316 1310 1316 1319 1320 1321
## 14 comps 15 comps 16 comps 17 comps
## CV 1335 1333 1333 1333
## adjCV 1321 1320 1320 1320
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.01 44.96 62.49 65.22 68.52 72.89 77.13
## Apps 75.74 82.40 86.74 90.58 92.34 92.79 92.88
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.46 82.45 84.76 88.08 90.76 92.80 94.45
## Apps 92.93 92.98 93.00 93.01 93.01 93.02 93.02
## 15 comps 16 comps 17 comps
## X 97.02 98.03 100.00
## Apps 93.02 93.02 93.02
The lowest MSEP with PCR dimension reduction appears to occur around \(M=8\). Looking at the summary I see \(M=9\) has the lowest CV error, so I will try both and evaluate the better performing one.
#Make predictions using M=8
pls.pred = predict(pls.fit, College.test, ncomp = 8)
#Compute test error
mean((pls.pred - College.test$Apps)^2)
## [1] 978534.3
#Make predictions using M=9
pls.pred = predict(pls.fit, College.test, ncomp = 9)
#Compute test error
mean((pls.pred - College.test$Apps)^2)
## [1] 1007163
So the best performing PLS model utilizes \(M=8\), and its test error is 978534.3. This is the second best performing model, second only to ridge regression.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
The model performance from best to worst is as follows (based upon test error): Ridge Regression (940970.9), PLS model (978534.3), linear model (984743.1), Lasso model (993741.7), PCR model (1682909). The test errors of PLS, linear, and lasso are fairly similary to one another, while PCR performs significantly worse and Ridge performs a little better. To say how accurately we can predict the number of applications received, let’s compute the ridge regression R-square value.
TOTALSUMOFSQUARES = sum((mean(College.test$Apps) - College.test$Apps)^2)
TOTALSUMOFRESIDUALS = sum((ridge.pred - College.test$Apps)^2)
1 - (TOTALSUMOFRESIDUALS)/(TOTALSUMOFSQUARES)
## [1] 0.9240954
The R-squared for my best model (ridge regression) tells us I can explain 92.4% of the variance in Apps using this model. This is strong accurate predictive power.
We will now try to predict per capita crime rate in the Boston data set.
detach(College)
library(MASS)
attach(Boston)
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
I will try out three of these four regression methods (best subset selection, lasso, ridge regression). First, I need to split the data into train and test subsets.
set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)
Boston.train = Boston[train,]
Boston.test = Boston[test,]
library(leaps)
#Perform BSS for up to p=13
regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=13)
reg.summary = summary(regfit.full)
#Make plots of performance statistics to determine best subset
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)
It appears both p=3 and p=7 could be good options, so I will fit models using both suggested subsets on test dataset.
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " " " " " " " "*" " " " " "*" "*" " "
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" " " "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
#Compute test error of null model for comparison's sake
mean((mean(Boston.test$crim)-Boston.test$crim)^2)
## [1] 83.74891
lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 52.66064
The test error of the linear model fit with the 3 best predictors is 52.66.
lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)
## [1] 51.04084
The test error of the linear model fit with the 7 best predictors is 51.04. This is only slightly lower than the model with the 3 best predictors, so I would argue the additional performance is insufficient for the added complexity of 4 more variables into the model.
set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.01973085
#Fit lasso model
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)
## [1] 50.7379
The test error of the lasso model fit with a lambda chosen by cross-validation is 50.73, which is lower than the linear model using best subsets. We can look at which variables were not shrunk to zero (AKA which would be included in the model):
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]
## (Intercept) zn indus chas nox rm
## 15.96344350 0.04355602 -0.10751503 -0.14897948 -6.30730423 -0.24512633
## age dis rad ptratio black lstat
## 0.01130326 -0.74018280 0.47687429 -0.15621440 -0.01470752 0.11414439
## medv
## -0.11377842
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.out$lambda.min
bestlam
## [1] 0.5240686
#Fit a ridge regression
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - Boston.test$crim)^2)
## [1] 51.46284
The test error of the ridge regression fit with a lambda chosen by cross-validation is 51.46, higher than the LASSO test error.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
Based on the above information in part (a), I would propose that the best model is the linear model with the best subset of 3 predictors. As an analyst I know this is part art, and for me I highly value simplicity and interpretability in models and always think critically about if the additional predictive power of more complex models is worth it. While the Lasso and Ridge models I show above indeed have less test error than the 3 variable linear model, the Lasso contains 12 predictors and the Ridge contains 13. The reduction in test error is so minimal compared to introducing 9 to 10 additional predictors, I cannot justify it.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
No it does not involve all features. It is a Best Subset Selection approach. As previously mentioned, it includes 3 of the 13 possible predictors.