Homework Assignment #5

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(a) The lasso, relative to least squares, is:

The correct statement is iii. Justification: The lasso exhibits similar behavior to ridge regression in that as \(lambda\) increases, the variance decreases and bias increases. Per the ISLR textbook: “when the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.” For further detail see part (b).

(b) Repeat (a) for ridge regression relative to least squares.

The correct statement is iii. Justification: The advantage of ridge regression to ordinary least squares (OLS) regression is rooted in the bias variance trade-off. OLS can be understood as a special case of ridge regression where the lambda shrinkage parameter is equal to zero. OLS resimates have generally low bias but can be highly variable. As lambda increases, the flexibility of the ridge regression decreases resulting in substantial reduction in variance with some increase in bias. Ridge regression will work best in situations where the OLS estimates have high variance.

(c) Repeat (a) for non-linear methods relative to least squares.

The correct statement is ii. Justification: Non-linear methods are more flexible than least squares by definition. As flexibility increases, variance generally tends to increase and bias generally decreases.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)

(a) Split the data set into a training set and a test set.

set.seed(1)
train=sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
test=(!train)

College.train = College[train,]
College.test = College[test,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps~., data=College.train)
lm.pred = predict(lm.fit, College.test, type="response")
mean((lm.pred-College.test$Apps)^2)

## [1] 984743.1

The test error of the linear model fit is 984743.1.

(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.

library(glmnet) 
set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(Apps~., data = College.train)
test.mat = model.matrix(Apps~., data = College.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,College.train$Apps,alpha=0)
bestlam = cv.out$lambda.min
bestlam

## [1] 394.2365

#Fit a ridge regression
ridge.mod = glmnet(train.mat,College.train$Apps,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - College.test$Apps)^2)

## [1] 940970.9

The test error of the ridge regression fit with a lambda chosen by cross-validation is 940970.9, which is lower than the linear model test error.

(d) Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#Choose lambda using cross-validation
set.seed(1)
cv.out2 = cv.glmnet(train.mat,College.train$Apps,alpha=1)
bestlam2 = cv.out2$lambda.min
bestlam2

## [1] 59.92044

#Fit lasso model
lasso.mod = glmnet(train.mat,College.train$Apps,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam2,newx=test.mat)
mean((lasso.pred - College.test$Apps)^2)

## [1] 993741.7

The test error of the lasso model fit with a lambda chosen by cross-validation is 993741.7, which is higher than both the linear model test error and the ridge regression test error.

(e) Fit a PCR model on the training set, with \(M\) chosen by crossvalidation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

library(pls)
set.seed(1)
#Fit and determine M based on CV results
pcr.fit=pcr(Apps~., data=College.train, scale=TRUE, validation="CV")
validationplot(pcr.fit,val.type = "MSEP")

summary(pcr.fit)

## Data:    X dimension: 393 17 
##  Y dimension: 393 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4189     4094     2321     2336     2168     2099     1961
## adjCV         4189     4094     2315     2332     2156     2088     1949
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1949     1937     1861      1792      1808      1830      1839
## adjCV     1940     1922     1845      1783      1800      1821      1831
##        14 comps  15 comps  16 comps  17 comps
## CV         1848      1612      1348      1333
## adjCV      1845      1584      1334      1320
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      31.858    57.44    64.20    69.91    75.10    80.17    83.82
## Apps    4.353    70.99    71.18    76.84    78.34    81.03    81.59
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.30    90.26     92.74     94.79     96.70     97.76     98.67
## Apps    82.21    83.31     83.97     83.97     84.34     84.58     84.70
##       15 comps  16 comps  17 comps
## X        99.37     99.82    100.00
## Apps     91.28     92.83     93.02

The lowest MSEP with PCR dimension reduction appears to occur around \(M=10\). Looking at the summary I also confirm \(M=10\) has the lowest CV error (other than 15+ components, which I am not considering because that does not accomplish much dimension reduction).

#Make predictions using chosen M
pcr.pred=predict(pcr.fit,College.test,ncomp=10)
#Compute test error
mean((pcr.pred-College.test$Apps)^2)

## [1] 1682909

The test error of the lasso model fit with a lambda chosen by cross-validation is 1682909, which underperforms all previous models.

(f) Fit a PLS model on the training set, with \(M\) chosen by crossvalidation. Report the test error obtained, along with the value of \(M\) selected by cross-validation.

set.seed(1)
#Fit and determine M based on CV results
pls.fit = plsr(Apps~., data=College.train, scale=TRUE,validation="CV")
validationplot(pls.fit,val.type = "MSEP")

summary(pls.fit)

## Data:    X dimension: 393 17 
##  Y dimension: 393 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4189     2172     1932     1720     1669     1489     1382
## adjCV         4189     2163     1930     1709     1640     1463     1365
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1333     1328     1323      1329      1332      1334      1334
## adjCV     1321     1316     1310      1316      1319      1320      1321
##        14 comps  15 comps  16 comps  17 comps
## CV         1335      1333      1333      1333
## adjCV      1321      1320      1320      1320
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.01    44.96    62.49    65.22    68.52    72.89    77.13
## Apps    75.74    82.40    86.74    90.58    92.34    92.79    92.88
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.46    82.45     84.76     88.08     90.76     92.80     94.45
## Apps    92.93    92.98     93.00     93.01     93.01     93.02     93.02
##       15 comps  16 comps  17 comps
## X        97.02     98.03    100.00
## Apps     93.02     93.02     93.02

The lowest MSEP with PCR dimension reduction appears to occur around \(M=8\). Looking at the summary I see \(M=9\) has the lowest CV error, so I will try both and evaluate the better performing one.

#Make predictions using M=8
pls.pred = predict(pls.fit, College.test, ncomp = 8)
#Compute test error
mean((pls.pred - College.test$Apps)^2)

## [1] 978534.3

#Make predictions using M=9
pls.pred = predict(pls.fit, College.test, ncomp = 9)
#Compute test error
mean((pls.pred - College.test$Apps)^2)

## [1] 1007163

So the best performing PLS model utilizes \(M=8\), and its test error is 978534.3. This is the second best performing model, second only to ridge regression.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The model performance from best to worst is as follows (based upon test error): Ridge Regression (940970.9), PLS model (978534.3), linear model (984743.1), Lasso model (993741.7), PCR model (1682909). The test errors of PLS, linear, and lasso are fairly similary to one another, while PCR performs significantly worse and Ridge performs a little better. To say how accurately we can predict the number of applications received, let’s compute the ridge regression R-square value.

TOTALSUMOFSQUARES = sum((mean(College.test$Apps) - College.test$Apps)^2)
TOTALSUMOFRESIDUALS = sum((ridge.pred - College.test$Apps)^2)
1 - (TOTALSUMOFRESIDUALS)/(TOTALSUMOFSQUARES)

## [1] 0.9240954

The R-squared for my best model (ridge regression) tells us I can explain 92.4% of the variance in Apps using this model. This is strong accurate predictive power.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

detach(College)
library(MASS)
attach(Boston)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

I will try out three of these four regression methods (best subset selection, lasso, ridge regression). First, I need to split the data into train and test subsets.

set.seed(1)
train=sample(c(TRUE,FALSE),nrow(Boston),rep=TRUE)
test=(!train)

Boston.train = Boston[train,]
Boston.test = Boston[test,]

Best Subset Selection

library(leaps)
#Perform BSS for up to p=13
regfit.full = regsubsets(crim~.,data=Boston.train,nvmax=13)
reg.summary = summary(regfit.full)

#Make plots of performance statistics to determine best subset
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")

plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
p = which.max(reg.summary$adjr2)
points(p,reg.summary$adjr2[p], col="red",cex=2,pch=20)

plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
p = which.min(reg.summary$cp)
points(p,reg.summary$cp[p],col="red",cex=2,pch=20)

plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
p = which.min(reg.summary$bic)
points(p,reg.summary$bic[p],col="red",cex=2,pch=20)

It appears both p=3 and p=7 could be good options, so I will fit models using both suggested subsets on test dataset.

summary(regfit.full)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston.train, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

#Compute test error of null model for comparison's sake
mean((mean(Boston.test$crim)-Boston.test$crim)^2)

## [1] 83.74891

lm.fit = lm(crim~rad+black+lstat, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)

## [1] 52.66064

The test error of the linear model fit with the 3 best predictors is 52.66.

lm.fit = lm(crim~zn+indus+dis+rad+black+lstat+medv, data=Boston.train)
lm.pred = predict(lm.fit, Boston.test, type="response")
mean((lm.pred-Boston.test$crim)^2)

## [1] 51.04084

The test error of the linear model fit with the 7 best predictors is 51.04. This is only slightly lower than the model with the 3 best predictors, so I would argue the additional performance is insufficient for the added complexity of 4 more variables into the model.

Lasso

set.seed(1)
#Set up matrices needed for the glmnet functions
train.mat = model.matrix(crim~., data = Boston.train)
test.mat = model.matrix(crim~., data = Boston.test)
#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=1)
bestlam = cv.out$lambda.min
bestlam

## [1] 0.01973085

#Fit lasso model
lasso.mod = glmnet(train.mat,Boston.train$crim,alpha=1)
#Make predictions
lasso.pred = predict(lasso.mod,s=bestlam,newx=test.mat)
mean((lasso.pred - Boston.test$crim)^2)

## [1] 50.7379

The test error of the lasso model fit with a lambda chosen by cross-validation is 50.73, which is lower than the linear model using best subsets. We can look at which variables were not shrunk to zero (AKA which would be included in the model):

lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam)[1:15,]
lasso.coef[lasso.coef!=0]

## (Intercept)          zn       indus        chas         nox          rm 
## 15.96344350  0.04355602 -0.10751503 -0.14897948 -6.30730423 -0.24512633 
##         age         dis         rad     ptratio       black       lstat 
##  0.01130326 -0.74018280  0.47687429 -0.15621440 -0.01470752  0.11414439 
##        medv 
## -0.11377842

Ridge Regression

#Choose lambda using cross-validation
cv.out = cv.glmnet(train.mat,Boston.train$crim,alpha=0)
bestlam = cv.out$lambda.min
bestlam

## [1] 0.5240686

#Fit a ridge regression
ridge.mod = glmnet(train.mat,Boston.train$crim,alpha = 0)
#Make predictions
ridge.pred = predict(ridge.mod,s=bestlam,newx = test.mat)
#Calculate test error
mean((ridge.pred - Boston.test$crim)^2)

## [1] 51.46284

The test error of the ridge regression fit with a lambda chosen by cross-validation is 51.46, higher than the LASSO test error.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Based on the above information in part (a), I would propose that the best model is the linear model with the best subset of 3 predictors. As an analyst I know this is part art, and for me I highly value simplicity and interpretability in models and always think critically about if the additional predictive power of more complex models is worth it. While the Lasso and Ridge models I show above indeed have less test error than the 3 variable linear model, the Lasso contains 12 predictors and the Ridge contains 13. The reduction in test error is so minimal compared to introducing 9 to 10 additional predictors, I cannot justify it.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

No it does not involve all features. It is a Best Subset Selection approach. As previously mentioned, it includes 3 of the 13 possible predictors.