Chapter 06

Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct.

(a) The lasso, relative to least squares, is:

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
attach(College)
set.seed(42)
sum(is.na(College))
## [1] 0

(a) Split the data set into a training set and a test set.

train.size <- dim(College)[1] / 2
set.seed(42)
train <- sample(1:dim(College)[1], train.size)
test <- -train
C_train <- College[train, ]
C_test <- College[test, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

fit_lm <- lm(Apps~., data=C_train)
pred_lm <- predict(fit_lm, C_test)
test_error <- mean((C_test[, "Apps"] - pred_lm)^2)
test_error
## [1] 1341776

Test RSS is \(1341776\)

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
matrix_train <- model.matrix(Apps~., data=C_train)
matrix_test <- model.matrix(Apps~., data=C_test)
grid <- 10 ^ seq(4, -2, length=100)
ridge_re <- cv.glmnet(matrix_train, C_train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
best_lmba <- ridge_re$lambda.min
best_lmba
## [1] 43.28761
pred_ridge_re <- predict(ridge_re, newx=matrix_test, s=best_lmba)
mean((C_test[, "Apps"] - pred_ridge_re)^2)
## [1] 1462449

The best λ is ~\(43.3\)

The RSS \(1462449\) is higher than the OLS of \(1341776\).

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso_1 <- cv.glmnet(matrix_train, C_train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)
best_lmba <- lasso_1$lambda.min
best_lmba
## [1] 21.54435
pred_lasso <- predict(lasso_1, newx=matrix_test, s=best_lmba)
mean((C_test[, "Apps"] - pred_lasso)^2)
## [1] 1410193

The best λ is ~\(21.5\)

The RSS \(1410193\) is higher than the OLS of \(1341776\).

The coefficients are on the output below:

lasso_2 <- glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1)
predict(lasso_2, s=best_lmba, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -6.038452e+02
## (Intercept)  .           
## PrivateYes  -4.235413e+02
## Accept       1.455236e+00
## Enroll      -2.003696e-01
## Top10perc    3.367640e+01
## Top25perc   -2.403036e+00
## F.Undergrad  .           
## P.Undergrad  2.086035e-02
## Outstate    -5.781855e-02
## Room.Board   1.246462e-01
## Books        .           
## Personal     1.832910e-05
## PhD         -5.601313e+00
## Terminal    -3.313824e+00
## S.F.Ratio    4.478684e+00
## perc.alumni -9.796600e-01
## Expend       6.967693e-02
## Grad.Rate    5.159652e+00

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
prin_comp_re <- pcr(Apps~., data=C_train, scale=TRUE, validation="CV")
validationplot(prin_comp_re, val.type="MSEP")

pred_prin_comp_re <- predict(prin_comp_re, C_test, ncomp=10)
mean((C_test[, "Apps"] - (pred_prin_comp_re))^2)
## [1] 2802274

I investigated several crossvalisations where M=8, 9, 10, & 11 and chose to stick with M=10. The MSE is \(2802274\).

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

fit_plsr <- plsr(Apps~., data=C_train, scale=TRUE, validation="CV")
validationplot(fit_plsr, val.type="MSEP")

pred_plsr<- predict(fit_plsr, C_test, ncomp=9)
mean((C_test[, "Apps"] - (pred_plsr))^2)
## [1] 1345885

Again I investigated several crossvalisations where M=7, 8, 9, 10, & 11 and chose to stick with M=9. The MSE is \(1345885\). So far, this is the closest to the OLS, \(1341776\).

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

avg_C_test <- mean(C_test[, "Apps"])
test_lm <- 1 - mean((C_test[, "Apps"] - pred_lm)^2) /mean((C_test[, "Apps"] - avg_C_test)^2)
test_ridge_re <- 1 - mean((C_test[, "Apps"] - pred_ridge_re)^2) /mean((C_test[, "Apps"] - avg_C_test)^2)
test_lasso_2 <- 1 - mean((C_test[, "Apps"] - pred_lasso)^2) /mean((C_test[, "Apps"] - avg_C_test)^2)
test_prin_comp_re <- 1 - mean((C_test[, "Apps"] - (pred_prin_comp_re))^2) /mean((C_test[, "Apps"] - avg_C_test)^2)
test_pred_plsr<- 1 - mean((C_test[, "Apps"] - (pred_plsr))^2) /mean((C_test[, "Apps"] - avg_C_test)^2)
test_lm
## [1] 0.9193214
test_ridge_re
## [1] 0.9120655
test_lasso_2
## [1] 0.9152076
test_prin_comp_re
## [1] 0.8315042
test_pred_plsr
## [1] 0.9190743

All models except for the PCR, which is the lowest at \(0.8315042\), have similar \(R^2\). The Partial Least Squares had the best prediction accuracy at \(0.9190743\)

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
library(leaps)
attach(Boston)
set.seed(42)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Setting up Subset, Train & Test to look at crim:

set.seed(42)
bo_subset<-sample(nrow(Boston),nrow(Boston)*0.70)
bo_train<-Boston[bo_subset,]
bo_test<-Boston[-bo_subset,]

LM

full_bo<-lm(medv~.,data=bo_train)
pred_bo<-predict(full_bo,bo_train)
bo_MSE<-mean((pred_bo-bo_train$medv)^2)
bo_MSE
## [1] 22.9207

Best subset selection:

library (leaps)

predict.regsubsets=function(object , newdata ,id ,...){
 form=as.formula (object$call [[2]])
 mat=model.matrix(form ,newdata )
 coefi=coef(object ,id=id)
 xvars=names(coefi)
 mat[,xvars]%*%coefi
}
regfit.best=regsubsets(crim~ ., data = bo_test, nvmax = 13)
coef(regfit.best ,10)
## (Intercept)          zn       indus         nox          rm         age 
## 16.91413432  0.04197017 -0.05710767 -5.28381266 -1.04175054  0.02958012 
##         dis         rad       black       lstat        medv 
## -0.62170744  0.41902465 -0.01374969 -0.06640933 -0.08742920
k=10
set.seed(1)
folds=sample (1:k,nrow(bo_test),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
 best.fit=regsubsets (crim~.,data=bo_test[folds!=j,],nvmax=13)
 for(i in 1:13){
  pred=predict (best.fit ,bo_test [folds ==j,],id=i)
  cv.errors[j,i]= mean( ( bo_test$crim[ folds==j]-pred)^2)
 }
}
mean.cv.errors <- apply(cv.errors ,2, mean)
par(mfrow=c(1,1))
plot(mean.cv.errors ,xlab = "Number of predictors", ylab = "Test MSE", pch = 25, type = "b",col="purple")

mean.cv.errors[which.min(mean.cv.errors)]
##        5 
## 26.16191

Lasso:

set.seed(42)
x <- model.matrix(crim ~ . - 1, data = Boston)
y <- Boston$crim
bo_lasso_2 <- cv.glmnet(x,y,alpha=1)
plot(bo_lasso_2)

coef(bo_lasso_2)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
bo_lasso_2$cvm[bo_lasso_2$lambda == bo_lasso_2$lambda.1se]
## [1] 55.14532

Ridge Regression:

x <- model.matrix(crim ~ . - 1, data = Boston)
y <- Boston$crim
bo_ridge_re <- cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(bo_ridge_re)

coef(bo_ridge_re)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.523899548
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526006
## nox          1.874769661
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540
bo_ridge_re$cvm[bo_ridge_re$lambda == bo_ridge_re$lambda.1se]
## [1] 58.69602

PCR:

prin_comp_re_2 <- pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(prin_comp_re_2)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.219    7.210    6.814    6.802    6.803    6.811
## adjCV         8.61    7.215    7.207    6.807    6.795    6.796    6.803
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.802    6.679    6.689     6.683     6.683     6.645     6.569
## adjCV    6.794    6.670    6.681     6.672     6.674     6.633     6.557
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(prin_comp_re_2,val.type="MSEP")

bo_pred_pcr<-predict(prin_comp_re_2,bo_test,ncomp=5)
bo_prc_MSE<-mean((bo_test$crim-bo_pred_pcr)^2)
bo_prc_MSE
## [1] 22.79465

Partial Least Square

bo_plsr <- plsr(crim~., data=Boston, scale=TRUE, validation="CV")
validationplot(bo_plsr, val.type="MSEP")

bo_pred_plsr<-predict(bo_plsr,bo_test,ncomp=5)
bo_plsr_MSE<-mean((bo_test$crim-bo_pred_plsr)^2)
bo_plsr_MSE
## [1] 22.29383

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Full Model: \(22.24\)

Best Subset: \(26.16\)

Lasso Regression: \(55.15\)

Ridge Regression: \(58.84\)

Principal Component Regression: \(22.95\)

Partial Least Square: \(22.29\)

In this seed the PLS performed the closest to the GLM.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

I would choose the 5 component Partial Least Square model because it had the best cross-validated MSE, next to PCR and Best Subset. Though they have been transformed, the PLS does include all variables. “The PLS approach attempts to find directions that help explain both the response and the predictors.”

detach(Boston)