Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

False

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

True - because lasso regression is still a linear model, it is less flexible. An increase is bias is traded off for a greater absolute decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

False

(b) Repeat (a) for ridge regression relative to least squares.

Ridge regression is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression is like lasso regression in the sense that increased bias is willingly traded off for a greater decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

Non-linear methods are more flexible and hence will give improved prediction accuracy when its increase in variance is traded for a decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
sum(is.na(College))
## [1] 0
dim(College)
## [1] 777  18
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

(a) Split the data set into a training set and a test set.

set.seed(1)
train <- sample(777, 388)
test <- -train
College.train <- College[train, ]
College.test <- College[test,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

The MSE is 1,135,758.

lm.fit <- lm(Apps ~ ., data = College.train)
lm.pred <- predict(lm.fit, College.test)
mean((College.test[, "Apps"] - lm.pred)^2)
## [1] 1135758

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

The MSE for the ridge regression was a small improvement at 1,135,714.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
set.seed(1)
x_train <- model.matrix(Apps~., data = College.train)[,-1]
y_train <- College.train$Apps
x_test <- model.matrix(Apps~., data = College.test)[,-1]
y_test <- College.test$Apps
grid <- 10^seq(10, -2, length = 100)
ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)

bestlam <- ridge.fit$lambda.min
bestlam
## [1] 0.01
ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)
## [1] 1135714

(d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

The test MSE was slightly lower than the ridge regression at 1,135,659. All predictors had nonzero coefficients. This is consistent with a minimum \(\lambda\) which was close to 0 and would result in a model not much different than the linear regression model.

set.seed(1)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)

bestlam <- lasso.fit$lambda.min
bestlam
## [1] 0.01
lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)
## [1] 1135659
x <- model.matrix(Apps~., data = College)[,-1]
y <- College$Apps
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:18,]
lasso.coeff[lasso.coeff != 0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -448.83141930 -494.15009341    1.58411981   -0.87022010   49.83428099 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -14.16510449    0.05618125    0.04447685   -0.08571647    0.15130927 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.02075742    0.03108262   -8.67588751   -3.33162029   15.41268430 
##   perc.alumni        Expend     Grad.Rate 
##    0.13965876    0.07790164    8.66427451

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

The test MSE was 1,937,465 using 7 components.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = College.train, scale = T, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, College.test, ncomp = 7)
mean((pcr.pred - y_test)^2)
## [1] 1937465

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

The MSE was 1,066,991 using 6 components.

set.seed(1)
pls.fit <- plsr(Apps ~ ., data = College.train, scale = T, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, College.test, ncomp = 6)
mean((pls.pred - y_test)^2)
## [1] 1066991

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The PLS model had the lowest MSE at 1,066,991. The ridge and lasso regression models showed minimal improvement in the MSE while the PCR model had a higher MSE than the base linear regression model. An estimated average accuracy rate can be used to compare the models by the following formula: 1 - \(\frac{|Mean Error|}{\overline{y}_{test}}\). As expected, the PLS model had the highest accuracy estimate at .776 while having the lowest MSE. The PCR model had the lowest accuracy estimate, which was lower than the base linear regression model. Both the ridge regression and lasso regression models showed minimal improvement over the base linear regression model.

# Base Regression Accuracy Estimate
1 - mean(abs(y_test - lm.pred)) / mean(y_test)
## [1] 0.7661462
# Ridge Regression Accuracy Estimate
1 - mean(abs(y_test - ridge.pred)) / mean(y_test)
## [1] 0.7661508
# Lasso Regression Accuracy Estimate
1 - mean(abs(y_test - lasso.pred)) / mean(y_test)
## [1] 0.7661595
# PCR Accuracy Estimate
1 - mean(abs(pcr.pred - y_test)) / mean(y_test)
## [1] 0.6673805
# PLS Accuracy Estimate
1 - mean(abs(pls.pred - y_test)) / mean(y_test)
## [1] 0.7757627

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

set.seed(1)
library(MASS)
library(leaps)
library(glmnet)

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best Subset Selection

sum(is.na(Boston))
## [1] 0
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
predict.regsubsets <- function(object, newdata, id, ...) {
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k <- 10
p <- ncol(Boston) - 1
folds <- sample(rep(1:k, length = nrow(Boston)))
cv.errors <- matrix(NA, k, p)
for (i in 1:k) {
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred <- predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] <- mean((Boston$crim[folds == i] - pred)^2)
    }
}

mse.cv <- apply(cv.errors, 2, mean)
plot(mse.cv, pch = 19, type =  "b")

which.min(mse.cv)
## [1] 9
mse.cv[which.min(mse.cv)]
## [1] 42.81453
bestfit.full <- regsubsets(crim ~., data = Boston, nvmax = 13)
coef(bestfit.full, 9)
##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877

Lasso Regression

set.seed(1)
train <- sample(506, 253)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test,]
x_train <- model.matrix(crim~., data = Boston.train)[,-1]
y_train <- Boston.train$crim
x_test <- model.matrix(crim~., data = Boston.test)[,-1]
y_test <- Boston.test$crim
grid <- 10^seq(10, -2, length = 100)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)

bestlam <- lasso.fit$lambda.min
bestlam
## [1] 0.07054802
lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)
## [1] 40.88312
x <- model.matrix(crim ~., data = Boston)[,-1]
y <- Boston$crim
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:14,]
lasso.coeff[lasso.coeff != 0]
##  (Intercept)           zn        indus         chas          nox           rm 
## 11.377041716  0.034373111 -0.062605969 -0.555023861 -5.721122105  0.151927338 
##          dis          rad      ptratio        black        lstat         medv 
## -0.715085799  0.506621375 -0.156825685 -0.007550409  0.122010397 -0.145585637

Ridge Regression

ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)

bestlam <- ridge.fit$lambda.min
bestlam
## [1] 0.3764936
ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)
## [1] 41.03422
ridgefull.fit <- glmnet(x, y, alpha = 0, lambda = grid)
ridge.coeff <- predict(ridgefull.fit, s = bestlam, type = "coefficients")[1:14,]
ridge.coeff
##  (Intercept)           zn        indus         chas          nox           rm 
## 10.684582993  0.035254538 -0.083043901 -0.729909760 -6.473100982  0.364745896 
##          age          dis          rad          tax      ptratio        black 
##  0.001530985 -0.766449784  0.454002431  0.002246733 -0.165813717 -0.008294454 
##        lstat         medv 
##  0.141537228 -0.151745918

PCR

library(pls)
set.seed(1)
pcr.fit <- pcr(crim ~ ., data = Boston.train, scale = T, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")

pcr.pred <- predict(pcr.fit, Boston.test, ncomp = 8)
mean((pcr.pred - y_test)^2)
## [1] 43.21097

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

The model with the lowest MSE was the lasso regression at 40.88312. The model can be expressed as:

\(crim = 11.377+0.034zn-0.063indus-0.555chas-5.721nox+0.152rm-0.715dis+0.507rad\) \(-0.157ptratio-0.008black+0.122lstat-0.146medv\)

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The model did not include all of the features in the dataset because the minimum lambda caused the coefficient estimates of age and tax to equal 0 in the lasso regression model.