Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

a. The lasso, relative to least squares, is:

     i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
     ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
     iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
     iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer choice iii. is correct because of the bias-variance trade off. The lasso lowers coefficients of variables that are not statistically significant, or less statistically significant, sometimes all the way down to 0. This will decrease variance and flexibility in the overall equation, which does come with an increase in bias.

b. Repeat (a) for ridge regression relative to least squares.

Answer choice iii. is correct again because of the bias-variance trade off. While the ridge regression will not bring less significant coefficients down to 0, the same principle applies as in the lasso example.

c. Repeat (a) for non-linear methods relative to least squares.

Answer choice ii. is correct because non-linear methods are more flexible than least squares because there is not a line to “fit”. This means less bias trying to assume linearity, which does come with an increase in variance but still more accuracy overall.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

a. Split the data set into a training set and a test set.

library(ISLR)
data(College)
set.seed(1)
train = sample(1:dim(College)[1], dim(College)[1] / 2)
College.train = College[train, ]
College.test = College[-train, ]

b. Fit a linear model using least squares on the training set, and report the test error obtained.

fit.lm = lm(Apps ~ ., data = College.train)
pred.lm = predict(fit.lm, College.test)
mean((pred.lm - College.test$Apps)^2)
## [1] 1135758

c. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

train.mat = model.matrix(Apps ~ ., data = College.train)
test.mat = model.matrix(Apps ~ ., data = College.test)
grid = 10 ^ seq(4, -2, length = 100)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.0.4
## Loading required package: Matrix
## Loaded glmnet 4.1-1
fit.ridge = glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge = cv.glmnet(train.mat, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge = cv.ridge$lambda.min
bestlam.ridge
## [1] 0.01
pred.ridge = predict(fit.ridge, s = bestlam.ridge, newx = test.mat)
mean((pred.ridge - College.test$Apps)^2)
## [1] 1135714

d. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

fit.lasso = glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso = cv.glmnet(train.mat, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso = cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
pred.lasso = predict(fit.lasso, s = bestlam.lasso, newx = test.mat)
mean((pred.lasso - College.test$Apps)^2)
## [1] 1135660
predict(fit.lasso, s = bestlam.lasso, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -7.900363e+02
## (Intercept)  .           
## PrivateYes  -3.070103e+02
## Accept       1.779328e+00
## Enroll      -1.469508e+00
## Top10perc    6.672214e+01
## Top25perc   -2.230442e+01
## F.Undergrad  9.258974e-02
## P.Undergrad  9.408838e-03
## Outstate    -1.083495e-01
## Room.Board   2.115147e-01
## Books        2.912105e-01
## Personal     6.120406e-03
## PhD         -1.547200e+01
## Terminal     6.409503e+00
## S.F.Ratio    2.282638e+01
## perc.alumni  1.130498e+00
## Expend       4.856697e-02
## Grad.Rate    7.488081e+00

e. Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## Warning: package 'pls' was built under R version 4.0.4
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pcr = pcr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pcr, val.type = "MSEP")

pred.pcr = predict(fit.pcr, College.test, ncomp = 10)
mean((pred.pcr - College.test$Apps)^2)
## [1] 1723100

f. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

fit.pls = plsr(Apps ~ ., data = College.train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")

pred.pls = predict(fit.pls, College.test, ncomp = 10)
mean((pred.pls - College.test$Apps)^2)
## [1] 1131661

g. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

test.avg = mean(College.test$Apps)
lm.r2 = 1 - mean((pred.lm - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lm.r2
## [1] 0.9015413
ridge.r2 = 1 - mean((pred.ridge - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
ridge.r2
## [1] 0.9015452
lasso.r2 = 1 - mean((pred.lasso - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
lasso.r2
## [1] 0.9015499
pcr.r2 = 1 - mean((pred.pcr - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pcr.r2
## [1] 0.8506248
pls.r2 = 1 - mean((pred.pls - College.test$Apps)^2) / mean((test.avg - College.test$Apps)^2)
pls.r2
## [1] 0.9018965

Of the results obtained, all of the models are very accurate, with the lowest r squared of .8506248 for the PCR model. Other than the PCR model, the rest are very similar in the .902 range for r squared, meaning about 90.2% of the variance in the number of college applications received can be predicted by our linear, ridge regression, lasso, and PLS models.

Question 11

We will now try to predict per capita crime rate in the Boston data set.

a. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
data(Boston)
set.seed(1)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
train=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
test=(!train)
regfit.best=regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat=model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
   coefi=coef(regfit.best,id=i)
   pred=test.mat[,names(coefi)]%*%coefi
   val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
summary(regfit.best)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[train, ], nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " "   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  " " "*"   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   " "  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    xvars = names(coefi)
    mat[, xvars] %*% coefi
}
regfit.best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(regfit.best,10)
##  (Intercept)           zn        indus          nox           rm          dis 
##  16.38579874   0.04186311  -0.09330371 -10.62174498   0.44828268  -0.99077268 
##          rad      ptratio        black        lstat         medv 
##   0.53566370  -0.26956103  -0.00759461   0.13084608  -0.19800062
k = 10
set.seed(1)
folds = sample(1:k, nrow(Boston), replace = TRUE)
cv.errors = matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred = predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] = mean((Boston$crim[folds == j] - pred)^2)
    }
}

mean.cv.errors = apply(cv.errors, 2, mean)
mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948 
##        9       10       11       12       13 
## 42.53822 42.73416 42.52367 42.46014 42.50125
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

x = model.matrix(crim ~ ., Boston)[, -1]
y = Boston$crim
cv.out.lasso = cv.glmnet(x, y, alpha = 1, type.measure = "mse")
plot(cv.out.lasso)

cv.out.rr = cv.glmnet(x, y, alpha = 0, type.measure = "mse")
plot(cv.out.rr)

pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.198    7.198    6.786    6.762    6.790    6.821
## adjCV         8.61    7.195    7.195    6.780    6.753    6.784    6.813
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.822    6.689    6.712     6.720     6.712     6.664     6.593
## adjCV    6.812    6.679    6.701     6.708     6.700     6.651     6.580
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.fit, val.type = "MSEP")

b. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

mean.cv.errors[12]
##       12 
## 42.46014
cv.out.rr
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 0) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min   0.54   100   43.48 14.33      13
## 1se  74.44    47   57.69 16.76      13
cv.out.lasso
## 
## Call:  cv.glmnet(x = x, y = y, type.measure = "mse", alpha = 1) 
## 
## Measure: Mean-Squared Error 
## 
##     Lambda Index Measure    SE Nonzero
## min  0.051    51   43.11 14.16      11
## 1se  3.376     6   56.89 17.29       1
boston.predict.pcr=predict(pcr.fit,Boston,ncomp=5)
pcr.MSE=mean((Boston$crim-boston.predict.pcr)^2)
pcr.MSE
## [1] 44.59322

of the models proposed, the one with the lowest cross validation estimate for mean squared error of 42.46014 is the best subset selection model. Even though the BSS model performed the best, the lasso and ridge regression were not far behind, both performing around 43.11 43.48 for their mean squared errors, respectively. The worst performing model was the PCR, with a CV estimate for the MSE equal to 44.59.

c. Does your chosen model involve all of the features in the data set? Why or why not?

No, my chosen model does not have all of the features in the data set. I obtained a more accurate prediction with a lower CV estimate for MSE from using 12 of the variables in my BSS model.