Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

ANSWER: iii. The lasso is less flexible and will give improved prediction accuracy as long as the increase in bias is less than the decrease in variance. Lasso shrinks coefficient estimates of less-useful variables to zero, at the cost of a small increase in bias, which is generally worth it to minimize variance.

(b) Repeat (a) for ridge regression relative to least squares.

ANSWER: iii. For the same reasons as part (a), except ridge regression does not shring the less-useful variable coefficients to zero like lasso. Ridge regression still minimizes the coefficients of variables and reduces variance with a slight increase in bias.

(c) Repeat (a) for non-linear methods relative to least squares.

ANSWER: ii. More flexible and will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Non-linear methods are more flexible because they make less assumptions about the form of the function. Relationships between predictors and the response are approximated which leads to a decrease in bias that outweighs and increase in variance, resulting in higher prediction accuracy.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
attach(College)

set.seed(1) 
trainingindex<-sample(nrow(College), 0.7*nrow(College))

train<-College[trainingindex, ]
test<-College[-trainingindex, ]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

l.model <- lm(Apps~., data=train)

lm.pred=predict(l.model, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1261630

The test error for the linear model is 1,261,630.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

set.seed(1)

xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
# Ridge regression
ridge.mod = cv.glmnet(xtrain, ytrain, alpha=0)

ridge.lambda = ridge.mod$lambda.min

ridge.pred=predict(ridge.mod, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err
## [1] 1163628

The test error rate of the ridge regression is 1,163,628. This is a better test error rate than the OLS regression.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)
lasso.mod=cv.glmnet(xtrain, ytrain, alpha=1)

lasso.lambda=lasso.mod$lambda.min

lasso.pred=predict(lasso.mod, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1265298
lasso.coeff=predict(lasso.mod, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff
##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.033941e+03  0.000000e+00  1.703414e+00 -9.625620e-01  5.160952e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.450882e+01  3.915947e-02  7.389830e-02 -1.077148e-01  1.466627e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  2.535477e-01 -8.694845e-03 -6.888117e+00  4.525972e-01  2.448283e+01 
##   perc.alumni        Expend     Grad.Rate 
##  7.159570e-01  6.301767e-02  6.103923e+00

The test error rate for the lasso model is 1,265,298.

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

summary(pcr.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3807     2130     2134     1826     1692     1696
## adjCV         3895     3807     2126     2132     1806     1682     1690
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1695     1638     1604      1593      1601      1603      1606
## adjCV     1694     1627     1598      1587      1595      1598      1600
##        14 comps  15 comps  16 comps  17 comps
## CV         1607      1544      1145      1113
## adjCV      1602      1524      1137      1106
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1837203

The number of components chosen was 10 due to a low CV and high variance explained. The test error rate for the PCR model is 1,837,203. This is a considerably higher error rate than any of the previous models.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     1969     1762     1566     1493     1320     1236
## adjCV         3895     1963     1757     1556     1469     1291     1222
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1225     1213     1212      1207      1213      1207      1206
## adjCV     1212     1201     1199      1195      1200      1195      1194
##        14 comps  15 comps  16 comps  17 comps
## CV         1205      1205      1205      1205
## adjCV      1193      1193      1193      1193
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
## Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
## Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.40    100.00
## Apps     93.32     93.32
pls.pred=predict(pls.fit, test, ncomp = 10)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1279353

The number of components chosen was once again 10 due to its low CV and high variance explained. This model performs similar to most other models, but not as well as the ridge regression.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

# Test Error 
barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkred", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

# r2
avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)

barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), col="darkslategray2", xlab="Regression Methods", ylab="Test R-Squared", main="Test R-Squared for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

There is not much of a difference in the r2 for each model, however, the PCR is the least accurate model based on test error rate and r2. All of the models are pretty accurate and the ridge regression model has the lowest test error rate.

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
library(MASS)
library(leaps)
data(Boston)
attach(Boston)

# Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 6  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*"   "*" 
## 7  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
which.min(mean.cv.errors)
## [1] 9
mean.cv.errors[which.min(mean.cv.errors)]
## [1] 42.81453

Cross-validation selects a 9 variable model which includes the variables zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.

# Lasso model
boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)

coef(boston.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                   s1
## (Intercept) 2.176491
## zn          .       
## indus       .       
## chas        .       
## nox         .       
## rm          .       
## age         .       
## dis         .       
## rad         0.150484
## tax         .       
## ptratio     .       
## black       .       
## lstat       .       
## medv        .
boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err
## [1] 62.74783

Because Lasso is a variable reduction method, the model has removed all variables except rad. This process results in a test error of 62.74783.

# Ridge regression
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)

coef(boston.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  1.523899542
## zn          -0.002949852
## indus        0.029276741
## chas        -0.166526007
## nox          1.874769665
## rm          -0.142852604
## age          0.006207995
## dis         -0.094547258
## rad          0.045932737
## tax          0.002086668
## ptratio      0.071258052
## black       -0.002605281
## lstat        0.035745604
## medv        -0.023480540
boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err
## [1] 58.8156

The ridge regression model attempts to push all coefficients for non-significant variables in the model to zero. This model shows nox as the most significant variable and the test error for the model is 58.8156.

# PCR model
library(pls)
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
pcr.pred=predict(boston.pcr, ncomp=13)
pcr.err=mean((pcr.pred-Boston$crim)^2)
pcr.err
## [1] 40.31607

Based on the CV error and variance explained, I would use all 13 components in the PCR model. This would explain 100% of the variance in the predictors by the model and 45.4% of the variance in the response variable. The MSE for this model is 40.31607.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

Based on the models computed above, the PCR model had the lowest MSE at 40.31607. The model is more complicated than the other models because it includes all 13 components, but it performs the best on this data set.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Yes the chosen model involves all of the features in the data set. All features are included because, when used in the PCR model, this provides the lowest CV and the lowest MSE while explaining the highest amount of variance.