STA 4143 Homework 5

Problem 2:

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than the decrease in bias.

The correct answer for part (a) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.

(b) The ridge regression, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than the decrease in bias.

The correct answer for part (b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Ridge regression and lasso’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases leading to decreased variance but increased bias. The relationship between \(\lambda\) and variance and bias in this regression method is the key holder to understanding the relationship. When there is small change in the training data, the least squares coefficient produces a large change and larger value for variance. Whereas ridge regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, between these two methods, ridge regression works best in situations where the least squares estimates have high variance. The big difference between ridge and lasso is that lasso performs variance selection and makes it easier to interpret.

(c) The non-linear methods, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than the decrease in bias.

The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.

Problem 9:

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
data(College)
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

pairs(College)

(a) Split the data set into a training set and a test set.

set.seed(1) 
trainingindex<-sample(nrow(College), 0.75*nrow(College))
head(trainingindex)

## [1] 207 289 444 703 156 694

train<-College[trainingindex, ]
test<-College[-trainingindex, ]
dim(College)

## [1] 777  18

dim(train)

## [1] 582  18

dim(test)

## [1] 195  18

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit=lm(Apps~., data=train)
summary(lm.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5098.3  -430.6   -10.4   297.1  7309.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -251.47922  469.10371  -0.536 0.592112    
## PrivateYes  -516.87524  158.90204  -3.253 0.001211 ** 
## Accept         1.62758    0.04355  37.373  < 2e-16 ***
## Enroll        -0.87037    0.20440  -4.258 2.41e-05 ***
## Top10perc     49.52847    6.61170   7.491 2.65e-13 ***
## Top25perc    -14.31333    5.22516  -2.739 0.006352 ** 
## F.Undergrad    0.04218    0.03759   1.122 0.262229    
## P.Undergrad   -0.03403    0.04725  -0.720 0.471675    
## Outstate      -0.08314    0.02218  -3.748 0.000197 ***
## Room.Board     0.16414    0.05508   2.980 0.003007 ** 
## Books          0.02537    0.30798   0.082 0.934377    
## Personal       0.05613    0.07492   0.749 0.454068    
## PhD           -5.98926    5.71897  -1.047 0.295428    
## Terminal      -6.08900    6.35359  -0.958 0.338293    
## S.F.Ratio     12.95405   15.08968   0.858 0.390997    
## perc.alumni   -2.51887    4.79885  -0.525 0.599866    
## Expend         0.07217    0.01468   4.915 1.16e-06 ***
## Grad.Rate      7.45308    3.45763   2.156 0.031541 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1040 on 564 degrees of freedom
## Multiple R-squared:  0.9363, Adjusted R-squared:  0.9343 
## F-statistic: 487.4 on 17 and 564 DF,  p-value: < 2.2e-16

lm.pred=predict(lm.fit, newdata=test)
lm.err=mean((test$Apps-lm.pred)^2)
lm.err

## [1] 1147256

The test error obtained from the linear model fit on all variables is 1147256. The \(r^2\) for this model is 93.63%. This means that 93.63% of the variance is described by this model. In order to try to reduce the error and increase the \(r^2\), I am going to fit the model on only significant variables.

(c) Fit a ridge regression model on the training set, wih \(\lambda\) chosen by cross-validation. Report the test error obtained.

xtrain=model.matrix(Apps~., data=train[,-1])
ytrain=train$Apps
xtest=model.matrix(Apps~., data=test[,-1])
ytest=test$Apps
set.seed(1)
library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

ridge.fit=cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)

After looking at the relationship between log \(\lambda\) and the MSE for the ridge regression, we are going to find the \(\lambda\) that most reduces the error.

ridge.lambda=ridge.fit$lambda.min
ridge.lambda

## [1] 423.3014

After getting the best \(\lambda\) from cross-validation, I will predict with the ridge regression using the best \(\lambda\) value.

ridge.pred=predict(ridge.fit, s=ridge.lambda, newx = xtest)
ridge.err=mean((ridge.pred-ytest)^2)
ridge.err

## [1] 1146729

The ridge regression for this dataset has a better test error than the OLS. As explained in the previous question, ridge regressions typically predict better when OLS has higher variance due to a large change in the data. One thing that may not be helpful for this model is the interpretation. Since ridge regression includes all variables in the model but penalizes some variables and shrinks the coefficients towards zero. By doing so do not create a problem for prediction accuracy, but makes it challenging to interpret the relationship between the variables and the response variable.

(d) Fit a lasso model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)
lasso.fit=cv.glmnet(xtrain, ytrain, alpha=1)
plot(lasso.fit)

Now that we can see the relationship between log lambda and the MSE, we are going to find the best lambda that reduces the error the most for the lasso model and test it on the hold-out data.

lasso.lambda=lasso.fit$lambda.min
lasso.lambda

## [1] 3.277469

With this lambda, I am going to use it to help predict the training model on the test data.

lasso.pred=predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err=mean((lasso.pred-ytest)^2)
lasso.err

## [1] 1152340

The lasso model is not as strong at predicting as the ridge regression on this data.

lasso.coeff=predict(lasso.fit, type="coefficients", s=lasso.lambda)[1:18,]
lasso.coeff

##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -7.529479e+02  0.000000e+00  1.614586e+00 -7.036607e-01  4.646142e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.230245e+01  3.107013e-02 -1.251723e-02 -1.039411e-01  1.431761e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
## -4.900052e-03  5.167434e-02 -3.444074e+00 -4.154387e+00  1.911491e+01 
##   perc.alumni        Expend     Grad.Rate 
## -3.723708e+00  7.466592e-02  6.279007e+00

As seen above, lasso chose some variables that we helpful in prediction on the test dataset. While a more simple and interpretable model than ridge regression, it did not perform as well in prediction accuracy. However, because the variance of the ridge regression was slightly lower than that of the lasso, it produced a more minimum error. This instance may have resulted in the ridge-outperformance due to the fact that the response may be a function of many predictors.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

pcr.fit=pcr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")

summary(pcr.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4064     4049     2196     2185     2171     1708     1693
## adjCV         4064     4052     2193     2184     2205     1694     1688
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1694     1708     1597      1594      1600      1602      1611
## adjCV     1689     1704     1590      1590      1595      1598      1606
##        14 comps  15 comps  16 comps  17 comps
## CV         1602      1563      1213      1189
## adjCV      1597      1545      1203      1180
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      31.144    57.84    64.56    70.34    75.87    80.88    84.37
## Apps    1.528    71.98    72.43    72.46    83.85    84.06    84.13
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.72    90.57     92.97     95.03     96.84     97.96     98.74
## Apps    84.13    85.89     85.98     86.04     86.04     86.06     86.19
##       15 comps  16 comps  17 comps
## X        99.37     99.84    100.00
## Apps     91.01     93.35     93.63

From just the looks of this model’s graph, I think that the MSE is lowest around 10 components. Additionally, the summary describe that 10 components the lowest CV error and explains 92.97% of the variance in predictors and 85.98% of variance in the response variable–Apps. Based on my conclusion, I am going to predict on the model with 10 components.

pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err

## [1] 1525762

The PCR model does not outperform any of the previous models. I believe that this may have occured since I chose a relatively large amount of principal components. The nature of PCR model explains that as more principal components are used in the regression model, the bias decreases, but the variance increases. However, I don’t believe that reducing the principal component number would’ve made for a better outcome because then I would’ve been struck with a higher CV error. When just looking at the numbers provided for the CV error, you can see a somewhat U-shape for these errors. 10 components had the least amount of CV error, so that made the most sense to use.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit=plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4064     1989     1601     1545     1443     1325     1210
## adjCV         4064     1986     1589     1551     1427     1310     1200
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1206     1201     1200      1199      1199      1198      1197
## adjCV     1195     1191     1190      1189      1189      1188      1187
##        14 comps  15 comps  16 comps  17 comps
## CV         1196      1196      1196      1196
## adjCV      1186      1186      1186      1186
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.66    35.76    63.00    65.64    69.53    73.50    76.77
## Apps    77.22    86.50    87.83    91.25    92.84    93.44    93.50
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       79.45    81.88      85.1     88.72     90.87     92.38     93.94
## Apps    93.55    93.59      93.6     93.61     93.62     93.63     93.63
##       15 comps  16 comps  17 comps
## X        95.16     97.12    100.00
## Apps     93.63     93.63     93.63

With this the validation plot, I can see that MSE drops to it’s lowest around 5-7 components. Based on the summary of the model, the lowest CV error occurs at 7 components. At 7 components, 76.77% of the variance in the predictors is explained and 93.55% of the variance in the response variable is explained. In comparison with the other componets, this components qualities are not that dissimilar, but since we are looking to really reduce the amount of components involved, I will use 7 components for prediction purposes.

pls.pred=predict(pls.fit, test, ncomp = 7)
pls.err=mean((pls.pred-test$Apps)^2)
pls.err

## [1] 1160474

The PLS model comes really close to beating the ridge regression, but does not outperform it. As stated in the book, PCR normally doesn’t perform better than ridge regression. This is because while PLS tries to reduce dimensions, it reduces the bias and therefore increase the variance of the model.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The best model for this problem was the Ridge Regression. As demonstrated in the barplot, the test errors were not that far apart or different.

barplot(c(lm.err, ridge.err, lasso.err, pcr.err, pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

In order to determine how accurately we can predict the number of college applications received, I will compute the \(r^2\) values for each method. These test \(r^2\) values will tell me how much of the models explain the variability of the response on the test data.

avg.apps=mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lm.r2

## [1] 0.8901095

ridge.r2=1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2

## [1] 0.89016

lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2

## [1] 0.8896226

pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2

## [1] 0.8538542

pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2

## [1] 0.8888434

barplot(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), col="darkslategray2", xlab="Regression Methods", ylab="Test R-Squared", main="Test R-Squared for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

Based on the test \(r^2\) values, these model’s are not that bad in accuracy. All of the model, with the exception of the PCR, have high accuracy.

Problem 11:

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
library(MASS)
data(Boston)
attach(Boston)

Best Subset Selection

library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

After fitting 13 models (# of variables - 1), I will need to find the one model that minimizes the CV error on the test data.

summary(best.fit)

## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

which.min(rmse.cv)

## [1] 9

boston.bsm.err=(rmse.cv[which.min(rmse.cv)])^2
boston.bsm.err

## [1] 43.47287

I see that cross-validation selects a 9-variable model based on the Test MSE. At 9-variables, the CV estimate for the test MSE is 43.47287–the lowest MSE reported.

The variables that are included in this model are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.

Lasso

Next I will fit a Lasso model.

boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)

The graph above depicts the relationship between log \(\lambda\) and MSE. To help predict the training model on the test model, I will need to find the \(\lambda\) that reduces the error the most.

coef(boston.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

boston.lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
boston.lasso.err

## [1] 54.83663

It’s important to keep in mind that Lasso is a variable reduction method. From what it looks like, the lasso model that reduces the MSE the model includes only 1 variable and has an MSE of 54.83663. The only variable included in this model is rad.

Ridge Regression

Since Ridge Regression is already pretty similar to lasso, I won’t have to do as much work. The code stays the same with the exception of one field.

boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)

Unlike Lasso, Ridge Regression attempts keep all the variables but push their coefficient value close to 0 if they don’t have significance in the relationship with the response.

coef(boston.ridge)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.017516870
## zn          -0.002805664
## indus        0.034405928
## chas        -0.225250601
## nox          2.249887495
## rm          -0.162546004
## age          0.007343331
## dis         -0.114928730
## rad          0.059813843
## tax          0.002659110
## ptratio      0.086423005
## black       -0.003342067
## lstat        0.044495212
## medv        -0.029124577

boston.ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
boston.ridge.err

## [1] 55.60604

In comparison with the other two methods, the ridge regression does not perform well. The MSE for this method is 55.60604.

PCR

Now I will explore the PCR method.

library(pls)
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.170    7.163    6.733    6.728    6.740    6.765
## adjCV         8.61    7.169    7.162    6.730    6.723    6.737    6.760
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.760    6.634    6.652     6.642     6.652     6.607     6.546
## adjCV    6.754    6.628    6.644     6.635     6.643     6.598     6.536
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.45    95.40     97.04     98.46     99.52     100.0
## crim    42.47    42.55     42.78     43.04     44.13      45.4

Based on the CV error as well as the variances explained, I think that the appropriate PCR model would only include 8 components. With 8 components, 93.45% of the variance is explained in the predictors by the model, and 42.47% of the variance is explained in the response variable by the model. Additionally, at 8 components, the MSE is at 44.38224. Aside from the BSM, this model predicts pretty well!

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

As computed above, the model that has the lowest cross-validation error is the one chosen by the best subset selection method. This model had an MSE of 43.47287.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The model that was chosen by Best Subset Selection only includes 9 variables. The variables that are included in this model are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv. If the model were to include of the thrown-out features, more variation of the response would be present. For this particular problem, we are looking to have model prediction accuracy with low variance and low MSE.