Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The correct answer for Part(b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Lasso’s advantage over least squares is rooted in the bias-variance trade-off. When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias. This consequently can generate more accurate predictions. In addition, lasso performs variable selection which makes it easier to interpret than other methods like ridge regression.

(b) Repeat (a) for ridge regression relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The correct answer for Part(b) is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: Ridge Regression’s advantage over least squares is rooted in the bias-variance trade-off. As \(\lambda\) increases, the flexibility of Ridge Regression fit decreases, leading to decreased variance but increased bias. In situations where the relationship between the response and the predictors is close to linear, the Least Squares Estimates will have low bias but may have high variance. This means when there is a small change in the training data, this may cause a large change in the least squares coefficient estimates. This is especially true when the number of variables ‘p’ is almost as large as the number of observations ‘n’. Whereas Ridge Regression can still perform well by trading off a small increase in bias for a large decrease in variance. Hence, Ridge Regression works best in situations where the Least Square Estimates have high variance.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

The correct answer for part (c) is: ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

library(ISLR)
summary(College)
##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
pairs(College)

  1. Split the data set into a training set and a test set.
set.seed(1)
trainingIndex <- sample(nrow(College), 0.75*nrow(College))
College.train <- College[trainingIndex,]
College.test <- College[-trainingIndex,]
dim(College)
## [1] 777  18
dim(College.train)
## [1] 582  18
dim(College.test)
## [1] 195  18
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
College.lm <- lm(Apps~., data = College.train)
summary(College.lm)
## 
## Call:
## lm(formula = Apps ~ ., data = College.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16
College.lm.pred <- predict(College.lm, newdata = College.test)
College.lm.test.err <- mean((College.test$Apps - College.lm.pred)^2)
College.lm.test.err
## [1] 1384604

After fitting a Linear Regression Model to predict the # of Applications, we retrieve a Test Error of 1,384,604.

  1. Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.0-2
grid = 10^seq(4, -2 ,length = 100)
# Training Data for X & Y Variables
College.x.train <- model.matrix(Apps~., data = College.train[,-1])
College.y.train <- College.train$Apps
# Testing Data for X & Y Variables
College.x.test <- model.matrix(Apps~., data = College.test[,-1])
College.y.test <- College.test$Apps
# College Ridge Regression Model 
College.ridge <- cv.glmnet(College.x.train, College.y.train, alpha = 0)
plot(College.ridge)

College.ridge.best.lambda <- College.ridge$lambda.min
College.ridge.best.lambda
## [1] 364.6228

Using Cross-Validation, we receive the tuning \(\lambda\) value of 364.6228. This tuning \(\lambda\) value of 364.6228 results in the smallest cross-validation error.

College.ridge.pred <- predict(College.ridge, s = College.ridge.best.lambda, newx = College.x.test)
College.ridge.err <- mean((College.ridge.pred - College.y.test)^2)
College.ridge.err
## [1] 1260111

The Test MSE or Test Error produced by the Cross Validation Ridge Regression is 1,260,111. This Test Error is significantly smaller than that of the OLS. While the prediction may be more accurate, the shrinking of the coefficients towards zero makes it challenging to interpret the relationship between the variables and the response variable - Applications.

  1. Fit a lasso model on the training set, with \(\lambda\) chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
College.lasso <- cv.glmnet(College.x.train, College.y.train, alpha = 1)
plot(College.lasso)

College.lasso.best.lambda <- College.lasso$lambda.min
College.lasso.best.lambda
## [1] 1.945882

Using Cross-Validation, we receive the \(\lambda\) value of 1.945882. This \(\lambda\) value of 1.945882 results in the smallest cross-validation error for the lasso regression.

College.lasso.pred <- predict(College.lasso, s = College.lasso.best.lambda, newx = College.x.test)
College.lasso.err <- mean((College.lasso.pred - College.y.test)^2)
College.lasso.err
## [1] 1394834

The Lasso does not outperform the Ridge Regression in this case. The Test Error for the Lasso Model is 1,394,834.

College.lasso.coeff <- predict(College.lasso, type = "coefficients", s = College.lasso.best.lambda)[1:18,]
College.lasso.coeff
##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.030320e+03  0.000000e+00  1.693638e+00 -1.100616e+00  5.104012e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.369969e+01  7.851307e-02  6.036558e-02 -9.861002e-02  1.475536e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  2.185146e-01  0.000000e+00 -8.390364e+00  1.349648e+00  2.457249e+01 
##   perc.alumni        Expend     Grad.Rate 
##  7.684156e-01  5.858007e-02  5.324356e+00

With the Lasso Model, the coefficients for this model are as per the above. While the Lasso Model may be more simple and interpretable model than the Ridge Regression, it did not perform as well in prediction accuracy.

  1. Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
College.pcr <- pcr(Apps~., data = College.train, scale = TRUE, validation = "CV")
summary(College.pcr)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3767     2088     2097     1815     1682     1675
## adjCV         3862     3766     2085     2096     1807     1669     1670
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1671     1620     1574      1568      1575      1579      1580
## adjCV     1666     1610     1568      1562      1570      1574      1574
##        14 comps  15 comps  16 comps  17 comps
## CV         1575      1502      1204      1134
## adjCV      1571      1486      1193      1126
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36
validationplot(College.pcr, val.type = "MSEP")

By looking at the PCR Model’s graph and output, it appears that the MSE is lowest around M = 10 components. With the 10 components, the PCR Model achieves the lowest CV Value of 1568 and a moderately good % of Variance Explained for both the Predictors (92.81%) and Response Variable (85.22%).

College.pcr.pred <- predict(College.pcr, College.test, ncomp = 10)
College.pcr.err <- mean((College.pcr.pred - College.test$Apps)^2)
College.pcr.err
## [1] 1952693

The PCR model with M = 10 Principal Components does not outperform any of the previous models as it has a Test Error of 1,952,693. I believe that this may have occured since a model with higher number of principal components was chosen.

  1. Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1) 
College.pls <- plsr(Apps~., data = College.train, scale = TRUE, validation = "CV")
summary(College.pls)
## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1921     1712     1511     1429     1255     1175
## adjCV         3862     1916     1710     1503     1410     1229     1164
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1153     1147     1139      1138      1138      1136      1134
## adjCV     1143     1138     1131      1130      1129      1128      1126
##        14 comps  15 comps  16 comps  17 comps
## CV         1134      1134      1134      1134
## adjCV      1126      1126      1126      1126
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36
validationplot(College.pls, val.type = "MSEP")

By reviewing the Validation Plot, I can see that the MSE drops to it’s lowest around 6 and 10 Components. Based on the summary of the PLS Model, the CV is lowest starting at 13 Components; however, this is too many Components to include a method that is supposed to be Dimension Reducing. As such, with consideration of the % of Variance Explained for the Predictors and Response Variable, I will proceed with 9 Components. At 9 Components, 82.71% of the Variance is explained in the Predictors and 93.30% of Variance is explained in the Response Variable - Apps.

College.pls.pred <- predict(College.pls, College.test, ncomp = 9)
College.pls.err <- mean((College.pls.pred - College.test$Apps)^2)
College.pls.err
## [1] 1381335

The PLS Model with M = 9 Components competively compares with the Lasso and Ridge Regression Model - in comparison it is in between the Lasso and Ridge Regression - with a Test Error of 1,381,335.

  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The below are the results for each of the models ran for this exercise. As seen in the results, the model that most accurately predicts the # of College Applications Received is the Ridge Regression. All the models are relatively similar, the model that is significantly different than the rest is the PCR Model as the Test Error is almost 600,000 more.

College.lm.test.err
## [1] 1384604
College.ridge.err
## [1] 1260111
College.lasso.err
## [1] 1394834
College.pcr.err
## [1] 1952693
College.pls.err
## [1] 1381335
barplot(c(College.lm.test.err, College.ridge.err, College.lasso.err, College.pcr.err, College.pls.err), col="darkslategray1", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

library(MASS)
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
training <- sample(nrow(Boston), 0.75*nrow(Boston))
Boston.train <- Boston[training,]
Boston.test <- Boston[-training,]
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best Subset Selection

library(leaps)
set.seed(12)

predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

Based on the RMSE Cross-Validation Plot, it appears that the Best Subset Model is the model with 8 variables.

summary(best.fit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston[folds != i, ], nvmax = p)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   "*"   " " 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   " " 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   "*"  "*" " " " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
which.min(rmse.cv)
## [1] 8
Boston.bestsubset.err <- (rmse.cv[which.min(rmse.cv)]^2)
Boston.bestsubset.err
## [1] 44.89955

With Cross-Validation, the Best Subset Model with 8 variables is selected. At 8 variables, the estimated Test Error is 44.89955.

Lasso

library(glmnet)
grid = 10^seq(4, -2 ,length = 100)
set.seed(1)
# Training Data for X & Y Variables
Boston.x <- model.matrix(crim~., data = Boston)
Boston.y <- Boston$crim
Boston.lasso <- cv.glmnet(Boston.x, Boston.y, alpha = 1, type.measure = "mse")
plot(Boston.lasso)

coef(Boston.lasso)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.0894283
## (Intercept) .        
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
Boston.lasso.err <- Boston.lasso$cvm[Boston.lasso$lambda==Boston.lasso$lambda.1se]
Boston.lasso.err
## [1] 55.3338

The Lasso does not appear to outperform the Best Subset Regression. It’s important to keep in mind that Lasso is a variable reduction method. From what it looks like, the lerror produced is 55.3338. The only variable included in this model is rad.

Ridge Regression

Boston.ridge = cv.glmnet(Boston.x, Boston.y, alpha = 0, type.measure = "mse")
plot(Boston.ridge)

Unlike Lasso, Ridge Regression attempts keep all the variables but push their coefficient value close to 0 if they don’t have significance in the relationship with the response.

coef(Boston.ridge)
## 15 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  2.190805375
## (Intercept)  .          
## zn          -0.002514410
## indus        0.021067968
## chas        -0.102372822
## nox          1.323474252
## rm          -0.106109537
## age          0.004452629
## dis         -0.066139399
## rad          0.029778399
## tax          0.001383946
## ptratio      0.049720162
## black       -0.001713827
## lstat        0.024367321
## medv        -0.016059212
Boston.ridge.err <- Boston.ridge$cvm[Boston.ridge$lambda==Boston.ridge$lambda.1se]
Boston.ridge.err
## [1] 63.17158

The Ridge Regression is competitively comparable to the Lasso Model’s results with an MSE of 63.17158, but Best Subset Selection is still outperforming.

Principal Components Regression

library(pls)
Boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(Boston.pcr)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.200    7.197    6.776    6.770    6.770    6.783
## adjCV         8.61    7.198    7.194    6.771    6.761    6.765    6.777
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.772    6.644    6.646     6.641     6.629     6.600     6.539
## adjCV    6.766    6.637    6.639     6.633     6.621     6.591     6.529
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

The PCR Model indicates that the model with the lowest Cross-Validated RMSEP at 13 components.

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using trainingerror.

As computed above, the model that has the lowest cross-validation error is the one chosen by the Best Subset Selection method. This model had an MSE of 44.89955.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

The chose Best Subset Model does not include all variables in the data set as there are only 8 included. The variable included in the model are: zn, nox, rad, ptratio, black, lstat, and medv. If the model were to include of the thrown-out features, more variation of the response would be present. For this particular problem, we are looking to have a model with the best prediction accuracy.