Problem 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
LASSO works best when the OLS estimate has a very high variance. This is because they will have a higher bias than OLS, but less variability. The bias/variance tradeoff will balance out when the variance OLS variance is very high. The lack of flexibility is of benefit to the LASSO because it reduces variability.

(b) Repeat (a) for ridge regression relative to least squares.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Ridge Regression is very similar to LASSO. It has the most benefit when variance for the OLS Regression is very high due to the bias/variance trade-off.

(c) Repeat (a) for non-linear methods relative to least squares.
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Non-linear methods are more flexible than OLS, therfore increasing the variance. They benefit by lessening the variance and increasing the bias.

Problem 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.
set.seed(1)
set.training <- sample(nrow(College), nrow(College)/2)
training.College <- College[set.training, ]
test.College <- College[-set.training, ]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.College <- lm(Apps ~ ., data = training.College)
summary(lm.College)
## 
## Call:
## lm(formula = Apps ~ ., data = training.College)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5741.2  -479.5    15.3   359.6  7258.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.902e+02  6.381e+02  -1.238 0.216410    
## PrivateYes  -3.070e+02  2.006e+02  -1.531 0.126736    
## Accept       1.779e+00  5.420e-02  32.830  < 2e-16 ***
## Enroll      -1.470e+00  3.115e-01  -4.720 3.35e-06 ***
## Top10perc    6.673e+01  8.310e+00   8.030 1.31e-14 ***
## Top25perc   -2.231e+01  6.533e+00  -3.415 0.000708 ***
## F.Undergrad  9.269e-02  5.529e-02   1.676 0.094538 .  
## P.Undergrad  9.397e-03  5.493e-02   0.171 0.864275    
## Outstate    -1.084e-01  2.700e-02  -4.014 7.22e-05 ***
## Room.Board   2.115e-01  7.224e-02   2.928 0.003622 ** 
## Books        2.912e-01  3.985e-01   0.731 0.465399    
## Personal     6.133e-03  8.803e-02   0.070 0.944497    
## PhD         -1.548e+01  6.681e+00  -2.316 0.021082 *  
## Terminal     6.415e+00  7.290e+00   0.880 0.379470    
## S.F.Ratio    2.283e+01  2.047e+01   1.115 0.265526    
## perc.alumni  1.134e+00  6.083e+00   0.186 0.852274    
## Expend       4.857e-02  1.619e-02   2.999 0.002890 ** 
## Grad.Rate    7.490e+00  4.397e+00   1.703 0.089324 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1083 on 370 degrees of freedom
## Multiple R-squared:  0.9389, Adjusted R-squared:  0.9361 
## F-statistic: 334.3 on 17 and 370 DF,  p-value: < 2.2e-16
pred.College.lm <- predict(lm.College, newdata = test.College)

error.College.lm <- mean((test.College$Apps - pred.College.lm)^2)
error.College.lm
## [1] 1135758

The MSE for this model is 1135758. The RSq of 0.9389 suggests that roughly 94% of the variation in the data is explained by the model.
Since the least squares model includes all of the components, this may not be the ideal model for the data.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
set.seed(1)
x.College <- model.matrix(Apps ~ ., data = training.College)[ ,-1]
y.College <- training.College$Apps
x.test.College <- model.matrix(Apps ~ ., data = test.College)[ ,-1]
y.test.College <- test.College$Apps

cv.College <- cv.glmnet(x.College, y.College, alpha = 0)
bestlam.College <- cv.College$lambda.min
bestlam.College
## [1] 405.8404
plot(cv.College)

ridge.College <- glmnet(x.College, y.College, alpha = 0, lambda = bestlam.College)
dim(coef(ridge.College))
## [1] 18  1
pred.ridge.College <- predict(ridge.College, s = bestlam.College, newx = x.test.College)
mean((pred.ridge.College - y.test.College)^2)
## [1] 976897.6
err.ridge.College <- mean((pred.ridge.College - y.test.College)^2)
err.ridge.College
## [1] 976897.6
coef.ridge.College <- predict(ridge.College, type = 'coefficient', s = bestlam.College)
coef.ridge.College
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -2.028526e+03
## PrivateYes  -2.881497e+02
## Accept       1.129842e+00
## Enroll       3.843311e-01
## Top10perc    3.047525e+01
## Top25perc   -3.273388e-01
## F.Undergrad  5.542918e-02
## P.Undergrad  2.051701e-02
## Outstate    -3.895379e-02
## Room.Board   2.631481e-01
## Books        4.149720e-01
## Personal    -3.236328e-02
## PhD         -7.849164e+00
## Terminal    -1.025938e+00
## S.F.Ratio    2.767028e+01
## perc.alumni -5.364354e+00
## Expend       5.879543e-02
## Grad.Rate    8.625249e+00

The MSE for the Ridge Regression model is 976897.6.
This MSE is quite a bit lower than that of the least squares model, but still contains all of the components in the model.

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
cv.lasso.College <- cv.glmnet(x.College, y.College, alpha = 1)
bestlam.lasso.College <- cv.lasso.College$lambda.min
bestlam.lasso.College
## [1] 1.97344
plot(cv.lasso.College)

lasso.College <- glmnet(x.College, y.College, alpha = 1, lambda = bestlam.lasso.College)
dim(coef(lasso.College))
## [1] 18  1
pred.lasso.College <- predict(lasso.College, s = bestlam.lasso.College, newx = x.test.College)
err.lasso.College <- mean((pred.lasso.College - y.test.College)^2)
err.lasso.College
## [1] 1116402
coef.lasso.College <- predict(lasso.College, type = 'coefficient', s = bestlam.lasso.College)
coef.lasso.College
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) -763.46736152
## PrivateYes  -313.76789127
## Accept         1.76295802
## Enroll        -1.32011619
## Top10perc     64.98040141
## Top25perc    -20.93406828
## F.Undergrad    0.07157355
## P.Undergrad    0.01197052
## Outstate      -0.10493165
## Room.Board     0.20915417
## Books          0.29265420
## Personal       0.00351509
## PhD          -14.48803559
## Terminal       5.33334392
## S.F.Ratio     21.67584610
## perc.alumni    0.51564913
## Expend         0.04812700
## Grad.Rate      7.01799293

The MSE for the LASSO model is 1116402.
The LASSO does not out-perform Ridge Regression, but the error is still smaller than that of the least squares model.
This lasso model also does not fully take any of the coefficients to zero.

(e) Fit a PCR model on the training set, with M chosen by crossvalidation.Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)

pcr.College <- pcr(Apps ~ ., data = training.College, scale = TRUE, validation = "CV")
summary(pcr.College)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4006     2373     2372     2069     1961     1919
## adjCV         4288     4007     2368     2369     1999     1948     1911
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1919     1921     1876      1832      1832      1836      1837
## adjCV     1912     1915     1868      1821      1823      1827      1827
##        14 comps  15 comps  16 comps  17 comps
## CV         1853      1759      1341      1270
## adjCV      1850      1733      1326      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89
validationplot(pcr.College, validation = "MSEP")

pred.pcr.College <- predict(pcr.College, x.test.College, ncomp = 4)
mean((pred.pcr.College - y.test.College)^2)
## [1] 2168577
coef.pcr.College <- pcr(Apps ~ ., data = training.College, scale = TRUE, ncomp = 4)
summary(coef.pcr.College)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 4
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps
## X       32.20    57.78    65.31    70.99
## Apps    13.44    70.93    71.07    79.87

The PCR model with the lowest MSE (CV^2) includes 17 components. Since this is no different than just running a least squares regression we can look at the validation plot to see what it tells us about the model.
Upon inspection of the validation plot we can see that the biggest drop in MSE happens around when the model contains 3-4 components, followed by another modest drop around 15 components.
In an effort to avoid a model that is too complex to interpret, choosing M = 4 appears to be the best balance option. At M = 4 79.87% of the variation in the number of Applications is explained, and there are no stand out improvements to that until M = 15.
The MSE for M = 4 is 2168577.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(1)

pls.College <- plsr(Apps ~ ., data = training.College, scale = TRUE, validation = "CV")
summary(pls.College)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2217     2019     1761     1630     1533     1347
## adjCV         4288     2211     2012     1749     1605     1510     1331
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1309     1303     1286      1283      1283      1277      1271
## adjCV     1296     1289     1273      1270      1270      1264      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1270      1270      1270      1270
## adjCV      1258      1257      1257      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20    78.62    80.81
## Apps    75.39    81.24    86.97    91.14    92.62    93.43    93.56    93.68
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.29     87.17     89.15     91.37     92.58     94.42     96.98
## Apps    93.76     93.79     93.83     93.86     93.88     93.89     93.89
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.89     93.89
validationplot(pls.College, validation = "MSEP")

pred.pls.College <- predict(pls.College, x.test.College, ncomp = 6)
mean((pred.pls.College - y.test.College)^2)
## [1] 1066991
coef.pls.College <- plsr(Apps ~ ., data = training.College, scale = TRUE, ncomp = 6)
summary(coef.pls.College)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 6
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20
## Apps    75.39    81.24    86.97    91.14    92.62    93.43

The PLS model with the lowest MSE (CV^2) include 14 components. Since this may still make the model more complex than we would like we can look at the validation plot to see that the biggest drop in MSE happens when there are 6 components in the model. The MSE after that is relatively the same as the number of components increases. Because of this we can confidently choose M = 6. This is consistent with our results from cross validation as well.
A PSL model with 6 components explains approximately 93.43% of the variation in Apps, and there is not much improvement on that figure as the number of components is increased.
The MSE for M = 6 is 1066991.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

MSE
Least Squares: 1135758
Ridge: 976897.6
LASSO: 1116402
PCR:2168577
PLS: 1066991

If only looking at the MSE, ridge regression would seem to be the best fit for the model because it has the lowest error.
Taking into consideration the complexities of the models, PLS may be the better choice for a model. The MSE is not much higher than the model using Ridge Regression and is a much simpler model to interpret since it only has 6 components rather than all 17 of the components in the data set. Using a PLS model is futher justified when we consider that 93% of the variation in Apps is explained by the model.

detach(College)

Problem 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Split Data into Training and Test Data

set.seed(1)
set.training.Boston <- sample(nrow(Boston), nrow(Boston)/2)
training.Boston <- Boston[set.training.Boston, ]
test.Boston <- Boston[-set.training.Boston, ]

Set Up X & Y

set.seed(1)
x.Boston <- model.matrix(crim ~ ., data = training.Boston)[ ,-1]
y.Boston <- training.Boston$crim
x.test.Boston <- model.matrix(crim ~ ., data = test.Boston)[ ,-1]
y.test.Boston <- test.Boston$crim

Ridge Regression

cv.ridge.Boston <- cv.glmnet(x.Boston, y.Boston, alpha = 1)
bestlam.ridge.Boston <- cv.ridge.Boston$lambda.min
bestlam.ridge.Boston
## [1] 0.05148183
plot(cv.ridge.Boston)

ridge.Boston <- glmnet(x.Boston, y.Boston, alpha = 1, lambda = bestlam.ridge.Boston)
dim(coef(ridge.Boston))
## [1] 14  1
pred.ridge.Boston <- predict(ridge.Boston, s = bestlam.ridge.Boston, newx = x.test.Boston)
err.ridge.Boston <- mean((pred.ridge.Boston - y.test.Boston)^2)
err.ridge.Boston
## [1] 40.99319
coef.ridge.Boston <- predict(ridge.Boston, type = 'coefficient', s = bestlam.ridge.Boston)
coef.ridge.Boston
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept) 18.9622903072
## zn           0.0368274919
## indus       -0.1257403173
## chas        -0.4800347940
## nox         -8.2790505636
## rm           0.1165681662
## age         -0.0006106162
## dis         -0.8428452301
## rad          0.5341757612
## tax         -0.0001555180
## ptratio     -0.3827058704
## black       -0.0129713964
## lstat        0.2578264241
## medv        -0.1607986419

MSE: 41.47312
A few of the components are reduced to minimal, but since ridge reduction does not reduce the number of components it may not be the best fit.

LASSO

cv.lasso.Boston <- cv.glmnet(x.Boston, y.Boston, alpha = 1)
bestlam.lasso.Boston <- cv.lasso.Boston$lambda.min
bestlam.lasso.Boston
## [1] 0.005029826
plot(cv.lasso.Boston)

lasso.Boston <- glmnet(x.Boston, y.Boston, alpha = 1, lambda = bestlam.lasso.Boston)
dim(coef(lasso.Boston))
## [1] 14  1
pred.lasso.Boston <- predict(lasso.Boston, s = bestlam.lasso.Boston, newx = x.test.Boston)
err.lasso.Boston <- mean((pred.lasso.Boston - y.test.Boston)^2)
err.lasso.Boston
## [1] 41.47312
coef.lasso.Boston <- predict(lasso.Boston, type = 'coefficient', s = bestlam.lasso.Boston)
coef.lasso.Boston
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  23.236959105
## zn            0.042603207
## indus        -0.125417411
## chas         -0.681840713
## nox         -10.567296772
## rm            0.387379613
## age          -0.006668022
## dis          -1.069374865
## rad           0.612088100
## tax          -0.004311220
## ptratio      -0.472344094
## black        -0.012666419
## lstat         0.264631900
## medv         -0.208913686

MSE: 40.89965
The model is reduced to 11 components after LASSO brings 2 of the coefficients to zero.

PCR

set.seed(1)

pcr.Boston <- pcr(crim ~ ., data = training.Boston, scale = TRUE, validation = "CV")
summary(pcr.Boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.555    7.549    7.093    6.926    6.932    6.977
## adjCV        9.275    7.550    7.544    7.088    6.920    6.926    6.968
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.976    6.871    6.845     6.848     6.797     6.764     6.700
## adjCV    6.967    6.859    6.843     6.842     6.786     6.751     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.51     60.4    69.86    77.08    82.80    87.68    91.24    93.56
## crim    34.94     35.2    42.83    45.47    45.57    45.58    45.75    47.59
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.47     97.08     98.48     99.54    100.00
## crim    47.68     48.75     49.31     50.14     51.37
validationplot(pcr.Boston, validation = "MSEP")

pred.pcr.Boston <- predict(pcr.Boston, x.test.Boston, ncomp = 1)
mean((pred.pcr.Boston - y.test.Boston)^2)
## [1] 48.24988
coef.pcr.Boston <- pcr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 1)
summary(coef.pcr.Boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 1
## TRAINING: % variance explained
##       1 comps
## X       48.51
## crim    34.94
pred.pcr.Boston4 <- predict(pcr.Boston, x.test.Boston, ncomp = 4)
mean((pred.pcr.Boston4 - y.test.Boston)^2)
## [1] 43.91107
coef.pcr.Boston4 <- pcr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 4)
summary(coef.pcr.Boston4)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 4
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps
## X       48.51     60.4    69.86    77.08
## crim    34.94     35.2    42.83    45.47

MSE (1): 48.24988
MSE (4): 43.91107

The validation plot shows that there is a large drop in MSE when only 1 component is added to the model, and then there is another smaller drop when 4 components have been added.
After comparing the MSE for the two different M’s, the M = 4 model appears to be the better fit.

PLS

set.seed(1)

pls.Boston <- plsr(crim ~ ., data = training.Boston, scale = TRUE, validation = "CV")
summary(pls.Boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.328    6.842    6.784    6.741    6.718    6.695
## adjCV        9.275    7.322    6.834    6.768    6.728    6.707    6.683
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.707    6.696     6.701     6.700     6.700     6.700
## adjCV    6.706    6.693    6.683     6.688     6.686     6.686     6.686
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.10    57.46    62.68    69.65    77.95    80.82    84.40    87.65
## crim    38.94    47.55    49.73    50.61    50.85    51.17    51.29    51.35
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       90.47     93.07     96.66     98.26    100.00
## crim    51.37     51.37     51.37     51.37     51.37
validationplot(pls.Boston, validation = "MSEP")

pred.pls.Boston <- predict(pls.Boston, x.test.Boston, ncomp = 2)
mean((pred.pls.Boston - y.test.Boston)^2)
## [1] 42.74086
coef.pls.Boston <- plsr(crim ~ ., data = training.Boston, scale = TRUE, ncomp = 2)
summary(coef.pls.Boston)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##       1 comps  2 comps
## X       48.10    57.46
## crim    38.94    47.55

MSE: 42.74086
There is a large drop in MSE when a second component is added to the model, then a steady drop for each component added afterward.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

MSEs
Ridge: 41.47312
LASSO: 40.89965
PCR4: 43.91107
PLS: 42.74086

The best MSE for LASSO appears to be the best. Combine this with the fact that it only uses 11 of 13 components and it might be a good fit if accuracy is the most important factor.
On the other hand, PLS is similar in accuracy, but only adds to components to the model so it might be the best alternative if simplicity is the most important factor.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

I chose 2 different models based on different factors. I would need more information on the purpose of the model to make a better decision on which model to choose.
With that said, neither of my chosen models would use all 13 components in the model.

detach(Boston)