Problem 2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(a) The lasso, relative to least squares, is:

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Since the LASSO will set coefficients to 0 it effectively reduces the number of variables which reduces the variance. If the increase in bias is not too much, then the good from the reduced variance will result in improved prediction accuracy and is desired.

(b) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Although the Ridge will contain all variables, there will still be a reduction in variance due to the shrinkage of coefficients toward 0, which is desired.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Methods that are more flexible result in an increase in bias. In the case of non-linear models which tend to have a high degree of flexibility, this will be ideal when a non-linear relationship exists between the predictor and outcome variables.

Problem 9. In this exercise, we will predict the number of applications received using the other variables in the [College]: https://rdrr.io/cran/ISLR/man/College.html data set.

(a) Split the data set into a training set and a test set.

library(ISLR)
set.seed(1)
#Create a train and test set from the College data
trainid=sample(1:nrow(College), nrow(College)/2)
test=College[trainid,]
train=College[-trainid,]
#dim(College)
names(College)
##  [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"  
##  [6] "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board" 
## [11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"  
## [16] "perc.alumni" "Expend"      "Grad.Rate"

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

#Fit the model with everything
lm.fit=lm(Apps~., data=train)
summary(lm.fit)
## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2397.9  -400.0   -37.9   263.8  6330.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -274.36837  496.00827  -0.553 0.580492    
## PrivateYes  -812.55253  181.80228  -4.469 1.04e-05 ***
## Accept         1.20602    0.06300  19.143  < 2e-16 ***
## Enroll        -0.29820    0.22026  -1.354 0.176612    
## Top10perc     34.64854    7.04843   4.916 1.33e-06 ***
## Top25perc     -8.28312    5.76933  -1.436 0.151926    
## F.Undergrad    0.09624    0.03831   2.512 0.012418 *  
## P.Undergrad    0.01715    0.03718   0.461 0.644851    
## Outstate      -0.04682    0.02587  -1.810 0.071124 .  
## Room.Board     0.09182    0.06131   1.497 0.135117    
## Books         -0.11015    0.27284  -0.404 0.686649    
## Personal       0.03498    0.08627   0.406 0.685312    
## PhD           -1.98865    6.04406  -0.329 0.742322    
## Terminal     -12.36690    6.70691  -1.844 0.065994 .  
## S.F.Ratio      8.73527   15.70896   0.556 0.578499    
## perc.alumni   -5.87603    5.29707  -1.109 0.268020    
## Expend         0.12014    0.01867   6.436 3.79e-10 ***
## Grad.Rate     12.70646    3.71635   3.419 0.000698 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 923.7 on 371 degrees of freedom
## Multiple R-squared:  0.9295, Adjusted R-squared:  0.9262 
## F-statistic: 287.6 on 17 and 371 DF,  p-value: < 2.2e-16
#Retrieve the prediction for the test set
lm.pred=predict(lm.fit, test)
#Compute the least squares error for the test set
lm.error=mean((test$Apps-lm.pred)^2)
lm.error
## [1] 1653835

The test error is 1653835

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
attach(College)
x=model.matrix(Apps~., College[,-17])
y=College$Apps
set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
#head(train)
test=(-train)
y.test=y[test]

grid=10^seq(10,-2,length=100)
cv.out=cv.glmnet(x[train,],y[train],lambda = grid, alpha = 0, thresh=1e-12)
plot(cv.out)

bestlam=cv.out$lambda.min
ridge.pred=predict(cv.out, s=bestlam, newx = x[test,])
mean((ridge.pred-y.test)^2)
## [1] 1209311

The test error is 1209311

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

names(College)
##  [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"  
##  [6] "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board" 
## [11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"  
## [16] "perc.alumni" "Expend"      "Grad.Rate"
lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda = grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 1190120
out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)
lasso.coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)   62.58919214
## (Intercept)    .         
## PrivateYes  -566.41435509
## Accept         1.57833046
## Enroll        -0.74995089
## Top10perc     61.56356704
## Top25perc    -18.74197086
## F.Undergrad    0.04042129
## P.Undergrad    0.04940370
## Outstate      -0.04934853
## Room.Board     0.18023292
## Books          0.07241693
## Personal       0.04577494
## PhD           -7.87188542
## Terminal      -1.82551823
## S.F.Ratio    -13.76410979
## perc.alumni    0.07433295
## Grad.Rate      6.80183612

The test error is: 1190120 Surprisingly, there are no 0 coefficients and 16 non-zero coefficients in the LASSO model

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
set.seed(2)
pcr.fit=pcr(Apps~., data=College, scale=TRUE, validation="CV", subset=train)
summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     4032     2359     2362     2055     1924     1881
## adjCV         4288     4035     2355     2358     1971     1911     1875
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1882     1890     1834      1799      1798      1799      1794
## adjCV     1877     1887     1829      1790      1791      1792      1787
##        14 comps  15 comps  16 comps  17 comps
## CV         1795      1764      1348      1278
## adjCV      1790      1737      1332      1265
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       32.20    57.78    65.31    70.99    76.37    81.27     84.8    87.85
## Apps    13.44    70.93    71.07    79.87    81.15    82.25     82.3    82.33
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.62     92.91     94.98     96.74     97.79     98.72     99.42
## Apps    83.38     84.76     84.80     84.84     85.11     85.14     90.55
##       16 comps  17 comps
## X        99.88    100.00
## Apps     93.42     93.89
validationplot(pcr.fit, val.type = 'MSEP')

pcr.fit$ncomp
## [1] 17
pcr.pred=predict(pcr.fit, College[test,], ncomp = 15)
mean((pcr.pred-College$Apps[test])^2)
## [1] 1268629

Best M is 15 which showed the lowest cross validation test error. The test error obtained is 1268629

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit=plsr(Apps~., data=College, scale=TRUE, validation="CV", subset=train)
summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4288     2217     2019     1761     1630     1533     1347
## adjCV         4288     2211     2012     1749     1605     1510     1331
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1309     1303     1286      1283      1283      1277      1271
## adjCV     1296     1289     1273      1270      1270      1264      1258
##        14 comps  15 comps  16 comps  17 comps
## CV         1270      1270      1270      1270
## adjCV      1258      1257      1257      1257
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.21    50.73    63.06    65.52    70.20    74.20    78.62    80.81
## Apps    75.39    81.24    86.97    91.14    92.62    93.43    93.56    93.68
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       83.29     87.17     89.15     91.37     92.58     94.42     96.98
## Apps    93.76     93.79     93.83     93.86     93.88     93.89     93.89
##       16 comps  17 comps
## X        98.78    100.00
## Apps     93.89     93.89
validationplot(pls.fit, val.type = "MSEP")

pls.pred=predict(pls.fit, College[test,], ncomp = 4)
mean((pls.pred-College$Apps[test])^2)
## [1] 1476030
#Perform PLS on full data set
pls.fit=plsr(Apps~., data=College, scale=TRUE, ncomp=4)
summary(pls.fit)
## Data:    X dimension: 777 17 
##  Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 4
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps
## X       25.76    40.33    62.59    64.97
## Apps    78.01    85.14    87.67    90.73

Best M is 4 which showed the lowest cross validation test error. The test error obtained is 1476030

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Our test errors show that LASSO would be the best model to leverage. Linear model provides a 92.62% adjusted R2 and PLS and PCR both explain a high amount of variance as well. However, just going with the test errors, LASSO is preferred.

Test Errors:

Linear: 1653835
Ridge: 1209311
Lasso: 1190120
PLS: 1476030, 96.98 % variance explained for x and 93.89 variance explained for APPS
PCR: 1268629, 99.42 % variance explained for x and 90.55 variance explained for APPS

Problem 11. We will now try to predict per capita crime rate in the [Boston]: https://rdrr.io/cran/mlbench/man/BostonHousing.html data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)
attach(Boston)

Best Subset Selection:

library(leaps)

regfit.full<-regsubsets(crim~., Boston, nvmax = 14)
reg.summary=summary(regfit.full)
reg.summary$rsq
##  [1] 0.3912567 0.4207965 0.4286123 0.4334892 0.4392738 0.4440173 0.4476594
##  [8] 0.4504606 0.4524408 0.4530572 0.4535605 0.4540031 0.4540104
#To find the max of a given statistic, in this case which number of variable model has the most adjusted r2. 
which.min(reg.summary$rss)
## [1] 13
which.max(reg.summary$adjr2)
## [1] 9
which.min(reg.summary$cp)
## [1] 8
which.min(reg.summary$bic)
## [1] 3
par(mfrow=c(2,2))
plot(reg.summary$rss, xlab="Number of Variables",  ylab = "RSS", type ="l")
points(13, reg.summary$rss[13], col='red', cex=2, pch=20)
plot(reg.summary$cp, xlab="Number of Variables",  ylab = "CP", type ="l")
points(8, reg.summary$cp[8], col='red', cex=2, pch=20)

plot(reg.summary$adjr2, xlab="Number of Variables",  ylab = "adjusted r square", type ="l")
points(9, reg.summary$adjr2[9], col='red', cex=2, pch=20)
plot(reg.summary$bic, xlab="Number of Variables",  ylab = "BIC", type ="l")
points(3, reg.summary$bic[3], col='red', cex=2, pch=20)

#preview the coefficients for the best subset model
coef(regfit.full, 8)
##   (Intercept)            zn           nox           dis           rad 
##  19.683127801   0.043293393 -12.753707757  -0.918318253   0.532616533 
##       ptratio         black         lstat          medv 
##  -0.310540942  -0.007922426   0.110173124  -0.174207166
#Showe the Adjr^2 for the model chosen
reg.summary$adjr2[8]
## [1] 0.4416149

The best subset selection model suggests that a subset of 8 variables is ideal. These variables would be:
zn
nox
dis
rad
ptratio
black
lstat
medv
and would result in a model that had a 44.16% adjusted r^2.

Ridge Regression:

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
x=model.matrix(crim~., Boston[,-14])
y=Boston$crim
set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
#head(train)
test=(-train)
y.test=y[test]

grid=10^seq(10,-2,length=100)
cv.out=cv.glmnet(x[train,],y[train],lambda = grid, alpha = 0, thresh=1e-12)
plot(cv.out)

bestlam=cv.out$lambda.min
ridge.pred=predict(cv.out, s=bestlam, newx = x[test,])
mean((ridge.pred-y.test)^2)
## [1] 41.89459

The ridge regression model has a test error rate of 41.89 and would contain all variables by virture of the rige regression istelf.

LASSO

lasso.mod=glmnet(x[train,], y[train], alpha=1, lambda = grid)
plot(lasso.mod)

set.seed(1)
cv.out=cv.glmnet(x[train,], y[train], alpha=1)
plot(cv.out)

bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod, s=bestlam, newx=x[test,])
mean((lasso.pred-y.test)^2)
## [1] 41.94497
out=glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef=predict(out, type='coefficient', s=bestlam)
lasso.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  6.478907123
## (Intercept)  .          
## zn           0.031635956
## indus       -0.063492351
## chas        -1.162574668
## nox         -4.920418670
## rm          -0.183476785
## age          .          
## dis         -0.594671142
## rad          0.504869664
## tax          .          
## ptratio     -0.040790700
## black       -0.009212781
## lstat        0.230728259

The LASSO has a test error of 41.94 and would effectively remove age and tax from the model by setting their coefficients at 0, leaving a total of 10 non zero coefficients.

PCR

library(pls)
set.seed(2)
pcr.fit=pcr(crim~., data=Boston, scale=TRUE, validation="CV", subset=train)
summary(pcr.fit)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.275    7.617    7.605    7.206    7.014    7.050    7.219
## adjCV        9.275    7.608    7.597    7.190    7.001    7.037    7.199
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       7.212    7.130    7.135     7.036     7.006     6.976     6.924
## adjCV    7.191    7.105    7.113     7.012     6.983     6.952     6.898
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.51     60.4    69.86    77.08    82.80    87.68    91.24    93.56
## crim    34.94     35.2    42.83    45.47    45.57    45.58    45.75    47.59
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.47     97.08     98.48     99.54    100.00
## crim    47.68     48.75     49.31     50.14     51.37
validationplot(pcr.fit, val.type = 'MSEP')

pcr.fit$ncomp
## [1] 13
pcr.pred=predict(pcr.fit, Boston[test,], ncomp = 4)
mean((pcr.pred-Boston$crim[test])^2)
## [1] 43.91107

The PCR model suggests the number of optimal components is 4 and this would result in a test error rate of 43.91107.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

LASSO, Ridge, and PCR all leveraged a cross validation approach to determine their optimal models. The best subset model was selection using the highest adjusted r^2.

Best Subset Adjr^2: 44.16%
Ridge Test Error: 41.89459
Lasso Test Error: 41.94497
PCR Test Error: 43.91107

It can be suggested that in the case of this data set with the number of observations being to low, the best subset would be an ok model to use since it reduced the predictors to 8. Since Ridge and LASSO are so close in their test errors, it would be suggested to leverage LASSO due to the interpretability. PCR can be omitted.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

Neither of the preferred models use all of the features. Each model’s function is search for a model that can predict while limiting the amount of features or dimensions retained. The impacted predictors were found to have little or minimal enough impact to predicting the outcome that they could be removed.