2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

2-A) The lasso, relative to least squares, is:

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Lasso’s decrease in variance is met with only a slight increase in its bias (p.223). Lasso will end up with some variables having coefficients equal to zero, so it will be less flexible than least squares.

2-B) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Similar to above: RR’s variance really drops before its bias comes up to meet it (p.218), and it will zero out some variables so it is less flexible.

2-C) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Non-linear methods are going to be more flexible than least squares, which is linear. Since we want the intersection of bias and variance to be as far to the left as possible, we want the bias to drop off real quick before the variance meets it.

9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

9-A)

Split the data set into a training set and a test set.

Going with an 50/50 train-test split.

set.seed(2021)
college.data <- College
train <- sample(1:nrow(college.data), 0.5*nrow(college.data))
test = -train
college.train <- college.data[train, ]
college.test <- college.data[test, ]

9-B)

Fit a linear model using least squares on the training set, and report the test error obtained.

lm9B <- lm(Apps ~ . , college.train)
lm9B.preds <- predict(lm9B, college.test)
mean((college.test$Apps - lm9B.preds)^2)
## [1] 1408299

Test error of LM returns 1408299.

9-C)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

set.seed(2021)
grid <- 10 ^ seq(10, -2, length=100)
train.matrix <- model.matrix(Apps ~ ., college.train)
test.matrix <- model.matrix(Apps ~ ., college.test)
#train the model using CV
ridge.model <- cv.glmnet(train.matrix, college.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- ridge.model$lambda.min
bestlam
## [1] 4.641589
ridge.model.pred <- predict(ridge.model, test.matrix, s=bestlam)
mean((ridge.model.pred - college.test$Apps)^2)
## [1] 1412127

Test error for ridge regression returns 1412127. This is higher than the LM’s 1408299.

9-D

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(2021)
grid <- 10 ^ seq(10, -2, length=100)
train.matrix <- model.matrix(Apps ~ ., college.train)
test.matrix <- model.matrix(Apps ~ ., college.test)
#train the model using CV
lasso.model <- cv.glmnet(train.matrix, college.train$Apps, alpha = 1)
bestlam <- lasso.model$lambda.min
bestlam
## [1] 9.082179
lasso.pred <- predict(lasso.model, test.matrix, s=bestlam)
mean((lasso.pred - college.test$Apps)^2)
## [1] 1406953

Test error for lasso returns 1406953. This is a little better than LM RSS of 1408299. This is also better than ridge regression’s RSS of 1412127.

Incorporating instructor’s demonstration to check coefficients.

x = model.matrix(Apps ~ ., college.data)
y = college.data$Apps
out <- glmnet(x,y, alpha=1, lambda = grid)
lasso.coef <- predict(out, type = 'coefficient', s=bestlam)
lasso.coef
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -5.246442e+02
## (Intercept)  .           
## PrivateYes  -4.805574e+02
## Accept       1.529824e+00
## Enroll      -4.473636e-01
## Top10perc    4.294125e+01
## Top25perc   -8.861363e+00
## F.Undergrad  1.669031e-03
## P.Undergrad  4.195095e-02
## Outstate    -7.527460e-02
## Room.Board   1.420197e-01
## Books        .           
## Personal     2.152610e-02
## PhD         -7.503658e+00
## Terminal    -3.038887e+00
## S.F.Ratio    1.142353e+01
## perc.alumni -5.145633e-01
## Expend       7.437734e-02
## Grad.Rate    7.074171e+00

9-E)

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value

of M selected by cross-validation.

set.seed(2021)
pcr.fit <- pcr(Apps ~ ., data = college.train, scale=TRUE, validation = 'CV')
validationplot(pcr.fit, val.type = 'MSEP')

summary(pcr.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     3414     1637     1661     1618     1335     1320
## adjCV         3402     3414     1634     1664     1621     1325     1318
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1307     1300     1237      1241      1204      1211      1215
## adjCV     1305     1307     1233      1239      1200      1208      1212
##        14 comps  15 comps  16 comps  17 comps
## CV         1215      1221      1009     962.0
## adjCV      1212      1219      1005     957.5
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      31.927    57.28    64.66    70.99    76.57    81.55    84.95    88.08
## Apps    1.057    77.34    77.34    78.53    85.67    85.79    86.04    86.12
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.04     93.39     95.35     97.01     98.09     98.85     99.39
## Apps    87.71     87.81     88.43     88.43     88.48     88.54     88.87
##       16 comps  17 comps
## X        99.81     100.0
## Apps     92.35      93.3

Going with 10 components.

set.seed(2021)
pcr.pred <- predict(pcr.fit, college.test, ncomp = 10)
mean((pcr.pred - college.test$Apps)^2)
## [1] 2868755

The test error obtained from PCR is 2868755. This is much higher than the LM error of 1408299.

Just for giggles, let’s demonstrate that a PCR with all of the components is equivalent to the least squares regression.

set.seed(2021)
pcr.predX <- predict(pcr.fit, college.test, ncomp = 17)
mean((pcr.predX - college.test$Apps)^2)
## [1] 1408299

How ’bout that!

9-F)

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value

of M selected by cross-validation.

set.seed(2021) #because I like being thorough
pls.fit <- plsr(Apps ~ ., data=college.train, scale=TRUE, validation='CV')
validationplot(pls.fit)

Looks like 10 components should work again.

pls.pred <- predict(pls.fit, college.test, ncomp = 10)
mean((pls.pred - college.test$Apps)^2)
## [1] 1402090

Partial least squares looking good with an error of 1402090, better than least squares regression’s 1408299.

9-G)

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much

difference among the test errors resulting from these five approaches?

Let’s compare the two models with lowest scores: PLS and OLS.

summary(pls.fit)
## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     1494     1230     1192     1114     1044    999.4
## adjCV         3402     1491     1228     1194     1107     1033    991.3
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV         985    973.2    967.2     963.9     959.2     961.7     961.8
## adjCV      977    968.2    962.8     959.5     955.2     957.2     957.3
##        14 comps  15 comps  16 comps  17 comps
## CV        961.6     962.0     962.0     962.0
## adjCV     957.1     957.5     957.6     957.5
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.38    38.33    62.54    65.60    68.94    72.50    75.06    80.50
## Apps    81.35    87.28    88.95    90.72    92.11    92.91    93.10    93.14
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.08     87.17     90.16     91.61     93.55     94.83     97.41
## Apps    93.20     93.26     93.28     93.30     93.30     93.30     93.30
##       16 comps  17 comps
## X        98.72     100.0
## Apps     93.30      93.3

The PLS model can explain 93.26% of the variance in Apps, which is very good.

summary(lm9B)
## 
## Call:
## lm(formula = Apps ~ ., data = college.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2456.0  -432.9   -57.5   347.0  5533.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  155.93524  505.76407   0.308 0.758015    
## PrivateYes  -787.54439  171.83306  -4.583 6.27e-06 ***
## Accept         1.52004    0.06492  23.414  < 2e-16 ***
## Enroll        -0.96329    0.21877  -4.403 1.40e-05 ***
## Top10perc     56.68321    7.11108   7.971 1.97e-14 ***
## Top25perc    -19.31393    5.74985  -3.359 0.000864 ***
## F.Undergrad    0.10991    0.03642   3.018 0.002723 ** 
## P.Undergrad   -0.03683    0.05155  -0.714 0.475455    
## Outstate      -0.05567    0.02306  -2.414 0.016259 *  
## Room.Board     0.19172    0.06117   3.134 0.001860 ** 
## Books          0.04727    0.27250   0.173 0.862373    
## Personal      -0.06411    0.07218  -0.888 0.374987    
## PhD           -9.59186    5.99321  -1.600 0.110351    
## Terminal      -7.18775    6.37788  -1.127 0.260480    
## S.F.Ratio     17.97700   15.17731   1.184 0.236989    
## perc.alumni    0.77472    5.28031   0.147 0.883434    
## Expend         0.06507    0.01645   3.955 9.18e-05 ***
## Grad.Rate      6.43350    3.72354   1.728 0.084860 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 899.5 on 370 degrees of freedom
## Multiple R-squared:  0.933,  Adjusted R-squared:  0.9299 
## F-statistic: 303.1 on 17 and 370 DF,  p-value: < 2.2e-16

Least square regression can predict the variation in the behavior of Apps with an adjusted R^2 of 92.99%.

Lasso comes in at 1406953 which should put its R^2 into the low 90s. Ridge regression comes in at 1412127 which should also put its R^2 into the low 90s.

PCR’s performance came in last with an error of 2868755, and explains about 87.8% of the variance in Apps.

11

We will now try to predict per capita crime rate in the Boston data set.

11-A)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Best Subset Selection:

boston.data <- Boston
set.seed(2021)
bos.train <- sample(1:nrow(boston.data), 0.5*nrow(boston.data))
boston.train <- boston.data[bos.train, ]
boston.test <- boston.data[-bos.train, ]
boston.bss.full <- regsubsets(crim ~ ., boston.data, nvmax = 13)
bss.summary <- summary(boston.bss.full)
bss.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., boston.data, nvmax = 13)
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
names(bss.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
par(mfrow = c(1,1))
plot(boston.bss.full$rss, xlab = "Number of Variables", ylab = "RSS", type = "l" )

Looks like we could work with anywhere from 9 to 12 variables. What variables?

#trying out Dr. Campbell's cool code from the labs
coef(boston.bss.full, 12)
##   (Intercept)            zn         indus          chas           nox 
##  16.985713928   0.044673247  -0.063848469  -0.744367726 -10.202169211 
##            rm           dis           rad           tax       ptratio 
##   0.439588002  -0.993556631   0.587660185  -0.003767546  -0.269948860 
##         black         lstat          medv 
##  -0.007518904   0.128120290  -0.198877768
test.mat <- model.matrix(crim~., data=boston.test)

val.errors <- rep(NA, 13)
for(i in 1:13){
  coefi = coef(boston.bss.full, id=i)
  pred = test.mat[ ,names(coefi)]%*%coefi
  val.errors[i] = mean((boston.test$crim - pred)^2)
}

val.errors
##  [1] 55.50955 52.12001 53.05658 51.51829 52.26508 51.80218 51.31701 50.85214
##  [9] 50.62286 50.29730 50.25833 50.23052 50.23926
which.min(val.errors)
## [1] 12

Fit the model.

boston.lm <- lm(crim ~ zn + indus + chas + nox + rm + dis + rad + tax + ptratio + black + lstat + medv, boston.train)
summary(boston.lm)
## 
## Call:
## lm(formula = crim ~ zn + indus + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat + medv, data = boston.train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.817  -1.806  -0.197   0.891  55.110 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.602975   8.842401   3.122  0.00202 ** 
## zn           0.044322   0.023994   1.847  0.06594 .  
## indus       -0.057781   0.093441  -0.618  0.53692    
## chas        -0.494034   1.441064  -0.343  0.73203    
## nox         -8.625412   5.785930  -1.491  0.13734    
## rm          -1.095898   0.759810  -1.442  0.15051    
## dis         -0.932314   0.337034  -2.766  0.00611 ** 
## rad          0.534305   0.104734   5.102 6.85e-07 ***
## tax         -0.003495   0.005919  -0.590  0.55547    
## ptratio     -0.179945   0.233382  -0.771  0.44145    
## black       -0.020328   0.004339  -4.685 4.69e-06 ***
## lstat        0.009001   0.086933   0.104  0.91762    
## medv        -0.088597   0.070880  -1.250  0.21253    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.498 on 240 degrees of freedom
## Multiple R-squared:  0.5132, Adjusted R-squared:  0.4889 
## F-statistic: 21.09 on 12 and 240 DF,  p-value: < 2.2e-16

Let’s see what the error is of this LM by making predictions.

boston.lm.pred <- predict (boston.lm, boston.test)
mean((boston.lm.pred - boston.test$crim)^2)
## [1] 55.4715

Error of the LM shows as 55.47.

Lasso method is next:

set.seed(2021)
grid<-10 ^ seq(10, -2, length = 100)
train.matrix <- model.matrix(crim ~ . , boston.train)
test.matrix <- model.matrix(crim ~ ., boston.test)
boston.lasso <- cv.glmnet(train.matrix, boston.train$crim, alpha = 1)
bestlam.boston <- boston.lasso$lambda.min
bestlam.boston
## [1] 0.02055351
plot(boston.lasso)

boston.lasso.pred <- predict(boston.lasso, test.matrix, s=bestlam.boston)
mean((boston.lasso.pred - boston.test$crim)^2)
## [1] 55.62816

The error of Lasso is 55.63.

Ridge regression is next:

set.seed(2021)
grid<-10 ^ seq(10, -2, length = 100)
train.matrix <- model.matrix(crim ~ . , boston.train)
test.matrix <- model.matrix(crim ~ . , boston.test)
boston.ridge <- cv.glmnet(train.matrix, boston.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.boston <- boston.ridge$lambda.min
bestlam.boston
## [1] 0.07054802
plot(boston.ridge)

boston.ridge.pred <- predict(boston.ridge, test.matrix, s=bestlam.boston)
mean((boston.ridge.pred - boston.test$crim)^2)
## [1] 55.57771

Error of Ridge Model is 55.58.

Doing PCR now:

set.seed(2021)
boston.pcr <- pcr(crim ~ ., data=boston.train, scale = TRUE, validation  = 'CV')
validationplot(boston.pcr, val.type = 'MSEP')

summary(boston.pcr)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           7.706    6.400    6.404    5.814    5.781    5.755    5.809
## adjCV        7.706    6.397    6.403    5.805    5.776    5.748    5.799
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.824    5.763    5.753     5.803     5.793     5.732     5.638
## adjCV    5.815    5.755    5.729     5.788     5.780     5.717     5.623
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       45.37    58.18    68.58    76.36    83.14    88.24    91.31    93.56
## crim    31.91    31.91    44.41    45.35    45.96    46.14    46.16    47.73
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.38     97.00     98.38     99.47    100.00
## crim    48.69     48.69     48.83     49.78     51.39

Going to go with 13 components, because it has lowest CV/adjCV RMSEP.

set.seed(2021)
boston.pcr.pred <- predict(boston.pcr, boston.test, ncomp = 13)
mean((boston.pcr.pred - boston.test$crim)^2)
## [1] 55.56365

Error of PCR is 55.56.

Here’s the summary (in ascending order of RSS) from the various methods.
* Error of LM is 55.47.
* Error of PCR is 55.56.
* Error of Ridge Model is 55.58. * Error of Lasso is 55.63.

11-B)

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

The LM model based on best subset selection based on 12 variables has the lowest error, based upon evaluation across training and testing sets.

11-C)

Does your chosen model involve all of the features in the data set? Why or why not?

The best subset selection model does not include all of the features, because the method seeks to develop the most efficient model with greatest predictive power. The variables not chosen did not contribute to improved predictive power. As a bonus, the best subset model is a little simpler than the PCR model (which had 13 features).