6.8 EXERCISE: 2 (Conceptual)

For parts (2.a) through (2.c), indicate which of i. through iv. is correct. Justify your answer.

(2.a) The lasso, relative to least squares, is:
(iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. As the lasso λ increases, the variance decreases and the bias increases. Where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates.

(2.b) Repeat (a) for ridge regression relative to least squares.
(iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression has a bias-variance trade off advantage over least squares. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates.

(2.c) Repeat (a) for non-linear methods relative to least squares
(ii) More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. The least squares estimates will have low bias. If the number of observations is much larger than the number of variables then the least squares estimates will have low bias and low variance, and hence will perform well on test observations. But if the number of observations is NOT much larger than the number of variables least squares can end up over-fitting. By using non-linear methods we can often substantially reduce the variance and improve the prediction accuracy at the cost of a negligible increase in bias.

6.8 EXERCISE: 9 (Applied)

In this exercise, we will predict the number of applications received using the other variables in the College data set.

data("College")
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(9.a) Split the data set into a training set and a test set.

train_index= sample(1:nrow(College), nrow(College)*.8)
train = College[train_index, ]
test= College[-train_index, ]

(9.b) Fit a linear model using least squares on the training set, and report the test error obtained..

lm.fit = lm(Apps~., data=train); lm.fit

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Coefficients:
## (Intercept)   PrivateYes       Accept       Enroll    Top10perc    Top25perc  
##  -630.58238   -388.97393      1.69123     -1.21543     50.45622    -13.62655  
## F.Undergrad  P.Undergrad     Outstate   Room.Board        Books     Personal  
##     0.08271      0.06555     -0.07562      0.14161      0.21161      0.01873  
##         PhD     Terminal    S.F.Ratio  perc.alumni       Expend    Grad.Rate  
##    -9.72551     -0.48690     18.26146      1.39008      0.05764      5.89480

lm.pred=predict(lm.fit, newdata=test); 
mean((test$Apps - lm.pred)^2)

## [1] 1567324

(9.c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

train.mat = model.matrix(Apps~., data=train)
test.mat = model.matrix(Apps~., data=test)
grid = 10 ^ seq(4, -2, length=100)
mod.ridge = cv.glmnet(train.mat, train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min

ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
mean((test[, "Apps"] - ridge.pred)^2)

## [1] 1567278

(9.d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates

mod.lasso = cv.glmnet(train.mat, train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)
lambda.best = mod.lasso$lambda.min
lasso.pred = predict(mod.lasso, newx=test.mat, s=lambda.best)

mod.lasso = glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1)
predict(mod.lasso, s=lambda.best, type="coefficients")

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -471.39372069
## (Intercept)    .         
## PrivateYes  -491.04485135
## Accept         1.57033288
## Enroll        -0.75961467
## Top10perc     48.14698891
## Top25perc    -12.84690694
## F.Undergrad    0.04149116
## P.Undergrad    0.04438973
## Outstate      -0.08328388
## Room.Board     0.14943472
## Books          0.01532293
## Personal       0.02909954
## PhD           -8.39597537
## Terminal      -3.26800340
## S.F.Ratio     14.59298267
## perc.alumni   -0.04404771
## Expend         0.07712632
## Grad.Rate      8.28950241

mean((test[, "Apps"] - lasso.pred)^2)

## [1] 1567261

(9.e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pcr.fit = pcr(Apps~., data=train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")

pcr.pred = predict(pcr.fit, test, ncomp=10)

mean((test[, "Apps"] - (pcr.pred))^2)

## [1] 2322618

(9.f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit = plsr(Apps~., data=train, scale=T, validation="CV")
validationplot(pls.fit, val.type="MSEP")

pls.pred = predict(pls.fit, test, ncomp=10)
mean((test[, "Apps"] - (pls.pred))^2)

## [1] 1551898

(9.g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five ap proaches?.

test.avg = mean(test[, "Apps"])

lm.test.r2 = 1 - mean((test[, "Apps"] - lm.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((test[, "Apps"] - ridge.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((test[, "Apps"] - lasso.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((test[, "Apps"] - (pcr.pred))^2) /mean((test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((test[, "Apps"] - (pls.pred))^2) /mean((test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")

(5.d) Answer: Including a dummy variable for student leads to a increase in the test error rate

6.8 EXERCISE: 11 (Applied)

We will now try to predict per capita crime rate in the Boston data set.

data("Boston")

(11.a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider

#Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

which.min(rmse.cv)

## [1] 10

rmse.cv[which.min(rmse.cv)]

## [1] 6.524057

#Lasso
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)

coef(cv.lasso)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.7799525
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.1920089
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])

## [1] 7.743457

#Ridge Regression
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)

coef(cv.ridge)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  1.875083351
## zn          -0.002796413
## indus        0.025127580
## chas        -0.131124711
## nox          1.591455573
## rm          -0.124822395
## age          0.005316260
## dis         -0.079813283
## rad          0.037184215
## tax          0.001710749
## ptratio      0.060102432
## black       -0.002126570
## lstat        0.029746839
## medv        -0.019577079

sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])

## [1] 7.795301

#PCR
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.210    7.213    6.774    6.757    6.774    6.785
## adjCV         8.61    7.207    7.210    6.768    6.749    6.769    6.779
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.778    6.658    6.682     6.669      6.67     6.621     6.553
## adjCV    6.771    6.650    6.673     6.660      6.66     6.611     6.542
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

(11.a) Answer: 13 component pcr fit has lowest CV/adjCV RMSEP.

(11.b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
(11.b) Answer: See above answers for cross-validate mean squared errors of selected models.

(11.c) Does your chosen model involve all of the features in the data set? Why or why not?
(11.c) Answer: I would choose the 9 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.

Assignment #5: Chapter 06 (page 259): 2, 9, 11

Jesse Contreras

03/26/2021

6.8 EXERCISE: 2 (Conceptual)

6.8 EXERCISE: 9 (Applied)

6.8 EXERCISE: 11 (Applied)