For parts (2.a) through (2.c), indicate which of i. through iv. is correct. Justify your answer.
(2.a) The lasso, relative to least squares, is:
(iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. As the lasso λ increases, the variance decreases and the bias increases. Where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates.
(2.b) Repeat (a) for ridge regression relative to least squares.
(iii) Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression has a bias-variance trade off advantage over least squares. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.where the relationship between the response and the predictors is close to linear, the least squares estimates will have low bias but may have high variance. This means that a small change in the training data can cause a large change in the least squares coefficient estimates.
(2.c) Repeat (a) for non-linear methods relative to least squares
(ii) More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. The least squares estimates will have low bias. If the number of observations is much larger than the number of variables then the least squares estimates will have low bias and low variance, and hence will perform well on test observations. But if the number of observations is NOT much larger than the number of variables least squares can end up over-fitting. By using non-linear methods we can often substantially reduce the variance and improve the prediction accuracy at the cost of a negligible increase in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
data("College")
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
(9.a) Split the data set into a training set and a test set.
train_index= sample(1:nrow(College), nrow(College)*.8)
train = College[train_index, ]
test= College[-train_index, ]
(9.b) Fit a linear model using least squares on the training set, and report the test error obtained..
lm.fit = lm(Apps~., data=train); lm.fit
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Coefficients:
## (Intercept) PrivateYes Accept Enroll Top10perc Top25perc
## -630.58238 -388.97393 1.69123 -1.21543 50.45622 -13.62655
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## 0.08271 0.06555 -0.07562 0.14161 0.21161 0.01873
## PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## -9.72551 -0.48690 18.26146 1.39008 0.05764 5.89480
lm.pred=predict(lm.fit, newdata=test);
mean((test$Apps - lm.pred)^2)
## [1] 1567324
(9.c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
train.mat = model.matrix(Apps~., data=train)
test.mat = model.matrix(Apps~., data=test)
grid = 10 ^ seq(4, -2, length=100)
mod.ridge = cv.glmnet(train.mat, train[, "Apps"], alpha=0, lambda=grid, thresh=1e-12)
lambda.best = mod.ridge$lambda.min
ridge.pred = predict(mod.ridge, newx=test.mat, s=lambda.best)
mean((test[, "Apps"] - ridge.pred)^2)
## [1] 1567278
(9.d) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates
mod.lasso = cv.glmnet(train.mat, train[, "Apps"], alpha=1, lambda=grid, thresh=1e-12)
lambda.best = mod.lasso$lambda.min
lasso.pred = predict(mod.lasso, newx=test.mat, s=lambda.best)
mod.lasso = glmnet(model.matrix(Apps~., data=College), College[, "Apps"], alpha=1)
predict(mod.lasso, s=lambda.best, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -471.39372069
## (Intercept) .
## PrivateYes -491.04485135
## Accept 1.57033288
## Enroll -0.75961467
## Top10perc 48.14698891
## Top25perc -12.84690694
## F.Undergrad 0.04149116
## P.Undergrad 0.04438973
## Outstate -0.08328388
## Room.Board 0.14943472
## Books 0.01532293
## Personal 0.02909954
## PhD -8.39597537
## Terminal -3.26800340
## S.F.Ratio 14.59298267
## perc.alumni -0.04404771
## Expend 0.07712632
## Grad.Rate 8.28950241
mean((test[, "Apps"] - lasso.pred)^2)
## [1] 1567261
(9.e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr.fit = pcr(Apps~., data=train, scale=T, validation="CV")
validationplot(pcr.fit, val.type="MSEP")
pcr.pred = predict(pcr.fit, test, ncomp=10)
mean((test[, "Apps"] - (pcr.pred))^2)
## [1] 2322618
(9.f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls.fit = plsr(Apps~., data=train, scale=T, validation="CV")
validationplot(pls.fit, val.type="MSEP")
pls.pred = predict(pls.fit, test, ncomp=10)
mean((test[, "Apps"] - (pls.pred))^2)
## [1] 1551898
(9.g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five ap proaches?.
test.avg = mean(test[, "Apps"])
lm.test.r2 = 1 - mean((test[, "Apps"] - lm.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
ridge.test.r2 = 1 - mean((test[, "Apps"] - ridge.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
lasso.test.r2 = 1 - mean((test[, "Apps"] - lasso.pred)^2) /mean((test[, "Apps"] - test.avg)^2)
pcr.test.r2 = 1 - mean((test[, "Apps"] - (pcr.pred))^2) /mean((test[, "Apps"] - test.avg)^2)
pls.test.r2 = 1 - mean((test[, "Apps"] - (pls.pred))^2) /mean((test[, "Apps"] - test.avg)^2)
barplot(c(lm.test.r2, ridge.test.r2, lasso.test.r2, pcr.test.r2, pls.test.r2), col="red", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"), main="Test R-squared")
(5.d) Answer: Including a dummy variable for student leads to a increase in the test error rate
We will now try to predict per capita crime rate in the Boston data set.
data("Boston")
(11.a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider
#Best subset selection
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
which.min(rmse.cv)
## [1] 10
rmse.cv[which.min(rmse.cv)]
## [1] 6.524057
#Lasso
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.lasso = cv.glmnet(x, y, type.measure = "mse")
plot(cv.lasso)
coef(cv.lasso)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.7799525
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.1920089
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(cv.lasso$cvm[cv.lasso$lambda == cv.lasso$lambda.1se])
## [1] 7.743457
#Ridge Regression
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
cv.ridge = cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(cv.ridge)
coef(cv.ridge)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.875083351
## zn -0.002796413
## indus 0.025127580
## chas -0.131124711
## nox 1.591455573
## rm -0.124822395
## age 0.005316260
## dis -0.079813283
## rad 0.037184215
## tax 0.001710749
## ptratio 0.060102432
## black -0.002126570
## lstat 0.029746839
## medv -0.019577079
sqrt(cv.ridge$cvm[cv.ridge$lambda == cv.ridge$lambda.1se])
## [1] 7.795301
#PCR
pcr.fit = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.210 7.213 6.774 6.757 6.774 6.785
## adjCV 8.61 7.207 7.210 6.768 6.749 6.769 6.779
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.778 6.658 6.682 6.669 6.67 6.621 6.553
## adjCV 6.771 6.650 6.673 6.660 6.66 6.611 6.542
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
(11.a) Answer: 13 component pcr fit has lowest CV/adjCV RMSEP.
(11.b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
(11.b) Answer: See above answers for cross-validate mean squared errors of selected models.
(11.c) Does your chosen model involve all of the features in the data set? Why or why not?
(11.c) Answer: I would choose the 9 parameter best subset model because it had the best cross-validated RMSE, next to PCR, but it was simpler model than the 13 component PCR model.