For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(A) The lasso, relative to least squares, is: iii. Less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(B) Ridge regression relative to least squares, is: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
(C) Non-linear methods relative to least squares, are: ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(A) Split the data set into a training set and a test set.
data("College")
names(College)
## [1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
## [6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
## [11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
## [16] "perc.alumni" "Expend" "Grad.Rate"
set.seed(1)
College.split = initial_split(College, strata = "Apps", prop = .7)
College.train = training(College.split)
College.test = testing (College.split)
(B) Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit=lm(Apps~., data= College.train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = College.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5107.8 -391.0 -7.3 290.7 7594.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -629.04441 506.66716 -1.242 0.21497
## PrivateYes -528.76499 161.85774 -3.267 0.00116 **
## Accept 1.67554 0.04556 36.779 < 2e-16 ***
## Enroll -0.98055 0.21469 -4.567 6.17e-06 ***
## Top10perc 38.91487 6.55417 5.937 5.28e-09 ***
## Top25perc -7.27523 5.16743 -1.408 0.15975
## F.Undergrad 0.01406 0.03889 0.362 0.71783
## P.Undergrad 0.08264 0.03669 2.252 0.02472 *
## Outstate -0.09102 0.02239 -4.065 5.53e-05 ***
## Room.Board 0.12088 0.05781 2.091 0.03701 *
## Books 0.04141 0.31067 0.133 0.89400
## Personal 0.01102 0.07274 0.151 0.87970
## PhD -6.36443 5.73029 -1.111 0.26722
## Terminal -5.11899 6.17210 -0.829 0.40727
## S.F.Ratio 26.61536 16.14142 1.649 0.09977 .
## perc.alumni 3.13772 4.89273 0.641 0.52161
## Expend 0.09128 0.01416 6.447 2.59e-10 ***
## Grad.Rate 7.77373 3.59667 2.161 0.03112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1039 on 523 degrees of freedom
## Multiple R-squared: 0.9359, Adjusted R-squared: 0.9338
## F-statistic: 449.2 on 17 and 523 DF, p-value: < 2.2e-16
lm.pred = predict(lm.fit, College.test)
(lm.info = postResample(lm.pred,College.test$Apps))
## RMSE Rsquared MAE
## 1083.7809074 0.9026437 631.4713036
The Root Means Squared Error obtained is 1083.7809074.
(C) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
train = model.matrix(Apps ~ ., data = College.train)
test = model.matrix(Apps ~ ., data = College.test)
grid <- 10 ^ seq(4, -2, length = 100)
ridge.fit <- glmnet(train, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.cv <- cv.glmnet(train, College.train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
(ridge.bestlam <- ridge.cv$lambda.min)
## [1] 0.01
The best lambda is 0.01
ridge.pred = predict(ridge.fit, s = ridge.bestlam, newx = test)
(ridge.info = postResample(ridge.pred,College.test$Apps))
## RMSE Rsquared MAE
## 1083.7730184 0.9026454 631.4656057
The Root Mean Square Error for Ridge regression model is 1083.7730184.
(D) Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(2)
lasso.fit <- glmnet(train, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
lasso.cv <- cv.glmnet(train, College.train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
(lasso.bestlam <- lasso.cv$lambda.min)
## [1] 0.01
lasso.pred = predict(lasso.fit, s = lasso.bestlam, newx = test)
(lasso.info = postResample(ridge.pred,College.test$Apps))
## RMSE Rsquared MAE
## 1083.7730184 0.9026454 631.4656057
The RMSE for lasso is 1083.7730184.
predict(lasso.fit, s=lasso.bestlam, type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -629.12897655
## (Intercept) .
## PrivateYes -528.72947567
## Accept 1.67546947
## Enroll -0.98002139
## Top10perc 38.90769848
## Top25perc -7.26977780
## F.Undergrad 0.01398635
## P.Undergrad 0.08263787
## Outstate -0.09101140
## Room.Board 0.12086539
## Books 0.04139372
## Personal 0.01099352
## PhD -6.36221484
## Terminal -5.11870024
## S.F.Ratio 26.61043507
## perc.alumni 3.13343444
## Expend 0.09127878
## Grad.Rate 7.77280764
(E) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(3)
pcr.fit = pcr(Apps ~ ., data=College.train, scale=TRUE, validation="CV")
summary(pcr.fit)
## Data: X dimension: 541 17
## Y dimension: 541 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4041 3962 2213 2216 1755 1741 1734
## adjCV 4041 3964 2210 2216 1734 1734 1730
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1705 1703 1637 1640 1649 1649 1656
## adjCV 1698 1700 1633 1635 1644 1644 1651
## 14 comps 15 comps 16 comps 17 comps
## CV 1660 1602 1241 1198
## adjCV 1655 1567 1229 1187
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 30.70 56.67 63.58 69.60 75.1 80.06 83.76 87.24
## Apps 4.56 70.96 71.47 82.69 82.7 82.78 83.55 83.63
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.40 92.77 94.92 96.70 97.90 98.72 99.33
## Apps 84.85 84.98 85.02 85.08 85.08 85.16 91.34
## 16 comps 17 comps
## X 99.84 100.00
## Apps 93.26 93.59
validationplot(pcr.fit, val.type="MSEP")
set.seed(3)
pcr.pred = predict(pcr.fit, College.test, ncomp = 8)
(pcr.info = postResample(pcr.pred,College.test$Apps))
## RMSE Rsquared MAE
## 1211.2579996 0.8796107 820.2794103
With PCR the test error is 1211.2579996 when M=8.
set.seed(4)
pls.fit <- plsr(Apps ~ ., data=College.train, scale=TRUE, validation="CV")
summary(pls.fit)
## Data: X dimension: 541 17
## Y dimension: 541 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4041 2015 1682 1579 1498 1301 1214
## adjCV 4041 2010 1678 1572 1477 1280 1202
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1186 1182 1176 1173 1171 1166 1165
## adjCV 1176 1172 1167 1164 1162 1157 1156
## 14 comps 15 comps 16 comps 17 comps
## CV 1165 1166 1166 1166
## adjCV 1157 1157 1157 1157
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.19 39.93 62.35 65.02 68.04 73.36 76.45 79.44
## Apps 76.62 84.55 87.05 90.85 93.02 93.39 93.45 93.49
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 81.96 84.40 86.78 90.56 92.46 94.80 96.78
## Apps 93.54 93.57 93.58 93.58 93.59 93.59 93.59
## 16 comps 17 comps
## X 98.65 100.00
## Apps 93.59 93.59
validationplot(pls.fit, val.typ="MSEP")
pls.pred = predict(pls.fit, College.test, ncomp = 2)
(pls.info = postResample(pls.pred,College.test$Apps))
## RMSE Rsquared MAE
## 1284.1314290 0.8647247 763.4251588
The test error is 1284.1314290 when M=2.
(G) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
as_tibble(rbind(lm.info, ridge.info,lasso.info, pcr.info, pls.info)) %>%
mutate(model = c('Linear', 'Ridge', 'Lasso', 'PCR', 'PLS')) %>%
select(model, RMSE, Rsquared)
## # A tibble: 5 × 3
## model RMSE Rsquared
## <chr> <dbl> <dbl>
## 1 Linear 1084. 0.903
## 2 Ridge 1084. 0.903
## 3 Lasso 1084. 0.903
## 4 PCR 1211. 0.880
## 5 PLS 1284. 0.865
The Linear, Ridge, and Lasso have a reasonably large R^2 and are extremely close together which indicates the difference between them is minuscule. The PCR and PLS models are less desirable compared to the other three due to their R^2 values being much lower; however, when compared to each other the R^2 values are close together indicating there may not be much of a difference between them.
We will now try to predict per capita crime rate in the Boston data set.
(A) Try out some of the regression methods explored in this chapter,such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
data("Boston")
set.seed(1)
Boston.split = initial_split(Boston, strata ="crim", prop=.8)
Boston.train = training(Boston.split)
Boston.test = testing (Boston.split)
lm.split=lm(crim~., data= Boston.train)
crim.pred = predict(lm.split, Boston.test)
(lm.info2 = postResample(crim.pred, Boston.test$crim))
## RMSE Rsquared MAE
## 6.7551201 0.4123084 2.9804667
Full liner model RMSE is 6.7551201.
set.seed(1)
matrix.train = model.matrix(crim ~ ., data = Boston.train)
matrix.test = model.matrix(crim ~ ., data = Boston.test)
grid = 10 ^ seq(4, -2, length = 100)
ridge.fit = glmnet(matrix.train, Boston.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.cv = cv.glmnet(matrix.train, Boston.train$crim, alpha = 0, lambda = grid, thresh = 1e-12)
(ridge.bestlam = ridge.cv$lambda.min)
## [1] 0.2477076
ridge.pred = predict(ridge.fit, s = ridge.bestlam, newx = matrix.test)
(ridge.info2 = postResample(ridge.pred, Boston.test$crim))
## RMSE Rsquared MAE
## 6.7350604 0.4155481 2.8834667
Ridge Regression best lamda is 0.2477076 and RSME is 6.7350604
set.seed(1)
lasso.fit = glmnet(matrix.train, Boston.train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
lasso.cv = cv.glmnet(matrix.train, Boston.train$crim, alpha = 1, lambda = grid, thresh = 1e-12)
(lasso.bestlam = lasso.cv$lambda.min)
## [1] 0.06135907
lasso.pred = predict(lasso.fit, s = lasso.bestlam, newx = matrix.test)
(lasso.info2 = postResample(lasso.pred, Boston.test$crim))
## RMSE Rsquared MAE
## 6.7506997 0.4127902 2.8520426
predict(lasso.fit, s=lasso.bestlam, type="coefficients")
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 11.801194827
## (Intercept) .
## zn 0.038681799
## indus -0.079153523
## chas -0.707239482
## nox -5.594725457
## rm 0.235049318
## age .
## dis -0.841139977
## rad 0.526935923
## tax .
## ptratio -0.208748543
## black -0.003396034
## lstat 0.098860201
## medv -0.182569148
Lasso best lamda is 0.06135907 and RSME is 6.7506997.
set.seed(1)
pcr.fit = pcr(crim ~ ., data = Boston.train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data: X dimension: 402 13
## Y dimension: 402 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.557 7.170 7.162 6.787 6.780 6.791 6.770
## adjCV 8.557 7.166 7.157 6.779 6.771 6.783 6.761
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.756 6.635 6.672 6.674 6.636 6.603 6.530
## adjCV 6.747 6.626 6.662 6.665 6.624 6.589 6.516
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.06 59.74 69.06 75.98 82.73 87.90 91.09 93.52
## crim 30.86 31.09 39.09 39.29 39.34 40.12 40.36 42.63
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.01 98.43 99.49 100.00
## crim 42.74 42.85 43.55 44.81 46.31
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred = predict(pcr.fit, Boston.test, ncomp = 3)
(pcr.info2 = postResample(pcr.pred, Boston.test$crim))
## RMSE Rsquared MAE
## 6.8425179 0.3967627 2.8322963
The PRC model used 3 components and has an error rate of 6.842518.
pls.fit = plsr(crim ~ ., data = Boston.train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data: X dimension: 402 13
## Y dimension: 402 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.557 6.974 6.630 6.561 6.543 6.516 6.504
## adjCV 8.557 6.971 6.624 6.550 6.531 6.503 6.490
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.492 6.481 6.483 6.481 6.482 6.482 6.482
## adjCV 6.479 6.469 6.471 6.469 6.470 6.470 6.470
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 46.63 56.09 61.08 70.50 75.66 78.89 83.93 85.92
## crim 34.54 42.22 44.62 45.32 45.81 46.14 46.23 46.30
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 88.72 94.23 96.64 98.37 100.00
## crim 46.31 46.31 46.31 46.31 46.31
validationplot(pls.fit, val.type = "MSEP")
pls.pred = predict(pls.fit, Boston.test, ncomp = 5)
(pls.info2 = postResample(pls.pred, Boston.test$crim))
## RMSE Rsquared MAE
## 6.7362959 0.4154761 2.9841105
The pls model used 5 components and has an error rate of 6.7362959
**(B) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.##
as_tibble(rbind(lm.info2, ridge.info2, lasso.info2, pcr.info2, pls.info2)) %>%
mutate(model = c('Linear', 'Ridge', 'Lasso', 'PCR', 'PLS')) %>%
select(model, RMSE, Rsquared)
## # A tibble: 5 × 3
## model RMSE Rsquared
## <chr> <dbl> <dbl>
## 1 Linear 6.76 0.412
## 2 Ridge 6.74 0.416
## 3 Lasso 6.75 0.413
## 4 PCR 6.84 0.397
## 5 PLS 6.74 0.415
With the exception of PCR all models have an R^2 values that are extremely close together which indicate the difference between the is minuscule. Based on the RMSE values the proposed model would be the Ridge model.
The Ridge Regression and Partial least squares have the lowest MSRE and seem to perform well on this data set.
(C) Does your chosen model involve all of the features in the data set? Why or why not?
The Ridge Regression model uses all the features because it does not have zeroing coefficients as a goal.