For parts (a) through (c), indicate which of i. through iv. is correct:
Justify your answer.
The lasso, relative to least squares, is:
The lasso, relative to least squares, is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance (iii). This is due to the shrinkage penalty that is included in estimating the coefficients. The lasso reduces the estimates up to and including zero, producing a simpler and more interpretable model.
Repeat (a) for ridge regression relative to least squares.
Ridge regression is less flexible than least squares and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance (iii). Ridge regression is very similar to the lasso but with a slightly different shrinkage penalty that reduces the coefficients very close to but not including zero. By shrinking the coefficient estimates, we can substantially decrease the variance at the cost of an increase in bias.
Repeat (a) for non-linear methods relative to least squares.
Non-linear methods are more flexible relative to least squares and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias (ii).
In this exercise, we will predict the number of applications received using the other variables in the College data set.
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
## [1] 0
library(caret)
set.seed(21)
ind <- College$Apps %>%
createDataPartition(p = 0.7, list = FALSE)
college.train <- College[ind,]
college.test <- College[-ind,]
ctrl <- trainControl("repeatedcv", number = 10, repeats = 3)
Fit a linear model using least squares on the training set, and report the test error obtained.
Using 10-fold cross-validation, the linear model gives us a test error of 1,239.
lm.preds <- lm.fit %>%
predict(college.test)
lm.r2 <- caret::R2(college.test$Apps, lm.preds)
lm.rmse <- RMSE(college.test$Apps, lm.preds)
lm.rmse
## [1] 1238.523
grid <- 10^seq(10, -2, length = 100)
ridge.fit <- train(Apps ~., data = college.train, method = "glmnet",
trControl = ctrl, metric = "RMSE",
tuneGrid = expand.grid(alpha = 0, lambda = grid))
ridge.fit$bestTune
## alpha lambda
## 38 0 305.3856
ridge.preds <- ridge.fit %>%
predict(college.test)
ridge.r2 <- caret::R2(college.test$Apps, ridge.preds)
ridge.rmse <- RMSE(college.test$Apps, ridge.preds)
ridge.rmse
## [1] 1665.928
The test error using the best \(\lambda\) increased from the linear model.
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.127861e+03
## PrivateYes -6.635202e+02
## Accept 8.144054e-01
## Enroll 5.892259e-01
## Top10perc 2.286175e+01
## Top25perc 1.456817e+00
## F.Undergrad 1.142090e-01
## P.Undergrad 1.253448e-02
## Outstate 2.223016e-03
## Room.Board 2.100673e-01
## Books 2.344128e-01
## Personal -7.322214e-02
## PhD -3.983164e+00
## Terminal -4.764773e+00
## S.F.Ratio 4.094570e+00
## perc.alumni -1.384515e+01
## Expend 6.868602e-02
## Grad.Rate 9.985739e+00
lasso.fit <- train(Apps ~., data = college.train, method = "glmnet",
trControl = ctrl, metric = "RMSE",
tuneGrid = expand.grid(alpha = 1, lambda = grid))
lasso.fit$bestTune
## alpha lambda
## 27 1 14.17474
lasso.preds <- lasso.fit %>%
predict(college.test)
lasso.r2 <- caret::R2(college.test$Apps, lasso.preds)
lasso.rmse <- RMSE(college.test$Apps, lasso.preds)
lasso.rmse
## [1] 1289.071
The lasso gives us a test error of 1,289 with three variables set to zero, Enroll
, Personal
, and S.F.Ratio
.
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -333.42729282
## PrivateYes -623.90503359
## Accept 1.29807112
## Enroll .
## Top10perc 34.67284704
## Top25perc -4.64574511
## F.Undergrad 0.01995245
## P.Undergrad 0.01834171
## Outstate -0.02828174
## Room.Board 0.14140433
## Books 0.03705017
## Personal .
## PhD -7.65561384
## Terminal -2.99825994
## S.F.Ratio .
## perc.alumni -6.92373141
## Expend 0.05835990
## Grad.Rate 6.35105420
pcr.fit <- train(Apps~., data = college.train, method = "pcr",
scale = TRUE, trControl = ctrl,
metric = "RMSE", tuneLength = 17)
plot(pcr.fit)
## ncomp
## 16 16
The Principal Components model with the smallest cross-validation error occurs when M = 16 components, which is not too different from performing least squares regression since nearly all components are included in the model. The PCR model explains over 99% of the variation in the predictors and 93% of the variation in Apps
.
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 16
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.959 57.64 64.52 70.22 75.58 80.49 84.28
## .outcome 9.532 78.39 78.84 81.26 86.10 87.56 88.25
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.86 90.94 93.28 95.26 96.94 97.98 98.96
## .outcome 88.76 89.33 89.55 89.56 89.57 89.57 89.58
## 15 comps 16 comps
## X 99.48 99.85
## .outcome 89.80 92.57
pcr.preds <- pcr.fit %>%
predict(college.test)
pcr.r2 <- caret::R2(college.test$Apps, pcr.preds)
pcr.rmse <- RMSE(college.test$Apps, pcr.preds)
pcr.rmse
## [1] 1295.164
pls.fit <- train(Apps~., data = college.train, method = "pls", metric = "RMSE",
scale = TRUE, trControl = ctrl, tuneLength = 10)
plot(pls.fit)
## ncomp
## 9 9
The Partial Least Squares model gives us the lowest cross-validation error when M = 9 components. This captures 84% of the variation in the predictors and 93% of the variation in Apps
.
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: oscorespls
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.67 50.02 62.88 66.84 70.57 73.77 77.66
## .outcome 81.90 86.19 89.97 90.88 91.93 92.69 92.80
## 8 comps 9 comps
## X 81.40 83.94
## .outcome 92.82 92.85
pls.preds <- pls.fit %>%
predict(college.test)
pls.r2 <- caret::R2(college.test$Apps, pls.preds)
pls.rmse <- RMSE(college.test$Apps, pls.preds)
pls.rmse
## [1] 1245.394
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Out of the five approaches, the Linear and Partial Least Squares models produce the lowest test error. I would suggest using the PLS model because it contains only 9 components and is simpler than the linear model. Based on the test R\(^2\), the PLS model provides a good fit to data and can explain about 93% of the variation in Apps
.
data.frame(Model = c("Linear Regression", "Ridge Regression", "Lasso",
"Principal Components", "Partial Least Squares"),
RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, pcr.rmse, pls.rmse),
RSquared = percent(c(lm.r2, ridge.r2, lasso.r2, pcr.r2, pls.r2), accuracy = 0.01)) %>%
arrange(RMSE) %>%
kable(format.args = list(big.mark = ","), align = c('l', 'c', 'c'), digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Model | RMSE | RSquared |
---|---|---|
Linear Regression | 1,238.52 | 92.71% |
Partial Least Squares | 1,245.39 | 92.66% |
Lasso | 1,289.07 | 92.27% |
Principal Components | 1,295.16 | 92.18% |
Ridge Regression | 1,665.93 | 88.01% |
We will now try to predict per capita crime rate in the Boston
data set.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## [1] 0
set.seed(11)
ind <- Boston$crim %>%
createDataPartition(p = 0.7, list=FALSE)
boston.train <- Boston[ind,]
boston.test <- Boston[-ind,]
ctrl <- trainControl("repeatedcv", number = 10, repeats = 3)
lm.step <- train(crim ~ ., data = boston.train,
method = "leapSeq", metric = "RMSE",
tuneGrid = data.frame(nvmax = 1:11),
trControl = ctrl)
plot(lm.step)
Using stepwise selection, the model with 11 predictors minimizes the cross-validation error. The best 11 variables include zn
, indus
, nox
, rm
, dis
, rad
, tax
, ptratio
, black
, lstat
, and medv
.
## (Intercept) zn indus nox rm
## 18.803005844 0.049895785 -0.077847103 -11.566222458 0.498400923
## dis rad tax ptratio black
## -1.126021669 0.641737854 -0.004108639 -0.328138847 -0.003503293
## lstat medv
## 0.111945440 -0.248228511
lm.preds <- lm.step %>%
predict(boston.test)
lm.r2 <- caret::R2(boston.test$crim, lm.preds)
lm.rmse <- RMSE(boston.test$crim, lm.preds)
lm.rmse
## [1] 5.644116
grid <- 10^seq(10, -2, length = 100)
ridge.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
trControl = ctrl, metric = "RMSE",
tuneGrid = expand.grid(alpha = 0, lambda = grid))
ridge.fit$bestTune
## alpha lambda
## 18 0 1.149757
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.450024251
## zn 0.028937374
## indus -0.081669458
## chas -1.149807159
## nox -2.645270852
## rm 0.247099570
## age 0.005135483
## dis -0.610186466
## rad 0.396576915
## tax 0.005317767
## ptratio -0.093691841
## black -0.005372294
## lstat 0.132609596
## medv -0.135318917
ridge.preds <- ridge.fit %>%
predict(boston.test)
ridge.r2 <- caret::R2(boston.test$crim, ridge.preds)
ridge.rmse <- RMSE(boston.test$crim, ridge.preds)
ridge.rmse
## [1] 5.563703
Using the best \(\lambda\), the ridge model reduces the test error to 5.56.
lasso.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
trControl = ctrl, metric = "RMSE",
tuneGrid = expand.grid(alpha = 1, lambda = grid))
lasso.fit$bestTune
## alpha lambda
## 17 1 0.869749
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.151548399
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis -0.003138506
## rad 0.487250940
## tax .
## ptratio .
## black .
## lstat 0.119299789
## medv -0.050643562
lasso.preds <- lasso.fit %>%
predict(boston.test)
lasso.r2 <- caret::R2(boston.test$crim, lasso.preds)
lasso.rmse <- RMSE(boston.test$crim, lasso.preds)
lasso.rmse
## [1] 5.673909
The lasso includes only four variables in the model and gives us a test error of 5.67.
e.fit <- train(crim ~ ., data = boston.train, method = "glmnet",
trControl = ctrl, metric = "RMSE")
e.fit$bestTune
## alpha lambda
## 8 1 0.1150005
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 9.234532624
## zn 0.032256976
## indus -0.063272348
## chas -0.840476025
## nox -2.982001590
## rm .
## age .
## dis -0.663485627
## rad 0.540182929
## tax .
## ptratio -0.141013003
## black -0.003244753
## lstat 0.100814473
## medv -0.153553125
e.preds <- e.fit %>%
predict(boston.test)
e.r2 <- caret::R2(boston.test$crim, e.preds)
e.rmse <- RMSE(boston.test$crim, e.preds)
e.rmse
## [1] 5.627802
Elastic net regression decreases the test error to 5.63 with 3 variables set to zero.
pcr.fit <- train(crim ~ ., data = boston.train, method = "pcr",
scale = TRUE, trControl = ctrl, metric = "RMSE",
tuneLength = 11)
plot(pcr.fit)
## ncomp
## 9 9
The optimal number of principal components is 9, which explains 96% of the variation in the predictors and 43% of the variation in crim
.
## Data: X dimension: 356 13
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 9
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 46.98 60.76 69.91 76.62 83.23 88.25 91.31
## .outcome 31.68 31.92 38.97 39.04 39.05 39.65 39.69
## 8 comps 9 comps
## X 93.61 95.50
## .outcome 43.16 43.37
pcr.preds <- pcr.fit %>%
predict(boston.test)
pcr.r2 <- caret::R2(boston.test$crim, pcr.preds)
pcr.rmse <- RMSE(boston.test$crim, pcr.preds)
pcr.rmse
## [1] 5.728774
pls.fit <- train(crim ~ ., data = boston.train, method = "pls", metric = "RMSE",
scale = TRUE, trControl = ctrl, tuneLength = 11)
plot(pls.fit)
## ncomp
## 5 5
The Partial Least Squares model has the lowest cross-validation error when M = 5 components. This captures 76% of the variation in the predictors and 46% of the variation in crim
.
## Data: X dimension: 356 13
## Y dimension: 356 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 46.62 55.94 62.24 71.95 76.29
## .outcome 34.94 42.05 44.45 45.31 45.76
pls.preds <- pls.fit %>%
predict(boston.test)
pls.r2 <- caret::R2(boston.test$crim, pls.preds)
pls.rmse <- RMSE(boston.test$crim, pls.preds)
pls.rmse
## [1] 5.625681
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Ridge Regression and Partial Least Squares give us the lowest test error. I would suggest using the PLS model because it is simpler with only 5 components in the model. The overall test R\(^2\) is relatively low, indicating that a non-linear approach may provide a better fit.
data.frame(Model = c("Stepwise Selection", "Ridge Regression", "Lasso", "Elastic Net Regression",
"Principal Components Regression", "Partial Least Squares"),
RMSE = c(lm.rmse, ridge.rmse, lasso.rmse, e.rmse, pcr.rmse, pls.rmse),
RSquared = percent(c(lm.r2, ridge.r2, lasso.r2, e.r2, pcr.r2, pls.r2), accuracy = 0.01)) %>%
arrange(RMSE) %>%
kable(align = c('l', 'c', 'c'), digits = 2) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Model | RMSE | RSquared |
---|---|---|
Ridge Regression | 5.56 | 43.49% |
Partial Least Squares | 5.63 | 43.40% |
Elastic Net Regression | 5.63 | 43.06% |
Stepwise Selection | 5.64 | 43.32% |
Lasso | 5.67 | 40.40% |
Principal Components Regression | 5.73 | 41.63% |
Does your chosen model involve all of the features in the data set? Why or why not?
Out of 13 features in the data set, the Partial Least Squares model was able to minimize the test error with only 5 components in the model. This explains 76% of the variation in the predictors and 46% of the variation in crim
.
## Data: X dimension: 356 13
## Y dimension: 356 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 46.62 55.94 62.24 71.95 76.29
## .outcome 34.94 42.05 44.45 45.31 45.76