Logistic regression is different than OLS, so what methods are available to us to help in selecting what predictors to use?
This example is from the Stanford video from Module #8. It’s also in the book An Introduction to Statistical Learning, section 4.6.1.
library(ISLR)
library(MASS)
library(caret)
library(pROC)
# Data from the Stanford video
sm <- ISLR::Smarket
The following models were the first and last that Trevor Hastie tried in his video. The ANOVAs with test = 'Chisq' show the statistics for adding each of the variables in order. That is analogous to the partial \(F\)-test for the lm ANOVA. The summary function show the \(z\)-value instead of the \(t\)-value. The \(z\)-value is the Wald test statistic for each \(\beta\) defined in MARR Chapter 8.
# First model
m1 <- glm(
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = sm, family = binomial)
anova(m1, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Direction
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1249 1731.2
## Lag1 1 1.97822 1248 1729.2 0.1596
## Lag2 1 0.79313 1247 1728.4 0.3732
## Lag3 1 0.03167 1246 1728.4 0.8587
## Lag4 1 0.01942 1245 1728.3 0.8892
## Lag5 1 0.03540 1244 1728.3 0.8508
## Volume 1 0.73283 1243 1727.6 0.3920
summary(m1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
## Volume, family = binomial, data = sm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.446 -1.203 1.065 1.145 1.326
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000 0.240736 -0.523 0.601
## Lag1 -0.073074 0.050167 -1.457 0.145
## Lag2 -0.042301 0.050086 -0.845 0.398
## Lag3 0.011085 0.049939 0.222 0.824
## Lag4 0.009359 0.049974 0.187 0.851
## Lag5 0.010313 0.049511 0.208 0.835
## Volume 0.135441 0.158360 0.855 0.392
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1727.6 on 1243 degrees of freedom
## AIC: 1741.6
##
## Number of Fisher Scoring iterations: 3
# Final model
m2 <- glm(
Direction ~ Lag1 + Lag2,
data = sm, family = binomial)
anova(m2, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Direction
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1249 1731.2
## Lag1 1 1.97822 1248 1729.2 0.1596
## Lag2 1 0.79313 1247 1728.4 0.3732
summary(m2)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = sm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.387 -1.203 1.073 1.147 1.346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07425 0.05667 1.310 0.190
## Lag1 -0.07151 0.05010 -1.427 0.153
## Lag2 -0.04450 0.05000 -0.890 0.374
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.2 on 1249 degrees of freedom
## Residual deviance: 1728.4 on 1247 degrees of freedom
## AIC: 1734.4
##
## Number of Fisher Scoring iterations: 3
The \(\chi2\) test is not the only one available. You can also use test = 'Cp' for Mallow’s \(C_p\) stat, which you would want to minimize.
You can use stepwise regression, like with linear models, by using MASS::stepAIC.
# StepAIC works for glms, too!
stepAIC(m1)
## Start: AIC=1741.58
## Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
##
## Df Deviance AIC
## - Lag4 1 1727.6 1739.6
## - Lag5 1 1727.6 1739.6
## - Lag3 1 1727.6 1739.6
## - Lag2 1 1728.3 1740.3
## - Volume 1 1728.3 1740.3
## <none> 1727.6 1741.6
## - Lag1 1 1729.7 1741.7
##
## Step: AIC=1739.62
## Direction ~ Lag1 + Lag2 + Lag3 + Lag5 + Volume
##
## Df Deviance AIC
## - Lag5 1 1727.7 1737.7
## - Lag3 1 1727.7 1737.7
## - Volume 1 1728.3 1738.3
## - Lag2 1 1728.3 1738.3
## <none> 1727.6 1739.6
## - Lag1 1 1729.8 1739.8
##
## Step: AIC=1737.66
## Direction ~ Lag1 + Lag2 + Lag3 + Volume
##
## Df Deviance AIC
## - Lag3 1 1727.7 1735.7
## - Volume 1 1728.4 1736.4
## - Lag2 1 1728.4 1736.4
## <none> 1727.7 1737.7
## - Lag1 1 1729.8 1737.8
##
## Step: AIC=1735.71
## Direction ~ Lag1 + Lag2 + Volume
##
## Df Deviance AIC
## - Volume 1 1728.4 1734.4
## - Lag2 1 1728.4 1734.4
## <none> 1727.7 1735.7
## - Lag1 1 1729.8 1735.8
##
## Step: AIC=1734.4
## Direction ~ Lag1 + Lag2
##
## Df Deviance AIC
## - Lag2 1 1729.2 1733.2
## <none> 1728.4 1734.4
## - Lag1 1 1730.5 1734.5
##
## Step: AIC=1733.2
## Direction ~ Lag1
##
## Df Deviance AIC
## - Lag1 1 1731.2 1733.2
## <none> 1729.2 1733.2
##
## Step: AIC=1733.17
## Direction ~ 1
##
## Call: glm(formula = Direction ~ 1, family = binomial, data = sm)
##
## Coefficients:
## (Intercept)
## 0.07363
##
## Degrees of Freedom: 1249 Total (i.e. Null); 1249 Residual
## Null Deviance: 1731
## Residual Deviance: 1731 AIC: 1733
That can help with automatic variable selection.
Because these models are predicting class membership, you can use confusion matrices to compare models with different sets of potential predictors.
# Add predictions
sm$PredRatio_m1 <- predict(m1, type = 'response')
sm$PredDirection_m1 <- factor(
ifelse(sm$PredRatio_m1 > 0.5, 'Up', 'Down'))
sm$PredRatio_m2 <- predict(m2, type = 'response')
sm$PredDirection_m2 <- factor(
ifelse(sm$PredRatio_m2 > 0.5, 'Up', 'Down'))
Model m1 has a higher accuracy than model m2, but neither is very good.
# Compare accuracy and other metrics from conf matrix
confusionMatrix(table(
sm[, c('PredDirection_m1', 'Direction')]))
## Confusion Matrix and Statistics
##
## Direction
## PredDirection_m1 Down Up
## Down 145 141
## Up 457 507
##
## Accuracy : 0.5216
## 95% CI : (0.4935, 0.5496)
## No Information Rate : 0.5184
## P-Value [Acc > NIR] : 0.4216
##
## Kappa : 0.0237
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.2409
## Specificity : 0.7824
## Pos Pred Value : 0.5070
## Neg Pred Value : 0.5259
## Prevalence : 0.4816
## Detection Rate : 0.1160
## Detection Prevalence : 0.2288
## Balanced Accuracy : 0.5116
##
## 'Positive' Class : Down
##
confusionMatrix(table(
sm[, c('PredDirection_m2', 'Direction')]))
## Confusion Matrix and Statistics
##
## Direction
## PredDirection_m2 Down Up
## Down 114 102
## Up 488 546
##
## Accuracy : 0.528
## 95% CI : (0.4999, 0.556)
## No Information Rate : 0.5184
## P-Value [Acc > NIR] : 0.2576
##
## Kappa : 0.0327
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.1894
## Specificity : 0.8426
## Pos Pred Value : 0.5278
## Neg Pred Value : 0.5280
## Prevalence : 0.4816
## Detection Rate : 0.0912
## Detection Prevalence : 0.1728
## Balanced Accuracy : 0.5160
##
## 'Positive' Class : Down
##
The ROC plots are helpful for selecting between models, and thereby, predictors. Again, neither of these look very good. In fact, m2 seems to be closer to an even guess than m1.
# Compare ROC curves
roc_m1 <- roc(
response = sm$Direction, predictor = sm$PredRatio_m1)
## Setting levels: control = Down, case = Up
## Setting direction: controls < cases
roc_m2 <- roc(
response = sm$Direction, predictor = sm$PredRatio_m2)
## Setting levels: control = Down, case = Up
## Setting direction: controls < cases
par(mfrow = c(1, 2))
plot(roc_m1)
plot(roc_m2)
AUC will help in most cases, but definitely in this one where it is difficult to see the differences in the ROCs.
# Compare AUC
auc(roc_m1)
## Area under the curve: 0.5387
auc(roc_m2)
## Area under the curve: 0.5314