Predictor Selection in Logistic Regression

Logistic regression is different than OLS, so what methods are available to us to help in selecting what predictors to use?

Stock Market Example

This example is from the Stanford video from Module #8. It’s also in the book An Introduction to Statistical Learning, section 4.6.1.

library(ISLR)
library(MASS)
library(caret)
library(pROC)

# Data from the Stanford video
sm <- ISLR::Smarket

ANOVA and Coefficient Testing

The following models were the first and last that Trevor Hastie tried in his video. The ANOVAs with test = 'Chisq' show the statistics for adding each of the variables in order. That is analogous to the partial \(F\)-test for the lm ANOVA. The summary function show the \(z\)-value instead of the \(t\)-value. The \(z\)-value is the Wald test statistic for each \(\beta\) defined in MARR Chapter 8.

# First model
m1 <- glm(
  Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
  data = sm, family = binomial)
anova(m1, test = 'Chisq')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Direction
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                    1249     1731.2         
## Lag1    1  1.97822      1248     1729.2   0.1596
## Lag2    1  0.79313      1247     1728.4   0.3732
## Lag3    1  0.03167      1246     1728.4   0.8587
## Lag4    1  0.01942      1245     1728.3   0.8892
## Lag5    1  0.03540      1244     1728.3   0.8508
## Volume  1  0.73283      1243     1727.6   0.3920

summary(m1)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = binomial, data = sm)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.446  -1.203   1.065   1.145   1.326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.126000   0.240736  -0.523    0.601
## Lag1        -0.073074   0.050167  -1.457    0.145
## Lag2        -0.042301   0.050086  -0.845    0.398
## Lag3         0.011085   0.049939   0.222    0.824
## Lag4         0.009359   0.049974   0.187    0.851
## Lag5         0.010313   0.049511   0.208    0.835
## Volume       0.135441   0.158360   0.855    0.392
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1727.6  on 1243  degrees of freedom
## AIC: 1741.6
## 
## Number of Fisher Scoring iterations: 3

# Final model
m2 <- glm(
  Direction ~ Lag1 + Lag2, 
  data = sm, family = binomial)
anova(m2, test = 'Chisq')

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Direction
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                  1249     1731.2         
## Lag1  1  1.97822      1248     1729.2   0.1596
## Lag2  1  0.79313      1247     1728.4   0.3732

summary(m2)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = sm)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.387  -1.203   1.073   1.147   1.346  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  0.07425    0.05667   1.310    0.190
## Lag1        -0.07151    0.05010  -1.427    0.153
## Lag2        -0.04450    0.05000  -0.890    0.374
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1731.2  on 1249  degrees of freedom
## Residual deviance: 1728.4  on 1247  degrees of freedom
## AIC: 1734.4
## 
## Number of Fisher Scoring iterations: 3

The \(\chi2\) test is not the only one available. You can also use test = 'Cp' for Mallow’s \(C_p\) stat, which you would want to minimize.

Stepwise Regression

You can use stepwise regression, like with linear models, by using MASS::stepAIC.

# StepAIC works for glms, too!
stepAIC(m1)

## Start:  AIC=1741.58
## Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume
## 
##          Df Deviance    AIC
## - Lag4    1   1727.6 1739.6
## - Lag5    1   1727.6 1739.6
## - Lag3    1   1727.6 1739.6
## - Lag2    1   1728.3 1740.3
## - Volume  1   1728.3 1740.3
## <none>        1727.6 1741.6
## - Lag1    1   1729.7 1741.7
## 
## Step:  AIC=1739.62
## Direction ~ Lag1 + Lag2 + Lag3 + Lag5 + Volume
## 
##          Df Deviance    AIC
## - Lag5    1   1727.7 1737.7
## - Lag3    1   1727.7 1737.7
## - Volume  1   1728.3 1738.3
## - Lag2    1   1728.3 1738.3
## <none>        1727.6 1739.6
## - Lag1    1   1729.8 1739.8
## 
## Step:  AIC=1737.66
## Direction ~ Lag1 + Lag2 + Lag3 + Volume
## 
##          Df Deviance    AIC
## - Lag3    1   1727.7 1735.7
## - Volume  1   1728.4 1736.4
## - Lag2    1   1728.4 1736.4
## <none>        1727.7 1737.7
## - Lag1    1   1729.8 1737.8
## 
## Step:  AIC=1735.71
## Direction ~ Lag1 + Lag2 + Volume
## 
##          Df Deviance    AIC
## - Volume  1   1728.4 1734.4
## - Lag2    1   1728.4 1734.4
## <none>        1727.7 1735.7
## - Lag1    1   1729.8 1735.8
## 
## Step:  AIC=1734.4
## Direction ~ Lag1 + Lag2
## 
##        Df Deviance    AIC
## - Lag2  1   1729.2 1733.2
## <none>      1728.4 1734.4
## - Lag1  1   1730.5 1734.5
## 
## Step:  AIC=1733.2
## Direction ~ Lag1
## 
##        Df Deviance    AIC
## - Lag1  1   1731.2 1733.2
## <none>      1729.2 1733.2
## 
## Step:  AIC=1733.17
## Direction ~ 1

## 
## Call:  glm(formula = Direction ~ 1, family = binomial, data = sm)
## 
## Coefficients:
## (Intercept)  
##     0.07363  
## 
## Degrees of Freedom: 1249 Total (i.e. Null);  1249 Residual
## Null Deviance:       1731 
## Residual Deviance: 1731  AIC: 1733

That can help with automatic variable selection.

Confusion Matrices

Because these models are predicting class membership, you can use confusion matrices to compare models with different sets of potential predictors.

# Add predictions
sm$PredRatio_m1 <- predict(m1, type = 'response')
sm$PredDirection_m1 <- factor(
  ifelse(sm$PredRatio_m1 > 0.5, 'Up', 'Down'))

sm$PredRatio_m2 <- predict(m2, type = 'response')
sm$PredDirection_m2 <- factor(
  ifelse(sm$PredRatio_m2 > 0.5, 'Up', 'Down'))

Model m1 has a higher accuracy than model m2, but neither is very good.

# Compare accuracy and other metrics from conf matrix
confusionMatrix(table(
  sm[, c('PredDirection_m1', 'Direction')]))

## Confusion Matrix and Statistics
## 
##                 Direction
## PredDirection_m1 Down  Up
##             Down  145 141
##             Up    457 507
##                                           
##                Accuracy : 0.5216          
##                  95% CI : (0.4935, 0.5496)
##     No Information Rate : 0.5184          
##     P-Value [Acc > NIR] : 0.4216          
##                                           
##                   Kappa : 0.0237          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.2409          
##             Specificity : 0.7824          
##          Pos Pred Value : 0.5070          
##          Neg Pred Value : 0.5259          
##              Prevalence : 0.4816          
##          Detection Rate : 0.1160          
##    Detection Prevalence : 0.2288          
##       Balanced Accuracy : 0.5116          
##                                           
##        'Positive' Class : Down            
##

confusionMatrix(table(
  sm[, c('PredDirection_m2', 'Direction')]))

## Confusion Matrix and Statistics
## 
##                 Direction
## PredDirection_m2 Down  Up
##             Down  114 102
##             Up    488 546
##                                          
##                Accuracy : 0.528          
##                  95% CI : (0.4999, 0.556)
##     No Information Rate : 0.5184         
##     P-Value [Acc > NIR] : 0.2576         
##                                          
##                   Kappa : 0.0327         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.1894         
##             Specificity : 0.8426         
##          Pos Pred Value : 0.5278         
##          Neg Pred Value : 0.5280         
##              Prevalence : 0.4816         
##          Detection Rate : 0.0912         
##    Detection Prevalence : 0.1728         
##       Balanced Accuracy : 0.5160         
##                                          
##        'Positive' Class : Down           
##

Reciever Operator Characteristic

The ROC plots are helpful for selecting between models, and thereby, predictors. Again, neither of these look very good. In fact, m2 seems to be closer to an even guess than m1.

# Compare ROC curves
roc_m1 <- roc(
  response = sm$Direction, predictor = sm$PredRatio_m1)

## Setting levels: control = Down, case = Up

## Setting direction: controls < cases

roc_m2 <- roc(
  response = sm$Direction, predictor = sm$PredRatio_m2)

## Setting levels: control = Down, case = Up
## Setting direction: controls < cases

par(mfrow = c(1, 2))
plot(roc_m1)
plot(roc_m2)

AUC will help in most cases, but definitely in this one where it is difficult to see the differences in the ROCs.

# Compare AUC
auc(roc_m1)

## Area under the curve: 0.5387

auc(roc_m2)

## Area under the curve: 0.5314