Variable Selection - Stepwise Regression

Variable selection selects a subset of variables based on model fit and model complexity. A common criterion for model fit and complexity is the Akaike Information Criterion (AIC): \[\text{AIC} = n \text{log}(\text{RSS}/n) + k,\] where RSS is the residual sum of squares, \(n\) is the number of data points, and \(k\) is the number of variables. Smaller AIC values are preferred and indicate a better model fit. With the AIC, we can use forward elimination, backward elimination, or stepwise selection to select the subset of variables.

Stepwise selection is considered as greedy: at each step it does the one thing that looks best at the time without considering future options.

To demonstrate stepwise regression, I use US crime data from 1960.

crime <- suppressMessages(
  readr::read_delim("http://www.statsci.org/data/general/uscrime.txt", delim="\t"))
head(crime)
## # A tibble: 6 × 16
##       M    So    Ed   Po1   Po2    LF   M.F   Pop    NW    U1    U2 Wealth  Ineq
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1  15.1     1   9.1   5.8   5.6 0.51   95      33  30.1 0.108   4.1   3940  26.1
## 2  14.3     0  11.3  10.3   9.5 0.583 101.     13  10.2 0.096   3.6   5570  19.4
## 3  14.2     1   8.9   4.5   4.4 0.533  96.9    18  21.9 0.094   3.3   3180  25  
## 4  13.6     0  12.1  14.9  14.1 0.577  99.4   157   8   0.102   3.9   6730  16.7
## 5  14.1     0  12.1  10.9  10.1 0.591  98.5    18   3   0.091   2     5780  17.4
## 6  12.1     0  11    11.8  11.5 0.547  96.4    25   4.4 0.084   2.9   6890  12.6
## # ℹ 3 more variables: Prob <dbl>, Time <dbl>, Crime <dbl>

We treat all variables as numeric and scale the data using R’s scale function (with the exception of So [dichotomous variable] and Crime [the response]).

crime[, c(1, 3:15)] <- lapply(crime[, c(1,3:15)], function(x) round(scale(x), 2))
head(crime)
## # A tibble: 6 × 16
##   M[,1]    So Ed[,1] Po1[,1] Po2[,1] LF[,1] M.F[,1] Pop[,1] NW[,1] U1[,1] U2[,1]
##   <dbl> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1  0.99     1  -1.31   -0.91   -0.87  -1.27   -1.12   -0.1    1.94   0.7    0.83
## 2  0.35     0   0.66    0.61    0.53   0.54    0.98   -0.62   0.01   0.03   0.24
## 3  0.27     1  -1.49   -1.35   -1.3   -0.7    -0.48   -0.49   1.15  -0.08  -0.12
## 4 -0.2      0   1.37    2.15    2.17   0.39    0.37    3.16  -0.21   0.36   0.59
## 5  0.19     0   1.37    0.81    0.74   0.74    0.07   -0.49  -0.69  -0.25  -1.66
## 6 -1.4      0   0.39    1.11    1.24  -0.35   -0.65   -0.31  -0.56  -0.64  -0.59
## # ℹ 5 more variables: Wealth <dbl[,1]>, Ineq <dbl[,1]>, Prob <dbl[,1]>,
## #   Time <dbl[,1]>, Crime <dbl>

Lets start with the intercept only model and the full model with all variables. The respective AIC and \(R^2\) values are shown.

# Model with all variables
mod_all <- lm(Crime ~ ., data=crime)
AIC(mod_all)
## [1] 650.1969
summary(mod_all)$r.squared
## [1] 0.8023824
# Intercept only model
mod0 <- lm(Crime ~ 1, data=crime)
AIC(mod0)
## [1] 696.4037
summary(mod0)$r.squared
## [1] 0

We use the stepAIC from the MASS library, which performs variable selection based on the AIC. Let’s start with forward elimination, by selecting the direction=“forward” argument. I pass the intercept only model (mod0) to the object argument and the full model (mod_all) to the scope argument.

library(MASS)
fmod <- stepAIC(object = mod0, direction = "forward", 
  scope = formula(mod_all), trace=FALSE)
coef(fmod)
## (Intercept)         Po1        Ineq          Ed           M        Prob 
##   905.06369   341.66723   269.46799   219.39474   132.39880   -86.33974 
##          U2 
##    75.78490
AIC(fmod)
## [1] 640.2218
summary(fmod)$r.squared
## [1] 0.7655889

Using forward elimination, 7 variables are selected with an \(R^2=0.77\). Compared with the full model, the AIC is indeed lower, but the full model has a higher \(R^2\). We can see the change in the AIC for each variable added.

fmod$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Crime ~ 1
## 
## Final Model:
## Crime ~ Po1 + Ineq + Ed + M + Prob + U2
## 
## 
##     Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                            46    6880928 561.0235
## 2  + Po1  1 3251554.7        45    3629373 532.9579
## 3 + Ineq  1  738010.4        44    2891363 524.2733
## 4   + Ed  1  586033.7        43    2305329 515.6276
## 5    + M  1  240654.7        42    2064674 512.4458
## 6 + Prob  1  257982.3        41    1806692 508.1724
## 7   + U2  1  193726.5        40    1612966 504.8416

The output shows that with the addition of each variable the AIC get smaller, until the model stops at the addition of U2 (Step 7).

Let’s try with a backward elimination approach.

This time we reverse the arguments to object and scope so that the model starts with all the variables, and eliminates until with only the intercept model (if needed).

bmod <- stepAIC(mod_all, direction = "backward", 
  scope = formula(mod0), trace=FALSE)
coef(bmod)
## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085
AIC(bmod)
## [1] 639.378
summary(bmod)$r.squared
## [1] 0.7885439

Backwards elimination selects 9 variables with an \(R^2=0.79\). The change in AIC for removing one variable is shown below:

bmod$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + 
##     U2 + Wealth + Ineq + Prob + Time
## 
## Final Model:
## Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
## 
## 
##       Step Df     Deviance Resid. Df Resid. Dev      AIC
## 1                                 31    1359793 514.8167
## 2     - So  1     1.300167        32    1359794 512.8167
## 3   - Time  1  9665.503811        33    1369459 511.1496
## 4     - LF  1  9782.997381        34    1379242 509.4842
## 5     - NW  1 11688.074179        35    1390930 507.8808
## 6    - Po2  1 16178.099053        36    1407108 506.4243
## 7    - Pop  1 21603.437077        37    1428712 505.1404
## 8 - Wealth  1 26302.520114        38    1455014 503.9978

Both forward and backward elimination can be done by selecting the direction = “both” argument.

smod <- stepAIC(mod_all, direction = "both", trace=FALSE)
coef(smod)
## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085
AIC(smod)
## [1] 639.378
summary(smod)$r.squared
## [1] 0.7885439
smod3 <- stepAIC(bmod, direction = "both", trace=FALSE)
coef(smod3)
## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085
AIC(smod3)
## [1] 639.378
summary(smod3)$r.squared
## [1] 0.7885439
smod4 <- stepAIC(fmod, direction = "both", trace=FALSE)
coef(smod4)
## (Intercept)         Po1        Ineq          Ed           M        Prob 
##   905.06369   341.66723   269.46799   219.39474   132.39880   -86.33974 
##          U2 
##    75.78490
AIC(smod4)
## [1] 640.2218
summary(smod4)$r.squared
## [1] 0.7655889

The analysis thus far has been demonstrated using the complete dataset. To avoid fitting the model to random effects in the training dataset, we use a cross-validation approach. I do this using the caret library and its functions.

smod2 <- stepAIC(mod0, direction = "both", trace=FALSE)
coef(smod2)
## (Intercept) 
##    905.0851
AIC(smod2)
## [1] 696.4037
summary(smod2)$r.squared
## [1] 0

Following the previous analysis, we select the backward elimination method, and choose \(k=10\) for the repeated k-fold cross-validation approach.

library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(100599)
set_train <- trainControl(method="repeatedcv", number=10, repeats=3)
cvmod <- train(Crime ~ ., data=crime,  scope = formula(mod0),
  method="lmStepAIC", direction="backward", trace=FALSE, trControl=set_train)

coef(cvmod$finalModel)
## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085
AIC(cvmod$finalModel)
## [1] 639.378

Results show the final model for the cross-validation approach includes the same variables as the first backward elimination model. Overall, stepwise regression is good for performing preliminary analyses and getting a better understanding of which variables may or may not be significant in your model. However, stepwise regression can select a final set of variables that fit more to random effects than real effects. This method may not perform as well when testing the final set of variables on other data.