Variable Selection

Variable Selection - Stepwise Regression

Variable selection selects a subset of variables based on model fit and model complexity. A common criterion for model fit and complexity is the Akaike Information Criterion (AIC): \[\text{AIC} = n \text{log}(\text{RSS}/n) + k,\] where RSS is the residual sum of squares, \(n\) is the number of data points, and \(k\) is the number of variables. Smaller AIC values are preferred and indicate a better model fit. With the AIC, we can use forward elimination, backward elimination, or stepwise selection to select the subset of variables.

Forward elimination: Start with a model that consists of only the intercept and add one variable that is most highly correlated with the response, thus giving the smallest RSS. Then consider adding one variable to the previous step so that the second variable selected has the largest partial correlation with the response. Repeat the last step and stop when the addition of one variable increases the AIC.
Backward elimination: Start with all variables in the model. At the next step, remove one variable to get the best model, where best is defined by the smallest RSS. Continue until the next deletion increases the AIC.
Stepwise selection: this is a combination of forward and backward elimination. We either start with all the variables or no variables except the intercept. As we add variables, we immediately eliminate variables that do not reduce the AIC.

Stepwise selection is considered as greedy: at each step it does the one thing that looks best at the time without considering future options.

To demonstrate stepwise regression, I use US crime data from 1960.

crime <- suppressMessages(
  readr::read_delim("http://www.statsci.org/data/general/uscrime.txt", delim="\t"))
head(crime)

## # A tibble: 6 × 16
##       M    So    Ed   Po1   Po2    LF   M.F   Pop    NW    U1    U2 Wealth  Ineq
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1  15.1     1   9.1   5.8   5.6 0.51   95      33  30.1 0.108   4.1   3940  26.1
## 2  14.3     0  11.3  10.3   9.5 0.583 101.     13  10.2 0.096   3.6   5570  19.4
## 3  14.2     1   8.9   4.5   4.4 0.533  96.9    18  21.9 0.094   3.3   3180  25  
## 4  13.6     0  12.1  14.9  14.1 0.577  99.4   157   8   0.102   3.9   6730  16.7
## 5  14.1     0  12.1  10.9  10.1 0.591  98.5    18   3   0.091   2     5780  17.4
## 6  12.1     0  11    11.8  11.5 0.547  96.4    25   4.4 0.084   2.9   6890  12.6
## # ℹ 3 more variables: Prob <dbl>, Time <dbl>, Crime <dbl>

We treat all variables as numeric and scale the data using R’s scale function (with the exception of So [dichotomous variable] and Crime [the response]).

crime[, c(1, 3:15)] <- lapply(crime[, c(1,3:15)], function(x) round(scale(x), 2))
head(crime)

## # A tibble: 6 × 16
##   M[,1]    So Ed[,1] Po1[,1] Po2[,1] LF[,1] M.F[,1] Pop[,1] NW[,1] U1[,1] U2[,1]
##   <dbl> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1  0.99     1  -1.31   -0.91   -0.87  -1.27   -1.12   -0.1    1.94   0.7    0.83
## 2  0.35     0   0.66    0.61    0.53   0.54    0.98   -0.62   0.01   0.03   0.24
## 3  0.27     1  -1.49   -1.35   -1.3   -0.7    -0.48   -0.49   1.15  -0.08  -0.12
## 4 -0.2      0   1.37    2.15    2.17   0.39    0.37    3.16  -0.21   0.36   0.59
## 5  0.19     0   1.37    0.81    0.74   0.74    0.07   -0.49  -0.69  -0.25  -1.66
## 6 -1.4      0   0.39    1.11    1.24  -0.35   -0.65   -0.31  -0.56  -0.64  -0.59
## # ℹ 5 more variables: Wealth <dbl[,1]>, Ineq <dbl[,1]>, Prob <dbl[,1]>,
## #   Time <dbl[,1]>, Crime <dbl>

Lets start with the intercept only model and the full model with all variables. The respective AIC and \(R^2\) values are shown.

# Model with all variables
mod_all <- lm(Crime ~ ., data=crime)
AIC(mod_all)

## [1] 650.1969

summary(mod_all)$r.squared

## [1] 0.8023824

# Intercept only model
mod0 <- lm(Crime ~ 1, data=crime)
AIC(mod0)

## [1] 696.4037

summary(mod0)$r.squared

## [1] 0

We use the stepAIC from the MASS library, which performs variable selection based on the AIC. Let’s start with forward elimination, by selecting the direction=“forward” argument. I pass the intercept only model (mod0) to the object argument and the full model (mod_all) to the scope argument.

library(MASS)
fmod <- stepAIC(object = mod0, direction = "forward", 
  scope = formula(mod_all), trace=FALSE)
coef(fmod)

## (Intercept)         Po1        Ineq          Ed           M        Prob 
##   905.06369   341.66723   269.46799   219.39474   132.39880   -86.33974 
##          U2 
##    75.78490

AIC(fmod)

## [1] 640.2218

summary(fmod)$r.squared

## [1] 0.7655889

Using forward elimination, 7 variables are selected with an \(R^2=0.77\). Compared with the full model, the AIC is indeed lower, but the full model has a higher \(R^2\). We can see the change in the AIC for each variable added.

fmod$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Crime ~ 1
## 
## Final Model:
## Crime ~ Po1 + Ineq + Ed + M + Prob + U2
## 
## 
##     Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                            46    6880928 561.0235
## 2  + Po1  1 3251554.7        45    3629373 532.9579
## 3 + Ineq  1  738010.4        44    2891363 524.2733
## 4   + Ed  1  586033.7        43    2305329 515.6276
## 5    + M  1  240654.7        42    2064674 512.4458
## 6 + Prob  1  257982.3        41    1806692 508.1724
## 7   + U2  1  193726.5        40    1612966 504.8416

The output shows that with the addition of each variable the AIC get smaller, until the model stops at the addition of U2 (Step 7).

Let’s try with a backward elimination approach.

This time we reverse the arguments to object and scope so that the model starts with all the variables, and eliminates until with only the intercept model (if needed).

bmod <- stepAIC(mod_all, direction = "backward", 
  scope = formula(mod0), trace=FALSE)
coef(bmod)

## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085

AIC(bmod)

## [1] 639.378

summary(bmod)$r.squared

## [1] 0.7885439

Backwards elimination selects 9 variables with an \(R^2=0.79\). The change in AIC for removing one variable is shown below:

bmod$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + 
##     U2 + Wealth + Ineq + Prob + Time
## 
## Final Model:
## Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
## 
## 
##       Step Df     Deviance Resid. Df Resid. Dev      AIC
## 1                                 31    1359793 514.8167
## 2     - So  1     1.300167        32    1359794 512.8167
## 3   - Time  1  9665.503811        33    1369459 511.1496
## 4     - LF  1  9782.997381        34    1379242 509.4842
## 5     - NW  1 11688.074179        35    1390930 507.8808
## 6    - Po2  1 16178.099053        36    1407108 506.4243
## 7    - Pop  1 21603.437077        37    1428712 505.1404
## 8 - Wealth  1 26302.520114        38    1455014 503.9978

Both forward and backward elimination can be done by selecting the direction = “both” argument.

smod <- stepAIC(mod_all, direction = "both", trace=FALSE)
coef(smod)

## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085

AIC(smod)

## [1] 639.378

summary(smod)$r.squared

## [1] 0.7885439

smod3 <- stepAIC(bmod, direction = "both", trace=FALSE)
coef(smod3)

## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085

AIC(smod3)

## [1] 639.378

summary(smod3)$r.squared

## [1] 0.7885439

smod4 <- stepAIC(fmod, direction = "both", trace=FALSE)
coef(smod4)

## (Intercept)         Po1        Ineq          Ed           M        Prob 
##   905.06369   341.66723   269.46799   219.39474   132.39880   -86.33974 
##          U2 
##    75.78490

AIC(smod4)

## [1] 640.2218

summary(smod4)$r.squared

## [1] 0.7655889

The analysis thus far has been demonstrated using the complete dataset. To avoid fitting the model to random effects in the training dataset, we use a cross-validation approach. I do this using the caret library and its functions.

smod2 <- stepAIC(mod0, direction = "both", trace=FALSE)
coef(smod2)

## (Intercept) 
##    905.0851

AIC(smod2)

## [1] 696.4037

summary(smod2)$r.squared

## [1] 0

Following the previous analysis, we select the backward elimination method, and choose \(k=10\) for the repeated k-fold cross-validation approach.

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(100599)
set_train <- trainControl(method="repeatedcv", number=10, repeats=3)
cvmod <- train(Crime ~ ., data=crime,  scope = formula(mod0),
  method="lmStepAIC", direction="backward", trace=FALSE, trControl=set_train)

coef(cvmod$finalModel)

## (Intercept)           M          Ed         Po1         M.F          U1 
##   905.14017   117.41855   201.06383   305.09942    65.90717  -109.64063 
##          U2        Ineq        Prob 
##   158.30781   244.67357   -86.20085

AIC(cvmod$finalModel)

## [1] 639.378

Results show the final model for the cross-validation approach includes the same variables as the first backward elimination model. Overall, stepwise regression is good for performing preliminary analyses and getting a better understanding of which variables may or may not be significant in your model. However, stepwise regression can select a final set of variables that fit more to random effects than real effects. This method may not perform as well when testing the final set of variables on other data.

Variable Selection

STAT 136 Regression and Correlation Analysis

2023-11-24

Variable Selection - Stepwise Regression