Variable selection selects a subset of variables based on model fit and model complexity. A common criterion for model fit and complexity is the Akaike Information Criterion (AIC): \[\text{AIC} = n \text{log}(\text{RSS}/n) + k,\] where RSS is the residual sum of squares, \(n\) is the number of data points, and \(k\) is the number of variables. Smaller AIC values are preferred and indicate a better model fit. With the AIC, we can use forward elimination, backward elimination, or stepwise selection to select the subset of variables.
Forward elimination: Start with a model that consists of only the intercept and add one variable that is most highly correlated with the response, thus giving the smallest RSS. Then consider adding one variable to the previous step so that the second variable selected has the largest partial correlation with the response. Repeat the last step and stop when the addition of one variable increases the AIC.
Backward elimination: Start with all variables in the model. At the next step, remove one variable to get the best model, where best is defined by the smallest RSS. Continue until the next deletion increases the AIC.
Stepwise selection: this is a combination of forward and backward elimination. We either start with all the variables or no variables except the intercept. As we add variables, we immediately eliminate variables that do not reduce the AIC.
Stepwise selection is considered as greedy: at each step it does the one thing that looks best at the time without considering future options.
To demonstrate stepwise regression, I use US crime data from 1960.
crime <- suppressMessages(
readr::read_delim("http://www.statsci.org/data/general/uscrime.txt", delim="\t"))
head(crime)
## # A tibble: 6 × 16
## M So Ed Po1 Po2 LF M.F Pop NW U1 U2 Wealth Ineq
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 15.1 1 9.1 5.8 5.6 0.51 95 33 30.1 0.108 4.1 3940 26.1
## 2 14.3 0 11.3 10.3 9.5 0.583 101. 13 10.2 0.096 3.6 5570 19.4
## 3 14.2 1 8.9 4.5 4.4 0.533 96.9 18 21.9 0.094 3.3 3180 25
## 4 13.6 0 12.1 14.9 14.1 0.577 99.4 157 8 0.102 3.9 6730 16.7
## 5 14.1 0 12.1 10.9 10.1 0.591 98.5 18 3 0.091 2 5780 17.4
## 6 12.1 0 11 11.8 11.5 0.547 96.4 25 4.4 0.084 2.9 6890 12.6
## # ℹ 3 more variables: Prob <dbl>, Time <dbl>, Crime <dbl>
We treat all variables as numeric and scale the data using R’s scale function (with the exception of So [dichotomous variable] and Crime [the response]).
crime[, c(1, 3:15)] <- lapply(crime[, c(1,3:15)], function(x) round(scale(x), 2))
head(crime)
## # A tibble: 6 × 16
## M[,1] So Ed[,1] Po1[,1] Po2[,1] LF[,1] M.F[,1] Pop[,1] NW[,1] U1[,1] U2[,1]
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.99 1 -1.31 -0.91 -0.87 -1.27 -1.12 -0.1 1.94 0.7 0.83
## 2 0.35 0 0.66 0.61 0.53 0.54 0.98 -0.62 0.01 0.03 0.24
## 3 0.27 1 -1.49 -1.35 -1.3 -0.7 -0.48 -0.49 1.15 -0.08 -0.12
## 4 -0.2 0 1.37 2.15 2.17 0.39 0.37 3.16 -0.21 0.36 0.59
## 5 0.19 0 1.37 0.81 0.74 0.74 0.07 -0.49 -0.69 -0.25 -1.66
## 6 -1.4 0 0.39 1.11 1.24 -0.35 -0.65 -0.31 -0.56 -0.64 -0.59
## # ℹ 5 more variables: Wealth <dbl[,1]>, Ineq <dbl[,1]>, Prob <dbl[,1]>,
## # Time <dbl[,1]>, Crime <dbl>
Lets start with the intercept only model and the full model with all variables. The respective AIC and \(R^2\) values are shown.
# Model with all variables
mod_all <- lm(Crime ~ ., data=crime)
AIC(mod_all)
## [1] 650.1969
summary(mod_all)$r.squared
## [1] 0.8023824
# Intercept only model
mod0 <- lm(Crime ~ 1, data=crime)
AIC(mod0)
## [1] 696.4037
summary(mod0)$r.squared
## [1] 0
We use the stepAIC from the MASS library, which performs variable selection based on the AIC. Let’s start with forward elimination, by selecting the direction=“forward” argument. I pass the intercept only model (mod0) to the object argument and the full model (mod_all) to the scope argument.
library(MASS)
fmod <- stepAIC(object = mod0, direction = "forward",
scope = formula(mod_all), trace=FALSE)
coef(fmod)
## (Intercept) Po1 Ineq Ed M Prob
## 905.06369 341.66723 269.46799 219.39474 132.39880 -86.33974
## U2
## 75.78490
AIC(fmod)
## [1] 640.2218
summary(fmod)$r.squared
## [1] 0.7655889
Using forward elimination, 7 variables are selected with an \(R^2=0.77\). Compared with the full model, the AIC is indeed lower, but the full model has a higher \(R^2\). We can see the change in the AIC for each variable added.
fmod$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Crime ~ 1
##
## Final Model:
## Crime ~ Po1 + Ineq + Ed + M + Prob + U2
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 46 6880928 561.0235
## 2 + Po1 1 3251554.7 45 3629373 532.9579
## 3 + Ineq 1 738010.4 44 2891363 524.2733
## 4 + Ed 1 586033.7 43 2305329 515.6276
## 5 + M 1 240654.7 42 2064674 512.4458
## 6 + Prob 1 257982.3 41 1806692 508.1724
## 7 + U2 1 193726.5 40 1612966 504.8416
The output shows that with the addition of each variable the AIC get smaller, until the model stops at the addition of U2 (Step 7).
Let’s try with a backward elimination approach.
This time we reverse the arguments to object and scope so that the model starts with all the variables, and eliminates until with only the intercept model (if needed).
bmod <- stepAIC(mod_all, direction = "backward",
scope = formula(mod0), trace=FALSE)
coef(bmod)
## (Intercept) M Ed Po1 M.F U1
## 905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063
## U2 Ineq Prob
## 158.30781 244.67357 -86.20085
AIC(bmod)
## [1] 639.378
summary(bmod)$r.squared
## [1] 0.7885439
Backwards elimination selects 9 variables with an \(R^2=0.79\). The change in AIC for removing one variable is shown below:
bmod$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 +
## U2 + Wealth + Ineq + Prob + Time
##
## Final Model:
## Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 31 1359793 514.8167
## 2 - So 1 1.300167 32 1359794 512.8167
## 3 - Time 1 9665.503811 33 1369459 511.1496
## 4 - LF 1 9782.997381 34 1379242 509.4842
## 5 - NW 1 11688.074179 35 1390930 507.8808
## 6 - Po2 1 16178.099053 36 1407108 506.4243
## 7 - Pop 1 21603.437077 37 1428712 505.1404
## 8 - Wealth 1 26302.520114 38 1455014 503.9978
Both forward and backward elimination can be done by selecting the direction = “both” argument.
smod <- stepAIC(mod_all, direction = "both", trace=FALSE)
coef(smod)
## (Intercept) M Ed Po1 M.F U1
## 905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063
## U2 Ineq Prob
## 158.30781 244.67357 -86.20085
AIC(smod)
## [1] 639.378
summary(smod)$r.squared
## [1] 0.7885439
smod3 <- stepAIC(bmod, direction = "both", trace=FALSE)
coef(smod3)
## (Intercept) M Ed Po1 M.F U1
## 905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063
## U2 Ineq Prob
## 158.30781 244.67357 -86.20085
AIC(smod3)
## [1] 639.378
summary(smod3)$r.squared
## [1] 0.7885439
smod4 <- stepAIC(fmod, direction = "both", trace=FALSE)
coef(smod4)
## (Intercept) Po1 Ineq Ed M Prob
## 905.06369 341.66723 269.46799 219.39474 132.39880 -86.33974
## U2
## 75.78490
AIC(smod4)
## [1] 640.2218
summary(smod4)$r.squared
## [1] 0.7655889
The analysis thus far has been demonstrated using the complete dataset. To avoid fitting the model to random effects in the training dataset, we use a cross-validation approach. I do this using the caret library and its functions.
smod2 <- stepAIC(mod0, direction = "both", trace=FALSE)
coef(smod2)
## (Intercept)
## 905.0851
AIC(smod2)
## [1] 696.4037
summary(smod2)$r.squared
## [1] 0
Following the previous analysis, we select the backward elimination method, and choose \(k=10\) for the repeated k-fold cross-validation approach.
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(100599)
set_train <- trainControl(method="repeatedcv", number=10, repeats=3)
cvmod <- train(Crime ~ ., data=crime, scope = formula(mod0),
method="lmStepAIC", direction="backward", trace=FALSE, trControl=set_train)
coef(cvmod$finalModel)
## (Intercept) M Ed Po1 M.F U1
## 905.14017 117.41855 201.06383 305.09942 65.90717 -109.64063
## U2 Ineq Prob
## 158.30781 244.67357 -86.20085
AIC(cvmod$finalModel)
## [1] 639.378
Results show the final model for the cross-validation approach includes the same variables as the first backward elimination model. Overall, stepwise regression is good for performing preliminary analyses and getting a better understanding of which variables may or may not be significant in your model. However, stepwise regression can select a final set of variables that fit more to random effects than real effects. This method may not perform as well when testing the final set of variables on other data.