Variable selection in Multiple regression

Variable selection in regression is arguably the hardest part of model building. Between the data stored by companies and that available from external providers, modelers now have access to hundreds or even thousands of factors. So many factors are available that it is often impractical to consider all of them in a formal predictive modeling context. The purpose of variable selection in regression is to identify the best subset of predictors among many variables to include in a model. Variable selection is intended to select the “best” subset of predictors. But why bother?

  1. We want to explain the data in the simplest way - redundant predictors should be removed. The principle of Occam’s razor states that among several plausible explanations for a phenomenon, the simplest is best. Applied to regression analysis, this implies that the smallest model that fits the data is best.
  2. Unnecessary predictors will add noise to the estimation of other quantities that we are interested in. Degrees of freedom will be wasted.
  3. Collinearity is caused by having too many variables trying to do the same job.
  4. Cost: if the model is for prediction, we can save time and/or money by not measuring redundant predictors.

There are many different strategies for selecting variables for a regression model. If there are no more than fifteen candidate variables, the All Possible Regressions procedure should be used since it will always give as good or better models than the stepping procedures available. On the other hand, when there are more than fifteen candidate variables, then other 3 procedures discussed here can be of use.

Variable Selection Procedures

There are various procedures for variable selection, but in this topic, we will talk about few as below, and also give an overview for statistics/criteria for variable selection.

  1. Forward (Step-Up) Selection
  2. Backward (Step-down) Selection / Backward Elimination
  3. Stepwise selection (a combination of Forward & backward selection)
  4. All Possible Regression

Forward (Step-Up) Selection

The forward selection method is simple to define. You begin with no candidate variables in the model.In this procedure; one adds features/variables to the model one at a time. At each step, each variable that is not already in the model is tested for inclusion in the model. The most significant of these feature is added to the model, so long as it’s P-value is below some pre-set level i.e., 0.05/0.01 or R2/Adjusted R2 is high for the model, there are other criteria for selection of the variable. Stop adding variables when none of the remaining variables adds any significant value to selection criteria.

Backward (Step-down) Selection / Backward Elimination

The backward selection model starts with all candidate variables in the model. At each step, the variable that is the least significant is removed. This process continues until no significant variables remain. The user sets the significance level at which variables can be removed from the model. However, because it works its way down instead of up, you are always retaining a large value of R-Squared. The problem is that the models selected by this procedure may include variables that are not necessary. The user sets the significance level at which variables can enter the model.

Stepwise selection (a combination of Forward & backward selection)

Stepwise regression is a semi-automated process of building a model by successively adding or removing variables based solely on the t-statistics of their estimated coefficients. Stepwise regression is a combination of the forward and backward selection techniques. Stepwise regression is a modification of the forward selection so that after each step in which a variable was added, all candidate variables in the model are checked to see if their significance has been reduced below the specified tolerance level. If a non-significant variable is found, it is removed from the model. Stepwise regression requires two significance levels: one for adding variables and one for removing variables. The cutoff probability for adding variables should be less than the cutoff probability for removing variables so that the procedure does not get into an infinite loop.

All Possible Regression

All possible regressions goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. If there are K potential independent variables (besides the constant), then there are 2K distinct subsets of them to be tested (including the empty set which corresponds to the mean model). For example, if you have 10 candidate independent variables, the number of subsets to be tested is 210, which is 1024, and if you have 20 candidate variables, the number is 220, which is more than one million.

Statistics/criteria for variable selection

There can be many statistics that can be used for variable selection, different statistics/criteria may lead to very different choices of variables. Here we will talk about few, which are widely used.
### T-test for a single predictor at a time
T-test is often used as a way to select predictors. The general rule is that if a predictor is significant, it can be included in a regression model.

F-test for the whole model or for comparing two nested models

F-test can be used to test the significance of one or more than one predictors. For example, for a subset of predictors in a model, if its overall F-test is not significant, then one might simply remove them from the regression model.

R2 and Adjusted R2

R2 can be used to measure the practical importance of a predictor. If a predictor can contribute significantly to the overall R2 or adjusted R2, it should be considered included in the model.

Mallows’ Cp

Mallows’ CpCp is widely used in variable selection. It compares a model with p predictors vs. all k predictors (k>p) using a Cp statistic:
Cp=(SSEp/MSEk ) - N+2(p+1)

Information criteria

Information criteria such as AIC (Akaike information criterion) and BIC (Bayesian information criterion) are often used in variable selection. AIC and BIC are define as
AIC=nln(SSE/n)+2p
BIC=nln(SSE/n)+pln(n)

An information criterion tries to identify the model with the smallest AIC and BIC that balance the model fit and model complexity.

There are other procedures, which can be used for variable selection, and similarly there are other statistic/criteria selection to be used variable selection. Manually, we can fit each possible model one by one using lm() and compare the model fits. To automatically run the procedure(s), we can use leaps package provided in R.

Which one to use

Forward or backward

If you have a very large set of candidate predictors from which you wish to extract a few-i.e., if you’re on a fishing expedition-you should generally go forward. If, on the other hand, if you have a modest-sized set of potential variables from which you wish to eliminate a few-i.e., if you’re fine-tuning some prior selection of variables-you should generally go backward. If you’re on a fishing expedition, you should still be careful not to cast too wide a net, selecting variables that are only accidentally related to your dependent variable.

Stepwise regression

Stepwise regression can yield R-squared values that are badly biased high. The method can also yield confidence intervals for effects and predicted values that are falsely narrow. It gives biased regression coefficients that need shrinkage e.g., the coefficients for remaining variables are too large. It also has severe problems in the presence of collinearity and increasing the sample size doesn’t help very much.

Stepwise or all-possible-subsets

Stepwise regression often works reasonably well as an automatic variable selection method, but this is not guaranteed. If the number of candidate predictors is large compared to the number of observations in your data set (say, more than 1 variable for every 10 observations), or if there is excessive multicollinearity (predictors are highly correlated), then the stepwise algorithms may go crazy and end up throwing nearly all the variables into the model, especially if you used a low threshold on a criterion like F statistic.
All-possible-subsets goes beyond stepwise regression and literally tests all possible subsets of the set of potential independent variables. But it carries all the caveats of stepwise regression.

Good reads & References :-
https://advstats.psychstat.org/book/mregression/selection.php
https://stats.stackexchange.com/questions/21265/choosing-variables-to-include-in-a-multiple-linear-regression-model.