regsubsets() function in the R package leaps.

For more details on using R function regsubsets() see https://cran.r-project.org/web/packages/leaps/index.html.

The basic idea of the all possible subsets approach is to run every possible combination of the predictors to find the best subset to meet some pre-defined objective criteria such as \({C}_{p}\) and adjusted R^2. It is hoped that that one ends up with a reasonable and useful regression model. Manually, we can fit each possible model one by one using lm() and compare the model fits. To automatically run the procedure, we can use the regsubsets() function in the R package leaps.

For Example:

Using the birth weight data, we can run the analysis as shown below. In the function regsubsets(),

library(MASS); data(birthwt)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622
  • The regular formula can be used to specify the model with all the predictors to be studied. In this example, it is bwt~lwt+race+smoke+ptl+ht+ui+ftv. One can also provides the outcome variable as a vector and the predictors in a matrix.

  • data tells the data set to be used.

  • nbest is the number of the best subsets of each size to save. If nbest=1, only the best model will be saved for each number of predictors. If nbest=2, the best two models with be saved given the number of predictors.

  • nvmax is the maximum size of subsets of predictors to examine. It specifies the maximum number of predictors you want to include in the final regression model. For example, if you have 7 predictors but set nvmax=5, then the most complex model to be evaluated will have only 5 predictors. Using this option will largely reduce computing time if a large number of predictors are evaluated.

if (!require('leaps')) install.packages('leaps')
## Loading required package: leaps
all<-regsubsets(bwt~lwt+race+smoke+ptl+ht+ui+ftv, data=birthwt,nbest=1, nvmax=7)

all

Subset selection object Call: regsubsets.formula(bwt ~ lwt + race + smoke + ptl + ht + ui + ftv, data = birthwt, nbest = 1, nvmax = 7) 7 Variables (and intercept) Forced in Forced out lwt FALSE FALSE race FALSE FALSE smoke FALSE FALSE ptl FALSE FALSE ht FALSE FALSE ui FALSE FALSE ftv FALSE FALSE 1 subsets of each size up to 7 Selection Algorithm: exhaustive

The immediate output of the function regsubsets() does not provide much information. To extract more useful information, the function summary() can be applied. This will include the following objects that can be printed.

  • which: A logical matrix indicating which predictors are in each model. 1 indicates a variable is included and 0 not.

  • rsq: The r-squared for each model (higher, better)

  • adjr2: Adjusted r-squared (higher, better)

  • cp: Mallows’ Cp (smaller, better)

  • bic: Schwartz’s Bayesian information criterion, BIC (lower, better)

  • rss: Residual sum of squares for each model (lower, better)

info <- summary(all)
cbind(info$which, round(cbind(rsq=info$rsq, adjr2=info$adjr2, cp=info$cp, bic=info$bic, rss=info$rss), 3))
##   (Intercept) lwt race smoke ptl ht ui ftv   rsq adjr2     cp     bic
## 1           1   0    0     0   0  0  1   0 0.081 0.076 29.155  -5.402
## 2           1   0    1     0   0  0  1   0 0.113 0.103 23.629  -6.922
## 3           1   0    1     1   0  0  1   0 0.175 0.162 11.058 -15.501
## 4           1   0    1     1   0  1  1   0 0.203 0.186  6.674 -16.648
## 5           1   1    1     1   0  1  1   0 0.221 0.200  4.371 -15.838
## 6           1   1    1     1   1  1  1   0 0.222 0.197  6.117 -10.861
## 7           1   1    1     1   1  1  1   1 0.223 0.193  8.000  -5.742
##        rss
## 1 91910625
## 2 88680498
## 3 82427257
## 4 79687294
## 5 77840556
## 6 77731620
## 7 77681278

We can also plot the different statistics to visually inspect the best models. Mallow’s Cp plot is one popular plot to use. In such a plot, Mallows’ Cp is plotted along the number of predictors. As mentioned early, for a good model, \({C}_{p}\) ≈ p . Therefore, the models are on or below the line of x=y can be considered as acceptable models. In this example, both the model with 5 predictors and the one with 6 predictors are good models.

plot(2:8, info$cp, xlab='P (# of predictors + 1)', ylab='Cp')
abline(a=0,b=1)

Summary

Using the all possible subsets method, one would select a model with a larger adjusted R-square, smaller Cp, smaller rsq, and smaller BIC.