# All subset regression with leaps, bestglm, glmulti, and meifly

## Settings for RMarkdown http://yihui.name/knitr/options#chunk_options
opts_chunk$set(comment = "", warning = FALSE, message = FALSE, tidy = FALSE, echo = T, fig.width = 5, fig.height = 5) options(width = 100, scipen = 5, digits = 5) setwd("~/statistics/Rmedstats/")  ## Summary • For linear regression, use leaps, which allows use of adjusted $$R^2$$ and Mallow Cp. • For logistic regression, use glmulti. • For Cox regression, use glmulti. • This article about glmulti is a good summary of this topic (http://www.jstatsoft.org/v34/i12/paper). ## Load and prepare dataset http://www.umass.edu/statdata/statdata/data/lowbwt.txt SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986. library(gdata) lbw <- read.xls("http://www.umass.edu/statdata/statdata/data/lowbwt.xls") names(lbw) <- tolower(names(lbw)) ## Recoding lbw <- within(lbw, { ## race relabeling race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other")) ## ftv (frequency of visit) relabeling ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal") ## ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+")) })  ## leaps (regression subset selection) Regression subset selection including exhaustive search. This is only for linear regression. Perform all subset regression, and choose “nbest” model(s) for each number of predictors up to nvmax. The result shows how it was performed. library(leaps) regsubsets.out <- regsubsets(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw, nbest = 1, # 1 best model for each number of predictors nvmax = NULL, # NULL for no limit on number of variables force.in = NULL, force.out = NULL, method = "exhaustive") regsubsets.out  Subset selection object Call: regsubsets.formula(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw, nbest = 1, nvmax = NULL, force.in = NULL, force.out = NULL, method = "exhaustive") 10 Variables (and intercept) Forced in Forced out age FALSE FALSE lwt FALSE FALSE race.catBlack FALSE FALSE race.catOther FALSE FALSE smoke FALSE FALSE preterm1+ FALSE FALSE ht FALSE FALSE ui FALSE FALSE ftv.catNone FALSE FALSE ftv.catMany FALSE FALSE 1 subsets of each size up to 10 Selection Algorithm: exhaustive  Best model at each variable number The best model in the 10-variable case includes all variables, as that is the only way to have 10 variables. summary.out <- summary(regsubsets.out) as.data.frame(summary.out$outmat)

          age lwt race.catBlack race.catOther smoke preterm1+ ht ui ftv.catNone ftv.catMany
1  ( 1 )                                                          *
2  ( 1 )                                                       *  *
3  ( 1 )        *                                              *  *
4  ( 1 )                      *             *     *               *
5  ( 1 )                      *             *     *            *  *
6  ( 1 )        *             *             *     *            *  *
7  ( 1 )        *             *             *     *         *  *  *
8  ( 1 )        *             *             *     *         *  *  *                       *
9  ( 1 )        *             *             *     *         *  *  *           *           *
10  ( 1 )   *   *             *             *     *         *  *  *           *           *


Graphical table of best subsets (plot.regsubsets)

By adjusted $$R^2$$, the best model includes lwt, race.cat, preterm, ht, and ui (variables that have black boxes at the higest Y-axis value).

## Adjusted R2 Plot Output from regsubsets Function in leaps package

This is just another way of presenting the same information for adjusted $$R^2$$. The model with 7 variables (counting dummy variables seprately) has the highest adjusted $$R^2$$.

Mallow Cp is used to decide on the number of predictors to include. The stopping rule is to start with the smallest model and gradually increase number of variables, and stop when Mallow Cp is approximately (number of regressors + 1, broken line) for the first time. In this case, the model with 6 regressors is the first one to achieve such a condition.

library(car)
layout(matrix(1:2, ncol = 2))
res.legend <-
subsets(regsubsets.out, statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2")
## Mallow Cp
res.legend <-
subsets(regsubsets.out, statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp")
abline(a = 1, b = 1, lty = 2) res.legend

              Abbreviation
age                      a
lwt                      l
race.catBlack          r.B
race.catOther          r.O
smoke                    s
preterm1+                p
ht                       h
ui                       u
ftv.catNone            f.N
ftv.catMany            f.M


See which model has the highest adjusted R2

The model with 7 variables (counting dummy variables separately) has the highest adjusted $$R^2$$. Variables marked with TRUE are the ones chosen.

which.max(summary.out$adjr2)   7  summary.out$which[7,]

  (Intercept)           age           lwt race.catBlack race.catOther         smoke     preterm1+
TRUE         FALSE          TRUE          TRUE          TRUE          TRUE          TRUE
ht            ui   ftv.catNone   ftv.catMany
TRUE          TRUE         FALSE         FALSE


Do regression with the best model

Somehow I had to recreate the best model from the output above.

best.model <- lm(bwt ~ lwt + race.cat + smoke + preterm + ht + ui, data = lbw)
summary(best.model)


Call:
lm(formula = bwt ~ lwt + race.cat + smoke + preterm + ht + ui,
data = lbw)

Residuals:
Min      1Q  Median      3Q     Max
-1886.4  -441.0    53.5   494.2  1620.9

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    2871.99     243.67   11.79   <2e-16 ***
lwt               4.04       1.67    2.42   0.0167 *
race.catBlack  -466.32     145.13   -3.21   0.0016 **
race.catOther  -335.68     112.28   -2.99   0.0032 **
smoke          -323.57     104.96   -3.08   0.0024 **
preterm1+      -208.44     133.50   -1.56   0.1202
ht             -573.69     198.96   -2.88   0.0044 **
ui             -489.96     135.93   -3.60   0.0004 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 643 on 181 degrees of freedom
Multiple R-squared: 0.25,   Adjusted R-squared: 0.221
F-statistic: 8.64 on 7 and 181 DF,  p-value: 0.00000000396



## bestglm (Best subset GLM)

Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation. For the normal case, the 'leaps' is used. Otherwise, a slower exhaustive search. The 'xtable' package is needed for vignette 'SimExperimentBICq.Rnw' accompanying this package.

References:

library(bestglm)


Reformat data

The outcome variable must be named y, no extraneous variables should be present in the dataset.

lbw.for.bestglm <- within(lbw, {
id   <- NULL        # Delete
low  <- NULL
race <- NULL
ptl  <- NULL
ftv  <- NULL

y    <- bwt         # bwt into y
bwt  <- NULL        # Delete bwt
})

## Reorder variables
lbw.for.bestglm <-
lbw.for.bestglm[, c("age","lwt","race.cat","smoke","preterm","ht","ui","ftv.cat","y")]


Perform all-subset linear (gaussian) regression based on Akaike Information Criteria (AIC)

res.bestglm <-
bestglm(Xy = lbw.for.bestglm,
family = gaussian,
IC = "AIC",                 # Information criteria for
method = "exhaustive")

Morgan-Tatar search since factors present with more than 2 levels.


## Show top 5 models
res.bestglm$BestModels   age lwt race.cat smoke preterm ht ui ftv.cat Criterion 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 2450.2 2 FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE 2450.7 3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 2452.1 4 TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE 2452.5 5 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2453.3  Show result for the best model The model identical to the one chosen by the adjusted $$R^2$$ was selected. summary(res.bestglm$BestModel)


Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))

Residuals:
Min      1Q  Median      3Q     Max
-1886.4  -441.0    53.5   494.2  1620.9

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)    2871.99     243.67   11.79   <2e-16 ***
lwt               4.04       1.67    2.42   0.0167 *
race.catBlack  -466.32     145.13   -3.21   0.0016 **
race.catOther  -335.68     112.28   -2.99   0.0032 **
smoke          -323.57     104.96   -3.08   0.0024 **
preterm1+      -208.44     133.50   -1.56   0.1202
ht             -573.69     198.96   -2.88   0.0044 **
ui             -489.96     135.93   -3.60   0.0004 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 643 on 181 degrees of freedom
Multiple R-squared: 0.25,   Adjusted R-squared: 0.221
F-statistic: 8.64 on 7 and 181 DF,  p-value: 0.00000000396



Logistic regression

Do the same, but as a logistic regression model. The resulting model was identical to the best linear model in this case.

## Prepare data
lbw.for.best.logistic <- within(lbw, {
id   <- NULL        # Delete
bwt  <- NULL
race <- NULL
ptl  <- NULL
ftv  <- NULL

y    <- low         # bwt into y
low  <- NULL        # Delete bwt
})

## Reorder variables
lbw.for.best.logistic <-
lbw.for.best.logistic[, c("age","lwt","race.cat","smoke","preterm","ht","ui","ftv.cat","y")]

## Perform
res.best.logistic <-
bestglm(Xy = lbw.for.best.logistic,
family = binomial,          # binomial family for logistic
IC = "AIC",                 # Information criteria for
method = "exhaustive")

Morgan-Tatar search since family is non-gaussian.
Note: factors present with more than 2 levels.


## Show top 5 models
res.best.logistic$BestModels   age lwt race.cat smoke preterm ht ui ftv.cat Criterion 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 211.85 2 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 212.48 3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 212.83 4 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 213.15 5 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 214.37  ## Show result for the best model: Same model was chosen summary(res.best.logistic$BestModel)


Call:
glm(formula = y ~ ., family = family, data = Xi, weights = weights)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.731  -0.784  -0.514   0.954   2.198

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.12533    0.96756   -0.13   0.8969
lwt           -0.01592    0.00695   -2.29   0.0221 *
race.catBlack  1.30086    0.52848    2.46   0.0138 *
race.catOther  0.85441    0.44091    1.94   0.0526 .
smoke          0.86658    0.40447    2.14   0.0322 *
preterm1+      1.12886    0.45039    2.51   0.0122 *
ht             1.86690    0.70737    2.64   0.0083 **
ui             0.75065    0.45882    1.64   0.1018
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 197.85  on 181  degrees of freedom
AIC: 213.9

Number of Fisher Scoring iterations: 4



## glmulti (Model selection and multimodel inference made easy)

Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.

References

library(glmulti)


All-subset linear regression using lm() based on AIC

glmulti.lm.out <-
glmulti(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw,
level = 1,               # No interaction considered
method = "h",            # Exhaustive approach
crit = "aic",            # AIC as criteria
confsetsize = 5,         # Keep 5 best models
plotty = F, report = F,  # No plot or interim reports
fitfunction = "lm")      # lm function

## Show 5 best models (Use @ instead of $for an S4 object) glmulti.lm.out@formulas  [] bwt ~ 1 + race.cat + preterm + lwt + smoke + ht + ui <environment: 0x11a320bb8> [] bwt ~ 1 + race.cat + lwt + smoke + ht + ui <environment: 0x11a320bb8> [] bwt ~ 1 + race.cat + preterm + age + lwt + smoke + ht + ui <environment: 0x11a320bb8> [] bwt ~ 1 + race.cat + age + lwt + smoke + ht + ui <environment: 0x11a320bb8> [] bwt ~ 1 + race.cat + preterm + ftv.cat + lwt + smoke + ht + ui <environment: 0x11a320bb8>  ## Show result for the best model summary(glmulti.lm.out@objects[])   Call: fitfunc(formula = as.formula(x), data = data) Residuals: Min 1Q Median 3Q Max -1886.4 -441.0 53.5 494.2 1620.9 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2871.99 243.67 11.79 <2e-16 *** race.catBlack -466.32 145.13 -3.21 0.0016 ** race.catOther -335.68 112.28 -2.99 0.0032 ** preterm1+ -208.44 133.50 -1.56 0.1202 lwt 4.04 1.67 2.42 0.0167 * smoke -323.57 104.96 -3.08 0.0024 ** ht -573.69 198.96 -2.88 0.0044 ** ui -489.96 135.93 -3.60 0.0004 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 643 on 181 degrees of freedom Multiple R-squared: 0.25, Adjusted R-squared: 0.221 F-statistic: 8.64 on 7 and 181 DF, p-value: 0.00000000396  All-subset logistic regression using glm() based on AIC glmulti.logistic.out <- glmulti(low ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw, level = 1, # No interaction considered method = "h", # Exhaustive approach crit = "aic", # AIC as criteria confsetsize = 5, # Keep 5 best models plotty = F, report = F, # No plot or interim reports fitfunction = "glm", # glm function family = binomial) # binomial family for logistic regression ## Show 5 best models (Use @ instead of$ for an S4 object)
glmulti.logistic.out@formulas

[]
low ~ 1 + race.cat + preterm + lwt + smoke + ht + ui
<environment: 0x11a2588e8>

[]
low ~ 1 + race.cat + preterm + lwt + smoke + ht
<environment: 0x11a2588e8>

[]
low ~ 1 + race.cat + preterm + age + lwt + smoke + ht + ui
<environment: 0x11a2588e8>

[]
low ~ 1 + race.cat + preterm + age + lwt + smoke + ht
<environment: 0x11a2588e8>

[]
low ~ 1 + race.cat + preterm + ftv.cat + lwt + smoke + ht + ui
<environment: 0x11a2588e8>


## Show result for the best model
summary(glmulti.logistic.out@objects[])


Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)

Deviance Residuals:
Min      1Q  Median      3Q     Max
-1.731  -0.784  -0.514   0.954   2.198

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   -0.12533    0.96756   -0.13   0.8969
race.catBlack  1.30086    0.52848    2.46   0.0138 *
race.catOther  0.85441    0.44091    1.94   0.0526 .
preterm1+      1.12886    0.45039    2.51   0.0122 *
lwt           -0.01592    0.00695   -2.29   0.0221 *
smoke          0.86658    0.40447    2.14   0.0322 *
ht             1.86690    0.70737    2.64   0.0083 **
ui             0.75065    0.45882    1.64   0.1018
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 197.85  on 181  degrees of freedom
AIC: 213.9

Number of Fisher Scoring iterations: 4



Load pbc survival data in survival package

library(survival)
pbc <- within(pbc, {
status.dichotomous <- status > 1
survival.vector    <- Surv(time, status.dichotomous)
})


All-subset Cox regression using coxph() based on AIC

glmulti.coxph.out <-
glmulti(survival.vector ~ trt + age + sex + ascites + hepato + spiders, data = pbc,
level = 1,               # No interaction considered
method = "h",            # Exhaustive approach
crit = "aic",            # AIC as criteria
confsetsize = 5,         # Keep 5 best models
plotty = F, report = F,  # No plot or interim reports
fitfunction = "coxph")   # coxph function

## Show 5 best models (Use @ instead of $for an S4 object) glmulti.coxph.out@formulas  [] survival.vector ~ 1 + age + ascites + hepato + spiders <environment: 0x104baef88> [] survival.vector ~ 1 + trt + age + ascites + hepato + spiders <environment: 0x104baef88> [] survival.vector ~ 1 + sex + age + ascites + hepato + spiders <environment: 0x104baef88> [] survival.vector ~ 1 + sex + trt + age + ascites + hepato + spiders <environment: 0x104baef88> [] survival.vector ~ 1 + sex + ascites + hepato + spiders <environment: 0x104baef88>  ## Show result for the best model summary(glmulti.coxph.out@objects[])  Call: fitfunc(formula = as.formula(x), data = data) n= 312, number of events= 125 (106 observations deleted due to missingness) coef exp(coef) se(coef) z Pr(>|z|) age 0.02620 1.02655 0.00864 3.03 0.00242 ** ascites 1.49687 4.46770 0.26109 5.73 0.0000000099 *** hepato 0.84318 2.32375 0.21198 3.98 0.0000695829 *** spiders 0.66601 1.94646 0.19530 3.41 0.00065 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 exp(coef) exp(-coef) lower .95 upper .95 age 1.03 0.974 1.01 1.04 ascites 4.47 0.224 2.68 7.45 hepato 2.32 0.430 1.53 3.52 spiders 1.95 0.514 1.33 2.85 Concordance= 0.771 (se = 0.029 ) Rsquare= 0.27 (max possible= 0.983 ) Likelihood ratio test= 98.3 on 4 df, p=0 Wald test = 121 on 4 df, p=0 Score (logrank) test = 156 on 4 df, p=0  ## meifly (Interactive model exploration using GGobi) Exploratory model analysis. Fit and graphical explore ensembles of linear models. This function just conduct all-subset regression, thus it can handle coxph without problems, but users will have to do model comparison using the result object. Interaction terms cannot be handled, thus inclusion of interaction terms needs creation of product term beforehand. References: Load meifly package library(meifly)  Fit all subsets (main effects only) For x, give a data.frame without the outcome variable. fitall.out <- fitall(y = pbc$survival.vector,
x = pbc[,c("trt", "age", "sex", "ascites","hepato","spiders")],
method="coxph")


Show the result

As expected, it is the same as the model chosen here ( http://rpubs.com/kaz_yos/exhaustive ).

## Extract AIC from each model
fitall.out.aic <- t(sapply(fitall.out, extractAIC))

## Create an order list of increasing AIC
final.out.order <- order(fitall.out.aic[,2])

## Show the result for the best model
fitall.out[final.out.order]

\$58
Call:
coxph(formula = y ~ age + ascites + hepato + spiders, data = data,
model = FALSE)

coef exp(coef) se(coef)    z            p
age     0.0262      1.03  0.00864 3.03 0.0024000000
ascites 1.4969      4.47  0.26109 5.73 0.0000000099
hepato  0.8432      2.32  0.21198 3.98 0.0000700000
spiders 0.6660      1.95  0.19530 3.41 0.0006500000

Likelihood ratio test=98.3  on 4 df, p=0  n= 312, number of events= 125
(106 observations deleted due to missingness)

attr(,"class")
 "ensemble"