all_subset
All subset regression with leaps, bestglm, glmulti, and meifly
Summary
For linear regression, use leaps, which allows use of adjusted \(R^2\) and Mallow Cp. For logistic regression, use glmulti. For Cox regression, use glmulti. This article about glmulti is a good summary of this topic (http://www.jstatsoft.org/v34/i12/paper).
Load and Prepare Dataset
library(readxl)
library(gdata)
lbw <- read_xls("C:/Users/Asus/OneDrive - Texas State University/Documents/Hackathon_Tusti/R Scripts/ALL Subsets/lowbwt.xls")
# Make the data interactive and accessible
library(DT)
datatable(
lbw, extensions = c('Select', 'Buttons'), options = list(
select = list(style = 'os', items = 'row'),
dom = 'Blfrtip',
rowId = 0,
buttons = c('csv', 'excel')
),
selection = 'none'
)names(lbw) <- tolower(names(lbw))
## Recoding
lbw <- within(lbw, {
## race relabeling
race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## ftv (frequency of visit) relabeling
ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
ftv.cat <- relevel(ftv.cat, ref = "Normal")
## ptl
preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
})leaps (regression subset selection)
Regression subset selection including exhaustive search. This is only for linear regression.
Reference: http://www.statmethods.net/stats/regression.html
Perform all subset regression, and choose “nbest” model(s) for each number of predictors up to nvmax.
The result shows how it was performed.
library(leaps)
regsubsets.out <-
regsubsets(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat,
data = lbw,
nbest = 1, # 1 best model for each number of predictors
nvmax = NULL, # NULL for no limit on number of variables
force.in = NULL, force.out = NULL,
method = "exhaustive")
regsubsets.outSubset selection object
Call: regsubsets.formula(bwt ~ age + lwt + race.cat + smoke + preterm +
ht + ui + ftv.cat, data = lbw, nbest = 1, nvmax = NULL, force.in = NULL,
force.out = NULL, method = "exhaustive")
10 Variables (and intercept)
Forced in Forced out
age FALSE FALSE
lwt FALSE FALSE
race.catBlack FALSE FALSE
race.catOther FALSE FALSE
smoke FALSE FALSE
preterm1+ FALSE FALSE
ht FALSE FALSE
ui FALSE FALSE
ftv.catNone FALSE FALSE
ftv.catMany FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: exhaustive
Best model at each variable number
The best model in the 10-variable case includes all variables, as that is the only way to have 10 variables.
age lwt race.catBlack race.catOther smoke preterm1+ ht ui ftv.catNone ftv.catMany
1 ( 1 ) *
2 ( 1 ) * *
3 ( 1 ) * * *
4 ( 1 ) * * * *
5 ( 1 ) * * * * *
6 ( 1 ) * * * * * *
7 ( 1 ) * * * * * * *
8 ( 1 ) * * * * * * * *
9 ( 1 ) * * * * * * * * *
10 ( 1 ) * * * * * * * * * *
Graphical table of best subsets (plot.regsubsets)
By adjusted \(R^2\), the best model includes lwt, race.cat, preterm, ht, and ui (variables that have black boxes at the higest Y-axis value).
Plot Output from regsubsets Function in leaps package
This is just another way of presenting the same information for adjusted \(R^2\). The model with 7 variables (counting dummy variables seprately) has the highest adjusted \(R^2\).
Mallow Cp is used to decide on the number of predictors to include. The stopping rule is to start with the smallest model and gradually increase number of variables, and stop when Mallow Cp is approximately (number of regressors + 1, broken line) for the first time. In this case, the model with 6 regressors is the first one to achieve such a condition.
library(car)
layout(matrix(1:2, ncol = 2))
# Define a custom xlim range based on your data to add more space
x_range <- c(4.5, 10) # Adjust the range as needed for your data
## Adjusted R2
res.legend <-
subsets(regsubsets.out, statistic="adjr2", legend = FALSE, min.size = 5, main = "Adjusted R^2", xlim = x_range, cex = 0.4)
## Mallow Cp
res.legend <-
subsets(regsubsets.out, statistic="cp", legend = FALSE, min.size = 5, main = "Mallow Cp", xlim = x_range, cex = 0.4)
abline(a = 1, b = 1, lty = 2) Abbreviation
age a
lwt l
race.catBlack r.B
race.catOther r.O
smoke s
preterm1+ p
ht h
ui u
ftv.catNone f.N
ftv.catMany f.M
See which model has the highest adjusted R2
The model with 7 variables (counting dummy variables separately) has the highest adjusted \(R^2\). Variables marked with TRUE are the ones chosen.
[1] 7
(Intercept) age lwt race.catBlack race.catOther smoke preterm1+
TRUE FALSE TRUE TRUE TRUE TRUE TRUE
ht ui ftv.catNone ftv.catMany
TRUE TRUE FALSE FALSE
Do regression with the best model
Call:
lm(formula = bwt ~ lwt + race.cat + smoke + preterm + ht + ui,
data = lbw)
Residuals:
Min 1Q Median 3Q Max
-1886.4 -441.0 53.5 494.2 1620.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.99 243.67 11.79 <2e-16 ***
lwt 4.04 1.67 2.42 0.0167 *
race.catBlack -466.32 145.13 -3.21 0.0016 **
race.catOther -335.68 112.28 -2.99 0.0032 **
smoke -323.57 104.96 -3.08 0.0024 **
preterm1+ -208.44 133.50 -1.56 0.1202
ht -573.69 198.96 -2.88 0.0044 **
ui -489.96 135.93 -3.60 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 643 on 181 degrees of freedom
Multiple R-squared: 0.25, Adjusted R-squared: 0.221
F-statistic: 8.64 on 7 and 181 DF, p-value: 0.00000000396
bestglm (Best subset GLM)
Best subset glm using AIC, BIC, EBIC, BICq or Cross-Validation. For the normal case, the ‘leaps’ is used. Otherwise, a slower exhaustive search. The ‘xtable’ package is needed for vignette ‘SimExperimentBICq.Rnw’ accompanying this package.
References: 1) http://cran.r-project.org/web/packages/bestglm/ ; 2)http://cran.r-project.org/web/packages/bestglm/vignettes/bestglm.pdf
Reformat Data
The outcome variable must be named y, no extraneous variables should be present in the dataset.
Perform all-subset linear (gaussian) regression based on Akaike Information Criteria (AIC)
# Explicitly create a new data frame without using within()
lbw.for.bestglm <- data.frame(
age = lbw$age,
lwt = lbw$lwt,
race.cat = lbw$race.cat,
smoke = lbw$smoke,
preterm = lbw$preterm,
ht = lbw$ht,
ui = lbw$ui,
ftv.cat = lbw$ftv.cat,
y = as.numeric(lbw$bwt) # Convert bwt to numeric and assign to y
)
# Confirm that all variables are in the correct type, especially y
str(lbw.for.bestglm)'data.frame': 189 obs. of 9 variables:
$ age : num 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : num 182 155 105 108 107 124 118 103 123 113 ...
$ race.cat: Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
$ smoke : num 0 0 1 1 1 0 0 0 1 1 ...
$ preterm : Factor w/ 2 levels "0","1+": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : num 0 0 0 0 0 0 0 0 0 0 ...
$ ui : num 1 0 0 1 1 0 0 0 0 0 ...
$ ftv.cat : Factor w/ 3 levels "Normal","None",..: 2 3 1 1 2 2 1 1 1 2 ...
$ y : num 2523 2551 2557 2594 2600 ...
Show top 5 models
age lwt race.cat smoke preterm ht ui ftv.cat Criterion
1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 2450.2
2 FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE 2450.7
3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 2452.1
4 TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE 2452.5
5 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 2453.3
Show result for the best model
The model identical to the one chosen by the adjusted \(R^2\) was selected.
Call:
lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
drop = FALSE], y = y))
Residuals:
Min 1Q Median 3Q Max
-1886.4 -441.0 53.5 494.2 1620.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.99 243.67 11.79 <2e-16 ***
lwt 4.04 1.67 2.42 0.0167 *
race.catBlack -466.32 145.13 -3.21 0.0016 **
race.catOther -335.68 112.28 -2.99 0.0032 **
smoke -323.57 104.96 -3.08 0.0024 **
preterm1+ -208.44 133.50 -1.56 0.1202
ht -573.69 198.96 -2.88 0.0044 **
ui -489.96 135.93 -3.60 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 643 on 181 degrees of freedom
Multiple R-squared: 0.25, Adjusted R-squared: 0.221
F-statistic: 8.64 on 7 and 181 DF, p-value: 0.00000000396
Logistic Regression
Do the same, but as a logistic regression model. The resulting model was identical to the best linear model in this case.
# Explicitly create a new data frame for logistic regression
lbw.for.best.logistic <- data.frame(
age = lbw$age,
lwt = lbw$lwt,
race.cat = lbw$race.cat,
smoke = lbw$smoke,
preterm = lbw$preterm,
ht = lbw$ht,
ui = lbw$ui,
ftv.cat = lbw$ftv.cat,
y = as.numeric(lbw$low) # Convert 'low' to numeric for the response variable
)
# Confirm the structure, ensuring 'y' is numeric for logistic regression
str(lbw.for.best.logistic)'data.frame': 189 obs. of 9 variables:
$ age : num 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : num 182 155 105 108 107 124 118 103 123 113 ...
$ race.cat: Factor w/ 3 levels "White","Black",..: 2 3 1 1 1 3 1 3 1 1 ...
$ smoke : num 0 0 1 1 1 0 0 0 1 1 ...
$ preterm : Factor w/ 2 levels "0","1+": 1 1 1 1 1 1 1 1 1 1 ...
$ ht : num 0 0 0 0 0 0 0 0 0 0 ...
$ ui : num 1 0 0 1 1 0 0 0 0 0 ...
$ ftv.cat : Factor w/ 3 levels "Normal","None",..: 2 3 1 1 2 2 1 1 1 2 ...
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
# Run bestglm with the cleaned data frame for logistic regression
library(bestglm)
res.best.logistic <- bestglm(Xy = lbw.for.best.logistic,
family = binomial, # Specify binomial family for logistic regression
IC = "AIC",
method = "exhaustive") age lwt race.cat smoke preterm ht ui ftv.cat Criterion
1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 211.85
2 FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 212.48
3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE 212.83
4 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE 213.15
5 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 214.37
Call:
glm(formula = y ~ ., family = family, data = Xi, weights = weights)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.12533 0.96756 -0.13 0.8969
lwt -0.01592 0.00695 -2.29 0.0221 *
race.catBlack 1.30086 0.52848 2.46 0.0138 *
race.catOther 0.85441 0.44091 1.94 0.0526 .
smoke 0.86658 0.40447 2.14 0.0322 *
preterm1+ 1.12886 0.45039 2.51 0.0122 *
ht 1.86690 0.70737 2.64 0.0083 **
ui 0.75065 0.45882 1.64 0.1018
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 197.85 on 181 degrees of freedom
AIC: 213.9
Number of Fisher Scoring iterations: 4
glmulti (Model selection and multimodel inference made easy)
Automated model selection and model-averaging. Provides a wrapper for glm and other functions, automatically generating all possible models (under constraints set by the user) with the specified response and explanatory variables, and finding the best models in terms of some Information Criterion (AIC, AICc or BIC). Can handle very large numbers of candidate models. Features a Genetic Algorithm to find the best models when an exhaustive screening of the candidates is not feasible.
References:
- http://www.jstatsoft.org/v34/i12/paper ; 2)http://cran.r-project.org/web/packages/glmulti/index.html
All-subset linear regression using lm() based on AIC
glmulti.lm.out <-
glmulti(bwt ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw,
level = 1, # No interaction considered
method = "h", # Exhaustive approach
crit = "aic", # AIC as criteria
confsetsize = 5, # Keep 5 best models
plotty = F, report = F, # No plot or interim reports
fitfunction = "lm") # lm function
## Show 5 best models (Use @ instead of $ for an S4 object)
glmulti.lm.out@formulas[[1]]
bwt ~ 1 + race.cat + preterm + lwt + smoke + ht + ui
<environment: 0x000001e917a03e60>
[[2]]
bwt ~ 1 + race.cat + lwt + smoke + ht + ui
<environment: 0x000001e917a03e60>
[[3]]
bwt ~ 1 + race.cat + preterm + age + lwt + smoke + ht + ui
<environment: 0x000001e917a03e60>
[[4]]
bwt ~ 1 + race.cat + age + lwt + smoke + ht + ui
<environment: 0x000001e917a03e60>
[[5]]
bwt ~ 1 + race.cat + preterm + ftv.cat + lwt + smoke + ht + ui
<environment: 0x000001e917a03e60>
Call:
fitfunc(formula = as.formula(x), data = data)
Residuals:
Min 1Q Median 3Q Max
-1886.4 -441.0 53.5 494.2 1620.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2871.99 243.67 11.79 <2e-16 ***
race.catBlack -466.32 145.13 -3.21 0.0016 **
race.catOther -335.68 112.28 -2.99 0.0032 **
preterm1+ -208.44 133.50 -1.56 0.1202
lwt 4.04 1.67 2.42 0.0167 *
smoke -323.57 104.96 -3.08 0.0024 **
ht -573.69 198.96 -2.88 0.0044 **
ui -489.96 135.93 -3.60 0.0004 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 643 on 181 degrees of freedom
Multiple R-squared: 0.25, Adjusted R-squared: 0.221
F-statistic: 8.64 on 7 and 181 DF, p-value: 0.00000000396
All-subset logistic regression using glm() based on AIC
glmulti.logistic.out <-
glmulti(low ~ age + lwt + race.cat + smoke + preterm + ht + ui + ftv.cat, data = lbw,
level = 1, # No interaction considered
method = "h", # Exhaustive approach
crit = "aic", # AIC as criteria
confsetsize = 5, # Keep 5 best models
plotty = F, report = F, # No plot or interim reports
fitfunction = "glm", # glm function
family = binomial) # binomial family for logistic regression
## Show 5 best models (Use @ instead of $ for an S4 object)
glmulti.logistic.out@formulas[[1]]
low ~ 1 + race.cat + preterm + lwt + smoke + ht + ui
<environment: 0x000001e97ec0c340>
[[2]]
low ~ 1 + race.cat + preterm + lwt + smoke + ht
<environment: 0x000001e97ec0c340>
[[3]]
low ~ 1 + race.cat + preterm + age + lwt + smoke + ht + ui
<environment: 0x000001e97ec0c340>
[[4]]
low ~ 1 + race.cat + preterm + age + lwt + smoke + ht
<environment: 0x000001e97ec0c340>
[[5]]
low ~ 1 + race.cat + preterm + ftv.cat + lwt + smoke + ht + ui
<environment: 0x000001e97ec0c340>
Call:
fitfunc(formula = as.formula(x), family = ..1, data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.12533 0.96756 -0.13 0.8969
race.catBlack 1.30086 0.52848 2.46 0.0138 *
race.catOther 0.85441 0.44091 1.94 0.0526 .
preterm1+ 1.12886 0.45039 2.51 0.0122 *
lwt -0.01592 0.00695 -2.29 0.0221 *
smoke 0.86658 0.40447 2.14 0.0322 *
ht 1.86690 0.70737 2.64 0.0083 **
ui 0.75065 0.45882 1.64 0.1018
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 197.85 on 181 degrees of freedom
AIC: 213.9
Number of Fisher Scoring iterations: 4
Load pbc survival data in survival package
All-subset Cox regression using coxph() based on AIC
glmulti.coxph.out <-
glmulti(survival.vector ~ trt + age + sex + ascites + hepato + spiders, data = pbc,
level = 1, # No interaction considered
method = "h", # Exhaustive approach
crit = "aic", # AIC as criteria
confsetsize = 5, # Keep 5 best models
plotty = F, report = F, # No plot or interim reports
fitfunction = "coxph") # coxph function
## Show 5 best models (Use @ instead of $ for an S4 object)
glmulti.coxph.out@formulas[[1]]
survival.vector ~ 1 + age + ascites + hepato + spiders
<environment: 0x000001e980521958>
[[2]]
survival.vector ~ 1 + trt + age + ascites + hepato + spiders
<environment: 0x000001e980521958>
[[3]]
survival.vector ~ 1 + sex + age + ascites + hepato + spiders
<environment: 0x000001e980521958>
[[4]]
survival.vector ~ 1 + sex + trt + age + ascites + hepato + spiders
<environment: 0x000001e980521958>
[[5]]
survival.vector ~ 1 + sex + ascites + hepato + spiders
<environment: 0x000001e980521958>
Call:
fitfunc(formula = as.formula(x), data = data)
n= 312, number of events= 125
(106 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.02620 1.02655 0.00864 3.03 0.00242 **
ascites 1.49687 4.46770 0.26109 5.73 0.0000000099 ***
hepato 0.84318 2.32375 0.21198 3.98 0.0000695829 ***
spiders 0.66601 1.94646 0.19530 3.41 0.00065 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.03 0.974 1.01 1.04
ascites 4.47 0.224 2.68 7.45
hepato 2.32 0.430 1.53 3.52
spiders 1.95 0.514 1.33 2.85
Concordance= 0.771 (se = 0.022 )
Likelihood ratio test= 98.3 on 4 df, p=<2e-16
Wald test = 121 on 4 df, p=<2e-16
Score (logrank) test = 156 on 4 df, p=<2e-16
meifly (Interactive model exploration using GGobi)
Exploratory model analysis. Fit and graphical explore ensembles of linear models.
This function just conduct all-subset regression, thus it can handle coxph without problems, but users will have to do model comparison using the result object. Interaction terms cannot be handled, thus inclusion of interaction terms needs creation of product term beforehand.
References:
http://cran.r-project.org/web/packages/meifly/index.html
Fit all subsets (main effects only)
For x, give a data.frame without the outcome variable.
fitall.out <- fitall(y = pbc$survival.vector,
x = pbc[,c("trt", "age", "sex", "ascites","hepato","spiders")],
method="coxph") | | | 0% | |= | 2% | |=== | 3% | |==== | 5% | |====== | 6% | |======= | 8% | |========= | 10% | |========== | 11% | |=========== | 13% | |============= | 14% | |============== | 16% | |================ | 17% | |================= | 19% | |=================== | 21% | |==================== | 22% | |===================== | 24% | |======================= | 25% | |======================== | 27% | |========================== | 29% | |=========================== | 30% | |============================= | 32% | |============================== | 33% | |=============================== | 35% | |================================= | 37% | |================================== | 38% | |==================================== | 40% | |===================================== | 41% | |======================================= | 43% | |======================================== | 44% | |========================================= | 46% | |=========================================== | 48% | |============================================ | 49% | |============================================== | 51% | |=============================================== | 52% | |================================================= | 54% | |================================================== | 56% | |=================================================== | 57% | |===================================================== | 59% | |====================================================== | 60% | |======================================================== | 62% | |========================================================= | 63% | |=========================================================== | 65% | |============================================================ | 67% | |============================================================= | 68% | |=============================================================== | 70% | |================================================================ | 71% | |================================================================== | 73% | |=================================================================== | 75% | |===================================================================== | 76% | |====================================================================== | 78% | |======================================================================= | 79% | |========================================================================= | 81% | |========================================================================== | 83% | |============================================================================ | 84% | |============================================================================= | 86% | |=============================================================================== | 87% | |================================================================================ | 89% | |================================================================================= | 90% | |=================================================================================== | 92% | |==================================================================================== | 94% | |====================================================================================== | 95% | |======================================================================================= | 97% | |========================================================================================= | 98% | |==========================================================================================| 100%
Show the result
## Extract AIC from each model
fitall.out.aic <- t(sapply(fitall.out, extractAIC))
## Create an order list of increasing AIC
final.out.order <- order(fitall.out.aic[,2])
## Show the result for the best model
fitall.out[final.out.order][1]$`58`
Call:
coxph(formula = y ~ age + ascites + hepato + spiders, data = data,
model = FALSE)
coef exp(coef) se(coef) z p
age 0.0262 1.0265 0.0086 3.0 0.002
ascites 1.4969 4.4677 0.2611 5.7 0.00000001
hepato 0.8432 2.3238 0.2120 4.0 0.00006958
spiders 0.6660 1.9465 0.1953 3.4 0.00064920
Likelihood ratio test=98 on 4 df, p=<2e-16
n= 312, number of events= 125
(106 observations deleted due to missingness)
attr(,"class")
[1] "ensemble"