emmeans
for
marginaleffects
?A couple of labs ago I provided you the memorable while still deeply pedagogical aphorism–
you estimated an
lm
, but you probably want to tidy and plot anemmeans
.
I remain certain that, in most cases, the contrasts and pvalues from an object which reports combinations of coefficients is of more use than the linear models.
But I’m not certain emmeans
is the right
toolkit to teach for this task. The maintainer, Russell Lenth, emeritus
professor of statistics at Iowa, recently
wrote
[I] continue… to maintain and occasionally add new features [to
emmeans
]. However, none of us is immortal; and neither is software. I have thought of trying to find a co-maintainer who could carry the ball once I am gone or lose interest, but the flip side of that is that the codebase is not getting less messy as time goes on – why impose that on someone else? So my thought now is that if at some point, enough active R developers want the capabilities ofemmeans
but I am no longer in the picture, they should feel free to supersede it with some other package that does it better. All of the code is publicly available on GitHub, so just take what is useful and replace what is not.
This is not a trivial concern. When people are working for free, it’s super advantageous to use tools which are themselves being widely used–these are the tools which have bug fixes, new features, updates in light of new methods, etc. Plenty of competing packages have similarly entered a state of benign neglect, or were less ambitious in the first place
margins
by former LSE political scientist Thomas
Leejper has
not been updated in 3 years, and is about to be removed from CRAN
for neglect
effects
seems to be maintained, by can’t do
contrasts, and its inbuilt plots are all pre-Wickhamist.
brmsmargins
only works with a tiny subset of
Bayesian models
modelbased
is a competitor with meaningfully less
model coverage
So, while I’m still using emmeans
in my own projects, i
think I’ll start using marginaleffects
in future code, at
least to acclimate myself to the package.
So this is an important for how lab will go–I’m still learning my way around this package too! As we putz around the documentation, we’re going to teach one another.
Let’s install a current version.
remotes::install_github("vincentarelbundock/marginaleffects")
packageVersion("marginaleffects")
## [1] '0.20.1.5'
Vincent Arel-Bundock is a French Canadian political scientist, a
Michigan PhD, and now at Montreal. He might be the best political
science computational toolmaker working today1. His package
modelsummary
is the prima inter pares for making
regression tables.
marginaleffects
has a cool website providing
documentation. The package provides
Fitted values
Contrasts/risk difference/differences in odds ratios)
Slopes (partial derivatives)
Linear and non-linear hypothesis tests
Equivalence tests
For a dizzying array of modelswhich social scientists (and political scientists in particular) typically use:
lm(), glm(), ivreg(), blmer(), rbm, lm_lin(), lm_robust(), gam(), lmer(), glmer(), polr(), MCMCglmm(), mlogit(), multinom(), svyglm(), survreg()
97
models in all!
Let’s try it out!
library(tidyverse)
library(magrittr)
library(marginaleffects)
library(broom)
t1 <- "https://github.com/thomasjwood/code_lab/raw/main/data/medical_insurance.csv" %>%
read_csv
t1 %>%
glimpse
## Rows: 2,772
## Columns: 7
## $ age <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",…
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…
t1
is data from an insurance company reporting
relationships between a policy holder’s characteristics and their
charges (summed over a three year period.
We might be interested in the following regression model
lm1 <- lm(
charges ~ age * smoker,
data = t1
)
lm1 %>%
summary
##
## Call:
## lm(formula = charges ~ age * smoker, data = t1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16259.4 -2043.4 -1344.8 -245.8 28662.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1974.864 402.618 -4.905 9.88e-07 ***
## age 264.612 9.647 27.429 < 2e-16 ***
## smokeryes 22267.058 887.550 25.088 < 2e-16 ***
## age:smokeryes 45.597 21.609 2.110 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6397 on 2768 degrees of freedom
## Multiple R-squared: 0.7232, Adjusted R-squared: 0.7229
## F-statistic: 2410 on 3 and 2768 DF, p-value: < 2.2e-16
A newborn smoker costs $45.60 in insurance payments? Seems like a bargain! But perhaps this is not the most intuitive quantity.
To our new friend, marginaleffects::predictions()
lm1 %>%
predictions(
newdata = datagrid(
age = seq(20, 80, 15)
)
)
##
## age Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 20 3317 230 14.4 <0.001 153.6 2866 3769
## 35 7287 142 51.2 <0.001 Inf 7008 7565
## 50 11256 171 65.8 <0.001 Inf 10921 11591
## 65 15225 283 53.8 <0.001 Inf 14670 15780
## 80 19194 416 46.2 <0.001 Inf 18379 20009
##
## Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, smoker, age, charges
## Type: response
lm1 %>%
predictions(
newdata = datagrid(
smoker = unique
)
)
##
## smoker Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## yes 32424 270 120.2 <0.001 Inf 31896 32953
## no 8374 136 61.5 <0.001 Inf 8107 8641
##
## Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, age, smoker, charges
## Type: response
lm1 %>%
predictions(
newdata = datagrid(
age = seq(20, 80, 15),
smoker = unique
)
)
##
## age smoker Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 20 yes 26496 447 59.2 <0.001 Inf 25620 27373
## 20 no 3317 230 14.4 <0.001 153.6 2866 3769
## 35 yes 31150 278 112.2 <0.001 Inf 30606 31694
## 35 no 7287 142 51.2 <0.001 Inf 7008 7565
## 50 yes 35803 350 102.4 <0.001 Inf 35117 36488
## 50 no 11256 171 65.8 <0.001 Inf 10921 11591
## 65 yes 40456 580 69.8 <0.001 Inf 39320 41592
## 65 no 15225 283 53.8 <0.001 Inf 14670 15780
## 80 yes 45109 847 53.2 <0.001 Inf 43448 46769
## 80 no 19194 416 46.2 <0.001 Inf 18379 20009
##
## Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, age, smoker, charges
## Type: response
And no surprise, we can tidy everything!
lm1 %>%
predictions(
newdata = datagrid(
age = seq(20, 80, 15),
smoker = unique
)
) %>%
tidy
## # A tibble: 10 × 12
## rowid estimate std.error statistic p.value s.value conf.low conf.high age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 26496. 447. 59.2 0 Inf 25620. 27373. 20
## 2 2 3317. 230. 14.4 5.62e-47 154. 2866. 3769. 20
## 3 3 31150. 278. 112. 0 Inf 30606. 31694. 35
## 4 4 7287. 142. 51.2 0 Inf 7008. 7565. 35
## 5 5 35803. 350. 102. 0 Inf 35117. 36488. 50
## 6 6 11256. 171. 65.8 0 Inf 10921. 11591. 50
## 7 7 40456. 580. 69.8 0 Inf 39320. 41592. 65
## 8 8 15225. 283. 53.8 0 Inf 14670. 15780. 65
## 9 9 45109. 847. 53.2 0 Inf 43448. 46769. 80
## 10 10 19194. 416. 46.2 0 Inf 18379. 20009. 80
## # ℹ 3 more variables: smoker <chr>, charges <dbl>, term <int>
Works great with glm
s and probabiliies
t1$charge_grp <- t1$charges %>%
cut_number(
n = 2,
labels = c(
"low charge",
"high charge"
)
) %>%
factor
glm1 <- glm(
charge_grp ~ age * smoker,
data = t1,
family = binomial()
)
glm1 %>%
predictions(
newdata = datagrid(
age = seq(20, 80, 15),
smoker = unique
),
)
##
## age smoker Estimate Pr(>|z|) S 2.5 % 97.5 %
## 20 yes 1.000 0.979 0.0 2.22e-16 1.000
## 20 no 0.015 <0.001 413.4 1.06e-02 0.021
## 35 yes 1.000 0.967 0.0 2.22e-16 1.000
## 35 no 0.147 <0.001 278.1 1.26e-01 0.170
## 50 yes 1.000 0.973 0.0 2.22e-16 1.000
## 50 no 0.661 <0.001 68.7 6.29e-01 0.691
## 65 yes 1.000 0.984 0.0 2.22e-16 1.000
## 65 no 0.957 <0.001 330.0 9.43e-01 0.967
## 80 yes 1.000 0.989 0.0 2.22e-16 1.000
## 80 no 0.996 <0.001 390.1 9.94e-01 0.998
##
## Columns: rowid, estimate, p.value, s.value, conf.low, conf.high, age, smoker, charge_grp
## Type: invlink(link)
Mmmmm maybe we want to see precisely how the smoking effect changes over the life course of policy holders?
lm1 %>%
avg_comparisons(
variables = list(smoker = "sequential"),
by = "age"
) %>%
tidy %>%
ggplot(
aes(
x = age, ymin = conf.low,y = estimate, ymax = conf.high
)
) +
geom_pointrange()
We can also check for interactions
t1$age_cat <- t1$age %>%
cut(
breaks = c(18, 30, 40, 50, 60, Inf),
labels = str_c(
c(18, 30, 40, 50, 60),
"-",
c(29, 39, 49, 69, 64)
)
)
lm2 <- lm(
charges ~ bmi * smoker * age_cat,
data = t1
)
lm2 %>%
comparisons(
variables = "smoker",
newdata = datagrid(
bmi = seq(16, 45, by = .2),
age_cat = unique
)
) %>%
tidy %>%
ggplot() +
geom_ribbon(
aes(
bmi, ymin = conf.low, ymax = conf.high
),
fill = "white",
color = "grey1",
size = .1,
outline.type = "full"
) +
geom_hline(
yintercept = 0,
linetype = "dashed",
color = "red"
) +
geom_line(
aes(
bmi, y = estimate
)
) +
facet_wrap(~age_cat) +
labs(
y = "Effect of smoking on insurance payments"
)
Just by way of advertisement, which is easier to interpret–the plotted contrasts, or the regression table?
lm2 %>%
summary
##
## Call:
## lm(formula = charges ~ bmi * smoker * age_cat, data = t1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12651.2 -2307.5 -1266.1 108.7 29350.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4727.996 1013.085 4.667 3.21e-06 ***
## bmi 1.549 33.762 0.046 0.963410
## smokeryes -17656.127 2109.685 -8.369 < 2e-16 ***
## age_cat30-39 1213.243 1605.738 0.756 0.449977
## age_cat40-49 6252.409 1657.412 3.772 0.000165 ***
## age_cat50-69 7509.606 1614.716 4.651 3.47e-06 ***
## age_cat60-64 12209.428 2613.640 4.671 3.14e-06 ***
## bmi:smokeryes 1331.290 67.921 19.600 < 2e-16 ***
## bmi:age_cat30-39 21.090 52.452 0.402 0.687659
## bmi:age_cat40-49 -43.185 53.647 -0.805 0.420901
## bmi:age_cat50-69 19.220 51.893 0.370 0.711139
## bmi:age_cat60-64 -49.962 79.389 -0.629 0.529182
## smokeryes:age_cat30-39 -2639.729 3442.434 -0.767 0.443258
## smokeryes:age_cat40-49 -2658.776 3390.656 -0.784 0.433025
## smokeryes:age_cat50-69 -3882.574 3580.812 -1.084 0.278345
## smokeryes:age_cat60-64 -21074.476 6210.852 -3.393 0.000701 ***
## bmi:smokeryes:age_cat30-39 131.018 111.153 1.179 0.238615
## bmi:smokeryes:age_cat40-49 112.747 108.775 1.037 0.300055
## bmi:smokeryes:age_cat50-69 139.432 111.223 1.254 0.210093
## bmi:smokeryes:age_cat60-64 677.208 195.636 3.462 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5031 on 2604 degrees of freedom
## (148 observations deleted due to missingness)
## Multiple R-squared: 0.8301, Adjusted R-squared: 0.8289
## F-statistic: 669.7 on 19 and 2604 DF, p-value: < 2.2e-16
Download the GSS table on the welfare framing effect data we used in a previous lab.
t2 <- "https://github.com/thomasjwood/code_lab/raw/main/data/gss_welfare.rds" %>%
url %>%
readRDS
marginaleffects
to estimate the overall framing
effect, and then then framing effect by ideology and race.whoa, how King/Gelman and other have receded.↩︎