Let’s set up our libraries and import our data. Nothing unusual here.
knitr::opts_chunk$set(fig.width = 9, fig.height = 6)
library(rethinking)
library(tidyverse)
library(tidybayes)
library(tidybayes.rethinking)
data(UCBadmit)
But, let’s look at the data we have. In the past, you may have heard me talking about this as an example of Simpson’s paradox. It’s a famous demonstration of that.
Simpson’s paradox describes a phenomenon where certain statistical descriptions reverse depending on what levels of aggregation we look at. Here, for example, we see a situation where women have a lower admissions rate at the university level but a higher admissions rate at the department level.
We call this a paradox because many non-statisticians have the expectation that these kinds of reversals shouldn’t be possible. But, with our statistical skills, we can analyze the matter to understand the underlying dynamic more clearly.
| dept | applicant.gender | admit | reject | applications |
|---|---|---|---|---|
| A | male | 512 | 313 | 825 |
| A | female | 89 | 19 | 108 |
| B | male | 353 | 207 | 560 |
| B | female | 17 | 8 | 25 |
| C | male | 120 | 205 | 325 |
| C | female | 202 | 391 | 593 |
| D | male | 138 | 279 | 417 |
| D | female | 131 | 244 | 375 |
| E | male | 53 | 138 | 191 |
| E | female | 94 | 299 | 393 |
| F | male | 22 | 351 | 373 |
| F | female | 24 | 317 | 341 |
The issue can be visualized in a simple way like this.
library(dagitty) # version 0.3-4
library(ggdag) # version 0.2.13.9000 (from source)
dagify(
admission ~ gender,
admission ~ department,
department ~ gender,
exposure = c("gender", "department"),
outcome = "admission"
) |>
tidy_dagitty(layout = "circle") |>
node_status() |>
ggplot(aes_dag(color=status)) +
geom_dag(size=2.1) +
theme_dag_gray(legend.position="none",aspect.ratio=.66) +
geom_dag_node(size=40) +
geom_dag_label(aes(label = name), fill = "white", color="black", size=4) +
coord_cartesian(xlim = c(-0.7, 1.1), ylim = c(-1.1, 1.1)) +
scale_color_manual(values=c("exposure" = "#F8766D", "outcome" = "#619CFF", "latent"="LightGrey"))
It is possible that gender is predictive of admission, possibly because admissions committees are discriminating.
It is also possible that gender predicts admission only indirect via department applications. It might be the case, for example, that women are applying to different departments than men are, and this difference of preference, rather than different admission standards, is what causes the disparity.
So how can we figure that out?
Before we move forward, let’s actually simulate some data to see how this dynamic might manifest under very controlled circumstances. This is an important step in any model. We want to make sure that our model behaves the way we intend it to, which sometimes isn’t the case.
Here, let’s create two simulations: men and women applying to two departments of different levels of difficulty. In the first simulation, admission rates are the same for each gender (no committee bias), and in the second simulation women are admitted at a lower rate (committee bias).
Critically, there is also a different preference between men
and women for the department of application. (We’re using a Bernoulli
distribution here (rbern), which is a simple two-outcome
“coin flip” with a fixed probability. A Bernoulli distribution is
functionally the same as a Binomial distribution with size=1.)
N <- 100000
rate_unbiased <- matrix(
data = c(0.30, 0.10,
0.30, 0.10),
nrow = 2,
dimnames = list( c("A", "B"), c("male", "female") )
)
rate_biased <- matrix(
data = c(0.30, 0.10,
0.20, 0.05),
nrow = 2,
dimnames = list( c("A", "B"), c("male", "female") )
)
sim1 <- tibble(
applicant.gender = rbern( N ) |>
as_factor() |> fct_recode("male" = "0", "female" = "1"),
dept = rbern( N , ifelse(applicant.gender == "male", 0.4, 0.6) ) |>
as_factor() |> fct_recode("A" = "0", "B" = "1"),
rate = rate_unbiased[cbind(dept, applicant.gender)],
admitted = rbern( N , prob=rate )
)
sim2 <- tibble(
applicant.gender = rbern( N ) |>
as_factor() |> fct_recode("male" = "0", "female" = "1"),
dept = rbern( N , ifelse(applicant.gender == "male", 0.4, 0.6) ) |>
as_factor() |> fct_recode("A" = "0", "B" = "1"),
rate = rate_biased[cbind(dept, applicant.gender)],
admitted = rbern( N , prob=rate )
)
And the results…
sim1 |> group_by(applicant.gender) |> summarize(mean(admitted))
## # A tibble: 2 × 2
## applicant.gender `mean(admitted)`
## <fct> <dbl>
## 1 male 0.220
## 2 female 0.181
sim2 |> group_by(applicant.gender) |> summarize(mean(admitted))
## # A tibble: 2 × 2
## applicant.gender `mean(admitted)`
## <fct> <dbl>
## 1 male 0.220
## 2 female 0.108
sim1 |> group_by(applicant.gender, dept) |> summarize(mean(admitted))
## `summarise()` has grouped output by 'applicant.gender'. You can override using
## the `.groups` argument.
## # A tibble: 4 × 3
## # Groups: applicant.gender [2]
## applicant.gender dept `mean(admitted)`
## <fct> <fct> <dbl>
## 1 male A 0.300
## 2 male B 0.100
## 3 female A 0.300
## 4 female B 0.101
sim2 |> group_by(applicant.gender, dept) |> summarize(mean(admitted))
## `summarise()` has grouped output by 'applicant.gender'. You can override using
## the `.groups` argument.
## # A tibble: 4 × 3
## # Groups: applicant.gender [2]
## applicant.gender dept `mean(admitted)`
## <fct> <fct> <dbl>
## 1 male A 0.300
## 2 male B 0.0978
## 3 female A 0.199
## 4 female B 0.0479
Sure enough, women have a total rate of admission that is lower in both simulations, despite the fact that we have committee-bias only in the second.
So let’s make a model from this. We’ll need a few helper columns to
turn our data into something that we can use. Then, we’ll define two
models: one that evaluates just for gender (lr_simple) and
one that evalutes for both gender and department.
consolidate_data <- function(df) {
df |>
group_by(dept, applicant.gender) |>
summarize(
admit = sum(admitted),
reject = sum(!admitted),
applications = admit + reject,
.groups="drop"
)
}
rename_cols <- function(df) {
df |>
select(
A = admit,
N = applications,
G = applicant.gender,
D = dept
)
}
df_sim1 <- sim1 |> consolidate_data() |> rename_cols()
df_sim2 <- sim2 |> consolidate_data() |> rename_cols()
Then, our models. These are both “logistic regressions”, which means that our output variable isn’t a continuous number but rather a simple binary (admitted or not admitted).
There are a few important changes here, but they look pretty similar
to our previous models. Just two things are different. Instead of having
a likelihood function based on a normal distribution, this one is based
on a binomial distribution (A ~ binomial(N, p)). Second,
our linear model needs a “link function” to conver its possibe values
into something that matches the requirements of our model structure
(logit(p) ~ a[G]). Other than that, it all works exactly
the way the rest of our model does.
If you don’t understand yet, that’s okay, but come ready to discuss together.
## simple
lr_simple <- alist(
A ~ binomial(N, p),
logit(p) <- a[G],
a[G] ~ normal(0, 1)
)
simple_model1 <- ulam(
lr_simple,
data=df_sim1,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
simple_model2 <- ulam(
lr_simple,
data=df_sim2,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
## multi
lr_multi <- alist(
A ~ binomial(N, p),
logit(p) <- a[G,D],
matrix[G,D]:a ~ normal(0, 1)
)
multi_model1 <- ulam(
lr_multi,
data=df_sim1,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
multi_model2 <- ulam(
lr_multi,
data=df_sim2,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
precis(simple_model1, depth=2)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1] -1.263913 0.01092542 -1.281091 -1.246642 1.002056 1404.406
## a[2] -1.511251 0.01174238 -1.530113 -1.492515 1.003152 1406.586
precis(simple_model2, depth=2)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1] -1.267974 0.01087992 -1.285266 -1.250272 1.004783 1515.818
## a[2] -2.108358 0.01399520 -2.130149 -2.086042 1.000579 1240.267
precis(multi_model1, depth=3)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1,1] -0.8456614 0.01271071 -0.8664640 -0.8249427 1.0032956 2077.787
## a[2,1] -0.8457551 0.01580715 -0.8713376 -0.8209453 1.0049709 2340.804
## a[1,2] -2.1955477 0.02352516 -2.2327639 -2.1587013 1.0023756 2151.318
## a[2,2] -2.1844820 0.01933002 -2.2140400 -2.1539559 0.9997393 2148.263
precis(multi_model2, depth=3)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1,1] -0.8483619 0.01268254 -0.8681213 -0.8275126 1.002790 2275.167
## a[2,1] -1.3956724 0.01784664 -1.4242054 -1.3674374 1.000689 2295.435
## a[1,2] -2.2207892 0.02535860 -2.2614709 -2.1813527 1.002315 1979.838
## a[2,2] -2.9883028 0.02741480 -3.0322119 -2.9440843 1.003158 2265.350
now let’s apply our model to the real data
df <- UCBadmit |> rename_cols()
ucb_model_simple <- ulam(
lr_simple,
data=df,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
ucb_model_multi <- ulam(
lr_multi,
data=df,
chains=4,
cores=4,
messages=FALSE,
refresh=0
)
precis(ucb_model_simple, depth=2)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1] -0.8295011 0.05199436 -0.9133484 -0.7484886 1.004321 1244.102
## a[2] -0.2217212 0.04026675 -0.2846887 -0.1579184 1.000804 1187.509
precis(ucb_model_multi, depth=3)
## mean sd 5.5% 94.5% rhat ess_bulk
## a[1,1] 1.4638967 0.24561291 1.08081621 1.8705604 1.0033459 3034.647
## a[2,1] 0.4903475 0.07269761 0.37822650 0.6065609 1.0067840 3096.356
## a[1,2] 0.6654304 0.39366023 0.04160832 1.2993269 0.9996281 3336.659
## a[2,2] 0.5328740 0.08879126 0.38930981 0.6739907 1.0027931 3382.761
## a[1,3] -0.6565170 0.08588772 -0.78983841 -0.5223414 0.9993391 2515.933
## a[2,3] -0.5247377 0.11249573 -0.70867413 -0.3454547 1.0009317 2232.840
## a[1,4] -0.6199629 0.10534940 -0.78450524 -0.4509935 1.0063708 2851.781
## a[2,4] -0.6974509 0.10621764 -0.86493819 -0.5278910 1.0081584 2979.186
## a[1,5] -1.1427886 0.11403406 -1.32830890 -0.9654053 1.0030967 2707.792
## a[2,5] -0.9412246 0.15606820 -1.18643311 -0.6976691 1.0004997 3244.497
## a[1,6] -2.4914726 0.19428600 -2.81331243 -2.1894480 0.9999052 2659.320
## a[2,6] -2.6673009 0.20341311 -3.00018081 -2.3467955 1.0053495 2147.416
let’s look at the admissions probability, both in total and for specific departments
ucb_model_simple |>
recover_types(df) |>
spread_draws(a[G]) |>
pivot_wider(names_from="G", values_from="a") |>
mutate(
diff_prob = inv_logit(female) - inv_logit(male)
) |>
ggplot( aes(x=diff_prob)) +
geom_density(bw=0.003) +
labs(x = "Gender-based admissions probability difference", color="Dept")
ucb_model_multi |>
recover_types(df) |>
spread_draws(a[G,D]) |>
pivot_wider(names_from="G", values_from="a") |>
mutate(
diff_prob = plogis(female) - plogis(male)
) |>
ggplot(aes(x=diff_prob) ) +
geom_density(aes(color=D), bw=0.005) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 0.4) +
labs(x = "Gender-based admissions probability difference", color="Dept") +
scale_x_continuous(limits=c(-.32, .32))
post-stratification to simulate intervention
# take 2, tidier
ucb_model_multi |>
recover_types(df) |>
spread_draws(a[G,D]) |>
pivot_wider(names_from="G", values_from="a") |>
mutate(
diff_prob = plogis(female) - plogis(male)
) |>
ggplot(aes(x=diff_prob)) +
geom_density(bw=0.002, color="red") +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", size = 0.4) +
labs(x = "Gender-based admissions probability difference") +
scale_x_continuous(limits=c(-.32, .32))
did we find causation? well…maybe…
set.seed(34)
dagify(
unknown3 ~ gender,
admission ~ unknown1,
admission ~ department,
admission ~ unknown2,
department ~ gender,
department ~ unknown3,
department ~ unknown4,
admission ~ unknown4,
unknown1 ~ gender,
exposure = c("gender", "department"),
outcome = "admission",
latent = c("unknown1", "unknown2", "unknown3", "unknown4")
) |>
tidy_dagitty(layout = "gem") |>
node_status() |>
ggplot(aes_dag(color=status)) +
geom_dag(size=2.1) +
theme_dag_gray(legend.position="none",aspect.ratio=.66) +
geom_dag_node(size=40) +
geom_dag_label(aes(label = name), fill = "white", color="black", size=4) +
coord_cartesian(xlim = c(-225, 275), ylim = c(-195, 395)) +
scale_color_manual(values=c("exposure" = "#F8766D", "outcome" = "#619CFF", "latent"="LightGrey"))