The purpose of this demo is to show how IPTW can be used to correct for confounding. The dataset is simulated from a structural causal model consistent with a DAG.
Packages needed
simcausal - Create a DAG by writing its structural causal model and simulate a dataset.
tidyverse - Always useful. To become familiar with this package, I suggest perusing this free online book by Garrett Grolemund and Hadley Wickman R 4 Data Science.
#load the packages.
library(tidyverse)
library(simcausal)
rcat.b1 calls the categorical distribution
rnorm calls the normal distribution
rbern calls the Bernoulli distribution, i.e. 1, 0
plogis is the inverse-logit function or expit: exp(x)/(1+exp(x))
In the code below, we first define the variables which don’t have parents (exogenous variables). Next, we define their children using a function dependent on the values of their parents, i.e. their direct causes. By doing this, we have specified a structural causal model, which could be used to draw a DAG.
D <- DAG.empty()
D <- D +
node("G", distr = "rcat.b1", probs = c(.25, .5, .25)) + #a confounder
node("A", distr = "rnorm", mean = 5, sd=2) + #another confounder
#define the exposure, a child of A and G
node("E", distr="rbern", prob = plogis(-2 + 2*(G==2)+ 1.5*(G==3) + .1*A)) +
#another var on confounding path
node("B", distr = "rbern", prob = plogis(-2+.4*A)) +
#define the outcome, a child of B, G, and E
node("O", distr="rbern", prob = plogis(-2 + 0.05*B + 1*(G==2)+ 0.5*(G==3) + .8*E))
#Set the DAG
D <- set.DAG(D)
plotDAG(D, xjitter = 0.3, yjitter = 0.2, edge_attrs =
list(width = 0.5, arrow.width = 0.4, arrow.size = 0.5),
vertex_attrs = list(size = 12, label.cex = 1))
Setting the seed makes it so that the same numbers appear each time you run the code.
dat <- sim(D,n=10000, rndseed = 4) #simulate a dataset based on the DAG
Take a peak at the simulated dataset.
head(dat)
## ID G A E B O
## 1 1 2 7.344377 1 0 1
## 2 2 1 5.916587 0 1 0
## 3 3 2 3.456017 1 0 0
## 4 4 2 5.155069 1 1 0
## 5 5 3 4.980727 1 1 0
## 6 6 2 3.142707 0 0 0
Look at the exposure and outcome. For those new to R, the syntax within each table() function call is dataframe$variable.
table(dat$E) #about 48% exposure prevalence
##
## 0 1
## 5220 4780
table(dat$O) #about 30% outcome prevalence
##
## 0 1
## 7016 2984
Stratify the data by E and calculate the risk in each group
#Crude risk difference
crude = dat %>%
group_by(E) %>%
summarise(O_PERC = sum(O) / n()) #this works if the outcome is coded as 1, 0.
crude #type the name of the object to print it below
## # A tibble: 2 x 2
## E O_PERC
## <int> <dbl>
## 1 0 0.201
## 2 1 0.404
Calculate the risk difference by subtracting the two E strata.
(RD_crude = filter(crude, E==1) - filter(crude, E==0)) #0.2097961
## E O_PERC
## 1 1 0.2030523
Stratify the risk-difference estimates by levels of G and B.
According to the DAG, we should control for G and B.
strat = dat %>%
group_by(E, G, B) %>%
summarise(O_PERC = sum(O) / n())
(RD_stratified = filter(strat, E==1) - filter(strat, E==0))
## E G B O_PERC
## 1 1 0 0 0.05214764
## 2 1 0 0 0.18321765
## 3 1 0 0 0.16486579
## 4 1 0 0 0.16598404
## 5 1 0 0 0.15718571
## 6 1 0 0 0.11205942
Is confounding present? Yes. The values in the O_PERC represent risk differences.
Each is lower than the crude risk difference. The DAG also tells us there is confounding (if all effects of E were blocked, E and D would still be associated).
Recall the steps for standardization: RDw = sum(weight_i * RD_i)/(sum(weight_i)
Numerator in words: In each strata of the confounder(s), multiply the stratum-specific risk-difference by the size of the stratum. Sum up these products.
Denominator in words: Sum up the weights.
Calculate the weights to be used in standardization. Group the dataset by the strata of the confounders and count the number of observations in those strata.
stratum_size = dat %>%
group_by(G, B) %>%
summarise(weight = n()) %>%
ungroup() %>% #ungroup so I can assign a row number
mutate(rownum = row_number()) #for linking to the stratified estimates.
stratum_size #type it again so it prints
## # A tibble: 6 x 4
## G B weight rownum
## <int> <int> <int> <int>
## 1 1 0 1274 1
## 2 1 1 1230 2
## 3 2 0 2485 3
## 4 2 1 2525 4
## 5 3 0 1264 5
## 6 3 1 1222 6
Join the stratified risk-difference estimates to the stratum-specific weights.
#merging by row number isn't very elegant, but it works.
RD_stratified_wrangle = RD_stratified %>%
mutate(rownum = row_number()) %>%
#drop E, G, and B because they're not accurate in the subtracted dataset anyway.
mutate(E = NULL, G = NULL, B = NULL) %>%
#merge in the group size by row number
left_join(stratum_size, by = "rownum") %>%
#multiply the stratum-specific risks by the stratum weight
mutate(weighted_rd_by_stratum = O_PERC*weight)
RD_stratified_wrangle
## O_PERC rownum G B weight weighted_rd_by_stratum
## 1 0.05214764 1 1 0 1274 66.4361
## 2 0.18321765 2 1 1 1230 225.3577
## 3 0.16486579 3 2 0 2485 409.6915
## 4 0.16598404 4 2 1 2525 419.1097
## 5 0.15718571 5 3 0 1264 198.6827
## 6 0.11205942 6 3 1 1222 136.9366
Sum up the weighted risk differences and divide by the sum of the weights.
sum(RD_stratified_wrangle$weighted_rd_by_stratum)/sum(RD_stratified_wrangle$weight)
## [1] 0.1456214
Standardized Risk Difference = 0.1456214
Recall crude Risk Difference = 0.2030523
Try correcting confounding using IPTW via propensity scores.
Propensity score = P(E|confounders)
Find probability of exposure within levels of G and B.
ps = dat %>%
group_by(G, B) %>%
summarise(ps = sum(E) / n())
ps
## # A tibble: 6 x 3
## # Groups: G [?]
## G B ps
## <int> <int> <dbl>
## 1 1 0 0.175
## 2 1 1 0.199
## 3 2 0 0.605
## 4 2 1 0.632
## 5 3 0 0.485
## 6 3 1 0.490
As expected, probability of exposure differs across levels of G and B.
dat_ps = dat %>%
#this merges the propensity score into the bigger dataset by levels of the confounder
left_join(ps, by = c("G", "B")) %>%
#if exposed, assign the propensity score. if unexposed assign 1-the propensity score.
mutate(ps_received = case_when(
E==1 ~ ps,
E==0 ~ 1-ps)) %>%
#and the weights (for IPTW) are 1 over the ps_received
mutate(IPTW = 1/ps_received)
summary(dat_ps$ps) #check the distribution of the propensity score
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1750 0.1992 0.6048 0.4780 0.6325 0.6325
summary(dat_ps$IPTW) #check the distribution of the weights
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.212 1.581 1.653 2.000 2.062 5.713
head(dat_ps) #look at the first few rows
## ID G A E B O ps ps_received IPTW
## 1 1 2 7.344377 1 0 1 0.6048290 0.6048290 1.653360
## 2 2 1 5.916587 0 1 0 0.1991870 0.8008130 1.248731
## 3 3 2 3.456017 1 0 0 0.6048290 0.6048290 1.653360
## 4 4 2 5.155069 1 1 0 0.6324752 0.6324752 1.581090
## 5 5 3 4.980727 1 1 0 0.4901800 0.4901800 2.040067
## 6 6 2 3.142707 0 0 0 0.6048290 0.3951710 2.530550
ggplot(dat_ps, aes(ps)) + geom_histogram() + facet_wrap(~E)
What is positivity?
Positivity holds if Pr[A=a|L=l]>0 for all l with Pr[L=l] is not equal to 0. The standardized risk can only be computed, if for each level of the covariate, there is at least one subject who was exposed and at least one subject who was unexposed.
Reference: Hernan 2006 Estimating causal effects from epidemiologic data
What does the balance look like?
The groups appear somewhat inbalanced, but positivity holds. The exposed are more likely to have higher propensity scores, and vice versa, but there is at least one person in each stratum for both exposed and unexposed.
IPTW_num= dat_ps %>% #numerator of the risk calculation
filter(O==1) %>% #for numerator, limit to outcome=1.
group_by(E) %>% #stratify by exposure status
summarise(num = sum(IPTW*O)) #multiply the weights by the outcome and take sum
IPTW_num
## # A tibble: 2 x 2
## E num
## <int> <dbl>
## 1 0 2251
## 2 1 3707
IPTW_denom = dat_ps %>% #denominator of the risk calculation
group_by(E) %>%
summarise(denom = sum(IPTW)) #sum the weights
IPTW_denom
## # A tibble: 2 x 2
## E denom
## <int> <dbl>
## 1 0 10000
## 2 1 10000
#link the numerator and denominator together and divide to get the weighted risks.
IPTW = IPTW_denom %>%
left_join(IPTW_num, by = "E") %>%
mutate(R_IPTW = num/denom)
IPTW
## # A tibble: 2 x 4
## E denom num R_IPTW
## <int> <dbl> <dbl> <dbl>
## 1 0 10000 2251 0.225
## 2 1 10000 3707 0.371
Drumroll… is this equal to the standardized risk difference?
(RD_IPTW = filter(IPTW, E==1) - filter(IPTW, E==0))
## E denom num R_IPTW
## 1 1 0 1456.214 0.1456214
#repeating code for standardized risk difference
sum(RD_stratified_wrangle$weighted_rd_by_stratum)/sum(RD_stratified_wrangle$weight)
## [1] 0.1456214
The risk difference corrected via IPTW is equivalent to that obtained from standardization. This is expected when the total population is used as the standard in standardization, as noted in the Hernan Robins Book Technical Point 2.3, Equivalence of IP weighting and standardization.