EPI 738 TA Session 2

The purpose of this demo is to show how IPTW can be used to correct for confounding. The dataset is simulated from a structural causal model consistent with a DAG.

Outline

  1. Simulate a dataset based on causal associations depicted in a DAG.
  2. Check for confounding.
  3. Control confounding using standardization (as a check).
  4. Control confounding using IPTW.

Packages needed
simcausal - Create a DAG by writing its structural causal model and simulate a dataset.
tidyverse - Always useful. To become familiar with this package, I suggest perusing this free online book by Garrett Grolemund and Hadley Wickman R 4 Data Science.

#load the packages.
library(tidyverse)
library(simcausal)

1. Use simcausal to simulate a dataset with confounding

1.1. Define the distributions of the variables in terms of their parents

rcat.b1 calls the categorical distribution
rnorm calls the normal distribution
rbern calls the Bernoulli distribution, i.e. 1, 0

plogis is the inverse-logit function or expit: exp(x)/(1+exp(x))

In the code below, we first define the variables which don’t have parents (exogenous variables). Next, we define their children using a function dependent on the values of their parents, i.e. their direct causes. By doing this, we have specified a structural causal model, which could be used to draw a DAG.

D <- DAG.empty()
D <- D +
  node("G", distr = "rcat.b1", probs = c(.25, .5, .25)) + #a confounder
  node("A", distr = "rnorm", mean = 5, sd=2)  + #another confounder

  #define the exposure, a child of A and G
  node("E", distr="rbern", prob = plogis(-2 + 2*(G==2)+ 1.5*(G==3) + .1*A)) + 

  #another var on confounding path
  node("B", distr = "rbern", prob = plogis(-2+.4*A)) + 

  #define the outcome, a child of B, G, and E
  node("O", distr="rbern", prob = plogis(-2 +  0.05*B + 1*(G==2)+ 0.5*(G==3) + .8*E)) 

#Set the DAG
D <- set.DAG(D)

1.2. Plot the DAG

plotDAG(D, xjitter = 0.3, yjitter = 0.2, edge_attrs = 
          list(width = 0.5, arrow.width = 0.4, arrow.size = 0.5),
        vertex_attrs = list(size = 12, label.cex = 1))

1.3. Simualate a dataset based on the DAG

Setting the seed makes it so that the same numbers appear each time you run the code.

dat <- sim(D,n=10000, rndseed = 4) #simulate a dataset based on the DAG

Take a peak at the simulated dataset.

head(dat)
##   ID G        A E B O
## 1  1 2 7.344377 1 0 1
## 2  2 1 5.916587 0 1 0
## 3  3 2 3.456017 1 0 0
## 4  4 2 5.155069 1 1 0
## 5  5 3 4.980727 1 1 0
## 6  6 2 3.142707 0 0 0

Look at the exposure and outcome. For those new to R, the syntax within each table() function call is dataframe$variable.

table(dat$E) #about 48% exposure prevalence
## 
##    0    1 
## 5220 4780
table(dat$O) #about 30% outcome prevalence
## 
##    0    1 
## 7016 2984

2. Assess confounding

Stratify the data by E and calculate the risk in each group

#Crude risk difference
crude  = dat  %>% 
  group_by(E) %>% 
  summarise(O_PERC = sum(O) / n()) #this works if the outcome is coded as 1, 0.

crude #type the name of the object to print it below
## # A tibble: 2 x 2
##       E O_PERC
##   <int>  <dbl>
## 1     0  0.201
## 2     1  0.404

Calculate the risk difference by subtracting the two E strata.

(RD_crude = filter(crude, E==1) - filter(crude, E==0)) #0.2097961
##   E    O_PERC
## 1 1 0.2030523

Stratify the risk-difference estimates by levels of G and B.
According to the DAG, we should control for G and B.

strat  = dat  %>% 
  group_by(E, G, B) %>% 
  summarise(O_PERC = sum(O) / n()) 

(RD_stratified = filter(strat, E==1) - filter(strat, E==0))
##   E G B     O_PERC
## 1 1 0 0 0.05214764
## 2 1 0 0 0.18321765
## 3 1 0 0 0.16486579
## 4 1 0 0 0.16598404
## 5 1 0 0 0.15718571
## 6 1 0 0 0.11205942

Is confounding present? Yes. The values in the O_PERC represent risk differences.
Each is lower than the crude risk difference. The DAG also tells us there is confounding (if all effects of E were blocked, E and D would still be associated).

3. Control confounding using standardization (as a check before IPTW)

Recall the steps for standardization: RDw = sum(weight_i * RD_i)/(sum(weight_i)
Numerator in words: In each strata of the confounder(s), multiply the stratum-specific risk-difference by the size of the stratum. Sum up these products.
Denominator in words: Sum up the weights.

Calculate the weights to be used in standardization. Group the dataset by the strata of the confounders and count the number of observations in those strata.

stratum_size = dat %>%
  group_by(G, B) %>%
  summarise(weight = n()) %>%
  ungroup() %>% #ungroup so I can assign a row number
  mutate(rownum = row_number()) #for linking to the stratified estimates.

stratum_size #type it again so it prints
## # A tibble: 6 x 4
##       G     B weight rownum
##   <int> <int>  <int>  <int>
## 1     1     0   1274      1
## 2     1     1   1230      2
## 3     2     0   2485      3
## 4     2     1   2525      4
## 5     3     0   1264      5
## 6     3     1   1222      6

Join the stratified risk-difference estimates to the stratum-specific weights.

#merging by row number isn't very elegant, but it works.
RD_stratified_wrangle = RD_stratified %>%
  mutate(rownum = row_number()) %>%
  #drop E, G, and B because they're not accurate in the subtracted dataset anyway.
  mutate(E = NULL, G = NULL, B = NULL) %>%
  #merge in the group size by row number
  left_join(stratum_size, by = "rownum") %>%
  
  #multiply the stratum-specific risks by the  stratum weight
  mutate(weighted_rd_by_stratum = O_PERC*weight) 

RD_stratified_wrangle 
##       O_PERC rownum G B weight weighted_rd_by_stratum
## 1 0.05214764      1 1 0   1274                66.4361
## 2 0.18321765      2 1 1   1230               225.3577
## 3 0.16486579      3 2 0   2485               409.6915
## 4 0.16598404      4 2 1   2525               419.1097
## 5 0.15718571      5 3 0   1264               198.6827
## 6 0.11205942      6 3 1   1222               136.9366

Sum up the weighted risk differences and divide by the sum of the weights.

sum(RD_stratified_wrangle$weighted_rd_by_stratum)/sum(RD_stratified_wrangle$weight)
## [1] 0.1456214

Standardized Risk Difference = 0.1456214

Recall crude Risk Difference = 0.2030523

4. Control confounding using IPTW

Try correcting confounding using IPTW via propensity scores.

Propensity score = P(E|confounders)

4.1 Calculate propensity scores

Find probability of exposure within levels of G and B.

ps  = dat  %>% 
  group_by(G, B) %>% 
  summarise(ps = sum(E) / n())

ps 
## # A tibble: 6 x 3
## # Groups:   G [?]
##       G     B    ps
##   <int> <int> <dbl>
## 1     1     0 0.175
## 2     1     1 0.199
## 3     2     0 0.605
## 4     2     1 0.632
## 5     3     0 0.485
## 6     3     1 0.490

As expected, probability of exposure differs across levels of G and B.

4.3 Use propensity scores to assess positivity

ggplot(dat_ps, aes(ps)) +  geom_histogram() + facet_wrap(~E)

What is positivity?
Positivity holds if Pr[A=a|L=l]>0 for all l with Pr[L=l] is not equal to 0. The standardized risk can only be computed, if for each level of the covariate, there is at least one subject who was exposed and at least one subject who was unexposed.
Reference: Hernan 2006 Estimating causal effects from epidemiologic data

What does the balance look like?
The groups appear somewhat inbalanced, but positivity holds. The exposed are more likely to have higher propensity scores, and vice versa, but there is at least one person in each stratum for both exposed and unexposed.

4.4 Calculate the risk difference weighted by IPTW

IPTW_num= dat_ps %>%  #numerator of the risk calculation
  filter(O==1) %>% #for numerator, limit to outcome=1.
  group_by(E) %>% #stratify by exposure status
  summarise(num = sum(IPTW*O)) #multiply the weights by the outcome and take sum
IPTW_num
## # A tibble: 2 x 2
##       E   num
##   <int> <dbl>
## 1     0  2251
## 2     1  3707
IPTW_denom = dat_ps %>%  #denominator of the risk calculation
  group_by(E) %>% 
  summarise(denom = sum(IPTW)) #sum the weights
IPTW_denom
## # A tibble: 2 x 2
##       E denom
##   <int> <dbl>
## 1     0 10000
## 2     1 10000
#link the numerator and denominator together and divide to get the weighted risks.
IPTW = IPTW_denom %>%
  left_join(IPTW_num, by = "E") %>%
  mutate(R_IPTW = num/denom)
IPTW
## # A tibble: 2 x 4
##       E denom   num R_IPTW
##   <int> <dbl> <dbl>  <dbl>
## 1     0 10000  2251  0.225
## 2     1 10000  3707  0.371

Drumroll… is this equal to the standardized risk difference?

(RD_IPTW = filter(IPTW, E==1) -  filter(IPTW, E==0))
##   E denom      num    R_IPTW
## 1 1     0 1456.214 0.1456214
#repeating code for standardized risk difference
sum(RD_stratified_wrangle$weighted_rd_by_stratum)/sum(RD_stratified_wrangle$weight)
## [1] 0.1456214

The risk difference corrected via IPTW is equivalent to that obtained from standardization. This is expected when the total population is used as the standard in standardization, as noted in the Hernan Robins Book Technical Point 2.3, Equivalence of IP weighting and standardization.