Confounding bias is one of the most ubiquitous challenges in estimating effects from observational (real-world data) studies. Confounding happens when the relationship between a treatment (or other exposure) and an outcome is distorted by the presence of other variables. These confounding variables can create misleading associations, or biases, making it difficult to infer causal relationships accurately. These confounding variables can create misleading associations, making it difficult to infer causal relationships accurately. Fortunately for us, study design and statistical methods have been evolving to help us mitigate this bias and recover the true effect estimate (when conditions are favorable: we’ll touch on these, but point to better sources for a deeper dive).

Confounding bias¹ is one of the most ubiquitous challenges in estimating effects from observational (real-world data) studies. Fortunately, study design and statistical methods have been evolving to help us mitigate biases and recover the true effect estimate. In this article, we will introduce the concepts of confounding and bias, external control arm study design, target trial emulation. At the end of the example, you should have a general understanding of the benefits and shortcomings of traditional methods for estimating treatment effect, including propensity score matching and weighting, and g-computation, as well as modern methods, including targeted maximum likelihood estimating (TMLE). For the worked example, we will consider a scenario where the protective effect of a treatment against heart attacks is obscured by confounding. This is a common and reproducible type of problem in public health.

Directed Acyclic Graphs

Causal graphs, specifically Directed Acyclic Graphs (DAGs), are powerful tools used in epidemiology (and other fields) to visualize and understand the relationships between variables in a study. A DAG is a graphical representation where nodes represent variables, and directed edges (unidirectional arrows) indicate causal relationships between these variables. The “acyclic” aspect means that there are no loops or cycles, ensuring that the graph represents a unidirectional flow of causality.

DAGs are particularly useful for identifying confounding because they clearly depict the pathways through which variables are connected. Confounders are variables that influence both the treatment and the outcome, potentially creating a spurious association. By mapping out all relevant variables and their relationships, a DAG helps researchers see which variables need to be controlled for in order to obtain an unbiased estimate of the treatment effect.

Figure 1: Causal diagram - directed acyclic graph (DAG) where nodes are variables and arrows indicated directional causal relationships. Example of simple DAG where treatment affects (causes or influences) cardiac arrest.

For example, let us consider a pseudo-real-life example: treatment A (taking the medication Axyza) is believed to reduce cholesterol. In our simpolified DAG, we represent this with two nodes: “treatment with Axyza” and “cholesterol,” with an arrow pointing from “treatment with Axyza” to “cholesterol,” indicating the causal relationship.

Now, let’s think a little more about the relationship between this treatment and outcome - let’s say we know patients with previously high cholesterol are much more likely to be treated and that this prior level strongly influences subsequent levels. Additionally, age and gender are suspected to be be related to both treatment and outcome in our example. We say that these additional variables (confounders) influence both likelihood of receiving treatment and risk of having a heart attack.

We add those nodes and corresponding directed edges into our diagram. Our DAG can now guide us in adjusting for confounders through propensity scores, TMLE, or other statistical methods, so we can better isolate the true effect of the treatment.

Our plausible DAG:

Figure 2: DAG of working example where treatment affects cardiac events, but with 4 confounding variables

Real World Data Study Design

External Control Arms

text tk

Observational Data and the `target trial`

Unlike randomized controlled trials (RCTs), where random assignment of treatment ensures that treatment groups are comparable, observational studies must rely on study design and statistical techniques to account for differences between treated and untreated groups.

One approach to estimating causal effects in observational studies is to emulate the conditions of a target randomized trial as closely as possible. This process, often referred to as “target trial emulation,” involves designing the observational study to mimic the hypothetical RCT that would answer the causal question of interest.

Effects Estimates

Crude (Naive) Estimates

This is simple the apparent difference in means between treatment groups, ignoring all else. You can also think of this as the conditional mean difference: \[E[cholesterol|statin=1]-E[cholesterol|statin=0]\]

Propensity Scores

Propensity scores represent the probability that someone (e.g., a patient) receives a treatment given a set of observed covariates. The concept was introduced by Rosenbaum and Rubin in 1983, and provides an elegant and intuitive way to balance the distribution of covariates between treated and control groups. Historically, propensity scores have been a powerful tool in causal inference, used to control for confounding variables in observational studies.

Calculating a Propensity Score

A propensity score is the conditional probability of receiving a treatment given a set of observed covariates. Mathematically, it is defined as:

\[ e(X) = P(T = 1 | X) \]

where \(T\) is the treatment indicator (1 for treated, 0 for control), and \(X\) represents the observed covariates.

Propensity Scores for Matching

Propensity score matching involves pairing treated and control units with similar propensity scores. This technique aims to create a pseudo-randomized experiment by ensuring that the distribution of covariates is similar between the matched groups. Matching can significantly reduce selection bias, allowing for more accurate estimation of treatment effects. One related concept is that of cohort imbalance, which we’re hoping to remedy by this method.

Propensity Scores for Weighting

Propensity score weighting involves assigning weights to each person based on their propensity score to create a synthetic (standardized) sample in which the distribution of covariates is balanced across treatment groups. The two common approaches are:

Inverse Probability of Treatment Weighting (IPTW): Each treated unit is weighted by the inverse of their propensity score, and each control unit is weighted by the inverse of one minus their propensity score. This method can be expressed as:

\[ w_i = \frac{T_i}{e(X_i)} + \frac{1 - T_i}{1 - e(X_i)} \]
Stabilized Weights: To prevent extreme weights that can lead to high variance, stabilized weights are used:

\[ w_i = \frac{T_i}{e(X_i)} \times \frac{P(T = 1)}{P(T = 1 | X)} + \frac{1 - T_i}{1 - e(X_i)} \times \frac{P(T = 0)}{P(T = 0 | X)} \]

Limitations

Note: There is considerable debate as to whether to use propensity scores for matching (See King et. al. 2019) and we also tend to lean away from recommending this method, although it is very intuitive and tractable. Also, note that through this method we are estimating a variant of the average treatment effect: the average treatment effect among the treated (ATT).

G-Computation

G-computation, also known as the G-formula or G-estimation, is a very promising method also used to estimate the (unbiased) causal effect of a treatment or intervention.

What is G-Computation?

G-computation involves using a statistical model to predict outcomes under different treatment scenarios. The key idea is to estimate what each individual’s outcome would be under each possible treatment condition (even though we only observe them under one condition in reality). So we’re using the observed data plus a statistical model to estimate counterfactual outcomes, and then using these counterfactual outcomes (providing a way to understand the causal effect of treatments)themselves) to estimate treatmnet effects.

How G-Computation Works

Modeling the Outcome: First, a model is fitted to predict the outcome \(Y\) based on the treatment \(T\) and a set of covariates \(X\). This model can be a regression model or it can be any machine learning model that can generate predicted outcomes!

\[ \widehat{Y} = f(T, X) \]
Predicting Counterfactuals: Using the fitted model, predict the outcomes for each individual under different treatment scenarios. Specifically, predict the outcome for each individual if they had received the treatment and if they had not received the treatment.

\[ \widehat{Y}(T=1, X=x) \] and \[ \widehat{Y}(T=0, X=x) \]
Averaging the Predictions: Calculate the average of the predicted outcomes for the treated and control scenarios across all individuals in the sample.

\[ \text{ATE} = \frac{1}{N} \sum_{i=1}^{N} \left[ \widehat{Y_i}(T=1) - \widehat{Y_i}(T=0) \right] \]

Estimating the Unseen Version of Each Person

The essence of g-computation lies in its ability to estimate the “unseen” or counterfactual version of each person. For each individual in the study, g-computation estimates what their outcome would have been if they had received a different treatment than the one they actually received. This is done using the statistical model trained on the observed data:

Counterfactual for Treated Individuals: For individuals who received the treatment (\(T=1\)), g-computation predicts what their outcome would have been if they had not received the treatment (\(T=0\)).
Counterfactual for Control Individuals: Conversely, for individuals who did not receive the treatment (\(T=0\)), it predicts what their outcome would have been if they had received the treatment (\(T=1\)).

By doing this for all individuals, g-computation fills in the missing potential outcomes under both treatment conditions, allowing for the estimation of causal effects. This approach also accounts for confounding factors and provides a very good estimate of the average treatment effect (ATE).

Modern/Emerging Methods: TMLE

What is TMLE?

Targeted Maximum Likelihood Estimation (tmle) and the broader statistical approach of Targeted Learning is perhaps the most promising method to be applied to real world data in recent times. It tends to be the best statistical method for a given study design for washing out bias by using a substitution estimator approach (starting with a g-computation estimator and deriving a fluctuation parameter from the propensity score and something called the clever covariate)

Figure 3: TMLE takes the same form as g-computation, but the red stars indicate an updated Q(1) and Q(0).

“A second chance to get it right” using the components we’ve introduced!

See van der Laan and Rose (2011) for an excellent full course on Targeted Learning.

How TMLE Works

Initial Estimation: Start with an initial estimate of the outcome model (similar to g-computation) and the propensity score model (IPTW).
Clever Covariate: Construct a “clever covariate,” which is derived from the propensity score. The clever covariate helps target the estimator towards the parameter of interest, making the estimate more efficient.

\[ H(T, X) = \frac{T}{e(X)} - \frac{1 - T}{1 - e(X)} \]
Fluctuation Parameter: Incorporate a fluctuation parameter to adjust the initial estimate. This is done through a process called targeting, where the initial model is updated to reduce bias and improve efficiency.
Targeted Estimate: Combine the updated outcome model and the clever covariate to obtain the targeted estimate of the treatment effect. This final estimate is designed to be both unbiased and efficient.

Advantages of TMLE

Double Robustness: TMLE provides valid estimates even if either the outcome model or the propensity score model is misspecified, but not both.
Efficiency: By incorporating the clever covariate and the fluctuation parameter, TMLE achieves higher efficiency compared to traditional methods.
Robustness to Model Misspecification: TMLE’s targeted update step helps mitigate bias due to model misspecification, leading to more reliable causal inference.

Worked Example

In this demonstration, we’ll:

Create (plausible) synthetic data so that we can know the ground truth effect we’re trying to estimate,
Estimate the effect without any attempt to remedy confounding (to start with our baseline biased estimate),
Apply traditional (propensity scores and g-computation) and modern (TMLE and DML) statistical methods to recover the unbiased effect and compare these.

Let’s dive in!

Libraries for today’s R session

All the following libraries are available via CRAN and can be installed either through the tools menu or using install.packages(" ") command for each.

library(simcausal)
library(ranger)
library(tidyverse)
library(ggdag)
library(MatchIt)
library(WeightIt)
library(survey)
library(tableone)
library(cobalt)   
library(PSweight)  
library(tmle)
library(DoubleML)
library(mlr3)
library(mlr3learners)
set.seed(12345)

Synthetic Data and Ground Truth

When we generate synthetic data, we have complete control over the data-generating process. This means we know the true causal effect of the treatment, allowing us to directly assess the accuracy and bias of our estimates.

By applying causal inference methods to synthetic data, we can evaluate their performance in recovering the known causal effect (as well as identifying and quantifying bias (we’ll define \(bias = true-estimated\))).

DAG-based data generation process

We’re going to use a package called simcausal to create a hypothetical DAG and simulate data (directly) from that. A more traditional approach is to build up the nodes sequentially (see box 1 in https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.7628 for example)

D_AW <- DAG.empty() + 
  node("age", distr = "rnorm", mean = 50, sd = 10) +
        node("gender", distr = "rbern", prob = 0.5) +
       node("priorchol", distr = "rnorm", mean = 200 + 0.5 * age, sd = 30) +
        node("statin", distr = "rbern", prob = plogis(-2 + 0.02 * priorchol + 0.01 * age - 0.5 * gender)) +
        node("cholesterol", distr = "rnorm", mean = 180 - 20 * statin + 0.8 * priorchol + 0.5 * age + 5 * gender, sd = 15)

D_AW_set <- set.DAG(D_AW)

plotDAG(D_AW_set, xjitter = 0.3, yjitter = 0.3)

Generate synthetic data

Now let’s generate some data (let’s start with n = 5000)

ObsData <- sim(DAG = D_AW_set, n = 5000, rndseed = 12345)
head(ObsData)

##   ID      age gender priorchol statin cholesterol
## 1  1 55.85529      0  164.8389      1    315.7124
## 2  2 57.09466      1  253.0668      1    399.0876
## 3  3 48.90697      1  254.1566      1    376.5909
## 4  4 45.46503      1  207.4650      1    361.7573
## 5  5 56.05887      1  235.9930      1    368.5381
## 6  6 31.82044      0  188.3088      1    316.1860

A1 <- node("statin", distr = "rbern", prob = 1)
D_AW_set <- D_AW_set + action("A1", nodes = A1)
A0 <- node("statin", distr = "rbern", prob = 0)
D_AW_set <- D_AW_set + action("A0", nodes = A0)

Xdat1 <- sim(DAG = D_AW_set, actions = c("A1", "A0"), n = 5000, rndseed = 12345)

Y1 <- Xdat1[["A1"]][,"cholesterol" ]
Y0 <- Xdat1[["A0"]][,"cholesterol" ]

(True_ATE <- mean(Y1 - Y0))

## [1] -20

Notice we have created the full dataset! We have the usually unavalable Y(1) and Y(0) as well as observed Y.

Bias

Now let’s take a look at the apparent (naive) estimate of average treatment effect: \(\widehat{ATE}\) = \(\widehat{\psi}\) = E[Y|A=1] - E[Y|A=0]).

We’ll compare this to the true ATE: ψ = E[Y(1)] - E[Y(0)]. (note we can only do this latter step because we have generated Y(1) and Y(0))

[Think “conditional A” vs. “do A.”]

treat <- ObsData %>% filter(statin == 1) %>% select(cholesterol) 
controls <- ObsData %>% filter(statin == 0) %>% select(cholesterol) 

(naive_est <- mean(treat$cholesterol) - mean(controls$cholesterol))

## [1] -5.117758

(True_ATE <- mean(Y1 - Y0))

## [1] -20

# Absolute bias
(bias <- naive_est - True_ATE)

## [1] 14.88224

# Percent bias
(naive_est-True_ATE)/True_ATE*100

## [1] -74.41121

We define percent bias as:

\[\% bias =\frac{\widehat{\psi} - \psi }{\psi} * 100\%\]

So our bias in the above example is -74.4112%.

An estimator is said to be unbiased if its bias is zero.

In our simulation experiments we’ll define the bias of an estimator as expected (or mean) difference between estimator and true value.

Traditional methods

All right. Let’s get down to business removing this bias and getting the estimate we would have gotten from a (target) randomized trial. We’ll apply 4 ( + naive) methods to remove bias:

naive estimate with no adjustment
propensity score (matching and weighting)
g-computation
targeted maximum likelihood (tmle)
double machine learning (dml)

No methods at all (aka crude (naive) estimate)

It’s good to plant our flag and see how bad it is if we don’t do anything about bias (you’ll see we just recover the initial biased estimate, but it should help level-set).

naive <- glm(cholesterol ~ statin, data = ObsData)
summary(naive)

## 
## Call:
## glm(formula = cholesterol ~ statin, data = ObsData)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  374.013      1.573 237.839  < 2e-16 ***
## statin        -5.118      1.632  -3.137  0.00172 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 877.8762)
## 
##     Null deviance: 4396263  on 4999  degrees of freedom
## Residual deviance: 4387625  on 4998  degrees of freedom
## AIC: 48081
## 
## Number of Fisher Scoring iterations: 2

As we can see, we get the biased estimate by regression (statin coefficient = -5.1178)

So our initial percent bias from simple (unadjusted) regression is -74.41%.

Using Propensity Scores for Matching

We’ll use the tableone package:

Remember, we identified age, gender and prior cholesterol as potential confounders.

xvars<-c("age","gender","priorchol")

table1<- CreateTableOne(vars=xvars, strata="statin", data=ObsData )
## include standardized mean difference (SMD)
print(table1, smd=TRUE)

##                        Stratified by statin
##                         0              1              p      test SMD   
##   n                        355           4645                           
##   age (mean (SD))        47.73 (9.84)   50.18 (9.87)  <0.001       0.248
##   gender (mean (SD))      0.61 (0.49)    0.49 (0.50)  <0.001       0.243
##   priorchol (mean (SD)) 209.52 (29.85) 226.80 (29.95) <0.001       0.578

Wow – dramatic difference in gender and prior-cholesterol. A typical metric is the standardized mean difference (SMD) and a heuristic for balance is a SMD <0.1.

propensity_model <- matchit(statin ~ age + gender + priorchol, 
                    data = ObsData, 
                    method = "nearest", # computationally efficient
                    ratio = 1, 
                    caliper = 0.05)
bal.tab(propensity_model, m.threshold=0.1)

## Balance Measures
##               Type Diff.Adj    M.Threshold
## distance  Distance   0.0491 Balanced, <0.1
## age        Contin.  -0.0181 Balanced, <0.1
## gender      Binary  -0.0028 Balanced, <0.1
## priorchol  Contin.   0.0635 Balanced, <0.1
## 
## Balance tally for mean differences
##                    count
## Balanced, <0.1         4
## Not Balanced, >0.1     0
## 
## Variable with the greatest mean difference
##   Variable Diff.Adj    M.Threshold
##  priorchol   0.0635 Balanced, <0.1
## 
## Sample sizes
##           Control Treated
## All           355    4645
## Matched       352     352
## Unmatched       3    4293

Examine matching triage

print(plot(propensity_model, type = 'jitter'))

## To identify the units, use first mouse button; to stop, use second.
## A matchit object
##  - method: 1:1 nearest neighbor matching without replacement
##  - distance: Propensity score [caliper]
##              - estimated with logistic regression
##  - caliper: <distance> (0.002)
##  - number of obs.: 5000 (original), 704 (matched)
##  - target estimand: ATT
##  - covariates: age, gender, priorchol

We notice we’re now pretty balanced (but lost some friends along the way - for this analysis, we’d be dropping all the unmatched people.)

Extract matches and analyze:

md <- match.data(propensity_model)

adjusted.PSmatch <- glm(cholesterol ~ statin,
                        data = md, 
                        weights = weights)

summary(adjusted.PSmatch)

## 
## Call:
## glm(formula = cholesterol ~ statin, data = md, weights = weights)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  374.496      1.580 237.063  < 2e-16 ***
## statin       -18.607      2.234  -8.329 4.27e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 878.4353)
## 
##     Null deviance: 677599  on 703  degrees of freedom
## Residual deviance: 616662  on 702  degrees of freedom
## AIC: 6773.7
## 
## Number of Fisher Scoring iterations: 2

You’ll notice now we’re closer to the true value (-18.6074 vs true value = -20).

Using Propensity Scores for Weighting

Let’s use the WeightIt package to recalculate weights (both non-stabilized and stabilized).

# Calculating non-stabilized weights
w.out.ns <- weightit(cholesterol ~ age + gender + priorchol, data = ObsData, method = "ps", estimand = "ATE")
summary(w.out.ns)

##                   Summary of weights
## 
## - Weight ranges:
## 
##        Min                                    Max
## all 0.0022 |---------------------------| 120.0368
## 
## - Units with the 5 most extreme weights:
##                                               
##          869    3706     117     2270     3852
##  all 79.1375 89.2385 94.5719 102.3759 120.0368
## 
## - Weight statistics:
## 
##     Coef of Var   MAD Entropy # Zeros
## all       2.871 0.794   0.862       0
## 
## - Effective Sample Sizes:
## 
##            Total
## Unweighted  5000
## Weighted     541

# Adding weights to data
ObsData$weights_ns <- w.out.ns$weights

And again for stabilized weights:

# Calculating stabilized weights
w.out.st <- weightit(cholesterol ~ age + gender + priorchol, data = ObsData, method = "ps", estimand = "ATE", stabilize = TRUE)
summary(w.out.st)

##                   Summary of weights
## 
## - Weight ranges:
## 
##        Min                                    Max
## all 0.0022 |---------------------------| 120.0368
## 
## - Units with the 5 most extreme weights:
##                                               
##          869    3706     117     2270     3852
##  all 79.1375 89.2385 94.5719 102.3759 120.0368
## 
## - Weight statistics:
## 
##     Coef of Var   MAD Entropy # Zeros
## all       2.871 0.794   0.862       0
## 
## - Effective Sample Sizes:
## 
##            Total
## Unweighted  5000
## Weighted     541

# Adding weights to data
ObsData$weights_st <- w.out.st$weights

# Creating survey design objects
design_ns <- svydesign(ids = ~1, weights = ~weights_ns, data = ObsData)
design_st <- svydesign(ids = ~1, weights = ~weights_st, data = ObsData)

# Estimating ATE using non-stabilized weights
ate_ns <- svyglm(cholesterol ~ statin, design = design_ns)
summary(ate_ns)

## 
## Call:
## svyglm(formula = cholesterol ~ statin, design = design_ns)
## 
## Survey design:
## svydesign(ids = ~1, weights = ~weights_ns, data = ObsData)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  388.720      3.132 124.093  < 2e-16 ***
## statin       -20.979      3.344  -6.273 3.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 624.499)
## 
## Number of Fisher Scoring iterations: 2

# Estimating ATE using stabilized weights
ate_st <- svyglm(cholesterol ~ statin, design = design_st)
summary(ate_st)

## 
## Call:
## svyglm(formula = cholesterol ~ statin, design = design_st)
## 
## Survey design:
## svydesign(ids = ~1, weights = ~weights_st, data = ObsData)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  388.720      3.132 124.093  < 2e-16 ***
## statin       -20.979      3.344  -6.273 3.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 624.499)
## 
## Number of Fisher Scoring iterations: 2

nice - both of these coefficients are close to the true value.

G-Computation

We can use the SuperLearner package to calculate expected outcomes and then set treatment to 0 and 1 for each individual to generate \(\widehat{Y_i}(T=0)\) and \(\widehat{Y_i}(T=1)\) (which we’ll call Q_0 and Q_1).

# Define Super Learner libraries
sl_libs <- c("SL.glmnet", "SL.ranger", "SL.glm")
# Select relevant variables from the dataset
Obs <- ObsData %>% dplyr::select(statin, cholesterol,age, gender, priorchol)
# Extract outcome and covariates
Y <- Obs$cholesterol
W_A <- Obs %>% dplyr::select(-cholesterol) 
# Step 1: Outcome regression using Super Learner
Q <- SuperLearner(Y = Y,  
                  X = W_A,  
                  SL.library = sl_libs)


Q_A <- as.vector(predict(Q)$pred)

# Predict Q for treated (A = 1) and untreated (A = 0)
W_A1 <- W_A %>% mutate(statin = 1)  # Set A = 1
Q_1 <- as.vector(predict(Q, newdata = W_A1)$pred)

W_A0 <- W_A %>% mutate(statin = 0) # Set A = 0
Q_0 <- as.vector(predict(Q, newdata = W_A0)$pred)

# Create dataset for g-computation
dat_g.comp <- as.data.frame(cbind(Y = Obs$cholesterol, A = Obs$statin, Q_A, Q_0, Q_1))

Now the g-computation estimate of the ATE is simply the average difference of Q_1 - Q_0:

(ate_gcomp <- mean(dat_g.comp$Q_1 - dat_g.comp$Q_0))

## [1] -19.80342

Very nice!

Modern/emerging methods

TMLE

We’ll recalculate the propensity score again (this time also using an ensemble machine learning approach (SuperLearning)) and then use the g-computation we applied previously:

Step 1: Estimate the outcome:

Done. This is the g-computation step.

Step 2a: Estimate the propensity score:

# Propensity score estimation using Super Learner
A <- Obs$statin
W <- Obs %>% dplyr::select(-cholesterol, -statin)
g <- SuperLearner(Y = A,  
                  X = W,  
                  family=binomial(),  
                  SL.library=sl_libs)
g_w <- as.vector(predict(g)$pred)

Step 2b: Estimate the clever covariate

H_1 <- 1/g_w

H_0 <- -1/(1-g_w) 

# add clever covariate data to previous dataset we made
dat_tmle <- 
  dat_g.comp %>%
  bind_cols(
    H_1 = H_1,
    H_0 = H_0) %>%
  mutate(H_A = case_when(A == 1 ~ H_1,   # if A is 1 (treated), assign H_1
                         A == 0 ~ H_0))  # if A is 0 (not treated), assign H_0

Step 3: Estimate the Fluctuation Parameter

glm_fit <- glm(Y ~ -1 + offset(Q_A) + H_A, 
               data=dat_tmle)

(eps <- coef(glm_fit))

##          H_A 
## 0.0003468298

Step 4: Update the Initial Estimates of the Expected Outcome

dat_tmle<- dat_tmle %>% 
 mutate(Q_A_update = Q_A + eps*H_A,
        Q_1_update = Q_1 + eps*H_1,
        Q_0_update = Q_0 + eps*H_0)

Step 5: Compute the Statistical Estimand of Interest

(tmle_ate <- mean(dat_tmle$Q_1_update - dat_tmle$Q_0_update))

## [1] -19.79606

Chef’s Kiss

Of course, there is a one-stop-shop package to calculate TMLE

tmle_fit <- tmle(Y = Y, A = A, W = W,
                 Q.SL.library = sl_libs,
                 g.SL.library = sl_libs) 
  
tmle_fit

##  Marginal mean under treatment (EY1)
##    Parameter Estimate:  367.85
##    Estimated Variance:  0.18218
##               p-value:  <2e-16
##     95% Conf Interval:  (367.01, 368.68)
## 
##  Marginal mean under comparator (EY0)
##    Parameter Estimate:  387.65
##    Estimated Variance:  0.98535
##               p-value:  <2e-16
##     95% Conf Interval:  (385.7, 389.59)
## 
##  Additive Effect
##    Parameter Estimate:  -19.8
##    Estimated Variance:  0.90418
##               p-value:  <2e-16
##     95% Conf Interval:  (-21.664, -17.936)
## 
##  Additive Effect among the Treated
##    Parameter Estimate:  -19.794
##    Estimated Variance:  0.94503
##               p-value:  <2e-16
##     95% Conf Interval:  (-21.7, -17.889)
## 
##  Additive Effect among the Controls
##    Parameter Estimate:  -19.856
##    Estimated Variance:  0.66048
##               p-value:  <2e-16
##     95% Conf Interval:  (-21.449, -18.264)

tmle_fit$estimates$ATE$psi

## [1] -19.7999

Double machine learning

Another promising technique for controlling for confounding variables using machine learning methods is Double Machine Learning (DML).

The core idea of DML is to leverage machine learning algorithms to flexibly model and control for confounders while isolating the effect of the treatment variable on the outcome. The core steps of the DML algorithm are:

Prediction: Predict the outcome variable \(Y\) and the treatment variable \(D\) using covariates \(Z\) by estimating \(E[Y|Z]\) and \(E[D|Z]\). This can be done using machine learning methods such as Random Forest or other high-performing ML tools (we use random forests in our worked example)
Residualization (Orthogonalization): Compute the residuals \(W = Y - E[Y|Z]\) and \(V = D - E[D|Z]\). This step is similar to the Frisch-Waugh-Lovell procedure briefly outlined below.
Regression: Regress \(W\) on \(V\) to estimate the target parameter \(\psi\).

The Key Ingredients of DML are:

Neyman orthogonality
High-quality ML estimation
Sample splitting

See Chernozhukov, et. al., 2018 for the excellent, but highly-technical paper.

obj_dml_data = DoubleMLData$new(Obs, y_col = "cholesterol", d_cols = "statin")


lgr::get_logger("mlr3")$set_threshold("warn")
# Initialize a random forests learner with specified parameters
ml_l = lrn("regr.ranger", num.trees = 100, mtry = 2, min.node.size = 2,
           max.depth = 5)
ml_m = lrn("regr.ranger", num.trees = 100, mtry = 2, min.node.size = 2,
           max.depth = 5)
ml_g = lrn("regr.ranger", num.trees = 100, mtry = 2, min.node.size = 2,
           max.depth = 5)


doubleml_plr = DoubleMLPLR$new(obj_dml_data,
                               ml_l, ml_m, ml_g,
                               n_folds = 2,
                               score = "IV-type")


doubleml_plr$fit()
doubleml_plr$summary()

## Estimates and significance testing of the effect of target variables
##        Estimate. Std. Error t value Pr(>|t|)    
## statin  -19.9014     0.8471  -23.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Everything together:

plot(c(1,2,3,4,5,6), c(naive$coefficients["statin"], 
                 adjusted.PSmatch$coefficients["statin"],
                 ate_ns$coefficients["statin"],
                 ate_gcomp, 
                 tmle_fit$estimates$ATE$psi, 
                 doubleml_plr$coef),
     ylim = c(-5,-25), 
     ylab = "parameter estimate", 
     xlab = "method",
     axes = F, lwd = 2)
axis(1, at = c(1,2,3,4, 5,6), labels = c("crude", "PS-match.", "PS-weight."," G-comp", "TMLE", "DML"))
axis(2)
abline(h = True_ATE, lty = 2, lwd = 2, col = "grey")
text(2, .26, "grey line = 'true value'")

Bonus: prettier DAGs and adjusting with `ggdag` package

theme_set(theme_dag())

D_AW_DAG <- dagify(Y ~  A + W1 + W2 + W3 ,
                   A ~ W1 + W2 + W3,
                   W3 ~ W1,
                   labels = c("Y" = "Cholesterol", 
                              "A" = "Statins",
                              "W1" = "Age",
                              "W2" = "Gender",
                              "W3" = "Prior Cholesterol"),
                   exposure = "A",
                   outcome = "Y")
ggdag(D_AW_DAG, text = TRUE, use_labels = "label")

Look at options here:

https://cran.r-project.org/web/packages/ggdag/ggdag.pdf

Example: adjust for W1

# control_for(D_AW_DAG, var = "W1")
ggdag_adjust(D_AW_DAG, var = "W3")

Bonus Frisch-Waugh-Lovell (FWL)

1. Regress A (T) on W (X)

a.hat <- glm(statin ~  age + gender + priorchol, data = ObsData)

2. Regress Y on W (X)

y.hat <- glm(cholesterol ~  age + gender + priorchol, data = ObsData)

3. Compute residuals from regressions 1 & 2.

delta.a <- a.hat$residuals
delta.y <- y.hat$residuals

4. Regress residuals delta.y on residuals delta.a

Delta <- glm(delta.y ~ delta.a)

Delta$coefficients["delta.a"]

##   delta.a 
## -19.80528

Voila

References

Rosenbaum, P. R., & Rubin, D. B. (1983). “The central role of the propensity score in observational studies for causal effects.” Biometrika, 70(1), 41-55.
Austin, P. C. (2011). “An introduction to propensity score methods for reducing the effects of confounding in observational studies.” Multivariate Behavioral Research, 46(3), 399-424.
van der Laan, M. J., & Rose, S. (2011). “Targeted Learning: Causal Inference for Observational and Experimental Data.” Springer.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). “Double/debiased machine learning for treatment and structural parameters.” The Econometrics Journal, 21(1), C1-C68., https://doi.org/10.1111/ectj.12097

Crash course on confounding, bias, and (modern) deconfounding remedies using R

Andy Wilson and Aimee Harrison

Jun 20\(^{th}\), 2024

Directed Acyclic Graphs

Real World Data Study Design

External Control Arms

Observational Data and the target trial

Effects Estimates

Crude (Naive) Estimates

Propensity Scores

Calculating a Propensity Score

Propensity Scores for Matching

Propensity Scores for Weighting

Limitations

G-Computation

What is G-Computation?

How G-Computation Works

Estimating the Unseen Version of Each Person

Modern/Emerging Methods: TMLE

What is TMLE?

How TMLE Works

Advantages of TMLE

Worked Example

Libraries for today’s R session

Synthetic Data and Ground Truth

DAG-based data generation process

Generate synthetic data

Bias

Traditional methods

No methods at all (aka crude (naive) estimate)

Using Propensity Scores for Matching

Examine matching triage

Extract matches and analyze:

Using Propensity Scores for Weighting

G-Computation

Modern/emerging methods

TMLE

Double machine learning

Everything together:

Bonus: prettier DAGs and adjusting with ggdag package

Example: adjust for W1

Bonus Frisch-Waugh-Lovell (FWL)

1. Regress A (T) on W (X)

2. Regress Y on W (X)

3. Compute residuals from regressions 1 & 2.

4. Regress residuals delta.y on residuals delta.a

References

Suggested Reading

Observational Data and the `target trial`

Bonus: prettier DAGs and adjusting with `ggdag` package