Photo of the author’s glassboard.

Introduction

We would overstate our telemedicine app’s effectiveness by claiming it reduces the risk of new coronavirus infections by 16.9%—when in fact it will only reduce this risk by 3.1%.

Consider a timely hypothetical twist on a classic example of spurious correlation: Recently, ice cream sales have been dropping—along with the number of homicides. But this isn’t because eating ice cream drives people to murder. It’s because a community-wide shelter-in-place mandate was enacted to prevent the spread of a novel infectious agent. This confounding factor drove down both ice cream sales and the number of homicides; hence the spurious correlation.

In this two-part tutorial, we’ll investigate a synthetic dataset inspired by a recent healthcare example involving racial disparities in COVID-19 cases (Garg et al, 2020; Aubrey, 2020). (Synthetic data are simulated—not real—data commonly created to teach or learn about an analytical tool.) We’ll follow a typical data science analysis workflow.

Our analysis goal will be to help public health authorities in our simulated world reduce SARS-CoV-2 (“coronavirus”) infections. We believe our digital health or telemedicine app can help prevent new infections; e.g., by promoting healthy lifestyle choices—particularly while social distancing and sheltering-in-place—that reduce the risk of coronavirus infection. But to do this, we’ll need an unbiased or statistically consistent (i.e., more unbiased with larger samples) estimate of the true effect of our would-be intervention.

Tutorial Goal: To understand how causal assumptions change our interpretation of predictions, moving them towards explanations. To understand why this is needed before we can quantify the impact of our tech solution.
What We’ll Learn: How to recognize the difference between a modeling association and a causal effect. How to conduct causal inference via the g-formula (Robins, 1986; Hernán and Robins, 2006) and propensity score weighting (Rosenbaum and Rubin, 1983; Hirano and Imbens, 2001; Lunceford and Davidian, 2004).

We will learn how a confounder biases our estimate of the would-be effect of a hypothetical telemedicine intervention. The biases that emerge from our synthetic data are shown below.

Part	Study Design	Confounder Adjustment	Empirical Bias	Estimated Risk Reduction
1	RWE (observational study)	not done	-0.138	overstated
2	RCT (experimental study)	not done	-0.003	consistently estimated
2	RWE (observational study)	g-formula	0.008	consistently estimated
2	RWE (observational study)	propensity score weighting	0.010	consistently estimated

In Part 1, we’ll analyze synthetic real-world evidence (RWE) data. We’ll begin to see why it isn’t enough, in general, to state the causal mechanisms (e.g., directed acyclic graph) and control for all confounders—assuming we observe all of them in our dataset. We must go one critical step further in order to estimate an intervention’s true effect size and direction. This step uses the Law of Total Expectation, a straightforward and intuitive concept we actually use all the time, that we’ll review in Part 2.

We’ll conclude in Part 2 by analyzing data from a synthetic randomized controlled trial (RCT) in which we’d randomly assigned the app to 50% of all trial participants. We’ll learn how to use the Law of Total Expectation to conduct causal inference using observational RWE data. (In both the RWE and RCT, the app had the same average effect on infection risk regardless of other causes; i.e., there was no interaction of app usage with other causes.) Specifically, we’ll learn how to apply the g-formula (also called “standardization”) and propensity score weighting to estimate the true overall or average treatment effect (ATE).

Here in Part 1, we’ll analyze a synthetic RWE dataset with three person-level variables: infection status, app usage, and race (Black or White only for simplicity).

Part 1 Objective: To try—and fail—to estimate the ATE of app usage on infections. We’ll specify the ATE as the difference in infection risk, or risk difference (RD); specifically, as the risk among users minus that among non-users.
What We’ll Learn: How to recognize the difference between a statistical contrast (e.g., RD) and a causal effect. Estimating the ATE incorrectly by confusing the former for the latter has far-reaching public health implications.

We will conclude Part 1 with the key insight that if we mistakenly infer causation from correlation:

We would overstate our telemedicine app’s effectiveness by claiming it reduces the risk of new coronavirus infections by 16.9%—when in fact it will only reduce this risk by 3.1%.

Dataset Characteristics

The observed (i.e., non-randomized) RWE data are in the tibble observational_rwe. (Generate this with the R code in the Appendix.)

glimpse(observational_rwe)

## Observations: 80,000
## Variables: 3
## $ race      <chr> "White", "White", "Black", "White", "White", "White", "Whit…
## $ app       <chr> "used app", "used app", "didn't use app", "used app", "used…
## $ infection <chr> "1. infected", "0. uninfected", "0. uninfected", "0. uninfe…

Each observation (i.e., row) represents a unique individual who was originally susceptible and uninfected. The variables and their unique values are:

knitr::kable(apply(observational_rwe, 2, unique))

race	app	infection
White	used app	1. infected
Black	didn’t use app	0. uninfected

Let’s randomly split the data into an 80% training set for exploratory data analysis (EDA), and a 20% holdout set for fitting an explanatory model (Shmueli, 2010) in our final validation/confirmatory analysis.

holdout_proportion <- 0.2
observational_rwe <- observational_rwe %>% dplyr::mutate(rownum = row_number())
set.seed(2004120322)
observational_rwe_training <- observational_rwe %>% dplyr::sample_frac(1 - holdout_proportion)
observational_rwe_holdout <- observational_rwe %>% dplyr::filter(!(rownum %in%
  observational_rwe_training$rownum))

# clean up: remove rownum
observational_rwe$rownum <- NULL
observational_rwe_training$rownum <- NULL
observational_rwe_holdout$rownum <- NULL

There are 64000 training observations and 16000 holdout observations.

Exploratory Data Analysis

Univariate Associations

Correlation Matrix

dummy_rwe_training <- observational_rwe_training %>%
  dplyr::mutate(
    race = (race == "White"),
    app = (app == "used app"),
    infection = (infection == "1. infected")
  )

knitr::kable(round(cor(dummy_rwe_training), 4))

	race	app	infection
race	1.0000	0.5294	-0.3553
app	0.5294	1.0000	-0.2239
infection	-0.3553	-0.2239	1.0000

corrplot::corrplot.mixed(cor(dummy_rwe_training))

Even in this simple correlation matrix, there are already signs of the challenges ahead. Can you spot them?

Infections by App Usage (Marginal Model)

Let’s examine our main relationship of interest.

observational_rwe_training %>%
  ggplot2::ggplot(ggplot2::aes(x = app, fill = infection)) +
  ggplot2::theme_classic() +
  ggplot2::geom_bar(position = "dodge") +
  ggplot2::ggtitle("Infections by App Usage")

df_rwe_training <- with(
  observational_rwe_training,
  cbind(
    table(app, infection),
    prop.table(table(app, infection), margin = 1) # row proportions
  )
)

knitr::kable(df_rwe_training) # row proportions

	0. uninfected	1. infected	0. uninfected	1. infected
didn’t use app	9519	3653	0.7226693	0.2773307
used app	46187	4641	0.9086921	0.0913079

There was a lower infection rate (i.e., empirical or observed infection risk) among app users: Only 9.1% of users had infections, compared to 27.7% of non-users. The empirical RD is 0.091 \(-\) 0.277 \(=\) -0.186—a marginal quantity in statistical terms because it does not account for (i.e., “is marginalized over”) any other variables. A model that does account for other variables (here, race) in addition to the would-be intervention (here, app usage) is called a conditional model.

out_fisher_rwe_training <- with(
  observational_rwe_training,
  fisher.test(app, infection)
)

out_fisher_rwe_training

## 
##  Fisher's Exact Test for Count Data
## 
## data:  app and infection
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.2493633 0.2749802
## sample estimates:
## odds ratio 
##  0.2618449

In addition, there is strong statistical evidence (i.e., statistical significance) that infections varied by app usage (\(p_{\text{EDA}} <<\) 0.001). Here, the estimated odds of infection for app users were only 0.262 (i.e., roughly a quarter) that of non-users, with an EDA 95% confidence interval (CI) of (0.249, 0.275).

Statistical significance does not imply importance: This strong statistical evidence is not surprising. Bigger samples typical of RWE data provide better estimates of progressively smaller differences in outcome means (i.e., for any given set of outcome distributions with finite means and variances). This says nothing about the domain (e.g., scientific, clinical, research, business) usefulness of those differences.
Correlation does not imply causation: In no way does this finding imply that using the app reduces infection risk. Statistical findings like this cannot imply causation. Statistical findings are implications of causal relationships, not vice versa. But it should prompt us to investigate further, and should be assessed alongside subsequent findings and hypotheses to help explain how this association arose.

Infections by Race

Race seems to have been associated with infection.

observational_rwe_training %>%
  ggplot2::ggplot(ggplot2::aes(x = race, fill = infection)) +
  ggplot2::theme_classic() +
  ggplot2::geom_bar(position = "dodge") +
  ggplot2::ggtitle("Infections by Race")

df_rwe_training_race_infection <- with(
  observational_rwe_training,
  cbind(
    table(race, infection),
    prop.table(table(race, infection), margin = 1) # row proportions
  )
)

knitr::kable(df_rwe_training_race_infection) # row proportions

	0. uninfected	1. infected	0. uninfected	1. infected
Black	5327	3871	0.5791476	0.4208524
White	50379	4423	0.9192913	0.0807087

out_fisher_rwe_training_race_infection <- with(
  observational_rwe_training,
  fisher.test(race, infection)
)

out_fisher_rwe_training_race_infection

## 
##  Fisher's Exact Test for Count Data
## 
## data:  race and infection
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.1147125 0.1272759
## sample estimates:
## odds ratio 
##  0.1208232

African Americans were more likely to have been infected than Whites (\(p_{\text{EDA}} <<\) 0.001): 42.1% of African Americans were infected, compared to 8.1% of Whites. Put differently, the estimated odds of infection for Whites were 0.121 (95% \(\text{CI}_{\text{EDA}}\): 0.115, 0.127) times that of African Americans. This finding is plausible due to structural inequality: African Americans tend to have worse access to care, lower socioeconomic status (SES), and more underlying health conditions.

App Usage by Race

Importantly, it looks like race was strongly associated with app usage.

observational_rwe_training %>%
  ggplot2::ggplot(ggplot2::aes(x = race, fill = app)) +
  ggplot2::theme_classic() +
  ggplot2::geom_bar(position = "dodge") +
  ggplot2::ggtitle("App Usage by Race")

df_rwe_training_race_app <- with(
  observational_rwe_training,
  cbind(
    table(race, app),
    prop.table(table(race, app), margin = 1) # row proportions
  )
)

knitr::kable(df_rwe_training_race_app) # row proportions

	didn’t use app	used app	didn’t use app	used app
Black	6698	2500	0.7282018	0.2717982
White	6474	48328	0.1181344	0.8818656

out_fisher_rwe_training_race_app <- with(
  observational_rwe_training,
  fisher.test(race, app)
)

out_fisher_rwe_training_race_app

## 
##  Fisher's Exact Test for Count Data
## 
## data:  race and app
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  18.95585 21.10802
## sample estimates:
## odds ratio 
##   20.00128

African Americans were less likely to have used the app than Whites (\(p_{\text{EDA}} <<\) 0.001): Only 27.2% of African Americans used the app, compared to 88.2% of Whites. Put differently, the estimated odds of using the app for Whites were 20.001 (95% \(\text{CI}_{\text{EDA}}\): 18.956, 21.108) times that of African Americans.

This finding is plausible also due to structural inequality. For example, suppose our app was only available on iPhone. African Americans are less likely than Whites to own iPhones (e.g., page 7 of Smith, 2013), which would have limited their access to the app. African Americans may also have worse access to telemedicine options in general.

Implications for Explanatory Modeling

What do these findings imply about how we should approach estimating the ATE? Could race affect both app usage and infection risk? If so, race would be a confounder—meaning our marginal RD estimate earlier will not suffice as an estimate of the ATE. (We’ll see how later on, and why this is so in Part 2.)

Variable Selection

Prediction Modeling

Suppose we decide to fit a prediction model. We think this may help us decide if we should include race in our final explanatory model (in addition to app usage). We’ll fit a 10-fold cross-validated ridge-regularized generalized linear model (GLM) with logit link (i.e., corresponding to a logistic regression).

set.seed(2004110254)
glmnet_rwe_training <- glmnet::cv.glmnet(
  x = model.matrix(
    object = infection ~ app + race,
    data = observational_rwe_training
  ),
  y = observational_rwe_training$infection,
  family = "binomial",
  type.measure = "auc",
  alpha = 0 # ridge: https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html
)
glmnet_rwe_training$glmnet_preds <- as.numeric(
  predict(
    object = glmnet_rwe_training,
    newx = model.matrix(
      object = infection ~ app + race,
      data = observational_rwe_training
    ),
    s = glmnet_rwe_training$lambda.min,
    type = "response"
  )
)
glmnet_ROC_rwe_training <- pROC::roc(
  response = observational_rwe_training$infection,
  predictor = glmnet_rwe_training$glmnet_preds
)
coef_glmnet_rwe_training <- coef(glmnet_rwe_training)

plot(glmnet_ROC_rwe_training)

glmnet_ROC_rwe_training$auc # training set cross-validated AUC

## Area under the curve: 0.6954

coef_glmnet_rwe_training

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) -1.900132655
## (Intercept)  .          
## appused app -0.001874018
## raceWhite   -0.003428336

We’ll select app regardless of its coefficient size because it is our potenial intervention. Still, we note that an automated search for important predictors ranked by coefficient size would have ranked race higher than app. We wonder how such a procedure could potentially have excluded true causes with smaller coefficients from other analyses we’ve done—causes that were important for recommending courses of action.

Importantly, note that the area under the receiver operating characteristic (ROC) curve (AUC) is not particularly large even though there exists a true ATE of app usage on infection risk by design. We use the simulation parameters and models (see Appendix) to calculate this ATE as -0.031. (We’ll derive this calculation in Part 2.) This ATE is only modestly sized; hence the modest AUC.

“To explain or to predict?” These findings underscore the important points in (Shmueli, 2010) about the different objectives, procedures, and properties of prediction and explanation.

Implications for Explanatory Modeling

We decide to keep race in the final explanatory model due to its relatively large prediction coefficient, even after controlling for app.

Propose Explanatory Model

An explanatory model is a domain hypothesis comprised of a causal model and a statistical model.

We will validate or confirm the statistical model using our holdout data.
We cannot validate or confirm the causal model using only our RWE data (training or holdout). We’d need to conduct an experiment or RCT to do this, or provide evidence in favor of our proposed causal model from the scientific literature.
We can—and should!—clearly state the causal model we’re assuming to be true. This is a key piece of scientific communication: It helps the reader interpret and assess our findings correctly before taking further action.

To fit the explanatory statistical model with a desired level of statistical power and evidence, we also need to specify a sample size.

Causal Model

From our findings, we propose the following causal model, formalized as a directed acyclic graph (DAG) (Pearl, 2009):

App Usage \(\rightarrow\) Infection
Race \(\rightarrow\) Infection
Race \(\rightarrow\) App Usage

DiagrammeR::grViz("
digraph causal {

  # Nodes
  node [shape = plaintext]
  Z [label = 'Race']
  X [label = 'App \n Usage']
  Y [label = 'Infection']

  # Edges
  edge [color = black,
        arrowhead = vee]
  rankdir = LR
  X -> Y
  Z -> X
  Z -> Y

  # Graph
  graph [overlap = true]
}")

Here, race confounds the effect of app usage on infection.

Statistical Model and Sample Size

Based on our DAG, it makes sense to model the infection risk as a function of race and app usage. We propose the following logistic model.

glm_rwe_training <- glm(
  data = observational_rwe_training,
  formula = as.factor(infection) ~ app + race,
  family = "binomial"
)

summary(glm_rwe_training)$coefficients

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -0.2280908 0.02278461 -10.01074 1.367266e-23
## appused app -0.3432987 0.03249127 -10.56588 4.289439e-26
## raceWhite   -1.9073638 0.03260492 -58.49926 0.000000e+00

The estimated odds of infection for app users were \(\exp(\)-0.343\() =\) 0.709 (95% \(\text{CI}_{\text{EDA}}\): 0.666, 0.756) times that of non-users (regardless of race). The corresponding estimated infection risks for app usage by race are:

risk_didnt_use_app_black_training <- plogis(coef(glm_rwe_training) %*% c(1, 0, 0))
risk_used_app_black_training <- plogis(coef(glm_rwe_training) %*% c(1, 1, 0))
risk_didnt_use_app_white_training <- plogis(coef(glm_rwe_training) %*% c(1, 0, 1))
risk_used_app_white_training <- plogis(coef(glm_rwe_training) %*% c(1, 1, 1))

0.443 for African Americans not using the app
0.361 for African Americans using the app
0.106 for Whites not using the app
0.077 for Whites using the app

In order to estimate the RD for African Americans with 80% statistical power at an \(\alpha = 0.05\) overall level of statistical evidence, the required sample sizes are:

# https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html
sample_size_black <- ceiling(
  pwr::pwr.2p.test(
    h = pwr::ES.h(p1 = risk_used_app_black_training, p2 = risk_didnt_use_app_black_training),
    sig.level = 0.025,
    power = .80,
    alternative = "less"
  )$n
)
sample_size_white <- ceiling(
  pwr::pwr.2p.test(
    h = pwr::ES.h(p1 = risk_used_app_white_training, p2 = risk_didnt_use_app_white_training),
    sig.level = 0.025,
    power = .80,
    alternative = "less"
  )$n
)

556 African American individuals in each app usage group (i.e., users or non-users)
1617 White individuals in each app usage group

Note that these figures are for one-sided hypothesis tests that the infection risk is lower for app users versus non-users within each racial group. They adjust for multiple testing (i.e., multiple comparisons, multiplicity) by requiring each of the two individual statistical hypothesis tests to meet an \(\alpha = 0.025\) level of statistical evidence.

Explanatory Modeling (Validation/Confirmation)

Recall that our holdout dataset has 16000 observations.

Infections by App Usage (Marginal Model)

Let’s first examine our main relationship of interest, as we did with the training data.

observational_rwe_holdout %>%
  ggplot2::ggplot(ggplot2::aes(x = app, fill = infection)) +
  ggplot2::theme_classic() +
  ggplot2::geom_bar(position = "dodge") +
  ggplot2::ggtitle("Infections by App Usage")

df_rwe_holdout <- with(
  observational_rwe_holdout,
  cbind(
    table(app, infection),
    prop.table(table(app, infection), margin = 1) # row proportions
  )
)

knitr::kable(df_rwe_holdout) # row proportions

	0. uninfected	1. infected	0. uninfected	1. infected
didn’t use app	2433	854	0.7401886	0.2598114
used app	11563	1150	0.9095414	0.0904586

As before, there was a lower infection rate among app users: Only 9% of users had infections, compared to 26% of non-users. The holdout empirical RD is -0.169, similar to the training empirical RD of -0.186.

Explanatory Model

Check Sample Size

Do we have a large enough sample for each app usage and racial group combination to meet our statistical power and evidence requirements?

observational_rwe_holdout %>%
  ggplot2::ggplot(ggplot2::aes(x = race, fill = app)) +
  ggplot2::theme_classic() +
  ggplot2::geom_bar(position = "dodge") +
  ggplot2::ggtitle("App Usage by Race")

df_rwe_holdout_race_app <- with(
  observational_rwe_holdout,
  cbind(
    table(race, app),
    prop.table(table(race, app), margin = 1) # row proportions
  )
)

knitr::kable(df_rwe_holdout_race_app) # row proportions

	didn’t use app	used app	didn’t use app	used app
Black	1667	607	0.7330695	0.2669305
White	1620	12106	0.1180242	0.8819758

Yes: We have at least 556 African American individuals in each app usage group, and at least 1617 White individuals likewise.

Fit Statistical Model

Based on our proposed causal and statistical models, we fit the following logistic model.

glm_rwe_holdout <- glm(
  data = observational_rwe_holdout,
  formula = as.factor(infection) ~ app + race,
  family = "binomial"
)

summary(glm_rwe_holdout)$coefficients

##               Estimate Std. Error    z value      Pr(>|z|)
## (Intercept) -0.3220348 0.04608849  -6.987316  2.801938e-12
## appused app -0.2316567 0.06680031  -3.467899  5.245447e-04
## raceWhite   -1.9344662 0.06675370 -28.979159 1.204740e-184

After controlling for race, there is good statistical evidence (p < 0.001) for the estimated association between app usage and infection. This confirms our training-data findings. There is also very strong statistical evidence (p <<< 0.001) for race’s association with infection risk. Specifically, the estimated odds of infection for Whites were \(\exp(\)-1.934\() =\) 0.145 (95% CI: 0.127, 0.165) times that of African Americans (regardless of app usage).

The estimated odds of infection for app users were \(\exp(\)-0.232\() =\) 0.793 (95% CI: 0.696, 0.905) times that of non-users (regardless of race). The corresponding estimated infection risks for app usage by race are:

risk_didnt_use_app_black_holdout <- plogis(coef(glm_rwe_holdout) %*% c(1, 0, 0))
risk_used_app_black_holdout <- plogis(coef(glm_rwe_holdout) %*% c(1, 1, 0))
risk_didnt_use_app_white_holdout <- plogis(coef(glm_rwe_holdout) %*% c(1, 0, 1))
risk_used_app_white_holdout <- plogis(coef(glm_rwe_holdout) %*% c(1, 1, 1))

0.42 for African Americans not using the app
0.365 for African Americans using the app
0.095 for Whites not using the app
0.077 for Whites using the app

The estimated RDs by race are:

rwe_holdout_rd_black <- risk_used_app_black_holdout - risk_didnt_use_app_black_holdout
rwe_holdout_rd_white <- risk_used_app_white_holdout - risk_didnt_use_app_white_holdout
confint_glm_rwe_holdout <- confint(glm_rwe_holdout) # 95% CIs: odds ratios of infection
rwe_holdout_rd_ci_black <- c(
  plogis(confint_glm_rwe_holdout[, 1] %*% c(1, 1, 0)) - plogis(confint_glm_rwe_holdout[, 1] %*% c(1, 0, 0)),
  plogis(confint_glm_rwe_holdout[, 2] %*% c(1, 1, 0)) - plogis(confint_glm_rwe_holdout[, 2] %*% c(1, 0, 0))
)
rwe_holdout_rd_ci_white <- c(
  plogis(confint_glm_rwe_holdout[, 1] %*% c(1, 1, 1)) - plogis(confint_glm_rwe_holdout[, 1] %*% c(1, 0, 1)),
  plogis(confint_glm_rwe_holdout[, 2] %*% c(1, 1, 1)) - plogis(confint_glm_rwe_holdout[, 2] %*% c(1, 0, 1))
)

-0.055 (95% CI: -0.083, -0.025) for African Americans
-0.018 (95% CI: -0.022, -0.01) for Whites

Conclusion and Public Health Implications

How do we estimate the true ATE of -0.031?

Attempt 1: Holdout Empirical Risk Difference

Can we simply use the holdout empirical RD of -0.169 (i.e., a marginal quantity) as our ATE estimate?

In our synthetic data, race confounds the presumed effect of app usage on infection risk by design. The true bias of mistaking the holdout empirical RD to be an estimate of the ATE is exactly -0.169 \(-\) -0.031 \(=\) -0.138.

This highlights a crucial point:

Suppose we mistake this empirical RD as an estimate of the ATE. Unbeknownst to us, in our simulated world we would overstate our telemedicine app’s effectiveness by claiming it reduces the risk of new coronavirus infections by 16.9%—when in fact it will only reduce this risk by 3.1%.

Some public health implications are:

Health authorities will squander valuable weeks deploying our app and assessing its effectiveness. During this time, more promising solutions could have been implemented. More people will become newly infected, who could otherwise have been spared.
Hospitals, government agencies, and businesses will make plans incorrectly anticipating a 16.9% drop in infections.

Attempt 2: (To be continued…)

Can we somehow use our explanatory model to estimate the ATE? After all, we’d accounted for race as a confounder. Are either of our race-specific estimated RDs an estimate of the true ATE of -0.031? Unfortunately, no: The ATE is a risk difference averaged over all other variables. (It is a marginal—not conditional—quantity.)

So then, can we estimate the ATE as a weighted average of these estimated RDs? Fortunately, yes: The g-formula does exactly this, which we’ll learn about in Part 2. We’ll apply the g-formula to estimate the ATE—along with another popular method called propensity score weighting.

But first, in Part 2 we’ll see how an RCT allows us to estimate the ATE in a statistically consistent or unbiased way.

References

Aubrey A. CDC Hospital Data Point To Racial Disparity In COVID-19 Cases. NPR. April 8, 2020 2:43 PM ET. npr.org/sections/coronavirus-live-updates/2020/04/08/830030932/cdc-hospital-data-point-to-racial-disparity-in-covid-19-cases

Garg S. Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019—COVID-NET, 14 States, March 1–30, 2020. MMWR. Morbidity and Mortality Weekly Report. 2020;69. cdc.gov/mmwr/volumes/69/wr/mm6915e3.htm
Hernán MA, Robins JM. Estimating causal effects from epidemiological data. Journal of Epidemiology & Community Health. 2006 Jul 1;60(7):578-86. jech.bmj.com/content/60/7/578.short
Hirano K, Imbens GW. Estimation of causal effects using propensity score weighting: An application to data on right heart catheterization. Health Services and Outcomes research methodology. 2001 Dec 1;2(3-4):259-78. https://link.springer.com/article/10.1023/A:1020371312283
Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in medicine. 2004 Oct 15;23(19):2937-60. https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.1903
Pearl J. Causality. Cambridge university press; 2009 Sep 14. bayes.cs.ucla.edu/BOOK-2K/
Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling. 1986 Jan 1;7(9-12):1393-512. sciencedirect.com/science/article/pii/0270025586900886
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983 Apr 1;70(1):41-55. academic.oup.com/biomet/article/70/1/41/240879
Smith A. Smartphone ownership-2013 update. Washington, DC: Pew Research Center; 2013 Jun 5. boletines.prisadigital.com/PIP_Smartphone_adoption_2013.pdf
Shmueli G. To explain or to predict?. Statistical science. 2010;25(3):289-310. projecteuclid.org/euclid.ss/1294167961

Appendix

RWE Simulation Parameters and Data-generation Models

Feature distribution

Race: Based on the U.S. Census, the probability of being White was set as \(\Pr(Z=1) = \pi_Z =\) 0.8560091.

Outcome Model

Partial Causal Model

Race and app usage both impact the infection risk. The partial DAG is:

App Usage (\(X\)) \(\rightarrow\) Infection (\(Y\))
Race (\(Z\)) \(\rightarrow\) Infection (\(Y\))

DiagrammeR::grViz("
digraph causal {

  # Nodes
  node [shape = plaintext]
  X [label = 'App \n Usage \n (X)']
  Z [label = 'Race \n (Z)']
  Y [label = 'Infection \n (Y)']

  # Edges
  edge [color = black,
        arrowhead = vee]
  rankdir = LR
  X -> Y
  Z -> Y

  # Graph
  graph [overlap = true]
}")

It’s partial because we haven’t yet specified how (if at all) app usage and race are associated.

Statistical Model

The infection risk is lower for app users (\(X=1\)) versus non-users (\(X=0\)).
The infection risk is lower for Whites (\(Z=1\)) versus African Americans (\(Z=0\)) due to structural inequality (e.g., better access to care, higher SES, fewer underlying health conditions).

These statistical relationships are described by the logistic model \[ \text{logit}( \Pr(Y=1|X,Z) ) = \beta_0 + \beta_X X + \beta_Z Z , \] where \(\beta_0 =\) -0.2381882, \(\beta_X =\) -0.3, and \(\beta_Z =\) -1.9469256. Hence, the probability of infection given race and app usage is \[ \Pr(Y=1|X,Z) = \frac {e^{\beta_0 + \beta_X X + \beta_Z Z}} {1 + e^{\beta_0 + \beta_X X + \beta_Z Z}} . \]

In particular, the infection risk is:

\(\Pr(Y=1|X=0,Z=0) =\) 0.441 for African Americans not using the app
\(\Pr(Y=1|X=1,Z=0) =\) 0.369 for African Americans using the app
\(\Pr(Y=1|X=0,Z=1) =\) 0.101 for Whites not using the app
\(\Pr(Y=1|X=1,Z=1) =\) 0.077 for Whites using the app

I chose these parameter values to reflect what was known as of 10 April 2020, as reported in a recent healthcare example involving racial disparities in COVID-19 cases (Garg et al, 2020; Aubrey, 2020) and this Washington Post article. The table emulated in the simulations is:

a <- round((1 - prInfectionGivenBlack) * prBlack, 3)
b <- round(prBlackGivenInfection * prInfection, 3)
c <- round((1 - prInfectionGivenWhite) * prWhite, 3)
d <- round(prWhiteGivenInfection * prInfection, 3)
knitr::kable(
  dplyr::tribble(
    ~Race, ~Uninfected, ~Infected, ~`Row Total`,
    "Black", a, b, a + b,
    "White", c, d, c + d,
    "Column Total", a + c, b + d, a + b + c + d
  )
)

Race	Uninfected	Infected	Row Total
Black	0.081	0.063	0.144
White	0.769	0.087	0.856
Column Total	0.850	0.150	1.000

Propensity Model: App Usage

Partial Causal Model

Race affects app usage. For example, suppose our app is only available on iPhone, and that African Americans are less likely than Whites to own iPhones (e.g., page 7 of Smith, 2013). In addition, suppose African Americans have worse access to telemedicine services, and so tend to be less aware of such apps.

That is, the partial DAG is:

Race (\(Z\)) \(\rightarrow\) App Usage (\(X\))

DiagrammeR::grViz("
digraph causal {

  # Nodes
  node [shape = plaintext]
  Z [label = 'Race \n (Z)']
  X [label = 'App \n Usage \n (X)']

  # Edges
  edge [color = black,
        arrowhead = vee]
  rankdir = LR
  Z -> X

  # Graph
  graph [overlap = true]
}")

Statistical Model

Whites are more likely to use the app than African Americans. This statistical relationship is described by the logistic model \[ \text{logit}( \Pr(X=1) ) = \alpha_0 + \alpha_Z Z , \] where \(\alpha_0 =\) -1 and \(\alpha_Z =\) 3. Hence, the probability of using the app given race is \[ \Pr(X=1|Z) = \frac {e^{\alpha_0 + \alpha_Z Z}} {1 + e^{\alpha_0 + \alpha_Z Z}} . \]

In particular:

\(\Pr(X=1|Z=0) \times 100 =\) 26.9% of African Americans use the app
\(\Pr(X=1|Z=1) \times 100 =\) 88.1% of Whites use the app

Causal Model

The full DAG is:

App Usage (\(X\)) \(\rightarrow\) Infection (\(Y\))
Race (\(Z\)) \(\rightarrow\) Infection (\(Y\))
Race (\(Z\)) \(\rightarrow\) App Usage (\(X\))

DiagrammeR::grViz("
digraph causal {

  # Nodes
  node [shape = plaintext]
  X [label = 'App \n Usage \n (X)']
  Z [label = 'Race \n (Z)']
  Y [label = 'Infection \n (Y)']

  # Edges
  edge [color = black,
        arrowhead = vee]
  rankdir = LR
  X -> Y
  Z -> Y
  Z -> X

  # Graph
  graph [overlap = true]
}")

RWE Simulation R Code

##### Set simulation parameters



### Preliminaries
random_seed <- 2004101447
sample_size_observational <- 80000



### Feature distribution
piZ <- 0.755 / (0.755 + 0.127) # race (based on U.S. Census)



### Outcome model

# beta0 and betaZ are derived from:
#   https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e3.htm
#   https://www.npr.org/sections/coronavirus-live-updates/2020/04/08/830030932/cdc-hospital-data-point-to-racial-disparity-in-covid-19-cases
#   https://www.washingtonpost.com/news/powerpost/paloma/the-health-202/2020/04/09/the-health-202-los-angeles-is-racing-to-discover-the-true-coronavirus-infection-rate/5e8de70588e0fa101a75e13d/

prInfection <- 0.15
prBlack <- 1 - piZ
prWhite <- piZ
prBlackGivenInfection <- 33 / (33 + 45)
prWhiteGivenInfection <- 1 - prBlackGivenInfection
prInfectionGivenBlack <- prBlackGivenInfection * prInfection / prBlack
prInfectionGivenWhite <- prWhiteGivenInfection * prInfection / prWhite

beta0 <- log(prInfectionGivenBlack / (1 - prInfectionGivenBlack)) # baseline: infection risk for African Americans who don't use app
betaX <- -0.3
betaZ <- log(prInfectionGivenWhite / (1 - prInfectionGivenWhite)) - beta0 # average influence of being White on infection risk



### Propensity model: app usage
alpha0_observational <- -1 # observational real-world evidence: baseline probability of using app for African Americans
alphaZ_observational <- 3 # observational real-world evidence: average influence of being White on probability of using app





##### Generate data.
set.seed(random_seed)
observational_rwe <- dplyr::tibble(
  race = rbinom(n = sample_size_observational, size = 1, prob = piZ),
  app = rbinom(n = sample_size_observational, size = 1, prob = plogis(alpha0_observational + alphaZ_observational * race)),
  infection = rbinom(n = sample_size_observational, size = 1, prob = plogis(beta0 + betaX * app + betaZ * race))
) %>%
  dplyr::mutate(
    race = ifelse(race == 1, "White", ifelse(race == 0, "Black", NA)),
    app = ifelse(app == 1, "used app", ifelse(app == 0, "didn't use app", NA)),
    infection = ifelse(infection == 1, "1. infected", ifelse(infection == 0, "0. uninfected", NA))
  )

About the Author

Dr. Daza is a biostatistician and health data scientist—not an epidemiologist—who develops causal inference methods for personalized (n-of-1) digital health. | 🇵🇭🇺🇸 ericjdaza.com @ericjdaza linkedin.com/in/ericjdaza | statsof1.org @statsof1 @fsbiostats

(Also published at https://towardsdatascience.com/coronavirus-telemedicine-and-race-part-1-simulated-real-world-evidence-9971f553194d and statsof1.netlify.com/2020-04-10-coronavirus-telemedicine-and-race-part-1/.)

Coronavirus, Telemedicine, and Race: Simulated Real-World Evidence

Causal inference tutorial in R using synthetic data (Part 1)

Eric J. Daza, DrPH, MPS

2020-04-15

Introduction

Dataset Characteristics

Exploratory Data Analysis

Univariate Associations

Correlation Matrix

Infections by App Usage (Marginal Model)

Infections by Race

App Usage by Race

Implications for Explanatory Modeling

Variable Selection

Prediction Modeling

Implications for Explanatory Modeling

Propose Explanatory Model

Causal Model

Statistical Model and Sample Size

Explanatory Modeling (Validation/Confirmation)

Infections by App Usage (Marginal Model)

Explanatory Model

Check Sample Size

Fit Statistical Model

Conclusion and Public Health Implications

Attempt 1: Holdout Empirical Risk Difference

Attempt 2: (To be continued…)

References

Appendix

RWE Simulation Parameters and Data-generation Models

Feature distribution

Outcome Model

Partial Causal Model

Statistical Model

Propensity Model: App Usage

Partial Causal Model

Statistical Model

Causal Model

RWE Simulation R Code

About the Author