Bayesian Causal Inference

References

Load packages

Here we use the distributed that I wrote for a previous simulation project (Comparison of Privacy-Protecting Analytic and Data-sharing Methods: a Simulation Study).

## random.org
set.seed(272438315)
library(tidyverse)
## Simulation suite from
## https://github.com/kaz-yos/distributed
## devtools::install_github(repo = "kaz-yos/distributed")
library(distributed)
library(rstan)
library(bayesplot)

Generate data

The following function can be used to construct a single dataset with treated and untreated with confounding.

     GenerateOneCenter(n, AssignCovariates, alphas, betas, survParams)

Arguments:

       n: study site-specific sample size

AssignCovariates: covariate generation functions that takes n and p as
          the only arguments.

  alphas: parameter vector for treatment model including c(alpha0,
          alphaX)

   betas: parameter vector for outcome model shared among binary and
          survival outcome models including ‘c(beta0, betaX, betaA,
          betaXA)’.

survParams: vector of two. The first element is the baseline hazard of
          events in the exponential event time outcome model
          (‘lambda’). The second element is the baseline hazard of
          censoring in the exponential censoring time model
          (‘lambda_c’).

This function call produces both binary and survival outcome. We focus on the binary outcome.

data1 <-
    GenerateOneCenter(n = 1000,
                      AssignCovariates = AssignCovariatesNormBinDefault,
                      alphas = c(alpha0 = -0.5, alphaX = c(0.5, 0.5)),
                      betas = c(beta0 = -0.5, betaX = c(0.5, 0.5),
                                ## Protective effect
                                betaA = -0.3,
                                ## No effect (measure) modification
                                betaXA = c(0, 0)),
                      survParams = c(lambda = -log(0.95), lambda_c = -log(0.99), Tmax = 1))

The true data including the counterfactual quantities can be displayed. The overall difference in pY0 (mean probability of response under no treatment) and pY1 (mean probability of response under treatment) is the average treatment effect (ATE). The same thing in the treated (A=1) is the average treatment effect on the treated (ATT). The treatment group difference in pY0 indicates difference in the mean counterfactuals under no treatment. That is, confounding. Similar confounding exhibits in pY1.

summary(data1, truth = TRUE)

## 
## ### Parameter values
## ### n        : 1000 
## ### alpha0   : -0.5 
## ### alphaX   : 0.5 0.5 
## ### beta0    : -0.5 
## ### betaX    : 0.5 0.5 
## ### betaA    : -0.3 
## ### betaXA   : 0 0 
## ### lambda   : 0.05129329 
## ### lambda_c : 0.01005034 
## ### Tmax     : 1 
## 
## ### Baseline table
##                   Overall         0               1               SMD   
## n                   1000             606             394                
## X1 (mean (SD))     -0.00 (1.02)    -0.18 (1.02)     0.28 (0.94)    0.466
## X2 (mean (SD))      0.06 (0.23)     0.04 (0.20)     0.08 (0.27)    0.150
## pA (mean (SD))      0.39 (0.12)     0.37 (0.11)     0.42 (0.11)    0.484
## A (mean (SD))       0.39 (0.49)     0.00 (0.00)     1.00 (0.00)    Inf  
## pY (mean (SD))      0.36 (0.11)     0.37 (0.11)     0.36 (0.11)    0.112
## pY0 (mean (SD))     0.39 (0.12)     0.37 (0.11)     0.42 (0.11)    0.484
## pY1 (mean (SD))     0.33 (0.11)     0.31 (0.10)     0.36 (0.11)    0.479
## Y (mean (SD))       0.38 (0.48)     0.38 (0.49)     0.36 (0.48)    0.044
## rate (mean (SD))    0.05 (0.03)     0.05 (0.03)     0.05 (0.03)    0.122
## rate0 (mean (SD))   0.06 (0.03)     0.05 (0.03)     0.07 (0.04)    0.447
## rate1 (mean (SD))   0.04 (0.02)     0.04 (0.02)     0.05 (0.03)    0.447
## T (mean (SD))      23.08 (27.98)   23.81 (30.45)   21.96 (23.67)   0.068
## C (mean (SD))     107.33 (132.05) 113.97 (137.21)  97.11 (123.18)  0.129
## time (mean (SD))  352.93 (55.82)  352.07 (57.00)  354.24 (54.01)   0.039
## event (mean (SD))   0.06 (0.23)     0.06 (0.24)     0.05 (0.21)    0.068
## S1yr (mean (SD))    0.94 (0.23)     0.94 (0.24)     0.95 (0.21)    0.068

The data structure is a regular data frame. To ease modeling with Stan, create a model matrix.

## Each row is c(1, A, X1, X2)
AX <- model.matrix( ~ A + X1 + X2, data = data1)
head(AX)

##   (Intercept) A          X1 X2
## 1           1 1  0.03849597  0
## 2           1 0 -2.14297526  0
## 3           1 0  0.51696808  0
## 4           1 0  1.28046055  0
## 5           1 0  1.43093630  0
## 6           1 0  1.73915419  0

## Each row is c(1, X1, X2)
X <- model.matrix( ~ X1 + X2, data = data1)
head(X)

##   (Intercept)          X1 X2
## 1           1  0.03849597  0
## 2           1 -2.14297526  0
## 3           1  0.51696808  0
## 4           1  1.28046055  0
## 5           1  1.43093630  0
## 6           1  1.73915419  0

## Outcome vector
y <- data1$Y
## Treatment vector
A <- data1$A

The true conditional effect is the coefficient betaA (log OR). The true marginal causal effects are the following.

## ATE
as_tibble(data1) %>%
    summarize(`True RD` = mean(pY1) - mean(pY0),
              `True RR` = mean(pY1) / mean(pY0))

## # A tibble: 1 x 2
##   `True RD` `True RR`
##       <dbl>     <dbl>
## 1   -0.0650     0.834

Regular logistic regression

logit_stan <- rstan::stan_model("./bayesian_causal1_logistic.stan")
logit_stan

## S4 class stanmodel 'bayesian_causal1_logistic' coded as follows:
## data {
##     // Number of parameters including intercept
##     int<lower=1> p;
##     // Hyperparameters
##     real beta_mean[p];
##     real<lower=0> beta_sd[p];
##     // Number of observations
##     int<lower=1> N;
##     // Binary outcome
##     int<lower=0,upper=1> y[N];
##     // Model matrix
##     matrix[N,p] X;
##     // Whether to use likelihood
##     int<lower=0,upper=1> use_lik;
## }
## 
## transformed data {
## 
## }
## 
## parameters {
##     // Real vector of p dimension.
##     vector[p] beta;
## }
## 
## transformed parameters {
##     // Linear predictor
##     vector[N] eta = X * beta;
## }
## 
## model {
##     // Prior
##     for (j in 1:p) {
##         target += normal_lpdf(beta[j] | beta_mean[j], beta_sd[j]);
##     }
## 
##     // Likelihood for logistic model
##     // http://modernstatisticalworkflow.blogspot.com/2017/04/an-easy-way-to-simulate-fake-data-from.html?m=1
##     if (use_lik == 1) {
##         for (i in 1:N) {
##             // https://mc-stan.org/docs/2_19/functions-reference/bernoulli-logit-distribution.html
##             target += bernoulli_logit_lpmf(y[i] | eta[i]);
##         }
##     }
## }
## 
## generated quantities {
##     vector[N] log_lik;
##     int<lower=0,upper=1> y_rep[N];
## 
##     for (i in 1:N) {
##         // Observation level log likelihood
##         log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
##         // Predicted (note these are prediction wrt observed assignment)
##         y_rep[i] = bernoulli_logit_rng(eta[i]);
##     }
## }

Prior only fit for prior predictive check

logit_stan_fit_prior <-
    rstan::sampling(logit_stan,
                    data = list(p = ncol(AX),
                                beta_mean = c(0, 0, 0, 0),
                                beta_sd = c(10, 5, 5, 5),
                                N = nrow(AX),
                                y = y,
                                X = AX,
                                use_lik = 0))

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(logit_stan_fit_prior)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("beta","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(logit_stan_fit_prior, pars = pars)

## Inference for Stan model: bayesian_causal1_logistic.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta[1]  -0.16    0.16 9.79 -19.96  -6.68  -0.18   6.35  18.96  3641    1
## beta[2]   0.03    0.08 4.96  -9.97  -3.34   0.04   3.34   9.94  3887    1
## beta[3]   0.04    0.08 4.92  -9.31  -3.29  -0.02   3.39   9.91  4161    1
## beta[4]   0.09    0.08 5.09  -9.79  -3.21  -0.01   3.38  10.26  4487    1
## lp__    -12.78    0.03 1.38 -16.18 -13.44 -12.48 -11.77 -11.04  1776    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 05:58:33 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Create a matrix of output plots from a 'stanfit' object
pairs(logit_stan_fit_prior, pars = pars)

## Markov chain traceplots
rstan::traceplot(logit_stan_fit_prior, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("beta")
bayesplot::mcmc_rank_hist(logit_stan_fit_prior, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(logit_stan_fit_prior, regex_pars = regex_pars, ref_line = TRUE)

Visualize the conditional treatment effect.

bayesplot::mcmc_areas(as.matrix(logit_stan_fit_prior),
                      pars = c("beta[2]"),
                      prob = 0.9)

exp(10) = 22,026. So this prior may be too diffuse and does not regularize much. We can further check the prior predictive distribution.

y_rep <- as.matrix(logit_stan_fit_prior, pars = c("y_rep"))
ppc_dens_overlay(y, y_rep[sample(seq_len(nrow(y_rep)), 200),])

ppc_stat(y = y, yrep = y_rep, stat = mean)

In this binary outcome setting, the density overlay approach is not very useful. We can use the mean (proportion of 1) as a summary to be compared. Here it should be reasonably broad such that all values are covered. ppc_stat shows the summary of the data, but this should be ignored here.

Posterior fit

logit_stan_fit <-
    rstan::sampling(logit_stan,
                    data = list(p = ncol(AX),
                                beta_mean = c(0, 0, 0, 0),
                                beta_sd = c(10, 5, 5, 5),
                                N = nrow(AX),
                                y = y,
                                X = AX,
                                use_lik = 1))

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(logit_stan_fit)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("beta","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(logit_stan_fit, pars = pars)

## Inference for Stan model: bayesian_causal1_logistic.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##            mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## beta[1]   -0.46    0.00 0.09   -0.63   -0.52   -0.46   -0.40   -0.28  3470    1
## beta[2]   -0.35    0.00 0.14   -0.62   -0.44   -0.34   -0.25   -0.08  3610    1
## beta[3]    0.47    0.00 0.07    0.33    0.42    0.47    0.52    0.61  3592    1
## beta[4]    1.03    0.00 0.28    0.47    0.84    1.03    1.21    1.59  4038    1
## lp__    -646.13    0.03 1.42 -649.71 -646.78 -645.83 -645.10 -644.37  1690    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 05:59:17 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Create a matrix of output plots from a 'stanfit' object
pairs(logit_stan_fit, pars = pars)

## Markov chain traceplots
rstan::traceplot(logit_stan_fit, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("beta")
bayesplot::mcmc_rank_hist(logit_stan_fit, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(logit_stan_fit, regex_pars = regex_pars, ref_line = TRUE)

Visualize the conditional treatment effect.

bayesplot::mcmc_areas(as.matrix(logit_stan_fit),
                      pars = c("beta[2]"),
                      prob = 0.9)

Now we can examine the posterior predictive distribution.

y_rep <- as.matrix(logit_stan_fit, pars = c("y_rep"))
ppc_dens_overlay(y, y_rep[sample(seq_len(nrow(y_rep)), 200),])

ppc_stat(y = y, yrep = y_rep, stat = mean)

ppc_stat shows the summary of the data, which is nicely covered by the posterior predictive counterparts.

Simple “g-formula”

Here we simply marginalize over the observed distribution of covariates as our treatment variable is time-fixed. When the treatment variable is time-varying, the treatment strategies (regimes) can involve specification of treatment values at multiple time points. This usually give rise to time-varying confounders that are affected by previous treatment. In this case, post-baseline covariates also need to be modeled. Here we are in a simpler setting.

logit_gf_stan <- rstan::stan_model("./bayesian_causal1_logistic_gf.stan")
logit_gf_stan

## S4 class stanmodel 'bayesian_causal1_logistic_gf' coded as follows:
## data {
##     // Number of parameters including intercept
##     int<lower=1> p;
##     // Hyperparameters
##     real beta_mean[p];
##     real<lower=0> beta_sd[p];
##     // Number of observations
##     int<lower=1> N;
##     // Binary outcome
##     int<lower=0,upper=1> y[N];
##     // Model matrix
##     matrix[N,p] X;
##     // Counterfactual model matrix
##     int<lower=1> N_new;
##     matrix[N_new,p] X0;
##     matrix[N_new,p] X1;
##     // Whether to evaluate likelihood
##     int<lower=0,upper=1> use_lik;
## }
## 
## transformed data {
## 
## }
## 
## parameters {
##     // Real vector of p dimension.
##     vector[p] beta;
## }
## 
## transformed parameters {
##     // Linear predictor
##     vector[N] eta = X * beta;
## }
## 
## model {
##     // Prior
##     for (j in 1:p) {
##         target += normal_lpdf(beta[j] | beta_mean[j], beta_sd[j]);
##     }
## 
##     // Likelihood for logistic model
##     if (use_lik == 1) {
##         for (i in 1:N) {
##             // https://mc-stan.org/docs/2_19/functions-reference/bernoulli-logit-distribution.html
##             target += bernoulli_logit_lpmf(y[i] | eta[i]);
##         }
##     }
## }
## 
## generated quantities {
##     // Counterfactual probability of outcomes
##     vector[N_new] pY0 = inv_logit(X0 * beta);
##     vector[N_new] pY1 = inv_logit(X1 * beta);
##     // Counterfactual risk difference
##     real rd = mean(pY1) - mean(pY0);
##     // Counterfactual risk ratio
##     real<lower=0> rr = mean(pY1) / mean(pY0);
## 
##     // Other elements
##     vector[N] log_lik;
##     int<lower=0,upper=1> y_rep[N];
## 
##     for (i in 1:N) {
##         // Observation level log likelihood
##         log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
##         // Predicted (note these are prediction wrt observed assignment)
##         y_rep[i] = bernoulli_logit_rng(eta[i]);
##     }
## }

To ease the posterior prediction process, here we create the counterfactual design matrices. The treatment variable and associated terms if any are manipulated.

## Counterfactual treatment assignment
data1$A0 <- 0
data1$A1 <- 1
## For ATE estimation, use all data but with counterfactual assignments.
AX0_ate <- model.matrix( ~ A0 + X1 + X2, data = data1)
AX1_ate <- model.matrix( ~ A1 + X1 + X2, data = data1)
## For ATT estimation
AX0_att <- model.matrix( ~ A0 + X1 + X2, data = subset(data1, A == 1))
AX1_att <- model.matrix( ~ A1 + X1 + X2, data = subset(data1, A == 1))

Prior predictive

logit_gf_stan_ate_fit_prior <-
    rstan::sampling(logit_gf_stan,
                    data = list(p = ncol(AX),
                                beta_mean = c(0, 0, 0, 0),
                                beta_sd = c(10, 5, 5, 5),
                                N = nrow(AX),
                                y = y,
                                X = AX,
                                N_new = nrow(AX0_ate),
                                X0 = AX0_ate,
                                X1 = AX1_ate,
                                use_lik = 0))

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(logit_gf_stan_ate_fit_prior)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("beta","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(logit_gf_stan_ate_fit_prior, pars = pars)

## Inference for Stan model: bayesian_causal1_logistic_gf.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##           mean se_mean    sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
## beta[1]  -0.02    0.14 10.03 -19.33  -7.07   0.23   6.96  19.05  4792    1
## beta[2]  -0.15    0.07  5.03 -10.04  -3.65  -0.11   3.35   9.54  4795    1
## beta[3]   0.05    0.07  4.94  -9.73  -3.31   0.12   3.51   9.43  4847    1
## beta[4]  -0.05    0.07  4.94  -9.77  -3.35  -0.01   3.32   9.37  4843    1
## lp__    -12.79    0.03  1.40 -16.26 -13.46 -12.49 -11.76 -11.08  2098    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:00:38 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Create a matrix of output plots from a 'stanfit' object
pairs(logit_gf_stan_ate_fit_prior, pars = pars)

## Markov chain traceplots
rstan::traceplot(logit_gf_stan_ate_fit_prior, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("beta")
bayesplot::mcmc_rank_hist(logit_gf_stan_ate_fit_prior, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(logit_gf_stan_ate_fit_prior, regex_pars = regex_pars, ref_line = TRUE)

Visualize the marginal treatment effect.

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_ate_fit_prior),
                      pars = c("rd"),
                      prob = 0.9)

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_ate_fit_prior),
                      pars = c("rr"),
                      prob = 0.9)

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_ate_fit_prior),
                      pars = c("rr"),
                      prob = 0.9) +
    geom_vline(xintercept = 1, alpha = 0.1) +
    scale_x_continuous(limits = c(0, 25))

## Warning: Removed 1 rows containing missing values (geom_segment).

We do not directly specify priors on the marginal treatment effects. These are implied by the priors on the coefficients. So there may be a value in checking the implied priors. Interestingly, the implied prior on the risk difference is quite concentrated around zero difference although the range essentially covers all possible values. The implied prior on the risk ratio also concentrate at 1 (zero difference) although the range is very wide. Another peak near 0 is likely an artifact of the asymmetric scale (vs symmetric log RR scale).

ATE estimation

We estimate the posterior for the ATE here.

logit_gf_stan_ate_fit <-
    rstan::sampling(logit_gf_stan,
                    data = list(p = ncol(AX),
                                beta_mean = c(0, 0, 0, 0),
                                beta_sd = c(10, 5, 5, 5),
                                N = nrow(AX),
                                y = y,
                                X = AX,
                                N_new = nrow(AX0_ate),
                                X0 = AX0_ate,
                                X1 = AX1_ate,
                                use_lik = 1))

We first check the soundness of the HMC diagnostics.

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(logit_gf_stan_ate_fit)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("beta","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(logit_gf_stan_ate_fit, pars = pars)

## Inference for Stan model: bayesian_causal1_logistic_gf.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##            mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## beta[1]   -0.46    0.00 0.09   -0.63   -0.52   -0.46   -0.40   -0.29  3511    1
## beta[2]   -0.35    0.00 0.15   -0.63   -0.45   -0.35   -0.25   -0.05  3345    1
## beta[3]    0.47    0.00 0.07    0.33    0.42    0.47    0.52    0.61  3393    1
## beta[4]    1.02    0.00 0.29    0.45    0.83    1.03    1.22    1.57  3759    1
## lp__    -646.18    0.03 1.43 -649.74 -646.88 -645.86 -645.11 -644.39  1806    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:02:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Create a matrix of output plots from a 'stanfit' object
pairs(logit_gf_stan_ate_fit, pars = pars)

## Markov chain traceplots
rstan::traceplot(logit_gf_stan_ate_fit, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("beta")
bayesplot::mcmc_rank_hist(logit_gf_stan_ate_fit, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(logit_gf_stan_ate_fit, regex_pars = regex_pars, ref_line = TRUE)

The results we care are not the posterior for the coefficients, but the ATE.

print(logit_gf_stan_ate_fit, pars = c("rd","rr"))

## Inference for Stan model: bayesian_causal1_logistic_gf.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##     mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## rd -0.08       0 0.03 -0.13 -0.10 -0.08 -0.05 -0.01  3326    1
## rr  0.82       0 0.07  0.69  0.77  0.81  0.86  0.97  3425    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:02:41 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_ate_fit),
                      pars = c("rd"),
                      prob = 0.9)

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_ate_fit),
                      pars = c("rr"),
                      prob = 0.9)

For more precise results, we can extract quantities.

post_ate <- as.data.frame(logit_gf_stan_ate_fit, pars = c("rd","rr")) %>%
    as_tibble
post_ate %>%
    summarize(`P[RD < 0]` = mean(rd < 0),
              `P[RD < -0.05]` = mean(rd < -0.05),
              `P[RR < 1.0]` = mean(rr < 1.0),
              `P[RR < 0.9]` = mean(rr < 0.9)) %>%
    gather(key = Statement, value = `Posterior Prob.`)

## # A tibble: 4 x 2
##   Statement     `Posterior Prob.`
##   <chr>                     <dbl>
## 1 P[RD < 0]                 0.990
## 2 P[RD < -0.05]             0.790
## 3 P[RR < 1.0]               0.990
## 4 P[RR < 0.9]               0.872

The last row represents the posterior probability for the counterfactual risk ratio being less than 0.9 (greater than 10% risk reductrion by hypothetical intervention on everyone).

ATT estimation

We estimate the posterior for the ATT here. We need to refit the model is we want to calculate the predictions for the manipulated design matrix for the treated individuals in Stan.

logit_gf_stan_att_fit <-
    rstan::sampling(logit_gf_stan,
                    data = list(p = ncol(AX),
                                beta_mean = c(0, 0, 0, 0),
                                beta_sd = c(10, 5, 5, 5),
                                N = nrow(AX),
                                y = y,
                                X = AX,
                                N_new = nrow(AX0_att),
                                X0 = AX0_att,
                                X1 = AX1_att,
                                use_lik = 1))

We first check the soundness of the HMC diagnostics.

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(logit_gf_stan_att_fit)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("beta","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(logit_gf_stan_att_fit, pars = pars)

## Inference for Stan model: bayesian_causal1_logistic_gf.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##            mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## beta[1]   -0.46    0.00 0.09   -0.63   -0.51   -0.46   -0.40   -0.29  2974    1
## beta[2]   -0.35    0.00 0.14   -0.62   -0.44   -0.35   -0.25   -0.07  3373    1
## beta[3]    0.47    0.00 0.07    0.33    0.42    0.47    0.52    0.61  3812    1
## beta[4]    1.03    0.00 0.28    0.47    0.83    1.02    1.22    1.58  4074    1
## lp__    -646.13    0.03 1.45 -649.83 -646.83 -645.76 -645.08 -644.36  1717    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:04:13 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Creatt a matrix of output plots from a 'stanfit' object
pairs(logit_gf_stan_att_fit, pars = pars)

## Markov chain traceplots
rstan::traceplot(logit_gf_stan_att_fit, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("beta")
bayesplot::mcmc_rank_hist(logit_gf_stan_att_fit, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(logit_gf_stan_att_fit, regex_pars = regex_pars, ref_line = TRUE)

The results we care are not the posterior for the coefficients, but the ATT.

print(logit_gf_stan_att_fit, pars = c("rd","rr"))

## Inference for Stan model: bayesian_causal1_logistic_gf.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##     mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## rd -0.08       0 0.03 -0.14 -0.10 -0.08 -0.06 -0.02  3345    1
## rr  0.82       0 0.07  0.71  0.78  0.82  0.87  0.96  3413    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:04:13 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_att_fit),
                      pars = c("rd"),
                      prob = 0.9)

bayesplot::mcmc_areas(as.matrix(logit_gf_stan_att_fit),
                      pars = c("rr"),
                      prob = 0.9)

For more precise results, we can extract quantities.

post_att <- as.data.frame(logit_gf_stan_att_fit, pars = c("rd","rr")) %>%
    as_tibble
post_att %>%
    summarize(`P[RD < 0]` = mean(rd < 0),
              `P[RD < -0.05]` = mean(rd < -0.05),
              `P[RR < 1.0]` = mean(rr < 1.0),
              `P[RR < 0.9]` = mean(rr < 0.9)) %>%
    gather(key = Statement, value = `Posterior Prob.`)

## # A tibble: 4 x 2
##   Statement     `Posterior Prob.`
##   <chr>                     <dbl>
## 1 P[RD < 0]                 0.992
## 2 P[RD < -0.05]             0.822
## 3 P[RR < 1.0]               0.992
## 4 P[RR < 0.9]               0.880

The last row represents the posterior probability for the counterfactual risk ratio being less than 0.9 (greater than 10% risk reductrion by hypothetical intervention on treated vs “untreating them”). We did not introduce effect (measure) modification to the data generation mechanism or the design matrix, thus, the results should be similar to the ATE.

Propensity score modeling via PPTA

Bayesian approaches to propensity score (PS) are an evolving area of methodological research. Originally, joint simultaneous modeling of a PS model and outcome model was proposed. It was later realized this allowed feedback from the outcome model to the PS model, which is considered inconsistent with the notion of the separation of the design stage and analysis stage. We follow the PPTA method.

Design stage

First, we obtain posterior predictive quantities from the treatment (PS) model. We obtain the posterior predictive PS (distribution of estimated PS). For each PS, PPTA is a Bernoulli realization (binary variable governed by PS). The inclusion indicator variable \(S_{i}\) defined as the indicator for disagreement of PPTA and the actual treatment assignment. This seems to be a Bernoulli realization of the overlap weights.

ppta1_stan <- rstan::stan_model("./bayesian_causal1_ppta1.stan")
ppta1_stan

## S4 class stanmodel 'bayesian_causal1_ppta1' coded as follows:
## data {
##   // Number of parameters
##   int<lower=1> p;
##   // Hyperparameters
##   real alpha_mean[p];
##   real<lower=0> alpha_sd[p];
##   // Number of observations
##   int<lower=1> N;
##   // Whether to evaluate likelihood
##   int<lower=0,upper=1> use_lik;
##   // Design matrix
##   matrix[N,p] X;
##   // Treatment assignment
##   int<lower=0,upper=1> A[N];
## }
## 
## transformed data {
## 
## }
## 
## parameters {
##   // PS model parameters
##   vector[p] alpha;
## }
## 
## transformed parameters {
##   // PS linear predictor
##   vector[N] ps_lp = X * alpha;
##   // PS
##   vector[N] ps = inv_logit(ps_lp);
## }
## 
## model {
##   // Priors
##   for (j in 1:p) {
##     target += normal_lpdf(alpha[j] | alpha_mean[j], alpha_sd[j]);
##   }
## 
##   // Likelihood: This is the likelihood of the treatment mechanism.
##   if (use_lik == 1) {
##     for (i in 1:N) {
##       target += bernoulli_logit_lpmf(A[i] | ps_lp[i]);
##     }
##   }
## }
## 
## generated quantities {
##   // For loo
##   vector[N] log_lik;
##   // For posterior predictive treatment assignment (PPTA)
##   int<lower=0,upper=1> A_tilde[N];
##   // Selection indicator
##   int<lower=0,upper=1> S[N];
## 
##   for (i in 1:N) {
##     // Observation level log likelihood
##     log_lik[i] = bernoulli_logit_lpmf(A[i] | ps_lp[i]);
##     // PPTA = Bernoulli realization of PS
##     A_tilde[i] = bernoulli_logit_rng(ps_lp[i]);
##     // Stochastic inclusion if PPTA differs from treatment assignment
##     // S = Bernoulli realization of estimated overlap weights
##     S[i] = (A_tilde[i] != A[i]);
##   }
## }

ppta1_stan_fit <-
    rstan::sampling(ppta1_stan,
                    data = list(p = ncol(X),
                                alpha_mean = c(0,0,0),
                                alpha_sd = c(10, 5, 5),
                                N = nrow(X),
                                X = X,
                                A = A,
                                use_lik = 1))

The MCMC diagnostics look fine.

## Check HMC diagnostics after sampling
rstan::check_hmc_diagnostics(ppta1_stan_fit)

## 
## Divergences:
## 
## Tree depth:
## 
## Energy:

## Specify relevant parameters
pars <- c("alpha","lp__")
## Print a summary for a fitted model represented by a 'stanfit' object
print(ppta1_stan_fit, pars = pars)

## Inference for Stan model: bayesian_causal1_ppta1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##             mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## alpha[1]   -0.50    0.00 0.07   -0.63   -0.54   -0.50   -0.45   -0.36  3758    1
## alpha[2]    0.48    0.00 0.07    0.34    0.43    0.47    0.52    0.61  3177    1
## alpha[3]    0.73    0.00 0.28    0.19    0.54    0.74    0.92    1.29  3619    1
## lp__     -651.99    0.03 1.18 -655.09 -652.55 -651.69 -651.11 -650.62  1885    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Jul  4 06:05:54 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

## Create a matrix of output plots from a 'stanfit' object
pairs(ppta1_stan_fit, pars = pars)

## Markov chain traceplots
rstan::traceplot(ppta1_stan_fit, pars = pars, inc_warmup = FALSE)

## Trace plots of MCMC draws
regex_pars <- c("alpha")
bayesplot::mcmc_rank_hist(ppta1_stan_fit, regex_pars = regex_pars, ref_line = TRUE)

bayesplot::mcmc_rank_overlay(ppta1_stan_fit, regex_pars = regex_pars, ref_line = TRUE)

Now we construct data for the analysis stage using the posterior predictive data.

data2 <- as.data.frame(ppta1_stan_fit, pars = "S") %>%
    as_tibble() %>%
    mutate(iter = seq_len(n())) %>%
    gather(key = id, value = S, -iter) %>%
    mutate(id = gsub("S|\\[|\\]","",id) %>%
               as.integer()) %>%
    ## Only keep selected rows.
    filter(S == 1) %>%
    left_join(as_tibble(data1) %>%
              ## We still need covariates.
              select(id, X1, X2, A, Y),
              by = "id") %>%
    arrange(iter, id) %>%
    select(-S)
data2

## # A tibble: 1,805,618 x 6
##     iter    id      X1    X2     A     Y
##    <int> <int>   <dbl> <int> <int> <int>
##  1     1     1  0.0385     0     1     0
##  2     1     3  0.517      0     0     0
##  3     1     4  1.28       0     0     1
##  4     1     5  1.43       0     0     0
##  5     1     6  1.74       0     0     0
##  6     1     7 -0.308      1     0     1
##  7     1    11  0.502      0     0     1
##  8     1    12  0.604      0     1     1
##  9     1    15  1.60       0     0     1
## 10     1    17  0.785      0     0     0
## # … with 1,805,608 more rows

data2_nest <- data2 %>%
    select(-id) %>%
    group_by(iter) %>%
    nest()
data2_nest

## # A tibble: 4,000 x 2
##     iter data              
##    <int> <list>            
##  1     1 <tibble [425 × 4]>
##  2     2 <tibble [435 × 4]>
##  3     3 <tibble [432 × 4]>
##  4     4 <tibble [459 × 4]>
##  5     5 <tibble [427 × 4]>
##  6     6 <tibble [464 × 4]>
##  7     7 <tibble [458 × 4]>
##  8     8 <tibble [446 × 4]>
##  9     9 <tibble [452 × 4]>
## 10    10 <tibble [430 × 4]>
## # … with 3,990 more rows

Analysis stage

In the analysis stage, an outcome model is fit within each posterior predictive study design (i.e., individuals selected based on each realization of PS). Interestingly, we still need to condition on the covariates to maintain a valid factorization of the joint distribution. Here we additionally try marginalizing the conditional effect using the observed distribution of covariates.

data2_nest_fit <-
    data2_nest %>%
    ## Random sample some iterations
    filter(iter %in% sample(iter, size = 10)) %>%
    mutate(fit = map(data, function(d) {
        ## Selected versions
        AX_S <- model.matrix( ~ A + X1 + X2, data = d)
        ## Each row is c(1, X1, X2)
        X_S <- model.matrix( ~ X1 + X2, data = d)
        ## Outcome vector
        y_S <- d$Y
        ## Fit
        rstan::sampling(logit_gf_stan,
                        data = list(p = ncol(AX_S),
                                    beta_mean = c(0, 0, 0, 0),
                                    beta_sd = c(10, 5, 5, 5),
                                    N = nrow(AX_S),
                                    y = y_S,
                                    X = AX_S,
                                    ## Try marginalizing over the original X
                                    ## This is not in PPTA paper.
                                    N_new = nrow(AX),
                                    X0 = AX0_ate,
                                    X1 = AX1_ate,
                                    use_lik = 1))
    }))

We still get the conditional effect \(\beta_{2}\) but estimated within a subset of the treated and untreated who are balanced in covariates. This may confer some kind of double robustness, but the PPTA does not mention this. The marginalization using the original covariate distribution is not something the PPTA paper did. This extrapolates inference based on the overlap subset to the entire population. Maybe the marginization should be with respect to each selected subset.

post_ppta <- data2_nest_fit %>%
    mutate(fit2 = map(fit, as.data.frame, pars = c("beta[2]","rd","rr"))) %>%
    select(-fit, -data) %>%
    unnest()
## Conditional effect is what the PPTA paper does.
ggplot(data = post_ppta, mapping = aes(x = `beta[2]`)) +
    geom_density() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
          legend.key = element_blank(),
          plot.title = element_text(hjust = 0.5),
          strip.background = element_blank())

## These two are marginized over original X.
ggplot(data = post_ppta, mapping = aes(x = rd)) +
    geom_density() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
          legend.key = element_blank(),
          plot.title = element_text(hjust = 0.5),
          strip.background = element_blank())

ggplot(data = post_ppta, mapping = aes(x = rd)) +
    geom_density() +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
          legend.key = element_blank(),
          plot.title = element_text(hjust = 0.5),
          strip.background = element_blank())

##
post_ppta %>%
    summarize(`P[RD < 0]` = mean(rd < 0),
              `P[RD < -0.05]` = mean(rd < -0.05),
              `P[RR < 1.0]` = mean(rr < 1.0),
              `P[RR < 0.9]` = mean(rr < 0.9)) %>%
    gather(key = Statement, value = `Posterior Prob.`)

## # A tibble: 4 x 2
##   Statement     `Posterior Prob.`
##   <chr>                     <dbl>
## 1 P[RD < 0]                 0.893
## 2 P[RD < -0.05]             0.627
## 3 P[RR < 1.0]               0.893
## 4 P[RR < 0.9]               0.710

Top Page: http://rpubs.com/kaz_yos/
Github: https://github.com/kaz-yos

print(sessionInfo())

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bayesplot_1.7.0       rstan_2.18.2          StanHeaders_2.18.1-10 distributed_0.3.0     forcats_0.4.0        
##  [6] stringr_1.4.0         dplyr_0.8.1           purrr_0.3.2           readr_1.3.1           tidyr_0.8.3          
## [11] tibble_2.1.3          ggplot2_3.2.0         tidyverse_1.2.1       doRNG_1.7.1           rngtools_1.3.1.1     
## [16] pkgmaker_0.27         registry_0.5-1        doParallel_1.0.14     iterators_1.0.10      foreach_1.4.4        
## [21] knitr_1.23           
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-140       matrixStats_0.54.0 lubridate_1.7.4    httr_1.4.0         tools_3.6.0       
##  [6] backports_1.1.4    utf8_1.1.4         R6_2.4.0           KernSmooth_2.23-15 DBI_1.0.0         
## [11] lazyeval_0.2.2     colorspace_1.4-1   withr_2.1.2        tidyselect_0.2.5   gridExtra_2.3     
## [16] prettyunits_1.0.2  processx_3.3.1     compiler_3.6.0     cli_1.1.0          rvest_0.3.4       
## [21] xml2_1.2.0         labeling_0.3       scales_1.0.0       ggridges_0.5.1     callr_3.2.0       
## [26] digest_0.6.19      rmarkdown_1.13     pkgconfig_2.0.2    htmltools_0.3.6    bibtex_0.4.2      
## [31] labelled_2.2.1     rlang_0.4.0        readxl_1.3.1       rstudioapi_0.10    generics_0.0.2    
## [36] jsonlite_1.6       inline_0.3.15      magrittr_1.5       loo_2.1.0          Matrix_1.2-17     
## [41] Rcpp_1.0.1         munsell_0.5.0      fansi_0.4.0        stringi_1.4.3      yaml_2.2.0        
## [46] pkgbuild_1.0.3     plyr_1.8.4         grid_3.6.0         crayon_1.3.4       lattice_0.20-38   
## [51] haven_2.1.0        splines_3.6.0      hms_0.4.2          zeallot_0.1.0      ps_1.3.0          
## [56] pillar_1.4.1       reshape2_1.4.3     codetools_0.2-16   stats4_3.6.0       glue_1.3.1        
## [61] tableone_0.10.0    evaluate_0.14      mitools_2.4        modelr_0.1.4       vctrs_0.1.0       
## [66] cellranger_1.1.0   gtable_0.3.0       assertthat_0.2.1   xfun_0.8           xtable_1.8-4      
## [71] broom_0.5.2        survey_3.36        e1071_1.7-2        class_7.3-15       survival_2.44-1.1

## Record execution time and multicore use
end_time <- Sys.time()
diff_time <- difftime(end_time, start_time, units = "auto")
cat("Started  ", as.character(start_time), "\n",
    "Finished ", as.character(end_time), "\n",
    "Time difference of ", diff_time, " ", attr(diff_time, "units"), "\n",
    "Used ", foreach::getDoParWorkers(), " cores\n",
    "Used ", foreach::getDoParName(), " as backend\n",
    sep = "")

## Started  2019-07-04 05:57:52
## Finished 2019-07-04 06:06:47
## Time difference of 8.920666 mins
## Used 12 cores
## Used doParallelMC as backend