shutup = suppressPackageStartupMessages
shutup(library(bnlearn))
shutup(library(rstan))
shutup(library(tidyverse))
shutup(library(ggplot2))
shutup(library(gridExtra))

Background and Goals

There is a principle in causal modeling called the “principle of independent mechanisms.” One of the consequences of this idea is that when you train a parametric causal model, the training result should be different in detectable ways from training an “anti-causal” model, meaning training the same parametric specification but in the non-causal direction.

It would be useful to have a Bayesian characterization of those differences. For example, it would nice to have a “causal Bayesian cross-validation statistic” that would give a higher score to a causal model than a non-causal model, even though statistical association flows both ways.

This experiment’s goal is to evaluate differences in the posterior samples resulting from fitting a causal and anti-causal Bayesian hierarchical model. Chapter 4 of Elements of Causal Inference by Peters, Janzing, and Scholkopf inspired my thinking on this experiment. I’m convinced that the ideas behind the “principle of independent mechanisms” must somehow translate to a hierarchical Bayesian modeling workflow, most likely in Bayesian methods for model evaluation and criticism.

The model

The ground truth data-generating process is as follows.

\[\begin{align*} U &\sim N(0, \sigma_u) \\ X &\sim \text{Uniform}(\text{xmin},\text{xmax}) \\ Y &= f(X, U; \theta) \end{align*}\]

For simplicity I consider only one exogenous variable \(U\).
\(f\) is a nonlinear deterministic function that maps \(U\) and \(X\) to Y.
\(\theta\) is a vector of parameters for the function \(f\).
In the ground truth setting \(\theta\) and \(\sigma_u\) are known.
xmin and xmax are constants.

The causal Bayesian hierarchical model I will use to fit this model is

\[\begin{align*} \sigma_u &\sim \pi_{\sigma} \\ \theta &\sim \pi_{\theta}\\ U &\sim N(0, \sigma_u) \\ Y &= f(X, U; \theta) \end{align*}\]

Here X is taken as a constant and it’s likelihood is not calculated. I explain this below.

The following is the anti-causal model.

\[\begin{align*} \sigma_u &\sim \pi_{\sigma} \\ \theta &\sim \pi_{\theta}\\ U &\sim N(0, \sigma_u) \\ X &= f(Y, U; \theta) \end{align*}\]

Note that it is the wrong model; \(X==f(Y, U; \theta)\) is false. My hope is that this falsehood will emerge as features in the posterior on \(\{\sigma_u, \theta \}\) in anti-causal case that are not present in the causal case. Further, I hope these features are detectable in a way that doesn’t require knowing ground truth.

\(X\) comes from a uniform distribution. However, I do not model X in the causal model, meaning there is no \(X \sim \text{Uniform}(xmin, xmax)\) in the model. I also do not model Y in the anti-causal model. The reason is that if the anti-causal model assumes Y comes from a uniform when the nonlinearity of \(f\) means it definitely does not, I think adding a \(Y \sim \text{Uniform}(ymin, ymax)\) statement would unfairly penalize (or favor) the anti-causal model.

Hypotheses

I generally expect elements of the joint posterior distributions on \(\{\sigma_u, \theta \}\) to have more uncertainty in the anti-causal fit then in the causal fit. At risk of being hand-wavy, I mean variance and information entropy when I say “uncertainty”. I also expect some posterior dependence between \(\sigma_u\) and \(\theta \}\) in the anti-causal case. Specifically:

I expect \(E(\sigma_{u, \text{anti-causal}}|X, Y) > E(\sigma_{u, \text{causal}}|X, Y)\)
I expect the predictive entropy of the exogenous variable to be greater in the anti-causal direction. Specifically, \(H(U_{\text{new}, \text{causal}}|data) < H(U_{\text{new}, \text{anti-causal}}|data)\)
I expect there to be some kind of “stable covariance” between \(\sigma_u|X, Y\) and \(\theta|X, Y\) in the causal case but not the anti-causal case. This is more of a gut feeling based on the “principle of independent mechanisms”. I expect the same thing of \(\theta|data\) and predictive \(U_{\text{new}}|data\).
I expect \(X\) to be independent of \(\sigma_u|X, Y\) and \(\theta|X, Y\) in the causal model but not in the anti-causal model, because the “wrongness” anti-causal model somehow does not allow all the information to flow from \(X\) to \(\sigma_u\) and \(\theta\).

The first and second items are based on a paper I read once and remember vaguely. I’m not confident it holds up.

I have found it useful to think of things in terms of a causal DAG that has additional variables added. The basic causal DAG is \(U \rightarrow Y \leftarrow X\). But we could expand the graph to have typed nodes, where one node type corresponds to parameters and another for out-of-sample variables. It would look something like this:

## Loading required namespace: Rgraphviz

Where \(U_{\text{new}}\) is an out-of-sample \(U\) value. So when predicting \(U_{new}\) from the posterior (i.e. from the posterior predictive distribution), you are conditioning on \(Y\) (and \(X\)). \(\theta\) and \(\{\theta, U_{new}}\) are d-connected through a collider \(Y\) in this graph, so there should be some dependence. I somehow think that dependence should have nice properties that wouldn’t be apparent in the anti-causal case.

Model implementaton

In this analysis, I assume \(f(X, U) = g(X) + U\), i.e. a nonlinear additive model. Further, I assume \(g()\) is a second degree polynomial. Therefore \(\theta = {a, b, c}\) where a, b, c are the coefficients of the polynomial.

I use Stan to implement both models.

The causal model is:

functions {
  real g(real u, real a, real b, real c) {
    return a * u^2 + b * u + c;
  }
}
data {
    int D;
    vector[D] X;
    vector[D] Y;
}
parameters {
  // Use default priors on parameters
  real <lower=0.1> u_sigma;
  real a;
  real b;
  real c;
  vector[D] U;
}

transformed parameters {
  vector[D] y_loc;
  for (i in 1:D){
    y_loc[i] = g(X[i], a, b, c) + U[i];
  }
}

model {
    // Assume uniformed prior on U
    target += normal_lpdf(U | 0, u_sigma);
    // Use the fake Dirac trick to get inference to work.
    target += normal_lpdf(Y | y_loc, .1);
}
generated quantities{
  vector[D] y_pred;
  vector[D] u_sim;
  vector[D] u_entropy;
  for (i in 1:D){
    u_sim[i] = normal_rng(0, u_sigma);
    u_entropy[i] = -normal_lpdf(u_sim[i]| 0, u_sigma);
    y_pred[i] = g(X[i], a, b, c) + u_sim[i];
  }
}

The anticausal model is:

functions {
  real g(real u, real a, real b, real c) {
    return a * u^2 + b * u + c;
  }
}
data {
    int D;
    vector[D] X;
    vector[D] Y;
}
parameters {
  // Use default priors on parameters
  real <lower=0.1> u_sigma;
  real a;
  real b;
  real c;
  vector[D] U;
}

transformed parameters {
  vector[D] x_loc;
  for (i in 1:D){
    x_loc[i] = g(Y[i], a, b, c) + U[i];
  }
}

model {
    // Assume uniformed prior on U
    target += normal_lpdf(U | 0, u_sigma);
    // Need a low variance normal likelihood on Y since we can't work with Dirac
    target += normal_lpdf(X | x_loc, .1);
}
generated quantities{
  vector[D] x_pred;
  vector[D] u_sim;
  vector[D] u_entropy;
  for (i in 1:D){
    u_sim[i] = normal_rng(0, u_sigma);
    u_entropy[i] = -normal_lpdf(u_sim[i]| 0, u_sigma);
    x_pred[i] = g(Y[i], a, b, c) + u_sim[i];
  }
}

A hack to get an SCM to work in Stan

This item target += normal_lpdf(Y | y_loc, .1);, and its analog in the anti-causal is a bit of a hack. Y is set deterministically, i.e., its conditional distribution given X, U, and 0 is a Dirac delta. However, conditioning on a Dirac delta is problematic in gradient-based inference methods like NUTS. I also don’t want to rely on calculating the Jacobian, in case I want to extend this analysis to more complex \(f\) functions, like a neural net.

So I use a normal distribution with a small scale parameter, building on the intuition that the limit as that scale parameter goes to 0 is a Dirac delta. I think of this of an inference hack, kind of like approximate Bayesian computation that looks at Euclidean distance between an observed Y and a sampled value of Y. I think this has no impact on results, but I could be wrong. Further, I should note the smaller one makes the scale parameter, the more difficult it is to get MCMC traces that converge and mix well.

Ground truth

The following simulates the ground truth model.

set.seed(456)

# Constants
x_min = -10
x_max = 10

# Ground truth global parameters
u_sigma = 10
a <- rnorm(1, 0, 1)
b <- rnorm(1, 0, 5)
c_ <- rnorm(1, 0, 30)

# Simulate data
D <- 100 # Number of samples.
Uy <- rnorm(D, 0, u_sigma)
g <- function(variable) {
    return(a * variable^2 + b * variable + c_)
}
X <- Ux <- sort(runif(D, x_min, x_max))
Y <- g(X)+ Uy
.data <- tibble(X = X, gX = g(X), Y = Y)

# Derived constants
ymin = min(Y)
ymax = max(Y)

# Plot
ggplot(data=.data, aes(x=X, y=Y)) +
    geom_ribbon(aes(ymin=g(X) - 1.96 * u_sigma, ymax=g(X) + 1.96 * u_sigma), alpha=.3) + # 95% range
    geom_point() +
    geom_line(aes(y=gX)) + # Ground truth g(X)
    ggtitle("Ground Truth")

Fit the causal model

causal_model <- stan_model("causal_model.stan", model_name = "causal_model")

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

Sanity check with optimizing.

optim_fit <- optimizing(causal_model, data=list(D=D, X=X, Y=Y))
message("u_sigma expected: ", u_sigma)

## u_sigma expected: 10

message("u_sigma actual: ", as.numeric(round(optim_fit$par['u_sigma'], 4)))

## u_sigma actual: 9.8012

message('a expected: ', a)

## a expected: -1.34352140793323

message('a actual: ', as.numeric(round(optim_fit$par['a'], 4)))

## a actual: -1.3018

message('b expected: ', b)

## b expected: 3.10887776826657

message('b actual: ', as.numeric(round(optim_fit$par['b'], 4)))

## b actual: 3.1249

message('c expected: ', c_)

## c expected: 24.026239946479

message('c actual: ', as.numeric(round(optim_fit$par['c'], 4)))

## c actual: 23.7623

Run HMC.

hmc_fit1 <- sampling(causal_model, data=list(D=D, X=X, Y=Y), iter=10000)

## 
## SAMPLING FOR MODEL 'causal_model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.5e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 3.71526 seconds (Warm-up)
## Chain 1:                4.08685 seconds (Sampling)
## Chain 1:                7.80211 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'causal_model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.8e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 3.23242 seconds (Warm-up)
## Chain 2:                3.96431 seconds (Sampling)
## Chain 2:                7.19673 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'causal_model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 3e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.3 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 3.67674 seconds (Warm-up)
## Chain 3:                4.11421 seconds (Sampling)
## Chain 3:                7.79095 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'causal_model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.7e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 4.0833 seconds (Warm-up)
## Chain 4:                4.16515 seconds (Sampling)
## Chain 4:                8.24845 seconds (Total)
## Chain 4:

## Warning: The largest R-hat is 1.09, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#r-hat

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

Causal model evaluation

Any really large Rhat’s? If so, stop.

if(max(summary(hmc_fit1)$summary[, 'Rhat']) > 3) print("Model/inference needs tweaking")

Trace evaluation

Evaluate the HMC traces. Did they mix well?

traceplot(hmc_fit1, c("u_sigma", "a", "b", "c"))

Here is an example of good mixing. Here is an example of bad mixing.

Posterior predictive plot

Generate posterior predictive plot of Y and plot it against g(X) to make sure the models is actually fitting the data.

smry <- summary(hmc_fit1, "y_pred")
plt_data1 <- .data %>%
    mutate(
        Y_pred = smry$summary[,"mean"],
        Y_low = smry$summary[, "2.5%"],
        Y_high = smry$summary[, "97.5%"],
        Y_sd = smry$summary[, "sd"]
    )
p1 <- ggplot(data=plt_data1, aes(x=X, y=Y)) +
    geom_ribbon(aes(ymin=Y_low, ymax=Y_high), alpha=.3) + # 95% predictive interval
    geom_point() +
    geom_line(aes(y=gX)) + # Ground truth g(X)
    geom_line(aes(y=Y_pred), color = 'Red') + # Predicted
    ggtitle("Causal Fit (Y|X)") 
p1 +
    xlim(floor(min(X)), ceiling(max(X))) +
    ylim(floor(min(plt_data1$Y_low)), ceiling(max(plt_data1$Y_high)))

What is the relationship between \(\sigma_u|X, Y\) and \(\theta|X, Y\)?

I check this using pair plots.

pairs(hmc_fit1, pars=c("u_sigma", "a", "b", "c"))

\(a\), \(b\), and \(c\) look correlated (which is good) while their correlation with \(\sigma_u\) looks low (which violates my assumptions that there should be some dependence).

What is the relationship between \(\theta|X, Y\) and \(U_{new}|X, Y\)?

The posterior predictive distribution of \(U_{new}|X, Y\) and the distribution of \(a|X\), \(b\), and \(c\) should be independent in the posterior in a causal model. Does that appear to be the case here?

To do this, I’ll bin the values for a, b, and c, then plot D for each level of the bin. I should not see a trend.

discretize <- function(tib, method="hartemink"){
    df <- as.data.frame(tib)
    df_disc <- bnlearn::discretize(df, method=method, breaks=5)
    return(as_tibble(df_disc))
}
add_u <- function(tib, samples){
    u_mat <- samples$U
    names_ <- paste0("U", 1:D)
    colnames(u_mat) <- names_
    return(as_tibble(cbind(tib, u_mat)))
}

samples1 <- rstan::extract(hmc_fit1, permuted = TRUE, inc_warmup = FALSE, include = TRUE)

trend_data1 <- tibble(
    `a samples` = samples1$a,
    `b samples` = samples1$b,
    `c samples` = samples1$c
) %>%
    discretize() %>%
    add_u(samples1)

tp11 <- trend_data1 %>%
    select(-`b samples`, -`c samples`) %>%
    gather("u_index", "u samples", -`a samples`) %>%
    ggplot(aes(x =`a samples`, y=`u samples`)) +
        geom_boxplot() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

tp12 <- trend_data1 %>%
    select(-`a samples`, -`c samples`) %>%
    gather("u_index", "u samples", -`b samples`) %>%
    ggplot(aes(x =`b samples`, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

tp13 <- trend_data1 %>%
    select(-`a samples`, -`b samples`) %>%
    gather("u_index", "u samples", -`c samples`) %>%
    ggplot(aes(x =`c samples`, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(tp11, tp12, tp13, ncol = 3, top='Posterior relationship btw function params (a, b, c) and exogenous vars (Us)')

These plots don’t show much evidence of dependence. Maybe it’s the functional form?

What about the estimated U’s and the actual values X?

tibble(X= X) %>%
    discretize(method="quantile") %>%
    cbind(t(samples1$U))  %>%
    gather("u_index", "u samples", -X) %>%
    ggplot(aes(x =X, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Looks OK. There is stability across groups of X.

Is \(E(\sigma_{u, \text{ac}}|X, Y) > E(\sigma_{u, \text{c}}|X, Y)\) and \(H(U_{\text{new}, \text{ac}}|X, Y) > H(U_{\text{new}, \text{c}}|X, Y)\)

I haven’t fit the anti-causal model yet. For now let’s generate posterior predictive samples of \(U_y\) and create histogram. I’ll overlay that histogram on the ground truth distribution of \(U_y\).

# Using ggplot
u_predictive <- map_dbl(samples1$u_sigma, ~ rnorm(1, 0, .x))

title <- 'Posterior predictive distribution of Uy \noverlayed against ground truth Uy density'
ggplot(data = tibble(u = u_predictive), aes(u)) +
    geom_density() + 
    stat_function(fun = dnorm, args = list(mean = 0, sd = u_sigma), colour="red") +
    ggtitle(title)

I can also do this using the regular hist() to overlay a histogram on a density plot.

u_range <- seq(-3 * u_sigma, 3 * u_sigma, by=.01)
density_line <- dnorm(u_range, 0, u_sigma)
ymax = max(jitter(density_line))
hist(u_predictive, freq = F, ylim = c(0, ymax), main = title, xlab = "u new")
lines(u_range, density_line, col = 'darkred')

Fit the anti-causal model

anticausal_model <- stan_model(
    "anticausal_model.stan",
    model_name = "anti-causal model"
)

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -mmacosx-version-min=10.13 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppParallel/include/"  -I"/Library/Frameworks/R.framework/Versions/4.0/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

Sanity check with optimizing. Here I do not expect the estimates to be the same as the “true” parameters. I just want to see if sensible values can be reached when I sample.

optim_fit2 <- optimizing(anticausal_model, data=list(D=D, X=X, Y=Y))
message("u_sigma expected: ", u_sigma)

## u_sigma expected: 10

message("u_sigma actual: ", as.numeric(round(optim_fit2$par['u_scale'], 4)))

## u_sigma actual: NA

message('a expected: ', a)

## a expected: -1.34352140793323

message('a actual: ', as.numeric(round(optim_fit2$par['a'], 4)))

## a actual: -6e-04

message('b expected: ', b)

## b expected: 3.10887776826657

message('b actual: ', as.numeric(round(optim_fit2$par['b'], 4)))

## b actual: 0.0145

message('c expected: ', c_)

## c expected: 24.026239946479

message('c actual: ', as.numeric(round(optim_fit2$par['c'], 4)))

## c actual: 1.0066

# Experience shows the wrong direction requires more iterations.
hmc_fit2 <- sampling(anticausal_model, data=list(D=D, X=X, Y=Y), iter = 10000)

## 
## SAMPLING FOR MODEL 'anti-causal model' NOW (CHAIN 1).
## Chain 1: 
## Chain 1: Gradient evaluation took 2.2e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 34.205 seconds (Warm-up)
## Chain 1:                16.2736 seconds (Sampling)
## Chain 1:                50.4786 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL 'anti-causal model' NOW (CHAIN 2).
## Chain 2: 
## Chain 2: Gradient evaluation took 1.6e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 35.9289 seconds (Warm-up)
## Chain 2:                23.1045 seconds (Sampling)
## Chain 2:                59.0334 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL 'anti-causal model' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 1.7e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 36.0805 seconds (Warm-up)
## Chain 3:                16.141 seconds (Sampling)
## Chain 3:                52.2214 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL 'anti-causal model' NOW (CHAIN 4).
## Chain 4: 
## Chain 4: Gradient evaluation took 1.5e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 36.5334 seconds (Warm-up)
## Chain 4:                16.8704 seconds (Sampling)
## Chain 4:                53.4039 seconds (Total)
## Chain 4:

## Warning: There were 277 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded

## Warning: Examine the pairs() plot to diagnose sampling problems

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

Anti-Causal model evaluation

Any really large Rhat’s? If so, it’s probably fine, but maybe not.

if(max(summary(hmc_fit2)$summary[, 'Rhat']) > 3) print("Model/inference needs tweaking")

Trace evaluation

Evaluate the traces. Did they converge and mix well?

traceplot(hmc_fit2, c("u_sigma", "a", "b", "c"))

Predictive check

How does the predictive fit look? Create the plot with inverted the axes at first. This is because the anti-causal model thinks Y is the cause X, and I want to plot intervals for X. Then use coord_flip() to flip it back so the two plots are comparable.

smry <- summary(hmc_fit2, "x_pred") 
plt_data2 <- tibble(X = X, gX = g(X), Y = Y) %>%
    mutate(
        X_pred = smry$summary[,"mean"],
        X_low = smry$summary[, "2.5%"],
        X_high = smry$summary[, "97.5%"],
        X_sd = smry$summary[, "sd"]
    )
# Match axes
buffer <- 1
min_x <- floor(min(plt_data2$X_low) - buffer)
max_x <- ceiling(max(plt_data2$X_high) + buffer)
min_y <- floor(min(plt_data1$Y_low) - buffer)
max_y <- floor(max(plt_data1$Y_high) + buffer)

p1_mod <- p1 +
    xlim(min_x, max_x) +
    ylim(min_y, max_y)


p2 <- ggplot(data=plt_data2, aes(x=Y, y=X)) +
    geom_ribbon(aes(ymin=X_low, ymax=X_high), alpha=.3) + # 95% predictive interval
    geom_point() +
    geom_line(aes(y=X_pred), color = 'Red') + # Predicted
    ylim(min_x, max_x) + 
    xlim(min_y, max_y) +
    coord_flip() + 
    ggtitle("anti-causal Fit (X|Y)") 
    

grid.arrange(p1_mod, p2, ncol = 2)

The anti-causal model does seem to be doing a good job fitting the data, compared to the causal model. This is good, it shows us what I knew – good prediction is not evidence of causal or anti-causal direction.

What is the relationship betweeen \(\sigma_u|X\) and \(\theta|X\)?

I check this using pair plots.

pairs(hmc_fit2, pars=c("u_sigma", "a", "b", "c"))

There is much more dependence within the elements of \(\theta|X, Y\) but that’s not a problem. \(\sigma_u|X, Y\) and \(\theta|X, Y\) lack dependence.

I was hoping to see some sort of strange dependence pattern here. Darn.

What is the relationship between \(\theta|X, Y\) and \(U_{new}|X, Y\)?

Again, \(U_{new}|X, Y\) distribution of \(a|X, Y\), \(b|X, Y\), and \(c|X, Y\) should be dependent in the causal model. I couldn’t find much evidence of that. My expectation is there should be dependence here too, and it should look, in some sense, “weird”.

samples2 <- rstan::extract(hmc_fit2, permuted = TRUE, inc_warmup = FALSE, include = TRUE)

trend_data2 <- tibble(
    `a samples` = samples2$a,
    `b samples` = samples2$b,
    `c samples` = samples2$c
) %>%
    discretize() %>%
    add_u(samples2)

tp21 <- trend_data2 %>%
    select(-`b samples`, -`c samples`) %>%
    gather("u_index", "u samples", -`a samples`) %>%
    ggplot(aes(x =`a samples`, y=`u samples`)) +
        geom_boxplot() + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

tp22 <- trend_data2 %>%
    select(-`a samples`, -`c samples`) %>%
    gather("u_index", "u samples", -`b samples`) %>%
    ggplot(aes(x =`b samples`, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

tp23 <- trend_data2 %>%
    select(-`a samples`, -`b samples`) %>%
    gather("u_index", "u samples", -`c samples`) %>%
    ggplot(aes(x =`c samples`, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(tp21, tp22, tp23, ncol = 3, top='Posterior relationship btw function params (a, b, c) and exogenous vars (Us)')

Seems to be some mean and variance shift of u samples across parameter intervals.

Variance shift (heteroskedasticity) is “weird”, so I’m going to call this a win.

What about the estimated U|X, Y’s and the actual values Y? I expect the anti-causal model to have trouble with this one.

tibble(Y = Y) %>%
    discretize(method="quantile") %>%
    cbind(t(samples2$U))  %>%
    gather("u_index", "u samples", -Y) %>%
    ggplot(aes(x =Y, y=`u samples`)) +
        geom_boxplot() + 
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Variance in predictive U samples seems to vary across quantiles of Y.

Weird. I’m going to call this a a win.

Is \(\sigma_{u, \text{ac}} > \sigma_{u, \text{c}}\) and \(H(U_{\text{new}, \text{ac}}) > H(U_{\text{new}, \text{c}})\)

The assertion that the posterior expectation of \(\sigma_u\) in the causal model is less than the posterior expectation of \(\sigma_u\) in the anti-causal model is…

mean(samples2$u_sigma) > mean(samples1$u_sigma)

## [1] FALSE

Not what I expected. (ㆆ _ ㆆ) darn.

The assertion that \(H(U_{\text{new}, \text{causal}}) < H(U_{\text{new}, \text{anti-causal}})\) is

mean(samples2$u_entropy) > mean(samples1$u_entropy)

## [1] FALSE

Not what I expected. (ㆆ _ ㆆ)

Generative posterior predictive samples of \(U_y\) and overlay on the previous histogram.

u_range <- seq(-3 * u_sigma, 3 * u_sigma, by=.01)
u_density <- dnorm(u_range, 0, u_sigma)
u_predictive2 <- map_dbl(samples2$u_sigma, ~ rnorm(1, 0, .x))

ymax <- max(
    c(
        jitter(u_density), 
        max(jitter(density(u_predictive)$y)), 
        max(jitter(density(u_predictive2)$y))
    )
)

hist(u_predictive, freq = F, ylim = c(0, ymax),
 main = title, col=rgb(0,0,1,alpha=0.3),
 sub = "blue = causal (correct), red = anti-causal, line = truth")
hist(u_predictive2, freq = F, add = T, col=rgb(1,0,0,alpha=0.3))
lines(u_range, u_density)

Again, darn. (ㆆ _ ㆆ)

Conclusions

Looks like my hypothesis that there would be more uncertainty in the predictive posterior of the exogenous variable \(U|X, Y\) didn’t bear out. This was based on some paper that I’m certainly misremembering. I believe that paper used Renyi entropy, not Shannon entropy, but I’m not sure why it would matter. A little bit of Googling surfaced this paper, which seems to suggest it wouldn’t work for models like the one I proposed above, but would for models where the exogenous variable is discrete.

I think I’m also guilty of conflating the fatness of a posterior distribution with “goodness-of-fit”. The two are not the same.

I find myself longing for some causal information criterion that tells me that in the following figure

the plot on a the left makes way more sense than the one on the right, and I want that reason to be more than some kind of penalized likelihood.

The dependence between the parameters in the anti-causal case is interesting, though I’m not sure how to characterize it. Most of this showed up mostly in the form of heteroskedasticity, which aligns with some of the examples in Janzing, Peters, and Scholkopf’s book. And there are statistical methods for detecting such things, one shouldn’t need to rely on visuals.

Bayesian analysis of causal vs anti-causal modeling

Background and Goals

The model

Hypotheses

Model implementaton

A hack to get an SCM to work in Stan

Ground truth

Fit the causal model

Causal model evaluation

Trace evaluation

Posterior predictive plot

What is the relationship between \(\sigma_u|X, Y\) and \(\theta|X, Y\)?

What is the relationship between \(\theta|X, Y\) and \(U_{new}|X, Y\)?

Is \(E(\sigma_{u, \text{ac}}|X, Y) > E(\sigma_{u, \text{c}}|X, Y)\) and \(H(U_{\text{new}, \text{ac}}|X, Y) > H(U_{\text{new}, \text{c}}|X, Y)\)

Fit the anti-causal model

Anti-Causal model evaluation

Trace evaluation

Predictive check

What is the relationship betweeen \(\sigma_u|X\) and \(\theta|X\)?

What is the relationship between \(\theta|X, Y\) and \(U_{new}|X, Y\)?

Is \(\sigma_{u, \text{ac}} > \sigma_{u, \text{c}}\) and \(H(U_{\text{new}, \text{ac}}) > H(U_{\text{new}, \text{c}})\)

Conclusions