Background

Bayes nets from bnlearn

The true model. The full Bayes net is of class ‘bn.fit’, and ‘bn.fit.dnet’ (dnet means discrete).

load('sachs.rda')
bn_model <- bn; rm(bn)
print(class(bn_model))

## [1] "bn.fit"      "bn.fit.dnet"

graphviz.plot(bn_model)

## Loading required namespace: Rgraphviz

Example of a conditional probability distribution from the network.

bn_model$PKA

## 
##   Parameters of node PKA (multinomial distribution)
## 
## Conditional probability table:
##  
##       PKC
## PKA           LOW        AVG       HIGH
##   LOW  0.38642552 0.06039638 0.01577014
##   AVG  0.37942435 0.92264651 0.95873839
##   HIGH 0.23415014 0.01695712 0.02549147

Factorization represented by the DAG

modelstring(bn_model)

## [1] "[PKC][Plcg][PIP3|Plcg][PKA|PKC][Jnk|PKA:PKC][P38|PKA:PKC][PIP2|PIP3:Plcg][Raf|PKA:PKC][Mek|PKA:PKC:Raf][Erk|Mek:PKA][Akt|Erk:PKA]"

Working with just the DAG structure itself:

true_dag <- model2network(modelstring(bn_model))
true_dag

## 
##   Random/Generated Bayesian network
## 
##   model:
##    [PKC][Plcg][PIP3|Plcg][PKA|PKC][Jnk|PKA:PKC][P38|PKA:PKC]
##    [PIP2|PIP3:Plcg][Raf|PKA:PKC][Mek|PKA:PKC:Raf][Erk|Mek:PKA][Akt|Erk:PKA]
##   nodes:                                 11 
##   arcs:                                  17 
##     undirected arcs:                     0 
##     directed arcs:                       17 
##   average markov blanket size:           3.09 
##   average neighbourhood size:            3.09 
##   average branching factor:              1.55 
## 
##   generation algorithm:                  Empty

Sim some data from groundtruth model in lieu of real data

N <- 200
.data <- rbn(bn_model, N)
head(.data); rm(bn_model)

##   Akt  Erk Jnk Mek P38 PIP2 PIP3 PKA  PKC Plcg Raf
## 1 LOW  AVG AVG AVG LOW  LOW  AVG AVG  LOW  LOW AVG
## 2 LOW HIGH AVG LOW LOW  LOW  AVG AVG  AVG  LOW LOW
## 3 AVG HIGH LOW LOW LOW  LOW  LOW AVG  LOW  LOW LOW
## 4 LOW  AVG AVG AVG LOW  AVG HIGH AVG  LOW  LOW AVG
## 5 LOW  AVG AVG AVG LOW  LOW  LOW AVG  AVG  LOW AVG
## 6 LOW  AVG AVG LOW LOW  LOW  AVG AVG HIGH  LOW LOW

Bayesian-style BN structure learning - No interventions

Score function

Scoring with Bayesian Dirichlet equivalent, think of this as log(P(G|D)P(G)) for discrete CPDs.

score(true_dag, .data, type = "bde")

## [1] -1665.421

Other scores, all likelihood-related.

score(true_dag, .data, type = "bic")

## [1] -1840.424

score(true_dag, .data, type = "aic")

## [1] -1546.874

score(true_dag, .data, type = "k2")

## [1] -1601.504

Structure learning

Note that generally there is nothing Bayesian about Bayesian network structure learning.

I take a Bayesian approach to structure learning here. I write a structure learning function takes in a “dataset object” and returns samples from a posterior. The prior is baked in.

“hc” stands for “Hill-Climbing”, a greedy search algo that optimizes the score function. It is deterministic given the data and starting graph. Using a bootstrap approach, we can get random DAG outputs by using a bootstrap approach – random start graphs and data sampled with replacement. Admittedly, calling this “Bayesian” inference of Bayes net structure is a stretch. MCMC methods exists, but are not implemented in this package. For our purposes, consider the structure learning approach black box.

I generate a list of M = 1000 DAGs representing the posterior.

nds <- names(.data)
M <- 100
structure_learner<- function(){
  starting_net <- random.graph(nds)
  .sampled <- as.data.frame(sample_n(.data, size = N, replace = T))
  hc(.sampled, start = starting_net, score = "bde", prior = "uniform")
}
posterior_samples <- map(1:M, function(i) structure_learner())

This is a list of DAGs representing samples from the posterior. The first one in the list is pictured below. The figure compares against the groundtruth graph, red edges are false positives (edges that don’t exist in the groundtruth), blue dashed edges are false negatives (groundtruth edges not in learned DAG).

graphviz.compare(posterior_samples[[1]], true_dag)

Scores of all the networks in the list can be calculated against the data. Here are the first ten scores:

map_dbl(posterior_samples[1:10], score, data = .data, type = "bde")

##  [1] -1698.732 -1635.944 -1647.012 -1753.223 -1705.184 -1659.645 -1674.024
##  [8] -1652.143 -1648.739 -1690.697

Fitting a Bayesian network on the DAG.

Given a DAG, I can fit a full Bayesian network on the DAG using the bn.fit function. This is a bit redundant because structure inference uses the score function, which relies on the MLE estimate for the CPD parameters. Now given the DAG, I’m re-inferring the CPD parameters given the data. At least instead of the MLE I can use the MAP with the method = "bayes" argument - though parameters estimates are still point values.

class(net <- bn.fit(true_dag, .data, method = "bayes"))

## [1] "bn.fit"      "bn.fit.dnet"

net$PKA; rm(net);

## 
##   Parameters of node PKA (multinomial distribution)
## 
## Conditional probability table:
##  
##       PKC
## PKA            LOW         AVG        HIGH
##   LOW  0.402298851 0.038302277 0.006802721
##   AVG  0.402298851 0.942028986 0.986394558
##   HIGH 0.195402299 0.019668737 0.006802721

Simulating intervention data using probabilistic queries

We simulate the data intervention data doing a probabilistic query on the mutilated graph.

Mutilating the graph means making the target of the intervention independent of its parents. In terms of the DAG, you are cutting its incoming edges. In terms of the CPD of the target node, you remove the CPD and replace it with a distribution that places all probability on the value fixed by the intervention. This is the the do-operator in Pyro is doing.

graphviz.compare illustrates the edge removed when PKA is set to “LOW”.

pka_mutilated <- mutilated(true_dag, evidence = list(PKA = "LOW"))
graphviz.compare(pka_mutilated, true_dag); rm(pka_mutilated);

Next, you apply an inference algorithm to the graph to simulated data given the fixed value of the intervention variable. Note the following code uses logic sampling. The alternative implementation is likelihood-weighting. It would be possible to use belief propagation by introducing some other packages.

cpdist generates random observations conditional on the evidence using the method specified in the method argument. Note how “PKA” is always “LOW” in the simulated data. Note, the syntax below is wonky because the cpdist function uses non-standard evaluation, so I have to modify the cpdist call directly.

data_simulator_builder <- function(intervention){
  simulator <- function(DAG){
    # Fit the Bayes Net from the DAG
    bn_mod <- bn.fit(DAG, .data, method = "bayes")
    # Mutilate the model
    mut_evidence <- structure(list("LOW"), names = intervention)
    mod_mutilated <- mutilated(bn_mod, evidence = mut_evidence)
    particles <- rbn(mod_mutilated, n = N)
    return(particles)
  }
  return(simulator)
}
pka_intervention_simulator <- data_simulator_builder("PKA")
.sim_data <- pka_intervention_simulator(posterior_samples[[1]])
head(.sim_data); rm(pka_intervention_simulator);

##    Akt  Erk  Jnk  Mek  P38 PIP2 PIP3 PKA PKC Plcg  Raf
## 1  LOW  AVG HIGH  AVG  LOW  AVG HIGH LOW LOW  LOW  LOW
## 2  AVG HIGH  AVG  LOW  AVG HIGH  LOW LOW AVG HIGH  LOW
## 3  LOW HIGH  LOW  LOW  LOW  LOW  LOW LOW AVG  LOW  AVG
## 4 HIGH HIGH  LOW HIGH  LOW  LOW HIGH LOW AVG HIGH HIGH
## 5  LOW  AVG  AVG  AVG  AVG  AVG  LOW LOW LOW  LOW  LOW
## 6  AVG HIGH  AVG HIGH HIGH  LOW HIGH LOW LOW  LOW HIGH

At this point I can simulate an intervention dataset for each graph in my posterior.

Bayesian-style BN structure learning - with interventions

Scoring in the presence of an intervention

There is a modified BDe score that takes an argument on what node was targeted, and calculates the correct score given the intervention. Think of it as calculating the score on the mutilated DAG (though its slightly different).

score(true_dag, data = .sim_data, exp = list(PKA = 1:N), type = "mbde")

## Warning in check.data(data): variable PKA has levels that are not observed
## in the data.

## [1] -1712.071

rm(.sim_data)

Learning structure in the presence of interventions

Before, the output structure learning is a sample of DAGs representing a posterior distribution. Given a DAG posterior learned on .data, we’d like to use that as the prior for structure inference on new_data. This turns out to be the same as doing inference on the combined datasets rbind(.data, new_data).

In this function, instead of returning the DAG, I’ll return the score of the DAG on the combined simulated/original data.

learner_builder <- function(intervention){
  learner <- function(.new_data){
    start_net <- random.graph(nds)
    .sampled <- rbind(
      sample_n(.data, size = N, replace = T),
      sample_n(.new_data, size = N, replace = T)
    ) %>% as.data.frame
    int_list <- list(int = (N+1):(2*N))
    names(int_list) <- intervention
    net <- hc(.sampled, start_net, score = "mbde", exp = int_list)
    scr <- score(net, .sampled, type = "mbde", exp = int_list)
    return(scr)
  }
}

Calculating KL for an intervention

So now I have the ability to forward simulate the entire process of generating new data and updating the network.

So here’s what I do. 1. For each DAG in the posterior samples… a. Simulate intervention data b. Simulate inference on the intervention data by collected a new set of posterior samples given that simulated data. c. Convert those simulated DAGs into score values – scoring those DAGs against the combined simulated and original data. d. Take the mean of those scores. 2. Return the mean of the difference of the simulated score means and the DAG scores on original data for each DAG.

This return value by construction is posterior expected KL divergence (approximated with this Monte Carlo approach.)

inference_simulator_builder <- function(intervention){
  inf_simulator <- function(DAG){
    data_simulator <- data_simulator_builder(intervention)
    .sim_data <- data_simulator(DAG)
    learner <- learner_builder(intervention)
    score_samples <- map_dbl(1:M, function(i) learner(.sim_data))
    score_mean <- mean(score_samples)/(2*N)
    return(score_mean)
  }
}

for(int in nds){
  score_learner <- inference_simulator_builder(int)
  sim_score_MAPs <- map(posterior_samples, score_learner)
  expected_KLs <- map2_dbl(sim_score_MAPs, posterior_samples,
                       ~ .x - score(.y, .data, type = "bde")/(2*N))
  message("Posterior expected intervention KL for an intervention on ",
          int, " is: ",
          round(mean(expected_KLs), 6))
}

## Posterior expected intervention KL for an intervention on Akt is: -3.115973

## Posterior expected intervention KL for an intervention on Erk is: -3.174514

## Posterior expected intervention KL for an intervention on Jnk is: -3.162067

## Posterior expected intervention KL for an intervention on Mek is: -3.156777

## Posterior expected intervention KL for an intervention on P38 is: -3.244603

## Posterior expected intervention KL for an intervention on PIP2 is: -3.325931

## Posterior expected intervention KL for an intervention on PIP3 is: -3.030456

## Posterior expected intervention KL for an intervention on PKA is: -3.365859

## Posterior expected intervention KL for an intervention on PKC is: -3.208324

## Posterior expected intervention KL for an intervention on Plcg is: -3.30015

## Posterior expected intervention KL for an intervention on Raf is: -3.12178

A possible alternative

Here is a possible alternative ideal inspired by Gelmans idea of an expected log pointwise predictive density for a new dataset. Suppose we have graph G…

G <- posterior_samples[[1]]

Suppose were were to ask ourselves “How much would the proposed intervention mutilate the model relative to differ from data we have seen so far?” We could then choose the intervention that causes the largest difference.

A possible measure of difference here might look like comparing the likelihood of the mutilated graph and that of the original graph on the original data. Gelman might call this a log posterior predictive likelihood ratio.

G_mutated <- mutilated(G, evidence = list(PKA = "LOW"))
logLik(G_mutated, .data) - logLik(G, .data)

## [1] -115.4802

We could then average this across the DAGs in the posterior sample.

lr_generator <- function(intervention){
  function(G){
    evidence <- structure(list("LOW"), names = intervention)
    G_mutated <- mutilated(G, evidence)
    llr <- logLik(G_mutated, .data) - logLik(G, .data)
    return(llr)
  }
}

for(int in nds){
  feature_calculator <- lr_generator(int)
  feature_posterior <- suppressWarnings(
    map_dbl(posterior_samples, feature_calculator)
  )
  posterior_mean <- mean(feature_posterior)
  message("Posterior log predictive likelihood ratio for an intervention on ", int, " is: ", round(posterior_mean, 6))
}

## Posterior log predictive likelihood ratio for an intervention on Akt is: -17.536543

## Posterior log predictive likelihood ratio for an intervention on Erk is: -33.545668

## Posterior log predictive likelihood ratio for an intervention on Jnk is: -19.153374

## Posterior log predictive likelihood ratio for an intervention on Mek is: -44.815446

## Posterior log predictive likelihood ratio for an intervention on P38 is: -34.424163

## Posterior log predictive likelihood ratio for an intervention on PIP2 is: -17.584331

## Posterior log predictive likelihood ratio for an intervention on PIP3 is: -10.168269

## Posterior log predictive likelihood ratio for an intervention on PKA is: -56.993847

## Posterior log predictive likelihood ratio for an intervention on PKC is: -46.206161

## Posterior log predictive likelihood ratio for an intervention on Plcg is: -30.712279

## Posterior log predictive likelihood ratio for an intervention on Raf is: -49.531399

In its current form this is just equivalent to removing the factor (CPD) that has the most explanatory power/least entropy. The likelihood ratio is the test statistic for the hypothesis that there is no causal relationshp given that node’s parents.

However, if .data were split into a training and testing set, the network were inferred on the training set and the likelihood ratio calculated on the test set, it starts to look more like an out-of-sample prediction technique. Depending on computational complexity, a leave-on-out approach would be feasible. Other prediction measures, such as classification error could work too.

Final thoughts

I have not resorted to do-calculus, because I am not explicitly working with equivalence class. I’m doing standard Koller query stuff.
After implementing, I don’t think my KL-based method is that much smarter than the equivalence class method I published before. It is certainly more expensive. I’d have to run more tests to see if it works more or less well.
I realize I still frame the problem like a bio problem, for example, many examples in the dataframe. A computer scientist might think about it in terms of 1 row at a time.
One thing I missed in the last paper was the fact that an intervention not only resolves causality, but it also adds variation to the data that was perhaps unseen in previous examples. That added variation increases detection of the presence of edges as well as their direction, and ignoring that in the utility function leads to sub optimal results. The KL method introduced here might address that if nodes rarely have “LOW” values in non-intervention data.
I think the path forward is to tackle the counterfactural position on the active learning problem. What does an active learner that queries the learner “what would have happened if…”? How could we incorporate counterfactural reasoning in an out-of-sample prediction alternative to that logliklihood method discribed above?
If counterfactual reasoning can buy us something in terms of machine learning, what can reasoning on DAGs and PDAGs (including do-calculus) buy us in terms of counterfactural reasoning? What can PPL’s functional programming approaches do for us in terms of counterfactural reasoning?

Vignette Title

Vignette Author

2018-03-01