The true model. The full Bayes net is of class ‘bn.fit’, and ‘bn.fit.dnet’ (dnet means discrete).
load('sachs.rda')
bn_model <- bn; rm(bn)
print(class(bn_model))## [1] "bn.fit" "bn.fit.dnet"
graphviz.plot(bn_model)## Loading required namespace: Rgraphviz
Example of a conditional probability distribution from the network.
bn_model$PKA##
## Parameters of node PKA (multinomial distribution)
##
## Conditional probability table:
##
## PKC
## PKA LOW AVG HIGH
## LOW 0.38642552 0.06039638 0.01577014
## AVG 0.37942435 0.92264651 0.95873839
## HIGH 0.23415014 0.01695712 0.02549147
Factorization represented by the DAG
modelstring(bn_model)## [1] "[PKC][Plcg][PIP3|Plcg][PKA|PKC][Jnk|PKA:PKC][P38|PKA:PKC][PIP2|PIP3:Plcg][Raf|PKA:PKC][Mek|PKA:PKC:Raf][Erk|Mek:PKA][Akt|Erk:PKA]"
Working with just the DAG structure itself:
true_dag <- model2network(modelstring(bn_model))
true_dag##
## Random/Generated Bayesian network
##
## model:
## [PKC][Plcg][PIP3|Plcg][PKA|PKC][Jnk|PKA:PKC][P38|PKA:PKC]
## [PIP2|PIP3:Plcg][Raf|PKA:PKC][Mek|PKA:PKC:Raf][Erk|Mek:PKA][Akt|Erk:PKA]
## nodes: 11
## arcs: 17
## undirected arcs: 0
## directed arcs: 17
## average markov blanket size: 3.09
## average neighbourhood size: 3.09
## average branching factor: 1.55
##
## generation algorithm: Empty
Sim some data from groundtruth model in lieu of real data
N <- 200
.data <- rbn(bn_model, N)
head(.data); rm(bn_model)## Akt Erk Jnk Mek P38 PIP2 PIP3 PKA PKC Plcg Raf
## 1 LOW AVG AVG AVG LOW LOW AVG AVG LOW LOW AVG
## 2 LOW HIGH AVG LOW LOW LOW AVG AVG AVG LOW LOW
## 3 AVG HIGH LOW LOW LOW LOW LOW AVG LOW LOW LOW
## 4 LOW AVG AVG AVG LOW AVG HIGH AVG LOW LOW AVG
## 5 LOW AVG AVG AVG LOW LOW LOW AVG AVG LOW AVG
## 6 LOW AVG AVG LOW LOW LOW AVG AVG HIGH LOW LOW
Scoring with Bayesian Dirichlet equivalent, think of this as log(P(G|D)P(G)) for discrete CPDs.
score(true_dag, .data, type = "bde")## [1] -1665.421
Other scores, all likelihood-related.
score(true_dag, .data, type = "bic")## [1] -1840.424
score(true_dag, .data, type = "aic")## [1] -1546.874
score(true_dag, .data, type = "k2")## [1] -1601.504
Note that generally there is nothing Bayesian about Bayesian network structure learning.
I take a Bayesian approach to structure learning here. I write a structure learning function takes in a “dataset object” and returns samples from a posterior. The prior is baked in.
“hc” stands for “Hill-Climbing”, a greedy search algo that optimizes the score function. It is deterministic given the data and starting graph. Using a bootstrap approach, we can get random DAG outputs by using a bootstrap approach – random start graphs and data sampled with replacement. Admittedly, calling this “Bayesian” inference of Bayes net structure is a stretch. MCMC methods exists, but are not implemented in this package. For our purposes, consider the structure learning approach black box.
I generate a list of M = 1000 DAGs representing the posterior.
nds <- names(.data)
M <- 100
structure_learner<- function(){
starting_net <- random.graph(nds)
.sampled <- as.data.frame(sample_n(.data, size = N, replace = T))
hc(.sampled, start = starting_net, score = "bde", prior = "uniform")
}
posterior_samples <- map(1:M, function(i) structure_learner())This is a list of DAGs representing samples from the posterior. The first one in the list is pictured below. The figure compares against the groundtruth graph, red edges are false positives (edges that don’t exist in the groundtruth), blue dashed edges are false negatives (groundtruth edges not in learned DAG).
graphviz.compare(posterior_samples[[1]], true_dag)Scores of all the networks in the list can be calculated against the data. Here are the first ten scores:
map_dbl(posterior_samples[1:10], score, data = .data, type = "bde")## [1] -1698.732 -1635.944 -1647.012 -1753.223 -1705.184 -1659.645 -1674.024
## [8] -1652.143 -1648.739 -1690.697
Given a DAG, I can fit a full Bayesian network on the DAG using the bn.fit function. This is a bit redundant because structure inference uses the score function, which relies on the MLE estimate for the CPD parameters. Now given the DAG, I’m re-inferring the CPD parameters given the data. At least instead of the MLE I can use the MAP with the method = "bayes" argument - though parameters estimates are still point values.
class(net <- bn.fit(true_dag, .data, method = "bayes"))## [1] "bn.fit" "bn.fit.dnet"
net$PKA; rm(net);##
## Parameters of node PKA (multinomial distribution)
##
## Conditional probability table:
##
## PKC
## PKA LOW AVG HIGH
## LOW 0.402298851 0.038302277 0.006802721
## AVG 0.402298851 0.942028986 0.986394558
## HIGH 0.195402299 0.019668737 0.006802721
We simulate the data intervention data doing a probabilistic query on the mutilated graph.
Mutilating the graph means making the target of the intervention independent of its parents. In terms of the DAG, you are cutting its incoming edges. In terms of the CPD of the target node, you remove the CPD and replace it with a distribution that places all probability on the value fixed by the intervention. This is the the do-operator in Pyro is doing.
graphviz.compare illustrates the edge removed when PKA is set to “LOW”.
pka_mutilated <- mutilated(true_dag, evidence = list(PKA = "LOW"))
graphviz.compare(pka_mutilated, true_dag); rm(pka_mutilated);Next, you apply an inference algorithm to the graph to simulated data given the fixed value of the intervention variable. Note the following code uses logic sampling. The alternative implementation is likelihood-weighting. It would be possible to use belief propagation by introducing some other packages.
cpdist generates random observations conditional on the evidence using the method specified in the method argument. Note how “PKA” is always “LOW” in the simulated data. Note, the syntax below is wonky because the cpdist function uses non-standard evaluation, so I have to modify the cpdist call directly.
data_simulator_builder <- function(intervention){
simulator <- function(DAG){
# Fit the Bayes Net from the DAG
bn_mod <- bn.fit(DAG, .data, method = "bayes")
# Mutilate the model
mut_evidence <- structure(list("LOW"), names = intervention)
mod_mutilated <- mutilated(bn_mod, evidence = mut_evidence)
particles <- rbn(mod_mutilated, n = N)
return(particles)
}
return(simulator)
}
pka_intervention_simulator <- data_simulator_builder("PKA")
.sim_data <- pka_intervention_simulator(posterior_samples[[1]])
head(.sim_data); rm(pka_intervention_simulator);## Akt Erk Jnk Mek P38 PIP2 PIP3 PKA PKC Plcg Raf
## 1 LOW AVG HIGH AVG LOW AVG HIGH LOW LOW LOW LOW
## 2 AVG HIGH AVG LOW AVG HIGH LOW LOW AVG HIGH LOW
## 3 LOW HIGH LOW LOW LOW LOW LOW LOW AVG LOW AVG
## 4 HIGH HIGH LOW HIGH LOW LOW HIGH LOW AVG HIGH HIGH
## 5 LOW AVG AVG AVG AVG AVG LOW LOW LOW LOW LOW
## 6 AVG HIGH AVG HIGH HIGH LOW HIGH LOW LOW LOW HIGH
At this point I can simulate an intervention dataset for each graph in my posterior.
There is a modified BDe score that takes an argument on what node was targeted, and calculates the correct score given the intervention. Think of it as calculating the score on the mutilated DAG (though its slightly different).
score(true_dag, data = .sim_data, exp = list(PKA = 1:N), type = "mbde")## Warning in check.data(data): variable PKA has levels that are not observed
## in the data.
## [1] -1712.071
rm(.sim_data)Before, the output structure learning is a sample of DAGs representing a posterior distribution. Given a DAG posterior learned on .data, we’d like to use that as the prior for structure inference on new_data. This turns out to be the same as doing inference on the combined datasets rbind(.data, new_data).
In this function, instead of returning the DAG, I’ll return the score of the DAG on the combined simulated/original data.
learner_builder <- function(intervention){
learner <- function(.new_data){
start_net <- random.graph(nds)
.sampled <- rbind(
sample_n(.data, size = N, replace = T),
sample_n(.new_data, size = N, replace = T)
) %>% as.data.frame
int_list <- list(int = (N+1):(2*N))
names(int_list) <- intervention
net <- hc(.sampled, start_net, score = "mbde", exp = int_list)
scr <- score(net, .sampled, type = "mbde", exp = int_list)
return(scr)
}
}So now I have the ability to forward simulate the entire process of generating new data and updating the network.
So here’s what I do. 1. For each DAG in the posterior samples… a. Simulate intervention data b. Simulate inference on the intervention data by collected a new set of posterior samples given that simulated data. c. Convert those simulated DAGs into score values – scoring those DAGs against the combined simulated and original data. d. Take the mean of those scores. 2. Return the mean of the difference of the simulated score means and the DAG scores on original data for each DAG.
This return value by construction is posterior expected KL divergence (approximated with this Monte Carlo approach.)
inference_simulator_builder <- function(intervention){
inf_simulator <- function(DAG){
data_simulator <- data_simulator_builder(intervention)
.sim_data <- data_simulator(DAG)
learner <- learner_builder(intervention)
score_samples <- map_dbl(1:M, function(i) learner(.sim_data))
score_mean <- mean(score_samples)/(2*N)
return(score_mean)
}
}
for(int in nds){
score_learner <- inference_simulator_builder(int)
sim_score_MAPs <- map(posterior_samples, score_learner)
expected_KLs <- map2_dbl(sim_score_MAPs, posterior_samples,
~ .x - score(.y, .data, type = "bde")/(2*N))
message("Posterior expected intervention KL for an intervention on ",
int, " is: ",
round(mean(expected_KLs), 6))
}## Posterior expected intervention KL for an intervention on Akt is: -3.115973
## Posterior expected intervention KL for an intervention on Erk is: -3.174514
## Posterior expected intervention KL for an intervention on Jnk is: -3.162067
## Posterior expected intervention KL for an intervention on Mek is: -3.156777
## Posterior expected intervention KL for an intervention on P38 is: -3.244603
## Posterior expected intervention KL for an intervention on PIP2 is: -3.325931
## Posterior expected intervention KL for an intervention on PIP3 is: -3.030456
## Posterior expected intervention KL for an intervention on PKA is: -3.365859
## Posterior expected intervention KL for an intervention on PKC is: -3.208324
## Posterior expected intervention KL for an intervention on Plcg is: -3.30015
## Posterior expected intervention KL for an intervention on Raf is: -3.12178
Here is a possible alternative ideal inspired by Gelmans idea of an expected log pointwise predictive density for a new dataset. Suppose we have graph G…
G <- posterior_samples[[1]]Suppose were were to ask ourselves “How much would the proposed intervention mutilate the model relative to differ from data we have seen so far?” We could then choose the intervention that causes the largest difference.
A possible measure of difference here might look like comparing the likelihood of the mutilated graph and that of the original graph on the original data. Gelman might call this a log posterior predictive likelihood ratio.
G_mutated <- mutilated(G, evidence = list(PKA = "LOW"))
logLik(G_mutated, .data) - logLik(G, .data)## [1] -115.4802
We could then average this across the DAGs in the posterior sample.
lr_generator <- function(intervention){
function(G){
evidence <- structure(list("LOW"), names = intervention)
G_mutated <- mutilated(G, evidence)
llr <- logLik(G_mutated, .data) - logLik(G, .data)
return(llr)
}
}
for(int in nds){
feature_calculator <- lr_generator(int)
feature_posterior <- suppressWarnings(
map_dbl(posterior_samples, feature_calculator)
)
posterior_mean <- mean(feature_posterior)
message("Posterior log predictive likelihood ratio for an intervention on ", int, " is: ", round(posterior_mean, 6))
}## Posterior log predictive likelihood ratio for an intervention on Akt is: -17.536543
## Posterior log predictive likelihood ratio for an intervention on Erk is: -33.545668
## Posterior log predictive likelihood ratio for an intervention on Jnk is: -19.153374
## Posterior log predictive likelihood ratio for an intervention on Mek is: -44.815446
## Posterior log predictive likelihood ratio for an intervention on P38 is: -34.424163
## Posterior log predictive likelihood ratio for an intervention on PIP2 is: -17.584331
## Posterior log predictive likelihood ratio for an intervention on PIP3 is: -10.168269
## Posterior log predictive likelihood ratio for an intervention on PKA is: -56.993847
## Posterior log predictive likelihood ratio for an intervention on PKC is: -46.206161
## Posterior log predictive likelihood ratio for an intervention on Plcg is: -30.712279
## Posterior log predictive likelihood ratio for an intervention on Raf is: -49.531399
In its current form this is just equivalent to removing the factor (CPD) that has the most explanatory power/least entropy. The likelihood ratio is the test statistic for the hypothesis that there is no causal relationshp given that node’s parents.
However, if .data were split into a training and testing set, the network were inferred on the training set and the likelihood ratio calculated on the test set, it starts to look more like an out-of-sample prediction technique. Depending on computational complexity, a leave-on-out approach would be feasible. Other prediction measures, such as classification error could work too.