Causal Generative Modeling with Bayesian Networks and R’s bnlearn package

Installing bnlearn

Open RStudio and in console type:

install.packages(bnlearn)
install.packages(Rgraphviz)

If you experience problems installing Rgraphviz, try the following script:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Rgraphviz")

Another way to install a package in R is to go to the “Package” tab, and click on the “Install” button.

Then type “bnlearn” in the window that appears and click on the install button. Do the same thing for the other package.

Understanding the directed acyclic graph (DAG) representation

In this section, we introduce the a causal model for transportation data, which comes with an accompanying data set comprised of answers from surveys. Here, we show how we can visualize it with bnlearn package.

The survey data dataset

survey data is a data set that focuses on how public transport varies across social groups. It includes the following factors (discrete variables):

Age (A): It is recorded as young (young) for individuals below 30 years, adult (adult) for individuals between 30 and 60 years old, and old (old) for people older than 60.
Sex (S): The biological sex of individual, recorded as male (M) or female (F).
Education (E): The highest level of education or training completed by the individual, recorded either high school (high) or university degree (uni).
Occupation (O): It is recorded as an employee (emp) or a self employed (self) worker.
Residence (R): The size of the city the individual lives in, recorded as small (small) or big (big).
Travel (T): The means of transport favoured by the individual, recorded as car (car), train (train) or other (other)

Travel is the target of the survey, the quantity of interest whose behaviour is under investigation.

Building a causal DAG

We can represent the causal relationships between the variables with a directed graph where each node corresponds to a variable and each edge represents conditional dependencies between pairs of variables.

In bnlearn, we can graphically represent the relationships between variables in survey data like this:

# empty graph
library(bnlearn)

## Warning: package 'bnlearn' was built under R version 3.5.2

## 
## Attaching package: 'bnlearn'

## The following objects are masked from 'package:BiocGenerics':
## 
##     path, score

## The following object is masked from 'package:stats':
## 
##     sigma

dag <- empty.graph(nodes = c("A","S","E","O","R","T"))
arc.set <- matrix(c("A", "E",
                    "S", "E",
                    "E", "O",
                    "E", "R",
                    "O", "T",
                    "R", "T"),
                  byrow = TRUE, ncol = 2,
                  dimnames = list(NULL, c("from", "to")))
arcs(dag) <- arc.set
nodes(dag)

## [1] "A" "S" "E" "O" "R" "T"

arcs(dag)

##      from to 
## [1,] "A"  "E"
## [2,] "S"  "E"
## [3,] "E"  "O"
## [4,] "E"  "R"
## [5,] "O"  "T"
## [6,] "R"  "T"

You can either use the simple plot function or use the graphviz.plot function from Rgraphviz package.

# plot dag with plot function
plot(dag)

graphviz.plot(dag)

Plotting with graphviz also allows you to adjust the layout of the graph, the shape of subsets of nodes, as well as the color and thinkness of edges.

We call this a causal DAG because we have assumed that the edges we encoded represent our causal assumptions about the system.

Learning a (causal) DAG from data

The strength of the bnlearn library is it’s set of algorithms for learning DAGs, including causal DAGs, from data.

However, this is a subject that warrants its own tutorial. We won’t address the topic here. To learn more, I suggest visiting bnlearn’s website.

The causal DAG as a generative representation of joint probability

Any DAG (causal or otherwise) that we might specify for this data represents a factorization of the joint probability distribution of the variables.

Given our causal DAG, the joint probability distribution factorizes as follows:

Specifying the probability distributions on your own

Given the causal DAG, the joint probability distribution of the survey data variables factorizes as follows:

\(Pr(A, S, E, O, R, T) = Pr(A) Pr(S) Pr(E | A, S) Pr(O | E) Pr(R | E) Pr(T | O, R).\)

Each factor is a conditional probability distribution (CPD). In the discrete case the CPD is sometimes called a conditional probability table (CPT).

Though we can factorize over any DAG and get a set of CPDs, when we factorize along a DAG we consider to be a representation of causality, we call each CPD a causal Markov kernel (CMK).

The factorization that provides a set of CMKs is the most useful factorization because CMKs correspond to independent causal mechanisms we assume to be invariant across data sets. I use the term CPD more often than CMK, since CPD is more commonly seen elsewhere. Just remember that when I say CPD, I really mean a special type of causal CPD called a CMK.

Here is how we specify the CPDs by hand in bnlearn.

A.lv <- c("young", "adult", "old")
S.lv <- c("M", "F")
E.lv <- c("high", "uni")
O.lv <- c("emp", "self")
R.lv <- c("small", "big")
T.lv <- c("car", "train", "other")

A.prob <- array(c(0.3,0.5,0.2), dim = 3, dimnames = list(A = A.lv))
S.prob <- array(c(0.6,0.4), dim = 2, dimnames = list(S = S.lv))
E.prob <- array(c(0.75,0.25,0.72,0.28,0.88,0.12,0.64,0.36,0.70,0.30,0.90,0.10), dim = c(2,3,2), dimnames = list(E = E.lv, A = A.lv, S = S.lv))
O.prob <- array(c(0.96,0.04,0.92,0.08), dim = c(2,2), dimnames = list(O = O.lv, E = E.lv))
R.prob <- array(c(0.25,0.75,0.2,0.8), dim = c(2,2), dimnames = list(R = R.lv, E = E.lv))
T.prob <- array(c(0.48,0.42,0.10,0.56,0.36,0.08,0.58,0.24,0.18,0.70,0.21,0.09), dim = c(3,2,2), dimnames = list(T = T.lv, O = O.lv, R = R.lv))
cpt <- list(A = A.prob, S = S.prob, E = E.prob, O = O.prob, R = R.prob, T = T.prob)

# custom cpt table
cpt

## $A
## A
## young adult   old 
##   0.3   0.5   0.2 
## 
## $S
## S
##   M   F 
## 0.6 0.4 
## 
## $E
## , , S = M
## 
##       A
## E      young adult  old
##   high  0.75  0.72 0.88
##   uni   0.25  0.28 0.12
## 
## , , S = F
## 
##       A
## E      young adult old
##   high  0.64   0.7 0.9
##   uni   0.36   0.3 0.1
## 
## 
## $O
##       E
## O      high  uni
##   emp  0.96 0.92
##   self 0.04 0.08
## 
## $R
##        E
## R       high uni
##   small 0.25 0.2
##   big   0.75 0.8
## 
## $T
## , , R = small
## 
##        O
## T        emp self
##   car   0.48 0.56
##   train 0.42 0.36
##   other 0.10 0.08
## 
## , , R = big
## 
##        O
## T        emp self
##   car   0.58 0.70
##   train 0.24 0.21
##   other 0.18 0.09

Now that we have defined both the causal DAG and the local CPDs for each variable, we can combine them to form a fully-specified causal Bayesian network. We combine the DAG we stored in dag and a list containing the local distributions, which we will call cpt, into an object of class bn.fit called bn.

# fit cpt table to network
bn <- custom.fit(dag, cpt)

Estimating parameters of CPDs

So far, we have assumed to know both the causal DAG and the parameters of the CPDs factors from factorizing the joint distribution along that DAG.

In general it is easier to turn our domain knowledge into a causal DAG that it is to turn into values of parameters of that DAG’s CPDs, though it is not uncommon to do so.

However, in the context of machine learning, most of the time we are going to learn these parameter values from data.

Let’s read the survey data:

survey <- read.table("data/survey.txt", header = TRUE)
head(survey)

In the case of this survey, and of discrete causal BNs in general, the parameters to estimate are the conditional probabilities in the local distributions. They can be estimated, for example, by the corresponding empirical frequencies in the data set, e.g.,

\(\hat{Pr}(O = emp | E = high) = \frac{\hat{Pr}(O = emp, E = high)}{\hat{Pr}(E = high)}= \frac{\text{number of observations for which O = emp and E = high}}{\text{number of observations for which E = high}}\)

This yields the classic frequentist and maximum likelihood estimates. In bnlearn, we can compute them with the bn.fit function. bn.fit complements the custom.fit function we used in the previous section; the latter constructs a BN using a set of custom parameters specified by the user, while the former estimates the same from the data.

bn.mle <- bn.fit(dag, data = survey, method = "mle")
bn.mle

## 
##   Bayesian network parameters
## 
##   Parameters of node A (multinomial distribution)
## 
## Conditional probability table:
##  adult   old young 
## 0.472 0.208 0.320 
## 
##   Parameters of node S (multinomial distribution)
## 
## Conditional probability table:
##      F     M 
## 0.402 0.598 
## 
##   Parameters of node E (multinomial distribution)
## 
## Conditional probability table:
##  
## , , S = F
## 
##       A
## E          adult       old     young
##   high 0.6391753 0.8461538 0.5384615
##   uni  0.3608247 0.1538462 0.4615385
## 
## , , S = M
## 
##       A
## E          adult       old     young
##   high 0.7194245 0.8923077 0.8105263
##   uni  0.2805755 0.1076923 0.1894737
## 
## 
##   Parameters of node O (multinomial distribution)
## 
## Conditional probability table:
##  
##       E
## O            high        uni
##   emp  0.98082192 0.92592593
##   self 0.01917808 0.07407407
## 
##   Parameters of node R (multinomial distribution)
## 
## Conditional probability table:
##  
##        E
## R            high       uni
##   big   0.7178082 0.8666667
##   small 0.2821918 0.1333333
## 
##   Parameters of node T (multinomial distribution)
## 
## Conditional probability table:
##  
## , , R = big
## 
##        O
## T              emp       self
##   car   0.58469945 0.69230769
##   other 0.19945355 0.15384615
##   train 0.21584699 0.15384615
## 
## , , R = small
## 
##        O
## T              emp       self
##   car   0.54700855 0.75000000
##   other 0.07692308 0.25000000
##   train 0.37606838 0.00000000

Note that we assume we know the structure of the DAG, so dag is an input of bn.fit function.

As an alternative, we can also do Bayesian estimation of the parameters. This will provide the maximum a posterior point values of the posterior. The Bayesian modeling depends on the data type, in the discrete case, it makes use of the Dirichlet conjugate prior.

To use Bayesian estimation, set the method argument of bn.fit must be set to "bayes".

bn.bayes <- bn.fit(dag, data = survey, method = "bayes", iss = 10)

The estimated posterior probabilities are computed from an uniformed prior over each conditional probability table. iss is an optional argument, whose name stands for imaginary sample size (also known as equivalent sample size). It determines how much weight is assigned to the prior distribution compared to the data when computing the posterior. The weight is specified as the size of an imaginary sample supporting the prior distribution.

Inference on the causal generative model

We now have a causal generative model. Specifically, we have a causal Bayesian network. We can do things like query the network to generate the value of some variables conditional on the value of others.

The fact that we are working with a directed acyclic graph means that we can work with a host of inference algorithms from the probabilistic graphical modeling research literature. The nature of these algorithms is beyond the scope of this course. I will say that the advantage of working with a causal Bayesian network is that we don’t have to think about inference too much, working with a DAG makes them pretty “black box”.

We will see that probabilistic programming gives us a much more expressive way of modeling causality. However, inference is much more challenging to deal with.

One typical case study is one where we need to apply the model to data where some of the variables are unobserved.

The code below demonstrates fitting a model on the first 2000 points of a sample data set, and then predicting the value of the variable “A” on the remaining values of the data set.

# predicting a variable in the test set.
model <- bn.fit(
    model2network("[A][B][E][G][C|A:B][D|B][F|A:D:E:G]"),
    gaussian.test[1:2000, ]
)
test <- gaussian.test[2001:nrow(gaussian.test), ]
predicted <- predict(
    model,
    node = "A",
    data = test,
    method = "bayes-lw")

plot(test$A, predicted, xlab = "actual", ylab = "predicted")

bayes-lw is the inference method used to infer the latent value. Specifically, bnlearn uses a fairly light-weight inference technique called likelihood weighting.

Note that bnlearn can be combined with R libraries that provide better inference algorithms for graphical models, such as gRain. However, for more powerful infernce, you might consider reimplementing your model in a probabilistic programming language such as Stan.