Bayesian Networks

Bayesian Networks are a modelling tool to specify joint probability distributions
- Can be used to decrease the computational complexity of a joint probability distribution through independence modelling.

Graphs

Bayesian Networks employ directed acyclic graphs (DAG)
- Nodes represent variables
- Edges can be used to indicate causal influence
DAG example:

Notation

For a variable \(V\) in a DAG \(G\):
- Parents\((V)\) is the set of variables \(N\) with an edge from \(N\) to \(V\)
- Descendants\((V)\) is the set of variables \(N\) with a direct path from \(V\) to \(N\)
- Non_Descendants\((V)\) are all variables in \(G\) other than \(V\), Parents\((V)\), Descendants\((V)\)
- Spouse\((V)\) are all variables in \(G\) that have a common child with \(V\)

Parameterisation

For a Bayesian network, we need to specify conditional probabilities for every variable \(X\) in DAG and its parents \(V\).

\[P(x|\textbf{u})\quad\forall\quad x\in X\text{ and } \textbf{u}\in\textbf{U}\]

Example:
- This example only requires specification of 11 probabilities rather than \(2^5-1 = 31\) with a full joint probability distribution.

Definitions

Bayesian network for variables \(\textbf{Z}\) is the pair \((G,\Theta)\):
- \(G\) is the DAG over \(\textbf{Z}\), aka the network structure
- \(\Theta\) is the set of CPTs, one for each variable \(Z\), called network parameterization
\(\Theta_{X|\textbf{U}}\) denotes CPT for \(X\) and parents \(\textbf{U}\)
- The size of a CPT \(\Theta_{X|\textbf{U}}\) is exponential with number of parents \(\textbf{U}\)
- Given a maximal parents \(k\), the size of any CPT is \(O(d^{k+1})\), where \(d\) is the number of possible variable values
  - This is compared to \(O(d^n)\) for a pure probability distribution table -\(k\) can be decreased through independence assumptions
\(X\textbf{U}\) denotes the set of variables known as a network family
\(\theta_{x|\textbf{u}}\) is the value of \(P(x|\textbf{u})\), known as a network parameter
Network instantiation is the assignment of all network variables \(\theta_{x|\textbf{u}}\sim\textbf{z}\)

Chain Rule for Bayesian Networks

The joint distribution is given by:

\[ P(\mathbf{z}) = \prod_{\Theta_{x|\mathbf{u}}\sim\mathbf{z}}\theta_{x|\mathbf{u}}\]

Example:

\[ P(a,b,\barr{c},d,\bar{e}) = \theta_{a}\theta_{b|a}\theta_{\bar{c}|a}\theta_{d|b,\bar{c}}\theta_{\bar{e}|\bar{c}}\]

Markovian Assumptions

A DAG \(G\) is a compact representation of the independence statements:

\[ V\perp Non\_Descendants(V)|Parents(V)\]

Therefore, a DAG is created to capture the belief about the independence between varaibles.
- This can be causal in form, however it is possible to have non-causal structured DAGs with indentical independencies
These independence statements make up the Markovian assumptions of \(G\), denoted Markov\((G)\).
The following are the Markovian assumptions, Markov\((G)\), for the example graph:
- \(C\perp B,E,R|A\)
- \(R\perp A,B,C|E\)
- \(A\perp R|B,E\)
- \(B\perp E,R\)
- \(E\perp B\)

Properties of Independence

Can extend our conditional independence statements in Markov\((G)\) through graphoid axioms

Symmetry

\[ X\perp Y|Z\text{ iff }Y\perp X|Z\] * Example:

\[ A\perp R|B,E\rightarrow R\perp A|B,E\]

Decomposition

\[ X\perp Y\cup W|Z\text{ iff }(X\perp Y|Z\text{ and }X\perp W|Z)\]

This property allows us to state:
- \(X\) is conditionally independent of any subset of non_descendants given Parents(V)

\[ X\perp W|Parents(X)\quad\forall\quad W\subseteq\text{ }Non\_ Descendants(X)\]

Example:

\[ R\perp A,C,B|E\rightarrow R\perp A|E,\quad R\perp C|E,\quad R\perp B|E\]

Weak Union

\[ X\perp Y\cup W|Z\text{ iff } X\perp W|Z\cup Y\]

This property allows us to state

\[ X\perp Non\_Descendants(X)\backslash W|Parents(X)\cup W\quad\forall\quad W\subseteq Non\_Descendant(X)\]

Example:

\[ C\perp B,E,R|A\rightarrow C\perp R|A,B,E\]

Contraction

\[ (X\perp Y|Z\text{ and }X\perp W|Z\cup Y)\text{ iff } X\perp Y\cup W|Z\]

Example:
- Consider a directed sequence of nodes \(X_1,...,X_4\)

\[ (X_1\perp X_4|X_2,X_3\text{ and }X_1\perp X_3|X_2) \rightarrow X_1\perp X_3,X_4|X_2\]

Intersection

Note that intersection only holds for strictly positive distributions
- No \(0/1\) probabilities (deterministic CPTs)

\[ (X\perp Y|Z\cup W\text{ and }X\perp W|Z\cup Y)\text{ iff } X\perp Y\cup W|Z\]

Graphical Tests of Independence

Graphical tests can be used to test conditional independence between sets of variables
- Graphical tests can be used to simplify independence calculations as derivations for variable independence through graphiod axioms is not trivial

Blocking

Consider a path between 2 variables. This path is considered ‘blocked’ if any intermediate nodes, referred to as valves \(W_i\), are blocked.

We can determine whether a valve is blocked by considering 3 cases. Consider:

\[X\perp Y|Z\]

Sequential Valves
- \(W\) is an intermediary between \(N_1,N_2\)
- Closed iff \(W\in\textbf{Z}\)

Divergent Valves:
- \(W\) is a common cause for \(N_1,N_2\)
- Closed iff \(W\in\textbf{Z}\)

Convergent Valves:
- \(W\) is a common effect of two causes \(N_1,N_2\)
- Blocked iff neither \(W\) nor any descendents\((W)\) appear in \(\textbf{Z}\)

D-Seperation

Let \(\textbf{X,Y,Z}\) be disjoint sets of nodes in a DAG \(G\). \(\textbf{X}\) and \(\textbf{Y}\) are d-seperated by \(\textbf{Z}\), \(dsep_G(\textbf{X,Y,Z})\), iff every path between a node in \(\textbf{X}\) and a node in \(\textbf{Y}\) is blocked by \(\textbf{Z}\)
- A path is blocked by \(\textbf{Z}\) iff at least one valve on the path is closed given \(\textbf{Z}\)
D-separation is sound
- \(dsep_G(\textbf{X,Y,Z})\) iff \(\textbf{X}\perp\textbf{Y}|\textbf{Z}\)
- If \(\mathbf{X}\),\(\mathbf{Y}\) are d-seperated by \(\mathbf{Z}\), \(\textbf{X}\perp\textbf{Y}|\textbf{Z}\) for any parametertisation \(\Theta\)
D-separation is not complete
- \(\mathbf{X}\perp \mathbf{Y}|\mathbf{Z} \nrightarrow desp_G(\mathbf{X,Y,Z})\)
- If \(\mathbf{X}\),\(\mathbf{Y}\) are not d-seperated by \(\mathbf{Z}\), whether \(\textbf{X}\perp\textbf{Y}|\textbf{Z}\) depends on parameterisation \(\Theta\)
- d-seperation can only detect independence in the graph structure, it is unable to detect independence hidden in graph parameters
- However, d-seperation can be made complete with a specific choice of parametrisation, therefore it has a weak notion of completeness.
  - This notion implies that no other graphical test would detect more independences than d-seperation

Efficient D-Seperation Algorithm

Testing for d-seperation is equivalent to testing whether \(\mathbf{X,Y}\) are disconnected in DAG \(G'\). obtained as follows:
- Delete any leaf node \(W\) from \(G\) if \(W\) does not belong to \(\mathbf{X\cup Y\cup Z}\). Repeat until no more nodes can be deleted.
- Delete all edges outgoing from notes in \(\mathbf{Z}\)
This procedure time and space are linear in size of DAG \(G\) (amount of edges in the DAG).

Independence Maps

I-MAP

\(G\) is an independence map of \(P\) iff every independence declared by d-seperation holds in \(P\)
- If \(G\) is induced by a bayesian network, it must be an I-MAP of \(P\). However it may not be minimal.

\[ dsep_g(\mathbf{X,Z,Y})\text{ iff } \mathbf{X}\perp \mathbf{Y}|\mathbf{Z}\]

I-MAP is minimal if \(G\) ceases to be an I-MAP if we delete any edges from \(G\)
- Minimal I-MAPs tend to exhibit more independencies, therefore have fewer parameters.
Procedure to construct a minimal I-MAP:
- Order variables \(X_1,...,X_n\)
- Start with an empty DAG \(G\) and consider the variable \(X_i\) for \(i=1,...,n\)
- For each \(X_i\), identify a minimal subset \(\mathbf{P}\) of variables \(X_1,...,X_i-1\) such that \(X_i\perp X_1,...,X_{i-1}\backslash P|P\)
- Make \(\mathbf{P}\) the parents of \(X_i\) in \(G\)

D-MAP

\(G\) is a dependency map of \(P\) iff a lack of d-seperation in \(G\) implies a dependence in \(P\)

\[ \mathbf{X}\perp \mathbf{Y}|\mathbf{Z}\text{ iff } dsep_g(\mathbf{X,Z,Y})\]

If \(P\) is induced by a bayesian network, it may not be a D-MAP
- However it can be made a D-MAP with careful choice of \(\Theta\) where no independence is hidden in the parameterisation \(\theta\)

P-MAP

If \(G\) is both a I-MAP and a D-MAP. it is a perfect map
- All independencies of \(G\) are accessible through D-seperation
- However, there are probability distributions with no P-MAP

Markov Blanket

A markov blanket for \(X\) will render every other variable irrelevant to \(X\).
- Minimal markov blanket is known as a markov boundary (no subset of \(\mathbf{B}\) is also a markov blanket)
Definition:
- Let P be a distribtion over variables \(\mathbf{X}\). A markov blanket for a varible \(X\in\mathbf{X}\) is the set of variables \(\mathbf{B\subseteq X}\) such that \(X\notin\mathbf{B}\) and \(X\perp\mathbf{X}\backslash(\mathbf{B}\cup\{X\})|\mathbf{B}\)
If \(P\) is induced by a DAG \(G\), then a markov blanket for \(X\) can be constructed with its parents, children and spouses in \(G\).