Bayesian networks build on the same intuitions as the naive Bayes model by exploiting conditional independence properties of the distribution in order to allow a compact and natural representation. However, they are not restricted to representing distributions satisfying the strong independence assumptions implicit in the naive Bayes model. They allow us the flexibility to tailor our representation of the distribution to the independence properties that appear reasonable in the current setting. The core of the Bayesian network representation is a directed acyclic graph (DAG) \(\mathbf{\varrho}\), whose nodes are the random variables in our domain and whose edges correspond, intuitively, to direct influence of one node on another. This graph \(\mathbf{\varrho}\) can be viewed in two very different ways:

  • as a data structure that provides the skeleton for representing a joint distribution compactly in a factorized way;
  • as a compact representation for a set of conditional independence assumptions about a distribution

Show case 1

\(P(G,D,I,S,L)\)
- \(\mathbf{G}\)rade: A vs. B vs. C
- Course \(\mathbf{D}\)ifficulty: 0 vs. 1
- Student \(\mathbf{I}\)ntelligence: 0 vs. 1
- Student \(\mathbf{S}\)AT: 0 vs. 1
- Reference \(\mathbf{L}\)etter: 0 vs. 1

grViz("
digraph causal{

# Node
node[shape=circle]
G
D
I
S
L

# Edges
edge[color=black, arrowhead=vee]
D -> G
I -> G
I -> S
G -> L
}")
Semantics and Factorization Example

Semantics and Factorization Example

So, we got \(P(D)\), \(P(I)\), \(P(G|I,D)\), \(P(S|I)\), and \(P(L|G)\).

Given this Bayesian network, what do we think is an appropriate factorization of the joint distribution \(P(D,I,G,S,L)\)?
\(P(D,I,G,S,L)\) = \(P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|D,I,G)P(L|D,I,G,S)\) = \(P(D)P(I)P(G|I,D)P(S|I)P(L|G)\) based on the standard chain rule of probability and the conditional independencies in the graph.

Example: \(P(d^0, i^1, g^3, s^1, l^1)\) = 0.6 x 0.3 x 0.02 x 0.8 x 0.01 = 2.8810^{-5}

Bayesian Network

  • A Bayesian network is:
    • A directed acyclic graph (DAG) G whose nodes represent the random variables \(X_1, ..., X_n\)
    • For each node \(X_i\) a CPD \(P(X_i|Par_G(X_i))\)
  • The BN represents a joint distribution via the chain rule for Bayesian networks, we said that \(P\) factorizes over \(G\) if \[P(X_1, ..., X_n) = \prod_i P(X_i | Par_G(X_i))\]
  • Properties:
    • BN is a legal distribution:
      • \(P \ge 0\): P is a product of CPDs and CPDs are non-negative.
      • \(\sum P = 1\): \(\sum_{D,I,G,S,L} P(D,I,G,S,L) = \sum_{D,I,G,S,L} P(D)P(I)P(G|I,D)P(S|I)P(L|G) =\)
        \(= \sum_{D,I,G,S} P(D)P(I)P(G|I,D)P(S|I) \sum_L P(L|G)\)
        \(= \sum_{D,I,G,S} P(D)P(I)P(G|I,D)P(S|I)\)
        \(= \sum_{D,I,G} P(D)P(I)P(G|I,D) \sum_S P(S|I)\)
        \(= \sum_{D,I} P(D)P(I)\sum_G P(G|I,D)\)
        \(= 1\)
        Example: What is the value of \(\sum_L P(L|G)\)? \(\rightarrow\) no matter what the value of GG is, the probability of \(L\) given \(G\) is well-defined, i.e. it sums up to \(1\)

Show case 2

Genetic Inheritance

Genetic Inheritance

BNs for Genetic Inheritance

grViz("
digraph causal{

  # a 'graph' statement
  graph [overlap = true, fontsize = 10]

  # several 'node' statements
  node  [shape = circle,
         style = filled,
         fontname = Helvetica,
         fillcolor = Linen]
G_Clancy; B_Clancy
G_Jackie; B_Jackie
G_Marge; B_Marge
G_Selma; B_Selma
G_Homer; B_Homer
G_Bart; B_Bart
G_Lisa; B_Lisa
G_Maggie; B_Maggie

# Edges
edge[color=black, arrowhead=vee]
G_Clancy -> B_Clancy; G_Clancy -> G_Marge; G_Clancy -> G_Selma
G_Jackie -> B_Jackie; G_Jackie -> G_Marge; G_Jackie -> G_Selma
G_Marge -> B_Marge; G_Marge -> G_Bart; G_Marge -> G_Lisa; G_Marge -> G_Maggie
G_Selma -> B_Selma
G_Homer -> B_Homer; G_Homer -> G_Bart; G_Homer -> G_Lisa; G_Homer -> G_Maggie;
G_Bart -> B_Bart
G_Lisa -> B_Lisa
G_Maggie -> B_Maggie
}")

Reasoning Patterns

Show case 1 (cont.)

Causal reasoning

This is an example of causal reasoning, because, intuitively the reasoning goes, in the causal direction from top to bottom.

We’re asking what is the probability of getting a strong letter?

\(P(l^1)\) \(\approx\) 0.503

If we’re going to condition on low intelligence, we’re going to use red to denote the false value:

\(P(l^1 \mid i^0)\)

Ask what happens if we make the difficulty of the course low and in this case:

\(P(l^1 \mid i^0, d^0)\)

Evidential reasoning

Evedentual goes from the bottom up.

So we can in this case condition on the grade and ask what happens to the probability of, of variables that are parents or, or general ancestors of the grade. So does it matter that this poor student takes the class and he gets a C. Initially the probability that the class was difficult is 0.4 and the probability that the student was intelligent is 0.3.

\(P(d^1)\) = 0.4; \(P(i^1)\) = 0.3
\(P(d^1 \mid g^3)\) \(\approx\) ; \(P(i^1 \mid g^3)\) \(\approx\)

Intercausal reasoning

Reasoning that is called inter-causal because effectively it’s flow of information between two causes of a. of a single effect.

\(P(i^1 \mid g^3, d^1)\)

\(P(i^1 \mid g^2)\)

\(P(i^1 \mid g^2, d^1)\)

The explanation of Intercausal Reasoning

Student aces the SAT

Flow of Probabilistic Influence

When can \(X\) influence \(Y\)?

  • X \(\longrightarrow\) Y: X is the parent of Y so X influence Y
  • X \(\longrightarrow\) W \(\longrightarrow\) Y: indirect influence between X and Y
grViz("
digraph causal{

# Node
node[shape=circle]
Difficulty
Grade
Letter

# Edges
edge[color=black, arrowhead=vee]
Difficulty -> Grade
Grade -> Letter
}")
  • X \(\longleftarrow\) W \(\longrightarrow\) Y: we have a common cause, w, that has two effects X and Y
grViz("
digraph causal{

# Node
node[shape=circle]
Intelligence
Grade
SAT

# Edges
edge[color=black, arrowhead=vee]
Intelligence -> Grade
Intelligence -> SAT
}")
  • X \(\longrightarrow\) W \(\longleftarrow\) Y: v-structure
grViz("
digraph causal{

# Node
node[shape=circle]
Difficulty
Grade
Intelligence

# Edges
edge[color=black, arrowhead=vee]
Difficulty -> Grade
Intelligence -> Grade
}")

Active trails

  • A trails \(X_1\) — … — \(X_k\) is active if: it has no v-structures \(X_{i-1} \longrightarrow X_i \longleftarrow X_{i+1}\)

In general, if we have an edge \(X\) \(\rightarrow\) \(Y\) in some Bayesian network, is there any set of other variables that we can condition on to make XX and YY independent? — No; the direct edge from XX to YY means that in general, influence can flow from \(X\) to \(Y\) regardless of whether other variables are observed.

  • A trails \(X_1\) — … — \(X_k\) is active given \(Z\) if:
    • for any v-structure \(X_{i-1} \longrightarrow X_i \longleftarrow X_{i+1}\) we have that \(X_i\) or one of its descendants \(\in\) \(Z\) (i.e. is observed)
    • no other \(X_i\) is in Z

When can \(X\) influence \(Y\) given evidence about \(\mathbf{Z}\)

Say we observe Intelligence. Are Grade and SAT conditionally independent? > If we don’t observe Intelligence, then Grade and SAT are dependent, because observing Grade gives us some information about IntelligenceIntelligence and therefore about SAT, and vice versa. However, if we have already observed Intelligence, then observing Grade can’t affect SAT and vice versa, so they are conditionally independent.

S — I — G — D allows influence to flow when:
- I is observed: not flow
- I not observed, nothing else: not flow
- I not observed, G is observed: flow
\(\Longrightarrow\) How a trail can be active (above)

Which of the following are active trails if we observe \(G\)?

grViz("
digraph causal{

# Node
node[shape=circle]
Coherence
Difficulty
Intelligence
Grade
SAT
Letter
Job
Happy


# Edges
edge[color=black, arrowhead=vee]
Coherence -> Difficulty
Difficulty -> Grade
Intelligence -> Grade
Intelligence -> SAT
Grade -> Letter
Grade -> Happy
Letter -> Job
SAT -> Job
Job -> Happy
}")
  • C \(\rightarrow\) D \(\rightarrow\) G \(\leftarrow\) I \(\rightarrow\) S is active, because observing G activates the V-structure there.

  • I \(\rightarrow\) G \(\rightarrow\) L \(\rightarrow\) J \(\rightarrow\) H is not active, because we have observed G; this blocks the flow of influence from I to L.

  • I \(\rightarrow\) S \(\rightarrow\) J \(\rightarrow\) H is active, because we are following arrows in the same direction without encountering any observed nodes.

  • C \(\rightarrow\) D \(\rightarrow\) G \(\leftarrow\) I \(\rightarrow\) S \(\rightarrow\) J \(\leftarrow\) L is not active. The V-structure at J blocks the flow of influence because neither J nor one of its descendants are observed.