Bayesian networks build on the same intuitions as the naive Bayes model by exploiting conditional independence properties of the distribution in order to allow a compact and natural representation. However, they are not restricted to representing distributions satisfying the strong independence assumptions implicit in the naive Bayes model. They allow us the flexibility to tailor our representation of the distribution to the independence properties that appear reasonable in the current setting. The core of the Bayesian network representation is a directed acyclic graph (DAG) \(\mathbf{\varrho}\), whose nodes are the random variables in our domain and whose edges correspond, intuitively, to direct influence of one node on another. This graph \(\mathbf{\varrho}\) can be viewed in two very different ways:
\(P(G,D,I,S,L)\)
- \(\mathbf{G}\)rade: A vs. B vs. C
- Course \(\mathbf{D}\)ifficulty: 0 vs. 1
- Student \(\mathbf{I}\)ntelligence: 0 vs. 1
- Student \(\mathbf{S}\)AT: 0 vs. 1
- Reference \(\mathbf{L}\)etter: 0 vs. 1
grViz("
digraph causal{
# Node
node[shape=circle]
G
D
I
S
L
# Edges
edge[color=black, arrowhead=vee]
D -> G
I -> G
I -> S
G -> L
}")
Semantics and Factorization Example
So, we got \(P(D)\), \(P(I)\), \(P(G|I,D)\), \(P(S|I)\), and \(P(L|G)\).
Given this Bayesian network, what do we think is an appropriate factorization of the joint distribution \(P(D,I,G,S,L)\)?
\(P(D,I,G,S,L)\) = \(P(D,I,G,S,L)=P(D)P(I|D)P(G|D,I)P(S|D,I,G)P(L|D,I,G,S)\) = \(P(D)P(I)P(G|I,D)P(S|I)P(L|G)\) based on the standard chain rule of probability and the conditional independencies in the graph.
Example: \(P(d^0, i^1, g^3, s^1, l^1)\) = 0.6 x 0.3 x 0.02 x 0.8 x 0.01 = 2.8810^{-5}
Genetic Inheritance
BNs for Genetic Inheritance
grViz("
digraph causal{
# a 'graph' statement
graph [overlap = true, fontsize = 10]
# several 'node' statements
node [shape = circle,
style = filled,
fontname = Helvetica,
fillcolor = Linen]
G_Clancy; B_Clancy
G_Jackie; B_Jackie
G_Marge; B_Marge
G_Selma; B_Selma
G_Homer; B_Homer
G_Bart; B_Bart
G_Lisa; B_Lisa
G_Maggie; B_Maggie
# Edges
edge[color=black, arrowhead=vee]
G_Clancy -> B_Clancy; G_Clancy -> G_Marge; G_Clancy -> G_Selma
G_Jackie -> B_Jackie; G_Jackie -> G_Marge; G_Jackie -> G_Selma
G_Marge -> B_Marge; G_Marge -> G_Bart; G_Marge -> G_Lisa; G_Marge -> G_Maggie
G_Selma -> B_Selma
G_Homer -> B_Homer; G_Homer -> G_Bart; G_Homer -> G_Lisa; G_Homer -> G_Maggie;
G_Bart -> B_Bart
G_Lisa -> B_Lisa
G_Maggie -> B_Maggie
}")
This is an example of causal reasoning, because, intuitively the reasoning goes, in the causal direction from top to bottom.
We’re asking what is the probability of getting a strong letter?
\(P(l^1)\) \(\approx\) 0.503
If we’re going to condition on low intelligence, we’re going to use red to denote the false value:
\(P(l^1 \mid i^0)\)
Ask what happens if we make the difficulty of the course low and in this case:
\(P(l^1 \mid i^0, d^0)\)
Evedentual goes from the bottom up.
So we can in this case condition on the grade and ask what happens to the probability of, of variables that are parents or, or general ancestors of the grade. So does it matter that this poor student takes the class and he gets a C. Initially the probability that the class was difficult is 0.4 and the probability that the student was intelligent is 0.3.
\(P(d^1)\) = 0.4; \(P(i^1)\) = 0.3
\(P(d^1 \mid g^3)\) \(\approx\) ; \(P(i^1 \mid g^3)\) \(\approx\)
Reasoning that is called inter-causal because effectively it’s flow of information between two causes of a. of a single effect.
\(P(i^1 \mid g^3, d^1)\)
\(P(i^1 \mid g^2)\)
\(P(i^1 \mid g^2, d^1)\)
The explanation of Intercausal Reasoning
grViz("
digraph causal{
# Node
node[shape=circle]
Difficulty
Grade
Letter
# Edges
edge[color=black, arrowhead=vee]
Difficulty -> Grade
Grade -> Letter
}")
grViz("
digraph causal{
# Node
node[shape=circle]
Intelligence
Grade
SAT
# Edges
edge[color=black, arrowhead=vee]
Intelligence -> Grade
Intelligence -> SAT
}")
grViz("
digraph causal{
# Node
node[shape=circle]
Difficulty
Grade
Intelligence
# Edges
edge[color=black, arrowhead=vee]
Difficulty -> Grade
Intelligence -> Grade
}")
In general, if we have an edge \(X\) \(\rightarrow\) \(Y\) in some Bayesian network, is there any set of other variables that we can condition on to make XX and YY independent? — No; the direct edge from XX to YY means that in general, influence can flow from \(X\) to \(Y\) regardless of whether other variables are observed.
Say we observe Intelligence. Are Grade and SAT conditionally independent? > If we don’t observe Intelligence, then Grade and SAT are dependent, because observing Grade gives us some information about IntelligenceIntelligence and therefore about SAT, and vice versa. However, if we have already observed Intelligence, then observing Grade can’t affect SAT and vice versa, so they are conditionally independent.
S — I — G — D allows influence to flow when:
- I is observed: not flow
- I not observed, nothing else: not flow
- I not observed, G is observed: flow
\(\Longrightarrow\) How a trail can be active (above)
Which of the following are active trails if we observe \(G\)?
grViz("
digraph causal{
# Node
node[shape=circle]
Coherence
Difficulty
Intelligence
Grade
SAT
Letter
Job
Happy
# Edges
edge[color=black, arrowhead=vee]
Coherence -> Difficulty
Difficulty -> Grade
Intelligence -> Grade
Intelligence -> SAT
Grade -> Letter
Grade -> Happy
Letter -> Job
SAT -> Job
Job -> Happy
}")
C \(\rightarrow\) D \(\rightarrow\) G \(\leftarrow\) I \(\rightarrow\) S is active, because observing G activates the V-structure there.
I \(\rightarrow\) G \(\rightarrow\) L \(\rightarrow\) J \(\rightarrow\) H is not active, because we have observed G; this blocks the flow of influence from I to L.
I \(\rightarrow\) S \(\rightarrow\) J \(\rightarrow\) H is active, because we are following arrows in the same direction without encountering any observed nodes.
C \(\rightarrow\) D \(\rightarrow\) G \(\leftarrow\) I \(\rightarrow\) S \(\rightarrow\) J \(\leftarrow\) L is not active. The V-structure at J blocks the flow of influence because neither J nor one of its descendants are observed.