- Variable elimination successively removes variables from a Bayesian Network while maintaining its ability to answer queries of interest
- MPE,MAP, probability of evidence, prior and posterior marginals
- Main insight is that we can use variable elimination to sum out variables without constructing joint probabilities
- Joint probability construction has exponential time complexity
- VE may be able to decrease the time complexity of queries.
Factors
- Factors are a function over a set of variables
- CPTs are an example of factors
- A factor over an empty set of variables is called trivial
- Assigns a single number to the trivial instantiation \(\top\)
- A common use is when you sum out all variables of a factor
- Assigns a single number to the trivial instantiation \(\top\)
- 2 main operations
- Summing out a variable
- Multiplying 2 factors
Summing out Variables (Marginalisation)
- The result of summing \(X\in\mathbf{X}\) from a factor \(f\) is another factor over variables \(\mathbf{Y}=\mathbf{X}\backslash\{X\}\), denoted \(\sum_Xf\)
\[ \left(\sum_x f\right)(\mathbf{y}) = \sum_xf(x,\mathbf{y})\]
- Summing operation is commutative
\[ \sum_x\sum_yf=\sum_y\sum_xf\]
- Summation Algorithm:
- \(O(\exp(w))\) in time and space. \(w\) is the number of factor variables in the original factor.
Example
Multiplication of Factors
- Multiplying 2 factors results in a new factor over the union of their variables \(Z=X\cup Y\)
\[ (f_1f_2)(\mathbf{z}) = f_1(\mathbf{x})f_2(\mathbf{y})\]
- Each instantiation in the new factor is compatible with exactly one instantiation on each original factor
- Factor multiplication is commutative and associative
- Don’t need to specify order of multiplication
- Multiplication algorithm:
- \(O(m\exp(w))\) complexity in time and space
- \(w\) is the number of variables in resulting factor
- \(m\) is the number of factors
- \(O(m\exp(w))\) complexity in time and space
Variable Elimination Algorithm
- Naively, one could compute the joint probability distribution through chain rule, then marginalise unwanted variables
\[ P(C) = \sum_{A,B}\Theta_{A}\Theta_{B|A}\Theta_{C|B}\]
- However we can improve on this by being smart with our summations.
Smarter Summations
- In general, we only need to sum out variables in factors they appear.
- Suppose \(X\) appears in \(f_{n-1},f_n\)
\[ \sum_xf_1f_2*...*f_{n-1}*f_{n} = f_1*f_2*...*\sum_xf_{n-1}f_{n}\]
- Therefore a smarter way to compute the marginal for this network:
- Considering elimination order \(\pi = \{A,B\}\)
\[ \sum_B\Theta_{C|B}\underbrace{\sum_A\Theta_{A}\Theta_{B|A}}_{\text{Compute First}}\]
Bucket Elimination
- Variable Elimination needs to identify each factor that mentions a variable
- Best complexity if factors are found in linear time in number of factors
- Can arrange factors into buckets based on elimination order
- One bucket for each network variable
- Factor is placed in the first bucket whose label appear in the factor
- Variables are eliminated by processing the buckets from top to bottom.
- When processing variable \(\pi(i)\):
- Multiply all factors in the bucket \(\pi(i)\)
- Sum out \(\pi(i)\)
- Place the resulting factor \(f_i\) in the first next bucket whose label appears in \(f_i\)
- One way to handle evidence is to create evidence indicators and put them in the relevant buckets to be multiplied.
- For example, consider \(\mathbf{e}: B=true, E=false\)
Computing Prior Marginals (VE_PR1)
- The first iteration of the variable elimination algorithm
- Used to calculate prior marginals (no evidence)
- Algorithm:
- \(O(n\exp(w)+n\exp(|\mathbf{Q}|)\) time and space complexity
- \(w\) is the number of variables in the biggest factor
- \(|\mathbf{Q}|\) is the number of query variables - I think
- Therefore the complexity depends on the biggest factor, which is dependent on the ordering \(\pi(i)\)
- We can compute an ordering \(\pi(i)\) that minimises size of intermediate factors
- \(O(n\exp(w)+n\exp(|\mathbf{Q}|)\) time and space complexity
Computing Posterior Marginals (VE_PR2)
- Can use variable elimination to calculate a query in the form:
\[ P(\mathbf{Q|e})\]
- To do this, it is easier to calculate the joint marginal \(P(\mathbf{Q,e})\) and then normalize:
\[ P(\mathbf{Q|e}) = \frac{P(\mathbf{Q,e})}{P(\mathbf{e})}\]
- Variable elimination algorithm can be extended to compute joint marginals by zeroing out rows inconsistent with \(\mathbf{e}\)
\[ f^\mathbf{e}(x)=\begin{cases}f(x),\quad\text{if } \mathbf{x\sim e}\\0,\quad\text{otherwise}\end{cases}\]
- For example, for \(\mathbf{e}:E=true\):
- To calculate the joint marginal:
- Note \((f_1f_2)^\mathbf{e}=f_1^\mathbf{e}f_2^\mathbf{e}\)
- Also note we do sum out evidence variables still
\[ P(\mathbf{Q,e})=\sum_{A,B}(\Theta_A\Theta_{B|A}\Theta_{C|B})^\mathbf{e} = \sum_{A,B}\Theta_A^\mathbf{e}\Theta_{B|A}^\mathbf{e}\Theta_{C|B}^\mathbf{e} \] * Algorithm: + Note we sum out evidence variables + People often run \(VE_PR2\) with empty \(\mathbf{Q}\) - Return trivial factor with probability of evidence \(P(\mathbf{e})\)
Network pruning extension (VE_PR)
- Before computing a query, we can simplify the graph based on the query
- Makes inference more efficienct
- Simplification leaves the minimum structure possible to still compute query \(P(\mathbf{Q,e})\)
Pruning Nodes
- Given a query \((\mathbf{Q,e})\), we can remove any leaf node if it does not belong to \((\mathbf{Q\cup E})\)
- Apply iteratively
- This works as for any CPT \(P(X|\mathbf{U})\), summing \(X\) will result in \(1\) being assigned to each instantiation.
- Therefore \(\sum_xf_1(X|\mathbf{U})*f_2 = f_2\)
Pruning Edges
- For a query \(P(\mathbf{Q,e})\), we can eliminate each edge \(U\rightarrow X\) that originates from a node \(U\in\mathbf{E}\)
- Replace CPT \(\Theta_{X|\mathbf{U}}\) by smaller CPT, which is obtained from \(\Theta_{X|\mathbf{U}}\) by assuming the value of \(\mathbf{u}\) given \(\mathbf{e}\)
- Example:
- Get rid of rows that do not match evidence. As only one value of \(U\) is left, we can remove that column as we assume \(\mathbf{u}\) given $.
Network Pruning Algorithm
- Network Pruning algorithm brings both node and edge pruning together to create algorithm VE_PR
- Pruning can typically be done in linear time, therefore is often worth it to simplify network.
- Network Pruning Example:
- Algorithm:
- Effective treewidth for a network \(N\) with respect to query \((\mathbf{Q,e})\) is the treewidth of pruneNetwork\((N,\mathbf{Q,e})\)
Elimination Order
- We want to choose \(\pi\) with the smallest width (max intermediate factor variables) to decrease complexity of variable elimination algorithms
- We can compute the width of an order by operating on an undirected graph
- Nodes of \(G\) are the variables that appear in the factors \(f_1,...,f_n\)
- There is an edge between two variables in \(G\) iff those variables appear in the same factor
- When eliminating a variable \(\pi(i)\) we need to create a new interaction graph
- Add an edge to \(G\) between every pair of neighbours of variable \(\pi(i)\) that are not already connected
- Delete variable \(\pi(i)\) from \(G\)
- The key insight from the interaction graph \(G\) for factors \(S\) is:
- The elimination of a variable \(\pi(i)\) from \(S\) leads to constructing a factor over the neighbours of \(\pi(i)\) in \(G\)
- Therefore, the width of an elimination order is the max amount of neighbours when eliminating the variables.
- Therefore, to calculate the width of an order we can use the algorithm:
- Note that this is very efficient to check many orders, as there is \(n!\) possible orderings.
- Therefore we often rely on heuristics that have good empirical results to create an order
Min-Degree Heuristic (MinDegreeOrder)
- Eliminate the variable that leads to constructing the smallest factor
- Equivalent to the variable with the smallest number of neighbours
- Greedy approach
- Only guaranteed to be optimal when a network has a complete elimination order width \(\leq 2\)
- Algorithm:
Fill-in Heuristic (MinFillOrder)
- Eliminates the variable that leads to adding the smallest number of edges in \(G\), the fill in edges
- Often perrforms better than MinDegreeOrder.
- Algorithm:
- Ties are often broken through MinDegreeoder
Network Structure and Complexity
- Minimum elimination order width is linked to the network structure
- No complete elimination order can have width smaller than network treewidth
- There is an order that has width=treewidth but it is NP-Hard to find.
- No complete elimination order can have width smaller than network treewidth
Tree-Width
- Treewidth is a number that quantifies the extent to which a network structure resembles a tree structure
- Impacts on treewidth
- Number of nodes have no direct impact
- Tree width cannot be lower than max amount of parents, \(k\), that a node can have
- Loops tend to increase tree width but have no predefined effect
Known Tree-Width Networks
- Polytree networrks (singly-connected):
- At most 1 undirected path between two nodes
- Treewidth is \(k\), where \(k\) is maximum parents per node
- Tree networks are polytree with a max of \(k=1\) (\(1\) parent):
- Tree width is 1
- Multiply connected networks are all networks that are not polytree networks