Variable elimination successively removes variables from a Bayesian Network while maintaining its ability to answer queries of interest
- MPE,MAP, probability of evidence, prior and posterior marginals
Main insight is that we can use variable elimination to sum out variables without constructing joint probabilities
- Joint probability construction has exponential time complexity
- VE may be able to decrease the time complexity of queries.

Factors

Factors are a function over a set of variables
- CPTs are an example of factors
A factor over an empty set of variables is called trivial
- Assigns a single number to the trivial instantiation $\top$
  - A common use is when you sum out all variables of a factor
2 main operations
- Summing out a variable
- Multiplying 2 factors

Summing out Variables (Marginalisation)

The result of summing $X\in\mathbf{X}$ from a factor $f$ is another factor over variables $\mathbf{Y}=\mathbf{X}\backslash\{X\}$, denoted $\sum_Xf$

\[ \left(\sum_x f\right)(\mathbf{y}) = \sum_xf(x,\mathbf{y})\]

Summing operation is commutative

\[ \sum_x\sum_yf=\sum_y\sum_xf\]

Summation Algorithm:
- $O(\exp(w))$ in time and space. $w$ is the number of factor variables in the original factor.

Example

Multiplication of Factors

Multiplying 2 factors results in a new factor over the union of their variables $Z=X\cup Y$

\[ (f_1f_2)(\mathbf{z}) = f_1(\mathbf{x})f_2(\mathbf{y})\]

Each instantiation in the new factor is compatible with exactly one instantiation on each original factor
Factor multiplication is commutative and associative
- Don’t need to specify order of multiplication
Multiplication algorithm:
- $O(m\exp(w))$ complexity in time and space
  - $w$ is the number of variables in resulting factor
  - $m$ is the number of factors

Variable Elimination Algorithm

Naively, one could compute the joint probability distribution through chain rule, then marginalise unwanted variables

\[ P(C) = \sum_{A,B}\Theta_{A}\Theta_{B|A}\Theta_{C|B}\]

However we can improve on this by being smart with our summations.

Smarter Summations

In general, we only need to sum out variables in factors they appear.
- Suppose $X$ appears in $f_{n-1},f_n$

\[ \sum_xf_1f_2*...*f_{n-1}*f_{n} = f_1*f_2*...*\sum_xf_{n-1}f_{n}\]

Therefore a smarter way to compute the marginal for this network:
- Considering elimination order $\pi = \{A,B\}$

\[ \sum_B\Theta_{C|B}\underbrace{\sum_A\Theta_{A}\Theta_{B|A}}_{\text{Compute First}}\]

Bucket Elimination

Variable Elimination needs to identify each factor that mentions a variable
- Best complexity if factors are found in linear time in number of factors
Can arrange factors into buckets based on elimination order
- One bucket for each network variable
- Factor is placed in the first bucket whose label appear in the factor

Variables are eliminated by processing the buckets from top to bottom.
When processing variable $\pi(i)$:
- Multiply all factors in the bucket $\pi(i)$
- Sum out $\pi(i)$
- Place the resulting factor $f_i$ in the first next bucket whose label appears in $f_i$

One way to handle evidence is to create evidence indicators and put them in the relevant buckets to be multiplied.
- For example, consider $\mathbf{e}: B=true, E=false$

Computing Prior Marginals (VE_PR1)

The first iteration of the variable elimination algorithm
- Used to calculate prior marginals (no evidence)
Algorithm:
- $O(n\exp(w)+n\exp(|\mathbf{Q}|)$ time and space complexity
  - $w$ is the number of variables in the biggest factor
  - $|\mathbf{Q}|$ is the number of query variables - I think
- Therefore the complexity depends on the biggest factor, which is dependent on the ordering $\pi(i)$
  - We can compute an ordering $\pi(i)$ that minimises size of intermediate factors

Computing Posterior Marginals (VE_PR2)

Can use variable elimination to calculate a query in the form:

\[ P(\mathbf{Q|e})\]

To do this, it is easier to calculate the joint marginal $P(\mathbf{Q,e})$ and then normalize:

\[ P(\mathbf{Q|e}) = \frac{P(\mathbf{Q,e})}{P(\mathbf{e})}\]

Variable elimination algorithm can be extended to compute joint marginals by zeroing out rows inconsistent with $\mathbf{e}$

\[ f^\mathbf{e}(x)=\begin{cases}f(x),\quad\text{if } \mathbf{x\sim e}\\0,\quad\text{otherwise}\end{cases}\]

For example, for $\mathbf{e}:E=true$:

To calculate the joint marginal:
- Note $(f_1f_2)^\mathbf{e}=f_1^\mathbf{e}f_2^\mathbf{e}$
- Also note we do sum out evidence variables still

\[ P(\mathbf{Q,e})=\sum_{A,B}(\Theta_A\Theta_{B|A}\Theta_{C|B})^\mathbf{e} = \sum_{A,B}\Theta_A^\mathbf{e}\Theta_{B|A}^\mathbf{e}\Theta_{C|B}^\mathbf{e} \] * Algorithm: + Note we sum out evidence variables + People often run $VE_PR2$ with empty $\mathbf{Q}$ - Return trivial factor with probability of evidence $P(\mathbf{e})$

Network pruning extension (VE_PR)

Before computing a query, we can simplify the graph based on the query
- Makes inference more efficienct
- Simplification leaves the minimum structure possible to still compute query $P(\mathbf{Q,e})$

Pruning Nodes

Given a query $(\mathbf{Q,e})$, we can remove any leaf node if it does not belong to $(\mathbf{Q\cup E})$
- Apply iteratively

This works as for any CPT $P(X|\mathbf{U})$, summing $X$ will result in $1$ being assigned to each instantiation.
- Therefore $\sum_xf_1(X|\mathbf{U})*f_2 = f_2$

Pruning Edges

For a query $P(\mathbf{Q,e})$, we can eliminate each edge $U\rightarrow X$ that originates from a node $U\in\mathbf{E}$
- Replace CPT $\Theta_{X|\mathbf{U}}$ by smaller CPT, which is obtained from $\Theta_{X|\mathbf{U}}$ by assuming the value of $\mathbf{u}$ given $\mathbf{e}$
Example:
- Get rid of rows that do not match evidence. As only one value of $U$ is left, we can remove that column as we assume $\mathbf{u}$ given $.

Network Pruning Algorithm

Network Pruning algorithm brings both node and edge pruning together to create algorithm VE_PR
- Pruning can typically be done in linear time, therefore is often worth it to simplify network.
Network Pruning Example:

Algorithm:
- Effective treewidth for a network $N$ with respect to query $(\mathbf{Q,e})$ is the treewidth of pruneNetwork$(N,\mathbf{Q,e})$

Elimination Order

We want to choose $\pi$ with the smallest width (max intermediate factor variables) to decrease complexity of variable elimination algorithms
We can compute the width of an order by operating on an undirected graph
- Nodes of $G$ are the variables that appear in the factors $f_1,...,f_n$
- There is an edge between two variables in $G$ iff those variables appear in the same factor

When eliminating a variable $\pi(i)$ we need to create a new interaction graph
- Add an edge to $G$ between every pair of neighbours of variable $\pi(i)$ that are not already connected
- Delete variable $\pi(i)$ from $G$

The key insight from the interaction graph $G$ for factors $S$ is:
- The elimination of a variable $\pi(i)$ from $S$ leads to constructing a factor over the neighbours of $\pi(i)$ in $G$
- Therefore, the width of an elimination order is the max amount of neighbours when eliminating the variables.
Therefore, to calculate the width of an order we can use the algorithm:
- Note that this is very efficient to check many orders, as there is $n!$ possible orderings.
- Therefore we often rely on heuristics that have good empirical results to create an order

Min-Degree Heuristic (MinDegreeOrder)

Eliminate the variable that leads to constructing the smallest factor
- Equivalent to the variable with the smallest number of neighbours
- Greedy approach
  - Only guaranteed to be optimal when a network has a complete elimination order width $\leq 2$
Algorithm:

Fill-in Heuristic (MinFillOrder)

Eliminates the variable that leads to adding the smallest number of edges in $G$, the fill in edges
- Often perrforms better than MinDegreeOrder.
Algorithm:
- Ties are often broken through MinDegreeoder

Network Structure and Complexity

Minimum elimination order width is linked to the network structure
- No complete elimination order can have width smaller than network treewidth
  - There is an order that has width=treewidth but it is NP-Hard to find.

Tree-Width

Treewidth is a number that quantifies the extent to which a network structure resembles a tree structure
Impacts on treewidth
- Number of nodes have no direct impact
- Tree width cannot be lower than max amount of parents, $k$, that a node can have
- Loops tend to increase tree width but have no predefined effect

Known Tree-Width Networks

Polytree networrks (singly-connected):
- At most 1 undirected path between two nodes
- Treewidth is $k$, where $k$ is maximum parents per node

Tree networks are polytree with a max of $k=1$ ($1$ parent):
- Tree width is 1

Multiply connected networks are all networks that are not polytree networks

Variable Elimination

Jake

06/10/2022