Variable Elimination

Jake

06/10/2022

  • Variable elimination successively removes variables from a Bayesian Network while maintaining its ability to answer queries of interest
    • MPE,MAP, probability of evidence, prior and posterior marginals
  • Main insight is that we can use variable elimination to sum out variables without constructing joint probabilities
    • Joint probability construction has exponential time complexity
    • VE may be able to decrease the time complexity of queries.

Factors

  • Factors are a function over a set of variables
    • CPTs are an example of factors
  • A factor over an empty set of variables is called trivial
    • Assigns a single number to the trivial instantiation \(\top\)
      • A common use is when you sum out all variables of a factor
  • 2 main operations
    • Summing out a variable
    • Multiplying 2 factors

Summing out Variables (Marginalisation)

  • The result of summing \(X\in\mathbf{X}\) from a factor \(f\) is another factor over variables \(\mathbf{Y}=\mathbf{X}\backslash\{X\}\), denoted \(\sum_Xf\)

\[ \left(\sum_x f\right)(\mathbf{y}) = \sum_xf(x,\mathbf{y})\]

  • Summing operation is commutative

\[ \sum_x\sum_yf=\sum_y\sum_xf\]

  • Summation Algorithm:
    • \(O(\exp(w))\) in time and space. \(w\) is the number of factor variables in the original factor.

Example

Multiplication of Factors

  • Multiplying 2 factors results in a new factor over the union of their variables \(Z=X\cup Y\)

\[ (f_1f_2)(\mathbf{z}) = f_1(\mathbf{x})f_2(\mathbf{y})\]

  • Each instantiation in the new factor is compatible with exactly one instantiation on each original factor
  • Factor multiplication is commutative and associative
    • Don’t need to specify order of multiplication
  • Multiplication algorithm:
    • \(O(m\exp(w))\) complexity in time and space
      • \(w\) is the number of variables in resulting factor
      • \(m\) is the number of factors

Variable Elimination Algorithm

  • Naively, one could compute the joint probability distribution through chain rule, then marginalise unwanted variables

\[ P(C) = \sum_{A,B}\Theta_{A}\Theta_{B|A}\Theta_{C|B}\]

  • However we can improve on this by being smart with our summations.

Smarter Summations

  • In general, we only need to sum out variables in factors they appear.
    • Suppose \(X\) appears in \(f_{n-1},f_n\)

\[ \sum_xf_1f_2*...*f_{n-1}*f_{n} = f_1*f_2*...*\sum_xf_{n-1}f_{n}\]

  • Therefore a smarter way to compute the marginal for this network:
    • Considering elimination order \(\pi = \{A,B\}\)

\[ \sum_B\Theta_{C|B}\underbrace{\sum_A\Theta_{A}\Theta_{B|A}}_{\text{Compute First}}\]

Bucket Elimination

  • Variable Elimination needs to identify each factor that mentions a variable
    • Best complexity if factors are found in linear time in number of factors
  • Can arrange factors into buckets based on elimination order
    • One bucket for each network variable
    • Factor is placed in the first bucket whose label appear in the factor

  • Variables are eliminated by processing the buckets from top to bottom.
  • When processing variable \(\pi(i)\):
    • Multiply all factors in the bucket \(\pi(i)\)
    • Sum out \(\pi(i)\)
    • Place the resulting factor \(f_i\) in the first next bucket whose label appears in \(f_i\)

  • One way to handle evidence is to create evidence indicators and put them in the relevant buckets to be multiplied.
    • For example, consider \(\mathbf{e}: B=true, E=false\)

Computing Prior Marginals (VE_PR1)

  • The first iteration of the variable elimination algorithm
    • Used to calculate prior marginals (no evidence)
  • Algorithm:
    • \(O(n\exp(w)+n\exp(|\mathbf{Q}|)\) time and space complexity
      • \(w\) is the number of variables in the biggest factor
      • \(|\mathbf{Q}|\) is the number of query variables - I think
    • Therefore the complexity depends on the biggest factor, which is dependent on the ordering \(\pi(i)\)
      • We can compute an ordering \(\pi(i)\) that minimises size of intermediate factors

Computing Posterior Marginals (VE_PR2)

  • Can use variable elimination to calculate a query in the form:

\[ P(\mathbf{Q|e})\]

  • To do this, it is easier to calculate the joint marginal \(P(\mathbf{Q,e})\) and then normalize:

\[ P(\mathbf{Q|e}) = \frac{P(\mathbf{Q,e})}{P(\mathbf{e})}\]

  • Variable elimination algorithm can be extended to compute joint marginals by zeroing out rows inconsistent with \(\mathbf{e}\)

\[ f^\mathbf{e}(x)=\begin{cases}f(x),\quad\text{if } \mathbf{x\sim e}\\0,\quad\text{otherwise}\end{cases}\]

  • For example, for \(\mathbf{e}:E=true\):

  • To calculate the joint marginal:
    • Note \((f_1f_2)^\mathbf{e}=f_1^\mathbf{e}f_2^\mathbf{e}\)
    • Also note we do sum out evidence variables still

\[ P(\mathbf{Q,e})=\sum_{A,B}(\Theta_A\Theta_{B|A}\Theta_{C|B})^\mathbf{e} = \sum_{A,B}\Theta_A^\mathbf{e}\Theta_{B|A}^\mathbf{e}\Theta_{C|B}^\mathbf{e} \] * Algorithm: + Note we sum out evidence variables + People often run \(VE_PR2\) with empty \(\mathbf{Q}\) - Return trivial factor with probability of evidence \(P(\mathbf{e})\)

Network pruning extension (VE_PR)

  • Before computing a query, we can simplify the graph based on the query
    • Makes inference more efficienct
    • Simplification leaves the minimum structure possible to still compute query \(P(\mathbf{Q,e})\)

Pruning Nodes

  • Given a query \((\mathbf{Q,e})\), we can remove any leaf node if it does not belong to \((\mathbf{Q\cup E})\)
    • Apply iteratively

  • This works as for any CPT \(P(X|\mathbf{U})\), summing \(X\) will result in \(1\) being assigned to each instantiation.
    • Therefore \(\sum_xf_1(X|\mathbf{U})*f_2 = f_2\)

Pruning Edges

  • For a query \(P(\mathbf{Q,e})\), we can eliminate each edge \(U\rightarrow X\) that originates from a node \(U\in\mathbf{E}\)
    • Replace CPT \(\Theta_{X|\mathbf{U}}\) by smaller CPT, which is obtained from \(\Theta_{X|\mathbf{U}}\) by assuming the value of \(\mathbf{u}\) given \(\mathbf{e}\)
  • Example:
    • Get rid of rows that do not match evidence. As only one value of \(U\) is left, we can remove that column as we assume \(\mathbf{u}\) given $.

Network Pruning Algorithm

  • Network Pruning algorithm brings both node and edge pruning together to create algorithm VE_PR
    • Pruning can typically be done in linear time, therefore is often worth it to simplify network.
  • Network Pruning Example:

  • Algorithm:
    • Effective treewidth for a network \(N\) with respect to query \((\mathbf{Q,e})\) is the treewidth of pruneNetwork\((N,\mathbf{Q,e})\)

Elimination Order

  • We want to choose \(\pi\) with the smallest width (max intermediate factor variables) to decrease complexity of variable elimination algorithms
  • We can compute the width of an order by operating on an undirected graph
    • Nodes of \(G\) are the variables that appear in the factors \(f_1,...,f_n\)
    • There is an edge between two variables in \(G\) iff those variables appear in the same factor

  • When eliminating a variable \(\pi(i)\) we need to create a new interaction graph
    • Add an edge to \(G\) between every pair of neighbours of variable \(\pi(i)\) that are not already connected
    • Delete variable \(\pi(i)\) from \(G\)

  • The key insight from the interaction graph \(G\) for factors \(S\) is:
    • The elimination of a variable \(\pi(i)\) from \(S\) leads to constructing a factor over the neighbours of \(\pi(i)\) in \(G\)
    • Therefore, the width of an elimination order is the max amount of neighbours when eliminating the variables.
  • Therefore, to calculate the width of an order we can use the algorithm:
    • Note that this is very efficient to check many orders, as there is \(n!\) possible orderings.
    • Therefore we often rely on heuristics that have good empirical results to create an order

Min-Degree Heuristic (MinDegreeOrder)

  • Eliminate the variable that leads to constructing the smallest factor
    • Equivalent to the variable with the smallest number of neighbours
    • Greedy approach
      • Only guaranteed to be optimal when a network has a complete elimination order width \(\leq 2\)
  • Algorithm:

Fill-in Heuristic (MinFillOrder)

  • Eliminates the variable that leads to adding the smallest number of edges in \(G\), the fill in edges
    • Often perrforms better than MinDegreeOrder.
  • Algorithm:
    • Ties are often broken through MinDegreeoder

Network Structure and Complexity

  • Minimum elimination order width is linked to the network structure
    • No complete elimination order can have width smaller than network treewidth
      • There is an order that has width=treewidth but it is NP-Hard to find.

Tree-Width

  • Treewidth is a number that quantifies the extent to which a network structure resembles a tree structure
  • Impacts on treewidth
    • Number of nodes have no direct impact
    • Tree width cannot be lower than max amount of parents, \(k\), that a node can have
    • Loops tend to increase tree width but have no predefined effect

Known Tree-Width Networks

  • Polytree networrks (singly-connected):
    • At most 1 undirected path between two nodes
    • Treewidth is \(k\), where \(k\) is maximum parents per node

  • Tree networks are polytree with a max of \(k=1\) (\(1\) parent):
    • Tree width is 1

  • Multiply connected networks are all networks that are not polytree networks