Markov Networks

Markov networks are undirected graphical models
- Often used to model symmetric dependencies where each state can be influenced by the state of its neighbours
- A variable is independent of all other variables given its neighbours

\[ X_1\perp X_3,X_4|X_2\]

Markov networks encode independence assumptions similar to Bayesian networks.
- Factors are generalizations of CPTs. Not necessary to be between 0-1.
  - Can convert to probabilities through normalization constant after computing a joint distribution.
- This takes less knowledge and we only consider local dependencies
  - However this makes it hard to learn from data as normalized probabilities do not match up with factor results.

Random Field

A random field \(\mathbf{X}\) is a set of random variables
- It is common that each variable \(X_i\) is associated with a site.
Set \(\mathbf{S}\) indexes a set of \(n\) sites
- Sites in \(\mathbf{S}\) are related to one another via a neighborhood system
- Neighboring relationship is mutual: \(i\in\mathbf{N}_{i'}\) iff \(i'\in\mathbf{N}_{i}\)
  - Natural due to undirected graph.

A random field \(\mathbf{X}\) is a Markov Random Field/Network on \(\mathbf{S}\) iff:
- Positivity: \(P(X_1=x_1,...,X_n=x_n) > 0,\forall\mathbf{x\in X}\)
  - All instantitations of variables in \(\mathbf{X}\) has a non-negative probability
- Markovianity: \(P(X_i|\mathbf{X}_{\mathbf{S}\backslash\{i\}})=P(X_i|\mathbf{X_{N_i}})\)
  - Variable is independent of others outside of their neighborhood given the neighborhood.
Graphically, Markov Networks are undirected graphical models
- \(G = (\mathbf{V,E})\), where \(\mathbf{V}\) consists of a set of random varaibles and \(\mathbf{E}\) is a set of undirected edges

Gibbs Distribution

When the positivity distribution is satisfied, the joint probability distribtion is uniquely determined by the Gibbs Distribution.
- Allows factorisation of full joint distribution into smaller factors created by ‘cliques’
  - Cliques are fully connected subgraphs

\[ P(\mathbf{X}) = \frac{1}{Z}\underbrace{\prod_{c\in cliques(G)}\phi_c(\mathbf{X}_c)}_{\tilde P(\mathbf{X})}\] * Z is normalization constant

\[ Z = \sum_\mathbf{x}\prod_{c\in cliques(G)}\phi_c(\mathbf{X}_c) \]

For example, using maximal cliques for the following graph:

\[ P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B,D)\phi_2(B,C,D)\]

However, in practice it is common to use smaller cliques such as pairwise factors
- Smaller cliques result in faster computations, however loses representation power
  - For example, in a fully connected graph a maximal clique would have \(O(d^n)\) while a pairwise graph has only \(O(n^2d^2)\) parameters

\[ P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A)\phi_5(D,B)\]

Therefore, due to multiple Gibbs distributions inducing the same undirected graph, we cannont read the factorization from the graph.

Factor Graphs

Factor graph is a graph containing random variables and factors over the sets of variables
- Removes ambiguity of factorization of Gibbs Distribution from a graph.
For example, the following factor graph for the undirected graph results in the factorisation:

\[ P(X_1,X_2,X_3,X_4,X_5) = P(X_1,X_2,X_3)P(X_2,X_3,X_4)P(X_3,X_5)\]

Energy Functions

Joint probability in a Markov Network is frequently expressed in terms of energy functions
- Maximising probability \(P(\mathbf{X})\) is equivalent to minimising energy \(E(\mathbf{X})\)
- Factors written with respect to energy are called potentials
Functions:

\[ P(\mathbf{X}) = \frac{1}{Z}\exp(-E(\mathbf{X}))\]

\[ E(\mathbf{X}) = \sum_{c\in Cliques(G)}\psi_c(\mathbf{X}_c)\]

\[ \psi_c(\mathbf{X}_c) = -\log\phi_c(\mathbf{X}_c)\]

Can therefore rewrite the probability function as:

\[ P(\mathbf{X}) = \frac{1}{Z}\exp\left(-\sum_{c\in Cliques(G)}\psi_c(\mathbf{X}_c)\right)\]

\[ P(\mathbf{X}) = \frac{1}{Z}\prod_{c\in Cliques(G)}\exp(-\psi_c(\mathbf{X}_c))\]

Pairwise Markov Networks

A common subclass of markov networks
- All factors are over single variables or pairs of variables
- Node potentials: \(\{\psi(X_i):i=1,...,n\}\)
- Edge potentials: \(\{\psi(X_i,X_j):(X_i,X_j)\in H\}\)

Common application is noise removal from binary images:
- Noisy image of pixel values, \(Y_{i,j}\)
- Noise-free image of pixel values, \(X_{i,j}\)
- Markov net with:
  - \((X_{i,j},X_{i',j'})\) potentials representing correlations between neighbouring pixels
  - \((X_{i,j}.Y_{i,j})\) potentials describing correlations between same pixels in noise-free and noisy image

Stochastic Search

Want to find the MAP or MPE assignment of the markov network.
- In the case of binary image smoothing, there is very large amount of possible assignments
- Therefore, finding the assignment of minimal energy is usually posed as a stochastic search
  - Finds local minimum energy instantiation, very rare to find global minimum
Algorithm:

This algorithm has 3 main variations
- Iterative Conditional Modes (ICM): Always select the assignment of minimum energy
- Metropolis: With a fixed probability, \(p\), it selects an assignment with higher energy
- Simulated Annealing (SA): With variable probability \(P(T)\), it selects an assignment with higher energy. \(T\) is a parameter known as temperature. The probability of selecting a value with higher energy is determined by the expression \(P(T)=e^{-\delta E/T}\), where \(\delta E\) is the energy difference. The value of \(T\) is reduced with each iteration.

Independence

Local

Absence of edges imply indepedence
Another local property of independence is the Markov Blanket
- In the undirected graph case, the Markov blanket is equal to a node’s neighborhood.

Global

Global interpretation of independence uses the idea of separation
- Let \(\textbf{X,Y,Z}\) be disjoint sets of nodes in a DAG \(G\). \(\textbf{X}\) and \(\textbf{Y}\) are d-seperated by \(\textbf{Z}\), \(sep_G(\textbf{X,Y,Z})\), iff every path between a node in \(\textbf{X}\) and a node in \(\textbf{Y}\) is blocked by \(\textbf{Z}\)
- A path is blocked by \(\textbf{Z}\) iff at least one valve on the path is closed given \(\textbf{Z}\)
  - Same idea as d-seperation except no exception of convergent structures
Can efficiently determine seperation through the cut-set test
- Two sets \(\mathbf{X}\) and \(\mathbf{Y}\) of variables are seperated by a set \(\mathbf{Z}\) iff
  - There is no path from every node \(X\in\mathbf{X}\) to every node \(Y\in\mathbf{Y}\) after removing all nodes in \(\mathbf{Z}\)
  - \(\mathbf{Z}\) is a cut-set between two parts of the graph

Like d-seperation, the seperation test is sound by not complete

Markov vs Bayesian Networks

Bayesian and Markov networks are different languages to represent independencies
- They represent different sets of independencies
- Some sets of independencies that one can model cannot be modelled by the other
In several circumstances, we need to find a markov network that is an I-MAP for a Bayesian network
- This is achieved through moralisation
  - Connecting the parents of unmarried child nodes
- Doing this loses the marginal independence of the parents

Variable Elimination

Same as before, using the factored distribution according to the Gibb’s Distribution.
- However, have to normalise the results at the end as the factors are not necessarily storing probabilities

Probability VE

\[ P(A,B) = \sum_C\sum_DP(A,B,C,D) \]

\[ = \frac{1}{Z}\sum_C\sum_D\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A) \] \[ \propto\sum_C\sum_D\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A) \]

\[ = \phi_1(A,B)\sum_C\phi_2(B,C)\sum_D\phi_3(C,D)\phi_4(D,A) \]

Normalize results \(\tau(A,B)\) to get \(P(A,B)\)

Energy VE

\[ P(A,B) = \sum_C\sum_DP(A,B,C,D) \] \[ \propto \sum_C\sum_D\exp(-(\psi_1(A,B)+\psi_2(B,C)+\psi_3(C,D)+\psi_4(D,A)))\]

Need to seperate out the exponentials before enacting elimatination order:

\[ = \sum_C\sum_D\exp(-\psi_1(A,B))\exp(-\psi_2(B,C))\exp(-\psi_3(C,D))\exp(-\psi_4(D,A))\] \[ = \exp(-\psi_1(A,B))\sum_C\exp(-\psi_2(B,C))\sum_D\exp(-\psi_3(C,D))\exp(-\psi_4(D,A)) \]

Normalize results \(\tau(A,B)\) to get \(P(A,B)\)

Conditional Random Fields

Connects a sequence of classifiers together through dependencies
- Discriminate model, aka it directly approximates \(P(\mathbf{Y|x})\)
- Used for sequences, such as words, where there is dependencies based on the prior/post outcomes (context)
  - ‘Quest’, while independently a u looks like a v (similar probability), it is more likely a u if it follows a q.
Output of classifiers apart of the field are seen as factors
- \(\phi_i(Y_i,x_i)\) is the score of the classifier
Pairwise factors are added to
- \(\phi(Y_i,Y_{i+1})\) is a measure dependence between two classifier scores

This type of model is known as a linear chain CRF
- Undirected version of HMM

Can calculate the probability through the Gibbs distribution:

\[ P(\mathbf{Y|x}) = \frac{1}{Z(\mathbf{x})}\phi_1(Y_1,x_1)\prod_{i=2}\phi_i(Y_i,x_i)\phi(Y_{i-1},Y_i)\]

MPE or MAP query is often used to find the most probable instantiation.
- This can be found efficiently through Viterbi algorithm

Generative Models and Synthetic Data

Can generate data from nodes in topological order based on the probabilities within the CPTS.
- Topological order ensures that each parent is sampled before the children that depend on it).

MarkovNetworks

Jake

02/11/2022