Markov Networks
- Markov networks are undirected graphical models
- Often used to model symmetric dependencies where each state can be influenced by the state of its neighbours
- A variable is independent of all other variables given its neighbours
\[ X_1\perp X_3,X_4|X_2\]
- Markov networks encode independence assumptions similar to Bayesian networks.
- Factors are generalizations of CPTs. Not necessary to be between 0-1.
- Can convert to probabilities through normalization constant after computing a joint distribution.
- This takes less knowledge and we only consider local dependencies
- However this makes it hard to learn from data as normalized probabilities do not match up with factor results.
- Factors are generalizations of CPTs. Not necessary to be between 0-1.
Random Field
- A random field \(\mathbf{X}\) is a set of random variables
- It is common that each variable \(X_i\) is associated with a site.
- Set \(\mathbf{S}\) indexes a set of \(n\) sites
- Sites in \(\mathbf{S}\) are related to one another via a neighborhood system
- Neighboring relationship is mutual: \(i\in\mathbf{N}_{i'}\) iff \(i'\in\mathbf{N}_{i}\)
- Natural due to undirected graph.
- A random field \(\mathbf{X}\) is a Markov Random Field/Network on \(\mathbf{S}\) iff:
- Positivity: \(P(X_1=x_1,...,X_n=x_n) > 0,\forall\mathbf{x\in X}\)
- All instantitations of variables in \(\mathbf{X}\) has a non-negative probability
- Markovianity: \(P(X_i|\mathbf{X}_{\mathbf{S}\backslash\{i\}})=P(X_i|\mathbf{X_{N_i}})\)
- Variable is independent of others outside of their neighborhood given the neighborhood.
- Positivity: \(P(X_1=x_1,...,X_n=x_n) > 0,\forall\mathbf{x\in X}\)
- Graphically, Markov Networks are undirected graphical models
- \(G = (\mathbf{V,E})\), where \(\mathbf{V}\) consists of a set of random varaibles and \(\mathbf{E}\) is a set of undirected edges
Gibbs Distribution
- When the positivity distribution is satisfied, the joint probability distribtion is uniquely determined by the Gibbs Distribution.
- Allows factorisation of full joint distribution into smaller factors created by ‘cliques’
- Cliques are fully connected subgraphs
- Allows factorisation of full joint distribution into smaller factors created by ‘cliques’
\[ P(\mathbf{X}) = \frac{1}{Z}\underbrace{\prod_{c\in cliques(G)}\phi_c(\mathbf{X}_c)}_{\tilde P(\mathbf{X})}\] * Z is normalization constant
\[ Z = \sum_\mathbf{x}\prod_{c\in cliques(G)}\phi_c(\mathbf{X}_c) \]
- For example, using maximal cliques for the following graph:
\[ P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B,D)\phi_2(B,C,D)\]
- However, in practice it is common to use smaller cliques such as pairwise factors
- Smaller cliques result in faster computations, however loses representation power
- For example, in a fully connected graph a maximal clique would have \(O(d^n)\) while a pairwise graph has only \(O(n^2d^2)\) parameters
- Smaller cliques result in faster computations, however loses representation power
\[ P(A,B,C,D) = \frac{1}{Z}\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A)\phi_5(D,B)\]
- Therefore, due to multiple Gibbs distributions inducing the same undirected graph, we cannont read the factorization from the graph.
Factor Graphs
- Factor graph is a graph containing random variables and factors over the sets of variables
- Removes ambiguity of factorization of Gibbs Distribution from a graph.
- For example, the following factor graph for the undirected graph results in the factorisation:
\[ P(X_1,X_2,X_3,X_4,X_5) = P(X_1,X_2,X_3)P(X_2,X_3,X_4)P(X_3,X_5)\]
Energy Functions
- Joint probability in a Markov Network is frequently expressed in terms of energy functions
- Maximising probability \(P(\mathbf{X})\) is equivalent to minimising energy \(E(\mathbf{X})\)
- Factors written with respect to energy are called potentials
- Functions:
\[ P(\mathbf{X}) = \frac{1}{Z}\exp(-E(\mathbf{X}))\]
\[ E(\mathbf{X}) = \sum_{c\in Cliques(G)}\psi_c(\mathbf{X}_c)\]
\[ \psi_c(\mathbf{X}_c) = -\log\phi_c(\mathbf{X}_c)\]
- Can therefore rewrite the probability function as:
\[ P(\mathbf{X}) = \frac{1}{Z}\exp\left(-\sum_{c\in Cliques(G)}\psi_c(\mathbf{X}_c)\right)\]
\[ P(\mathbf{X}) = \frac{1}{Z}\prod_{c\in Cliques(G)}\exp(-\psi_c(\mathbf{X}_c))\]
Pairwise Markov Networks
- A common subclass of markov networks
- All factors are over single variables or pairs of variables
- Node potentials: \(\{\psi(X_i):i=1,...,n\}\)
- Edge potentials: \(\{\psi(X_i,X_j):(X_i,X_j)\in H\}\)
- Common application is noise removal from binary images:
- Noisy image of pixel values, \(Y_{i,j}\)
- Noise-free image of pixel values, \(X_{i,j}\)
- Markov net with:
- \((X_{i,j},X_{i',j'})\) potentials representing correlations between neighbouring pixels
- \((X_{i,j}.Y_{i,j})\) potentials describing correlations between same pixels in noise-free and noisy image
Stochastic Search
- Want to find the MAP or MPE assignment of the markov network.
- In the case of binary image smoothing, there is very large amount of possible assignments
- Therefore, finding the assignment of minimal energy is usually posed as a stochastic search
- Finds local minimum energy instantiation, very rare to find global minimum
- Algorithm:
- This algorithm has 3 main variations
- Iterative Conditional Modes (ICM): Always select the assignment of minimum energy
- Metropolis: With a fixed probability, \(p\), it selects an assignment with higher energy
- Simulated Annealing (SA): With variable probability \(P(T)\), it selects an assignment with higher energy. \(T\) is a parameter known as temperature. The probability of selecting a value with higher energy is determined by the expression \(P(T)=e^{-\delta E/T}\), where \(\delta E\) is the energy difference. The value of \(T\) is reduced with each iteration.
Independence
Local
- Absence of edges imply indepedence
- Another local property of independence is the Markov Blanket
- In the undirected graph case, the Markov blanket is equal to a node’s neighborhood.
Global
- Global interpretation of independence uses the idea of separation
- Let \(\textbf{X,Y,Z}\) be disjoint sets of nodes in a DAG \(G\). \(\textbf{X}\) and \(\textbf{Y}\) are d-seperated by \(\textbf{Z}\), \(sep_G(\textbf{X,Y,Z})\), iff every path between a node in \(\textbf{X}\) and a node in \(\textbf{Y}\) is blocked by \(\textbf{Z}\)
- A path is blocked by \(\textbf{Z}\) iff at least one valve on the path is closed given \(\textbf{Z}\)
- Same idea as d-seperation except no exception of convergent structures
- Can efficiently determine seperation through the cut-set test
- Two sets \(\mathbf{X}\) and \(\mathbf{Y}\) of variables are seperated by a set \(\mathbf{Z}\) iff
- There is no path from every node \(X\in\mathbf{X}\) to every node \(Y\in\mathbf{Y}\) after removing all nodes in \(\mathbf{Z}\)
- \(\mathbf{Z}\) is a cut-set between two parts of the graph
- Two sets \(\mathbf{X}\) and \(\mathbf{Y}\) of variables are seperated by a set \(\mathbf{Z}\) iff
- Like d-seperation, the seperation test is sound by not complete
Markov vs Bayesian Networks
- Bayesian and Markov networks are different languages to represent independencies
- They represent different sets of independencies
- Some sets of independencies that one can model cannot be modelled by the other
- In several circumstances, we need to find a markov network that is an I-MAP for a Bayesian network
- This is achieved through moralisation
- Connecting the parents of unmarried child nodes
- Doing this loses the marginal independence of the parents
- This is achieved through moralisation
Variable Elimination
- Same as before, using the factored distribution according to the Gibb’s Distribution.
- However, have to normalise the results at the end as the factors are not necessarily storing probabilities
Probability VE
\[ P(A,B) = \sum_C\sum_DP(A,B,C,D) \]
\[ = \frac{1}{Z}\sum_C\sum_D\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A) \] \[ \propto\sum_C\sum_D\phi_1(A,B)\phi_2(B,C)\phi_3(C,D)\phi_4(D,A) \]
\[ = \phi_1(A,B)\sum_C\phi_2(B,C)\sum_D\phi_3(C,D)\phi_4(D,A) \]
- Normalize results \(\tau(A,B)\) to get \(P(A,B)\)
Energy VE
\[ P(A,B) = \sum_C\sum_DP(A,B,C,D) \] \[ \propto \sum_C\sum_D\exp(-(\psi_1(A,B)+\psi_2(B,C)+\psi_3(C,D)+\psi_4(D,A)))\]
- Need to seperate out the exponentials before enacting elimatination order:
\[ = \sum_C\sum_D\exp(-\psi_1(A,B))\exp(-\psi_2(B,C))\exp(-\psi_3(C,D))\exp(-\psi_4(D,A))\] \[ = \exp(-\psi_1(A,B))\sum_C\exp(-\psi_2(B,C))\sum_D\exp(-\psi_3(C,D))\exp(-\psi_4(D,A)) \]
- Normalize results \(\tau(A,B)\) to get \(P(A,B)\)
Conditional Random Fields
- Connects a sequence of classifiers together through dependencies
- Discriminate model, aka it directly approximates \(P(\mathbf{Y|x})\)
- Used for sequences, such as words, where there is dependencies based on the prior/post outcomes (context)
- ‘Quest’, while independently a u looks like a v (similar probability), it is more likely a u if it follows a q.
- Output of classifiers apart of the field are seen as factors
- \(\phi_i(Y_i,x_i)\) is the score of the classifier
- Pairwise factors are added to
- \(\phi(Y_i,Y_{i+1})\) is a measure dependence between two classifier scores
- This type of model is known as a linear chain CRF
- Undirected version of HMM
- Can calculate the probability through the Gibbs distribution:
\[ P(\mathbf{Y|x}) = \frac{1}{Z(\mathbf{x})}\phi_1(Y_1,x_1)\prod_{i=2}\phi_i(Y_i,x_i)\phi(Y_{i-1},Y_i)\]
- MPE or MAP query is often used to find the most probable instantiation.
- This can be found efficiently through Viterbi algorithm
Generative Models and Synthetic Data
- Can generate data from nodes in topological order based on the probabilities within the CPTS.
- Topological order ensures that each parent is sampled before the children that depend on it).