Markov Models

Jake

31/10/2022

Dynamic Bayesian Networks

  • Graphical models with a repeating structure that grows with time
    • Time/space dynamic
  • Utilise the markov property
    • Future states are independent of past given the current state

Markov Chains

  • Consider a discrete variable \(X\) with states
    • Transitions between states are nondeterministic

  • Prior probabilities \(P(X_1)\) denote the starting distribution
    • Transition probability/dynamics are denoted \(P(X_t|X_{t-1})\)

  • Assuming stationary, there is \(|X| + |X|^2\) parameters
    • \(|X|\) for number of states
    • \(|X|^2\) for probability table (all combos of \(X_{t-1},X_t\))

Properties

  • Stationary property
    • Chains are time-homogeneous
    • Transitions probabilities are the same for all values of \(t\)
  • Markov property
    • Future states are independent of the past given the current state

\[ X_{t+1}\perp X_{t-1}|X_t\]

Chain Rule

  • Chain rule for bayesian networks work here for a sequence of states:

\[ P(X_1,...,X_t) = P(X_1)*\prod_{i=1}^nP(X_t|X_{t-1}) \]

Staying in State

  • Probability of staying in a certain state for \(d\) steps:
    • Stay for \(d-1\) steps, transition out on \(dth\) step.

\[ P(S_s^d)=P(s|s)^{d-1}(1-P(s|s))\]

  • Expected time spent in a state:

\[ E[S_s]=\frac{1}{1-P(s|s)}\]

Mini-Forward Algorithm

  • We can find \(P(X_t)\) for any time \(t\) by simulating the chain with the mini-forward algorithm.
    • \(O(n|X|^2)\)

\[ P(x_t) = \sum_{x_{t-1}}P(x_t,x_{t-1}) = \sum_{x_{t-1}}P(x_t|x_{t-1})P(x_{t-1}) \]

  • Example

\[ P(X_t=sun) = P(X_t=sun|X_{t-1}=sun)P(X_{t-1}=sun)\]

\[ + P(X_t=sun|X_{t-1}=rain)P(X_{t-1}=rain) \]

Irreducible Chains

  • A MC is irreducible if every state \(x'\) is reachable from every other state \(x\)
    • For every pair of states \((x,x')\), there is a time \(t\) such that \(P(X_t=x'|X_1=x > 0)\)
    • Also known as a regular/ergodic chain
      • The states of MC are recurrent, meaning they are guarenteed to be visited an infinite number of times when simulating the chain.

Aperiodic Chains

  • A MC is aperiodic if it is possible to return to any state at any time
    • There exists a \(t\) such that for all state \(x\) and all \(t'\geq t\). \(P(X_{t'}=x|X_1=x)>0\)

Stationary Distribution

  • For most chains, \(P(X)\) will converge independent of the initial distribution.
  • Stationary distribution \(\pi\) is the distribution we obtain if the chain converges

\[\pi(X) = \sum_xP(X|x)\pi(x)\]

\[ \pi = \pi T\] * Example:

\[ \pi(sun) = P(sun|sun)\pi(sun) + P(sun|rain)\pi(rain)\]

\[ \pi(rain) = P(rain|sun)\pi(sun) + P(rain|rain)\pi(rain)\] \[ \pi(sun) + \pi(rain) = 1\]

  • Stationary distribution can also be considered the \(\%\) of time in each state

  • An irreducible and aperiodic MC converges to a unique stationary distribution

    • Irreducible: We can go from any state to any state
    • Aperiodic: Avoids chains that alternate forever and never settle into a stationary distribution
  • Note that convergence may be possible but very slow

Hidden Markov Models

  • Extension of markov chains where states are not directly observable
    • We are able to observe ‘emissions’ that are probablisitically linked to hidden states
  • HMM have two main components
    • Underlying markov chain over states \(X\)
    • Observable outputs (effects of states) at each time step.
      • Often called emissions

  • HMM parameters include:
    • Inital distribution \(P(X_1)\)
    • Transition probabilities \(P(X_t|X_{t-1})\)
    • Emissions probabilities \(P(E_t|X_t)\)

Chain Rule

  • Same as usual

\[ P(X_1,E_1,...,X_n,E_n) = P(X_1)P(E_1|X_1)\prod^n_{t=2}P(X_t|X_{t-1})P(E_t|X_t)\]

Independencies

  • State is independent of all past states and evidence given previous state

\[ X_t\perp X_1,...,X_{t-2},E_1,...,E_{t-1}|X_{t-1}\]

  • Evidence is independent of all past evidence and states given the current state

\[ E_t\perp X_1,...,X_{t-1},E_1,...,E_{t-1}|X_t\]

Inference

  • Need to track \(P(X_t)\) over time using evidence.
    • \(B(X_t) = P(X_t|e_1,...,e_t)\) denotes belief of state at \(t\)
    • \(B(X_1)\) is typically a uniform distribution
  • Inference has 2 main steps
    • Passage of time
    • Observation of evidence

Passage of Time

  • Consider a current state of belief

\[ B(X_t) = P(X_t|e_{1:t}) \]

  • Need to update belief for 1 time step
    • Change of belief before we observe emissions

\[ P(X_{t+1}|e_{1:t}) \]

  • To compute this:

\[ P(X_{t+1}|e_{1:t}) = \sum_{x_t}P(X_{t+1}|x_t)B(x_t)\]

Observing Evidence

  • After the time step we get an emission that can be used to update belief.

\[ B(X_{t+1})=P(X_{t+1}|e_{1:t+1})\] \[ \propto P(e_{{t+1}}|X_{t+1})P(X_{t+1}|e_{1:t})\]

  • Renormalise the results by dividing by:

\[\sum_{x_t} B(X_{t+1}) \]

Forward Algorithm

  • Considering a sequence of emissions, we want to know the state of belief

\[ B(X_t) = P(X_t|e_{1:t}) \]

  • We consider:

\[ P(X_t|e_{1:t})\propto P(X_t,e_{1:t})\] \[ = \underbrace{P(e_t|X_t)}_{\text{Observation}}\overbrace{\sum_{x_{t-1}}P(X_t|x_{t-1})P(x_{t-1},e_{1:t-1})}^{\text{Passage of Time}} \]

Most Probable Explanation

  • We want to compute the most probable sequence of states conditioned on a sequence of evidence.

\[ argmax_{x_{1:t}}P(x_{1:t}|e_{1:t}) \]

State Trellis

  • A state trellis is a graph that illustrates state transition over time
    • Each arm represents a time passage/evidence observation with weight

\[ P(x_t|x_{t-1})P(e_t|x_t)\]

  • Considering a path as a sequence of states
    • Product of weights on a path is the sequence probability according to evidence
    • Linked to forward algorithm, which computes the sum of the paths that end in the same state
    • MPE involves calculating path with highest probability

Viterbi Algorithm

  • Viterbi Algorithm computes the maximum of path probabilities that lead to the same final state

\[ m[x_t] = \max_{x_{1:t-1}}P(x_{1:t-1},x_t|e_{1:t})\] \[ = P(e_t|X_t)\max_{x_{t-1}}P(x_t|x_{t-1})m[x_{t-1}]\]

  • Algorithm:
    • \(O(n|X|^2)\) complexity

  • Example:
    • Consider the max probability path to each possible state. We backtrack from the maximal probability.

  • This algorithm provides the probability of the most likely sequence but not the sequence itself.
    • Can keep an additional structure pointing to parent of each node
    • Backtrack computation
      • Divide by probability of evidence \(P(e_t|X_t\).
      • For each state \(x_{t-1}\) divide by \(P(x_t|x_{t-1})\).
      • See what value of \(x_{t-1}\) matches result.

Log Probability Moditification

  • Often use log probabilities with viterbi to avoid vanishing probabilities
    • Paths grow exponentially with sequence size

\[ m[x_t] = \log(P(e_t|X_t))+\max_{x_{t-1}}\log(P(x_t|x_{t-1}))+m[x_{t-1}]\]

  • Algorithm:

Particle Filtering

  • Sometimes \(|X|\) is too big for use with exact inference.
    • \(|X|\) may be too big to even store \(B(X)\)
    • E.g. \(X\) is continuous
  • An approximate solution to inference
    • Track samples of \(X\), not all values
    • Samples are called particles
    • Time per step is linear in number of samples
    • In memory, we can store a list of particles rather than a list of states.

  • Representation of \(P(X)\) is now a list of \(N\) particles (samples)
    • Generally \(N<<|X|\)
    • \(P(X)\) is approximated by number of particles with value \(x\)
      • More particles, higher accuracy
      • Many \(x\) values will therefore have \(P(x) = 0\)

Elapsing Time

  • Each particle is moved by sampling its next position from the transition model

\[ x' = sample(P(X'|x))\]

Observing Evidence

  • Downweight samples based on evidence
    • Probabilities don’t sum to 1 but rather sum to \(N\) times an approximation of \(P(e)\)

\[ w(x) = P(e|x) \]

\[B(X)\propto P(e|X)B'(X) \]

  • Rather than tracking weighted samples, we resample
    • Draw \(N\) from our weighted distribution with replacement