MarkovChainMonteCarlo

Jake

11/11/2022

Markov Chain Monte Carlo (MCMC)

  • Given a distribution \(\pi(X), X\in E\) known up to a normalisation constant.
  • Construct Markov Chain with state space \(E\) and stationary distribution \(\pi(X)\)
  • Simulate Markov Chain sample paths until convergence
  • Once converged, chain iterations can be treated as a sample from \(\pi(X)\)

Markov Chain Theory

  • Considering a discrete-time, discrete-state space
  • Markov Chain \(X_t\):
    • Markov property:Given present state \(X_t\), future statee \(Y\) is independent of past

\[ P(X_{t+1}=y|X_t=x_t,...,X_0=x_0) = P(X_{t+1}=y|X_t=x_t)\]

  • N-step transition probability

\[ P^n(x,y) = P(X_n=y|X_0=x) = \sum_{x_1}...\sum_{x_{t-1}}P(x,x_1)P(x_1,x_2)...P(x_{n-1},y)\]

  • Irreducible chains
    • Chain can move from anywhere to anywhere else in finite number of time steps

\[ \pi(y)>0\rightarrow P(X_n=y|x_0=x)>0\]

  • Aperiodicity
    • Considering the period \(d_x\) of state \(x\), the largest common divior of set \(\{n\geq 1:P^n(x,x)>0\}\)
      • \(x\) is aperiodic if \(d_x=1\)
      • \(P(x,x)>0\) implies \(d_x=1\)

Relevent Limit Theorems

  • For an irreducible markov chain:
  • Ergodic averages converge to their expected values under \(\pi(X)\)

\[ \lim_{N\rightarrow\infty}\frac{1}{N}\sum^N_{t=1}g(X_t)\rightarrow\int g(x)\pi(x) dx<\infty\]

  • If chain is also aperiodic, then there is convergence in total variation

\[ ||P^n(X_0,.)-\pi(.)||\rightarrow 0,\quad n\rightarrow\infty\]

  • This implies that markov chain sample path mimics a random sample from \(\pi\)
  • \(\pi(X)\) is also the limiting distribution of successive chain iterates
    • True regardless of starting value of chain

Gibbs Sampler

  • Gibbs sampling allows sampling from a p-dimensional distribution \(\pi(X)\), where \(X=(X^{(1)},...,X^{(p)})\)
    • Often used for posterior distributions.
    • Need full distribution, includingf normalisation constant
  • Generate sample from \(\pi(X,Y)\) by sampling in turn from conditional distributions \(\pi(x|y)\) and \(\pi(y|x)\)

Algorithm

  • 1.Arbitrarily cchoose \(Y_0\)
    1. For \(t=0,...,T\):
    • Draw \(X_t\sim\pi(x|Y_t)\)
    • Draw \(Y_t\sim\pi(y|X_t)\)
    1. Produces \((X_0,Y_0),(X_1,Y_1),...,(X_T,Y_T)\)
    • Can extend this algorithm to p dimensions
  • Distribution of \(X_t\) and \(Y_t\) converge to \(\pi(X)\) and \(\pi(Y)\), the true margins of \(\pi(X,Y)\) as \(t\rightarrow\infty\).
  • For large \(t\), \(X_t\) and \(Y_t\) effectively sample from \(\pi(X)\) and \(\pi(Y)\), and the pair is effectively a sample from \(\pi(x,y)\)

Gibbs Convergence

  • To converge, conditional distribution needs to determine \(f(\theta)\) uniquely
  • Positivity condition is sufficent for univariate distributions to uniquely determine joint distribution
    • \(f(\theta)\) is strictly positivie for every \(\theta\) for which each of marginal desntities \(f(\theta_1),...,f(\theta_p)\) is positive
      • Strictly positive \(f(\theta)\) for each \(\theta\) combination

Rates of Convergence

Bounds

  • Theoretical bounds for total variation distance are avaliable, however typically hard to use in practice
  • Geometric Ergodic markov chain bound:
    • \(0<r<1\)
    • \(M(x)\) is a function that depends on initalisation

\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq M(x)r^n\]

  • Uniformly geometrically ergodic chain bound:
    • Does not depend on initalisation, therefore a stronger result

\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq Mr^n\]

Dependence

  • Rate of convergence depends on degree of dependence between model parameters
    • Dependence slows convergence and creates markov chain correlation
  • If parameters were independent:
    • Immediate convergence
    • Independence between successive chain iterations

Assessing Convergence

Burn in

  • We only want to sample values once the chain has converged
    • Therefore, if convergence is slow, early chain values will not represent \(\pi(X)\)
  • To account for this, the first portion of each chain is often discarded, known as the burn-in period.

Autocorrelation

  • Markov chain iterates \(X^0,X^1,...\) are autocorrelated
    • If autocorrelation is high (i.e mixing is poor), chain storage may be a problem
  • Can resolve autocorrelation issues by thinning, aka recording every kth value
    • Thinned or non-thinned values can be used for inference.

Multiple Chains

  • Start multiple chains over dispersed starting points
    • After convergence they chains should be indistinguishable
      • As convergence is the same no matter the starting point

Gelman-Rubin

  • Diagnostic that works on the principle: If chain has converge, then diagnostic test will be passed
    • However, the reverse is not true
    • Not sufficient indicators
  • Consider \(m\) chains of \(n\) samples after discarding first half of sample path (burn-in)
  • Gelman-Rubin Statistic:
    • \(\psi_{ij}\) is \(ith\) sample from \(mth\) chain

\[ \hat{R} = \sqrt{\hat{Var}(\psi|y)/W},\quad\hat{Var}(\psi|y) = \frac{n-1}{n}W+\frac{1}{n}B \]

  • \(B\) is considered the between chain variance, and \(W\) is considered the within chain variance.
  • Ratio \(\hat{R}\) should decline to \(1\) as \(n\rightarrow\infty\)
    • If it remains high, the simulation should be run further.

Reparameterisation

  • Can reparameterise model to decrease dependence between variables and therefore improve mixing and rate of convergence.

Metropolis-Hastings Algorithm

  • MH algorithm is a MCMC technique to sample from an arbitrary distribution with unknown normalising constant.
  • Sample from a proposal distribution \(q(\theta_i|\theta_{-i})\) rather than \(\pi(\theta_i|\theta_{-i})\), where \(q\) is easy to sample from.
    • Decide whether to accept or stay at the same value based on a probability.

Algorithm

    1. Initalise \(\theta_1\) (vector). Set \(t=1\)
    1. Generate proposed point \(\theta^*\sim q(\theta|\theta_t)\) (conditioned on last iterate)
    1. Generate \(u\sim U(0,1)\). If \(u\leq\alpha(\theta_t,\theta^*)\) then accept \(\theta_{t+1}=\theta^*\), else \(\theta_{t+1}=\theta_t\)

\[ \alpha(\theta_t,\theta^*)=\begin{cases}\min\left(1,\frac{\pi(\theta^*|x)q(\theta_t|\theta^*)}{\pi(\theta_t|x)q(\theta^*|\theta_t)}\right),\quad\text{if }\pi(\theta_t|x)q(\theta^*|\theta_t)>0\\1,\quad\text{otherwise}\end{cases}\]

    1. Increment \(t=t+1\) and go to \(2\)

Proposal Distribution

  • \(q(\theta|\theta_t)\) can be any distribution such that Markov Chain is still irreducible and aperiodic.
  • Common choice is the random walk proposal:

\[ q(\theta|\theta_t) = N(\theta_t,\Sigma),\quad\theta^*=\theta_t+\epsilon,\quad\epsilon\sim N(0,\Sigma)\]

  • In the case of random walk, the proposal distribution is symmetric
    • Therefore, as \(q(\theta|\theta_t)=q(\theta_t|\theta)\), the acceptance probability simplifies to:

\[ \alpha(\theta_t|\theta^*) = \min\left(1,\frac{\pi(\theta^*)}{\pi(\theta_t)}\right)\]

Reversibility Condition

  • Considering transition density for a given \(\theta\), \(P(\theta,\theta^*)\)
    • \(P(\theta,\theta^*)\) satisfies the reversibility condition, with \(\pi(\theta)\) invariant distribution:
      • Probability of being in state \(\theta\) and transition to state \(\theta^*\) must equal the probability of being in state \(\theta^*\) and transitioning to state \(\theta\)

\[ \pi(\theta)P(\theta,\theta^*) = \pi(\theta^*)P(\theta^*,\theta)\]

  • The detailed balance/reversibility condition is sufficient but not necessary to give \(\pi(\theta)\) as an invariant distribution
  • The full transition kernel can be defined as:

\[ P(\theta,\theta^*) =\underbrace{q(\theta,\theta^*)\alpha(\theta,\theta^*)}_{\text{Prob of proposing and accepting same value}}+\underbrace{I(\theta=\theta^*)\left(1-\int q(\theta,\theta ')\alpha(\theta,\theta ')d\theta'\right)}_{\text{{Prob pf rejecting proposed move}}} \]

Acceptance Rate

  • Don’t want acceptance rrate too high or too low or sampling will be inefficient
    • Too high is basically sampling from proposal
    • Too low stays the same value for extended periods of time
  • Can alter acceptance rate by changing variance of proposal
    • High variance goes into areas of lower probability that likely won’t be accepted
  • Rule of thumb:
    • \(\alpha(\theta,\theta^*) = 0.234\) for multivariate
    • \(\alpha(\theta,\theta^*) = 0.44\) for univariate