Markov Chain Monte Carlo (MCMC)

Given a distribution \(\pi(X), X\in E\) known up to a normalisation constant.
Construct Markov Chain with state space \(E\) and stationary distribution \(\pi(X)\)
Simulate Markov Chain sample paths until convergence
Once converged, chain iterations can be treated as a sample from \(\pi(X)\)

Markov Chain Theory

Considering a discrete-time, discrete-state space
Markov Chain \(X_t\):
- Markov property:Given present state \(X_t\), future statee \(Y\) is independent of past

\[ P(X_{t+1}=y|X_t=x_t,...,X_0=x_0) = P(X_{t+1}=y|X_t=x_t)\]

N-step transition probability

\[ P^n(x,y) = P(X_n=y|X_0=x) = \sum_{x_1}...\sum_{x_{t-1}}P(x,x_1)P(x_1,x_2)...P(x_{n-1},y)\]

Irreducible chains
- Chain can move from anywhere to anywhere else in finite number of time steps

\[ \pi(y)>0\rightarrow P(X_n=y|x_0=x)>0\]

Aperiodicity
- Considering the period \(d_x\) of state \(x\), the largest common divior of set \(\{n\geq 1:P^n(x,x)>0\}\)
  - \(x\) is aperiodic if \(d_x=1\)
  - \(P(x,x)>0\) implies \(d_x=1\)

Relevent Limit Theorems

For an irreducible markov chain:
Ergodic averages converge to their expected values under \(\pi(X)\)

\[ \lim_{N\rightarrow\infty}\frac{1}{N}\sum^N_{t=1}g(X_t)\rightarrow\int g(x)\pi(x) dx<\infty\]

If chain is also aperiodic, then there is convergence in total variation

\[ ||P^n(X_0,.)-\pi(.)||\rightarrow 0,\quad n\rightarrow\infty\]

This implies that markov chain sample path mimics a random sample from \(\pi\)
\(\pi(X)\) is also the limiting distribution of successive chain iterates
- True regardless of starting value of chain

Gibbs Sampler

Gibbs sampling allows sampling from a p-dimensional distribution \(\pi(X)\), where \(X=(X^{(1)},...,X^{(p)})\)
- Often used for posterior distributions.
- Need full distribution, includingf normalisation constant
Generate sample from \(\pi(X,Y)\) by sampling in turn from conditional distributions \(\pi(x|y)\) and \(\pi(y|x)\)

Algorithm

1.Arbitrarily cchoose \(Y_0\)
1. For \(t=0,...,T\):
- Draw \(X_t\sim\pi(x|Y_t)\)
- Draw \(Y_t\sim\pi(y|X_t)\)
1. Produces \((X_0,Y_0),(X_1,Y_1),...,(X_T,Y_T)\)
- Can extend this algorithm to p dimensions
Distribution of \(X_t\) and \(Y_t\) converge to \(\pi(X)\) and \(\pi(Y)\), the true margins of \(\pi(X,Y)\) as \(t\rightarrow\infty\).
For large \(t\), \(X_t\) and \(Y_t\) effectively sample from \(\pi(X)\) and \(\pi(Y)\), and the pair is effectively a sample from \(\pi(x,y)\)

Gibbs Convergence

To converge, conditional distribution needs to determine \(f(\theta)\) uniquely
Positivity condition is sufficent for univariate distributions to uniquely determine joint distribution
- \(f(\theta)\) is strictly positivie for every \(\theta\) for which each of marginal desntities \(f(\theta_1),...,f(\theta_p)\) is positive
  - Strictly positive \(f(\theta)\) for each \(\theta\) combination

Rates of Convergence

Bounds

Theoretical bounds for total variation distance are avaliable, however typically hard to use in practice
Geometric Ergodic markov chain bound:
- \(0<r<1\)
- \(M(x)\) is a function that depends on initalisation

\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq M(x)r^n\]

Uniformly geometrically ergodic chain bound:
- Does not depend on initalisation, therefore a stronger result

\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq Mr^n\]

Dependence

Rate of convergence depends on degree of dependence between model parameters
- Dependence slows convergence and creates markov chain correlation
If parameters were independent:
- Immediate convergence
- Independence between successive chain iterations

Assessing Convergence

Burn in

We only want to sample values once the chain has converged
- Therefore, if convergence is slow, early chain values will not represent \(\pi(X)\)
To account for this, the first portion of each chain is often discarded, known as the burn-in period.

Autocorrelation

Markov chain iterates \(X^0,X^1,...\) are autocorrelated
- If autocorrelation is high (i.e mixing is poor), chain storage may be a problem
Can resolve autocorrelation issues by thinning, aka recording every kth value
- Thinned or non-thinned values can be used for inference.

Multiple Chains

Start multiple chains over dispersed starting points
- After convergence they chains should be indistinguishable
  - As convergence is the same no matter the starting point

Gelman-Rubin

Diagnostic that works on the principle: If chain has converge, then diagnostic test will be passed
- However, the reverse is not true
- Not sufficient indicators
Consider \(m\) chains of \(n\) samples after discarding first half of sample path (burn-in)
Gelman-Rubin Statistic:
- \(\psi_{ij}\) is \(ith\) sample from \(mth\) chain

\[ \hat{R} = \sqrt{\hat{Var}(\psi|y)/W},\quad\hat{Var}(\psi|y) = \frac{n-1}{n}W+\frac{1}{n}B \]

\(B\) is considered the between chain variance, and \(W\) is considered the within chain variance.
Ratio \(\hat{R}\) should decline to \(1\) as \(n\rightarrow\infty\)
- If it remains high, the simulation should be run further.

Reparameterisation

Can reparameterise model to decrease dependence between variables and therefore improve mixing and rate of convergence.

Metropolis-Hastings Algorithm

MH algorithm is a MCMC technique to sample from an arbitrary distribution with unknown normalising constant.
Sample from a proposal distribution \(q(\theta_i|\theta_{-i})\) rather than \(\pi(\theta_i|\theta_{-i})\), where \(q\) is easy to sample from.
- Decide whether to accept or stay at the same value based on a probability.

Algorithm

1. Initalise \(\theta_1\) (vector). Set \(t=1\)
1. Generate proposed point \(\theta^*\sim q(\theta|\theta_t)\) (conditioned on last iterate)
1. Generate \(u\sim U(0,1)\). If \(u\leq\alpha(\theta_t,\theta^*)\) then accept \(\theta_{t+1}=\theta^*\), else \(\theta_{t+1}=\theta_t\)

\[ \alpha(\theta_t,\theta^*)=\begin{cases}\min\left(1,\frac{\pi(\theta^*|x)q(\theta_t|\theta^*)}{\pi(\theta_t|x)q(\theta^*|\theta_t)}\right),\quad\text{if }\pi(\theta_t|x)q(\theta^*|\theta_t)>0\\1,\quad\text{otherwise}\end{cases}\]

1. Increment \(t=t+1\) and go to \(2\)

Proposal Distribution

\(q(\theta|\theta_t)\) can be any distribution such that Markov Chain is still irreducible and aperiodic.
Common choice is the random walk proposal:

\[ q(\theta|\theta_t) = N(\theta_t,\Sigma),\quad\theta^*=\theta_t+\epsilon,\quad\epsilon\sim N(0,\Sigma)\]

In the case of random walk, the proposal distribution is symmetric
- Therefore, as \(q(\theta|\theta_t)=q(\theta_t|\theta)\), the acceptance probability simplifies to:

\[ \alpha(\theta_t|\theta^*) = \min\left(1,\frac{\pi(\theta^*)}{\pi(\theta_t)}\right)\]

Reversibility Condition

Considering transition density for a given \(\theta\), \(P(\theta,\theta^*)\)
- \(P(\theta,\theta^*)\) satisfies the reversibility condition, with \(\pi(\theta)\) invariant distribution:
  - Probability of being in state \(\theta\) and transition to state \(\theta^*\) must equal the probability of being in state \(\theta^*\) and transitioning to state \(\theta\)

\[ \pi(\theta)P(\theta,\theta^*) = \pi(\theta^*)P(\theta^*,\theta)\]

The detailed balance/reversibility condition is sufficient but not necessary to give \(\pi(\theta)\) as an invariant distribution
The full transition kernel can be defined as:

\[ P(\theta,\theta^*) =\underbrace{q(\theta,\theta^*)\alpha(\theta,\theta^*)}_{\text{Prob of proposing and accepting same value}}+\underbrace{I(\theta=\theta^*)\left(1-\int q(\theta,\theta ')\alpha(\theta,\theta ')d\theta'\right)}_{\text{{Prob pf rejecting proposed move}}} \]

Acceptance Rate

Don’t want acceptance rrate too high or too low or sampling will be inefficient
- Too high is basically sampling from proposal
- Too low stays the same value for extended periods of time
Can alter acceptance rate by changing variance of proposal
- High variance goes into areas of lower probability that likely won’t be accepted
Rule of thumb:
- \(\alpha(\theta,\theta^*) = 0.234\) for multivariate
- \(\alpha(\theta,\theta^*) = 0.44\) for univariate

MarkovChainMonteCarlo

Jake

11/11/2022

Markov Chain Monte Carlo (MCMC)

Markov Chain Theory

Relevent Limit Theorems

Gibbs Sampler

Algorithm

Gibbs Convergence

Rates of Convergence

Bounds

Dependence

Assessing Convergence

Burn in

Autocorrelation

Multiple Chains

Gelman-Rubin

Reparameterisation

Metropolis-Hastings Algorithm

Algorithm

Proposal Distribution

Reversibility Condition

Acceptance Rate