Markov Chain Monte Carlo (MCMC)
- Given a distribution \(\pi(X), X\in E\) known up to a normalisation constant.
- Construct Markov Chain with state space \(E\) and stationary distribution \(\pi(X)\)
- Simulate Markov Chain sample paths until convergence
- Once converged, chain iterations can be treated as a sample from \(\pi(X)\)
Markov Chain Theory
- Considering a discrete-time, discrete-state space
- Markov Chain \(X_t\):
- Markov property:Given present state \(X_t\), future statee \(Y\) is independent of past
\[ P(X_{t+1}=y|X_t=x_t,...,X_0=x_0) = P(X_{t+1}=y|X_t=x_t)\]
- N-step transition probability
\[ P^n(x,y) = P(X_n=y|X_0=x) = \sum_{x_1}...\sum_{x_{t-1}}P(x,x_1)P(x_1,x_2)...P(x_{n-1},y)\]
- Irreducible chains
- Chain can move from anywhere to anywhere else in finite number of time steps
\[ \pi(y)>0\rightarrow P(X_n=y|x_0=x)>0\]
- Aperiodicity
- Considering the period \(d_x\) of state \(x\), the largest common divior of set \(\{n\geq 1:P^n(x,x)>0\}\)
- \(x\) is aperiodic if \(d_x=1\)
- \(P(x,x)>0\) implies \(d_x=1\)
- Considering the period \(d_x\) of state \(x\), the largest common divior of set \(\{n\geq 1:P^n(x,x)>0\}\)
Relevent Limit Theorems
- For an irreducible markov chain:
- Ergodic averages converge to their expected values under \(\pi(X)\)
\[ \lim_{N\rightarrow\infty}\frac{1}{N}\sum^N_{t=1}g(X_t)\rightarrow\int g(x)\pi(x) dx<\infty\]
- If chain is also aperiodic, then there is convergence in total variation
\[ ||P^n(X_0,.)-\pi(.)||\rightarrow 0,\quad n\rightarrow\infty\]
- This implies that markov chain sample path mimics a random sample from \(\pi\)
- \(\pi(X)\) is also the limiting distribution of successive chain iterates
- True regardless of starting value of chain
Gibbs Sampler
- Gibbs sampling allows sampling from a p-dimensional distribution \(\pi(X)\), where \(X=(X^{(1)},...,X^{(p)})\)
- Often used for posterior distributions.
- Need full distribution, includingf normalisation constant
- Generate sample from \(\pi(X,Y)\) by sampling in turn from conditional distributions \(\pi(x|y)\) and \(\pi(y|x)\)
Algorithm
- 1.Arbitrarily cchoose \(Y_0\)
- For \(t=0,...,T\):
- Draw \(X_t\sim\pi(x|Y_t)\)
- Draw \(Y_t\sim\pi(y|X_t)\)
- Produces \((X_0,Y_0),(X_1,Y_1),...,(X_T,Y_T)\)
- Can extend this algorithm to p dimensions
- Distribution of \(X_t\) and \(Y_t\) converge to \(\pi(X)\) and \(\pi(Y)\), the true margins of \(\pi(X,Y)\) as \(t\rightarrow\infty\).
- For large \(t\), \(X_t\) and \(Y_t\) effectively sample from \(\pi(X)\) and \(\pi(Y)\), and the pair is effectively a sample from \(\pi(x,y)\)
Gibbs Convergence
- To converge, conditional distribution needs to determine \(f(\theta)\) uniquely
- Positivity condition is sufficent for univariate distributions to uniquely determine joint distribution
- \(f(\theta)\) is strictly positivie for every \(\theta\) for which each of marginal desntities \(f(\theta_1),...,f(\theta_p)\) is positive
- Strictly positive \(f(\theta)\) for each \(\theta\) combination
- \(f(\theta)\) is strictly positivie for every \(\theta\) for which each of marginal desntities \(f(\theta_1),...,f(\theta_p)\) is positive
Rates of Convergence
Bounds
- Theoretical bounds for total variation distance are avaliable, however typically hard to use in practice
- Geometric Ergodic markov chain bound:
- \(0<r<1\)
- \(M(x)\) is a function that depends on initalisation
\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq M(x)r^n\]
- Uniformly geometrically ergodic chain bound:
- Does not depend on initalisation, therefore a stronger result
\[ \sup_{A\subset E}|P^n(x,A)-P(A)|\leq Mr^n\]
Dependence
- Rate of convergence depends on degree of dependence between model parameters
- Dependence slows convergence and creates markov chain correlation
- If parameters were independent:
- Immediate convergence
- Independence between successive chain iterations
Assessing Convergence
Burn in
- We only want to sample values once the chain has converged
- Therefore, if convergence is slow, early chain values will not represent \(\pi(X)\)
- To account for this, the first portion of each chain is often discarded, known as the burn-in period.
Autocorrelation
- Markov chain iterates \(X^0,X^1,...\) are autocorrelated
- If autocorrelation is high (i.e mixing is poor), chain storage may be a problem
- Can resolve autocorrelation issues by thinning, aka recording every kth value
- Thinned or non-thinned values can be used for inference.
Multiple Chains
- Start multiple chains over dispersed starting points
- After convergence they chains should be indistinguishable
- As convergence is the same no matter the starting point
- After convergence they chains should be indistinguishable
Gelman-Rubin
- Diagnostic that works on the principle: If chain has converge, then diagnostic test will be passed
- However, the reverse is not true
- Not sufficient indicators
- Consider \(m\) chains of \(n\) samples after discarding first half of sample path (burn-in)
- Gelman-Rubin Statistic:
- \(\psi_{ij}\) is \(ith\) sample from \(mth\) chain
\[ \hat{R} = \sqrt{\hat{Var}(\psi|y)/W},\quad\hat{Var}(\psi|y) = \frac{n-1}{n}W+\frac{1}{n}B \]
- \(B\) is considered the between chain variance, and \(W\) is considered the within chain variance.
- Ratio \(\hat{R}\) should decline to \(1\) as \(n\rightarrow\infty\)
- If it remains high, the simulation should be run further.
Reparameterisation
- Can reparameterise model to decrease dependence between variables and therefore improve mixing and rate of convergence.
Metropolis-Hastings Algorithm
- MH algorithm is a MCMC technique to sample from an arbitrary distribution with unknown normalising constant.
- Sample from a proposal distribution \(q(\theta_i|\theta_{-i})\) rather than \(\pi(\theta_i|\theta_{-i})\), where \(q\) is easy to sample from.
- Decide whether to accept or stay at the same value based on a probability.
Algorithm
- Initalise \(\theta_1\) (vector). Set \(t=1\)
- Generate proposed point \(\theta^*\sim q(\theta|\theta_t)\) (conditioned on last iterate)
- Generate \(u\sim U(0,1)\). If \(u\leq\alpha(\theta_t,\theta^*)\) then accept \(\theta_{t+1}=\theta^*\), else \(\theta_{t+1}=\theta_t\)
\[ \alpha(\theta_t,\theta^*)=\begin{cases}\min\left(1,\frac{\pi(\theta^*|x)q(\theta_t|\theta^*)}{\pi(\theta_t|x)q(\theta^*|\theta_t)}\right),\quad\text{if }\pi(\theta_t|x)q(\theta^*|\theta_t)>0\\1,\quad\text{otherwise}\end{cases}\]
- Increment \(t=t+1\) and go to \(2\)
Proposal Distribution
- \(q(\theta|\theta_t)\) can be any distribution such that Markov Chain is still irreducible and aperiodic.
- Common choice is the random walk proposal:
\[ q(\theta|\theta_t) = N(\theta_t,\Sigma),\quad\theta^*=\theta_t+\epsilon,\quad\epsilon\sim N(0,\Sigma)\]
- In the case of random walk, the proposal distribution is symmetric
- Therefore, as \(q(\theta|\theta_t)=q(\theta_t|\theta)\), the acceptance probability simplifies to:
\[ \alpha(\theta_t|\theta^*) = \min\left(1,\frac{\pi(\theta^*)}{\pi(\theta_t)}\right)\]
Reversibility Condition
- Considering transition density for a given \(\theta\), \(P(\theta,\theta^*)\)
- \(P(\theta,\theta^*)\) satisfies the reversibility condition, with \(\pi(\theta)\) invariant distribution:
- Probability of being in state \(\theta\) and transition to state \(\theta^*\) must equal the probability of being in state \(\theta^*\) and transitioning to state \(\theta\)
- \(P(\theta,\theta^*)\) satisfies the reversibility condition, with \(\pi(\theta)\) invariant distribution:
\[ \pi(\theta)P(\theta,\theta^*) = \pi(\theta^*)P(\theta^*,\theta)\]
- The detailed balance/reversibility condition is sufficient but not necessary to give \(\pi(\theta)\) as an invariant distribution
- The full transition kernel can be defined as:
\[ P(\theta,\theta^*) =\underbrace{q(\theta,\theta^*)\alpha(\theta,\theta^*)}_{\text{Prob of proposing and accepting same value}}+\underbrace{I(\theta=\theta^*)\left(1-\int q(\theta,\theta ')\alpha(\theta,\theta ')d\theta'\right)}_{\text{{Prob pf rejecting proposed move}}} \]
Acceptance Rate
- Don’t want acceptance rrate too high or too low or sampling will be inefficient
- Too high is basically sampling from proposal
- Too low stays the same value for extended periods of time
- Can alter acceptance rate by changing variance of proposal
- High variance goes into areas of lower probability that likely won’t be accepted
- Rule of thumb:
- \(\alpha(\theta,\theta^*) = 0.234\) for multivariate
- \(\alpha(\theta,\theta^*) = 0.44\) for univariate