Note: These lecture notes are still rough, and have only have been mildly proofread.

Note 2: This vignette is coded in Python. Unfortunately, .Rmd files do not yet support variable conservation between Python code snippets, so all code is presented in a single snippet at the end of the vignette.

Summary

This vignette introduces the concept of the integrated autocorrelation time (IAC) and demonstrates how it can be practically calculated and applied to compare convergence rates of different MCMC schemes. As an example, Gibbs samplings (GS) and the Metropolis-Hastings (MH) algorithm are implemented for an Ising model and their convergence rates are compared directly.

Prerequisites

Familiarity with GS and the MH algorithm.

Integrated Auto-Correlation

Very often, MCMC schemes are implemented in order to numerically approximate integrals involving probability density functions (pdfs). Let’s take the most straightforward case in which we want to calculate the mean of some function \(f(\underline{x})\) over some distribution \(\pi(\underline{x})\):

\begin{equation}\label{analytic} \bar{f} = \int f(\underline{x}) \pi(\underline{x}) \text{d}\underline{x}, \end{equation}

As we have learned in this class, by generating \(\underline X^{(0)}, \dots ,\underline X^{(M)}\) from an appropriate MCMC scheme, an estimate of \(\bar{f}\) can be constructed as

\begin{equation} \bar{f} \approx \bar{f}_M = \frac{1}{M} \sum_{i=1}^{M} f(\underline X^{(i)}), \end{equation}

where \(M\) is the length of the Markov chain generated. This approach works because the Markov process has been constructed such that its stationary distribution is itself \(\pi(\underline x)\).

Clearly \(\bar{f}\) is an important measure for us, otherwise we never would have wasted our time calculating it to begin with. But equally important is quantifying how accurate our estimate of \(\bar{f}\) is. For example, if we generated \(N\) estimates of \(\bar{f}\), how tightly distributed would \(\bar{f}\) be? If the variance \(\sigma^2(\bar{f})\) is small the MCMC can be considered quickly converging, and the opposite is true if \(\sigma^2(\bar{f})\) is large. To be more mathematical, by the central limit theorem (CLT) we know that as \(N \rightarrow \infty\), \(\sqrt{N}(\bar{f} - <f>) \sim \mathcal{N}(0,a^2)\). In other words,

\begin{equation} \lim_{N \rightarrow \infty} \sigma^2(\bar{f}) = \frac{a^2}{N} \end{equation}

Right out of the gates we can be confident that \(a^2\) will depend on the MCMC scheme used. For example, if the length \(M\) of the Markov chains used is small, convergence will intuitively take longer which manifests as \(a^2\) being large. In this vignette, an expression for \(a\) is given but not proven (for a more complete treatment, see http://www.hep.fsu.edu/~berg/teach/mcmc08/material/lecture07mcmc3.pdf):

\begin{equation} a^2 = \tau_f \sigma^2(f), \end{equation}

where

\begin{equation} \tau_f = 1 + 2 \sum_{i=1}^\infty \textbf{corr}(f(X^{(0)}),f(X^{(i)})). \end{equation}

Let’s first focus our attention to \(\sigma^2(f)\). This is the variance of \(f(\underline x)\) under the distribution \(\pi(\underline x)\) and is therefore independent of the MCMC scheme used. Therefore the MCMC scheme chosen modulates only \(\tau_f\), which we call the integrated autocorrelation (IAC). We can further assert that an MCMC with a fast rate of convergence will have a small \(\tau_f\), since small \(\tau_f\) is proportional to the variance of our estimate. To understand the IAC better, consider the extreme case when \(X^{(0)}, \dots, X^{(M)}\) are drawn completely independently from one another (as opposed to the dependent samples generated in a MCMC scheme). Since all samples are independent, Eq. 1.5 equates to \(1\) and so in the limit of large \(N\) Eq. 1.3 becomes

\begin{equation} \sigma^2(\bar{f}) = \frac{\sigma^2(f)}{N} \end{equation}

This yields the familiar equation for the variance of the sample mean (assuming independent samples). But by definition, MCMC schemes generate dependent samples corresponding to \(\tau_f > 1\). We can then conclude that the rate of convergence will be always be slower in MCMC implementations than for independent sampling. \(\tau_f\) is a very useful summary statistic for quantifying the decrease in rate of convergence introduced by dependent sampling. The rate of convergence for different MCMC schemes can be compared directly, which is demonstrated in the next section.

Example: the Ising model

The Ising model was originally developed as a model for ferromagnetism, although nowadays it has widespread use as a model system that people use to test their latest and greatest Monte-Carlo numerical methods. It is also used within the field of computational biology with varying degrees of success (just search https://www.google.com/#q=ising+model+biology&*). In this section we simulate an Ising model using Gibbs sampling and the Metropolis-Hastings algorithm.

Definition

Consider a 2D lattice \(\{1,\dots,L\} \times \{1,\dots,L\}\), where each cell can take on the values \(\pm 1\). Let \(\sigma\) be the \(L \times L\) matrix which characterizes the state of the lattice. For example, \(\sigma_{2,10}\) is the state of the cell with coordinates \((2,10)\). In an Ising model, \(\sigma\) evolves discretely according to a time index \(k\), and the sequence of \(\{\sigma^{(k=0)},\dots,\sigma^{(k=K)}\}\) forms a Markov chain. Each cell \(\sigma_{i,j}\) evolves in accordance with a probability distribution that depends on the state of the other cells. It makes sense to imagine this in terms of magnetism, where each cell can be thought of as a magnet that either points up (\(\sigma_{i,j}=+1\)) or down (\(\sigma_{i,j}=-1\)). Just the way that two magnets tend to align themselves when close to each other, the state of one cell will influence the state of nearby cells more strongly than cells further away. In this vein, a result from statistical mechanics is that the probability distribution of \(\sigma\) is given by

\begin{equation} \pi(\sigma_{i,j}) = \frac{1}{\mathcal Z} \exp\big({-\beta J \sigma_{i,j} \sum_{(i',j')=\text{ NN}}\sigma_{i',j'}}). \end{equation}

A characteristic feature of the Ising model is that it assumes \(\sigma_{i,j}\) is influenced only by adjacent cells. In other words, it is conditionally independent of all non-adjacent cells. This is denoted by a nearest neighbors (NN) summation in Eq. 1.7. Writing the summation out in full looks like

\begin{equation} \sum_{i',j'=\text{ NN}}\sigma_{i',j'} = \sigma_{i+1,j} + \sigma_{i-1,j} + \sigma_{i,j+1} + \sigma_{i,j-1}. \end{equation}

To make boundary conditions simple, the lattice is considered periodic. For example, the nearest neighbors of \(\sigma_{L,L}\) are \(\sigma_{1,L}\), \(\sigma_{L,1}\), \(\sigma_{L-1,L}\), and \(\sigma_{L,L-1}\). \(J\) is a coupling constant that determines the nature and strength of interaction between the cell states. If \(J>0\), \(\sigma_{i,j}\) prefers the state of its neighbors, and if \(J<0\), \(\sigma_{i,j}\) prefers to be the opposite of its neighbors. These are called ferromagnetism and antiferromagnetism, respectively. On the other hand, \(\beta\) is an external parameter that can be thought of as an inverse temperature (\(\beta\) is inversely proportional to temperature) that introduces thermal noise into the system. When \(\beta\) is large (temperature is low) the cell states are determined dominantly by interaction terms of the nearest neighbors in either a ferromagnetic or antiferromagnetic fashion—this leads to highly ordered states. But as \(\beta\) decreases (temperature increases), thermal fluctuations begin to dominate the system and the cell states have increased disorder. \(\mathcal{Z}\) is a normalization constant.

Gibbs Sampling vs. Metropolis-Hastings Algorithm

Suppose we are interested in the mean magnetization of an Ising model with specified \(L\), \(\beta\), and \(J\), where mean magnetization is defined as

\begin{equation} M(\sigma) = \sum_{i,j} \sigma_{i,j} \end{equation}

To approach this problem we decide to use MCMC to construct a numerical distribution of \(M\), from which we calculate the mean \(\bar{M}\). The following code implements a Gibbs sampler and Metropolis-Hastings algorithm, constructs an estimate \(\bar{M}\) for each scheme, and then informs us which measurement is more trustable by calculating the IAC.

'''
OVERVIEW:

This program simulates states from a periodic 2D lattice sigma, in which each 
sigma(i,j) = +- 1, using 2 MCMC schemes: Gibbs sampling and the 
Metropolis-Hastings algorithm. A nearest neighbor Ising model is assumed. Once
both Markov chains are generated, the integrated autocorrelation times for the
magnetization function ( sum_{i,j}(sigma(i,j)) ) are compared between methods.

PARAMETERS:
    
The grid is LxL
K is the total length of the Markov chain
beta is the inverse temperature term
B is the external magnetic field
if J = 1, the interactions are ferromagnetic
if J = -1, the interactions are antiferromagnetic
kappa is truncation point of the autocorrelation curve and should be
adjusted if autocorrelation curve G does not tend to 0 by G[kappa]
burn-in is pre-chain truncated before Markov process reaches stationarity

REFERENCES:

https://en.wikipedia.org/wiki/Gibbs_sampling
https://en.wikipedia.org/wiki/Metropolis-Hastings_algorithm
https://en.wikipedia.org/wiki/Ising_model
'''

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams.update({'font.size': 20})
matplotlib.rc('font',family='Times New Roman')

def main():
    
    global L,K,B,beta,J,kappa,failure,burnin
    L = 15
    K = 100000
    beta = 0.4
    B = 0.0
    J = -1.
    kappa = 1000
    failure = 0
    burnin = 1000
    

#   simulates sigma using Gibbs sampler and MH algorithm
    print "starting Gibbs sampler"
    sigma_Gibbs = Gibbs()
    print "starting Metropolis-Hastings algorithm"
    sigma_MH = MetHast()
    
#   plots the initial and final states for both sample methods
    plot_state(sigma_Gibbs,sigma_MH)
    
#   trims off burn-in
    sigma_Gibbs = sigma_Gibbs[:,:,burnin:]
    sigma_MH = sigma_MH[:,:,burnin:]

#   calculates magnetization
    M_Gibbs = magnetization(sigma_Gibbs)
    M_MH = magnetization(sigma_MH)

#   plots the magnetization
    plt.hist(M_Gibbs,bins=15,histtype='step',label='Gibbs: mean(M) = {:.4f}'.\
             format(np.mean(M_Gibbs)))
    plt.hist(M_MH,bins=15,histtype='step',label='MetHast: mean(M) = {:.4f}'.\
             format(np.mean(M_MH)))
    plt.xlabel(r"$M/L^{2}$"); plt.ylabel('occurences')
    plt.legend(fontsize='13'); plt.tight_layout()
    #plt.show()
    plt.savefig('project_plot2.png')
    plt.close()
    
#   compares the integrated autocorrelation times for magnetizations
    compare_IAC([M_Gibbs,M_MH],labels=['Gibbs','MetHast'])
    
    
'''
Each iteration, the Gibbs sampler selects one of the L^2 lattice elements
randomly, i.e. sigma(i,j). A new value of sigma(i,j) is then drawn from
the posterior distribution P[sigma(i,j) | all sigma(I!=i, J!=j)]. 
The posterior distribution includes only 4 sigma terms because the Ising 
model assumes nearest neighbor interactions: sigma(I=i+-1, J=j+-1). Note
that sigma being updated one (i,j) pair at a time is the characteristic
partial resampling feature of Gibbs sampling.
'''      
def Gibbs():
#   sigma_series stores sigma for each step of the Markov chain
    sigma_series = initialize_sigma()
#   selects (i,j) randomly and calls draw_sigma_ij(...) to update sigma(i,j)
    for k in range(1,K):
        i,j = np.random.randint(0,L,size=2)
        sigma_series[:,:,k] = Gibbs_update(sigma_series[:,:,k-1],i,j)
    return sigma_series    
    
'''
Returns the new state of sigma after updating sigma(i,j) according to the
posterior distribution used in Gibbs sampling.
'''
def Gibbs_update(sigma,i,j):
#   The nearest neighbors of sigma(i,j) = sigma(i+1,j), sigma(i-1,j), 
#   sigma(i,j+1), and sigma(i,j-1).
    neighbors = get_neighbors(sigma,i,j)
#   p and q are the probabilities of being spin +1 or spin -1, respectively.
#   p + q = 1.
    p,q = posterior(sigma[i,j],neighbors,sigma)
#   sets sigma(i,j) according to wp and wm
    if np.random.rand() < p:
        sigma[i,j] = 1
    else:
        sigma[i,j] = -1
#   returns new state for sigma
    return sigma
    
'''
Each iteration, the Metropolis-Hastings sampler selects one of the L^2
elements randomly, i.e. sigma(i,j). The proposed state for sigma(i,j,k=i+1) 
is set to omega(i,j) = -sigma(i,j,k=i). Afterwards, this proposed state is
either accepted or rejected with a probability A.
''' 
def MetHast():
#   sigma_series stores sigma for each step of the Markov chain
    sigma_series = initialize_sigma()
#   selects (i,j) randomly and calls draw_sigma_ij(...) to update sigma(i,j)
    for k in range(1,K):
        i,j = np.random.randint(0,L,size=2)
        sigma_series[:,:,k] = MetHast_update(sigma_series[:,:,k-1],i,j)
    return sigma_series
    
'''
Returns the new state of sigma after updating sigma(i,j) according to the
accept/reject scheme used in the Metropolis Hastings algorithm.
'''
def MetHast_update(sigma,i,j):
#   omega is the proposed state for sigma(i,j)
    omega = -sigma[i,j]
#   The nearest neighbors of sigma(i,j) = sigma(i+1,j), sigma(i-1,j), 
#   sigma(i,j+1), and sigma(i,j-1).
    neighbors = get_neighbors(sigma,i,j)
#   sigma(i,j) is either set to sigma or -sigma (the original state)
    sigma[i,j] = acceptreject(omega,neighbors,sigma)
    return sigma

'''
Returns the state for sigma(i,j,k=i+1), after either accepting or rejecting 
the proposal, omega. 
'''
def acceptreject(omega,neighbors,sigma):
    global failure
#   H_after and H_before are the Hamiltonians for sigma_ij after and before
#   flipping sigma_ij.
    H_after = -omega * J * np.sum(neighbors) - B * np.sum(sigma)
    H_before = omega * J * np.sum(neighbors) - B * np.sum(sigma)
    delta_H = H_after - H_before
#   A is the ratio of Pr(omega)/Pr(sigma) = Pr(H_after)/Pr(H_before)
    A = np.exp(-beta*delta_H)
#   if delta_H < 0, accept transition with 100% probability,
#   otherwise accept with probability A = exp(-beta*delta_H)
    if np.random.rand() < A:
        return omega
    else:
        failure += 1
        return -omega
     
''' 
Returns p = Pr{sigma(i,j) = +1} and q = Pr{sigma(i,j) = -1}. These 
are otherwise known as the Boltzmann factors.
'''
def posterior(sigma_ij,neighbors,sigma):
#   Hp and Hm are the Hamiltonians for sigma_ij = +/- 1
    Hp = -1 * J * np.sum(neighbors) - B * np.sum(sigma)
    Hm = +1 * J * np.sum(neighbors) - B * np.sum(sigma)
#   p and q are the probability that sigma_ij is +/- 1
    p = np.exp(-beta*Hp)/(np.exp(-beta*Hp) + np.exp(-beta*Hm))
    q = np.exp(-beta*Hm)/(np.exp(-beta*Hp) + np.exp(-beta*Hm))
    return p,q    
    
'''
Returns the nearest neighbors of sigma(i,j). For nearest neighbor interactions,
these are sigma(i+1,j), sigma(i-1,j), sigma(i,j+1), and sigma(i,j-1).
'''
def get_neighbors(sigma,i,j):
    return np.array([sigma[(i+1)%L,j],
                     sigma[(i-1)%L,j],
                     sigma[i,(j+1)%L],
                     sigma[i,(j-1)%L]])
    
'''
Returns M, the magnetization. M(sigma) is the function for which the integrated
autocorrelation time is calculated for (equivalent to f(x) in the vignette).
'''
def magnetization(sigma_series):
#   M is the magnetization
    M = np.sum(sigma_series,axis=(0,1))/L**2
    return M
    
'''
Returns a zero-filled array sigma of dimension (i,j,K) which the Markov chain 
is stored in. That's not completely true. Before returning the array, the 
initial Markov state is defined in sigma[:,:,0], where each sigma(i,j,k=0) =
+/- 1 with equal probability.
'''
def initialize_sigma():
#   creates sigma(k=0), with elements +/-1 with equal probability
#   sigma_series stores sigma for every time-step
    sigma0 = np.random.choice([-1,1],size=L**2,p=[0.5,0.5]).reshape(L,L)
    sigma_series = np.zeros((L,L,K))
    sigma_series[:,:,0] = sigma0
    return sigma_series

'''
Calculates the autocorrelation curve of M, then returns the integrated 
autocorrelation time (IAC).
'''
def tau(M):
#   autocorr is the UNintegrated autocorrelation curve
    autocorr = auto_corr_fast(M)    
#   tau = 1 + 2*sum(G)
    return 1 + 2*np.sum(autocorr), autocorr

'''
This is the intuitive and naive way to calculate autocorrelations. See
auto_corr_fast(...) instead.
'''
def auto_corr(M):
#   The autocorrelation has to be truncated at some point so there are enough
#   data points constructing each lag. Let kappa be the cutoff
    auto_corr = np.zeros(kappa-1)
    mu = np.mean(M)
    for s in range(1,kappa-1):
        auto_corr[s] = np.mean( (M[:-s]-mu) * (M[s:]-mu) ) / np.var(M)
    return auto_corr

'''
The bruteforce way to calculate autocorrelation curves is defined in 
auto_corr(M). The correlation is computed for an array against itself, and 
then the indices of one copy of the array are shifted by one and the 
procedure is repeated. This is a typical "convolution-style" approach. 
An incredibly faster method is to Fourier transform the array first, since 
convolutions in Fourier space is simple multiplications. This is the approach
in auto_corr_fast(...)
'''
def auto_corr_fast(M):   
#   The autocorrelation has to be truncated at some point so there are enough
#   data points constructing each lag. Let kappa be the cutoff
    M = M - np.mean(M)
    N = len(M)
    fvi = np.fft.fft(M, n=2*N)
#   G is the autocorrelation curve
    G = np.real( np.fft.ifft( fvi * np.conjugate(fvi) )[:N] )
    G /= N - np.arange(N); G /= G[0]
    G = G[:kappa]
    return G
    
'''
Plots the autocorrelation curves and calculates the IAC for each M in Ms
'''
def compare_IAC(Ms,labels):
#   loop through each magnetization chain
    for ind,M in enumerate(Ms):
#       get IAC and autocorrelation curve
        IAC,G = tau(M)
        plt.plot(np.arange(len(G)),G,label="{}: IAC = {:.2f}".\
                                            format(labels[ind],IAC))
    plt.legend(loc='best',fontsize=14)
    plt.tight_layout()
    #plt.show()
    plt.savefig('project_plot3.png')
    plt.close()
    
'''
Plots the initial and final states for both sample methods
'''
def plot_state(sigma1,sigma2):
#   plots sigma(k=0) and sigma(k=K) for sweep and random
    fig = plt.figure(figsize=(12,10))
    ax = fig.add_subplot(221)
    ax.pcolormesh(sigma1[:,:,0]); ax.set_title('Gibbs: $k=0$')
    ax = fig.add_subplot(222)
    ax.pcolormesh(sigma1[:,:,-1]); ax.set_title('Gibbs: $k=K$')
    ax = fig.add_subplot(223)
    ax.pcolormesh(sigma2[:,:,0]); ax.set_title('MH: $k=0$')
    ax = fig.add_subplot(224)
    ax.pcolormesh(sigma2[:,:,-1]); ax.set_title('MH: $k=K$')
    plt.tight_layout()
    #plt.show()
    plt.savefig('project_plot1.png')
    plt.close()

#main()

Figure 1. Initial and final states for \(\sigma\).

The plot illustrates the initial and final states of \(\sigma\) for both implementations. The initial state (\(k=0\)) was defined a priori by \(\sigma_{i,j} = \pm 1\) with equal probability, which explains the observed disorder. By the end of the Markov chain (\(k=K\)), an equilibrium state of order is seen. From the checkerboard pattern its clear that cells prefer to be the opposite of their neighbors, i.e. their interaction is antiferromagnetic. Indeed, in this simulation \(J<0\). However, the checkerboard pattern is not perfect due to random fluctuations introduced by the value of \(\beta\) we picked. If \(\beta\) was picked to be smaller, would you expect the final state to be more or less disordered?

Figure 2. Distribution of the magnetization.

This figure shows the calculated distributions of \(M\) using both schemes. Each produce a bell-shaped curve centered near 0, with the Gibbs sampler generating an estimate of \(\bar{M}_\text{Gibbs} = 0.0008\) and the M-H algorithm generating an estimate of \(\bar{M}_\text{MH} = 0.0012\). The question is, which one do we trust more? The naive approach would be to repeat the simulation 100 times for each method, construct a distribution for \(\bar{M}_\text{Gibbs}\) and \(\bar{M}_\text{MH}\), and calculate the standard deviation for the distributions. The one with the smaller standard deviation is then the better estimate. The smart thing to do is calculate the IAC.

Figure 3. Comparison of the IACs.

Below is the autocorrelation curve for both methods, and the IAC is just 1 plus 2 times the sum of the autocorrelation curve (Eq. 1.5). The Metropolis-Hastings algorithm has an IAC nearly half that of the Gibbs sampler, which by Eqs. 1.3 and 1.4, indicates \(\sigma^2(\bar{M}_\text{MH}) \approx \frac{1}{2} \sigma^2(\bar{M}_\text{Gibbs})\). Therefore, we are more confident in our estimate \(\bar{M}_\text{MH}\).

Keep in mind this does not mean the Metropolis-Hastings algorithm is better than the Gibbs sampler. It just means that the Metropolis-Hastings algorithm is better than the Gibbs sampler for the specified \(L\), \(J\), \(\beta\), and \(\mathbb{E}[M(\sigma)]\). In otherwords, the Metropolis-Hastings algorithm performed better for this \(f(\underline x)\) and \(\pi(\underline x)\).