Notes for Explaining the Gibbs Sampler

Gibbs sampler has been widely used in Bayesian models. In 1992, Georoge Casella and Edward I. George published the paper “Explaining the Gibbs sample”.\(^1\) In the paper they gave several examples of Gibbs sampler and a proof that Gibbs sample sequence will converge to a marginal distribution. However,in the paper they didn’t provide computer code that readers can reproduce their results and figures. Here I use R to reproduce their results and make some notes for the paper.

For the equation \((2.1)\), there was nothing spacial, it was just the definition of the marginal density for a continuous random variable,i.e. if there was a joint density function \(f(x,y_1,..y_p)\) then we could integrate out all other variables and obtain the marginal density

\[f(x)=\int...\int f(x,y_1,...,y_p)dy_1...dy_p \tag{2.1}\] If we can calculate \(f(x)\) directly, then we can easily obtain the mean or variance of the marginal density, However, usually it is extremely difficult to evaluate the integral for the equation \((2.1)\). Fortunately, we can use Gibbs sampler to obtain the mean and variances by generating a sample from the distribution of \(f(x)\) i.e \(X_1,...,X_m \sim f(x)\). Then we can obtain the mean by the equation \((2.2)\)

\[\lim_{m\to\infty}\frac{1}{m}\sum_{i=1}^mX_i=\int_{-\infty}^\infty xf(x)dx=E(X)\tag{2.2}\]

Two-variable case

For the two-variable case , i.e we have a pair random variables \((X,Y)\) ,the joint density will be \(f(x,y)\) and the marginal density for \(X\) will be

\[f(x)=\int f(x,y)dy\] and suppose this integral is very hard to calculate, then we can generate a ‘Gibbs sequence(chain)’ to estimate the mean and variance of the marginal density \(f(x)\). \[Y_0'\to X_0'\to Y_1'\to X_1'\to Y_2'\to X_2'\to...\to Y_k'\to X_k' \tag{2.3}\]

The sequence \((2.3)\) can be obtain by the following steps:

Give an initial value of \(Y_0'=y_0'\)
Alternately generate other values by the following equations

\[X_j'\sim f(x|Y_j'=y_j')\\Y_{j+1}'\sim f(y|X_j'=x_j') \tag{2.4}\] From equation \((2.4)\) we can see if we know the density of \(f(x|y)\) and \(f(y|x)\) it is quite easy to generate the sequence \((2.3)\), in fact, usually it is easier to obtain \(f(x|y)\) and \(f(y|x)\) than the marginal distribution \(f(x)\).

When \(k\) is large enough, the final observation in (2.3), namely \(X_k' = x_k'\) is a sample point from \(f(x)\), we can produce \(m\) such sequences of \((2.3)\), then we will have a sample with \(m\) number of \(x_k'\)s and we can use this i.i.d sample to calculate the mean or variance of the \(f(x)\).

Example 1

We know the joint distribution of \(X\) and \(Y\) is

\[f(x,y)\propto \begin{pmatrix} n\\ x \end{pmatrix}y^{x+\alpha-1}(1-y)^{n-x+\beta-1} \tag{2.5}\] where \(x=0,1,...,n \text{ and } 0\le y\le 1\)

The marginal density is:

\[f(x)=\int_0^1 \begin{pmatrix} n\\ x \end{pmatrix}y^{x+\alpha-1}(1-y)^{n-x+\beta-1}dy\] suppose we think this integral is very difficult to evaluate (in fact this integral is not difficult to evaluate,it has a Beta-binomial distribution),we will use Gibbs sampler to generate a sample from this marginal distribution. From \((2.5)\) we can see

\[f(x|y) \propto Binomial(n,y) \tag{2.6a}\] \[f(y|x) \propto Beta(x+\alpha,n-x+\beta) \tag{2.6b}\] To obtain \((2.6a)\) we ignore any factors that are not functions of \(x\) in \((2.5)\), then we get

\[f(x|y)\propto \begin{pmatrix} n\\ x \end{pmatrix}y^{x}(1-y)^{n-x}\]

This is \(Binomial(n,y)\).

To obtain \((2.6b)\) we ignore any factors that not functions of \(y\) in \((2.5)\) then we get

\[f(y|x)\propto Beta(x+\alpha,n-x+\beta)\] Now, we have everything we need to perform a Gibbs sampling using above information. We will produce 500 chains of \((2.3)\) and each chain will have 10 values, and we will take the last value as the sample. Therefore, the sample size is 500.

Method 1

As describe in the paper, the following Gibbs sampling will produce 500 chains and each chain has a length of 10 and we take the 10th value as a sample point.

library(VGAM) #for the beta-binomial distribution

set.seed(1010)

X<-matrix(,nrow=500,ncol=10) #matrix with 500 rows and 10 columns
Y<-matrix(,nrow=500,ncol=10)

for(i in 1:500){          #initial value for the Y0 of the 500 row
  Y[i,1]<- rbeta(1,2,4)
}

for(i in 1:500){
  X[i,1]<- rbinom(1,16,Y[i,1]) # first value for X 
}

for(j in 1:500) {     
  for (i in 2:10){
  
 X[j,i]  <- rbinom(n = 1, size=16, prob=Y[j,i-1])    #equation 2.6a
 Y[j,i] <- rbeta(n = 1,X[j,i]+2,16-X[j,i]+4)         #equation 2.6b
  }
}

Histogram using 500 data points from the Gibbs sampler

hx<-hist(X[,10]) #histogram from the Gibbs sampler

Histogram from the beta-binomial distribution with sample size of 500.

Z<-rzoibetabinom.ab(500, 16, 2, 4, pstr0 = 0, pstrsize = 0) #500 values produced by beta-binomial distribution
hz<-hist(Z)

Put the two histograms together

c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")
plot(hx, xlab="" ,main="",col = c1) # Plot 1st histogram using a transparent color
plot(hz, col = c2, add = TRUE) # Add 2nd histogram using different color

Produce the figure 1, this figure will not be exactly the same as the figure 1 in the paper, since the randoms numbers produced by the R program will not be exactly the same as the random numbers produced in the paper.

require(plotrix)
l <- list(X[,10],Z)
multhist(l, ylim=c(0,70))

Method 2.

In fact, this method is commonly used by statistical software, i.e we can produce a very long chain, then get rid of some burn in values. For the following R program we produce a 11000 length chain and get rid of the first 1000 burn in values.

library(VGAM)

set.seed(1010)

X<- rep(NA, 11000)
Y<- rep(NA, 11000)
T<- 1000    # burn in
Y[1]<-rbeta(1,2,4)# initialize Y0
for(i in 2:11000) {   
  X[i]  <- rbinom(n = 1, size=16, prob=Y[i-1])    
 Y[i] <- rbeta(n = 1,X[i]+2,16-X[i]+4)
}
X  <- X[-(1:T)]   # remove burn in
Y <- Y[-(1:T)] # remove burn in

Histogram of the Gibbs sampler

hx<-hist(X)

Histogram of the beta-binomial distribution

Z<-rzoibetabinom.ab(10000, 16, 2, 4, pstr0 = 0, pstrsize = 0)
hz<-hist(Z)

Put two histograms together

c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")

plot(hx, col = c1) # Plot 1st histogram using a transparent color
plot(hz, col = c2, add = TRUE) # Add 2nd histogram using different color

We can see the Gibbs sampler can produce almost the same data as the real beta-binomial distribtion.

Produce a plot as Figure 1.

require(plotrix)
l <- list(X,Z)
multhist(l)

Example 2

For example 2 we know two conditional distributions restricted to the interval \((0,B)\)

\[f(x|y)\propto ye^{-yx},\;0<x<B<\infty \tag{2.8a}\] \[f(y|x)\propto xe^{-xy},\;0<x<B<\infty \tag{2.8b}\] The most interesting part for the example 2 is to use equation \((2.9)\) to estimate the marginal density function.

\[\hat{f}(x)=\frac{1}{m}\sum_{i=1}^m f(x|y_i) \tag{2.9}\] Here the \(\hat{f}(x)\) is the Rao-Blackwell(ised) estimator\(^2\). This equation means when the Markov chain reaches to the stationary state we can average these \(y_i\)s, and the unbiasness of the estimator holds for any \(x\)’s.(Due to Professor Christian P. Robert’s answer to my question) This is important to understand this equation.

The following is the R code to reproduce the figure 2. I also added a density estimation use the generic kernel density estimator(KDE) method (red line). The solid line is produced by the equation \((2.9)\). The dashed line is produced by the equation \(f(x)\propto(1-e^{-Bx}/x)\) (page 171), here \(B=5\) and the proper normalizer is 3.79 (I will show how I get this number)

set.seed(1010)
X<-matrix(,nrow=500,ncol=15) #matrix with 500 rows and 10 columns
Y<-matrix(,nrow=500,ncol=15)
for(i in 1:500){      #initial value for the Y0 of the 500 row
  temp_y=6
  while(temp_y>5){    #this is the trickiest part of this program, as long as temp_x >5, the loop will run until a value<=5 appear
    Y[i,1]<- rexp(1, rate = 1)
    temp_y=Y[i,1]  
  }
}
for(i in 1:500){
  temp_x=6
  while(temp_x>5){
    X[i,1]<- rexp(1,rate=Y[i,1]) # first value for X 
    temp_x=X[i,1] 
  }
}
k = 15
for (i in 1:500) {
  for (j in 2:k) {
    temp_x = 6
    while(temp_x>5) {
      X[i,j] = rexp(1,Y[,j-1])
      temp_x = X[i,j]
    }
    temp_y = 6
    while(temp_y>5) {
      Y[i,j] = rexp(1,X[i,j])
      temp_y = Y[i,j]
    }     
  }
}
hist(X[,15], breaks=40, freq=F,main='',xlab='')
##for solid line
rb=seq(0,5,le=123)
for(i in 1:123)
  rb[i]=mean(Y[,15]*exp(-rb[i]*Y[,15]))
lines(seq(0,5,le=123),rb,col="black",lwd=2)
#for dash line
y<-(1-exp(-5*X[,15]))/X[,15]
lines(X[,15][order(X[,15])], y[order(X[,15])]/3.79, lty = 5)# 3.79 is the normalization factor
#using kernel density to estimate the marginal distribution
lines(density(X[,15]),col='red')

(To be continue…)

References

1.Casella, G., & George, E. I. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167-174.

2.Robert CP, Casella G. Introducing monte carlo methods with R. New York: Springer; 2010..