Gibbs sampling for Bayesian Binary Probit

If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.

0.1 Introduction

Suppose we have \(N\) independent binary random variables \(y_{i} \in \{0, 1\}\), where each \(y_{i}\) comes from a Bernoulli distribution with probability of success \(p_{i}\). We also assume that the success probability \(p_{i}\) is related to a vector of covariates \(\boldsymbol{x}_{i} = (x_{i,1}, ..., x_{1,D})\). Then, we define the binary regression model as \(p_{i} = g^{-1}(\boldsymbol{x}_{i}\boldsymbol{\theta})\), where \(\boldsymbol{\theta}\) represents a (\(D \times 1\)) column vector of regression coefficients, and \(g^{-1}(\cdot)\) is the link function. The probit model is obtained if we define \(g^{-1}(\cdot) = \Phi(\cdot)\), where \(\Phi(\cdot)\) denotes the cumulative distribution function (cdf) of the standard normal distribution.

0.1.1 Generate synthetic data

As a first step, we create some synthetic data that will be used to fit our model.

# Set seed for reproducible results
set.seed(250)

# Create 400 covariate data x which lie between (-1, 1)
N <- 400
x <- seq(-1, 1, length.out = N)

# Create n x D design matrix
D <- 2
# We learn a linear function
X <- matrix(c(rep(1, N), x), ncol = D)

# True values of regression coeffiecients theta
true_theta <- c(-.5, 3.3)

# Obtain the vector with probabilities of success p using the probit link
p <- pnorm(X %*% true_theta)

# Generate binary observation data y
y <- rbinom(N, 1, p)

# Variables that we will need later
N1  <- sum(y)  # Number of successes
N0  <- N - N1  # Number of failures

0.2 MLE for Binary Probit Regression

We can directly compute the Maximum Likelihood estimate of the model parameters using the glm() function with a probit link function. We observe that the MLE estimate of the model parameters is really close to the true model parameters.

# Fit the model to the data
fit <- glm(y ~ x, family = binomial(link = probit))

# MLE estimates of the regression coefficients
# (Intercept)           x 
#  -0.5490056   3.1296282
mle_theta <- fit$coefficients

# Plot covariates x versus observations y
plot(x, y, main = "Synthetic data")
# Show the fitted function using the MLE estimates
lines(x = x, y = pnorm(X %*% mle_theta), col = "red3", lwd = 2)
legend("bottomright", legend = "MLE", col = "red3", bty = 'n', 
       lty = 1, lwd = 2, inset = c(0.02, 0.08), cex = 0.9)

0.3 Bayesian Binary Probit Regression

Instead of computing just the MLE estimate of the model parameters, we can go further and compute their posterior distribution by taking a Bayesian approach. We can complete the Bayesian model by taking a prior distribution \(\pi(\boldsymbol{\theta})\) over the model parameters. Then the Bayesian binary probit regression model becomes,

\[ \begin{aligned}\label{probit-regression-eq} y_{i} & \sim \mathcal{B}ernoulli\Big(\Phi(\eta_{i})\Big) \\ \eta_{i} & = \boldsymbol{x}_{i} \boldsymbol{\theta} \\ \boldsymbol{\theta} & \sim \pi(\boldsymbol{\theta}) \end{aligned} \]

and the graphical model can be represented in the following way,

The posterior distribution for the Bayesian binary probit model is the following, \[ \begin{aligned} p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{X}) & \propto p(\boldsymbol{\theta}) p(\boldsymbol{y} | \boldsymbol{\theta}, \boldsymbol{X}) = \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} p(y_{i} | \boldsymbol{\theta}, \boldsymbol{x}_{i}) \\ & = \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} \Phi\Big(\boldsymbol{x}_{i} \boldsymbol{\theta}\Big)^{y_{i}} \Big(1 - \Phi(\boldsymbol{x}_{i} \boldsymbol{\theta})\Big)^{(1 - y_{i})} \end{aligned} \]

Performing inference for this model in the Bayesian framework is complicated by the fact that no conjugate prior \(\pi(\boldsymbol{\theta})\) exists for the parameters of the probit regression model. To overcome this problem, Albert and Chib (1993) augmented the original model with an additional auxiliary variable that renders the conditional distributions of the model parameters equivalent to those under a Bayesian normal linear regression model with Gaussian noise; and derived an efficient Gibbs sampling scheme for computing the posterior statistics.

0.3.1 Augmented Model

Let us introduce \(N\) independent latent variables \(z_{i}\), where each \(z_{i}\) follows a normal distribution. Then the augmented probit model has the following hierarchical structure,

where,

\[ \begin{aligned} y_{i} & = \begin{cases} 1 & \; \text{if } z_{i} > 0\\ 0 & \; \text{if } z_{i} \leq 0\\ \end{cases} \nonumber \\ z_{i} & = \boldsymbol{x}_{i} \boldsymbol{\theta} + \epsilon_{i} \\ \epsilon_{i} & \sim \mathcal{N}(0,1) \nonumber \\ \boldsymbol{\theta} & \sim \pi(\boldsymbol{\theta}) \nonumber \end{aligned} \]

where \(y_{i}\) is now deterministic conditional on the sign of the stochastic auxiliary variable \(z_{i}\). Therefore we formulate the original problem as a missing data problem where we have a normal regression model on the latent data \(z_{i}\) and the observed responses \(y_{i}\) are incomplete in that we only observe whether \(z_{i} > 0\) or \(z_{i} \leq 0\).

We are interested in computing the joint posterior distribution of the latent variables \(\boldsymbol{z}\) and the model parameters \(\boldsymbol{\theta}\) given the data \(\boldsymbol{y}\) and \(\boldsymbol{X}\),

\[ \begin{aligned} p(\boldsymbol{z}, \boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{X}) \propto p(\boldsymbol{\theta}) p(\boldsymbol{z}|\boldsymbol{\theta}, \boldsymbol{X}) p(\boldsymbol{y}| \boldsymbol{z}) = \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} p(z_{i}| \boldsymbol{\theta}, \boldsymbol{x}_{i}) p(y_{i}| z_{i}) \end{aligned} \]

where we have that,

\[ \begin{aligned} p(z_{i}| \boldsymbol{\theta}, \boldsymbol{x}_{i}) & = \mathcal{N}(z_{i} | \boldsymbol{x}_{i} \boldsymbol{\theta}, 1) \nonumber\\ p(y_{i}| z_{i}) & = \boldsymbol{1}(y_{i} = 1)\boldsymbol{1}(z_{i} > 0) + \boldsymbol{1}(y_{i} = 0)\boldsymbol{1}(z_{i} \leq 0) \end{aligned} \] where \(\boldsymbol{1}(\cdot)\) is the indicator function, equal to \(1\) if the quantities inside the function are satisfied, and \(0\) otherwise.

0.3.2 Gibbs sampling scheme

This joint posterior is difficult to normalize and sample from directly. However, computation of the marginal posterior of \(\boldsymbol{\theta}\) and \(\boldsymbol{z}\) using the Gibbs sampling requires only computing \(p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{y}, \boldsymbol{X})\) and \(p(\boldsymbol{z} | \boldsymbol{\theta}, \boldsymbol{y}, \boldsymbol{X})\), and these full conditional distributions are of standard forms. It should be noted that \(p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{y}, \boldsymbol{X}) = p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X})\), since \(\boldsymbol{\theta}\) is conditionally independent of \(\boldsymbol{y}\) given \(\boldsymbol{z}\).

First, the full conditional of \(\boldsymbol{\theta}\) is given by,

\[ p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X}) \propto \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} \mathcal{N}(z_{i} | \boldsymbol{x}_{i} \boldsymbol{\theta}, 1) \]

This quantity is the posterior density for the normal linear regression model. Using standard linear model results, if we use a constant prior for \(\boldsymbol{\theta}\), i.e. \(\pi(\boldsymbol{\theta}) \propto 1\), then,

\[ \boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X} \sim \mathcal{N}\Big((\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\boldsymbol{X}^{T}\boldsymbol{z}, (\boldsymbol{X}^{T}\boldsymbol{X})^{-1}\Big) \]

However, if we assign the proper conjugate prior \(\pi(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta}_{0}, \boldsymbol{Q}_{0})\), then,

\[ \begin{aligned} \boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X} & \sim \mathcal{N}(\boldsymbol{M}, \boldsymbol{V}) \\ \boldsymbol{M} & = \boldsymbol{V} (\boldsymbol{Q}_{0}^{-1}\boldsymbol{\theta}_{0} + \boldsymbol{X}^{T}\boldsymbol{z}) \\ \boldsymbol{V} & = (\boldsymbol{Q}_{0}^{-1} + \boldsymbol{X}^{T}\boldsymbol{X})^{-1} \end{aligned} \]

Now suppose we knew \(\boldsymbol{\theta}\). Then we could easily draw the latent variables \(z_{i}\) from their distribution conditional on \(\boldsymbol{\theta}\) and \(\boldsymbol{x}_{i}\), which is \(\mathcal{N}(\boldsymbol{x}_{i}\boldsymbol{\theta},1)\). However, if we also condition on the \(y_{i}\), we need to take into consideration this additional source of information. Thus, the full conditional of \(z_{i} | \boldsymbol{\theta}, y_{i}, \boldsymbol{x}_{i}\) is a truncated normal distribution,

\[ z_{i} | \boldsymbol{\theta}, y_{i}, \boldsymbol{x}_{i} \sim \begin{cases} \mathcal{TN}(\boldsymbol{x}_{i}\boldsymbol{\theta}, 1, 0, \infty) & \; \text{if } y_{i} = 1\\ \mathcal{TN}(\boldsymbol{x}_{i}\boldsymbol{\theta}, 1, -\infty, 0) & \; \text{if } y_{i} = 0\\ \end{cases} \]

0.3.3 Gibbs sampling implementation

Gibbs sampling is an MCMC algorithm that simulates draws (i.e. Monte Carlo) that are dependent (i.e. Markov Chain), and these draws are approximately from the probability distribution of interest, i.e. the target distribution \(p(\boldsymbol{z}, \boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{X})\).

In each simulation step of the Gibbs algorithm we have to sample \(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X}\) and \(z_{i} | \boldsymbol{\theta}, y_{i}, \boldsymbol{x}_{i}\) in turn. The following code implements the Gibbs sampling algorithm for the Bayesian binary probit model.

# Library for sampling from Multivariate Normal distribution
require(mvtnorm)
# Library for sampling from Truncated Normal distribution
require(truncnorm)


# Conjugate prior on the coefficients \theta ~ N(theta_0, Q_0)
theta_0 <- rep(0, D)
Q_0 <- diag(10, D)

# Initialize parameters
theta <- rep(0, D)
z <- rep(0, N)

# Number of simulations for Gibbs sampler
N_sim <- 10000 
# Burn in period
burn_in <- 5000
# Matrix storing samples of the \theta parameter
theta_chain <- matrix(0, nrow = N_sim, ncol = D)


# ---------------------------------
# Gibbs sampling algorithm
# ---------------------------------

# Compute posterior variance of theta
prec_0 <- solve(Q_0)
V <- solve(prec_0 + crossprod(X, X))

for (t in 2:N_sim) {
  # Update Mean of z
  mu_z <- X %*% theta
  # Draw latent variable z from its full conditional: z | \theta, y, X
  z[y == 0] <- rtruncnorm(N0, mean = mu_z[y == 0], sd = 1, a = -Inf, b = 0)
  z[y == 1] <- rtruncnorm(N1, mean = mu_z[y == 1], sd = 1, a = 0, b = Inf)
  
  # Compute posterior mean of theta
  M <- V %*% (prec_0 %*% theta_0 + crossprod(X, z))
  # Draw variable \theta from its full conditional: \theta | z, X
  theta <- c(rmvnorm(1, M, V))
  
  # Store the \theta draws
  theta_chain[t, ] <- theta
}

# ---------------------------
# Get posterior mean of \theta
# ---------------------------
# (Intercept)           x 
#  -0.5026835   3.2076024
post_theta <- colMeans(theta_chain[-(1:burn_in), ])

# Plot covariates x versus observations y
plot(x, y, main = "Synthetic data")
# Show the fitted function using the posterior mean estimates
lines(x = x, y = pnorm(X %*% mle_theta), col = "red3", lwd = 2)
lines(x = x, y = pnorm(X %*% post_theta), col = "blue3", lwd = 2)
legend("bottomright", legend=c("MLE","Post. Mean"), col=c("red3","blue3"), 
       bty = 'n', lwd = 2, inset = c(0.02, 0.08), lty = 1, cex = 0.9)

Having stored the information of each MCMC iteration in the theta_chain matrix, we can create some useful plots to obsevere how the chain evolved over each iteration and how the marginal density plot of each parameter looks like. In the figure below, the top panel depicts the marginal histogram of the sampled values over all the iterations after the burn-in period, and the bottom panel shows the trace plot of each parameter, i.e. the values the parameters took during the runtime of the MCMC simulation. As we observe, after only a few hundred iteration steps the MCMC chain converged to the true values of the parameters. The \(red\) line corresponds to the true parameter values, the \(gold\) line to the posterior mean of the MCMC chain, and the \(green\) line to the MLE estimate.

par(mfrow = c(2,2))
hist(theta_chain[-(1:burn_in),1],breaks=30, main="Posterior of theta_1", 
     xlab="True value = red line", col="cornflowerblue")
abline(v = post_theta[1], col="goldenrod2", lwd=3)
abline(v = true_theta[1], col="red2", lwd=3)
abline(v = mle_theta[1], col="darkolivegreen", lwd=3)
hist(theta_chain[-(1:burn_in),2], breaks=30, main="Posterior of theta_2", 
     xlab="True value = red line", col="cornflowerblue")
abline(v = post_theta[2], col="goldenrod2", lwd=3)
abline(v = true_theta[2], col="red2", lwd=3)
abline(v = mle_theta[2], col="darkolivegreen", lwd=3)
legend("topright", c("True", "MLE", "Post. Mean"), lty=1, lwd=2,
       col=c("red2", "darkolivegreen","goldenrod2"), bty='n', cex=.95)

plot(theta_chain[, 1], type = "l", xlab="True value = red line" , 
     main = "Chain values of theta_1")
abline(h = true_theta[1], col="red2", lwd=2)
lines(cumsum(theta_chain[, 1])/(1:N_sim), col="goldenrod2", lwd=2)
plot(theta_chain[, 2], type = "l", xlab="True value = red line" , 
     main = "Chain values of theta_2")
abline(h = true_theta[2], col="red2", lwd=2)
lines(cumsum(theta_chain[, 2])/(1:N_sim), col="goldenrod2", lwd=2)

0.4 Holmes and Held Extension

In the previous section we described how we could use MCMC simulation to iteratively sample from the full conditional distributions and obtain the target distribution that we are interested. However, a potential problem in this model is that the there is a strong correlation between the model parameters \(\boldsymbol{\theta}\) and \(\boldsymbol{z}\). In the implementation of the Albert and Chib (1993) approach, this correlation may result in slow mixing of the chain, hence, it requires a larger number of simulations to reach the target distribution.

Holmes and Held (2006) proposed a simple approcah to reduce the autocorrelation of the parameters and hence improve the mixing of the chain. Their extension was to update \(\boldsymbol{\theta}\) and \(\boldsymbol{z}\) jointly, by using the factorisation,

\[ p(\boldsymbol{z}, \boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{X}) = p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X}) p(\boldsymbol{z} | \boldsymbol{y}, \boldsymbol{X}) \]

where \(p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{X})\) is the same as in the previous sections, but \(\boldsymbol{z}\) is now updated from its marginal distribution having integrated over \(\boldsymbol{\theta}\). Assuming that the prior for \(\boldsymbol{\theta}\) is a mean zero normal distribution, \(\mathcal{N}(0, Q_{0})\), using standard matrix algebra we obtain,

\[ p(\boldsymbol{z} | \boldsymbol{y}, \boldsymbol{X}) \propto \mathcal{N}(0, \boldsymbol{I}_{N} + \boldsymbol{XQ}_{0}\boldsymbol{X}^{T}) \boldsymbol{1}(\boldsymbol{y}, \boldsymbol{z}) \] where \(\boldsymbol{1}(\boldsymbol{y}, \boldsymbol{z})\) is the indicator function which truncates the multivariate normal distribution of \(\boldsymbol{z}\) to the appropriate region.

Sampling directly from a truncated multivariate normal distribution is known to be difficult, however we can use again Gibbs algorithm to sample from the full conditional distributions in turn,

\[ z_{i} | \boldsymbol{z}_{-i}, y_{i}, \boldsymbol{x}_{i} \sim \begin{cases} \mathcal{TN}(m_{i}, \upsilon_{i}, 0, \infty) & \; \text{if } y_{i} = 1\\ \mathcal{TN}(m_{i}, \upsilon_{i}, -\infty, 0) & \; \text{if } y_{i} = 0\\ \end{cases} \]

The means \(m_{i}\) and variances \(\upsilon_{i}\) are obtained from the leave-one-out marginal predictive densities, and using Henderson and Searle (1981) we can calculate these parameters as follows,

\[ \begin{aligned} m_{i} & = \boldsymbol{x}_{i}\boldsymbol{M} - w_{i}(z_{i} - \boldsymbol{x}_{i}\boldsymbol{M})\\ \upsilon_{i} & = 1 + w_{i} \\ w_{i} & = h_{i} / (1 - h_{i}) \end{aligned} \]

where \(h_{i}\) is the \(i^{th}\) diagonal element of the Bayesian hat matrix, \(h_{i} = (\boldsymbol{H})_{ii}\), \(\boldsymbol{H} = \boldsymbol{XVX}^{T}\).

However, following an update of each \(z_{i}\) we need to recalculate the posterior mean \(\boldsymbol{M}\) using the following relationship,

\[ \boldsymbol{M} = \boldsymbol{M}^{old} + \boldsymbol{S}_{i}(z_{i} - z_{i}^{old}) \] where \(\boldsymbol{M}^{old}\) and \(z_{i}^{old}\) denote the values of \(\boldsymbol{M}\) and \(z_{i}\) prior to the update of \(z_{i}\), and \(\boldsymbol{S}_{i}\) denotes the \(i^{th}\) column of \(\boldsymbol{S} = \boldsymbol{VX}^{T}\).

Note that during the Gibbs simulation, the calculation of \(\boldsymbol{S}, w_{i}, \upsilon_{i}\) need only to be computed once prior to the MCMC iterations.

0.4.1 Algorithm implementation

# Compute the martix S = VX'
S <- tcrossprod(V, X)

h <- vector(mode = "numeric", length = N)
w <- vector(mode = "numeric", length = N)
u <- vector(mode = "numeric", length = N)
for (j in 1:N){
  # h stores the diagonal elements of the hat matrix (XS = XVX')
  h[j] <- X[j, ] %*% S[, j]
  w[j] <- h[j] / (1 - h[j])
  u[j] <- w[j] + 1
}

# Initialize latent variable Z, from truncated normal
z[y == 0] <- rtruncnorm(N0, mean = 0, sd = 1, a = -Inf, b = 0)
z[y == 1] <- rtruncnorm(N1, mean = 0, sd = 1, a = 0, b = Inf)

# Matrix storing samples of the \theta parameter
theta_chain_holmes <- matrix(0, nrow = N_sim, ncol = D)

# ---------------------------------
# Gibbs sampling algorithm
# ---------------------------------

# Compute the conditional mean of \theta
M <- as.vector(S %*% z)

for (t in 2:N_sim) {
  for (j in 1:N){
    # Store the old value of z
    z_old <- z[j]
    
    # Update mean of latent variable z_i
    m <- X[j, ] %*% M
    m <- m - w[j] * (z[j] - m)
    
    # Draw latent variable z from full conditional: z_j | z_-j, y, X
    if (y[j] == 0)
      z[j] <- rtruncnorm(1, mean = m, sd = u[j], a = -Inf, b = 0)
    else
      z[j] <- rtruncnorm(1, mean = m, sd = u[j], a = 0, b = Inf)
    
    # Update the posterior mean M
    M <- as.vector(M + (z[j] - z_old) %*% S[ ,j])
  }
  
  # Posterior of M | Z
  theta_chain_holmes[t, ] <- c(rmvnorm(1, M, V))
}

# ---------------------------
# Get posterior mean of \theta
# ---------------------------
# (Intercept)           x 
#  -0.5026835   3.2076024
post_theta_holmes <- colMeans(theta_chain_holmes[-(1:burn_in), ])

Having stored the information of each MCMC iteration in the theta_chain_holmes matrix, we can create the same plots as in the previous section to observe how the chain evolved over each iteration and how the marginal density plot of each parameter looks like. Again, the \(red\) line corresponds to the true parameter values, the \(gold\) line to the posterior mean of the MCMC chain, and the \(green\) line to the MLE estimate.

0.5 Convergence Diagnostics

In theory we know that when running MCMC simulation, we expect our Markov chain to have converged to the stationary distribution, which is also our target distribution. However, there is no guarantee that after \(M\) draws our chain will have converged to the stationary distribution and will have adequately explored the posterior distribution. Convergence diagnostics help to resolve these issues, however, we should note that many diagnostic tools are designed to verify a necessary but not sufficient condition for convergence. There are no conclusive tests that can tell you when the Markov chain has converged to its stationary distribution.

0.5.1 Visual analysis via trace plots

As we showed in the previous section, we can test visually if the chain has converged the stationary distribution via trace plots. A trace plot is a plot of the iteration number against the value of the draw of the parameter at each iteration. The trace plot shows if the chain has not yet converged to its stationary distribution. The aspects of stationarity that are most recognizable from a trace plot are a relatively constant mean and variance. A chain that mixes well traverses its posterior space rapidly, and it can jump from one remote region of the posterior to another in relatively few steps. If our chain is taking a long time to move around the parameter space, then it will take longer to converge.

0.5.2 MCMC Variation

The posterior mean after running the MCMC simulation is the same as the Monte Carlo simulation. However, the variance of the posterior mean when using MCMC simulation is larger than the MC variance, since the samples are not i.i.d., but are correlated since we make draws using a Markov chain. Hence, in MCMC simulation we need a larger number of samples \(M\), that would lead to the same variability we would obtain when using Monte Carlo simulation.

In the time series setting (which includes Markov chains), the Effective Sample Size (ESS) gives an estimate of the equivalent number of independent iterations that the Markov chain represents. In other words, it is defined as the correction factor for computing the variance of the empirical average.

The formula for the ESS is given by, \[ ESS = \frac{M}{\tau} = \frac{M}{1 + 2 \sum_{k=1}^{\infty}\rho_{k}(x)} \]

where \(\rho_{k}(x)\) is the autocorrelation at lag \(k\) for variable \(x\), and \(\tau\) is referred to as the autocorrelation time. \(ESS\) and \(\tau\) are inversely proportional to each other, and low \(ESS\) or high \(\tau\) indicates bad mixing of the Markov chain.

Let us compute the \(ESS\) of the chains using the Albert and Chib (1993) method and the Holmes and Held (2006) extension. For doing this we will use the coda package.

# Coda library
require(coda)

# ---------------------------
# Get ESS for Albert&Chib method
# ---------------------------
#  theta_1       theta_2 
#  596.8568      229.6335
ess_albert_chib <- effectiveSize(mcmc(theta_chain[-(1:burn_in), ]))

# ---------------------------
# Get ESS for Holmes extension
# ---------------------------
#  theta_1       theta_2 
#  1006.6389     487.1383
ess_holmes <- effectiveSize(mcmc(theta_chain_holmes[-(1:burn_in), ]))

As we can observe, the Holmes and Held (2006) extension leads to having a higher \(ESS\) (almost twice larger), hence, in order to estimate the posterior mean with the same accuracy as the Holmes and Held (2006) method, we need to increase the number of draws \(M\) for the Albert and Chib (1993) method.

0.6 Conclusions

This tutorial-like document showed how we can perform Bayesian binary probit regression using the data augmentation approach.