We will frame much of the following in terms of a missing data problem, with \( \theta \) representing the model parameters, \( y \) representing the observed data, and \( z \) representing the missing data, with \( x=(y,z) \) indicating the complete data. While we will discuss the missing data context, the framework can be used with \( z \) representing the entire set of parameters and latent variables.
We use Variational Bayes (VBayes) as a method to approximate the true posterior distribution in a Bayesian inference model:
\[
p(\theta \mid y) = p(y \mid \theta)\frac{p(\theta)}{p(y)}
\]
and the joint posterior is:
\[
p(\theta,z \mid y) = p(y,z \mid \theta)\frac{p(\theta)}{p(y)}
\]
where:
\[
p(y \mid \theta) = \int p(y,z \mid \theta)dz
\]
and
\[
p(y) = \int p(y,z,\theta)dzd\theta = \int p(y,z \mid \theta)p(\theta)dzd\theta
\]
for a prior distribution \( p(\theta) \).
The marginal likelihood, \( p(y) \), is used for Bayesian model selection and averaging:
Ref: Beal & Ghahramani (2003): hereafter (B&G)
Note: \( F \) is an integral, so evalutating it to get the estimate \( F(\hat q_z,\hat q_\theta,y) \) of \( log\:p(y) \) may be difficult.
We will carry out the ABO blood group example from the EM algorithm section and apply the VBayes method to estimating the posterior.
We can show that: \[ log\:p(z,y,\theta) = log\:p(y,z \mid \theta) + log\:p(\theta)\\ = (m_A-1)log(p_A) + (m_B-1)log(p_B) + (m_O-1)log(p_O) + \text{const} \] where: \[ m_A = 2n_{A/A} + n_{A/O} + n_{AB} + \alpha_A\\ m_B = 2n_{B/B} + n_{B/O} + n_{AB} + \alpha_B\\ m_O = 2n_{O/O} + n_{A/O} + n_{B/O} + \alpha_O \\ \] and \( (\alpha_A,\alpha_B,\alpha_O) \) are the parameters of the Dirichlet prior for \( \theta \). The unobservable data is \( z=(n_{A/A},n_{B/B}) \) with \( n_{A/O} = n_A-n_{A/A} \), \( n_{B/O} = n_B-n_{B/B} \). Initialize \( q_\theta^0(\theta) \) to be the prior for \( \theta \); ie. take \( \alpha_A^0 = \alpha_A \), etc. As was mentioned previously, we do not need to initialize \( q_z^0(z) \) as we find \( q_z^1(z) \) from \( q_\theta^0(\theta) \).
We can make the following claims (justification in the Appendix):
1. \( q_z^{k+1}(z) \) is a product of \( q_{n_{AA}}^k \), a Binomial(\( n_A,p_{AA}^k \)), and \( q_{n_{BB}} \), a Binomial(\( n_B,p_{BB}^k \)), where:
\[
p_{AA}^k = \frac{e^{\psi(\alpha_A^k) - \psi(\alpha_O^k)}}{1+e^{\psi(\alpha_A^k) - \psi(\alpha_O^k)}}\\
p_{BB}^k = \frac{e^{\psi(\alpha_B^k) - \psi(\alpha_O^k)}}{1+e^{\psi(\alpha_B^k) - \psi(\alpha_O^k)}}
\]
and \( \psi(x) \) is the digamma function, \( \frac{\Gamma^\prime(x)}{\Gamma(x)} \)
2. \( q_\theta^{k+1}(\theta) \) is Dirichlet(\( \alpha_A^{k+1},\alpha_B^{k+1},\alpha_O^{k+1} \)), where
\[
\alpha^{k+1}_A = n_A(1+p_{AA}^k) + n_{AB} + \alpha_A^0\\
\alpha^{k+1}_B = n_B(1+p_{BB}^k) + n_{AB} + \alpha_B^0\\
\alpha^{k+1}_A = 2n_O + n_A(1-p_{AA}^k) + n_B(1-p_{BB}^k) + \alpha_O^0\\
\]
Below are functions in R for carrying out the variational Bayes methodology with the distributions above:
# Variational Bayesian methods on example in section 13.6 of Lange
# observed data are counts y=(n_A, n_B, n_AB, n_O) complete data are counts
# x=(n_AA, n_AO, n_BB, n_BO, n_AB, n_O) Take missing data to be
# z=(n_AA,n_BB), with n_AO=n_A-n_AA, n_BO=n_B-n_BB
# Define the relevant functions for the example above:
qz.update <- function(alpha) {
# q_z is a product of binomials with success probs pAA and pBB. Return these
# p's
expit <- function(x) {
return(exp(x)/(1 + exp(x)))
}
pAA <- expit(digamma(alpha[1]) - digamma(alpha[3]))
pBB <- expit(digamma(alpha[2]) - digamma(alpha[3]))
return(c(pAA, pBB))
}
qtheta.update <- function(y, pvec, alpha0) {
# q_theta is a Dirichlet with parameter vector alpha. y is (nA,nB,nAB,nO);
alphaA <- y[1] * (1 + pvec[1]) + y[3] + alpha0[1]
alphaB <- y[2] * (1 + pvec[2]) + y[3] + alpha0[2]
alphaO <- 2 * y[4] + y[1] * (1 - pvec[1]) + y[2] * (1 - pvec[2]) + alpha0[3]
return(c(alphaA, alphaB, alphaO))
}
varBayes <- function(y, alpha0, maxit = 100, tol = 0.001) {
# initialize
iter = 0
alpha <- alpha0
alphadiff <- rep(1, 3)
while (any(abs(alphadiff) > tol) && iter < maxit) {
alphaold <- alpha
pp <- qz.update(alpha)
alpha <- qtheta.update(y, pp, alpha0)
alphadiff <- alpha - alphaold
print(alpha)
iter = iter + 1
}
return(alpha)
}
We can test this methodology with the observed data of: \( n_A=186 \), \( n_B=38 \), \( n_{AB}=13 \), \( n_O=284 \). First though we should discuss the algorithmic connections between Bayesian EM and VBayes.
To make the connections, we need to discuss an alternative statement of the EM algorithm due to Neal & Hinton (1998) - (N&H). In the following, \( \tilde p \) is the distribution for the missing data \( z \). Let \( F(\tilde p,\theta) = E_{\tilde p}(log\:p(z,y\mid\theta)) - E_{\tilde p}(log\:p(z)) \). Then after iteration \( k \) of the EM algorithm, the updates are:
1. E-Step: \( \tilde p^{k+1} = argmax_{\tilde p} F(\tilde p,\theta^k) \)
2. M-step: \( \theta^{k+1} = argmax_{\theta} F(\tilde p^{k+1},\theta) \)
The proof of this is in N&H Theorem 1.
Comments: Motivation to describe EM as a coordinate ascent algorithm over 2 blocks of arguments, \( \tilde p, \theta \), was to formalize EM-like algorithms that iterate over partial updates of the missing data and parameter vectors. By seeing EM as a maximization, we can try alternate maximization algorithms, such as coordinate ascent on more than two blocks of arguments.
N&H Lemma 1 shows that \( argmax_{\tilde p} F(\tilde p,\theta^k) = p(z\mid y,\theta^k) (Equation 2) \), so that the M-step maximizes \( E(log\:p(z,y\mid \theta)\mid z,\theta^k) - E(log\:p(z\mid y,\theta^k)\mid y,\theta^k) \). The first term is \( Q(\theta\mid\theta^k) \) in the standard EM notation, and the second term is constant with respect to \( \theta \). Equation 2 also follows from showing \( F(\tilde p,\theta) = -KL(\tilde p \mid\mid p_\theta) + l_0(\theta) \) where \( p_\theta = p(z\mid y,\theta) \). \( -KL \) term is maximized when \( KL=0 \), which happens iff \( \tilde p = p_\theta \).
For Bayesian EM, replace \( F \) with \( F_B(\tilde p,\theta) = F(\tilde p,\theta) + log\:p(\theta) \).
We can algebraically manipulate \( F_B \) to obtain: \[ F_B(\tilde p,\theta) = \int \tilde p(z) log\:\frac{p(z,y,\theta)}{\tilde p(z)}dz \] Compare this to the objective function from VBayes: \[ F(q_z,q_\theta,y) = \int q_z(z)q_\theta(\theta) log\:\frac{p(z,y,\theta)}{q_z(z)q_\theta(\theta)}dzd\theta \] We can see that the Bayesian EM algorithm can be expressed as the VBayes algorithm restricting \( q_\theta(\theta) \) restricted to be the class of point mass functions \( \delta_\theta \) at \( \theta \), so that: \[ F(\tilde p,\delta_\theta,y) = \int \tilde p(z)\delta_\theta(\theta) log\:\frac{p(z,y,\theta)}{\tilde p_z(z)\delta_\theta(\theta)}dzd\theta\\ = \int \tilde p(z)\left[\delta_\theta(\theta) log\:\frac{p(z,y,\theta)}{\tilde p_z(z)\delta_\theta(\theta)}d\theta\right]dz\\ = F_B(\tilde p,\theta) \]
This makes some intuitive sense, as the goal of Bayesian EM is to find the posterior mode (MAP) estimate, while the goal of VBayes if finding an approximation to the posterior. If the approximations are restricted to the class of point masses, it is sensible that the VBayes estimate and Bayesian EM estimate would be equivalent.
For the example above we can look at both the mode of the VBayes approximation to the posterior, as well as the Bayesian EM MAP estimate. We would not expect them to be exactly the same here, as the class of approximating distributions for VBayes in this example is more than just point mass distributions. We perform the estimates with a Dirichlet prior on \( \theta \) and with the observed data of: \( n_A=186 \), \( n_B=38 \), \( n_{AB}=13 \), \( n_O=284 \). We will have the function print out the mode of the posterior for \( \theta \) at each step of the VBayes algorithm, and then compare the final mode estimate to that with the Bayesian EM algorithm (as described in the lectures on Bayesian EM).
# Initialize: Set the initial q_theta to be the prior. Use same prior as in
# Bayesian EM example:
alpha0 <- c(3, 2, 5) # parameters of Dirichlet prior
# No need to set an initial q_z, because we start with a q_z update that
# depends only on initial q_theta.
# Set the observed data
y <- c(186, 38, 13, 284)
vb.alpha <- varBayes(y, alpha0)
## [1] 268.62 62.61 720.77
## [1] 252.46 56.02 743.53
## [1] 249.10 55.64 747.26
## [1] 248.46 55.61 747.93
## [1] 248.33 55.61 748.06
## [1] 248.31 55.61 748.08
## [1] 248.31 55.61 748.08
## [1] 248.31 55.61 748.09
# Mode of posterior is at
# (alphaA/sum(alpha),alphaB/sum(alpha),alphaO/sum(alpha))
vb.mode <- (vb.alpha - 1)/sum(vb.alpha - 1)
As we can see above, the posterior converged to the best approximation in around 8 cycles. We can compare the mode of the VBayes approximate posterior to the MAP estimate from the Bayesian EM algorithm:
As was mentioned at the beginning, the VBayes algorithm can be used on more general forms of problems than the missing data problem above.
Compare the updates above to the B&G updates, where \( \theta \) above was \( (\theta,z) \):
For more examples, see the wikipedia page: here
Here we provide the justification for the two claims regarding the updates of \( q_z \) and \( q_\theta \):
Write updates as:
For (1):
\( E_{q_\theta^k}log\:p(z,y,\theta) = E_{q_\theta^k}[(m_A-1)log\:p_A+(m_B-1)log\:p_B+(m_O-1)log\:p_O+const] \)
Ignore \( const \), note \( \theta = (p_A,p_B,p_O) \) is being integrated, so:
\[
E_{q_\theta^k}log\:p(z,y,\theta) = (m_A-1)E_{q_\theta^k}(log\:p_A)+(m_B-1)E_{q_\theta^k}(log\:p_B)+(m_O-1)E_{q_\theta^k}(log\:p_O)
\]
The \( E(log\:p(A))\; \theta=(p_A,p_B,p_O) \sim \) Dirichlet\( (\alpha_A^k,\alpha_B^k,\alpha_O^k) \implies p_A \sim \) Beta\( (\alpha_A^k,\alpha_B^k,\alpha_O^k) \)
For a r.v. \( X\sim \) Beta\( (a,b) \implies E(log\:X) = \psi(a) - \psi(a+b) \), where \( \psi(x) \) is the Digamma function. Therefore:
\[
E(log\:p_A) = \psi(\alpha_A) - \psi(\alpha_A+\alpha_B+\alpha_O)\\
E(log\:p_B) = \psi(\alpha_B) - \psi(\alpha_A+\alpha_B+\alpha_O)\\
E(log\:p_O) = \psi(\alpha_O) - \psi(\alpha_A+\alpha_B+\alpha_O)\\
\]
Now expand \( m_A,m_B,m_O \) and collect all terms involving \( n_{A/A} \) & \( n_{B/B} \) (our missing data) and simplify to:
\[
log\:q_z^{k+1}(z) = n_{A/A}(\psi(\alpha_A^k)-\psi(\alpha_O^k)) + n_{B/B}(\psi(\alpha_B^k)-\psi(\alpha_O^k)) + const
\]
where the \( const \) in this case are all the terms that are constant with respect to the missing data, and where \( n_{A/A} \in \{0,1,....,n_A\} \) and \( n_{B/B} \in \{0,1,....,n_B\} \).
Compare to binomial log-pmf:
\( Y \sim \) Binomial\( (n,p) \) means
We recognize \( log\:q_z^{k+1}(z) \) as the log-pmf of independent Binomial\( (n_A,p_{AA}^k) \), for \( n_{A/A} \), and Binomial\( (n_B,p_{BB}^k) \), for \( n_{B/B} \), random variables. Where \[ logit(p_{AA}^k) = \psi(\alpha_A^k) - \psi(\alpha_O^k) \implies p_{AA}^k = \frac{exp\{\psi(\alpha_A^k)-\psi(\alpha_O^k)\}}{1+exp\{\psi(\alpha_A^k)-\psi(\alpha_O^k)\}} \] and \[ logit(p_{BB}^k) = \psi(\alpha_B^k) - \psi(\alpha_O^k) \implies p_{BB}^k = \frac{exp\{\psi(\alpha_B^k)-\psi(\alpha_O^k)\}}{1+exp\{\psi(\alpha_B^k)-\psi(\alpha_O^k)\}} \] as claimed.
For (2):
\[
E_{q_z^{k+1}}(log\:p(z,y,\theta)) = E_{q_z^{k+1}}[(m_A-1)log\:p_A+(m_B-1)log\:p_B+(m_O-1)log\:p_O + const]\\
= E_{q_z^{k+1}}[m_A-1]log\:p_A+E_{q_z^{k+1}}[m_B-1]log\:p_B+E_{q_z^{k+1}}[m_O-1]log\:p_O + const
\]
This can be recognized as Dirichlet, with parameters
\[
\alpha_A^{k+1} = E_{q_z^{k+1}}[m_A];\;\alpha_B^{k+1} = E_{q_z^{k+1}}[m_B];\;\alpha_O^{k+1} = E_{q_z^{k+1}}[m_O]
\]
We can use the binomial distributions of \( n_{A/A} \) and \( n_{B/B} \) to show that: