Variational Bayes for Approximate Inference

A Brief Tutorial

We will frame much of the following in terms of a missing data problem, with \( \theta \) representing the model parameters, \( y \) representing the observed data, and \( z \) representing the missing data, with \( x=(y,z) \) indicating the complete data. While we will discuss the missing data context, the framework can be used with \( z \) representing the entire set of parameters and latent variables.

We use Variational Bayes (VBayes) as a method to approximate the true posterior distribution in a Bayesian inference model:
\[ p(\theta \mid y) = p(y \mid \theta)\frac{p(\theta)}{p(y)} \]
and the joint posterior is:
\[ p(\theta,z \mid y) = p(y,z \mid \theta)\frac{p(\theta)}{p(y)} \]
where:
\[ p(y \mid \theta) = \int p(y,z \mid \theta)dz \]
and
\[ p(y) = \int p(y,z,\theta)dzd\theta = \int p(y,z \mid \theta)p(\theta)dzd\theta \]
for a prior distribution \( p(\theta) \).

The marginal likelihood, \( p(y) \), is used for Bayesian model selection and averaging:

Choosing between models \( M_1,M_2,....,M_k \) with prior probabilities, \( p(M_1),p(M_2),....,p(M_k) \)
As described in the Bayesian inference handout at the beginning of the course, the marginal likelihoods \( p(y \mid M_i) = \int p(y,z \mid \theta_i)p(\theta_i \mid M_i)d\theta_i dz \) and their ratios (Bayes Factors) are used to select a model or are averaged over according to the priors on models.
We wont carry around conditioning on a model in what follows.

Mean-field VBayes for the missing data problem

Ref: Beal & Ghahramani (2003): hereafter (B&G)

Note: Their notation is different: They use \( x \) for the missing data where we use \( z \)! Using VBayes, approximations to the marginal and posterior distributions are derived by finding the “closest” joint posterior in the class of distributions that factor into \( q_\theta(\theta) \) & \( q_z(z) \). Key point: We restrict \( q \) to the class \( q=q_zq_\theta \) for tractability, hoping this class of \( q \)'s is rich enough to give useful approximations. The criterion function for “closest” is an expression for the marginal likelihood:
\[ log p(y) = log \int p(y,z,\theta)dzd\theta = log \int q(z,\theta)\frac{p(y,z,\theta)}{q(z,\theta)}dzd\theta \geq \int q(z,\theta) log\frac{p(y,z,\theta)}{q(z,\theta)}dzd\theta \;(Equation 1) \] for any density \( q(z,\theta) \) (B&G, eq. 4). The inequality follows from Jensen's inequality and the fact that the logarithm is a concave function. The inequality becomes an equality only when \( q(z,\theta) = p(z,\theta \mid y) \). You can check this by substitution into (1). The difference between the LHS and RHS of the inequality is shown in Bishop (pg.463 eqn 10.2) to be the Kullback-Liebler (KL) divergence between \( q \) and the joint posterior: \[ KL(q \mid\mid p) = -\int q(z,\theta)log \frac{p(z,\theta \mid y)}{q(z,\theta)}dzd\theta \] which is \( \geq 0 \), with equality only when \( q = p \) (Bishop pg. 55-56). Thus equation (1) defines an upper bound on it's RHS that is maximized by the joint posterior.
Let \( F(q_z,q_\theta,y) = \int q_z(z)q_\theta(\theta)log\frac{p(y,z,\theta)}{q_z(z)q_\theta(\theta)}dzd\theta \)
- This is the RHS of (1) for \( q(z,\theta) = q_z(z)q_\theta(\theta) \)
Maximizing \( F \) over \( q_z \) and \( q_\theta \) gives the VBayes approximation to the joint posterior, \( \hat q_z(z)\hat q_\theta(\theta) \) and the posterior \( \hat q_\theta(\theta) \). Additionally, \( F(\hat q_z,\hat q_\theta, y) \) is an estimate of \( log\:p(y) \).
Finding the maximum of \( F \) with respect to two functions, maximizing the functional \( F \) uses results from a branch of math called the calculus of variations. This gives rise to the “V” in VBayes. Note: For a nice derivation that uses only properties of KL divergence see Bishop pg. 465-466.
To maximize we use a coordinate ascent approach:
1. Fix \( q_\theta \) and maximize the functional \( F \) over \( q_z \)
2. Fixing \( q_z \) from step 1., maximize over \( q_\theta \)
3. Repeat 1. and 2. until convergence Can be shown that after iteration \( k \), next updates are:
\( q^{k+1}_z(z) \propto exp(\int log\:p(z,y,\theta) q^k_\theta(\theta)d\theta) \)
\( q^{k+1}_z(z) \propto exp(\int log\:p(z,y,\theta) q^{k+1}_z(z)dz) \)

Note: \( F \) is an integral, so evalutating it to get the estimate \( F(\hat q_z,\hat q_\theta,y) \) of \( log\:p(y) \) may be difficult.

Updates 1 & 2 require solving an integral in the numerators. If the result is the kernal of a known distribution, the normalizing constant in the denominator may be known. Using conjugate priors may simplify the calculation, as in the following example.

VBayes on ABO Blood type example

We will carry out the ABO blood group example from the EM algorithm section and apply the VBayes method to estimating the posterior.

We can show that: \[ log\:p(z,y,\theta) = log\:p(y,z \mid \theta) + log\:p(\theta)\\ = (m_A-1)log(p_A) + (m_B-1)log(p_B) + (m_O-1)log(p_O) + \text{const} \] where: \[ m_A = 2n_{A/A} + n_{A/O} + n_{AB} + \alpha_A\\ m_B = 2n_{B/B} + n_{B/O} + n_{AB} + \alpha_B\\ m_O = 2n_{O/O} + n_{A/O} + n_{B/O} + \alpha_O \\ \] and \( (\alpha_A,\alpha_B,\alpha_O) \) are the parameters of the Dirichlet prior for \( \theta \). The unobservable data is \( z=(n_{A/A},n_{B/B}) \) with \( n_{A/O} = n_A-n_{A/A} \), \( n_{B/O} = n_B-n_{B/B} \). Initialize \( q_\theta^0(\theta) \) to be the prior for \( \theta \); ie. take \( \alpha_A^0 = \alpha_A \), etc. As was mentioned previously, we do not need to initialize \( q_z^0(z) \) as we find \( q_z^1(z) \) from \( q_\theta^0(\theta) \).

We can make the following claims (justification in the Appendix):
1. \( q_z^{k+1}(z) \) is a product of \( q_{n_{AA}}^k \), a Binomial(\( n_A,p_{AA}^k \)), and \( q_{n_{BB}} \), a Binomial(\( n_B,p_{BB}^k \)), where: \[ p_{AA}^k = \frac{e^{\psi(\alpha_A^k) - \psi(\alpha_O^k)}}{1+e^{\psi(\alpha_A^k) - \psi(\alpha_O^k)}}\\ p_{BB}^k = \frac{e^{\psi(\alpha_B^k) - \psi(\alpha_O^k)}}{1+e^{\psi(\alpha_B^k) - \psi(\alpha_O^k)}} \] and \( \psi(x) \) is the digamma function, \( \frac{\Gamma^\prime(x)}{\Gamma(x)} \)
2. \( q_\theta^{k+1}(\theta) \) is Dirichlet(\( \alpha_A^{k+1},\alpha_B^{k+1},\alpha_O^{k+1} \)), where \[ \alpha^{k+1}_A = n_A(1+p_{AA}^k) + n_{AB} + \alpha_A^0\\ \alpha^{k+1}_B = n_B(1+p_{BB}^k) + n_{AB} + \alpha_B^0\\ \alpha^{k+1}_A = 2n_O + n_A(1-p_{AA}^k) + n_B(1-p_{BB}^k) + \alpha_O^0\\ \] Below are functions in R for carrying out the variational Bayes methodology with the distributions above:

# Variational Bayesian methods on example in section 13.6 of Lange

# observed data are counts y=(n_A, n_B, n_AB, n_O) complete data are counts
# x=(n_AA, n_AO, n_BB, n_BO, n_AB, n_O) Take missing data to be
# z=(n_AA,n_BB), with n_AO=n_A-n_AA, n_BO=n_B-n_BB

# Define the relevant functions for the example above:
qz.update <- function(alpha) {
    # q_z is a product of binomials with success probs pAA and pBB. Return these
    # p's
    expit <- function(x) {
        return(exp(x)/(1 + exp(x)))
    }
    pAA <- expit(digamma(alpha[1]) - digamma(alpha[3]))
    pBB <- expit(digamma(alpha[2]) - digamma(alpha[3]))
    return(c(pAA, pBB))
}

qtheta.update <- function(y, pvec, alpha0) {
    # q_theta is a Dirichlet with parameter vector alpha.  y is (nA,nB,nAB,nO);
    alphaA <- y[1] * (1 + pvec[1]) + y[3] + alpha0[1]
    alphaB <- y[2] * (1 + pvec[2]) + y[3] + alpha0[2]
    alphaO <- 2 * y[4] + y[1] * (1 - pvec[1]) + y[2] * (1 - pvec[2]) + alpha0[3]
    return(c(alphaA, alphaB, alphaO))
}

varBayes <- function(y, alpha0, maxit = 100, tol = 0.001) {
    # initialize
    iter = 0
    alpha <- alpha0
    alphadiff <- rep(1, 3)
    while (any(abs(alphadiff) > tol) && iter < maxit) {
        alphaold <- alpha
        pp <- qz.update(alpha)
        alpha <- qtheta.update(y, pp, alpha0)
        alphadiff <- alpha - alphaold
        print(alpha)
        iter = iter + 1
    }
    return(alpha)
}

We can test this methodology with the observed data of: \( n_A=186 \), \( n_B=38 \), \( n_{AB}=13 \), \( n_O=284 \). First though we should discuss the algorithmic connections between Bayesian EM and VBayes.

To make the connections, we need to discuss an alternative statement of the EM algorithm due to Neal & Hinton (1998) - (N&H). In the following, \( \tilde p \) is the distribution for the missing data \( z \). Let \( F(\tilde p,\theta) = E_{\tilde p}(log\:p(z,y\mid\theta)) - E_{\tilde p}(log\:p(z)) \). Then after iteration \( k \) of the EM algorithm, the updates are:
1. E-Step: \( \tilde p^{k+1} = argmax_{\tilde p} F(\tilde p,\theta^k) \)
2. M-step: \( \theta^{k+1} = argmax_{\theta} F(\tilde p^{k+1},\theta) \)

The proof of this is in N&H Theorem 1.

Comments: Motivation to describe EM as a coordinate ascent algorithm over 2 blocks of arguments, \( \tilde p, \theta \), was to formalize EM-like algorithms that iterate over partial updates of the missing data and parameter vectors. By seeing EM as a maximization, we can try alternate maximization algorithms, such as coordinate ascent on more than two blocks of arguments.

N&H Lemma 1 shows that \( argmax_{\tilde p} F(\tilde p,\theta^k) = p(z\mid y,\theta^k) (Equation 2) \), so that the M-step maximizes \( E(log\:p(z,y\mid \theta)\mid z,\theta^k) - E(log\:p(z\mid y,\theta^k)\mid y,\theta^k) \). The first term is \( Q(\theta\mid\theta^k) \) in the standard EM notation, and the second term is constant with respect to \( \theta \). Equation 2 also follows from showing \( F(\tilde p,\theta) = -KL(\tilde p \mid\mid p_\theta) + l_0(\theta) \) where \( p_\theta = p(z\mid y,\theta) \). \( -KL \) term is maximized when \( KL=0 \), which happens iff \( \tilde p = p_\theta \).

For Bayesian EM, replace \( F \) with \( F_B(\tilde p,\theta) = F(\tilde p,\theta) + log\:p(\theta) \).

The E-step is again unchanged because the new term does not contain \( \tilde p \)
The M-step amounts to maximizing \( Q(\theta\mid\theta^k)+log\:p(\theta) = \tilde Q(\theta\mid\theta^k) \) as before.

We can algebraically manipulate \( F_B \) to obtain: \[ F_B(\tilde p,\theta) = \int \tilde p(z) log\:\frac{p(z,y,\theta)}{\tilde p(z)}dz \] Compare this to the objective function from VBayes: \[ F(q_z,q_\theta,y) = \int q_z(z)q_\theta(\theta) log\:\frac{p(z,y,\theta)}{q_z(z)q_\theta(\theta)}dzd\theta \] We can see that the Bayesian EM algorithm can be expressed as the VBayes algorithm restricting \( q_\theta(\theta) \) restricted to be the class of point mass functions \( \delta_\theta \) at \( \theta \), so that: \[ F(\tilde p,\delta_\theta,y) = \int \tilde p(z)\delta_\theta(\theta) log\:\frac{p(z,y,\theta)}{\tilde p_z(z)\delta_\theta(\theta)}dzd\theta\\ = \int \tilde p(z)\left[\delta_\theta(\theta) log\:\frac{p(z,y,\theta)}{\tilde p_z(z)\delta_\theta(\theta)}d\theta\right]dz\\ = F_B(\tilde p,\theta) \]

This makes some intuitive sense, as the goal of Bayesian EM is to find the posterior mode (MAP) estimate, while the goal of VBayes if finding an approximation to the posterior. If the approximations are restricted to the class of point masses, it is sensible that the VBayes estimate and Bayesian EM estimate would be equivalent.

For the example above we can look at both the mode of the VBayes approximation to the posterior, as well as the Bayesian EM MAP estimate. We would not expect them to be exactly the same here, as the class of approximating distributions for VBayes in this example is more than just point mass distributions. We perform the estimates with a Dirichlet prior on \( \theta \) and with the observed data of: \( n_A=186 \), \( n_B=38 \), \( n_{AB}=13 \), \( n_O=284 \). We will have the function print out the mode of the posterior for \( \theta \) at each step of the VBayes algorithm, and then compare the final mode estimate to that with the Bayesian EM algorithm (as described in the lectures on Bayesian EM).

# Initialize: Set the initial q_theta to be the prior. Use same prior as in
# Bayesian EM example:
alpha0 <- c(3, 2, 5)  # parameters of Dirichlet prior
# No need to set an initial q_z, because we start with a q_z update that
# depends only on initial q_theta.

# Set the observed data
y <- c(186, 38, 13, 284)
vb.alpha <- varBayes(y, alpha0)

## [1] 268.62  62.61 720.77
## [1] 252.46  56.02 743.53
## [1] 249.10  55.64 747.26
## [1] 248.46  55.61 747.93
## [1] 248.33  55.61 748.06
## [1] 248.31  55.61 748.08
## [1] 248.31  55.61 748.08
## [1] 248.31  55.61 748.09

# Mode of posterior is at
# (alphaA/sum(alpha),alphaB/sum(alpha),alphaO/sum(alpha))
vb.mode <- (vb.alpha - 1)/sum(vb.alpha - 1)

As we can see above, the posterior converged to the best approximation in around 8 cycles. We can compare the mode of the VBayes approximate posterior to the MAP estimate from the Bayesian EM algorithm:

parameters: (\( \alpha_A \),\( \alpha_B \),\( \alpha_O \))
VBayes: (0.2358, 0.0521, 0.7122)
Bayesian EM: (0.2142, 0.0508, 0.7351) These estimates are quite close to one another, as we would expect.

As was mentioned at the beginning, the VBayes algorithm can be used on more general forms of problems than the missing data problem above.

Our “parameter” was \( (z,\theta) \) and the VBayes approximation restricted \( q(z,\theta) \) to distributions of the form \( q_z(z)q_\theta(\theta) \)
More generally, we can partition a parameter vector \( \theta \) into \( M \) sets \( (\theta_1,.....,\theta_M) \) and use \( q(\theta) = \prod\limits_{i=1}^M q_i(\theta_i) \) as the class of approximating distributions. The VBayes update cycles through \( M \) component updates \[ q_j(\theta_j \mid \theta_{-j}) = \frac{e^{E_{-j}(log\:p(\theta,y))}}{\int e^{E_{-j}(log\:p(\theta,y))}d\theta_j} \] where \( E_{-j} \) indicates the expectation over all components of \( \theta \) other than \( \theta_j \) (Bishop pg 465-466).

Compare the updates above to the B&G updates, where \( \theta \) above was \( (\theta,z) \):

\( q^{k+1}_z(z) = \frac{exp(\int log\:p(z,y,\theta) q^k_\theta(\theta)d\theta)}{\int exp(\int log\:p(z,y,\theta) q^k_\theta(\theta)d\theta) dz} \)
\( q^{k+1}_z(z) = \frac{exp(\int log\:p(z,y,\theta) q^{k+1}_z(z)dz)}{\int exp(\int log\:p(z,y,\theta) q^{k+1}_z(z)dz)d\theta} \)

For more examples, see the wikipedia page: here

Appendix

Here we provide the justification for the two claims regarding the updates of \( q_z \) and \( q_\theta \):

Write updates as:

\( log\:q_z^{k+1}(z) = E_{q_\theta^k}log\:p(z,y,\theta) +\:const \)
\( log\:q_\theta^{k+1}(\theta) = E_{q_z^{k+1}}log\:p(z,y,\theta) +\:const \)

For (1):
\( E_{q_\theta^k}log\:p(z,y,\theta) = E_{q_\theta^k}[(m_A-1)log\:p_A+(m_B-1)log\:p_B+(m_O-1)log\:p_O+const] \)
Ignore \( const \), note \( \theta = (p_A,p_B,p_O) \) is being integrated, so: \[ E_{q_\theta^k}log\:p(z,y,\theta) = (m_A-1)E_{q_\theta^k}(log\:p_A)+(m_B-1)E_{q_\theta^k}(log\:p_B)+(m_O-1)E_{q_\theta^k}(log\:p_O) \] The \( E(log\:p(A))\; \theta=(p_A,p_B,p_O) \sim \) Dirichlet\( (\alpha_A^k,\alpha_B^k,\alpha_O^k) \implies p_A \sim \) Beta\( (\alpha_A^k,\alpha_B^k,\alpha_O^k) \)
For a r.v. \( X\sim \) Beta\( (a,b) \implies E(log\:X) = \psi(a) - \psi(a+b) \), where \( \psi(x) \) is the Digamma function. Therefore: \[ E(log\:p_A) = \psi(\alpha_A) - \psi(\alpha_A+\alpha_B+\alpha_O)\\ E(log\:p_B) = \psi(\alpha_B) - \psi(\alpha_A+\alpha_B+\alpha_O)\\ E(log\:p_O) = \psi(\alpha_O) - \psi(\alpha_A+\alpha_B+\alpha_O)\\ \] Now expand \( m_A,m_B,m_O \) and collect all terms involving \( n_{A/A} \) & \( n_{B/B} \) (our missing data) and simplify to: \[ log\:q_z^{k+1}(z) = n_{A/A}(\psi(\alpha_A^k)-\psi(\alpha_O^k)) + n_{B/B}(\psi(\alpha_B^k)-\psi(\alpha_O^k)) + const \] where the \( const \) in this case are all the terms that are constant with respect to the missing data, and where \( n_{A/A} \in \{0,1,....,n_A\} \) and \( n_{B/B} \in \{0,1,....,n_B\} \).

Compare to binomial log-pmf:
\( Y \sim \) Binomial\( (n,p) \) means

\( log\:f(y) = y\:log\:p + (n-y)log(1-p) + const \) = \( y\:log\frac{p}{1-p} + const \), where the \( const \) is constant in \( y \)

We recognize \( log\:q_z^{k+1}(z) \) as the log-pmf of independent Binomial\( (n_A,p_{AA}^k) \), for \( n_{A/A} \), and Binomial\( (n_B,p_{BB}^k) \), for \( n_{B/B} \), random variables. Where \[ logit(p_{AA}^k) = \psi(\alpha_A^k) - \psi(\alpha_O^k) \implies p_{AA}^k = \frac{exp\{\psi(\alpha_A^k)-\psi(\alpha_O^k)\}}{1+exp\{\psi(\alpha_A^k)-\psi(\alpha_O^k)\}} \] and \[ logit(p_{BB}^k) = \psi(\alpha_B^k) - \psi(\alpha_O^k) \implies p_{BB}^k = \frac{exp\{\psi(\alpha_B^k)-\psi(\alpha_O^k)\}}{1+exp\{\psi(\alpha_B^k)-\psi(\alpha_O^k)\}} \] as claimed.

For (2):
\[ E_{q_z^{k+1}}(log\:p(z,y,\theta)) = E_{q_z^{k+1}}[(m_A-1)log\:p_A+(m_B-1)log\:p_B+(m_O-1)log\:p_O + const]\\ = E_{q_z^{k+1}}[m_A-1]log\:p_A+E_{q_z^{k+1}}[m_B-1]log\:p_B+E_{q_z^{k+1}}[m_O-1]log\:p_O + const \] This can be recognized as Dirichlet, with parameters \[ \alpha_A^{k+1} = E_{q_z^{k+1}}[m_A];\;\alpha_B^{k+1} = E_{q_z^{k+1}}[m_B];\;\alpha_O^{k+1} = E_{q_z^{k+1}}[m_O] \] We can use the binomial distributions of \( n_{A/A} \) and \( n_{B/B} \) to show that:

\( \alpha_A^{k+1} = n_A(1+p_{AA}^k)+n_{AB}+\alpha_A^0 \)
\( \alpha_B^{k+1} = n_B(1+p_{BB}^k)+n_{AB}+\alpha_B^0 \)
\( \alpha_O^{k+1} = 2n_O + n_A(1-p_{AA}^k)+n_B(1-p_{BB}^k)+\alpha_O^0 \) where \( (\alpha_A^0,\alpha_B^0,\alpha_O^0) \) are the parameters of the Dirichlet prior. Thus showing claim (2).