1 Variational Inference

Consider we have a fully Bayesian probabilistic model in which we denote all observed variables by \(\mathbf{X}\) and all latent variables by \(\mathbf{Z}\). Note that the latent random variables \(\mathbf{Z}\) include both parameters that might govern all the data, as found in Bayesian models, and latent variables that are “local” to individual data points. Our probabilistic model, specifies the joint density \(p(\mathbf{X}, \mathbf{Z})\) and our goal is to compute the posterior distribution \(p(\mathbf{Z} | \mathbf{X})\). This conditional can be used to produce point or interval estimates of the latent variables, form predictive densities of new data, and using Bayes rule we can write it as \[ p(\mathbf{Z} | \mathbf{X}) = \frac{p(\mathbf{X}, \mathbf{Z})}{p(\mathbf{X})} = \frac{p(\mathbf{X} | \mathbf{Z}) p(\mathbf{Z})}{p(\mathbf{X})} \] The denominator contains the marginal likelihood (or model evidence) since we marginalize out the latent variables from the joint density \[ p(\mathbf{X}) = \int p(\mathbf{X}, \mathbf{Z}) d\mathbf{Z} \] For many models, this evidence integral is unavailable in closed form or requires exponential time to compute. The evidence is what we need to compute the conditional density from the joint; this is why inference in such models is hard.

1.1 The evidence lower bound (ELBO)

In variational inference, we specific a family \(\mathcal{Q}\) of densities over the latent variables. Each \(q(\mathbf{Z}) \in \mathcal{Q}\) is a candidate approximation to the exact posterior density. Our goal is to find the best candidate, i.e. the one that minimizes the Kulback-Leibler (\(\mathcal{KL}\)) divergence to the exact posterior. Hence, inference now amounts to solving the following optimization problem \[ \begin{aligned} q^{*}(Z) & = \underset{q(\mathbf{Z}) \in \mathcal{Q}} {\mathrm{argmax}}\; \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) \end{aligned} \] Note that the complexity of the family \(\mathcal{Q}\) determines the complexity of this optimization. Let us observe what is the \(\mathcal{KL}\) divergence between these two distributions \[ \begin{aligned} \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) & = - \int q(\mathbf{Z}) \ln\frac{p(\mathbf{Z} | \mathbf{X})}{q(\mathbf{Z})} d\mathbf{Z} \\ & = \int q(\mathbf{Z})\ln q(\mathbf{Z}) d\mathbf{Z} - \int q(\mathbf{Z}) \ln p(\mathbf{Z} | \mathbf{X})d\mathbf{Z} \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[ \ln q(\mathbf{Z})\Big] - \int q(\mathbf{Z}) \ln \frac{p(\mathbf{X}, \mathbf{Z})}{p(\mathbf{X})} d\mathbf{Z} \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] + \int q(\mathbf{Z}) \ln p(\mathbf{X})d\mathbf{Z} - \int q(\mathbf{Z}) \ln p(\mathbf{X}, \mathbf{Z})d\mathbf{Z}\\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] + \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] + \ln p(\mathbf{X}) \\ \end{aligned} \] This quantity is intractable to compute since it denends on the model evidence \(\ln p(\mathbf{X})\), which however does not depend on the choice of the variational distribution \(q(\mathbf{Z})\). Instead, we optimize an alternative objective function which is equivalent to the \(\mathcal{KL}\) up to an added constant (i.e. the model evidence) \[ \begin{aligned} q_{ELBO}(Z) & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \end{aligned} \] This function is called the evidence lower bound (ELBO) and maximizing ELBO results in minizing the \(\mathcal{KL}\) divergence defined above, since the ELBO is the negative \(\mathcal{KL}\) plus the model evidence \(\ln p(\mathbf{X})\).

Examining the ELBO gives intuitions about the optimal variational density. We rewrite the ELBO as a sum of the expected log likelihood of the data and the \(\mathcal{KL}\) divergence between the prior \(p(\mathbf{Z})\) and \(q(\mathbf{Z})\) \[ \begin{aligned} q_{ELBO}(Z) & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X} | \mathbf{Z})\Big] + \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X} | \mathbf{Z})\Big] - \mathcal{KL} (q(\mathbf{Z}) \; ||\; p(\mathbf{Z}))\\ \end{aligned} \] The first term is an expected likelihood; it encourages densities that place their mass on configurations of the latent variables that explain the observed data. The second term is the negative divergence between the variational density and the prior; it encourages densities close to the prior. Thus, the variational objective mirrors the usual balance between likelihood and prior.

Another property of the ELBO is that it lower-bounds the (log) evidence, \(\ln p(\mathbf{X}) \geq q_{ELBO}(Z)\) for any \(q(\mathbf{Z})\). To see this notice that we can write the log evidence as \[ \ln p(\mathbf{X}) = \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) + q_{ELBO}(Z) \] The bound then follows from the fact that \(\mathcal{KL}(q \;||\;p) \geq 0\), and \(q_{ELBO}(Z)\) achieves that bound only when the \(\mathcal{KL}\) divergence vanishes, that is \(\mathcal{KL}\big[q(\mathbf{Z})\;||\; p(\mathbf{Z} | \mathbf{X}) \big] = 0\). But, the \(\mathcal{KL}(q \;||\;p)\) is zero if and only if \(q = p\), which would lead to using the posterior distribution as our proposal. We can derive the same results through Jensen’s inequality Jordan et. al. (1999).

1.2 The mean-field variational family

We now describe a variational family \(\mathcal{Q}\), to complete the specification of the optimization problem. The complexity of the family determines the complexity of the optimization; it is more difficult to optimize over a complex family than a simple family. Here we focus on the mean-field variational family where the latent variables are mutually independent and each governed by a distinct factor in the variational density. A generic member of the mean-field variational family is \[ q(\mathbf{Z}) = \prod_{m=1}^{M} q_{m}(Z_{m}) \] Each latent variable \(Z_{m}\) is governed by its own variational factor, the density \(q_{m}(Z_{m})\). We emphasize that the variational family is not a model of the observed data, indeed, the data \(\mathbf{X}\) does not appear in the above equation. Instead, it is the ELBO, and the corresponding KL minimization problem, that connects the fitted variational density to the data and model. Also, notice we have not specified the parametric form of the individual variational factors. In principle, each can take on any parametric form appropriate to the corresponding random variable.

1.3 Coordinate ascent mean-field variational inference

Using the ELBO and the mean-field family, we have cast approximate conditional inference as an optimization problem. One of the most commonly used algorithms for solving this optimization problem is coordinate ascent variational inference (CAVI) (Bishop, 2006). CAVI iteratively optimizes each factor of the mean-field variational density, while holding the others fixed. It climbs the ELBO to a local optimum.

Consider the \(m^{th}\) latent variable \(Z_{m}\). The complete conditional of \(Z_{m}\) is its conditional density given all of the other latent variables in the model and the observations, i.e. \(p(Z_{m} | \mathbf{Z}_{-m}, \mathbf{X})\). Fix the other variational factors \(q_{j}(Z_{j})\), \(j \neq m\), then the log of the optimal solution for factor \(q_{m}(Z_{m})\) is \[ \ln q_{m}^{*}(Z_{m}) = \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] + \text{const} \] That is, the log of the optimal solution for factor \(q_{m}\), is obtained simply by considering the log of the joint distribution over all hidden and observed variables and then taking the expectation w.r.t all of the other factors \(\{q_{j}\}\) for \(j \neq m\).

To obtain the form of \(q^{*}_{m}(Z_{m})\) we exponentiate on both sides and normalize to obtain \[ q_{m}^{*}(Z_{m}) = \frac{\exp\left\{ \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \right\} }{\int\exp\left\{ \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \right\}d Z_{m}} \]

2 Probit Regression Model

Suppose we have \(I\) independent binary random variables \(y_{i} \in \{0, 1\}\), where \[ y_{i} \sim \mathcal{B}ernoulli(y_{i} | \gamma_{i}) \] We assume that the success probability \(\gamma_{i}\) is related to a vector of covariates \(\mathbf{x}_{i} \in \mathbb{R}^{D}\). Then, we define the probit regression model as \[ \begin{aligned} \eta_{i} & = \mathbf{w}^{T}\mathbf{x}_{i} \\ \mathbb{E}[y_{i}] & = \gamma_{i} = g^{-1}(\eta_{i}) \end{aligned} \] where \(\mathbf{w}\) represents a \(D \times 1\) column vector of regression coefficients, and \(g(\cdot)\) is the link function which allows to move from natural parameters \(\eta_{i}\) to mean parameters \(\gamma_{i}\). The probit model is obtained if we define \(g^{-1}(\cdot) = \Phi(\cdot)\), where \(\Phi(\cdot)\) denotes the cumulative distribution function (cdf) of the standard normal distribution. The probabilistic graphical model is shown below

3 Bayesian Probit Regression

In the Bayesian approach we define a prior distribution over the parameters \(\mathbf{w}\). Then the Bayesian probit regression model becomes \[ \begin{aligned} y_{i} & \sim \mathcal{B}ernoulli(y_{i} | \gamma_{i}) \\ \gamma_{i} & = \Phi(\eta_{i})\\ \eta_{i} & = \mathbf{w}^{T}\mathbf{x}_{i} \\ \mathbf{w} & \sim p(\mathbf{w} | \mathcal{H}) \end{aligned} \] where \(\mathcal{H}\) are (hyper)-parameters of the prior. Collecting \(I\) response variables \(\mathbf{y} \in \mathbb{R}^{I}\) and covariates \(\mathbf{X}\in \mathbb{R}^{I\times D}\), the posterior distribution over the parameters is given by \[ \begin{aligned} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) & \propto p(\mathbf{w} | \mathcal{H}) p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = p(\mathbf{w} | \mathcal{H}) \prod\limits_{i=1}^{I} p(y_{i} | \mathbf{x}_{i}, \mathbf{w}) \\ & = p(\mathbf{w} | \mathcal{H}) \prod\limits_{i=1}^{I} \Phi\Big(\mathbf{w}^{T}\mathbf{x}_{i}\Big)^{y_{i}} \Big(1 - \Phi(\mathbf{w}^{T}\mathbf{x}_{i})\Big)^{(1-y_{i})} \end{aligned} \]

3.1 Augmented Model

Performing inference for this model in the Bayesian framework is complicated by the fact that no conjugate prior \(p(\mathbf{w} | \mathcal{H})\) exists for the parameters of the probit regression model. To overcome this problem, Albert and Chib (1993) augmented the original model with an additional auxiliary latent variable that renders the conditional distributions of the model parameters equivalent to those under a Bayesian normal linear regression model with Gaussian noise.

Let us introduce \(I\) independent latent variables \(z_{i}\), where each \(z_{i}\) follows a Gaussian distribution centered at \(\mathbf{w}^{T}\mathbf{x}_{i}\), i.e. \(z_{i} \sim \mathcal{N}(\mathbf{w}^{T}\mathbf{x}_{i}, 1)\). Then the augmented probit model has the following hierarchical structure

where \[ \begin{aligned} \tau & \sim Gamma(\tau | \alpha_{0}, \beta_{0}) \\ \mathbf{w} | \tau & \sim \mathcal{N}(\mathbf{w} | \mathbf{m}_{0}, \tau^{-1}\mathbf{I}) \\ z_{i} | \mathbf{w} & \sim \mathcal{N}(z_{i} | \mathbf{w}^{T}\mathbf{x}_{i}, 1) \\ y_{i} | z_{i} & = \begin{cases} 1 & \; \text{if } z_{i} > 0\\ 0 & \; \text{if } z_{i} \leq 0\\ \end{cases} \nonumber \\ \end{aligned} \] where \(y_{i}\) is now deterministic conditional on the sign of the latent variable \(z_{i}\). Hence, we formulate the original problem as a missing data problem where we have a normal regression model on the latent variables \(z_{i}\) and the observed responses \(y_{i}\) are incomplete in that we only observe whether \(z_{i} > 0\) or \(z_{i} \leq 0\). Note also that we now can put a conjugate Gaussian prior over parameters \(\mathbf{w}\). A typical choice for the conjugate prior mean would be \(\mathbf{m}_{0} = 0\). Regarding the precision parameter \(\tau\) either we can choose a small value to make it uninformative or we can introduce a Gamma prior as shown above.

Thus, the the joint distribution over all the variables is given by \[ \begin{aligned} p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X}) & = p(\mathbf{y}| \mathbf{z}) p(\mathbf{z}|\mathbf{w}, \mathbf{X}) p(\mathbf{w} | \tau) p(\tau)\\ & = \left\{ \prod\limits_{i=1}^{I} p(y_{i} | z_{i}) p(z_{i}| \mathbf{x}_{i}, \mathbf{w}) \right\} p(\mathbf{w} | \tau)p(\tau) \end{aligned} \] where \[ p(y_{i}| z_{i}) = \mathbf{1}(z_{i} > 0)^{y_{i}} \mathbf{1}(z_{i} \leq 0)^{(1 - y_{i})} \] where \(\boldsymbol{1}(\cdot)\) is the indicator function, equal to \(1\) if the quantities inside the function are satisfied, and \(0\) otherwise. One way to compute the posterior is the Gibbs algorithm, see my previous post http://rpubs.com/cakapourani/bayesian-binary-probit-model. Here we will approximate the posterior using variational inference.

4 Variational Bayesian Probit Regression Model

To apply the variational inference machinery, we will divide our parameters in three blocks, the latent variables \(\mathbf{z}\) the coefficient parameters \((\mathbf{w})\) and the precision parameter \(\tau\). Hence, the variational posterior distribution is given by the factorised expression \[ q(\mathbf{z}, \mathbf{w}, \tau) = q(\mathbf{z}) q(\mathbf{w}) q(\tau) \approx p(\mathbf{z}, \mathbf{w}, \tau | \mathbf{y}, \mathbf{X}) \] In addition to this factorization, we have an additional induced factorization: the latent variables \(z_{i}\) are independent given the observations \(y_{i}\). Hence our variational posterior becomes \[ \begin{aligned} q(\mathbf{z}, \mathbf{w}, \tau) = q(\mathbf{z}) q(\mathbf{w}) q(\tau)= q(\mathbf{w})q(\tau)\prod_{i=1}^{I}q(z_{i}) \end{aligned} \] It should be noted that this is the only assumption that we need to make in order to obtain a tractable practical solution to our Bayesian probit regression model. In particular, the functional form of the factors \(q(\mathbf{z})\), \(q(\mathbf{w})\) and \(q(\tau)\) will be determinted automatically by optimization of the variational distribution.

4.1 Update factor \(q(\mathbf{z})\)

Let us consider the derivation of the update equation for the factor \(q(z_{i})\) and we assume that the corresponding \(y_{i} = 1\). The log of the optimized factor is given by \[ \begin{aligned} \text{ln}\;q^{*}(z_{i}) & = \mathbb{E}_{q(\mathbf{w},\tau)}\Big[\text{ln}\;p(y_{i}, z_{i}, \mathbf{w}, \tau | \mathbf{x}_{i})\Big] + \text{const} \\ & = \ln p(y_{i} | z_{i}) + \mathbb{E}_{q(\mathbf{w})} \Big[\ln p(z_{i} | \mathbf{x}_{i}, \mathbf{w})\Big] + \underbrace{\mathbb{E}_{q(\mathbf{w})}\Big[\ln p(\mathbf{w} | \tau)\Big]}_{\text{const}} + \underbrace{\ln p(\tau)}_{\text{const}} + \text{const} \\ & = \ln p(y_{i} | z_{i}) + \mathbb{E}_{q(\mathbf{w})} \Big[\ln \mathcal{N}(z_{i} | \mathbf{w}^{T}\mathbf{x}_{i}, 1)\Big] + \text{const} \\ & = y_{i}\ln\mathbf{1}(z_{i} > 0) + \underbrace{(1 - y_{i})\ln\mathbf{1}(z_{i} \leq 0)}_{0,\; \text{since}\;y_{i}=1} -\frac{1}{2} \mathbb{E}_{q(\mathbf{w})} \Big[ (z_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2}\Big] + \text{const} \\ & = \ln\mathbf{1}(z_{i} > 0) -\frac{1}{2} z_{i}^{2} + z_{i}\mathbb{E}_{q(\mathbf{w})} \Big[\mathbf{w}^{T}\Big]\mathbf{x}_{i} - \underbrace{\frac{1}{2}\mathbb{E}_{q(\mathbf{w})} \Big[\mathbf{w}^{T}\mathbf{x}_{i} \mathbf{x}_{i}^{T}\mathbf{w}\Big]}_{\text{const}} + \text{const} \\ & = \ln\mathbf{1}(z_{i} > 0) -\frac{1}{2} z_{i}^{2} + z_{i}\mathbb{E}_{q(\mathbf{w})} \Big[\mathbf{w}^{T}\Big]\mathbf{x}_{i} + \text{const} \end{aligned} \] Exponentiating this quantity and setting \(\mu_{i} = \mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}^{T}\big]\mathbf{x}_{i}\) we obtain \[ q^{*}(z_{i}) \propto \mathbf{1}(z_{i} > 0) \exp\left(-\frac{1}{2} z_{i}^{2} + z_{i}\mu_{i} \right) \]

We observe that the optimized factor \(q^{*}(z_{i})\) is an un-normalized Truncated Normal distribution, so we can complete the square to identify the mean and covariance \[ q^{*}(z_{i}) = \begin{cases} \mathcal{TN}_{+}\left(z_{i} | \mu_{i}, 1 \right) & \; \text{if } y_{i} = 1\\ \mathcal{TN}_{-}\left(z_{i} | \mu_{i}, 1 \right) & \; \text{if } y_{i} = 0\\ \end{cases} \] where \(\mathcal{TN}_{+}(\cdot)\) denotes the normal distribution truncated on the left tail to zero to contain only positive values, and \(\mathcal{TN}_{-}(\cdot)\) denotes the normal distribution truncated on the right tail to zero to contain only negative values.

To complete the square we make use of the fact that the exponent in a general Gaussian distribution \(\mathcal{N}(\mathbf{x} | \pmb{\mu}, \pmb{\Sigma})\) can be written \[ -\frac{1}{2}(\mathbf{x}-\pmb{\mu})^{T}\pmb{\Sigma}^{-1}(\mathbf{x}-\pmb{\mu}) = -\frac{1}{2}\mathbf{x}^{T}\pmb{\Sigma}^{-1}\mathbf{x} + \mathbf{x}^{T}\pmb{\Sigma}^{-1}\pmb{\mu} + \text{const} \] where const denotes terms that are independent of \(\mathbf{x}\), and we have made use of the symmetry of \(\pmb{\Sigma}\).

4.2 Update factor \(q(\tau)\)

The log of the optimized factor is given by \[ \begin{align} \ln q^{*}(\mathbf{\tau}) & = \mathbb{E}_{q(\mathbf{z},\mathbf{w})}\Big[\ln p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X}) \Big] + \text{const} \\ & = \underbrace{\mathbb{E}_{q(\mathbf{z})}\Big[\ln p(\mathbf{y} | \mathbf{z})\Big]}_{\text{const}} + \underbrace{\mathbb{E}_{q(\mathbf{z},\mathbf{w})} \Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big]}_{\text{const}} + \mathbb{E}_{q(\mathbf{w})}\Big[\ln p(\mathbf{w} | \tau)\Big] + \ln p(\tau) + \text{const} \\ & = \mathbb{E}_{q(\mathbf{w})}\Big[\ln p(\mathbf{w} | \tau)\Big] + \ln p(\tau) + \text{const} \\ & = \underbrace{\frac{D}{2}\ln\tau - \frac{\tau}{2} \mathbb{E}_{q(\mathbf{w})}[\mathbf{w}^{T} \mathbf{w}]}_{\text{Gaussian PDF}} + \underbrace{(\alpha_{0} - 1)\ln\tau - \beta_{0}\tau}_{\text{Gamma PDF}}\\ & = \underbrace{(\alpha_{0} + \frac{D}{2} - 1)}_{\alpha\;\text{parameter}}\ln\tau - \Big(\underbrace{\beta_{0} + \frac{1}{2}\mathbb{E}_{q(\mathbf{w})}[\mathbf{w}^{T}\mathbf{w}]}_{\beta\;\text{parameter}}\Big)\tau \end{align} \] which is the log of the (unnormalized) Gamma distribution, leading to \[ \begin{aligned} q^{*}(\tau) & = \mathcal{G}\text{amma}(\tau_{k} | \alpha, \beta) \\ \alpha & = \alpha_{0} + \frac{D}{2} \\ \beta & = \beta_{0} + \frac{1}{2}\mathbb{E}_{q(\mathbf{w})}[\mathbf{w}^{T}\mathbf{w}] \end{aligned} \]

4.3 Update factor \(q(\mathbf{w})\)

Finally, let us consider the factor \(q(\mathbf{w})\) in the variational posterior distribution. Using again the general expression for the optimized factor we have \[ \begin{aligned} \text{ln}\;q^{*}(\mathbf{w}) & = \mathbb{E}_{q(\mathbf{z},\tau)}\Big[\ln p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X}) \Big] + \text{const} \\ & = \underbrace{\mathbb{E}_{q(\mathbf{z})}\Big[\ln p(\mathbf{y} | \mathbf{z})\Big]}_{\text{const}} + \mathbb{E}_{q(\mathbf{z})} \Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big] + \mathbb{E}_{q(\tau)}\Big[\ln p(\mathbf{w} | \tau)\Big] + \underbrace{\ln p(\tau)}_{\text{const}} + \text{const} \\ & = \mathbb{E}_{q(\mathbf{z})} \Big[\ln \mathcal{N}(\mathbf{z} | \mathbf{X} \mathbf{w}, 1)\Big] - \frac{1}{2} \mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{w}^{T}{\mathbf{w}} + \text{const} \\ & = \mathbb{E}_{q(\mathbf{z})} \Big[-\frac{1}{2}(\mathbf{z} - \mathbf{X} \mathbf{w})^{T}(\mathbf{z} - \mathbf{X} \mathbf{w})\Big] - \frac{1}{2} \mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{w}^{T}{\mathbf{w}} + \text{const} \\ & = \mathbb{E}_{q(\mathbf{z})} \left[-\frac{1}{2}( \underbrace{\mathbf{z}^{T} \mathbf{z}}_{\text{const}} - 2 \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{z} + \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w}\right] - \frac{1}{2} \mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{w}^{T}{\mathbf{w}} + \text{const} \\ & = \mathbf{w}^{T}\mathbf{X}^{T}\mathbb{E}_{q(\mathbf{z})} \big[\mathbf{z}\big] -\frac{1}{2} \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w} - \frac{1}{2} \mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{w}^{T}{\mathbf{w}} + \text{const} \\ & = \mathbf{w}^{T}\mathbf{X}^{T}\mathbb{E}_{q(\mathbf{z})} \big[\mathbf{z}\big] - \frac{1}{2} \mathbf{w}^{T}\Big\{\mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{I} + \mathbf{X}^{T}\mathbf{X}\Big\}\mathbf{w} + \text{const} \end{aligned} \]

By exponentiating this quantity, we observe that the optimized factor \(q^{*}(z_{i})\) is an un-normalized multivariate Normal distribution, so we can complete the square to identify the mean and covariance \[ \begin{aligned} q^{*}(\mathbf{w}) & = \mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{\mathbf{S}}) \\ \mathbf{m} & = \mathbf{S} \mathbf{X}^{T}\mathbb{E}_{q(\mathbf{z})}\big[\mathbf{z}\big]\\ \mathbf{S} & = \Big(\mathbb{E}_{q(\tau)}\big[\tau\big]\mathbf{I} + \boldsymbol{X}^{T}\boldsymbol{X}\Big)^{-1} \end{aligned} \]

4.4 Computing expectations

When deriving the optimized variational factors, the derivations involved expectations with respect to the variational distributions. These expectations are computed as follows

Term \(\mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}^{T}\big]\): The factor \(q(\mathbf{w})\) is a Gaussian distribution \(\mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{S})\) hence its expected value is \[ \begin{align} \mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}^{T}\big] = \mathbf{m} \end{align} \]

Term \(\mathbb{E}_{q(\tau)}\big[\tau\big]\): The factor \(q(\tau)\) is a Gamma distribution \(\mathcal{G}amma(\tau | \alpha, \beta)\), hence its expected value is \[ \mathbb{E}_{q(\tau)}\big[\tau\big] = \frac{\alpha}{\beta} \]

Term \(\mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}^{T}\mathbf{w}\big]\): The factor \(q(\mathbf{w})\) is a Gaussian distribution \(\mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{S})\) hence we have \[ \begin{align} \mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}^{T}\mathbf{w}\big] = \text{tr}\Big(\mathbb{E}_{q(\mathbf{w})} \big[\mathbf{w}\mathbf{w}^{T} \big]\Big) = \text{tr}\Big(\mathbf{m}\mathbf{m}^{T} + \mathbf{S}\Big) = \text{tr}\left(\mathbf{m}\mathbf{m}^{T}\right) + \text{tr}(\mathbf{S}) = \mathbf{m}^{T}\mathbf{m} + \text{tr}(\mathbf{S}) \\ \end{align} \] Term \(\mathbb{E}_{q(z_{i})}\big[z_{i}\big]\): The factor \(q(z_{i})\) is a Truncated Gaussian distribution as defined above, hence form standard results its expected value is given by \[ \mathbb{E}_{q(z_{i})}\big[z_{i}\big] = \begin{cases} \mu_{i} + \phi_{i}/(1 - \Phi_{i}) & \; \text{if } y_{i} = 1\\ \mu_{i} - \phi_{i}/\Phi_{i} & \; \text{if } y_{i} = 0\\ \end{cases} \] where \(\phi_{i} = \phi(-\mu_{i})\) is the normal density and \(\Phi_{i} = \Phi(-\mu_{i})\) is the cumulative distribution function (cdf) of the standard normal distribution as defined above.

4.5 Variational lower bound

The variational lower bound \(\mathcal{L}(q)\) (i.e. evidence lower bound (ELBO) ) is given by \[ \begin{aligned} \mathcal{L}(q) & = \int\int\int q(\mathbf{z}, \mathbf{w}, \tau) \ln\left\{\frac{p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X})}{q(\mathbf{z}, \mathbf{w}, \tau)}\right\}d \mathbf{z}\;d\mathbf{w}\;d\tau \\ & = \;\mathbb{E}_{q(\mathbf{z}, \mathbf{w}, \tau)}\Big[\ln p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X})\Big] - \mathbb{E}_{q(\mathbf{z}, \mathbf{w}, \tau)}\Big[\ln q(\mathbf{z}, \mathbf{w}, \tau)\Big] \\ & = \;\mathbb{E}_{q(\mathbf{z})}\Big[\ln p(\mathbf{y} | \mathbf{z})\Big] + \mathbb{E}_{q(\mathbf{z}, \mathbf{w})}\Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big] + \mathbb{E}_{q(\mathbf{w}, \tau)} \Big[\ln p(\mathbf{w} | \tau)\Big] + \mathbb{E}_{q(\tau) }\Big[\ln p(\tau)\Big] \\ & \quad \qquad- \mathbb{E}_{q(\mathbf{z})}\Big[\ln q(\mathbf{z})\Big] - \mathbb{E}_{q(\mathbf{w})}\Big[\ln q(\mathbf{w})\Big] - \mathbb{E}_{q(\tau)} \Big[\ln q(\tau)\Big] \end{aligned} \]

Note that the terms involving expectations of \(\ln q(\cdot)\) distributions simply represent the negative information entropies \(\mathbb{H}(\cdot)\) of those distributions. The various terms in the ELBO are evaluated to give

\[ \begin{aligned} \mathbb{E}_{q(\mathbf{z})}\Big[\ln p(\mathbf{y} | \mathbf{z})\Big] & = \sum_{i}^{I} \mathbb{E}_{q(z_{i})}\Big[y_{i}\ln \mathbf{1}(z_{i} > 0) + (1 - y_{i}) \ln \mathbf{1}(z_{i} \leq 0) \Big] \\ & = \sum_{i}^{I} \int_{-\infty}^{\infty} q(z_{i})\Big\{y_{i}\ln \mathbf{1}(z_{i} > 0) + (1 - y_{i}) \ln \mathbf{1}(z_{i} \leq 0) \Big\}\;d z_{i} \\ & = \sum_{i}^{I} \begin{cases} \int_{-\infty}^{\infty} q(z_{i}) \ln \mathbf{1}(z_{i} > 0)\; d z_{i} & \; \text{if } y_{i} = 1 \\ \int_{-\infty}^{\infty} q(z_{i}) \ln \mathbf{1}(z_{i} \leq 0)\; d z_{i} & \; \text{if } y_{i} = 0 \\ \end{cases} \\ & = \sum_{i}^{I} \begin{cases} \int_{-\infty}^{\infty} 0\; d z_{i} & \; \text{if } y_{i} = 1 \\ \int_{-\infty}^{\infty} 0\; d z_{i} & \; \text{if } y_{i} = 0 \\ \end{cases} \\ & = 0 \\ \mathbb{E}_{q(\mathbf{z}, \mathbf{w})}\Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big] & = \mathbb{E}_{q(\mathbf{z}, \mathbf{w})} \Big[ \ln \mathcal{N}(\mathbf{z} | \mathbf{X}\mathbf{w}, \mathbf{I})\Big] \\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z}, \mathbf{w})}\Big[(\mathbf{z} - \mathbf{Xw})^{T}(\mathbf{z} - \mathbf{Xw}) \Big]\\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z})}\Big[\mathbf{z}^{T}\mathbf{z}\Big] + \mathbb{E}_{q(\mathbf{z})}\Bigg[\mathbb{E}_{q(\mathbf{w})}\Big[\mathbf{w}^{T}\Big]\mathbf{X}^{T}\mathbf{z}\Bigg] -\frac{1}{2} \mathbb{E}_{q(\mathbf{w})}\Big[ \mathbf{w}^{T}\mathbf{X}^{T}\mathbf{X}\mathbf{w}) \Big]\\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z})}\Big[\mathbf{z}^{T}\mathbf{z}\Big] + \mathbb{E}_{q(\mathbf{z})}\Big[\pmb{\mu}^{T}\mathbf{z}\Big] -\frac{1}{2} \text{tr}\left(\mathbf{X}^{T} \mathbf{X}\mathbb{E}_{q(\mathbf{w})} \Big[\mathbf{w}\mathbf{w}^{T}) \Big]\right) \\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z})}\Big[\mathbf{z}^{T}\mathbf{z}\Big] + \mathbb{E}_{q(\mathbf{z})} \Big[\pmb{\mu}^{T}\mathbf{z}\Big] -\frac{1}{2} \text{tr}\left(\mathbf{X}^{T}\mathbf{X} \Big(\mathbf{mm}^{T} + \mathbf{S}\Big)\right) \\ \mathbb{E}_{q(\mathbf{w}, \tau)}\Big[\ln p(\mathbf{w} | \tau)\Big] & = \mathbb{E}_{q(\mathbf{w}, \tau)} \left[ -\frac{D}{2}\ln 2\pi - \frac{D}{2}\ln \tau^{-1} - \frac{1}{2\tau^{-1}} \mathbf{w}^{T}\mathbf{w}\right] \\ & = -\frac{D}{2}\ln 2\pi + \frac{D}{2}\mathbb{E}_{q(\tau)} \big[\ln \tau\big] - \frac{1}{2}\mathbb{E}_{q(\tau)}\big[\tau\big] \mathbb{E}_{q(\mathbf{w})} \big[\mathbf{w}^{T}\mathbf{w}\big] \\ & = -\frac{D}{2}\ln 2\pi + \frac{D}{2}\Big(\psi(\alpha) - \ln \beta\Big) - \frac{\alpha}{2\beta} \Big(\mathbf{m}^{T}\mathbf{m} + \text{tr}(\mathbf{S})\Big) \\ \mathbb{E}_{q(\tau)}\Big[\ln p(\tau)\Big] & = \alpha_{0}\ln \beta_{0} + (\alpha_{0} - 1)\Big(\psi(\alpha) - \ln \beta\Big) - \beta_{0}\frac{\alpha}{\beta} - \ln\Gamma(\alpha_{0})\\ \mathbb{E}_{q(\mathbf{z})}\Big[\ln q(\mathbf{z})\Big] & = \sum_{i}^{I} \mathbb{E}_{q(z_{i})}\left[ \ln \left(\mathcal{TN}_{+}(z_{i} | \mu_{i}, 1)^{y_{i}} \mathcal{TN}_{-}(z_{i} | \mu_{i}, 1)^{(1-y_{i})}\right)\right] \\ & = \sum_{i=1}^{I} \mathbb{E}_{q(z_{i})}\left[ \ln \left\{\mathcal{N}(z_{i} | \mu_{i}, 1) \left(\frac{1}{1 - \Phi_{i}}\right)^{y_{i}} \left(\frac{1}{\Phi_{i}}\right)^{(1-y_{i})}\right\}\right] \\ & = \mathbb{E}_{q(\mathbf{z})}\Big[ \ln\mathcal{N}(\mathbf{z} | \pmb{\mu}, \mathbf{I})\Big] - \sum_{i=1}^{I}\Bigg\{ y_{i}\ln \left(1-\Phi_{i}\right) + (1-y_{i})\ln\left(\Phi_{i}\right)\Bigg\} \\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z})}\Big[(\mathbf{z} - \pmb{\mu})^{T}(\mathbf{z} - \pmb{\mu}) \Big] - \\ & \qquad\qquad\qquad \sum_{i=1}^{I}\Bigg\{ y_{i}\ln \left(1-\Phi_{i}\right) + (1-y_{i})\ln\left(\Phi_{i}\right)\Bigg\} \\ & = -\frac{I}{2}\ln2\pi -\frac{1}{2} \mathbb{E}_{q(\mathbf{z})}\Big[\mathbf{z}^{T}\mathbf{z}\Big] + \mathbb{E}_{q(\mathbf{z})} \Big[\pmb{\mu}^{T}\mathbf{z}\Big] - \frac{1}{2} \pmb{\mu}^{T}\pmb{\mu} - \\ & \qquad\qquad\qquad \sum_{i=1}^{I}\Bigg\{ y_{i}\ln \left(1-\Phi_{i}\right) + (1-y_{i})\ln\left(\Phi_{i}\right)\Bigg\} \\ \mathbb{E}_{q(\mathbf{w})}\Big[\ln q(\mathbf{w})\Big] & = -\frac{1}{2}\ln |\mathbf{S}| - \frac{D}{2}(1 + \ln 2\pi) \\ \mathbb{E}_{q(\tau)}\Big[\ln q(\tau)\Big] & = -\ln\Gamma(\alpha) + (\alpha - 1)\psi(\alpha) + \ln \beta - \alpha \end{aligned} \] where \(\Phi_{i} \equiv \Phi(-\mu_{i})\), and \(\Phi(\cdot)\) is the cumulative distribution function (cdf) of the standard normal distribution. The quantity \(\mathbb{E}_{q(\mathbf{z})}\Big[\ln p(\mathbf{y} | \mathbf{z})\Big]\) will always be zero w.r.t. to the variational factors, hence it does not contribute in the ELBO. Also, by inspecting expectations \(\mathbb{E}_{q(\mathbf{z}, \mathbf{w})}\Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big]\) and \(\mathbb{E}_{q(\mathbf{z})}\Big[\ln q(\mathbf{z})\Big]\) some of the elements cancel out, thus we have \[ \begin{align} \mathbb{E}_{q(\mathbf{z}, \mathbf{w})}\Big[\ln p(\mathbf{z} | \mathbf{X}, \mathbf{w})\Big] - \mathbb{E}_{q(\mathbf{z})}\Big[\ln q(\mathbf{z})\Big] & = -\frac{1}{2} \text{tr}\left(\mathbf{X}^{T}\mathbf{X} \Big(\mathbf{mm}^{T} + \mathbf{S}\Big)\right) + \frac{1}{2} \pmb{\mu}^{T}\pmb{\mu} + \\ & \qquad\quad\sum_{i=1}^{I}\Bigg\{ y_{i}\ln \left(1-\Phi_{i}\right) + (1-y_{i})\ln\left(\Phi_{i}\right)\Bigg\} \\ \end{align} \]

4.6 Predictive distribution

The predictive distribution over \(y_{*}\), given a new input \(\mathbf{x}_{*}\), is evaluated using the variational posterior for the parameters \[ \begin{align} p(y_{*} | \mathbf{x}_{*}, \mathbf{X}, \mathbf{y}) & = \int_{-\infty}^{\infty}\int\int p(y_{*}, z, \mathbf{w}, \tau | \mathbf{x}_{*}, \mathbf{X}, \mathbf{y})\;d\tau\;d\mathbf{w}\;dz \\ & = \int_{-\infty}^{\infty}\int\int p(y_{*} | z) \;p(z | \mathbf{w}, \mathbf{x}_{*}) p(\mathbf{w}, \tau | \mathbf{X}, \mathbf{y})\;d\tau\;d\mathbf{w}\;dz \\ & \approx \int_{-\infty}^{\infty}\int\int p(y_{*}|z) \;p(z|\mathbf{w},\mathbf{x}_{*}) q(\mathbf{w}) q(\tau)\;d\tau \;d\mathbf{w}\;dz \\ & = \int_{-\infty}^{\infty}\int \mathbf{1}(z > 0)^{y_{*}} \mathbf{1}(z \leq 0)^{(1 - y_{*})} \;\mathcal{N}(z | \mathbf{w}^{T}\mathbf{x}_{*}, 1) \;\mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{S})\;d\mathbf{w}\;dz \\ & = \int_{-\infty}^{\infty} \mathbf{1}(z > 0)^{y_{*}} \mathbf{1}(z \leq 0)^{(1 - y_{*})} \;\mathcal{N}(z | \mathbf{m}^{T}\mathbf{x}_{*}, 1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})\;dz \\ & = \begin{cases} \int_{0}^{\infty} \mathcal{N}(z | \mathbf{m}^{T}\mathbf{x}_{*}, 1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})\;dz & \; \text{if } y_{*} = 1\\ \int_{-\infty}^{0} \mathcal{N}(z | \mathbf{m}^{T}\mathbf{x}_{*}, 1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})\;dz & \; \text{if } y_{*} = 0\\ \end{cases} \\ & = \begin{cases} 1 - \Phi \left(\frac{-\mathbf{m}^{T}\mathbf{x}_{*}}{(1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})^{1/2}}\right) & \; \text{if } y_{*} = 1\\ \Phi\left(\frac{-\mathbf{m}^{T}\mathbf{x}_{*}}{(1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})^{1/2}}\right) & \; \text{if } y_{*} = 0 \end{cases} \\ & = \Phi \left(\frac{\mathbf{m}^{T}\mathbf{x}_{*}}{(1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})^{1/2}}\right)^{y_{*}}\left(1 - \Phi \left(\frac{\mathbf{m}^{T}\mathbf{x}_{*}}{(1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})^{1/2}}\right)\right)^{(1-y_{*})} \\ & = \mathcal{B}ernoulli\left(y_{*} \Big{|} \Phi \left(\frac{\mathbf{m}^{T}\mathbf{x}_{*}}{(1 + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*})^{1/2}}\right)\right) \end{align} \]

4.7 Update equations summary

Below we summarize the update equations we obtain for the Variational Bayes algorithm \[ \begin{align} \pmb{\mu} & = \mathbf{Xm} \\ \mathbb{E}_{q(z_{i})}\big[z_{i}\big] & = \begin{cases} \mu_{i} + \phi_{i}/(1 - \Phi_{i}) & \; \text{if } y_{i} = 1\\ \mu_{i} - \phi_{i}/\Phi_{i} & \; \text{if } y_{i} = 0\\ \end{cases} \\ \mathbf{m} & = \mathbf{S}\mathbf{X}^{T}\mathbb{E}_{q(\mathbf{z})}\big[\mathbf{z}\big] \\ \mathbf{S} & = \left(\frac{\alpha}{\beta}\mathbf{I} +\mathbf{X}^{T}\mathbf{X}\right)^{-1} \\ \alpha_{k} & = \alpha_{0} + \frac{D}{2} \\ \beta_{k} & = \beta_{0} + \frac{1}{2}\left(\mathbf{m}^{T}\mathbf{m} + \text{tr}(\mathbf{S})\right)\\ \end{align} \]

5 Code implementation in R

Helper plotting functions

# devtools::install_github("andreaskapou/BPRMeth-devel")
suppressPackageStartupMessages(require(BPRMeth))
suppressPackageStartupMessages(require(matrixcalc))
suppressPackageStartupMessages(require(ggplot2))
suppressPackageStartupMessages(require(data.table))
suppressPackageStartupMessages(require(purrr))
suppressPackageStartupMessages(require(mvtnorm))
suppressPackageStartupMessages(require(truncnorm))
suppressPackageStartupMessages(require(gganimate))

# Define ggplot2 theme
gg_theme <- function(){
  p <- theme(
      plot.title = element_text(size = 20,face = 'bold',
                                margin = margin(0,0,3,0), hjust = 0.5),
      axis.text = element_text(size = rel(1.05), color = 'black'),
      axis.title = element_text(size = rel(1.45), color = 'black'),
      axis.title.y = element_text(margin = margin(0,10,0,0)),
      axis.title.x = element_text(margin = margin(10,0,0,0)),
      axis.ticks.x = element_line(colour = "black", size = rel(0.8)),
      axis.ticks.y = element_blank(),
      legend.position = "right",
      legend.key.size = unit(1.4, 'lines'),
      legend.title = element_text(size = 12, face = 'bold'),
      legend.text = element_text(size = 12),
      panel.border = element_blank(),
      panel.grid.major = element_line(colour = "gainsboro"),
      panel.background = element_blank()
    )
  return(p)
}

# Plot the predictive distribution
draw_predictive <- function(X, obs, title="", ...){
  p <- ggplot(X, aes(x = xs, y = ys, color = Model)) +
    geom_line(size = 1.5) +
    geom_ribbon(aes(ymin = X$ys_low, ymax = X$ys_high, fill = Model), 
                alpha = 0.2, size = 0.1) +
    geom_point(data = obs, mapping = aes(x = x, y = y), shape = 1, 
               color = "black", size = 3) +
    scale_x_continuous(limits = c(-1, 1), 
                       labels = c("-5kb", "", "TSS", "", "+5kb")) + 
    scale_color_brewer(palette = "Set1") +
    scale_fill_brewer(palette = "Set1") +
    labs(title = title, x = "x", y = "y") + gg_theme()
  return(p)
}

# Plot the predictive distribution
draw_predictive_anim <- function(X, obs, title="", ...){
  p <- ggplot() + 
    geom_point(data = obs, mapping = aes(x = x, y = y), 
               shape = 1, color = "red", size = 3) +
    geom_line(data = X, aes(x = xs, y = ys, frame = iter), 
              size = 1.5, col = "darkblue") +
    geom_ribbon(data = X, aes(x = xs, ymin = ys_low, ymax = ys_high, 
       frame = iter), alpha = 0.2, size = 0.1, fill = "cornflowerblue") +
    scale_x_continuous(limits = c(-1, 1), 
                       labels = c("-5kb", "", "TSS", "", "+5kb")) + 
    labs(title = title, x = "x", y = "y") + gg_theme()
  return(p)
}

Function for fitting the variational Bayesian probit regression model

# Compute predictive distribution of VBPR model
vbpr_predictive <- function(model, X_test){
  # Predictive mean
  mu_pred <- c(X_test %*% model$m)
  # Predictive variance
  s_pred <- sqrt(1 + diag(X_test %*% model$S %*% t(X_test)))
  Phi <- pnorm(mu_pred / s_pred)
  return(list(Phi = Phi, mu_pred = mu_pred, s_pred = s_pred))
}

# Fit VBPR model
vbpr_fit <- function(X, y, basis=NULL, alpha_0=1e-1, beta_0=1e-1, 
                     max_iter=500, epsilon_conv=1e-5, 
                     is_animation=FALSE, is_verbose=FALSE){
  X <- as.matrix(X)
  D <- NCOL(X)               # Number of covariates
  N <- NROW(X)               # Number of observations
  L <- rep(-1e40, max_iter)  # Store the lower bounds
  E_z <- vector(mode = "numeric", length = N)
  # If basis is null, create RBF as default
  if (is.null(basis)) { basis <- create_rbf_object(M = D - 1) }
  
  dt_frame <- NULL
  if (is_animation) {
    xs <- seq(-1,1,length.out = 100)
    hs <- design_matrix(basis, xs)$H
    dt_frame <- data.table(xs = numeric(), ys = numeric(), 
        ys_low = numeric(), ys_high = numeric(), iter = numeric())
  }
  
  XX    <- crossprod(X, X)   # Compute X'X
  alpha <- alpha_0 + 0.5*D   # 'shape' of Gamma(\tau | \alpha, \beta)
  beta  <- beta_0            # rate' of Gamma(\tau | \alpha, \beta)
  S <- solve(diag(D) + XX)   # covariance of q(w | m, S)
  m <- rep(0, D)             # mean of q(w | m, S)
  
  if (is_animation) {
    ys <- vbpr_predictive(model = list(m = m, S = S), X_test = hs)$Phi
    dt_frame <- rbind(dt_frame, data.table(xs = xs, ys = ys, 
      ys_low = ys - ys*(1 - ys), ys_high = ys + ys*(1 - ys), iter = 0))
  }
  
  # Iterate to find optimal parameters
  for (i in 2:max_iter) {
    # Update mean of q(z)
    mu <- X %*% m
    mu_1 <- mu[y == 1]  # Keep data where y == 1
    mu_0 <- mu[y == 0]  # Keep data where y == 0
    # Compute expectation E[z]
    E_z[y == 1] <- mu_1 + dnorm(-mu_1) / (1 - pnorm(-mu_1))
    E_z[y == 0] <- mu_0 - dnorm(-mu_0) / pnorm(-mu_0)
    S <- solve(diag(alpha/beta, D) + XX) # posterior covariance of q(w)
    m <- S %*% (crossprod(X, E_z))             # posterior mean of q(w)
    beta <- beta_0 + 0.5 * c(crossprod(m) + matrix.trace(S))
    
    # Compute lower bound
    lb_p_zw_qw <- -0.5*matrix.trace(XX %*% (tcrossprod(m,m) + S)) + 
        0.5*crossprod(mu) + sum(y*log(1 - pnorm(-mu)) + 
        (1 - y)*log(pnorm(-mu)))
    lb_pw <- -0.5*D*log(2*pi) + 0.5*D*(digamma(alpha) - log(beta)) -
        alpha/(2*beta)*(t(m) %*% (m) + matrix.trace(S))
    lb_qw <- -0.5*log(det(S)) - 0.5*D*(1 + log(2*pi))
    lb_pa <- alpha_0*log(beta_0) + (alpha_0 - 1)*(digamma(alpha) - 
        log(beta)) - beta_0*(alpha/beta) - lgamma(alpha_0)
    lb_qa <- -lgamma(alpha) + (alpha - 1)*digamma(alpha) + 
        log(beta) - alpha
    
    L[i] <- lb_p_zw_qw + lb_pw + lb_pa - lb_qw - lb_qa
    
    if (is_animation) {
      if ((i - 1) %% 5 == 0 | i < 10) {
        ys <- vbpr_predictive(model = list(m = m,S = S),X_test = hs)$Phi
        dt_frame <- rbind(dt_frame, data.table(xs = xs, ys = ys, 
         ys_low = ys - ys*(1 - ys),ys_high = ys + ys*(1 - ys),iter = i - 1))
      }
    }
    
    # Show VB difference
    if (is_verbose) { cat("It:\t",i,"\tLB:\t",
                          L[i],"\tLB_diff:\t",L[i] - L[i - 1],"\n")}
    # Check if lower bound decreases
    if (L[i] < L[i - 1]) { message("Lower bound decreases!\n")}
    # Check for convergence
    if (L[i] - L[i - 1] < epsilon_conv) { break }
    # Check if VB converged in the given maximum iterations
    if (i == max_iter) {warning("VB did not converge!\n")}
  }
  if (is_animation) {
    dt_frame <- dt_frame %>% .[, iter := as.factor(iter)]
  }
  obj <- structure(list(m = m, S = S, alpha = alpha, beta = beta, 
        N = N, D = D, L = L[2:i], dt_frame = dt_frame), class = "vbpr")
  return(obj)
}

5.1 Learn simple probit regression model

Here we generate synthetic data and fit the VBPR model

set.seed(255)                  # For reproducibility
N <- 80                        # Number of observations
x <- sort(runif(N, -1, 1))     # Generate x data
basis <- create_polynomial_object(M = 1) # Polynomial basis
X <- design_matrix(basis, x)$H # Learn a linear function plus intercept
w <- c(-.5, 3.3)               # True values of regression coeffiecients
y <- rbinom(N, 1, pnorm(X %*% w)) # Generate binary observation data y
# Run VBPR model
vbpr_model <- vbpr_fit(X = X, y = y, basis = basis, max_iter = 101, 
                       is_animation = TRUE, is_verbose = FALSE)

Let’s examine the posterior predictive distribution

# Store training data
obs <- as.data.table(cbind(x, y)) %>% setnames(c("x", "y"))
p <- draw_predictive_anim(X = vbpr_model$dt_frame, obs = obs, 
                          title = "Simple probit regression, Iter:")
gganimate(p, interval = .45, "figures/vb_bpr/vbpr.gif", 
          ani.width = 750, ani.height = 470)

Variational GMM algorithm

5.2 Comparison with Gibbs

Below we compare the fit of the BPR model using the Variational Bayes and Gibbs implementation.

6 Binomial Probit Regression Model

In the case that we have a Binomial observation model, we can recast it as a Bernoulli observation model with additional observations. Assume that we have \(I\) independent observations \(y_{i} = (m_{i}, T_{i})\) that follow a Binomial distribution \[ m_{i} \sim \mathcal{B}inomial(m_{i} | T_{i}, \gamma_{i}) \] where \(T_{i}\) is the total number of trials for the \(i^{th}\) observation, \(m_{i}\) denotes the number of successes and again the success probability \(\gamma_{i}\) is related to a vector of covariates \(\mathbf{x}_{i} \in \mathbb{R}^{D}\).

We can think of each of the \(m_{i}\) as the total number of successes of \(T_{i}\) independent Bernoulli experiments with outcomes \(y_{i1}^{*}, ..., y_{iT_{i}}^{*}\), where now each \(y_{it}^{*}\) follows a Bernoulli distribution. It is simple to reconstruct the binary outcomes \(y_{it}^{*}\) from the Binomial observations \(T_{i}\) using the following formula, \[ y_{it}^{*} = \begin{cases} 1 & \; \text{if } 1 \leq t \leq m_{i}\\ 0 & \; \text{if } m_{i} < t \leq T_{i}\\ \end{cases} \]

Using this approach of extending our observation matrix to binary outcomes, now we can use the variational implementation for the probit regression model we derived above to perform inference for Binomial data. The only thing we need to derive is the posterior predictive distribution, to account for the lower variance we will get from the multiple Bernoulli outcomes in each observation.

6.1 Predictive distribution

TODO

6.2 Learn Binomial probit regression model

set.seed(255)                          # For reproducibility
N <- 80                                # Number of observations
x <- sort(runif(N, -1, 1))             # Generate X data
D <- 4                                 # Number of basis functions
basis <- create_rbf_object(M = D - 1)  # Create RBF basis object
X <- design_matrix(basis, x)$H         # Create design matrix
w <- c(-0.4, 0.75, 2, -1.4)            # True values of coeffiecients
T_i <- rbinom(N, 40, 0.8)              # Total number of trials T_i
m_i <- rbinom(N, T_i, pnorm(X %*% w))  # Number of successes m_i

y <- vector(mode = "integer") # Etended vector y of length (J x 1)
for (i in 1:N) { y <- c(y, rep(1, m_i[i]), rep(0, T_i[i] - m_i[i])) }
# Extended design matrix from binary observations
X <- X[rep(1:nrow(X), T_i), ] 
  
# Run VBPR model
vbpr_binom_model <- vbpr_fit(X = X, y = y, basis = basis, max_iter = 101, 
                             is_animation = TRUE, is_verbose = FALSE)

Let’s examine the posterior predictive distribution ( Is this the right predictive? )