Gaussian Variational Approximations

dummy slide

Introduction

Introduction Example

Suppose we have \(i = 1,\dots,n\) twin pairs.

For each twin we observe a vector of variables \(\vec x_{ij}\) for \(j = 1,2\).

We have some outcome \(Y_{ij} \in \{0, 1\}\) for \(j = 1,2\).

Introduction Example (cont.)

We suppose that there is an association between \(\vec x_{ij}\) and \(Y_{ij}\). A simple model is:

\[\logit(\Prob(Y_{ij} = 1)) = \vec\beta^\top\vec x_{ij}\]

Exponential Family

This model is in the exponential family. That is, the density is:

\[ \begin{align*} f_{ij}(y) &= \exp(y \eta_{ij}(\vec\beta) - b(\eta_{ij}(\vec\beta)) + c(y)) \\ \eta_{ij}(\vec\beta) &= \vec\beta^\top\vec x_{ij} \end{align*} \]

where \(b\) is a log-partition function. This is only the canonical form.

The rest of the talk equally applies for the general exponential family.

Adding Random effects

There may be genetic effects or environmental effects that change the association between \(\vec x_{ij}\) and \(Y_{ij}\).

We can model this with some unknown random effect \(\vec U_i\) and suppose that:

\[\logit(\Prob(Y_{ij} = 1 \mid \vec U_i = \vec u)) = (\vec u + \vec\beta)^\top\vec x_{ij}\]

Marginal Likelihood

\[ \begin{align*} f_i(\vec y_i) &= \int g_i(\vec u;\vec\theta)\prod_{j = 1}^2 f_{ij}(y_{ij}\mid \vec u) \der\vec u \\ f_{ij}(y_{ij}\mid \vec u) &= \exp\bigg(y_{ij} \eta_{ij}(\vec\beta,\vec u) + b(\eta_{ij}(\vec\beta,\vec u)) + c(y_{ij})\bigg) \\ \eta_{ij}(\vec\beta,\vec u) &= (\vec u + \vec\beta)^\top\vec x_{ij} \end{align*} \]

where \(g_i\) is the density of \(\vec U_i\) given model parameters \(\vec\theta\). Generally Intractable.

This is a generalized linear mixed model (GLMM).

Data is referred to as clustered data.

Remarks

Typically, one works with:

\[\logit(\Prob(Y_{ij} = 1 \mid \vec U_i = \vec u)) = \vec\beta^\top\vec x_{ij} + \vec u^\top\vec z_{ij}\]

where \(\vec z_{ij}\) contains a subset of \(\vec x_{ij}\).

Typically, we assume that

\[\vec U_i \sim N(\vec 0, \mat\Sigma_i(\vec\theta))\]

Talk Overview

Variational approximations.

Estimation methods.

Examples.

Conclusions.

Variational Approximations

Lower Bound

Define the integrand

\[ \begin{align*} h_i(\vec y_i, \vec u) &= g_i(\vec u;\vec\theta) \prod_{j = 1}^{l_i}f_{ij}(y_{ij}\mid \vec u) \end{align*} \]

such that

\[ \begin{align*} f_i(\vec y_i) &= \int h_i(\vec y_i, \vec u) \der\vec u \\ f_{ij}(y_{ij}\mid \vec u) &= \exp\bigg(y_{ij} \eta_{ij}(\vec\beta,\vec u) + b(\eta_{ij}(\vec\beta,\vec u)) + c(y_{ij})\bigg) \\ \eta_{ij}(\vec\beta,\vec u) &= \vec\beta^\top\vec x_{ij} + \vec u^\top\vec z_{ij} \end{align*} \]

Lower Bound (cont.)

Select some variational distribution with density \(\nu_i\) parameterized by some set \(\Omega_i\). Then

\[ \begin{align*} \log f_i(\vec y_i) &\geq \int \nu_i(\vec u;\vec\omega) \log\left( \frac{h_i(\vec y_i,\vec u)} {\nu_i(\vec u;\vec\omega)}\right)\der\vec u \\ &= \log \tilde f_i(\vec y_i;\vec\omega) \end{align*} \]

for some \(\vec\omega \in \Omega_i\).

Approximate Maximum Likelihood

Use

\[ \argmax_{\vec\beta,\vec\theta,\vec\omega_1,\dots,\vec\omega_n} \sum_{i = 1}^n \log \tilde f_i(\vec y_i;\vec\omega_i) \]

instead of

\[ \argmax_{\vec\beta,\vec\theta} \sum_{i = 1}^n \log f_i(\vec y_i) \]

Remarks

Often very simple (at worst one dimensional integrals):

\[\int \nu_i(\vec u;\vec\omega) \log\left( h_i(\vec y_i,\vec u)\right)\der\vec u\]

Easily computed with e.g. adaptive Gaussian-Hermite quadrature.

Computational Issues

The number of parameters are \(\bigO{n}\).

A Gaussian VA (GVA) is:

\[ \nu_i(\vec u;\vec\mu_i, \mat\Psi_i) = \phi(\vec u;\vec\mu_i, \mat\Psi_i) \]

Let \(d\) be the dimension of \(\vec u_i\). Then \(\vec\omega_i\in\mathbb R^{d(d + 3) / 2}\).

\(d = 4\) and \(n = 1000\) yields 14000 variational parameters.

Ormerod and Wand (2012) show the formulas for GVAs for GLMMs.

Computational Issues (cont.)

The objective functions is:

\[ l(\vec\beta,\vec\theta,\vec\omega_1,\dots,\vec\omega_n) = \sum_{i = 1}^n \log \tilde f_i(\vec y_i;\vec\omega_i) \]

The Hessian

\[\nabla^2 l(\vec x) = \begin{pmatrix} \partial^2 l(\vec x) / \partial x_1\partial x_1 & \dots & \partial^2 l(\vec x)\partial x_1\partial x_k \\ \vdots &\ddots &\vdots \\ \partial^2 l(\vec x) / \partial x_k\partial x_1 & \dots & \partial^2 l(\vec x) / \partial x_k\partial x_k \end{pmatrix}\]

is extremely sparse and highly structured.

Hessian Example

Hypothetical example of \(\nabla^2 f(\vec x)\) with \(n = 10\) clusters. Black entries are non-zero.

Optimization Methods

Newton’s Method

Want to minimize some function \(f\) with \(o\) inputs.

Start at some \(\vec x^{(0)}\in\mathbb R^o\). Set \(k = 0\) and

\[\vec x^{(k+1)} = \vec x^{(k)} - \gamma(\nabla^2 f(\vec x^{(k)}))^{-1}\nabla f(\vec x^{(k)})\]

for \(\gamma \in (0, 1]\). Set \(k \leftarrow k + 1\) if not converged.

Good convergence properties if started close to an optimum. Inversion is \(\bigO {o^3}\) in general.

BFGS

Recursively update an approximation of \((\nabla^2 f(\vec x^{(k)}))^{-1}\)

using rank-two updates. Can also provide an approximation of \(\nabla^2 f(\vec x^{(k)})\).

Still a \(\bigO{o^2}\) iteration cost and memory cost.

Limited-memory BFGS

Make an approximation of \(\nabla^2 f(\vec x^{(k)})\) using only the last couple of gradients, \(\nabla f(\vec x^{(k)})\).

\(\bigO o\) iteration cost and memory cost.

Poor convergence properties.

Partial Separability

Suppose that

\[f(\vec x) = \sum_{i = 1}^n f_i(\vec x_{\mathcal I_i})\]

where

\[\begin{align*} \vec x_{\mathcal I_i} &= (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top)\vec x\\ \mathcal I_i &= (j_{i1}, \dots, \mathcal j_{im_i}) \subseteq \{1, \dots, o\} \end{align*},\]

\(\vec e_k\) is the \(k\)’th column of the \(o\) dimensional identity matrix, and \(m_i \ll o\) for all \(i = 1,\dots,n\).

Partial Separability (cont.)

Make \(n\) BFGS approximations of \(\nabla^2 f_1(\vec x^{(k)}_{\mathcal I_1}),\dots,\nabla^2 f_n(\vec x^{(k)}_{\mathcal I_n})\).

Preserves the sparsity of \(\nabla^2 f(\vec x^{(k)})\) and provides potentially quicker convergence towards a good approximation.

Approximately compute \((\nabla^2 f(\vec x^{(k)})^{-1}\nabla f(\vec x^{(k)})\) using a conjugate gradient method.

Fast as we can do sparse matrix-vector products.

Implementation

Implemented in the psqn package (Christoffersen 2020).

Provides a header-only C++ library and a R interface.

Supports computation in parallel.

Remarks

Closely followed Nocedal and Wright (2006).

Examples

1D Random Effects

Three dimensional fixed effects, \(\vec x_{ij}, \vec\beta\in\mathbb R^3\).

One dimensional random effect, \(\vec U_i \in \mathbb R\) and \(\vec z_{ij} = (1)\).

Three observations per cluster. Each outcome is conditionally binomially distributed with a size in 1-5.

All configuration includes 100 replications.

Compare with the Laplace approximation and the adaptive Gauss-Hermite quadrature (AGHQ) from the lme4 package (Bates et al. 2015).

1D Random Effects (small)

\(n = 1000\) clusters. Bias estimates are:

	Inter.	Cont.	Bin.	Std.
Laplace	0.00573	0.0045	-0.0054	-0.0391
	(0.00646)	(0.0037)	(0.0068)	(0.0053)
AGHQ	0.00022	0.0036	-0.0046	-0.0078
	(0.00651)	(0.0037)	(0.0068)	(0.0055)
GVA	0.00579	0.0045	-0.0054	-0.0221
	(0.00648)	(0.0037)	(0.0068)	(0.0053)

Monte Carlo standard errors are in parentheses. The first three columns are the fixed effects, \(\vec\beta\). Inter.: intercept, cont.: continuous covariate, and bin.: binary covariate. The last column is the random effect standard deviation. The GVA row uses the psqn package.

1D Random Effects (comp. time)

	mean	meadian
Laplace	0.713	0.697
AGHQ	1.824	1.818
GVA	0.244	0.247
GVA (4 threads)	0.070	0.070
GVA LBFGS	2.652	2.663

Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package (Coppola, Stewart, and Okazaki 2020).

1D Random Effects (large)

\(n = 5000\) clusters. Bias estimates are:

	Inter.	Cont.	Bin.	Rng Std.
Laplace	0.0023	0.0019	0.0018	-0.0333
	(0.0033)	(0.0015)	(0.0032)	(0.0023)
AGHQ	-0.0032	0.0010	0.0026	-0.0021
	(0.0034)	(0.0015)	(0.0033)	(0.0024)
GVA	0.0023	0.0019	0.0018	-0.0164
	(0.0033)	(0.0015)	(0.0033)	(0.0023)

1D Random Effects (comp. time)

	mean	meadian
Laplace	3.706	3.691
AGHQ	10.214	10.091
GVA	1.208	1.240
GVA (4 threads)	0.343	0.351
GVA LBFGS	22.125	21.706

Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package.

3D Random Effects

Three dimensional random effect, \(\vec U_i \in \mathbb R^3\) and \(\vec z_{ij} = \vec x_{ij}\).

\(n = 1000\) clusters. Ten observations per cluster.

3D Random Effects

Bias estimates are:

	Inter.	Cont.	Bin.	Inter. std.	Cont. std.	Bin. std.
Laplace	-0.0047	0.0029	0.0051	-0.0123	-0.0162	-0.0357
	(0.0035)	(0.0039)	(0.0044)	(0.0039)	(0.0033)	(0.0056)
GVA	-0.0018	0.0047	0.0050	-0.0042	-0.0139	-0.0190
	(0.0035)	(0.0039)	(0.0044)	(0.0039)	(0.0032)	(0.0054)

Monte Carlo standard errors are in parentheses. The first three columns are the fixed effects, \(\vec\beta\). Inter.: intercept, cont.: continuous covariate, and bin.: binary covariate. The last three columns are the random effect standard deviations. The GVA row uses the psqn package.

3D Random Effects

	cor cont. inter.	cor bin. inter.	cor cont. bin.
Laplace	-0.0015	0.02211	0.0035
	(0.0056)	(0.00891)	(0.0075)
GVA	-0.0063	0.00096	0.0087
	(0.0055)	(0.00862)	(0.0072)

Bias estimates of the correlation coefficients. Monte Carlo standard errors are in parentheses.

3D Random Effects (comp. time)

	mean	meadian
Laplace	32.236	27.580
GVA	2.013	1.987
GVA (4 threads)	0.575	0.567
GVA LBFGS	26.814	26.735

Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package.

Conclusions

Introduced variational approximations for generalized linear mixed models.

Number of variational parameters increase quickly

in the number of clusters and the dimension of the random effects.

The psqn package exploits the sparsity of the Hessian and yields very fast estimation times.

Extension

Easy to extend to other members of the exponential family and use other link functions.

Can use other variational distributions.

Extension to the psqn Package

Add Newton conjugate gradient method.

Add constraints.

Add a trust region method.

Add other sparsity patterns of the Hessian.

Thank You!

The presentation is at rpubs.com/boennecd/GVAs-n-psqn-KTH.

The markdown is at github.com/boennecd/Talks.

The code for the example are at github.com/boennecd/psqn-va-ex.

The psqn package is CRAN and at github.com/boennecd/psqn.

References are on the next slide.

References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Christoffersen, Benjamin. 2020. Psqn: Partially Separable Quasi-Newton. https://github.com/boennecd/psqn.

Coppola, Antonio, Brandon Stewart, and Naoaki Okazaki. 2020. Lbfgs: Limited-Memory Bfgs Optimization.

Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. 2nd ed. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-40065-5.

Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1). Taylor & Francis: 2–17. https://doi.org/10.1198/jcgs.2011.09118.