\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Suppose we have \(i = 1,\dots,n\) twin pairs.
For each twin we observe a vector of variables \(\vec x_{ij}\) for \(j = 1,2\).
We have some outcome \(Y_{ij} \in \{0, 1\}\) for \(j = 1,2\).
We suppose that there is an association between \(\vec x_{ij}\) and \(Y_{ij}\). A simple model is:
\[\logit(\Prob(Y_{ij} = 1)) = \vec\beta^\top\vec x_{ij}\]
This model is in the exponential family. That is, the density is:
\[ \begin{align*} f_{ij}(y) &= \exp(y \eta_{ij}(\vec\beta) - b(\eta_{ij}(\vec\beta)) + c(y)) \\ \eta_{ij}(\vec\beta) &= \vec\beta^\top\vec x_{ij} \end{align*} \]
where \(b\) is a log-partition function. This is only the canonical form.
The rest of the talk equally applies for the general exponential family.
There may be genetic effects or environmental effects that change the association between \(\vec x_{ij}\) and \(Y_{ij}\).
We can model this with some unknown random effect \(\vec U_i\) and suppose that:
\[\logit(\Prob(Y_{ij} = 1 \mid \vec U_i = \vec u)) = (\vec u + \vec\beta)^\top\vec x_{ij}\]
\[ \begin{align*} f_i(\vec y_i) &= \int g_i(\vec u;\vec\theta)\prod_{j = 1}^2 f_{ij}(y_{ij}\mid \vec u) \der\vec u \\ f_{ij}(y_{ij}\mid \vec u) &= \exp\bigg(y_{ij} \eta_{ij}(\vec\beta,\vec u) + b(\eta_{ij}(\vec\beta,\vec u)) + c(y_{ij})\bigg) \\ \eta_{ij}(\vec\beta,\vec u) &= (\vec u + \vec\beta)^\top\vec x_{ij} \end{align*} \]
where \(g_i\) is the density of \(\vec U_i\) given model parameters \(\vec\theta\). Generally Intractable.
Data is referred to as clustered data.
Typically, one works with:
\[\logit(\Prob(Y_{ij} = 1 \mid \vec U_i = \vec u)) = \vec\beta^\top\vec x_{ij} + \vec u^\top\vec z_{ij}\]
where \(\vec z_{ij}\) contains a subset of \(\vec x_{ij}\).
Typically, we assume that
\[\vec U_i \sim N(\vec 0, \mat\Sigma_i(\vec\theta))\]
Variational approximations.
Estimation methods.
Examples.
Conclusions.
Define the integrand
\[ \begin{align*} h_i(\vec y_i, \vec u) &= g_i(\vec u;\vec\theta) \prod_{j = 1}^{l_i}f_{ij}(y_{ij}\mid \vec u) \end{align*} \]
such that
\[ \begin{align*} f_i(\vec y_i) &= \int h_i(\vec y_i, \vec u) \der\vec u \\ f_{ij}(y_{ij}\mid \vec u) &= \exp\bigg(y_{ij} \eta_{ij}(\vec\beta,\vec u) + b(\eta_{ij}(\vec\beta,\vec u)) + c(y_{ij})\bigg) \\ \eta_{ij}(\vec\beta,\vec u) &= \vec\beta^\top\vec x_{ij} + \vec u^\top\vec z_{ij} \end{align*} \]
Select some variational distribution with density \(\nu_i\) parameterized by some set \(\Omega_i\). Then
\[ \begin{align*} \log f_i(\vec y_i) &\geq \int \nu_i(\vec u;\vec\omega) \log\left( \frac{h_i(\vec y_i,\vec u)} {\nu_i(\vec u;\vec\omega)}\right)\der\vec u \\ &= \log \tilde f_i(\vec y_i;\vec\omega) \end{align*} \]
for some \(\vec\omega \in \Omega_i\).
Use
\[ \argmax_{\vec\beta,\vec\theta,\vec\omega_1,\dots,\vec\omega_n} \sum_{i = 1}^n \log \tilde f_i(\vec y_i;\vec\omega_i) \]
instead of
\[ \argmax_{\vec\beta,\vec\theta} \sum_{i = 1}^n \log f_i(\vec y_i) \]
Often very simple (at worst one dimensional integrals):
\[\int \nu_i(\vec u;\vec\omega) \log\left( h_i(\vec y_i,\vec u)\right)\der\vec u\]
Easily computed with e.g. adaptive Gaussian-Hermite quadrature.
The number of parameters are \(\bigO{n}\).
A Gaussian VA (GVA) is:
\[ \nu_i(\vec u;\vec\mu_i, \mat\Psi_i) = \phi(\vec u;\vec\mu_i, \mat\Psi_i) \]
\(d = 4\) and \(n = 1000\) yields 14000 variational parameters.
Ormerod and Wand (2012) show the formulas for GVAs for GLMMs.
The objective functions is:
\[ l(\vec\beta,\vec\theta,\vec\omega_1,\dots,\vec\omega_n) = \sum_{i = 1}^n \log \tilde f_i(\vec y_i;\vec\omega_i) \]
The Hessian
\[\nabla^2 l(\vec x) = \begin{pmatrix} \partial^2 l(\vec x) / \partial x_1\partial x_1 & \dots & \partial^2 l(\vec x)\partial x_1\partial x_k \\ \vdots &\ddots &\vdots \\ \partial^2 l(\vec x) / \partial x_k\partial x_1 & \dots & \partial^2 l(\vec x) / \partial x_k\partial x_k \end{pmatrix}\]
is extremely sparse and highly structured.
Hypothetical example of \(\nabla^2 f(\vec x)\) with \(n = 10\) clusters. Black entries are non-zero.
Want to minimize some function \(f\) with \(o\) inputs.
Start at some \(\vec x^{(0)}\in\mathbb R^o\). Set \(k = 0\) and
\[\vec x^{(k+1)} = \vec x^{(k)} - \gamma(\nabla^2 f(\vec x^{(k)}))^{-1}\nabla f(\vec x^{(k)})\]
for \(\gamma \in (0, 1]\). Set \(k \leftarrow k + 1\) if not converged.
Good convergence properties if started close to an optimum. Inversion is \(\bigO {o^3}\) in general.
using rank-two updates. Can also provide an approximation of \(\nabla^2 f(\vec x^{(k)})\).
Still a \(\bigO{o^2}\) iteration cost and memory cost.
Make an approximation of \(\nabla^2 f(\vec x^{(k)})\) using only the last couple of gradients, \(\nabla f(\vec x^{(k)})\).
\(\bigO o\) iteration cost and memory cost.
Poor convergence properties.
Suppose that
\[f(\vec x) = \sum_{i = 1}^n f_i(\vec x_{\mathcal I_i})\]
where
\[\begin{align*} \vec x_{\mathcal I_i} &= (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top)\vec x\\ \mathcal I_i &= (j_{i1}, \dots, \mathcal j_{im_i}) \subseteq \{1, \dots, o\} \end{align*},\]
\(\vec e_k\) is the \(k\)’th column of the \(o\) dimensional identity matrix, and \(m_i \ll o\) for all \(i = 1,\dots,n\).
Make \(n\) BFGS approximations of \(\nabla^2 f_1(\vec x^{(k)}_{\mathcal I_1}),\dots,\nabla^2 f_n(\vec x^{(k)}_{\mathcal I_n})\).
Preserves the sparsity of \(\nabla^2 f(\vec x^{(k)})\) and provides potentially quicker convergence towards a good approximation.
Fast as we can do sparse matrix-vector products.
Implemented in the psqn package (Christoffersen 2020).
Provides a header-only C++ library and a R interface.
Supports computation in parallel.
Closely followed Nocedal and Wright (2006).
Three dimensional fixed effects, \(\vec x_{ij}, \vec\beta\in\mathbb R^3\).
One dimensional random effect, \(\vec U_i \in \mathbb R\) and \(\vec z_{ij} = (1)\).
All configuration includes 100 replications.
Compare with the Laplace approximation and the adaptive Gauss-Hermite quadrature (AGHQ) from the lme4 package (Bates et al. 2015).
\(n = 1000\) clusters. Bias estimates are:
| Inter. | Cont. | Bin. | Std. | |
|---|---|---|---|---|
| Laplace | 0.00573 | 0.0045 | -0.0054 | -0.0391 |
| (0.00646) | (0.0037) | (0.0068) | (0.0053) | |
| AGHQ | 0.00022 | 0.0036 | -0.0046 | -0.0078 |
| (0.00651) | (0.0037) | (0.0068) | (0.0055) | |
| GVA | 0.00579 | 0.0045 | -0.0054 | -0.0221 |
| (0.00648) | (0.0037) | (0.0068) | (0.0053) |
Monte Carlo standard errors are in parentheses. The first three columns are the fixed effects, \(\vec\beta\). Inter.: intercept, cont.: continuous covariate, and bin.: binary covariate. The last column is the random effect standard deviation. The GVA row uses the psqn package.
| mean | meadian | |
|---|---|---|
| Laplace | 0.713 | 0.697 |
| AGHQ | 1.824 | 1.818 |
| GVA | 0.244 | 0.247 |
| GVA (4 threads) | 0.070 | 0.070 |
| GVA LBFGS | 2.652 | 2.663 |
Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package (Coppola, Stewart, and Okazaki 2020).
\(n = 5000\) clusters. Bias estimates are:
| Inter. | Cont. | Bin. | Rng Std. | |
|---|---|---|---|---|
| Laplace | 0.0023 | 0.0019 | 0.0018 | -0.0333 |
| (0.0033) | (0.0015) | (0.0032) | (0.0023) | |
| AGHQ | -0.0032 | 0.0010 | 0.0026 | -0.0021 |
| (0.0034) | (0.0015) | (0.0033) | (0.0024) | |
| GVA | 0.0023 | 0.0019 | 0.0018 | -0.0164 |
| (0.0033) | (0.0015) | (0.0033) | (0.0023) |
Monte Carlo standard errors are in parentheses. The first three columns are the fixed effects, \(\vec\beta\). Inter.: intercept, cont.: continuous covariate, and bin.: binary covariate. The last column is the random effect standard deviation. The GVA row uses the psqn package.
| mean | meadian | |
|---|---|---|
| Laplace | 3.706 | 3.691 |
| AGHQ | 10.214 | 10.091 |
| GVA | 1.208 | 1.240 |
| GVA (4 threads) | 0.343 | 0.351 |
| GVA LBFGS | 22.125 | 21.706 |
Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package.
Three dimensional random effect, \(\vec U_i \in \mathbb R^3\) and \(\vec z_{ij} = \vec x_{ij}\).
\(n = 1000\) clusters. Ten observations per cluster.
Bias estimates are:
| Inter. | Cont. | Bin. | Inter. std. | Cont. std. | Bin. std. | |
|---|---|---|---|---|---|---|
| Laplace | -0.0047 | 0.0029 | 0.0051 | -0.0123 | -0.0162 | -0.0357 |
| (0.0035) | (0.0039) | (0.0044) | (0.0039) | (0.0033) | (0.0056) | |
| GVA | -0.0018 | 0.0047 | 0.0050 | -0.0042 | -0.0139 | -0.0190 |
| (0.0035) | (0.0039) | (0.0044) | (0.0039) | (0.0032) | (0.0054) |
Monte Carlo standard errors are in parentheses. The first three columns are the fixed effects, \(\vec\beta\). Inter.: intercept, cont.: continuous covariate, and bin.: binary covariate. The last three columns are the random effect standard deviations. The GVA row uses the psqn package.
| cor cont. inter. | cor bin. inter. | cor cont. bin. | |
|---|---|---|---|
| Laplace | -0.0015 | 0.02211 | 0.0035 |
| (0.0056) | (0.00891) | (0.0075) | |
| GVA | -0.0063 | 0.00096 | 0.0087 |
| (0.0055) | (0.00862) | (0.0072) |
Bias estimates of the correlation coefficients. Monte Carlo standard errors are in parentheses.
| mean | meadian | |
|---|---|---|
| Laplace | 32.236 | 27.580 |
| GVA | 2.013 | 1.987 |
| GVA (4 threads) | 0.575 | 0.567 |
| GVA LBFGS | 26.814 | 26.735 |
Computation times are in seconds. “GVA LBFGS” uses the limited-memory BFGS implementation in the lbfgs package.
Introduced variational approximations for generalized linear mixed models.
in the number of clusters and the dimension of the random effects.
The psqn package exploits the sparsity of the Hessian and yields very fast estimation times.
Easy to extend to other members of the exponential family and use other link functions.
Can use other variational distributions.
Add Newton conjugate gradient method.
Add constraints.
Add a trust region method.
Add other sparsity patterns of the Hessian.
The presentation is at rpubs.com/boennecd/GVAs-n-psqn-KTH.
The markdown is at github.com/boennecd/Talks.
The code for the example are at github.com/boennecd/psqn-va-ex.
The psqn package is CRAN and at github.com/boennecd/psqn.
References are on the next slide.
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.
Christoffersen, Benjamin. 2020. Psqn: Partially Separable Quasi-Newton. https://github.com/boennecd/psqn.
Coppola, Antonio, Brandon Stewart, and Naoaki Okazaki. 2020. Lbfgs: Limited-Memory Bfgs Optimization.
Nocedal, Jorge, and Stephen Wright. 2006. Numerical Optimization. 2nd ed. Springer Science & Business Media. https://doi.org/10.1007/978-0-387-40065-5.
Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1). Taylor & Francis: 2–17. https://doi.org/10.1198/jcgs.2011.09118.