Approximation Methods

dummy slide

Introduction

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\diag{\text{diag}} \]

Presentation Outline

Introduction to mixed models with a probit link.

Discussion of available approximation methods.

Simulation example.

Real data example.

Next steps and alternatives.

Three Models

Why

Very similar marginal log-likelihood.

Many options for approximating the marginal log-likelihood.

Notation

\[ \begin{align*} \phi^{(K)}(\vec x;\vec\mu,\mat\Sigma) &= \frac 1{(2\pi)^{K/2}\lvert\mat\Sigma\rvert^{1/2}}\exp\left( - \frac 12 (\vec x - \vec\mu)^\top\mat\Sigma^{-1}(\vec x - \vec\mu) \right) \\ \Phi^{(K)}(\vec x;\vec\mu,\mat\Sigma) &= \int_{-\infty}^{x_1} \cdots\int_{-\infty}^{x_K} \phi^{(K)}(\vec z;\vec\mu,\mat\Sigma)\der z_1 \cdots \der z_K \\ \end{align*} \] The standard case \[ \begin{align*} \phi^{(K)}(\vec x) &= \phi^{(K)}(\vec x;\vec 0,\mat I) \\ \Phi^{(K)}(\vec x) &= \Phi^{(K)}(\vec x;\vec 0,\mat I) \\ \end{align*} \] The univariate case \[ \begin{align*} \phi(x; \mu, \sigma^2) &= \phi^{(1)}(x; \mu, \sigma^2) \\ \Phi(x; \mu, \sigma^2) &= \Phi^{(1)}(x; \mu, \sigma^2) \end{align*} \]

Generalization of Skew-normal Dist.

\[ \begin{pmatrix} \vec V_1 \\ \vec V_2 \end{pmatrix} \sim N\left( \begin{pmatrix} \vec \xi_1 \\ \vec\xi_2 \end{pmatrix}, \begin{pmatrix} \mat\Xi_{11} & \mat\Xi_{12} \\ \mat\Xi_{21} & \mat\Xi_{22} \end{pmatrix} \right) \]

then the density of \(\vec V_1 = \vec v_1\) and \(\vec V_2 \leq \vec v_2\) is

\[ \begin{align*} \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \Prob\left(\vec V_2 < \vec v_2 \,\middle\vert\, \vec V_1 = \vec v_1 \right)\nonumber \hspace{-140pt}& \\ &= \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \\ &\hspace{20pt}\cdot \Phi^{(k_2)}\left( \vec v_2 - \mat\Xi_{21}\mat\Xi_{11}^{-1}(\vec v_1 - \vec\xi_1); \vec \xi_2, \mat\Xi_{22} - \mat\Xi_{21}\mat\Xi_{11}^{-1}\mat\Xi_{12} \right) \end{align*} \]

Generalization of Skew-normal Dist.

… and the marginal is

\[ \begin{align*} \Prob(\vec V_2 \leq \vec v_2) &= \Phi^{(k_2)}(\vec v_2; \vec\xi_2, \mat\Xi_2) \\ &= \int \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \\ &\hspace{30pt}\cdot \Prob\left(\vec V_2 < \vec v_2 \,\middle\vert\, \vec V_1 = \vec v_1 \right)\der v_{11} \cdots \der v_{1k_1} \end{align*} \]

I.e. either \(k_1\) or \(k_2\)-dimensional intractable integral.

Three Models

We have \(\vec Y = (Y_1,\dots,Y_n)^\top\) observed outcomes.

Mixed binomial

Outcomes \(Y_i \in \{0,\dots,m\}\) are conditionally independent and binomially distributed.

Mixed multinomial

Outcomes \(Y_i \in \{1,\dots,c\}\) are conditionally independent and multinomially distributed.

Mixed generalized survival model (GSM)

Outcomes \(Y_i\in (0,\infty)\) are conditionally independently drawn from a GSM and potentially right censored.

Mixed Binomial

Given random effect \(\vec U \in \mathbb{R}^K\), each \(Y_1, \dots, Y_n\) are

\[ Y_i \sim \text{Bin}(\Phi(\vec x_i^\top\vec\beta + \vec z_i^\top\vec u), m) \]

with

\[\vec U \sim N(\vec 0, \mat\Sigma)\]

\(\vec x_i\) and \(\vec z_i\) are known covariates. \(\vec x_i^\top\vec\beta\) is the fixed effect. \(\vec\beta\) and \(\mat\Sigma\) are unknown parameters. \(\vec U\) is an unobservable random effect.

Mixed Binomial Likelihood

The complete data likelihood is

\[ \begin{align*} p(\vec u, \vec y) &= c(\vec y) \phi^{(K)}(\vec u;\vec 0, \mat\Sigma) \\ &\hspace{20pt}\cdot\prod_{i = 1}^n \Phi(\vec x_i^\top\vec\beta + \vec z_i^\top\vec u)^{y_i} \Phi(-\vec x_i^\top\vec\beta - \vec z_i^\top\vec u)^{m - y_i}\\ c(\vec y) &= \prod_{i = 1}^n \begin{pmatrix}m \\ y_i\end{pmatrix}\nonumber \end{align*} \]

The marginal log-likelihood is

\[ l(\vec\beta,\mat\Sigma) = \log\int p(\vec u, \vec y) \der u_1\cdots \der u_K \]

Mixed Binomial Likelihood

Let \(\mat X = (\vec x_1, \dots, \vec x_n)^\top\) and define

\[ \begin{align*} \vec j_i &= (\underbrace{1, \dots, 1}_{y_i\text{ times}}, \underbrace{-1, \cdots, -1}_{m - y_i\text{ times}})^\top \\ \widetilde{\mat X} &= \begin{pmatrix} \vec j_1 & \vec 0 & \cdots & \vec 0 \\ \vec 0 & \vec j_2 & \ddots & \vec \vdots \\ \vdots & \ddots & \ddots & \vec 0 \\ \vec 0 & \cdots & \vec 0 & \vec j_n \end{pmatrix}\mat X \end{align*} \]

and similarly \(\mat Z\) and \(\widetilde{\mat Z}\).

Mixed Binomial Likelihood

Then

\[ \begin{align*} p(\vec u, \vec y) &= c(\vec y) \phi^{(K)}(\vec u;\vec 0, \mat\Sigma) \\ &\hspace{20pt}\cdot\prod_{i = 1}^n \Phi(\vec x_i^\top\vec\beta + \vec z_i^\top\vec u)^{y_i} \Phi(-\vec x_i^\top\vec\beta - \vec z_i^\top\vec u)^{m - y_i} \\ &= c(\vec y) \phi^{(K)}(\vec u;\vec 0, \mat\Sigma) \Phi^{(nm)}(\widetilde{\mat X}\vec\beta + \widetilde{\mat Z}\vec u) \end{align*} \]

I.e. we have a \(K\) or \(nm\)-dimensional integral

as shown by Y. Pawitan et al. (2004) in the binary case and Ochi and Prentice (1984) in a more restricted case.

Mixed Multinomial

We have \(c\) categories.

\(\mat Z_i = (\vec z_{i1}, \dots, \vec z_{ic})^\top \in \mathbb{R}^{c\times K}\): known random effect covariates.

\(\mat B = (\vec \beta_1, \dots, \vec\beta_c)^\top\): fixed effect coefficients.

We observe \(Y_1,\dots, Y_n \in \{1,\dots,c\}\) with

\[Y_i = k \Leftrightarrow \forall k \neq k':\, A_{ik} > A_{ik'}, \qquad k,k'\in\{1,\dots,c\}\] where \(\vec A_i\) is a latent variable.

Mixed Multinomial Cont.

Assume \[\vec A_i \mid \vec U = \vec u \sim N(\mat B\vec x_i + \mat Z_i \vec u, \mat I)\]
then \[ \begin{align*} \mathcal{C}_{ik} &= \left\{ \vec A_i:\,\forall k' \neq k: A_{ik} > A_{ik'} \right\} \\ p(Y_i = k \mid \vec U = \vec u) &= \int_{\mathcal{C}_{ik}} \phi^{(c)} (\vec a; \mat B\vec x_i + \mat Z_i \vec u, \mat I) \der a_1\cdots\der a_c \\ &\hspace{-60pt}= \Phi^{(c - 1)}( \underbrace{(\vec 1\vec\beta_k^\top - \mat B_{(-k)})}_{ \widetilde{\mat B}_k}\vec x_i + \underbrace{(\vec 1\vec z_{ik}^\top - \mat Z_{i(-k)})}_{ \widetilde{\mat Z}_{ik}}\vec u; \vec 0, \mat I + \vec 1\vec 1^\top) \end{align*} \]
with \(\mat B_{(-k)} = (\vec\beta_1, \dots, \vec\beta_{k-1}, \vec\beta_{k + 1}, \dots, \vec\beta_c)\)

and similarly for \(\mat Z_i\). See McFadden (1984).

Mixed Multinomial Likelihood

The complete data likelihood is

\[ \begin{align*} p(\vec u, \vec y) &= \phi^{(K)}(\vec u; \vec 0, \mat\Sigma) \prod_{i = 1}^n \Phi^{(c - 1)}( \widetilde{\mat B}_{y_i}\vec x_i + \widetilde{\mat Z}_{iy_i}\vec u; \vec 0, \mat I + \vec 1\vec 1^\top) \\ &= \phi^{(K)}(\vec u; \vec 0, \mat\Sigma) \\ &\hspace{20pt}\cdot \Phi^{(n(c - 1))}( \widetilde{\mat B}\vec x + \widetilde{\mat Z}\vec u; \vec 0, \diag(\underbrace{ \mat I + \vec 1\vec 1^\top, \dots, \mat I + \vec 1\vec 1^\top}_{ n\text{ times}})) \end{align*} \]

with \(\widetilde{\mat B} = \diag(\widetilde{\mat B}_{y_1}, \dots, \widetilde{\mat B}_{y_n})\), \(\vec x = (\vec x_1^\top, \dots, \vec x_n^\top)^\top\), and \(\widetilde{\mat Z} = (\widetilde{\mat Z}_{1y_1}^\top \dots, \widetilde{\mat Z}_{ny_n}^\top)^\top\).

I.e. we have a \(K\) or \(n(c-1)\)-dimensional integral.

Mixed GSM

The survival time is \(Y_i^* \in (0, \infty)\).

Only observe \(Y_i = \min (Y_i^*, C_i)\) where \(C_i\) is an independent censoring time and let \(D_i = 1_{\{Y_i^* < C_i\}}\) be an event indicator.

Censoring e.g. due to drop out.

Mixed GSM Cont.

Let \(S(y\mid \vec x, \vec z, \vec u) = \Prob(Y^* > y \mid \vec x, \vec z, \vec u)\) be the conditional survival function.

Assume that

\[S(y\mid\vec x, \vec z, \vec u) = \Phi(-\vec x^\top(y)\vec\beta - \vec z^\top\vec u)\]

See Royston and Parmar (2002), X.-R. Liu, Pawitan, and Clements (2016), and X.-R. Liu, Pawitan, and Clements (2017).

Mixed GSM Cont.

Define the sets of indices of censored and observed events \[ \begin{align*} \mathcal{C} &= \{i\in \{1,\dots,n\}:\, d_i = 0\} \\ \mathcal{O} &= \{i\in \{1,\dots,n\}:\, d_i = 1\} = \{1,\dots,n\}\setminus\mathcal{C} \end{align*} \]

and \[\mat X^o(\vec Y^o) = (\vec x_j(Y_j))_{j\in\mathcal{O}}^\top\]

Similarly define \(\mat X^{\prime o}(\vec Y^o)\), \(\mat Z^o\), \(\vec Y^o\), \(\mat X^c(\vec Y^c)\), \(\mat Z^c\), and \(\vec Y^c\). Derivatives are applied element wise.

Mixed GSM Likelihood

The complete data likelihood is

\[ \begin{align*} p(\vec u, \vec y, \vec d) &= \phi^{(K)}(\vec u; \vec 0, \mat\Sigma) c(\vec y^o, \mat X^o, \vec\beta)\\ &\hspace{20pt}\cdot \phi^{(\lvert \mathcal{O}\rvert)}(-\mat X^o(\vec y^o)\vec\beta - \mat Z^o\vec u) \\ &\hspace{20pt}\cdot \Phi^{(\lvert \mathcal{C}\rvert)}(-\mat X^c(\vec y^c)\vec\beta - \mat Z^c\vec u) \\ c(\vec y^o, \mat X^o, \vec\beta) &= -\mat X^{\prime o}(\vec y^o)\vec\beta \end{align*} \]

Mixed GSM Marginal Log-likelihood

We can show that

\[ \begin{align*} l(\vec\beta, \mat\Sigma) &= \log\int p(\vec u, \vec y, \vec d) \der u_1\cdots\der u_K \\ &= \log k(\vec t^o, \mat X^o, \mat Z^o, \mat\Sigma) \\ &\hspace{20pt}+ \log\int \phi^{(K)}(\vec u; \vec h, \mat H^{-1}) \\ &\hspace{60pt}\cdot \Phi^{(\lvert \mathcal{C}\rvert)}(-\mat X^c(\vec t^c)\vec\beta - \mat Z^c\vec u) \der u_1\cdots\der u_K \end{align*} \]

I.e. we have a \(K\) or \(\lvert \mathcal{C}\rvert\)-dimensional integral.

See one of the next slides for the definitions of \(k\), \(\vec h\), and \(\mat H\).

Mixed GSM Remarks

Mixed Tobit model is a special case.

Similar to a mixed version of the linear transformation model discussed by Hothorn, Möst, and Bühlmann (2018).

A discrete time survival submodel is suggested by Barrett et al. (2015).

Similar to the mixed binomial model.

Mixed GSM Marginal Log-likelihood Cont.

\[ \begin{align*} \mat H(\mat Z^o, \mat\Sigma) &= \mat H = \mat Z^{o\top}\mat Z^o + \mat\Sigma^{-1} \\ \vec h(\vec y^o, \mat X^o, \mat Z^o, \mat\Sigma) &= \vec h = \mat H^{-1} \mat Z^{o\top}\left(- \mat X^o(\vec y^o)\vec\beta \right)\\ k(\vec y^o, \mat X^o, \mat Z^o, \mat\Sigma) &= \\ &\hspace{-40pt} c(\vec y^o, \mat X^o, \vec\beta) (2\pi)^{-\lvert\mathcal O\rvert/2}\lvert\mat\Sigma\mat H\rvert^{-1/2} \\ &\hspace{-20pt} \cdot\exp\left( -\frac 12 (- \mat X^o\vec\beta(\vec y^o))^\top (- \mat X^o\vec\beta(\vec y^o)) +\frac 12\vec h^\top\mat H\vec h\right) \end{align*} \]

Approximation Methods

Recall

\[ \begin{align*} \Prob(\vec V_2 \leq \vec v_2) &= \Phi^{(k_2)}(\vec v_2; \vec\xi_2, \mat\Xi_2) \\ &= \int \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \\ &\hspace{30pt}\cdot \Prob\left(\vec V_2 < \vec v_2 \,\middle\vert\, \vec V_1 = \vec v_1 \right)\der v_{11} \cdots \der v_{1k_1} \end{align*} \]

Either approximate the \(k_2\)-dimensional CDF or the \(k_1\)-dimensional Gaussian weighted integral.

We focus in the mixed binary model with \(k_2 = n\) and \(k_1 = K\).

I.e. the mixed binomial model with \(m = 1\).

Will Consider

The CDF approximation suggested by Genz (1992).

(Adaptive) Gauss–Hermite Quadrature ([A]GHQ).

Monte Carlo method implemented by Genz and Monahan (1999).

Approximating the CDF

Use approximation shown by Genz (1992)

or similarly and seemingly independently developed GHK method used by Hajivassiliou, McFadden, and Ruud (1996).

Use the pmvtnorm package implementation

(Genz and Bretz 2009; Genz et al. 2020) in Fortran.
Uses randomized Korobov rules (Niederreiter 1972; Keast 1973; Cranley and Patterson 1976). Have seen a more than ten-fold reduction compared to C++ implementation of the MC estimator suggested by Genz (1992).

\(\bigO{n^3 + n^2K + nK^2}\),

and typically not \(\bigO{n^3}\) in practical examples it seems.

Gauss–Hermite Quadrature

Approximate the \(K\)-dimensional integral.

Recursive application of quadrature rule using \(b\) values for each of the \(K\) dimensions.

\(\bigO{nb^K}\).

Adaptive method may require much smaller \(b\)

(Pinheiro and Bates 1995; Q. Liu and Pierce 1994). Requires estimation of \(K\)-dimensional mode and inversion of \(K\times K\) dimensional matrix.

Monte Carlo Estimate

Approximate the \(K\)-dimensional integral using the method described by Genz and Monahan (1999).

Genz and Monahan (1999) provide a Fortran implementation for a generic integrand.

\(\bigO{K^3 + s(nK + K^2)}\) where \(s\) is the number of samples

and the \(\bigO{snK}\)-term is typically dominating.

Adaptive method may require fewer samples, \(s\),

Requires estimation of \(K\)-dimensional mode and inversion of \(K\times K\) dimensional matrix.

Simulation Example

Details

Use a mixed binomial model in the binary case, \(m = 1\).

The unconditional variance of \(\vec x_i^\top\vec\beta + \vec z_i^\top\vec u\) is 2.

Let \(\eta_i = \vec x_i^\top\vec\beta \sim N(0, 1)\) and draw \(\mat\Sigma\) from a Wishart distribution.

Focus on the evaluation of marginal log-likelihood

where we fix the relative error of each method.

Average Computation Times (n = 2)

Method/K 2 3 4 5 6
GHQ 0.02 0.12 0.79 6.93 40.46
AGHQ 0.03 0.09 0.43 3.00 26.33
CDF 0.04 0.03 0.00 0.02 0.02
Genz & Monahan 13.88 49.94 43.96 74.98 74.73
Adaptive Genz & Monahan 13.68 9.77 6.18 23.51 22.44

Average computation times in milliseconds.

Average Computation Times (n = 4)

Method/K 2 3 4 5 6
GHQ 0.07 0.41 3.95 24.96 153.06
AGHQ 0.03 0.14 0.60 3.59 30.64
CDF 0.43 0.45 0.42 0.42 0.47
Genz & Monahan 77.28 85.60 108.23 91.98 123.41
Adaptive Genz & Monahan 5.30 12.54 26.44 4.89 14.21

Average computation times in milliseconds.

Average Computation Times (n = 8)

Method/K 2 3 4 5 6
GHQ 0.10 0.74 7.41 89.98 331.24
AGHQ 0.05 0.17 1.36 7.80 34.50
CDF 4.74 4.61 4.55 4.74 4.59
Genz & Monahan 103.63 238.29 244.12 215.58 193.56
Adaptive Genz & Monahan 1.91 0.82 1.20 4.21 3.34

Average computation times in milliseconds.

Average Computation Times (n = 16)

Method/K 2 3 4 5 6
GHQ 0.28 4.42 110.59 538.85 2793.15
AGHQ 0.06 0.26 1.74 7.81 60.53
CDF 31.71 32.35 32.07 33.58 33.39
Genz & Monahan 141.72 383.82 439.39 575.03 561.41
Adaptive Genz & Monahan 0.24 0.69 1.26 0.71 2.99

Average computation times in milliseconds.

Average Computation Times (n = 32)

Method/K 2 3 4 5 6
GHQ 0.98 19.06 172.64 5023.05 26964.95
AGHQ 0.14 0.31 0.84 5.27 47.69
CDF 75.26 75.18 74.61 75.18 74.35
Genz & Monahan 465.74 685.87 702.84 846.62 1570.17
Adaptive Genz & Monahan 0.42 0.45 0.38 0.54 0.81

Average computation times in milliseconds.

Average Scaled RMSE (n = 2)

Method/K 2 3 4 5 6
GHQ 8.06 6.75 8.26 6.47 5.67
AGHQ 6.22 5.67 8.84 5.55 5.34
CDF 6.15 5.72 8.88 5.67 5.36
Genz & Monahan 37.81 38.73 67.95 58.60 57.10
Adaptive Genz & Monahan 29.09 47.24 46.43 64.27 46.52
Average scaled RMSE are multiplied by \(10^{5}\).

RMSE of (A)GHQ is computed with 4 consecutive values of \(b\).

Average Scaled RMSE (n = 4)

Method/K 2 3 4 5 6
GHQ 16.32 12.51 17.72 9.58 12.83
AGHQ 9.43 9.58 12.43 8.67 15.18
CDF 19.69 18.70 22.91 24.32 26.48
Genz & Monahan 162.67 131.94 139.43 127.33 172.40
Adaptive Genz & Monahan 56.07 162.08 89.73 132.42 138.80
Average scaled RMSE are multiplied by \(10^{5}\).

RMSE of (A)GHQ is computed with 4 consecutive values of \(b\).

Average Scaled RMSE (n = 8)

Method/K 2 3 4 5 6
GHQ 56.11 36.70 42.10 31.64 21.65
AGHQ 14.96 19.47 20.94 17.23 18.97
CDF 17.83 23.58 25.84 25.91 22.78
Genz & Monahan 236.28 185.91 200.49 272.13 252.81
Adaptive Genz & Monahan 119.25 179.82 175.57 202.02 212.55
Average scaled RMSE are multiplied by \(10^{5}\).

RMSE of (A)GHQ is computed with 4 consecutive values of \(b\).

Average Scaled RMSE (n = 16)

Method/K 2 3 4 5 6
GHQ 134.64 160.54 126.85 93.83 85.55
AGHQ 19.60 16.63 25.47 34.65 37.27
CDF 25.99 27.36 29.86 43.33 44.78
Genz & Monahan 457.59 486.99 408.75 569.23 445.92
Adaptive Genz & Monahan 127.07 286.81 298.32 370.65 360.66
Average scaled RMSE are multiplied by \(10^{5}\).

RMSE of (A)GHQ is computed with 4 consecutive values of \(b\).

Average Scaled RMSE (n = 32)

Method/K 2 3 4 5 6
GHQ 388.58 407.13 272.85 331.03 237.07
AGHQ 31.78 47.17 69.40 52.50 54.38
CDF 76.19 74.42 105.98 91.77 101.08
Genz & Monahan 859.31 903.12 1052.99 941.96 840.60
Adaptive Genz & Monahan 243.57 238.78 362.14 381.14 398.06
Average scaled RMSE are multiplied by \(10^{5}\).

RMSE of (A)GHQ is computed with 4 consecutive values of \(b\).

Real Data Example

Salamander Data Set

Two types of salamanders: whiteside and roughbutt.

How successful is mating depending on type of male and female.

\(Y_i = 1\) if the \(i\)th mating pair is successful.

Model is \[ \begin{align*} \vec U &= (\vec U_f, \vec U_m) \sim N(\vec 0, \diag(\sigma_f^2\mat I, \sigma_m^2\mat I))\\ Y_i \mid \vec U = \vec u &\sim \text{Bin}(\Phi(z_i), 1) \\ z_i &= \beta_{ww}I_{ww}(i) + \beta_{wr}I_{wr}(i) + \beta_{rw}I_{rw}(i) \\ &\hspace{20pt} + \beta_{rr}I_{rr}(i) + u_{ff_i} + u_{mm_i} \end{align*} \]

Crossed Random Effects

\(n = 60\) and \(K = 20\) in each cluster.

There is a square if a male and female mates.

MCMC Estimate

## Inference for Stan model: salamander.
## 6 chains, each with iter=100000; warmup=20000; thin=1; 
## post-warmup draws per chain=80000, total post-warmup draws=480000.
## 
##          mean se_mean   sd  2.5%   25%   50%   75% 97.5%  n_eff Rhat
## beta[1]  0.63       0 0.26  0.13  0.45  0.62  0.80  1.16 176522    1
## beta[2] -0.43       0 0.29 -1.02 -0.63 -0.43 -0.23  0.14 225548    1
## beta[3] -1.81       0 0.35 -2.54 -2.04 -1.80 -1.57 -1.16 130912    1
## beta[4]  2.23       0 0.38  1.52  1.97  2.22  2.48  2.99 138428    1
## sigma_f  0.77       0 0.17  0.46  0.65  0.76  0.87  1.12  66649    1
## sigma_m  0.72       0 0.16  0.42  0.61  0.71  0.82  1.06  61948    1
## 
## Samples were drawn using NUTS(diag_e) at Thu Feb 20 10:48:52 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Laplace Approximation

Requires estimation of \(K\)-dimensional mode and inversion of a \(K\times K\) matrix.

Very fast and scale well but heavily biased in some cases.

Laplace Approximation Cont.

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( probit )
## Formula: y ~ wsm * wsf + (1 | female) + (1 | male)
##    Data: sala
##      AIC      BIC   logLik deviance df.resid 
##    432.2    455.5   -210.1    420.2      354 
## Random effects:
##  Groups Name        Std.Dev.
##  female (Intercept) 0.626   
##  male   (Intercept) 0.576   
## Number of obs: 360, groups:  female, 60; male, 60
## Fixed Effects:
## (Intercept)          wsm          wsf      wsm:wsf  
##       0.601       -0.419       -1.731        2.140

CDF Approximation

C++ implementation which supports computation in parallel.

Run on an Intel® Core™ i7-8750H CPU compiled with gcc 8.3.0.

Uses the mvkbrv Fortran subroutine from the mvtnorm package to approximate the gradient.

Uses BFGS implementation in R’s (R Core Team 2019) optim function.

CDF Approximation Result (Large Relative Error)

## 
## fit_CDF_cpp_fast
## ----------------
## 
## Fixed effects
## (Intercept)         wsm         wsf     wsm:wsf 
##      0.6208     -0.4290     -1.7242      2.1243 
## 
## Random effect standard deviations              
## 0.7073 0.6835 
## 
## Log-likelihood estimate -206.82
## Computation time 23.77/0.48/0.53 (seconds total/per func./per grad.)

CDF Approximation Result (Small Relative Error)

## 
## fit_CDF_cpp
## -----------
## 
## Fixed effects
## (Intercept)         wsm         wsf     wsm:wsf 
##      0.6206     -0.4305     -1.7228      2.1249 
## 
## Random effect standard deviations              
## 0.7075 0.6817 
## 
## Log-likelihood estimate -206.83
## Computation time 6.81/0.76/2.27 (seconds total/per func./per grad.)

Next Steps

Hybrid Method

Lai and Shih (2003) suggest a hybrid Laplace and Monte Carlo method.

May be very fast when combined with methods from Genz (1992) and Genz and Monahan (1999).

Variational Approximation (VA)

Girolami and Rogers (2006) and Consonni and Marin (2007) suggest a VA (Ormerod and Wand 2010).

Consonni and Marin (2007) shows that it works poorly in some cases.

Conditional Density

\[ \begin{align*} p(\vec v_1\mid \vec V_2 \leq \vec v_2) &= \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \frac {\Prob(\vec V_2 \leq\vec v_2 \mid \vec V_1 = \vec v_1)} {\Prob(\vec V_2 \leq\vec v_2)}\\ &\hspace{-60pt}= \phi^{(k_1)}(\vec v_1; \vec \xi_1, \mat\Xi_{11}) \\ &\hspace{-20pt}\cdot \frac {\Phi^{(k_2)}\left( \vec v_2 - \mat\Xi_{21}\mat\Xi_{11}^{-1}(\vec v_1 - \vec\xi_1); \vec \xi_2, \mat\Xi_{22} - \mat\Xi_{21}\mat\Xi_{11}^{-1}\mat\Xi_{12} \right)} {\Phi^{(k_2)}\left( \vec v_2; \vec \xi_2, \mat\Xi_{22} \right)} \end{align*} \]
Suggests that a skew-normal VA will work well.

Ormerod (2011).

Requires estimation of \(\bigO{K^2}\) variational parameters.

Conclusion

Summary

Showed three classes of mixed models with very similar marginal log-likelihoods.

A variety of approximations are available.

Best choice depends on the setting.

Hybrid method may be useful.

One variational approximation seems unexplored.

Thank You!

The presentation is at rpubs.com/boennecd/mix-probit-KTH.

The markdown is at github.com/boennecd/Talks.

Code and more examples at github.com/boennecd/mixprobit.

References are on the next slide.

References

Barrett, Jessica, Peter Diggle, Robin Henderson, and David Taylor-Robinson. 2015. “Joint Modelling of Repeated Measurements and Time-to-Event Outcomes: Flexible Model Specification and Exact Likelihood Inference.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77 (1): 131–48. doi:10.1111/rssb.12060.

Consonni, Guido, and Jean-Michel Marin. 2007. “Mean-Field Variational Approximate Bayesian Inference for Latent Variable Models.” Computational Statistics & Data Analysis 52 (2): 790–98. doi:https://doi.org/10.1016/j.csda.2006.10.028.

Cranley, R., and T. N. L. Patterson. 1976. “Randomization of Number Theoretic Methods for Multiple Integration.” SIAM Journal on Numerical Analysis 13 (6). Society for Industrial; Applied Mathematics: 904–14. http://www.jstor.org/stable/2156452.

Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2). Taylor & Francis: 141–49. doi:10.1080/10618600.1992.10477010.

Genz, Alan, and Frank Bretz. 2009. Computation of Multivariate Normal and T Probabilities. Lecture Notes in Statistics. Heidelberg: Springer-Verlag.

Genz, Alan, and John Monahan. 1999. “A Stochastic Algorithm for High-Dimensional Integrals over Unbounded Regions with Gaussian Weight.” Journal of Computational and Applied Mathematics 112 (1): 71–81. doi:https://doi.org/10.1016/S0377-0427(99)00214-9.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.

Girolami, M., and S. Rogers. 2006. “Variational Bayesian Multinomial Probit Regression with Gaussian Process Priors.” Neural Computation 18 (8): 1790–1817. doi:10.1162/neco.2006.18.8.1790.

Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. doi:https://doi.org/10.1016/0304-4076(94)01716-6.

Hothorn, Torsten, Lisa Möst, and Peter Bühlmann. 2018. “Most Likely Transformations.” Scandinavian Journal of Statistics 45 (1): 110–34. doi:10.1111/sjos.12291.

Keast, P. 1973. “Optimal Parameters for Multidimensional Integration.” SIAM Journal on Numerical Analysis 10 (5): 831–38. doi:10.1137/0710068.

Lai, Tze Leung, and Mei-Chiung Shih. 2003. “A Hybrid Estimator in Nonlinear and Generalised Linear Mixed Effects Models.” Biometrika 90 (4). [Oxford University Press, Biometrika Trust]: 859–79. http://www.jstor.org/stable/30042093.

Liu, Qing, and Donald A. Pierce. 1994. “A Note on Gauss-Hermite Quadrature.” Biometrika 81 (3). [Oxford University Press, Biometrika Trust]: 624–29. http://www.jstor.org/stable/2337136.

Liu, Xing-Rong, Yudi Pawitan, and Mark Clements. 2016. “Parametric and Penalized Generalized Survival Models.” Statistical Methods in Medical Research 27 (5): 1531–46. doi:10.1177/0962280216664760.

Liu, Xing-Rong, Yudi Pawitan, and Mark S. Clements. 2017. “Generalized Survival Models for Correlated Time-to-Event Data.” Statistics in Medicine 36 (29): 4743–62. doi:10.1002/sim.7451.

McFadden, Daniel. 1984. “Chapter 24 Econometric Analysis of Qualitative Response Models.” In, 2:1395–1457. Handbook of Econometrics. Elsevier. doi:https://doi.org/10.1016/S1573-4412(84)02016-X.

Niederreiter, H. 1972. “On a Number-Theoretical Integration Method.” Aequationes Mathematicae 8 (3): 304–11. doi:10.1007/BF01844507.

Ochi, Y., and Ross L. Prentice. 1984. “Likelihood Inference in a Correlated Probit Regression Model.” Biometrika 71 (3). [Oxford University Press, Biometrika Trust]: 531–43. http://www.jstor.org/stable/2336562.

Ormerod, J. T. 2011. “Skew-Normal Variational Approximations for Bayesian Inference.” Unpublished Article.

Ormerod, J. T., and M. P. Wand. 2010. “Explaining Variational Approximations.” The American Statistician 64 (2). Taylor & Francis: 140–53. doi:10.1198/tast.2010.09058.

Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. doi:10.1002/sim.1603.

Pinheiro, José C., and Douglas M. Bates. 1995. “Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model.” Journal of Computational and Graphical Statistics 4 (1). American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America: 12–35. http://www.jstor.org/stable/1390625.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Royston, Patrick, and Mahesh K. B. Parmar. 2002. “Flexible Parametric Proportional-Hazards and Proportional-Odds Models for Censored Survival Data, with Application to Prognostic Modelling and Estimation of Treatment Effects.” Statistics in Medicine 21 (15): 2175–97. doi:10.1002/sim.1203.