Mixed Effect Models for Pedigree Data

dummy slide

Motivation

Research Questions

Want to estimate genetic effect, environmental effect, paternal effect, etc.

We observe \(j = 1,\dots, n_i\) binary outcomes in \(i = 1,\dots, m\) clusters.

Running Example: Interested in presence of a genetic effect and maternal effect.

Model

\[ \begin{align*} Y_{ij} &= 1_{\{\vec\beta^\top\vec x_{ij} + \epsilon_{i1j} + \epsilon_{i2j} + \epsilon_{i3j} > 0\}} \\ \vec\epsilon_{ik} &= (\epsilon_{ik1}, \dots, \epsilon_{ikn_i})^\top \\ \vec\epsilon_{i1} & \sim N^{(n_i)}(\vec0, 2\theta_1\mat K_i) \\ \vec\epsilon_{i2} & \sim N^{(n_i)}(\vec0, \theta_2\mat R_i) \\ \vec\epsilon_{i3} & \sim N^{(n_i)}(\vec0, \mat I) \end{align*} \]

where \(\mat K_i\) is the kinship matrix and \(\mat R_i\) is a matrix such that \(\vec\epsilon_{i2}\) represents a maternal effect.

Quantify heritability through \(\theta_1 \big/(\theta_1 + \theta_2 + 1)\).

Model with Example

Example: 12 and 13 have 1/4 shared genetic effect and 1/2 shared maternal effect. 13 and 15 only have 1/4 shared genetic effect.

Scale Matrix: Genetic Effect

Scale Matrix: Maternal Effect

General Model

\[ \begin{align*} Y_{ij} &= 1_{\{\vec\beta^\top\vec x_{ij} + \epsilon_{ij} > 0\}} \\ \vec\epsilon_i = (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat I + \mat\Sigma_i(\vec\theta)) \\ \mat\Sigma_i(\vec\theta) &= \sum_{k = 1}^K \theta_k \mat C_{ik} \end{align*} \]

for known matrices \(\mat C_{i1}, \dots, \mat C_{iK}\).

Previous Work

Pawitan et al. (2004) use the Fortran code by Genz and Bretz (2002) to estimate the model using a derivative-free optimization method.

Talk Overview

Extended code.

Simulation study.

Extensions.

Conclusion.

Marginal Likelihood

The Marginal Likelihood

\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \int \phi^{(n_i)} (\vec z;\vec 0, \mat\Sigma_i(\vec\theta) ) \prod_{j = 1}^{n_i}\Phi( u_{ij}(\vec\beta^\top \vec x_{ij} + z_j)) \der\vec z \\ u_{ij} &= 2y_{ij} - 1 \end{align*}. \]

Find the MLE:

\[ \argmax_{\vec\beta,\vec\theta} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\theta). \]

Identity

\[ \begin{align*} \begin{pmatrix} \vec Y_1 \\ \vec Y_2\end{pmatrix} &\sim N^{(k_1 + k_2)}\left( \begin{pmatrix} \vec\xi_1 \\ \vec \xi_2\end{pmatrix}, \begin{pmatrix} \vec\Xi_{11} & \vec\Xi_{12} \\ \vec\Xi_{21} & \vec\Xi_{22} \end{pmatrix} \right) \\ \Prob(\vec Y_2 \leq \vec x) &= \Phi^{(k_2)}(\vec x; \vec\xi_2,\mat\Xi_{22}) \\ &= \int \phi^{(k_1)}(\vec z;\vec\xi_1,\mat\Xi_{11}) \Phi^{(k_2)}\Big( \\ &\hspace{25pt}\vec x; \vec\xi_2 + \mat\Xi_{21}\mat\Xi_{11}^{-1}(\vec z - \vec \xi_1), \mat\Xi_{22} - \mat\Xi_{21}\mat\Xi_{11}^{-1}\mat\Xi_{12} \Big)\der\vec z \end{align*} \]

The Marginal Likelihood (cont.)

\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i) \\ &= \int_{\vec z\leq \mat U_i\mat X_i\vec\beta} \phi^{(n_i)}(\vec z; \vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i)\der\vec z \\ &= \int_{\mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta} \prod_{j = 1}^{n_i} \phi(z_j)\der\vec z \end{align*} \]

where \(\mat U_i = \text{diag}(\vec u_i)\) and

\[\mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i = \mat S^\top\mat S.\]

Importance Sampling

\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \int \frac{\prod_{j = 1}^{n_i} \phi(z_j)}{ h(\vec z)}h(\vec z)\der\vec z \\ &= \int h(\vec z)\prod_{j = 1}^{n_i} \Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)\der z \end{align*} \]

where

\[ h(\vec z) = \begin{cases} \prod_{j = 1}^{n_i} \frac{\phi(z_j)} {\Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)} & \mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta \\ 0 & \text{otherwise} \end{cases}. \]

Importance Sampling (cont.)

Closely follow Genz and Bretz (2009).

Like Genz (1992) and Genz and Bretz (2002):

Add a heuristic variable re-ordering.
Use randomized Korobov rules (Niederreiter 1972; Keast 1973; Cranley and Patterson 1976).

New Code

Rewritten most of the Fortran code by Genz and Bretz (2002) in C++.

Used in the mvtnorm package (Genz et al. 2020).

Added fast \(\Phi\) and \(\Phi^{-1}\) approximations.

Added support for computation in parallel.

Seems compute-bound.

Added gradient approximations.

Like Hajivassiliou, McFadden, and Ruud (1996) but with all of the above, re-ordering, and randomized Korobov rules.

Gradient Approximations

\[ \begin{align*} L_i'(\vec\beta,\vec\theta) &= \int_{\vec z\leq \mat U_i\mat X_i\vec\beta} \vec g(\vec z; \vec\beta, \vec\theta) \phi^{(n_i)}(\vec z; \vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i)\der\vec z \\ &= \int_{\mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta} \vec g(\mat S^{-\top}\vec z; \vec\beta, \vec\theta) \phi(z_j)\der\vec z\\ &= \int \vec g(\mat S^{-\top}\vec z; \vec\beta, \vec\theta) h(\vec z)\prod_{j = 1}^{n_i} \Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)\der z \end{align*}. \]

Used for

\[ \sum_{i = 1}^m (\log L_i(\vec\beta,\vec\theta))' = \sum_{i = 1}^m \frac{L_i'(\vec\beta,\vec\theta)}{L_i(\vec\beta,\vec\theta)}. \]

Std. Normal CDF Approximation

Use monotone cubic interpolation:

truth <- exp(seq(log(1e-32), log(1 - 1e-16), length.out = 10000))
args <- qnorm(truth)
range(pnorm(args) - pnorm_aprx(args))

## [1] -1.78e-07  1.80e-07

Std. Normal CDF Approximation

bench::mark(pnorm = pnorm(args), aprx = pnorm_aprx(args), min_time = 1, 
            check = FALSE)

## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 pnorm       429.6µs  462.9µs     2140.    78.2KB     4.06
## 2 aprx         35.9µs   57.3µs    17562.    83.4KB    33.4

Std. Normal Quantile Approximation

Use lower precision method shown by Wichura (1988).

Same paper used by stats::qnorm.

args <- exp(seq(log(1e-32), log(1 - 1e-16), length.out = 10000))
range(qnorm(args) - qnorm_aprx(args))

## [1] -6.97e-07  7.78e-07

Std. Normal Quantile Approximation

bench::mark(qnorm = qnorm(args), aprx = qnorm_aprx(args), 
            min_time = 1, check = FALSE)

## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 qnorm         400µs    422µs     2350.    78.2KB     4.06
## 2 aprx          213µs    219µs     4520.    82.3KB     9.29

Approximate Multivariate Normal CDF

Compare mvtnorm::pmvnorm with pedmod::mvndst:

Example from ?mvtnorm::pmvnorm.

set.seed(95501057)
print(c(pmvnorm(lw, up, mu, sig), 
        mvndst (lw, up, mu, sig)), digits = 5)

## [1] 0.57997 0.58004

Approximate Multivariate Normal CDF

The MC error is comparable:

set.seed(95501057)
apply(
  replicate(100, c(pmvnorm(lw, up, mu, sig), 
                   mvndst (lw, up, mu, sig))), 
  1, sd)

## [1] 9.95e-05 7.43e-05

Approximate Multivariate Normal CDF

set.seed(1)
bench::mark(mvtnorm = pmvnorm(lw, up, mu, sig), 
            pedmod  = mvndst(lw, up, mu, sig), min_time = 1, check = FALSE)

## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 mvtnorm      1.16ms    1.2ms      827.     3.7KB     2.01
## 2 pedmod     624.63µs  629.3µs     1555.    2.49KB     0

Simulation Study

Setup

\(m = 10000\) families with a fixed design matrix

and the pedigree data shown earlier resulting in 160000 observations. Repeat 100 times.

Start with a maximum of 5000 samples per term. Then use 25000.

Use four threads.

Setup (cont.)

\[ \begin{align*} \mat X_i &= (\vec 1, \vec x_{\cdot 2}, \vec x_{\cdot 3}) \\ \vec x_{\cdot 2} &= \frac 4{15}(-7.5, -6.5, \dots, 6.5, 7.5)^\top \\ \vec x_{\cdot 3} &= (0, 1, 0, \dots, 0, 1)^\top \\ \vec\beta &= (-2, 0.5, 1)^\top \\ \vec\theta &= (0.5, 0.33)^\top \end{align*} \]

Fixed Families: Bias Estimates

	(Intercept)	X	Dummy	Genetic	Maternal
Bias w/o extra	-0.00082	0.00017	0.00006	-0.00028	0.00176
SE w/o extra	0.00191	0.00060	0.00130	0.00252	0.00202
Bias w/ extra	-0.00087	0.00017	0.00014	-0.00028	0.00183
SE w/ extra	0.00191	0.00060	0.00130	0.00253	0.00202

“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term. SE is the MC standard error. The last two columns are the genetic effect parameter, \(\theta_1\), and the maternal effect parameter, \(\theta_2\).

Fixed Families: Bias Estimates (cont.)

Estimated values less true values when using more samples per log marginal likelihood term. The y-axis is the same as the box plots shown later.

Fixed Families: Computation Time

	Mean	Median	Sd
W/o extra	347	302	120
W/ extra	447	389	139

“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.

Setup

\(m = 1000\) families with:

\[ \begin{align*} \vec x_{ij} &= \left(1, Z_{ij}, \left(j - (n_i + 1) / 2\right) / v(n_i) \right)^\top \\ v(x) &= \sqrt{x (x + 1) \big/ 12} \\ Z_{ij}&\sim N(0, 1) \\ \vec\beta &= (-1, 0.3, 0.2)^\top \\ \vec\theta &= (0.5, 0.3)^\top \end{align*} \]

Pedigree data is randomly generated.

Example

Only keep the youngest generation as Mahjani et al. (2020).

Scale Matrix: Genetic Effect

Scale Matrix: Maternal Effect

Smaller Families

\(n_i \sim 13\) yielding \(13000\) observations.

Smaller Families: Bias Estimates

	(Intercept)	X1	X2	Genetic	Maternal
Bias w/o extra	0.00834	-0.00285	0.00168	-0.0399	0.01267
SE w/o extra	0.00533	0.00217	0.00197	0.0232	0.00713
Bias w/ extra	0.00821	-0.00283	0.00170	-0.0396	0.01268
SE w/ extra	0.00535	0.00217	0.00198	0.0233	0.00713

Smaller Families: Bias (cont.)

Estimated values less true values when using more samples per log marginal likelihood term.

Smaller Families: Computation Time

	Mean	Median	Sd
W/o extra	32.1	31.8	9.61
W/ extra	40.8	40.0	10.98

“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.

Larger Families

\(n_i \sim 50\) yielding \(50000\) observations.

Larger Families: Bias

	(Intercept)	X1	X2	Genetic	Maternal
Bias w/o extra	0.00742	-0.00158	-0.00254	-0.0387	0.01087
SE w/o extra	0.00288	0.00114	0.00101	0.0122	0.00392
Bias w/ extra	0.00665	-0.00150	-0.00257	-0.0363	0.01057
SE w/ extra	0.00292	0.00113	0.00102	0.0123	0.00396

Larger Families: Bias (cont.)

Estimated values less true values when using more samples per log marginal likelihood term.

Larger Families: Computation Time

	Mean	Median	Sd
W/o extra	265	253	92.7
W/ extra	615	581	208.5

“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.

Extensions

Improvements

Get better starting values.

Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.

Use stochastic gradient techniques.

Survival Data

\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(S_{ij} \in (0,\infty)\).

Use a mixed generalized survival model:

\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ f_{Y_{ij}^*\mid \epsilon_{ij}}(y, e) &= \phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \vec g'(y)^\top\vec\omega \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\theta)) \end{align*} \]

Survival Data

A simple case is \(\vec g(y) = \log y\) in which case

\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I + \mat\Sigma_i(\vec\theta)) \end{align*} \]

The CDF dimension is equal to the number of censored individuals.

Generic GLMMs

The identity can be used for multinomial and ordinal data with the probit link.

Approximating the CDF when the number of random effects per cluster is not much smaller than the number of observations in a cluster.

Imputation Method

Use Gaussian copula for the joint distribution of mixed data types.

Improve upon the approximate EM-algorithm by Zhao and Udell (2019).

See github.com/boennecd/mdgc.

One can extend the approach to multinomial data.

Conclusions

Extended the estimation method suggested by Pawitan et al. (2004).

Showed that the new C++ implementation is fast

for large data sets with moderate to high dimensional integrals per cluster.

Discussed extensions to other types of outcomes.

Thank You!

The presentation is at rpubs.com/boennecd/MEB-RQMC-pedigree.

The markdown is at github.com/boennecd/Talks.

The R package is at github.com/boennecd/pedmod.

References are on the next slide.

References

Cranley, R., and T. N. L. Patterson. 1976. “Randomization of Number Theoretic Methods for Multiple Integration.” SIAM Journal on Numerical Analysis 13 (6). Society for Industrial; Applied Mathematics: 904–14. http://www.jstor.org/stable/2156452.

Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2). [American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]: 141–49. http://www.jstor.org/stable/1390838.

Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4). Taylor & Francis: 950–71. https://doi.org/10.1198/106186002394.

———. 2009. Computation of Multivariate Normal and T Probabilities. Vol. 195. Springer Science & Business Media.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.

Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.

Keast, P. 1973. “Optimal Parameters for Multidimensional Integration.” SIAM Journal on Numerical Analysis 10 (5): 831–38. https://doi.org/10.1137/0710068.

Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.

Niederreiter, H. 1972. “On a Number-Theoretical Integration Method.” Aequationes Mathematicae 8 (3): 304–11. https://doi.org/10.1007/BF01844507.

Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.

Wichura, Michael J. 1988. “Algorithm as 241: The Percentage Points of the Normal Distribution.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 37 (3). [Wiley, Royal Statistical Society]: 477–84. http://www.jstor.org/stable/2347330.

Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.”