Mixed Effect Models for Pedigree Data

dummy slide

Motivation

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]

Example

Suppose that the above is family \(i\) for which we observe \(Y_{i1},\dots Y_{i10} \in \{0, 1\}\).

Circles are females and squares are male.

Research Questions

Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.

Can be done by extending GLM’s with a probit link.

Logistic Regression

Have \(i = 1,\dots,m\) families (clusters) with \(j = 1,\dots,n_i\) members.

Logistic regression can be seen as

\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]

where \(\epsilon_{it}\) follows a logistic distribution. Implies

\[ \text{logit}(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta. \]

Probit Model

If we instead assume that \(\epsilon_{ij} \sim N(0, 1)\) then

\[ \Phi(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta. \]

Generalizing this is computationally attractive.

Extending the Probit Model

Want to add genetic effects, \(g_{ij}\). Assume that the genetic effects are additive on the latent scale:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij} &\sim N(0, 1) \\ (g_{i1}, \dots, g_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_g^2 \mat C_{ig}). \end{align*} \]

Need to select the matrix \(\mat C_{ig}\).

Selecting the Scale Matrix

We select \(\mat C_{ig}\) such that \(c_{igkj}\) is two times the kinship matrix entry for individual \(k\) and \(j\) in family \(i\).

Example

Subject 7 and 8 are siblings and have \(2^{-1} = 50\)% while 7 and 9 are cousins and only have \(2^{-3} = 12.5\)%.

Adding Other Effects

Add environmental effects, \(e_{ij}\), to get a classical ACE model:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} + e_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij} &\sim N(0, 1) \\ (g_{i1}, \dots, g_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_g^2 \mat C_{ig}) \\ (e_{i1}, \dots, e_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_e^2 \mat C_{ie}). \\ \end{align*} \]

Example (continued)

(1,2,3,6) \(\mat C_{ie}\) entries are equal to one and so are (3, 4, 7, 8) and (5, 6, 9, 10) entries.

Inference

Interested in \(\sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by additive genetic effects,

and \(\sigma_e^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by environmental effects.

Re-write

The model can be written as:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \sigma_g^2 \mat C_{ig} + \sigma_e^2 \mat C_{ie}). \end{align*} \]

Re-write (continued)

Or for \(K\) different effects:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \mat\Sigma(\vec\sigma)) \\ \mat\Sigma(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat C_{ik} \end{align*} \]

Estimation

The log marginal likelihood is intractable

… but can expressed as a multivariate normal distribution CDFs.

Many approximations exists! Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Yip et al. (2018), and Bai et al. (2019).

Previous Work

Previous work only use a log marginal likelihood approximation.

Have to use derivative-free optimization.

To simplify the computation, they

  • discretize covariates,
  • omit observations,
  • ignore a lot of information from the pedigree, and
  • use a composite likelihood like procedure

where individuals with cousin both on their mother and father side is repeated twice.

New Method and Implementation

Method with gradient approximations.

Our Implementation supports computation in parallel.

A faster approximation.

Works for an arbitrary number of effects, \(K\).

MCMC

MCMC is also an option.

Seems popular with animal data. See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).

Very slow!

Marginal Likelihood

Maximum Likelihood

Find the MLE:

\[ \argmax_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]

where \(L_i\) is the marginal likelihood of family (cluster) \(i\).

The Marginal Likelihood

\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]

where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).

Point: need CDF approximation.

Importance Sampling

Genz (1992) and Genz and Bretz (2002) develop an efficient importance sampling procedure for the CDF with:

  1. A heuristic variable re-ordering.
  2. Randomized Korobov rules (randomized quasi-Monte Carlo; see Niederreiter 1972; Keast 1973; Cranley and Patterson 1976).

New Code

Rewritten most of the Fortran code by Genz and Bretz (2002) in C++.

Used in the mvtnorm package (Genz et al. 2020).

Added fast \(\Phi\) and \(\Phi^{-1}\) approximations.

Added support for computation in parallel.

Seems compute-bound.

Added gradient approximations.

Like Hajivassiliou, McFadden, and Ruud (1996) but with all of the above, re-ordering, and randomized Korobov rules.

Simulation and Applied Examples

Simulation Study

250 families, with 10 members in each family, and the same pedigree. We add an additive genetic effect.

Same pedigree as earlier.

\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1X_{ij} + \beta_2 B_{ij} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \end{align*}\]

\(X_{ij}\sim N(0, 1)\), \(B_{ij}\sim \text{Bin}(0.5, 1)\) are observed.

Compare with MCMCglmm.

Caveat: we use four threads with our method.

Simulation Study Results

\(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_g\) Time
Bias -0.481 0.108 -0.130 -0.346 13.890
SE 0.550 0.319 0.592 0.674 0.549

Bias estimates using 100 data sets. The bias estimates are multiplied by 100. The last column shows the average estimation time in seconds. The SE row shows the standard errors.

Results: Comparison with MCMC

\(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_g\) Time
MCMC (mean) -1.128 1.32 -0.108 1.811 1588.36
SE 0.530 1.52 2.093 1.560 3.63
MCMC (mode) -2.704 2.22 1.293 -1.019
SE 2.181 1.76 2.192 1.462
New method -1.218 1.36 -0.024 0.515 13.43
SE 0.547 1.51 2.087 1.713 1.23

Bias estimates using 10 data sets. The bias estimates are multiplied by 100. The last column shows average the estimation time in seconds. The SE rows show the standard errors. Both posterior means and modes are shown.

Simplifying the Real Application

Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.

Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.

Most families are small but one is very large with 618.162 members with 167.279 members with observed outcomes.

Want to avoid 167.279 dimensional integration!

Dependencies as a Graph

Dependencies as a Graph (continued)

Dependencies as a Graph (continued)

Reducing the Computational Cost

Removing an edge introduces independence between two individuals.

Removing edges can creates two non-connected families (subgraphs) with \(\tilde n_k\) and \(\tilde n_l\) individuals (vertices)

with \(\tilde n_k + \tilde n_l = n_i\).

Yields two integrals to be approximated of size \(\tilde n_k\) and \(\tilde n_l\)

rather than one of size \(n\).

Reducing the Computational Cost (continued)

Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.

Essentially the \(p\)-way partition problem.

Many approximations exists if the partitions need not to be connected (Simon and Teng 1997; Karypis and Kumar 1998).

Currently use a very crude heuristic.

Reduces the number of cousin pairs by 7.1 pct.

Profile Likelihood

The above profile likelihood was computed in 3 hours and 22 minutes on Vector.

The MCMC analysis in Mahjani et al. (2020) took weeks!

Extensions

Improvements

Get better starting values.

Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.

Use stochastic gradient techniques.

Survival Data

\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(S_{ij} \in (0,\infty)\).

Only observe \(Y_{ij} = \min (Y_{ij}^*, S_{ij})\).

Use a mixed generalized survival model:

\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\theta)) \end{align*} \]

Survival Data

A simple case is \(\vec g(y) = \log y\) in which case

\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]

Or equivalently

\[ \begin{align*} \log Y_{ij}^* &= -\omega^{-1}\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \omega^{-2}(\mat I_{n_i} + \mat\Sigma_i(\vec\sigma))) \end{align*} \]

Used in the bivariate case by Szulkin et al. (2017).

Survival Data (continued)

Generally,

\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]

The CDF dimension is equal to the number of censored individuals.

Generic GLMMs

Similar log marginal likelihoods appears for some mixed models with multinomial and ordinal data.

The CDF is attractive when the number of random effects per cluster is not much smaller than the number of observations in a cluster.

Imputation Method

Use Gaussian copula for the joint distribution of mixed data types.

Improve upon the approximate EM-algorithm by Zhao and Udell (2019).

See github.com/boennecd/mdgc.

One can extend the approach to multinomial data.

See Christoffersen et al. (2021).

Conclusions

Conclusions

Extended the estimation method suggested by Pawitan et al. (2004).

Showed that the new C++ implementation is fast

for large data sets with moderate to high dimensional integrals per cluster.

Discussed extensions to other types of outcomes.

Thank You!

The presentation is at rpubs.com/boennecd/KTH-Pre-Meb-Meeting.

The markdown is at github.com/boennecd/Talks.

The R package is at github.com/boennecd/pedmod.

References are on the next slide.

References

Bai, Dan, Benjamin Hon Kei Yip, Gayle C. Windham, Andre Sourander, Richard Francis, Rinat Yoffe, Emma Glasson, et al. 2019. “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort.” JAMA Psychiatry 76 (10): 1035–43. https://doi.org/10.1001/jamapsychiatry.2019.1411.

Christoffersen, Benjamin, Mark Clements, Keith Humphreys, and Hedvig Kjellström. 2021. “Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types.” http://arxiv.org/abs/2102.02642.

Cranley, R., and T. N. L. Patterson. 1976. “Randomization of Number Theoretic Methods for Multiple Integration.” SIAM Journal on Numerical Analysis 13 (6): 904–14. http://www.jstor.org/stable/2156452.

Ekbom, Anders. 2010. “The Swedish Multi-Generation Register.” In Methods in Molecular Biology, 215–20. Humana Press. https://doi.org/10.1007/978-1-59745-423-0_10.

Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2): 141–49. http://www.jstor.org/stable/1390838.

Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71. https://doi.org/10.1198/106186002394.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.

Hadfield, Jarrod D. 2010. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package.” Journal of Statistical Software 33 (2): 1–22. https://www.jstatsoft.org/v33/i02/.

Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.

Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM J. Sci. Comput. 20 (1): 359–92.

Keast, P. 1973. “Optimal Parameters for Multidimensional Integration.” SIAM Journal on Numerical Analysis 10 (5): 831–38. https://doi.org/10.1137/0710068.

Lindström, L., Y. Pawitan, M. Reilly, K. Hemminki, P. Lichtenstein, and K. Czene. 2006. “Estimation of Genetic and Environmental Factors for Melanoma Onset Using Population-Based Family Data.” Statistics in Medicine 25 (18): 3110–23. https://doi.org/https://doi.org/10.1002/sim.2266.

Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.

Niederreiter, H. 1972. “On a Number-Theoretical Integration Method.” Aequationes Mathematicae 8 (3): 304–11. https://doi.org/10.1007/BF01844507.

Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.

Simon, Horst D., and Shang-Hua Teng. 1997. “How Good Is Recursive Bisection?” SIAM Journal on Scientific Computing 18 (5): 1436–45. https://doi.org/10.1137/S1064827593255135.

Svensson, Anna C., Sven Sandin, Sven Cnattingius, Marie Reilly, Yudi Pawitan, Christina M. Hultman, and Paul Lichtenstein. 2009. “Maternal Effects for Preterm Birth: A Genetic Epidemiologic Study of 630,000 Families.” American Journal of Epidemiology 170 (11): 1365–72. https://doi.org/10.1093/aje/kwp328.

Szulkin, Robert, Mark S. Clements, Patrik K. E. Magnusson, Fredrik E. Wiklund, and Ralf Kuja-Halkola. 2017. “Estimating Heritability of Prostate Cancer-Specific Survival Using Population-Based Registers.” The Prostate 77 (8): 900–907. https://doi.org/https://doi.org/10.1002/pros.23344.

Vitezica, Ignacy Misztal AND Shogo Tsuruta AND Daniela Lourenco AND Yutaka Masuda AND Ignacio Aguilar AND Andres Legarra AND Zulma. 2018. Manual for Blupf90 Family Programs. University of Georgia. http://nce.ads.uga.edu/wiki/doku.php?id=documentation.

Yip, Benjamin Hon Kei, Dan Bai, Behrang Mahjani, Lambertus Klei, Yudi Pawitan, Christina M Hultman, Dorothy E Grice, et al. 2018. “Heritable Variation, with Little or No Maternal Effect, Accounts for Recurrence Risk to Autism Spectrum Disorder in Sweden.” Biological Psychiatry 83 (7): 589–97.

Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.” http://arxiv.org/abs/1910.12845.