\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Circles are females and squares are male.
Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.
Can be done by extending GLM’s with a probit link.
Have \(i = 1,\dots,m\) families (clusters) with \(j = 1,\dots,n_i\) members.
Logistic regression can be seen as
\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]
where \(\epsilon_{it}\) follows a logistic distribution. Implies
\[ \text{logit}(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta. \]
If we instead assume that \(\epsilon_{ij} \sim N(0, 1)\) then
\[ \Phi(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta. \]
Generalizing this is computationally attractive.
Want to add genetic effects, \(g_{ij}\). Assume that the genetic effects are additive on the latent scale:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij} &\sim N(0, 1) \\ (g_{i1}, \dots, g_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_g^2 \mat C_{ig}). \end{align*} \]
Need to select the matrix \(\mat C_{ig}\).
Subject 7 and 8 are siblings and have \(2^{-1} = 50\)% while 7 and 9 are cousins and only have \(2^{-3} = 12.5\)%.
Add environmental effects, \(e_{ij}\), to get a classical ACE model:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} + e_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij} &\sim N(0, 1) \\ (g_{i1}, \dots, g_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_g^2 \mat C_{ig}) \\ (e_{i1}, \dots, e_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \sigma_e^2 \mat C_{ie}). \\ \end{align*} \]
(1,2,3,6) \(\mat C_{ie}\) entries are equal to one and so are (3, 4, 7, 8) and (5, 6, 9, 10) entries.
the proportion of the variance explained by additive genetic effects,
the proportion of the variance explained by environmental effects.
The model can be written as:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \sigma_g^2 \mat C_{ig} + \sigma_e^2 \mat C_{ie}). \end{align*} \]
Or for \(K\) different effects:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \mat\Sigma(\vec\sigma)) \\ \mat\Sigma(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat C_{ik} \end{align*} \]
The log marginal likelihood is intractable
Many approximations exists! Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Yip et al. (2018), and Bai et al. (2019).
Have to use derivative-free optimization.
To simplify the computation, they
where individuals with cousin both on their mother and father side is repeated twice.
Method with gradient approximations.
Our Implementation supports computation in parallel.
A faster approximation.
Works for an arbitrary number of effects, \(K\).
Seems popular with animal data. See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).
Very slow!
Find the MLE:
\[ \argmax_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]
where \(L_i\) is the marginal likelihood of family (cluster) \(i\).
\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]
where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).
Point: need CDF approximation.
Genz (1992) and Genz and Bretz (2002) develop an efficient importance sampling procedure for the CDF with:
Used in the mvtnorm package (Genz et al. 2020).
Added fast \(\Phi\) and \(\Phi^{-1}\) approximations.
Seems compute-bound.
Like Hajivassiliou, McFadden, and Ruud (1996) but with all of the above, re-ordering, and randomized Korobov rules.
Same pedigree as earlier.
\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1X_{ij} + \beta_2 B_{ij} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \end{align*}\]
\(X_{ij}\sim N(0, 1)\), \(B_{ij}\sim \text{Bin}(0.5, 1)\) are observed.
Caveat: we use four threads with our method.
| \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_g\) | Time | |
|---|---|---|---|---|---|
| Bias | -0.481 | 0.108 | -0.130 | -0.346 | 13.890 |
| SE | 0.550 | 0.319 | 0.592 | 0.674 | 0.549 |
Bias estimates using 100 data sets. The bias estimates are multiplied by 100. The last column shows the average estimation time in seconds. The SE row shows the standard errors.
| \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_g\) | Time | |
|---|---|---|---|---|---|
| MCMC (mean) | -1.128 | 1.32 | -0.108 | 1.811 | 1588.36 |
| SE | 0.530 | 1.52 | 2.093 | 1.560 | 3.63 |
| MCMC (mode) | -2.704 | 2.22 | 1.293 | -1.019 | |
| SE | 2.181 | 1.76 | 2.192 | 1.462 | |
| New method | -1.218 | 1.36 | -0.024 | 0.515 | 13.43 |
| SE | 0.547 | 1.51 | 2.087 | 1.713 | 1.23 |
Bias estimates using 10 data sets. The bias estimates are multiplied by 100. The last column shows average the estimation time in seconds. The SE rows show the standard errors. Both posterior means and modes are shown.
Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.
Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.
Want to avoid 167.279 dimensional integration!
Removing an edge introduces independence between two individuals.
with \(\tilde n_k + \tilde n_l = n_i\).
rather than one of size \(n\).
Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.
Many approximations exists if the partitions need not to be connected (Simon and Teng 1997; Karypis and Kumar 1998).
Reduces the number of cousin pairs by 7.1 pct.
The MCMC analysis in Mahjani et al. (2020) took weeks!
Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.
Use stochastic gradient techniques.
\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(S_{ij} \in (0,\infty)\).
Only observe \(Y_{ij} = \min (Y_{ij}^*, S_{ij})\).
Use a mixed generalized survival model:
\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\theta)) \end{align*} \]
A simple case is \(\vec g(y) = \log y\) in which case
\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]
Or equivalently
\[ \begin{align*} \log Y_{ij}^* &= -\omega^{-1}\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \omega^{-2}(\mat I_{n_i} + \mat\Sigma_i(\vec\sigma))) \end{align*} \]
Used in the bivariate case by Szulkin et al. (2017).
Generally,
\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]
The CDF dimension is equal to the number of censored individuals.
Similar log marginal likelihoods appears for some mixed models with multinomial and ordinal data.
The CDF is attractive when the number of random effects per cluster is not much smaller than the number of observations in a cluster.
Use Gaussian copula for the joint distribution of mixed data types.
See Christoffersen et al. (2021).
Extended the estimation method suggested by Pawitan et al. (2004).
for large data sets with moderate to high dimensional integrals per cluster.
Discussed extensions to other types of outcomes.
The presentation is at rpubs.com/boennecd/KTH-Pre-Meb-Meeting.
The markdown is at github.com/boennecd/Talks.
The R package is at github.com/boennecd/pedmod.
References are on the next slide.
Bai, Dan, Benjamin Hon Kei Yip, Gayle C. Windham, Andre Sourander, Richard Francis, Rinat Yoffe, Emma Glasson, et al. 2019. “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort.” JAMA Psychiatry 76 (10): 1035–43. https://doi.org/10.1001/jamapsychiatry.2019.1411.
Christoffersen, Benjamin, Mark Clements, Keith Humphreys, and Hedvig Kjellström. 2021. “Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types.” http://arxiv.org/abs/2102.02642.
Cranley, R., and T. N. L. Patterson. 1976. “Randomization of Number Theoretic Methods for Multiple Integration.” SIAM Journal on Numerical Analysis 13 (6): 904–14. http://www.jstor.org/stable/2156452.
Ekbom, Anders. 2010. “The Swedish Multi-Generation Register.” In Methods in Molecular Biology, 215–20. Humana Press. https://doi.org/10.1007/978-1-59745-423-0_10.
Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2): 141–49. http://www.jstor.org/stable/1390838.
Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71. https://doi.org/10.1198/106186002394.
Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.
Hadfield, Jarrod D. 2010. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package.” Journal of Statistical Software 33 (2): 1–22. https://www.jstatsoft.org/v33/i02/.
Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.
Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM J. Sci. Comput. 20 (1): 359–92.
Keast, P. 1973. “Optimal Parameters for Multidimensional Integration.” SIAM Journal on Numerical Analysis 10 (5): 831–38. https://doi.org/10.1137/0710068.
Lindström, L., Y. Pawitan, M. Reilly, K. Hemminki, P. Lichtenstein, and K. Czene. 2006. “Estimation of Genetic and Environmental Factors for Melanoma Onset Using Population-Based Family Data.” Statistics in Medicine 25 (18): 3110–23. https://doi.org/https://doi.org/10.1002/sim.2266.
Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.
Niederreiter, H. 1972. “On a Number-Theoretical Integration Method.” Aequationes Mathematicae 8 (3): 304–11. https://doi.org/10.1007/BF01844507.
Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.
Simon, Horst D., and Shang-Hua Teng. 1997. “How Good Is Recursive Bisection?” SIAM Journal on Scientific Computing 18 (5): 1436–45. https://doi.org/10.1137/S1064827593255135.
Svensson, Anna C., Sven Sandin, Sven Cnattingius, Marie Reilly, Yudi Pawitan, Christina M. Hultman, and Paul Lichtenstein. 2009. “Maternal Effects for Preterm Birth: A Genetic Epidemiologic Study of 630,000 Families.” American Journal of Epidemiology 170 (11): 1365–72. https://doi.org/10.1093/aje/kwp328.
Szulkin, Robert, Mark S. Clements, Patrik K. E. Magnusson, Fredrik E. Wiklund, and Ralf Kuja-Halkola. 2017. “Estimating Heritability of Prostate Cancer-Specific Survival Using Population-Based Registers.” The Prostate 77 (8): 900–907. https://doi.org/https://doi.org/10.1002/pros.23344.
Vitezica, Ignacy Misztal AND Shogo Tsuruta AND Daniela Lourenco AND Yutaka Masuda AND Ignacio Aguilar AND Andres Legarra AND Zulma. 2018. Manual for Blupf90 Family Programs. University of Georgia. http://nce.ads.uga.edu/wiki/doku.php?id=documentation.
Yip, Benjamin Hon Kei, Dan Bai, Behrang Mahjani, Lambertus Klei, Yudi Pawitan, Christina M Hultman, Dorothy E Grice, et al. 2018. “Heritable Variation, with Little or No Maternal Effect, Accounts for Recurrence Risk to Autism Spectrum Disorder in Sweden.” Biological Psychiatry 83 (7): 589–97.
Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.” http://arxiv.org/abs/1910.12845.