Genetic and Environmental Effects

dummy slide

Motivation

Overview

Mention previous work and the type of data.

Introduce the models.

Highlight limitations of previous estimation methods.

Example

Suppose that the above is family \(i\) for which we observe \(Y_{i1},\dots Y_{i10} \in \{0, 1\}\).

Circles are females and squares are males.

OCD Study

Genes are known to have an impact on the risk of obsessive-compulsive disorder (OCD).

Genetic and environmental influences on a maternal phenotype can affect the phenotype of the child, in turn.

Mahjani et al. (2020) look at direct genetic effects and maternal effects on OCD

controlling for sex and age of the mother.

Research Questions

Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.

Many examples.

E.g. pre-eclamptic events, melanoma onset, schizophrenia and bipolar disorder, and preterm birth (Pawitan et al. 2004; Lindström et al. 2006; Svensson et al. 2009; Lichtenstein et al. 2009; Yip et al. 2018; Bai et al. 2019).

Liability Threshold Model

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}\left(\vec 0; \mat I_{n_i} + \mat\Sigma_i(\vec\sigma) \right) \\ \mat\Sigma_i(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat C_{ik} \end{align*} \]

given correlation matrices \(\mat C_{ik}\) and fixed effects covariates \(\vec x_{ij}\)s.

A GLMM for binary outcomes with the probit link function. The probit link gives a computationally attractive model (Christoffersen, Clements, Kjellström, et al. 2021)!

ACE Model

Decompose variation into:

Additive genetic effect (A),
Shared environmental factors (C), and
Individual specific effects and measurement errors (E).

The environmental effects are often based on strong assumptions.

Example

Left: the family. Right: the kinship matrix.

Darker colors are further from zero. The upper triangle is omitted.

Childhood Environment

Left: the family. Right: the childhood environment scale matrix.

Inference

Interested in \(h_g = \sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by additive genetic effects,

and \(h_e = \sigma_e^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by environmental effects.

Arbitrary Number of Effects

May want paternal effects or maternal effects.

May want other definitions of environments.

E.g. shared adult environment.

Estimation

The likelihood has no closed-form expression

… but can expressed as integrals that have been extensively studied.

Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Lichtenstein et al. (2009), Yip et al. (2018), and Bai et al. (2019).

Previous Work

Previous work only use a likelihood approximation without gradients.

Have to use derivative-free optimization. This is slow.

To simplify the computation, they do not use all data by

using discretized covariates,
omitting observations and information from the pedigree, and
using a composite likelihood like procedure

where individuals with cousin both on their mother and father side is repeated twice.

Goal

Goal: faster estimation method such that we can use all (or almost all) data.

The implementation is in the pedmod package.

MCMC

MCMC is another option.

Seems popular with animal data.

See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).

Does take all information into account but very slow!

Fitting a single model took weeks for Mahjani et al. (2020).

The Likelihood

Overview

Show the likelihood.

Mention what has been done before and how we extend it

at a high level.

Maximum Likelihood

Find the MLE:

\[ \max_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]

where \(L_i\) is the likelihood of family (cluster) \(i\).

The Likelihood

The likelihood of family \(i\):

\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]

where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).

Point: can use a CDF approximation.

Importance Sampling

Genz (1992) and Genz and Bretz (2002) develop an efficient importance sampling procedure to approximates the \(\Phi^{(K)}\).

Available in Fortran and in R through the mvtnorm package (Genz et al. 2020).

New Code

Rewritten most of the code by Genz and Bretz (2002) in C++.

Our version better exploits auto vectorization from compilers.

Added fast approximations of the standard normal distribution CDF and its inverse.

The main bottleneck in the algorithm.

Added support for computation in parallel.

Seems compute-bound.

New Code (Continued)

Added gradient and Hessian approximations.

Like Hajivassiliou, McFadden, and Ruud (1996) for the gradient but with all additional extensions of Genz (1992) and Genz and Bretz (2002).

Implemented the minimax tilting method suggested by Botev (2017) for the rare probability case.

This and more is in the pedmod package.

Simulation and Applied Examples

Overview

Show a simulation study looking at bias, coverage of Wald-based confidence intervals and the computation time.

Show a procedure to greatly simplify the computational problem while only ignoring very little of the information.

Simulation Study

250 families, with 10 members in each family, and the same pedigree shown earlier.

Use an additive genetic and childhood environment effect.

Compare with MCMCglmm.

Caveat: we use four threads with our method.

25000 samples were used

for each of 250 ten dimensional integrals.

The Model

\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \\ (\beta_0, \beta_1, \beta_2,\sigma_g^2) &= (-3, 1, 2, 3). \end{align*}\]

\(x_{ij1}\sim N(0, 1)\), \(x_{ij2}\sim \text{Bin}(0.5, 1)\) and \(\mat C_{ig}\) is for an additive genetic effect.

Simulation Study Results

	\(\bar{\beta}_0\)	\(\bar{\beta}_1\)	\(\bar{\beta}_2\)	\(h_g\)	Time
Bias	-0.00220	0.00191	0.0014	-0.00300	14.609
SE	0.00198	0.00103	0.0021	0.00228	0.334
Coverage	0.95300	0.95200	0.9480	0.95700

Bias estimates of standardized fixed effect coefficients and the heritability along with standard errors from 1000 data sets. The former coefficients are standardized by dividing by the square root of the total variance. Coverage estimates are for 95% Wald-based confidence intervals and the last column shows the average computation time in seconds for MCMCglmm.

Comparison with MCMC

	\(\bar{\beta}_0\)	\(\bar{\beta}_1\)	\(\bar{\beta}_2\)	\(h_g\)	Time
MCMC (mean)	-0.0118	0.0136	-0.0008	0.0205	1526.53
SE	0.0054	0.0151	0.0209	0.0162	4.51
MCMC (mode)	-0.0397	0.0309	0.0186	-0.0150	1526.53
SE	0.0183	0.0171	0.0207	0.0155	4.51
New method	-0.0121	0.0136	-0.0002	0.0054	13.77
SE	0.0055	0.0151	0.0209	0.0173	1.16

Similar table for the first 10 data sets. The average effective sample size for \(\sigma_g^2\) was 335.58.

The Model with Environment Effects

\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig} + \sigma_e^2\mat C_{ie}) \\ (\beta_0, \beta_1, \beta_2,\sigma_g^2,\sigma_e^2) &= (-3, 1, 2, 2, 1). \end{align*}\]

Added \(\mat C_{ie}\) for a childhood environment effect.

The Model with Environment Effects

	\(\bar\beta_0\)	\(\bar\beta_1\)	\(\bar\beta_2\)	\(h_g\)	\(h_e\)	Time
Bias	-0.00310	0.00236	0.00311	-0.00587	0.00419	19.486
SE	0.00189	0.00102	0.00203	0.00277	0.00241	0.469
Coverage	0.95500	0.96600	0.95500	0.95200	0.95900

Similar table with the childhood environmental effect based on 1000 sampled data sets.

Simplifying the Real Application

Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.

Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.

Most families are small …

… but one is very large with 618162 members with 167279 members with observed outcomes.

Want to avoid 167279 dimensional integration!

Dependencies as a Graph

Dependencies as a Graph (continued)

Cutting the Graph

Cutting the Graph (continued)

Reducing the Computational Cost

Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.

Very similar to the \(p\)-way partition problem.

Many approximations exists without the connection constraint (Simon and Teng 1997; Karypis and Kumar 1998). The partitions need not to be connected.

The data can be simplified to at most 200 dimensional integration by removing 3164 parent-child links

out of 3944737 (0.08 percent reduction)!

Kinship Matrix

The orange colored entries are equal to zero in the new kinship matrix.

The data is sorted.

OCD Example

	\(\beta_0\)	\(\beta_{\text{sex}}\)	\(\beta_{\text{AOM}}\)	\(h_g\)
Estimates	-3.692	0.227	0.147	\(0.469\,(0.387, 0.545)\)

Estimates of a model for OCD with a data set like the one used by Mahjani et al. (2020). The parenthesis shows a 95% profile likelihood-based confidence interval.

The estimation time was four hours and 58 minutes.

It took more than a month with MCMC! It took 12 hours and 15 minutes to compute the confidence interval.

Extensions

Improvements

Get better starting values.

Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.

Survival Data

The IPCW based estimator used by Holst, Scheike, and Hjelmborg (2016)

combined with a pairwise composite likelihood approach.

In this case, only the bivariate normal CDF is required.

Fast approximations exists and derivatives are easily computed (Genz 2004; Vaida and Liu 2009).

Survival Data (Continued)

\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(C_{ij} \in (0,\infty)\).

Only observe \(Y_{ij} = \min (Y_{ij}^*, C_{ij})\).

Use a mixed flexible parametric survival model:

\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\sigma)) \end{align*} \]

Survival Data (Continued)

A simple case is \(\vec g(y) = \log y\) in which case

\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]

Used in the bivariate case by Szulkin et al. (2017).

Bonus: The dimension of the integrals are equal to the number of censored individuals.

E.g. interested in \(\sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance on the log scale explained by additive genetic effects.

Survival Data (continued)

Generally,

\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)). \end{align*} \]

More flexible.

A prototype implementation is available at github.com/boennecd/mixprobit/blob/master/examples/mixed-gsm.md.

Individual Specific Loadings

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij0} + \sum_{k = 1}^K\sigma_k(\vec z_{ij})\epsilon_{ijk} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij0} &\sim N(0,1) \\ (\epsilon_{i1k}, \dots, \epsilon_{in_ik})^\top & \sim N^{(n_i)}\left(\vec 0; \mat C_{ik} \right) \\ \sigma_k(\vec z_{ij}) &= \exp(\vec\theta_k^\top\vec z_{ij}) \end{align*} \]

Proportion of variance is \[ \frac{\sigma_k(\vec z_{ij})^2}{1 + \sum_{l = 1}^K\sigma_l(\vec z_{ij})^2} \]

Individual Specific Loadings

E.g. boys may be more dependent than girls.

Already implemented.

Imputation Method

Improve upon the approximate EM-algorithm by Zhao and Udell (2019) used for imputation of missing values

using Gaussian copulas. See github.com/boennecd/mdgc.

Can provide joint imputation for continuous, binary, ordinal, and multinomial data.

See Christoffersen, Clements, Humphreys, et al. (2021).

Thank You!

The presentation is at rpubs.com/boennecd/KU-pedmod.

The markdown is at github.com/boennecd/Talks.

The R package is on CRAN and at github.com/boennecd/pedmod.

References are on the next slide.

References

Bai, Dan, Benjamin Hon Kei Yip, Gayle C. Windham, Andre Sourander, Richard Francis, Rinat Yoffe, Emma Glasson, et al. 2019. “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort.” JAMA Psychiatry 76 (10): 1035–43. https://doi.org/10.1001/jamapsychiatry.2019.1411.

Botev, Z. I. 2017. “The Normal Law Under Linear Restrictions: Simulation and Estimation via Minimax Tilting.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79 (1): 125–48. https://doi.org/https://doi.org/10.1111/rssb.12162.

Christoffersen, Benjamin, Mark Clements, Keith Humphreys, and Hedvig Kjellström. 2021. “Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types.” In Asian Conference on Machine Learning, 870–85. PMLR.

Christoffersen, Benjamin, Mark Clements, Hedvig Kjellström, and Keith Humphreys. 2021. “Approximation Methods for Mixed Models with Probit Link Functions.” arXiv. https://doi.org/10.48550/ARXIV.2110.14412.

Ekbom, Anders. 2010. “The Swedish Multi-Generation Register.” In Methods in Molecular Biology, 215–20. Humana Press. https://doi.org/10.1007/978-1-59745-423-0_10.

Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2): 141–49. http://www.jstor.org/stable/1390838.

———. 2004. “Numerical Computation of Rectangular Bivariate and Trivariate Normal and t Probabilities.” Statistics and Computing 14 (3): 251–60. https://doi.org/10.1023/b:stco.0000035304.20635.31.

Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate t Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71. https://doi.org/10.1198/106186002394.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and t Distributions. https://CRAN.R-project.org/package=mvtnorm.

Hadfield, Jarrod D. 2010. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package.” Journal of Statistical Software 33 (2): 1–22. https://www.jstatsoft.org/v33/i02/.

Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.

Holst, Klaus K., Thomas H. Scheike, and Jacob B. Hjelmborg. 2016. “The Liability Threshold Model for Censored Twin Data.” Computational Statistics & Data Analysis 93: 324–35. https://doi.org/https://doi.org/10.1016/j.csda.2015.01.014.

Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM J. Sci. Comput. 20 (1): 359–92.

Lichtenstein, Paul, Benjamin H Yip, Camilla Björk, Yudi Pawitan, Tyrone D Cannon, Patrick F Sullivan, and Christina M Hultman. 2009. “Common Genetic Determinants of Schizophrenia and Bipolar Disorder in Swedish Families: A Population-Based Study.” The Lancet 373 (9659): 234–39. https://doi.org/https://doi.org/10.1016/S0140-6736(09)60072-6.

Lindström, L., Y. Pawitan, M. Reilly, K. Hemminki, P. Lichtenstein, and K. Czene. 2006. “Estimation of Genetic and Environmental Factors for Melanoma Onset Using Population-Based Family Data.” Statistics in Medicine 25 (18): 3110–23. https://doi.org/https://doi.org/10.1002/sim.2266.

Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.

Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.

Simon, Horst D., and Shang-Hua Teng. 1997. “How Good Is Recursive Bisection?” SIAM Journal on Scientific Computing 18 (5): 1436–45. https://doi.org/10.1137/S1064827593255135.

Svensson, Anna C., Sven Sandin, Sven Cnattingius, Marie Reilly, Yudi Pawitan, Christina M. Hultman, and Paul Lichtenstein. 2009. “Maternal Effects for Preterm Birth: A Genetic Epidemiologic Study of 630,000 Families.” American Journal of Epidemiology 170 (11): 1365–72. https://doi.org/10.1093/aje/kwp328.

Szulkin, Robert, Mark S. Clements, Patrik K. E. Magnusson, Fredrik E. Wiklund, and Ralf Kuja-Halkola. 2017. “Estimating Heritability of Prostate Cancer-Specific Survival Using Population-Based Registers.” The Prostate 77 (8): 900–907. https://doi.org/https://doi.org/10.1002/pros.23344.

Vaida, Florin, and Lin Liu. 2009. “Fast Implementation for Normal Mixed Effects Models with Censored Response.” Journal of Computational and Graphical Statistics 18 (4): 797–817. https://doi.org/10.1198/jcgs.2009.07130.

Vitezica, Ignacy Misztal AND Shogo Tsuruta AND Daniela Lourenco AND Yutaka Masuda AND Ignacio Aguilar AND Andres Legarra AND Zulma. 2018. Manual for Blupf90 Family Programs. University of Georgia. http://nce.ads.uga.edu/wiki/doku.php?id=documentation.

Yip, Benjamin Hon Kei, Dan Bai, Behrang Mahjani, Lambertus Klei, Yudi Pawitan, Christina M Hultman, Dorothy E Grice, et al. 2018. “Heritable Variation, with Little or No Maternal Effect, Accounts for Recurrence Risk to Autism Spectrum Disorder in Sweden.” Biological Psychiatry 83 (7): 589–97.

Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.” http://arxiv.org/abs/1910.12845.