Genetic and Environmental Effects

dummy slide

Background

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]

Before MEB and KTH

Wanted to study economics, finance, or computer science.

Realized that I enjoy statistics.

Particularly, computational statistics.

Got a Ph.d. from Copenhagen Business School.

Worked with default models which mostly are survival models.

Position

Postdoc supervised by Keith, Mark, and Hedvig Kjellström from KTH.

Work on variational approximations in biostatistics.

A way of approximating integrals when estimating models. Often very fast.

Has almost nothing to do with today’s topic!

Except also having to do with integral approximations.

Motivation

Overview

Mention previous work and the type of data.

Introduce the models.

Highlight limitations of previous estimation methods.

Example

Suppose that the above is family \(i\) for which we observe \(Y_{i1},\dots Y_{i10} \in \{0, 1\}\).

Circles are females and squares are males.

OCD Study

Genes are known to have an impact on the risk of obsessive-compulsive disorder (OCD).

Genetic and environmental influences on a maternal phenotype can affect the phenotype of the child, in turn.

Mahjani et al. (2020) look at direct genetic effects and maternal effects on OCD

controlling for sex and age of the mother.

ASD Study

Both rheumatoid arthritis (RA) and autism spectrum disorder (ASD) seem to be effected by genes and share risk factors.

Interesting to study the potential genetic link with RA in mothers and ASD in children.

Joint work with Evora Hailin Zhu, Benjamin Yip, and Sven.

Research Questions

Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.

Other outcomes have been studied in the department.

E.g. pre-eclamptic events, melanoma onset, schizophrenia and bipolar disorder, and preterm birth (Pawitan et al. 2004; Lindström et al. 2006; Svensson et al. 2009; Lichtenstein et al. 2009; Yip et al. 2018; Bai et al. 2019).

Liability-threshold Models

\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]

where \(\epsilon_{ij}\) is standard normally distributed. We observe the \(Y_{ij}\)s which are zero or one.

It is a GLM with \(\Phi(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta\) where \(\Phi\) is the standard normal CDF.

There is a latent score \(x_{ij}^\top\vec\beta + \epsilon_{ij}\) and we observe \(Y_{ij} = 1\) if this exceeds the threshold zero.

This choice allows us to use fast methods later.

Liability-threshold Models (continued)

The Kinship Matrix

Want to add additive genetic effects.

Use random effects with a correlation matrix given by the kinship matrix.

Entries are the probability that a randomly selected allele from a locus will be identical by descent between two individuals.

Example

Left the family. Right the kinship matrix.

Darker colors are further from zero.

Adding Additive Genetic Effects

\(g_{ij}\) is the additive genetic effect for individual \(j\) in family \(i\) and \(\mat K_i\) be the kinship matrix of family \(i\). Suppose that

\[(g_{i1},\dots,g_{i10})^\top \sim N^{(10)}(\vec 0, \sigma_g^2 \underbrace{2\mat K_i}_{\mat C_{ig}}).\]

Write the model as

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} > 0 \\ 0 & \text{otherwise} \end{cases}. \end{align*} \]

ACE Model

Wants to account for environmental effects also: the ACE model.

  • Additive genetic effect (A),
  • shared environmental factors (C), and
  • individual specific effects and measurement errors (E).

The environmental effects (C) are often based on strong assumptions.

Example (continued)

Scale matrix for the environmental effect.

1, 2, 3, and 6 share an environment.

Example (continued)

Scale matrix for the environmental effect.

1 and 2 have other environments.

Example (continued)

Scale matrix for the environmental effect.

3, 4, 7, and 8 share an environment.

Example (continued)

Scale matrix for the environmental effect.

4 has another environment.

Example (continued)

Scale matrix for the environmental effect.

5, 6, 9, and 10 share an environment.

Example (continued)

Scale matrix for the environmental effect.

5 has another environment.

Adding Environmental Effects

\(e_{ij}\) is the environmental effect for individual \(j\) in family \(i\). Define \(\mat C_{ie}\) as in the previous slide and let:

\[(e_{i1},\dots,e_{i10})^\top \sim N^{(10)}(\vec 0, \sigma_e^2 \mat C_{ie}).\]

The ACE model is:

\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} + e_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]

Inference

Interested in \(\sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by additive genetic effects,

and \(\sigma_e^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance explained by environmental effects.

Re-write

The model can be written as:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \sigma_g^2 \mat C_{ig} + \sigma_e^2 \mat C_{ie}). \end{align*} \]

Arbitrary Number of Effects

May want paternal effects or maternal effects.

May want other definitions of environments.

E.g. shared adult environment or shared childhood environment.

Can easily extend.

Re-write (continued)

For \(K\) different effects:

\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \\ \mat\Sigma_i(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat L_{ik}. \end{align*} \]

Given researcher defined \(\mat L_{i1}\dots\mat L_{iK}\).

Estimation

The likelihood has no simple (closed form) expression

… but can expressed as integrals that have been extensively studied.

Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Lichtenstein et al. (2009), Yip et al. (2018), and Bai et al. (2019).

Previous Work

Previous work only use a likelihood approximation without gradients.

Have to use derivative-free optimization. This is slow.

To simplify the computation, they do not use all data by

  • using discretized covariates,
  • omitting observations and information from the pedigree, and
  • using a composite likelihood like procedure

where individuals with cousin both on their mother and father side is repeated twice.

Previous Work (continued)

Does not efficiently use all data.

May lead to less efficient estimator.

Higher variance thus requiring more observations.

Goal: faster estimation method such that we can use all (or almost all) data.

New Method and Implementation

Method with gradient approximations.

A faster approximation and an implementation that supports computation in parallel.

Works for an arbitrary number of user defined effects, \(K\) and \(\mat L_{i1},\dots,\mat L_{iK}\).

The implementation is in the pedmod package.

MCMC

Markov chain Monte Carlo (MCMC) is another option.

A way of sequentially sampling from the posterior.

Seems popular with animal data.

See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).

Does take all information into account but very slow!

Fitting a single model took weeks for Mahjani et al. (2020).

The Likelihood

Overview

Show the likelihood.

At a high level, mention what has been done before and how we extend it.

Maximum Likelihood

Find the MLE:

\[ \max_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]

where \(L_i\) is the likelihood of family (cluster) \(i\).

The Likelihood

The likelihood of family \(i\):

\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]

where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).

Point: need CDF approximation.

Importance Sampling

Genz (1992) and Genz and Bretz (2002) develop an efficient importance sampling procedure to approximates the \(\Phi^{(K)}\).

Available in Fortran and in R through the mvtnorm package (Genz et al. 2020).

New Code

Rewritten most of the code by Genz and Bretz (2002) in C++.

Added fast approximations of the standard normal distribution CDF and its inverse.

The main bottleneck in the algorithm.

Added support for computation in parallel.

Seems compute-bound.

Added gradient approximations.

Like Hajivassiliou, McFadden, and Ruud (1996) but with all additional extension of Genz (1992) and Genz and Bretz (2002).

Simulation and Applied Examples

Overview

Show a simulation study looking at bias and computation time.

Show a procedure to greatly simplify the computational problem while only ignoring very little of the information.

Show some preliminary results.

Simulation Study

250 families, with 10 members in each family, and the same pedigree. We add an additive genetic effect.

Same pedigree as earlier.

Compare with MCMCglmm.

Caveat: we use four threads with our method.

The Model

\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \\ (\beta_0, \beta_1, \beta_2,\sigma_2^2) &= (-3, 1, 2, 3). \end{align*}\]

\(x_{ij1}\sim N(0, 1)\), \(x_{ij2}\sim \text{Bin}(0.5, 1)\) are observed.

Discretization will yield the wrong result.

Simulation Study Results

\(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_g\) Time
Bias -0.481 0.108 -0.13 -0.346 13.7
Bias / SE -0.875 0.337 -0.22 -0.514

Bias estimates estimated using 100 data sets. The bias estimates are multiplied by 100. The last column shows the average estimation time in seconds. The standard error for the latter is 0.534. The last row shows the bias estimates divided by the standard errors.

Results: Comparison with MCMC

\(\beta_0\) \(\beta_1\) \(\beta_2\) \(\sigma_g\) Time
MCMC (mean) -1.13 1.318 -0.108 1.811 1573.9
Bias / SE -2.13 0.866 -0.051 1.161
MCMC (mode) -2.70 2.220 1.293 -1.019
Bias / SE -1.24 1.263 0.590 -0.696
New method -1.22 1.363 -0.024 0.515 13.7
Bias / SE -2.23 0.902 -0.012 0.301

Bias estimates using 10 data sets. The bias estimates are multiplied by 100. The last column shows average the estimation time in seconds. The “Bias / SE” rows show the bias estimates divided by the standard errors.

Simplifying the Real Application

Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.

Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.

Most families are small …

… but one is very large with 618162 members with 167279 members with observed outcomes.

Want to avoid 167279 dimensional integration!

Dependencies as a Graph

Dependencies as a Graph (continued)

Dependencies as a Graph (continued)

Cutting the Graph

Cutting the Graph (continued)

Reducing the Computational Cost

Removing an edge introduces independence between two individuals

and between their ancestor and descendants.

Removing edges can creates two non-connected families (subgraphs) with \(\tilde n_k\) and \(\tilde n_l\) individuals (vertices)

with \(\tilde n_k + \tilde n_l = n_i\).

Yields two integrals to be approximated of size \(\tilde n_k\) and \(\tilde n_l\)

rather than one of size \(n_i\).

Reducing the Computational Cost

Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.

Essentially the \(p\)-way partition problem.

Many approximations exists without the connection constraint (Simon and Teng 1997; Karypis and Kumar 1998). The partitions need not to be connected.

The data can be simplified to at most 200 dimensional integration by removing 3164 parent-child links

out of 3944737 (0.08 percent reduction)!

Kinship Matrix

The orange colored entries are equal to zero in the new kinship matrix.

The two matrices are otherwise identical.

Profile Likelihood

The above profile likelihood was computed in 3 hours and 22 minutes on Vector.

The MCMC analysis in Mahjani et al. (2020) took weeks!

Extensions

Overview

Cover ways of reducing the computation time.

Show extensions to survival data.

Improvements

Get better starting values.

Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.

Use stochastic gradient techniques.

Currently, working on a method like suggested by Byrd et al. (2016).

Survival Data

\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(C_{ij} \in (0,\infty)\).

Only observe \(Y_{ij} = \min (Y_{ij}^*, C_{ij})\).

Use a mixed flexible parametric survival model:

\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\sigma)) \end{align*} \]

Survival Data

A simple case is \(\vec g(y) = \log y\) in which case

\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]

Or equivalently

\[ \begin{align*} \log Y_{ij}^* &= -\omega^{-1}\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \omega^{-2}(\mat I_{n_i} + \mat\Sigma_i(\vec\sigma))). \end{align*} \]

Used in the bivariate case by Szulkin et al. (2017).

Bonus: The dimension of the integrals are equal to the number of censored individuals.

Inference in ACE Model

Interested in \(\sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance on the log scale explained by additive genetic effects,

and \(\sigma_e^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance on the log scale explained by environmental effects.

Survival Data (continued)

Generally,

\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)). \end{align*} \]

More flexible.

Inference in ACE Model

Interested in \(\sigma_g^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance on the \(h(y) = \vec g(y)^\top\vec\omega\) scale explained by additive genetic effects,

and \(\sigma_e^2/(1 + \sigma_g^2+ \sigma_e^2)\)

the proportion of the variance on the \(h(y) = \vec g(y)^\top\vec\omega\) scale explained by environmental effects.

Other Projects

Variational Approximations

Recall: I work on variational approximations in biostatistics.

A way of approximating integrals when estimating models. Often very fast.

Used variational approximations for mixed flexible parametric models.

E.g. to account for the hospital being used, repeated measures, environmental effects, dependence between twins, etc.

Joint Survival and Marker Models

Preliminary work on applying variational approximations for joint survival and marker models.

Has applications e.g. with kidney data to characterise trajectories of biomarkers of kidney function decline while accounting for diabetic complications and medications.

Estimating Variational Approximations

Variational approximations easily involves optimization with many parameters.

In the thousands or even millions.

Implemented generic optimization methods that use the structure of the function being optimized.

Seen ten fold and 100 fold reductions in estimation time. See github.com/boennecd/psqn and github.com/boennecd/psqn-va-ex.

Have seen 20 times faster estimation time compared with a Laplace approximation from lme4 in some examples and with lower bias.

Imputation Method

Improve upon the approximate EM-algorithm by Zhao and Udell (2019) used for imputation of missing values.

See github.com/boennecd/mdgc.

Can provide joint imputation for continuous, binary, ordinal, and multinomial data.

See Christoffersen et al. (2021).

Based on a similar approximation as used for the family data.

Conclusions

Summary

Want to study

  • maternal and genetic influences on OCD and
  • potential genetic relations between mothers getting RA and their children getting ASD.

Want to use as much information as possible.

Provide a faster means of estimating common models and a procedure to simplify the families while ignoring little information.

Thanks To

Advisers: Keith Humphreys, Mark Clements, and ‪Hedvig Kjellström.

Behrang Mahjani, Evora Hailin Zhu, Benjamin Yip, and Sven Sandin.

Ph.d. Position

Mark is hiring a Ph.d. student to work on:

  • Variational approximations used to estimate penalized mixed flexible parametric survival models.
  • Variational approximations used in joint survival and marker models.
  • Smooth accelerated failure time models.

Thank You!

The presentation is at rpubs.com/boennecd/MEB-pedmod.

The markdown is at github.com/boennecd/Talks.

The R package is at github.com/boennecd/pedmod.

References are on the next slide.

References

Bai, Dan, Benjamin Hon Kei Yip, Gayle C. Windham, Andre Sourander, Richard Francis, Rinat Yoffe, Emma Glasson, et al. 2019. “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort.” JAMA Psychiatry 76 (10): 1035–43. https://doi.org/10.1001/jamapsychiatry.2019.1411.

Byrd, R. H., S. L. Hansen, Jorge Nocedal, and Y. Singer. 2016. “A Stochastic Quasi-Newton Method for Large-Scale Optimization.” SIAM Journal on Optimization 26 (2): 1008–31. https://doi.org/10.1137/140954362.

Christoffersen, Benjamin, Mark Clements, Keith Humphreys, and Hedvig Kjellström. 2021. “Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types.” http://arxiv.org/abs/2102.02642.

Ekbom, Anders. 2010. “The Swedish Multi-Generation Register.” In Methods in Molecular Biology, 215–20. Humana Press. https://doi.org/10.1007/978-1-59745-423-0_10.

Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2): 141–49. http://www.jstor.org/stable/1390838.

Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71. https://doi.org/10.1198/106186002394.

Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.

Hadfield, Jarrod D. 2010. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package.” Journal of Statistical Software 33 (2): 1–22. https://www.jstatsoft.org/v33/i02/.

Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.

Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM J. Sci. Comput. 20 (1): 359–92.

Lichtenstein, Paul, Benjamin H Yip, Camilla Björk, Yudi Pawitan, Tyrone D Cannon, Patrick F Sullivan, and Christina M Hultman. 2009. “Common Genetic Determinants of Schizophrenia and Bipolar Disorder in Swedish Families: A Population-Based Study.” The Lancet 373 (9659): 234–39. https://doi.org/https://doi.org/10.1016/S0140-6736(09)60072-6.

Lindström, L., Y. Pawitan, M. Reilly, K. Hemminki, P. Lichtenstein, and K. Czene. 2006. “Estimation of Genetic and Environmental Factors for Melanoma Onset Using Population-Based Family Data.” Statistics in Medicine 25 (18): 3110–23. https://doi.org/https://doi.org/10.1002/sim.2266.

Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.

Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.

Simon, Horst D., and Shang-Hua Teng. 1997. “How Good Is Recursive Bisection?” SIAM Journal on Scientific Computing 18 (5): 1436–45. https://doi.org/10.1137/S1064827593255135.

Svensson, Anna C., Sven Sandin, Sven Cnattingius, Marie Reilly, Yudi Pawitan, Christina M. Hultman, and Paul Lichtenstein. 2009. “Maternal Effects for Preterm Birth: A Genetic Epidemiologic Study of 630,000 Families.” American Journal of Epidemiology 170 (11): 1365–72. https://doi.org/10.1093/aje/kwp328.

Szulkin, Robert, Mark S. Clements, Patrik K. E. Magnusson, Fredrik E. Wiklund, and Ralf Kuja-Halkola. 2017. “Estimating Heritability of Prostate Cancer-Specific Survival Using Population-Based Registers.” The Prostate 77 (8): 900–907. https://doi.org/https://doi.org/10.1002/pros.23344.

Vitezica, Ignacy Misztal AND Shogo Tsuruta AND Daniela Lourenco AND Yutaka Masuda AND Ignacio Aguilar AND Andres Legarra AND Zulma. 2018. Manual for Blupf90 Family Programs. University of Georgia. http://nce.ads.uga.edu/wiki/doku.php?id=documentation.

Yip, Benjamin Hon Kei, Dan Bai, Behrang Mahjani, Lambertus Klei, Yudi Pawitan, Christina M Hultman, Dorothy E Grice, et al. 2018. “Heritable Variation, with Little or No Maternal Effect, Accounts for Recurrence Risk to Autism Spectrum Disorder in Sweden.” Biological Psychiatry 83 (7): 589–97.

Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.” http://arxiv.org/abs/1910.12845.