\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Wanted to study economics, finance, or computer science.
Particularly, computational statistics.
Worked with default models which mostly are survival models.
Postdoc supervised by Keith, Mark, and Hedvig Kjellström from KTH.
A way of approximating integrals when estimating models. Often very fast.
Except also having to do with integral approximations.
Mention previous work and the type of data.
Introduce the models.
Highlight limitations of previous estimation methods.
Circles are females and squares are males.
Genes are known to have an impact on the risk of obsessive-compulsive disorder (OCD).
Genetic and environmental influences on a maternal phenotype can affect the phenotype of the child, in turn.
controlling for sex and age of the mother.
Both rheumatoid arthritis (RA) and autism spectrum disorder (ASD) seem to be effected by genes and share risk factors.
Interesting to study the potential genetic link with RA in mothers and ASD in children.
Joint work with Evora Hailin Zhu, Benjamin Yip, and Sven.
Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.
E.g. pre-eclamptic events, melanoma onset, schizophrenia and bipolar disorder, and preterm birth (Pawitan et al. 2004; Lindström et al. 2006; Svensson et al. 2009; Lichtenstein et al. 2009; Yip et al. 2018; Bai et al. 2019).
\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]
where \(\epsilon_{ij}\) is standard normally distributed. We observe the \(Y_{ij}\)s which are zero or one.It is a GLM with \(\Phi(P(Y_{ij} = 1)) = \vec x_{ij}^\top\vec\beta\) where \(\Phi\) is the standard normal CDF.
There is a latent score \(x_{ij}^\top\vec\beta + \epsilon_{ij}\) and we observe \(Y_{ij} = 1\) if this exceeds the threshold zero.
This choice allows us to use fast methods later.
Want to add additive genetic effects.
Entries are the probability that a randomly selected allele from a locus will be identical by descent between two individuals.
Darker colors are further from zero.
\(g_{ij}\) is the additive genetic effect for individual \(j\) in family \(i\) and \(\mat K_i\) be the kinship matrix of family \(i\). Suppose that
\[(g_{i1},\dots,g_{i10})^\top \sim N^{(10)}(\vec 0, \sigma_g^2 \underbrace{2\mat K_i}_{\mat C_{ig}}).\]
Write the model as
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} > 0 \\ 0 & \text{otherwise} \end{cases}. \end{align*} \]
Wants to account for environmental effects also: the ACE model.
The environmental effects (C) are often based on strong assumptions.
1, 2, 3, and 6 share an environment.
1 and 2 have other environments.
3, 4, 7, and 8 share an environment.
4 has another environment.
5, 6, 9, and 10 share an environment.
5 has another environment.
\(e_{ij}\) is the environmental effect for individual \(j\) in family \(i\). Define \(\mat C_{ie}\) as in the previous slide and let:
\[(e_{i1},\dots,e_{i10})^\top \sim N^{(10)}(\vec 0, \sigma_e^2 \mat C_{ie}).\]
The ACE model is:
\[ Y_{ij} = \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} + g_{ij} + e_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \]
the proportion of the variance explained by additive genetic effects,
the proportion of the variance explained by environmental effects.
The model can be written as:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \sigma_g^2 \mat C_{ig} + \sigma_e^2 \mat C_{ie}). \end{align*} \]
May want paternal effects or maternal effects.
E.g. shared adult environment or shared childhood environment.
Can easily extend.
For \(K\) different effects:
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}(\vec 0; \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \\ \mat\Sigma_i(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat L_{ik}. \end{align*} \]
Given researcher defined \(\mat L_{i1}\dots\mat L_{iK}\).
The likelihood has no simple (closed form) expression
Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Lichtenstein et al. (2009), Yip et al. (2018), and Bai et al. (2019).
Have to use derivative-free optimization. This is slow.
To simplify the computation, they do not use all data by
where individuals with cousin both on their mother and father side is repeated twice.
Does not efficiently use all data.
Higher variance thus requiring more observations.
Goal: faster estimation method such that we can use all (or almost all) data.
Method with gradient approximations.
A faster approximation and an implementation that supports computation in parallel.
Works for an arbitrary number of user defined effects, \(K\) and \(\mat L_{i1},\dots,\mat L_{iK}\).
The implementation is in the pedmod package.
A way of sequentially sampling from the posterior.
See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).
Fitting a single model took weeks for Mahjani et al. (2020).
Show the likelihood.
At a high level, mention what has been done before and how we extend it.
Find the MLE:
\[ \max_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]
where \(L_i\) is the likelihood of family (cluster) \(i\).
The likelihood of family \(i\):
\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]
where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).
Point: need CDF approximation.
Available in Fortran and in R through the mvtnorm package (Genz et al. 2020).
Rewritten most of the code by Genz and Bretz (2002) in C++.
The main bottleneck in the algorithm.
Seems compute-bound.
Like Hajivassiliou, McFadden, and Ruud (1996) but with all additional extension of Genz (1992) and Genz and Bretz (2002).
Show a simulation study looking at bias and computation time.
Show a procedure to greatly simplify the computational problem while only ignoring very little of the information.
Show some preliminary results.
Same pedigree as earlier.
Caveat: we use four threads with our method.
\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \\ (\beta_0, \beta_1, \beta_2,\sigma_2^2) &= (-3, 1, 2, 3). \end{align*}\]
\(x_{ij1}\sim N(0, 1)\), \(x_{ij2}\sim \text{Bin}(0.5, 1)\) are observed.
Discretization will yield the wrong result.
| \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_g\) | Time | |
|---|---|---|---|---|---|
| Bias | -0.481 | 0.108 | -0.13 | -0.346 | 13.7 |
| Bias / SE | -0.875 | 0.337 | -0.22 | -0.514 |
Bias estimates estimated using 100 data sets. The bias estimates are multiplied by 100. The last column shows the average estimation time in seconds. The standard error for the latter is 0.534. The last row shows the bias estimates divided by the standard errors.
| \(\beta_0\) | \(\beta_1\) | \(\beta_2\) | \(\sigma_g\) | Time | |
|---|---|---|---|---|---|
| MCMC (mean) | -1.13 | 1.318 | -0.108 | 1.811 | 1573.9 |
| Bias / SE | -2.13 | 0.866 | -0.051 | 1.161 | |
| MCMC (mode) | -2.70 | 2.220 | 1.293 | -1.019 | |
| Bias / SE | -1.24 | 1.263 | 0.590 | -0.696 | |
| New method | -1.22 | 1.363 | -0.024 | 0.515 | 13.7 |
| Bias / SE | -2.23 | 0.902 | -0.012 | 0.301 |
Bias estimates using 10 data sets. The bias estimates are multiplied by 100. The last column shows average the estimation time in seconds. The “Bias / SE” rows show the bias estimates divided by the standard errors.
Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.
Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.
Most families are small …
Want to avoid 167279 dimensional integration!
and between their ancestor and descendants.
with \(\tilde n_k + \tilde n_l = n_i\).
rather than one of size \(n_i\).
Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.
Many approximations exists without the connection constraint (Simon and Teng 1997; Karypis and Kumar 1998). The partitions need not to be connected.
out of 3944737 (0.08 percent reduction)!
The two matrices are otherwise identical.
The MCMC analysis in Mahjani et al. (2020) took weeks!
Cover ways of reducing the computation time.
Show extensions to survival data.
Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.
Currently, working on a method like suggested by Byrd et al. (2016).
\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(C_{ij} \in (0,\infty)\).
Only observe \(Y_{ij} = \min (Y_{ij}^*, C_{ij})\).
Use a mixed flexible parametric survival model:
\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\sigma)) \end{align*} \]
A simple case is \(\vec g(y) = \log y\) in which case
\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]
Or equivalently
\[ \begin{align*} \log Y_{ij}^* &= -\omega^{-1}\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \omega^{-2}(\mat I_{n_i} + \mat\Sigma_i(\vec\sigma))). \end{align*} \]
Used in the bivariate case by Szulkin et al. (2017).
Bonus: The dimension of the integrals are equal to the number of censored individuals.
the proportion of the variance on the log scale explained by additive genetic effects,
the proportion of the variance on the log scale explained by environmental effects.
Generally,
\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)). \end{align*} \]
More flexible.
the proportion of the variance on the \(h(y) = \vec g(y)^\top\vec\omega\) scale explained by additive genetic effects,
the proportion of the variance on the \(h(y) = \vec g(y)^\top\vec\omega\) scale explained by environmental effects.
A way of approximating integrals when estimating models. Often very fast.
E.g. to account for the hospital being used, repeated measures, environmental effects, dependence between twins, etc.
Preliminary work on applying variational approximations for joint survival and marker models.
Has applications e.g. with kidney data to characterise trajectories of biomarkers of kidney function decline while accounting for diabetic complications and medications.
In the thousands or even millions.
Seen ten fold and 100 fold reductions in estimation time. See github.com/boennecd/psqn and github.com/boennecd/psqn-va-ex.
Have seen 20 times faster estimation time compared with a Laplace approximation from lme4 in some examples and with lower bias.
See Christoffersen et al. (2021).
Based on a similar approximation as used for the family data.
Want to study
Want to use as much information as possible.
Provide a faster means of estimating common models and a procedure to simplify the families while ignoring little information.
Advisers: Keith Humphreys, Mark Clements, and Hedvig Kjellström.
Behrang Mahjani, Evora Hailin Zhu, Benjamin Yip, and Sven Sandin.
Mark is hiring a Ph.d. student to work on:
The presentation is at rpubs.com/boennecd/MEB-pedmod.
The markdown is at github.com/boennecd/Talks.
The R package is at github.com/boennecd/pedmod.
References are on the next slide.
Bai, Dan, Benjamin Hon Kei Yip, Gayle C. Windham, Andre Sourander, Richard Francis, Rinat Yoffe, Emma Glasson, et al. 2019. “Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort.” JAMA Psychiatry 76 (10): 1035–43. https://doi.org/10.1001/jamapsychiatry.2019.1411.
Byrd, R. H., S. L. Hansen, Jorge Nocedal, and Y. Singer. 2016. “A Stochastic Quasi-Newton Method for Large-Scale Optimization.” SIAM Journal on Optimization 26 (2): 1008–31. https://doi.org/10.1137/140954362.
Christoffersen, Benjamin, Mark Clements, Keith Humphreys, and Hedvig Kjellström. 2021. “Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types.” http://arxiv.org/abs/2102.02642.
Ekbom, Anders. 2010. “The Swedish Multi-Generation Register.” In Methods in Molecular Biology, 215–20. Humana Press. https://doi.org/10.1007/978-1-59745-423-0_10.
Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2): 141–49. http://www.jstor.org/stable/1390838.
Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4): 950–71. https://doi.org/10.1198/106186002394.
Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.
Hadfield, Jarrod D. 2010. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package.” Journal of Statistical Software 33 (2): 1–22. https://www.jstatsoft.org/v33/i02/.
Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.
Karypis, George, and Vipin Kumar. 1998. “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.” SIAM J. Sci. Comput. 20 (1): 359–92.
Lichtenstein, Paul, Benjamin H Yip, Camilla Björk, Yudi Pawitan, Tyrone D Cannon, Patrick F Sullivan, and Christina M Hultman. 2009. “Common Genetic Determinants of Schizophrenia and Bipolar Disorder in Swedish Families: A Population-Based Study.” The Lancet 373 (9659): 234–39. https://doi.org/https://doi.org/10.1016/S0140-6736(09)60072-6.
Lindström, L., Y. Pawitan, M. Reilly, K. Hemminki, P. Lichtenstein, and K. Czene. 2006. “Estimation of Genetic and Environmental Factors for Melanoma Onset Using Population-Based Family Data.” Statistics in Medicine 25 (18): 3110–23. https://doi.org/https://doi.org/10.1002/sim.2266.
Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.
Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.
Simon, Horst D., and Shang-Hua Teng. 1997. “How Good Is Recursive Bisection?” SIAM Journal on Scientific Computing 18 (5): 1436–45. https://doi.org/10.1137/S1064827593255135.
Svensson, Anna C., Sven Sandin, Sven Cnattingius, Marie Reilly, Yudi Pawitan, Christina M. Hultman, and Paul Lichtenstein. 2009. “Maternal Effects for Preterm Birth: A Genetic Epidemiologic Study of 630,000 Families.” American Journal of Epidemiology 170 (11): 1365–72. https://doi.org/10.1093/aje/kwp328.
Szulkin, Robert, Mark S. Clements, Patrik K. E. Magnusson, Fredrik E. Wiklund, and Ralf Kuja-Halkola. 2017. “Estimating Heritability of Prostate Cancer-Specific Survival Using Population-Based Registers.” The Prostate 77 (8): 900–907. https://doi.org/https://doi.org/10.1002/pros.23344.
Vitezica, Ignacy Misztal AND Shogo Tsuruta AND Daniela Lourenco AND Yutaka Masuda AND Ignacio Aguilar AND Andres Legarra AND Zulma. 2018. Manual for Blupf90 Family Programs. University of Georgia. http://nce.ads.uga.edu/wiki/doku.php?id=documentation.
Yip, Benjamin Hon Kei, Dan Bai, Behrang Mahjani, Lambertus Klei, Yudi Pawitan, Christina M Hultman, Dorothy E Grice, et al. 2018. “Heritable Variation, with Little or No Maternal Effect, Accounts for Recurrence Risk to Autism Spectrum Disorder in Sweden.” Biological Psychiatry 83 (7): 589–97.
Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.” http://arxiv.org/abs/1910.12845.