\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Mention previous work and the type of data.
Introduce the models.
Highlight limitations of previous estimation methods.
Circles are females and squares are males.
Genes are known to have an impact on the risk of obsessive-compulsive disorder (OCD).
Genetic and environmental influences on a maternal phenotype can affect the phenotype of the child, in turn.
controlling for sex and age of the mother.
Want to estimate unobserved genetic effects, environmental effects, paternal effects, etc. for binary outcomes.
E.g. pre-eclamptic events, melanoma onset, schizophrenia and bipolar disorder, and preterm birth (Pawitan et al. 2004; Lindström et al. 2006; Svensson et al. 2009; Lichtenstein et al. 2009; Yip et al. 2018; Bai et al. 2019).
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{in_i})^\top & \sim N^{(n_i)}\left(\vec 0; \mat I_{n_i} + \mat\Sigma_i(\vec\sigma) \right) \\ \mat\Sigma_i(\vec\sigma) &= \sum_{k = 1}^K\sigma_k^2 \mat C_{ik} \end{align*} \]
A GLMM for binary outcomes with the probit link function. The probit link gives a computationally attractive model (Christoffersen, Clements, Kjellström, et al. 2021)!
Decompose variation into:
The environmental effects are often based on strong assumptions.
Darker colors are further from zero. The upper triangle is omitted.
Left: the family. Right: the childhood environment scale matrix.
the proportion of the variance explained by additive genetic effects,
the proportion of the variance explained by environmental effects.
May want paternal effects or maternal effects.
E.g. shared adult environment.
The likelihood has no closed-form expression
Used by Pawitan et al. (2004), Lindström et al. (2006), Svensson et al. (2009), Lichtenstein et al. (2009), Yip et al. (2018), and Bai et al. (2019).
Have to use derivative-free optimization. This is slow.
To simplify the computation, they do not use all data by
where individuals with cousin both on their mother and father side is repeated twice.
Goal: faster estimation method such that we can use all (or almost all) data.
The implementation is in the pedmod package.
MCMC is another option.
See e.g. MCMCglmm (Hadfield 2010) and BLUPF90 (Vitezica 2018).
Fitting a single model took weeks for Mahjani et al. (2020).
Show the likelihood.
at a high level.
Find the MLE:
\[ \max_{\vec\beta,\vec\sigma} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\sigma) \]
where \(L_i\) is the likelihood of family (cluster) \(i\).
The likelihood of family \(i\):
\[ \begin{align*} L_i(\vec\beta,\vec\sigma) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\sigma) \mat U_i) \\ u_{ij} &= 2y_{ij} - 1 \\ \mat U_i &= \text{diag}(\vec u_i) \\ \end{align*} \]
where \(\Phi^{(K)}(\vec x; \vec\omega, \mat \Omega)\) is the \(K\) dimensional multivariate normal distribution’s CDF with mean \(\vec\omega\) and covariance matrix \(\mat\Omega\) evaluated at \(\vec x\).
Point: can use a CDF approximation.
Available in Fortran and in R through the mvtnorm package (Genz et al. 2020).
Our version better exploits auto vectorization from compilers.
The main bottleneck in the algorithm.
Seems compute-bound.
Like Hajivassiliou, McFadden, and Ruud (1996) for the gradient but with all additional extensions of Genz (1992) and Genz and Bretz (2002).
Implemented the minimax tilting method suggested by Botev (2017) for the rare probability case.
This and more is in the pedmod package.
Show a simulation study looking at bias, coverage of Wald-based confidence intervals and the computation time.
Show a procedure to greatly simplify the computational problem while only ignoring very little of the information.
250 families, with 10 members in each family, and the same pedigree shown earlier.
Use an additive genetic and childhood environment effect.
Caveat: we use four threads with our method.
for each of 250 ten dimensional integrals.
\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig}) \\ (\beta_0, \beta_1, \beta_2,\sigma_g^2) &= (-3, 1, 2, 3). \end{align*}\]
\(x_{ij1}\sim N(0, 1)\), \(x_{ij2}\sim \text{Bin}(0.5, 1)\) and \(\mat C_{ig}\) is for an additive genetic effect.
| \(\bar{\beta}_0\) | \(\bar{\beta}_1\) | \(\bar{\beta}_2\) | \(h_g\) | Time | |
|---|---|---|---|---|---|
| Bias | -0.00220 | 0.00191 | 0.0014 | -0.00300 | 14.609 |
| SE | 0.00198 | 0.00103 | 0.0021 | 0.00228 | 0.334 |
| Coverage | 0.95300 | 0.95200 | 0.9480 | 0.95700 |
Bias estimates of standardized fixed effect coefficients and the heritability along with standard errors from 1000 data sets. The former coefficients are standardized by dividing by the square root of the total variance. Coverage estimates are for 95% Wald-based confidence intervals and the last column shows the average computation time in seconds for MCMCglmm.
| \(\bar{\beta}_0\) | \(\bar{\beta}_1\) | \(\bar{\beta}_2\) | \(h_g\) | Time | |
|---|---|---|---|---|---|
| MCMC (mean) | -0.0118 | 0.0136 | -0.0008 | 0.0205 | 1526.53 |
| SE | 0.0054 | 0.0151 | 0.0209 | 0.0162 | 4.51 |
| MCMC (mode) | -0.0397 | 0.0309 | 0.0186 | -0.0150 | 1526.53 |
| SE | 0.0183 | 0.0171 | 0.0207 | 0.0155 | 4.51 |
| New method | -0.0121 | 0.0136 | -0.0002 | 0.0054 | 13.77 |
| SE | 0.0055 | 0.0151 | 0.0209 | 0.0173 | 1.16 |
Similar table for the first 10 data sets. The average effective sample size for \(\sigma_g^2\) was 335.58.
\[\begin{align*} Y_{ij} &= \begin{cases} 1 & \beta_0 + \beta_1x_{ij1} + \beta_2 x_{ij2} + \epsilon_{ij} > 0 \\ 0 & \text{otherwise} \end{cases} \\ (\epsilon_{i1}, \dots, \epsilon_{i10})^\top &\sim N^{(10)}(\vec 0, \mat I_{10} + \sigma_g^2\mat C_{ig} + \sigma_e^2\mat C_{ie}) \\ (\beta_0, \beta_1, \beta_2,\sigma_g^2,\sigma_e^2) &= (-3, 1, 2, 2, 1). \end{align*}\]
Added \(\mat C_{ie}\) for a childhood environment effect.
| \(\bar\beta_0\) | \(\bar\beta_1\) | \(\bar\beta_2\) | \(h_g\) | \(h_e\) | Time | |
|---|---|---|---|---|---|---|
| Bias | -0.00310 | 0.00236 | 0.00311 | -0.00587 | 0.00419 | 19.486 |
| SE | 0.00189 | 0.00102 | 0.00203 | 0.00277 | 0.00241 | 0.469 |
| Coverage | 0.95500 | 0.96600 | 0.95500 | 0.95200 | 0.95900 |
Similar table with the childhood environmental effect based on 1000 sampled data sets.
Want to avoid the composite likelihood like procedure and use as much information from the pedigree as possible.
Use the Swedish Multi-Generation Registry (Ekbom 2010) like Mahjani et al. (2020) to study OCD.
Most families are small …
Want to avoid 167279 dimensional integration!
Idea: create \(p\) families each of roughly size \(n_i / p\) such the number of removed dependencies is minimized.
Many approximations exists without the connection constraint (Simon and Teng 1997; Karypis and Kumar 1998). The partitions need not to be connected.
out of 3944737 (0.08 percent reduction)!
The data is sorted.
| \(\beta_0\) | \(\beta_{\text{sex}}\) | \(\beta_{\text{AOM}}\) | \(h_g\) | |
|---|---|---|---|---|
| Estimates | -3.692 | 0.227 | 0.147 | \(0.469\,(0.387, 0.545)\) |
Estimates of a model for OCD with a data set like the one used by Mahjani et al. (2020). The parenthesis shows a 95% profile likelihood-based confidence interval.
It took more than a month with MCMC! It took 12 hours and 15 minutes to compute the confidence interval.
Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.
combined with a pairwise composite likelihood approach.
Fast approximations exists and derivatives are easily computed (Genz 2004; Vaida and Liu 2009).
\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(C_{ij} \in (0,\infty)\).
Only observe \(Y_{ij} = \min (Y_{ij}^*, C_{ij})\).
Use a mixed flexible parametric survival model:
\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\sigma)) \end{align*} \]
A simple case is \(\vec g(y) = \log y\) in which case
\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)) \end{align*} \]
Used in the bivariate case by Szulkin et al. (2017).
Bonus: The dimension of the integrals are equal to the number of censored individuals.
the proportion of the variance on the log scale explained by additive genetic effects.
Generally,
\[ \begin{align*} \vec g(Y_{ij}^*)^\top\vec\omega &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I_{n_i} + \mat\Sigma_i(\vec\sigma)). \end{align*} \]A prototype implementation is available at github.com/boennecd/mixprobit/blob/master/examples/mixed-gsm.md.
\[ \begin{align*} Y_{ij} &= \begin{cases} 1 & \vec x_{ij}^\top\vec\beta + \epsilon_{ij0} + \sum_{k = 1}^K\sigma_k(\vec z_{ij})\epsilon_{ijk} > 0 \\ 0 & \text{otherwise} \end{cases} \\ \epsilon_{ij0} &\sim N(0,1) \\ (\epsilon_{i1k}, \dots, \epsilon_{in_ik})^\top & \sim N^{(n_i)}\left(\vec 0; \mat C_{ik} \right) \\ \sigma_k(\vec z_{ij}) &= \exp(\vec\theta_k^\top\vec z_{ij}) \end{align*} \]
Proportion of variance is \[ \frac{\sigma_k(\vec z_{ij})^2}{1 + \sum_{l = 1}^K\sigma_l(\vec z_{ij})^2} \]
E.g. boys may be more dependent than girls.
Already implemented.
using Gaussian copulas. See github.com/boennecd/mdgc.
See Christoffersen, Clements, Humphreys, et al. (2021).
The presentation is at rpubs.com/boennecd/KU-pedmod.
The markdown is at github.com/boennecd/Talks.
The R package is on CRAN and at github.com/boennecd/pedmod.
References are on the next slide.