\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Want to estimate genetic effect, environmental effect, paternal effect, etc.
We observe \(j = 1,\dots, n_i\) binary outcomes in \(i = 1,\dots, m\) clusters.
Running Example: Interested in presence of a genetic effect and maternal effect.
where \(\mat K_i\) is the kinship matrix and \(\mat R_i\) is a matrix such that \(\vec\epsilon_{i2}\) represents a maternal effect.
Quantify heritability through \(\theta_1 \big/(\theta_1 + \theta_2 + 1)\).
Example: 12 and 13 have 1/4 shared genetic effect and 1/2 shared maternal effect. 13 and 15 only have 1/4 shared genetic effect.
\[ \begin{align*} Y_{ij} &= 1_{\{\vec\beta^\top\vec x_{ij} + \epsilon_{ij} > 0\}} \\ \vec\epsilon_i = (\epsilon_{i1}, \dots, \epsilon_{in_i}) &\sim N^{(n_i)}(\vec 0, \mat I + \mat\Sigma_i(\vec\theta)) \\ \mat\Sigma_i(\vec\theta) &= \sum_{k = 1}^K \theta_k \mat C_{ik} \end{align*} \]
for known matrices \(\mat C_{i1}, \dots, \mat C_{iK}\).
Pawitan et al. (2004) use the Fortran code by Genz and Bretz (2002) to estimate the model using a derivative-free optimization method.
Extended code.
Simulation study.
Extensions.
Conclusion.
\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \int \phi^{(n_i)} (\vec z;\vec 0, \mat\Sigma_i(\vec\theta) ) \prod_{j = 1}^{n_i}\Phi( u_{ij}(\vec\beta^\top \vec x_{ij} + z_j)) \der\vec z \\ u_{ij} &= 2y_{ij} - 1 \end{align*}. \]
Find the MLE:
\[ \argmax_{\vec\beta,\vec\theta} \sum_{i = 1}^m\log L_i(\vec\beta,\vec\theta). \]\[ \begin{align*} \begin{pmatrix} \vec Y_1 \\ \vec Y_2\end{pmatrix} &\sim N^{(k_1 + k_2)}\left( \begin{pmatrix} \vec\xi_1 \\ \vec \xi_2\end{pmatrix}, \begin{pmatrix} \vec\Xi_{11} & \vec\Xi_{12} \\ \vec\Xi_{21} & \vec\Xi_{22} \end{pmatrix} \right) \\ \Prob(\vec Y_2 \leq \vec x) &= \Phi^{(k_2)}(\vec x; \vec\xi_2,\mat\Xi_{22}) \\ &= \int \phi^{(k_1)}(\vec z;\vec\xi_1,\mat\Xi_{11}) \Phi^{(k_2)}\Big( \\ &\hspace{25pt}\vec x; \vec\xi_2 + \mat\Xi_{21}\mat\Xi_{11}^{-1}(\vec z - \vec \xi_1), \mat\Xi_{22} - \mat\Xi_{21}\mat\Xi_{11}^{-1}\mat\Xi_{12} \Big)\der\vec z \end{align*} \]
\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \Phi^{(n_i)}(\mat U_i\mat X_i\vec\beta;\vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i) \\ &= \int_{\vec z\leq \mat U_i\mat X_i\vec\beta} \phi^{(n_i)}(\vec z; \vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i)\der\vec z \\ &= \int_{\mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta} \prod_{j = 1}^{n_i} \phi(z_j)\der\vec z \end{align*} \]
where \(\mat U_i = \text{diag}(\vec u_i)\) and
\[\mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i = \mat S^\top\mat S.\]
\[ \begin{align*} L_i(\vec\beta,\vec\theta) &= \int \frac{\prod_{j = 1}^{n_i} \phi(z_j)}{ h(\vec z)}h(\vec z)\der\vec z \\ &= \int h(\vec z)\prod_{j = 1}^{n_i} \Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)\der z \end{align*} \]
where
\[ h(\vec z) = \begin{cases} \prod_{j = 1}^{n_i} \frac{\phi(z_j)} {\Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)} & \mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta \\ 0 & \text{otherwise} \end{cases}. \]
Closely follow Genz and Bretz (2009).
Like Genz (1992) and Genz and Bretz (2002):
Used in the mvtnorm package (Genz et al. 2020).
Added fast \(\Phi\) and \(\Phi^{-1}\) approximations.
Seems compute-bound.
Like Hajivassiliou, McFadden, and Ruud (1996) but with all of the above, re-ordering, and randomized Korobov rules.
\[ \begin{align*} L_i'(\vec\beta,\vec\theta) &= \int_{\vec z\leq \mat U_i\mat X_i\vec\beta} \vec g(\vec z; \vec\beta, \vec\theta) \phi^{(n_i)}(\vec z; \vec 0, \mat I + \mat U_i\mat\Sigma_i(\vec\theta) \mat U_i)\der\vec z \\ &= \int_{\mat S^\top\vec z\leq \mat U_i\mat X_i\vec\beta} \vec g(\mat S^{-\top}\vec z; \vec\beta, \vec\theta) \phi(z_j)\der\vec z\\ &= \int \vec g(\mat S^{-\top}\vec z; \vec\beta, \vec\theta) h(\vec z)\prod_{j = 1}^{n_i} \Phi(\vec s_j^\top\vec z - u_{ij}\vec x_{ij}^\top\vec\beta)\der z \end{align*}. \]
Used for
\[ \sum_{i = 1}^m (\log L_i(\vec\beta,\vec\theta))' = \sum_{i = 1}^m \frac{L_i'(\vec\beta,\vec\theta)}{L_i(\vec\beta,\vec\theta)}. \]
Use monotone cubic interpolation:
truth <- exp(seq(log(1e-32), log(1 - 1e-16), length.out = 10000))
args <- qnorm(truth)
range(pnorm(args) - pnorm_aprx(args))## [1] -1.78e-07 1.80e-07
## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 pnorm 429.6µs 462.9µs 2140. 78.2KB 4.06
## 2 aprx 35.9µs 57.3µs 17562. 83.4KB 33.4
Same paper used by stats::qnorm.
args <- exp(seq(log(1e-32), log(1 - 1e-16), length.out = 10000))
range(qnorm(args) - qnorm_aprx(args))## [1] -6.97e-07 7.78e-07
## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 qnorm 400µs 422µs 2350. 78.2KB 4.06
## 2 aprx 213µs 219µs 4520. 82.3KB 9.29
mvtnorm::pmvnorm with pedmod::mvndst:
Example from ?mvtnorm::pmvnorm.
## [1] 0.57997 0.58004
The MC error is comparable:
set.seed(95501057)
apply(
replicate(100, c(pmvnorm(lw, up, mu, sig),
mvndst (lw, up, mu, sig))),
1, sd)## [1] 9.95e-05 7.43e-05
set.seed(1)
bench::mark(mvtnorm = pmvnorm(lw, up, mu, sig),
pedmod = mvndst(lw, up, mu, sig), min_time = 1, check = FALSE)## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
## 1 mvtnorm 1.16ms 1.2ms 827. 3.7KB 2.01
## 2 pedmod 624.63µs 629.3µs 1555. 2.49KB 0
and the pedigree data shown earlier resulting in 160000 observations. Repeat 100 times.
Start with a maximum of 5000 samples per term. Then use 25000.
Use four threads.
\[ \begin{align*} \mat X_i &= (\vec 1, \vec x_{\cdot 2}, \vec x_{\cdot 3}) \\ \vec x_{\cdot 2} &= \frac 4{15}(-7.5, -6.5, \dots, 6.5, 7.5)^\top \\ \vec x_{\cdot 3} &= (0, 1, 0, \dots, 0, 1)^\top \\ \vec\beta &= (-2, 0.5, 1)^\top \\ \vec\theta &= (0.5, 0.33)^\top \end{align*} \]
| (Intercept) | X | Dummy | Genetic | Maternal | |
|---|---|---|---|---|---|
| Bias w/o extra | -0.00082 | 0.00017 | 0.00006 | -0.00028 | 0.00176 |
| SE w/o extra | 0.00191 | 0.00060 | 0.00130 | 0.00252 | 0.00202 |
| Bias w/ extra | -0.00087 | 0.00017 | 0.00014 | -0.00028 | 0.00183 |
| SE w/ extra | 0.00191 | 0.00060 | 0.00130 | 0.00253 | 0.00202 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term. SE is the MC standard error. The last two columns are the genetic effect parameter, \(\theta_1\), and the maternal effect parameter, \(\theta_2\).
Estimated values less true values when using more samples per log marginal likelihood term. The y-axis is the same as the box plots shown later.
| Mean | Median | Sd | |
|---|---|---|---|
| W/o extra | 347 | 302 | 120 |
| W/ extra | 447 | 389 | 139 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.
\(m = 1000\) families with:
\[ \begin{align*} \vec x_{ij} &= \left(1, Z_{ij}, \left(j - (n_i + 1) / 2\right) / v(n_i) \right)^\top \\ v(x) &= \sqrt{x (x + 1) \big/ 12} \\ Z_{ij}&\sim N(0, 1) \\ \vec\beta &= (-1, 0.3, 0.2)^\top \\ \vec\theta &= (0.5, 0.3)^\top \end{align*} \]
Pedigree data is randomly generated.
Only keep the youngest generation as Mahjani et al. (2020).
\(n_i \sim 13\) yielding \(13000\) observations.
| (Intercept) | X1 | X2 | Genetic | Maternal | |
|---|---|---|---|---|---|
| Bias w/o extra | 0.00834 | -0.00285 | 0.00168 | -0.0399 | 0.01267 |
| SE w/o extra | 0.00533 | 0.00217 | 0.00197 | 0.0232 | 0.00713 |
| Bias w/ extra | 0.00821 | -0.00283 | 0.00170 | -0.0396 | 0.01268 |
| SE w/ extra | 0.00535 | 0.00217 | 0.00198 | 0.0233 | 0.00713 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term. SE is the MC standard error. The last two columns are the genetic effect parameter, \(\theta_1\), and the maternal effect parameter, \(\theta_2\).
Estimated values less true values when using more samples per log marginal likelihood term.
| Mean | Median | Sd | |
|---|---|---|---|
| W/o extra | 32.1 | 31.8 | 9.61 |
| W/ extra | 40.8 | 40.0 | 10.98 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.
\(n_i \sim 50\) yielding \(50000\) observations.
| (Intercept) | X1 | X2 | Genetic | Maternal | |
|---|---|---|---|---|---|
| Bias w/o extra | 0.00742 | -0.00158 | -0.00254 | -0.0387 | 0.01087 |
| SE w/o extra | 0.00288 | 0.00114 | 0.00101 | 0.0122 | 0.00392 |
| Bias w/ extra | 0.00665 | -0.00150 | -0.00257 | -0.0363 | 0.01057 |
| SE w/ extra | 0.00292 | 0.00113 | 0.00102 | 0.0123 | 0.00396 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term. SE is the MC standard error. The last two columns are the genetic effect parameter, \(\theta_1\), and the maternal effect parameter, \(\theta_2\).
Estimated values less true values when using more samples per log marginal likelihood term.
| Mean | Median | Sd | |
|---|---|---|---|
| W/o extra | 265 | 253 | 92.7 |
| W/ extra | 615 | 581 | 208.5 |
“W/o extra” and “w/ extra” are with, respectively, 5000 and 25000 samples per log marginal likelihood term.
Make an efficient Gaussian variational approximation or Laplace approximation to get starting values.
Use stochastic gradient techniques.
\(Y_{ij}^* \in (0,\infty)\) with independent right censoring time \(S_{ij} \in (0,\infty)\).
Use a mixed generalized survival model:
\[ \begin{align*} P(Y_{ij}^* > y \mid \epsilon_{ij} = e) &= \Phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \\ f_{Y_{ij}^*\mid \epsilon_{ij}}(y, e) &= \phi( -\vec g(y)^\top\vec\omega - \vec\beta^\top\vec x_{ij} - e) \vec g'(y)^\top\vec\omega \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat\Sigma_i(\vec\theta)) \end{align*} \]
A simple case is \(\vec g(y) = \log y\) in which case
\[ \begin{align*} \omega\log Y_{ij}^* &= -\vec\beta^\top\vec x_{ij} + \epsilon_{ij} \\ \vec\epsilon_i &\sim N^{(n_i)}(\vec 0, \mat I + \mat\Sigma_i(\vec\theta)) \end{align*} \]
The CDF dimension is equal to the number of censored individuals.
The identity can be used for multinomial and ordinal data with the probit link.
Approximating the CDF when the number of random effects per cluster is not much smaller than the number of observations in a cluster.
Use Gaussian copula for the joint distribution of mixed data types.
One can extend the approach to multinomial data.
Extended the estimation method suggested by Pawitan et al. (2004).
for large data sets with moderate to high dimensional integrals per cluster.
Discussed extensions to other types of outcomes.
The presentation is at rpubs.com/boennecd/MEB-RQMC-pedigree.
The markdown is at github.com/boennecd/Talks.
The R package is at github.com/boennecd/pedmod.
References are on the next slide.
Cranley, R., and T. N. L. Patterson. 1976. “Randomization of Number Theoretic Methods for Multiple Integration.” SIAM Journal on Numerical Analysis 13 (6). Society for Industrial; Applied Mathematics: 904–14. http://www.jstor.org/stable/2156452.
Genz, Alan. 1992. “Numerical Computation of Multivariate Normal Probabilities.” Journal of Computational and Graphical Statistics 1 (2). [American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]: 141–49. http://www.jstor.org/stable/1390838.
Genz, Alan, and Frank Bretz. 2002. “Comparison of Methods for the Computation of Multivariate T Probabilities.” Journal of Computational and Graphical Statistics 11 (4). Taylor & Francis: 950–71. https://doi.org/10.1198/106186002394.
———. 2009. Computation of Multivariate Normal and T Probabilities. Vol. 195. Springer Science & Business Media.
Genz, Alan, Frank Bretz, Tetsuhisa Miwa, Xuefei Mi, Friedrich Leisch, Fabian Scheipl, and Torsten Hothorn. 2020. mvtnorm: Multivariate Normal and T Distributions. https://CRAN.R-project.org/package=mvtnorm.
Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. “Simulation of Multivariate Normal Rectangle Probabilities and Their Derivatives Theoretical and Computational Results.” Journal of Econometrics 72 (1): 85–134. https://doi.org/https://doi.org/10.1016/0304-4076(94)01716-6.
Keast, P. 1973. “Optimal Parameters for Multidimensional Integration.” SIAM Journal on Numerical Analysis 10 (5): 831–38. https://doi.org/10.1137/0710068.
Mahjani, Behrang, Lambertus Klei, Christina M. Hultman, Henrik Larsson, Bernie Devlin, Joseph D. Buxbaum, Sven Sandin, and Dorothy E. Grice. 2020. “Maternal Effects as Causes of Risk for Obsessive-Compulsive Disorder.” Biological Psychiatry 87 (12): 1045–51. https://doi.org/https://doi.org/10.1016/j.biopsych.2020.01.006.
Niederreiter, H. 1972. “On a Number-Theoretical Integration Method.” Aequationes Mathematicae 8 (3): 304–11. https://doi.org/10.1007/BF01844507.
Pawitan, Y., M. Reilly, E. Nilsson, S. Cnattingius, and P. Lichtenstein. 2004. “Estimation of Genetic and Environmental Factors for Binary Traits Using Family Data.” Statistics in Medicine 23 (3): 449–65. https://doi.org/https://doi.org/10.1002/sim.1603.
Wichura, Michael J. 1988. “Algorithm as 241: The Percentage Points of the Normal Distribution.” Journal of the Royal Statistical Society. Series C (Applied Statistics) 37 (3). [Wiley, Royal Statistical Society]: 477–84. http://www.jstor.org/stable/2347330.
Zhao, Yuxuan, and Madeleine Udell. 2019. “Missing Value Imputation for Mixed Data via Gaussian Copula.”