\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Postdoc working for Keith Humphreys and Mark Clements from KI and Hedvig Kjellström from KTH.
Also done work with liability threshold models (the pedmod package) which may be of interest in the department.
Present work which I and Birzhan Akynkozhayev, a new PhD student of Mark Clements, are working on.
Quantify association between breast cancer and breast density and other variables.
Is the risk related to
Can be estimated with a joint model for the density, the marker, and the risk of breast cancer.
People with a worse health status may tend to have worse marker values and have more measurements.
Leads to at least two time-to-event outcomes of interest.
The population registers and research questions seldom leaves us with only one marker of interest.
Have a method and implementation that
We are not aware of any software that that satisfy all four criteria.
Introduce the marker and survival sub-models.
Highlight possible relations and assumptions.
The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.
The population mean curve with pointwise quantiles for the individuals’ mean curves and the curve for one individual with the observed values.
The population mean curve with pointwise quantiles for the individuals’ mean curves and with curves for three individuals and their observed values.
\[ \begin{align*} Y_{ij} &= \mu_i(s_{ij}, \vec U_i) + \epsilon_{ij} \\ \epsilon_{ij} &\sim N(0, \sigma^2) \\ \mu_{i}(s, \vec U_i) &= \tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)} + \vec g(s)^\top\tilde{\vec\beta}^{(2)} + \vec m(s)^\top\vec U_i \\ \vec U_i &\sim N^{(R)}(\vec0, \Psi) \end{align*} \]
\(\tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)}\): fixed effects.
\(\vec g(s)^\top\tilde{\vec\beta}^{(2)}\): the population mean curve.
\(\vec m(s)^\top\vec U_i\): difference for individual \(i\) to the population curve.
Let \(\vec x_i(s) = (\tilde{\vec x}_i^\top, \vec g(s)^\top)^\top\) such that
\[ \mu_{i}(s, \vec U_i) = \vec x_i(s)^\top\vec\beta + \vec m_i(s)^\top\vec U_i \]
and \(\vec m_i\) may depended on \(i\).
Allows for time-varying fixed and random effects.
Markers’ deviation from the population may be linked to higher or lower risk of the event for each individual.
Different ways the two can be linked.
A higher level at any given point may lead to higher/lower risk.
A stepper change at any given point may lead to higher/lower risk.
A long period of higher values may lead to higher/lower risk at any given point.
To simplify the notation, only the current value is shown.
\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)} + \vec b(t)^\top\tilde{\vec\gamma}^{(2)} + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]
\(\tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)}\): fixed effects.
\(\vec b(t)^\top\tilde{\vec\gamma}^{(2)}\): the log baseline hazard function.
Shared with the marker.
Can be excluded.
Let \(\vec z_i(t) = (\tilde{\vec z}_i^\top, \vec b(t)^\top)^\top\) such that
\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \vec z_i(t)^\top\vec\gamma + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]
Allows for non-proportional effects.
We may have \(L\) different markers rather than one
\[ \begin{align*} \vec Y_{ij}&= \vec\mu_i(s_{ij}, \vec U_i) + \vec\epsilon_{ij} \\ \vec\epsilon_{ij} &\sim N^{(L)}(\vec 0, \Sigma) \\ \mu_{i1}(s, \vec U_i) &= \vec x_{i1}(s)^\top\vec\beta_1 + \vec m_{i1}(s)^\top\vec U_{i1} \\ \vdots &\hphantom{=}\vdots\\\ \mu_{iL}(s, \vec U_i) &= \vec x_{iL}(s)^\top\vec\beta_L + \vec m_{iL}(s)^\top\vec U_{iL} \\ \vec U_i &= \begin{pmatrix} \vec U_{i1} \\ \vdots \\ \vec U_{iL} \end{pmatrix}\sim N^{(R)}(\vec0, \Psi). \end{align*} \]
Measurement error.
\(\Psi\) makes markers of the same and different types dependent across time.
\[\begin{align*} X_i(s) &= \begin{pmatrix} \vec x_{i1}(s)^\top & 0^\top & \cdots & \vec 0^\top \\ \vec 0^\top & \vec x_{i2}(s)^\top & \ddots & \vdots \\ \vdots & \ddots & \ddots & \vec 0^\top \\ \vec 0^\top & \cdots & \vec 0^\top & \vec x_{iL}(s)^\top \end{pmatrix} \\ \vec\beta &=(\vec \beta_1^\top,\dots,\vec\beta_L^\top)^\top, \end{align*}\]
and define \(M_i(s)\) similarly. Then
\[\vec\mu_i(s_{ij}, \vec U_i) = X_i(s_{ij})\vec\beta + M_i(s_{ij})\vec U_i.\]
We may have \(E\) different survival types rather than one
\[ \begin{align*} h_{i1}(t\mid \vec U_i, \vec\xi_i) &= \exp\left( \vec z_{i1}(t)^\top\vec\gamma_1 + \vec\alpha_1^\top M_i(t)\vec U_i + \xi_{i1} \right) \\ \vdots &\hphantom{=}\vdots \\ h_{iE}(t\mid \vec U_i,\vec \xi_i) &= \exp\left( \vec z_{iE}(t)^\top\vec\gamma_E + \vec\alpha_E^\top M_i(t)\vec U_i + \xi_{iE} \right) \\ \vec\xi_i = \begin{pmatrix}\xi_{i1} \\ \vdots \\ \xi_{iE}\end{pmatrix} &\sim N^{(E)}(\vec 0, \Xi) \end{align*} \]
Motivate the variational approximation (VA) and show how to use it for fast estimation.
Brief highlight of the psqn package we have developed to quickly optimize the approximate likelihood.
Briefly discuss the package for the joint models.
Need an approximation.
Let \(\vec\zeta\) be the model parameters and \(l_i(\vec\zeta)\) be the log likelihood term of individual \(i\).
Let \(\vec O_i = (\vec U_i^\top, \vec\xi_i^\top)^\top\) be the concatenated random effects.
The concatenated vector of observed marker values, observations times, observed times, and events indicators.
Given a distribution with density \(q^{(i)}_{\vec\theta}\) for \(\vec\theta\in \Theta_i\), the variational approximation we use is the lower bound
\[ l_i(\vec\zeta) \geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \]
Typically, much faster to evaluate.
\[ \begin{align*} l_i(\vec\zeta) &= \log p_{i,\vec\zeta}(\vec y_i) = \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)\big/ q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)\big/ q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o\\ &= \int q^{(i)}_{\vec\theta}(\vec o) \Bigg( \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) + \underbrace{\log \left(\frac{q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)}\right)}_{\text{KL divergence}} \Bigg)d \vec o \\ &\geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \end{align*} \]
Equality if \(p_{i,\vec\zeta}(\vec o\mid\vec y_i) = q^{(i)}_{\vec\theta}(\vec o)\).
\[ \text{argmax}_{\vec\zeta} \sum_{i = 1}^n l_i(\vec\zeta) \approx \text{argmax}_{\vec\zeta} \max_{\vec\theta_1,\dots,\vec\theta_n} \sum_{i = 1}^n\tilde l_i(\vec\zeta, \vec\theta_i) \]
35070 parameters in one simulation study later.
as it is partially separable.
Can be combined to an approximation of the entire Hessian which is sparse.
with a block diagonal preconditioner in a Quasi-Newton algorithm.
The method is embarrassingly parallel in all steps and this is exploited. The user just needs a thread-safe implementation of the lower bound terms and their gradients.
Allows us to iterate very quickly with different models.
The lower bound requires evaluation of
All are tractable; Only have to do 1D quadrature for the approximate expected cumulative hazard. Very fast with Gauss–Legendre quadrature.
The conditional distribution may be well approximated by a normal distribution.
\[\hat l_i(\vec\zeta) = \tilde l_i(\vec\zeta, \hat{\vec\theta}_i(\vec\zeta))\qquad \hat{\vec\theta}_i(\vec\zeta) = \argmax_{\vec\theta} \tilde l_i(\vec\zeta, \vec\theta)\]
Then the Hessian is
\[ \nabla^2_{\vec\zeta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) - \left. \nabla^2_{\vec\zeta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta) \left(\nabla^2_{\vec\theta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta)\right)^{-1} \nabla^2_{\vec\theta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) \right|_{\vec\theta = \hat{\vec\theta}_i(\vec\zeta)}. \]
Can also construct “profile lower bound” curves.
Given \(\hat{\vec\zeta}\) and \(\hat{\vec\theta}_i(\hat{\vec\zeta})\) and some function \(v(\vec o;\hat{\vec\zeta})\), the right hand-side
\[ \int v(\vec o;\hat{\vec\zeta})p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i)d\vec o \approx \int v(\vec o;\hat{\vec\zeta}) q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o) d\vec o \]
is almost always much easier to compute.
\(q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o) \approx p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i)\) if the lower bound is tight.
Key for the use of VAs in a Bayesian framework.
using the psqn package.
Similar to the Adept library (Hogan 2014). A gradient implementation will reduce the computation time.
cumulative, current value, and derivatives in the hazards, mixtures thereof and more.
Comparison with the Monte Carlo based joineRML package (Hickey, Philipson, and Jorgensen 2018) and quadrature based JM package (Rizopoulos 2010).
Simulation study to check the bias of the VA and to compare the observed information matrix and the computation time.
Simulation study with multivariate markers with both a terminal event and an informative observation process.
joineRML has a non-parametric baseline hazard.
We used the same spline as in our package using ten knots based on the observed event times.
Four and seven quadrature nodes were used with JM.
Natural cubic spline for the marker intercept with four interior knots.
Therefore, three random effects per individual.
Uniform distribution for a covariate in the hazard, normally distributed baseline covariate for the marker, and we included the covariate from the marker in the hazard. No frailty.
Full code available online.
The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.
The lines are the 10%, 25%, 50%, 75% and 90% pointwise quantiles of the conditional hazards.
An average of 3.83 observed markers per individual (SD 2.98).
An average of 42.4% are censored.
500 individuals in each sample and 100 data sets.
| Method | \(\beta_{11}\) | \(\beta_{12}\) | \(\sigma^2_1\) | \(\gamma_{11}\) | \(\gamma_{12}\) | \(\alpha_{11}\) | |
|---|---|---|---|---|---|---|---|
| True values | 1.00000 | -0.50000 | 0.25000 | 0.40000 | 0.00000 | -1.00000 | |
| Bias | GVA | 0.00783 | -0.00438 | -0.00047 | -0.01112 | 0.00655 | -0.00013 |
| joineRML | 0.01828 | -0.00434 | -0.00061 | -0.01241 | 0.00614 | 0.00395 | |
| JM 4 | -0.00224 | -0.00410 | -0.00209 | -0.01187 | 0.00633 | 0.00229 | |
| JM 7 | 0.00436 | -0.00424 | -0.00043 | -0.01013 | 0.00637 | -0.00575 | |
| SE | GVA | 0.00840 | 0.00610 | 0.00105 | 0.01194 | 0.00814 | 0.00613 |
| joineRML | 0.00832 | 0.00606 | 0.00105 | 0.01195 | 0.00802 | 0.00614 | |
| JM 4 | 0.00845 | 0.00591 | 0.00105 | 0.01191 | 0.00790 | 0.00613 | |
| JM 7 | 0.00838 | 0.00607 | 0.00105 | 0.01198 | 0.00812 | 0.00627 |
SE are standard errors of the bias estimates. GVA is the results from our method. JM 4 and 7 are the results form the JM package with four and seven quadrature nodes.
| Method | \(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) | Time | |
|---|---|---|---|
| Average | GVA | 2.16449 | 2.024 |
| joineRML | 2.19245 | 719.268 | |
| JM 4 | 2.18567 | 42.767 | |
| JM 7 | 2.17030 | 224.622 |
The computation time is in seconds. Four threads were used with our method but an average of 0.938 seconds was used on the observed information matrix. The latter can be greatly reduced.
| Method | Log likelihood | Relative covariance error | |
|---|---|---|---|
| Average | GVA | -3070.070 | 0.014307 |
| JM 4 | -3070.311 | 0.122795 | |
| JM 7 | -3069.541 | 0.000000 |
Take away:
Added another marker with flipped signs for the time-varying fixed effects.
The correlation of intercept part of the two random splines was set to 0.5.
Thus, seven random effects per individual.
An average of 4.8 observed markers per individual (SD 8.08).
An average of 45.5% are censored.
1000 individuals per sample and 1000 data sets.
35070 parameters in total.
| \(\alpha_{11}\) | \(\alpha_{12}\) | \(\alpha_{21}\) | \(\alpha_{22}\) | \(\xi_2^2\) | |
|---|---|---|---|---|---|
| True values | 0.5000 | 0.5000 | -0.5000 | -0.5000 | 0.2500 |
| Bias | -0.0003 | 0.0006 | 0.0010 | 0.0001 | -0.0126 |
| SE | 0.0012 | 0.0012 | 0.0005 | 0.0005 | 0.0009 |
| RMSE | 0.0385 | 0.0375 | 0.0162 | 0.0156 | 0.0315 |
| Coverage | 0.9400 | 0.9550 | 0.9410 | 0.9590 | 0.9170 |
The table shows bias estimates, standard errors of the bias estimates, root mean square errors and coverage of Wald type confidence intervals.
| \(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) | Time | Coverage profile | |
|---|---|---|---|
| Average | 3.024 | 90.42 | 0.9390 |
| SE | 0.020 | 0.34 |
The computation time includes an average of 33.9 seconds spent to compute the observed information matrix. This can be reduced. The coverage column shows the coverage of a profile likelihood based confidence interval for \(\alpha_{11}\). The average computation time for the confidence interval was 151 seconds.
Introduced variational approximations.
Showed that they can yield fast estimates of joint survival and marker models.
Simulation studies suggest that the bias is low if present at all.
Breast cancer and markers including breast density.
Modelling longitudinal markers of kidney function for diabetes patients.
Delayed entry yields another integral that needs to be approximated.
Yields a maximin problem rather than a pure maximization problem.
The additional terms are easier to approximate. For instance, we do not need to marginalize out the random effects for the recurrent events.
The marker sub-models can be changed to a generalized linear mixed models (GLMMs).
At worst, we have to approximate many one dimensional integrals.
with less bias and six dimensional random effects. The example is at https://github.com/boennecd/psqn-va-ex#6d-random-effects
A reasonably tractable example is the multivariate skew-normal distribution (Ormerod 2011). Extension will give tighter lower bounds but may take longer to evaluate or be harder to optimize.
Some of the parameters can profiled out quite quickly. This may reduce the computation time further.
The presentation is at rpubs.com/boennecd/KU-VAJointSurv.
The markdown is at github.com/boennecd/Talks.
The implementation is at github.com/boennecd/VAJointSurv.
The psqn package is on CRAN and at github.com/boennecd/psqn.
References are on the next slide.