Joint models with multiple markers and multiple time-to-event outcomes

dummy slide

Introduction

\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]

Research

Postdoc working for Keith Humphreys and Mark Clements from KI and Hedvig Kjellström from KTH.

Work on applying variational approximations in biostatistics.

Also done work with liability threshold models (the pedmod package) which may be of interest in the department.

Present work which I and Birzhan Akynkozhayev, a new PhD student of Mark Clements, are working on.

Example

Quantify association between breast cancer and breast density and other variables.

Is the risk related to

  • long periods of higher than normal density,
  • the current density only or
  • large changes over a short period of time?

Can be estimated with a joint model for the density, the marker, and the risk of breast cancer.

Informative Observation Process

Markers may be related to the likelihood of an observation and the risk of the terminal event of interest.

People with a worse health status may tend to have worse marker values and have more measurements.

Leads to at least two time-to-event outcomes of interest.

Multiple Markers

The population registers and research questions seldom leaves us with only one marker of interest.

Goal

Have a method and implementation that

  1. Supports multiple markers.
  2. Supports multiple survival outcomes (recurrent and competing events).
  3. Supports delayed entry.
  4. Is scalable and fast.

We are not aware of any software that that satisfy all four criteria.

Joint Models

Overview

Introduce the marker and survival sub-models.

Highlight possible relations and assumptions.

Example: Population Curve

The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.

Example: Population Curve

The population mean curve with pointwise quantiles for the individuals’ mean curves and the curve for one individual with the observed values.

Example: Population Curve

The population mean curve with pointwise quantiles for the individuals’ mean curves and with curves for three individuals and their observed values.

Marker Model

\[ \begin{align*} Y_{ij} &= \mu_i(s_{ij}, \vec U_i) + \epsilon_{ij} \\ \epsilon_{ij} &\sim N(0, \sigma^2) \\ \mu_{i}(s, \vec U_i) &= \tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)} + \vec g(s)^\top\tilde{\vec\beta}^{(2)} + \vec m(s)^\top\vec U_i \\ \vec U_i &\sim N^{(R)}(\vec0, \Psi) \end{align*} \]

\(\tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)}\): fixed effects.

\(\vec g(s)^\top\tilde{\vec\beta}^{(2)}\): the population mean curve.

\(\vec m(s)^\top\vec U_i\): difference for individual \(i\) to the population curve.

Marker Model

Let \(\vec x_i(s) = (\tilde{\vec x}_i^\top, \vec g(s)^\top)^\top\) such that

\[ \mu_{i}(s, \vec U_i) = \vec x_i(s)^\top\vec\beta + \vec m_i(s)^\top\vec U_i \]

and \(\vec m_i\) may depended on \(i\).

Allows for time-varying fixed and random effects.

Time-to-event Outcomes

Markers’ deviation from the population may be linked to higher or lower risk of the event for each individual.

Different ways the two can be linked.

Relation to Markers: Current Value

A higher level at any given point may lead to higher/lower risk.

Relation to Markers: Slope

A stepper change at any given point may lead to higher/lower risk.

Relation to Markers: Cumulative

A long period of higher values may lead to higher/lower risk at any given point.

Relation to Markers

All types are supported in our software including mixtures of them.

To simplify the notation, only the current value is shown.

Survival Model

\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)} + \vec b(t)^\top\tilde{\vec\gamma}^{(2)} + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]

\(\tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)}\): fixed effects.

\(\vec b(t)^\top\tilde{\vec\gamma}^{(2)}\): the log baseline hazard function.

\(\alpha\vec m_i(t)^\top\vec U_i\): individual specific deviation.

Shared with the marker.

\(\xi_i\): individual specific deviation.

Can be excluded.

Survival Model

Let \(\vec z_i(t) = (\tilde{\vec z}_i^\top, \vec b(t)^\top)^\top\) such that

\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \vec z_i(t)^\top\vec\gamma + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]

Allows for non-proportional effects.

More Markers

We may have \(L\) different markers rather than one

\[ \begin{align*} \vec Y_{ij}&= \vec\mu_i(s_{ij}, \vec U_i) + \vec\epsilon_{ij} \\ \vec\epsilon_{ij} &\sim N^{(L)}(\vec 0, \Sigma) \\ \mu_{i1}(s, \vec U_i) &= \vec x_{i1}(s)^\top\vec\beta_1 + \vec m_{i1}(s)^\top\vec U_{i1} \\ \vdots &\hphantom{=}\vdots\\\ \mu_{iL}(s, \vec U_i) &= \vec x_{iL}(s)^\top\vec\beta_L + \vec m_{iL}(s)^\top\vec U_{iL} \\ \vec U_i &= \begin{pmatrix} \vec U_{i1} \\ \vdots \\ \vec U_{iL} \end{pmatrix}\sim N^{(R)}(\vec0, \Psi). \end{align*} \]

Assumptions

\(\Sigma\) makes markers of different types observed at the same time point dependent.

Measurement error.

\(\Psi\) makes markers of the same and different types dependent across time.

More Survival Types

\[\begin{align*} X_i(s) &= \begin{pmatrix} \vec x_{i1}(s)^\top & 0^\top & \cdots & \vec 0^\top \\ \vec 0^\top & \vec x_{i2}(s)^\top & \ddots & \vdots \\ \vdots & \ddots & \ddots & \vec 0^\top \\ \vec 0^\top & \cdots & \vec 0^\top & \vec x_{iL}(s)^\top \end{pmatrix} \\ \vec\beta &=(\vec \beta_1^\top,\dots,\vec\beta_L^\top)^\top, \end{align*}\]

and define \(M_i(s)\) similarly. Then

\[\vec\mu_i(s_{ij}, \vec U_i) = X_i(s_{ij})\vec\beta + M_i(s_{ij})\vec U_i.\]

More Survival Types

We may have \(E\) different survival types rather than one

\[ \begin{align*} h_{i1}(t\mid \vec U_i, \vec\xi_i) &= \exp\left( \vec z_{i1}(t)^\top\vec\gamma_1 + \vec\alpha_1^\top M_i(t)\vec U_i + \xi_{i1} \right) \\ \vdots &\hphantom{=}\vdots \\ h_{iE}(t\mid \vec U_i,\vec \xi_i) &= \exp\left( \vec z_{iE}(t)^\top\vec\gamma_E + \vec\alpha_E^\top M_i(t)\vec U_i + \xi_{iE} \right) \\ \vec\xi_i = \begin{pmatrix}\xi_{i1} \\ \vdots \\ \xi_{iE}\end{pmatrix} &\sim N^{(E)}(\vec 0, \Xi) \end{align*} \]

Likelihood and Approximations

Overview

Motivate the variational approximation (VA) and show how to use it for fast estimation.

Brief highlight of the psqn package we have developed to quickly optimize the approximate likelihood.

Briefly discuss the package for the joint models.

The Likelihood

The likelihood is intractable.

Need an approximation.

Let \(\vec\zeta\) be the model parameters and \(l_i(\vec\zeta)\) be the log likelihood term of individual \(i\).

Let \(\vec O_i = (\vec U_i^\top, \vec\xi_i^\top)^\top\) be the concatenated random effects.

Let \(\vec Y_i\) be the outcomes.

The concatenated vector of observed marker values, observations times, observed times, and events indicators.

Variational Approximations

Given a distribution with density \(q^{(i)}_{\vec\theta}\) for \(\vec\theta\in \Theta_i\), the variational approximation we use is the lower bound

\[ l_i(\vec\zeta) \geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \]

Typically, much faster to evaluate.

Variational Approximations (Cont.)

\[ \begin{align*} l_i(\vec\zeta) &= \log p_{i,\vec\zeta}(\vec y_i) = \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)\big/ q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)\big/ q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o\\ &= \int q^{(i)}_{\vec\theta}(\vec o) \Bigg( \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) + \underbrace{\log \left(\frac{q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)}\right)}_{\text{KL divergence}} \Bigg)d \vec o \\ &\geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \end{align*} \]

Equality if \(p_{i,\vec\zeta}(\vec o\mid\vec y_i) = q^{(i)}_{\vec\theta}(\vec o)\).

Approximate Maximum Likelihood

\[ \text{argmax}_{\vec\zeta} \sum_{i = 1}^n l_i(\vec\zeta) \approx \text{argmax}_{\vec\zeta} \max_{\vec\theta_1,\dots,\vec\theta_n} \sum_{i = 1}^n\tilde l_i(\vec\zeta, \vec\theta_i) \]

The latter problem has many parameters but the problem is partially separable.

35070 parameters in one simulation study later.

We can use our psqn package (Christoffersen 2021) to optimize the lower bound

as it is partially separable.

The psqn Package

A BFGS approximation is made of the Hessian of each lower bound terms, \(\tilde l_i(\vec\zeta, \vec\theta_i)\).

Can be combined to an approximation of the entire Hessian which is sparse.

The approximate Hessian is solved using conjugate gradient

with a block diagonal preconditioner in a Quasi-Newton algorithm.

The package is a C++ library (with an R interface).

The method is embarrassingly parallel in all steps and this is exploited. The user just needs a thread-safe implementation of the lower bound terms and their gradients.

Applicable to many VAs for clustered data.

Allows us to iterate very quickly with different models.

The Lower Bound

The lower bound requires evaluation of

  • the entropy: \(-E_{q^{(i)}_{\vec\theta}}(\log q^{(i)}_{\vec\theta}(\vec O))\).
  • the first two moments: \(E_{q^{(i)}_{\vec\theta}}(\vec O)\) and \(E_{q^{(i)}_{\vec\theta}}(\vec O\vec O^\top)\).
  • the moment generating function: \(E_{q^{(i)}_{\vec\theta}}(\exp(\vec t^\top\vec O))\).
We use a multivariate normal distribution.

All are tractable; Only have to do 1D quadrature for the approximate expected cumulative hazard. Very fast with Gauss–Legendre quadrature.

Motivating the Variational Distribution

The conditional density of the random effects is multivariate normal distributed without the survival outcomes.

The conditional distribution may be well approximated by a normal distribution.

The Observed Information Matrix

\[\hat l_i(\vec\zeta) = \tilde l_i(\vec\zeta, \hat{\vec\theta}_i(\vec\zeta))\qquad \hat{\vec\theta}_i(\vec\zeta) = \argmax_{\vec\theta} \tilde l_i(\vec\zeta, \vec\theta)\]

Then the Hessian is

\[ \nabla^2_{\vec\zeta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) - \left. \nabla^2_{\vec\zeta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta) \left(\nabla^2_{\vec\theta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta)\right)^{-1} \nabla^2_{\vec\theta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) \right|_{\vec\theta = \hat{\vec\theta}_i(\vec\zeta)}. \]

Computationally easy to compute.

Can also construct “profile lower bound” curves.

Marginal Measures

Given \(\hat{\vec\zeta}\) and \(\hat{\vec\theta}_i(\hat{\vec\zeta})\) and some function \(v(\vec o;\hat{\vec\zeta})\), the right hand-side

\[ \int v(\vec o;\hat{\vec\zeta})p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i)d\vec o \approx \int v(\vec o;\hat{\vec\zeta}) q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o) d\vec o \]

is almost always much easier to compute.

\(q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o) \approx p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i)\) if the lower bound is tight.

Key for the use of VAs in a Bayesian framework.

Implementation

Written in C++ with support for computation in parallel

using the psqn package.

Uses an extension of the reverse mode automatic differentiation implementation from Savine (2018).

Similar to the Adept library (Hogan 2014). A gradient implementation will reduce the computation time.

Supports left-truncation, partly observed markers, recurrent events, competing risk,

cumulative, current value, and derivatives in the hazards, mixtures thereof and more.

Simulation Studies

Overview

Simulation study with univariate marker and survival data.

Comparison with the Monte Carlo based joineRML package (Hickey, Philipson, and Jorgensen 2018) and quadrature based JM package (Rizopoulos 2010).

Simulation study to check the bias of the VA and to compare the observed information matrix and the computation time.

Simulation study with multivariate markers with both a terminal event and an informative observation process.

Comparison with joineRML & JM

joineRML has a non-parametric baseline hazard.

JM uses a parametric baseline hazard.

We used the same spline as in our package using ten knots based on the observed event times.

Four and seven quadrature nodes were used with JM.

Comparison with joineRML & JM

Natural cubic spline for the marker intercept with four interior knots.

Random intercept using a natural cubic spline with one interior knot.

Therefore, three random effects per individual.

Uniform distribution for a covariate in the hazard, normally distributed baseline covariate for the marker, and we included the covariate from the marker in the hazard. No frailty.

Full code available online.

Mean Curve

The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.

Pointwise Quantiles of the Hazard

The lines are the 10%, 25%, 50%, 75% and 90% pointwise quantiles of the conditional hazards.

Summary Statistics

An average of 3.83 observed markers per individual (SD 2.98).

An average of 42.4% are censored.

500 individuals in each sample and 100 data sets.

Results: Bias

Method \(\beta_{11}\) \(\beta_{12}\) \(\sigma^2_1\) \(\gamma_{11}\) \(\gamma_{12}\) \(\alpha_{11}\)
True values 1.00000 -0.50000 0.25000 0.40000 0.00000 -1.00000
Bias GVA 0.00783 -0.00438 -0.00047 -0.01112 0.00655 -0.00013
joineRML 0.01828 -0.00434 -0.00061 -0.01241 0.00614 0.00395
JM 4 -0.00224 -0.00410 -0.00209 -0.01187 0.00633 0.00229
JM 7 0.00436 -0.00424 -0.00043 -0.01013 0.00637 -0.00575
SE GVA 0.00840 0.00610 0.00105 0.01194 0.00814 0.00613
joineRML 0.00832 0.00606 0.00105 0.01195 0.00802 0.00614
JM 4 0.00845 0.00591 0.00105 0.01191 0.00790 0.00613
JM 7 0.00838 0.00607 0.00105 0.01198 0.00812 0.00627
Take away: No evidence of bias with any of the methods.

SE are standard errors of the bias estimates. GVA is the results from our method. JM 4 and 7 are the results form the JM package with four and seven quadrature nodes.

Results: Time & Covariance

Method \(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) Time
Average GVA 2.16449 2.024
joineRML 2.19245 719.268
JM 4 2.18567 42.767
JM 7 2.17030 224.622
Take away: comparable error for the covariance matrix. One to two orders of magnitude faster estimation.

The computation time is in seconds. Four threads were used with our method but an average of 0.938 seconds was used on the observed information matrix. The latter can be greatly reduced.

Results: Likelihood & Hessian

Method Log likelihood Relative covariance error
Average GVA -3070.070 0.014307
JM 4 -3070.311 0.122795
JM 7 -3069.541 0.000000

Take away:

  1. Maximum lower bound was only slightly smaller than the maximum likelihood (and always smaller as expected).
  2. The observed information matrix was very similar.
  3. It was insufficient to use four quadrature nodes with the JM package.

Multivariate Simulation Study

Added another marker with flipped signs for the time-varying fixed effects.

The correlation of intercept part of the two random splines was set to 0.5.

The observation times depend on the random effects and an additional frailty.

Thus, seven random effects per individual.

Summary Statistics

An average of 4.8 observed markers per individual (SD 8.08).

An average of 45.5% are censored.

1000 individuals per sample and 1000 data sets.

35070 parameters in total.

Results: Bias & Coverage

\(\alpha_{11}\) \(\alpha_{12}\) \(\alpha_{21}\) \(\alpha_{22}\) \(\xi_2^2\)
True values 0.5000 0.5000 -0.5000 -0.5000 0.2500
Bias -0.0003 0.0006 0.0010 0.0001 -0.0126
SE 0.0012 0.0012 0.0005 0.0005 0.0009
RMSE 0.0385 0.0375 0.0162 0.0156 0.0315
Coverage 0.9400 0.9550 0.9410 0.9590 0.9170
Take away: No evidence of bias expect for the scale parameter of the log normal frailty.

The table shows bias estimates, standard errors of the bias estimates, root mean square errors and coverage of Wald type confidence intervals.

Results: Time and Covariance

\(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) Time Coverage profile
Average 3.024 90.42 0.9390
SE 0.020 0.34
Take away: Computation time is very feasible.

The computation time includes an average of 33.9 seconds spent to compute the observed information matrix. This can be reduced. The coverage column shows the coverage of a profile likelihood based confidence interval for \(\alpha_{11}\). The average computation time for the confidence interval was 151 seconds.

Extensions and Conclusions

Conclusions

Introduced variational approximations.

Showed that they can yield fast estimates of joint survival and marker models.

Simulation studies suggest that the bias is low if present at all.

Applications

Breast cancer and markers including breast density.

Modelling longitudinal markers of kidney function for diabetes patients.

Delayed Entry

Delayed entry yields another integral that needs to be approximated.

A VA can be applied but we can only get an upper bound on the additional likelihood terms.

Yields a maximin problem rather than a pure maximization problem.

For now, the additional terms are approximated with adaptive Gaussian-Hermite quadrature.

The additional terms are easier to approximate. For instance, we do not need to marginalize out the random effects for the recurrent events.

Marker Sub-model

The marker sub-models can be changed to a generalized linear mixed models (GLMMs).

The new lower bounds terms are easy to approximate (Ormerod and Wand 2012).

At worst, we have to approximate many one dimensional integrals.

We have an example of a mixed logistic regression with more than 20 times faster estimation times compared to the Laplace approximation in lme4

with less bias and six dimensional random effects. The example is at https://github.com/boennecd/psqn-va-ex#6d-random-effects

Variational Approximations

Other distributions also have reasonably easy expressions of the quantities in the lower bound.

A reasonably tractable example is the multivariate skew-normal distribution (Ormerod 2011). Extension will give tighter lower bounds but may take longer to evaluate or be harder to optimize.

Some of the parameters can profiled out quite quickly. This may reduce the computation time further.

Thank You!

The presentation is at rpubs.com/boennecd/KU-VAJointSurv.

The markdown is at github.com/boennecd/Talks.

The implementation is at github.com/boennecd/VAJointSurv.

The psqn package is on CRAN and at github.com/boennecd/psqn.

References are on the next slide.

References

Christoffersen, Benjamin. 2021. Psqn: Partially Separable Quasi-Newton. https://CRAN.R-project.org/package=psqn.
Hickey, Graeme L., Pete Philipson, and Andrea Jorgensen. 2018. “joineRML: A Joint Model and Software Package for Time-to-Event and Multivariate Longitudinal Outcomes.” BMC Medical Research Methodology 18 (1). https://doi.org/10.1186/s12874-018-0502-1.
Hogan, Robin J. 2014. “Fast Reverse-Mode Automatic Differentiation Using Expression Templates in c++.” ACM Trans. Math. Softw. 40 (4). https://doi.org/10.1145/2560359.
Ormerod, J. T. 2011. “Skew-Normal Variational Approximations for Bayesian Inference.” Unpublished Article.
Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1): 2–17. http://www.jstor.org/stable/23248820.
Rizopoulos, Dimitris. 2010. JM: An R Package for the Joint Modelling of Longitudinal and Time-to-Event Data.” Journal of Statistical Software 35 (9): 1–33. http://www.jstatsoft.org/v35/i09/.
Savine, Antoine. 2018. Modern Computational Finance: AAD and Parallel Simulations. John Wiley & Sons.