Joint models with multiple markers and multiple time-to-event outcomes

dummy slide

Introduction

Research

Work as a postdoc on applying variational approximations in biostatistics.

Variational approximations will be introduced later.

Present joint work with Birzhan Akynkozhayev, Mark Clements, Keith Humphreys, and Hedvig Kjellström.

General Problem

Observe longitudinal outcomes at discrete points in time for individuals or firms and want to model the process.

Will call these markers.

The value of the outcome may depend on

Informative drop-out (e.g. merger and default).
The observation points (e.g. more frequent observations may have “better” values).

Want to account for this.

General Problem (Cont.)

Interested in the relation between discretely observed longitudinal variables and the time to terminal or recurrent events.

E.g. default, merger, employment, etc.

Want to account for longitudinal variables being observed at different points and in different frequencies for the individuals.

The method can be seen as a form of imputation.

Example

Quantify association between breast cancer and breast density and other variables.

Is the risk related to

long periods of higher than normal density,
the current density only or
large changes over a short period of time?

Informative Observation Process

Markers may be related to the likelihood of an observation and the risk of the terminal event of interest.

People with a worse health status may tend to have worse marker values and have more measurements.

Leads to at least two time-to-event outcomes of interest.

The terminal event (say death) and the observation process.

Multiple Types

The population registers and research questions seldom leaves us with only one marker of interest.

We may want to decompose a terminal event into competing events.

E.g. merger, liquidation and default.

Goal

Have a method and implementation that

Supports multiple markers.
Supports multiple time-to-event outcomes (recurrent and competing events).
Supports delayed entry (some are observed conditional on an event not having occurred).
Is scalable and fast.

We are not aware of any software that that satisfy all four criteria.

Joint Models

Overview

Use random effects to link the markers and the time-to-event outcomes.

Will have sub-models for each.

Introduce the marker and time-to-event sub-models.

Highlight possible relations and assumptions.

Example: Population Curve

The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.

Example: Population Curve

The population mean curve with pointwise quantiles for the individuals’ mean curves and the curve for one individual with the observed values.

Example: Population Curve

The population mean curve with pointwise quantiles for the individuals’ mean curves and with curves for three individuals and their observed values.

Marker Model

\[ \begin{align*} Y_{ij} &= \mu_i(s_{ij}, \vec U_i) + \epsilon_{ij} \\ \epsilon_{ij} &\sim N(0, \sigma^2) \\ \mu_{i}(s, \vec U_i) &= \tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)} + \vec g(s)^\top\tilde{\vec\beta}^{(2)} + \vec m(s)^\top\vec U_i \\ \vec U_i &\sim N^{(R)}(\vec0, \Psi) \end{align*} \]

\(\tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)}\): fixed effects.

\(\vec g(s)^\top\tilde{\vec\beta}^{(2)}\): the population mean curve.

\(\vec m(s)^\top\vec U_i\): difference for individual \(i\) to the population curve.

Marker Model

Let \(\vec x_i(s) = (\tilde{\vec x}_i^\top, \vec g(s)^\top)^\top\) such that

\[ \mu_{i}(s, \vec U_i) = \vec x_i(s)^\top\vec\beta + \vec m_i(s)^\top\vec U_i \]

and \(\vec m_i\) may depended on \(i\).

Allows for time-varying fixed and random effects.

Time and covariate interactions.

Time-to-event Model

Model the conditional hazard for the terminal events

and the cause specific hazard for the competing events.

\[ h_i(t\mid\vec U_i) = \lim_{d \rightarrow 0} \frac{P(T_i < t + d \mid T_i\geq t, \vec U_i)}{d} \]

where \(T_i\) is the event time.

Model the conditional rate for the recurrent events

\[ h_i(t\mid\vec U_i) = \lim_{d \rightarrow 0} \frac{P(N_i(t + d) - N_i(t) > 0 \mid \vec U_i)}{d} \]

where \(N_i(t)\) is event count at time \(t\).

Time-to-event Model

Markers’ deviation from the population may be linked to higher or lower risk of the event for each individual.

Different ways the two can be linked.

Relation to Markers: Current Value

A higher level at any given point may lead to higher/lower risk.

Relation to Markers: Slope

A stepper change at any given point may lead to higher/lower risk.

Relation to Markers: Cumulative

A long period of higher values may lead to higher/lower risk at any given point.

Relation to Markers

All types are supported in our software including mixtures of them.

To simplify the notation, only the current value is shown.

Time-to-event Model

\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)} + \vec b(t)^\top\tilde{\vec\gamma}^{(2)} + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]

\(\tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)}\): fixed effects.

\(\vec b(t)^\top\tilde{\vec\gamma}^{(2)}\): the log baseline hazard function.

\(\alpha\vec m_i(t)^\top\vec U_i\): individual specific deviation.

Shared with the marker.

\(\xi_i\): individual specific deviation.

Can be excluded.

Time-to-event Model

Let \(\vec z_i(t) = (\tilde{\vec z}_i^\top, \vec b(t)^\top)^\top\) such that

\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \vec z_i(t)^\top\vec\gamma + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]

Allows for time-varying (non-proportional) effects.

Time and covariate interactions.

More Markers

We may have \(L\) different markers rather than one

\[ \begin{align*} \vec Y_{ij}&= \vec\mu_i(s_{ij}, \vec U_i) + \vec\epsilon_{ij} \\ \vec\epsilon_{ij} &\sim N^{(L)}(\vec 0, \Sigma) \\ \mu_{i1}(s, \vec U_i) &= \vec x_{i1}(s)^\top\vec\beta_1 + \vec m_{i1}(s)^\top\vec U_{i1} \\ \vdots &\hphantom{=}\vdots\\\ \mu_{iL}(s, \vec U_i) &= \vec x_{iL}(s)^\top\vec\beta_L + \vec m_{iL}(s)^\top\vec U_{iL} \\ \vec U_i &= \begin{pmatrix} \vec U_{i1} \\ \vdots \\ \vec U_{iL} \end{pmatrix}\sim N^{(R)}(\vec0, \Psi). \end{align*} \]

Assumptions

\(\Sigma\) makes markers of different types observed at the same time point dependent.

Measurement error.

\(\Psi\) makes markers of the same and different types dependent across time.

More Time-to-event Types

\[\begin{align*} X_i(s) &= \begin{pmatrix} \vec x_{i1}(s)^\top & 0^\top & \cdots & \vec 0^\top \\ \vec 0^\top & \vec x_{i2}(s)^\top & \ddots & \vdots \\ \vdots & \ddots & \ddots & \vec 0^\top \\ \vec 0^\top & \cdots & \vec 0^\top & \vec x_{iL}(s)^\top \end{pmatrix} \\ \vec\beta &=(\vec \beta_1^\top,\dots,\vec\beta_L^\top)^\top, \end{align*}\]

and define \(M_i(s)\) similarly. Then

\[\vec\mu_i(s_{ij}, \vec U_i) = X_i(s_{ij})\vec\beta + M_i(s_{ij})\vec U_i.\]

More Time-to-event Types

We may have \(E\) different time-to-event types rather than one

\[ \begin{align*} h_{i1}(t\mid \vec U_i, \vec\xi_i) &= \exp\left( \vec z_{i1}(t)^\top\vec\gamma_1 + \vec\alpha_1^\top M_i(t)\vec U_i + \xi_{i1} \right) \\ \vdots &\hphantom{=}\vdots \\ h_{iE}(t\mid \vec U_i,\vec \xi_i) &= \exp\left( \vec z_{iE}(t)^\top\vec\gamma_E + \vec\alpha_E^\top M_i(t)\vec U_i + \xi_{iE} \right) \\ \vec\xi_i = \begin{pmatrix}\xi_{i1} \\ \vdots \\ \xi_{iE}\end{pmatrix} &\sim N^{(E)}(\vec 0, \Xi) \end{align*} \]

Likelihood and Approximations

Overview

Motivate the variational approximation (VA) and show how to use it for fast estimation.

Brief highlight of the psqn package we have developed to quickly optimize the approximate likelihood.

Briefly discuss the package for the joint models.

The Likelihood

Let \(\vec\zeta\) be the model parameters and \(l_i(\vec\zeta)\) be the log likelihood term of individual \(i\).

Let \(\vec O_i = (\vec U_i^\top, \vec\xi_i^\top)^\top\) be the concatenated random effects.

Let \(\vec Y_i\) be the outcomes.

The concatenated vector of observed marker values, observations times (possibly), observed times, and events indicators.

The Likelihood

\[ l_i(\vec\zeta) = \log \int p_{i,\vec\zeta}(\vec y_i, \vec o)\der \vec o \]

where \(p_{i,\vec\zeta}(\vec y_i, \vec o)\) is the density of outcome and the random effects evaluated at \(\vec y_i\) and \(\vec o_i\).

The likelihood is intractable.

Need an approximation.

Variational Approximations

Given a distribution with density \(q^{(i)}_{\vec\theta}\) for \(\vec\theta\in \Theta_i\), the variational approximation we use is the lower bound

\[ l_i(\vec\zeta) \geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \]

Typically, much faster to evaluate.

Variational Approximations (Cont.)

\[ \begin{align*} l_i(\vec\zeta) &= \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)\big/ q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)\big/ q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o\\ &= \int q^{(i)}_{\vec\theta}(\vec o) \Bigg( \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) + \underbrace{\log \left(\frac{q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)}\right)}_{\text{KL divergence}} \Bigg)d \vec o \\ &\geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \end{align*} \]

Equality if \(p_{i,\vec\zeta}(\vec o\mid\vec y_i) = q^{(i)}_{\vec\theta}(\vec o)\).

Approximate Maximum Likelihood

\[ \text{argmax}_{\vec\zeta} \sum_{i = 1}^n l_i(\vec\zeta) \approx \text{argmax}_{\vec\zeta} \max_{\vec\theta_1,\dots,\vec\theta_n} \sum_{i = 1}^n\tilde l_i(\vec\zeta, \vec\theta_i) \]

The latter problem has many parameters but the problem is partially separable.

35070 parameters in one simulation study later.

We can use our psqn package (Christoffersen 2021) to optimize the lower bound

as it is partially separable.

The psqn Package

A BFGS approximation is made of the Hessian of each lower bound terms, \(\tilde l_i(\vec\zeta, \vec\theta_i)\).

Can be combined to an approximation of the entire Hessian which is sparse.

The approximate Hessian is solved using conjugate gradient

with a block diagonal preconditioner in a Quasi-Newton algorithm.

The package is a C++ library (with an R interface).

The method is embarrassingly parallel in all steps and this is exploited.

Applicable to many VAs in other settings.

Allows us to iterate very quickly with different models.

The Lower Bound

The lower bound requires evaluation of

the entropy: \(-E_{q^{(i)}_{\vec\theta}}(\log q^{(i)}_{\vec\theta}(\vec O))\).
the first two moments: \(E_{q^{(i)}_{\vec\theta}}(\vec O)\) and \(E_{q^{(i)}_{\vec\theta}}(\vec O\vec O^\top)\).
the moment generating function: \(E_{q^{(i)}_{\vec\theta}}(\exp(\vec t^\top\vec O))\).

We use a multivariate normal distribution.

All are tractable; Only have to do 1D numerical integration and this is fast.

Motivating the Variational Distribution

The conditional density of the random effects is multivariate normal distributed without the time-to-event outcomes.

The conditional distribution may be well approximated by a normal distribution.

The Observed Information Matrix

\[\hat l_i(\vec\zeta) = \tilde l_i(\vec\zeta, \hat{\vec\theta}_i(\vec\zeta))\qquad \hat{\vec\theta}_i(\vec\zeta) = \argmax_{\vec\theta} \tilde l_i(\vec\zeta, \vec\theta)\]

Then the Hessian is

\[ \nabla^2_{\vec\zeta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) - \left. \nabla^2_{\vec\zeta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta) \left(\nabla^2_{\vec\theta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta)\right)^{-1} \nabla^2_{\vec\theta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) \right|_{\vec\theta = \hat{\vec\theta}_i(\vec\zeta)}. \]

Computationally easy to compute.

Can also construct “profile lower bound” curves.

Marginal Measures

Given \(\hat{\vec\zeta}\) and \(\hat{\vec\theta}_i(\hat{\vec\zeta})\) and some function \(v(\vec o;\hat{\vec\zeta})\), the right hand-side

\[ E_{p_{i,\hat{\vec\zeta}}}\left(v(\vec O_i;\hat{\vec\zeta})\mid \vec Y_i = \vec y_i\right) \approx E_{q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}}\left(v(\vec O_i;\hat{\vec\zeta})\right) \]

is almost always much easier to compute.

\(p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i) \approx q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o)\) if the lower bound is tight.

Key for the use of VAs in a Bayesian framework.

Implementation

Written in C++ with support for computation in parallel

using the psqn package.

Uses an extension of the automatic differentiation implementation from Savine (2018).

Similar to the Adept library (Hogan 2014). A gradient implementation will reduce the computation time.

Supports left-truncation, partly observed markers, recurrent events, competing risk,

cumulative, current value, and derivatives in the hazards, mixtures thereof and more.

Simulation Studies

Overview

Simulation study with univariate marker and time-to-event data.

Comparison with the Monte Carlo based joineRML package (Hickey, Philipson, and Jorgensen 2018) and quadrature based JM package (Rizopoulos 2010).

Check the bias of the VA and to compare the observed information matrix and the computation time.

Simulation study with multivariate markers with both a terminal event and an informative observation process.

Comparison with joineRML & JM

joineRML has a non-parametric baseline hazard.

JM uses a parametric baseline hazard.

We used the same spline as in our package using ten knots based on the observed event times.

Four and seven quadrature nodes were used with JM.

Comparison with joineRML & JM

Natural cubic spline for the marker intercept with four interior knots.

Random intercept using a natural cubic spline with one interior knot.

Therefore, three random effects per individual.

Uniform distribution for a covariate in the hazard, normally distributed baseline covariate for the marker, and we included the covariate from the marker in the hazard. No frailty.

Full code available online.

Mean Curve

The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.

Pointwise Quantiles of the Hazard

The lines are the 10%, 25%, 50%, 75% and 90% pointwise quantiles of the conditional hazards.

Summary Statistics

An average of 3.83 observed markers per individual (SD 2.98).

An average of 42.4% are censored.

500 individuals in each sample and 100 data sets.

Results: Bias

	Method	\(\beta_{11}\)	\(\beta_{12}\)	\(\sigma^2_1\)	\(\gamma_{11}\)	\(\gamma_{12}\)	\(\alpha_{11}\)
True values		1.00000	-0.50000	0.25000	0.40000	0.00000	-1.00000
Bias	GVA	0.00783	-0.00438	-0.00047	-0.01112	0.00655	-0.00013
	joineRML	0.01828	-0.00434	-0.00061	-0.01241	0.00614	0.00395
	JM 4	-0.00224	-0.00410	-0.00209	-0.01187	0.00633	0.00229
	JM 7	0.00436	-0.00424	-0.00043	-0.01013	0.00637	-0.00575
SE	GVA	0.00840	0.00610	0.00105	0.01194	0.00814	0.00613
	joineRML	0.00832	0.00606	0.00105	0.01195	0.00802	0.00614
	JM 4	0.00845	0.00591	0.00105	0.01191	0.00790	0.00613
	JM 7	0.00838	0.00607	0.00105	0.01198	0.00812	0.00627

Take away: No evidence of bias with any of the methods.

SE are standard errors of the bias estimates. GVA is the results from our method. JM 4 and 7 are the results form the JM package with four and seven quadrature nodes.

Results: Time & Covariance

	Method	\(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\)	Time
Average	GVA	2.16449	2.024
	joineRML	2.19245	719.268
	JM 4	2.18567	42.767
	JM 7	2.17030	224.622

Take away: comparable error for the covariance matrix. One to two orders of magnitude faster estimation.

The computation time is in seconds. Four threads were used with our method but an average of 0.938 seconds was used on the observed information matrix. The latter can be greatly reduced.

Results: Likelihood & Hessian

	Method	Log likelihood	Relative covariance error
Average	GVA	-3070.070	0.014307
	JM 4	-3070.311	0.122795
	JM 7	-3069.541	0.000000

Take away:

Maximum lower bound was only slightly smaller than the maximum likelihood (and always smaller as expected).
The observed information matrix was very similar.
It was insufficient to use four quadrature nodes with the JM package.

Multivariate Simulation Study

Added another marker with flipped signs for the time-varying fixed effects.

The correlation of intercept part of the two random splines was set to 0.5.

The observation times depend on the random effects and an additional frailty.

Thus, seven random effects per individual.

Summary Statistics

An average of 4.8 observed markers per individual (SD 8.08).

An average of 45.5% are censored.

1000 individuals per sample and 1000 data sets.

35070 parameters in total.

Results: Bias & Coverage

	\(\alpha_{11}\)	\(\alpha_{12}\)	\(\alpha_{21}\)	\(\alpha_{22}\)	\(\xi_2^2\)
True values	0.5000	0.5000	-0.5000	-0.5000	0.2500
Bias	-0.0003	0.0006	0.0010	0.0001	-0.0126
SE	0.0012	0.0012	0.0005	0.0005	0.0009
RMSE	0.0385	0.0375	0.0162	0.0156	0.0315
Coverage	0.9400	0.9550	0.9410	0.9590	0.9170

Take away: No evidence of bias expect for the scale parameter of the log normal frailty.

The table shows bias estimates, standard errors of the bias estimates, root mean square errors and coverage of Wald type confidence intervals.

Results: Time and Covariance

	\(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\)	Time	Coverage profile
Average	3.024	90.42	0.9390
SE	0.020	0.34

Take away: Computation time is very feasible.

The computation time includes an average of 33.9 seconds spent to compute the observed information matrix. This can be reduced. The coverage column shows the coverage of a profile likelihood based confidence interval for \(\alpha_{11}\). The average computation time for the confidence interval was 151 seconds.

Application

Interested in the relation between longitudinal observed dense and fatty tissue areas in the breast and breast cancer onset.

60081 women with 188037 longitudinal measurements.

The women are cancer free at study entry and we look at the time to the first breast cancer diagnosis.

BMI is included in both the marker and survival sub-models. There is delayed entry due to the cancer free condition.

Application

Estimate a random intercept and a random intercept and slope model

which yields two and four random effects per woman.

Thus, there are 300446 and 841182 parameters.

Application: Results

	Model	\(\alpha_{11}\)	\(\alpha_{12}\)	\(\psi_1^2\)	\(\psi_2^2\)	\(\psi_3^2\)	\(\psi_4^2\)	LB	Time
Estimate	W/o	0.132	0.0305	3.78		3.80		-587142.0	1.17
	W/	0.144	0.0266	3.68	0.00233	3.42	0.00228	-586180.0	38.49
SE	W/o	0.013	0.0160	0.02		0.02
	W/	0.013	0.0171	0.05	0.00016	0.05	0.00017

Take away: a significant association with the dense tissue but not the fatty tissue.

The “w/o” and “w/” rows are the results without and with the random slopes, respectively. \(\alpha_{11}\) and \(\alpha_{12}\) are the current value association with the dense and fatty tissue, respectively. The \(\psi^2_i\)s are the estimated variances, LB is maximum lower bound and the computation time is in hours.

Extensions and Conclusions

Conclusions

Introduced variational approximations.

Showed that they can yield fast estimates of joint time-to-event and marker models.

Simulation studies suggest that the bias is low if present at all.

Applications

Breast cancer and markers including breast density.

Modelling longitudinal markers of kidney function for diabetes patients.

Delayed Entry

Delayed entry yields another integral that needs to be approximated.

A VA can be applied but we can only get an upper bound on the additional likelihood terms.

Yields a maximin problem rather than a pure maximization problem.

For now, the additional terms are approximated with adaptive Gaussian-Hermite quadrature.

The additional terms are easier to approximate. For instance, we do not need to marginalize out the random effects for the recurrent events.

Marker Sub-model

The marker sub-models can be changed to a generalized linear mixed models (GLMMs).

E.g. if they are binary, count data, strictly positive or non-normal.

The new lower bounds terms are easy to approximate (Ormerod and Wand 2012).

At worst, we have to approximate many one dimensional integrals.

We have an example of a mixed logistic regression with nice performance

compared to the Laplace implementation in the lme4 package. The example is at https://github.com/boennecd/psqn-va-ex#6d-random-effects

Variational Approximations

Other distributions also have reasonably easy expressions of the quantities in the lower bound.

A reasonably tractable example is the multivariate skew-normal distribution (Ormerod 2011). Extension will give tighter lower bounds but may take longer to evaluate or be harder to optimize.

Some of the parameters can profiled out quite quickly.

This may reduce the computation time further.

Thank You!

The presentation is at rpubs.com/boennecd/CBS-VAJointSurv.

The markdown is at github.com/boennecd/Talks.

The implementation is at github.com/boennecd/VAJointSurv.

The psqn package is on CRAN and at github.com/boennecd/psqn.

References are on the next slide.

References

Christoffersen, Benjamin. 2021. Psqn: Partially Separable Quasi-Newton. https://CRAN.R-project.org/package=psqn.

Hickey, Graeme L., Pete Philipson, and Andrea Jorgensen. 2018. “joineRML: A Joint Model and Software Package for Time-to-Event and Multivariate Longitudinal Outcomes.” BMC Medical Research Methodology 18 (1). https://doi.org/10.1186/s12874-018-0502-1.

Hogan, Robin J. 2014. “Fast Reverse-Mode Automatic Differentiation Using Expression Templates in c++.” ACM Trans. Math. Softw. 40 (4). https://doi.org/10.1145/2560359.

Ormerod, J. T. 2011. “Skew-Normal Variational Approximations for Bayesian Inference.” Unpublished Article.

Ormerod, J. T., and M. P. Wand. 2012. “Gaussian Variational Approximate Inference for Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 21 (1): 2–17. http://www.jstor.org/stable/23248820.

Rizopoulos, Dimitris. 2010. “JM: An R Package for the Joint Modelling of Longitudinal and Time-to-Event Data.” Journal of Statistical Software 35 (9): 1–33. http://www.jstatsoft.org/v35/i09/.

Savine, Antoine. 2018. Modern Computational Finance: AAD and Parallel Simulations. John Wiley & Sons.