\[ \renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\Expec{\text{E}} \def\logit{\text{logit}} \def\diag{\text{diag}} \]
Variational approximations will be introduced later.
Present joint work with Birzhan Akynkozhayev, Mark Clements, Keith Humphreys, and Hedvig Kjellström.
Will call these markers.
Want to account for this.
E.g. default, merger, employment, etc.
The method can be seen as a form of imputation.
Quantify association between breast cancer and breast density and other variables.
People with a worse health status may tend to have worse marker values and have more measurements.
The terminal event (say death) and the observation process.
The population registers and research questions seldom leaves us with only one marker of interest.
E.g. merger, liquidation and default.
Have a method and implementation that
We are not aware of any software that that satisfy all four criteria.
Will have sub-models for each.
Introduce the marker and time-to-event sub-models.
Highlight possible relations and assumptions.
The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.
The population mean curve with pointwise quantiles for the individuals’ mean curves and the curve for one individual with the observed values.
The population mean curve with pointwise quantiles for the individuals’ mean curves and with curves for three individuals and their observed values.
\[ \begin{align*} Y_{ij} &= \mu_i(s_{ij}, \vec U_i) + \epsilon_{ij} \\ \epsilon_{ij} &\sim N(0, \sigma^2) \\ \mu_{i}(s, \vec U_i) &= \tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)} + \vec g(s)^\top\tilde{\vec\beta}^{(2)} + \vec m(s)^\top\vec U_i \\ \vec U_i &\sim N^{(R)}(\vec0, \Psi) \end{align*} \]
\(\tilde{\vec x}_i^\top\tilde{\vec\beta}^{(1)}\): fixed effects.
\(\vec g(s)^\top\tilde{\vec\beta}^{(2)}\): the population mean curve.
\(\vec m(s)^\top\vec U_i\): difference for individual \(i\) to the population curve.
Let \(\vec x_i(s) = (\tilde{\vec x}_i^\top, \vec g(s)^\top)^\top\) such that
\[ \mu_{i}(s, \vec U_i) = \vec x_i(s)^\top\vec\beta + \vec m_i(s)^\top\vec U_i \]
and \(\vec m_i\) may depended on \(i\).
Time and covariate interactions.
and the cause specific hazard for the competing events.
where \(T_i\) is the event time.
Model the conditional rate for the recurrent events
\[ h_i(t\mid\vec U_i) = \lim_{d \rightarrow 0} \frac{P(N_i(t + d) - N_i(t) > 0 \mid \vec U_i)}{d} \]where \(N_i(t)\) is event count at time \(t\).
Markers’ deviation from the population may be linked to higher or lower risk of the event for each individual.
Different ways the two can be linked.
A higher level at any given point may lead to higher/lower risk.
A stepper change at any given point may lead to higher/lower risk.
A long period of higher values may lead to higher/lower risk at any given point.
To simplify the notation, only the current value is shown.
\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)} + \vec b(t)^\top\tilde{\vec\gamma}^{(2)} + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]
\(\tilde{\vec z}_i^\top\tilde{\vec\gamma}^{(1)}\): fixed effects.
\(\vec b(t)^\top\tilde{\vec\gamma}^{(2)}\): the log baseline hazard function.
Shared with the marker.
Can be excluded.
Let \(\vec z_i(t) = (\tilde{\vec z}_i^\top, \vec b(t)^\top)^\top\) such that
\[ \begin{align*} h_i(t\mid \vec U_i, \xi_i) &= \exp\left( \vec z_i(t)^\top\vec\gamma + \alpha \vec m_i(t)^\top\vec U_i + \xi_i \right) \\ \xi_i &\sim N(0, \Xi) \end{align*} \]
Time and covariate interactions.
We may have \(L\) different markers rather than one
\[ \begin{align*} \vec Y_{ij}&= \vec\mu_i(s_{ij}, \vec U_i) + \vec\epsilon_{ij} \\ \vec\epsilon_{ij} &\sim N^{(L)}(\vec 0, \Sigma) \\ \mu_{i1}(s, \vec U_i) &= \vec x_{i1}(s)^\top\vec\beta_1 + \vec m_{i1}(s)^\top\vec U_{i1} \\ \vdots &\hphantom{=}\vdots\\\ \mu_{iL}(s, \vec U_i) &= \vec x_{iL}(s)^\top\vec\beta_L + \vec m_{iL}(s)^\top\vec U_{iL} \\ \vec U_i &= \begin{pmatrix} \vec U_{i1} \\ \vdots \\ \vec U_{iL} \end{pmatrix}\sim N^{(R)}(\vec0, \Psi). \end{align*} \]
Measurement error.
\(\Psi\) makes markers of the same and different types dependent across time.
\[\begin{align*} X_i(s) &= \begin{pmatrix} \vec x_{i1}(s)^\top & 0^\top & \cdots & \vec 0^\top \\ \vec 0^\top & \vec x_{i2}(s)^\top & \ddots & \vdots \\ \vdots & \ddots & \ddots & \vec 0^\top \\ \vec 0^\top & \cdots & \vec 0^\top & \vec x_{iL}(s)^\top \end{pmatrix} \\ \vec\beta &=(\vec \beta_1^\top,\dots,\vec\beta_L^\top)^\top, \end{align*}\]
and define \(M_i(s)\) similarly. Then
\[\vec\mu_i(s_{ij}, \vec U_i) = X_i(s_{ij})\vec\beta + M_i(s_{ij})\vec U_i.\]
We may have \(E\) different time-to-event types rather than one
\[ \begin{align*} h_{i1}(t\mid \vec U_i, \vec\xi_i) &= \exp\left( \vec z_{i1}(t)^\top\vec\gamma_1 + \vec\alpha_1^\top M_i(t)\vec U_i + \xi_{i1} \right) \\ \vdots &\hphantom{=}\vdots \\ h_{iE}(t\mid \vec U_i,\vec \xi_i) &= \exp\left( \vec z_{iE}(t)^\top\vec\gamma_E + \vec\alpha_E^\top M_i(t)\vec U_i + \xi_{iE} \right) \\ \vec\xi_i = \begin{pmatrix}\xi_{i1} \\ \vdots \\ \xi_{iE}\end{pmatrix} &\sim N^{(E)}(\vec 0, \Xi) \end{align*} \]
Motivate the variational approximation (VA) and show how to use it for fast estimation.
Brief highlight of the psqn package we have developed to quickly optimize the approximate likelihood.
Briefly discuss the package for the joint models.
Let \(\vec\zeta\) be the model parameters and \(l_i(\vec\zeta)\) be the log likelihood term of individual \(i\).
Let \(\vec O_i = (\vec U_i^\top, \vec\xi_i^\top)^\top\) be the concatenated random effects.
The concatenated vector of observed marker values, observations times (possibly), observed times, and events indicators.
where \(p_{i,\vec\zeta}(\vec y_i, \vec o)\) is the density of outcome and the random effects evaluated at \(\vec y_i\) and \(\vec o_i\).
Need an approximation.
Given a distribution with density \(q^{(i)}_{\vec\theta}\) for \(\vec\theta\in \Theta_i\), the variational approximation we use is the lower bound
\[ l_i(\vec\zeta) \geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \]
Typically, much faster to evaluate.
\[ \begin{align*} l_i(\vec\zeta) &= \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)\big/ q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)\big/ q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o\\ &= \int q^{(i)}_{\vec\theta}(\vec o) \Bigg( \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) + \underbrace{\log \left(\frac{q^{(i)}_{\vec\theta}(\vec o)} {p_{i,\vec\zeta}(\vec o\mid\vec y_i)}\right)}_{\text{KL divergence}} \Bigg)d \vec o \\ &\geq \int q^{(i)}_{\vec\theta}(\vec o) \log \left(\frac{p_{i,\vec\zeta}(\vec y_i, \vec o)} {q^{(i)}_{\vec\theta}(\vec o)}\right) d \vec o = \tilde l_i(\vec\zeta, \vec\theta) \end{align*} \]
Equality if \(p_{i,\vec\zeta}(\vec o\mid\vec y_i) = q^{(i)}_{\vec\theta}(\vec o)\).
\[ \text{argmax}_{\vec\zeta} \sum_{i = 1}^n l_i(\vec\zeta) \approx \text{argmax}_{\vec\zeta} \max_{\vec\theta_1,\dots,\vec\theta_n} \sum_{i = 1}^n\tilde l_i(\vec\zeta, \vec\theta_i) \]
35070 parameters in one simulation study later.
as it is partially separable.
Can be combined to an approximation of the entire Hessian which is sparse.
with a block diagonal preconditioner in a Quasi-Newton algorithm.
The method is embarrassingly parallel in all steps and this is exploited.
Allows us to iterate very quickly with different models.
The lower bound requires evaluation of
All are tractable; Only have to do 1D numerical integration and this is fast.
The conditional distribution may be well approximated by a normal distribution.
\[\hat l_i(\vec\zeta) = \tilde l_i(\vec\zeta, \hat{\vec\theta}_i(\vec\zeta))\qquad \hat{\vec\theta}_i(\vec\zeta) = \argmax_{\vec\theta} \tilde l_i(\vec\zeta, \vec\theta)\]
Then the Hessian is
\[ \nabla^2_{\vec\zeta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) - \left. \nabla^2_{\vec\zeta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta) \left(\nabla^2_{\vec\theta\vec\theta}\tilde l_i(\vec\zeta, \vec\theta)\right)^{-1} \nabla^2_{\vec\theta\vec\zeta}\tilde l_i(\vec\zeta, \vec\theta) \right|_{\vec\theta = \hat{\vec\theta}_i(\vec\zeta)}. \]
Can also construct “profile lower bound” curves.
Given \(\hat{\vec\zeta}\) and \(\hat{\vec\theta}_i(\hat{\vec\zeta})\) and some function \(v(\vec o;\hat{\vec\zeta})\), the right hand-side
\[ E_{p_{i,\hat{\vec\zeta}}}\left(v(\vec O_i;\hat{\vec\zeta})\mid \vec Y_i = \vec y_i\right) \approx E_{q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}}\left(v(\vec O_i;\hat{\vec\zeta})\right) \]
is almost always much easier to compute.
\(p_{i,\hat{\vec\zeta}}(\vec o\mid\vec y_i) \approx q^{(i)}_{\hat{\vec\theta}_i(\hat{\vec\zeta})}(\vec o)\) if the lower bound is tight.
Key for the use of VAs in a Bayesian framework.
using the psqn package.
Similar to the Adept library (Hogan 2014). A gradient implementation will reduce the computation time.
cumulative, current value, and derivatives in the hazards, mixtures thereof and more.
Comparison with the Monte Carlo based joineRML package (Hickey, Philipson, and Jorgensen 2018) and quadrature based JM package (Rizopoulos 2010).
Check the bias of the VA and to compare the observed information matrix and the computation time.
Simulation study with multivariate markers with both a terminal event and an informative observation process.
joineRML has a non-parametric baseline hazard.
We used the same spline as in our package using ten knots based on the observed event times.
Four and seven quadrature nodes were used with JM.
Natural cubic spline for the marker intercept with four interior knots.
Therefore, three random effects per individual.
Uniform distribution for a covariate in the hazard, normally distributed baseline covariate for the marker, and we included the covariate from the marker in the hazard. No frailty.
Full code available online.
The population mean curve with 95% pointwise probability intervals for the individuals’ mean curves.
The lines are the 10%, 25%, 50%, 75% and 90% pointwise quantiles of the conditional hazards.
An average of 3.83 observed markers per individual (SD 2.98).
An average of 42.4% are censored.
500 individuals in each sample and 100 data sets.
| Method | \(\beta_{11}\) | \(\beta_{12}\) | \(\sigma^2_1\) | \(\gamma_{11}\) | \(\gamma_{12}\) | \(\alpha_{11}\) | |
|---|---|---|---|---|---|---|---|
| True values | 1.00000 | -0.50000 | 0.25000 | 0.40000 | 0.00000 | -1.00000 | |
| Bias | GVA | 0.00783 | -0.00438 | -0.00047 | -0.01112 | 0.00655 | -0.00013 |
| joineRML | 0.01828 | -0.00434 | -0.00061 | -0.01241 | 0.00614 | 0.00395 | |
| JM 4 | -0.00224 | -0.00410 | -0.00209 | -0.01187 | 0.00633 | 0.00229 | |
| JM 7 | 0.00436 | -0.00424 | -0.00043 | -0.01013 | 0.00637 | -0.00575 | |
| SE | GVA | 0.00840 | 0.00610 | 0.00105 | 0.01194 | 0.00814 | 0.00613 |
| joineRML | 0.00832 | 0.00606 | 0.00105 | 0.01195 | 0.00802 | 0.00614 | |
| JM 4 | 0.00845 | 0.00591 | 0.00105 | 0.01191 | 0.00790 | 0.00613 | |
| JM 7 | 0.00838 | 0.00607 | 0.00105 | 0.01198 | 0.00812 | 0.00627 |
SE are standard errors of the bias estimates. GVA is the results from our method. JM 4 and 7 are the results form the JM package with four and seven quadrature nodes.
| Method | \(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) | Time | |
|---|---|---|---|
| Average | GVA | 2.16449 | 2.024 |
| joineRML | 2.19245 | 719.268 | |
| JM 4 | 2.18567 | 42.767 | |
| JM 7 | 2.17030 | 224.622 |
The computation time is in seconds. Four threads were used with our method but an average of 0.938 seconds was used on the observed information matrix. The latter can be greatly reduced.
| Method | Log likelihood | Relative covariance error | |
|---|---|---|---|
| Average | GVA | -3070.070 | 0.014307 |
| JM 4 | -3070.311 | 0.122795 | |
| JM 7 | -3069.541 | 0.000000 |
Take away:
Added another marker with flipped signs for the time-varying fixed effects.
The correlation of intercept part of the two random splines was set to 0.5.
Thus, seven random effects per individual.
An average of 4.8 observed markers per individual (SD 8.08).
An average of 45.5% are censored.
1000 individuals per sample and 1000 data sets.
35070 parameters in total.
| \(\alpha_{11}\) | \(\alpha_{12}\) | \(\alpha_{21}\) | \(\alpha_{22}\) | \(\xi_2^2\) | |
|---|---|---|---|---|---|
| True values | 0.5000 | 0.5000 | -0.5000 | -0.5000 | 0.2500 |
| Bias | -0.0003 | 0.0006 | 0.0010 | 0.0001 | -0.0126 |
| SE | 0.0012 | 0.0012 | 0.0005 | 0.0005 | 0.0009 |
| RMSE | 0.0385 | 0.0375 | 0.0162 | 0.0156 | 0.0315 |
| Coverage | 0.9400 | 0.9550 | 0.9410 | 0.9590 | 0.9170 |
The table shows bias estimates, standard errors of the bias estimates, root mean square errors and coverage of Wald type confidence intervals.
| \(\lVert\widehat{\mat\Psi} - \mat\Psi\rVert\) | Time | Coverage profile | |
|---|---|---|---|
| Average | 3.024 | 90.42 | 0.9390 |
| SE | 0.020 | 0.34 |
The computation time includes an average of 33.9 seconds spent to compute the observed information matrix. This can be reduced. The coverage column shows the coverage of a profile likelihood based confidence interval for \(\alpha_{11}\). The average computation time for the confidence interval was 151 seconds.
Interested in the relation between longitudinal observed dense and fatty tissue areas in the breast and breast cancer onset.
60081 women with 188037 longitudinal measurements.
BMI is included in both the marker and survival sub-models. There is delayed entry due to the cancer free condition.
which yields two and four random effects per woman.
Thus, there are 300446 and 841182 parameters.
| Model | \(\alpha_{11}\) | \(\alpha_{12}\) | \(\psi_1^2\) | \(\psi_2^2\) | \(\psi_3^2\) | \(\psi_4^2\) | LB | Time | |
|---|---|---|---|---|---|---|---|---|---|
| Estimate | W/o | 0.132 | 0.0305 | 3.78 | 3.80 | -587142.0 | 1.17 | ||
| W/ | 0.144 | 0.0266 | 3.68 | 0.00233 | 3.42 | 0.00228 | -586180.0 | 38.49 | |
| SE | W/o | 0.013 | 0.0160 | 0.02 | 0.02 | ||||
| W/ | 0.013 | 0.0171 | 0.05 | 0.00016 | 0.05 | 0.00017 |
The “w/o” and “w/” rows are the results without and with the random slopes, respectively. \(\alpha_{11}\) and \(\alpha_{12}\) are the current value association with the dense and fatty tissue, respectively. The \(\psi^2_i\)s are the estimated variances, LB is maximum lower bound and the computation time is in hours.
Introduced variational approximations.
Showed that they can yield fast estimates of joint time-to-event and marker models.
Simulation studies suggest that the bias is low if present at all.
Breast cancer and markers including breast density.
Modelling longitudinal markers of kidney function for diabetes patients.
Delayed entry yields another integral that needs to be approximated.
Yields a maximin problem rather than a pure maximization problem.
The additional terms are easier to approximate. For instance, we do not need to marginalize out the random effects for the recurrent events.
E.g. if they are binary, count data, strictly positive or non-normal.
At worst, we have to approximate many one dimensional integrals.
compared to the Laplace implementation in the lme4 package. The example is at https://github.com/boennecd/psqn-va-ex#6d-random-effects
A reasonably tractable example is the multivariate skew-normal distribution (Ormerod 2011). Extension will give tighter lower bounds but may take longer to evaluate or be harder to optimize.
This may reduce the computation time further.
The presentation is at rpubs.com/boennecd/CBS-VAJointSurv.
The markdown is at github.com/boennecd/Talks.
The implementation is at github.com/boennecd/VAJointSurv.
The psqn package is on CRAN and at github.com/boennecd/psqn.
References are on the next slide.