PTCMGH R package: Promotion
Time Cure Models with a General Hazard structurePromotion-Time Cure Models (PTCMs) are a class of survival regression models in which a subset of the population is assumed to be cured (Yakovlev and Tsodikov 1996; Tsodikov 1998; M. H. Chen, Ibrahim, and Sinha 1999; Amico and Van Keilegom 2018), that is, never to experience the event of interest. Equivalently, this sub-population is characterised by infinite survival times. The survival function of a PTCM is given by:
\[ S_c(t) = \exp\left\{ -\theta \tilde{F}(t) \right\}, \quad t \geq 0, \] where \(\theta>0\) is a positive parameter, and \(\tilde{F}\) is a cumulative distribution function with support on \({\mathbb R}_+\). Note that
\[ \lim_{t\to\infty} S_c(t) = \exp\{-\theta\}, \] which represents the cured fraction. Since the survival function associated with the PTCM does not decrease to zero, it is often referred to as an “improper” survival function (Peng and Yu 2021).
If covariates \({\bf x} \in {\mathbb R}^p\) are available, they can be incorporated into the two components of the PTCM:
Let \({\bf w} \subseteq {\bf x}\), and model the log-link \(\theta({\bf w}) = \exp({\alpha_0+\bf w}^{\top}{\boldsymbol{\alpha}})\), where \(\alpha_0\) is an intercept.
Let \({\bf z} \subseteq {\bf x}\), and model the hazard structure as \(\tilde{H}(t \mid {\bf z})\). Then, we can define
\[ \tilde{F}(t \mid {\bf z}) = 1 - \exp\{-\tilde{H}(t \mid {\bf z})\}. \] The hazard structure \(\tilde{H}(t \mid {\bf z})\) can include the proportional hazards, the accelerated failure time, or more general hazard structures (Rubio et al. 2019).
Suppose that the hazard and cumulative hazard functions follow a proportional hazards (PH) structure:
\[\begin{align} \tilde{h}(t \mid {\bf z}) &= \tilde{h}_0(t \mid \boldsymbol{\gamma}) \exp\{{\bf z}^{\top} \boldsymbol{\beta}\}, \\ \tilde{H}(t \mid {\bf z}) &= \tilde{H}_0(t \mid \boldsymbol{\gamma}) \exp\{{\bf z}^{\top} \boldsymbol{\beta}\}, \end{align}\]
where \(\tilde{h}_0\) is a baseline hazard associated with a distribution \(\tilde{F}_0\), with parameters \(\boldsymbol{\gamma}\). Let also \(\theta({\bf w}) = \exp({\bf w}^{\top}{\boldsymbol{\alpha}})\), and \(u\) be a realisation from a \(U(0,1)\) distribution. A simulation from the PTCM with a PH structure (PTCM-PH) can then be obtained as follows:
\[ t^* = \tilde{F}_0^{-1}\left( 1 - \exp\left\{ \exp\left\{ - {\bf z}^{\top}\boldsymbol{\beta} \right\}\log\left[ 1 + \log(u)\exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} \right] \right\} \right). \] if \(1 + \log(u)\exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\}>0\), and \(t^*=\infty\) otherwise.
Suppose that the hazard and cumulative hazard functions follow an accelerated failure time (AFT) hazard structure:
\[\begin{align} \tilde{h}(t \mid {\bf z}) &= \tilde{h}_0(t \exp\{{\bf z}^{\top} \boldsymbol{\beta}\} \mid \boldsymbol{\gamma}) \exp\{{\bf z}^{\top} \boldsymbol{\beta}\}, \\ \tilde{H}(t \mid {\bf z}) &= \tilde{H}_0(t \exp\{{\bf z}^{\top} \boldsymbol{\beta}\} \mid \boldsymbol{\gamma}) , \end{align}\]
where \(\tilde{h}_0\) is a baseline hazard associated with a distribution \(\tilde{F}_0\), with parameters \(\boldsymbol{\gamma}\). Let also \(\theta({\bf w}) = \exp({\bf w}^{\top}{\boldsymbol{\alpha}})\), and \(u\) be a realisation from a \(U(0,1)\) distribution. A simulation from the PTCM with a AFT structure (PTCM-AFT) can then be obtained as follows:
\[ t^* = \tilde{F}_0^{-1}\left( -\log(u) \exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} \right) \exp\left\{ - {\bf z}^{\top}\boldsymbol{\beta} \right\}. \] if \(\log(u)\exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} < 1\), and \(t^*=\infty\) otherwise.
Suppose that the hazard and cumulative hazard functions follow a General Hazard (GH) structure (Y. Q. Chen and Jewell 2001) (Rubio et al. 2019):
\[\begin{align} \tilde{h}(t \mid {\bf z}) &= \tilde{h}_0(t \exp\{{\bf z}^{\top} \boldsymbol{\eta}\} \mid \boldsymbol{\gamma}) \exp\{{\bf z}^{\top} \boldsymbol{\beta}\}, \\ \tilde{H}(t \mid {\bf z}) &= \tilde{H}_0(t \exp\{{\bf z}^{\top} \boldsymbol{\eta}\} \mid \boldsymbol{\gamma}) \exp\{{\bf z}^{\top} \boldsymbol{\beta} - {\bf z}^{\top} \boldsymbol{\eta}\}\}, \end{align}\]
where \(\tilde{h}_0\) is a baseline hazard associated with a distribution \(\tilde{F}_0\), with parameters \(\boldsymbol{\gamma}\). Let also \(\theta({\bf w}) = \exp({\bf w}^{\top}{\boldsymbol{\alpha}})\), and \(u\) be a realisation from a \(U(0,1)\) distribution. A simulation from the PTCM with a GH structure (PTCM-GH) can then be obtained as follows:
\[ t^* = \exp\left\{ - {\bf z}^{\top}\boldsymbol{\eta} \right\} \tilde{F}_0^{-1}\left( 1 - \exp\left\{ \exp\left\{ {\bf z}^{\top}\boldsymbol{\eta} - {\bf z}^{\top}\boldsymbol{\beta} \right\}\log\left[ 1 + \log(u)\exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} \right] \right\} \right). \] if \(1 - \exp\left\{ \exp\left\{ {\bf z}^{\top}\boldsymbol{\eta} - {\bf z}^{\top}\boldsymbol{\beta} \right\}\log\left[ 1 + \log(u)\exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} \right] \right\} < 1\), and \(t^*=\infty\) otherwise.
Provided that one can simulate from the distribution \(\tilde{F}(t \mid {\bf z})\) using numerical methods, simulating from a PTCM can be accomplished simply by generating \(u \sim U(0,1)\) and solving:
\[ t^* = \tilde{F}\left( -\log(u) \exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} \mid {\bf z} \right) \] if \(-\log(u) \exp\left\{-\alpha_0 - {\bf w}^{\top}\boldsymbol{\alpha} \right\} < 1\), and \(t^*=\infty\) otherwise.
The log-likelihood function of the parameters \(\Psi = (\theta, \boldsymbol{\gamma}^{\top})^{\top}\) is:
\[\begin{align} \ell\left( \Psi \right) &= \sum_{i=1}^n \delta_i \log h(t_i \mid \Psi) - \sum_{i=1}^n H(t_i \mid \Psi)\\ &= \sum_{i=1}^n \delta_i \log(\theta) + \sum_{i=1}^n \delta_i \log \tilde{h}(t_i \mid \boldsymbol{\gamma}) - \sum_{i=1}^n \delta_i \tilde{H}(t_i \mid \boldsymbol{\gamma})\\ &+ \sum_{i=1}^n \theta \left[ 1 - \exp\left\{ -\tilde{H}(t_i \mid \boldsymbol{\gamma}) \right\} \right]\\ &= n_O \log(\theta) + \sum_{i=1}^n \delta_i \log \tilde{h}(t_i \mid \boldsymbol{\gamma}) - \sum_{i=1}^n \delta_i \tilde{H}(t_i \mid \boldsymbol{\gamma})\\ &- n\theta - \theta \sum_{i=1}^n \exp\left\{ -\tilde{H}(t_i \mid \boldsymbol{\gamma}) \right\}, \end{align}\]
where \(n_O = \sum_{i=1}^n \delta_i\).
The log-likelihood function of the model parameters \(\Psi = (\alpha_0,\boldsymbol{\alpha}^{\top}, \boldsymbol{\beta}^{\top}, \boldsymbol{\eta}^{\top}, \boldsymbol{\gamma}^{\top})^{\top}\) is:
\[\begin{align} \ell\left( \Psi \right) &= \sum_{i=1}^n \delta_i \log h(t_i \mid \Psi, \boldsymbol{z}_i) - \sum_{i=1}^n H(t_i \mid \Psi, \boldsymbol{z}_i)\\ &= \sum_{i=1}^n \delta_i \boldsymbol{w}_i^{\top}\boldsymbol{\alpha} + \sum_{i=1}^n \delta_i \log \tilde{h}(t_i \mid \boldsymbol{\gamma}, \boldsymbol{z}_i) - \sum_{i=1}^n \delta_i \tilde{H}(t_i \mid \boldsymbol{\gamma}, \boldsymbol{z}_i)\\ &- \sum_{i=1}^n \exp\{\boldsymbol{w}_i^{\top}\boldsymbol{\alpha}\} + \sum_{i=1}^n \exp\left\{\boldsymbol{w}_i^{\top}\boldsymbol{\alpha} -\tilde{H}(t_i \mid \boldsymbol{\gamma}, \boldsymbol{z}_i) \right\}. \end{align}\]
PTCMGH R packageThe PTCMGH R package provides functions for simulating
survival times from PTCMs with proportional hazards, accelerated failure
time, and general hazard structures under a log-link for \(\theta\), as well as for performing maximum
likelihood estimation of these models. The hazard structure used for
modelling \(\tilde{F}\) can include the
proportional hazards, the accelerated failure time, or more general hazard
structures.
General Hazard (GH) model.
Accelerated Failure Time (AFT) model.
Proportional Hazards (PH) model.
Accelerated Hazards (AH) model.
These models are fitted using the R commands nlminb and
optim. Thus, the user needs to specify the initial points
and to check the convergence of the optimisation step, as usual.
The current version of the PTCMGH R package implements
the following parametric baseline hazards for the models discussed in
the previous section, using the command PTCMMLE.
Power Generalised Weibull (PGW) distribution.
Exponentiated Weibull (EW) distribution.
Generalised Gamma (GenGamma) distribuiton.
Gamma (Gamma) distribution.
Lognormal (LogNormal) distribution.
Log-logistic (LogLogistic) distribution.
Weibull (Weibull) distribution. (only for AFT, PH, and AH models)
All positive parameters are transformed into the real line using a
log link (reparameterisation).
The PTCMGH R package depends on functions from the HazReg R
package, which must also be installed and loaded.
Disclaimer: Although this R package implements several hazard structures, some models may exhibit practical non-identifiability if the follow-up period is not sufficiently long (Rubio, Espindola, and Montoya 2023). Users should verify this (e.g., by plotting parameter profile likelihoods). A potential remedy is to fit simpler models, though pathological cases may still arise where the cure model is difficult to estimate due to limited follow-up or high censoring rates. Such issues are well documented in the literature (Lambert and Bremhorst 2019).
A list of R packages implementing different types of cure survival models can be found at: CRAN Task View: Survival Analysis
The following R code illustrates how to simulate survival times from
a PTCM-GH model and perform maximum likelihood estimation for PTCMs with
baseline, proportional hazards, accelerated failure time, accelerated
hazards, and general hazard structures using a log-link for \(\theta\), via the PTCMGH R
package.
rm(list=ls())
#library(devtools)
#install_github("FJRubio67/HazReg")
library(HazReg)
#library(devtools)
#install_github("FJRubio67/PTCMGH")
library(PTCMGH)
library(survival)
library(knitr)
# ?simPTCMGH
# ?PTCMMLE
# Data simulation
n = 10000
seed = 123
set.seed(seed)
# Design matrix (includes the intercept in the first entry for theta)
des0 <- cbind(1, rnorm(n), rnorm(n))
# Simulation: PTCM with GH structure and LogNormal baseline hazard, and log-link
sim = simPTCMGH(n = n,
seed = seed,
hstr = "GH",
dist = "LogNormal",
des_theta = des0,
des_t = des0[, -1],
des_h = des0[, -1],
des = NULL,
par_base = c(-0.25, 0.1),
alpha = c(0.5, 0.5, 0.5),
beta_t = c(0.5, 0.5),
beta_h = c(1.0, 1.0),
beta = NULL)
# Inducing censoring
cens = 6
status <- ifelse(sim < cens, 1, 0)
mean(status)
## [1] 0.7631
times <- ifelse(sim< cens, sim, cens)
# Kaplan-Meier estimator for the survival times
km <- survfit(Surv(times, status) ~ 1)
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
# Model fitting
OPT <- PTCMMLE(init = c(0,0,0),
times = times,
status = status,
hstr = "baseline",
dist = "LogNormal",
des_theta = NULL,
des_t = NULL,
des_h = NULL,
des = NULL,
method = "nlminb",
maxit = 10000)
# MLE
MLE <- c(OPT$OPT$par[1],exp(OPT$OPT$par[2]),exp(OPT$OPT$par[3]))
# Fitted survival function
spt <- Vectorize(function(t) exp(- MLE[3]*(1-exp(-chlnorm(t,MLE[1],MLE[2])))) )
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
curve(spt,0,6, col = "red", add= T)
legend("topright", legend = c("KM","Baseline"), col = c("black","red"),
lwd = c(2,2))
# Model fitting
OPT_PH <- PTCMMLE(init = c(0,0,0,0,0,0,0),
times = times,
status = status,
hstr = "PH",
dist = "LogNormal",
des_theta = des0,
des_t = NULL,
des_h = NULL,
des = des0[,-1],
method = "nlminb",
maxit = 10000)
# MLE
MLE_PH <- c(OPT_PH$OPT$par[1],exp(OPT_PH$OPT$par[2]),OPT_PH$OPT$par[-c(1:2)])
# Population survival function
spt_ph <- Vectorize(function(t){
theta_i <- as.vector(exp(des0 %*% MLE_PH[3:5]))
F_i <- 1 - exp(-chlnorm(t,MLE_PH[1],MLE_PH[2])*exp(des0[,-1]%*%MLE_PH[6:7]))
survs <- exp(-theta_i*F_i)
return(mean(survs))
})
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
curve(spt_ph,0,6, col = "red", add= T)
legend("topright", legend = c("KM","PH"), col = c("black","red"),
lwd = c(2,2))
# Model fitting
OPT_AFT <- PTCMMLE(init = c(0,0,0,0,0,0,0),
times = times,
status = status,
hstr = "AFT",
dist = "LogNormal",
des_theta = des0,
des_t = NULL,
des_h = NULL,
des = des0[,-1],
method = "nlminb",
maxit = 10000)
# MLE
MLE_AFT <- c(OPT_AFT$OPT$par[1],exp(OPT_AFT$OPT$par[2]),OPT_AFT$OPT$par[-c(1:2)])
# Population survival function
spt_aft <- Vectorize(function(t){
theta_i <- as.vector(exp(des0 %*% MLE_AFT[3:5]))
F_i <- 1 - exp(-chlnorm(as.vector(t*exp(des0[,-1]%*%MLE_AFT[6:7])),MLE_AFT[1],MLE_AFT[2]))
survs <- exp(-theta_i*F_i)
return(mean(survs))
})
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
curve(spt_aft,0,6, col = "red", add= T)
legend("topright", legend = c("KM","AFT"), col = c("black","red"),
lwd = c(2,2))
# Model fitting
OPT_AH <- PTCMMLE(init = c(0,0,0,0,0,0,0),
times = times,
status = status,
hstr = "AH",
dist = "LogNormal",
des_theta = des0,
des_t = des0[,-1],
des_h = NULL,
des = NULL,
method = "nlminb",
maxit = 10000)
MLE
## [1] -0.1641417 0.7747794 1.4431071
MLE_AH <- c(OPT_AH$OPT$par[1],exp(OPT_AH$OPT$par[2]),OPT_AH$OPT$par[-c(1:2)])
# Population survival function
spt_ah <- Vectorize(function(t){
theta_i <- as.vector(exp(des0 %*% MLE_AH[3:5]))
F_i <- 1 - exp(-chlnorm(as.vector(t*exp(des0[,-1]%*%MLE_AH[6:7])),MLE_AH[1],MLE_AH[2])*as.vector(exp(-des0[,-1]%*%MLE_AH[6:7])))
survs <- exp(-theta_i*F_i)
return(mean(survs))
})
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
curve(spt_ah,0,6, col = "red", add= T)
legend("topright", legend = c("KM","AH"), col = c("black","red"),
lwd = c(2,2))
# Model fitting
OPT_GH <- PTCMMLE(init = c(0,0,0,0,0,0,0,0,0),
times = times,
status = status,
hstr = "GH",
dist = "LogNormal",
des_theta = des0,
des_t = des0[,-1],
des_h = des0[,-1],
des = NULL,
method = "nlminb",
maxit = 10000)
# MLE
MLE_GH <- c(OPT_GH$OPT$par[1],exp(OPT_GH$OPT$par[2]),OPT_GH$OPT$par[-c(1:2)])
# Comparison: MLE vs true parameter values
cbind(MLE_GH, c(-0.25, 0.1,0.5, 0.5, 0.5,0.5, 0.5, 1.0, 1.0))
## MLE_GH
## [1,] -0.24967272 -0.25
## [2,] 0.09894309 0.10
## [3,] 0.49141439 0.50
## [4,] 0.50155037 0.50
## [5,] 0.49230559 0.50
## [6,] 0.50103911 0.50
## [7,] 0.49994272 0.50
## [8,] 1.00742913 1.00
## [9,] 1.02867217 1.00
# Population survival function
spt_gh <- Vectorize(function(t){
theta_i <- as.vector(exp(des0 %*% MLE_GH[3:5]))
F_i <- 1 - exp(-chlnorm(as.vector(t*as.vector(exp(des0[,-1]%*%MLE_GH[6:7]))),MLE_GH[1],MLE_GH[2])*as.vector(exp(des0[,-1]%*%MLE_GH[8:9] - des0[,-1]%*%MLE_GH[6:7])))
survs <- exp(-theta_i*F_i)
return(mean(survs))
})
plot(km$time, km$surv, type = "l", col = "black", lwd = 2, lty = 1,
ylim = c(0,1), xlab = "Time", ylab = "Survival")
curve(spt_gh,0,6, col = "red", add= T)
legend("topright", legend = c("KM","GH"), col = c("black","red"),
lwd = c(2,2))
# Confidence intervals for the parameters (positive parameters transformed to log-scale)
CI_GH <- Conf_Int(OPT_GH$log_lik, OPT_GH$OPT$par, level = 0.95)
kable(CI_GH, digits = 3)
| Lower | Upper | Transf MLE | Std. Error | |
|---|---|---|---|---|
| par1 | -0.253 | -0.247 | -0.250 | 0.002 |
| par2 | -2.330 | -2.296 | -2.313 | 0.009 |
| par3 | 0.464 | 0.519 | 0.491 | 0.014 |
| par4 | 0.471 | 0.532 | 0.502 | 0.015 |
| par5 | 0.462 | 0.523 | 0.492 | 0.016 |
| par6 | 0.497 | 0.505 | 0.501 | 0.002 |
| par7 | 0.496 | 0.504 | 0.500 | 0.002 |
| par8 | 0.952 | 1.063 | 1.007 | 0.028 |
| par9 | 0.972 | 1.085 | 1.029 | 0.029 |
# Akaike information criterion
AIC <- 2*OPT$OPT$objective + 2*length(OPT$OPT$par)
AIC_PH <- 2*OPT_PH$OPT$objective + 2*length(OPT_PH$OPT$par)
AIC_AFT <- 2*OPT_AFT$OPT$objective + 2*length(OPT_AFT$OPT$par)
AIC_AH <- 2*OPT_AH$OPT$objective + 2*length(OPT_AH$OPT$par)
AIC_GH <- 2*OPT_GH$OPT$objective + 2*length(OPT_GH$OPT$par)
# AICs: baseline, PH, AFT, AH, GH
AICs <- c(AIC, AIC_PH, AIC_AFT, AIC_AH, AIC_GH)
AICs
## [1] 20633.206 3262.401 -13473.418 -11954.774 -13987.182
which.min(AICs)
## [1] 5