Survival analysis after the diagnosis of cancer is an active research area in cancer epidemiology and a primary interest of many countries. Three typical frameworks adopted to model cancer survival data are: (i) the overall survival setting (where all-cause mortality is studied), (ii) the cause-specific setting (where the cause of death is known), and (iii) the relative survival setting (where the cause of death is unknown or unreliable). The overall survival setting is not the optimal choice for cancer epidemiology when the main interest is on comparing two (or more) populations (e.g. two countries, two periods in the same country, two groups with different deprivation levels within the same country and at the same period, and etcetera), since it is affected by other causes of mortality (which may differ for the populations of interest). In practice when using population-based cancer registry data, the cause of death is typically unavailable or the death certificates are not reliable. Thus, the cause-specific setting is not a reasonable choice either. The relative survival setting represents a useful alternative, where the mortality hazard associated to other causes is approximated by the population hazard \(h_P(\cdot;{\bf z})\), which is typically obtained from life tables based on the available sociodemographic characteristics \({\bf z}\) (e.g. sex, region and deprivation level) in addition to age and year of diagnosis.
In practice, a limitation of quantities derived in the relative survival setting (Perme, Stare, and Estève 2012), such as the excess hazard, is that patients need to be matched to groups of the general population sharing the available characteristics \({\bf z}\) in order to obtain the background mortality rates \(h_P(\cdot;{\bf z})\) from the life tables. And in many countries, the life tables are not stratified according to some important characteristics (e.g. deprivation). We propose an approach to deal with that situation (Rubio et al. 2019).
We refer the reader to references (Rubio et al. 2018) and (Rubio et al. 2019) for more details on these models.
In the paper associated to this tutorial (Rubio et al. 2019), we compare the classical excess hazard model (M1): \[ h_o(t;{\bf x}) = h_P(A+t;y+t;{\bf z}) + h_E(t;{\bf x}), \] to the single correction model (M2): \[ h_o^{C}(t;{\bf x}) = \gamma h_P(A+t;y+t;{\bf z}) + h_E(t;{\bf x}), \] and the frailty correction model (M3) \[ \tilde{h}_o(t;{\bf x}) = \dfrac{\mu \, h_P(A+t;y+t;{\bf z})}{1+b \left[H_P(A+t;y+t;{\bf z})-H_P(A;y;{\bf z})\right]} + h_E(t;{\bf x}). \] In these models, we adopt the general excess hazard model (GH) proposed in (Rubio et al. 2018): \[ h_{\text{E}}^{\text{GH}}\left(t;{\bf x}_i\right) = h_0\left(t \exp({\bf x}_i^T\beta_1)\right)\exp({\bf x}_i^T\beta_2), \] where the baseline hazard is modelled using the Exponentiated Weibull distribution (see The Exponentiated Weibull distribution for more details). The data are simulated as described in Simulation Scenario I in (Rubio et al. 2019), and the mismatch is induced by simulating a random correction given by \(\gamma \sim Gamma(6.5,10)\). This corresponds to the Wide Mismatch scenario described in (Rubio et al. 2019).
The following R code shows how to fit these models using a simulated data set. We can observe that not accounting for mismatches in the life tables leads to an overestimation of the excess hazard, while our approach based on model M3 allows to retrieve the simulated excess hazard (Rubio et al. 2019).
rm(list=ls())
# Required packages
library(numDeriv)
library(knitr)
library(compiler)
# Exponentiated Weibull Distribution
# Hazard
hexpw <- function(t,lambda,kappa,alpha){
pdf0 <- alpha*dweibull(t,scale=lambda,shape=kappa)*pweibull(t,scale=lambda,shape=kappa)^(alpha-1)
cdf0 <- pweibull(t,scale=lambda,shape=kappa)^alpha
cdf0 <- ifelse(cdf0==1,0.9999999,cdf0)
return(pdf0/(1-cdf0))
}
# Cumulative hazard
Hexpw <- function(t,lambda,kappa,alpha,log.p=FALSE){
cdf <- pweibull(t,scale=lambda,shape=kappa)^alpha
return(-log(1-cdf))
}
###########################################################################################
# Function to calculate the normal confidence intervals
# The parameters indicated with "index" are transformed to the real line using log()
###########################################################################################
# FUN : minus log-likelihood function to be used to calculate the confidence intervals
# MLE : maximum likelihood estimator of the parameters of interest
# level : confidence level
# index : position of the positive parameters under the original parameterisation
Conf.Int <- function(FUN,MLE,level=0.95){
sd.int <- abs(qnorm(0.5*(1-level)))
HESS <- hessian(FUN,x=MLE)
Fisher.Info <- solve(HESS, tol = 1e-18)
Sigma <- sqrt(diag(Fisher.Info))
U<- MLE + sd.int*Sigma
L<- MLE - sd.int*Sigma
C.I <- cbind(L,U,MLE, Sigma)
rownames(C.I)<- paste0("par", seq_along(1:length(MLE)))
colnames(C.I)<- c("Lower","Upper","Transf MLE", "Std. Error")
return(C.I)
}
# Reading the simulated data (available at: https://sites.google.com/site/fjavierrubio67/dataFGH.txt)
dat <- read.table("dataFGH.txt", header = FALSE)
# Difference of cumulative hazards
Hp.diff.new <- as.vector(dat[,5])
# Variables used in the modelling
hp.as = as.vector(dat[,4]) # expected rates
ind.cens = as.vector(dat[,6]) # censoring status: 0-alive, 1-dead
surv.time = as.vector(dat[,7]) # survival times
x2 = as.matrix(dat[,1:3]) # design matrix
colnames(x2) <- c("agec","sex","TTT") # names of the covariates
dim.x22 <- dim(x2)[2]
log.likewG <- cmpfun(function(par){
if(anyNA(par)) return(Inf)
ae0 <- exp(par[1]); be0 <- exp(par[2]); ce0=exp(par[3]); beta0 <- par[4:(4+dim(x2)[2]-1)]; beta1 <- tail(par,n = dim(x2)[2]);
exp.x.beta0 <- exp(x2%*%beta0)
exp.x.beta1 <- exp(x2%*%beta1)
exp.x.dif <- exp(x2%*%(beta1-beta0))
haz0 <- hp.as + hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.beta1
CH0 <- Hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.dif + Hp.diff.new
if(anyNA(CH0)) return(Inf)
if(anyNA(haz0)) return(Inf)
else val <- - ind.cens*log(haz0) + CH0
return(sum(val))
})
#----------------------------------------------------------------------------------------------------
# nlminb within Coordinate Descent algorithm
# Used to calculate a better initial point
#----------------------------------------------------------------------------------------------------
NCD <- 1 # Number of coordinate descent iterations
init.cd <- c(0,0,0.5,rep(0,3),rep(0,3)) # initial point for coordinate descent
for(j in 1:NCD){
#print(j)
for(i in 1:length(init.cd)){
tempf <- Vectorize(function(par){
val <- replace(init.cd, i, par)
return(log.likewG(val))
})
new.val <- nlminb(init.cd[i],tempf)$par
init.cd <- replace(init.cd, i, new.val )
}}
#----------------------------------------------------------------------------------------------------
# Optimisation
if(any(is.na(init.cd))) {
initG <- c(0,0,0.5,rep(0,3),rep(0,3))} else initG <- init.cd
OPTEWG1 <- nlminb(initG, log.likewG,control=list(iter.max=10000))
OPTEWG2 <- optim(initG, log.likewG,control=list(maxit=10000))
if(OPTEWG1$objective <= OPTEWG2$value){
MLEEWG <- c(exp(OPTEWG1$par[1:3]),OPTEWG1$par[-c(1:3)])
AICEWG <- 2*OPTEWG1$objective + 2*length(MLEEWG)
}
if(OPTEWG1$objective > OPTEWG2$value){
MLEEWG <- c(exp(OPTEWG2$par[1:3]),OPTEWG2$par[-c(1:3)])
AICEWG <- 2*OPTEWG2$value + 2*length(MLEEWG)
}
# MLE
kable(MLEEWG)
x |
---|
1.2702134 |
0.6050082 |
2.3015349 |
0.1177166 |
0.4059126 |
-0.0005506 |
0.0631782 |
0.2857123 |
0.1850393 |
# Confidence interval (the parameters of the Exponentiated Weibull baseline hazard are in the log-scale)
CI <- Conf.Int(log.likewG,c(log(MLEEWG[1:3]),MLEEWG[-c(1,2,3)]),level=0.95)
kable(CI)
Lower | Upper | Transf MLE | Std. Error | |
---|---|---|---|---|
par1 | -0.0025804 | 0.4809503 | 0.2391850 | 0.1233519 |
par2 | -0.6069121 | -0.3981146 | -0.5025133 | 0.0532656 |
par3 | 0.6591310 | 1.0080215 | 0.8335762 | 0.0890043 |
par4 | 0.1011707 | 0.1342625 | 0.1177166 | 0.0084420 |
par5 | 0.1082526 | 0.7035727 | 0.4059126 | 0.1518702 |
par6 | -0.3002885 | 0.2991873 | -0.0005506 | 0.1529303 |
par7 | 0.0599485 | 0.0664080 | 0.0631782 | 0.0016479 |
par8 | 0.2256502 | 0.3457745 | 0.2857123 | 0.0306445 |
par9 | 0.1249923 | 0.2450863 | 0.1850393 | 0.0306368 |
log.likEWC = function(par){
if(any(is.na(par))) return(Inf)
alpha <- exp(par[1]); ae0 <- exp(par[2]); be0 <- exp(par[3]); ce0 <- exp(par[4]); beta0 <- par[5:(5+dim(x2)[2]-1)]; beta1 <- tail(par,n = dim(x2)[2]);
exp.x.beta0 <- exp(x2%*%beta0)
exp.x.beta1 <- exp(x2%*%beta1)
exp.x.dif <- exp(x2%*%(beta1-beta0))
lhaz0= log(alpha*hp.as + hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.beta1)
lSurv0 = -alpha*(Hp.diff.new) -Hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.dif
val = sum(- ind.cens*lhaz0 -lSurv0)
ifelse(is.na(val),return(Inf),return(val))
}
# Optimisation
initC <- c(log(3),log(MLEEWG[c(1,2,3)]),MLEEWG[-c(1,2,3)])
OPTC1 = nlminb(initC,log.likEWC,control=list(iter.max=10000))
OPTC2 = optim(initC,log.likEWC,control=list(maxit=10000))
if(OPTC1$objective <= OPTC2$value){
MLEC <- c(exp(OPTC1$par[1:4]),OPTC1$par[-c(1:4)])
AICC <- 2*OPTC1$objective + 2*length(MLEC)
}
if(OPTC1$objective > OPTC2$value){
MLEC <- c(exp(OPTC2$par[1:4]),OPTC2$par[-c(1:4)])
AICC <- 2*OPTC2$value + 2*length(MLEC)
}
# MLE
kable(MLEC)
x |
---|
0.0000015 |
1.4203225 |
0.6687226 |
1.9736752 |
0.1187587 |
0.3935385 |
0.0210009 |
0.0654359 |
0.3077478 |
0.1681506 |
# Confidence interval
CIC <- Conf.Int(log.likEWC,c(log(MLEC[1:4]),MLEC[-c(1:4)]),level=0.95)
kable(CIC)
Lower | Upper | Transf MLE | Std. Error | |
---|---|---|---|---|
par1 | -1939.7829968 | 1913.0086780 | -13.3871594 | 982.8730796 |
par2 | 0.1726343 | 0.5291336 | 0.3508839 | 0.0909454 |
par3 | -0.4881770 | -0.3165949 | -0.4023860 | 0.0437717 |
par4 | 0.5411638 | 0.8186310 | 0.6798974 | 0.0707837 |
par5 | 0.1017150 | 0.1358024 | 0.1187587 | 0.0086959 |
par6 | 0.0867579 | 0.7003192 | 0.3935385 | 0.1565236 |
par7 | -0.2873398 | 0.3293416 | 0.0210009 | 0.1573196 |
par8 | 0.0624665 | 0.0684053 | 0.0654359 | 0.0015150 |
par9 | 0.2535104 | 0.3619851 | 0.3077478 | 0.0276726 |
par10 | 0.1138745 | 0.2224267 | 0.1681506 | 0.0276924 |
log.lik.E = function(par){
if(any(is.na(par))) return(Inf)
theta = exp(par[1]); kappa = exp(par[2])/exp(par[1]); ae0 = exp(par[3]); be0 = exp(par[4]); ce0 <- exp(par[5]);
beta0 <- par[6:(6+dim(x2)[2]-1)]; beta1 <- tail(par,n = dim(x2)[2]);
exp.x.beta0 <- exp(x2%*%beta0)
exp.x.beta1 <- exp(x2%*%beta1)
exp.x.dif <- exp(x2%*%(beta1-beta0))
haz0= kappa*theta*hp.as/(1+theta*Hp.diff.new) + hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.beta1
log.Surv0 = -Hexpw(surv.time*exp.x.beta0,ae0,be0,ce0)*exp.x.dif - kappa*log(1+theta*Hp.diff.new)
val = sum(- ind.cens*log(haz0) - log.Surv0)
ifelse(is.na(val),return(Inf),return(val))
}
# Optimisation step
initE <- c(log(10),log(6.5),log(MLEEWG[c(1,2,3)]),MLEEWG[-c(1,2,3)])
OPTE1 = nlminb(initE,log.lik.E,control=list(iter.max=10000))
OPTE2 = optim(initE,log.lik.E,control=list(maxit=10000))
if(OPTE1$objective <= OPTE2$value){
MLEE <- c(exp(OPTE1$par[1:5]),OPTE1$par[-c(1:5)])
AICE <- 2*OPTE1$objective + 2*length(MLEE)
}
if(OPTE1$objective > OPTE2$value){
MLEE <- c(exp(OPTE2$par[1:5]),OPTE2$par[-c(1:5)])
AICE <- 2*OPTE2$value + 2*length(MLEE)
}
# MLE
kable(MLEE)
x |
---|
9.2479647 |
5.8119256 |
1.8308332 |
0.6255403 |
2.3633986 |
0.1042421 |
0.0805945 |
0.0920908 |
0.0490128 |
0.2013338 |
0.2620419 |
# Confidence interval
CIE <- Conf.Int(log.lik.E,c(log(MLEE[1:5]),MLEE[-c(1:5)]),level=0.95)
kable(CIE)
Lower | Upper | Transf MLE | Std. Error | |
---|---|---|---|---|
par1 | 1.0815227 | 3.3672843 | 2.2244035 | 0.5831132 |
par2 | 1.3147741 | 2.2050497 | 1.7599119 | 0.2271153 |
par3 | 0.1693553 | 1.0401870 | 0.6047711 | 0.2221550 |
par4 | -0.7166535 | -0.2216255 | -0.4691395 | 0.1262850 |
par5 | 0.5064020 | 1.2137993 | 0.8601007 | 0.1804618 |
par6 | 0.0740148 | 0.1344694 | 0.1042421 | 0.0154224 |
par7 | -0.3952150 | 0.5564040 | 0.0805945 | 0.2427644 |
par8 | -0.3298207 | 0.5140022 | 0.0920908 | 0.2152649 |
par9 | 0.0373661 | 0.0606595 | 0.0490128 | 0.0059423 |
par10 | 0.0942409 | 0.3084267 | 0.2013338 | 0.0546402 |
par11 | 0.1620441 | 0.3620396 | 0.2620419 | 0.0510202 |
# True values of the parameters
beta1 <- c(0.1,0.1,0.1)
beta2 <- c(0.05,0.2,0.25)
ae <- 1.75; be <- 0.6; ce <- 2.5;
# Comparison with the MLEs
MLES <- cbind(c(NA,NA,MLEEWG), c(NA,MLEC), MLEE, c(10,6.5,ae,be,ce,beta1,beta2))
colnames(MLES) <- c("M1", "M2", "M3", "True")
kable(MLES, digits = 2)
M1 | M2 | M3 | True |
---|---|---|---|
NA | NA | 9.25 | 10.00 |
NA | 0.00 | 5.81 | 6.50 |
1.27 | 1.42 | 1.83 | 1.75 |
0.61 | 0.67 | 0.63 | 0.60 |
2.30 | 1.97 | 2.36 | 2.50 |
0.12 | 0.12 | 0.10 | 0.10 |
0.41 | 0.39 | 0.08 | 0.10 |
0.00 | 0.02 | 0.09 | 0.10 |
0.06 | 0.07 | 0.05 | 0.05 |
0.29 | 0.31 | 0.20 | 0.20 |
0.19 | 0.17 | 0.26 | 0.25 |
# Comparison using AIC
AICS <- cbind( c("M1", "M2", "M3"), round(c(AICEWG, AICC, AICE),2))
kable(AICS, col.names = c("Model", "AIC"))
Model | AIC |
---|---|
M1 | 24879.34 |
M2 | 24878.27 |
M3 | 24866.1 |
# Comparison of the fitted excess baseline hazards
true.haz <- Vectorize(function(t) hexpw(t,ae,be,ce))
fit1 <- Vectorize(function(t) hexpw(t, MLEEWG[1], MLEEWG[2], MLEEWG[3]))
fit2 <- Vectorize(function(t) hexpw(t, MLEC[2], MLEC[3], MLEC[4]))
fit3 <- Vectorize(function(t) hexpw(t, MLEE[3], MLEE[4], MLEE[5]))
curve(true.haz, 0, 5, lwd = 2, xlab = "Time", ylab ="Baseline Excess Hazard", main = "Fitted Excess Hazards",
cex.axis = 1.5, cex.lab = 1.5, ylim= c(0.05,0.475), n = 1000)
curve(fit1, 0, 5, lwd = 2, lty = 2, col = "red",add = T, n = 1000)
curve(fit2, 0, 5, lwd = 2, lty = 2, col = "orange",add = T, n = 1000)
curve(fit3, 0, 5, lwd = 2, lty = 2, col = "blue",add = T, n = 1000)
legend(4,0.475, c("True","M1","M2","M3"),
text.col = c("black","red","orange","blue"), col = c("black", "red", "orange", "blue"), lty = c(1, 2, 2, 2), lwd=2,
merge = TRUE, bg = "white")
Perme, M.P., J. Stare, and J. Estève. 2012. “On Estimation in Relative Survival.” Biometrics 68: 113–20.
Rubio, F.J., B. Rachet, R. Giorgi, C. Maringe, and A. Belot. 2019. “On Models for the Estimation of the Excess Mortality Hazard in Case of Insufficiently Stratified Life Tables.” Biostatistics in press: na–na.
Rubio, F.J., L. Remontet, N.P. Jewell, and A. Belot. 2018. “On a General Structure for Hazard-Based Regression Models: An Application to Population-Based Cancer Research.” Statistical Methods in Medical Research in press: na–na.