Motivation

In many scenarios, we may be interested in simulating survival times from a population based on the corresponding life tables. For instance, in relative survival analysis (Rubio et al. 2019) (Eletti et al. 2022) simulation studies require the simulation of times to event based on life tables. These life tables are stratified on certain characteristics such as age, year, and deprivation level. The life table defines a piecewise-constant rate hazard function, for each combination of characteristics, which can be used to simulate the times to event.

The SimLT R package allows for simulating times to event, based on the information in a life table.

The following R code presents a step-by-step illustrative example om the simulation of times to event based on life tables from England for the years 2010-2015. The SimLT R package requires functions from the R packages msm and lubridate, which I decided not to include as dependencies. Thus, the user needs to install them and load them manually.

See also: GHSurv, LBANS

Illustrative example: R code

In this example we will simulate \(n=1000\) times to event using a life table from England based on year, sex, deprivation level, age, English region (gor). The life table can be downloaded from SimLT.

Data preparation

rm(list=ls())

# Required packages
#library(devtools)
#install_github("FJRubio67/SimLT")
library(SimLT)
library(msm)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
#------------------------------------------------------------------------
# Life tables
#------------------------------------------------------------------------
LT <- as.data.frame(read.table("England_LT_2010_2015_dep_gor.txt",header = T))
head(LT)
##   X_year sex dep age      rate gor
## 1   2010   1   9   0 0.0047365  10
## 2   2011   1   9   0 0.0046188  10
## 3   2012   1   9   0 0.0046188  10
## 4   2013   1   9   0 0.0046188  10
## 5   2014   1   9   0 0.0046188  10
## 6   2015   1   9   0 0.0046188  10

Simulated design matrix

We now simulate the individuals’ characteristics using the built-in function simDesMatrix. These characteristics will be used to simulate the times to event. This function is based on real data on colon and lung (male and female) patients in England. You can use your own design matrix instead.

#------------------------------------------------------------------------
# Design matrix simulation 
#------------------------------------------------------------------------
# Sample size
n <- 1000
# Design matrix
Xcf <- simDesMatrix(seed = 123, n = n, admin.cens = "31-12-2015", scale.age = FALSE,
                    site = "colon", sex = "female")
head(Xcf)
##   ID  date.diag dep gor      age fup.time
## 1  1 2010-05-11   2   2 80.09770 5.639973
## 2  2 2010-12-21   2   3 90.00683 5.026694
## 3  3 2010-03-31   1   2 76.93449 5.752225
## 4  4 2010-03-11   3   6 92.35372 5.806982
## 5  5 2010-12-17   3   6 67.86141 5.037645
## 6  6 2010-05-13   4   7 28.95104 5.634497

Follow-up period

Define the follow-up period. This requires commands from the R package lubridate.

#------------------------------------------------------------------------
# Follow-up period
#------------------------------------------------------------------------
# fixed points (new year and last day of follow-up)
year.init <- 2010
last.day <- as.Date("2015-12-31")
fixed.times <- c(seq(as.Date("2010-01-01"), as.Date("2015-01-01"), by = "year"),last.day)

Time points and hazard rates

Now, we need to define the time points that define the step function associated to the piecewise-constant rate hazard function.

#------------------------------------------------------------------------
# Time points and hazard rates
#------------------------------------------------------------------------

# function to calculate ages at time points
ages <- Vectorize(function(dates){
  out <- time_length(difftime(as.Date(dates), as.Date(dob)), "years")
  return(out)
})

# Variables to be used from the life table
MLT <- LT[,c("X_year","sex","dep","age","gor")]

# Initialising times, dates, and hrates
times <- dates <- hrates <- list()

# sex
sex0 <- 2

# Extracting the hazard rates from the life tables for each patient at "age + t"

#pb = txtProgressBar(min = 0, max = n, initial = 1) # uncomment to track progress
for(i in 1:n){
  # Date of birth
  dob <- as.POSIXlt(as.Date(Xcf[i,2]) - Xcf[i,5]*365.25)
  if(day(dob)==29 & month(dob)==2) day(dob) <- 1; month(dob) <- 3
  # Birthday on 2010
  bday1 <- dob; year(bday1) <- year.init; bday1 <-as.Date(bday1);
  # Date of diagnosis
  dod <- as.Date(Xcf[i,2])
  # Variable times (Birthdays + date of diagnosis)
  var.times <- c(seq(as.Date(bday1),last.day, by = "year"),dod)
  # Unique sorted time points
  time.points <- sort(unique(c(fixed.times,var.times,as.Date(last.day))))
  # Age at each time point
  age.tp <- ages(time.points)
  #*********************************************************
  # Removing dates before the date of diagnosis
  #*********************************************************
  ind <- length(which(time.points<=dod))
  if(ind == 1 ){
    # Length of times in days between time points
    times[[i]] <- c(0,as.numeric(diff(time.points)))
    # Length of times in days between time points
    dates[[i]] <- time.points
    # Age at each time point (integer part)
    age.tp <- floor(age.tp)
  }
  if(ind > 1 ){
    # Length of times in days between time points
    times[[i]] <- c(0,as.numeric(diff(time.points))[-c(1:(ind-1))])
    # Length of times in days between time points
    dates[[i]] <- time.points[-c(1:(ind-1))]
    # Age at each time point (integer part)
    age.tp <- floor(age.tp[-c(1:(ind-1))])
  }
  MAT.ID <- cbind(1:length(age.tp),year(dates[[i]]),rep(sex0,length(age.tp)), rep(Xcf[i,"dep"],length(age.tp)),
                  age.tp,rep(Xcf[i,"gor"],length(age.tp)))
  MAT.ID <- as.data.frame(MAT.ID)
  colnames(MAT.ID) <- c("index","X_year", "sex", "dep", "age", "gor")
  hrates[[i]] <- merge(x = MAT.ID, y = LT, by = colnames(MAT.ID)[-1], all.x = TRUE, sort = FALSE)
  hrates[[i]] <- hrates[[i]][order(hrates[[i]]$index),]
  hrates[[i]] <- hrates[[i]]$rate
#  setTxtProgressBar(pb,i) # uncomment to track progress
}

# List containing the simulated survival times, hazard rates, and last date of follow-up
Xcf_rates <- list(times = times, hrates = hrates, dates = date)

Simulating the times to event

Now, we simulate the times to event using the command sim_pophaz. This requires a seed for the simulation, and a list with time points at which the piecewise hazard jumps (times) and the values of the piecewise hazard (hrates). This list was constructed in the previous subsection.

##################################################################################################
# Population survival times
##################################################################################################
simC <- sim_pophaz(seed = 123,lst = Xcf_rates)
sim.pop <- simC$sim.pop
status.pop <- simC$status

# Censoring
table(status.pop)
## status.pop
##   0   1 
## 797 203
library(survival)

# Kaplan-Meier estimator for the simulated times
kmsim <- survfit(Surv(sim.pop, status.pop) ~ 1)

# Comparison
plot(kmsim$time, kmsim$surv, type = "l", col = "black", lwd = 2, lty = 1, ylim = c(0,1),
     xlab = "Time", ylab = "Survival", cex.axis = 1.5, cex.lab = 1.5, main = "")

Eletti, A., G. Marra, M. Quaresma, R. Radice, and F. J. Rubio. 2022. “A Unifying Framework for Flexible Excess Hazard Modelling with Applications in Cancer Epidemiology.” Journal of the Royal Statistical Society: Series C NA: in press.
Rubio, F. J., L. Remontet, N. P. Jewell, and A. Belot. 2019. “On a General Structure for Hazard-Based Regression Models: An Application to Population-Based Cancer Research.” Statistical Methods in Medical Research 28: 2404–17.