Simulating relative survival data using the R package relsurvsim

Aline Campos Reis de Souza

2024-10-21

Datasets

colrec registry

To illustrate the usage of the implemented functions, we simulate relative survival data based on the colrec dataset, available on relsurv package. It contains survival data of 5971 patients from the Slovene Cancer Registry, diagnosed with colon and rectal cancer in 1994-2000.

library(relsurv)   
data(colrec)  
head(colrec)  
##   sex   age  diag time stat stage   site
## 1   1 23004 12656   16    0     1 rectum
## 2   2 12082 13388  504    0     3 rectum
## 3   1 24277 12711   22    0     3  colon
## 4   2 29256 13971 3998    0     1  colon
## 6   2 30416 12997    9    0    99  colon
## 7   1 15156 13814   88    0     2  colon

The variables included across the colrec dataset can be defined as follows:

slopop life table

slopop, a census dataset for the Slovene population, is also employed.

data(slopop)  

It is available on R in a ratetable format.

summary(slopop)
##  Rate table with 3 dimensions:
##  age ranges from 0 to 37619.82; with 104 categories
##  year ranges from -10957 to 21915; with 45 categories
##  sex has levels of: male female

Simulating relative survival data

Data manipulation

Demographic variables

We start by checking the demographic variables in the colrec dataset:

# Summary Age
summary(colrec$age)  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4559   21973   24864   24565   27493   35325
# Summary Sex
table(colrec$sex)   
## 
##    1    2 
## 3289 2682
# Summary date of diagnosis
summary(colrec$diag)
##      First       Last   
##    "1Jan94" "30Dec2000"

We then check the attributes of the life table to ensure compatibility between the dataset and the reference table:

attributes(survexp.us)$dimnames      
## $age
##   [1] "0"   "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11" 
##  [13] "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23" 
##  [25] "24"  "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35" 
##  [37] "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47" 
##  [49] "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59" 
##  [61] "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71" 
##  [73] "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83" 
##  [85] "84"  "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95" 
##  [97] "96"  "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107"
## [109] "108" "109"
## 
## $sex
## [1] "male"   "female"
## 
## $year
##  [1] "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948" "1949"
## [11] "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958" "1959"
## [21] "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969"
## [31] "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978" "1979"
## [41] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
## [51] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [61] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [71] "2010" "2011" "2012" "2013" "2014"

To use the implemented function in relsurvsim unit and classification of the demographic variables must be the same on dataset and ratetable. Therefore:

# Age must be given in years:
colrec$age <- colrec$age/364.24  
summary(colrec$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.52   60.33   68.26   67.44   75.48   96.98
# Sex must be classified as male/female
colrec$sex <- ifelse(colrec$sex == 1, "male", "female")
table(colrec$sex)
## 
## female   male 
##   2682   3289
# Year of diagnosis must be given in years
colrec$year <- format(as.Date(colrec$diag),"%Y")
head(colrec$year)
## [1] "1994" "1996" "1994" "1998" "1995" "1997"

Covariates

colrec$stage[colrec$stage == 99] <- NA
colrec$stage <- factor(colrec$stage)
head(colrec, n=5)   
##      sex      age  diag time stat stage   site year
## 1   male 63.15616 12656   16    0     1 rectum 1994
## 2 female 33.17044 13388  504    0     3 rectum 1996
## 3   male 66.65111 12711   22    0     3  colon 1994
## 4 female 80.32067 13971 3998    0     1  colon 1998
## 6 female 83.50538 12997    9    0  <NA>  colon 1995

Time response

We converted the time responses from days to years, just for convenience:

# Converting time in years (just for convenience)
colrec$time <- colrec$time/365.24 
colrec$stat <- ifelse(colrec$time >= 10, 0, colrec$stat)
colrec$time <- ifelse(colrec$time >= 10, 10, colrec$time)

An administrative censoring was also applied at 10 years:

colrec$stat <- ifelse(colrec$time >= 10, 0, colrec$stat)
colrec$time <- ifelse(colrec$time >= 10, 10, colrec$time)
# Mean of status variable
mean(colrec$stat)
## [1] 0.7340479

Finally, a column with the expected hazard is calculated (for later excess hazard estimation).

for(i in 1: nrow(colrec)){
  colrec$expected[i]<-slopop[colrec$age[i] + colrec$time[i],colrec$year[i], colrec$sex[i]]
}

We can visualize the distribution of the time variable:

library(ggplot2)
ggplot(colrec, aes(x = time)) +
  geom_histogram(binwidth = 1, fill = "lightcyan3", color = "gray70") +  
  geom_vline(aes(xintercept = 10, color = "Administrative censoring"), linetype = "dashed") +   
  labs(x = "Time from diagnosis", y = "Frequency", title = "Distribution of Observed Times") +  
  scale_color_manual(name = " ", values = "red") +   
  theme_minimal()+
  guides(color = guide_legend(position = "bottom", title = NULL, direction = "horizontal"))

Simulating relative survival data

Fitting a flexible parametric survival model

We fit a parametric excess hazard model to the colrec dataset with genWeibull() function.

fitGW <- genWeibull(formula = Surv(time, stat) ~ age + stage + site, 
                    data = colrec, 
                    expected = "expected")
## 
## Call:
## Surv(time, stat) ~ age + stage + site
## 
## Coefficients:
##                  Coefficients Standard.Error  z.value  Pr...z..    
## betas.age           0.0354157      0.0015271  23.1913 < 2.2e-16 ***
## betas.stage2        0.5113001      0.0514374   9.9402 < 2.2e-16 ***
## betas.stage3        2.1146068      0.0570588  37.0601 < 2.2e-16 ***
## betas.siterectum    0.1121397      0.0323541   3.4660 0.0005282 ***
## log.r              -5.1056265      0.1974432 -25.8587 < 2.2e-16 ***
## log.a              -2.6673326      0.1722444 -15.4857 < 2.2e-16 ***
## log.k              -0.1933600      0.0223846  -8.6381 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-likelihood = 28639.74

Defining the generating excess hazard function

The estimated coefficients can be used as the true parameters for data generation in order to mimic this registry. We can visualize the specified baseline hazard function using the function plot.baseline():

plot.baseline(dist="genWeibull", pars = list(r = exp(-5.11), a = exp(-2.66), k = exp(-0.19)))

Regression coefficients are defined in a named vector,

betas_1 <- c(age = 0.03, stage2 = 0.51, stage3 = 2.11, siterectum = 0.11)

The excess hazard function is defined using the same syntax of simsurv package.

excess.hazard_1 <- function(t, x, betas) {
# Baseline hazard (Generalized weibull distribution)
  r = exp(-5.1)
  a = exp(-2.7)
  k = exp(-0.19)
  log.h <- log(k) + k * log(r) + (k - 1) * log(t) - log(1 + ((r * t)^k)/a)
# Covariates: Linear proportional hazard at age, stage ang site
  Xbeta <- x[["age"]] * betas[["age"]] +
           x[["stage2"]] * betas[["stage2"]] +
           x[["stage3"]] * betas[["stage3"]] +
           x[["siterectum"]] * betas[["siterectum"]]

  return(exp(log.h + Xbeta))
}

Data simulation

Relative survival data is simulated with relsurvsim() function:

simdata_1 <- 
  relsurvsim(n = 1000,
             formula = Surv(time, stat) ~ age + stage + site,
             excess.hazard = excess.hazard_1,
             data = colrec,
             map.names = list(age = "age", sex = "sex", year = "year"),
             ratetable = slopop,
             age_var = "age",
             time.scale = "years",
             betas = betas_1,
             censor.control = list(dist = "exponential", pars = 20, admCensor = 10))
## Simulating cancer-specific times...
## Simulating expected event times...
## Simulating censoring times...

It generates the following outputs:

names(simdata_1)  
## [1] "cancer.specific.setting"   "cause.specific.setting"   
## [3] "relative.survival.setting" "expected.times"           
## [5] "specific.times"            "age_var"                  
## [7] "admCens"                   "data"

We can visualize the simulated data using the plot() function. If type = histogram the distribution of the simulated data and the real data are compared to check the generation process.

plot(simdata_1, type = "histogram")   

A more general overview can be accessed setting type = panel:

plot(simdata_1, type = "panel")   

Controling the censoring proportion in one replicate

It is possible to generate right-censored observations, applying administrative censoring or random censoring from prespecified distributions.

simdata_2 <- 
  relsurvsim(n = 1000,
             formula = Surv(time, stat) ~ age + stage + site,
             excess.hazard = excess.hazard_1,
             data = colrec,
             map.names = list(age = "age", sex = "sex", year = "year"),
             ratetable = slopop,
             age_var = "age",
             time.scale = "years",
             betas = betas_1,
             censor.control = list(dist = "exponential", pars = 200, admCensor = 40), 
             progress=FALSE)
plot(simdata_2, type = "panel")   

plot.censor(dist = "uniform", pars = c(1.5, 3.5), admCensor = 10)

plot.censor(dist="exponential", pars=1, admCensor=10)

plot.censor(dist = "weibull", pars = c(4, 6), admCensor = 10)

For example, let’s generate relative survival data, for which censored observations are more recurrent in the beginning of the study:

simdata_3 <- 
  relsurvsim(n = 1000,
             formula = Surv(time, stat) ~ age + stage + site,
             excess.hazard = excess.hazard_1,
             data = colrec,
             map.names = list(age = "age", sex = "sex", year = "year"),
             ratetable = slopop,
             age_var = "age",
             time.scale = "years",
             betas = betas_1,
             censor.control = list(dist = "exponential", pars = 1, admCensor = 40), 
             progress=FALSE)
plot(simdata_3, type = "panel")   

Controling the proportion of competing risks events in one replicate

The proportion of competing risks event can be controled by demographic covariates. For example, age effect in the excess hazard is positive. It implies that that lower the age, the lower the proportion of cancer deaths.

simdata_4 <- 
  relsurvsim(n = 100,
             formula = Surv(time, stat) ~ age + stage + site,
             excess.hazard = excess.hazard_1,
             data = colrec[colrec$age<40,],
             map.names = list(age = "age", sex = "sex", year = "year"),
             ratetable = slopop,
             age_var = "age",
             time.scale = "years",
             betas = betas_1,
             censor.control = list(dist = "exponential", pars = 200, admCensor = 80), 
             progress=FALSE)
plot(simdata_4, type = "panel")   

Generating replicates

We can generate replicates of relative survival data and apply it to pre-specified functions, using run_replicates(). For example, if we want to evaluate the proportion of censoring observation in R replicates, we can:

  1. specify the parameters of simulated data:
params<- list(formula=Surv(time, stat)~ age+stage+site,
              excess.hazard = excess.hazard_1,
              data = colrec,
              map.names=list(age = "age", sex = "sex", year = "year"),
              ratetable = slopop,
              age_var = "age",
              time.scale = "years",
              betas = betas_1,
              censor.control = list(dist = "exponential", pars = 200, admCensor = 80)
)
  1. Define a function, that will be applied on this simulated data and from which we extract the proportion of censoring:
censor_prop <-function(data){
    out<- mean(data$status)
     return(list(out=out))
}
  1. Generate the replicates. The output is a matrix where the columns are the outputs of the functions and the lines are the replicates.
censor_propReplicate <- run_replicates(R=3, n=1000, censor_prop, params, seed = 1245, setting = "relative_survival")
as.numeric(censor_propReplicate)
## [1] 0.954 0.966 0.961
mean(as.numeric(censor_propReplicate))
## [1] 0.9603333