relsurvsimcolrec registryTo illustrate the usage of the implemented functions, we simulate
relative survival data based on the colrec dataset,
available on relsurv package. It contains survival data of
5971 patients from the Slovene Cancer Registry, diagnosed with colon and
rectal cancer in 1994-2000.
## sex age diag time stat stage site
## 1 1 23004 12656 16 0 1 rectum
## 2 2 12082 13388 504 0 3 rectum
## 3 1 24277 12711 22 0 3 colon
## 4 2 29256 13971 3998 0 1 colon
## 6 2 30416 12997 9 0 99 colon
## 7 1 15156 13814 88 0 2 colon
The variables included across the colrec dataset can be
defined as follows:
sex: Patient’s sex (1 = male, 2 = female).
age: Age of the patient in days.
diag: Date of diagnosis (in date format).
time: Survival time in days.
stat: Censoring indicator (0 = censored, 1 =
death).
stage: Cancer stage. Values 1-3; code 99 stands for
unknown.
site: Cancer site.
slopop life tableslopop, a census dataset for the Slovene population, is
also employed.
It is available on R in a ratetable format.
## Rate table with 3 dimensions:
## age ranges from 0 to 37619.82; with 104 categories
## year ranges from -10957 to 21915; with 45 categories
## sex has levels of: male female
We start by checking the demographic variables in the
colrec dataset:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4559 21973 24864 24565 27493 35325
##
## 1 2
## 3289 2682
## First Last
## "1Jan94" "30Dec2000"
We then check the attributes of the life table to ensure compatibility between the dataset and the reference table:
## $age
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11"
## [13] "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23"
## [25] "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35"
## [37] "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47"
## [49] "48" "49" "50" "51" "52" "53" "54" "55" "56" "57" "58" "59"
## [61] "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70" "71"
## [73] "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83"
## [85] "84" "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95"
## [97] "96" "97" "98" "99" "100" "101" "102" "103" "104" "105" "106" "107"
## [109] "108" "109"
##
## $sex
## [1] "male" "female"
##
## $year
## [1] "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948" "1949"
## [11] "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958" "1959"
## [21] "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968" "1969"
## [31] "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978" "1979"
## [41] "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988" "1989"
## [51] "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998" "1999"
## [61] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [71] "2010" "2011" "2012" "2013" "2014"
To use the implemented function in relsurvsim unit and
classification of the demographic variables must be the
same on dataset and ratetable. Therefore:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.52 60.33 68.26 67.44 75.48 96.98
# Sex must be classified as male/female
colrec$sex <- ifelse(colrec$sex == 1, "male", "female")
table(colrec$sex)##
## female male
## 2682 3289
# Year of diagnosis must be given in years
colrec$year <- format(as.Date(colrec$diag),"%Y")
head(colrec$year)## [1] "1994" "1996" "1994" "1998" "1995" "1997"
## sex age diag time stat stage site year
## 1 male 63.15616 12656 16 0 1 rectum 1994
## 2 female 33.17044 13388 504 0 3 rectum 1996
## 3 male 66.65111 12711 22 0 3 colon 1994
## 4 female 80.32067 13971 3998 0 1 colon 1998
## 6 female 83.50538 12997 9 0 <NA> colon 1995
We converted the time responses from days to years, just for convenience:
# Converting time in years (just for convenience)
colrec$time <- colrec$time/365.24
colrec$stat <- ifelse(colrec$time >= 10, 0, colrec$stat)
colrec$time <- ifelse(colrec$time >= 10, 10, colrec$time)An administrative censoring was also applied at 10 years:
colrec$stat <- ifelse(colrec$time >= 10, 0, colrec$stat)
colrec$time <- ifelse(colrec$time >= 10, 10, colrec$time)
# Mean of status variable
mean(colrec$stat)## [1] 0.7340479
Finally, a column with the expected hazard is calculated (for later excess hazard estimation).
for(i in 1: nrow(colrec)){
colrec$expected[i]<-slopop[colrec$age[i] + colrec$time[i],colrec$year[i], colrec$sex[i]]
}We can visualize the distribution of the time variable:
library(ggplot2)
ggplot(colrec, aes(x = time)) +
geom_histogram(binwidth = 1, fill = "lightcyan3", color = "gray70") +
geom_vline(aes(xintercept = 10, color = "Administrative censoring"), linetype = "dashed") +
labs(x = "Time from diagnosis", y = "Frequency", title = "Distribution of Observed Times") +
scale_color_manual(name = " ", values = "red") +
theme_minimal()+
guides(color = guide_legend(position = "bottom", title = NULL, direction = "horizontal"))We fit a parametric excess hazard model to the colrec
dataset with genWeibull() function.
fitGW <- genWeibull(formula = Surv(time, stat) ~ age + stage + site,
data = colrec,
expected = "expected")##
## Call:
## Surv(time, stat) ~ age + stage + site
##
## Coefficients:
## Coefficients Standard.Error z.value Pr...z..
## betas.age 0.0354157 0.0015271 23.1913 < 2.2e-16 ***
## betas.stage2 0.5113001 0.0514374 9.9402 < 2.2e-16 ***
## betas.stage3 2.1146068 0.0570588 37.0601 < 2.2e-16 ***
## betas.siterectum 0.1121397 0.0323541 3.4660 0.0005282 ***
## log.r -5.1056265 0.1974432 -25.8587 < 2.2e-16 ***
## log.a -2.6673326 0.1722444 -15.4857 < 2.2e-16 ***
## log.k -0.1933600 0.0223846 -8.6381 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-likelihood = 28639.74
The estimated coefficients can be used as the true
parameters for data generation in order to mimic this registry.
We can visualize the specified baseline hazard function using the
function plot.baseline():
Regression coefficients are defined in a named vector,
The excess hazard function is defined using the same syntax of
simsurv package.
excess.hazard_1 <- function(t, x, betas) {
# Baseline hazard (Generalized weibull distribution)
r = exp(-5.1)
a = exp(-2.7)
k = exp(-0.19)
log.h <- log(k) + k * log(r) + (k - 1) * log(t) - log(1 + ((r * t)^k)/a)
# Covariates: Linear proportional hazard at age, stage ang site
Xbeta <- x[["age"]] * betas[["age"]] +
x[["stage2"]] * betas[["stage2"]] +
x[["stage3"]] * betas[["stage3"]] +
x[["siterectum"]] * betas[["siterectum"]]
return(exp(log.h + Xbeta))
}Relative survival data is simulated with relsurvsim()
function:
simdata_1 <-
relsurvsim(n = 1000,
formula = Surv(time, stat) ~ age + stage + site,
excess.hazard = excess.hazard_1,
data = colrec,
map.names = list(age = "age", sex = "sex", year = "year"),
ratetable = slopop,
age_var = "age",
time.scale = "years",
betas = betas_1,
censor.control = list(dist = "exponential", pars = 20, admCensor = 10))## Simulating cancer-specific times...
## Simulating expected event times...
## Simulating censoring times...
It generates the following outputs:
## [1] "cancer.specific.setting" "cause.specific.setting"
## [3] "relative.survival.setting" "expected.times"
## [5] "specific.times" "age_var"
## [7] "admCens" "data"
We can visualize the simulated data using the plot()
function. If type = histogram the distribution of the
simulated data and the real data are compared to check the generation
process.
A more general overview can be accessed setting
type = panel:
It is possible to generate right-censored observations, applying administrative censoring or random censoring from prespecified distributions.
admCensor: let’s decrease the proportion of censored times
by increasing the administrative censoring .simdata_2 <-
relsurvsim(n = 1000,
formula = Surv(time, stat) ~ age + stage + site,
excess.hazard = excess.hazard_1,
data = colrec,
map.names = list(age = "age", sex = "sex", year = "year"),
ratetable = slopop,
age_var = "age",
time.scale = "years",
betas = betas_1,
censor.control = list(dist = "exponential", pars = 200, admCensor = 40),
progress=FALSE)plot.censor() function,
setting uniform, exponential or
weibull distributions.For example, let’s generate relative survival data, for which censored observations are more recurrent in the beginning of the study:
simdata_3 <-
relsurvsim(n = 1000,
formula = Surv(time, stat) ~ age + stage + site,
excess.hazard = excess.hazard_1,
data = colrec,
map.names = list(age = "age", sex = "sex", year = "year"),
ratetable = slopop,
age_var = "age",
time.scale = "years",
betas = betas_1,
censor.control = list(dist = "exponential", pars = 1, admCensor = 40),
progress=FALSE)The proportion of competing risks event can be controled by
demographic covariates. For example, age effect in the
excess hazard is positive. It implies that that lower
the age, the lower the proportion of cancer
deaths.
simdata_4 <-
relsurvsim(n = 100,
formula = Surv(time, stat) ~ age + stage + site,
excess.hazard = excess.hazard_1,
data = colrec[colrec$age<40,],
map.names = list(age = "age", sex = "sex", year = "year"),
ratetable = slopop,
age_var = "age",
time.scale = "years",
betas = betas_1,
censor.control = list(dist = "exponential", pars = 200, admCensor = 80),
progress=FALSE)We can generate replicates of relative survival data and apply it to
pre-specified functions, using run_replicates(). For
example, if we want to evaluate the proportion of censoring observation
in R replicates, we can:
params<- list(formula=Surv(time, stat)~ age+stage+site,
excess.hazard = excess.hazard_1,
data = colrec,
map.names=list(age = "age", sex = "sex", year = "year"),
ratetable = slopop,
age_var = "age",
time.scale = "years",
betas = betas_1,
censor.control = list(dist = "exponential", pars = 200, admCensor = 80)
)censor_propReplicate <- run_replicates(R=3, n=1000, censor_prop, params, seed = 1245, setting = "relative_survival")
as.numeric(censor_propReplicate)## [1] 0.954 0.966 0.961
## [1] 0.9603333