This section contains the setup and the various utility functions used throughout.
Libraries used:
library(rstan)
library(survival)
library(flexsurv)
library(data.table)
# See https://github.com/maj-biostat/pwexp
library(pwexp)
library(splines2)
library(muhaz)
rstan::rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
options(mc.cores = 1)
Compiled code (any models used are shown later):
mod1 <- rstan::stan_model("../stan/surv_mspline4.stan", verbose = FALSE)
Commonly, survival data look like this
set.seed(1)
N <- 200
bins = c(0, 50, 150, 250)
rate = c(1/200, 1/100, 1/75, 1/50)
d <- data.table(
id = 1:(2*N),
y = c(pwexp::rpw(N, bins, rate),
pwexp::rpw(N, bins, 2*rate)),
trt = rep(0:1, N)
)
# Observations censored after 200 days
d[, evt := as.numeric(y < 200)]
d[y >= 200, y := 200]
head(d)
## id y trt evt
## 1: 1 155.70809 0 1
## 2: 2 123.85287 1 1
## 3: 3 80.71255 0 1
## 4: 4 19.25642 1 1
## 5: 5 176.32975 0 1
## 6: 6 21.43027 1 1
where the data contain an id
which may represent an individual person, but could also be a manufactured product like a lightbulb or a car engine. Here we will assume that id
refers to a participant in a clinical trial. The y
represents the time (e.g. days) until an event of interest occurred, such as death, trt
represents some experimental intervention and evt
represents whether we observed the event of interest or whether the experiment finished prior to us being able to observe the event. In the latter case, we have right censoring; we know that the event did not happen up to the censoring time (200 days), but we do not know when it will happen.
However, in some situations, e.g. when using some piecewise survival models, the data needs to be transformed into a counting process format. In this form, the data for each participant time is divided into \(J\) non-overlapping segments \((0, s_1], (s_1, s_2], \dots, (s_{J-1}, s_J]\). The survSplit
function in the survival
package enables us to do the transformation.
# Just create some arbitrary segment cut points for demo.
dbin <- data.table(
id_timept = 1:5,
bins = c(10, 20, 50, 100, 150)
)
d2 <- data.table(
survSplit(Surv(y, evt) ~ trt,
start="tstart", end="tstop", id = "id",
data=d, cut=dbin$bins)
)
d2 <- merge(d2, dbin, by.x = "tstop", by.y = "bins" )
setkey(d2, id)
setcolorder(d2, c("id", "id_timept", "tstart", "tstop", "evt"))
d2[id %in% c(1, 2, 3, 10), ]
## id id_timept tstart tstop evt trt
## 1: 1 1 0 10 0 0
## 2: 1 2 10 20 0 0
## 3: 1 3 20 50 0 0
## 4: 1 4 50 100 0 0
## 5: 1 5 100 150 0 0
## 6: 2 1 0 10 0 1
## 7: 2 2 10 20 0 1
## 8: 2 3 20 50 0 1
## 9: 2 4 50 100 0 1
## 10: 3 1 0 10 0 0
## 11: 3 2 10 20 0 0
## 12: 3 3 20 50 0 0
## 13: 10 1 0 10 0 1
## 14: 10 2 10 20 0 1
## 15: 10 3 20 50 0 1
## 16: 10 4 50 100 0 1
## 17: 10 5 100 150 0 1
Notice how each individual now has multiple records for each of the segments that were defined. Also note that the last segment will be either 1 if the event was observed or 0 if the event was not observed up to the maximum possible time (censoring time 200 days).
A related function is tmerge
, which is useful if you are introducing time dependent covariates.