Censoring and likelihood

Introduction

You will need these libraries.

library(data.table)
library(parallel)

In survival analysis you observe \((U_i, \delta_i)\) where \(U_i = \text{min}(T_i, C_i)\) is the minimum of the survival time and censoring time and \(\delta_i = 1(T_i \le C_i)\) is an indicator of observing the event. If \(T_i\) and \(C_i\) are independent, the censoring is noninformative, and then the likelihood for right censored data \(y\), conditional on parameters \(\lambda\) is

\[ \begin{aligned} L(\lambda) = \prod_1^n f(u_i|\lambda)^{\delta_i} \prod_1^n S(u_i)^{(1-\delta_i)} \end{aligned} \]

to avoid overflow, it is better to work with the log likelihood

\[ \begin{aligned} l(\lambda) = \sum_1^n {\delta_i} \log(f(u_i|\lambda) \sum_1^n {(1-\delta_i)} \log( S(u_i) \end{aligned} \]

Note that the density function is \(f\) and the survival function \(S\) characterises the probability of surviving longer than time \(t\). What we are saying is that observed event times contribute to the likelihood proportionally to their probability and the censored times contribute proportionally to the probability of an event happening some time after the censoring time.

Example 1

To be concrete, simulate independent event and censoring times and construct \((U_i, \delta_i)\)

set.seed(1)
t <- rexp(10, 1/5)
c <- round(rexp(10, 1/7))

u <- t
d <- t <= c
u[!d] <- c[!d]
data.table(t, c, u, d)

##              t  c         u     d
##  1:  3.7759092 10 3.7759092  TRUE
##  2:  5.9082139  5 5.0000000 FALSE
##  3:  0.7285336  9 0.7285336  TRUE
##  4:  0.6989763 31 0.6989763  TRUE
##  5:  2.1803431  7 2.1803431  TRUE
##  6: 14.4748427  7 7.0000000 FALSE
##  7:  6.1478103 13 6.1478103  TRUE
##  8:  2.6984142  5 2.6984142  TRUE
##  9:  4.7828375  2 2.0000000 FALSE
## 10:  0.7352300  4 0.7352300  TRUE

Now compute the log-likelihood assuming that all event times were observed. I scale the result to have a maximum of 1 so that the results can be shown on the same plot.

lambda <- seq(1/20, 1/2, len = 5000)
ll1 <- sapply(seq_along(lambda), function(i){
  ld <- dexp(t, lambda[i], log = TRUE)
  sum(ld)
})
ll1 <- ll1 - min(ll1)
ll1 <- ll1/max(ll1)

Next, compute the log-likelihood for \(U\). Note that you need to be careful when computing the log survival probability. Use the lower.tail = FALSE as below rather than doing 1 - pexp(...), which would only work on the non-logged scale.

ll2 <- sapply(seq_along(lambda), function(i){
  ld1 <- dexp(u[d], lambda[i], log = TRUE)
  ld2 <- pexp(u[!d], lambda[i], log = TRUE, lower.tail = FALSE)
  sum(ld1, ld2)
})
ll2 <- ll2 - min(ll2)
ll2 <- ll2/max(ll2)

And now plot them both. The maximised values for lambda based on the sample data are 0.24 and 0.23 respectively. The true population parameter is \(\lambda = 0.2\).

plot(lambda, ll1, type = "l")
lines(lambda, ll2, col = 2)
abline(v = lambda[which.max(ll1)])
abline(v = lambda[which.max(ll2)], col = 2)
legend("bottomright", legend = c("Event times", "Observed times"), lty = 1, col = 1:2)

Because there are only a few observations, I can show the actual likelihood as well. This shows that there is more uncertainty when some of the events are censored, which makes sense. The maximised values for lambda based on the sample data are the same as previously, namely 0.24 and 0.23 respectively.

l1 <- sapply(seq_along(lambda), function(i){
  ld <- dexp(t, lambda[i], log = FALSE)
  prod(ld)
})
l2 <- sapply(seq_along(lambda), function(i){
  ld1 <- dexp(u[d], lambda[i], log = FALSE)
  ld2 <- pexp(u[!d], lambda[i], log = FALSE, lower.tail = FALSE)
  prod(ld1, ld2)
})
l1 <- l1 / max(l1)
l2 <- l2 / max(l2)
plot(lambda, l1, type = "l")
lines(lambda, l2, col = 2)
abline(v = lambda[which.max(ll1)])
abline(v = lambda[which.max(ll2)], col = 2)
legend("bottomright", legend = c("Event times", "Observed times"), lty = 1, col = 1:2)

Example 2

Now consider the situation where the participants are observed at fixed times from their origin. This might be relevant for a clinical trial where we review participant’s medical records at monthly intervals to see if they have been admitted to hospital. For the following example, assume that the timescale is in months. Also assume that the trial stops when we have 8 months of data on everyone and that we are part way through the trial triggered by reaching 24 enrolments. For each participant we will need to determine their last review (with 8 reviews being the maximum).

getdata <- function(n = 24, seed = NULL){
  if(!is.null(seed))  set.seed(seed)
  # event times
  t <- rexp(n, 1/3)
  # enrollment times
  e <- cumsum(c(0, rexp(n-1, 3/2)))
  # trial stops in this month
  now <- max(e)
  # some will have had all their reviews, but others that have just 
  # enrolled will have had no reviews at this time
  c <- findInterval(now - e, 1:8)
  # now whether we have observed an event is conditional on whether the
  # event had occurred by the time of the review or not
  d <- t < c
  u <- t
  u[!d] <- c[!d]
  data.table(enrolled_time = e,
             time_to_now = now - e,
             evttime = t, last_review = c, 
             obstime = u, evt_status = d)
}
dd <- getdata(seed = 1)
t <- dd$evttime
u <- dd$obstime
d <- dd$evt_status

Compute the log likelihood based on the event times and the observed times.

lambda <- seq(1/20, 3/4, len = 5000)
ll1 <- sapply(seq_along(lambda), function(i){
  ld <- dexp(t, lambda[i], log = TRUE)
  sum(ld)
})
ll1 <- ll1 - min(ll1)
ll1 <- ll1/max(ll1)

ll2 <- sapply(seq_along(lambda), function(i){
  ld1 <- dexp(u[d], lambda[i], log = TRUE)
  ld2 <- pexp(u[!d], lambda[i], log = TRUE, lower.tail = FALSE)
  sum(ld1, ld2)
})
ll2 <- ll2 - min(ll2)
ll2 <- ll2/max(ll2)

The maximised values for lambda based on the sample data are 0.31 and 0.3 respectively. The true value for lambda for the population is 0.33.

plot(lambda, ll1, type = "l")
lines(lambda, ll2, col = 2)
abline(v = lambda[which.max(ll1)])
abline(v = lambda[which.max(ll2)], col = 2)
legend("bottomright", legend = c("Event times", "Observed times"), lty = 1, col = 1:2)

If we repeat the data generation many times we should, on average, be able to recover (or get closer to) the population parameter. Admittently, this will be a bit rough because the optimisation is just based on a grid of values for \(\lambda\).

Simulate the data generation process and obtain the value of \(\lambda\) that maximises the log-likelihood function.

# Make a bigger grid for lambda
lambda <- seq(1e-08, 3, len = 10000)
if(.Platform$OS.type == "unix") {
  ncores <- parallel::detectCores()-3
} else {
  ncores <- 1
}
lres <- mclapply(X=1:1000, mc.cores=ncores, FUN=function(i) {
  dd <- getdata()
  t <- dd$evttime
  u <- dd$obstime
  d <- dd$evt_status
  
  ll1 <- sapply(seq_along(lambda), function(i) {
    ld <- dexp(t, lambda[i], log = TRUE)
    sum(ld)
  })
  ll2 <- sapply(seq_along(lambda), function(i) {
    ld1 <- dexp(u[d], lambda[i], log = TRUE)
    ld2 <- pexp(u[!d], lambda[i], log = TRUE, lower.tail = FALSE)
    sum(ld1, ld2)
  })
  c(lambda[which.max(ll1)], lambda[which.max(ll2)])
})
m1 <- do.call(rbind, lres)

Plot the densities for the simulation results based on the true event times and the observed times for which there was censoring determined by the last review time. The mean values of \(\lambda\) are 0.35 and 0.35 respectively. The median values 0.34 and 0.34

plot(density(m1[, 1]), main = "Density for parameter estimates")
lines(density(m1[, 2]), col = 2)
legend("topright", legend = c("Event times", "Observed times"), lty = 1, col = 1:2)

Censoring and likelihood

Looking at the effects of censoring on the likelihood

Introduction

Example 1

Example 2