BIDA in Stan: Piecewise constant hazard Cox

References

Background

The proportional hazards (PH) model (Cox 1972) is a very popular regression method for survival data. The popularity likely originated in not having to specify the (baseline) hazard function. These days, it is popular probably because it is the only survival regression model that many applied researchers are aware of. In Frequentist PH model, parameter estimation is conducted via partial likelihood from which the baseline hazard function has dropped out (BSA p16).

In Bayesian paradigm, obtaining the posterior distribution of parameters requires the full likelihood function involving all parameters including nuisance ones and the prior for all these parameters. In the case of survival analysis, one of the parameters that requires modeling is the entire baseline hazard function. One way to proceed is to parametrize the baseline hazard function parsimoniously, i.e., parametric survival analysis (BIDA p325, BSA chap 2). The other approach is to more flexibly model the baseline hazard function.

Here we will examine the simplest form of the latter, the piecewise constant hazard model (piecewise exponential model).

Piecewise constant hazard model

Likelihood

This model formulation was taken from BSA (p47-).

Firstly, partition the time axis into $J$ intervals using $0 < s_{1} < s_{2} ... < s_{J} < \infty$.

\[(0,s_{1}], (s_{1},s_{2}], ..., (s_{J-1},s_{J}]\]

$s_{J}$ is a finite value that has to be larger than the largest observed time in the study. Name each interval $I_{j} = (s_{j-1},s_{j}]$. We assume a constant hazard $\lambda$ within each interval. Let $D = (n,\mathbf{y}, X, \nu)$ describe the observed data.

$\mathbf{y} = (y_{1}, ..., y_{n})^{T}$ is the observed times.

$X$ is a $n \times p$ matrix of covariates associated with a length $p$ vector $\boldsymbol{\beta}$.

$\boldsymbol{\nu} = (\nu_{1},...,\nu_{n})^{T}$ is a vector of failure (censoring) indicators.

Let $\boldsymbol{\lambda} = (\lambda_{1},...,\lambda_{J})^{T}$. The full likelihood for $(\boldsymbol{\beta}, \boldsymbol{\lambda})$ is the following.

\[L(\boldsymbol{\beta},\boldsymbol{\lambda} | D) = \prod^{n}_{i=1} \prod^{J}_{j=1} (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{ij}\nu_{i}} \exp \left\{ - \delta_{ij} \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\]

where $\delta_{ij}$ is an interval specific indicator of end of follow up indicator $I[y_{i} \in I_{j}]$.

This is really hard, so let us dissect this into more manageable pieces. Firstly, we will focus on one individual rather than the entire dataset.

\[L(\boldsymbol{\beta},\boldsymbol{\lambda} | D_{i}) = \prod^{J}_{j=1} (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{ij}\nu_{i}} \exp \left\{ - \delta_{ij} \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\]

Thus, this is a product of interval-specific contributions. However, for interval $I_{j}$ in which the individual just survived without death or censoring $\delta_{ij} = 0$, there is no contribution (no parameters left).

\[(\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{(0) \nu_{i}} \exp \left\{ - (0) \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\} = 1\]

Therefore, the only contribution happens at the interval when the individual either dies or becomes censored. Note in general, an event individual contributes a density, which is a product of hazard and survival. On the other hand, a censored individual can only contribute survival information.

Event individual contribution

Firstly, we will consider death in interval $I_{j}$.

\[(\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{(1) (1)} \exp \left\{ - (1) \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\]

$(\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{(1) (1)}$ is the hazard contribution.

$\exp \left\{ - (1) \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}$ is the survival contribution. Taken together this individual contributes density information.

Note $\left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right]$ is the cumulative baseline hazard that this individual faced. By multiplying this with time-constant (by the PH assumption) multiplication factor, $\exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta})$, we obtain the individual-specific cumulative hazard. An exponential of the negative cumulative hazard is the survival.

Censored individual contribution

Now we will consider censoring in interval $I_{j}$.

\[(\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{(1) (0)} \exp \left\{ - (1) \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\]

$(\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{(1) (0)} = 1$. There is no hazard contribution.

$\exp \left\{ - (1) \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}$ is the survival contribution.

Poisson trick

This part follows BIDA (p347-) and Piece-Wise Exponential Model. Let us examine the likelihood for individual $i$ further. As stated above only contribution happens at the interval in which follow up ends by death or censoring. Without loss of generality, consider this interval as $I_{j}$.

\[ \begin{align*} L(\boldsymbol{\beta},\boldsymbol{\lambda} | D_{i}) &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{ij}\nu_{i}} \exp \left\{ - \delta_{ij} \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &~~~\text{By } \delta_{ij} = 1\\ &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \exp \left\{ - \left[ \lambda_{j}(y_{i} - s_{j-1}) + \sum^{j-1}_{g=1} \lambda_{g}(s_{g} - s_{g-1}) \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &~~~\text{Let $H_{i,g}$ represent at-risk time during $I_{g}$ for $i$}.\\ &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \exp \left\{ - \left[ \sum^{j}_{g=1} \lambda_{g} H_{i,g} \right] \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \exp \left\{ - \sum^{j}_{g=1} \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &~~~\text{Note the new middle piece is 1.}\\ &= (\lambda_{j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \left[ \prod^{j-1}_{g=1} (\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{0} \right] \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &\propto (\lambda_{j} H_{i,j} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\nu_{i}} \left[ \prod^{j-1}_{g=1} (\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{0} \right] \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &~~~\text{By reintroducing } \delta_{i,g} \text{, which is 0 for }g < j\\ &= \left[ \prod^{j}_{g=1} (\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{i,g} \nu_{i}} \right] \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\}\\ &= \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\} (\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{i,g} \nu_{i}}\\ &~~~\text{By } \delta_{i,g} \nu_{i} \in \{0,1\}\\ &= \prod^{j}_{g=1} \exp \left\{ - \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}) \right\} (\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))^{\delta_{i,g} \nu_{i}} \big/ (\delta_{i,g} \nu_{i})!\\ \end{align*} \]

We can multiply the likelihood with a term that does not contain parameters and retain the same inference. Note the last expression is a product of individual- and interval-specific Poisson likelihood. This transformation implies that we can split each individual’s observation into interval-specific observations up until the interval in which follow up ended. In addition to copying covariates, each interval-specific observation has to have the duration of the at-risk time and an indicator that is 1 in the last interval if the individual died otherwise 0. The latter indicator serves as the outcome of the Poisson model.

The corresponding Poisson model for individual $i$ interval $g$ is the following.

\[ \begin{align*} \mu_{i,g} &= \lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta})\\ \log(\mu_{i,g}) &= \log(\lambda_{g} H_{i,g} \exp(\mathbf{x}_{i}^{T}\boldsymbol{\beta}))\\ &= \log(H_{i,g}) + \log(\lambda_{g}) + \mathbf{x}_{i}^{T}\boldsymbol{\beta}\\ &= \log(H_{i,g}) + \beta_{0,g} + \mathbf{x}_{i}^{T}\boldsymbol{\beta}\\ \end{align*} \]

Therefore, $\log(H_{i,g})$ becomes the offset. The intercept $\beta_{0,g} = \log(\lambda_{g})$ is interval-specific. The outcome, the indicator variable $(\delta_{i,g} \nu_{i})$ is not independent within an individual because it is all zero except the last one, which can be 1 if the individual died. However, the likelihood has the form of a product of interval-specific Poisson contributions. Thus, the Poisson modeling can proceed as if these interval-specific observations from one individual were independent. In R, survival::survSplit function can create this long-format dataset from a single-row-per-person dataset.

Prior specification

We have clarified the likelihood part, so now we need to specify the priors for all parameters. We have the covariate coefficient vector $\boldsymbol{\beta}$. After the Poisson transformation, we also have the interval-specific intercepts. Each one of the parameters can take on any value on the real line.

The Stan developer website has a web page dedicated to prior choice philosophy (Prior Choice Recommendations). Here we adopt the principle of weakly informative priors that are informative enough to regularize. The direct quote is the following.

Weakly informative prior should contain enough information to regularize: the idea is that the prior rules out unreasonable parameter values but is not so strong as to rule out values that might make sense.

Priors for covariate coefficients

These coefficients for covariates are on the log hazard ratio scale. Based on the substantive ground, we would like to rule out hazard ratios that are greater than 50 or smaller than 1/50. $\log(50) = 3.912023$. Thus, $N(0,2^2)$ may be a good choice. This prior puts approximately 5% of the prior probability located outside the above stated reasonable range.

Priors for baseline hazards

BSA (p48) suggests independent gamma priors for the piecewise baseline hazard parameters and multivariate normal prior for the vector of log baseline hazard parameters. BSA does not mention numerical values. It is hard to think of the reasonable range for the piecewise baseline hazards.

BUGS (p290) uses independent $Gamma(0.001, 0.001)$ (dgamma) as a prior for each $\lambda$. This has mean 1 and variance 1000.

BIDA (p351-) acknowledges this difficulty and suggest centering the priors on an exponential regression model.

\[\lambda_{k} \sim Gamma(\lambda_{*}w_{k}, w_{k})\]

where $w_{k}$ is the interval length times some hyperparameter $w$.

Data analysis example

Set up multicore environment

## Configure parallelization
## Parallel backend for foreach (also loads foreach and parallel; includes doMC)
library(doParallel)
## Reproducible parallelization
library(doRNG)
## Detect core count (Do not use on clusters)
n_cores <- parallel::detectCores()
## Used by parallel::mclapply() as default
options(mc.cores = n_cores)
## Used by doParallel as default
options(cores = n_cores)
## Register doParallel as the parallel backend with foreach
## http://stackoverflow.com/questions/28989855/the-difference-between-domc-and-doparallel-in-r
doParallel::registerDoParallel(cores = n_cores)
## Report multicore use
## cat("### Using", foreach::getDoParWorkers(), "cores\n")
## cat("### Using", foreach::getDoParName(), "as backend\n")

Load packages

library(tidyverse)
library(survival)
library(rstanarm)
library(broom)
library(directlabels)

Load and prepare dataset

aml                  package:survival                  R Documentation
Acute Myelogenous Leukemia survival data
Description:
     Survival in patients with Acute Myelogenous Leukemia.  The
     question at the time was whether the standard course of
     chemotherapy should be extended ('maintainance') for additional
     cycles.
Usage:
     aml
     leukemia
Format:
       time:    survival or censoring time
       status:  censoring status
       x:       maintenance chemotherapy given? (factor)
Source:
     Rupert G. Miller (1997), _Survival Analysis_.  John Wiley & Sons.
     ISBN: 0-471-25218-2.

data(leukemia, package = "survival")
leukemia <- as_data_frame(leukemia) %>%
    mutate(id = seq_len(n())) %>%
    select(id, everything())
leukemia

## # A tibble: 23 x 4
##       id  time status x         
##    <int> <dbl>  <dbl> <fct>     
##  1     1     9      1 Maintained
##  2     2    13      1 Maintained
##  3     3    13      0 Maintained
##  4     4    18      1 Maintained
##  5     5    23      1 Maintained
##  6     6    28      0 Maintained
##  7     7    31      1 Maintained
##  8     8    34      1 Maintained
##  9     9    45      0 Maintained
## 10    10    48      1 Maintained
## # ... with 13 more rows

Check distribution of observed times and decide cut points for piecewise constant hazard function.

## One piece
cut_one <- max(leukemia$time) + 1
cut_one

## [1] 162

## Two pieces
cut_two <- c(median(leukemia$time), cut_one) %>% round()
cut_two

## [1]  23 162

## Three pieces
cut_three <- c(quantile(leukemia$time, probs = c(1/3, 2/3)), cut_one) %>% round()
cut_three

## 33.33333% 66.66667%           
##        14        31       162

## At all event times
cut_events <- c(leukemia %>%
                filter(status == 1) %>%
                select(time) %>%
                arrange(time) %>%
                magrittr::extract2("time") %>%
                unique,
                cut_one) %>% round()
cut_events

##  [1]   5   8   9  12  13  18  23  27  30  31  33  34  43  45  48 162

## At all event or censoring
cut_times <- c(sort(unique(leukemia$time)),
               cut_one) %>% round()
cut_times

##  [1]   5   8   9  12  13  16  18  23  27  28  30  31  33  34  43  45  48 161 162

Now transform dataset into long-format ones. We need an interval indicator and an interval length variable.

## No cut for all same constant hazard
leukemia_one <- survival::survSplit(formula = Surv(time, status) ~ ., data = leukemia, cut = cut_one) %>%
    mutate(interval = factor(tstart),
           interval_length = time - tstart) %>%
    as_data_frame
leukemia_one

## # A tibble: 23 x 7
##       id x          tstart  time status interval interval_length
##    <int> <fct>       <dbl> <dbl>  <dbl> <fct>              <dbl>
##  1     1 Maintained      0     9      1 0                      9
##  2     2 Maintained      0    13      1 0                     13
##  3     3 Maintained      0    13      0 0                     13
##  4     4 Maintained      0    18      1 0                     18
##  5     5 Maintained      0    23      1 0                     23
##  6     6 Maintained      0    28      0 0                     28
##  7     7 Maintained      0    31      1 0                     31
##  8     8 Maintained      0    34      1 0                     34
##  9     9 Maintained      0    45      0 0                     45
## 10    10 Maintained      0    48      1 0                     48
## # ... with 13 more rows

## Split into two observations
leukemia_two <- survival::survSplit(formula = Surv(time, status) ~ ., data = leukemia, cut = cut_two) %>%
    mutate(interval = factor(tstart),
           interval_length = time - tstart) %>%
    as_data_frame
leukemia_two

## # A tibble: 34 x 7
##       id x          tstart  time status interval interval_length
##    <int> <fct>       <dbl> <dbl>  <dbl> <fct>              <dbl>
##  1     1 Maintained      0     9      1 0                      9
##  2     2 Maintained      0    13      1 0                     13
##  3     3 Maintained      0    13      0 0                     13
##  4     4 Maintained      0    18      1 0                     18
##  5     5 Maintained      0    23      1 0                     23
##  6     6 Maintained      0    23      0 0                     23
##  7     6 Maintained     23    28      0 23                     5
##  8     7 Maintained      0    23      0 0                     23
##  9     7 Maintained     23    31      1 23                     8
## 10     8 Maintained      0    23      0 0                     23
## # ... with 24 more rows

## Split into three observations
leukemia_three <- survival::survSplit(formula = Surv(time, status) ~ ., data = leukemia, cut = cut_three) %>%
    mutate(interval = factor(tstart),
           interval_length = time - tstart) %>%
    as_data_frame
leukemia_three

## # A tibble: 45 x 7
##       id x          tstart  time status interval interval_length
##    <int> <fct>       <dbl> <dbl>  <dbl> <fct>              <dbl>
##  1     1 Maintained      0     9      1 0                      9
##  2     2 Maintained      0    13      1 0                     13
##  3     3 Maintained      0    13      0 0                     13
##  4     4 Maintained      0    14      0 0                     14
##  5     4 Maintained     14    18      1 14                     4
##  6     5 Maintained      0    14      0 0                     14
##  7     5 Maintained     14    23      1 14                     9
##  8     6 Maintained      0    14      0 0                     14
##  9     6 Maintained     14    28      0 14                    14
## 10     7 Maintained      0    14      0 0                     14
## # ... with 35 more rows

## Split at event times
leukemia_events <- survival::survSplit(formula = Surv(time, status) ~ ., data = leukemia, cut = cut_events) %>%
    mutate(interval = factor(tstart),
           interval_length = time - tstart) %>%
    as_data_frame
leukemia_events

## # A tibble: 180 x 7
##       id x          tstart  time status interval interval_length
##    <int> <fct>       <dbl> <dbl>  <dbl> <fct>              <dbl>
##  1     1 Maintained      0     5      0 0                      5
##  2     1 Maintained      5     8      0 5                      3
##  3     1 Maintained      8     9      1 8                      1
##  4     2 Maintained      0     5      0 0                      5
##  5     2 Maintained      5     8      0 5                      3
##  6     2 Maintained      8     9      0 8                      1
##  7     2 Maintained      9    12      0 9                      3
##  8     2 Maintained     12    13      1 12                     1
##  9     3 Maintained      0     5      0 0                      5
## 10     3 Maintained      5     8      0 5                      3
## # ... with 170 more rows

## Split at event and censoring times
leukemia_times <- survival::survSplit(formula = Surv(time, status) ~ ., data = leukemia, cut = cut_times) %>%
    mutate(interval = factor(tstart),
           interval_length = time - tstart) %>%
    as_data_frame
leukemia_times

## # A tibble: 203 x 7
##       id x          tstart  time status interval interval_length
##    <int> <fct>       <dbl> <dbl>  <dbl> <fct>              <dbl>
##  1     1 Maintained      0     5      0 0                      5
##  2     1 Maintained      5     8      0 5                      3
##  3     1 Maintained      8     9      1 8                      1
##  4     2 Maintained      0     5      0 0                      5
##  5     2 Maintained      5     8      0 5                      3
##  6     2 Maintained      8     9      0 8                      1
##  7     2 Maintained      9    12      0 9                      3
##  8     2 Maintained     12    13      1 12                     1
##  9     3 Maintained      0     5      0 0                      5
## 10     3 Maintained      5     8      0 5                      3
## # ... with 193 more rows

Sanity check with frequentist methods

coxph(formula = Surv(time, status) ~ x,
         data    = leukemia,
         ties    = c("efron","breslow","exact")[1]) %>% summary

## Call:
## coxph(formula = Surv(time, status) ~ x, data = leukemia, ties = c("efron", 
##     "breslow", "exact")[1])
## 
##   n= 23, number of events= 18 
## 
##                  coef exp(coef) se(coef)     z Pr(>|z|)  
## xNonmaintained 0.9155    2.4981   0.5119 1.788   0.0737 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                exp(coef) exp(-coef) lower .95 upper .95
## xNonmaintained     2.498     0.4003    0.9159     6.813
## 
## Concordance= 0.619  (se = 0.063 )
## Rsquare= 0.137   (max possible= 0.976 )
## Likelihood ratio test= 3.38  on 1 df,   p=0.07
## Wald test            = 3.2  on 1 df,   p=0.07
## Score (logrank) test = 3.42  on 1 df,   p=0.06

glm(formula = status ~ x + offset(log(interval_length)),
    family  = poisson(link = "log"),
    data    = leukemia_one) %>% summary

## 
## Call:
## glm(formula = status ~ x + offset(log(interval_length)), family = poisson(link = "log"), 
##     data = leukemia_one)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3084  -0.6723   0.2218   0.9045   1.4513  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -4.1015     0.3780 -10.851   <2e-16 ***
## xNonmaintained   0.9581     0.4835   1.982   0.0475 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25.878  on 22  degrees of freedom
## Residual deviance: 21.816  on 21  degrees of freedom
## AIC: 61.816
## 
## Number of Fisher Scoring iterations: 6

## Drop intercept to show all interval-specific log rates
glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
    data = leukemia_two,
    family = poisson(link = "log")) %>% summary

## 
## Call:
## glm(formula = status ~ -1 + interval + x + offset(log(interval_length)), 
##     family = poisson(link = "log"), data = leukemia_two)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3680  -0.7694  -0.2321   1.0099   1.5947  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## interval0       -4.3530     0.4813  -9.044   <2e-16 ***
## interval23      -3.8963     0.4245  -9.178   <2e-16 ***
## xNonmaintained   1.0760     0.5008   2.149   0.0317 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1233.592  on 34  degrees of freedom
## Residual deviance:   39.322  on 31  degrees of freedom
## AIC: 81.322
## 
## Number of Fisher Scoring iterations: 6

glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
    data = leukemia_three,
    family = poisson(link = "log")) %>% summary

## 
## Call:
## glm(formula = status ~ -1 + interval + x + offset(log(interval_length)), 
##     family = poisson(link = "log"), data = leukemia_three)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2992  -0.7545  -0.5998   0.9033   1.9277  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## interval0       -4.3546     0.5342  -8.151 3.60e-16 ***
## interval14      -4.1437     0.5469  -7.577 3.55e-14 ***
## interval31      -3.8955     0.4792  -8.129 4.32e-16 ***
## xNonmaintained   1.0734     0.5170   2.076   0.0379 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1243.349  on 45  degrees of freedom
## Residual deviance:   49.378  on 41  degrees of freedom
## AIC: 93.378
## 
## Number of Fisher Scoring iterations: 6

glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
    data = leukemia_events,
    family = poisson(link = "log")) %>% summary

## 
## Call:
## glm(formula = status ~ -1 + interval + x + offset(log(interval_length)), 
##     family = poisson(link = "log"), data = leukemia_events)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0000  -0.4782  -0.3817  -0.2709   2.2170  
## 
## Coefficients:
##                 Estimate Std. Error z value     Pr(>|z|)    
## interval0        -4.6197     0.7997  -5.777 0.0000000076 ***
## interval5        -3.9794     0.7909  -5.032 0.0000004863 ***
## interval8        -3.4251     1.0527  -3.254     0.001139 ** 
## interval9        -4.4906     1.0562  -4.252 0.0000212196 ***
## interval12       -3.3054     1.0513  -3.144     0.001666 ** 
## interval13       -4.7988     1.0573  -4.539 0.0000056629 ***
## interval18       -3.9979     0.7880  -5.074 0.0000003904 ***
## interval23       -4.2947     1.0577  -4.060 0.0000489812 ***
## interval27       -3.8196     1.0538  -3.625     0.000289 ***
## interval30       -2.5174     1.0457  -2.407     0.016063 *  
## interval31       -3.1265     1.0538  -2.967     0.003009 ** 
## interval33       -2.1895     1.0392  -2.107     0.035127 *  
## interval34       -4.2680     1.0495  -4.067 0.0000476765 ***
## interval43       -2.3916     1.0264  -2.330     0.019795 *  
## interval45       -1.7918     1.0000  -1.792     0.073172 .  
## interval48      -21.0300  2103.3625  -0.010     0.992023    
## xNonmaintained    0.9024     0.5123   1.762     0.078133 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1282.980  on 180  degrees of freedom
## Residual deviance:   74.353  on 163  degrees of freedom
## AIC: 144.35
## 
## Number of Fisher Scoring iterations: 14

glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
    data = leukemia_times,
    family = poisson(link = "log")) %>% summary

## 
## Call:
## glm(formula = status ~ -1 + interval + x + offset(log(interval_length)), 
##     family = poisson(link = "log"), data = leukemia_times)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0000  -0.4653  -0.3347  -0.2550   2.2175  
## 
## Coefficients:
##                 Estimate Std. Error z value      Pr(>|z|)    
## interval0        -4.6210     0.7997  -5.778 0.00000000755 ***
## interval5        -3.9806     0.7910  -5.033 0.00000048368 ***
## interval8        -3.4262     1.0528  -3.255      0.001136 ** 
## interval9        -4.4918     1.0563  -4.253 0.00002113636 ***
## interval12       -3.3065     1.0513  -3.145      0.001661 ** 
## interval13      -20.8476  2343.2634  -0.009      0.992901    
## interval16       -3.8208     1.0539  -3.625      0.000288 ***
## interval18       -3.9991     0.7881  -5.075 0.00000038829 ***
## interval23       -4.2959     1.0578  -4.061 0.00004879153 ***
## interval27      -19.6892  2869.9764  -0.007      0.994526    
## interval28       -3.3932     1.0563  -3.212      0.001316 ** 
## interval30       -2.5185     1.0457  -2.408      0.016024 *  
## interval31       -3.1276     1.0539  -2.968      0.003000 ** 
## interval33       -2.1905     1.0393  -2.108      0.035055 *  
## interval34       -4.2691     1.0496  -4.068 0.00004750821 ***
## interval43       -2.3924     1.0264  -2.331      0.019759 *  
## interval45       -1.7918     1.0000  -1.792      0.073172 .  
## interval48      -24.0300  9426.6168  -0.003      0.997966    
## xNonmaintained    0.9042     0.5122   1.765      0.077530 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1285.623  on 203  degrees of freedom
## Residual deviance:   74.182  on 184  degrees of freedom
## AIC: 148.18
## 
## Number of Fisher Scoring iterations: 17

Some log baseline hazard estimates are -22.8454 with a standard error of 8693.5021. This non-convergence of the MLE algorithm occurs if the interval (a,b] does not contain any event, such as (13,16] in the model that defined intervals by all observed times. There is only one censoring at time 16 in this interval.

Model with rstanarm

Here we will use normal priors because they are readily available in rstanarm. Centering the prior for a log hazard ratio parameter at 0 (hazard ratio 1) sounds reasonable.

Where to center the prior for a log baseline hazard is unclear, so we will center it around 0, but make it vague. Here we will use $N(0,4^2)$, which means 95% of the prior probability is in [-7.839856, +7.839856] on the natural log scale.

We will remove the intercept term to make all interval terms parametrized as log baseline hazard (rather than change from the initial log baseline hazard). rstanarm does not seem to allow a different prior for each regression coefficient. So we will just use $N(0,4^2)$ for all.

fit_leukemia_one <- rstanarm::stan_glm(formula = status ~ x + offset(log(interval_length)),
                                       data = leukemia_one,
                                       family = poisson(link = "log"),
                                       prior_intercept = normal(location = 0, scale = 4),
                                       prior = normal(location = 0, scale = 4))
prior_summary(fit_leukemia_one)

## Priors for model 'fit_leukemia_one' 
## ------
## Intercept (after predictors centered)
##  ~ normal(location = 0, scale = 4)
## 
## Coefficients
##  ~ normal(location = 0, scale = 4)
## ------
## See help('prior_summary.stanreg') for more details

summary(fit_leukemia_one)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      status ~ x + offset(log(interval_length))
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 23
##  predictors:   2
## 
## Estimates:
##                  mean   sd    2.5%   25%   50%   75%   97.5%
## (Intercept)     -4.1    0.4  -5.0   -4.4  -4.1  -3.9  -3.5  
## xNonmaintained   1.0    0.5   0.1    0.6   1.0   1.3   1.9  
## mean_PPD         0.8    0.3   0.3    0.6   0.8   1.0   1.3  
## log-posterior  -33.6    1.0 -36.2  -33.9 -33.3 -32.9 -32.6  
## 
## Diagnostics:
##                mcse Rhat n_eff
## (Intercept)    0.0  1.0  2165 
## xNonmaintained 0.0  1.0  2253 
## mean_PPD       0.0  1.0  3144 
## log-posterior  0.0  1.0  1620 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

fit_leukemia_two <- rstanarm::stan_glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
                                       data = leukemia_two,
                                       family = poisson(link = "log"),
                                       prior = normal(location = 0, scale = 4))
prior_summary(fit_leukemia_two)

## Priors for model 'fit_leukemia_two' 
## ------
## 
## Coefficients
##  ~ normal(location = [0,0,0], scale = [4,4,4])
## ------
## See help('prior_summary.stanreg') for more details

summary(fit_leukemia_two)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      status ~ -1 + interval + x + offset(log(interval_length))
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 34
##  predictors:   3
## 
## Estimates:
##                  mean   sd    2.5%   25%   50%   75%   97.5%
## interval0       -4.4    0.5  -5.3   -4.7  -4.3  -4.0  -3.5  
## interval23      -3.9    0.4  -4.8   -4.2  -3.9  -3.6  -3.2  
## xNonmaintained   1.0    0.5   0.0    0.7   1.0   1.4   2.0  
## mean_PPD         0.5    0.2   0.2    0.4   0.5   0.6   0.9  
## log-posterior  -43.1    1.3 -46.4  -43.7 -42.8 -42.1 -41.6  
## 
## Diagnostics:
##                mcse Rhat n_eff
## interval0      0.0  1.0  1366 
## interval23     0.0  1.0  1562 
## xNonmaintained 0.0  1.0  1318 
## mean_PPD       0.0  1.0  3627 
## log-posterior  0.0  1.0  1445 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

fit_leukemia_three <- rstanarm::stan_glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
                                         data = leukemia_three,
                                         family = poisson(link = "log"),
                                         prior = normal(location = 0, scale = 4))
prior_summary(fit_leukemia_three)

## Priors for model 'fit_leukemia_three' 
## ------
## 
## Coefficients
##  ~ normal(location = [0,0,0,...], scale = [4,4,4,...])
## ------
## See help('prior_summary.stanreg') for more details

summary(fit_leukemia_three)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      status ~ -1 + interval + x + offset(log(interval_length))
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 45
##  predictors:   4
## 
## Estimates:
##                  mean   sd    2.5%   25%   50%   75%   97.5%
## interval0       -4.3    0.5  -5.4   -4.7  -4.3  -4.0  -3.4  
## interval14      -4.1    0.5  -5.3   -4.5  -4.1  -3.8  -3.2  
## interval31      -3.9    0.5  -4.9   -4.2  -3.9  -3.6  -3.1  
## xNonmaintained   1.0    0.5   0.0    0.6   1.0   1.3   2.0  
## mean_PPD         0.4    0.1   0.2    0.3   0.4   0.5   0.7  
## log-posterior  -49.9    1.4 -53.6  -50.6 -49.6 -48.9 -48.2  
## 
## Diagnostics:
##                mcse Rhat n_eff
## interval0      0.0  1.0  1622 
## interval14     0.0  1.0  1738 
## interval31     0.0  1.0  1793 
## xNonmaintained 0.0  1.0  1417 
## mean_PPD       0.0  1.0  4123 
## log-posterior  0.0  1.0  1694 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

fit_leukemia_events <- rstanarm::stan_glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
                                          data = leukemia_events,
                                          family = poisson(link = "log"),
                                          prior = normal(location = 0, scale = 4))
prior_summary(fit_leukemia_events)

## Priors for model 'fit_leukemia_events' 
## ------
## 
## Coefficients
##  ~ normal(location = [0,0,0,...], scale = [4,4,4,...])
## ------
## See help('prior_summary.stanreg') for more details

summary(fit_leukemia_events)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      status ~ -1 + interval + x + offset(log(interval_length))
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 180
##  predictors:   17
## 
## Estimates:
##                  mean   sd    2.5%   25%   50%   75%   97.5%
## interval0       -4.5    0.8  -6.2   -4.9  -4.4  -3.9  -3.2  
## interval5       -3.9    0.8  -5.6   -4.3  -3.8  -3.3  -2.6  
## interval8       -3.4    1.1  -6.0   -4.0  -3.3  -2.7  -1.7  
## interval9       -4.4    1.1  -6.8   -5.0  -4.3  -3.7  -2.7  
## interval12      -3.3    1.1  -5.8   -3.9  -3.2  -2.6  -1.6  
## interval13      -4.7    1.0  -7.1   -5.3  -4.6  -4.0  -3.1  
## interval18      -3.9    0.8  -5.6   -4.3  -3.8  -3.3  -2.6  
## interval23      -4.2    1.1  -6.7   -4.8  -4.1  -3.5  -2.6  
## interval27      -3.8    1.1  -6.2   -4.4  -3.7  -3.0  -2.1  
## interval30      -2.6    1.1  -5.2   -3.2  -2.5  -1.8  -0.9  
## interval31      -3.1    1.0  -5.5   -3.7  -3.0  -2.4  -1.4  
## interval33      -2.3    1.1  -4.9   -2.9  -2.2  -1.5  -0.6  
## interval34      -4.2    1.0  -6.5   -4.8  -4.1  -3.5  -2.5  
## interval43      -2.6    1.1  -5.2   -3.2  -2.4  -1.8  -0.8  
## interval45      -2.1    1.1  -4.6   -2.8  -2.0  -1.4  -0.4  
## interval48      -6.7    1.9 -11.0   -7.8  -6.4  -5.3  -4.0  
## xNonmaintained   0.5    0.4  -0.4    0.2   0.5   0.8   1.3  
## mean_PPD         0.1    0.0   0.1    0.1   0.1   0.1   0.2  
## log-posterior  -87.0    3.1 -93.9  -88.9 -86.7 -84.8 -81.9  
## 
## Diagnostics:
##                mcse Rhat n_eff
## interval0      0.0  1.0  3576 
## interval5      0.0  1.0  4107 
## interval8      0.0  1.0  3723 
## interval9      0.0  1.0  3297 
## interval12     0.0  1.0  3544 
## interval13     0.0  1.0  3279 
## interval18     0.0  1.0  4133 
## interval23     0.0  1.0  3880 
## interval27     0.0  1.0  3935 
## interval30     0.0  1.0  3871 
## interval31     0.0  1.0  3589 
## interval33     0.0  1.0  3291 
## interval34     0.0  1.0  3765 
## interval43     0.0  1.0  3575 
## interval45     0.0  1.0  4413 
## interval48     0.0  1.0  3276 
## xNonmaintained 0.0  1.0  3104 
## mean_PPD       0.0  1.0  4077 
## log-posterior  0.1  1.0  1385 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

fit_leukemia_times <- rstanarm::stan_glm(formula = status ~ -1 + interval + x + offset(log(interval_length)),
                                         data = leukemia_times,
                                         family = poisson(link = "log"),
                                         prior = normal(location = 0, scale = 4))
prior_summary(fit_leukemia_times)

## Priors for model 'fit_leukemia_times' 
## ------
## 
## Coefficients
##  ~ normal(location = [0,0,0,...], scale = [4,4,4,...])
## ------
## See help('prior_summary.stanreg') for more details

summary(fit_leukemia_times)

## 
## Model Info:
## 
##  function:     stan_glm
##  family:       poisson [log]
##  formula:      status ~ -1 + interval + x + offset(log(interval_length))
##  algorithm:    sampling
##  priors:       see help('prior_summary')
##  sample:       4000 (posterior sample size)
##  observations: 203
##  predictors:   19
## 
## Estimates:
##                  mean   sd    2.5%   25%   50%   75%   97.5%
## interval0       -4.4    0.8  -6.1   -4.9  -4.3  -3.9  -3.1  
## interval5       -3.8    0.8  -5.5   -4.2  -3.7  -3.3  -2.5  
## interval8       -3.4    1.1  -5.9   -4.0  -3.3  -2.6  -1.6  
## interval9       -4.4    1.0  -6.7   -4.9  -4.2  -3.7  -2.7  
## interval12      -3.3    1.0  -5.7   -3.9  -3.2  -2.6  -1.6  
## interval13      -6.2    2.0 -11.0   -7.4  -5.9  -4.7  -3.2  
## interval16      -3.8    1.0  -6.2   -4.4  -3.6  -3.0  -2.1  
## interval18      -3.8    0.8  -5.5   -4.3  -3.8  -3.3  -2.5  
## interval23      -4.2    1.0  -6.5   -4.7  -4.1  -3.4  -2.5  
## interval27      -5.1    2.2 -10.3   -6.3  -4.7  -3.5  -1.8  
## interval28      -3.4    1.1  -5.9   -4.0  -3.2  -2.6  -1.7  
## interval30      -2.5    1.1  -5.0   -3.2  -2.4  -1.8  -0.9  
## interval31      -3.1    1.1  -5.6   -3.7  -3.0  -2.4  -1.4  
## interval33      -2.3    1.1  -4.9   -2.9  -2.2  -1.5  -0.6  
## interval34      -4.2    1.0  -6.6   -4.8  -4.1  -3.5  -2.6  
## interval43      -2.5    1.1  -5.0   -3.1  -2.4  -1.8  -0.8  
## interval45      -2.1    1.1  -4.5   -2.8  -2.0  -1.3  -0.4  
## interval48      -6.7    1.9 -11.2   -7.7  -6.4  -5.4  -4.0  
## xNonmaintained   0.4    0.4  -0.5    0.1   0.4   0.7   1.3  
## mean_PPD         0.1    0.0   0.1    0.1   0.1   0.1   0.2  
## log-posterior  -91.3    3.3 -98.9  -93.3 -90.9 -88.9 -86.0  
## 
## Diagnostics:
##                mcse Rhat n_eff
## interval0      0.0  1.0  3432 
## interval5      0.0  1.0  4021 
## interval8      0.0  1.0  3538 
## interval9      0.0  1.0  3551 
## interval12     0.0  1.0  3500 
## interval13     0.0  1.0  3275 
## interval16     0.0  1.0  3421 
## interval18     0.0  1.0  3125 
## interval23     0.0  1.0  3474 
## interval27     0.0  1.0  3305 
## interval28     0.0  1.0  3081 
## interval30     0.0  1.0  2785 
## interval31     0.0  1.0  2986 
## interval33     0.0  1.0  3649 
## interval34     0.0  1.0  3070 
## interval43     0.0  1.0  3527 
## interval45     0.0  1.0  3956 
## interval48     0.0  1.0  2958 
## xNonmaintained 0.0  1.0  2751 
## mean_PPD       0.0  1.0  4476 
## log-posterior  0.1  1.0  1829 
## 
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

By the virtue of regularization by the prior, even the log baseline hazards for intervals without events does not break the model, but has lower posterior means.

Examine model fit with loo. Note @avehtari (tweet, BDA3 co-author, developer of Stan, GPy, etc!) pointed out that in the case of sub-patient level dataset (multiple rows representing intervals for each patient), leave-one-observation-out method is tricky. In this instance, what each observation represents in terms of the duration of the interval is different across models. This likely requires manually coded cross-validation with patient-level nested data frame (each row represents one patient with multiple time points nested within it). The evaluation was suppressed as the results are misleading.

list_fits_leukemia <- list(fit_leukemia_one = fit_leukemia_one,
                           fit_leukemia_two = fit_leukemia_two,
                           fit_leukemia_three = fit_leukemia_three,
                           fit_leukemia_events = fit_leukemia_events,
                           fit_leukemia_times = fit_leukemia_times)
lapply(list_fits_leukemia, loo::loo)

Survival curve prediction

Now we have posterior samples of interval-specific log baseline hazards as well as log hazard ratio for treatment. We can construct treatment group-specific posterior survival curves. Since we have piecewise constant hazards, it is easier to first construct a cumulative hazard function. Here is the constructor.

cunstruct_cumulative_hazard_function <- function(cutpoints, log_baseline_hazards, group_effect) {

    ## t is a vector of time points. group is {0,1} scalar
    cumulative_hazard_function <- function(t, group) {
        ## Boolean for any exposed time in each interval
        ## length(cutpoints) x length(t)
        interval_exposed <- outer(cutpoints, t, `<`)

        ## t - cutpoint. Multiply by interval exposed to avoid negative times.
        time_exposed <-  -outer(cutpoints, t, `-`) * interval_exposed

        ## Last interval is of width Inf
        interval_widths <- c(diff(cutpoints), Inf)

        ## For each interval, time exposed cannot exceed interval width.
        time_exposed_correct  <- sweep(x = time_exposed,
                                       MARGIN = 1,
                                       STATS = interval_widths,
                                       FUN = pmin)

        ## Multiply by corresponding baseline hazards to get interval specific cumulative baseline hazards.
        interval_baseline_cumulative_hazards <- sweep(x = time_exposed_correct,
                                                      MARGIN = 1,
                                                      STATS = exp(log_baseline_hazards),
                                                      FUN = `*`)

        ## Cumulative baseline hazard vector length(t)
        baseline_cumulative_hazards <- colSums(interval_baseline_cumulative_hazards)

        ## return after applying group effect
        return(baseline_cumulative_hazards * exp(group_effect * group))
    }

    return(cumulative_hazard_function)
}

We can use the tidybayes package to obtain each posterior sample as a row of a data frame. The columns named interval are interval-specific constant log hazards.

tidybayes::tidy_draws(fit_leukemia_three)

## # A tibble: 4,000 x 13
##    .chain .iteration .draw interval0 interval14 interval31 xNonmaintained accept_stat__ stepsize__ treedepth__
##     <int>      <int> <int>     <dbl>      <dbl>      <dbl>          <dbl>         <dbl>      <dbl>       <dbl>
##  1      1          1     1     -4.50      -4.33      -4.89          1.07          0.951      0.346           2
##  2      1          2     2     -4.83      -4.37      -3.28          0.913         1          0.346           3
##  3      1          3     3     -4.63      -3.40      -3.87          0.940         0.980      0.346           3
##  4      1          4     4     -5.32      -4.86      -3.93          1.50          0.965      0.346           3
##  5      1          5     5     -4.33      -4.02      -4.29          1.42          0.991      0.346           3
##  6      1          6     6     -4.14      -4.41      -4.44          0.702         0.996      0.346           3
##  7      1          7     7     -3.81      -4.09      -3.94          0.561         0.998      0.346           3
##  8      1          8     8     -4.55      -3.99      -3.97          1.14          0.991      0.346           3
##  9      1          9     9     -4.34      -4.47      -3.75          0.806         0.939      0.346           3
## 10      1         10    10     -3.33      -3.70      -3.58          0.175         0.964      0.346           4
## # ... with 3,990 more rows, and 3 more variables: n_leapfrog__ <dbl>, divergent__ <dbl>, energy__ <dbl>

Then we construct posterior samples of cumulative hazard functions based on the interval-specific constant log hazards. No-so-tidy tidyverse code follows.

cum_haz_leukemia_one <- tidybayes::tidy_draws(fit_leukemia_one) %>%
    mutate(`H(t|x)` = pmap(list(`(Intercept)`, xNonmaintained),
                           function(`(Intercept)`, xNonmaintained){
                               cunstruct_cumulative_hazard_function(
                                   cutpoints = c(0),
                                   log_baseline_hazards = c(`(Intercept)`),
                                   group_effect = xNonmaintained)
                           })) %>%
    select(.chain, .iteration, .draw, `H(t|x)`)

cum_haz_leukemia_two <- tidybayes::tidy_draws(fit_leukemia_two) %>%
    mutate(`H(t|x)` = pmap(list(interval0, interval23, xNonmaintained),
                           function(interval0, interval23, xNonmaintained){
                               cunstruct_cumulative_hazard_function(
                                   cutpoints = c(0,23),
                                   log_baseline_hazards = c(interval0, interval23),
                                   group_effect = xNonmaintained)
                           })) %>%
    select(.chain, .iteration, .draw, `H(t|x)`)

cum_haz_leukemia_three <- tidybayes::tidy_draws(fit_leukemia_three) %>%
    mutate(`H(t|x)` = pmap(list(interval0, interval14, interval31, xNonmaintained),
                           function(interval0, interval14, interval31, xNonmaintained){
                               cunstruct_cumulative_hazard_function(
                                   cutpoints = c(0,14,31),
                                   log_baseline_hazards = c(interval0, interval14, interval31),
                                   group_effect = xNonmaintained)
                           })) %>%
    select(.chain, .iteration, .draw, `H(t|x)`)

cum_haz_leukemia_events <- tidybayes::tidy_draws(fit_leukemia_events) %>%
    mutate(`H(t|x)` = pmap(list(interval0, interval5, interval8, interval9, interval12,
                                interval13, interval18, interval23, interval27, interval30,
                                interval31, interval33, interval34, interval43, interval45,
                                interval48, xNonmaintained),
                           function(interval0, interval5, interval8, interval9, interval12,
                                    interval13, interval18, interval23, interval27, interval30,
                                    interval31, interval33, interval34, interval43, interval45,
                                    interval48, xNonmaintained){
                               cunstruct_cumulative_hazard_function(
                                   cutpoints = c(0, 5, 8, 9, 12, 13, 18, 23, 27, 30, 31, 33, 34, 43, 45, 48),
                                   log_baseline_hazards = c(interval0, interval5, interval8, interval9,
                                                            interval12, interval13, interval18, interval23,
                                                            interval27, interval30, interval31, interval33,
                                                            interval34, interval43, interval45, interval48),
                                   group_effect = xNonmaintained)
                           })) %>%
    select(.chain, .iteration, .draw, `H(t|x)`)



cum_haz_leukemia_times <- tidybayes::tidy_draws(fit_leukemia_times) %>%
    mutate(`H(t|x)` = pmap(list(interval0, interval5, interval8, interval9, interval12, interval13,
                                interval16, interval18, interval23, interval27, interval28, interval30,
                                interval31, interval33, interval34, interval43, interval45, interval48,
                                xNonmaintained),
                           function(interval0, interval5, interval8, interval9, interval12, interval13,
                                    interval16, interval18, interval23, interval27, interval28, interval30,
                                    interval31, interval33, interval34, interval43, interval45, interval48,
                                    xNonmaintained){
                               cunstruct_cumulative_hazard_function(
                                   cutpoints = c(0, 5, 8, 9, 12, 13, 16, 18, 23, 27, 28, 30, 31, 33,
                                                 34, 43, 45, 48),
                                   log_baseline_hazards = c(interval0, interval5, interval8, interval9,
                                                            interval12, interval13, interval16, interval18,
                                                            interval23, interval27, interval28, interval30,
                                                            interval31, interval33, interval34, interval43,
                                                            interval45, interval48),
                                   group_effect = xNonmaintained)
                           })) %>%
    select(.chain, .iteration, .draw, `H(t|x)`)

As you can see, each row corresponds to a posterior sample of a cumulative hazard function that can be evaluated at an arbitrary time point $t \ge 0$.

cum_haz_leukemia_times

## # A tibble: 4,000 x 4
##    .chain .iteration .draw `H(t|x)`
##     <int>      <int> <int> <list>  
##  1      1          1     1 <fn>    
##  2      1          2     2 <fn>    
##  3      1          3     3 <fn>    
##  4      1          4     4 <fn>    
##  5      1          5     5 <fn>    
##  6      1          6     6 <fn>    
##  7      1          7     7 <fn>    
##  8      1          8     8 <fn>    
##  9      1          9     9 <fn>    
## 10      1         10    10 <fn>    
## # ... with 3,990 more rows

Here we define functions to create plotting data and to plot.

create_plot_df <- function(cum_haz_leukemia_df) {
    ## Evaluation time points
    times_df <- data_frame(t = seq(from = 0, to = cut_one, by = 1))

    cum_haz_leukemia_df %>%
        mutate(times_df = list(times_df)) %>%
        mutate(times_df = pmap(list(times_df, `H(t|x)`),
                               function(times_df, `H(t|x)`) {
                                   times_df %>%
                                       mutate(`H(t|1)` = `H(t|x)`(t, 1),
                                              `H(t|0)` = `H(t|x)`(t, 0)) %>%
                                       mutate(`S(t|1)` = exp(-`H(t|1)`),
                                              `S(t|0)` = exp(-`H(t|0)`)) %>%
                                       select(-`H(t|1)`, -`H(t|0)`)
                               }
                               )
               ) %>%
        select(-`H(t|x)`) %>%
        unnest() %>%
        gather(key = treatment,
               value = survival,
               `S(t|1)`, `S(t|0)`) %>%
        mutate(treatment = factor(treatment,
                                  levels = c("S(t|0)", "S(t|1)"),
                                  labels = c("Maintained","Nonmaintained")))
}

summarize_df <- function(df) {
    df %>%
        group_by(treatment, t) %>%
        summarize(survival_mean = mean(survival),
                  survival_95upper = quantile(survival, probs = 0.975),
                  survival_95lower = quantile(survival, probs = 0.025))
}

plot_df <- function(df) {
    df_summary <- summarize_df(df)

    df %>%
        ggplot(mapping = aes(x = t, y = survival,
                             group = interaction(.chain, .iteration, .draw, treatment))) +
        geom_line(size = 0.1, alpha = 0.025) +
        geom_line(data = df_summary,
                  mapping = aes(y = survival_mean, group = treatment)) +
        geom_line(data = df_summary,
                  mapping = aes(y = survival_95upper, group = treatment),
                  linetype = "dotted") +
        geom_line(data = df_summary,
                  mapping = aes(y = survival_95lower, group = treatment),
                  linetype = "dotted") +
        facet_grid(. ~ treatment) +
        theme_bw() +
        theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
              legend.key = element_blank(),
              plot.title = element_text(hjust = 0.5),
              strip.background = element_blank())
}

Plot different models.

survival_leukemia_one <-
    cum_haz_leukemia_one %>%
    create_plot_df()
survival_leukemia_one %>%
    plot_df() +
    labs(title = "One constant hazard")

survival_leukemia_two <-
    cum_haz_leukemia_two %>%
    create_plot_df()
survival_leukemia_two %>%
    plot_df() +
    labs(title = "Two constant hazards")

survival_leukemia_three <-
    cum_haz_leukemia_three %>%
    create_plot_df()
survival_leukemia_three %>%
    plot_df() +
    labs(title = "Three constant hazards")

survival_leukemia_events <-
    cum_haz_leukemia_events %>%
    create_plot_df()
survival_leukemia_events %>%
    plot_df() +
    labs(title = "All event times")

survival_leukemia_times <-
    cum_haz_leukemia_times %>%
    create_plot_df()
survival_leukemia_times %>%
    plot_df() +
    labs(title = "All times")

## Frequentist Kaplan-Meier estimates
km_leukemia_fit <- survival::survfit(formula = Surv(time, status) ~ x,
                                     data = leukemia,
                                     type      = "kaplan-meier",
                                     error     = "greenwood",
                                     conf.type = "log-log")
survival_leukemia_km <-
    km_leukemia_fit %>%
    broom::tidy() %>%
    mutate(treatment = gsub("x=", "", strata),
           model = "Kaplan-Meier") %>%
    rename(t = time,
           ## These are misnomers, but for plotting purpuse.
           survival_mean = estimate,
           survival_95upper = conf.high,
           survival_95lower = conf.low) %>%
    select(treatment, t, survival_mean, survival_95upper, survival_95lower, model) %>%
    right_join(y = crossing(treatment = c("Maintained","Nonmaintained"),
                            t = seq(from = 0, to = cut_one, by = 1),
                            model = "Kaplan-Meier")) %>%
    mutate(treatment = factor(treatment,
                              levels = c("Maintained","Nonmaintained"))) %>%
    arrange(treatment, t) %>%
    mutate(survival_mean = if_else(t == 0, 1, survival_mean),
           survival_95upper = if_else(t == 0, 1, survival_95upper),
           survival_95lower = if_else(t == 0, 1, survival_95lower)) %>%
    ## Need LOCF to recover step function
    tidyr::fill(survival_mean, survival_95upper, survival_95lower,
                .direction = "down")

Now plot together for comparison. Note the Kaplan-Meier estimates are Frequentist estimates. The Kaplan-Meier interval is Frequentist 95% confidence interval. For other methods, the intervals are 95% credible intervals (posterior probability intervals).

## Summarize
survival_leukemia_together <-
    bind_rows(
        survival_leukemia_one %>%
        summarize_df() %>%
        mutate(model = "One constant hazard"),
        survival_leukemia_two %>%
        summarize_df() %>%
        mutate(model = "Two constant hazards"),
        survival_leukemia_three %>%
        summarize_df() %>%
        mutate(model = "Three constant hazards"),
        survival_leukemia_events %>%
        summarize_df() %>%
        mutate(model = "All event times"),
        survival_leukemia_times %>%
        summarize_df() %>%
        mutate(model = "All times"),
        ## KM
        survival_leukemia_km)

gg_together <- survival_leukemia_together %>%
    ggplot(mapping = aes(x = t, y = survival_mean,
                         group = interaction(treatment, model), color = model)) +
    geom_line() +
    geom_line(mapping = aes(y = survival_95upper),
              linetype = "dotted") +
    geom_line(mapping = aes(y = survival_95lower),
              linetype = "dotted") +
    facet_grid(. ~ treatment) +
    labs(x = "Time in Days", y = "Survival") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
          legend.key = element_blank(),
          plot.title = element_text(hjust = 0.5),
          strip.background = element_blank())
## Regular plot
gg_together

## Direct labeling
directlabels::direct.label(gg_together + scale_x_continuous(limit = c(0, 200)))

All methods are similar for the Nonmaintained group, whereas there are larger differences for the Maintained group. This is likely due to departure from the exponential model (One constant hazard).

Record session information

print(sessionInfo())

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS  10.14.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2          directlabels_2018.05.22 broom_0.5.0             rstanarm_2.18.1        
##  [5] Rcpp_0.12.19            survival_2.43-1         forcats_0.3.0           stringr_1.3.1          
##  [9] dplyr_0.7.7             purrr_0.2.5             readr_1.1.1             tidyr_0.8.2            
## [13] tibble_1.4.2            ggplot2_3.1.0           tidyverse_1.2.1         doRNG_1.7.1            
## [17] rngtools_1.3.1          pkgmaker_0.27           registry_0.5            doParallel_1.0.14      
## [21] iterators_1.0.10        foreach_1.4.4           knitr_1.20             
## 
## loaded via a namespace (and not attached):
##  [1] minqa_1.2.4               colorspace_1.3-2          ggridges_0.5.1            rsconnect_0.8.8          
##  [5] rprojroot_1.3-2           ggstance_0.3.1            markdown_0.8              base64enc_0.1-3          
##  [9] rstudioapi_0.8            rstan_2.18.1              svUnit_0.7-12             DT_0.4                   
## [13] fansi_0.4.0               lubridate_1.7.4           xml2_1.2.0                codetools_0.2-15         
## [17] splines_3.5.1             shinythemes_1.1.1         bayesplot_1.6.0           jsonlite_1.5             
## [21] nloptr_1.2.1              shiny_1.1.0               compiler_3.5.1            httr_1.3.1               
## [25] backports_1.1.2           assertthat_0.2.0          Matrix_1.2-14             lazyeval_0.2.1           
## [29] cli_1.0.1                 later_0.7.5               htmltools_0.3.6           prettyunits_1.0.2        
## [33] tools_3.5.1               igraph_1.2.2              coda_0.19-2               gtable_0.2.0             
## [37] glue_1.3.0                reshape2_1.4.3            cellranger_1.1.0          nlme_3.1-137             
## [41] crosstalk_1.0.0           ps_1.2.0                  lme4_1.1-18-1             rvest_0.3.2              
## [45] mime_0.6                  miniUI_0.1.1.1            tidybayes_1.0.3           gtools_3.8.1             
## [49] MASS_7.3-51               zoo_1.8-4                 scales_1.0.0              colourpicker_1.0         
## [53] hms_0.4.2                 promises_1.0.1            inline_0.3.15             shinystan_2.5.0          
## [57] yaml_2.2.0                gridExtra_2.3             loo_2.0.0                 StanHeaders_2.18.0       
## [61] stringi_1.2.4             dygraphs_1.1.1.6          pkgbuild_1.0.2            bibtex_0.4.2             
## [65] rlang_0.3.0.1             pkgconfig_2.0.2           matrixStats_0.54.0        evaluate_0.12            
## [69] lattice_0.20-35           bindr_0.1.1               splines2_0.2.8            labeling_0.3             
## [73] rstantools_1.5.1          htmlwidgets_1.3           tidyselect_0.2.5          processx_3.2.0           
## [77] plyr_1.8.4                magrittr_1.5              R6_2.3.0                  pillar_1.3.0             
## [81] haven_1.1.2               withr_2.1.2               xts_0.11-1                modelr_0.1.2             
## [85] crayon_1.3.4              arrayhelpers_1.0-20160527 utf8_1.1.4                rmarkdown_1.10           
## [89] grid_3.5.1                readxl_1.1.0              callr_3.0.0               threejs_0.3.1            
## [93] digest_0.6.18             xtable_1.8-3              httpuv_1.4.5              stats4_3.5.1             
## [97] munsell_0.5.0             quadprog_1.5-5            shinyjs_1.0

## Record execution time
end_time <- Sys.time()
cat("Started  ", as.character(start_time), "\n")

## Started   2018-11-15 21:02:40

cat("Finished ", as.character(end_time), "\n")

## Finished  2018-11-15 21:21:37

print(end_time - start_time)

## Time difference of 18.94245 mins

Top Page: http://rpubs.com/kaz_yos/
Github: https://github.com/kaz-yos