1 Introduction

(The purpose of this post is a summary of literature, and to provide analytical examples in R)

We often wish to compare survival experience of two or more groups of individuals. Two common but distinct approaches to compare survival outcomes are: i) comparing the entire survival curves and ii) comparing the survival probabilities at a fixed point in time. These approaches test distinct hypotheses concerning survival experience. In next sections, I will provide a brief overview of statistical models for time-to-event analysis. Following is an overview of statistic methods to compare survival experiences between groups - here, the focus is to compare the usage and pros and cons of each method rather than mathematical formula of each method.

Finally, treatment switching is becoming an important issue in clinical trials, so I will also go over this issue briefly.

2 Statistical Models

Survival data can be described by 4 entities: survival prob, hazard prob, prob density function, and cumulative density function. If we know one entity, it’s easy to transform to the other three quantities. The transformations among them are presented in the figure below.

Math entities and transofrms between them

The same assumptions shared in all time-to-event analyses are:

Censoring is unrelated to prognosis
The survival probabilities are the same for subjects recruited early and late in the study
The events happened at the times specified.

Deviations from these assumptions matter most if they are satisfied differently in the groups being compared (e.g. censoring is more likely in one group).

2.1 Kaplan-Meier curve

Kaplan-Meier plot is a non-parametric method to estimate the survival probability from direct observed survival times. It doesn’t require any assumption of distribution.

2.2 Proportional hazard models

The proportional hazard models do not need to consider a specific probability distribution for the survival time; therefore, it is the most helpful model in analyzing survival data.

But the efficiency of the model is severely dependent to proportional hazards assumption and, for this reason, The Cox model is often called proportional hazards model. In occasions where the proportional hazard model is not acceptable, estimates derived from Cox model will lead to an improper fitting of the model and incorrect inferences.

2.2.1 Proportional hazard assumption

PH assumption: Hazard can vary, but hazard ratio of two individuals (at the same time) is constant.

2.2.1.1 Assessing PH assumption

Graphical methods: If log-log survival curves for the two individuals appear roughly parallel, the hazard ratio proportion is constant over the period of observation.
Schoenfeld residuals and Grambsch-Therneau test: If PH assumption holds, there should be no correlation between the weighted average for the remaining subjects and the survival time. In other words, at every failure, the weighted average of what remains should be similar to the value for the person who just failed.
Goodness-of-fit tests based on cumulative sums of residuals: Kolmogorov-Smirnov test, Cramer-von Mises test, Anderson-Darling test
Interactions with time as time-dependent covariates

2.2.1.2 Handling violation of PH assumption

Stratified Cox regression: Stratifying data set to \(m\) smaller datasets based on strata of the covariate(s). Each dataset has its own baseline hazard, but they all share the regression parameters.
Adding time-dependent covariate (i.e. time-dependent coefficient Cox model): interaction of concerned covariate and time or a function of time.
Accelerated failure-time model

2.2.2 Cox PH model

The hazard function takes the form: \(h(t|X) = h_0(t)e^{\beta X}\)

\(h_0\) is baseline hazard, which is the same for all individuals.
\(e^{\beta X}\) is the function of covariates. Covariates can be time-dependent or time-independent. When all Xs are zero, the model reduces to the baseline hazard.

As baseline hazard is not specified, Cox model is a semi-parametric model.

2.2.3 Extended Cox model for time-varying covariates

Time-varying covariates Cox model taking the form: \(h(t|X) = h_0(t)e^{\beta X(t)}\)

When explanatory variables are collected at more than one time point and change over time, it may be more appropriate to use time varying covariates. The model is robust because it utilizes all available data.

Data for time-varying covariate Cox model need to be in long-table format, as in repeated measures regression. Follow-up period of a subject is partitioned into sub-intervals (tstart, tstop] based on measurement times of the covariate(s), and each sub-interval is presented as one row in data frame.

Although a given subject has multiple observations, we generally do not need to worry about correlated data, as this data representation is simply a programming trick. The likelihood equations at any time point use only one copy of any subject, the program picks out the correct row of data at each time. However, when subjects have multiple events, then the rows for the events are correlated within subject and a cluster variance is needed.

2.2.4 Parametric PH model

In contrast to the Cox PH model where baseline hazard \(h_0\) is not specified, the parametric PH models specify baseline hazard function.

Weibull regression is the most popular parametric survival regression. Parametric PH and AFT models are both available only for Weibull and exponential.

Weibull model with baseline hazard: \(h(t|X) = (\gamma \lambda t^{\gamma^{-1}})e^{\beta X}\)

2.3 Accelerated failure-time (AFT) models

The AFT model provides an alternative to the proportional hazards (PH) model to analyze time-to-event data. In the AFT regression model, the effect of covariates on the logarithm of the survival time is assessed.

However, in contrast to the Cox PH model, the parametric AFT models require specifying the event time distribution,which is difficult in many real-life applications. The AFT models include generalized gamma, Log-logistic, Log-normal, Weibull and exponential.

Classic AFT models impose conventional assumptions: i) for all covariates, event time ratios (ETR), i.e., acceleration factors, are constant over time, and ii) continuous covariates have linear relationship.

If the “true” data-generating model may be more consistent with the PH model, the constant time ratio assumption is violated, which may lead to biased estimates, unless the baseline hazard is exponential or Weibull. Pang et al (2021) proposed flexible AFT model allowing both time-dependent and non-linear effects of continuous variables. Available code for Flexiable AFT model

AFT model taking the form: \[logT = Y = \mu + \alpha X + \sigma W \]

, where X are set of covariates, and W has the extreme value distribution.

Weibull AFT form: \[h(t|X) = \gamma \lambda(t e^{-\alpha X})^{\gamma-1} e^{-\alpha X} \]

where \(\gamma = 1/\sigma\), \(\lambda = e^{-\mu/\sigma}\), \(\beta = -\alpha/\sigma\) are transformations for Weibull PH model.

2.4 Examples

2.4.1 Example 1: Cox and AFT models

Plot comparing K-M curves between two treatment groups in Primary biliary cirrhosis (PBC) dataset:

library(tidyverse)
library(survival)
library(survminer)
library(flexsurv)

data(pbc, package="survival")

kmfit <- survfit(Surv(time, status == 2) ~ trt, data = pbc)
ggsurvplot(kmfit,
           pval = TRUE,       # show p-value of log-rank test.
           conf.int = FALSE,  # show confidence intervals for point estimates of survival curves.
           xlab = "Time (days)", # customize X axis label.
           break.time.by = 500,  # break X axis in time intervals by 500.
           ggtheme = theme_light() + 
             theme(legend.title=element_text(size=rel(1.25)),
                   legend.text=element_text(size=rel(1.25))), # customize plot and risk table with a theme.
           surv.median.line = "hv",  # add the median survival pointer.
           legend.labs = c("Standard", "Experimental"), # change legend labels.
           legend.title = "Treatment",
           palette = "Dark2"# custom color palettes.
)

Cox PH model:

coxfit <- coxph(Surv(time,status==2) ~ trt + age + edema + log(bili) + log(albumin) + log(protime), ties="efron", data = pbc)
summary(coxfit)

## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema + 
##     log(bili) + log(albumin) + log(protime), data = pbc, ties = "efron")
## 
##   n= 312, number of events= 125 
##    (106 observations deleted due to missingness)
## 
##                   coef exp(coef)  se(coef)      z Pr(>|z|)    
## trt           0.135502  1.145111  0.185424  0.731 0.464920    
## age           0.034625  1.035232  0.008905  3.888 0.000101 ***
## edema         0.787101  2.197017  0.296321  2.656 0.007902 ** 
## log(bili)     0.884853  2.422627  0.098718  8.963  < 2e-16 ***
## log(albumin) -3.083307  0.045808  0.718950 -4.289  1.8e-05 ***
## log(protime)  2.969493 19.482041  1.016033  2.923 0.003471 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## trt            1.14511    0.87328   0.79619    1.6470
## age            1.03523    0.96597   1.01732    1.0535
## edema          2.19702    0.45516   1.22915    3.9270
## log(bili)      2.42263    0.41277   1.99645    2.9398
## log(albumin)   0.04581   21.83048   0.01119    0.1875
## log(protime)  19.48204    0.05133   2.65941  142.7196
## 
## Concordance= 0.844  (se = 0.02 )
## Likelihood ratio test= 199.9  on 6 df,   p=<2e-16
## Wald test            = 197.6  on 6 df,   p=<2e-16
## Score (logrank) test = 269.7  on 6 df,   p=<2e-16

2.4.1.1 Checking PH assumption

Log-log plot: although curves are crossed, but they roughly overlap; therefore, data do not deviate from PH assumption severely.

ggsurvplot(kmfit, fun = 'cloglog', 
           legend.labs = c("Standard", "Experimental"), # change legend labels.
           legend.title = "Treatment",
           palette = "Dark2")

Plot of Schoenfeld resdiduals and Grambsch-Therneau test: Edema and Log(protime) violate PH assumption.

# Schoenfeld resdiduals
ggcoxdiagnostics(coxfit, type = "schoenfeld")

# Grambsch-Therneau test
cox.zph(coxfit)

##               chisq df     p
## trt           0.603  1 0.438
## age           0.502  1 0.479
## edema         4.100  1 0.043
## log(bili)     1.616  1 0.204
## log(albumin)  0.190  1 0.663
## log(protime)  5.445  1 0.020
## GLOBAL       12.047  6 0.061

Goodness-of-fit tests based on cumulative sums of residuals also indicate non-PH of endema and log(protime):

library(goftte)

coxprop <- prop(coxfit, type.test='Lin',plots = 200,
                variable=c("treatment","age", "edema","bilirubin","log(albumin)","log(protime)"))

coxprop

## 
## Rejection p-values associated to Lin's approximation for proportional hazards assumption
## ---
## Kolmogorov-Smirnov-test: p-value=0.207
## Cramer-von-Mises-test: p-value=0.3
## Anderson-Darling-test: p-value=0.405
## Based on 1000 realizations. Cumulated residuals ordered by treatment-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.263
## Cramer-von-Mises-test: p-value=0.288
## Anderson-Darling-test: p-value=0.355
## Based on 1000 realizations. Cumulated residuals ordered by age-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.015
## Cramer-von-Mises-test: p-value=0.027
## Anderson-Darling-test: p-value=0.029
## Based on 1000 realizations. Cumulated residuals ordered by edema-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.084
## Cramer-von-Mises-test: p-value=0.193
## Anderson-Darling-test: p-value=0.198
## Based on 1000 realizations. Cumulated residuals ordered by bilirubin-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.742
## Cramer-von-Mises-test: p-value=0.674
## Anderson-Darling-test: p-value=0.685
## Based on 1000 realizations. Cumulated residuals ordered by log(albumin)-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.002
## Cramer-von-Mises-test: p-value=0.002
## Anderson-Darling-test: p-value=0.002
## Based on 1000 realizations. Cumulated residuals ordered by log(protime)-variable.
## ---

plot(coxprop)

Assess adequate form of covariate: Replicate Example 73.12 in SAS Documentation for using ASSESS statement in PROC PHREG, we can use fcov function in goftte package to assess the adequate form of bilirubin in the model. a Kolmogorov-type supremum test based on a sample of 1,000 simulated residual patterns has p < 0.001. Plot shows none of the 1,000 simulated realizations has an absolute maximum exceeding that of the observed cumulative martingale residual process. Result indicates that linear form of bili is not appropriate for the model.

coxfit2 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime), ties="breslow", data = pbc)
cox.fcov <- fcov(coxfit2, type.test='Lin',plots = 200,
                 variable=c("treatment","age", "edema","bilirubin","log(albumin)","log(protime)"))

plot(cox.fcov, idx=4)

2.4.1.2 Handling non-PH

Add time-dependent covariate

The interaction term of edema is not statistically significant when we include interaction terms of time and both edema and log(protime) into the model.

coxfit_time1 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime) + tt(edema) + tt(log(protime)), ties="breslow", data = pbc,
                tt=function(x, t, ...) {
                  xtime=x*t/365.25
                })

coxfit_time1

## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema + 
##     bili + log(albumin) + log(protime) + tt(edema) + tt(log(protime)), 
##     data = pbc, ties = "breslow", tt = function(x, t, ...) {
##         xtime = x * t/365.25
##     })
## 
##                        coef  exp(coef)   se(coef)      z        p
## trt               -0.011966   0.988106   0.185010 -0.065 0.948433
## age                0.033383   1.033947   0.009475  3.523 0.000426
## edema              1.328368   3.774877   0.460209  2.886 0.003896
## bili               0.110266   1.116575   0.015624  7.057 1.70e-12
## log(albumin)      -3.766113   0.023142   0.719533 -5.234 1.66e-07
## log(protime)       6.381214 590.644454   1.608171  3.968 7.25e-05
## tt(edema)         -0.231251   0.793540   0.150309 -1.539 0.123924
## tt(log(protime))  -0.721962   0.485798   0.359230 -2.010 0.044457
## 
## Likelihood ratio test=174.3  on 8 df, p=< 2.2e-16
## n= 312, number of events= 125 
##    (106 observations deleted due to missingness)

Therefore, we can drop the interaction term between edema and time.

coxfit_time2 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime) + tt(log(protime)),
                      ties="breslow", data = pbc,
                      tt=function(x, t, ...) {
                        xtime=x*t/365.25
                        })
                

coxfit_time2

## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema + 
##     bili + log(albumin) + log(protime) + tt(log(protime)), data = pbc, 
##     ties = "breslow", tt = function(x, t, ...) {
##         xtime = x * t/365.25
##     })
## 
##                        coef  exp(coef)   se(coef)      z        p
## trt                0.014231   1.014333   0.184143  0.077 0.938397
## age                0.033872   1.034452   0.009398  3.604 0.000313
## edema              0.779225   2.179782   0.304559  2.559 0.010512
## bili               0.112017   1.118532   0.015416  7.266 3.70e-13
## log(albumin)      -3.848515   0.021311   0.714031 -5.390 7.05e-08
## log(protime)       6.618422 748.762903   1.549773  4.271 1.95e-05
## tt(log(protime))  -0.831122   0.435560   0.350268 -2.373 0.017653
## 
## Likelihood ratio test=171.6  on 7 df, p=< 2.2e-16
## n= 312, number of events= 125 
##    (106 observations deleted due to missingness)

Stratified Cox model

With edema being a categorical variable, we can conduct a Cox model stratified for edema. When included in the model, interaction term between log(protime) and time is not statistically significant (not shown here), so we can drop it from model.

stratifiedCox <- coxph(Surv(time,status==2) ~ trt + age + bili + log(albumin) + log(protime) + strata(edema),
                       data = pbc)

stratifiedCox

## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + bili + 
##     log(albumin) + log(protime) + strata(edema), data = pbc)
## 
##                   coef exp(coef)  se(coef)      z        p
## trt          -0.082320  0.920977  0.191992 -0.429 0.668092
## age           0.033945  1.034528  0.009614  3.531 0.000414
## bili          0.114832  1.121685  0.016387  7.008 2.42e-12
## log(albumin) -3.608973  0.027080  0.716617 -5.036 4.75e-07
## log(protime)  3.456192 31.696036  0.928979  3.720 0.000199
## 
## Likelihood ratio test=98.63  on 5 df, p=< 2.2e-16
## n= 312, number of events= 125 
##    (106 observations deleted due to missingness)

Weibull AFT model

An alternate for Cox model in non-PH is AFT. Let’s try to fit a Weibull AFT model.

The output tables coef and HR are estimates for Weibull PH regression. And ETR and coefficient estimates in summary are for Weibull AFT.

ETR (treated vs. control) equals 1.01, then the time corresponding to any given survival probability is only 1% longer for the treated than the control subjects, but not statistically significant (95%CI: 0.80-1.28, p=0.93)

library(SurvRegCensCov)
weibull.aftfit <- WeibullReg(Surv(time,status==2) ~ trt+ age + edema + bili + log(albumin) + log(protime), data = pbc)
weibull.aftfit

## $formula
## Surv(time, status == 2) ~ trt + age + edema + bili + log(albumin) + 
##     log(protime)
## 
## $coef
##                   Estimate           SE
## lambda        8.636575e-09 2.039620e-08
## gamma         1.536983e+00 1.112450e-01
## trt          -1.633552e-02 1.827419e-01
## age           3.222800e-02 9.270163e-03
## edema         8.659405e-01 3.059779e-01
## bili          1.139634e-01 1.454384e-02
## log(albumin) -3.751319e+00 6.887614e-01
## log(protime)  3.549429e+00 8.894523e-01
## 
## $HR
##                       HR          LB           UB
## trt           0.98379718 0.687630920   1.40752382
## age           1.03275294 1.014158102   1.05168873
## edema         2.37724094 1.305041628   4.33034040
## bili          1.12071105 1.089215765   1.15311704
## log(albumin)  0.02348674 0.006089027   0.09059357
## log(protime) 34.79345265 6.086874489 198.88439455
## 
## $ETR
##                      ETR         LB         UB
## trt           1.01068499 0.80056890  1.2759478
## age           0.97924995 0.96763270  0.9910067
## edema         0.56926855 0.38638241  0.8387201
## bili          0.92853476 0.91201513  0.9453536
## log(albumin) 11.48112092 4.82912309 27.2960815
## log(protime)  0.09932588 0.03014253  0.3272993
## 
## $summary
## 
## Call:
## survival::survreg(formula = formula, data = data, dist = "weibull")
##                 Value Std. Error     z       p
## (Intercept)  12.08033    1.54377  7.83 5.1e-15
## trt           0.01063    0.11891  0.09 0.92878
## age          -0.02097    0.00609 -3.44 0.00057
## edema        -0.56340    0.19772 -2.85 0.00438
## bili         -0.07415    0.00916 -8.10 5.7e-16
## log(albumin)  2.44070    0.44186  5.52 3.3e-08
## log(protime) -2.30935    0.60841 -3.80 0.00015
## Log(scale)   -0.42982    0.07238 -5.94 2.9e-09
## 
## Scale= 0.651 
## 
## Weibull distribution
## Loglik(model)= -1104.2   Loglik(intercept only)= -1188.8
##  Chisq= 169.17 on 6 degrees of freedom, p= 6.8e-34 
## Number of Newton-Raphson Iterations: 6 
## n=312 (106 observations deleted due to missingness)

The adequacy of the Weibull distribution for survival data with respect to one categorical covariate can be examined by via a plot of log Time vs. log Estimated Cumulative Hazard. If the Weibull distribution fits the data well, then the lines produced should be linear and parallel.

WeibullDiag(Surv(time,status==2) ~ trt, data = pbc)

2.4.2 Example 2: Extended Cox model

Extended Cox for time-varying covariates

Below is an example in vignette of tmerge function.

pbcseq contains the follow-up laboratory data for each study patient, while pbc contains only baseline measurements of the laboratory parameter. However, some baseline data values in pbcseq differ from pbc, for instance, the data errors in prothrombin time and age. Therefore, baseline data were taken from pbc before merging with pbcseq to create time-dependent covariates.

The coefficient estimates in the original Cox model (coxfit) and this time-varying covariates extended Cox model (coxfit2) are quite different, especially for log(bili), log(albumin) and log(protime)

temp <- subset(pbc, id <= 312, select=c(id:sex, stage)) # baseline data
pbc2 <- tmerge(temp, temp, id=id, endpt = event(time, status))
pbc2 <- tmerge(pbc2, pbcseq, id=id, ascites = tdc(day, ascites),
               edema =tdc(day, edema),
               bili = tdc(day, bili), albumin = tdc(day, albumin),
               protime = tdc(day, protime), alk.phos = tdc(day, alk.phos))

coxfit2 <- coxph(Surv(tstart, tstop, endpt==2) ~ trt + age + edema + log(bili) + log(albumin) + log(protime), data=pbc2)

rbind('baseline fit' = coef(coxfit),
      'time dependent' = coef(coxfit2))

##                       trt        age     edema log(bili) log(albumin)
## baseline fit   0.13550199 0.03462547 0.7871006 0.8848527    -3.083307
## time dependent 0.02175964 0.04398652 0.7871554 1.1764688    -4.016083
##                log(protime)
## baseline fit       2.969493
## time dependent     2.654881

Extended Cox for recurrent events

A look of Chronic granulomatous disease (CGD) data in wide table format:

data(cgd0, package = 'survival')

## Warning in data(cgd0, package = "survival"): data set 'cgd0' not found

head(cgd0)

##   id center random treat sex age height weight inherit steroids propylac
## 1  1    204  82888     1   2  12  147.0   62.0       2        2        2
## 2  2    204  82888     0   1  15  159.0   47.5       2        2        1
## 3  3    204  82988     1   1  19  171.0   72.7       1        2        1
## 4  4    204  91388     1   1  12  142.0   34.0       1        2        1
## 5  5    238  92888     0   1  17  162.5   52.7       1        2        1
## 6  6    245  93088     1   2  44  153.3   45.0       2        2        2
##   hos.cat futime etime1 etime2 etime3 etime4 etime5 etime6 etime7
## 1       2    414    219    373     NA     NA     NA     NA     NA
## 2       2    439      8     26    152    241    249    322    350
## 3       2    382     NA     NA     NA     NA     NA     NA     NA
## 4       2    388     NA     NA     NA     NA     NA     NA     NA
## 5       1    383    246    253     NA     NA     NA     NA     NA
## 6       2    364     NA     NA     NA     NA     NA     NA     NA

Transformation from wide table to long table can be done with tmerge function. A look of data in long table format, which is needed for modeling:

data(cgd, package = 'survival')
head(cgd)

##   id            center     random   treat    sex age height weight   inherit
## 1  1 Scripps Institute 1989-06-07  rIFN-g female  12    147   62.0 autosomal
## 2  1 Scripps Institute 1989-06-07  rIFN-g female  12    147   62.0 autosomal
## 3  1 Scripps Institute 1989-06-07  rIFN-g female  12    147   62.0 autosomal
## 4  2 Scripps Institute 1989-06-07 placebo   male  15    159   47.5 autosomal
## 5  2 Scripps Institute 1989-06-07 placebo   male  15    159   47.5 autosomal
## 6  2 Scripps Institute 1989-06-07 placebo   male  15    159   47.5 autosomal
##   steroids propylac  hos.cat tstart enum tstop status
## 1        0        0 US:other      0    1   219      1
## 2        0        0 US:other    219    2   373      1
## 3        0        0 US:other    373    3   414      0
## 4        0        1 US:other      0    1     8      1
## 5        0        1 US:other      8    2    26      1
## 6        0        1 US:other     26    3   152      1

As subjects have multiple events, the rows for the events are correlated within subject and a cluster variance is needed.

(extendedCox.fit <- coxph(Surv(tstart, tstop, status) ~ treat + inherit + steroids, data =cgd, cluster = id))

## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ treat + inherit + 
##     steroids, data = cgd, cluster = id)
## 
##                     coef exp(coef) se(coef) robust se      z        p
## treatrIFN-g      -1.0722    0.3422   0.2619    0.3118 -3.438 0.000585
## inheritautosomal  0.1777    1.1944   0.2356    0.3180  0.559 0.576395
## steroids          0.7726    2.1653   0.5169    0.4687  1.648 0.099310
## 
## Likelihood ratio test=22.49  on 3 df, p=5.149e-05
## n= 203, number of events= 76

3 Compare survival experience between groups

My focus here is a high-level overview of pros and cons of each method. For review of mathematical details, please see at: Statistical tests comparing two survival curves

3.1 Compare the entire survival curves

The null hypothesis in this approach is that the risk of mortality after treatment A is the same as the risk of mortality after treatment B at all time points.

The difference in treatment benefits in survival analysis is commonly quantified and reported as the (cumulative) survival rate, the hazard ratio (HR), the median survival time (MT), or the restricted mean survival time (RMST).

The log-rank test corresponding to the HR is a common and classical choice when the HR between treatment arms remains constant over time (the proportional hazards assumption). However, when the proportional hazards assumption is violated, especially when two survival curves cross each other, the difference based on hazards may be offset before and after the crossing point, resulting in a loss of power of the log-rank test.
The weighted log-rank test: weighted log-rank test statistics take the form of the weighted sum of the differences of the estimated hazard functions at each observed failure time. As a result, these statistics are used to test whether the hazard difference is zero between the treatment group and the control group. In the non-PH setting, the relative differences of the two hazard functions are not constant over time, therefore a differential weighting (compared to equal weighting in the log-rank statistic) at different time points has the potential to improve the efficiency of the test statistics.

The Fleming-Harrington family of weighted log-rank test statistics: \(FH(\rho, \gamma) = S(t-)^\rho (1-S(t-))^\gamma\).
- \(FH(0,0)\) is the log-rank statistic that is most powerful under the proportional hazards assumption;
- when this is a diminishing effect (i.e. early separation), \(FH(\rho, 0)\) with \(\rho > 0\) that over weights the early events will provide higher power to detect a treatment difference compared with equal weighting
- when delayed effect exists, \(FH(0, 𝛾)\) with \(𝛾 > 0\) that over weights the late events will be more powerful to detect the late separation
- \(FH(𝜌, 𝛾)\) with \(𝜌 = 𝛾 > 0\) will be more powerful if the biggest separation of two hazard functions occurs in the middle
MaxCombo test: test statistic taking the form: \[Z_{max} = max_{\rho, \gamma} \{Z_{FH_{(\rho,\gamma)}}\} \]

where \(Z_{FH_{(\rho,\gamma)}}\) is the standardized Fleming-Harrington weighted log-rank statistics. In particular, the original MaxCombo test is interested in the combination of \(FH(0,0)\), \(FH(0,1)\), \(FH(1,1)\) and \(FH(1,0)\), which should be sensitive to PH, late-separation, middle-separation and early-separation scenarios.

Modified MaxCombo test:
- Option 1: \(FH(0,0)\), \(FH(0,0.5)\), \(FH(0.5,0)\), \(FH(0,0.5)\): conservative and less sensitive to tail events.
- Option 2: \(FH(0,0)\), \(FH(0,0.5)\), \(FH(0,0.5)\): if delayed effect is only possibility
When the MaxCombo test is used, the treatment effect estimate is taken as the estimated HR obtained from the weighted Cox model corresponding to the weighted log-rank test with the smallest p-value (Lin et al., 2019).

The MaxCombo test provides a pragmatic alternative to log-rank under nonproportional hazards (Mukhopadhyay et al., 2022)
Weighted Kaplan-Meier tests: take the form of the weighted sum of the differences of the Kaplan-Meier estimates of survival functions. Therefore, they are valid to test whether the two underlying survival functions are the same or not. A particularly interesting weighted Kaplan-Meier test is to set the weight equal to 1, resulting in the difference of two RMSTs (Lin et al, 2019).
The method based on difference in the MT: depends on information at the median survival point (when the survival rate is equal to 0.5). Therefore, when the censoring rate is large or the follow-up time is insufficient (causing the survival rate does not reach 0.5), this method cannot be applied without an accurate estimation of the MT. Also, a test based on a difference in MT is not applicable when the crossing point of the survival curves is located near 0.5 (Huang et al., 2020).
The method based on the difference in RMST does not require the assumption of proportional hazards. However, when the two survival curves cross each other, the difference in the RMST may be offset before and after the crossing point, which causes an invalid assessment of the difference in treatment benefits and reduced power (Huang et al., 2020).
The area between two survival curves (ABS) measure effectively reflects the absolute benefit of treatment effects between two groups in randomized clinical trials. The ABS is a robust measure for comparing the treatment benefits of two groups and can be applied regardless of whether the proportional hazards assumption is violated or the survival curves cross.

Meanwhile, the corresponding ABS permutation (ABSP) test is also robust and reliable. When researchers are interested in treatment effects over any time interval (i.e., evaluation of the short-term or long-term treatment benefits of two cancer drugs), the corresponding interval ABSP (ABSPi) test can be applied in combination with Kaplan-Meier curves to draw a comprehensive conclusion (Huang et al., 2020).
Two-stage procedure test: details can be found in (Qiu and Sheng, 2008)
Other non-parametric tests with higher power against crossing hazard alternatives include Kolmogorov–Smirnov and Cramer–von Mises types of tests and the median test of Brookmeyer and Crowley.

3.2 Compare the survival probabilities at a fixed point in time

The null hypothesis in this approach is that the survival probability at time \(t_0\) after treatment A is the same as the survival probability at time \(t_0\) after treatment B.

Comparisons at fixed points in time are often of interest when the survival curves of the treatments are known to cross.

Survival probabilities at time \(t_0\) can be compared using the methods of Zhang (Zhang et al., 2007) or pseudo-value approach (Klein et al., 2007)

3.3 Example

For illustration, I use Crossdata dataset in ComparisonSurv package.

First, I examine K-M plot. The plot shows crossing survival curves between two groups.

library(ComparisonSurv)

data("Crossdata")
crossdata.km <- survfit(Surv(time, status) ~ group, data = Crossdata)
ggsurvplot(crossdata.km,
           xlab = "Time (years)", # customize X axis label.
           ggtheme = theme_light() +
             theme(legend.title=element_text(size=rel(1.25)),
                   legend.text=element_text(size=rel(1.25))), # customize plot and risk table with a theme.
           surv.median.line = "hv",  # add the median survival pointer.
           legend.title = "Treatment",
           palette = "Dark2"# custom color palettes.
)

ComparisonSurv package provides a function to conveniently compare two survival curves by multiple tests. PH assumption is tested by Grambsch-Therneau test (as cox.zph in survival package).

Overall.test(Crossdata$time, Crossdata$status, Crossdata$group)

## 
##  PH assumption yield a result of: Pvalue =0.00035, chisq =12.76765.

##                          method statistic  pvalue
## 1                      Log-rank   0.99103 0.31949
## 2                Gehan-Wilcoxon  -0.89872 0.36880
## 3                   Tarone-Ware  -0.11533 0.90818
## 4                   Weighted KM  -0.02205 0.98241
## 5                           ABS   1.94767 0.05145
## 6               ABS permutation   1.94767 0.02000
## 7                     Two-stage        NA 0.02532
## 8           Squared differences   0.74491 0.45633
## 9 RMST (tau= 4.66533347498626 )   0.19763 0.33130

L_interval <- Crossdata %>% filter(time <=1.7)

R_interval <- Crossdata %>% filter(time > 1.7)

#taking row 6 for ABSPi_L
overall_test <- Overall.test(L_interval$time, L_interval$status, L_interval$group)

## 
##  PH assumption yield a result of: Pvalue =0.04027, chisq =4.20628.

##                          method statistic  pvalue
## 1                      Log-rank   1.87754 0.17061
## 2                Gehan-Wilcoxon  -2.31947 0.02037
## 3                   Tarone-Ware  -1.94566 0.05170
## 4                   Weighted KM  -2.30598 0.02111
## 5                           ABS   1.93758 0.05267
## 6               ABS permutation   1.93758 0.03600
## 7                     Two-stage        NA 0.04092
## 8           Squared differences  -0.71155 0.47674
## 9 RMST (tau= 1.69013354003741 )    -0.204 0.02574

#taking row 6 for ABSPi_R
Overall.test(R_interval$time, R_interval$status, R_interval$group)

## 
##  PH assumption yield a result of: Pvalue =0.5712, chisq =0.32069.

##                          method statistic  pvalue
## 1                      Log-rank   8.48658 0.00358
## 2                Gehan-Wilcoxon   2.87986 0.00398
## 3                   Tarone-Ware   2.92056 0.00349
## 4                   Weighted KM   2.87753 0.00401
## 5                           ABS    3.0367 0.00239
## 6               ABS permutation    3.0367 0.00800
## 7                     Two-stage        NA 0.00358
## 8           Squared differences   1.16412 0.24438
## 9 RMST (tau= 4.66533347498626 )   0.76546 0.00330

Fleming-Harrington weighted log-rank test:

# devtools::install_github('lilywang1988/IAfrac') #install IAfrac package from Github if not already installed
library(IAfrac)
FH.stats <- FH.test(Crossdata$time, Crossdata$status, Crossdata$group, rho = 1, gamma = 0)

(FH.p_value <- 1 - pchisq(FH.stats$Z, 1)) # p-value

## [1] 0.4864349

Max-Combo test:

library(maxcombo)

Crossdata2 <- Crossdata %>% 
  mutate(id = 1:nrow(.),
         treated = group,
         Atime = 0,
         Btime = time,
         Bobserved = ifelse(status == 1, T, F),
         Ctime = time,
         Cobserved = ifelse(status == 0, T, F)) %>% 
  select(id, treated, Atime, Btime, Bobserved, Ctime, Cobserved)

oolistweightingfunctions_G00_G01_G10_G11 <- base::list(
  logrank=function(stminus){ base::return(1.0) },
  flemingharrington01=function(stminus){ base::return(1.0 - stminus) },
  flemingharrington10=function(stminus){ base::return(stminus) },
  flemingharrington11=function(stminus){ base::return(stminus * (1 -stminus)) }
)

oogetdoublemaxcomboteststatistic(
  oodataframe = Crossdata2,
  oolistfunctionweightasafunctionofstminus = oolistweightingfunctions_G00_G01_G10_G11
)

## [1] 2.75419

oogetdoublemaxcombotestpvalue(
  oodataframe = Crossdata2,
  oolistfunctionweightasafunctionofstminus = oolistweightingfunctions_G00_G01_G10_G11
)

## [1] 0.007160818

Compare at fixed points

Fixpoint.test(Crossdata$time, Crossdata$status, Crossdata$group, t0 = 0.5)

## $est.g0
##          method  t0       est lower.95.CI upper.95.CI
## 1         Naive 0.5 0.9487089   0.9048900   0.9925277
## 2           Log 0.5 0.9487089   0.9476661   0.9497314
## 3       Cloglog 0.5 0.9487089   0.9487060   0.9487117
## 4 Arcsin-square 0.5 0.9487089   0.9476713   0.9497365
## 5         Logit 0.5 0.9487089   0.9106487   0.9497317
## 
## $est.g1
##          method  t0       est lower.95.CI upper.95.CI
## 1         Naive 0.5 0.8123129   0.7340737   0.8905522
## 2           Log 0.5 0.8123129   0.8084334   0.8161232
## 3       Cloglog 0.5 0.8123129   0.8121468   0.8124790
## 4 Arcsin-square 0.5 0.8123129   0.8084531   0.8161425
## 5         Logit 0.5 0.8123129   0.7763788   0.8161275
## 
## $test
##          method statistic  pvalue
## 1         Naive   8.11135 0.00440
## 2           Log   8.88749 0.00287
## 3       Cloglog   7.36022 0.00667
## 4 Arcsin-square   9.30954 0.00228
## 5         Logit   7.54433 0.00602

4 Treatment switching

In parallel RCTs, like oncology trials, treatment decisions may change over time to better fit the patient health status. The treatment crossover causes the separation of randomization lost.

Simple methods to handle treatment switching:

Intent-to-treat (ITT) analysis: assume that treatment switching ineffective.
Exclude switchers: assume that there are no confounders. Cause selection bias if switching is related to baseline and time-dependent prognostic factors.
Censor switchers: assume that there are no confounders. Cause selection bias if switching is related to baseline and time-dependent prognostic factors
Treatment as time-varying covariate: assume that there are no confounders.

More rigorous methods are available, with rank preserving structural failure time (RPSFT), inverse probability of censoring weighting and two-step procedure among the most promising.

4.1 Rank preserving structural failure time (RPSFT)

The RPSFTM uses a causal model to produce counter-factual event times (the time that would be observed if no treatment were received) in order to estimate a causal treatment effect.

Let \(T_i=T_i^{off}+T_i^{on}\) be the observed event time for participant i, where \(T_i^{off}\) and \(T_i^{on}\) represent the time spent off and on treatment, respectively. The \(T_i\) are related to the counter-factual event times, \(U_i\), via the following causal model: \[U_i = T_i^{off}+T_i^{on}*exp(\psi)\]

where \(exp(-\psi)\) is acceleration factor.

Assumptions: RPSFT has 2 assumptions:

Balance due to randomization: The only difference between randomised groups is the treatment received. In other words, \(U_i\) are independent of randomised treatment group \(R_i\).
“Common treatment effect” assumption: The treatment effect is the same for all participants regardless of when treatment is received.

Estimation: Assumption 1 should be plausible in RCT, and therefore, a grid search (g-estimation) is used to find \(\psi_0\) such that \(U_i\) is independent of \(R_i\). For a range of \(\psi\)’s, the hypothesis \(\psi_0 = \psi\) is tested by computing \(U_i(\psi)\), subject to censoring, and calculating the test statistic \(Z(\psi)\). This is usually the same test statistic as for the ITT analysis, for example, the log rank test statistic to compare survival distributions between groups. In the rpsftm() function, the possible test options are the log rank, and the Wald test from a Cox or Weibull regression model. For the parametric Weibull test, the point estimate (\(\hat{\psi}\)) is the value of \(\psi\) for which \(Z(\psi)\) = 0. For the non-parametric tests (log rank, Cox), \(\hat{\psi}\) is the value of \(\psi\) for which \(Z(\psi)\) crosses 0, since \(Z(\psi)\) is a step function. (Alison et al, 2017)

Re-censoring: Consider a case: A patient has his observed event time \(T_i\) extended and get censored because switching to a superior treatment, whilst he would observe event if not switch. Therefore, when change from \(T_i\) to working on \(U_i\) scale, it requires re-censoring for some patients.

Let \(C_i\) be the potential censoring time for participant \(i\). A participant is then recensored at the minimum possible censoring time:

\[D^∗_i(ψ)=min(C_i,C_i exp(ψ))\]

If \(D^∗_i(ψ)<U_i(ψ)\), then \(U_i\) is replaced by \(D^∗_i\) and the censoring indicator is replaced by 0. For treatment arms where switching does not occur, there can be no informative censoring and so re-censoring is not applied (Alison et al, 2017).

Variations: RPSFTM can be implemented in different ways by involving different assumptions such as model allowing for lagged effects or treatment, or effects that differ in magnitude across randomized groups (Sullivan et al, 2020)

4.2 Inverse probability of censoring weighting (IWCP)

The IPCW method adjusts estimates of a treatment effect in the presence of informative censoring. When patients from treatment B switch to treatment A, the patients are artificially censored at the time of switch. To compensate for this selection bias, in treatment B, we find stayed patients with similar baseline characteristics and time-dependent prognostic covariates to switchers and give them higher weights.

The weight applied to individual \(i\) for time interval \(t\) is given by: \[w_{i,t} = \frac{\prod_{k=0}^t P(C(k)_i = 0|C(k-1)i=0,X_i)}{\prod_{k=0}^t P(C(k)_i = 0|C(k-1)i=0,X_i,Z(k)_i)} \] The numerator is the probability of patient \(i\) remaining uncensored at the end of time \(k\) given that the patien was also uncensored at the end of time \(k-1\) and given some baseline covariates X.

The denominator is the probability of patient \(i\) remaining uncensored at the end of time \(k\) given that the patien was also uncensored at the end of time \(k-1\) and given some baseline covariates X and time-dependent prognostic factors Z.

Assumption: “no unmeasured confounders” assumption - only if there are data on all time-dependent prognostic factors for mortality that independently predict informative censoring (switching) can the method produce unbiased results.

This assumption is not testable by observed data. Models for switching and survival must be correctly specified, and the method fails if there are any covariates which ensure (that is, the probability equals 1) that treatment switching will occur Latimer et al, 2014.

4.3 Two-step estimation

Assumption: No unmeasured confounders (stronger assumption than IPCW as few covariates included in model.

4.4 Best practice Recommendation

Always present ITT
Don’t use naive as sole or primary method
Don’t use IPWC or 2-stage AFT if most patients switch or having a near perfect predictor of switch
Don’t use RPSFTM if near equal exposure or survival

4.5 Examples

4.5.1 Example 1: RPSFTM

We will use data immdef available in rpsftm package.

First, let’s take a look at data:

library(rpsftm)
data(immdef, package = 'rpsftm')
head(immdef)

##   id def imm censyrs xo    xoyrs prog  progyrs entry
## 1  1   0   1       3  0 0.000000    0 3.000000     0
## 2  2   1   0       3  1 2.652797    0 3.000000     0
## 3  3   0   1       3  0 0.000000    1 1.737838     0
## 4  4   0   1       3  0 0.000000    1 2.166291     0
## 5  5   1   0       3  1 2.122100    1 2.884646     0
## 6  6   1   0       3  1 0.557392    0 3.000000     0

Intent-to-treat analysis: Fitting Weibull AFT model to full analysis set shows that getting immediate treatment extends survival time by a factor of 1.158, but the effect is not statistically significant (ETR= 1.158, 95%CI: 0.996, 1.347)

#K-M plot, using base plot for black&white plot for the ease to compare with plot by rpsftm later
immdef.kmfit <- survfit(Surv(progyrs, prog) ~ imm, data = immdef)
plot(immdef.kmfit, lty = c("solid", "dashed"), xlab = "Time", ylab = "Survival")
legend('topright',c("Reference", "Comparator"),lty = c("solid", "dashed"))

weibull.fit <- WeibullReg(Surv(progyrs, prog) ~ imm, data = immdef)
weibull.fit

## $formula
## Surv(progyrs, prog) ~ imm
## 
## $coef
##          Estimate         SE
## lambda  0.1251584 0.01212204
## gamma   1.4803054 0.07754845
## imm    -0.2177152 0.11363574
## 
## $HR
##            HR        LB       UB
## imm 0.8043545 0.6437549 1.005019
## 
## $ETR
##         ETR        LB       UB
## imm 1.15844 0.9960953 1.347244
## 
## $summary
## 
## Call:
## survival::survreg(formula = formula, data = data, dist = "weibull")
##               Value Std. Error     z      p
## (Intercept)  1.4039     0.0620 22.65 <2e-16
## imm          0.1471     0.0770  1.91  0.056
## Log(scale)  -0.3922     0.0524 -7.49  7e-14
## 
## Scale= 0.676 
## 
## Weibull distribution
## Loglik(model)= -855.1   Loglik(intercept only)= -857
##  Chisq= 3.69 on 1 degrees of freedom, p= 0.055 
## Number of Newton-Raphson Iterations: 5 
## n= 1000

RPSFT:

To apply RPSFT, we need to create a variable rx indicating the proportion of time spent on treatment.

rx = 0 indicates patient on deferred treatment all follow-up time,
rx = 1 for patient on immediate treatment arm entirely,
0 < rx < 1 for proportion of time on immediate treatment of a switcher.

Using log-rank test, RPSFTM estimates \(\hat{\psi} = -0.181\), so the acceleration factor is \(exp(-\hat{\psi})= 1.99\). This means getting immediate treatment extends survival time by a factor of 1.99 (95%CI: 0.998, 1.419).

# xoyrs: the time, in years, from entry to switching, or 0 for participants in the Immediate arm
# progyrs: time, in years, from entry to disease progression or censoring
immdef <-  immdef %>% mutate(rx = 1 - xoyrs/progyrs) 

# log-rank test
rpsftm.fit <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx),
                     data=immdef,
                     censor_time=censyrs)
summary(rpsftm.fit)

##   arm   rx.Min. rx.1st Qu. rx.Median   rx.Mean rx.3rd Qu.   rx.Max.
## 1   0 0.0000000  0.0000000 0.0000000 0.1574062  0.2547779 0.9770941
## 2   1 1.0000000  1.0000000 1.0000000 1.0000000  1.0000000 1.0000000
##         Length Class      Mode   
## psi        1   -none-     numeric
## fit       17   survfit    list   
## CI         2   -none-     numeric
## Sstar   1000   Surv       numeric
## rand    2000   rand       numeric
## ans        5   -none-     list   
## eval_z     2   data.frame list   
## n          2   table      numeric
## obs        2   -none-     numeric
## exp        2   -none-     numeric
## var        4   -none-     numeric
## chisq      1   -none-     numeric
## call       4   -none-     call   
## formula    3   terms      call   
## terms      3   terms      call   
## 
## psi: -0.1813226
## exp(psi): 0.8341662
## Confidence Interval, psi -0.34984 0.002287789
## Confidence Interval, exp(psi)  0.7048008 1.00229

plot(rpsftm.fit)

To adjust for covariates, we can also use option test=coxph and test=survreg to use Wald test of Cox model and use Weibull model, respectively.

# Cox model
rpsftm.fit.coxph <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx) + entry,
                     data=immdef,
                     censor_time=censyrs,
                     test = coxph)

# Weibull model
rpsftm.fit.weireg <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx) + entry,
                     data=immdef,
                     censor_time=censyrs,
                     test = survreg)

When we assume effects that differ in magnitude across randomized groups, we can use option treat_modifier=. Here, we can assume that treatment effect of immediate treatment in switchers from deferred treatment is half of that in those assigned in immediate group from the start.

immdef <- immdef %>% mutate(weight = ifelse(imm==1, 1, .5))

rpsftm(Surv(progyrs, prog) ~ rand(imm, rx),
                     data=immdef,
                     censor_time=censyrs,
                     treat_modifier =  weight)

##   arm   rx.Min. rx.1st Qu. rx.Median   rx.Mean rx.3rd Qu.   rx.Max.
## 1   0 0.0000000  0.0000000 0.0000000 0.1574062  0.2547779 0.9770941
## 2   1 1.0000000  1.0000000 1.0000000 1.0000000  1.0000000 1.0000000
## Call:
## rpsftm(formula = Surv(progyrs, prog) ~ rand(imm, rx), data = immdef, 
##     censor_time = censyrs, treat_modifier = weight)
## 
##          N Observed Expected (O-E)^2/E (O-E)^2/V
## .arm=0 500      157      157  8.79e-06  1.86e-05
## .arm=1 500      143      143  9.66e-06  1.86e-05
## 
##  Chisq= 0  on 1 degrees of freedom, p= 1 
## 
## psi: -0.1706579
## exp(psi): 0.84311