(The purpose of this post is a summary of literature, and to provide analytical examples in R)
We often wish to compare survival experience of two or more groups of individuals. Two common but distinct approaches to compare survival outcomes are: i) comparing the entire survival curves and ii) comparing the survival probabilities at a fixed point in time. These approaches test distinct hypotheses concerning survival experience. In next sections, I will provide a brief overview of statistical models for time-to-event analysis. Following is an overview of statistic methods to compare survival experiences between groups - here, the focus is to compare the usage and pros and cons of each method rather than mathematical formula of each method.
Finally, treatment switching is becoming an important issue in clinical trials, so I will also go over this issue briefly.
Survival data can be described by 4 entities: survival prob, hazard prob, prob density function, and cumulative density function. If we know one entity, it’s easy to transform to the other three quantities. The transformations among them are presented in the figure below.
Math entities and transofrms between them
The same assumptions shared in all time-to-event analyses are:
Deviations from these assumptions matter most if they are satisfied differently in the groups being compared (e.g. censoring is more likely in one group).
Kaplan-Meier plot is a non-parametric method to estimate the survival probability from direct observed survival times. It doesn’t require any assumption of distribution.
The proportional hazard models do not need to consider a specific probability distribution for the survival time; therefore, it is the most helpful model in analyzing survival data.
But the efficiency of the model is severely dependent to proportional hazards assumption and, for this reason, The Cox model is often called proportional hazards model. In occasions where the proportional hazard model is not acceptable, estimates derived from Cox model will lead to an improper fitting of the model and incorrect inferences.
PH assumption: Hazard can vary, but hazard ratio of two individuals (at the same time) is constant.
Graphical methods: If log-log survival curves for the two individuals appear roughly parallel, the hazard ratio proportion is constant over the period of observation.
Schoenfeld residuals and Grambsch-Therneau test: If PH assumption holds, there should be no correlation between the weighted average for the remaining subjects and the survival time. In other words, at every failure, the weighted average of what remains should be similar to the value for the person who just failed.
Goodness-of-fit tests based on cumulative sums of residuals: Kolmogorov-Smirnov test, Cramer-von Mises test, Anderson-Darling test
Interactions with time as time-dependent covariates
Stratified Cox regression: Stratifying data set to \(m\) smaller datasets based on strata of the covariate(s). Each dataset has its own baseline hazard, but they all share the regression parameters.
Adding time-dependent covariate (i.e. time-dependent coefficient Cox model): interaction of concerned covariate and time or a function of time.
Accelerated failure-time model
The hazard function takes the form: \(h(t|X) = h_0(t)e^{\beta X}\)
As baseline hazard is not specified, Cox model is a semi-parametric model.
Time-varying covariates Cox model taking the form: \(h(t|X) = h_0(t)e^{\beta X(t)}\)
When explanatory variables are collected at more than one time point and change over time, it may be more appropriate to use time varying covariates. The model is robust because it utilizes all available data.
Data for time-varying covariate Cox model need to be in long-table format, as in repeated measures regression. Follow-up period of a subject is partitioned into sub-intervals (tstart, tstop] based on measurement times of the covariate(s), and each sub-interval is presented as one row in data frame.
Although a given subject has multiple observations, we generally do not need to worry about correlated data, as this data representation is simply a programming trick. The likelihood equations at any time point use only one copy of any subject, the program picks out the correct row of data at each time. However, when subjects have multiple events, then the rows for the events are correlated within subject and a cluster variance is needed.
In contrast to the Cox PH model where baseline hazard \(h_0\) is not specified, the parametric PH models specify baseline hazard function.
Weibull regression is the most popular parametric survival regression. Parametric PH and AFT models are both available only for Weibull and exponential.
Weibull model with baseline hazard: \(h(t|X) = (\gamma \lambda t^{\gamma^{-1}})e^{\beta X}\)
The AFT model provides an alternative to the proportional hazards (PH) model to analyze time-to-event data. In the AFT regression model, the effect of covariates on the logarithm of the survival time is assessed.
However, in contrast to the Cox PH model, the parametric AFT models require specifying the event time distribution,which is difficult in many real-life applications. The AFT models include generalized gamma, Log-logistic, Log-normal, Weibull and exponential.
Classic AFT models impose conventional assumptions: i) for all covariates, event time ratios (ETR), i.e., acceleration factors, are constant over time, and ii) continuous covariates have linear relationship.
If the “true” data-generating model may be more consistent with the PH model, the constant time ratio assumption is violated, which may lead to biased estimates, unless the baseline hazard is exponential or Weibull. Pang et al (2021) proposed flexible AFT model allowing both time-dependent and non-linear effects of continuous variables. Available code for Flexiable AFT model
AFT model taking the form: \[logT = Y = \mu + \alpha X + \sigma W \]
, where X are set of covariates, and W has the extreme value distribution.
Weibull AFT form: \[h(t|X) = \gamma \lambda(t e^{-\alpha X})^{\gamma-1} e^{-\alpha X} \]
where \(\gamma = 1/\sigma\), \(\lambda = e^{-\mu/\sigma}\), \(\beta = -\alpha/\sigma\) are transformations for Weibull PH model.
Plot comparing K-M curves between two treatment groups in Primary biliary cirrhosis (PBC) dataset:
library(tidyverse)
library(survival)
library(survminer)
library(flexsurv)
data(pbc, package="survival")
kmfit <- survfit(Surv(time, status == 2) ~ trt, data = pbc)
ggsurvplot(kmfit,
pval = TRUE, # show p-value of log-rank test.
conf.int = FALSE, # show confidence intervals for point estimates of survival curves.
xlab = "Time (days)", # customize X axis label.
break.time.by = 500, # break X axis in time intervals by 500.
ggtheme = theme_light() +
theme(legend.title=element_text(size=rel(1.25)),
legend.text=element_text(size=rel(1.25))), # customize plot and risk table with a theme.
surv.median.line = "hv", # add the median survival pointer.
legend.labs = c("Standard", "Experimental"), # change legend labels.
legend.title = "Treatment",
palette = "Dark2"# custom color palettes.
)
Cox PH model:
coxfit <- coxph(Surv(time,status==2) ~ trt + age + edema + log(bili) + log(albumin) + log(protime), ties="efron", data = pbc)
summary(coxfit)
## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema +
## log(bili) + log(albumin) + log(protime), data = pbc, ties = "efron")
##
## n= 312, number of events= 125
## (106 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## trt 0.135502 1.145111 0.185424 0.731 0.464920
## age 0.034625 1.035232 0.008905 3.888 0.000101 ***
## edema 0.787101 2.197017 0.296321 2.656 0.007902 **
## log(bili) 0.884853 2.422627 0.098718 8.963 < 2e-16 ***
## log(albumin) -3.083307 0.045808 0.718950 -4.289 1.8e-05 ***
## log(protime) 2.969493 19.482041 1.016033 2.923 0.003471 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## trt 1.14511 0.87328 0.79619 1.6470
## age 1.03523 0.96597 1.01732 1.0535
## edema 2.19702 0.45516 1.22915 3.9270
## log(bili) 2.42263 0.41277 1.99645 2.9398
## log(albumin) 0.04581 21.83048 0.01119 0.1875
## log(protime) 19.48204 0.05133 2.65941 142.7196
##
## Concordance= 0.844 (se = 0.02 )
## Likelihood ratio test= 199.9 on 6 df, p=<2e-16
## Wald test = 197.6 on 6 df, p=<2e-16
## Score (logrank) test = 269.7 on 6 df, p=<2e-16
Log-log plot: although curves are crossed, but they roughly overlap; therefore, data do not deviate from PH assumption severely.
ggsurvplot(kmfit, fun = 'cloglog',
legend.labs = c("Standard", "Experimental"), # change legend labels.
legend.title = "Treatment",
palette = "Dark2")
Plot of Schoenfeld resdiduals and Grambsch-Therneau test: Edema and Log(protime) violate PH assumption.
# Schoenfeld resdiduals
ggcoxdiagnostics(coxfit, type = "schoenfeld")
# Grambsch-Therneau test
cox.zph(coxfit)
## chisq df p
## trt 0.603 1 0.438
## age 0.502 1 0.479
## edema 4.100 1 0.043
## log(bili) 1.616 1 0.204
## log(albumin) 0.190 1 0.663
## log(protime) 5.445 1 0.020
## GLOBAL 12.047 6 0.061
Goodness-of-fit tests based on cumulative sums of residuals also indicate non-PH of endema and log(protime):
library(goftte)
coxprop <- prop(coxfit, type.test='Lin',plots = 200,
variable=c("treatment","age", "edema","bilirubin","log(albumin)","log(protime)"))
coxprop
##
## Rejection p-values associated to Lin's approximation for proportional hazards assumption
## ---
## Kolmogorov-Smirnov-test: p-value=0.207
## Cramer-von-Mises-test: p-value=0.3
## Anderson-Darling-test: p-value=0.405
## Based on 1000 realizations. Cumulated residuals ordered by treatment-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.263
## Cramer-von-Mises-test: p-value=0.288
## Anderson-Darling-test: p-value=0.355
## Based on 1000 realizations. Cumulated residuals ordered by age-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.015
## Cramer-von-Mises-test: p-value=0.027
## Anderson-Darling-test: p-value=0.029
## Based on 1000 realizations. Cumulated residuals ordered by edema-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.084
## Cramer-von-Mises-test: p-value=0.193
## Anderson-Darling-test: p-value=0.198
## Based on 1000 realizations. Cumulated residuals ordered by bilirubin-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.742
## Cramer-von-Mises-test: p-value=0.674
## Anderson-Darling-test: p-value=0.685
## Based on 1000 realizations. Cumulated residuals ordered by log(albumin)-variable.
## ---
## Kolmogorov-Smirnov-test: p-value=0.002
## Cramer-von-Mises-test: p-value=0.002
## Anderson-Darling-test: p-value=0.002
## Based on 1000 realizations. Cumulated residuals ordered by log(protime)-variable.
## ---
plot(coxprop)
Assess adequate form of covariate: Replicate Example 73.12 in SAS Documentation for using ASSESS statement in PROC PHREG, we can use fcov function in goftte package to assess the adequate form of bilirubin in the model. a Kolmogorov-type supremum test based on a sample of 1,000 simulated residual patterns has p < 0.001. Plot shows none of the 1,000 simulated realizations has an absolute maximum exceeding that of the observed cumulative martingale residual process. Result indicates that linear form of bili is not appropriate for the model.
coxfit2 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime), ties="breslow", data = pbc)
cox.fcov <- fcov(coxfit2, type.test='Lin',plots = 200,
variable=c("treatment","age", "edema","bilirubin","log(albumin)","log(protime)"))
plot(cox.fcov, idx=4)
Add time-dependent covariate
The interaction term of edema is not statistically significant when we include interaction terms of time and both edema and log(protime) into the model.
coxfit_time1 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime) + tt(edema) + tt(log(protime)), ties="breslow", data = pbc,
tt=function(x, t, ...) {
xtime=x*t/365.25
})
coxfit_time1
## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema +
## bili + log(albumin) + log(protime) + tt(edema) + tt(log(protime)),
## data = pbc, ties = "breslow", tt = function(x, t, ...) {
## xtime = x * t/365.25
## })
##
## coef exp(coef) se(coef) z p
## trt -0.011966 0.988106 0.185010 -0.065 0.948433
## age 0.033383 1.033947 0.009475 3.523 0.000426
## edema 1.328368 3.774877 0.460209 2.886 0.003896
## bili 0.110266 1.116575 0.015624 7.057 1.70e-12
## log(albumin) -3.766113 0.023142 0.719533 -5.234 1.66e-07
## log(protime) 6.381214 590.644454 1.608171 3.968 7.25e-05
## tt(edema) -0.231251 0.793540 0.150309 -1.539 0.123924
## tt(log(protime)) -0.721962 0.485798 0.359230 -2.010 0.044457
##
## Likelihood ratio test=174.3 on 8 df, p=< 2.2e-16
## n= 312, number of events= 125
## (106 observations deleted due to missingness)
Therefore, we can drop the interaction term between edema and time.
coxfit_time2 <- coxph(Surv(time,status==2) ~ trt + age + edema + bili + log(albumin) + log(protime) + tt(log(protime)),
ties="breslow", data = pbc,
tt=function(x, t, ...) {
xtime=x*t/365.25
})
coxfit_time2
## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + edema +
## bili + log(albumin) + log(protime) + tt(log(protime)), data = pbc,
## ties = "breslow", tt = function(x, t, ...) {
## xtime = x * t/365.25
## })
##
## coef exp(coef) se(coef) z p
## trt 0.014231 1.014333 0.184143 0.077 0.938397
## age 0.033872 1.034452 0.009398 3.604 0.000313
## edema 0.779225 2.179782 0.304559 2.559 0.010512
## bili 0.112017 1.118532 0.015416 7.266 3.70e-13
## log(albumin) -3.848515 0.021311 0.714031 -5.390 7.05e-08
## log(protime) 6.618422 748.762903 1.549773 4.271 1.95e-05
## tt(log(protime)) -0.831122 0.435560 0.350268 -2.373 0.017653
##
## Likelihood ratio test=171.6 on 7 df, p=< 2.2e-16
## n= 312, number of events= 125
## (106 observations deleted due to missingness)
Stratified Cox model
With edema being a categorical variable, we can conduct a Cox model stratified for edema. When included in the model, interaction term between log(protime) and time is not statistically significant (not shown here), so we can drop it from model.
stratifiedCox <- coxph(Surv(time,status==2) ~ trt + age + bili + log(albumin) + log(protime) + strata(edema),
data = pbc)
stratifiedCox
## Call:
## coxph(formula = Surv(time, status == 2) ~ trt + age + bili +
## log(albumin) + log(protime) + strata(edema), data = pbc)
##
## coef exp(coef) se(coef) z p
## trt -0.082320 0.920977 0.191992 -0.429 0.668092
## age 0.033945 1.034528 0.009614 3.531 0.000414
## bili 0.114832 1.121685 0.016387 7.008 2.42e-12
## log(albumin) -3.608973 0.027080 0.716617 -5.036 4.75e-07
## log(protime) 3.456192 31.696036 0.928979 3.720 0.000199
##
## Likelihood ratio test=98.63 on 5 df, p=< 2.2e-16
## n= 312, number of events= 125
## (106 observations deleted due to missingness)
Weibull AFT model
An alternate for Cox model in non-PH is AFT. Let’s try to fit a Weibull AFT model.
The output tables coef and HR are estimates for Weibull PH regression. And ETR and coefficient estimates in summary are for Weibull AFT.
ETR (treated vs. control) equals 1.01, then the time corresponding to any given survival probability is only 1% longer for the treated than the control subjects, but not statistically significant (95%CI: 0.80-1.28, p=0.93)
library(SurvRegCensCov)
weibull.aftfit <- WeibullReg(Surv(time,status==2) ~ trt+ age + edema + bili + log(albumin) + log(protime), data = pbc)
weibull.aftfit
## $formula
## Surv(time, status == 2) ~ trt + age + edema + bili + log(albumin) +
## log(protime)
##
## $coef
## Estimate SE
## lambda 8.636575e-09 2.039620e-08
## gamma 1.536983e+00 1.112450e-01
## trt -1.633552e-02 1.827419e-01
## age 3.222800e-02 9.270163e-03
## edema 8.659405e-01 3.059779e-01
## bili 1.139634e-01 1.454384e-02
## log(albumin) -3.751319e+00 6.887614e-01
## log(protime) 3.549429e+00 8.894523e-01
##
## $HR
## HR LB UB
## trt 0.98379718 0.687630920 1.40752382
## age 1.03275294 1.014158102 1.05168873
## edema 2.37724094 1.305041628 4.33034040
## bili 1.12071105 1.089215765 1.15311704
## log(albumin) 0.02348674 0.006089027 0.09059357
## log(protime) 34.79345265 6.086874489 198.88439455
##
## $ETR
## ETR LB UB
## trt 1.01068499 0.80056890 1.2759478
## age 0.97924995 0.96763270 0.9910067
## edema 0.56926855 0.38638241 0.8387201
## bili 0.92853476 0.91201513 0.9453536
## log(albumin) 11.48112092 4.82912309 27.2960815
## log(protime) 0.09932588 0.03014253 0.3272993
##
## $summary
##
## Call:
## survival::survreg(formula = formula, data = data, dist = "weibull")
## Value Std. Error z p
## (Intercept) 12.08033 1.54377 7.83 5.1e-15
## trt 0.01063 0.11891 0.09 0.92878
## age -0.02097 0.00609 -3.44 0.00057
## edema -0.56340 0.19772 -2.85 0.00438
## bili -0.07415 0.00916 -8.10 5.7e-16
## log(albumin) 2.44070 0.44186 5.52 3.3e-08
## log(protime) -2.30935 0.60841 -3.80 0.00015
## Log(scale) -0.42982 0.07238 -5.94 2.9e-09
##
## Scale= 0.651
##
## Weibull distribution
## Loglik(model)= -1104.2 Loglik(intercept only)= -1188.8
## Chisq= 169.17 on 6 degrees of freedom, p= 6.8e-34
## Number of Newton-Raphson Iterations: 6
## n=312 (106 observations deleted due to missingness)
The adequacy of the Weibull distribution for survival data with respect to one categorical covariate can be examined by via a plot of log Time vs. log Estimated Cumulative Hazard. If the Weibull distribution fits the data well, then the lines produced should be linear and parallel.
WeibullDiag(Surv(time,status==2) ~ trt, data = pbc)
Extended Cox for time-varying covariates
Below is an example in vignette of tmerge function.
pbcseq contains the follow-up laboratory data for each study patient, while pbc contains only baseline measurements of the laboratory parameter. However, some baseline data values in pbcseq differ from pbc, for instance, the data errors in prothrombin time and age. Therefore, baseline data were taken from pbc before merging with pbcseq to create time-dependent covariates.
The coefficient estimates in the original Cox model (coxfit) and this time-varying covariates extended Cox model (coxfit2) are quite different, especially for log(bili), log(albumin) and log(protime)
temp <- subset(pbc, id <= 312, select=c(id:sex, stage)) # baseline data
pbc2 <- tmerge(temp, temp, id=id, endpt = event(time, status))
pbc2 <- tmerge(pbc2, pbcseq, id=id, ascites = tdc(day, ascites),
edema =tdc(day, edema),
bili = tdc(day, bili), albumin = tdc(day, albumin),
protime = tdc(day, protime), alk.phos = tdc(day, alk.phos))
coxfit2 <- coxph(Surv(tstart, tstop, endpt==2) ~ trt + age + edema + log(bili) + log(albumin) + log(protime), data=pbc2)
rbind('baseline fit' = coef(coxfit),
'time dependent' = coef(coxfit2))
## trt age edema log(bili) log(albumin)
## baseline fit 0.13550199 0.03462547 0.7871006 0.8848527 -3.083307
## time dependent 0.02175964 0.04398652 0.7871554 1.1764688 -4.016083
## log(protime)
## baseline fit 2.969493
## time dependent 2.654881
Extended Cox for recurrent events
A look of Chronic granulomatous disease (CGD) data in wide table format:
data(cgd0, package = 'survival')
## Warning in data(cgd0, package = "survival"): data set 'cgd0' not found
head(cgd0)
## id center random treat sex age height weight inherit steroids propylac
## 1 1 204 82888 1 2 12 147.0 62.0 2 2 2
## 2 2 204 82888 0 1 15 159.0 47.5 2 2 1
## 3 3 204 82988 1 1 19 171.0 72.7 1 2 1
## 4 4 204 91388 1 1 12 142.0 34.0 1 2 1
## 5 5 238 92888 0 1 17 162.5 52.7 1 2 1
## 6 6 245 93088 1 2 44 153.3 45.0 2 2 2
## hos.cat futime etime1 etime2 etime3 etime4 etime5 etime6 etime7
## 1 2 414 219 373 NA NA NA NA NA
## 2 2 439 8 26 152 241 249 322 350
## 3 2 382 NA NA NA NA NA NA NA
## 4 2 388 NA NA NA NA NA NA NA
## 5 1 383 246 253 NA NA NA NA NA
## 6 2 364 NA NA NA NA NA NA NA
Transformation from wide table to long table can be done with tmerge function. A look of data in long table format, which is needed for modeling:
data(cgd, package = 'survival')
head(cgd)
## id center random treat sex age height weight inherit
## 1 1 Scripps Institute 1989-06-07 rIFN-g female 12 147 62.0 autosomal
## 2 1 Scripps Institute 1989-06-07 rIFN-g female 12 147 62.0 autosomal
## 3 1 Scripps Institute 1989-06-07 rIFN-g female 12 147 62.0 autosomal
## 4 2 Scripps Institute 1989-06-07 placebo male 15 159 47.5 autosomal
## 5 2 Scripps Institute 1989-06-07 placebo male 15 159 47.5 autosomal
## 6 2 Scripps Institute 1989-06-07 placebo male 15 159 47.5 autosomal
## steroids propylac hos.cat tstart enum tstop status
## 1 0 0 US:other 0 1 219 1
## 2 0 0 US:other 219 2 373 1
## 3 0 0 US:other 373 3 414 0
## 4 0 1 US:other 0 1 8 1
## 5 0 1 US:other 8 2 26 1
## 6 0 1 US:other 26 3 152 1
As subjects have multiple events, the rows for the events are correlated within subject and a cluster variance is needed.
(extendedCox.fit <- coxph(Surv(tstart, tstop, status) ~ treat + inherit + steroids, data =cgd, cluster = id))
## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ treat + inherit +
## steroids, data = cgd, cluster = id)
##
## coef exp(coef) se(coef) robust se z p
## treatrIFN-g -1.0722 0.3422 0.2619 0.3118 -3.438 0.000585
## inheritautosomal 0.1777 1.1944 0.2356 0.3180 0.559 0.576395
## steroids 0.7726 2.1653 0.5169 0.4687 1.648 0.099310
##
## Likelihood ratio test=22.49 on 3 df, p=5.149e-05
## n= 203, number of events= 76
My focus here is a high-level overview of pros and cons of each method. For review of mathematical details, please see at: Statistical tests comparing two survival curves
The null hypothesis in this approach is that the risk of mortality after treatment A is the same as the risk of mortality after treatment B at all time points.
The difference in treatment benefits in survival analysis is commonly quantified and reported as the (cumulative) survival rate, the hazard ratio (HR), the median survival time (MT), or the restricted mean survival time (RMST).
The log-rank test corresponding to the HR is a common and classical choice when the HR between treatment arms remains constant over time (the proportional hazards assumption). However, when the proportional hazards assumption is violated, especially when two survival curves cross each other, the difference based on hazards may be offset before and after the crossing point, resulting in a loss of power of the log-rank test.
The weighted log-rank test: weighted log-rank test statistics take the form of the weighted sum of the differences of the estimated hazard functions at each observed failure time. As a result, these statistics are used to test whether the hazard difference is zero between the treatment group and the control group. In the non-PH setting, the relative differences of the two hazard functions are not constant over time, therefore a differential weighting (compared to equal weighting in the log-rank statistic) at different time points has the potential to improve the efficiency of the test statistics.
The Fleming-Harrington family of weighted log-rank test statistics: \(FH(\rho, \gamma) = S(t-)^\rho (1-S(t-))^\gamma\).
\(FH(0,0)\) is the log-rank statistic that is most powerful under the proportional hazards assumption;
when this is a diminishing effect (i.e. early separation), \(FH(\rho, 0)\) with \(\rho > 0\) that over weights the early events will provide higher power to detect a treatment difference compared with equal weighting
when delayed effect exists, \(FH(0, 𝛾)\) with \(𝛾 > 0\) that over weights the late events will be more powerful to detect the late separation
\(FH(𝜌, 𝛾)\) with \(𝜌 = 𝛾 > 0\) will be more powerful if the biggest separation of two hazard functions occurs in the middle
MaxCombo test: test statistic taking the form: \[Z_{max} = max_{\rho, \gamma} \{Z_{FH_{(\rho,\gamma)}}\} \]
where \(Z_{FH_{(\rho,\gamma)}}\) is the standardized Fleming-Harrington weighted log-rank statistics. In particular, the original MaxCombo test is interested in the combination of \(FH(0,0)\), \(FH(0,1)\), \(FH(1,1)\) and \(FH(1,0)\), which should be sensitive to PH, late-separation, middle-separation and early-separation scenarios.
Modified MaxCombo test:
Option 1: \(FH(0,0)\), \(FH(0,0.5)\), \(FH(0.5,0)\), \(FH(0,0.5)\): conservative and less sensitive to tail events.
Option 2: \(FH(0,0)\), \(FH(0,0.5)\), \(FH(0,0.5)\): if delayed effect is only possibility
When the MaxCombo test is used, the treatment effect estimate is taken as the estimated HR obtained from the weighted Cox model corresponding to the weighted log-rank test with the smallest p-value (Lin et al., 2019).
The MaxCombo test provides a pragmatic alternative to log-rank under nonproportional hazards (Mukhopadhyay et al., 2022)
Weighted Kaplan-Meier tests: take the form of the weighted sum of the differences of the Kaplan-Meier estimates of survival functions. Therefore, they are valid to test whether the two underlying survival functions are the same or not. A particularly interesting weighted Kaplan-Meier test is to set the weight equal to 1, resulting in the difference of two RMSTs (Lin et al, 2019).
The method based on difference in the MT: depends on information at the median survival point (when the survival rate is equal to 0.5). Therefore, when the censoring rate is large or the follow-up time is insufficient (causing the survival rate does not reach 0.5), this method cannot be applied without an accurate estimation of the MT. Also, a test based on a difference in MT is not applicable when the crossing point of the survival curves is located near 0.5 (Huang et al., 2020).
The method based on the difference in RMST does not require the assumption of proportional hazards. However, when the two survival curves cross each other, the difference in the RMST may be offset before and after the crossing point, which causes an invalid assessment of the difference in treatment benefits and reduced power (Huang et al., 2020).
The area between two survival curves (ABS) measure effectively reflects the absolute benefit of treatment effects between two groups in randomized clinical trials. The ABS is a robust measure for comparing the treatment benefits of two groups and can be applied regardless of whether the proportional hazards assumption is violated or the survival curves cross.
Meanwhile, the corresponding ABS permutation (ABSP) test is also robust and reliable. When researchers are interested in treatment effects over any time interval (i.e., evaluation of the short-term or long-term treatment benefits of two cancer drugs), the corresponding interval ABSP (ABSPi) test can be applied in combination with Kaplan-Meier curves to draw a comprehensive conclusion (Huang et al., 2020).
Two-stage procedure test: details can be found in (Qiu and Sheng, 2008)
Other non-parametric tests with higher power against crossing hazard alternatives include Kolmogorov–Smirnov and Cramer–von Mises types of tests and the median test of Brookmeyer and Crowley.
The null hypothesis in this approach is that the survival probability at time \(t_0\) after treatment A is the same as the survival probability at time \(t_0\) after treatment B.
Comparisons at fixed points in time are often of interest when the survival curves of the treatments are known to cross.
Survival probabilities at time \(t_0\) can be compared using the methods of Zhang (Zhang et al., 2007) or pseudo-value approach (Klein et al., 2007)
For illustration, I use Crossdata dataset in ComparisonSurv package.
First, I examine K-M plot. The plot shows crossing survival curves between two groups.
library(ComparisonSurv)
data("Crossdata")
crossdata.km <- survfit(Surv(time, status) ~ group, data = Crossdata)
ggsurvplot(crossdata.km,
xlab = "Time (years)", # customize X axis label.
ggtheme = theme_light() +
theme(legend.title=element_text(size=rel(1.25)),
legend.text=element_text(size=rel(1.25))), # customize plot and risk table with a theme.
surv.median.line = "hv", # add the median survival pointer.
legend.title = "Treatment",
palette = "Dark2"# custom color palettes.
)
ComparisonSurv package provides a function to conveniently compare two survival curves by multiple tests. PH assumption is tested by Grambsch-Therneau test (as cox.zph in survival package).
Overall.test(Crossdata$time, Crossdata$status, Crossdata$group)
##
## PH assumption yield a result of: Pvalue =0.00035, chisq =12.76765.
## method statistic pvalue
## 1 Log-rank 0.99103 0.31949
## 2 Gehan-Wilcoxon -0.89872 0.36880
## 3 Tarone-Ware -0.11533 0.90818
## 4 Weighted KM -0.02205 0.98241
## 5 ABS 1.94767 0.05145
## 6 ABS permutation 1.94767 0.02000
## 7 Two-stage NA 0.02532
## 8 Squared differences 0.74491 0.45633
## 9 RMST (tau= 4.66533347498626 ) 0.19763 0.33130
L_interval <- Crossdata %>% filter(time <=1.7)
R_interval <- Crossdata %>% filter(time > 1.7)
#taking row 6 for ABSPi_L
overall_test <- Overall.test(L_interval$time, L_interval$status, L_interval$group)
##
## PH assumption yield a result of: Pvalue =0.04027, chisq =4.20628.
## method statistic pvalue
## 1 Log-rank 1.87754 0.17061
## 2 Gehan-Wilcoxon -2.31947 0.02037
## 3 Tarone-Ware -1.94566 0.05170
## 4 Weighted KM -2.30598 0.02111
## 5 ABS 1.93758 0.05267
## 6 ABS permutation 1.93758 0.03600
## 7 Two-stage NA 0.04092
## 8 Squared differences -0.71155 0.47674
## 9 RMST (tau= 1.69013354003741 ) -0.204 0.02574
#taking row 6 for ABSPi_R
Overall.test(R_interval$time, R_interval$status, R_interval$group)
##
## PH assumption yield a result of: Pvalue =0.5712, chisq =0.32069.
## method statistic pvalue
## 1 Log-rank 8.48658 0.00358
## 2 Gehan-Wilcoxon 2.87986 0.00398
## 3 Tarone-Ware 2.92056 0.00349
## 4 Weighted KM 2.87753 0.00401
## 5 ABS 3.0367 0.00239
## 6 ABS permutation 3.0367 0.00800
## 7 Two-stage NA 0.00358
## 8 Squared differences 1.16412 0.24438
## 9 RMST (tau= 4.66533347498626 ) 0.76546 0.00330
Fleming-Harrington weighted log-rank test:
# devtools::install_github('lilywang1988/IAfrac') #install IAfrac package from Github if not already installed
library(IAfrac)
FH.stats <- FH.test(Crossdata$time, Crossdata$status, Crossdata$group, rho = 1, gamma = 0)
(FH.p_value <- 1 - pchisq(FH.stats$Z, 1)) # p-value
## [1] 0.4864349
Max-Combo test:
library(maxcombo)
Crossdata2 <- Crossdata %>%
mutate(id = 1:nrow(.),
treated = group,
Atime = 0,
Btime = time,
Bobserved = ifelse(status == 1, T, F),
Ctime = time,
Cobserved = ifelse(status == 0, T, F)) %>%
select(id, treated, Atime, Btime, Bobserved, Ctime, Cobserved)
oolistweightingfunctions_G00_G01_G10_G11 <- base::list(
logrank=function(stminus){ base::return(1.0) },
flemingharrington01=function(stminus){ base::return(1.0 - stminus) },
flemingharrington10=function(stminus){ base::return(stminus) },
flemingharrington11=function(stminus){ base::return(stminus * (1 -stminus)) }
)
oogetdoublemaxcomboteststatistic(
oodataframe = Crossdata2,
oolistfunctionweightasafunctionofstminus = oolistweightingfunctions_G00_G01_G10_G11
)
## [1] 2.75419
oogetdoublemaxcombotestpvalue(
oodataframe = Crossdata2,
oolistfunctionweightasafunctionofstminus = oolistweightingfunctions_G00_G01_G10_G11
)
## [1] 0.007160818
Compare at fixed points
Fixpoint.test(Crossdata$time, Crossdata$status, Crossdata$group, t0 = 0.5)
## $est.g0
## method t0 est lower.95.CI upper.95.CI
## 1 Naive 0.5 0.9487089 0.9048900 0.9925277
## 2 Log 0.5 0.9487089 0.9476661 0.9497314
## 3 Cloglog 0.5 0.9487089 0.9487060 0.9487117
## 4 Arcsin-square 0.5 0.9487089 0.9476713 0.9497365
## 5 Logit 0.5 0.9487089 0.9106487 0.9497317
##
## $est.g1
## method t0 est lower.95.CI upper.95.CI
## 1 Naive 0.5 0.8123129 0.7340737 0.8905522
## 2 Log 0.5 0.8123129 0.8084334 0.8161232
## 3 Cloglog 0.5 0.8123129 0.8121468 0.8124790
## 4 Arcsin-square 0.5 0.8123129 0.8084531 0.8161425
## 5 Logit 0.5 0.8123129 0.7763788 0.8161275
##
## $test
## method statistic pvalue
## 1 Naive 8.11135 0.00440
## 2 Log 8.88749 0.00287
## 3 Cloglog 7.36022 0.00667
## 4 Arcsin-square 9.30954 0.00228
## 5 Logit 7.54433 0.00602
In parallel RCTs, like oncology trials, treatment decisions may change over time to better fit the patient health status. The treatment crossover causes the separation of randomization lost.
Simple methods to handle treatment switching:
More rigorous methods are available, with rank preserving structural failure time (RPSFT), inverse probability of censoring weighting and two-step procedure among the most promising.
The RPSFTM uses a causal model to produce counter-factual event times (the time that would be observed if no treatment were received) in order to estimate a causal treatment effect.
Let \(T_i=T_i^{off}+T_i^{on}\) be the observed event time for participant i, where \(T_i^{off}\) and \(T_i^{on}\) represent the time spent off and on treatment, respectively. The \(T_i\) are related to the counter-factual event times, \(U_i\), via the following causal model: \[U_i = T_i^{off}+T_i^{on}*exp(\psi)\]
where \(exp(-\psi)\) is acceleration factor.
Assumptions: RPSFT has 2 assumptions:
Estimation: Assumption 1 should be plausible in RCT, and therefore, a grid search (g-estimation) is used to find \(\psi_0\) such that \(U_i\) is independent of \(R_i\). For a range of \(\psi\)’s, the hypothesis \(\psi_0 = \psi\) is tested by computing \(U_i(\psi)\), subject to censoring, and calculating the test statistic \(Z(\psi)\). This is usually the same test statistic as for the ITT analysis, for example, the log rank test statistic to compare survival distributions between groups. In the rpsftm() function, the possible test options are the log rank, and the Wald test from a Cox or Weibull regression model. For the parametric Weibull test, the point estimate (\(\hat{\psi}\)) is the value of \(\psi\) for which \(Z(\psi)\) = 0. For the non-parametric tests (log rank, Cox), \(\hat{\psi}\) is the value of \(\psi\) for which \(Z(\psi)\) crosses 0, since \(Z(\psi)\) is a step function. (Alison et al, 2017)
Re-censoring: Consider a case: A patient has his observed event time \(T_i\) extended and get censored because switching to a superior treatment, whilst he would observe event if not switch. Therefore, when change from \(T_i\) to working on \(U_i\) scale, it requires re-censoring for some patients.
Let \(C_i\) be the potential censoring time for participant \(i\). A participant is then recensored at the minimum possible censoring time:
\[D^∗_i(ψ)=min(C_i,C_i exp(ψ))\]
If \(D^∗_i(ψ)<U_i(ψ)\), then \(U_i\) is replaced by \(D^∗_i\) and the censoring indicator is replaced by 0. For treatment arms where switching does not occur, there can be no informative censoring and so re-censoring is not applied (Alison et al, 2017).
Variations: RPSFTM can be implemented in different ways by involving different assumptions such as model allowing for lagged effects or treatment, or effects that differ in magnitude across randomized groups (Sullivan et al, 2020)
The IPCW method adjusts estimates of a treatment effect in the presence of informative censoring. When patients from treatment B switch to treatment A, the patients are artificially censored at the time of switch. To compensate for this selection bias, in treatment B, we find stayed patients with similar baseline characteristics and time-dependent prognostic covariates to switchers and give them higher weights.
The weight applied to individual \(i\) for time interval \(t\) is given by: \[w_{i,t} = \frac{\prod_{k=0}^t P(C(k)_i = 0|C(k-1)i=0,X_i)}{\prod_{k=0}^t P(C(k)_i = 0|C(k-1)i=0,X_i,Z(k)_i)} \] The numerator is the probability of patient \(i\) remaining uncensored at the end of time \(k\) given that the patien was also uncensored at the end of time \(k-1\) and given some baseline covariates X.
The denominator is the probability of patient \(i\) remaining uncensored at the end of time \(k\) given that the patien was also uncensored at the end of time \(k-1\) and given some baseline covariates X and time-dependent prognostic factors Z.
Assumption: “no unmeasured confounders” assumption - only if there are data on all time-dependent prognostic factors for mortality that independently predict informative censoring (switching) can the method produce unbiased results.
This assumption is not testable by observed data. Models for switching and survival must be correctly specified, and the method fails if there are any covariates which ensure (that is, the probability equals 1) that treatment switching will occur Latimer et al, 2014.
Assumption: No unmeasured confounders (stronger assumption than IPCW as few covariates included in model.
Always present ITT
Don’t use naive as sole or primary method
Don’t use IPWC or 2-stage AFT if most patients switch or having a near perfect predictor of switch
Don’t use RPSFTM if near equal exposure or survival
We will use data immdef available in rpsftm package.
First, let’s take a look at data:
library(rpsftm)
data(immdef, package = 'rpsftm')
head(immdef)
## id def imm censyrs xo xoyrs prog progyrs entry
## 1 1 0 1 3 0 0.000000 0 3.000000 0
## 2 2 1 0 3 1 2.652797 0 3.000000 0
## 3 3 0 1 3 0 0.000000 1 1.737838 0
## 4 4 0 1 3 0 0.000000 1 2.166291 0
## 5 5 1 0 3 1 2.122100 1 2.884646 0
## 6 6 1 0 3 1 0.557392 0 3.000000 0
Intent-to-treat analysis: Fitting Weibull AFT model to full analysis set shows that getting immediate treatment extends survival time by a factor of 1.158, but the effect is not statistically significant (ETR= 1.158, 95%CI: 0.996, 1.347)
#K-M plot, using base plot for black&white plot for the ease to compare with plot by rpsftm later
immdef.kmfit <- survfit(Surv(progyrs, prog) ~ imm, data = immdef)
plot(immdef.kmfit, lty = c("solid", "dashed"), xlab = "Time", ylab = "Survival")
legend('topright',c("Reference", "Comparator"),lty = c("solid", "dashed"))
weibull.fit <- WeibullReg(Surv(progyrs, prog) ~ imm, data = immdef)
weibull.fit
## $formula
## Surv(progyrs, prog) ~ imm
##
## $coef
## Estimate SE
## lambda 0.1251584 0.01212204
## gamma 1.4803054 0.07754845
## imm -0.2177152 0.11363574
##
## $HR
## HR LB UB
## imm 0.8043545 0.6437549 1.005019
##
## $ETR
## ETR LB UB
## imm 1.15844 0.9960953 1.347244
##
## $summary
##
## Call:
## survival::survreg(formula = formula, data = data, dist = "weibull")
## Value Std. Error z p
## (Intercept) 1.4039 0.0620 22.65 <2e-16
## imm 0.1471 0.0770 1.91 0.056
## Log(scale) -0.3922 0.0524 -7.49 7e-14
##
## Scale= 0.676
##
## Weibull distribution
## Loglik(model)= -855.1 Loglik(intercept only)= -857
## Chisq= 3.69 on 1 degrees of freedom, p= 0.055
## Number of Newton-Raphson Iterations: 5
## n= 1000
RPSFT:
To apply RPSFT, we need to create a variable rx indicating the proportion of time spent on treatment.
rx = 0 indicates patient on deferred treatment all follow-up time,
rx = 1 for patient on immediate treatment arm entirely,
0 < rx < 1 for proportion of time on immediate treatment of a switcher.
Using log-rank test, RPSFTM estimates \(\hat{\psi} = -0.181\), so the acceleration factor is \(exp(-\hat{\psi})= 1.99\). This means getting immediate treatment extends survival time by a factor of 1.99 (95%CI: 0.998, 1.419).
# xoyrs: the time, in years, from entry to switching, or 0 for participants in the Immediate arm
# progyrs: time, in years, from entry to disease progression or censoring
immdef <- immdef %>% mutate(rx = 1 - xoyrs/progyrs)
# log-rank test
rpsftm.fit <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx),
data=immdef,
censor_time=censyrs)
summary(rpsftm.fit)
## arm rx.Min. rx.1st Qu. rx.Median rx.Mean rx.3rd Qu. rx.Max.
## 1 0 0.0000000 0.0000000 0.0000000 0.1574062 0.2547779 0.9770941
## 2 1 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## Length Class Mode
## psi 1 -none- numeric
## fit 17 survfit list
## CI 2 -none- numeric
## Sstar 1000 Surv numeric
## rand 2000 rand numeric
## ans 5 -none- list
## eval_z 2 data.frame list
## n 2 table numeric
## obs 2 -none- numeric
## exp 2 -none- numeric
## var 4 -none- numeric
## chisq 1 -none- numeric
## call 4 -none- call
## formula 3 terms call
## terms 3 terms call
##
## psi: -0.1813226
## exp(psi): 0.8341662
## Confidence Interval, psi -0.34984 0.002287789
## Confidence Interval, exp(psi) 0.7048008 1.00229
plot(rpsftm.fit)
To adjust for covariates, we can also use option test=coxph and test=survreg to use Wald test of Cox model and use Weibull model, respectively.
# Cox model
rpsftm.fit.coxph <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx) + entry,
data=immdef,
censor_time=censyrs,
test = coxph)
# Weibull model
rpsftm.fit.weireg <- rpsftm(Surv(progyrs, prog) ~ rand(imm, rx) + entry,
data=immdef,
censor_time=censyrs,
test = survreg)
When we assume effects that differ in magnitude across randomized groups, we can use option treat_modifier=. Here, we can assume that treatment effect of immediate treatment in switchers from deferred treatment is half of that in those assigned in immediate group from the start.
immdef <- immdef %>% mutate(weight = ifelse(imm==1, 1, .5))
rpsftm(Surv(progyrs, prog) ~ rand(imm, rx),
data=immdef,
censor_time=censyrs,
treat_modifier = weight)
## arm rx.Min. rx.1st Qu. rx.Median rx.Mean rx.3rd Qu. rx.Max.
## 1 0 0.0000000 0.0000000 0.0000000 0.1574062 0.2547779 0.9770941
## 2 1 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## Call:
## rpsftm(formula = Surv(progyrs, prog) ~ rand(imm, rx), data = immdef,
## censor_time = censyrs, treat_modifier = weight)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## .arm=0 500 157 157 8.79e-06 1.86e-05
## .arm=1 500 143 143 9.66e-06 1.86e-05
##
## Chisq= 0 on 1 degrees of freedom, p= 1
##
## psi: -0.1706579
## exp(psi): 0.84311
Alison et al. (2017). rpsftm: An R Package for Rank Preserving Structural Failure Time Models
Blan JM and Altman DG (2004). Statistics Notes: The logrank test.
Klein et al. (2007). Analyzing survival curves at a fixed point in time.
Mukhopadhyay et al. (2022). Log-Rank Test vs MaxCombo and Difference in Restricted Mean Survival Time Tests for Comparing Survival Under Nonproportional Hazards in Immuno-oncology Trials: A Systematic Review and Meta-analysis.
Latimer et al. (2014). Adjusting for treatment switching in randomised controlled trials – A simulation study and a simplified two-stage method
Lin et al. (2019). Alternative Analysis Methods for Time to Event Endpoints under Non-proportional Hazards: A Comparative Analysis.
Huang et al. (2020). A Nonparametric Statistical Method for Two Crossing Survival Curves.
Pang et al. (2021). Flexible extension of the accelerated failure time model to account for nonlinear and time-dependent effects of covariates on the hazard.
Qiu P and Sheng J (2008). A two-stage procedure for comparing hazard rate functions.
Sfumato et al. (2019). Goftte: A R package for assessing goodness-of-fit in proportional (sub) distributions hazards regression models
Sullivan et al. (2020) Adjusting for Treatment Switching in Oncology Trials: A Systematic Review and Recommendations for Reporting
Zhang et al. (2007). A SAS macro for estimation of direct adjusted survival curves based on a stratified Cox regression model
*SAS macro Zhang (2017) %ADJ_SURV
*SAS macro for MaxCombo test %combo_wlr