Background

Survival analysis focuses on the expected time until occurrence of an event. Some important definition in survival analysis are:

In medical research, \(S(t)\) could be the probability that an individual survives to a future time \(t\) (e.g. after diagnosis of cancer, the probability of being alive at time \(t\)).

Censoring is when the event of interest has not occurred due to some random cause (e.g. patients without an outcome). In medical studies, censoring may arise because a patient:

This is different to truncation where it is incomplete due to a systematic selection. For more details, see link.

Install/Load required packages

#Uncomment to install
#install.packages(c("survival", "survminer"))

library(survival)
library(survminer)
library(ggcorrplot)

Import the datasets

DATA <- read.delim("~/Desktop/Interview/IDNT_OUT.CSV", header = TRUE, sep = "," )
head(DATA)
    AGE ALBUMIN HEMOGLOB  ACR_0 gender DSCR_ESRD DSCR_ESRD_RD ESRD    GFR_0 TRT
  1  56     3.3     15.0 1790.5      1         0         1462    0 73.59759   0
  2  47     4.0     15.3  782.1      1         0         1450    0 76.26230   1
  3  58     4.1     10.6  597.2      1         0         1572    0 39.80008   1
  4  70     3.9     12.9  393.4      1         0           98    0 49.17549   0
  5  67     3.5     13.0 1328.8      1         0          817    0 49.61471   1
  6  59     3.0     12.9 5454.5      1         1         1415    1 60.05364   0

Labels and subgroup of interest

The subgroups of interest are:

Albuminuria is a condition where albumin is abnormally present in the urine. Glomerular filtration rate is the filtering rate through the kidney.

# Treatment: Placebo = 0, Drug X = 1
DATA$TRT <- factor(DATA$TRT, 
                     levels = c("0", "1"), 
                     labels = c("Placebo", "DrugX"))

# Gender: Male = 1, Female = 2
DATA$gender <- factor(DATA$gender, 
                       levels = c("1", "2"), 
                       labels = c("MALE", "FEMALE"))

# High Albumin: Albumin >= 1000
DATA <- DATA %>% mutate(ACR_0_group = ifelse(ACR_0 >= 1000, "low", "high"))
DATA$ACR_0_group <- factor(DATA$ACR_0_group)

# High GFR: GFR >= 45
DATA <- DATA %>% mutate(GFR_0_group = ifelse(GFR_0 >= 45, "low", "high"))
DATA$GFR_0_group <- factor(DATA$GFR_0_group)

Kaplan-Meier analysis

The Kaplan-Meier statistic is a non-parametric estimator of the survival function see. In this case, the event of interest is Doubling of Serum Creatinine or End-stage renal disease (DSCR_ESRD). The subgroups to compare are the patients under Treatment with Drug X or Placebo (TRT).

#surv_object
surv_object <- Surv(time = DATA$DSCR_ESRD_RD, event = DATA$ESRD)

# group by treatment 
fitTRT <- survfit(surv_object ~ TRT, data = DATA)

#summary(fitTRT)
#summary(fitTRT)$table
fitTRT
  Call: survfit(formula = surv_object ~ TRT, data = DATA)
  
     1 observation deleted due to missingness 
                n events median 0.95LCL 0.95UCL
  TRT=Placebo 568    100     NA      NA      NA
  TRT=DrugX   579     82     NA      NA      NA

The median survival time (i.e. when the survival function is halved) is not achieved (NA).

Kaplan-Meier plot

# plot(fitTRT)
# abline(h = 0.5)

# Change color, linetype by strata, risk.table color by strata
ggsurvplot(fitTRT,
           pval = TRUE, conf.int = TRUE,
           risk.table = TRUE, # Add risk table
           risk.table.col = "strata", # Change risk table color by groups
           linetype = "strata", # Change line type by groups
           #surv.median.line = "hv", # Specify median survival
           ggtheme = theme_bw(), # Change ggplot2 theme
           palette = c("#E7B800", "#2E9FDF"))
  Warning: Vectorized input to `element_text()` is not officially supported.
  Results may be unexpected or may change in future versions of ggplot2.

The horizontal axis represents time to DSCR_ESRD event in days, and the vertical axis the probability of ‘surviving’ (the proportion that do not suffer an DSCR_ESRD event). Tick marks mean that a patient was censored. The ‘Number at risk’ are the number of patients who are still ‘alive’.

Test the hypothesis of significantly different curves (groups)

The log-rank test compares the binary outcome (event) between two groups. It can be used to test whether the difference between the two survival functions is statistically significant.

logrank <- survdiff(surv_object ~ TRT, data = DATA)
logrank
  Call:
  survdiff(formula = surv_object ~ TRT, data = DATA)
  
  n=1147, 1 observation deleted due to missingness.
  
                N Observed Expected (O-E)^2/E (O-E)^2/V
  TRT=Placebo 568      100     88.4      1.51      2.94
  TRT=DrugX   579       82     93.6      1.43      2.94
  
   Chisq= 2.9  on 1 degrees of freedom, p= 0.09

Cox proportional hazards model

The Cox model expresses the hazard function \(h(t)\) as,

\(h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p)\) see

where,

The quantities \(exp(b_i)\) are called hazard ratios (HRs). The HRs compares the probability of events between two groups. A HR of 1 means that both groups are experiencing an equal number of events at any point in time (no difference in survival). It shows how some factors influence the hazard’s rate (the rate to the ocurrence of the event).

In this case, \(x_1\) = TRT is studied within the ACR_0 and GFR_0 groups, i.e. in subsets of the data given by high/low ACR_0 and GFR_0.

High Albumin

## High Albumin
surv_object <- Surv(time = DATA[ DATA$ACR_0_group=="high", ]$DSCR_ESRD_RD, 
                    event = DATA[ DATA$ACR_0_group=="high" , ]$ESRD)

fitTRT <- survfit(surv_object ~ TRT, 
                  data =  DATA[ DATA$ACR_0_group=="high", ])

# ggsurvplot(fitTRT,
#            pval = TRUE, conf.int = TRUE,
#            risk.table = TRUE, # Add risk table
#            risk.table.col = "strata", # Change risk table color by groups
#            linetype = "strata", # Change line type by groups
#            #surv.median.line = "hv", # Specify median survival
#            ggtheme = theme_bw(), # Change ggplot2 theme
#            palette = c("#E7B800", "#2E9FDF"))

#fitTRT
fit.coxph <- coxph(surv_object ~ TRT, 
                   data = DATA[ DATA$ACR_0_group=="high", ])
ggforest(model = fit.coxph, data = DATA[ DATA$ACR_0_group=="high", ], fontsize = 1)

summary(fit.coxph)
  Call:
  coxph(formula = surv_object ~ TRT, data = DATA[DATA$ACR_0_group == 
      "high", ])
  
    n= 364, number of events= 12 
     (60 observations deleted due to missingness)
  
             coef exp(coef) se(coef)     z Pr(>|z|)
  TRTDrugX 0.3786    1.4603   0.5857 0.646    0.518
  
           exp(coef) exp(-coef) lower .95 upper .95
  TRTDrugX      1.46     0.6848    0.4633     4.603
  
  Concordance= 0.535  (se = 0.075 )
  Likelihood ratio test= 0.42  on 1 df,   p=0.5
  Wald test            = 0.42  on 1 df,   p=0.5
  Score (logrank) test = 0.42  on 1 df,   p=0.5

Low Albumin

## Low Albumin 
surv_object <- Surv(time = DATA[ DATA$ACR_0_group=="low", ]$DSCR_ESRD_RD, 
                    event = DATA[ DATA$ACR_0_group=="low" , ]$ESRD)
fitTRT <- survfit(surv_object ~ TRT, 
                  data =  DATA[ DATA$ACR_0_group=="low", ])

# ggsurvplot(fitTRT,
#            pval = TRUE, conf.int = TRUE,
#            risk.table = TRUE, # Add risk table
#            risk.table.col = "strata", # Change risk table color by groups
#            linetype = "strata", # Change line type by groups
#            #surv.median.line = "hv", # Specify median survival
#            ggtheme = theme_bw(), # Change ggplot2 theme
#            palette = c("#E7B800", "#2E9FDF"))

#fitTRT
fit.coxph <- coxph(surv_object ~ TRT, 
                   data = DATA[ DATA$ACR_0_group=="low", ])
ggforest(model = fit.coxph, data = DATA[ DATA$ACR_0_group=="low", ], fontsize = 1)

summary(fit.coxph)
  Call:
  coxph(formula = surv_object ~ TRT, data = DATA[DATA$ACR_0_group == 
      "low", ])
  
    n= 723, number of events= 152 
     (61 observations deleted due to missingness)
  
              coef exp(coef) se(coef)      z Pr(>|z|)  
  TRTDrugX -0.3393    0.7122   0.1635 -2.076   0.0379 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
           exp(coef) exp(-coef) lower .95 upper .95
  TRTDrugX    0.7122      1.404     0.517    0.9813
  
  Concordance= 0.54  (se = 0.021 )
  Likelihood ratio test= 4.35  on 1 df,   p=0.04
  Wald test            = 4.31  on 1 df,   p=0.04
  Score (logrank) test = 4.35  on 1 df,   p=0.04

High Glomerular Filtration Rate

# High Glomerular Filtration Rate
surv_object <- Surv(time = DATA[ DATA$GFR_0_group =="high", ]$DSCR_ESRD_RD, 
                    event = DATA[ DATA$GFR_0_group =="high" , ]$ESRD)
fitTRT <- survfit(surv_object ~ TRT, 
                  data =  DATA[ DATA$GFR_0_group=="high", ])


# ggsurvplot(fitTRT,
#            pval = TRUE, conf.int = TRUE,
#            risk.table = TRUE, # Add risk table
#            risk.table.col = "strata", # Change risk table color by groups
#            linetype = "strata", # Change line type by groups
#            #surv.median.line = "hv", # Specify median survival
#            ggtheme = theme_bw(), # Change ggplot2 theme
#            palette = c("#E7B800", "#2E9FDF"))
#fitTRT
fit.coxph <- coxph(surv_object ~ TRT, 
                   data = DATA[ DATA$GFR_0_group=="high", ])
ggforest(model = fit.coxph, data = DATA[ DATA$GFR_0_group=="high", ], fontsize = 1)

summary(fit.coxph)
  Call:
  coxph(formula = surv_object ~ TRT, data = DATA[DATA$GFR_0_group == 
      "high", ])
  
    n= 562, number of events= 159 
     (8 observations deleted due to missingness)
  
              coef exp(coef) se(coef)      z Pr(>|z|)  
  TRTDrugX -0.3790    0.6845   0.1597 -2.373   0.0176 *
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
           exp(coef) exp(-coef) lower .95 upper .95
  TRTDrugX    0.6845      1.461    0.5006    0.9361
  
  Concordance= 0.548  (se = 0.021 )
  Likelihood ratio test= 5.67  on 1 df,   p=0.02
  Wald test            = 5.63  on 1 df,   p=0.02
  Score (logrank) test = 5.7  on 1 df,   p=0.02

Low Glomerular Filtration Rate

# Low Glomerular Filtration Rate
surv_object <- Surv(time = DATA[ DATA$GFR_0_group =="low", ]$DSCR_ESRD_RD, 
                    event = DATA[ DATA$GFR_0_group =="low" , ]$ESRD)
fitTRT <- survfit(surv_object ~ TRT, 
                  data =  DATA[ DATA$GFR_0_group=="low", ])

# ggsurvplot(fitTRT,
#            pval = TRUE, conf.int = TRUE,
#            risk.table = TRUE, # Add risk table
#            risk.table.col = "strata", # Change risk table color by groups
#            linetype = "strata", # Change line type by groups
#            #surv.median.line = "hv", # Specify median survival
#            ggtheme = theme_bw(), # Change ggplot2 theme
#            palette = c("#E7B800", "#2E9FDF"))

#fitTRT
fit.coxph <- coxph(surv_object ~ TRT, 
                   data = DATA[ DATA$GFR_0_group=="low", ])

ggforest(model = fit.coxph, data = DATA[ DATA$GFR_0_group=="low", ], fontsize = 1)

summary(fit.coxph)
  Call:
  coxph(formula = surv_object ~ TRT, data = DATA[DATA$GFR_0_group == 
      "low", ])
  
    n= 578, number of events= 22 
     (7 observations deleted due to missingness)
  
              coef exp(coef) se(coef)      z Pr(>|z|)
  TRTDrugX -0.1642    0.8486   0.4283 -0.383    0.701
  
           exp(coef) exp(-coef) lower .95 upper .95
  TRTDrugX    0.8486      1.178    0.3665     1.964
  
  Concordance= 0.531  (se = 0.055 )
  Likelihood ratio test= 0.15  on 1 df,   p=0.7
  Wald test            = 0.15  on 1 df,   p=0.7
  Score (logrank) test = 0.15  on 1 df,   p=0.7

The column coef in the output denotes the log hazard ratios, and exp(coef) the corresponding hazard ratios (HR). Increasing by one the coeff translates into increasing the log hazard ratio by coef. The Cox model is a proportional-hazards model, that means that the hazard of the event in any group is a constant multiple of the hazard in any other. This implies that curves cannot cross each other.

R version

sessionInfo()
  R version 3.6.3 (2020-02-29)
  Platform: x86_64-w64-mingw32/x64 (64-bit)
  Running under: Windows 7 x64 (build 7600)
  
  Matrix products: default
  
  locale:
  [1] LC_COLLATE=English_United States.1252 
  [2] LC_CTYPE=English_United States.1252   
  [3] LC_MONETARY=English_United States.1252
  [4] LC_NUMERIC=C                          
  [5] LC_TIME=English_United States.1252    
  
  attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     
  
  other attached packages:
  [1] ggcorrplot_0.1.3 survminer_0.4.6  ggpubr_0.2.5     magrittr_1.5    
  [5] ggplot2_3.3.0    survival_3.1-8  
  
  loaded via a namespace (and not attached):
   [1] zoo_1.8-7         tidyselect_1.0.0  xfun_0.13         purrr_0.3.4      
   [5] splines_3.6.3     lattice_0.20-38   colorspace_1.4-1  vctrs_0.2.4      
   [9] generics_0.0.2    htmltools_0.4.0   yaml_2.2.1        survMisc_0.5.5   
  [13] rlang_0.4.5       pillar_1.4.3      glue_1.4.0        withr_2.1.2      
  [17] lifecycle_0.2.0   stringr_1.4.0     munsell_0.5.0     ggsignif_0.6.0   
  [21] gtable_0.3.0      evaluate_0.14     labeling_0.3      knitr_1.28       
  [25] fansi_0.4.1       broom_0.5.6       Rcpp_1.0.4.6      xtable_1.8-4     
  [29] scales_1.1.0      backports_1.1.6   farver_2.0.3      km.ci_0.5-2      
  [33] gridExtra_2.3     digest_0.6.25     stringi_1.4.6     dplyr_0.8.5      
  [37] KMsurv_0.1-5      cowplot_1.0.0     grid_3.6.3        cli_2.0.2        
  [41] tools_3.6.3       tibble_3.0.0      crayon_1.3.4      tidyr_1.0.2      
  [45] pkgconfig_2.0.3   ellipsis_0.3.0    Matrix_1.2-18     data.table_1.12.8
  [49] assertthat_0.2.1  rmarkdown_2.1     rstudioapi_0.11   R6_2.4.1         
  [53] nlme_3.1-144      compiler_3.6.3