Summary

The purpose of this document is to support the creation of a Shiny application that analyses, visualises and interprets survival data with various methods.

The app can be found here: https://xvalda.shinyapps.io/Survival/.

Survival analysis gathers a set of methods to answer to some questions such as:
- how much time it takes for an event to occur
- what is the probability for a patient to survive a certain amount of time, given a condition
- are there statistically significant differences in survival time between different patients groups

We’ll clarify more concepts in the intuition section.

The dataset we’re using is the “veteran” data from the survival package in R.

Credits:
- Data source: “D Kalbfleisch and RL Prentice (1980), The Statistical Analysis of Failure Time Data. Wiley, New York.”
- We used several resources listed at the end of the document.

Structure of the document:
1. Load data and packages
2. Exploratory Data Analysis
3. Intuition about the survival theories
4. Kaplan-Meier Analytics
5. Cox proportional hazard models
6. Random Forests
7. Building the Shiny App

1. Load data and packages

library(survival); library(survminer); library(ranger); library(ggplot2); library(dplyr); library(ggfortify)
data(veteran)
str(veteran)
## 'data.frame':    137 obs. of  8 variables:
##  $ trt     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ celltype: Factor w/ 4 levels "squamous","smallcell",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ time    : num  72 411 228 126 118 10 82 110 314 100 ...
##  $ status  : num  1 1 1 1 1 1 1 1 1 0 ...
##  $ karno   : num  60 70 60 60 70 20 40 80 50 70 ...
##  $ diagtime: num  7 5 3 9 11 5 10 29 18 6 ...
##  $ age     : num  69 64 38 63 65 49 69 68 43 70 ...
##  $ prior   : num  0 10 0 10 10 0 10 0 0 0 ...

2. Exploratory data analysis

The data provides from a randomized experiment between two treatment groups for lung cancer, it contains 137 observations of 8 variables:
- trt: 1 = standard, 2 = test
- celltype: factor describing the type of cell: 1=squamous, 2=smallcell, 3=adeno, 4=large
- time: survival time (from start of study to death), in days
- status: censoring status, 0 = patient death was not observed (survival time was censored), 1 = patient death was observed
- karno: Karnofsky performance score (quantifies cancer patients’ general well-being and activities of daily life, 0 = Dead to 100 = Normal)
- diagtime: time from diagnosis to randomisation, in months
- age: age of the patient in years
- prior: 0 = no prior therapy, 10 = prior therapy

We can transform two variables into factors:

veteran$trt <- factor(veteran$trt, labels = c("standard", "test"))
veteran$prior <- factor(veteran$prior, labels = c("no", "yes"))

Summary statistics below show us among other things:
- even assignment in two treatment groups
- obviously right skewed distribution of time-to-event
- 6.6% of censored observations, 93.4% of patients died before the end of the study
- age ranges from 34 to 81 years old
- 30% of the subjects had a prior treatment

vet_temp <- veteran #assign veteran to temporary object (we want status variable to remain numerical and not factor for the rest of the project, but transform it here in a factor for a cleaner summary)
vet_temp$status <- as.factor(vet_temp$status) #transform status as factor 
summary(vet_temp)
##        trt          celltype       time       status      karno      
##  standard:69   squamous :35   Min.   :  1.0   0:  9   Min.   :10.00  
##  test    :68   smallcell:48   1st Qu.: 25.0   1:128   1st Qu.:40.00  
##                adeno    :27   Median : 80.0           Median :60.00  
##                large    :27   Mean   :121.6           Mean   :58.57  
##                               3rd Qu.:144.0           3rd Qu.:75.00  
##                               Max.   :999.0           Max.   :99.00  
##     diagtime           age        prior   
##  Min.   : 1.000   Min.   :34.00   no :97  
##  1st Qu.: 3.000   1st Qu.:51.00   yes:40  
##  Median : 5.000   Median :62.00           
##  Mean   : 8.774   Mean   :58.31           
##  3rd Qu.:11.000   3rd Qu.:66.00           
##  Max.   :87.000   Max.   :81.00
rm(vet_temp) #remove the object that we don't need now

We confirm the right skew we noticed in the summary statistics, and there isn’t clear differentiating patterns between test or sandard treatment groups;

ggplot(data = veteran, aes(x = time, fill = trt)) + 
  geom_histogram() + 
  facet_grid(trt ~.) + 
  ggtitle("Figure 1. Distribution of time-to-event by type of treatment")

We can explore the data further but we do not see obvious patterns, at least visually:

ggplot(data = veteran, aes(x = karno, fill = celltype)) + 
  geom_histogram() + 
  ggtitle("Figure 2. Type of karno per cell type")

A pairs plot is always useful, even when, like in this case, it doesn’t show “interesting” correlations.

library(GGally)
ggpairs(veteran)

Even though we do not find exploitable relations between variables at this stage, the above plots and summary statistics give us a better sense of the data, and we’ll propose more interactive plotting options in the shiny app.

In order to make use of the time element of such a clinical trial study, we will use the different methods of survival analysis.

3. Intuition about survival analysis

Survival analysis helps when the time element (an event occuring) is taken into account.

We want to estimate the time to reach a specific event.

Although very much used in clinical studies where the event is death or recurrence of a disease across two or more treatment groups, there are business/economical applications such as time from manufacturing a new machine component to the failure of this component, or lifetime of two different models of light bulbs, or estimating the time people remain unemployed, …

3.1. Event

The event is the endpoint in the study, the specific outcome we want to measure: death, relapse, mechanical component failure, …

3.2. Censored Data

This refers to incomplete data, the event did not occur for a subject during the time of the study, there are several cases:
- patient did not experience any event during the study, and we do not know if the event occured (for instance, we do not know if the patient ultimately survived or not after the study)
- Right censored subjects: the patient withdrew from the trial or data is lost for some reason (follow-up didn’t occur on the patient, or the patient experienced a different event)

All patients not experiencing the event during the time of the study will be censored at the latest recorded time point.

3.3 Kaplan-Meier

Main purpose: visualing survival curves, works well with categorical data, for numerical data we’ll use the Cox proportional hazard models

After releasing similar findings, Edward Kaplan and Paul Meier published their joint work in 1958 about the non-parametric statistic (i.e. does not assume any underlying probability distribution).

Kaplan-Meier statistic measures the probability that a patient will survive past a specific point in time.
At t = 0, the statistic is 1 (or 100%).
When t increases infinitely, the statistic becomes 0.

The plot of the KM estimator is a series of decreasing horizontal steps, approaching the true survival function.

The survival probability of surviving after time = t is noted S(t). It is the product of all the prior survival probabilities up to time = t.

S(t=n) = S(t=1) x S(t=2) … x S(t=n-1)

The basic formula is: \[\LARGE \hat{S}(t) = \prod_{i: t_i <= t} (1-\frac{d_i}{n_i})\]

t(i) is a time when at least an event happened, d(i) is the number of events (death or recurring disease for instance), that happend at time t(i), n(i) is the number of of individuals that survive (did not have an event or where not censored).
Source: https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator

This is based on conditional probabilities, each new proportion conditional on the previous proportions.

3.4. Log-Rank Test

This test is performed on the curves of the Kaplan-Meier method.
For two (or more) different survival curves, it tests the null hypothesis that both curves are equal. If the p-value is below the significance level (generally alpha = 0.05), then we have convincing statistical evidence that at least two curves differ.

3.5. Cox proportional hazard models

Main purpose: describing the simultaneous effect of several variables on the rate of a particular event happening at a specific point in time.

The Cox model is defined with the hazard function, h(t): it describes the probability of a hazard of a subject to survive to time = t. The function returns a proportion, from 0 to 1.

It measures the instaneous risk of death. It has a memory-less property, the likelihood of something happening at time = t has no relation to what happened in the past. The function at year y applies to all subjects alive that year, without taking into account who died in previous years.

The exponential survival distribution models the time until the event (component failure, …).

Covariates are the predictors we use in the model that looks pretty much like a multiple linear regression.

\[hazard function: h(t) = h_0(t) * exp(b_1x_1 * b_2x_2 * ... * b_n x_n)\]

  • t = survival time
  • h(t) = hazard function taking as arguments n covariates noted x1 to xn
  • b1 … bn = coefficients or weights of each covariate
  • h0 = is the baseline hazard, i.e. value of the hazard when all x’s are equal to zero.

Hazard ratios (log(b)) are noted HR:
- if HR = 1: no effect
- if HR < 1: reduction in hazard, we call the associated covariate a good prognostic factor
- if HR > 1: increase in hazard, we call the associated covariate a bas prognostic factor

3.6. Difference between survival and hazard functions

The survivor function describes the probability of “not having an event”, whereas the hazard function describes the probability of the event occurring.

The survival function is the probability that a subject survives from the time origin of the study to a specified future time t.

The hazard function is the probability that a subject under observation at a time t has an event at that time.

4. Kaplan-Meier Analytics

We want to fit a survival curve for time until the event on the x-axis and status on the y-axis, with an explanatory variable like gender, type of treatment, …

4.1. Fitting a Kaplan-Meier Model

We use the survfit() function from the survival package, in combination with the Surv() function, which provides a survival object containing failure time and censoring information.

In this first example, we use the treatment class (standard or test):

fit1 <- survfit(Surv(time, status) ~ trt, data=veteran)
summary(fit1)$table
##              records n.max n.start events   *rmean *se(rmean) median
## trt=standard      69    69      69     64 123.9282   14.84352  103.0
## trt=test          68    68      68     64 134.0478   22.67211   52.5
##              0.95LCL 0.95UCL
## trt=standard      59     132
## trt=test          44      95

The short summary above shows that each treatment groups has the same events of 64 each, with a median survival time of 103 days for the standard treatment group and only 52.5 days for the test group.

4.2. Visualizing the model

Although the autoplot(model_name) function provides the quickest way to visualize the survival curve, the ggsurvplot() from the Survminer package offers more options in just one plot:

ggsurvplot(
  fit1,                     #survival model we want to plot 
  pval = TRUE,              #displays p-value of log-rank test, if p-value < 0.05, then the difference between the two curves are statistically significant
  conf.int = TRUE,          #plots a confidence interval for each curve
  xlab = "Time in days",
  break.time.by = 150,      # break X axis in time intervals by 100.
  ggtheme = theme_light(),  # customize theme with a grid for better readability 
  risk.table = "abs_pct",   # absolute number and percentage at risk
  risk.table.y.text.col = T,# colour risk table text annotations
  risk.table.y.text = FALSE,# show bars instead of names in text annotations
                            # in legend of risk table.
  ncensor.plot = TRUE,      # plot the number of censored subjects at time t
  surv.median.line = "hv"   # add the median survival pointer
)

Interpretation:
- The x-axis represents the survival time in days.
- The y-axis shows the probability of survival time, related to the number of days on the x-axis.
- Each event (death in this case) is shown by a vertical drop of the curve.
- Vertical ticks (although hardly noticeable on the plot) show a censored patient.
- The curve always start at 1 (no events occured or in this case all patients are alive), then decreases and if the study would last infinitely, the curve would tend towards 0 (no subjects left due to event or censoring).
- The p-value is 0.93, which is extremely high, we fail to reject the null hypothesis that both curves are different, in plain language, there is no statistical evidence that both curves are different OR, the treatment type doesn’t impact the survival time.
- Like stated earlier and visible here, the median survival time (the point where survival probability is 0.5) for standard treatment is 103 days, against only 52.5 days for the test treatment. However, the log-rank test shows a very high p-value, so even though the differences in medians seems large, there is no strong statistical evidence (actually very weak evidence in this case) that both curves are different.

4.3. Transforming the survival curve

There is another major argument of the ggsurvplot() function, “fun”, which proposes alternative transformations:

  • “log”: log transformation of the survivor function,
ggsurvplot(fit1,
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "log")

  • “event”: plots cumulative events (f(y) = 1-y). It’s also known as the cumulative incidence,
ggsurvplot(fit1,
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "event")

  • “cumhaz” plots the cumulative hazard function (f(y) = -log(y))
    The cummulative hazard is commonly used to estimate the hazard probability.
    It’s defined as H(t)=???log(survivalfunction)=???log(S(t)). The cumulative hazard (H(t)) can be interpreted as the cumulative force of mortality. In other words, it corresponds to the number of events that would be expected for each individual by time t if the event were a repeatable process.
ggsurvplot(fit1,
          conf.int = TRUE,
          ggtheme = theme_bw(), 
          fun = "cumhaz")

4.4. Fitting several surival curves

5. Cox proportional hazard models

5.1. Fitting a Cox Model

We use the coxph() function from the survival package.

Fit univariate model

fit_cox_uni <- coxph(Surv(time, status) ~ trt, data = veteran)
summary(fit_cox_uni)
## Call:
## coxph(formula = Surv(time, status) ~ trt, data = veteran)
## 
##   n= 137, number of events= 128 
## 
##            coef exp(coef) se(coef)     z Pr(>|z|)
## trttest 0.01774   1.01790  0.18066 0.098    0.922
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## trttest     1.018     0.9824    0.7144      1.45
## 
## Concordance= 0.525  (se = 0.026 )
## Rsquare= 0   (max possible= 0.999 )
## Likelihood ratio test= 0.01  on 1 df,   p=0.9
## Wald test            = 0.01  on 1 df,   p=0.9
## Score (logrank) test = 0.01  on 1 df,   p=0.9

Interpreting the summary output:

  • The “z” column gives the Wald statistic value (ratio of each regression coefficient to its standard error (z = coef/se(coef)). The wald statistic evaluates, whether the beta coefficient of a given variable is statistically significantly different from 0. With z = 0.098 and p = 0.922, type of treatment doesn’t have any statistically significant coefficients.

It the regression coefficients (coef) has a positive sign, the hazard (risk of death) is higher, i.e. the prognosis is worse, for subjects with higher values of that variable. The output gives the hazard ratio (HR) for the second group relative to the first group, that is, test treatment versus standard. The beta coefficient for trt = 0.01774 indicates that test patients have higher risk of death (higher survival rates) than standard patients. But let’s keep in mind that the p-value was high and the variable itself is not statistically significant.

Hazard ratios , that is the exponentiated coefficients noted (exp(coef) = exp(0.01774) = 1.02), give the effect size of the covariates. For example, having test treatment increases the hazard by a factor of 1.02.

Upper and lower 95% confidence intervals for the hazard ratio (exp(coef)).

The global statistical significance of the model gives p-values for three tests: likelihood-ratio test, Wald test, and score logrank. These three methods test the null hypothesis that all beta vaues are zero and are asymptotically equivalent. For large n, they return similar results. For small n, we use the likelihood ratio test.

Fit multivariate model

fit_cox_multi <- coxph(Surv(time, status) ~ ., data = veteran)
summary(fit_cox_multi)
## Call:
## coxph(formula = Surv(time, status) ~ ., data = veteran)
## 
##   n= 137, number of events= 128 
## 
##                         coef  exp(coef)   se(coef)      z Pr(>|z|)    
## trttest            2.946e-01  1.343e+00  2.075e-01  1.419  0.15577    
## celltypesmallcell  8.616e-01  2.367e+00  2.753e-01  3.130  0.00175 ** 
## celltypeadeno      1.196e+00  3.307e+00  3.009e-01  3.975 7.05e-05 ***
## celltypelarge      4.013e-01  1.494e+00  2.827e-01  1.420  0.15574    
## karno             -3.282e-02  9.677e-01  5.508e-03 -5.958 2.55e-09 ***
## diagtime           8.132e-05  1.000e+00  9.136e-03  0.009  0.99290    
## age               -8.706e-03  9.913e-01  9.300e-03 -0.936  0.34920    
## prioryes           7.159e-02  1.074e+00  2.323e-01  0.308  0.75794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## trttest              1.3426     0.7448    0.8939    2.0166
## celltypesmallcell    2.3669     0.4225    1.3799    4.0597
## celltypeadeno        3.3071     0.3024    1.8336    5.9647
## celltypelarge        1.4938     0.6695    0.8583    2.5996
## karno                0.9677     1.0334    0.9573    0.9782
## diagtime             1.0001     0.9999    0.9823    1.0182
## age                  0.9913     1.0087    0.9734    1.0096
## prioryes             1.0742     0.9309    0.6813    1.6937
## 
## Concordance= 0.736  (se = 0.03 )
## Rsquare= 0.364   (max possible= 0.999 )
## Likelihood ratio test= 62.1  on 8 df,   p=2e-10
## Wald test            = 62.37  on 8 df,   p=2e-10
## Score (logrank) test = 66.74  on 8 df,   p=2e-11

Interpretation:
- the 3 p-values of the global statistical significance are similar and show that the model is statistically significant (p-value < 0.05).
- cell type and karno seem to be the most significant variables

ggsurvplot(survfit(fit_cox_multi, data = veteran), ggtheme = theme_minimal())

6. Shiny Application

6.1. Requirements for current shiny app

The shiny app is targeted at researchers, laboratories and physicians alike.

App Features:
- Tab 1: Introduction and dataset, Introduction and presentation of the dataset, details about the dataset and codebook.
- Tab 2: Basic Data Exploration , Basic data exploration, the user can have basic summary statistics of the variables and a few plots (histogram and scatterplot) in order to get a better feel of the data.
- tab 3: Survival Curve, Kaplan Meier survival curve, users can filter and modify the plot.
- tab 4: Full Model Summary, full output of the model.
- tab 5: Reference: intuition about survival analysis, references/credits, contact details.

6.2. Plans for a second iteration of the app

Potential further developments are:
- archiving more datasets
- allowing user to import own dataset, user will need to enter the name of the outcome variable
- more sophisticated plotting options
- more algorithms (Cox Proportional Hazard Model)
- option to export a pdf report

References

In order to compile this introduction to survival analysis, We used resources, some of them extensively:
- http://www.sthda.com/english/wiki/cox-proportional-hazards-model
- http://www.sthda.com/english/wiki/cox-model-assumptions
- http://www.statisticshowto.com/survival-analysis/
- https://www.openintro.org/download.php?file=survival_analysis_in_R
- http://www.ms.uky.edu/~mai/Rsurv.pdf

Contact details

For questions, requests for further developments, maintenance issues or any comment, please feel free to reach out, indicating it relates to the “Survival Shiny App”:
- xavier@measuringsocial.com
- www.linkedin.com/in/xavier-valdayron-9707231