Introduction

Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, occurrence of a disease, marriage, divorce, etc. The reason for using survival analysis is because it is equipped to deal with censored data (condition in which the value of a measurement or observation is only partially known), where other analytic techniques such as linear regression are not. There are three different techniques within survival analysis; nonparametric equations, semiparametric equations, and parametric equations.

T = non-negative random variable representing waiting time until occurrence of an event f(t) = pdf pf T F(t) = cdf where \[ F(t) = Pr (T < t) \]

Censored Data

Censored data exists due to real-world complications in tests and research which limit the observation of the event of interest. There are three types of censoring we care about:

  • Right-Censored Observations : Lower bound and no upper bound

  • Left-censored Observations : Upper bound and no lower bound

  • Interval-censored Observations : Lower and upper bound

Survival Analysis Comparisons

The similarities, differences, and visualization methods are shown below.

Technique Visualization Hazard_Function_Shape Covariate_Effects_Shape
Nonparametric Kaplan-Meier Plots No Assumption No Assumption
Semiparametric Cox Proportional Hazard Plots No Assumption Strong Assumption
Parametric Kaplan-Meier Plots w/ Distribution Distributional Assumption Strong Assumption

Although all of the models are similar, they all have their advantages and disadvantages;

Technique Advantages Disadvantages Model_Precision
Nonparametric Good for smaller data Cannot use multiple covariates Least
Semiparametric Parametric w/o restrictive assumptions Bad w/ time-dependence Middle
Parametric Allows for predictive modelling Potential unbiased parameter estimates Most

Now that we have established what each model can do, we will step into the specificities of the models.

Data and Packages

The following packages are necessary to conduct the analysis

# Packages
library(survival)
library(survminer)
library(knitr)
library(flexsurv)
library(survminer)

# Data
attach(lung)
kable(head(lung,5))
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175 NA
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 NA 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 NA 0

There is no data manipulation necessary in order to perform our survival analysis. When starting survival analysis, we will start with the most basic modelling approach (nonparametric) and work up to the most complicated modelling approach (parametric).

Nonparametric

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. In this case, we want to do survival analysis without assuming any distributions. This is done with Kaplan-Meier plots. There is no assummption about the shape of the hazard/survival function or about how the covariates affect that shape.

Fit the Data for KM

KM plots are useful for smaller datasets, like this one, but can only compare limited number of groups and can only include a limited number of covariates. The first step in doing nonparametric survival analysis is adding a Survival Object, which indicates the time and whether or not an event occurred.

lung$SurvObj <- with(lung, Surv(time, status == 1))
kable(head(lung,5))
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss SurvObj
3 306 2 74 1 1 90 100 1175 NA 306+
3 455 2 68 1 0 90 90 1225 15 455+
3 1010 1 56 1 0 90 90 NA 15 1010
5 210 2 57 1 1 90 60 1150 11 210+
1 883 2 60 1 0 100 90 NA 0 883+

The Survival Object is the time, but has a “+” if the event occurred, or a “-” if it did not. Now that we have a general understanding of the dataset with the Survival object, we can start to fit the data with a KM model.

km.by.sex <- survfit(Surv(time, status) ~ sex, data = lung)
km.as.one <- survfit(Surv(time, status) ~ 1, data = lung)
km.by.sex
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     212     310
## sex=2  90     53    426     348     550

Looking at km.by.sex we see how the dataset breaks down when comparing the data split by sex in terms of time and status.

When doing our survival analysis, we want to do our analysis when status = 2 (the patient has died) because that is the event we are measuring. This means we want to use the Surv function when an event has not occurred, leading to the equation of km.as.one.

Functions used:

  • Surv creates a survival object which adds a new column that correlates with time
  • with command puts that into the lung dataset for each datapoint
  • survfit creates survival curves

Visualize KM

Visualize the Kaplan-Meier graph with the generic X-Y plot.

plot(km.as.one, main = "Kaplan-Meier Survival Curve", 
     xlab = "Time (in Days)", ylab = "Survival Probability")
legend("topright", c("Lower 95% CI", "KM", "Upper 95% CI"))

The above KM plot shows that as time approaches 1000 days, patients have about a 10% chance of surviving past that, or in our case, 10% of patients lived longer than 1000 days.

Visualize with survminer

survminer is the ggplot of survival curves; creates pretty graphs and allows the user to output information that a regular plot cannot do. If you look at all of the things ggsurvplot can output, it is a much bigger help than the generic X-Y plot.

ggsurvplot(km.as.one, data = lung, risk.table = F)

This is the same KM plot as above, but the below plot will show how the survminer package is beneficial in using for these kind of analyses.

ggsurvplot(km.by.sex, data = lung,
           title = "Survival Curves",
           pval = TRUE, pval.method = TRUE,    # Add p-value &  method name
           surv.median.line = "hv",            # Add median survival lines
           legend.title = "Sex",               # Change legend titles
           legend.labs = c("Male", "female"),  # Change legend labels
           palette = "jco",                    # Use JCO journal color palette
           risk.table = TRUE,                  # Add No at risk table
           cumevents = TRUE,                   # Add cumulative No of events table
           tables.height = 0.15,               # Specify tables height
           tables.theme = theme_cleantable(),  # Clean theme for tables
           tables.y.text = FALSE               # Hide tables y axis text
)

As you can see there is a drastic difference between the base X-Y plot and ggsurvplot. The Kaplan-Meier plot shows that at around time 250, the survival probability is 55%, 25% at 500, and continues to go down from there. The plot does show that while the trial went on, approximatly 10% of patients lived the entire time and no event occured. Looking at the p-value, we see that it is below 0.05, meaning we reject the null hypothesis of the sex difference not mattering. This means that there is a difference in survival for males and females, and it shows that females have a better survival rate for the entire trial.

Semiparametric

Semiparametric models also do not make any predictions about the shape of the hazard function or survival function, but only make a strong assumption about how the covariates affect the shape. When dealing with the Cox Model for survival analysis, there are two key assumptions that need to be valid in order for the model results to be safely applied to survival analysis data. The first assumption is the issue of non-informative censoring; the design of the underlying study must ensure that the mechanisms giving rise to censoring of individual subjects are not related to the probability of an event occurring. For example, continuation of follow-ups must continue even if a patient no longer shows a positive status. In our case, we do not need to worry about this because the data was conducted in this manner.

Fit the Data for Cox

res.cox <- coxph(SurvObj ~ age + sex + 
                 ph.karno + wt.loss, data = lung)
res.cox
## Call:
## coxph(formula = SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
## 
##               coef exp(coef)  se(coef)     z     p
## age      -0.020049  0.980151  0.015314 -1.31 0.190
## sex       0.663250  1.941090  0.269254  2.46 0.014
## ph.karno  0.016598  1.016737  0.012357  1.34 0.179
## wt.loss  -0.000325  0.999675  0.010266 -0.03 0.975
## 
## Likelihood ratio test=12  on 4 df, p=0.0172
## n= 214, number of events= 62 
##    (14 observations deleted due to missingness)

res.cox is the formula for the cox regression, which will be plotted after the second assumption is checked.

Check Second Assumption

## Check for violation of proportional hazard (constant HR over time)
(res.zph <- cox.zph(res.cox))
##              rho chisq      p
## age      -0.0587 0.246 0.6202
## sex       0.1375 1.179 0.2775
## ph.karno  0.1645 1.104 0.2935
## wt.loss   0.2196 2.724 0.0988
## GLOBAL        NA 5.295 0.2584
par(mfrow = c(2,2))
plot(res.zph[1], main = "Age")
plot(res.zph[2], main = "Sex")
plot(res.zph[3], main = "Karnofsky Performance Score")
plot(res.zph[4], main = "Weight Loss")

The above plots are checking for proportionality of the predictors in the model by creating interactions with time. The y-axis is the hazard ratio and the x-axis is time, so if the line has a positive slope, this implies non-proportionality in the form of a rising hazard over time.The flatter the line is , then the proportionality is supported. Although the graphs vary between variables, the second assumption of the Cox model is valid.

Plot using ggcoxadjustedcurves

Now that we know what a normal X-Y plot looks like for survival analysis, we are going to just jump into the more visually pleasing and easy way. ggcoxadjustedcurves plots adjusted survival curves for the coxph model.

par(mfrow = c(1,1))
ggcoxadjustedcurves(res.cox, data = lung,
                   variable  = lung[, "sex"],   # Variable of interest
                   legend.title = "Sex",        # Change legend title
                   palette = "npg",             # nature publishing group color palettes
                   curv.size = 2,                # Change line size
                   ggtheme = theme_survminer()
                   )

The cox model looks very similar to the Kaplan-Meier curve, which is common with such a small dataset. Since the advantage of the Cox Model over the Kaplan-Meier model is that it can include more covariates and groups, the advantage will not show visually with a smaller dataset.

Parametric

Parametric models are similar to semiparametric, but the user needs to define what distribution to fit with the data (ex. Weibull, Exponential). Parametric curves can be the most useful because of the distributional fit, but can also be hard to extract. Since the purpose of a parametric model is to find if there is a distribution that is good at predicting, we are going to use the AIC approach. Akaike Information Criterion (AIC) gives the relative estimate of the information lost when a given model is used to represent the provess that generates the data. The lower the AIC is, the better.

attach(lung)
S <- Surv(time, status) 
Dist <- c("exp", "weibull", "lnorm", "gamma", "gompertz", 
          "gengamma", "gengamma.orig", "genf", "genf.orig", "llogis")
AIC <- matrix(ncol = 2, nrow = 10)

for(i in 1:10){
  AIC[i,1] <- Dist[i]
  model <- flexsurvreg(S ~ 1, dist=Dist[i])
  model <- flexsurvreg(S ~ 1, dist=Dist[i]) # fit the exponential model
  AIC[i,2] <- AIC(model)
}

colnames(AIC) <- c("Distribution", "AIC")
AIC <- transform(AIC, Distribution = as.character(Distribution), 
                 AIC = as.factor(AIC))
#### clean up AIC
AIC$AIC <- as.numeric(levels(AIC$AIC)[AIC$AIC])
AIC[order(AIC$AIC),]
##     Distribution      AIC
## 2        weibull 2311.702
## 7  gengamma.orig 2313.380
## 6       gengamma 2313.380
## 4          gamma 2313.469
## 5       gompertz 2314.711
## 8           genf 2315.153
## 9      genf.orig 2315.153
## 10        llogis 2325.862
## 1            exp 2326.676
## 3          lnorm 2342.538

When comparing the AIC of the different distributions that can be used within R, the weibull distribution has the lowest AIC of 2312, meaning it is the best distribution to fit the data. Looking at the Weibull distribution graphically we get;

model5<-flexsurvreg(S ~ 1, dist="weibull")
plot(model5, ylab="Survival probability", 
     xlab="Time", main = "Weibull Survival Plot")
legend("topright",legend=c("KM Plot","Fitted"), 
       lty=c(1,1),col=c("black","red"), cex=0.75)

Based on the graph, we can see the difference in precision between the Kaplan-Meier plot and the gompertz fitted plot, especially with the difference around the 375 day mark. The KM plot and the fitted weibull plot look very similar because of the small dataset, but the parametric model allows for more precide parameter estimates and allows for more predictive modelling. As you can see, with a large dataset it could be much easier to plot a distribution and follow that than a Kaplan-Meier line.

Exercises

  1. Using the cgd dataset in the survival package, create a new variable SurvObj that is when the status of interest is obtained (status == 1).
  2. Use the survfit function to plot a Kaplan-Meier curve.
  3. Look through the help for ggsurvplot and use ggsurvplot to make a plot with the data split by sex
  1. Using the variables treat, enum, tstart, tstop, and sex, use the coxph and survfit commands to plot a Cox curve.
  2. Have fun