Introduction

Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, occurrence of a disease, marriage, divorce, etc. The reason for using survival analysis is because it is equipped to deal with censored data (condition in which the value of a measurement or observation is only partially known), where other analytic techniques such as linear regression are not. There are three different techniques within survival analysis; nonparametric equations, semiparametric equations, and parametric equations.

T = non-negative random variable representing waiting time until occurrence of an event f(t) = pdf pf T F(t) = cdf where \[ F(t) = Pr (T < t) \]

Survival Function: function that gives the probability that a patient, device, or other object of interest will survive beyond any given specified time.

\[ S(t) = Pr(T \ge t) = 1 - F(t) = \int_t^{\infty} f(x) ~dx \]

This formula gives the probability that the event of interest has not occurred by duration t.
Hazard Function: likelihood that something will survive to a certain point in time based on its survival to an earlier time. Instantaneous survival

\[\lambda (t) = \lim_{dt\to0} \left( \frac{Pr(t \le T < t + dt|T \ge t)}{dt} \right) \]

Probability that the event occurred during the time interval (dt).

Censored Data

Censored data exists due to real-world complications in tests and research which limit the observation of the event of interest. There are three types of censoring we care about:

Right-Censored Observations : Lower bound and no upper bound
Left-censored Observations : Upper bound and no lower bound
Interval-censored Observations : Lower and upper bound

Survival Analysis Comparisons

The similarities, differences, and visualization methods are shown below.

Technique	Visualization	Hazard_Function_Shape	Covariate_Effects_Shape
Nonparametric	Kaplan-Meier Plots	No Assumption	No Assumption
Semiparametric	Cox Proportional Hazard Plots	No Assumption	Strong Assumption
Parametric	Kaplan-Meier Plots w/ Distribution	Distributional Assumption	Strong Assumption

Although all of the models are similar, they all have their advantages and disadvantages;

Technique	Advantages	Disadvantages	Model_Precision
Nonparametric	Good for smaller data	Cannot use multiple covariates	Least
Semiparametric	Parametric w/o restrictive assumptions	Bad w/ time-dependence	Middle
Parametric	Allows for predictive modelling	Potential unbiased parameter estimates	Most

Now that we have established what each model can do, we will step into the specificities of the models.

Data and Packages

The following packages are necessary to conduct the analysis

survival : contains the necessary functions and arguments to perform survival analysis in R
- lung dataset : measures survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.
survminer : contains the package ggsurvplot() for easily drawing beautiful survival curves
knitr : allows for the user to produce a kable (a very simple table).
flexsurv : allows for parametric survival modeling.

# Packages
library(survival)
library(survminer)
library(knitr)
library(flexsurv)
library(survminer)

# Data
attach(lung)
kable(head(lung,5))

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss
3	306	2	74	1	1	90	100	1175	NA
3	455	2	68	1	0	90	90	1225	15
3	1010	1	56	1	0	90	90	NA	15
5	210	2	57	1	1	90	60	1150	11
1	883	2	60	1	0	100	90	NA	0

There is no data manipulation necessary in order to perform our survival analysis. When starting survival analysis, we will start with the most basic modelling approach (nonparametric) and work up to the most complicated modelling approach (parametric).

Nonparametric

Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. In this case, we want to do survival analysis without assuming any distributions. This is done with Kaplan-Meier plots. There is no assummption about the shape of the hazard/survival function or about how the covariates affect that shape.

Fit the Data for KM

KM plots are useful for smaller datasets, like this one, but can only compare limited number of groups and can only include a limited number of covariates. The first step in doing nonparametric survival analysis is adding a Survival Object, which indicates the time and whether or not an event occurred.

lung$SurvObj <- with(lung, Surv(time, status == 1))
kable(head(lung,5))

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss	SurvObj
3	306	2	74	1	1	90	100	1175	NA	306+
3	455	2	68	1	0	90	90	1225	15	455+
3	1010	1	56	1	0	90	90	NA	15	1010
5	210	2	57	1	1	90	60	1150	11	210+
1	883	2	60	1	0	100	90	NA	0	883+

The Survival Object is the time, but has a “+” if the event occurred, or a “-” if it did not. Now that we have a general understanding of the dataset with the Survival object, we can start to fit the data with a KM model.

km.by.sex <- survfit(Surv(time, status) ~ sex, data = lung)
km.as.one <- survfit(Surv(time, status) ~ 1, data = lung)
km.by.sex

## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     212     310
## sex=2  90     53    426     348     550

Looking at km.by.sex we see how the dataset breaks down when comparing the data split by sex in terms of time and status.

When doing our survival analysis, we want to do our analysis when status = 2 (the patient has died) because that is the event we are measuring. This means we want to use the Surv function when an event has not occurred, leading to the equation of km.as.one.

Functions used:

Surv creates a survival object which adds a new column that correlates with time
with command puts that into the lung dataset for each datapoint
survfit creates survival curves

Visualize KM

Visualize the Kaplan-Meier graph with the generic X-Y plot.

plot(km.as.one, main = "Kaplan-Meier Survival Curve", 
     xlab = "Time (in Days)", ylab = "Survival Probability")
legend("topright", c("Lower 95% CI", "KM", "Upper 95% CI"))

The above KM plot shows that as time approaches 1000 days, patients have about a 10% chance of surviving past that, or in our case, 10% of patients lived longer than 1000 days.

Visualize with survminer

survminer is the ggplot of survival curves; creates pretty graphs and allows the user to output information that a regular plot cannot do. If you look at all of the things ggsurvplot can output, it is a much bigger help than the generic X-Y plot.

ggsurvplot(km.as.one, data = lung, risk.table = F)

This is the same KM plot as above, but the below plot will show how the survminer package is beneficial in using for these kind of analyses.

ggsurvplot(km.by.sex, data = lung,
           title = "Survival Curves",
           pval = TRUE, pval.method = TRUE,    # Add p-value &  method name
           surv.median.line = "hv",            # Add median survival lines
           legend.title = "Sex",               # Change legend titles
           legend.labs = c("Male", "female"),  # Change legend labels
           palette = "jco",                    # Use JCO journal color palette
           risk.table = TRUE,                  # Add No at risk table
           cumevents = TRUE,                   # Add cumulative No of events table
           tables.height = 0.15,               # Specify tables height
           tables.theme = theme_cleantable(),  # Clean theme for tables
           tables.y.text = FALSE               # Hide tables y axis text
)

As you can see there is a drastic difference between the base X-Y plot and ggsurvplot. The Kaplan-Meier plot shows that at around time 250, the survival probability is 55%, 25% at 500, and continues to go down from there. The plot does show that while the trial went on, approximatly 10% of patients lived the entire time and no event occured. Looking at the p-value, we see that it is below 0.05, meaning we reject the null hypothesis of the sex difference not mattering. This means that there is a difference in survival for males and females, and it shows that females have a better survival rate for the entire trial.

Semiparametric

Semiparametric models also do not make any predictions about the shape of the hazard function or survival function, but only make a strong assumption about how the covariates affect the shape. When dealing with the Cox Model for survival analysis, there are two key assumptions that need to be valid in order for the model results to be safely applied to survival analysis data. The first assumption is the issue of non-informative censoring; the design of the underlying study must ensure that the mechanisms giving rise to censoring of individual subjects are not related to the probability of an event occurring. For example, continuation of follow-ups must continue even if a patient no longer shows a positive status. In our case, we do not need to worry about this because the data was conducted in this manner.

Fit the Data for Cox

res.cox <- coxph(SurvObj ~ age + sex + 
                 ph.karno + wt.loss, data = lung)
res.cox

## Call:
## coxph(formula = SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
## 
##               coef exp(coef)  se(coef)     z     p
## age      -0.020049  0.980151  0.015314 -1.31 0.190
## sex       0.663250  1.941090  0.269254  2.46 0.014
## ph.karno  0.016598  1.016737  0.012357  1.34 0.179
## wt.loss  -0.000325  0.999675  0.010266 -0.03 0.975
## 
## Likelihood ratio test=12  on 4 df, p=0.0172
## n= 214, number of events= 62 
##    (14 observations deleted due to missingness)

res.cox is the formula for the cox regression, which will be plotted after the second assumption is checked.

Check Second Assumption

## Check for violation of proportional hazard (constant HR over time)
(res.zph <- cox.zph(res.cox))

##              rho chisq      p
## age      -0.0587 0.246 0.6202
## sex       0.1375 1.179 0.2775
## ph.karno  0.1645 1.104 0.2935
## wt.loss   0.2196 2.724 0.0988
## GLOBAL        NA 5.295 0.2584

par(mfrow = c(2,2))
plot(res.zph[1], main = "Age")
plot(res.zph[2], main = "Sex")
plot(res.zph[3], main = "Karnofsky Performance Score")
plot(res.zph[4], main = "Weight Loss")

The above plots are checking for proportionality of the predictors in the model by creating interactions with time. The y-axis is the hazard ratio and the x-axis is time, so if the line has a positive slope, this implies non-proportionality in the form of a rising hazard over time.The flatter the line is , then the proportionality is supported. Although the graphs vary between variables, the second assumption of the Cox model is valid.

Plot using `ggcoxadjustedcurves`

Now that we know what a normal X-Y plot looks like for survival analysis, we are going to just jump into the more visually pleasing and easy way. ggcoxadjustedcurves plots adjusted survival curves for the coxph model.

par(mfrow = c(1,1))
ggcoxadjustedcurves(res.cox, data = lung,
                   variable  = lung[, "sex"],   # Variable of interest
                   legend.title = "Sex",        # Change legend title
                   palette = "npg",             # nature publishing group color palettes
                   curv.size = 2,                # Change line size
                   ggtheme = theme_survminer()
                   )

The cox model looks very similar to the Kaplan-Meier curve, which is common with such a small dataset. Since the advantage of the Cox Model over the Kaplan-Meier model is that it can include more covariates and groups, the advantage will not show visually with a smaller dataset.

Parametric

Parametric models are similar to semiparametric, but the user needs to define what distribution to fit with the data (ex. Weibull, Exponential). Parametric curves can be the most useful because of the distributional fit, but can also be hard to extract. Since the purpose of a parametric model is to find if there is a distribution that is good at predicting, we are going to use the AIC approach. Akaike Information Criterion (AIC) gives the relative estimate of the information lost when a given model is used to represent the provess that generates the data. The lower the AIC is, the better.

attach(lung)
S <- Surv(time, status) 
Dist <- c("exp", "weibull", "lnorm", "gamma", "gompertz", 
          "gengamma", "gengamma.orig", "genf", "genf.orig", "llogis")
AIC <- matrix(ncol = 2, nrow = 10)

for(i in 1:10){
  AIC[i,1] <- Dist[i]
  model <- flexsurvreg(S ~ 1, dist=Dist[i])
  model <- flexsurvreg(S ~ 1, dist=Dist[i]) # fit the exponential model
  AIC[i,2] <- AIC(model)
}

colnames(AIC) <- c("Distribution", "AIC")
AIC <- transform(AIC, Distribution = as.character(Distribution), 
                 AIC = as.factor(AIC))
#### clean up AIC
AIC$AIC <- as.numeric(levels(AIC$AIC)[AIC$AIC])
AIC[order(AIC$AIC),]

##     Distribution      AIC
## 2        weibull 2311.702
## 7  gengamma.orig 2313.380
## 6       gengamma 2313.380
## 4          gamma 2313.469
## 5       gompertz 2314.711
## 8           genf 2315.153
## 9      genf.orig 2315.153
## 10        llogis 2325.862
## 1            exp 2326.676
## 3          lnorm 2342.538

When comparing the AIC of the different distributions that can be used within R, the weibull distribution has the lowest AIC of 2312, meaning it is the best distribution to fit the data. Looking at the Weibull distribution graphically we get;

model5<-flexsurvreg(S ~ 1, dist="weibull")
plot(model5, ylab="Survival probability", 
     xlab="Time", main = "Weibull Survival Plot")
legend("topright",legend=c("KM Plot","Fitted"), 
       lty=c(1,1),col=c("black","red"), cex=0.75)

Based on the graph, we can see the difference in precision between the Kaplan-Meier plot and the gompertz fitted plot, especially with the difference around the 375 day mark. The KM plot and the fitted weibull plot look very similar because of the small dataset, but the parametric model allows for more precide parameter estimates and allows for more predictive modelling. As you can see, with a large dataset it could be much easier to plot a distribution and follow that than a Kaplan-Meier line.

Exercises

Using the cgd dataset in the survival package, create a new variable SurvObj that is when the status of interest is obtained (status == 1).
Use the survfit function to plot a Kaplan-Meier curve.
Look through the help for ggsurvplot and use ggsurvplot to make a plot with the data split by sex

Add a p-value to see if there is a statistical difference for sex.

Using the variables treat, enum, tstart, tstop, and sex, use the coxph and survfit commands to plot a Cox curve.
Have fun

Survival Analysis Tutorial

Jacob Lindell and Joe Berry

August 23, 2017

Introduction

Censored Data

Survival Analysis Comparisons

Data and Packages

Nonparametric

Fit the Data for KM

Visualize KM

Visualize with survminer

Semiparametric

Fit the Data for Cox

Check Second Assumption

Plot using `ggcoxadjustedcurves`

Parametric

Exercises

Survival Analysis Tutorial

Jacob Lindell and Joe Berry

August 23, 2017

Introduction

Censored Data

Survival Analysis Comparisons

Data and Packages

Nonparametric

Fit the Data for KM

Visualize KM

Visualize with survminer

Semiparametric

Fit the Data for Cox

Check Second Assumption

Plot using ggcoxadjustedcurves

Parametric

Exercises

Plot using `ggcoxadjustedcurves`