Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, occurrence of a disease, marriage, divorce, etc. The reason for using survival analysis is because it is equipped to deal with censored data (condition in which the value of a measurement or observation is only partially known), where other analytic techniques such as linear regression are not. There are three different techniques within survival analysis; nonparametric equations, semiparametric equations, and parametric equations.
T = non-negative random variable representing waiting time until occurrence of an event f(t) = pdf pf T F(t) = cdf where \[ F(t) = Pr (T < t) \]
Survival Function: function that gives the probability that a patient, device, or other object of interest will survive beyond any given specified time.
\[ S(t) = Pr(T \ge t) = 1 - F(t) = \int_t^{\infty} f(x) ~dx \]
This formula gives the probability that the event of interest has not occurred by duration t.
Hazard Function: likelihood that something will survive to a certain point in time based on its survival to an earlier time. Instantaneous survival
\[\lambda (t) = \lim_{dt\to0} \left( \frac{Pr(t \le T < t + dt|T \ge t)}{dt} \right) \]
Probability that the event occurred during the time interval (dt).
Censored data exists due to real-world complications in tests and research which limit the observation of the event of interest. There are three types of censoring we care about:
Right-Censored Observations : Lower bound and no upper bound
Left-censored Observations : Upper bound and no lower bound
Interval-censored Observations : Lower and upper bound
The similarities, differences, and visualization methods are shown below.
Technique | Visualization | Hazard_Function_Shape | Covariate_Effects_Shape |
---|---|---|---|
Nonparametric | Kaplan-Meier Plots | No Assumption | No Assumption |
Semiparametric | Cox Proportional Hazard Plots | No Assumption | Strong Assumption |
Parametric | Kaplan-Meier Plots w/ Distribution | Distributional Assumption | Strong Assumption |
Although all of the models are similar, they all have their advantages and disadvantages;
Technique | Advantages | Disadvantages | Model_Precision |
---|---|---|---|
Nonparametric | Good for smaller data | Cannot use multiple covariates | Least |
Semiparametric | Parametric w/o restrictive assumptions | Bad w/ time-dependence | Middle |
Parametric | Allows for predictive modelling | Potential unbiased parameter estimates | Most |
Now that we have established what each model can do, we will step into the specificities of the models.
The following packages are necessary to conduct the analysis
survival
: contains the necessary functions and arguments to perform survival analysis in R
lung
dataset : measures survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.survminer
: contains the package ggsurvplot()
for easily drawing beautiful survival curvesknitr
: allows for the user to produce a kable (a very simple table).flexsurv
: allows for parametric survival modeling.# Packages
library(survival)
library(survminer)
library(knitr)
library(flexsurv)
library(survminer)
# Data
attach(lung)
kable(head(lung,5))
inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
---|---|---|---|---|---|---|---|---|---|
3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
There is no data manipulation necessary in order to perform our survival analysis. When starting survival analysis, we will start with the most basic modelling approach (nonparametric) and work up to the most complicated modelling approach (parametric).
Nonparametric regression is a category of regression analysis in which the predictor does not take a predetermined form but is constructed according to information derived from the data. In this case, we want to do survival analysis without assuming any distributions. This is done with Kaplan-Meier plots. There is no assummption about the shape of the hazard/survival function or about how the covariates affect that shape.
KM plots are useful for smaller datasets, like this one, but can only compare limited number of groups and can only include a limited number of covariates. The first step in doing nonparametric survival analysis is adding a Survival Object, which indicates the time and whether or not an event occurred.
lung$SurvObj <- with(lung, Surv(time, status == 1))
kable(head(lung,5))
inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss | SurvObj |
---|---|---|---|---|---|---|---|---|---|---|
3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA | 306+ |
3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 | 455+ |
3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 | 1010 |
5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 | 210+ |
1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 | 883+ |
The Survival Object is the time, but has a “+” if the event occurred, or a “-” if it did not. Now that we have a general understanding of the dataset with the Survival object, we can start to fit the data with a KM model.
km.by.sex <- survfit(Surv(time, status) ~ sex, data = lung)
km.as.one <- survfit(Surv(time, status) ~ 1, data = lung)
km.by.sex
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550
Looking at km.by.sex
we see how the dataset breaks down when comparing the data split by sex in terms of time and status.
When doing our survival analysis, we want to do our analysis when status = 2 (the patient has died) because that is the event we are measuring. This means we want to use the Surv function when an event has not occurred, leading to the equation of km.as.one
.
Functions used:
Surv
creates a survival object which adds a new column that correlates with timewith
command puts that into the lung dataset for each datapointsurvfit
creates survival curvesVisualize the Kaplan-Meier graph with the generic X-Y plot.
plot(km.as.one, main = "Kaplan-Meier Survival Curve",
xlab = "Time (in Days)", ylab = "Survival Probability")
legend("topright", c("Lower 95% CI", "KM", "Upper 95% CI"))
The above KM plot shows that as time approaches 1000 days, patients have about a 10% chance of surviving past that, or in our case, 10% of patients lived longer than 1000 days.
survminer
is the ggplot
of survival curves; creates pretty graphs and allows the user to output information that a regular plot cannot do. If you look at all of the things ggsurvplot
can output, it is a much bigger help than the generic X-Y plot.
ggsurvplot(km.as.one, data = lung, risk.table = F)
This is the same KM plot as above, but the below plot will show how the survminer
package is beneficial in using for these kind of analyses.
ggsurvplot(km.by.sex, data = lung,
title = "Survival Curves",
pval = TRUE, pval.method = TRUE, # Add p-value & method name
surv.median.line = "hv", # Add median survival lines
legend.title = "Sex", # Change legend titles
legend.labs = c("Male", "female"), # Change legend labels
palette = "jco", # Use JCO journal color palette
risk.table = TRUE, # Add No at risk table
cumevents = TRUE, # Add cumulative No of events table
tables.height = 0.15, # Specify tables height
tables.theme = theme_cleantable(), # Clean theme for tables
tables.y.text = FALSE # Hide tables y axis text
)
As you can see there is a drastic difference between the base X-Y plot and ggsurvplot
. The Kaplan-Meier plot shows that at around time 250, the survival probability is 55%, 25% at 500, and continues to go down from there. The plot does show that while the trial went on, approximatly 10% of patients lived the entire time and no event occured. Looking at the p-value, we see that it is below 0.05, meaning we reject the null hypothesis of the sex difference not mattering. This means that there is a difference in survival for males and females, and it shows that females have a better survival rate for the entire trial.
Semiparametric models also do not make any predictions about the shape of the hazard function or survival function, but only make a strong assumption about how the covariates affect the shape. When dealing with the Cox Model for survival analysis, there are two key assumptions that need to be valid in order for the model results to be safely applied to survival analysis data. The first assumption is the issue of non-informative censoring; the design of the underlying study must ensure that the mechanisms giving rise to censoring of individual subjects are not related to the probability of an event occurring. For example, continuation of follow-ups must continue even if a patient no longer shows a positive status. In our case, we do not need to worry about this because the data was conducted in this manner.
res.cox <- coxph(SurvObj ~ age + sex +
ph.karno + wt.loss, data = lung)
res.cox
## Call:
## coxph(formula = SurvObj ~ age + sex + ph.karno + wt.loss, data = lung)
##
## coef exp(coef) se(coef) z p
## age -0.020049 0.980151 0.015314 -1.31 0.190
## sex 0.663250 1.941090 0.269254 2.46 0.014
## ph.karno 0.016598 1.016737 0.012357 1.34 0.179
## wt.loss -0.000325 0.999675 0.010266 -0.03 0.975
##
## Likelihood ratio test=12 on 4 df, p=0.0172
## n= 214, number of events= 62
## (14 observations deleted due to missingness)
res.cox
is the formula for the cox regression, which will be plotted after the second assumption is checked.
## Check for violation of proportional hazard (constant HR over time)
(res.zph <- cox.zph(res.cox))
## rho chisq p
## age -0.0587 0.246 0.6202
## sex 0.1375 1.179 0.2775
## ph.karno 0.1645 1.104 0.2935
## wt.loss 0.2196 2.724 0.0988
## GLOBAL NA 5.295 0.2584
par(mfrow = c(2,2))
plot(res.zph[1], main = "Age")
plot(res.zph[2], main = "Sex")
plot(res.zph[3], main = "Karnofsky Performance Score")
plot(res.zph[4], main = "Weight Loss")
The above plots are checking for proportionality of the predictors in the model by creating interactions with time. The y-axis is the hazard ratio and the x-axis is time, so if the line has a positive slope, this implies non-proportionality in the form of a rising hazard over time.The flatter the line is , then the proportionality is supported. Although the graphs vary between variables, the second assumption of the Cox model is valid.
ggcoxadjustedcurves
Now that we know what a normal X-Y plot looks like for survival analysis, we are going to just jump into the more visually pleasing and easy way. ggcoxadjustedcurves
plots adjusted survival curves for the coxph model.
par(mfrow = c(1,1))
ggcoxadjustedcurves(res.cox, data = lung,
variable = lung[, "sex"], # Variable of interest
legend.title = "Sex", # Change legend title
palette = "npg", # nature publishing group color palettes
curv.size = 2, # Change line size
ggtheme = theme_survminer()
)
The cox model looks very similar to the Kaplan-Meier curve, which is common with such a small dataset. Since the advantage of the Cox Model over the Kaplan-Meier model is that it can include more covariates and groups, the advantage will not show visually with a smaller dataset.
Parametric models are similar to semiparametric, but the user needs to define what distribution to fit with the data (ex. Weibull, Exponential). Parametric curves can be the most useful because of the distributional fit, but can also be hard to extract. Since the purpose of a parametric model is to find if there is a distribution that is good at predicting, we are going to use the AIC approach. Akaike Information Criterion (AIC) gives the relative estimate of the information lost when a given model is used to represent the provess that generates the data. The lower the AIC is, the better.
attach(lung)
S <- Surv(time, status)
Dist <- c("exp", "weibull", "lnorm", "gamma", "gompertz",
"gengamma", "gengamma.orig", "genf", "genf.orig", "llogis")
AIC <- matrix(ncol = 2, nrow = 10)
for(i in 1:10){
AIC[i,1] <- Dist[i]
model <- flexsurvreg(S ~ 1, dist=Dist[i])
model <- flexsurvreg(S ~ 1, dist=Dist[i]) # fit the exponential model
AIC[i,2] <- AIC(model)
}
colnames(AIC) <- c("Distribution", "AIC")
AIC <- transform(AIC, Distribution = as.character(Distribution),
AIC = as.factor(AIC))
#### clean up AIC
AIC$AIC <- as.numeric(levels(AIC$AIC)[AIC$AIC])
AIC[order(AIC$AIC),]
## Distribution AIC
## 2 weibull 2311.702
## 7 gengamma.orig 2313.380
## 6 gengamma 2313.380
## 4 gamma 2313.469
## 5 gompertz 2314.711
## 8 genf 2315.153
## 9 genf.orig 2315.153
## 10 llogis 2325.862
## 1 exp 2326.676
## 3 lnorm 2342.538
When comparing the AIC of the different distributions that can be used within R, the weibull distribution has the lowest AIC of 2312, meaning it is the best distribution to fit the data. Looking at the Weibull distribution graphically we get;
model5<-flexsurvreg(S ~ 1, dist="weibull")
plot(model5, ylab="Survival probability",
xlab="Time", main = "Weibull Survival Plot")
legend("topright",legend=c("KM Plot","Fitted"),
lty=c(1,1),col=c("black","red"), cex=0.75)
Based on the graph, we can see the difference in precision between the Kaplan-Meier plot and the gompertz fitted plot, especially with the difference around the 375 day mark. The KM plot and the fitted weibull plot look very similar because of the small dataset, but the parametric model allows for more precide parameter estimates and allows for more predictive modelling. As you can see, with a large dataset it could be much easier to plot a distribution and follow that than a Kaplan-Meier line.
cgd
dataset in the survival
package, create a new variable SurvObj
that is when the status
of interest is obtained (status
== 1).survfit
function to plot a Kaplan-Meier curve.ggsurvplot
and use ggsurvplot
to make a plot with the data split by sex
sex
.treat
, enum
, tstart
, tstop
, and sex
, use the coxph
and survfit
commands to plot a Cox curve.