knitr::opts_chunk$set(echo = TRUE, warning=FALSE,comment = NA, message=FALSE,
fig.height=4, fig.width=6)
In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.
This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).
knitr::opts_chunk$set(comment = NA)
Use the command below to ensure that the output values are not written in scientific notation.
options(scipen=999)
library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)
library(magrittr)
library(ggpubr)
library(stargazer)
library(ggfortify)
library(simsurv)
my_data<-ovarian
data(cancer, package="survival")
head(my_data,10)
tail(my_data,10)
A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.
names(my_data)
[1] "futime" "fustat" "age" "resid.ds" "rx" "ecog.ps"
attach(my_data)
Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.
km.model<-survfit(Surv(futime,fustat)~1,
type="kaplan-meier")
Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.
km.model
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
[1,] 26 12 638 464 NA
This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.
summary(km.model)
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 26 1 0.962 0.0377 0.890 1.000
115 25 1 0.923 0.0523 0.826 1.000
156 24 1 0.885 0.0627 0.770 1.000
268 23 1 0.846 0.0708 0.718 0.997
329 22 1 0.808 0.0773 0.670 0.974
353 21 1 0.769 0.0826 0.623 0.949
365 20 1 0.731 0.0870 0.579 0.923
431 17 1 0.688 0.0919 0.529 0.894
464 15 1 0.642 0.0965 0.478 0.862
475 14 1 0.596 0.0999 0.429 0.828
563 12 1 0.546 0.1032 0.377 0.791
638 11 1 0.497 0.1051 0.328 0.752
From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.
plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")
From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
From the graph above, you get the reason why why the median survival days is 638.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")
Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.
plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)
The graph above shows every point where there were censored observation.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
tail(cancer,10)
km.model2<-survfit(Surv(futime,fustat)~over50,
type="kaplan-meier", data = cancer)
plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)
ggsurvplot(km.model2,data = cancer, pval = TRUE)
plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)
## legend
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)
km.model2
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
n events median 0.95LCL 0.95UCL
over50=0 6 1 NA NA NA
over50=1 20 11 563 431 NA
From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.
summary(km.model2)
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer,
type = "kaplan-meier")
over50=0
time n.risk n.event survival std.err lower 95% CI
329.000 6.000 1.000 0.833 0.152 0.583
upper 95% CI
1.000
over50=1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
59 20 1 0.950 0.0487 0.859 1.000
115 19 1 0.900 0.0671 0.778 1.000
156 18 1 0.850 0.0798 0.707 1.000
268 17 1 0.800 0.0894 0.643 0.996
353 16 1 0.750 0.0968 0.582 0.966
365 15 1 0.700 0.1025 0.525 0.933
431 12 1 0.642 0.1093 0.460 0.896
464 10 1 0.577 0.1157 0.390 0.855
475 9 1 0.513 0.1193 0.326 0.809
563 7 1 0.440 0.1227 0.255 0.760
638 6 1 0.367 0.1222 0.191 0.705
From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.
From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.
survdiff(Surv(futime,fustat)~over50, data = cancer)
Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)
N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0 6 1 3.6 1.876 2.75
over50=1 20 11 8.4 0.804 2.75
Chisq= 2.7 on 1 degrees of freedom, p= 0.1
From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.
The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.
In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:
The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.
The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.
data("lung")
head(lung)
tail(lung,5)
names(lung)
[1] "inst" "time" "status" "age" "sex" "ph.ecog"
[7] "ph.karno" "pat.karno" "meal.cal" "wt.loss"
The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.
cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")
head(cancer,10)
cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)
attach(cancer)
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps, data = cancer)
summary(cox.model)
Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps, data = cancer)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
over501 1.5356 4.6442 1.0541 1.457 0.145
ecog.ps2 0.2628 1.3006 0.5893 0.446 0.656
exp(coef) exp(-coef) lower .95 upper .95
over501 4.644 0.2153 0.5884 36.655
ecog.ps2 1.301 0.7689 0.4098 4.128
Concordance= 0.603 (se = 0.081 )
Likelihood ratio test= 3.64 on 2 df, p=0.2
Wald test = 2.46 on 2 df, p=0.3
Score (logrank) test = 2.95 on 2 df, p=0.2
Remember, cox proportional hazard model do not have the coefficient for the intercept.
At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)
The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.
anova(cox.model2,cox.model,test = "LRT")
From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.
data("lung")
head(lung)
write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)
lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")
attach(lung_data)
head(lung_data,10)
cox.model3<-coxph(Surv(time,status)~over50+sex,data = lung_data)
summary(cox.model3)
Call:
coxph(formula = Surv(time, status) ~ over50 + sex, data = lung_data)
n= 228, number of events= 165
coef exp(coef) se(coef) z Pr(>|z|)
over50 0.2803 1.3236 0.3003 0.933 0.35057
sex -0.5270 0.5904 0.1672 -3.151 0.00162 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
over50 1.3236 0.7555 0.7347 2.3844
sex 0.5904 1.6938 0.4254 0.8193
Concordance= 0.585 (se = 0.022 )
Likelihood ratio test= 11.58 on 2 df, p=0.003
Wald test = 10.94 on 2 df, p=0.004
Score (logrank) test = 11.19 on 2 df, p=0.004
At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.
This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.
cox.model3<-coxph(Surv(time,status)~over50+sex, data = lung_data)
cox.model4<-coxph(Surv(time,status)~over50, data = lung_data)
The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.
anova(cox.model4,cox.model3,test = "LRT")
From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.
Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.
cox.model5<-coxph(Surv(time,status)~age+wt.loss, data = lung_data)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss, data = lung_data)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.
The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.
cox.model5<-coxph(Surv(time,status)~age+wt.loss, data = lung_data)
summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss, data = lung_data)
n= 214, number of events= 152
(14 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age 0.021697 1.021934 0.009627 2.254 0.0242 *
wt.loss 0.001339 1.001340 0.006239 0.215 0.8301
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.022 0.9785 1.0028 1.041
wt.loss 1.001 0.9987 0.9892 1.014
Concordance= 0.563 (se = 0.026 )
Likelihood ratio test= 5.27 on 2 df, p=0.07
Wald test = 5.12 on 2 df, p=0.08
Score (logrank) test = 5.14 on 2 df, p=0.08
If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
xlab="fitted values",ylab="Martingale residuals",
main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
residuals(cox.model5,type = "deviance")),col="red")
From the results above, there is linearity between the covariates/X-variables and the log hazard.
This assumption is tested using Schoenfeld test which has the following hypothesis;
Performing this test will return test for each X-variable and for overall model
cox.zph(cox.model5)
chisq df p
age 0.6910 1 0.41
wt.loss 0.0121 1 0.91
GLOBAL 0.7209 2 0.70
The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.
We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…
plot(cox.zph(cox.model5))
The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.
plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. However, the proportional hazards assumption seems to be partially met.
plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")
From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.
attach(ovarian)
head(ovarian,10)
tail(ovarian,10)
gender<-rep(0:0,110)
length(gender)
[1] 110
gender1<-data.frame(gender)
head(gender1)
tail(gender1)
gender<-rep(1:1,90)
length(gender)
[1] 90
gender2<-data.frame(gender)
head(gender2)
gender<-rbind(gender1,gender2)
Gender<-data.frame(gender)
head(Gender,3)
tail(Gender,3)
status<-rep(1:1,21)
s1<-data.frame(status)
head(s1,3)
status<-rep(0:0,89)
s2<-data.frame(status)
head(s2,4)
status<-rep(1:1,34)
s3<-data.frame(status)
head(s1,3)
status<-rep(0:0,56)
s4<-data.frame(status)
status<-rbind(s1,s2,s3,s4)
head(status,5)
tail(status,5)
n<-110
set.seed(7)
tr<-rbinom(110,1,0.5)
covs<-data.frame(id=1:n,trt=tr)
b<-c(trt=-0.5)
ST<-simsurv(dist="weibull",lambdas=0.1,gammas=1.5,x=covs,betas=b)
head(ST,10)
t1<-data.frame(ST$eventtime)
n1<-90
set.seed(9)#makes the simulated values replicable
tr<-rbinom(90,1,0.5)
covs<-data.frame(id=1:n1,trt=tr)
b<-c(trt=-0.2)
ST<-simsurv(dist="weibull",lambdas=0.1,gammas=1.5,x=covs,betas=b)
head(ST)
t2<-data.frame(ST$eventtime)
head(t2)
t3<-data.frame(t1)
t4<-data.frame(t2)
stime<-rbind(t1, t2)
head(stime)
colnames(stime)<-c("time")
tail(stime)
dataset<-data.frame(stime,Gender,status)
head(dataset)
summary(dataset)
time gender status
Min. : 0.04149 Min. :0.00 Min. :0.000
1st Qu.: 2.63161 1st Qu.:0.00 1st Qu.:0.000
Median : 4.57048 Median :0.00 Median :0.000
Mean : 4.93313 Mean :0.45 Mean :0.275
3rd Qu.: 6.63952 3rd Qu.:1.00 3rd Qu.:1.000
Max. :16.64622 Max. :1.00 Max. :1.000
dataset$time<-round(dataset$time,digits=0)
head(dataset)
fit<-survfit(Surv(time,status)~1,dataset)
fit
Call: survfit(formula = Surv(time, status) ~ 1, data = dataset)
n events median 0.95LCL 0.95UCL
[1,] 200 55 10 8 11
summary(fit)
Call: survfit(formula = Surv(time, status) ~ 1, data = dataset)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 197 9 0.954 0.0149 0.926 0.984
2 172 4 0.932 0.0182 0.897 0.968
3 154 6 0.896 0.0227 0.852 0.942
4 129 1 0.889 0.0236 0.844 0.936
5 102 7 0.828 0.0313 0.769 0.891
6 69 6 0.756 0.0401 0.681 0.839
7 54 8 0.644 0.0500 0.553 0.750
8 38 4 0.576 0.0550 0.478 0.695
10 20 4 0.461 0.0678 0.345 0.615
11 14 5 0.296 0.0734 0.182 0.481
12 6 1 0.247 0.0760 0.135 0.451
plot(fit,xlab="time",ylab="Surv Probability")
plot(fit,lty=c(1,3,2),xlab="Time",ylab="Survival Probability",
col = c(4, 6),main="KM Curve")###lty---line type
legend(1000, 1, c("Treat A", " Treat B"), col = c(4, 6),
text.col = "green4", lty = c(1, 3), #pch = c(NA, 3),
merge = TRUE, bg = "gray90")
s1<-Surv(dataset$time,dataset$status)
s1
[1] 11 5 12 2 3 6 1 3 7 10 3 7 1 1 11 10 7 6
[19] 5 2 5 1+ 1+ 9+ 3+ 5+ 9+ 3+ 17+ 8+ 1+ 5+ 5+ 5+ 1+ 5+
[37] 4+ 4+ 6+ 9+ 5+ 2+ 5+ 4+ 6+ 5+ 4+ 6+ 5+ 1+ 4+ 13+ 4+ 7+
[55] 3+ 5+ 2+ 3+ 8+ 3+ 4+ 7+ 5+ 12+ 4+ 9+ 4+ 6+ 4+ 4+ 5+ 4+
[73] 4+ 4+ 3+ 3+ 5+ 5+ 2+ 7+ 7+ 2+ 2+ 3+ 1+ 3+ 11+ 7+ 4+ 3+
[91] 2+ 5+ 2+ 1+ 4+ 11+ 11+ 7+ 13+ 5+ 8+ 6+ 4+ 8+ 5+ 1+ 1+ 1+
[109] 4+ 0+ 5 6 10 10 1 3 7 6 8 11 1 2 6 1 7 1
[127] 6 8 3 7 5 4 5 7 3 11 1 5 7 8 1 8 2 11
[145] 4+ 2+ 1+ 8+ 3+ 3+ 2+ 2+ 5+ 5+ 4+ 2+ 3+ 10+ 2+ 4+ 0+ 3+
[163] 9+ 5+ 3+ 7+ 6+ 1+ 4+ 1+ 4+ 8+ 2+ 4+ 4+ 3+ 9+ 7+ 1+ 1+
[181] 5+ 6+ 3+ 5+ 5+ 8+ 5+ 8+ 2+ 0+ 4+ 6+ 5+ 1+ 3+ 6+ 3+ 13+
[199] 5+ 10+
head(dataset)
tail(dataset)
dataset$gender<-factor(dataset$gender,levels=c("0","1"),labels=c("male","female"))
summary(dataset)
time gender status
Min. : 0.000 male :110 Min. :0.000
1st Qu.: 3.000 female: 90 1st Qu.:0.000
Median : 5.000 Median :0.000
Mean : 4.945 Mean :0.275
3rd Qu.: 7.000 3rd Qu.:1.000
Max. :17.000 Max. :1.000
fit1<-survfit(Surv(dataset$time,dataset$status)~dataset$gender)
summary(fit1)
Call: survfit(formula = Surv(dataset$time, dataset$status) ~ dataset$gender)
dataset$gender=male
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 109 3 0.972 0.0157 0.942 1.000
2 96 2 0.952 0.0209 0.912 0.994
3 87 3 0.919 0.0275 0.867 0.975
5 56 3 0.870 0.0380 0.799 0.948
6 36 2 0.822 0.0489 0.731 0.923
7 29 3 0.737 0.0639 0.622 0.873
10 12 2 0.614 0.0955 0.453 0.833
11 10 2 0.491 0.1089 0.318 0.759
12 5 1 0.393 0.1238 0.212 0.728
dataset$gender=female
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 88 6 0.9318 0.0269 0.8806 0.986
2 76 2 0.9073 0.0313 0.8481 0.971
3 67 3 0.8667 0.0376 0.7959 0.944
4 55 1 0.8509 0.0401 0.7758 0.933
5 46 4 0.7769 0.0509 0.6833 0.883
6 33 4 0.6827 0.0628 0.5700 0.818
7 25 5 0.5462 0.0742 0.4185 0.713
8 18 4 0.4248 0.0787 0.2954 0.611
10 8 2 0.3186 0.0878 0.1856 0.547
11 4 3 0.0797 0.0724 0.0134 0.473
fit2<-survfit(s1~gender, data=dataset)
summary(fit2)
Call: survfit(formula = s1 ~ gender, data = dataset)
gender=male
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 109 3 0.972 0.0157 0.942 1.000
2 96 2 0.952 0.0209 0.912 0.994
3 87 3 0.919 0.0275 0.867 0.975
5 56 3 0.870 0.0380 0.799 0.948
6 36 2 0.822 0.0489 0.731 0.923
7 29 3 0.737 0.0639 0.622 0.873
10 12 2 0.614 0.0955 0.453 0.833
11 10 2 0.491 0.1089 0.318 0.759
12 5 1 0.393 0.1238 0.212 0.728
gender=female
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 88 6 0.9318 0.0269 0.8806 0.986
2 76 2 0.9073 0.0313 0.8481 0.971
3 67 3 0.8667 0.0376 0.7959 0.944
4 55 1 0.8509 0.0401 0.7758 0.933
5 46 4 0.7769 0.0509 0.6833 0.883
6 33 4 0.6827 0.0628 0.5700 0.818
7 25 5 0.5462 0.0742 0.4185 0.713
8 18 4 0.4248 0.0787 0.2954 0.611
10 8 2 0.3186 0.0878 0.1856 0.547
11 4 3 0.0797 0.0724 0.0134 0.473
plot(fit2,xlab="time",ylab="Surv Probability")
plot(fit1,lty=c(1,3),xlab="Time",ylab="Survival Probability",
col = c(4, 6))
legend(16, 0.9, c("female", " male"), col = c(4, 6),
text.col = "green4", lty = c(1, 3), pch = c(NA, 3),
merge = TRUE, bg = "gray90")
ggsurvplot(fit1, data = dataset, pval = TRUE)
ggsurvplot_list(fit2, data = dataset, pval = TRUE)
ggsurvplot_list(fit2, data = dataset, pval = TRUE,legend.title = "Strata",legend.labs = c("Male", "Female"))
Cox proportional hazards model allow you to include covariates. You can build Cox proportional hazards models using the coxph function and visualize them using the ggforest. These type of plot is called a forest plot. It shows so-called hazard.
fit.coxph <-coxph(Surv(time,status) ~ gender, data = dataset)
summary(fit.coxph)
Call:
coxph(formula = Surv(time, status) ~ gender, data = dataset)
n= 200, number of events= 55
coef exp(coef) se(coef) z Pr(>|z|)
genderfemale 0.8331 2.3004 0.2819 2.955 0.00313 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderfemale 2.3 0.4347 1.324 3.997
Concordance= 0.59 (se = 0.041 )
Likelihood ratio test= 9.09 on 1 df, p=0.003
Wald test = 8.73 on 1 df, p=0.003
Score (logrank) test = 9.21 on 1 df, p=0.002
The p-value associated with the coefficient of females is 0.00313 , with a hazard ratio HR = exp(coef) of 2.3004 indicating that females have increased risk of death as compared to males. In other words, females are 2.3004 likely to die as compared to male.
ggforest(fit.coxph, data =dataset)
As indicated by the plot females have a hazard ratio (HR) of 2.3, indicating that females have increased likely of death as compared to males as shown by the forest plot. The plot shows that females are 2.3 likely to die as compared to male. On the other hand, the respective 95% confidence interval is 1.3 and 4 indicating significant difference in the two survival groups.
fit.coxph <- coxph(s1 ~ gender, data = dataset)
ggforest(fit.coxph, data = dataset)
summary(fit.coxph)
Call:
coxph(formula = s1 ~ gender, data = dataset)
n= 200, number of events= 55
coef exp(coef) se(coef) z Pr(>|z|)
genderfemale 0.8331 2.3004 0.2819 2.955 0.00313 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
genderfemale 2.3 0.4347 1.324 3.997
Concordance= 0.59 (se = 0.041 )
Likelihood ratio test= 9.09 on 1 df, p=0.003
Wald test = 8.73 on 1 df, p=0.003
Score (logrank) test = 9.21 on 1 df, p=0.002
NB Concordance shows the predictive ability of the model.Also use the likelihood ratio test. As indicated above, the likelihood ratio test is significant indicating that the model is signficant/ is agood fit. Kaplan-Meier estimator and the log-rank test estimates the survival probability,Cox proportional hazards model calculates the risk of death and respective hazard ratios.