Set up Rstudio

knitr::opts_chunk$set(echo = TRUE, warning=FALSE,comment = NA, message=FALSE,
                      fig.height=4, fig.width=6)

SURVIVAL ANALYSIS (Part One)

In many cancer studies, the main outcome under assessment is the time to an event of interest. The generic name for the time is survival time, although it may be applied to the time ‘survived’ from complete remission to relapse or progression as equally as to the time from diagnosis to death. If the event occurred in all individuals, many methods of analysis would be applicable. However, it is usual that at the end of follow-up some of the individuals have not had the event of interest, and thus their true time to event is unknown. Further, survival data are rarely Normally distributed, but are skewed and comprise typically of many early events and relatively few late ones. It is these features of the data that make the special methods called survival analysis necessary.

This paper is the first of a series of four articles that aim to introduce and explain the basic concepts of survival analysis. Most survival analyses in cancer journals use some or all of Kaplan–Meier (KM) plots, logrank tests, and Cox (proportional hazards) regression. We will discuss the background to, and interpretation of, each of these methods but also other approaches to analysis that deserve to be used more often. In this first article, we will present the basic concepts of survival analysis, including how to produce and interpret survival curves, and how to quantify and test survival differences between two or more groups of patients. Future papers in the series cover multivariate analysis and the last paper introduces some more advanced concepts in a brief question and answer format. More detailed accounts of these methods can be found in books written specifically about survival analysis, for example, Collett (1994), Parmar and Machin (1995) and Kleinbaum (1996). In addition, individual references for the methods are presented throughout the series. Several introductory texts also describe the basis of survival analysis, for example, Altman (2003) and Piantadosi (1997).

Run the command below to eliminate # symbol in the Rmarkdown file

knitr::opts_chunk$set(comment = NA)

Set up the script to remove scientific notation

Use the command below to ensure that the output values are not written in scientific notation.

options(scipen=999)

Activate the necessary packages

library(survival)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(survminer)
library(magrittr)
library(ggpubr)
library(stargazer)
library(ggfortify)
library(simsurv)

Load the dataset

my_data<-ovarian
data(cancer, package="survival")

View the first few observations

head(my_data,10)

View the last few observations

tail(my_data,10)

Variable definition

Note:

Note:

A patient is scored as censored if he or she did not suffer the outcome of interest. In survival analysis, patients who do not have an “event” during a specified period are said to have censored observation.

Read variable name

names(my_data)
[1] "futime"   "fustat"   "age"      "resid.ds" "rx"       "ecog.ps" 

Attach the dataset

attach(my_data)

Kaplan Meier Survival Model

Kaplan-Meier estimate is one of the best options to be used to measure the fraction of subjects living for a certain amount of time after treatment. In clinical trials or community trials, the effect of an intervention is assessed by measuring the number of subjects survived or saved after that intervention over a period of time. The time starting from a defined point to the occurrence of a given event, for example death is called as survival time and the analysis of group data as survival analysis. This can be affected by subjects under study that are uncooperative and refused to be remained in the study or when some of the subjects may not experience the event or death before the end of the study, although they would have experienced or died if observation continued, or we lose touch with them midway in the study. We label these situations as censored observations. The Kaplan-Meier estimate is the simplest way of computing the survival over time in spite of all these difficulties associated with subjects or situations. The survival curve can be created assuming various situations. It involves computing of probabilities of occurrence of event at a certain point of time and multiplying these successive probabilities by any earlier computed probabilities to get the final estimate. This can be calculated for two groups of subjects and also their statistical difference in the survivals. This can be used in Ayurveda research when they are comparing two drugs and looking for survival of subjects.

Fit the model

km.model<-survfit(Surv(futime,fustat)~1,
                  type="kaplan-meier")

What is ~1

Since we have no x-variable, we must put a 1 there, its just survival, no variable to relate it to. In the command above, we specified the model, however, if we don’t the model, the system will fit the Kaplan-meier model by default.

Summaries of the model

km.model
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

      n events median 0.95LCL 0.95UCL
[1,] 26     12    638     464      NA

This gives the MEDIAN survival and confidence interval (CI) for it. From the results above, we have 26 individuals and 12 events. The results also gives the median survival time. From the results, more than half of the people survived beyond 638 days.

Model summary

summary(km.model)
Call: survfit(formula = Surv(futime, fustat) ~ 1, type = "kaplan-meier")

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     26       1    0.962  0.0377        0.890        1.000
  115     25       1    0.923  0.0523        0.826        1.000
  156     24       1    0.885  0.0627        0.770        1.000
  268     23       1    0.846  0.0708        0.718        0.997
  329     22       1    0.808  0.0773        0.670        0.974
  353     21       1    0.769  0.0826        0.623        0.949
  365     20       1    0.731  0.0870        0.579        0.923
  431     17       1    0.688  0.0919        0.529        0.894
  464     15       1    0.642  0.0965        0.478        0.862
  475     14       1    0.596  0.0999        0.429        0.828
  563     12       1    0.546  0.1032        0.377        0.791
  638     11       1    0.497  0.1051        0.328        0.752

From the summary table above, at time (59 days) 26 individuals were at risk and an event occurred. In other words, and individual died. Additionally, beyond 59 days, the probability of an individual surviving is 96.2%. Further, from the summary table above, we are 95% confident that there are 89% to 100% chances of survival beyond 59 days. At time 115 days, 25 individuals were at risk and one died. The probability of survival beyond 115 days is 92.3% with the standard error of 0.0523. Besides, we are 95% confident that beyond 115 days, there are 82.6% to 100% chances of survival.

Plot the model

plot(km.model,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model")

From the command above, we provide conf.int = F, if we do not want the coefficient level in the plot. Consider the plot below with confidence level.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)

From the graph above, you get the reason why why the median survival days is 638.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1)
abline(h=0.5,col="red")

Remember we can also include a ‘tick’ where there is a censored observation by using the “mark-time” argument.

plot(km.model,conf.int = T,xlab = "Time(days)",
ylab="%Alive=S(t)",main="Kaplan-Meier Model", las=1, mark.time = T)

The graph above shows every point where there were censored observation.

Add Another variable (+50 years)

Import the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the dataset

head(cancer,10)
tail(cancer,10)
Let us now fit a KM model relating over/under50 to survival and see if there is a relationship between survival and age.
km.model2<-survfit(Surv(futime,fustat)~over50,
                  type="kaplan-meier", data = cancer)

Plot the model

plot(km.model2,conf.int = F,xlab = "Time(days)",
ylab="%Alive=S(t)",main="K-M Model: Survival Analysis", las=1)

ggsurvplot(km.model2,data = cancer, pval = TRUE)

Alternatively,

plot(km.model2, conf.int = F,xlab = "Time(days)",ylab = "Survival Probability",main="K-M model: Survival Analysis", col = c("red","blue"),las=1,lwd=2,mark.time = T)

## legend 
legend(100, 0.2, legend = c("below50","above50"),lty = 1,lwd = 2,col=c("red","blue"),bty = "",cex = 0.6)

Display the summary

km.model2
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

          n events median 0.95LCL 0.95UCL
over50=0  6      1     NA      NA      NA
over50=1 20     11    563     431      NA

From the summary statistics above, there were six individuals who were under 50 years of age, from which only one event occurred. On the other hand, we have 20 individual who were at risk and over 50 years from which 11 events occurred. This is a clear indication that there is a relationship between age and survival.

Display the model summary

summary(km.model2)
Call: survfit(formula = Surv(futime, fustat) ~ over50, data = cancer, 
    type = "kaplan-meier")

                over50=0 
        time       n.risk      n.event     survival      std.err lower 95% CI 
     329.000        6.000        1.000        0.833        0.152        0.583 
upper 95% CI 
       1.000 

                over50=1 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
   59     20       1    0.950  0.0487        0.859        1.000
  115     19       1    0.900  0.0671        0.778        1.000
  156     18       1    0.850  0.0798        0.707        1.000
  268     17       1    0.800  0.0894        0.643        0.996
  353     16       1    0.750  0.0968        0.582        0.966
  365     15       1    0.700  0.1025        0.525        0.933
  431     12       1    0.642  0.1093        0.460        0.896
  464     10       1    0.577  0.1157        0.390        0.855
  475      9       1    0.513  0.1193        0.326        0.809
  563      7       1    0.440  0.1227        0.255        0.760
  638      6       1    0.367  0.1222        0.191        0.705

From the summary table above, at time (329 days) 6 individuals below 50 years were at risk and an event occurred. In other words, and individual died. Additionally, beyond 329 days, the probability of an individual below 50 years surviving is 83.3%. Further, from the summary table above, we are 95% confident that there are 58.3% to 100% chances of survival beyond 329 days for the individuals below 50 years. On the other hand, at time (59 days) 20 individuals above 50 years were at risk and an event occurred. In other words, and individual died. Besides, beyond 59 days, the probability of an individual above 50 years surviving is 95%. Further, from the summary table above, we are 95% confident that there are 85.9% to 100% chances of survival beyond 59 days for the individuals above 50 years. It is important to note that the tick marks in the plot above shows censored observations.

Discussion
  • Are the two survival functions statistically different *
  • What do you think? Just visually comparing the two *

From the graph, we can argue that survival differs significantly across the two age category. Individuals below 50 years seems to have a higher survival probability as compared to individuals above 50 years.

Mathematical Proof

Ho: Survival in the two age groups is the same

Ha: Survival in the two age groups is not the same

survdiff(Surv(futime,fustat)~over50, data = cancer)
Call:
survdiff(formula = Surv(futime, fustat) ~ over50, data = cancer)

          N Observed Expected (O-E)^2/E (O-E)^2/V
over50=0  6        1      3.6     1.876      2.75
over50=1 20       11      8.4     0.804      2.75

 Chisq= 2.7  on 1 degrees of freedom, p= 0.1 

From the inferential results above, there is no evidence to reject the null hypothesis. In other words, there is no statistically sufficient evidence to prove that survival in the two groups is statistically different as shown by the p-value of 0.1 which is more than 0.05.

SURVIVAL ANALYSIS (Part Two)

Cox-Proportional Harzard Model (a)

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical research for investigating the association between the survival time of patients and one or more predictor variables.

In the previous chapter (survival analysis basics), we described the basic concepts of survival analyses and methods for analyzing and summarizing survival data, including:

  • the definition of hazard and survival functions,
  • the construction of Kaplan-Meier survival curves for different patient groups
  • the logrank test for comparing two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival according to one factor under investigation, but ignore the impact of any others.Additionally, Kaplan-Meier curves and logrank tests are useful only when the predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females). They don’t work easily for quantitative predictors such as gene expression, weight, or age. An alternative method is the Cox proportional hazards regression analysis, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival time.

The Cox proportional hazards model is called a semi-parametric model, because there are no assumptions about the shape of the baseline hazard function. There are however, other assumptions as noted above (i.e., independence, changes in predictors produce proportional changes in the hazard regardless of time, and a linear association between the natural logarithm of the relative hazard and the predictors). There are other regression models used in survival analysis that assume specific distributions for the survival times such as the exponential, Weibull, Gompertz and log-normal distributions1,8. The exponential regression survival model, for example, assumes that the hazard function is constant. Other distributions assume that the hazard is increasing over time, decreasing over time, or increasing initially and then decreasing. Example 5 will illustrate estimation of a Cox proportional hazards regression model and discuss the interpretation of the regression coefficients.

Load the data set

data("lung")
head(lung)
tail(lung,5)
names(lung)
 [1] "inst"      "time"      "status"    "age"       "sex"       "ph.ecog"  
 [7] "ph.karno"  "pat.karno" "meal.cal"  "wt.loss"  

The data set above is an Rstudio inbuilt data set that can be used to run Cox Proportional hazard model, however, in this case, we are going to use the data below.

Load the data set

cancer<-read.csv("C:\\Users\\user\\Downloads\\cancer.csv")

View the data set

head(cancer,10)

Change the variable of interest into factor

cancer$over50<-as.factor(cancer$over50)
cancer$ecog.ps<-as.factor(cancer$ecog.ps)

Attach the data set

attach(cancer)

Fit the model

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps, data = cancer)

Display the model summary

summary(cox.model)
Call:
coxph(formula = Surv(futime, fustat) ~ over50 + ecog.ps, data = cancer)

  n= 26, number of events= 12 

           coef exp(coef) se(coef)     z Pr(>|z|)
over501  1.5356    4.6442   1.0541 1.457    0.145
ecog.ps2 0.2628    1.3006   0.5893 0.446    0.656

         exp(coef) exp(-coef) lower .95 upper .95
over501      4.644     0.2153    0.5884    36.655
ecog.ps2     1.301     0.7689    0.4098     4.128

Concordance= 0.603  (se = 0.081 )
Likelihood ratio test= 3.64  on 2 df,   p=0.2
Wald test            = 2.46  on 2 df,   p=0.3
Score (logrank) test = 2.95  on 2 df,   p=0.2

Remember, cox proportional hazard model do not have the coefficient for the intercept.

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 4.6442 times as likely to to die as someone who is under 50 years adjusting for ECOG performance status. ECOG performance status describes a patient’s level of functioning in terms of their ability to care for themselves, daily activity, and physical ability (walking, working, etc.). Researchers worldwide consider the ECOG Performance Status Scale when planning cancer clinical trials to study new treatments. From the results (4.6442- 1) indicates that someone who is over 50 years is 364.42% times likely to die compared to someone who is under 50 years when adjusted for ECOG performance status. What this means is that if we pick two individualz who have same the ECOG performance status, the one above 50 years is 364.42% times likely to dies than the one below 50 years. Additionally, someone who is below 50 years is 0.2153 times as likey to die as compared to someone who is above 50 years adjusting to ECOG performance status.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop ECOG performance status]

cox.model<-coxph(Surv(futime,fustat)~over50+ecog.ps)
cox.model2<-coxph(Surv(futime,fustat)~over50)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping ECOG performance would better or worsen our model significantly. Consider the command below.

anova(cox.model2,cox.model,test = "LRT")

From the results above(p-value>0.05), we cam drop the ECOG performance status without the loss of predictive power from the model. In other words, ECOG performance status is not very important in this model.

Additional Test

Load the data set

data("lung")
head(lung)
write.csv(lung, "C:\\Users\\user\\Downloads\\lung.csv", row.names=FALSE)

Now Import the data set

lung_data<-read.csv("C:\\Users\\user\\Downloads\\lung_data.csv")

Attach the data set

attach(lung_data)

View the data set

head(lung_data,10)

Cox proportional hazard model (b)

Fit the model with numeric X-variable

cox.model3<-coxph(Surv(time,status)~over50+sex,data = lung_data)

Display summary statistics

summary(cox.model3)
Call:
coxph(formula = Surv(time, status) ~ over50 + sex, data = lung_data)

  n= 228, number of events= 165 

          coef exp(coef) se(coef)      z Pr(>|z|)   
over50  0.2803    1.3236   0.3003  0.933  0.35057   
sex    -0.5270    0.5904   0.1672 -3.151  0.00162 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

       exp(coef) exp(-coef) lower .95 upper .95
over50    1.3236     0.7555    0.7347    2.3844
sex       0.5904     1.6938    0.4254    0.8193

Concordance= 0.585  (se = 0.022 )
Likelihood ratio test= 11.58  on 2 df,   p=0.003
Wald test            = 10.94  on 2 df,   p=0.004
Score (logrank) test = 11.19  on 2 df,   p=0.004

Interpretation of exponential coefficient.

At any given time, someone is above 50 years is 1.3236 times as likely to to die as someone who is under 50 years adjusting for gender. From the results (1.3236- 1) indicates that someone who is over 50 years is 32.36% times likely to die compared to someone who is under 50 years when adjusted for gender. What this means is that if we pick two individuals who have same the gender, the one above 50 years is 32.36% times likely to die than the one below 50 years. Additionally, someone who is below 50 years is 0.7555 times likely to die as compared to someone who is above 50 years adjusting for gender.

Interpretation of the Likelyhood Ratio Test

This tests the overal model significance. The concordance statistic is the goodness of fit statistics for survival analysis. In logistic regression, this is equivalent to the area under the curve.

Comparing the nested models using likelihood ratio test (LRT); [can we drop gender]

cox.model3<-coxph(Surv(time,status)~over50+sex, data = lung_data)
cox.model4<-coxph(Surv(time,status)~over50, data = lung_data)

Perform the Likelihood Ratio Test

The test aims to establish whether dropping gender would better or worsen our model significantly. Consider the command below.

anova(cox.model4,cox.model3,test = "LRT")

From the results above(p-value<0.05), we cannot drop the the X-variable sex from the model. Doing so leads to loss of predictive power from the model. In other words, the X-variable sex is very important in this model.

Now, we have worked with adding categorical X-variables, let us now add numeric X-variable in the model.

Run the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss, data = lung_data)

Display Summary statistics

summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss, data = lung_data)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

Interpretation exponentiated age variable

At any given instant in time, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger adjusting for weight loss in the past six months. What this means is that if we pick two individuals who have similar weight loss in the past six months, the probability of dying for someone who is one year older is 2% higher than someone who is one year younger.

Assumptions of Cox-Proportional Hazard Model

The Cox proportional hazards model makes two assumptions: * survival curves for different strata must have hazard functions that are proportional over the time t. * the relationship between the log hazard and each covariate is linear, which can be verified with residual plots. Examples of covariates can be categorical such as race or treatment group, or continuous such as biomarker concentrations.

Linearity

Fit the model

cox.model5<-coxph(Surv(time,status)~age+wt.loss, data = lung_data)

Display the summary of the model

summary(cox.model5)
Call:
coxph(formula = Surv(time, status) ~ age + wt.loss, data = lung_data)

  n= 214, number of events= 152 
   (14 observations deleted due to missingness)

            coef exp(coef) se(coef)     z Pr(>|z|)  
age     0.021697  1.021934 0.009627 2.254   0.0242 *
wt.loss 0.001339  1.001340 0.006239 0.215   0.8301  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

        exp(coef) exp(-coef) lower .95 upper .95
age         1.022     0.9785    1.0028     1.041
wt.loss     1.001     0.9987    0.9892     1.014

Concordance= 0.563  (se = 0.026 )
Likelihood ratio test= 5.27  on 2 df,   p=0.07
Wald test            = 5.12  on 2 df,   p=0.08
Score (logrank) test = 5.14  on 2 df,   p=0.08

If you recall, we assume that the relationship between the covariates or rather X-variable and the log hazard is linear. We can check this assumption in the same way we check linearity in the linear regression model, logistic regression model, or Poisson regression model. In order to do, we can look at the residual plot. On the x-axis, we plot the predicted values and y-axis we plot the residuals.

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)

Add the horizontal line at zero

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)

Fit a smoother through the points

plot(predict(cox.model5),residuals(cox.model5,type="martingale"),
     xlab="fitted values",ylab="Martingale residuals",
     main="Residual Plot",las=1)
abline(h=0)
lines(smooth.spline(predict(cox.model5),residuals(cox.model5,type = "martingale")),col="red")

Check linearity using the deviance residuals

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))

Add the abline and smoother

plot(predict(cox.model5),residuals(cox.model5,type = "deviance"))
abline(h=0)
lines(smooth.spline(predict(cox.model5),
                    residuals(cox.model5,type = "deviance")),col="red")

From the results above, there is linearity between the covariates/X-variables and the log hazard.

Proporrtional Hazard Assumption

This assumption is tested using Schoenfeld test which has the following hypothesis;

Ho: HAZARDS are proportional, HAZARDs are constant over time
Ha: HAZARDS are not proportional, HAZARDS are not proportional over time.

Performing this test will return test for each X-variable and for overall model

cox.zph(cox.model5)
         chisq df    p
age     0.6910  1 0.41
wt.loss 0.0121  1 0.91
GLOBAL  0.7209  2 0.70

The p-values for the covariates and overall model are greater than 0.05. This is an indicator that there is statistically significant evidence to reject the null hypothesis. We therefore conculude that HAZARDS are proportional over time.

Plot of Schoemfeld test

We can see a plot of these as well …(one plot for each parameter). These are plots of “changes in b over time”, if we let “b” vary over time recall…if “b” vary over time, this means that there is NO PH!. The effect is not constant over time… it varies! However pay less attention to the extremes, as line is sensitive…

plot(cox.zph(cox.model5))

The smooth line in the plot tells us how HAZARds will change if we allow them to change. The dotted line on either side of the solid line is the confidence limit. Let us add the abline through zero.

plot(cox.zph(cox.model5)[1], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least 1/3 of the time. However, that is not enough to show that coefficient of HAZARDS for age is not changing over time. However, the proportional hazards assumption seems to be partially met.

plot(cox.zph(cox.model5)[2], main="Plot of Hazards")
abline(h=0,col="red")

From the graph above, the reference line [abline] lies with the confidence bound at least throught the time. This is a clear indication that is enough evidence to enough to show that coefficient of HAZARDS for weight loss is not changing over time. From the graph, we can conlude that the coefficient for weight loss is not really changing. In other words proportional hazards assumption is fully met.

Survival Analysis (Part three)

Load the data set

attach(ovarian)

View the data set

head(ovarian,10)
tail(ovarian,10)

Data generation and simulation

Simulating gender (Female=0)

gender<-rep(0:0,110)
length(gender)
[1] 110

Data Framing

gender1<-data.frame(gender)
head(gender1)
tail(gender1)

Simulating gender (male = 1)

gender<-rep(1:1,90)
length(gender)
[1] 90

Data Framing

gender2<-data.frame(gender)
head(gender2)

Data Binding

gender<-rbind(gender1,gender2)
Gender<-data.frame(gender)

View the data binded

head(Gender,3)
tail(Gender,3)

Simulate status for male participants in the study

status<-rep(1:1,21)
s1<-data.frame(status)
head(s1,3)
status<-rep(0:0,89)
s2<-data.frame(status)
head(s2,4)

Simulating status for female participants in the study

status<-rep(1:1,34)
s3<-data.frame(status)
head(s1,3)
status<-rep(0:0,56)
s4<-data.frame(status)
status<-rbind(s1,s2,s3,s4)
head(status,5)
tail(status,5)

Simulating Time to event using weibull distribution

male

n<-110
set.seed(7)
tr<-rbinom(110,1,0.5)
covs<-data.frame(id=1:n,trt=tr)
b<-c(trt=-0.5)

View the data set

ST<-simsurv(dist="weibull",lambdas=0.1,gammas=1.5,x=covs,betas=b)
head(ST,10)
t1<-data.frame(ST$eventtime)
n1<-90
set.seed(9)#makes the simulated values replicable
tr<-rbinom(90,1,0.5)
covs<-data.frame(id=1:n1,trt=tr)
b<-c(trt=-0.2)
ST<-simsurv(dist="weibull",lambdas=0.1,gammas=1.5,x=covs,betas=b)
head(ST)
t2<-data.frame(ST$eventtime)
head(t2)
t3<-data.frame(t1)
t4<-data.frame(t2)
stime<-rbind(t1, t2)
head(stime)
colnames(stime)<-c("time")
tail(stime)

The final data set

dataset<-data.frame(stime,Gender,status)
head(dataset)

View the summary of the data set

summary(dataset)
      time              gender         status     
 Min.   : 0.04149   Min.   :0.00   Min.   :0.000  
 1st Qu.: 2.63161   1st Qu.:0.00   1st Qu.:0.000  
 Median : 4.57048   Median :0.00   Median :0.000  
 Mean   : 4.93313   Mean   :0.45   Mean   :0.275  
 3rd Qu.: 6.63952   3rd Qu.:1.00   3rd Qu.:1.000  
 Max.   :16.64622   Max.   :1.00   Max.   :1.000  

Remove decimal point from the time variable

dataset$time<-round(dataset$time,digits=0)
head(dataset)

Fit the survival model

fit<-survfit(Surv(time,status)~1,dataset)
fit
Call: survfit(formula = Surv(time, status) ~ 1, data = dataset)

       n events median 0.95LCL 0.95UCL
[1,] 200     55     10       8      11

Display the model summary

summary(fit)
Call: survfit(formula = Surv(time, status) ~ 1, data = dataset)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    197       9    0.954  0.0149        0.926        0.984
    2    172       4    0.932  0.0182        0.897        0.968
    3    154       6    0.896  0.0227        0.852        0.942
    4    129       1    0.889  0.0236        0.844        0.936
    5    102       7    0.828  0.0313        0.769        0.891
    6     69       6    0.756  0.0401        0.681        0.839
    7     54       8    0.644  0.0500        0.553        0.750
    8     38       4    0.576  0.0550        0.478        0.695
   10     20       4    0.461  0.0678        0.345        0.615
   11     14       5    0.296  0.0734        0.182        0.481
   12      6       1    0.247  0.0760        0.135        0.451

Plot the survival model

plot(fit,xlab="time",ylab="Surv Probability")

plot(fit,lty=c(1,3,2),xlab="Time",ylab="Survival Probability",
col = c(4, 6),main="KM Curve")###lty---line type
legend(1000, 1, c("Treat A", " Treat B"), col = c(4, 6),
text.col = "green4", lty = c(1, 3), #pch = c(NA, 3),
merge = TRUE, bg = "gray90")

Fitting Survival Object

s1<-Surv(dataset$time,dataset$status)
s1
  [1] 11   5  12   2   3   6   1   3   7  10   3   7   1   1  11  10   7   6 
 [19]  5   2   5   1+  1+  9+  3+  5+  9+  3+ 17+  8+  1+  5+  5+  5+  1+  5+
 [37]  4+  4+  6+  9+  5+  2+  5+  4+  6+  5+  4+  6+  5+  1+  4+ 13+  4+  7+
 [55]  3+  5+  2+  3+  8+  3+  4+  7+  5+ 12+  4+  9+  4+  6+  4+  4+  5+  4+
 [73]  4+  4+  3+  3+  5+  5+  2+  7+  7+  2+  2+  3+  1+  3+ 11+  7+  4+  3+
 [91]  2+  5+  2+  1+  4+ 11+ 11+  7+ 13+  5+  8+  6+  4+  8+  5+  1+  1+  1+
[109]  4+  0+  5   6  10  10   1   3   7   6   8  11   1   2   6   1   7   1 
[127]  6   8   3   7   5   4   5   7   3  11   1   5   7   8   1   8   2  11 
[145]  4+  2+  1+  8+  3+  3+  2+  2+  5+  5+  4+  2+  3+ 10+  2+  4+  0+  3+
[163]  9+  5+  3+  7+  6+  1+  4+  1+  4+  8+  2+  4+  4+  3+  9+  7+  1+  1+
[181]  5+  6+  3+  5+  5+  8+  5+  8+  2+  0+  4+  6+  5+  1+  3+  6+  3+ 13+
[199]  5+ 10+

COMPARE with the first

head(dataset)

with + is censored

tail(dataset)

Stratify the curve by gender

dataset$gender<-factor(dataset$gender,levels=c("0","1"),labels=c("male","female"))

Display the summary of the results

summary(dataset)
      time           gender        status     
 Min.   : 0.000   male  :110   Min.   :0.000  
 1st Qu.: 3.000   female: 90   1st Qu.:0.000  
 Median : 5.000                Median :0.000  
 Mean   : 4.945                Mean   :0.275  
 3rd Qu.: 7.000                3rd Qu.:1.000  
 Max.   :17.000                Max.   :1.000  

Fit the stratified model

fit1<-survfit(Surv(dataset$time,dataset$status)~dataset$gender)

Display the summary

summary(fit1)
Call: survfit(formula = Surv(dataset$time, dataset$status) ~ dataset$gender)

                dataset$gender=male 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    109       3    0.972  0.0157        0.942        1.000
    2     96       2    0.952  0.0209        0.912        0.994
    3     87       3    0.919  0.0275        0.867        0.975
    5     56       3    0.870  0.0380        0.799        0.948
    6     36       2    0.822  0.0489        0.731        0.923
    7     29       3    0.737  0.0639        0.622        0.873
   10     12       2    0.614  0.0955        0.453        0.833
   11     10       2    0.491  0.1089        0.318        0.759
   12      5       1    0.393  0.1238        0.212        0.728

                dataset$gender=female 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     88       6   0.9318  0.0269       0.8806        0.986
    2     76       2   0.9073  0.0313       0.8481        0.971
    3     67       3   0.8667  0.0376       0.7959        0.944
    4     55       1   0.8509  0.0401       0.7758        0.933
    5     46       4   0.7769  0.0509       0.6833        0.883
    6     33       4   0.6827  0.0628       0.5700        0.818
    7     25       5   0.5462  0.0742       0.4185        0.713
    8     18       4   0.4248  0.0787       0.2954        0.611
   10      8       2   0.3186  0.0878       0.1856        0.547
   11      4       3   0.0797  0.0724       0.0134        0.473

Kaplan-Meier curves.

stratify the curve by gender

fit2<-survfit(s1~gender, data=dataset)

Display the summary of the results

summary(fit2)
Call: survfit(formula = s1 ~ gender, data = dataset)

                gender=male 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1    109       3    0.972  0.0157        0.942        1.000
    2     96       2    0.952  0.0209        0.912        0.994
    3     87       3    0.919  0.0275        0.867        0.975
    5     56       3    0.870  0.0380        0.799        0.948
    6     36       2    0.822  0.0489        0.731        0.923
    7     29       3    0.737  0.0639        0.622        0.873
   10     12       2    0.614  0.0955        0.453        0.833
   11     10       2    0.491  0.1089        0.318        0.759
   12      5       1    0.393  0.1238        0.212        0.728

                gender=female 
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    1     88       6   0.9318  0.0269       0.8806        0.986
    2     76       2   0.9073  0.0313       0.8481        0.971
    3     67       3   0.8667  0.0376       0.7959        0.944
    4     55       1   0.8509  0.0401       0.7758        0.933
    5     46       4   0.7769  0.0509       0.6833        0.883
    6     33       4   0.6827  0.0628       0.5700        0.818
    7     25       5   0.5462  0.0742       0.4185        0.713
    8     18       4   0.4248  0.0787       0.2954        0.611
   10      8       2   0.3186  0.0878       0.1856        0.547
   11      4       3   0.0797  0.0724       0.0134        0.473

Plot the Keplan Meier Model

plot(fit2,xlab="time",ylab="Surv Probability")

plot(fit1,lty=c(1,3),xlab="Time",ylab="Survival Probability",
col = c(4, 6))
legend(16, 0.9, c("female", " male"), col = c(4, 6),
text.col = "green4", lty = c(1, 3), pch = c(NA, 3),
merge = TRUE, bg = "gray90")

Alternatively

ggsurvplot(fit1, data = dataset, pval = TRUE)

ggsurvplot_list(fit2, data = dataset, pval = TRUE)

Alternative plot

ggsurvplot_list(fit2, data = dataset, pval = TRUE,legend.title = "Strata",legend.labs = c("Male", "Female"))

Cox Proportional hazards

Cox proportional hazards model allow you to include covariates. You can build Cox proportional hazards models using the coxph function and visualize them using the ggforest. These type of plot is called a forest plot. It shows so-called hazard.

Fit a Cox proportional hazards model

fit.coxph <-coxph(Surv(time,status) ~ gender, data = dataset)

Display model summry

summary(fit.coxph)
Call:
coxph(formula = Surv(time, status) ~ gender, data = dataset)

  n= 200, number of events= 55 

               coef exp(coef) se(coef)     z Pr(>|z|)   
genderfemale 0.8331    2.3004   0.2819 2.955  0.00313 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

             exp(coef) exp(-coef) lower .95 upper .95
genderfemale       2.3     0.4347     1.324     3.997

Concordance= 0.59  (se = 0.041 )
Likelihood ratio test= 9.09  on 1 df,   p=0.003
Wald test            = 8.73  on 1 df,   p=0.003
Score (logrank) test = 9.21  on 1 df,   p=0.002

The p-value associated with the coefficient of females is 0.00313 , with a hazard ratio HR = exp(coef) of 2.3004 indicating that females have increased risk of death as compared to males. In other words, females are 2.3004 likely to die as compared to male.

Create the ggforest plot

ggforest(fit.coxph, data =dataset)

As indicated by the plot females have a hazard ratio (HR) of 2.3, indicating that females have increased likely of death as compared to males as shown by the forest plot. The plot shows that females are 2.3 likely to die as compared to male. On the other hand, the respective 95% confidence interval is 1.3 and 4 indicating significant difference in the two survival groups.

Alternative

fit.coxph <- coxph(s1 ~ gender, data = dataset)
ggforest(fit.coxph, data = dataset)

Model fit

summary(fit.coxph)
Call:
coxph(formula = s1 ~ gender, data = dataset)

  n= 200, number of events= 55 

               coef exp(coef) se(coef)     z Pr(>|z|)   
genderfemale 0.8331    2.3004   0.2819 2.955  0.00313 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

             exp(coef) exp(-coef) lower .95 upper .95
genderfemale       2.3     0.4347     1.324     3.997

Concordance= 0.59  (se = 0.041 )
Likelihood ratio test= 9.09  on 1 df,   p=0.003
Wald test            = 8.73  on 1 df,   p=0.003
Score (logrank) test = 9.21  on 1 df,   p=0.002

NB Concordance shows the predictive ability of the model.Also use the likelihood ratio test. As indicated above, the likelihood ratio test is significant indicating that the model is signficant/ is agood fit. Kaplan-Meier estimator and the log-rank test estimates the survival probability,Cox proportional hazards model calculates the risk of death and respective hazard ratios.