Loading packages

I will load the necessary packages that help me to perform the analysis

library(survival)
library(ggplot2)
library(GGally)
library(survminer)
library(gtsummary)

Discovering the data

head(lung)
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
str(lung)
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

Kaplan-Meier graph

A nonparametric estimator of the survival function, the Kaplan Meier method is widely used to estimate and graph survival probabilities as a function of time. It can be used to obtain univariate descriptive statistics for survival data, including the median survival time, and compare the survival experience for two or more groups of subjects. To test for overall differences between estimated survival curves of two or more groups of subjects, such as males versus females, or treated versus untreated (control) groups, several tests are available, including the log-rank test. This can be motivated as a type of chi-square test,a widely used test in practice, and in reality is a method for comparing the Kaplan-Meier curves estimated for each group of subjects.

Kaplan-Meier for all the survivals

sf.lung<-survfit(Surv(time,status)~1,data=lung)
sf.lung
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##       n  events  median 0.95LCL 0.95UCL 
##     228     165     310     285     363
ggsurvplot(
    fit = survfit(Surv(time, status) ~ 1, data = lung), 
    xlab = "Days", 
    ylab = "Overall survival probability")

Kaplan-Meier for the survivals per sex

sf.sex<-survfit(Surv(time,status)~sex,data=lung)
sf.sex
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
## 
##         n events median 0.95LCL 0.95UCL
## sex=1 138    112    270     212     310
## sex=2  90     53    426     348     550
ggsurvplot(
    fit = survfit(Surv(time, status) ~ sex, data = lung), 
    xlab = "Days", 
    ylab = "Overall survival probability",
    pval = TRUE)

Cox regression

A popular regression model for the analysis of survival data is the Cox proportional hazards regression model. It allows testing for differences in survival times of two or more groups of interest, while allowing to adjust for covariates of interest. The Cox regression model is a semiparametric model, making fewer assumptions than typical parametric methods but more assumptions than those nonparametric methods described above. In particular, and in contrast with parametric models, it makes no assumptions about the shape of the so-called baseline hazard function.

The Cox regression model provides useful and easy to interpret information regarding the relationship of the hazard function to predictors. While a nonlinear relationship between the hazard function and the predictors is assumed, the hazard ratio comparing any two observations is in fact constant over time in the setting where the predictor variables do not vary over time. This assumption is called the proportional hazards assumption and checking if this assumption is met is an important part of a Cox regression analysis.

cox regression for all the survivals

cox.regression = coxph(Surv(time,status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal,data=lung)
summary(cox.regression)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog + ph.karno + 
##     pat.karno + meal.cal, data = lung)
## 
##   n= 178, number of events= 131 
##    (50 observations deleted due to missingness)
## 
##                 coef  exp(coef)   se(coef)      z Pr(>|z|)   
## age        1.079e-02  1.011e+00  1.109e-02  0.973  0.33063   
## sex       -5.095e-01  6.008e-01  1.934e-01 -2.635  0.00842 **
## ph.ecog    5.876e-01  1.800e+00  2.074e-01  2.833  0.00461 **
## ph.karno   1.790e-02  1.018e+00  1.085e-02  1.650  0.09898 . 
## pat.karno -7.058e-03  9.930e-01  7.577e-03 -0.931  0.35160   
## meal.cal  -1.751e-05  1.000e+00  2.322e-04 -0.075  0.93987   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## age          1.0108     0.9893    0.9891    1.0331
## sex          0.6008     1.6644    0.4113    0.8777
## ph.ecog      1.7996     0.5557    1.1985    2.7021
## ph.karno     1.0181     0.9823    0.9966    1.0399
## pat.karno    0.9930     1.0071    0.9783    1.0078
## meal.cal     1.0000     1.0000    0.9995    1.0004
## 
## Concordance= 0.636  (se = 0.029 )
## Likelihood ratio test= 24.23  on 6 df,   p=5e-04
## Wald test            = 24.01  on 6 df,   p=5e-04
## Score (logrank) test = 24.58  on 6 df,   p=4e-04
coxph(Surv(time, status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal, data = lung) %>% 
  gtsummary::tbl_regression(exp = TRUE) 
N = 178 HR 95% CI p-value
age 1.01 0.99, 1.03 0.3
sex 0.60 0.41, 0.88 0.008
ph.ecog 1.80 1.20, 2.70 0.005
ph.karno 1.02 1.00, 1.04 0.10
pat.karno 0.99 0.98, 1.01 0.4
meal.cal 1.00 1.00, 1.00 >0.9
Cox_curve <- survfit(cox.regression)
plot(Cox_curve)

cox regression by sex

cox.regression.sex = coxph(Surv(time,status) ~ sex,data=lung)
summary(cox.regression.sex)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##   n= 228, number of events= 165 
## 
##        coef exp(coef) se(coef)      z Pr(>|z|)   
## sex -0.5310    0.5880   0.1672 -3.176  0.00149 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     0.588      1.701    0.4237     0.816
## 
## Concordance= 0.579  (se = 0.021 )
## Likelihood ratio test= 10.63  on 1 df,   p=0.001
## Wald test            = 10.09  on 1 df,   p=0.001
## Score (logrank) test = 10.33  on 1 df,   p=0.001
coxph(Surv(time, status) ~ sex, data = lung) %>% 
  gtsummary::tbl_regression(exp = TRUE) 
N = 228 HR 95% CI p-value
sex 0.59 0.42, 0.82 0.001
Cox_curve_sex <- survfit(cox.regression.sex)
plot(Cox_curve_sex)

# Comparing the results

#Kaplan-Meier curve dataframe
#Add a row of model name
km <- rep("Kaplan Meier", length(sf.sex$time))
#Create a dataframe
km_df <- data.frame(sf.sex$time,sf.sex$surv,km)
#Rename the columns so they are same for all dataframes
names(km_df) <- c("Time","Surv","Model")

#Cox model curve dataframe
#Add a row of model name
cox <- rep("Cox",length(Cox_curve_sex$time))
#Create a dataframe
cox_df <- data.frame(Cox_curve_sex$time,Cox_curve_sex$surv,cox)
#Rename the columns so they are same for all dataframes
names(cox_df) <- c("Time","Surv","Model")

#Combine the results
plot_combo <- rbind(km_df,cox_df)

#Make a ggplot
plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model))
plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")