I will load the necessary packages that help me to perform the analysis
library(survival)
library(ggplot2)
library(GGally)
library(survminer)
library(gtsummary)
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
str(lung)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
A nonparametric estimator of the survival function, the Kaplan Meier method is widely used to estimate and graph survival probabilities as a function of time. It can be used to obtain univariate descriptive statistics for survival data, including the median survival time, and compare the survival experience for two or more groups of subjects. To test for overall differences between estimated survival curves of two or more groups of subjects, such as males versus females, or treated versus untreated (control) groups, several tests are available, including the log-rank test. This can be motivated as a type of chi-square test,a widely used test in practice, and in reality is a method for comparing the Kaplan-Meier curves estimated for each group of subjects.
sf.lung<-survfit(Surv(time,status)~1,data=lung)
sf.lung
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## n events median 0.95LCL 0.95UCL
## 228 165 310 285 363
ggsurvplot(
fit = survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability")
sf.sex<-survfit(Surv(time,status)~sex,data=lung)
sf.sex
## Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
##
## n events median 0.95LCL 0.95UCL
## sex=1 138 112 270 212 310
## sex=2 90 53 426 348 550
ggsurvplot(
fit = survfit(Surv(time, status) ~ sex, data = lung),
xlab = "Days",
ylab = "Overall survival probability",
pval = TRUE)
A popular regression model for the analysis of survival data is the Cox proportional hazards regression model. It allows testing for differences in survival times of two or more groups of interest, while allowing to adjust for covariates of interest. The Cox regression model is a semiparametric model, making fewer assumptions than typical parametric methods but more assumptions than those nonparametric methods described above. In particular, and in contrast with parametric models, it makes no assumptions about the shape of the so-called baseline hazard function.
The Cox regression model provides useful and easy to interpret information regarding the relationship of the hazard function to predictors. While a nonlinear relationship between the hazard function and the predictors is assumed, the hazard ratio comparing any two observations is in fact constant over time in the setting where the predictor variables do not vary over time. This assumption is called the proportional hazards assumption and checking if this assumption is met is an important part of a Cox regression analysis.
cox.regression = coxph(Surv(time,status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal,data=lung)
summary(cox.regression)
## Call:
## coxph(formula = Surv(time, status) ~ age + sex + ph.ecog + ph.karno +
## pat.karno + meal.cal, data = lung)
##
## n= 178, number of events= 131
## (50 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 1.079e-02 1.011e+00 1.109e-02 0.973 0.33063
## sex -5.095e-01 6.008e-01 1.934e-01 -2.635 0.00842 **
## ph.ecog 5.876e-01 1.800e+00 2.074e-01 2.833 0.00461 **
## ph.karno 1.790e-02 1.018e+00 1.085e-02 1.650 0.09898 .
## pat.karno -7.058e-03 9.930e-01 7.577e-03 -0.931 0.35160
## meal.cal -1.751e-05 1.000e+00 2.322e-04 -0.075 0.93987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0108 0.9893 0.9891 1.0331
## sex 0.6008 1.6644 0.4113 0.8777
## ph.ecog 1.7996 0.5557 1.1985 2.7021
## ph.karno 1.0181 0.9823 0.9966 1.0399
## pat.karno 0.9930 1.0071 0.9783 1.0078
## meal.cal 1.0000 1.0000 0.9995 1.0004
##
## Concordance= 0.636 (se = 0.029 )
## Likelihood ratio test= 24.23 on 6 df, p=5e-04
## Wald test = 24.01 on 6 df, p=5e-04
## Score (logrank) test = 24.58 on 6 df, p=4e-04
coxph(Surv(time, status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal, data = lung) %>%
gtsummary::tbl_regression(exp = TRUE)
| N = 178 | HR | 95% CI | p-value |
|---|---|---|---|
| age | 1.01 | 0.99, 1.03 | 0.3 |
| sex | 0.60 | 0.41, 0.88 | 0.008 |
| ph.ecog | 1.80 | 1.20, 2.70 | 0.005 |
| ph.karno | 1.02 | 1.00, 1.04 | 0.10 |
| pat.karno | 0.99 | 0.98, 1.01 | 0.4 |
| meal.cal | 1.00 | 1.00, 1.00 | >0.9 |
Cox_curve <- survfit(cox.regression)
plot(Cox_curve)
cox.regression.sex = coxph(Surv(time,status) ~ sex,data=lung)
summary(cox.regression.sex)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## n= 228, number of events= 165
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex -0.5310 0.5880 0.1672 -3.176 0.00149 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 0.588 1.701 0.4237 0.816
##
## Concordance= 0.579 (se = 0.021 )
## Likelihood ratio test= 10.63 on 1 df, p=0.001
## Wald test = 10.09 on 1 df, p=0.001
## Score (logrank) test = 10.33 on 1 df, p=0.001
coxph(Surv(time, status) ~ sex, data = lung) %>%
gtsummary::tbl_regression(exp = TRUE)
| N = 228 | HR | 95% CI | p-value |
|---|---|---|---|
| sex | 0.59 | 0.42, 0.82 | 0.001 |
Cox_curve_sex <- survfit(cox.regression.sex)
plot(Cox_curve_sex)
# Comparing the results
#Kaplan-Meier curve dataframe
#Add a row of model name
km <- rep("Kaplan Meier", length(sf.sex$time))
#Create a dataframe
km_df <- data.frame(sf.sex$time,sf.sex$surv,km)
#Rename the columns so they are same for all dataframes
names(km_df) <- c("Time","Surv","Model")
#Cox model curve dataframe
#Add a row of model name
cox <- rep("Cox",length(Cox_curve_sex$time))
#Create a dataframe
cox_df <- data.frame(Cox_curve_sex$time,Cox_curve_sex$surv,cox)
#Rename the columns so they are same for all dataframes
names(cox_df) <- c("Time","Surv","Model")
#Combine the results
plot_combo <- rbind(km_df,cox_df)
#Make a ggplot
plot_gg <- ggplot(plot_combo, aes(x = Time, y = Surv, color = Model))
plot_gg + geom_line() + ggtitle("Comparison of Survival Curves")