Survival Analysis in Colorectal Cancer Patient Data
Survival analysis is the result of measuring the time from one event to another. Time in the form of years, months, weeks, or days counted from the first study until the occurrence of an event is used for the dependent variable.
1. Overview Data Survival
TBP data is data applied to a survival analysis involving 30 samples of patients treated with colorectal cancer. The data consists of two dependent variables and four independent variables. This data is used to predict the factors that affect a person’s survival from an event or death.
tbp_data<-read.csv2("data_tbp.csv")
head(tbp_data) Umur Kelamin Stadium Threatment waktu status
1 Tua 1 Parah Standar 34 FALSE
2 Muda 1 Parah Standar 4 TRUE
3 Tua 0 Parah Standar 10 TRUE
4 Tua 1 Parah Standar 5 TRUE
5 Muda 0 Parah Standar 20 TRUE
6 Muda 1 Parah Standar 24 FALSE
2. Data Wrangling
Data wrangling is done so that the data format is in accordance with the objectives of the analysis. The data type of data_tbp.csv is not correct, so the data type is changed.
library(dplyr)
tbp_data<-tbp_data %>%
mutate(Kelamin = as.factor(Kelamin),Umur = as.factor(Umur),Stadium=as.factor(Stadium),Threatment=as.factor(Threatment))
glimpse(tbp_data)Rows: 30
Columns: 6
$ Umur <fct> Tua, Muda, Tua, Tua, Muda, Muda, Muda, Muda, Muda, Muda, Tu~
$ Kelamin <fct> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,~
$ Stadium <fct> Parah, Parah, Parah, Parah, Parah, Parah, Parah, Parah, Par~
$ Threatment <fct> Standar, Standar, Standar, Standar, Standar, Standar, Stand~
$ waktu <int> 34, 4, 10, 5, 20, 24, 8, 17, 17, 34, 17, 8, 27, 21, 18, 16,~
$ status <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, F~
Type of data is correct so that it can proceed to the next process
3. Data Explanation
summary(tbp_data) Umur Kelamin Stadium Threatment waktu
Muda:22 0:15 Parah :25 Baru :19 Min. : 4.00
Tua : 8 1:15 Sangat Parah: 5 Standar:11 1st Qu.: 8.00
Median :14.00
Mean :15.33
3rd Qu.:20.75
Max. :34.00
status
Mode :logical
FALSE:20
TRUE :10
Insight
Presented in the following table.
Table1 shows the male gender of colon cancer patients who have a balanced percentage of patients who are female. Nearly 73.3% of the total young colon cancer patients with age \(\leq 50\) years. There were 13.3% of patients who were already at a very severe type of stage and there were 2/3 patients who received a new type of treatment. This shows that the male gender of bowel cancer patients has a balanced proportion with patients who are female. Nearly 73.3% of the total young colon cancer patients with age \(\leq 50\) years. There were 13.3% of patients who were already at a very severe type of stage and there were 2/3 of the patients who received a new type of treatment.
4. Data Analysis
The steps to analyze colorectal cancer patient data are:
a. Proportional hazard test
The proportional hazard assumption states that the hazard function of an individual differences are the proportions or ratios of the hazard functions of the two individuals discrepancy is constant. Proportional hazard testing can be done in two ways, namely the graphical approach and statistical test.
Graph Approach
#fitting dengan faktor Kelamin
fit_umur<-survfit(Surv(waktu,status)~Kelamin , data=tbp_data)
#Visualisasi fitting dengan faktor Kelamin
ggsurvplot(fit_umur, data = tbp_data,
fun = "cumhaz",
legend = "top",
legend.title ="Kelamin",
legend.labs = c("famale",
"Male"))
The Figure shows the survival curve for colon cancer patients based on gender. The plot of women and men shows that there are points that intersect so that the possibility of the assumption that the proportional hazard assumption is violated.
library(survival)
library(survminer)
#fitting dengan faktor umur
fit_umur<-survfit(Surv(waktu,status)~ Umur , data=tbp_data)
#Visualisasi fitting dengan faktor umur
ggsurvplot(fit_umur, data = tbp_data,
fun = "cumhaz",
legend = "top",
legend.title = "Umur",
legend.labs = c("Tua",
"Muda"))
The Figure shows the -In survival curve of patients with colon cancer based on age factors. The old and young plots appear parallel, thus indicating that the rate of death risk for colon cancer patients tends to be constant, in other words the proportional hazard assumption is not violated.
#fitting dengan faktor Stadium
fit_umur<-survfit(Surv(waktu,status)~Stadium , data=tbp_data)
#Visualisasi fitting dengan faktor Stadium
ggsurvplot(fit_umur, data = tbp_data,
fun = "cumhaz",
legend = "top",
legend.title = "Stadium",
legend.labs = c("Parah",
"Sangat Parah"))
The Figure shows the -In survival curve of patients with colon cancer based on gender. Severe and very severe plots show that there are points that intersect so that it may indicate that the proportional hazard assumption is violated.
#fitting dengan faktor Kelamin
fit_umur<-survfit(Surv(waktu,status)~ Threatment , data=tbp_data)
#Visualisasi fitting dengan faktor Kelamin
ggsurvplot(fit_umur, data = tbp_data,
fun = "cumhaz",
legend = "top",
legend.title = "Threatment",
legend.labs = c("Standar",
"Baru"))
The figure shows the -In survival curve of patients with colon cancer based on the threat factor. The standard and new plots appear parallel, thus indicating that the rate of death risk for colon cancer patients tends to be constant, in other words the proportional hazard assumption is not violated.
Statistical test
Testing the proportional hazard assumption with a graphical approach usually results in different decisions between one observer and another so it is necessary to use another approach that can strengthen the decision, namely by using this statistical test method obtaining p-value for each predictor variable, so that it can be convincing when compared with a graphical approach.
library(survival)
fit_all<-coxph(Surv(waktu,status)~Threatment+Umur+Stadium+Kelamin , data=tbp_data)
cek_fitt_all<-cox.zph(fit_all)
cek_fitt_all chisq df p
Threatment 0.20798 1 0.648
Umur 0.00399 1 0.950
Stadium 0.19411 1 0.660
Kelamin 5.65015 1 0.017
GLOBAL 8.23987 4 0.083
The results of the proportional hazards assumption test in Table 4.2.2 show that of the 4 variables tested (umur, Kelamin, stadium, and threatment) only Kelamin did not meet the PH assumption with a significance level of 5%. So that the Cox proportional hazard regression method cannot be used. Then it was decided to make 2 models, namely the Stratified cox regression model and the extended cox regression model.
b.Model Stratified Cox
Before forming the model, testing was carried out whether there was an interaction between the stratafication variable, namely gender, and the variables being modeled including age, stage, and threat.
#Without interaction
model_strata<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment+strata(Kelamin), data=tbp_data)
summary(model_strata)Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment +
strata(Kelamin), data = tbp_data)
n= 30, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
UmurTua 0.4248 1.5293 0.7774 0.546 0.585
StadiumSangat Parah 1.3123 3.7146 1.0626 1.235 0.217
ThreatmentStandar 1.3298 3.7804 0.8734 1.523 0.128
exp(coef) exp(-coef) lower .95 upper .95
UmurTua 1.529 0.6539 0.3332 7.018
StadiumSangat Parah 3.715 0.2692 0.4628 29.814
ThreatmentStandar 3.780 0.2645 0.6825 20.940
Concordance= 0.65 (se = 0.093 )
Likelihood ratio test= 3.81 on 3 df, p=0.3
Wald test = 3.22 on 3 df, p=0.4
Score (logrank) test = 3.7 on 3 df, p=0.3
#Interaction
res.interaction.sex <- coxph(formula = Surv(waktu, status) ~ (Umur + Stadium+Threatment)*Kelamin - Kelamin+ strata(Kelamin),
data = tbp_data,
ties = c("efron","breslow","exact")[1])
summary(res.interaction.sex)Call:
coxph(formula = Surv(waktu, status) ~ (Umur + Stadium + Threatment) *
Kelamin - Kelamin + strata(Kelamin), data = tbp_data, ties = c("efron",
"breslow", "exact")[1])
n= 30, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
UmurTua 1.237e+00 3.444e+00 1.022e+00 1.211 0.226
StadiumSangat Parah 1.181e+00 3.257e+00 1.206e+00 0.979 0.327
ThreatmentStandar 3.519e-01 1.422e+00 1.128e+00 0.312 0.755
UmurMuda:Kelamin1 1.704e+00 5.493e+00 1.598e+00 1.066 0.286
UmurTua:Kelamin1 NA NA 0.000e+00 NA NA
StadiumSangat Parah:Kelamin1 -1.117e+00 3.273e-01 3.248e+04 0.000 1.000
ThreatmentStandar:Kelamin1 2.018e+01 5.787e+08 1.664e+04 0.001 0.999
exp(coef) exp(-coef) lower .95 upper .95
UmurTua 3.444e+00 2.904e-01 0.4651 25.50
StadiumSangat Parah 3.257e+00 3.071e-01 0.3067 34.59
ThreatmentStandar 1.422e+00 7.034e-01 0.1558 12.97
UmurMuda:Kelamin1 5.493e+00 1.821e-01 0.2397 125.89
UmurTua:Kelamin1 NA NA NA NA
StadiumSangat Parah:Kelamin1 3.273e-01 3.055e+00 0.0000 Inf
ThreatmentStandar:Kelamin1 5.787e+08 1.728e-09 0.0000 Inf
Concordance= 0.73 (se = 0.098 )
Likelihood ratio test= 8.61 on 6 df, p=0.2
Wald test = 4 on 6 df, p=0.7
Score (logrank) test = 8.87 on 6 df, p=0.2
#The test results whether there is interaction or not
anova(res.interaction.sex,model_strata)Analysis of Deviance Table
Cox model: response is Surv(waktu, status)
Model 1: ~ (Umur + Stadium + Threatment) * Kelamin - Kelamin + strata(Kelamin)
Model 2: ~ Umur + Stadium + Threatment + strata(Kelamin)
loglik Chisq Df P(>|Chi|)
1 -18.624
2 -21.025 4.802 3 0.1869
Based on the test results, the p-value is greater than the significance level of 0.05, so the decision is a failure to reject \(H_0\), which means that there is no interpretation between the variables of residence with the variables of age, stage and thereatment. Once it is known that there is no interaction in the model
#Female Strata Model
model_female<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment, data=tbp_data,subset = (Kelamin == "1"))
summary(model_female)Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment,
data = tbp_data, subset = (Kelamin == "1"))
n= 15, number of events= 3
coef exp(coef) se(coef) z Pr(>|z|)
UmurTua -4.668e-01 6.270e-01 1.229e+00 -0.380 0.704
StadiumSangat Parah 6.379e-02 1.066e+00 5.354e+04 0.000 1.000
ThreatmentStandar 2.153e+01 2.236e+09 2.744e+04 0.001 0.999
exp(coef) exp(-coef) lower .95 upper .95
UmurTua 6.270e-01 1.595e+00 0.05641 6.97
StadiumSangat Parah 1.066e+00 9.382e-01 0.00000 Inf
ThreatmentStandar 2.236e+09 4.472e-10 0.00000 Inf
Concordance= 0.829 (se = 0.083 )
Likelihood ratio test= 5.12 on 3 df, p=0.2
Wald test = 0.14 on 3 df, p=1
Score (logrank) test = 4.02 on 3 df, p=0.3
Insight :
Based on the results above, it can be interpreted that the hazard rate factor for the old age category is 0.627, meaning that female patients with old age have a risk of 0.627 times smaller than female patients with young age to experience death. Judging from the p-value of the old age variable is more than 0.05, it can be said that with a significance level of 0.05, the old age variable has no significant effect on the risk of death in colorectal cancer patients in women. Furthermore, the Hazard Ratio (HR) value for the very severe stage variable is 1.066. This could also imply that, holding other variables constant, female patients with very severe stages had a 1.066 times greater risk than female patients with severe stages of experiencing death.
#Male strata Model
model_male<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment, data=tbp_data,subset = (Kelamin == "0"))
summary(model_male)Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment,
data = tbp_data, subset = (Kelamin == "0"))
n= 15, number of events= 7
coef exp(coef) se(coef) z Pr(>|z|)
UmurTua 1.2367 3.4441 1.0215 1.211 0.226
StadiumSangat Parah 1.1807 3.2567 1.2055 0.979 0.327
ThreatmentStandar 0.3519 1.4217 1.1282 0.312 0.755
exp(coef) exp(-coef) lower .95 upper .95
UmurTua 3.444 0.2904 0.4651 25.50
StadiumSangat Parah 3.257 0.3071 0.3067 34.59
ThreatmentStandar 1.422 0.7034 0.1558 12.97
Concordance= 0.669 (se = 0.139 )
Likelihood ratio test= 3.49 on 3 df, p=0.3
Wald test = 3.86 on 3 df, p=0.3
Score (logrank) test = 4.85 on 3 df, p=0.2
Insight
Based on the results above, it can be interpreted that the hazard rate factor for the old age category is 3.4441, meaning that male patients with old age have a risk of 3.4441 times greater than male patients with young age to experience death. Judging from the p-value of the old age variable is more than 0.05, it can be said that with a significance level of 0.05, the old age variable has no significant effect on the risk of death in colon cancer patients in women. Furthermore, the Hazard Ratio (HR) value for the very severe stage variable is 3.2567. This can also be interpreted that, by holding other variables constant, male patients with very severe stages have a 3.2567 times greater risk than male patients with very severe stages of experiencing death.
c. Model Extended Cox
The extended cox model is characterized by using the time function in the model, namely by interacting with variables that do not meet the proportional hazard assumption, namely the gender variable with the time function in this study is Log (T).
library(dplyr)
tbp.data<-read.csv2("tbp.data.csv")
tbp.data<-tbp.data %>%
mutate(Kelamin=as.numeric(Kelamin),waktu=as.numeric(waktu))
attach(tbp.data)#Time Dependent
log.kelamin1<-log(waktu)*Kelamin
fit_all<-coxph(Surv(waktu,status)~Kelamin+log.kelamin1+Threatment+Stadium+Umur)
summary(fit_all)Call:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 +
Threatment + Stadium + Umur)
n= 30, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
Kelamin 3.257e+01 1.403e+14 2.259e+01 1.442 0.149
log.kelamin1 -1.542e+01 2.009e-07 1.089e+01 -1.416 0.157
ThreatmentStandar 1.251e+00 3.496e+00 9.711e-01 1.289 0.197
StadiumSangat Parah 1.874e+00 6.511e+00 1.171e+00 1.600 0.110
UmurTua 6.669e-01 1.948e+00 8.814e-01 0.757 0.449
exp(coef) exp(-coef) lower .95 upper .95
Kelamin 1.403e+14 7.129e-15 8.304e-06 2.370e+33
log.kelamin1 2.009e-07 4.978e+06 1.077e-16 3.747e+02
ThreatmentStandar 3.496e+00 2.861e-01 5.211e-01 2.345e+01
StadiumSangat Parah 6.511e+00 1.536e-01 6.557e-01 6.465e+01
UmurTua 1.948e+00 5.133e-01 3.463e-01 1.096e+01
Concordance= 0.86 (se = 0.073 )
Likelihood ratio test= 26.06 on 5 df, p=9e-05
Wald test = 4.86 on 5 df, p=0.4
Score (logrank) test = 18.52 on 5 df, p=0.002
Perform backward elimination to determine the best extended cox.
model.step <- step(fit_all, direction = "backward", trace = T)Start: AIC=43.69
Surv(waktu, status) ~ Kelamin + log.kelamin1 + Threatment + Stadium +
Umur
Df AIC
- Umur 1 42.221
- Threatment 1 43.379
<none> 43.690
- Stadium 1 44.175
- Kelamin 1 59.793
- log.kelamin1 1 63.759
Step: AIC=42.22
Surv(waktu, status) ~ Kelamin + log.kelamin1 + Threatment + Stadium
Df AIC
<none> 42.221
- Threatment 1 42.225
- Stadium 1 43.128
- Kelamin 1 58.349
- log.kelamin1 1 61.976
model.stepCall:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 +
Threatment + Stadium)
coef exp(coef) se(coef) z p
Kelamin 3.022e+01 1.337e+13 2.077e+01 1.455 0.1457
log.kelamin1 -1.427e+01 6.353e-07 1.002e+01 -1.423 0.1546
ThreatmentStandar 1.309e+00 3.702e+00 9.397e-01 1.393 0.1636
StadiumSangat Parah 1.961e+00 7.106e+00 1.128e+00 1.738 0.0822
Likelihood ratio test=25.53 on 4 df, p=3.934e-05
n= 30, number of events= 10
Based on the above results, it can be concluded that the best cox extended model to describe the risk of death in patients with intestinal cancers is a model without age variables. The following is the result of the model revision.
final_extended_cox<-coxph(Surv(waktu,status)~Kelamin+log.kelamin1+Threatment+Stadium)
summary(final_extended_cox)Call:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 +
Threatment + Stadium)
n= 30, number of events= 10
coef exp(coef) se(coef) z Pr(>|z|)
Kelamin 3.022e+01 1.337e+13 2.077e+01 1.455 0.1457
log.kelamin1 -1.427e+01 6.353e-07 1.002e+01 -1.423 0.1546
ThreatmentStandar 1.309e+00 3.702e+00 9.397e-01 1.393 0.1636
StadiumSangat Parah 1.961e+00 7.106e+00 1.128e+00 1.738 0.0822 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
Kelamin 1.337e+13 7.482e-14 2.776e-05 6.435e+30
log.kelamin1 6.353e-07 1.574e+06 1.863e-15 2.167e+02
ThreatmentStandar 3.702e+00 2.701e-01 5.870e-01 2.335e+01
StadiumSangat Parah 7.106e+00 1.407e-01 7.784e-01 6.487e+01
Concordance= 0.855 (se = 0.075 )
Likelihood ratio test= 25.53 on 4 df, p=4e-05
Wald test = 4.77 on 4 df, p=0.3
Score (logrank) test = 17.48 on 4 df, p=0.002
Interpretation of the results can be seen from the following plot
library(survminer)
ggforest(model.step, data=tbp_data)b. Best Model Selection
Selection of the best model using the AIC value. The best model is the smallest AIC model. The following is the AIC value based on processing using the female gender stratification model, the male gender stratafication model and the extended cox regression model.
AIC(model_strata)[1] 48.0498
AIC(final_extended_cox)[1] 42.2212
Based on the table above, extended cox is the best model in analyzing the risk of death of colon cancer patients with an AIC value of 42.24.
5. Conclusion
- Because the proportional hazard assumption is violated, it was decided to make two models, namely the stratified cox model and the extended cox model.
- In the stratified cox model, stratafication was carried out on the independent variable, namely Kelamin, so that two models were formed, namely the female strata model and the male strata model.
- The formation of the extended cox model by adding the time function to the Kelamin variable is \(log (t)\).
- Based on AIC, the best model is the extended cox model.