Survival Analysis in Colorectal Cancer Patient Data


Survival analysis is the result of measuring the time from one event to another. Time in the form of years, months, weeks, or days counted from the first study until the occurrence of an event is used for the dependent variable.

1. Overview Data Survival

TBP data is data applied to a survival analysis involving 30 samples of patients treated with colorectal cancer. The data consists of two dependent variables and four independent variables. This data is used to predict the factors that affect a person’s survival from an event or death.

tbp_data<-read.csv2("data_tbp.csv")
head(tbp_data)
  Umur Kelamin Stadium Threatment waktu status
1  Tua       1   Parah    Standar    34  FALSE
2 Muda       1   Parah    Standar     4   TRUE
3  Tua       0   Parah    Standar    10   TRUE
4  Tua       1   Parah    Standar     5   TRUE
5 Muda       0   Parah    Standar    20   TRUE
6 Muda       1   Parah    Standar    24  FALSE

2. Data Wrangling

Data wrangling is done so that the data format is in accordance with the objectives of the analysis. The data type of data_tbp.csv is not correct, so the data type is changed.

library(dplyr)
tbp_data<-tbp_data %>%
  mutate(Kelamin = as.factor(Kelamin),Umur = as.factor(Umur),Stadium=as.factor(Stadium),Threatment=as.factor(Threatment))
  glimpse(tbp_data)
Rows: 30
Columns: 6
$ Umur       <fct> Tua, Muda, Tua, Tua, Muda, Muda, Muda, Muda, Muda, Muda, Tu~
$ Kelamin    <fct> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0,~
$ Stadium    <fct> Parah, Parah, Parah, Parah, Parah, Parah, Parah, Parah, Par~
$ Threatment <fct> Standar, Standar, Standar, Standar, Standar, Standar, Stand~
$ waktu      <int> 34, 4, 10, 5, 20, 24, 8, 17, 17, 34, 17, 8, 27, 21, 18, 16,~
$ status     <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, F~

Type of data is correct so that it can proceed to the next process

3. Data Explanation

summary(tbp_data)
   Umur    Kelamin         Stadium     Threatment     waktu      
 Muda:22   0:15    Parah       :25   Baru   :19   Min.   : 4.00  
 Tua : 8   1:15    Sangat Parah: 5   Standar:11   1st Qu.: 8.00  
                                                  Median :14.00  
                                                  Mean   :15.33  
                                                  3rd Qu.:20.75  
                                                  Max.   :34.00  
   status       
 Mode :logical  
 FALSE:20       
 TRUE :10       
                
                
                

Insight
Presented in the following table.
Table 1. Descriptive Results

Table1 shows the male gender of colon cancer patients who have a balanced percentage of patients who are female. Nearly 73.3% of the total young colon cancer patients with age \(\leq 50\) years. There were 13.3% of patients who were already at a very severe type of stage and there were 2/3 patients who received a new type of treatment. This shows that the male gender of bowel cancer patients has a balanced proportion with patients who are female. Nearly 73.3% of the total young colon cancer patients with age \(\leq 50\) years. There were 13.3% of patients who were already at a very severe type of stage and there were 2/3 of the patients who received a new type of treatment.

4. Data Analysis

The steps to analyze colorectal cancer patient data are:

a. Proportional hazard test

The proportional hazard assumption states that the hazard function of an individual differences are the proportions or ratios of the hazard functions of the two individuals discrepancy is constant. Proportional hazard testing can be done in two ways, namely the graphical approach and statistical test.

Graph Approach

#fitting dengan faktor Kelamin
fit_umur<-survfit(Surv(waktu,status)~Kelamin , data=tbp_data)
#Visualisasi fitting dengan faktor Kelamin
ggsurvplot(fit_umur, data = tbp_data, 
 fun = "cumhaz", 
 legend = "top",
 legend.title ="Kelamin",
 legend.labs = c("famale", 
 "Male"))

The Figure shows the survival curve for colon cancer patients based on gender. The plot of women and men shows that there are points that intersect so that the possibility of the assumption that the proportional hazard assumption is violated.

library(survival)
library(survminer)
#fitting dengan faktor umur
fit_umur<-survfit(Surv(waktu,status)~ Umur , data=tbp_data)
#Visualisasi fitting dengan faktor umur
ggsurvplot(fit_umur, data = tbp_data, 
 fun = "cumhaz", 
 legend = "top",
 legend.title = "Umur",
 legend.labs = c("Tua", 
 "Muda"))

The Figure shows the -In survival curve of patients with colon cancer based on age factors. The old and young plots appear parallel, thus indicating that the rate of death risk for colon cancer patients tends to be constant, in other words the proportional hazard assumption is not violated.

#fitting dengan faktor Stadium
fit_umur<-survfit(Surv(waktu,status)~Stadium , data=tbp_data)
#Visualisasi fitting dengan faktor Stadium
ggsurvplot(fit_umur, data = tbp_data, 
 fun = "cumhaz", 
 legend = "top",
 legend.title = "Stadium",
 legend.labs = c("Parah", 
 "Sangat Parah"))

The Figure shows the -In survival curve of patients with colon cancer based on gender. Severe and very severe plots show that there are points that intersect so that it may indicate that the proportional hazard assumption is violated.

#fitting dengan faktor Kelamin
fit_umur<-survfit(Surv(waktu,status)~  Threatment , data=tbp_data)
#Visualisasi fitting dengan faktor Kelamin
ggsurvplot(fit_umur, data = tbp_data, 
 fun = "cumhaz",
 legend = "top",
 legend.title = "Threatment",
 legend.labs = c("Standar", 
 "Baru"))

The figure shows the -In survival curve of patients with colon cancer based on the threat factor. The standard and new plots appear parallel, thus indicating that the rate of death risk for colon cancer patients tends to be constant, in other words the proportional hazard assumption is not violated.

Statistical test

Testing the proportional hazard assumption with a graphical approach usually results in different decisions between one observer and another so it is necessary to use another approach that can strengthen the decision, namely by using this statistical test method obtaining p-value for each predictor variable, so that it can be convincing when compared with a graphical approach.

library(survival)
fit_all<-coxph(Surv(waktu,status)~Threatment+Umur+Stadium+Kelamin , data=tbp_data)
cek_fitt_all<-cox.zph(fit_all)
cek_fitt_all
             chisq df     p
Threatment 0.20798  1 0.648
Umur       0.00399  1 0.950
Stadium    0.19411  1 0.660
Kelamin    5.65015  1 0.017
GLOBAL     8.23987  4 0.083

The results of the proportional hazards assumption test in Table 4.2.2 show that of the 4 variables tested (umur, Kelamin, stadium, and threatment) only Kelamin did not meet the PH assumption with a significance level of 5%. So that the Cox proportional hazard regression method cannot be used. Then it was decided to make 2 models, namely the Stratified cox regression model and the extended cox regression model.

b.Model Stratified Cox

Before forming the model, testing was carried out whether there was an interaction between the stratafication variable, namely gender, and the variables being modeled including age, stage, and threat.

#Without interaction
model_strata<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment+strata(Kelamin), data=tbp_data)
summary(model_strata)
Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment + 
    strata(Kelamin), data = tbp_data)

  n= 30, number of events= 10 

                      coef exp(coef) se(coef)     z Pr(>|z|)
UmurTua             0.4248    1.5293   0.7774 0.546    0.585
StadiumSangat Parah 1.3123    3.7146   1.0626 1.235    0.217
ThreatmentStandar   1.3298    3.7804   0.8734 1.523    0.128

                    exp(coef) exp(-coef) lower .95 upper .95
UmurTua                 1.529     0.6539    0.3332     7.018
StadiumSangat Parah     3.715     0.2692    0.4628    29.814
ThreatmentStandar       3.780     0.2645    0.6825    20.940

Concordance= 0.65  (se = 0.093 )
Likelihood ratio test= 3.81  on 3 df,   p=0.3
Wald test            = 3.22  on 3 df,   p=0.4
Score (logrank) test = 3.7  on 3 df,   p=0.3
#Interaction
res.interaction.sex <- coxph(formula = Surv(waktu, status) ~ (Umur + Stadium+Threatment)*Kelamin - Kelamin+ strata(Kelamin),
                             data    = tbp_data,
                             ties    = c("efron","breslow","exact")[1])
summary(res.interaction.sex)
Call:
coxph(formula = Surv(waktu, status) ~ (Umur + Stadium + Threatment) * 
    Kelamin - Kelamin + strata(Kelamin), data = tbp_data, ties = c("efron", 
    "breslow", "exact")[1])

  n= 30, number of events= 10 

                                   coef  exp(coef)   se(coef)     z Pr(>|z|)
UmurTua                       1.237e+00  3.444e+00  1.022e+00 1.211    0.226
StadiumSangat Parah           1.181e+00  3.257e+00  1.206e+00 0.979    0.327
ThreatmentStandar             3.519e-01  1.422e+00  1.128e+00 0.312    0.755
UmurMuda:Kelamin1             1.704e+00  5.493e+00  1.598e+00 1.066    0.286
UmurTua:Kelamin1                     NA         NA  0.000e+00    NA       NA
StadiumSangat Parah:Kelamin1 -1.117e+00  3.273e-01  3.248e+04 0.000    1.000
ThreatmentStandar:Kelamin1    2.018e+01  5.787e+08  1.664e+04 0.001    0.999

                             exp(coef) exp(-coef) lower .95 upper .95
UmurTua                      3.444e+00  2.904e-01    0.4651     25.50
StadiumSangat Parah          3.257e+00  3.071e-01    0.3067     34.59
ThreatmentStandar            1.422e+00  7.034e-01    0.1558     12.97
UmurMuda:Kelamin1            5.493e+00  1.821e-01    0.2397    125.89
UmurTua:Kelamin1                    NA         NA        NA        NA
StadiumSangat Parah:Kelamin1 3.273e-01  3.055e+00    0.0000       Inf
ThreatmentStandar:Kelamin1   5.787e+08  1.728e-09    0.0000       Inf

Concordance= 0.73  (se = 0.098 )
Likelihood ratio test= 8.61  on 6 df,   p=0.2
Wald test            = 4  on 6 df,   p=0.7
Score (logrank) test = 8.87  on 6 df,   p=0.2
#The test results whether there is interaction or not
anova(res.interaction.sex,model_strata)
Analysis of Deviance Table
 Cox model: response is  Surv(waktu, status)
 Model 1: ~ (Umur + Stadium + Threatment) * Kelamin - Kelamin + strata(Kelamin)
 Model 2: ~ Umur + Stadium + Threatment + strata(Kelamin)
   loglik Chisq Df P(>|Chi|)
1 -18.624                   
2 -21.025 4.802  3    0.1869

Based on the test results, the p-value is greater than the significance level of 0.05, so the decision is a failure to reject \(H_0\), which means that there is no interpretation between the variables of residence with the variables of age, stage and thereatment. Once it is known that there is no interaction in the model

#Female Strata Model 
model_female<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment, data=tbp_data,subset = (Kelamin == "1"))
summary(model_female)
Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment, 
    data = tbp_data, subset = (Kelamin == "1"))

  n= 15, number of events= 3 

                          coef  exp(coef)   se(coef)      z Pr(>|z|)
UmurTua             -4.668e-01  6.270e-01  1.229e+00 -0.380    0.704
StadiumSangat Parah  6.379e-02  1.066e+00  5.354e+04  0.000    1.000
ThreatmentStandar    2.153e+01  2.236e+09  2.744e+04  0.001    0.999

                    exp(coef) exp(-coef) lower .95 upper .95
UmurTua             6.270e-01  1.595e+00   0.05641      6.97
StadiumSangat Parah 1.066e+00  9.382e-01   0.00000       Inf
ThreatmentStandar   2.236e+09  4.472e-10   0.00000       Inf

Concordance= 0.829  (se = 0.083 )
Likelihood ratio test= 5.12  on 3 df,   p=0.2
Wald test            = 0.14  on 3 df,   p=1
Score (logrank) test = 4.02  on 3 df,   p=0.3
Insight :

Based on the results above, it can be interpreted that the hazard rate factor for the old age category is 0.627, meaning that female patients with old age have a risk of 0.627 times smaller than female patients with young age to experience death. Judging from the p-value of the old age variable is more than 0.05, it can be said that with a significance level of 0.05, the old age variable has no significant effect on the risk of death in colorectal cancer patients in women. Furthermore, the Hazard Ratio (HR) value for the very severe stage variable is 1.066. This could also imply that, holding other variables constant, female patients with very severe stages had a 1.066 times greater risk than female patients with severe stages of experiencing death.

#Male strata Model
model_male<-coxph(formula = Surv(waktu, status)~ Umur+Stadium+Threatment, data=tbp_data,subset = (Kelamin == "0"))
summary(model_male)
Call:
coxph(formula = Surv(waktu, status) ~ Umur + Stadium + Threatment, 
    data = tbp_data, subset = (Kelamin == "0"))

  n= 15, number of events= 7 

                      coef exp(coef) se(coef)     z Pr(>|z|)
UmurTua             1.2367    3.4441   1.0215 1.211    0.226
StadiumSangat Parah 1.1807    3.2567   1.2055 0.979    0.327
ThreatmentStandar   0.3519    1.4217   1.1282 0.312    0.755

                    exp(coef) exp(-coef) lower .95 upper .95
UmurTua                 3.444     0.2904    0.4651     25.50
StadiumSangat Parah     3.257     0.3071    0.3067     34.59
ThreatmentStandar       1.422     0.7034    0.1558     12.97

Concordance= 0.669  (se = 0.139 )
Likelihood ratio test= 3.49  on 3 df,   p=0.3
Wald test            = 3.86  on 3 df,   p=0.3
Score (logrank) test = 4.85  on 3 df,   p=0.2
Insight

Based on the results above, it can be interpreted that the hazard rate factor for the old age category is 3.4441, meaning that male patients with old age have a risk of 3.4441 times greater than male patients with young age to experience death. Judging from the p-value of the old age variable is more than 0.05, it can be said that with a significance level of 0.05, the old age variable has no significant effect on the risk of death in colon cancer patients in women. Furthermore, the Hazard Ratio (HR) value for the very severe stage variable is 3.2567. This can also be interpreted that, by holding other variables constant, male patients with very severe stages have a 3.2567 times greater risk than male patients with very severe stages of experiencing death.

c. Model Extended Cox

The extended cox model is characterized by using the time function in the model, namely by interacting with variables that do not meet the proportional hazard assumption, namely the gender variable with the time function in this study is Log (T).

library(dplyr)
tbp.data<-read.csv2("tbp.data.csv")
tbp.data<-tbp.data %>%
  mutate(Kelamin=as.numeric(Kelamin),waktu=as.numeric(waktu))
attach(tbp.data)
#Time Dependent 
log.kelamin1<-log(waktu)*Kelamin
fit_all<-coxph(Surv(waktu,status)~Kelamin+log.kelamin1+Threatment+Stadium+Umur)
summary(fit_all)
Call:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 + 
    Threatment + Stadium + Umur)

  n= 30, number of events= 10 

                          coef  exp(coef)   se(coef)      z Pr(>|z|)
Kelamin              3.257e+01  1.403e+14  2.259e+01  1.442    0.149
log.kelamin1        -1.542e+01  2.009e-07  1.089e+01 -1.416    0.157
ThreatmentStandar    1.251e+00  3.496e+00  9.711e-01  1.289    0.197
StadiumSangat Parah  1.874e+00  6.511e+00  1.171e+00  1.600    0.110
UmurTua              6.669e-01  1.948e+00  8.814e-01  0.757    0.449

                    exp(coef) exp(-coef) lower .95 upper .95
Kelamin             1.403e+14  7.129e-15 8.304e-06 2.370e+33
log.kelamin1        2.009e-07  4.978e+06 1.077e-16 3.747e+02
ThreatmentStandar   3.496e+00  2.861e-01 5.211e-01 2.345e+01
StadiumSangat Parah 6.511e+00  1.536e-01 6.557e-01 6.465e+01
UmurTua             1.948e+00  5.133e-01 3.463e-01 1.096e+01

Concordance= 0.86  (se = 0.073 )
Likelihood ratio test= 26.06  on 5 df,   p=9e-05
Wald test            = 4.86  on 5 df,   p=0.4
Score (logrank) test = 18.52  on 5 df,   p=0.002

Perform backward elimination to determine the best extended cox.

model.step <- step(fit_all, direction = "backward", trace = T)
Start:  AIC=43.69
Surv(waktu, status) ~ Kelamin + log.kelamin1 + Threatment + Stadium + 
    Umur

               Df    AIC
- Umur          1 42.221
- Threatment    1 43.379
<none>            43.690
- Stadium       1 44.175
- Kelamin       1 59.793
- log.kelamin1  1 63.759

Step:  AIC=42.22
Surv(waktu, status) ~ Kelamin + log.kelamin1 + Threatment + Stadium

               Df    AIC
<none>            42.221
- Threatment    1 42.225
- Stadium       1 43.128
- Kelamin       1 58.349
- log.kelamin1  1 61.976
model.step
Call:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 + 
    Threatment + Stadium)

                          coef  exp(coef)   se(coef)      z      p
Kelamin              3.022e+01  1.337e+13  2.077e+01  1.455 0.1457
log.kelamin1        -1.427e+01  6.353e-07  1.002e+01 -1.423 0.1546
ThreatmentStandar    1.309e+00  3.702e+00  9.397e-01  1.393 0.1636
StadiumSangat Parah  1.961e+00  7.106e+00  1.128e+00  1.738 0.0822

Likelihood ratio test=25.53  on 4 df, p=3.934e-05
n= 30, number of events= 10 

Based on the above results, it can be concluded that the best cox extended model to describe the risk of death in patients with intestinal cancers is a model without age variables. The following is the result of the model revision.

final_extended_cox<-coxph(Surv(waktu,status)~Kelamin+log.kelamin1+Threatment+Stadium)
summary(final_extended_cox)
Call:
coxph(formula = Surv(waktu, status) ~ Kelamin + log.kelamin1 + 
    Threatment + Stadium)

  n= 30, number of events= 10 

                          coef  exp(coef)   se(coef)      z Pr(>|z|)  
Kelamin              3.022e+01  1.337e+13  2.077e+01  1.455   0.1457  
log.kelamin1        -1.427e+01  6.353e-07  1.002e+01 -1.423   0.1546  
ThreatmentStandar    1.309e+00  3.702e+00  9.397e-01  1.393   0.1636  
StadiumSangat Parah  1.961e+00  7.106e+00  1.128e+00  1.738   0.0822 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                    exp(coef) exp(-coef) lower .95 upper .95
Kelamin             1.337e+13  7.482e-14 2.776e-05 6.435e+30
log.kelamin1        6.353e-07  1.574e+06 1.863e-15 2.167e+02
ThreatmentStandar   3.702e+00  2.701e-01 5.870e-01 2.335e+01
StadiumSangat Parah 7.106e+00  1.407e-01 7.784e-01 6.487e+01

Concordance= 0.855  (se = 0.075 )
Likelihood ratio test= 25.53  on 4 df,   p=4e-05
Wald test            = 4.77  on 4 df,   p=0.3
Score (logrank) test = 17.48  on 4 df,   p=0.002

Interpretation of the results can be seen from the following plot

library(survminer)
ggforest(model.step, data=tbp_data)

b. Best Model Selection

Selection of the best model using the AIC value. The best model is the smallest AIC model. The following is the AIC value based on processing using the female gender stratification model, the male gender stratafication model and the extended cox regression model.

AIC(model_strata)
[1] 48.0498
AIC(final_extended_cox)
[1] 42.2212

Based on the table above, extended cox is the best model in analyzing the risk of death of colon cancer patients with an AIC value of 42.24.

5. Conclusion

  1. Because the proportional hazard assumption is violated, it was decided to make two models, namely the stratified cox model and the extended cox model.
  2. In the stratified cox model, stratafication was carried out on the independent variable, namely Kelamin, so that two models were formed, namely the female strata model and the male strata model.
  3. The formation of the extended cox model by adding the time function to the Kelamin variable is \(log (t)\).
  4. Based on AIC, the best model is the extended cox model.