Introduction :-

In this report, I am attempting to do survival analysis (or) time-to-event analysis on Breast Cancer Data set data set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

The given dataset has 686 observations and each observation has 10 attributes. The header of the dataset is as follows.

##   horTh age menostat tsize tgrade pnodes progrec estrec time cens
## 1    no  70     Post    21     II      3      48     66 1814    1
## 2   yes  56     Post    12     II      7      61     77 2018    1
## 3   yes  58     Post    35     II      9      52    271  712    1
## 4   yes  59     Post    17     II      4      60     29 1807    1
## 5    no  73     Post    35     II      1      26     65  772    1
## 6    no  32      Pre    57    III     24       0     13  448    1

The detailed structure is as follows.

## 'data.frame':    686 obs. of  10 variables:
##  $ horTh   : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 2 1 1 1 ...
##  $ age     : int  70 56 58 59 73 32 59 65 80 66 ...
##  $ menostat: Factor w/ 2 levels "Post","Pre": 1 1 1 1 1 2 1 1 1 1 ...
##  $ tsize   : int  21 12 35 17 35 57 8 16 39 18 ...
##  $ tgrade  : Factor w/ 3 levels "I","II","III": 2 2 2 2 2 3 2 2 2 2 ...
##  $ pnodes  : int  3 7 9 4 1 24 2 1 30 7 ...
##  $ progrec : int  48 61 52 60 26 0 181 192 0 0 ...
##  $ estrec  : int  66 77 271 29 65 13 0 25 59 3 ...
##  $ time    : int  1814 2018 712 1807 772 448 2172 2161 471 2014 ...
##  $ cens    : int  1 1 1 1 1 1 0 0 1 0 ...

In Input , The type of each attribute is as follows.

##     horTh       age  menostat     tsize    tgrade    pnodes   progrec    estrec 
##  "factor" "integer"  "factor" "integer"  "factor" "integer" "integer" "integer" 
##      time      cens 
## "integer" "integer"

For EDA, I am updating the type of cens to factorial. Finally, the type of each attribute is as follows.

##     horTh       age  menostat     tsize    tgrade    pnodes   progrec    estrec 
##  "factor" "integer"  "factor" "integer"  "factor" "integer" "integer" "integer" 
##      time      cens 
## "integer"  "factor"

The number of null values in each column are as follows.

##    horTh      age menostat    tsize   tgrade   pnodes  progrec   estrec 
##        0        0        0        0        0        0        0        0 
##     time     cens 
##        0        0

As there is no null values, we can proceed further.


The overall summary of all the attributes is as follows.

##  horTh          age        menostat       tsize        tgrade   
##  no :440   Min.   :21.00   Post:396   Min.   :  3.00   I  : 81  
##  yes:246   1st Qu.:46.00   Pre :290   1st Qu.: 20.00   II :444  
##            Median :53.00              Median : 25.00   III:161  
##            Mean   :53.05              Mean   : 29.33            
##            3rd Qu.:61.00              3rd Qu.: 35.00            
##            Max.   :80.00              Max.   :120.00            
##      pnodes         progrec           estrec             time        cens   
##  Min.   : 1.00   Min.   :   0.0   Min.   :   0.00   Min.   :   8.0   0:387  
##  1st Qu.: 1.00   1st Qu.:   7.0   1st Qu.:   8.00   1st Qu.: 567.8   1:299  
##  Median : 3.00   Median :  32.5   Median :  36.00   Median :1084.0          
##  Mean   : 5.01   Mean   : 110.0   Mean   :  96.25   Mean   :1124.5          
##  3rd Qu.: 7.00   3rd Qu.: 131.8   3rd Qu.: 114.00   3rd Qu.:1684.8          
##  Max.   :51.00   Max.   :2380.0   Max.   :1144.00   Max.   :2659.0

The distribution of all continuous variables is as follows.


The distribution of all contionus variables in each category is as follows.

  1. Age:-

  1. Tsize:-

  1. Pnodes:-

  1. Progrec:-

  1. Estrec:-

  1. Time:-


The co-releation between the continous variables is as follows

##                 age       tsize      pnodes     progrec      estrec        time
## age      1.00000000 -0.04541210  0.03270905  0.08435497  0.32313238  0.05395755
## tsize   -0.04541210  1.00000000  0.32766498 -0.02741477 -0.08176636 -0.13837569
## pnodes   0.03270905  0.32766498  1.00000000 -0.07253389 -0.04318344 -0.25675074
## progrec  0.08435497 -0.02741477 -0.07253389  1.00000000  0.39260134  0.10272922
## estrec   0.32313238 -0.08176636 -0.04318344  0.39260134  1.00000000  0.06547710
## time     0.05395755 -0.13837569 -0.25675074  0.10272922  0.06547710  1.00000000

___

Description of EDA :-

In our data set,


NON-PARAMETRIC SURVIVAL MODELS

Fitting Kaplan-Meier Model (with out considering category varaibles) :-

In this Model, all the records will be considered as similar and the categorical varible tgrade (I / II / II ) and horTh ( yes / no) are not considered in this model.

  • Overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##     686     299    1807    1587    2030
  • Survival Table :-
#summary(km_model)

summary_km_model

  • kaplan-meier curve :-


Fitting Kaplan-Meier Model (considering horTh category variable) :-

In this Model, all the records will be considered in two different categories based on horTh value of the record.

The overview of the fitted kaplan-meier model is as follows.

## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
## 
##             n events median 0.95LCL 0.95UCL
## horTh=no  440    205   1528    1296    1814
## horTh=yes 246     94   2018    1918      NA
  • Survival Table :-

km_model_horth_1

km_model_horth_2

  • kaplan-meier curve :-


Comparing two KM-Curves :-

The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).

  1. Null Hypothesis : Survival in two groups is same.
  2. Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## horTh=no  440      205      180      3.37      8.56
## horTh=yes 246       94      119      5.12      8.56
## 
##  Chisq= 8.6  on 1 degrees of freedom, p= 0.003

The P-value is very very less.


Fitting Kaplan-Meier Model (considering tgrade category variable) :-

In this Model, all the records will be considered in three different categories based on tgrade value of the record.

The overview of the fitted kaplan-meier model is as follows.

## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
## 
##              n events median 0.95LCL 0.95UCL
## tgrade=I    81     18     NA    1990      NA
## tgrade=II  444    202   1730    1493    2030
## tgrade=III 161     79   1337     960      NA
  • Survival Table :-
#summary(km_model_tgrade)

km_model_tgrade_1

km_model_tgrade_2

km_model_tgrade_3

  • kaplan-meier curve :-


Comparing two KM-Curves :-

The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).

  1. Null Hypothesis : Survival in two groups is same.
  2. Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## tgrade=I    81       18     42.2   13.8469    16.159
## tgrade=II  444      202    198.2    0.0725     0.215
## tgrade=III 161       79     58.6    7.0788     8.848
## 
##  Chisq= 21.1  on 2 degrees of freedom, p= 3e-05

P-Value is very less.


Conclusion of Non-parametric models :-

As the p value is very small , we can reject the null hypothesis. We can say that, the group of people in hrth and tgrade are stastically different and the survival will be different in each group.


SEMI-PARAMETRIC SURVIVAL MODELS

Fitting cox PH Model :-

In this Model, all the records will be considered and all the variables will be considred.

  • Overview of the fitted cox PH model is as follows.
## Call:
## coxph(formula = attrib, data = df)
## 
##                   coef  exp(coef)   se(coef)      z        p
## horThyes    -0.3462784  0.7073155  0.1290747 -2.683 0.007301
## age         -0.0094592  0.9905854  0.0093006 -1.017 0.309126
## menostatPre -0.2584448  0.7722516  0.1834765 -1.409 0.158954
## tsize        0.0077961  1.0078266  0.0039390  1.979 0.047794
## tgradeII     0.6361117  1.8891211  0.2492025  2.553 0.010693
## tgradeIII    0.7796542  2.1807181  0.2684801  2.904 0.003685
## pnodes       0.0487886  1.0499984  0.0074471  6.551  5.7e-11
## progrec     -0.0022172  0.9977852  0.0005735 -3.866 0.000111
## estrec       0.0001973  1.0001973  0.0004504  0.438 0.661307
## 
## Likelihood ratio test=104.8  on 9 df, p=< 2.2e-16
## n= 686, number of events= 299
##                           2.5 %    97.5 %
## horThyes    0.7073155 0.5492178 0.9109233
## age         0.9905854 0.9726917 1.0088082
## menostatPre 0.7722516 0.5389933 1.1064563
## tsize       1.0078266 1.0000758 1.0156374
## tgradeII    1.8891211 1.1591463 3.0787991
## tgradeIII   2.1807181 1.2884537 3.6908828
## pnodes      1.0499984 1.0347839 1.0654366
## progrec     0.9977852 0.9966642 0.9989075
## estrec      1.0001973 0.9993148 1.0010806

SURVIVAL TREES

Fitting Conditional inference tree :-

In this Model, all the records will be considered and all the variables will be considred.

  • Overview of the fitted Conditional inference tree model is as follows.
## 
##   Conditional inference tree with 4 terminal nodes
## 
## Response:  Surv(time, cens) 
## Inputs:  horTh, age, menostat, tsize, tgrade, pnodes, progrec, estrec 
## Number of observations:  686 
## 
## 1) pnodes <= 3; criterion = 1, statistic = 56.156
##   2) horTh == {yes}; criterion = 0.965, statistic = 9.497
##     3)*  weights = 128 
##   2) horTh == {no}
##     4)*  weights = 248 
## 1) pnodes > 3
##   5) progrec <= 20; criterion = 0.999, statistic = 14.941
##     6)*  weights = 144 
##   5) progrec > 20
##     7)*  weights = 166