Introduction :-
In this report, I am attempting to do survival analysis (or) time-to-event analysis on Breast Cancer Data set data set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 686 observations and each observation has 10 attributes. The header of the dataset is as follows.
## horTh age menostat tsize tgrade pnodes progrec estrec time cens
## 1 no 70 Post 21 II 3 48 66 1814 1
## 2 yes 56 Post 12 II 7 61 77 2018 1
## 3 yes 58 Post 35 II 9 52 271 712 1
## 4 yes 59 Post 17 II 4 60 29 1807 1
## 5 no 73 Post 35 II 1 26 65 772 1
## 6 no 32 Pre 57 III 24 0 13 448 1
Explanation of all the variables :-
horTh: hormonal therapy, a factor at two levels no and yes.
age: of the patients in years.
menostat: menopausal status, a factor at two levels pre (premenopausal) and post (postmenopausal).
tsize: tumor size.
tgrade: tumor grade, a ordered factor at levels.
pnodes: number of positive nodes.
progrec: progesterone receptor.
estrec: estrogen recepto.
time: recurrence free survival time (in days).
cens: censoring indicator (0: censored, 1: event).
The detailed structure is as follows.
## 'data.frame': 686 obs. of 10 variables:
## $ horTh : Factor w/ 2 levels "no","yes": 1 2 2 2 1 1 2 1 1 1 ...
## $ age : int 70 56 58 59 73 32 59 65 80 66 ...
## $ menostat: Factor w/ 2 levels "Post","Pre": 1 1 1 1 1 2 1 1 1 1 ...
## $ tsize : int 21 12 35 17 35 57 8 16 39 18 ...
## $ tgrade : Factor w/ 3 levels "I","II","III": 2 2 2 2 2 3 2 2 2 2 ...
## $ pnodes : int 3 7 9 4 1 24 2 1 30 7 ...
## $ progrec : int 48 61 52 60 26 0 181 192 0 0 ...
## $ estrec : int 66 77 271 29 65 13 0 25 59 3 ...
## $ time : int 1814 2018 712 1807 772 448 2172 2161 471 2014 ...
## $ cens : int 1 1 1 1 1 1 0 0 1 0 ...
In Input , The type of each attribute is as follows.
## horTh age menostat tsize tgrade pnodes progrec estrec
## "factor" "integer" "factor" "integer" "factor" "integer" "integer" "integer"
## time cens
## "integer" "integer"
For EDA, I am updating the type of cens to factorial. Finally, the type of each attribute is as follows.
## horTh age menostat tsize tgrade pnodes progrec estrec
## "factor" "integer" "factor" "integer" "factor" "integer" "integer" "integer"
## time cens
## "integer" "factor"
- Dealing with NULL values :-
The number of null values in each column are as follows.
## horTh age menostat tsize tgrade pnodes progrec estrec
## 0 0 0 0 0 0 0 0
## time cens
## 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## horTh age menostat tsize tgrade
## no :440 Min. :21.00 Post:396 Min. : 3.00 I : 81
## yes:246 1st Qu.:46.00 Pre :290 1st Qu.: 20.00 II :444
## Median :53.00 Median : 25.00 III:161
## Mean :53.05 Mean : 29.33
## 3rd Qu.:61.00 3rd Qu.: 35.00
## Max. :80.00 Max. :120.00
## pnodes progrec estrec time cens
## Min. : 1.00 Min. : 0.0 Min. : 0.00 Min. : 8.0 0:387
## 1st Qu.: 1.00 1st Qu.: 7.0 1st Qu.: 8.00 1st Qu.: 567.8 1:299
## Median : 3.00 Median : 32.5 Median : 36.00 Median :1084.0
## Mean : 5.01 Mean : 110.0 Mean : 96.25 Mean :1124.5
## 3rd Qu.: 7.00 3rd Qu.: 131.8 3rd Qu.: 114.00 3rd Qu.:1684.8
## Max. :51.00 Max. :2380.0 Max. :1144.00 Max. :2659.0
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Age:-
- Tsize:-
- Pnodes:-
- Progrec:-
- Estrec:-
- Time:-
The co-releation between the continous variables is as follows
## age tsize pnodes progrec estrec time
## age 1.00000000 -0.04541210 0.03270905 0.08435497 0.32313238 0.05395755
## tsize -0.04541210 1.00000000 0.32766498 -0.02741477 -0.08176636 -0.13837569
## pnodes 0.03270905 0.32766498 1.00000000 -0.07253389 -0.04318344 -0.25675074
## progrec 0.08435497 -0.02741477 -0.07253389 1.00000000 0.39260134 0.10272922
## estrec 0.32313238 -0.08176636 -0.04318344 0.39260134 1.00000000 0.06547710
## time 0.05395755 -0.13837569 -0.25675074 0.10272922 0.06547710 1.00000000
___
Description of EDA :-
In our data set,
NON-PARAMETRIC SURVIVAL MODELS
Fitting Kaplan-Meier Model (with out considering category varaibles) :-
In this Model, all the records will be considered as similar and the categorical varible tgrade (I / II / II ) and horTh ( yes / no) are not considered in this model.
- Overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 686 299 1807 1587 2030
- Survival Table :-
#summary(km_model)summary_km_model
- kaplan-meier curve :-
Fitting Kaplan-Meier Model (considering horTh category variable) :-
In this Model, all the records will be considered in two different categories based on horTh value of the record.
The overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## horTh=no 440 205 1528 1296 1814
## horTh=yes 246 94 2018 1918 NA
- Survival Table :-
km_model_horth_1
km_model_horth_2
- kaplan-meier curve :-
Comparing two KM-Curves :-
The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).
- Null Hypothesis : Survival in two groups is same.
- Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## horTh=no 440 205 180 3.37 8.56
## horTh=yes 246 94 119 5.12 8.56
##
## Chisq= 8.6 on 1 degrees of freedom, p= 0.003
The P-value is very very less.
Fitting Kaplan-Meier Model (considering tgrade category variable) :-
In this Model, all the records will be considered in three different categories based on tgrade value of the record.
The overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## tgrade=I 81 18 NA 1990 NA
## tgrade=II 444 202 1730 1493 2030
## tgrade=III 161 79 1337 960 NA
- Survival Table :-
#summary(km_model_tgrade)km_model_tgrade_1
km_model_tgrade_2
km_model_tgrade_3
- kaplan-meier curve :-
Comparing two KM-Curves :-
The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).
- Null Hypothesis : Survival in two groups is same.
- Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## tgrade=I 81 18 42.2 13.8469 16.159
## tgrade=II 444 202 198.2 0.0725 0.215
## tgrade=III 161 79 58.6 7.0788 8.848
##
## Chisq= 21.1 on 2 degrees of freedom, p= 3e-05
P-Value is very less.
Conclusion of Non-parametric models :-
As the p value is very small , we can reject the null hypothesis. We can say that, the group of people in hrth and tgrade are stastically different and the survival will be different in each group.
SEMI-PARAMETRIC SURVIVAL MODELS
Fitting cox PH Model :-
In this Model, all the records will be considered and all the variables will be considred.
- Overview of the fitted cox PH model is as follows.
## Call:
## coxph(formula = attrib, data = df)
##
## coef exp(coef) se(coef) z p
## horThyes -0.3462784 0.7073155 0.1290747 -2.683 0.007301
## age -0.0094592 0.9905854 0.0093006 -1.017 0.309126
## menostatPre -0.2584448 0.7722516 0.1834765 -1.409 0.158954
## tsize 0.0077961 1.0078266 0.0039390 1.979 0.047794
## tgradeII 0.6361117 1.8891211 0.2492025 2.553 0.010693
## tgradeIII 0.7796542 2.1807181 0.2684801 2.904 0.003685
## pnodes 0.0487886 1.0499984 0.0074471 6.551 5.7e-11
## progrec -0.0022172 0.9977852 0.0005735 -3.866 0.000111
## estrec 0.0001973 1.0001973 0.0004504 0.438 0.661307
##
## Likelihood ratio test=104.8 on 9 df, p=< 2.2e-16
## n= 686, number of events= 299
## 2.5 % 97.5 %
## horThyes 0.7073155 0.5492178 0.9109233
## age 0.9905854 0.9726917 1.0088082
## menostatPre 0.7722516 0.5389933 1.1064563
## tsize 1.0078266 1.0000758 1.0156374
## tgradeII 1.8891211 1.1591463 3.0787991
## tgradeIII 2.1807181 1.2884537 3.6908828
## pnodes 1.0499984 1.0347839 1.0654366
## progrec 0.9977852 0.9966642 0.9989075
## estrec 1.0001973 0.9993148 1.0010806
SURVIVAL TREES
Fitting Conditional inference tree :-
In this Model, all the records will be considered and all the variables will be considred.
- Overview of the fitted Conditional inference tree model is as follows.
##
## Conditional inference tree with 4 terminal nodes
##
## Response: Surv(time, cens)
## Inputs: horTh, age, menostat, tsize, tgrade, pnodes, progrec, estrec
## Number of observations: 686
##
## 1) pnodes <= 3; criterion = 1, statistic = 56.156
## 2) horTh == {yes}; criterion = 0.965, statistic = 9.497
## 3)* weights = 128
## 2) horTh == {no}
## 4)* weights = 248
## 1) pnodes > 3
## 5) progrec <= 20; criterion = 0.999, statistic = 14.941
## 6)* weights = 144
## 5) progrec > 20
## 7)* weights = 166