Introduction :-
In this report, I am attempting to do survival analysis (or) time-to-event analysis on Rats Data set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 300 rats and each rat has 5attributes. The header of the dataset is as follows.
## litter rx time status sex
## 1 1 1 101 0 f
## 2 1 0 49 1 f
## 3 1 0 104 0 f
## 4 2 1 91 0 m
## 5 2 0 104 0 m
## 6 2 0 102 0 m
Explanation of all the variables :-
• litter : litter number from 1 to 100, numeric
• rx : treatment, (1=drug, 0=control), factor
• time : time to tumor or last follow-up, numeric
• status : event status, 1=tumor and 0=censored, numeric
• sex : male or female, factor
The detailed structure is as follows.
## 'data.frame': 300 obs. of 5 variables:
## $ litter: int 1 1 1 2 2 2 3 3 3 4 ...
## $ rx : int 1 0 0 1 0 0 1 0 0 1 ...
## $ time : int 101 49 104 91 104 102 104 102 104 91 ...
## $ status: int 0 1 0 0 0 0 0 0 0 0 ...
## $ sex : Factor w/ 2 levels "f","m": 1 1 1 2 2 2 1 1 1 2 ...
In Input , The type of each attribute is as follows.
## litter rx time status sex
## "integer" "integer" "integer" "integer" "factor"
The type of rx is not correct . I am updating it to factorial category. For EDA, I am updating the type of status to factorial. Finally, the type of each attribute is as follows.
## litter rx time status sex
## "integer" "factor" "integer" "factor" "factor"
- Dealing with NULL values :-
The number of null values in each column are as follows.
## litter rx time status sex
## 0 0 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## litter rx time status sex
## Min. : 1.00 0:200 Min. : 23.00 0:258 f:150
## 1st Qu.: 25.75 1:100 1st Qu.: 80.75 1: 42 m:150
## Median : 50.50 Median : 98.00
## Mean : 50.50 Mean : 90.44
## 3rd Qu.: 75.25 3rd Qu.:104.00
## Max. :100.00 Max. :104.00
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Litter:-
- Time:-
The co-releation between the continous variables is as follows
## litter time
## litter 1.00000000 -0.04241067
## time -0.04241067 1.00000000
___
Description of EDA :-
In our data set,
There are 300 rats & 5 attributes for each rat.
rx has little effect on time of event.
NON-PARAMETRIC SURVIVAL MODELS
Fitting Kaplan-Meier Model (with out considering category varaibles) :-
In this Model, all the records will be considered as similar and the categorical varible tgrade (I / II / II ) and horTh ( yes / no) are not considered in this model.
- Overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 300 42 NA NA NA
- Survival Table :-
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 34 298 1 0.997 0.00335 0.990 1.000
## 39 297 1 0.993 0.00473 0.984 1.000
## 40 295 1 0.990 0.00579 0.979 1.000
## 45 294 1 0.987 0.00668 0.974 1.000
## 49 292 1 0.983 0.00746 0.969 0.998
## 50 290 1 0.980 0.00817 0.964 0.996
## 54 285 1 0.976 0.00883 0.959 0.994
## 55 282 1 0.973 0.00946 0.955 0.992
## 64 274 1 0.969 0.01007 0.950 0.989
## 66 271 1 0.966 0.01065 0.945 0.987
## 67 270 1 0.962 0.01119 0.940 0.984
## 68 267 1 0.959 0.01172 0.936 0.982
## 70 263 1 0.955 0.01222 0.931 0.979
## 71 261 1 0.951 0.01271 0.927 0.977
## 72 259 1 0.948 0.01318 0.922 0.974
## 73 257 2 0.940 0.01408 0.913 0.968
## 75 251 1 0.936 0.01451 0.908 0.965
## 77 245 1 0.933 0.01494 0.904 0.962
## 78 238 1 0.929 0.01539 0.899 0.959
## 79 235 1 0.925 0.01582 0.894 0.956
## 80 230 2 0.917 0.01667 0.885 0.950
## 81 225 2 0.909 0.01749 0.875 0.944
## 84 215 2 0.900 0.01832 0.865 0.937
## 86 209 1 0.896 0.01873 0.860 0.933
## 88 202 1 0.891 0.01916 0.855 0.930
## 89 198 2 0.882 0.02000 0.844 0.922
## 92 176 1 0.877 0.02050 0.838 0.919
## 94 169 1 0.872 0.02103 0.832 0.914
## 96 158 2 0.861 0.02216 0.819 0.906
## 101 142 1 0.855 0.02282 0.812 0.901
## 102 139 2 0.843 0.02409 0.797 0.891
## 103 113 3 0.820 0.02669 0.770 0.874
## 104 108 1 0.813 0.02751 0.761 0.869
- kaplan-meier curve :-
There is no good & specific results observed in this graph.
Fitting Kaplan-Meier Model (considering horTh category variable) :-
In this Model, all the records will be considered in two different categories based on sex value of the record.
The overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## sex=f 150 40 NA NA NA
## sex=m 150 2 NA NA NA
- Survival Table :-
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## sex=f
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 34 150 1 0.993 0.00664 0.980 1.000
## 39 149 1 0.987 0.00937 0.968 1.000
## 40 148 1 0.980 0.01143 0.958 1.000
## 45 147 1 0.973 0.01315 0.948 0.999
## 49 145 1 0.967 0.01468 0.938 0.996
## 50 143 1 0.960 0.01606 0.929 0.992
## 54 142 1 0.953 0.01731 0.920 0.988
## 55 141 1 0.946 0.01846 0.911 0.983
## 64 138 1 0.939 0.01956 0.902 0.979
## 66 137 1 0.933 0.02058 0.893 0.974
## 67 136 1 0.926 0.02154 0.884 0.969
## 68 135 1 0.919 0.02245 0.876 0.964
## 70 132 1 0.912 0.02333 0.867 0.959
## 72 130 1 0.905 0.02418 0.859 0.954
## 73 128 2 0.891 0.02579 0.842 0.943
## 77 119 1 0.883 0.02664 0.833 0.937
## 78 114 1 0.876 0.02751 0.823 0.931
## 79 112 1 0.868 0.02835 0.814 0.925
## 80 108 2 0.852 0.03002 0.795 0.913
## 81 105 2 0.835 0.03156 0.776 0.900
## 84 99 2 0.819 0.03310 0.756 0.886
## 86 96 1 0.810 0.03384 0.746 0.879
## 88 93 1 0.801 0.03458 0.736 0.872
## 89 91 2 0.784 0.03599 0.716 0.858
## 92 82 1 0.774 0.03680 0.705 0.850
## 94 79 1 0.764 0.03761 0.694 0.842
## 96 74 2 0.744 0.03933 0.670 0.825
## 101 69 1 0.733 0.04021 0.658 0.816
## 102 67 2 0.711 0.04188 0.634 0.798
## 103 64 3 0.678 0.04412 0.597 0.770
## 104 60 1 0.666 0.04481 0.584 0.760
##
## sex=m
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 71 131 1 0.992 0.0076 0.978 1
## 75 129 1 0.985 0.0108 0.964 1
- kaplan-meier curve :-
Comparing two KM-Curves :-
The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).
- Null Hypothesis : Survival in two groups is same.
- Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=f 150 40 20.6 18.1 35.9
## sex=m 150 2 21.4 17.5 35.9
##
## Chisq= 35.9 on 1 degrees of freedom, p= 2e-09
The P-value is very very less.
Fitting Kaplan-Meier Model (considering rx category variable) :-
In this Model, all the records will be considered in three different categories based on rx value of the record.
The overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## rx=0 200 21 NA NA NA
## rx=1 100 21 NA NA NA
- Survival Table :-
## Call: survfit(formula = attrib, data = df, type = "kaplan-meier")
##
## rx=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 40 198 1 0.995 0.00504 0.985 1.000
## 49 196 1 0.990 0.00712 0.976 1.000
## 50 195 1 0.985 0.00871 0.968 1.000
## 54 191 1 0.980 0.01008 0.960 1.000
## 55 188 1 0.974 0.01129 0.953 0.997
## 64 184 1 0.969 0.01241 0.945 0.994
## 66 182 1 0.964 0.01343 0.938 0.991
## 68 181 1 0.958 0.01438 0.931 0.987
## 71 176 1 0.953 0.01529 0.924 0.983
## 73 173 1 0.948 0.01617 0.916 0.980
## 75 168 1 0.942 0.01702 0.909 0.976
## 77 164 1 0.936 0.01786 0.902 0.972
## 78 158 1 0.930 0.01871 0.894 0.968
## 79 156 1 0.924 0.01951 0.887 0.963
## 81 149 2 0.912 0.02113 0.871 0.954
## 84 142 2 0.899 0.02270 0.856 0.945
## 96 111 1 0.891 0.02390 0.845 0.939
## 101 98 1 0.882 0.02533 0.834 0.933
## 102 96 1 0.873 0.02668 0.822 0.927
##
## rx=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 34 99 1 0.990 0.0100 0.970 1.000
## 39 98 1 0.980 0.0141 0.952 1.000
## 45 97 1 0.970 0.0172 0.937 1.000
## 67 89 1 0.959 0.0202 0.920 0.999
## 70 86 1 0.948 0.0228 0.904 0.993
## 72 85 1 0.937 0.0251 0.889 0.987
## 73 84 1 0.925 0.0272 0.874 0.980
## 80 78 2 0.902 0.0312 0.842 0.965
## 86 72 1 0.889 0.0332 0.826 0.957
## 88 67 1 0.876 0.0353 0.809 0.948
## 89 64 2 0.848 0.0391 0.775 0.929
## 92 54 1 0.833 0.0414 0.755 0.918
## 94 50 1 0.816 0.0438 0.735 0.907
## 96 47 1 0.799 0.0462 0.713 0.895
## 102 43 1 0.780 0.0487 0.690 0.882
## 103 41 3 0.723 0.0552 0.623 0.840
## 104 38 1 0.704 0.0569 0.601 0.825
- kaplan-meier curve :-
There is no special results in this graph.
Comparing two KM-Curves :-
The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).
- Null Hypothesis : Survival in two groups is same.
- Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = attrib, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## rx=0 200 21 28.2 1.82 5.55
## rx=1 100 21 13.8 3.71 5.55
##
## Chisq= 5.5 on 1 degrees of freedom, p= 0.02
P-Value is very less.
Conclusion of Non-parametric models :-
As the p value is very small , we can reject the null hypothesis. We can say that, the group of people in sex and rx are stastically different and the survival will be different in each group.
** From the graphs we can conclude that, the groups survival rate is same.**
SEMI-PARAMETRIC SURVIVAL MODELS
Fitting cox PH Model :-
In this Model, all the records will be considered and all the variables will be considred.
- Overview of the fitted cox PH model is as follows.
## Call:
## coxph(formula = attrib, data = df)
##
## coef exp(coef) se(coef) z p
## litter 0.008465 1.008501 0.005344 1.584 0.11315
## rx1 0.805296 2.237359 0.309431 2.603 0.00925
## sexm -3.085125 0.045724 0.724932 -4.256 2.08e-05
##
## Likelihood ratio test=52.58 on 3 df, p=2.252e-11
## n= 300, number of events= 42
## 2.5 % 97.5 %
## litter 1.00850125 0.99799405 1.0191191
## rx1 2.23735897 1.21996509 4.1032118
## sexm 0.04572433 0.01104292 0.1893262
Conclusion of Semi-parametric models :-
As the Hazard ratio in rx1 is very high, we can say as rx1 group has more dangor that rx2 group
As the Hazard ratio in sexm is very less, we can say as sexm group has less dangor that senf group
SURVIVAL TREES
Fitting Conditional inference tree :-
In this Model, all the records will be considered and all the variables will be considred.
- Overview of the fitted Conditional inference tree model is as follows.
##
## Conditional inference tree with 3 terminal nodes
##
## Response: Surv(time, status)
## Inputs: litter, rx, sex
## Number of observations: 300
##
## 1) sex == {m}; criterion = 1, statistic = 35.839
## 2)* weights = 150
## 1) sex == {f}
## 3) rx == {0}; criterion = 0.991, statistic = 8.707
## 4)* weights = 100
## 3) rx == {1}
## 5)* weights = 50