Introduction :-
In this report, I am attempting to do survival analysis (or) time-to-event analysis on leukemia data set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
- Structure of given dataset :-
The given dataset has 23 observations and each observation has 3 attributes. The header of the dataset is as follows.
## time status x
## 1 9 1 Maintained
## 2 13 1 Maintained
## 3 13 0 Maintained
## 4 18 1 Maintained
## 5 23 1 Maintained
## 6 28 0 Maintained
The detailed structure is as follows.
## 'data.frame': 23 obs. of 3 variables:
## $ time : num 9 13 13 18 23 28 31 34 45 48 ...
## $ status: num 1 1 0 1 1 0 1 1 0 1 ...
## $ x : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...
In Input , The type of each attribute is as follows.
## time status x
## "numeric" "numeric" "factor"
For the survival analysis, the varibale time should be numerical , varibale status should be numerical variable type and varibale x should be factorial.
For EDA, I am updating the type of status to factorial. Finally, the type of each attribute is as follows.
## time status x
## "numeric" "factor" "factor"
- Dealing with NULL values :-
The number of null values in each column are as follows.
## time status x
## 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## time status x
## Min. : 5.00 0: 5 Maintained :11
## 1st Qu.: 12.50 1:18 Nonmaintained:12
## Median : 23.00
## Mean : 29.48
## 3rd Qu.: 33.50
## Max. :161.00
The distribution of all continuous variables is as follows.
The distribution of all contionus variables in each category is as follows.
- Removing Outliers :-
I assume, the record with time = 161 is outlier. I removed it from our dataset.
The updated distribution of all contionus variables is as follows.
The updated distribution of all contionus variables in each category is as follows.
Description :-
In our data set,
- time is contionus variable with min of 5 to max of 48 .
- Status is categorical variable of two types
- 0 : Censored records
- 1: Event occurred Records
- X is categorical variable of two types
- Maintained
- Non-Maintained
NON-PARAMETRIC SURVIVAL MODELS
Fitting Kaplan-Meier Model (with out considering category of records) :-
In this Model, all the records will be considered as similar and the categorical varible X (Maintained / not-maintained) is not considered in this model.
- Overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = Surv(time, status) ~ 1, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## 22 18 27 18 43
- Survival Table :-
## Call: survfit(formula = Surv(time, status) ~ 1, data = df, type = "kaplan-meier")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 22 2 0.909 0.0613 0.7966 1.000
## 8 20 2 0.818 0.0822 0.6719 0.996
## 9 18 1 0.773 0.0893 0.6160 0.969
## 12 17 1 0.727 0.0950 0.5631 0.939
## 13 16 1 0.682 0.0993 0.5125 0.907
## 18 13 1 0.629 0.1046 0.4544 0.872
## 23 12 2 0.524 0.1104 0.3472 0.792
## 27 10 1 0.472 0.1111 0.2976 0.749
## 30 8 1 0.413 0.1118 0.2430 0.702
## 31 7 1 0.354 0.1103 0.1922 0.652
## 33 6 1 0.295 0.1065 0.1454 0.599
## 34 5 1 0.236 0.1002 0.1027 0.543
## 43 4 1 0.177 0.0909 0.0647 0.484
## 45 3 1 0.118 0.0774 0.0326 0.427
## 48 1 1 0.000 NaN NA NA
- kaplan-meier curve :-
Fitting Kaplan-Meier Model (considering category of records) :-
In this Model, all the records will be considered in two different categories based on X value of the record.
The overview of the fitted kaplan-meier model is as follows.
## Call: survfit(formula = Surv(time, status) ~ x, data = df, type = "kaplan-meier")
##
## n events median 0.95LCL 0.95UCL
## x=Maintained 10 7 31 18 NA
## x=Nonmaintained 12 11 23 8 NA
- Survival Table :-
## Call: survfit(formula = Surv(time, status) ~ x, data = df, type = "kaplan-meier")
##
## x=Maintained
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 9 10 1 0.900 0.0949 0.7320 1.000
## 13 9 1 0.800 0.1265 0.5868 1.000
## 18 7 1 0.686 0.1515 0.4447 1.000
## 23 6 1 0.571 0.1638 0.3258 1.000
## 31 4 1 0.429 0.1743 0.1931 0.951
## 34 3 1 0.286 0.1647 0.0923 0.884
## 48 1 1 0.000 NaN NA NA
##
## x=Nonmaintained
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 12 2 0.8333 0.1076 0.6470 1.000
## 8 10 2 0.6667 0.1361 0.4468 0.995
## 12 8 1 0.5833 0.1423 0.3616 0.941
## 23 6 1 0.4861 0.1481 0.2675 0.883
## 27 5 1 0.3889 0.1470 0.1854 0.816
## 30 4 1 0.2917 0.1387 0.1148 0.741
## 33 3 1 0.1944 0.1219 0.0569 0.664
## 43 2 1 0.0972 0.0919 0.0153 0.620
## 45 1 1 0.0000 NaN NA NA
- kaplan-meier curve :-
Comparing two KM-Curves :-
The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).
- Null Hypothesis : Survival in two groups is same.
- Alternative Hypothesis : Survival in two groups is not same.
## Call:
## survdiff(formula = Surv(time, status) ~ x, data = df)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained 10 7 9.93 0.866 2.1
## x=Nonmaintained 12 11 8.07 1.066 2.1
##
## Chisq= 2.1 on 1 degrees of freedom, p= 0.1
Conclusion of Non-parametric models :-
As the p value is not very small , we can accept the null hypothesis. We can say that, the two groups are stastically same. Maintained group and nonmaintained group survive rate is same.