Survival Analysis on leukemia Dataset

Introduction :-

In this report, I am attempting to do survival analysis (or) time-to-event analysis on leukemia data set.

Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 23 observations and each observation has 3 attributes. The header of the dataset is as follows.

##   time status          x
## 1    9      1 Maintained
## 2   13      1 Maintained
## 3   13      0 Maintained
## 4   18      1 Maintained
## 5   23      1 Maintained
## 6   28      0 Maintained

The detailed structure is as follows.

## 'data.frame':    23 obs. of  3 variables:
##  $ time  : num  9 13 13 18 23 28 31 34 45 48 ...
##  $ status: num  1 1 0 1 1 0 1 1 0 1 ...
##  $ x     : Factor w/ 2 levels "Maintained","Nonmaintained": 1 1 1 1 1 1 1 1 1 1 ...

In Input , The type of each attribute is as follows.

##      time    status         x 
## "numeric" "numeric"  "factor"

For the survival analysis, the varibale time should be numerical , varibale status should be numerical variable type and varibale x should be factorial.

For EDA, I am updating the type of status to factorial. Finally, the type of each attribute is as follows.

##      time    status         x 
## "numeric"  "factor"  "factor"

Dealing with NULL values :-

The number of null values in each column are as follows.

##   time status      x 
##      0      0      0

As there is no null values, we can proceed further.

Summary :-

The overall summary of all the attributes is as follows.

##       time        status             x     
##  Min.   :  5.00   0: 5   Maintained   :11  
##  1st Qu.: 12.50   1:18   Nonmaintained:12  
##  Median : 23.00                            
##  Mean   : 29.48                            
##  3rd Qu.: 33.50                            
##  Max.   :161.00

The distribution of all continuous variables is as follows.

The distribution of all contionus variables in each category is as follows.

Removing Outliers :-

I assume, the record with time = 161 is outlier. I removed it from our dataset.

The updated distribution of all contionus variables is as follows.

The updated distribution of all contionus variables in each category is as follows.

Description :-

In our data set,

time is contionus variable with min of 5 to max of 48 .
Status is categorical variable of two types
1. 0 : Censored records
2. 1: Event occurred Records
X is categorical variable of two types
1. Maintained
2. Non-Maintained

NON-PARAMETRIC SURVIVAL MODELS

Fitting Kaplan-Meier Model (with out considering category of records) :-

In this Model, all the records will be considered as similar and the categorical varible X (Maintained / not-maintained) is not considered in this model.

Overview of the fitted kaplan-meier model is as follows.

## Call: survfit(formula = Surv(time, status) ~ 1, data = df, type = "kaplan-meier")
## 
##       n  events  median 0.95LCL 0.95UCL 
##      22      18      27      18      43

Survival Table :-

## Call: survfit(formula = Surv(time, status) ~ 1, data = df, type = "kaplan-meier")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     22       2    0.909  0.0613       0.7966        1.000
##     8     20       2    0.818  0.0822       0.6719        0.996
##     9     18       1    0.773  0.0893       0.6160        0.969
##    12     17       1    0.727  0.0950       0.5631        0.939
##    13     16       1    0.682  0.0993       0.5125        0.907
##    18     13       1    0.629  0.1046       0.4544        0.872
##    23     12       2    0.524  0.1104       0.3472        0.792
##    27     10       1    0.472  0.1111       0.2976        0.749
##    30      8       1    0.413  0.1118       0.2430        0.702
##    31      7       1    0.354  0.1103       0.1922        0.652
##    33      6       1    0.295  0.1065       0.1454        0.599
##    34      5       1    0.236  0.1002       0.1027        0.543
##    43      4       1    0.177  0.0909       0.0647        0.484
##    45      3       1    0.118  0.0774       0.0326        0.427
##    48      1       1    0.000     NaN           NA           NA

kaplan-meier curve :-

Fitting Kaplan-Meier Model (considering category of records) :-

In this Model, all the records will be considered in two different categories based on X value of the record.

The overview of the fitted kaplan-meier model is as follows.

## Call: survfit(formula = Surv(time, status) ~ x, data = df, type = "kaplan-meier")
## 
##                  n events median 0.95LCL 0.95UCL
## x=Maintained    10      7     31      18      NA
## x=Nonmaintained 12     11     23       8      NA

Survival Table :-

## Call: survfit(formula = Surv(time, status) ~ x, data = df, type = "kaplan-meier")
## 
##                 x=Maintained 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     9     10       1    0.900  0.0949       0.7320        1.000
##    13      9       1    0.800  0.1265       0.5868        1.000
##    18      7       1    0.686  0.1515       0.4447        1.000
##    23      6       1    0.571  0.1638       0.3258        1.000
##    31      4       1    0.429  0.1743       0.1931        0.951
##    34      3       1    0.286  0.1647       0.0923        0.884
##    48      1       1    0.000     NaN           NA           NA
## 
##                 x=Nonmaintained 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     12       2   0.8333  0.1076       0.6470        1.000
##     8     10       2   0.6667  0.1361       0.4468        0.995
##    12      8       1   0.5833  0.1423       0.3616        0.941
##    23      6       1   0.4861  0.1481       0.2675        0.883
##    27      5       1   0.3889  0.1470       0.1854        0.816
##    30      4       1   0.2917  0.1387       0.1148        0.741
##    33      3       1   0.1944  0.1219       0.0569        0.664
##    43      2       1   0.0972  0.0919       0.0153        0.620
##    45      1       1   0.0000     NaN           NA           NA

kaplan-meier curve :-

Comparing two KM-Curves :-

The logrank test, or log-rank test, is a hypothesis test to compare the survival distributions of two samples. It is a nonparametric test. This test is well suitable for Kaplan-Meier Estimator model ( non- parametric model ).

Null Hypothesis : Survival in two groups is same.
Alternative Hypothesis : Survival in two groups is not same.

## Call:
## survdiff(formula = Surv(time, status) ~ x, data = df)
## 
##                  N Observed Expected (O-E)^2/E (O-E)^2/V
## x=Maintained    10        7     9.93     0.866       2.1
## x=Nonmaintained 12       11     8.07     1.066       2.1
## 
##  Chisq= 2.1  on 1 degrees of freedom, p= 0.1

Conclusion of Non-parametric models :-

As the p value is not very small , we can accept the null hypothesis. We can say that, the two groups are stastically same. Maintained group and nonmaintained group survive rate is same.