Exploratory Data Analysis

This objective of this project is to practice the Survival Analysis method by using R, after Data Science Indonesia (DSI) Workshop. Datasets containing the data from a population study of non-alcoholic fatty liver disease (NAFLD). Subjects with the condition and a set of matched control subjects were followed forward for metabolic conditions, cardiac endpoints, and death.
On this project, I will use Weibull and Cox Model to predict the survival probablity among the patient.

Data Checking

data("nafld")
head(nafld1)

tail(nafld1)

Checking Data Structure

str(nafld1)

## 'data.frame':    17549 obs. of  9 variables:
##  $ id     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age    : num  57 67 53 56 68 39 49 30 47 79 ...
##  $ male   : num  0 0 1 1 1 0 0 1 1 0 ...
##  $ weight : num  60 70.4 105.8 109.3 NA ...
##  $ height : num  163 168 186 170 NA 155 161 180 188 155 ...
##  $ bmi    : num  22.7 24.9 30.5 37.8 NA ...
##  $ case.id: int  10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
##  $ futime : num  6261 624 1783 3143 1836 ...
##  $ status : num  0 0 0 0 1 0 0 0 0 1 ...

Checking Data Summary

summary(nafld1)

##        id             age             male            weight      
##  Min.   :    1   Min.   :18.00   Min.   :0.0000   Min.   : 33.40  
##  1st Qu.: 4393   1st Qu.:42.00   1st Qu.:0.0000   1st Qu.: 70.00  
##  Median : 8786   Median :53.00   Median :0.0000   Median : 83.90  
##  Mean   : 8784   Mean   :52.66   Mean   :0.4673   Mean   : 86.35  
##  3rd Qu.:13175   3rd Qu.:63.00   3rd Qu.:1.0000   3rd Qu.: 99.20  
##  Max.   :17566   Max.   :98.00   Max.   :1.0000   Max.   :181.70  
##                                                   NA's   :4786    
##      height           bmi            case.id          futime    
##  Min.   :123.0   Min.   : 9.207   Min.   :    3   Min.   :   7  
##  1st Qu.:162.0   1st Qu.:25.136   1st Qu.: 4598   1st Qu.:1132  
##  Median :169.0   Median :28.876   Median : 8781   Median :2148  
##  Mean   :169.4   Mean   :30.074   Mean   : 8841   Mean   :2411  
##  3rd Qu.:177.0   3rd Qu.:33.710   3rd Qu.:13249   3rd Qu.:3353  
##  Max.   :215.0   Max.   :84.396   Max.   :17563   Max.   :7268  
##  NA's   :3168    NA's   :4961     NA's   :31                    
##      status       
##  Min.   :0.00000  
##  1st Qu.:0.00000  
##  Median :0.00000  
##  Mean   :0.07773  
##  3rd Qu.:0.00000  
##  Max.   :1.00000  
##

nafld1 is a data frame with 17,549 observations on the following 10 variables: - id - subject identifier
- age - age at entry to the study
- male - gender orientation, with 0 = female, 1 = male
- weight - weight in kg
- height - height in cm
- bmi - body mass index
- case.id - the id of the NAFLD case to whom this subject is matched
- futime - time to death or last follow-up
- status - status of patients, with 0 = alive at last follow-up, 1 = dead

Cleansing Missing Values

nafld1[!complete.cases(nafld1),]

In total, there are 4,987 rows that need to be cleaned since they are containing missing values.

nafld1_clean <- nafld1 %>% drop_na()

Kaplan-Meier Estimate

Preparing Data

km <- survfit(Surv(futime, status)~1, data = nafld1_clean)
km.table <- data.frame(time = km$time,
                         n_risk = km$n.risk,
                         n_event = km$n.event,
                         n_censor = km$n.censor,
                         sur_prob = km$surv)

km.table

Kaplan-Meier Curve

ggsurvplot(
  fit = km, 
  palette = "blue", 
  ggtheme = theme_bw(),
  linetype = 1, 
  risk.table = TRUE,
  cumcensor = TRUE,
  cumevents = TRUE,
  tables.height = 0.15
)

Kaplan-Meier Curve by Gender

km2 <- survfit(Surv(futime, status)~male, data = nafld1)
ggsurvplot(km2,
           pval = TRUE,
           ggtheme = theme_bw(),
           risk.table = TRUE,
           cumcensor = TRUE,
           cumevents = TRUE,
           tables.height = 0.15,
           palette = c("dodgerblue","hotpink"),
           legend="bottom",
           legend.labs = c("Male", "Female"),
           xlab = "Time in Days")

Preparing Data for Modelling

Based on available variables, we can see that varaible weight and height automatically represented by bmi, therefore, I don’t see the necessity to include weight and `height in our modelling.

Checking Distribution of BMI

hist(nafld1_clean$bmi)

Checking Boxplot BMI vs Gender

boxplot(bmi ~ male, data = nafld1_clean, names=c("female","male"))

Checking Boxplot BMI vs Status

boxplot(bmi ~ status, data = nafld1_clean, names=c("alive","deceased"))

Feature Engineering

# Function for BMI Category
category <- function(x){
  if(x < 18.5){
    x <- "Underweight"
  }else if(x >= 18.5 & x < 24.9){
    x <- "Healthy Weight"
  }else if(x >= 25 & x < 29.9){
    x <- "Overweight"
  }else{
    x <- "Obsese"
  }
  
}

# Change the male and bmi into factor
nafld1_feat <- nafld1_clean %>% 
                               mutate(gender = factor(ifelse(male == 0, "female", "male")),
                                      bmi_cat = factor(sapply(bmi, category))
                                      ) 
nafld1_feat

Weibull Model

Compute Weibull Model

wb <- survreg(Surv(futime, status)~gender+bmi_cat, data = nafld1_feat)

Decide Covariate Combinations

new_wb <- expand.grid(
                      gender = levels(nafld1_feat$gender),
                      bmi_cat = levels(nafld1_feat$bmi_cat)
                      )

new_wb

Compute Survival Curves

surv <- seq(.99, .01, by = -.01)
t <- predict(wb, type = "quantile", p = 1 - surv, newdata = new_wb)
t[,1:10]

##          [,1]     [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 407.1663 765.7573 1110.1975 1446.9417 1778.9467 2107.9207 2434.9827
## [2,] 311.4372 585.7196  849.1783 1106.7503 1360.6973 1612.3261 1862.4923
## [3,] 422.0643 793.7760 1150.8192 1499.8846 1844.0376 2185.0486 2524.0777
## [4,] 322.8325 607.1508  880.2493 1147.2458 1410.4847 1671.3204 1930.6401
## [5,] 456.0770 857.7436 1243.5597 1620.7551 1992.6421 2361.1340 2727.4842
## [6,] 348.8484 656.0789  951.1855 1239.6983 1524.1507 1806.0062 2086.2236
## [7,] 139.5016 262.3605  380.3712  495.7451  609.4952  722.2069  834.2635
## [8,] 106.7033 200.6768  290.9419  379.1901  466.1964  552.4084  638.1193
##           [,8]     [,9]     [,10]
## [1,] 2760.9278 3086.355 3411.7363
## [2,] 2111.8043 2360.720 2609.6007
## [3,] 2861.9490 3199.283 3536.5702
## [4,] 2189.0742 2447.098 2705.0848
## [5,] 3092.5834 3457.102 3821.5700
## [6,] 2365.4840 2644.301 2923.0781
## [7,]  945.9375 1057.434 1168.9147
## [8,]  723.5375  808.820  894.0904

Create New `df` with Survival Curve Information

surv_wb_wide <- cbind(new_wb, t)

surv_wb_mod <-  melt(surv_wb_wide, id.vars = c("gender", "bmi_cat"), 
                variable.name = "surv_id",  value.name = "time")
surv_wb_mod$surv <- surv[as.numeric(surv_wb_mod$surv_id)]

str(surv_wb_mod)

## 'data.frame':    792 obs. of  5 variables:
##  $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ bmi_cat: Factor w/ 4 levels "Healthy Weight",..: 1 1 2 2 3 3 4 4 1 1 ...
##  $ surv_id: Factor w/ 99 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ time   : num  407 311 422 323 456 ...
##  $ surv   : num  0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 ...

Plotting Weibull Model

ggsurvplot_df(surv_wb_mod, surv.geom = geom_line,
              linetype = "gender", color = "bmi_cat", legend.title = NULL)

On Weibull Model, female has bigger survival probability than male. Some interesting facts, that BMI with Overweight category has the biggest chance to survive, and Underweight is the lowest one.

Cox Model

Compute Cox Model

cx <- coxph(Surv(futime, status)~gender+bmi_cat, data = nafld1_feat)

Compute Survival Curves

cxsf <- survfit(cx, data = nafld1_feat, newdata = new_wb, conf.type = "none")
str(cxsf)

## List of 10
##  $ n       : int 12562
##  $ time    : num [1:4928] 7 9 10 11 12 13 15 17 18 19 ...
##  $ n.risk  : num [1:4928] 12562 12559 12557 12552 12550 ...
##  $ n.event : num [1:4928] 0 0 1 1 0 0 1 1 0 0 ...
##  $ n.censor: num [1:4928] 3 2 4 1 1 1 0 0 1 3 ...
##  $ surv    : num [1:4928, 1:8] 1 1 1 1 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##  $ cumhaz  : num [1:4928, 1:8] 0 0 0.0000719 0.0001438 0.0001438 ...
##  $ std.err : num [1:4928, 1:8] 0 0 0.000072 0.000102 0.000102 ...
##  $ logse   : logi TRUE
##  $ call    : language survfit(formula = cx, newdata = new_wb, conf.type = "none", data = nafld1_feat)
##  - attr(*, "class")= chr [1:2] "survfitcox" "survfit"

Create New `df` with Survival Curve Information

surv_cxmod0 <- surv_summary(cxsf)
head(surv_cxmod0)

surv_cxmod <- cbind(surv_cxmod0,
                    new_wb[as.character(surv_cxmod0$strata), ])

Plotting Cox Model

ggsurvplot_df(surv_cxmod, linetype = "gender", color = "bmi_cat",
              legend.title = NULL, censor = FALSE)

The plot of Cox Model giving the same result with Weibull Model.

Comparing Concordance Index

concordance(wb)

## Call:
## concordance.survreg(object = wb)
## 
## n= 12562 
## Concordance= 0.5577 se= 0.01036
## concordant discordant     tied.x     tied.y    tied.xy 
##    3625171    2737581    1331897        116         28

concordance(cx)

## Call:
## concordance.coxph(object = cx)
## 
## n= 12562 
## Concordance= 0.5577 se= 0.01036
## concordant discordant     tied.x     tied.y    tied.xy 
##    3625171    2737581    1331897        116         28

Based on Concordance Index comparison, we can see there is no differnece between the two model.

Survival Analysis Workshop DSI - NAFLD Case

Survival Analysis Workshop DSI - NAFLD Case

Exploratory Data Analysis

Data Checking

Checking Data Structure

Checking Data Summary

Cleansing Missing Values

Kaplan-Meier Estimate

Preparing Data

Kaplan-Meier Curve

Kaplan-Meier Curve by Gender

Preparing Data for Modelling

Checking Distribution of BMI

Checking Boxplot BMI vs Gender

Checking Boxplot BMI vs Status

Feature Engineering

Weibull Model

Compute Weibull Model

Decide Covariate Combinations

Compute Survival Curves

Create New `df` with Survival Curve Information

Plotting Weibull Model

Cox Model

Compute Cox Model

Compute Survival Curves

Create New `df` with Survival Curve Information

Plotting Cox Model

Comparing Concordance Index

Survival Analysis Workshop DSI - NAFLD Case

Survival Analysis Workshop DSI - NAFLD Case

Exploratory Data Analysis

Data Checking

Checking Data Structure

Checking Data Summary

Cleansing Missing Values

Kaplan-Meier Estimate

Preparing Data

Kaplan-Meier Curve

Kaplan-Meier Curve by Gender

Preparing Data for Modelling

Checking Distribution of BMI

Checking Boxplot BMI vs Gender

Checking Boxplot BMI vs Status

Feature Engineering

Weibull Model

Compute Weibull Model

Decide Covariate Combinations

Compute Survival Curves

Create New df with Survival Curve Information

Plotting Weibull Model

Cox Model

Compute Cox Model

Compute Survival Curves

Create New df with Survival Curve Information

Plotting Cox Model

Comparing Concordance Index

Create New `df` with Survival Curve Information

Create New `df` with Survival Curve Information