Survival Analysis Workshop DSI - NAFLD Case
Exploratory Data Analysis
This objective of this project is to practice the Survival Analysis method by using R, after Data Science Indonesia (DSI) Workshop. Datasets containing the data from a population study of non-alcoholic fatty liver disease (NAFLD). Subjects with the condition and a set of matched control subjects were followed forward for metabolic conditions, cardiac endpoints, and death.
On this project, I will use Weibull and Cox Model to predict the survival probablity among the patient.
Checking Data Structure
## 'data.frame': 17549 obs. of 9 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : num 57 67 53 56 68 39 49 30 47 79 ...
## $ male : num 0 0 1 1 1 0 0 1 1 0 ...
## $ weight : num 60 70.4 105.8 109.3 NA ...
## $ height : num 163 168 186 170 NA 155 161 180 188 155 ...
## $ bmi : num 22.7 24.9 30.5 37.8 NA ...
## $ case.id: int 10630 14817 3 6628 1871 7144 7507 13417 9 13518 ...
## $ futime : num 6261 624 1783 3143 1836 ...
## $ status : num 0 0 0 0 1 0 0 0 0 1 ...
Checking Data Summary
## id age male weight
## Min. : 1 Min. :18.00 Min. :0.0000 Min. : 33.40
## 1st Qu.: 4393 1st Qu.:42.00 1st Qu.:0.0000 1st Qu.: 70.00
## Median : 8786 Median :53.00 Median :0.0000 Median : 83.90
## Mean : 8784 Mean :52.66 Mean :0.4673 Mean : 86.35
## 3rd Qu.:13175 3rd Qu.:63.00 3rd Qu.:1.0000 3rd Qu.: 99.20
## Max. :17566 Max. :98.00 Max. :1.0000 Max. :181.70
## NA's :4786
## height bmi case.id futime
## Min. :123.0 Min. : 9.207 Min. : 3 Min. : 7
## 1st Qu.:162.0 1st Qu.:25.136 1st Qu.: 4598 1st Qu.:1132
## Median :169.0 Median :28.876 Median : 8781 Median :2148
## Mean :169.4 Mean :30.074 Mean : 8841 Mean :2411
## 3rd Qu.:177.0 3rd Qu.:33.710 3rd Qu.:13249 3rd Qu.:3353
## Max. :215.0 Max. :84.396 Max. :17563 Max. :7268
## NA's :3168 NA's :4961 NA's :31
## status
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.07773
## 3rd Qu.:0.00000
## Max. :1.00000
##
nafld1 is a data frame with 17,549 observations on the following 10 variables: - id - subject identifier
- age - age at entry to the study
- male - gender orientation, with 0 = female, 1 = male
- weight - weight in kg
- height - height in cm
- bmi - body mass index
- case.id - the id of the NAFLD case to whom this subject is matched
- futime - time to death or last follow-up
- status - status of patients, with 0 = alive at last follow-up, 1 = dead
Cleansing Missing Values
In total, there are 4,987 rows that need to be cleaned since they are containing missing values.
Kaplan-Meier Estimate
Preparing Data
km <- survfit(Surv(futime, status)~1, data = nafld1_clean)
km.table <- data.frame(time = km$time,
n_risk = km$n.risk,
n_event = km$n.event,
n_censor = km$n.censor,
sur_prob = km$surv)
km.tableKaplan-Meier Curve
ggsurvplot(
fit = km,
palette = "blue",
ggtheme = theme_bw(),
linetype = 1,
risk.table = TRUE,
cumcensor = TRUE,
cumevents = TRUE,
tables.height = 0.15
)Kaplan-Meier Curve by Gender
km2 <- survfit(Surv(futime, status)~male, data = nafld1)
ggsurvplot(km2,
pval = TRUE,
ggtheme = theme_bw(),
risk.table = TRUE,
cumcensor = TRUE,
cumevents = TRUE,
tables.height = 0.15,
palette = c("dodgerblue","hotpink"),
legend="bottom",
legend.labs = c("Male", "Female"),
xlab = "Time in Days")Preparing Data for Modelling
Based on available variables, we can see that varaible weight and height automatically represented by bmi, therefore, I don’t see the necessity to include weight and `height in our modelling.
Checking Boxplot BMI vs Status
Feature Engineering
# Function for BMI Category
category <- function(x){
if(x < 18.5){
x <- "Underweight"
}else if(x >= 18.5 & x < 24.9){
x <- "Healthy Weight"
}else if(x >= 25 & x < 29.9){
x <- "Overweight"
}else{
x <- "Obsese"
}
}
# Change the male and bmi into factor
nafld1_feat <- nafld1_clean %>%
mutate(gender = factor(ifelse(male == 0, "female", "male")),
bmi_cat = factor(sapply(bmi, category))
)
nafld1_featWeibull Model
Compute Weibull Model
Decide Covariate Combinations
new_wb <- expand.grid(
gender = levels(nafld1_feat$gender),
bmi_cat = levels(nafld1_feat$bmi_cat)
)
new_wbCompute Survival Curves
surv <- seq(.99, .01, by = -.01)
t <- predict(wb, type = "quantile", p = 1 - surv, newdata = new_wb)
t[,1:10]## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 407.1663 765.7573 1110.1975 1446.9417 1778.9467 2107.9207 2434.9827
## [2,] 311.4372 585.7196 849.1783 1106.7503 1360.6973 1612.3261 1862.4923
## [3,] 422.0643 793.7760 1150.8192 1499.8846 1844.0376 2185.0486 2524.0777
## [4,] 322.8325 607.1508 880.2493 1147.2458 1410.4847 1671.3204 1930.6401
## [5,] 456.0770 857.7436 1243.5597 1620.7551 1992.6421 2361.1340 2727.4842
## [6,] 348.8484 656.0789 951.1855 1239.6983 1524.1507 1806.0062 2086.2236
## [7,] 139.5016 262.3605 380.3712 495.7451 609.4952 722.2069 834.2635
## [8,] 106.7033 200.6768 290.9419 379.1901 466.1964 552.4084 638.1193
## [,8] [,9] [,10]
## [1,] 2760.9278 3086.355 3411.7363
## [2,] 2111.8043 2360.720 2609.6007
## [3,] 2861.9490 3199.283 3536.5702
## [4,] 2189.0742 2447.098 2705.0848
## [5,] 3092.5834 3457.102 3821.5700
## [6,] 2365.4840 2644.301 2923.0781
## [7,] 945.9375 1057.434 1168.9147
## [8,] 723.5375 808.820 894.0904
Create New df with Survival Curve Information
surv_wb_wide <- cbind(new_wb, t)
surv_wb_mod <- melt(surv_wb_wide, id.vars = c("gender", "bmi_cat"),
variable.name = "surv_id", value.name = "time")
surv_wb_mod$surv <- surv[as.numeric(surv_wb_mod$surv_id)]
str(surv_wb_mod)## 'data.frame': 792 obs. of 5 variables:
## $ gender : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
## $ bmi_cat: Factor w/ 4 levels "Healthy Weight",..: 1 1 2 2 3 3 4 4 1 1 ...
## $ surv_id: Factor w/ 99 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ time : num 407 311 422 323 456 ...
## $ surv : num 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.98 0.98 ...
Plotting Weibull Model
ggsurvplot_df(surv_wb_mod, surv.geom = geom_line,
linetype = "gender", color = "bmi_cat", legend.title = NULL)On Weibull Model, female has bigger survival probability than male. Some interesting facts, that BMI with Overweight category has the biggest chance to survive, and Underweight is the lowest one.
Cox Model
Compute Cox Model
Compute Survival Curves
## List of 10
## $ n : int 12562
## $ time : num [1:4928] 7 9 10 11 12 13 15 17 18 19 ...
## $ n.risk : num [1:4928] 12562 12559 12557 12552 12550 ...
## $ n.event : num [1:4928] 0 0 1 1 0 0 1 1 0 0 ...
## $ n.censor: num [1:4928] 3 2 4 1 1 1 0 0 1 3 ...
## $ surv : num [1:4928, 1:8] 1 1 1 1 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:8] "1" "2" "3" "4" ...
## $ cumhaz : num [1:4928, 1:8] 0 0 0.0000719 0.0001438 0.0001438 ...
## $ std.err : num [1:4928, 1:8] 0 0 0.000072 0.000102 0.000102 ...
## $ logse : logi TRUE
## $ call : language survfit(formula = cx, newdata = new_wb, conf.type = "none", data = nafld1_feat)
## - attr(*, "class")= chr [1:2] "survfitcox" "survfit"
Create New df with Survival Curve Information
Plotting Cox Model
ggsurvplot_df(surv_cxmod, linetype = "gender", color = "bmi_cat",
legend.title = NULL, censor = FALSE)The plot of Cox Model giving the same result with Weibull Model.
Comparing Concordance Index
## Call:
## concordance.survreg(object = wb)
##
## n= 12562
## Concordance= 0.5577 se= 0.01036
## concordant discordant tied.x tied.y tied.xy
## 3625171 2737581 1331897 116 28
## Call:
## concordance.coxph(object = cx)
##
## n= 12562
## Concordance= 0.5577 se= 0.01036
## concordant discordant tied.x tied.y tied.xy
## 3625171 2737581 1331897 116 28
Based on Concordance Index comparison, we can see there is no differnece between the two model.