LBB RM: Naive Bayes, Decission Tree, and Random Forest For Stroke Classification Problem

Musthofa Syarifudin

2021-05-02

Bussines Problem

Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

  1. id: unique identifier
  2. gender: “Male”, “Female” or “Other”
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: “No” or “Yes”
  7. work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
  8. Residence_type: “Rural” or “Urban”
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  12. stroke: 1 if the patient had a stroke or 0 if not *Note: “Unknown” in smoking_status means that the information is unavailable for this patient

Preparation

Load the library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)

Read The Data

We load the data and change 1 and 0 into more understandable value

stroke_dataset <- read.csv("data/healthcare-dataset-stroke-data.csv", stringsAsFactors = TRUE)

stroke_dataset <- stroke_dataset  %>% mutate(bmi = as.numeric(bmi))
stroke_dataset$stroke <- ifelse(stroke_dataset$stroke == "1", "Stroke", "Healthy") %>% factor()
stroke_dataset$heart_disease <- ifelse(stroke_dataset$heart_disease == "1", "Heart disease", "No Heart disease") %>% factor()
stroke_dataset$hypertension <- ifelse(stroke_dataset$hypertension == "1", "Hypertension", "No Hypertension") %>% factor()
rmarkdown::paged_table((stroke_dataset))

Check NA Values

Check NA Values in dataset

colSums(is.na(stroke_dataset))
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0

Summary

summary(stroke_dataset)
##        id           gender          age                 hypertension 
##  Min.   :   67   Female:2994   Min.   : 0.08   Hypertension   : 498  
##  1st Qu.:17741   Male  :2115   1st Qu.:25.00   No Hypertension:4612  
##  Median :36932   Other :   1   Median :45.00                         
##  Mean   :36518                 Mean   :43.23                         
##  3rd Qu.:54682                 3rd Qu.:61.00                         
##  Max.   :72940                 Max.   :82.00                         
##           heart_disease  ever_married         work_type    Residence_type
##  Heart disease   : 276   No :1757     children     : 687   Rural:2514    
##  No Heart disease:4834   Yes:3353     Govt_job     : 657   Urban:2596    
##                                       Never_worked :  22                 
##                                       Private      :2925                 
##                                       Self-employed: 819                 
##                                                                          
##  avg_glucose_level      bmi                smoking_status     stroke    
##  Min.   : 55.12    Min.   :  1.0   formerly smoked: 885   Healthy:4861  
##  1st Qu.: 77.25    1st Qu.:112.0   never smoked   :1892   Stroke : 249  
##  Median : 91.89    Median :158.0   smokes         : 789                 
##  Mean   :106.15    Mean   :172.2   Unknown        :1544                 
##  3rd Qu.:114.09    3rd Qu.:214.0                                        
##  Max.   :271.74    Max.   :419.0
stroke_data_clean <- stroke_dataset %>% select(-c(id))

Exploratory Data Analysis

Inspect proportion of data

stroke_data_clean %>% 
  inspect_cat() %>% 
  show_plot()

Data target looked imbalance and could cause bias toward healthy patient. We will upsample the data so the target will have balance proportions

Inspect Numerical Distribution

stroke_data_clean %>% 
  inspect_num() %>% 
  show_plot()

It seems only avg_glukose_level that doesn have normal data distribution. Next we gonna log the avg_glukose_level

Log The Data Distribution

stroke_data_clean$avg_glucose_level <- log(stroke_data_clean$avg_glucose_level)
stroke_data_clean %>% 
  inspect_num() %>% 
  show_plot()

Now the avg_glukose_level look more to normal distribution

Split data

we are gonna split data into train and test, with the proportion of 80:20

RNGkind(sample.kind = "Rounding") 
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417) 
stroke_split <- sample(nrow(stroke_data_clean), nrow(stroke_data_clean)*0.80) 
stroke_train <- stroke_data_clean[stroke_split, ]
stroke_test <- stroke_data_clean[-stroke_split, ] 

Upsampling

We gonna balance only training data to make the model dont bias toward healthy class.

stroke_train <- upSample(stroke_train %>% select(-stroke),
                         stroke_train$stroke,
                         list = FALSE,
                         yname = "Stroke")

Plot train and test data distribution

Plot new train and test distribution

stroke_train %>% 
  inspect_cat() %>% 
  show_plot()

stroke_test %>% 
  inspect_cat() %>% 
  show_plot()

K-Fold Cross Validation

We Create K-fold Validation to make our model learn better, and we set it with 10 k with 3 repeats

train_k <- trainControl(method="repeatedcv", 
                        number=10, 
                        repeats=3)

Naive Bayes

The first is Naive Bayes Model, this will be baseline because of the model complexity that still easy to understand ## Create Model Using caret with method=“naivebayes” we create the model

naive_bayes_model <- train(Stroke ~ .,
      data = stroke_train,
      method="naive_bayes",
      trControl = train_k
      )
naive_bayes_model
## Naive Bayes 
## 
## 7772 samples
##   10 predictor
##    2 classes: 'Healthy', 'Stroke' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6994, 6995, 6994, 6994, 6995, 6996, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa     
##   FALSE      0.6338997  0.26780658
##    TRUE      0.5067303  0.01345271
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
##  and adjust = 1.

The model have accuracy of 0.63 with train sample on FALSE(“Healthy”) and 0.50 On True("Stroke)

Plot The Accuracy Of Each Class

plot(naive_bayes_model)

Prediction

predict_nai <- predict(naive_bayes_model, stroke_test)

Confusion Matrix

cm_nb <- confusionMatrix(predict_nai, stroke_test$stroke, positive = "Stroke")
cm_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     262      0
##    Stroke      713     47
##                                              
##                Accuracy : 0.3023             
##                  95% CI : (0.2743, 0.3315)   
##     No Information Rate : 0.954              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0327             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 1.00000            
##             Specificity : 0.26872            
##          Pos Pred Value : 0.06184            
##          Neg Pred Value : 1.00000            
##              Prevalence : 0.04599            
##          Detection Rate : 0.04599            
##    Detection Prevalence : 0.74364            
##       Balanced Accuracy : 0.63436            
##                                              
##        'Positive' Class : Stroke             
## 

The model seems to predict stroke more often that healthy, but because the test data having majority is healthy so it has very low accuracy. The model has perfect accuracy on Healthy class but suffer in stroke class. The model predict 262 people healthy and all of them are true, however when it come to stroke class the model predict 760 stroke class but only 0.061 % are correct. It seems the baseline model Naive Bayes failed to generalize in the training data.

Decision Tree

We gonna try Decission Tree from caret package ## Decission Tree Modeling

set.seed(417)

model_dt <- train(Stroke ~ ., 
                   data=stroke_train, 
                   method="ctree", 
                   trControl = train_k)
model_dt
## Conditional Inference Tree 
## 
## 7772 samples
##   10 predictor
##    2 classes: 'Healthy', 'Stroke' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 6995, 6994, 6994, 6995, 6994, 6995, ... 
## Resampling results across tuning parameters:
## 
##   mincriterion  Accuracy   Kappa    
##   0.01          0.9212589  0.8425170
##   0.50          0.9018285  0.8036553
##   0.99          0.8564956  0.7129885
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.01.

The model seems doing good work on training sample with optimal acuracy of 0.92 better than Naive Bayes counterpart.

Predict Decission Tree

predict_dt <- predict(model_dt, stroke_test)

Decission Tree Evaluation

dt_cm <- confusionMatrix(predict_dt, stroke_test$stroke, positive = "Stroke")
dt_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     809     23
##    Stroke      166     24
##                                              
##                Accuracy : 0.8151             
##                  95% CI : (0.7899, 0.8384)   
##     No Information Rate : 0.954              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.139              
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.51064            
##             Specificity : 0.82974            
##          Pos Pred Value : 0.12632            
##          Neg Pred Value : 0.97236            
##              Prevalence : 0.04599            
##          Detection Rate : 0.02348            
##    Detection Prevalence : 0.18591            
##       Balanced Accuracy : 0.67019            
##                                              
##        'Positive' Class : Stroke             
## 

The model has better accuracy than baseline model, with total accuracy 0.81. Model Decission Tree has better overal acuracy than baseline even thought train and test has different data distribution the model still perform with acceptable accuracy lower than training accuracy with difference 0.11 %. Not very much droped accuracy but it still can impove up to 0.11 % closing with proxy of bayes error with variance improvement.

Random Forest

To Reduce the variance error we try random forest model ## Random Forest Modelling We use caret package random forest with hyperparameter grid search. Here we wont use k-fold validation because of training time that needed too long to complete, we settle with simple cross validation.

model_rf <- train(Stroke ~ .,
                  data = stroke_train,
                  method = "rf")
model_rf
## Random Forest 
## 
## 7772 samples
##   10 predictor
##    2 classes: 'Healthy', 'Stroke' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7772, 7772, 7772, 7772, 7772, 7772, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8670285  0.7340251
##    9    0.9873756  0.9747466
##   16    0.9809394  0.9618713
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.

Predict and Confusion Matrix

predict_rf <- predict(model_rf, stroke_test)
rf_cm <- confusionMatrix(predict_rf, stroke_test$stroke, positive="Stroke")
rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     962     46
##    Stroke       13      1
##                                           
##                Accuracy : 0.9423          
##                  95% CI : (0.9262, 0.9558)
##     No Information Rate : 0.954           
##     P-Value [Acc > NIR] : 0.9654          
##                                           
##                   Kappa : 0.0119          
##                                           
##  Mcnemar's Test P-Value : 0.00003099      
##                                           
##             Sensitivity : 0.0212766       
##             Specificity : 0.9866667       
##          Pos Pred Value : 0.0714286       
##          Neg Pred Value : 0.9543651       
##              Prevalence : 0.0459883       
##          Detection Rate : 0.0009785       
##    Detection Prevalence : 0.0136986       
##       Balanced Accuracy : 0.5039716       
##                                           
##        'Positive' Class : Stroke          
## 

The Model perform better than decission tree or baseline-naive bayes model, with accuracy up to 0.9432 % on test data. However the model did not have good understanding on stroke class with only 1 out of 13 are correct classes, but the model has solid/almost perfect score in class Healthy with only 0.02 % error rate, better than baseline and decision tree model. It also giving warning that model may overfit toward healthy class and underfit on stroke class.

Conclusion

The 3 model has each pros and cons but on overal the best model are the random forest with accuracy up to 0.94 on test data and 0.9809394 on train data.But with decission tree the model has better on predicting stroke class than any other model and rendom forest has better predicting on healthy class.

naive_bayes_train_acc <- naive_bayes_model$results$Accuracy
naive_bayes_train_acc1 <- sum(naive_bayes_train_acc)/2
naive_bayes_test_acc_sens <- cm_nb$byClass[ "Sensitivity"]
naive_bayes_test_acc_speci <- cm_nb$byClass[ "Specificity"]
coul <- brewer.pal(5, "Set2") 
model_name <- as.factor(c("Naive Bayes", "Decission Tree", "Random Forest"))
accuracy_list <- c(naive_bayes_train_acc1, model_dt$results[1, "Accuracy"], model_rf$results[2, "Accuracy"])
sensivity <- c(naive_bayes_test_acc_sens, dt_cm$byClass["Sensitivity"], rf_cm$byClass["Sensitivity"])
specifity <- c(naive_bayes_test_acc_speci, dt_cm$byClass["Specificity"], rf_cm$byClass["Specificity"])
result <- data.frame(model_name,accuracy_list, sensivity, specifity)
barplot(height=result$accuracy_list, names=result$model_name, main = "Accuracy Plot", col=coul)

barplot(height=result$specifity, names=result$model_name, main = "Specifity Plot", col=coul)

barplot(height=result$sensivity, names=result$model_name, main = "Sensivity Plot",col=coul)