Bussines Problem
Context
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Attribute Information
- id: unique identifier
- gender: “Male”, “Female” or “Other”
- age: age of the patient
- hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
- ever_married: “No” or “Yes”
- work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
- Residence_type: “Rural” or “Urban”
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
- stroke: 1 if the patient had a stroke or 0 if not *Note: “Unknown” in smoking_status means that the information is unavailable for this patient
Preparation
Load the library
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)Read The Data
We load the data and change 1 and 0 into more understandable value
stroke_dataset <- read.csv("data/healthcare-dataset-stroke-data.csv", stringsAsFactors = TRUE)
stroke_dataset <- stroke_dataset %>% mutate(bmi = as.numeric(bmi))
stroke_dataset$stroke <- ifelse(stroke_dataset$stroke == "1", "Stroke", "Healthy") %>% factor()
stroke_dataset$heart_disease <- ifelse(stroke_dataset$heart_disease == "1", "Heart disease", "No Heart disease") %>% factor()
stroke_dataset$hypertension <- ifelse(stroke_dataset$hypertension == "1", "Hypertension", "No Hypertension") %>% factor()
rmarkdown::paged_table((stroke_dataset))Check NA Values
Check NA Values in dataset
colSums(is.na(stroke_dataset))## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
Summary
summary(stroke_dataset)## id gender age hypertension
## Min. : 67 Female:2994 Min. : 0.08 Hypertension : 498
## 1st Qu.:17741 Male :2115 1st Qu.:25.00 No Hypertension:4612
## Median :36932 Other : 1 Median :45.00
## Mean :36518 Mean :43.23
## 3rd Qu.:54682 3rd Qu.:61.00
## Max. :72940 Max. :82.00
## heart_disease ever_married work_type Residence_type
## Heart disease : 276 No :1757 children : 687 Rural:2514
## No Heart disease:4834 Yes:3353 Govt_job : 657 Urban:2596
## Never_worked : 22
## Private :2925
## Self-employed: 819
##
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. : 1.0 formerly smoked: 885 Healthy:4861
## 1st Qu.: 77.25 1st Qu.:112.0 never smoked :1892 Stroke : 249
## Median : 91.89 Median :158.0 smokes : 789
## Mean :106.15 Mean :172.2 Unknown :1544
## 3rd Qu.:114.09 3rd Qu.:214.0
## Max. :271.74 Max. :419.0
stroke_data_clean <- stroke_dataset %>% select(-c(id))Exploratory Data Analysis
Inspect proportion of data
stroke_data_clean %>%
inspect_cat() %>%
show_plot() Data target looked imbalance and could cause bias toward healthy patient. We will upsample the data so the target will have balance proportions
Inspect Numerical Distribution
stroke_data_clean %>%
inspect_num() %>%
show_plot() It seems only avg_glukose_level that doesn have normal data distribution. Next we gonna log the avg_glukose_level
Log The Data Distribution
stroke_data_clean$avg_glucose_level <- log(stroke_data_clean$avg_glucose_level)
stroke_data_clean %>%
inspect_num() %>%
show_plot() Now the avg_glukose_level look more to normal distribution
Split data
we are gonna split data into train and test, with the proportion of 80:20
RNGkind(sample.kind = "Rounding") ## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
stroke_split <- sample(nrow(stroke_data_clean), nrow(stroke_data_clean)*0.80)
stroke_train <- stroke_data_clean[stroke_split, ]
stroke_test <- stroke_data_clean[-stroke_split, ] Upsampling
We gonna balance only training data to make the model dont bias toward healthy class.
stroke_train <- upSample(stroke_train %>% select(-stroke),
stroke_train$stroke,
list = FALSE,
yname = "Stroke")Plot train and test data distribution
Plot new train and test distribution
stroke_train %>%
inspect_cat() %>%
show_plot()stroke_test %>%
inspect_cat() %>%
show_plot()K-Fold Cross Validation
We Create K-fold Validation to make our model learn better, and we set it with 10 k with 3 repeats
train_k <- trainControl(method="repeatedcv",
number=10,
repeats=3)Naive Bayes
The first is Naive Bayes Model, this will be baseline because of the model complexity that still easy to understand ## Create Model Using caret with method=“naivebayes” we create the model
naive_bayes_model <- train(Stroke ~ .,
data = stroke_train,
method="naive_bayes",
trControl = train_k
)
naive_bayes_model## Naive Bayes
##
## 7772 samples
## 10 predictor
## 2 classes: 'Healthy', 'Stroke'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6994, 6995, 6994, 6994, 6995, 6996, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.6338997 0.26780658
## TRUE 0.5067303 0.01345271
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = FALSE
## and adjust = 1.
The model have accuracy of 0.63 with train sample on FALSE(“Healthy”) and 0.50 On True("Stroke)
Plot The Accuracy Of Each Class
plot(naive_bayes_model)Prediction
predict_nai <- predict(naive_bayes_model, stroke_test)Confusion Matrix
cm_nb <- confusionMatrix(predict_nai, stroke_test$stroke, positive = "Stroke")
cm_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 262 0
## Stroke 713 47
##
## Accuracy : 0.3023
## 95% CI : (0.2743, 0.3315)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0327
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 1.00000
## Specificity : 0.26872
## Pos Pred Value : 0.06184
## Neg Pred Value : 1.00000
## Prevalence : 0.04599
## Detection Rate : 0.04599
## Detection Prevalence : 0.74364
## Balanced Accuracy : 0.63436
##
## 'Positive' Class : Stroke
##
The model seems to predict stroke more often that healthy, but because the test data having majority is healthy so it has very low accuracy. The model has perfect accuracy on Healthy class but suffer in stroke class. The model predict 262 people healthy and all of them are true, however when it come to stroke class the model predict 760 stroke class but only 0.061 % are correct. It seems the baseline model Naive Bayes failed to generalize in the training data.
Decision Tree
We gonna try Decission Tree from caret package ## Decission Tree Modeling
set.seed(417)
model_dt <- train(Stroke ~ .,
data=stroke_train,
method="ctree",
trControl = train_k)
model_dt## Conditional Inference Tree
##
## 7772 samples
## 10 predictor
## 2 classes: 'Healthy', 'Stroke'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 6995, 6994, 6994, 6995, 6994, 6995, ...
## Resampling results across tuning parameters:
##
## mincriterion Accuracy Kappa
## 0.01 0.9212589 0.8425170
## 0.50 0.9018285 0.8036553
## 0.99 0.8564956 0.7129885
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mincriterion = 0.01.
The model seems doing good work on training sample with optimal acuracy of 0.92 better than Naive Bayes counterpart.
Predict Decission Tree
predict_dt <- predict(model_dt, stroke_test)Decission Tree Evaluation
dt_cm <- confusionMatrix(predict_dt, stroke_test$stroke, positive = "Stroke")
dt_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 809 23
## Stroke 166 24
##
## Accuracy : 0.8151
## 95% CI : (0.7899, 0.8384)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.139
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.51064
## Specificity : 0.82974
## Pos Pred Value : 0.12632
## Neg Pred Value : 0.97236
## Prevalence : 0.04599
## Detection Rate : 0.02348
## Detection Prevalence : 0.18591
## Balanced Accuracy : 0.67019
##
## 'Positive' Class : Stroke
##
The model has better accuracy than baseline model, with total accuracy 0.81. Model Decission Tree has better overal acuracy than baseline even thought train and test has different data distribution the model still perform with acceptable accuracy lower than training accuracy with difference 0.11 %. Not very much droped accuracy but it still can impove up to 0.11 % closing with proxy of bayes error with variance improvement.
Random Forest
To Reduce the variance error we try random forest model ## Random Forest Modelling We use caret package random forest with hyperparameter grid search. Here we wont use k-fold validation because of training time that needed too long to complete, we settle with simple cross validation.
model_rf <- train(Stroke ~ .,
data = stroke_train,
method = "rf")
model_rf## Random Forest
##
## 7772 samples
## 10 predictor
## 2 classes: 'Healthy', 'Stroke'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7772, 7772, 7772, 7772, 7772, 7772, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8670285 0.7340251
## 9 0.9873756 0.9747466
## 16 0.9809394 0.9618713
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 9.
Predict and Confusion Matrix
predict_rf <- predict(model_rf, stroke_test)
rf_cm <- confusionMatrix(predict_rf, stroke_test$stroke, positive="Stroke")
rf_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 962 46
## Stroke 13 1
##
## Accuracy : 0.9423
## 95% CI : (0.9262, 0.9558)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 0.9654
##
## Kappa : 0.0119
##
## Mcnemar's Test P-Value : 0.00003099
##
## Sensitivity : 0.0212766
## Specificity : 0.9866667
## Pos Pred Value : 0.0714286
## Neg Pred Value : 0.9543651
## Prevalence : 0.0459883
## Detection Rate : 0.0009785
## Detection Prevalence : 0.0136986
## Balanced Accuracy : 0.5039716
##
## 'Positive' Class : Stroke
##
The Model perform better than decission tree or baseline-naive bayes model, with accuracy up to 0.9432 % on test data. However the model did not have good understanding on stroke class with only 1 out of 13 are correct classes, but the model has solid/almost perfect score in class Healthy with only 0.02 % error rate, better than baseline and decision tree model. It also giving warning that model may overfit toward healthy class and underfit on stroke class.
Conclusion
The 3 model has each pros and cons but on overal the best model are the random forest with accuracy up to 0.94 on test data and 0.9809394 on train data.But with decission tree the model has better on predicting stroke class than any other model and rendom forest has better predicting on healthy class.
naive_bayes_train_acc <- naive_bayes_model$results$Accuracy
naive_bayes_train_acc1 <- sum(naive_bayes_train_acc)/2
naive_bayes_test_acc_sens <- cm_nb$byClass[ "Sensitivity"]
naive_bayes_test_acc_speci <- cm_nb$byClass[ "Specificity"]
coul <- brewer.pal(5, "Set2")
model_name <- as.factor(c("Naive Bayes", "Decission Tree", "Random Forest"))
accuracy_list <- c(naive_bayes_train_acc1, model_dt$results[1, "Accuracy"], model_rf$results[2, "Accuracy"])
sensivity <- c(naive_bayes_test_acc_sens, dt_cm$byClass["Sensitivity"], rf_cm$byClass["Sensitivity"])
specifity <- c(naive_bayes_test_acc_speci, dt_cm$byClass["Specificity"], rf_cm$byClass["Specificity"])
result <- data.frame(model_name,accuracy_list, sensivity, specifity)
barplot(height=result$accuracy_list, names=result$model_name, main = "Accuracy Plot", col=coul)barplot(height=result$specifity, names=result$model_name, main = "Specifity Plot", col=coul)barplot(height=result$sensivity, names=result$model_name, main = "Sensivity Plot",col=coul)