Bussines Problem
Context
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Attribute Information
- id: unique identifier
- gender: “Male”, “Female” or “Other”
- age: age of the patient
- hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
- heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
- ever_married: “No” or “Yes”
- work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
- Residence_type: “Rural” or “Urban”
- avg_glucose_level: average glucose level in blood
- bmi: body mass index
- smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
- stroke: 1 if the patient had a stroke or 0 if not *Note: “Unknown” in smoking_status means that the information is unavailable for this patient
Preparation
Load the library
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)Read the dataset
stroke_dataset <- read.csv("data/healthcare-dataset-stroke-data.csv", stringsAsFactors = TRUE)
stroke_dataset <- stroke_dataset %>% mutate_at(c("hypertension", "heart_disease", "stroke"), as.factor) %>% mutate(bmi = as.numeric(bmi))
stroke_dataset$stroke <- ifelse(stroke_dataset$stroke == "1", "Stroke", "Healthy") %>% factor()
stroke_dataset$heart_disease <- ifelse(stroke_dataset$heart_disease == "1", "Heart disease", "No Heart disease")
stroke_dataset$hypertension <- ifelse(stroke_dataset$hypertension == "1", "Hypertension", "No Hypertension")
rmarkdown::paged_table((stroke_dataset))Checking dataset column types
str(stroke_dataset)## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : Factor w/ 3 levels "Female","Male",..: 2 1 2 1 1 2 2 1 1 1 ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : chr "No Hypertension" "No Hypertension" "No Hypertension" "No Hypertension" ...
## $ heart_disease : chr "Heart disease" "No Heart disease" "Heart disease" "No Heart disease" ...
## $ ever_married : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
## $ work_type : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
## $ Residence_type : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 240 419 199 218 114 164 148 102 419 116 ...
## $ smoking_status : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
## $ stroke : Factor w/ 2 levels "Healthy","Stroke": 2 2 2 2 2 2 2 2 2 2 ...
we are gonna remove some column:
- id = because it’s id of data, does not contain special information
Check na values
stroke_data_clean <- stroke_dataset %>% select(-c(id))
colSums(is.na(stroke_dataset))## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
the data has zero na values. we can move on to next step
Summary of data
summary(stroke_dataset)## id gender age hypertension
## Min. : 67 Female:2994 Min. : 0.08 Length:5110
## 1st Qu.:17741 Male :2115 1st Qu.:25.00 Class :character
## Median :36932 Other : 1 Median :45.00 Mode :character
## Mean :36518 Mean :43.23
## 3rd Qu.:54682 3rd Qu.:61.00
## Max. :72940 Max. :82.00
## heart_disease ever_married work_type Residence_type
## Length:5110 No :1757 children : 687 Rural:2514
## Class :character Yes:3353 Govt_job : 657 Urban:2596
## Mode :character Never_worked : 22
## Private :2925
## Self-employed: 819
##
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. : 1.0 formerly smoked: 885 Healthy:4861
## 1st Qu.: 77.25 1st Qu.:112.0 never smoked :1892 Stroke : 249
## Median : 91.89 Median :158.0 smokes : 789
## Mean :106.15 Mean :172.2 Unknown :1544
## 3rd Qu.:114.09 3rd Qu.:214.0
## Max. :271.74 Max. :419.0
Exploratory Data Analysis
Inspect proportion of data
stroke_data_clean %>%
inspect_cat() %>%
show_plot()Inspect Numerical Distribution
stroke_data_clean %>%
inspect_num() %>%
show_plot() Some distribution look like normal distribution
From Only Stroke Patient What we Know
stroke_data_clean %>%
filter(stroke == "Stroke") %>%
select_if(~class(.) == 'factor') %>%
inspect_cat() %>%
show_plot()stroke_data_clean_cat <- stroke_data_clean %>% select_if(~class(.) == 'factor')We know something from stroke point of view:
- Majority of stroke patient are married, that mean when we use k means clustering the model will suspected with imbalance proportions
- Other variable seems to have pretty or reasonable balance except ever_married and work_type
Split data
we are gonna split data into train and test. with the proportion of 80:20
RNGkind(sample.kind = "Rounding") ## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
stroke_split <- sample(nrow(stroke_data_clean_cat), nrow(stroke_data_clean_cat)*0.80)
stroke_train <- stroke_data_clean_cat[stroke_split, ]
stroke_test <- stroke_data_clean_cat[-stroke_split, ] Plot train and test data distribution
stroke_train %>%
inspect_cat() %>%
show_plot()stroke_test %>%
inspect_cat() %>%
show_plot()Logistic Regression
Make logistic regression model
model_glm <- glm(formula = stroke ~ ., data = stroke_train, family = "binomial")
summary(model_glm)##
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = stroke_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5091 -0.3686 -0.3231 -0.2093 3.3839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.344361 0.747230 -7.152 0.000000000000854
## genderMale -0.001673 0.149082 -0.011 0.991044
## genderOther -12.134090 1455.397559 -0.008 0.993348
## ever_marriedYes 0.900152 0.232252 3.876 0.000106
## work_typeGovt_job 1.754571 0.777759 2.256 0.024075
## work_typeNever_worked -9.962160 352.788326 -0.028 0.977472
## work_typePrivate 1.912383 0.751371 2.545 0.010922
## work_typeSelf-employed 2.269847 0.764397 2.969 0.002983
## Residence_typeUrban 0.196425 0.146684 1.339 0.180538
## smoking_statusnever smoked -0.394729 0.187386 -2.107 0.035160
## smoking_statussmokes -0.318228 0.223262 -1.425 0.154054
## smoking_statusUnknown -0.377827 0.222508 -1.698 0.089501
##
## (Intercept) ***
## genderMale
## genderOther
## ever_marriedYes ***
## work_typeGovt_job *
## work_typeNever_worked
## work_typePrivate *
## work_typeSelf-employed **
## Residence_typeUrban
## smoking_statusnever smoked *
## smoking_statussmokes
## smoking_statusUnknown .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1608.9 on 4087 degrees of freedom
## Residual deviance: 1526.1 on 4076 degrees of freedom
## AIC: 1550.1
##
## Number of Fisher Scoring iterations: 14
Evaluation
Making prediction and create confusion matrix
predicted_stroke <- predict(object = model_glm, newdata = stroke_test)
predicted_stroke <- ifelse(predicted_stroke > 0.5,"Stroke","Healthy")
predicted_stroke <- as.factor(predicted_stroke)
confusionMatrix(predicted_stroke, stroke_test$stroke)## Warning in confusionMatrix.default(predicted_stroke, stroke_test$stroke): Levels
## are not in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 975 47
## Stroke 0 0
##
## Accuracy : 0.954
## 95% CI : (0.9393, 0.966)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 0.5387
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.00000000001949
##
## Sensitivity : 1.000
## Specificity : 0.000
## Pos Pred Value : 0.954
## Neg Pred Value : NaN
## Prevalence : 0.954
## Detection Rate : 0.954
## Detection Prevalence : 1.000
## Balanced Accuracy : 0.500
##
## 'Positive' Class : Healthy
##
model has accuracy of 0.954 but specificity 0 that because the model did not guess any of the patient stroke instead predicting all patient healthy because majority of the data are healthy patient. Thus we need to balance train data for logistic regression increase the accuracy of minority class.
Balancing Class Data With Upsampling Method
set.seed(456)
train_up <- upSample(x = stroke_train %>% select(-stroke), # prediktor
y = stroke_train$stroke, # target
list = F, # supaya target dan prediktor jadi 1 table
yname = "stroke" # nama target
)
table(train_up$stroke)##
## Healthy Stroke
## 3886 3886
after upsampling data has balanced class proportions
Plot Distribution of upsampled data
train_up %>%
inspect_cat() %>%
show_plot() even though the target variable are balance other variable retained imbalance properties
Making New Linear Logistic Model With Upsampling Data
model_glm_1 <- glm(formula = stroke ~ ., data = train_up, family = "binomial")
summary(model_glm_1)##
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = train_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6331 -1.1872 0.3904 1.0550 2.5059
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.74281 0.20993 -13.065 < 0.0000000000000002
## genderMale -0.03263 0.05011 -0.651 0.514937
## genderOther -12.99190 535.41118 -0.024 0.980641
## ever_marriedYes 0.94982 0.06722 14.129 < 0.0000000000000002
## work_typeGovt_job 2.08788 0.21572 9.679 < 0.0000000000000002
## work_typeNever_worked -10.60104 129.51224 -0.082 0.934763
## work_typePrivate 2.16865 0.20759 10.447 < 0.0000000000000002
## work_typeSelf-employed 2.57430 0.21377 12.042 < 0.0000000000000002
## Residence_typeUrban 0.24630 0.04857 5.071 0.00000039593
## smoking_statusnever smoked -0.37658 0.06530 -5.767 0.00000000808
## smoking_statussmokes -0.25953 0.07589 -3.420 0.000627
## smoking_statusUnknown -0.35267 0.07568 -4.660 0.00000316464
##
## (Intercept) ***
## genderMale
## genderOther
## ever_marriedYes ***
## work_typeGovt_job ***
## work_typeNever_worked
## work_typePrivate ***
## work_typeSelf-employed ***
## Residence_typeUrban ***
## smoking_statusnever smoked ***
## smoking_statussmokes ***
## smoking_statusUnknown ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10774 on 7771 degrees of freedom
## Residual deviance: 9760 on 7760 degrees of freedom
## AIC: 9784
##
## Number of Fisher Scoring iterations: 12
Evalute The Model
predicted_stroke_1 <- predict(object = model_glm_1, newdata = stroke_test)
predicted_stroke_1 <- ifelse(predicted_stroke_1 > 0.5,"Stroke","Healthy")
predicted_stroke_1 <- as.factor(predicted_stroke_1)
confusionMatrix(predicted_stroke_1, stroke_test$stroke, positive = "Stroke")## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 838 36
## Stroke 137 11
##
## Accuracy : 0.8307
## 95% CI : (0.8063, 0.8532)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0462
##
## Mcnemar's Test P-Value : 0.00000000000002897
##
## Sensitivity : 0.23404
## Specificity : 0.85949
## Pos Pred Value : 0.07432
## Neg Pred Value : 0.95881
## Prevalence : 0.04599
## Detection Rate : 0.01076
## Detection Prevalence : 0.14481
## Balanced Accuracy : 0.54676
##
## 'Positive' Class : Stroke
##
Now The model more accurate predicting stroke people rather than labeling all of the data as healthy, but the accuracy suffer down to 0.83 and specificity increase to 0.85 but its specifity is fall into 0.23, it probably happen bacause many of the variable are imbalance proportions.
Model Improvement - Backward Elimination
backward <- step(object = model_glm_1, trace = 0)
summary(backward)##
## Call:
## glm(formula = stroke ~ ever_married + work_type + Residence_type +
## smoking_status, family = "binomial", data = train_up)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6259 -1.1806 0.3931 1.0615 2.5123
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76292 0.20796 -13.286 < 0.0000000000000002
## ever_marriedYes 0.94922 0.06712 14.143 < 0.0000000000000002
## work_typeGovt_job 2.09255 0.21553 9.709 < 0.0000000000000002
## work_typeNever_worked -10.60274 129.55331 -0.082 0.934773
## work_typePrivate 2.17063 0.20752 10.460 < 0.0000000000000002
## work_typeSelf-employed 2.57814 0.21364 12.068 < 0.0000000000000002
## Residence_typeUrban 0.24718 0.04856 5.090 0.0000003580
## smoking_statusnever smoked -0.36889 0.06449 -5.720 0.0000000106
## smoking_statussmokes -0.25858 0.07589 -3.407 0.000656
## smoking_statusUnknown -0.34937 0.07560 -4.621 0.0000038104
##
## (Intercept) ***
## ever_marriedYes ***
## work_typeGovt_job ***
## work_typeNever_worked
## work_typePrivate ***
## work_typeSelf-employed ***
## Residence_typeUrban ***
## smoking_statusnever smoked ***
## smoking_statussmokes ***
## smoking_statusUnknown ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10774.3 on 7771 degrees of freedom
## Residual deviance: 9761.3 on 7762 degrees of freedom
## AIC: 9781.3
##
## Number of Fisher Scoring iterations: 12
Evaluate Improved Model
predicted_stroke_2 <- predict(object = backward, newdata = stroke_test)
predicted_stroke_2 <- ifelse(predicted_stroke_2 > 0.5,"Stroke","Healthy")
predicted_stroke_2 <- as.factor(predicted_stroke_2)
confusionMatrix(predicted_stroke_2, stroke_test$stroke, positive = "Stroke")## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 834 36
## Stroke 141 11
##
## Accuracy : 0.8268
## 95% CI : (0.8022, 0.8495)
## No Information Rate : 0.954
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0433
##
## Mcnemar's Test P-Value : 0.000000000000005405
##
## Sensitivity : 0.23404
## Specificity : 0.85538
## Pos Pred Value : 0.07237
## Neg Pred Value : 0.95862
## Prevalence : 0.04599
## Detection Rate : 0.01076
## Detection Prevalence : 0.14873
## Balanced Accuracy : 0.54471
##
## 'Positive' Class : Stroke
##
the model did not improve at all instead the accuracy down to 0.826 so the first model seems better on overall.
KNN
Create new train and test data that has numerical category
stroke_split2 <- sample(nrow(stroke_data_clean), nrow(stroke_data_clean)*0.80)
stroke_train2 <- stroke_data_clean[stroke_split2, ] %>% select(c("age", "avg_glucose_level", "bmi", "stroke"))
stroke_test2 <- stroke_data_clean[-stroke_split2, ] %>% select(c("age", "avg_glucose_level", "bmi", "stroke"))Balance class data
set.seed(476)
train_up2 <- upSample(x = stroke_train2 %>% select(-stroke), # prediktor
y = stroke_train2$stroke, # target
list = F, # supaya target dan prediktor jadi 1 table
yname = "stroke" # nama target
)
table(train_up2$stroke)##
## Healthy Stroke
## 3888 3888
Scale train and test data
train_up_x <- train_up2 %>% select(-stroke) %>% scale()
train_up_y <- train_up2 %>% pull(stroke)
train_up_scale <- train_up2 %>% select(-stroke) %>% scale() %>% as.data.frame()
train_up_scale$stroke <- train_up_y
test_x <- stroke_test2 %>% select(-stroke) %>%
scale(center = attr(train_up_x, "scaled:center") ,
scale = attr(train_up_x, "scaled:scale"))
test_y <- stroke_test2 %>% pull(stroke)
test_scale <- test_x %>% as.data.frame() %>% mutate(stroke = test_y)Find Optimum K
sqrt(nrow(stroke_train2))## [1] 63.93747
model_knn2 <- knn(train=train_up_x, test = test_x, cl= train_up_y,k=63)
confusionMatrix(model_knn2, test_y, positive = "Stroke")## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 659 13
## Stroke 314 36
##
## Accuracy : 0.68
## 95% CI : (0.6505, 0.7086)
## No Information Rate : 0.9521
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1052
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.73469
## Specificity : 0.67729
## Pos Pred Value : 0.10286
## Neg Pred Value : 0.98065
## Prevalence : 0.04795
## Detection Rate : 0.03523
## Detection Prevalence : 0.34247
## Balanced Accuracy : 0.70599
##
## 'Positive' Class : Stroke
##
it seems the model did not have good accuracy maybe because of k that too high we will try using other k.
Finding Another Optimum K
model_knn <- train(stroke ~ ., data=train_up_scale, method = "knn")
model_knn## k-Nearest Neighbors
##
## 7776 samples
## 3 predictor
## 2 classes: 'Healthy', 'Stroke'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7776, 7776, 7776, 7776, 7776, 7776, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9139253 0.8278035
## 7 0.8943706 0.7886775
## 9 0.8777029 0.7553145
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
The model train on many k values one of the best k value is 5 because it reached highest accuracy in train data
Confussion Matrix
predict_knn <- predict(model_knn, newdata = test_scale)
confusionMatrix(predict_knn, test_y, positive = "Stroke")## Confusion Matrix and Statistics
##
## Reference
## Prediction Healthy Stroke
## Healthy 813 28
## Stroke 160 21
##
## Accuracy : 0.816
## 95% CI : (0.7909, 0.8394)
## No Information Rate : 0.9521
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1159
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.42857
## Specificity : 0.83556
## Pos Pred Value : 0.11602
## Neg Pred Value : 0.96671
## Prevalence : 0.04795
## Detection Rate : 0.02055
## Detection Prevalence : 0.17710
## Balanced Accuracy : 0.63207
##
## 'Positive' Class : Stroke
##
The model of knn has overall lower accuracy from logistic regression but better on sensitivity.