LBB RM: CLassification Model For Stroke Problem

Musthofa Syarifudin

2021-04-28

Bussines Problem

Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Attribute Information

id: unique identifier
gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever_married: “No” or “Yes”
work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
Residence_type: “Rural” or “Urban”
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 1 if the patient had a stroke or 0 if not *Note: “Unknown” in smoking_status means that the information is unavailable for this patient

Preparation

Load the library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(car)
library(scales)
library(lmtest)
library(MLmetrics)
library(inspectdf)
options(scipen = 100, max.print = 1e+06)

Read the dataset

stroke_dataset <- read.csv("data/healthcare-dataset-stroke-data.csv", stringsAsFactors = TRUE)

stroke_dataset <- stroke_dataset %>% mutate_at(c("hypertension", "heart_disease", "stroke"), as.factor) %>% mutate(bmi = as.numeric(bmi))
stroke_dataset$stroke <- ifelse(stroke_dataset$stroke == "1", "Stroke", "Healthy") %>% factor()
stroke_dataset$heart_disease <- ifelse(stroke_dataset$heart_disease == "1", "Heart disease", "No Heart disease")
stroke_dataset$hypertension <- ifelse(stroke_dataset$hypertension == "1", "Hypertension", "No Hypertension")
rmarkdown::paged_table((stroke_dataset))

Checking dataset column types

str(stroke_dataset)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : Factor w/ 3 levels "Female","Male",..: 2 1 2 1 1 2 2 1 1 1 ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : chr  "No Hypertension" "No Hypertension" "No Hypertension" "No Hypertension" ...
##  $ heart_disease    : chr  "Heart disease" "No Heart disease" "Heart disease" "No Heart disease" ...
##  $ ever_married     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 1 2 2 ...
##  $ work_type        : Factor w/ 5 levels "children","Govt_job",..: 4 5 4 4 5 4 4 4 4 4 ...
##  $ Residence_type   : Factor w/ 2 levels "Rural","Urban": 2 1 1 2 1 2 1 2 1 2 ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  240 419 199 218 114 164 148 102 419 116 ...
##  $ smoking_status   : Factor w/ 4 levels "formerly smoked",..: 1 2 2 3 2 1 2 2 4 4 ...
##  $ stroke           : Factor w/ 2 levels "Healthy","Stroke": 2 2 2 2 2 2 2 2 2 2 ...

we are gonna remove some column:

id = because it’s id of data, does not contain special information

Check na values

stroke_data_clean <- stroke_dataset %>% select(-c(id))
colSums(is.na(stroke_dataset))

##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0

the data has zero na values. we can move on to next step

Summary of data

summary(stroke_dataset)

##        id           gender          age        hypertension      
##  Min.   :   67   Female:2994   Min.   : 0.08   Length:5110       
##  1st Qu.:17741   Male  :2115   1st Qu.:25.00   Class :character  
##  Median :36932   Other :   1   Median :45.00   Mode  :character  
##  Mean   :36518                 Mean   :43.23                     
##  3rd Qu.:54682                 3rd Qu.:61.00                     
##  Max.   :72940                 Max.   :82.00                     
##  heart_disease      ever_married         work_type    Residence_type
##  Length:5110        No :1757     children     : 687   Rural:2514    
##  Class :character   Yes:3353     Govt_job     : 657   Urban:2596    
##  Mode  :character                Never_worked :  22                 
##                                  Private      :2925                 
##                                  Self-employed: 819                 
##                                                                     
##  avg_glucose_level      bmi                smoking_status     stroke    
##  Min.   : 55.12    Min.   :  1.0   formerly smoked: 885   Healthy:4861  
##  1st Qu.: 77.25    1st Qu.:112.0   never smoked   :1892   Stroke : 249  
##  Median : 91.89    Median :158.0   smokes         : 789                 
##  Mean   :106.15    Mean   :172.2   Unknown        :1544                 
##  3rd Qu.:114.09    3rd Qu.:214.0                                        
##  Max.   :271.74    Max.   :419.0

Exploratory Data Analysis

Inspect proportion of data

stroke_data_clean %>% 
  inspect_cat() %>% 
  show_plot()

Inspect Numerical Distribution

stroke_data_clean %>% 
  inspect_num() %>% 
  show_plot()

Some distribution look like normal distribution

From Only Stroke Patient What we Know

stroke_data_clean %>% 
  filter(stroke == "Stroke") %>% 
 select_if(~class(.) == 'factor') %>% 
  inspect_cat() %>% 
  show_plot()

stroke_data_clean_cat <- stroke_data_clean %>% select_if(~class(.) == 'factor')

We know something from stroke point of view:

Majority of stroke patient are married, that mean when we use k means clustering the model will suspected with imbalance proportions
Other variable seems to have pretty or reasonable balance except ever_married and work_type

Split data

we are gonna split data into train and test. with the proportion of 80:20

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417) 
stroke_split <- sample(nrow(stroke_data_clean_cat), nrow(stroke_data_clean_cat)*0.80) 
stroke_train <- stroke_data_clean_cat[stroke_split, ]
stroke_test <- stroke_data_clean_cat[-stroke_split, ]

Plot train and test data distribution

stroke_train %>% 
  inspect_cat() %>% 
  show_plot()

stroke_test %>% 
  inspect_cat() %>% 
  show_plot()

Logistic Regression

Make logistic regression model

model_glm <- glm(formula = stroke ~ ., data = stroke_train, family = "binomial")
summary(model_glm)

## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = stroke_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5091  -0.3686  -0.3231  -0.2093   3.3839  
## 
## Coefficients:
##                               Estimate  Std. Error z value          Pr(>|z|)
## (Intercept)                  -5.344361    0.747230  -7.152 0.000000000000854
## genderMale                   -0.001673    0.149082  -0.011          0.991044
## genderOther                 -12.134090 1455.397559  -0.008          0.993348
## ever_marriedYes               0.900152    0.232252   3.876          0.000106
## work_typeGovt_job             1.754571    0.777759   2.256          0.024075
## work_typeNever_worked        -9.962160  352.788326  -0.028          0.977472
## work_typePrivate              1.912383    0.751371   2.545          0.010922
## work_typeSelf-employed        2.269847    0.764397   2.969          0.002983
## Residence_typeUrban           0.196425    0.146684   1.339          0.180538
## smoking_statusnever smoked   -0.394729    0.187386  -2.107          0.035160
## smoking_statussmokes         -0.318228    0.223262  -1.425          0.154054
## smoking_statusUnknown        -0.377827    0.222508  -1.698          0.089501
##                               
## (Intercept)                ***
## genderMale                    
## genderOther                   
## ever_marriedYes            ***
## work_typeGovt_job          *  
## work_typeNever_worked         
## work_typePrivate           *  
## work_typeSelf-employed     ** 
## Residence_typeUrban           
## smoking_statusnever smoked *  
## smoking_statussmokes          
## smoking_statusUnknown      .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1608.9  on 4087  degrees of freedom
## Residual deviance: 1526.1  on 4076  degrees of freedom
## AIC: 1550.1
## 
## Number of Fisher Scoring iterations: 14

Evaluation

Making prediction and create confusion matrix

predicted_stroke <- predict(object = model_glm, newdata = stroke_test)
predicted_stroke <- ifelse(predicted_stroke > 0.5,"Stroke","Healthy")
predicted_stroke <- as.factor(predicted_stroke)
confusionMatrix(predicted_stroke, stroke_test$stroke)

## Warning in confusionMatrix.default(predicted_stroke, stroke_test$stroke): Levels
## are not in the same order for reference and data. Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     975     47
##    Stroke        0      0
##                                           
##                Accuracy : 0.954           
##                  95% CI : (0.9393, 0.966) 
##     No Information Rate : 0.954           
##     P-Value [Acc > NIR] : 0.5387          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.00000000001949
##                                           
##             Sensitivity : 1.000           
##             Specificity : 0.000           
##          Pos Pred Value : 0.954           
##          Neg Pred Value :   NaN           
##              Prevalence : 0.954           
##          Detection Rate : 0.954           
##    Detection Prevalence : 1.000           
##       Balanced Accuracy : 0.500           
##                                           
##        'Positive' Class : Healthy         
##

model has accuracy of 0.954 but specificity 0 that because the model did not guess any of the patient stroke instead predicting all patient healthy because majority of the data are healthy patient. Thus we need to balance train data for logistic regression increase the accuracy of minority class.

Balancing Class Data With Upsampling Method

set.seed(456)
train_up <- upSample(x = stroke_train %>% select(-stroke), # prediktor
                     y = stroke_train$stroke,  # target
                     list = F, # supaya target dan prediktor jadi 1 table
                     yname = "stroke" # nama target
                     )
table(train_up$stroke)

## 
## Healthy  Stroke 
##    3886    3886

after upsampling data has balanced class proportions

Plot Distribution of upsampled data

train_up %>% 
  inspect_cat() %>% 
  show_plot()

even though the target variable are balance other variable retained imbalance properties

Making New Linear Logistic Model With Upsampling Data

model_glm_1 <- glm(formula = stroke ~ ., data = train_up, family = "binomial")
summary(model_glm_1)

## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = train_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6331  -1.1872   0.3904   1.0550   2.5059  
## 
## Coefficients:
##                             Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                 -2.74281    0.20993 -13.065 < 0.0000000000000002
## genderMale                  -0.03263    0.05011  -0.651             0.514937
## genderOther                -12.99190  535.41118  -0.024             0.980641
## ever_marriedYes              0.94982    0.06722  14.129 < 0.0000000000000002
## work_typeGovt_job            2.08788    0.21572   9.679 < 0.0000000000000002
## work_typeNever_worked      -10.60104  129.51224  -0.082             0.934763
## work_typePrivate             2.16865    0.20759  10.447 < 0.0000000000000002
## work_typeSelf-employed       2.57430    0.21377  12.042 < 0.0000000000000002
## Residence_typeUrban          0.24630    0.04857   5.071        0.00000039593
## smoking_statusnever smoked  -0.37658    0.06530  -5.767        0.00000000808
## smoking_statussmokes        -0.25953    0.07589  -3.420             0.000627
## smoking_statusUnknown       -0.35267    0.07568  -4.660        0.00000316464
##                               
## (Intercept)                ***
## genderMale                    
## genderOther                   
## ever_marriedYes            ***
## work_typeGovt_job          ***
## work_typeNever_worked         
## work_typePrivate           ***
## work_typeSelf-employed     ***
## Residence_typeUrban        ***
## smoking_statusnever smoked ***
## smoking_statussmokes       ***
## smoking_statusUnknown      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10774  on 7771  degrees of freedom
## Residual deviance:  9760  on 7760  degrees of freedom
## AIC: 9784
## 
## Number of Fisher Scoring iterations: 12

Evalute The Model

predicted_stroke_1 <- predict(object = model_glm_1, newdata = stroke_test)
predicted_stroke_1 <- ifelse(predicted_stroke_1 > 0.5,"Stroke","Healthy")
predicted_stroke_1 <- as.factor(predicted_stroke_1)
confusionMatrix(predicted_stroke_1, stroke_test$stroke, positive = "Stroke")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     838     36
##    Stroke      137     11
##                                              
##                Accuracy : 0.8307             
##                  95% CI : (0.8063, 0.8532)   
##     No Information Rate : 0.954              
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0462             
##                                              
##  Mcnemar's Test P-Value : 0.00000000000002897
##                                              
##             Sensitivity : 0.23404            
##             Specificity : 0.85949            
##          Pos Pred Value : 0.07432            
##          Neg Pred Value : 0.95881            
##              Prevalence : 0.04599            
##          Detection Rate : 0.01076            
##    Detection Prevalence : 0.14481            
##       Balanced Accuracy : 0.54676            
##                                              
##        'Positive' Class : Stroke             
##

Now The model more accurate predicting stroke people rather than labeling all of the data as healthy, but the accuracy suffer down to 0.83 and specificity increase to 0.85 but its specifity is fall into 0.23, it probably happen bacause many of the variable are imbalance proportions.

Model Improvement - Backward Elimination

backward <- step(object = model_glm_1, trace = 0)
summary(backward)

## 
## Call:
## glm(formula = stroke ~ ever_married + work_type + Residence_type + 
##     smoking_status, family = "binomial", data = train_up)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6259  -1.1806   0.3931   1.0615   2.5123  
## 
## Coefficients:
##                             Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                 -2.76292    0.20796 -13.286 < 0.0000000000000002
## ever_marriedYes              0.94922    0.06712  14.143 < 0.0000000000000002
## work_typeGovt_job            2.09255    0.21553   9.709 < 0.0000000000000002
## work_typeNever_worked      -10.60274  129.55331  -0.082             0.934773
## work_typePrivate             2.17063    0.20752  10.460 < 0.0000000000000002
## work_typeSelf-employed       2.57814    0.21364  12.068 < 0.0000000000000002
## Residence_typeUrban          0.24718    0.04856   5.090         0.0000003580
## smoking_statusnever smoked  -0.36889    0.06449  -5.720         0.0000000106
## smoking_statussmokes        -0.25858    0.07589  -3.407             0.000656
## smoking_statusUnknown       -0.34937    0.07560  -4.621         0.0000038104
##                               
## (Intercept)                ***
## ever_marriedYes            ***
## work_typeGovt_job          ***
## work_typeNever_worked         
## work_typePrivate           ***
## work_typeSelf-employed     ***
## Residence_typeUrban        ***
## smoking_statusnever smoked ***
## smoking_statussmokes       ***
## smoking_statusUnknown      ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10774.3  on 7771  degrees of freedom
## Residual deviance:  9761.3  on 7762  degrees of freedom
## AIC: 9781.3
## 
## Number of Fisher Scoring iterations: 12

Evaluate Improved Model

predicted_stroke_2 <- predict(object = backward, newdata = stroke_test)
predicted_stroke_2 <- ifelse(predicted_stroke_2 > 0.5,"Stroke","Healthy")
predicted_stroke_2 <- as.factor(predicted_stroke_2)
confusionMatrix(predicted_stroke_2, stroke_test$stroke, positive = "Stroke")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     834     36
##    Stroke      141     11
##                                               
##                Accuracy : 0.8268              
##                  95% CI : (0.8022, 0.8495)    
##     No Information Rate : 0.954               
##     P-Value [Acc > NIR] : 1                   
##                                               
##                   Kappa : 0.0433              
##                                               
##  Mcnemar's Test P-Value : 0.000000000000005405
##                                               
##             Sensitivity : 0.23404             
##             Specificity : 0.85538             
##          Pos Pred Value : 0.07237             
##          Neg Pred Value : 0.95862             
##              Prevalence : 0.04599             
##          Detection Rate : 0.01076             
##    Detection Prevalence : 0.14873             
##       Balanced Accuracy : 0.54471             
##                                               
##        'Positive' Class : Stroke              
##

the model did not improve at all instead the accuracy down to 0.826 so the first model seems better on overall.

KNN

Create new train and test data that has numerical category

stroke_split2 <- sample(nrow(stroke_data_clean), nrow(stroke_data_clean)*0.80) 
stroke_train2 <- stroke_data_clean[stroke_split2, ] %>% select(c("age", "avg_glucose_level", "bmi", "stroke"))
stroke_test2 <- stroke_data_clean[-stroke_split2, ] %>% select(c("age", "avg_glucose_level", "bmi", "stroke"))

Balance class data

set.seed(476)
train_up2 <- upSample(x = stroke_train2 %>% select(-stroke), # prediktor
                     y = stroke_train2$stroke,  # target
                     list = F, # supaya target dan prediktor jadi 1 table
                     yname = "stroke" # nama target
                     )
table(train_up2$stroke)

## 
## Healthy  Stroke 
##    3888    3888

Scale train and test data

train_up_x <- train_up2 %>% select(-stroke) %>% scale() 
train_up_y <- train_up2 %>% pull(stroke)
train_up_scale <- train_up2 %>% select(-stroke) %>% scale() %>% as.data.frame()
train_up_scale$stroke <- train_up_y
test_x <- stroke_test2 %>% select(-stroke) %>% 
  scale(center = attr(train_up_x, "scaled:center") , 
        scale = attr(train_up_x, "scaled:scale"))
test_y <- stroke_test2 %>% pull(stroke)
test_scale <- test_x %>% as.data.frame() %>% mutate(stroke = test_y)

Find Optimum K

sqrt(nrow(stroke_train2))

## [1] 63.93747

model_knn2 <- knn(train=train_up_x, test = test_x, cl=  train_up_y,k=63)
confusionMatrix(model_knn2, test_y, positive = "Stroke")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     659     13
##    Stroke      314     36
##                                              
##                Accuracy : 0.68               
##                  95% CI : (0.6505, 0.7086)   
##     No Information Rate : 0.9521             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1052             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.73469            
##             Specificity : 0.67729            
##          Pos Pred Value : 0.10286            
##          Neg Pred Value : 0.98065            
##              Prevalence : 0.04795            
##          Detection Rate : 0.03523            
##    Detection Prevalence : 0.34247            
##       Balanced Accuracy : 0.70599            
##                                              
##        'Positive' Class : Stroke             
##

it seems the model did not have good accuracy maybe because of k that too high we will try using other k.

Finding Another Optimum K

model_knn <- train(stroke ~ ., data=train_up_scale, method = "knn")
model_knn

## k-Nearest Neighbors 
## 
## 7776 samples
##    3 predictor
##    2 classes: 'Healthy', 'Stroke' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7776, 7776, 7776, 7776, 7776, 7776, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9139253  0.8278035
##   7  0.8943706  0.7886775
##   9  0.8777029  0.7553145
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

The model train on many k values one of the best k value is 5 because it reached highest accuracy in train data

Confussion Matrix

predict_knn <- predict(model_knn, newdata = test_scale)
confusionMatrix(predict_knn, test_y, positive = "Stroke")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Healthy Stroke
##    Healthy     813     28
##    Stroke      160     21
##                                              
##                Accuracy : 0.816              
##                  95% CI : (0.7909, 0.8394)   
##     No Information Rate : 0.9521             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.1159             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.42857            
##             Specificity : 0.83556            
##          Pos Pred Value : 0.11602            
##          Neg Pred Value : 0.96671            
##              Prevalence : 0.04795            
##          Detection Rate : 0.02055            
##    Detection Prevalence : 0.17710            
##       Balanced Accuracy : 0.63207            
##                                              
##        'Positive' Class : Stroke             
##

The model of knn has overall lower accuracy from logistic regression but better on sensitivity.