Brain Stroke Prediction based on Logistic Regression and K-NN

Introduction

Objectives

We will try to do prediction based on history data whether the patient has a brain stroke or no. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is downloaded from Kaggle, which you can also download from here.

Library and Setup

library(tidyverse)
library(dplyr)
library(rsample)
library(gtools)
library(class) #for knn
library(caret) #for confusion matrix in knn

Data Preparation

Read Data

brain <- read.csv("full_data.csv")
glimpse(brain)
## Rows: 4,981
## Columns: 11
## $ gender            <chr> "Male", "Male", "Female", "Female", "Male", "Male", …
## $ age               <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
## $ hypertension      <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
## $ heart_disease     <int> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
## $ ever_married      <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes…
## $ work_type         <chr> "Private", "Private", "Private", "Self-employed", "P…
## $ Residence_type    <chr> "Urban", "Rural", "Urban", "Rural", "Urban", "Rural"…
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
## $ bmi               <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
## $ smoking_status    <chr> "formerly smoked", "never smoked", "smokes", "never …
## $ stroke            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Attribute Information

  1. gender: “Male”, “Female” or “Other”
  2. age: age of the patient
  3. hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  4. heartdisease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  5. evermarried: “No” or “Yes”
  6. worktype: “children”, “Govtjov”, “Neverworked”, “Private” or “Self-employed”
  7. Residencetype: “Rural” or “Urban”
  8. avgglucoselevel: average glucose level in blood
  9. bmi: body mass index
  10. smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  11. stroke: 0 if not or 1 if the patient had a stroke

*Note: “Unknown” in smoking_status means that the information is unavailable for this patient

This is a bigger picture of our dataset:

rmarkdown::paged_table(brain)

Data Wrangling

In some of the variables used, there is a data type discrepancy, therefore what we need to do is to adjust the data type on some of the existing variables.

brain <- brain %>%
  mutate_if(is.character, as.factor) %>%
  mutate(hypertension = factor(hypertension, levels = c(0, 1)),
         heart_disease = factor(heart_disease, levels = c(0, 1)),
         stroke = factor(stroke, levels = c(0, 1))
         )
glimpse(brain)
## Rows: 4,981
## Columns: 11
## $ gender            <fct> Male, Male, Female, Female, Male, Male, Female, Fema…
## $ age               <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
## $ hypertension      <fct> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
## $ heart_disease     <fct> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
## $ ever_married      <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes…
## $ work_type         <fct> Private, Private, Private, Self-employed, Private, P…
## $ Residence_type    <fct> Urban, Rural, Urban, Rural, Urban, Rural, Urban, Urb…
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
## $ bmi               <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
## $ smoking_status    <fct> formerly smoked, never smoked, smokes, never smoked,…
## $ stroke            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

The next step is to check for missing values. We need to check the missing value first so that it doesn’t interfere with modeling later.

colSums(is.na(brain))
##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

Exploratory Data Analysis

Before doing the modeling, we need to first see the proportion of the target variable that we have in the target column.

# target class proportion
prop.table(table(brain$stroke))
## 
##         0         1 
## 0.9502108 0.0497892

We could see the proportion of the target variable is not balanced. Here we gonna split the dataset to make it balanced.

brain_0 <- brain %>%
  filter(stroke == 0) %>%
  head(302)
str(brain_0)
brain_1 <- brain %>%
  filter(stroke == 1)
str(brain_1)

Combining dataset that is splitted forming a new data.

brain_new <- rbind(brain_0, brain_1)
str(brain_new)
prop.table(table(brain_new$stroke))
## 
##         0         1 
## 0.5490909 0.4509091

Here the proportion of target variabel is more balanced although it consists only 550 row.

Logistic Regression

Cross Validation

To evaluate the model and see its ability to predict new data, our data is divided into 2: train data and test data. We call this process cross-validation.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

index <- sample(nrow(brain_new), nrow(brain_new)*0.8)
brain_train <- brain_new[index,]
brain_test <- brain_new[-index,]
# re-check class imbalance
table(brain_train$stroke)
## 
##   0   1 
## 241 199
table(brain_test$stroke)
## 
##  0  1 
## 61 49

Both of brain_train and brain_test is balanced.

Build Model

brain_stroke <- glm(formula = stroke ~., data = brain_train, family = "binomial")
summary(brain_stroke)
## 
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = brain_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3908  -0.6343  -0.2224   0.7433   2.6375  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -3.7927421  0.9752735  -3.889 0.000101 ***
## genderMale                  0.1638678  0.2614025   0.627 0.530738    
## age                         0.0843237  0.0101930   8.273  < 2e-16 ***
## hypertension1               0.9371155  0.3516469   2.665 0.007700 ** 
## heart_disease1              0.6137583  0.4110275   1.493 0.135377    
## ever_marriedYes             0.1863650  0.4374563   0.426 0.670094    
## work_typeGovt_job          -1.8797865  1.0481699  -1.793 0.072909 .  
## work_typePrivate           -2.0547462  1.0272452  -2.000 0.045473 *  
## work_typeSelf-employed     -2.3457367  1.0718963  -2.188 0.028641 *  
## Residence_typeUrban         0.1570400  0.2519370   0.623 0.533067    
## avg_glucose_level           0.0003707  0.0023634   0.157 0.875365    
## bmi                         0.0102397  0.0224775   0.456 0.648712    
## smoking_statusnever smoked -0.3224959  0.3371130  -0.957 0.338749    
## smoking_statussmokes        0.2046291  0.4128258   0.496 0.620121    
## smoking_statusUnknown      -0.1400686  0.3863571  -0.363 0.716951    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 605.95  on 439  degrees of freedom
## Residual deviance: 399.74  on 425  degrees of freedom
## AIC: 429.74
## 
## Number of Fisher Scoring iterations: 5

reminder:

  • to change log of odds to odds: exp()
  • to change log of odds to probability: inv.logit()
inv.logit(-3.7862114)
## [1] 0.02217833
# odds from 
exp(0.9371155)
## [1] 2.552608

Probability of someone have a brain stroke when a person have a hypertension is most likely 2.5 times compare to probability a person who have brain stroke without hypertension, taking a note that other variable is constant.

Feature selection using stepwise

# logistic regression juga dapat menggunakan step
step(object = brain_stroke, direction = "backward", trace = FALSE)
## 
## Call:  glm(formula = stroke ~ age + hypertension + heart_disease, family = "binomial", 
##     data = brain_train)
## 
## Coefficients:
##    (Intercept)             age   hypertension1  heart_disease1  
##       -4.76689         0.07536         0.84579         0.79346  
## 
## Degrees of Freedom: 439 Total (i.e. Null);  436 Residual
## Null Deviance:       606 
## Residual Deviance: 408.3     AIC: 416.3
# save model from stepwise
brain_stroke_m2 <- glm(formula = stroke ~ age + hypertension, family = "binomial", data = brain_train)
summary(brain_stroke_m2)
## 
## Call:
## glm(formula = stroke ~ age + hypertension, family = "binomial", 
##     data = brain_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1426  -0.6672  -0.1888   0.7038   3.0958  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.887913   0.514411  -9.502   <2e-16 ***
## age            0.079060   0.008195   9.647   <2e-16 ***
## hypertension1  0.831411   0.327411   2.539   0.0111 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 605.95  on 439  degrees of freedom
## Residual deviance: 412.74  on 437  degrees of freedom
## AIC: 418.74
## 
## Number of Fisher Scoring iterations: 5

Predict

brain_test$pred.Stroke <- predict(object = brain_stroke_m2, newdata = brain_test, type = "response")
# Because we want to get the probability here we use type = "response"
brain_test$pred.Label <- ifelse(brain_test$pred.Stroke > 0.5, 1, 0) %>% as.factor()
# looking through the prediction result
rmarkdown::paged_table(brain_test %>% select(pred.Stroke, pred.Label, stroke))

Evaluation

# confusion matrix
confusionMatrix(data = brain_test$pred.Label, reference = brain_test$stroke, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 17
##          1 18 32
##                                           
##                Accuracy : 0.6818          
##                  95% CI : (0.5862, 0.7674)
##     No Information Rate : 0.5545          
##     P-Value [Acc > NIR] : 0.004366        
##                                           
##                   Kappa : 0.3573          
##                                           
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.6531          
##             Specificity : 0.7049          
##          Pos Pred Value : 0.6400          
##          Neg Pred Value : 0.7167          
##              Prevalence : 0.4455          
##          Detection Rate : 0.2909          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.6790          
##                                           
##        'Positive' Class : 1               
## 

The result shows that our Logistic Regression model has accuracy of 69 % on test dataset, meaning that 69 % of our data is correctly classified. The value of sensitivity and specificity is 67.35 % and 70.49 %. This indicate that small of positive outcomes are correctly classified compared to negative outcomes that is more corrected. The precision/positive predicted value is 64.71 %, meaning that 64.71 % of our positive prediction is correct.

From this we got 4 model performance metrics which is Accuracy, Sensitivity/Recall, Precision, Specificity.

  • Accuracy: how accurately our model predicts the target class (globally)
  • Sensitivity/Recall: a measure of the goodness of the model to the positive class
  • Specificity: a measure of the goodness of the model to the negative class
  • Post Pred Value/Precision: how accurately does the model predict positive classes

Since our goals is to predict a patient with brain stroke so the Doctor will take the next step of treatment. we will concern on FN (False Negative). False Negative is described when predicted value is not having brain stroke, but in actual it has a brain stroke.

  • Target variable = brain stroke / not brain stroke
  • Positive Class = Stroke
  • Metrics = Recall (reduce FN value)

In the other part our accuracy value is not that best, this model need improvement in selecting features.

K-Nearest Neighbor

Cross Validation

To evaluate the model and see its ability to predict new data, our data is divided into 2: train data and test data. We call this process cross-validation.

set.seed(100)

index <- initial_split(data = brain_new, prop = 0.8, strata = stroke)

brain_train <- training(index)
brain_test <- testing(index)
prop.table(table(brain_train$stroke))
## 
##         0         1 
## 0.5489749 0.4510251
table(brain_train$stroke)
## 
##   0   1 
## 241 198
prop.table(table(brain_test$stroke))
## 
##         0         1 
## 0.5495495 0.4504505
table(brain_test$stroke)
## 
##  0  1 
## 61 50

Both of brain_train and brain_test is balanced.

#predictor data train 
train_x <- brain_train %>% select_if(is.numeric)

# target data train
train_y <- brain_train %>% select(stroke)

# predictor data test
test_x <- brain_test %>% select_if(is.numeric)

# predictor data test 
test_y <- brain_test %>% select(stroke)

Data Pre-Processing

Scaling

Feature re-scaling is required in the data pre-processing step if the range of each variable is different, using min-max normalization or z-score standardization.

Z-score Standardization

train_x <- scale(train_x)

test_x <- scale(test_x,
                center = attr(train_x, "scaled:center"),
                scale = attr(train_x, "scaled:scale"))
sqrt(nrow(train_x))
## [1] 20.95233

It will be used to model the value of k=20

Predict

brain_pred <- knn(train = train_x,
                  test = test_x,
                  cl = train_y$stroke,
                  k = 20)

Evaluation

confusionMatrix(data = brain_pred,
                reference = test_y$stroke,
                positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 15
##          1 15 35
##                                           
##                Accuracy : 0.7297          
##                  95% CI : (0.6372, 0.8096)
##     No Information Rate : 0.5495          
##     P-Value [Acc > NIR] : 7.162e-05       
##                                           
##                   Kappa : 0.4541          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7000          
##             Specificity : 0.7541          
##          Pos Pred Value : 0.7000          
##          Neg Pred Value : 0.7541          
##              Prevalence : 0.4505          
##          Detection Rate : 0.3153          
##    Detection Prevalence : 0.4505          
##       Balanced Accuracy : 0.7270          
##                                           
##        'Positive' Class : 1               
## 

The result shows that our K-NN with K = 20 has accuracy of 72.97 % on test dataset, meaning that 72.97 % of our data is correctly classified. The value of sensitivity and specificity is 70 % and 75 %. This indicate that most of positive outcomes are correctly classified including negative outcomes that is correctly classified. The precision/positive predicted value is 70 %, meaning that 70 % of our positive prediction is correct.

Summary

Both model based on Logistic Regression and K-Nearest Neighbor, based on accuracy value it doesn’t show the best for model predicts the target class. However, when we concerned with Recall/Sensitivity, KNN method giving a better value compare to Logistic Regression.

No significant difference between logistic regression and K-NN in term of accuracy. The K-NN is classifying data as positive outcome (above average) more often than logistic regression model. As a result, the sensitivity and specificity of K-NN is higher than logistic regression. Overall, K-NN is better than logistic regression.

Depending on what we want to achieve, we can choose. Accuracy may not be the best metric on this case. If we want to maximize both the number of correct positive and negative outcome, we should choose the improved K-NN model.