Brain Stroke Prediction based on Logistic Regression and K-NN
Introduction
Objectives
We will try to do prediction based on history data whether the patient has a brain stroke or no. The logistic regression and K-Nearest Neighbor (K-NN) would be used as the classification method. The dataset is downloaded from Kaggle, which you can also download from here.
Library and Setup
library(tidyverse)
library(dplyr)
library(rsample)
library(gtools)
library(class) #for knn
library(caret) #for confusion matrix in knnData Preparation
Read Data
brain <- read.csv("full_data.csv")glimpse(brain)## Rows: 4,981
## Columns: 11
## $ gender <chr> "Male", "Male", "Female", "Female", "Male", "Male", …
## $ age <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
## $ hypertension <int> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
## $ heart_disease <int> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
## $ ever_married <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes…
## $ work_type <chr> "Private", "Private", "Private", "Self-employed", "P…
## $ Residence_type <chr> "Urban", "Rural", "Urban", "Rural", "Urban", "Rural"…
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
## $ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
## $ smoking_status <chr> "formerly smoked", "never smoked", "smokes", "never …
## $ stroke <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
Attribute Information
- gender: “Male”, “Female” or “Other”
- age: age of the patient
- hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
- heartdisease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
- evermarried: “No” or “Yes”
- worktype: “children”, “Govtjov”, “Neverworked”, “Private” or “Self-employed”
- Residencetype: “Rural” or “Urban”
- avgglucoselevel: average glucose level in blood
- bmi: body mass index
- smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
- stroke: 0 if not or 1 if the patient had a stroke
*Note: “Unknown” in smoking_status means that the information is unavailable for this patient
This is a bigger picture of our dataset:
rmarkdown::paged_table(brain)Data Wrangling
In some of the variables used, there is a data type discrepancy, therefore what we need to do is to adjust the data type on some of the existing variables.
brain <- brain %>%
mutate_if(is.character, as.factor) %>%
mutate(hypertension = factor(hypertension, levels = c(0, 1)),
heart_disease = factor(heart_disease, levels = c(0, 1)),
stroke = factor(stroke, levels = c(0, 1))
)
glimpse(brain)## Rows: 4,981
## Columns: 11
## $ gender <fct> Male, Male, Female, Female, Male, Male, Female, Fema…
## $ age <dbl> 67, 80, 49, 79, 81, 74, 69, 78, 81, 61, 54, 79, 50, …
## $ hypertension <fct> 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1…
## $ heart_disease <fct> 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0…
## $ ever_married <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes…
## $ work_type <fct> Private, Private, Private, Self-employed, Private, P…
## $ Residence_type <fct> Urban, Rural, Urban, Rural, Urban, Rural, Urban, Urb…
## $ avg_glucose_level <dbl> 228.69, 105.92, 171.23, 174.12, 186.21, 70.09, 94.39…
## $ bmi <dbl> 36.6, 32.5, 34.4, 24.0, 29.0, 27.4, 22.8, 24.2, 29.7…
## $ smoking_status <fct> formerly smoked, never smoked, smokes, never smoked,…
## $ stroke <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
The next step is to check for missing values. We need to check the missing value first so that it doesn’t interfere with modeling later.
colSums(is.na(brain))## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
Exploratory Data Analysis
Before doing the modeling, we need to first see the proportion of the target variable that we have in the target column.
# target class proportion
prop.table(table(brain$stroke))##
## 0 1
## 0.9502108 0.0497892
We could see the proportion of the target variable is not balanced. Here we gonna split the dataset to make it balanced.
brain_0 <- brain %>%
filter(stroke == 0) %>%
head(302)
str(brain_0)brain_1 <- brain %>%
filter(stroke == 1)
str(brain_1)Combining dataset that is splitted forming a new data.
brain_new <- rbind(brain_0, brain_1)
str(brain_new)prop.table(table(brain_new$stroke))##
## 0 1
## 0.5490909 0.4509091
Here the proportion of target variabel is more balanced although it consists only 550 row.
Logistic Regression
Cross Validation
To evaluate the model and see its ability to predict new data, our data is divided into 2: train data and test data. We call this process cross-validation.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
index <- sample(nrow(brain_new), nrow(brain_new)*0.8)
brain_train <- brain_new[index,]
brain_test <- brain_new[-index,]# re-check class imbalance
table(brain_train$stroke)##
## 0 1
## 241 199
table(brain_test$stroke)##
## 0 1
## 61 49
Both of brain_train and brain_test is
balanced.
Build Model
brain_stroke <- glm(formula = stroke ~., data = brain_train, family = "binomial")
summary(brain_stroke)##
## Call:
## glm(formula = stroke ~ ., family = "binomial", data = brain_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3908 -0.6343 -0.2224 0.7433 2.6375
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7927421 0.9752735 -3.889 0.000101 ***
## genderMale 0.1638678 0.2614025 0.627 0.530738
## age 0.0843237 0.0101930 8.273 < 2e-16 ***
## hypertension1 0.9371155 0.3516469 2.665 0.007700 **
## heart_disease1 0.6137583 0.4110275 1.493 0.135377
## ever_marriedYes 0.1863650 0.4374563 0.426 0.670094
## work_typeGovt_job -1.8797865 1.0481699 -1.793 0.072909 .
## work_typePrivate -2.0547462 1.0272452 -2.000 0.045473 *
## work_typeSelf-employed -2.3457367 1.0718963 -2.188 0.028641 *
## Residence_typeUrban 0.1570400 0.2519370 0.623 0.533067
## avg_glucose_level 0.0003707 0.0023634 0.157 0.875365
## bmi 0.0102397 0.0224775 0.456 0.648712
## smoking_statusnever smoked -0.3224959 0.3371130 -0.957 0.338749
## smoking_statussmokes 0.2046291 0.4128258 0.496 0.620121
## smoking_statusUnknown -0.1400686 0.3863571 -0.363 0.716951
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 605.95 on 439 degrees of freedom
## Residual deviance: 399.74 on 425 degrees of freedom
## AIC: 429.74
##
## Number of Fisher Scoring iterations: 5
reminder:
- to change log of odds to odds: exp()
- to change log of odds to probability: inv.logit()
inv.logit(-3.7862114)## [1] 0.02217833
# odds from
exp(0.9371155)## [1] 2.552608
Probability of someone have a brain stroke when a person have a hypertension is most likely 2.5 times compare to probability a person who have brain stroke without hypertension, taking a note that other variable is constant.
Feature selection using stepwise
# logistic regression juga dapat menggunakan step
step(object = brain_stroke, direction = "backward", trace = FALSE)##
## Call: glm(formula = stroke ~ age + hypertension + heart_disease, family = "binomial",
## data = brain_train)
##
## Coefficients:
## (Intercept) age hypertension1 heart_disease1
## -4.76689 0.07536 0.84579 0.79346
##
## Degrees of Freedom: 439 Total (i.e. Null); 436 Residual
## Null Deviance: 606
## Residual Deviance: 408.3 AIC: 416.3
# save model from stepwise
brain_stroke_m2 <- glm(formula = stroke ~ age + hypertension, family = "binomial", data = brain_train)
summary(brain_stroke_m2)##
## Call:
## glm(formula = stroke ~ age + hypertension, family = "binomial",
## data = brain_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1426 -0.6672 -0.1888 0.7038 3.0958
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.887913 0.514411 -9.502 <2e-16 ***
## age 0.079060 0.008195 9.647 <2e-16 ***
## hypertension1 0.831411 0.327411 2.539 0.0111 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 605.95 on 439 degrees of freedom
## Residual deviance: 412.74 on 437 degrees of freedom
## AIC: 418.74
##
## Number of Fisher Scoring iterations: 5
Predict
brain_test$pred.Stroke <- predict(object = brain_stroke_m2, newdata = brain_test, type = "response")
# Because we want to get the probability here we use type = "response"brain_test$pred.Label <- ifelse(brain_test$pred.Stroke > 0.5, 1, 0) %>% as.factor()# looking through the prediction result
rmarkdown::paged_table(brain_test %>% select(pred.Stroke, pred.Label, stroke))Evaluation
# confusion matrix
confusionMatrix(data = brain_test$pred.Label, reference = brain_test$stroke, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 17
## 1 18 32
##
## Accuracy : 0.6818
## 95% CI : (0.5862, 0.7674)
## No Information Rate : 0.5545
## P-Value [Acc > NIR] : 0.004366
##
## Kappa : 0.3573
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.6531
## Specificity : 0.7049
## Pos Pred Value : 0.6400
## Neg Pred Value : 0.7167
## Prevalence : 0.4455
## Detection Rate : 0.2909
## Detection Prevalence : 0.4545
## Balanced Accuracy : 0.6790
##
## 'Positive' Class : 1
##
The result shows that our Logistic Regression model has accuracy of 69 % on test dataset, meaning that 69 % of our data is correctly classified. The value of sensitivity and specificity is 67.35 % and 70.49 %. This indicate that small of positive outcomes are correctly classified compared to negative outcomes that is more corrected. The precision/positive predicted value is 64.71 %, meaning that 64.71 % of our positive prediction is correct.
From this we got 4 model performance metrics which is
Accuracy, Sensitivity/Recall,
Precision, Specificity.
Accuracy: how accurately our model predicts the target class (globally)Sensitivity/Recall: a measure of the goodness of the model to the positive classSpecificity: a measure of the goodness of the model to the negative classPost Pred Value/Precision: how accurately does the model predict positive classes
Since our goals is to predict a patient with brain stroke so the Doctor will take the next step of treatment. we will concern on FN (False Negative). False Negative is described when predicted value is not having brain stroke, but in actual it has a brain stroke.
- Target variable = brain stroke / not brain stroke
- Positive Class = Stroke
- Metrics = Recall (reduce FN value)
In the other part our accuracy value is not that best, this model need improvement in selecting features.
K-Nearest Neighbor
Cross Validation
To evaluate the model and see its ability to predict new data, our data is divided into 2: train data and test data. We call this process cross-validation.
set.seed(100)
index <- initial_split(data = brain_new, prop = 0.8, strata = stroke)
brain_train <- training(index)
brain_test <- testing(index)prop.table(table(brain_train$stroke))##
## 0 1
## 0.5489749 0.4510251
table(brain_train$stroke)##
## 0 1
## 241 198
prop.table(table(brain_test$stroke))##
## 0 1
## 0.5495495 0.4504505
table(brain_test$stroke)##
## 0 1
## 61 50
Both of brain_train and brain_test is
balanced.
#predictor data train
train_x <- brain_train %>% select_if(is.numeric)
# target data train
train_y <- brain_train %>% select(stroke)
# predictor data test
test_x <- brain_test %>% select_if(is.numeric)
# predictor data test
test_y <- brain_test %>% select(stroke)Data Pre-Processing
Scaling
Feature re-scaling is required in the data pre-processing step if the range of each variable is different, using min-max normalization or z-score standardization.
Z-score Standardization
train_x <- scale(train_x)
test_x <- scale(test_x,
center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))sqrt(nrow(train_x))## [1] 20.95233
It will be used to model the value of k=20
Predict
brain_pred <- knn(train = train_x,
test = test_x,
cl = train_y$stroke,
k = 20)Evaluation
confusionMatrix(data = brain_pred,
reference = test_y$stroke,
positive="1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 46 15
## 1 15 35
##
## Accuracy : 0.7297
## 95% CI : (0.6372, 0.8096)
## No Information Rate : 0.5495
## P-Value [Acc > NIR] : 7.162e-05
##
## Kappa : 0.4541
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7000
## Specificity : 0.7541
## Pos Pred Value : 0.7000
## Neg Pred Value : 0.7541
## Prevalence : 0.4505
## Detection Rate : 0.3153
## Detection Prevalence : 0.4505
## Balanced Accuracy : 0.7270
##
## 'Positive' Class : 1
##
The result shows that our K-NN with K = 20 has accuracy of 72.97 % on test dataset, meaning that 72.97 % of our data is correctly classified. The value of sensitivity and specificity is 70 % and 75 %. This indicate that most of positive outcomes are correctly classified including negative outcomes that is correctly classified. The precision/positive predicted value is 70 %, meaning that 70 % of our positive prediction is correct.
Summary
Both model based on Logistic Regression and K-Nearest Neighbor, based on accuracy value it doesn’t show the best for model predicts the target class. However, when we concerned with Recall/Sensitivity, KNN method giving a better value compare to Logistic Regression.
No significant difference between logistic regression and K-NN in term of accuracy. The K-NN is classifying data as positive outcome (above average) more often than logistic regression model. As a result, the sensitivity and specificity of K-NN is higher than logistic regression. Overall, K-NN is better than logistic regression.
Depending on what we want to achieve, we can choose. Accuracy may not be the best metric on this case. If we want to maximize both the number of correct positive and negative outcome, we should choose the improved K-NN model.