Introduction

In this project, we are going to attempt doing a stroke prediction using a deep learning model from the keras library using the dataset obtained from Kaggle.

According to the Mayo Clinic, Stroke is a medical condition in which the flow of blood to the brain is reduced or blocked completely. Stroke risk factors includes pre-existing medical condition such as high blood pressure and obesity.

Thus, this project aims to predict whether an individual is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

Pre-Modeling Steps

Import Libraries

library(keras)
library(dplyr)
library(ggplot2)
library(vtreat)
library(magrittr)
library(caret)
library(tidyverse)
library(RColorBrewer)
library(ggpubr)
library(rsample)

Import Dataset

df_stroke<-read.csv("healthcare-dataset-stroke-data.csv")
rmarkdown::paged_table(df_stroke)

Variable Description :

  1. id: unique identifier
  2. gender: “Male”, “Female” or “Other”
  3. age: age of the patient
  4. hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
  5. heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
  6. ever_married: “No” or “Yes”
  7. work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
  8. Residence_type: “Rural” or “Urban”
  9. avg_glucose_level: average glucose level in blood
  10. bmi: body mass index
  11. smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
  12. stroke: 1 if the patient had a stroke or 0 if not *Note: “Unknown” in smoking_status means that the information is unavailable for this patient

Data Preprocessing

Impute Missing Data

First of all, let’s check if the data has any missing values.

colSums(is.na(df_stroke))
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0                 0                 0                 0

Seems like there is none, however if we inspect the data we can see that the data has entries of “N/A” in the bmi column. Now let’s check how many of those entries exists in our data.

colSums(df_stroke=="N/A")
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0

Now we know, that the bmi column has 201 entries of “N/A”

First and foremost let’s separate our data into two parts, data with bmi column with numeric entries and entries of “N/A”. In which we will transform the data into the appropriate variable type.

stroke_numeric<-df_stroke %>% filter(bmi!="N/A") %>% mutate(bmi=as.numeric(bmi))
stroke_na<-df_stroke %>% filter(bmi=="N/A") %>% mutate(bmi=ifelse(bmi=="N/A",NA,bmi))

Now, let’s bind our data by rows.

df_stroke<-rbind(stroke_na,stroke_numeric)

And next we find how many missing data in our column.

colSums(is.na(df_stroke))
##                id            gender               age      hypertension 
##                 0                 0                 0                 0 
##     heart_disease      ever_married         work_type    Residence_type 
##                 0                 0                 0                 0 
## avg_glucose_level               bmi    smoking_status            stroke 
##                 0               201                 0                 0

Since the bmi column has empties, we will do imputation in our data.

Now, we group the data by intervals of 5 and by gender, then we impute using the mean of those categories.

df_stroke_cleaned<-df_stroke %>%  group_by(cut(age,seq(0,82,by=5)),gender) %>%
  mutate(bmi=ifelse(is.na(bmi),mean(bmi,na.rm=TRUE),bmi)) 
rmarkdown::paged_table(df_stroke_cleaned)

See that there are 13 columns in our data due to the additional column that we got from grouping our data.

Now, let’s deselect the id column and the column that we got from grouping our data.

df_cleaned<-df_stroke_cleaned %>% select(-id)
df_cleaned<-df_cleaned[,c(1:11)]

Exploratory Data Analysis

Now, we are going to perform exploratory data analysis on our data. First and foremost let’s convert all of the binary variables on our data into “yes” and "no/

df_eda<-df_cleaned %>% 
  mutate(hypertension=ifelse(hypertension==0,"No","Yes")) %>% 
  mutate(heart_disease=ifelse(heart_disease==0,"No","Yes")) %>% 
  mutate(stroke=ifelse(stroke==0,"No","Yes"))

First and foremost, we plot the target variable column which is the stroke column.

df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=stroke))+theme_minimal()+ylab("Count")+xlab("Stroke")+theme(legend.position ="none")

From the graph above we can inspect that most of the individuals in our dataset never had a stroke. This may cause possible issues in the modeling process due to the imbalance in our dataset.

a<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=gender),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Gender")

b<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=ever_married),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Marriage Status")
ggarrange(a,b)

From the graph above, we can observe that :

  • Individuals who had not have a stroke is mostly comprised of female and people who were ever married.
  • Individuals who had have a stroke is also mostly comprised of female and people who were ever married.
c<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=hypertension),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Hypertension")

d<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=heart_disease),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Heart Disease")
ggarrange(c,d)

From the graph above, we can observe that :

  • Individuals who never had a stroke is mostly comprised of people who doesn’t have hypertension and heart disease.
  • Individuals who ever had a stroke is also mostly comprised of people who doesn’t have hypertension and heart disease.
e<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=work_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Work Type")

f<-df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=Residence_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Residence Type")
ggarrange(e,f,ncol = 1)

From the graph above, we can observe that :

  • Individuals who never had a stroke are mostly comprised of people who is a private employee and people who lives in the urban area.
  • Individuals who ever had a stroke are also mostly comprised of people who is a private employee and people who lives in the urban area.
df_eda %>% 
  ggplot(aes(stroke))+geom_bar(aes(fill=smoking_status),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Smoking Status")

People who ever and never had a stroke are mostly people who were a former smoker.

Now, let’s plot the boxplots for all of the quantitative variables based on the stroke columns.

x<-df_eda %>% ggplot(aes(age))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Age")
y<-df_eda %>% ggplot(aes(avg_glucose_level))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Average Glucose Level")
z<-df_eda %>% ggplot(aes(bmi))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Body Mass Index")
ggarrange(x,y,z,common.legend = T)

From the boxplots above we can observe that people who ever had a stroke has older age and higher glucose level than the people who haven’t.

One Hot Encoding

Since the deep learning algorithm requires all of our data to be in numerical form, we are going do one-hot encoding on our data.

treatplan<-designTreatmentsZ(df_cleaned,colnames(df_cleaned),verbose = F)

Now, we examine the score frame

scoreFrame <- treatplan %>%
    use_series(scoreFrame) %>%
    select(varName, origName, code)

We only want the rows with codes “clean” or “lev”

newvars <- scoreFrame %>%
    filter(code %in% c("clean","lev")) %>%
    use_series(varName)
df_stroke.treat <- prepare(treatplan, df_cleaned, varRestriction = newvars)
df_stroke.treat<-df_stroke.treat %>% mutate(stroke=as.factor(stroke))

Now, we can see that all of our categorical data are one hot encoded

rmarkdown::paged_table(df_stroke.treat)

Train-test Split

Now, let’s split our data into training and testing set. In this modeling process, we are going to split the test into 70% training data and 30% testing data.

set.seed(123)
id<-initial_split(data = df_stroke.treat, prop = 0.7, strata = "stroke")
train<-training(id)
test<-testing(id)

Now, let’s examine the target class in the training set of our data.

table(train$stroke)
## 
##    0    1 
## 3403  174

We can see that our target variables are quite imbalanced. So now we are going to do upsampling to even out our data classes.

df_train.sampled<-upSample(x = train  %>% select(-stroke),
                               y = as.factor(train$stroke), list = F,
                               yname = "stroke")

Perfect, now we have balanced classes.

table(df_train.sampled$stroke)
## 
##    0    1 
## 3403 3403

Feature Scaling

The deep learning algorithm also requires the data to be scaled prior to modeling, so now we scale the data and convert the explanatory variables into an array.

train.x_keras<-df_train.sampled%>% select(-stroke) %>% scale() %>% as.matrix()
train.y_keras<-to_categorical(df_train.sampled[,"stroke"])
test.x_keras<-test%>% select(-stroke) %>% scale() %>% as.matrix()
test.y_keras<-test[,"stroke"]
train.x_keras<-array_reshape(train.x_keras,dim=dim(train.x_keras))
test.x_keras<-array_reshape(test.x_keras,dim=dim(test.x_keras))

Modeling

First and foremost, we set the specifications for our model:

  • We are going to use 3 layers of neural network, with 56 nodes in the first layer, 28 in the second and 2 in the output layer.
  • The first and second layer will use relu function and the third layer will be a sigmoid function, since this is a classification problem
  • We will use adam optimizer with learning rate of 0.001
  • We wil also set the loss function to binary_crossentropy, since this is a classification problem
  • We set the metric to accuracy and set the epoch to 30 with batch size of 50
initializer <- initializer_random_normal(seed = 100)
model <- keras_model_sequential()
model %>% 
  layer_dense(input_shape = ncol(train.x_keras), # input
              units = 56, activation = "relu", name = "hidden_1") %>%  # hidden layer 1
  layer_dense(units = 28, activation = "relu", name = "hidden_2") %>% # hidden layer 2
  layer_dense(units = 2, activation = "sigmoid", name = "output")
model %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_adam(lr = 0.001),
          metric = "accuracy")
history <- model %>% 
  keras::fit(train.x_keras,
      train.y_keras,
      epoch = 30, 
      batch_size = 50,verbose=T)

Now, let’s inspect our model’s accuracy from the 30th epoch of the deep learning model.

history$metrics$accuracy[30]
## [1] 0.9457096

From the 30th epoch, we got a pretty got accuracy at above 90%.

Now, let’s evaluate using the test set.

prediction <- model %>% 
  predict_classes(x = test.x_keras)

confusionMatrix(as.factor(prediction), reference = as.factor(test.y_keras),positive = "1")
## Warning in confusionMatrix.default(as.factor(prediction), reference =
## as.factor(test.y_keras), : Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1458   75
##          1    0    0
##                                           
##                Accuracy : 0.9511          
##                  95% CI : (0.9391, 0.9613)
##     No Information Rate : 0.9511          
##     P-Value [Acc > NIR] : 0.5307          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.95108         
##              Prevalence : 0.04892         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
## 

We see that the model have a pretty good accuracy of 95.11%. However, it predicts all of the classes to be 0. In which due to the class imbalance in our testing data it will result in a high accuracy.

Model Improvement

From the model building we have seen some bias in the initial model in which it fails to classify any of the existing stroke cases in the testing set correctly. In the previous step, we tried using upsampling to handle the class imbalance. In this step, we are going to try setting weights to each class of the model.

First of all, we set another x and y variable of the original training set to a new variable.

train.x_keras.ori<-train%>% select(-stroke) %>% scale() %>% as.matrix()
train.y_keras.ori<-to_categorical(train[,"stroke"])
test.x_keras<-test%>% select(-stroke) %>% scale() %>% as.matrix()
test.y_keras<-test[,"stroke"]
train.x_keras.ori<-array_reshape(train.x_keras.ori,dim=dim(train.x_keras.ori))
test.x_keras<-array_reshape(test.x_keras,dim=dim(test.x_keras))

Now, we calculate the proportions

prop.table(table(train[,"stroke"]))
## 
##          0          1 
## 0.95135588 0.04864412

In this case, we will use 0.04864412 as weight of 1 and 0.95135588 as the weight of zero. We will also make adjustments to increase the epoch to give the model more frequency to work through the entire training dataset

initializer <- initializer_random_normal(seed = 100)
model2 <- keras_model_sequential()
model2 %>% 
  layer_dense(input_shape = ncol(train.x_keras.ori), # input
              units = 56, activation = "relu", name = "hidden_1") %>%  # hidden layer 1
  layer_dense(units = 28, activation = "relu", name = "hidden_2") %>% # hidden layer 2
  layer_dense(units = 2, activation = "sigmoid", name = "output")
model2 %>% 
  compile(loss = "binary_crossentropy",
          optimizer = optimizer_adam(lr = 0.001),
          metric = "accuracy")
history2 <- model2 %>% 
  keras::fit(train.x_keras.ori,
      train.y_keras.ori,
      epoch = 50, 
      batch_size = 50,verbose=T,
      class_weight = list("0"= 0.04864412 ,"1"=0.95135588))

Now, let’s calculate the accuracy in the 50th epoch.

history2$metrics$accuracy[50]
## [1] 0.8470786

It seems like we got a lower accuracy.

Now, let’s examine the performance on the testing set.

prediction2 <- model2 %>% 
  predict_classes(x = test.x_keras)

confusionMatrix(as.factor(prediction2), reference = as.factor(test.y_keras))
## Warning in confusionMatrix.default(as.factor(prediction2), reference =
## as.factor(test.y_keras)): Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1458   75
##          1    0    0
##                                           
##                Accuracy : 0.9511          
##                  95% CI : (0.9391, 0.9613)
##     No Information Rate : 0.9511          
##     P-Value [Acc > NIR] : 0.5307          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9511          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9511          
##          Detection Rate : 0.9511          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
## 

All in all, we got a performance that is exactly the same as the first model, which means that setting the class weights did not improve the model’s performance.

Conclusion

In conclusion, we constructed deep learning models. In both models, we attempted two different methods of handling class imbalance.

In which we use upSampling in the first model and adjusting class weights in the second model. In both models we got the same accuracy for the test method, however the initial model managed to perform better in the training set than the improved model.

All in all, the final model that is reasonable to choose based on the metrics is the initial model. This is due to the fact that the accuracy performance in the training and testing set managed to be higher than 90%. The accuracy of the testing and training model in the initial model also doesn’t significantly differ in which it is an indication that no overfitting or underfitting exists.

However, both models did not manage to correctly predict the occurrence of stroke in the testing set. Thus, while the accuracy in the final model is already quite high, it indicates that further improvement can still be made.

References