In this project, we are going to attempt doing a stroke prediction using a deep learning model from the keras library using the dataset obtained from Kaggle.
According to the Mayo Clinic, Stroke is a medical condition in which the flow of blood to the brain is reduced or blocked completely. Stroke risk factors includes pre-existing medical condition such as high blood pressure and obesity.
Thus, this project aims to predict whether an individual is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.
library(keras)
library(dplyr)
library(ggplot2)
library(vtreat)
library(magrittr)
library(caret)
library(tidyverse)
library(RColorBrewer)
library(ggpubr)
library(rsample)df_stroke<-read.csv("healthcare-dataset-stroke-data.csv")rmarkdown::paged_table(df_stroke)Variable Description :
First of all, let’s check if the data has any missing values.
colSums(is.na(df_stroke))## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 0 0 0
Seems like there is none, however if we inspect the data we can see that the data has entries of “N/A” in the bmi column. Now let’s check how many of those entries exists in our data.
colSums(df_stroke=="N/A")## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 201 0 0
Now we know, that the bmi column has 201 entries of “N/A”
First and foremost let’s separate our data into two parts, data with bmi column with numeric entries and entries of “N/A”. In which we will transform the data into the appropriate variable type.
stroke_numeric<-df_stroke %>% filter(bmi!="N/A") %>% mutate(bmi=as.numeric(bmi))
stroke_na<-df_stroke %>% filter(bmi=="N/A") %>% mutate(bmi=ifelse(bmi=="N/A",NA,bmi))Now, let’s bind our data by rows.
df_stroke<-rbind(stroke_na,stroke_numeric)And next we find how many missing data in our column.
colSums(is.na(df_stroke))## id gender age hypertension
## 0 0 0 0
## heart_disease ever_married work_type Residence_type
## 0 0 0 0
## avg_glucose_level bmi smoking_status stroke
## 0 201 0 0
Since the bmi column has empties, we will do imputation in our data.
Now, we group the data by intervals of 5 and by gender, then we impute using the mean of those categories.
df_stroke_cleaned<-df_stroke %>% group_by(cut(age,seq(0,82,by=5)),gender) %>%
mutate(bmi=ifelse(is.na(bmi),mean(bmi,na.rm=TRUE),bmi))
rmarkdown::paged_table(df_stroke_cleaned)See that there are 13 columns in our data due to the additional column that we got from grouping our data.
Now, let’s deselect the id column and the column that we got from grouping our data.
df_cleaned<-df_stroke_cleaned %>% select(-id)
df_cleaned<-df_cleaned[,c(1:11)]Now, we are going to perform exploratory data analysis on our data. First and foremost let’s convert all of the binary variables on our data into “yes” and "no/
df_eda<-df_cleaned %>%
mutate(hypertension=ifelse(hypertension==0,"No","Yes")) %>%
mutate(heart_disease=ifelse(heart_disease==0,"No","Yes")) %>%
mutate(stroke=ifelse(stroke==0,"No","Yes"))First and foremost, we plot the target variable column which is the stroke column.
df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=stroke))+theme_minimal()+ylab("Count")+xlab("Stroke")+theme(legend.position ="none")From the graph above we can inspect that most of the individuals in our dataset never had a stroke. This may cause possible issues in the modeling process due to the imbalance in our dataset.
a<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=gender),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Gender")
b<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=ever_married),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Marriage Status")
ggarrange(a,b)From the graph above, we can observe that :
c<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=hypertension),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Hypertension")
d<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=heart_disease),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Heart Disease")
ggarrange(c,d)From the graph above, we can observe that :
e<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=work_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Work Type")
f<-df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=Residence_type),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Residence Type")
ggarrange(e,f,ncol = 1)From the graph above, we can observe that :
df_eda %>%
ggplot(aes(stroke))+geom_bar(aes(fill=smoking_status),position = "dodge")+theme_minimal()+ylab("Count")+xlab("Stroke")+scale_fill_brewer(palette = "Set2")+ggtitle("Stroke by Smoking Status")People who ever and never had a stroke are mostly people who were a former smoker.
Now, let’s plot the boxplots for all of the quantitative variables based on the stroke columns.
x<-df_eda %>% ggplot(aes(age))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Age")
y<-df_eda %>% ggplot(aes(avg_glucose_level))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Average Glucose Level")
z<-df_eda %>% ggplot(aes(bmi))+geom_boxplot(aes(fill=stroke))+theme_minimal()+coord_flip()+ggtitle("Body Mass Index")
ggarrange(x,y,z,common.legend = T)From the boxplots above we can observe that people who ever had a stroke has older age and higher glucose level than the people who haven’t.
Since the deep learning algorithm requires all of our data to be in numerical form, we are going do one-hot encoding on our data.
treatplan<-designTreatmentsZ(df_cleaned,colnames(df_cleaned),verbose = F)Now, we examine the score frame
scoreFrame <- treatplan %>%
use_series(scoreFrame) %>%
select(varName, origName, code)We only want the rows with codes “clean” or “lev”
newvars <- scoreFrame %>%
filter(code %in% c("clean","lev")) %>%
use_series(varName)
df_stroke.treat <- prepare(treatplan, df_cleaned, varRestriction = newvars)
df_stroke.treat<-df_stroke.treat %>% mutate(stroke=as.factor(stroke))Now, we can see that all of our categorical data are one hot encoded
rmarkdown::paged_table(df_stroke.treat)Now, let’s split our data into training and testing set. In this modeling process, we are going to split the test into 70% training data and 30% testing data.
set.seed(123)
id<-initial_split(data = df_stroke.treat, prop = 0.7, strata = "stroke")
train<-training(id)
test<-testing(id)Now, let’s examine the target class in the training set of our data.
table(train$stroke)##
## 0 1
## 3403 174
We can see that our target variables are quite imbalanced. So now we are going to do upsampling to even out our data classes.
df_train.sampled<-upSample(x = train %>% select(-stroke),
y = as.factor(train$stroke), list = F,
yname = "stroke")Perfect, now we have balanced classes.
table(df_train.sampled$stroke)##
## 0 1
## 3403 3403
The deep learning algorithm also requires the data to be scaled prior to modeling, so now we scale the data and convert the explanatory variables into an array.
train.x_keras<-df_train.sampled%>% select(-stroke) %>% scale() %>% as.matrix()
train.y_keras<-to_categorical(df_train.sampled[,"stroke"])
test.x_keras<-test%>% select(-stroke) %>% scale() %>% as.matrix()
test.y_keras<-test[,"stroke"]
train.x_keras<-array_reshape(train.x_keras,dim=dim(train.x_keras))
test.x_keras<-array_reshape(test.x_keras,dim=dim(test.x_keras))First and foremost, we set the specifications for our model:
initializer <- initializer_random_normal(seed = 100)
model <- keras_model_sequential()
model %>%
layer_dense(input_shape = ncol(train.x_keras), # input
units = 56, activation = "relu", name = "hidden_1") %>% # hidden layer 1
layer_dense(units = 28, activation = "relu", name = "hidden_2") %>% # hidden layer 2
layer_dense(units = 2, activation = "sigmoid", name = "output")
model %>%
compile(loss = "binary_crossentropy",
optimizer = optimizer_adam(lr = 0.001),
metric = "accuracy")
history <- model %>%
keras::fit(train.x_keras,
train.y_keras,
epoch = 30,
batch_size = 50,verbose=T)Now, let’s inspect our model’s accuracy from the 30th epoch of the deep learning model.
history$metrics$accuracy[30]## [1] 0.9457096
From the 30th epoch, we got a pretty got accuracy at above 90%.
Now, let’s evaluate using the test set.
prediction <- model %>%
predict_classes(x = test.x_keras)
confusionMatrix(as.factor(prediction), reference = as.factor(test.y_keras),positive = "1")## Warning in confusionMatrix.default(as.factor(prediction), reference =
## as.factor(test.y_keras), : Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1458 75
## 1 0 0
##
## Accuracy : 0.9511
## 95% CI : (0.9391, 0.9613)
## No Information Rate : 0.9511
## P-Value [Acc > NIR] : 0.5307
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.95108
## Prevalence : 0.04892
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
We see that the model have a pretty good accuracy of 95.11%. However, it predicts all of the classes to be 0. In which due to the class imbalance in our testing data it will result in a high accuracy.
From the model building we have seen some bias in the initial model in which it fails to classify any of the existing stroke cases in the testing set correctly. In the previous step, we tried using upsampling to handle the class imbalance. In this step, we are going to try setting weights to each class of the model.
First of all, we set another x and y variable of the original training set to a new variable.
train.x_keras.ori<-train%>% select(-stroke) %>% scale() %>% as.matrix()
train.y_keras.ori<-to_categorical(train[,"stroke"])
test.x_keras<-test%>% select(-stroke) %>% scale() %>% as.matrix()
test.y_keras<-test[,"stroke"]
train.x_keras.ori<-array_reshape(train.x_keras.ori,dim=dim(train.x_keras.ori))
test.x_keras<-array_reshape(test.x_keras,dim=dim(test.x_keras))Now, we calculate the proportions
prop.table(table(train[,"stroke"]))##
## 0 1
## 0.95135588 0.04864412
In this case, we will use 0.04864412 as weight of 1 and 0.95135588 as the weight of zero. We will also make adjustments to increase the epoch to give the model more frequency to work through the entire training dataset
initializer <- initializer_random_normal(seed = 100)
model2 <- keras_model_sequential()
model2 %>%
layer_dense(input_shape = ncol(train.x_keras.ori), # input
units = 56, activation = "relu", name = "hidden_1") %>% # hidden layer 1
layer_dense(units = 28, activation = "relu", name = "hidden_2") %>% # hidden layer 2
layer_dense(units = 2, activation = "sigmoid", name = "output")
model2 %>%
compile(loss = "binary_crossentropy",
optimizer = optimizer_adam(lr = 0.001),
metric = "accuracy")
history2 <- model2 %>%
keras::fit(train.x_keras.ori,
train.y_keras.ori,
epoch = 50,
batch_size = 50,verbose=T,
class_weight = list("0"= 0.04864412 ,"1"=0.95135588))Now, let’s calculate the accuracy in the 50th epoch.
history2$metrics$accuracy[50]## [1] 0.8470786
It seems like we got a lower accuracy.
Now, let’s examine the performance on the testing set.
prediction2 <- model2 %>%
predict_classes(x = test.x_keras)
confusionMatrix(as.factor(prediction2), reference = as.factor(test.y_keras))## Warning in confusionMatrix.default(as.factor(prediction2), reference =
## as.factor(test.y_keras)): Levels are not in the same order for reference and
## data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1458 75
## 1 0 0
##
## Accuracy : 0.9511
## 95% CI : (0.9391, 0.9613)
## No Information Rate : 0.9511
## P-Value [Acc > NIR] : 0.5307
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9511
## Neg Pred Value : NaN
## Prevalence : 0.9511
## Detection Rate : 0.9511
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
All in all, we got a performance that is exactly the same as the first model, which means that setting the class weights did not improve the model’s performance.
In conclusion, we constructed deep learning models. In both models, we attempted two different methods of handling class imbalance.
In which we use upSampling in the first model and adjusting class weights in the second model. In both models we got the same accuracy for the test method, however the initial model managed to perform better in the training set than the improved model.
All in all, the final model that is reasonable to choose based on the metrics is the initial model. This is due to the fact that the accuracy performance in the training and testing set managed to be higher than 90%. The accuracy of the testing and training model in the initial model also doesn’t significantly differ in which it is an indication that no overfitting or underfitting exists.
However, both models did not manage to correctly predict the occurrence of stroke in the testing set. Thus, while the accuracy in the final model is already quite high, it indicates that further improvement can still be made.
Soriano, F. (2021, January 26). Stroke Prediction Dataset. Kaggle. https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Stroke - Symptoms and causes. (2021, February 9). Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113