A stroke is a medical condition in which poor blood flow to the brain causes cell death.According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. It’s a dangerous but common symptom, which might greatly affect one’s living condition. Potential risk factors include age,hypertension, cholesterol,Smoking,Diabetes and obesity.
As a healthcare analytics student, what interests me is whether we could predict the stroke outcome for potential patients. This project aims to predict whether people will have stroke based on historical data. We mainly want to finsh two tasks:
This dataset is a recording of patients’s body features and their stroke status. It contains 5110 observations with 12 attributes.It is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Below is the attribute Information:
*Note: “Unknown” in smoking_status means that the information is unavailable for this patient.
Here is the link of the dataset: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
Let’s import the data and get started.
# Import data file
data <- read.csv("healthcare-dataset-stroke-data.csv")
str(data)## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : chr "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
summary(data)## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Length:5110 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 Class :character Class :character 1st Qu.:0.00000
## Median : 91.89 Mode :character Mode :character Median :0.00000
## Mean :106.15 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:0.00000
## Max. :271.74 Max. :1.00000
From the structure and summary of the dataset, we find out that:
For the data cleaning part, we need to look at the column properties and null values. First, we need to drop column ID and deal with BMI. BMI should be a numeric column but now it’s a character column.
#Drop Column ID
stroke = subset(data, select = -c(id))
#Transfer column bmi to number
suppressWarnings(stroke$bmi <- as.numeric(as.character(stroke$bmi)))
# Check for null
colSums(is.na(stroke))## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 201 0 0
As we can see, there are many null values in BMI. We could replace the N/A with the column median.
# replace the bmi with the median.
bmi.median = median(stroke$bmi, na.rm = TRUE)
stroke$bmi <- stroke$bmi %>% replace_na(bmi.median)Next,we might take a look at the other non-numeric column. Let’s continue with column gender.
# Examine column gender
as.data.frame(table(stroke$gender))## Var1 Freq
## 1 Female 2994
## 2 Male 2115
## 3 Other 1
Since there is only 1 row of Other in column gender, we could drop the row and turn it into a binary feature.
# Drop the column with 'other'.(Since there is only 1 row)
stroke = stroke[!stroke$gender == 'Other',]Then we examine column work_type,residence_type and ever_married.
# Examine work_type
as.data.frame(table(stroke$work_type))## Var1 Freq
## 1 children 687
## 2 Govt_job 657
## 3 Never_worked 22
## 4 Private 2924
## 5 Self-employed 819
as.data.frame(table(stroke$Residence_type))## Var1 Freq
## 1 Rural 2513
## 2 Urban 2596
as.data.frame(table(stroke$ever_married))## Var1 Freq
## 1 No 1756
## 2 Yes 3353
They both look fine. Last we examine the smoking feature.
# Examine smoking
as.data.frame(table(stroke$smoking_status))## Var1 Freq
## 1 formerly smoked 884
## 2 never smoked 1892
## 3 smokes 789
## 4 Unknown 1544
There are a large number of unknowns for smoking_status. We replace the unknown with the most frequent category, which is ‘never smoked’.
#replace the unknown with the most frequent category.
stroke <- stroke %>% mutate(smoking_status = replace(smoking_status, smoking_status == "Unknown", "never smoked"))We could look at the dataframe and check for NA again.
head(stroke)## gender age hypertension heart_disease ever_married work_type
## 1 Male 67 0 1 Yes Private
## 2 Female 61 0 0 Yes Self-employed
## 3 Male 80 0 1 Yes Private
## 4 Female 49 0 0 Yes Private
## 5 Female 79 1 0 Yes Self-employed
## 6 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 28.1 never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24.0 never smoked 1
## 6 Urban 186.21 29.0 formerly smoked 1
colSums(is.na(stroke))## gender age hypertension heart_disease
## 0 0 0 0
## ever_married work_type Residence_type avg_glucose_level
## 0 0 0 0
## bmi smoking_status stroke
## 0 0 0
For the second part of the EDA, we need to transform columns for future analysis. For 0-1 binary variables, we would transform them into factors.
# Save a copy of the dataset
stroke1 <- copy(stroke)
# Transform binary variables to factor
stroke$stroke <- as.factor(stroke$stroke)
stroke$hypertension <- as.factor(stroke$hypertension)
stroke$heart_disease <- as.factor(stroke$heart_disease)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$Residence_type <- as.factor(stroke$Residence_type)
# examine the dataset
stroke %>% head()## gender age hypertension heart_disease ever_married work_type
## 1 Male 67 0 1 Yes Private
## 2 Female 61 0 0 Yes Self-employed
## 3 Male 80 0 1 Yes Private
## 4 Female 49 0 0 Yes Private
## 5 Female 79 1 0 Yes Self-employed
## 6 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.6 formerly smoked 1
## 2 Rural 202.21 28.1 never smoked 1
## 3 Rural 105.92 32.5 never smoked 1
## 4 Urban 171.23 34.4 smokes 1
## 5 Rural 174.12 24.0 never smoked 1
## 6 Urban 186.21 29.0 formerly smoked 1
In case we need to do regression in the future, we should set multi-level features into dummy variable and save it in a dataframe.
#For multi-level variables, get dummy
stroke_dummy <- dummy_cols(stroke,select_columns = c("gender","work_type","smoking_status"),remove_first_dummy = TRUE, remove_selected_columns = TRUE)
stroke_dummy %>% head()## age hypertension heart_disease ever_married Residence_type avg_glucose_level
## 1 67 0 1 Yes Urban 228.69
## 2 61 0 0 Yes Rural 202.21
## 3 80 0 1 Yes Rural 105.92
## 4 49 0 0 Yes Urban 171.23
## 5 79 1 0 Yes Rural 174.12
## 6 81 0 0 Yes Urban 186.21
## bmi stroke gender_Male work_type_Govt_job work_type_Never_worked
## 1 36.6 1 1 0 0
## 2 28.1 1 0 0 0
## 3 32.5 1 1 0 0
## 4 34.4 1 0 0 0
## 5 24.0 1 0 0 0
## 6 29.0 1 1 0 0
## work_type_Private work_type_Self-employed smoking_status_never smoked
## 1 1 0 0
## 2 0 1 1
## 3 1 0 1
## 4 1 0 0
## 5 0 1 1
## 6 1 0 0
## smoking_status_smokes
## 1 0
## 2 0
## 3 0
## 4 1
## 5 0
## 6 0
Now we finish the data cleaning and head for EDA.
First, let’s look at the distribution of for each individual variable. For discrete variables, we should use barplot to show their distribution.
# Discrete
p1 <- ggplot(data = stroke) +geom_bar(mapping = aes(x = gender))
p2 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = hypertension))
p3 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = heart_disease))
p4 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = ever_married))
grid.arrange(p1,p2,p3,p4, ncol= 2)From these plots, we conclude that most patients do not have heart_disease and hypertension. Female and married people are the majority.
p5 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = work_type))
p6 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = Residence_type))
p7 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = smoking_status))
p8 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = stroke))
grid.arrange(p5,p6,p7,p8, ncol= 2)From these plots, we conclude that most patients do not have stroke. Private workers and non-smoker are the majority.
For the continuous variables, we use histogram to display their distribution.
# Continuous
c1 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = age), binwidth = 0.5, col = 'steelblue')
c2 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = avg_glucose_level), binwidth = 0.5, col = 'steelblue')
c3 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = bmi), binwidth = 0.5, col = 'steelblue')
grid.arrange(c1,c2,c3, ncol= 2)From these plots, we conclude that age is slightly left-skewed abd glucose_level and bmi are right-skewed. The pike in the bmi plot is the result of replace NA with median. It seems that median is a better estimator than mean since bmi is right-skewed.
For the next part of the EDA, we could take a look at the connection between different variables. The outcome in this dataset is stroke, so we could first examine the connection between stroke and some input features. For continuous input, a boxplot is the best.
#stroke on age
ggplot(data = stroke, mapping = aes(x = stroke, y = age)) +geom_boxplot()+labs(title='Stroke on Age')#stroke on hypertension
ggplot(data = stroke, mapping = aes(x = stroke, y = avg_glucose_level)) +geom_boxplot()+labs(title='Stroke on Glucose Level')#stroke on bmi
ggplot(data = stroke, mapping = aes(x = stroke, y = bmi)) +geom_boxplot()+labs(title='Stroke on bmi')We can conclude from these plot that those who have stroke are significantly older than those who don’t. Meanwhile, those who have stroke have a higher glucose level and bmi, but it’s not that significant.
Then we could look at the connection between stroke and some discrete input features. In this part, we use Mosaic Plot to plot the connection.
ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,work_type), fill=work_type)) + labs(title='Stroke on Work Type')ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,smoking_status), fill=smoking_status)) + labs(title='Stroke on smoking_status')ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,hypertension), fill=hypertension)) + labs(title='Stroke on hypertension')ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,ever_married), fill=ever_married)) + labs(title='Stroke on ever_married')From these plots, we could conclude that: * Self-employed workes are more likely to get stroke. * Smoke seems to have little effect on stroke. * Those who have hypertension are more likely to get stroke. * Those who are not married are less likely to get stroke.(Wonder Why)
Next, we want to figure out the correlation between input variables. One way to do this is through correlation heatmap. To do this, we have to transform some qualitative variables to quantitative variables.
# Residence_type: Urban for 0, Rural for 1
stroke1$Residence_type[stroke1$Residence_type == "Urban"] <- 0
stroke1$Residence_type[stroke1$Residence_type == "Rural"] <- 1
# ever_married: No for 0, Yes for 1
stroke1$ever_married[stroke1$ever_married == "Yes"] <- 1
stroke1$ever_married[stroke1$ever_married == "No"] <- 0
# gender: Male for 0, Female for 1
stroke1$gender[stroke1$gender == "Male"] <- 0
stroke1$gender[stroke1$gender == "Female"] <- 1
suppressWarnings(stroke1$Residence_type <- as.numeric(as.character(stroke1$Residence_type)))
suppressWarnings(stroke1$ever_married <- as.numeric(as.character(stroke1$ever_married)))
suppressWarnings(stroke1$gender <- as.numeric(as.character(stroke1$gender)))
# Quantitative Variables: Correlation Map
stroke.quant = subset(stroke1, select = -c(work_type, smoking_status))
stroke.cor = round(cor(stroke.quant),2)
ggplot(data = reshape2::melt(stroke.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))As we can see from the heatmap, most features don’t have a significant correlation with others, which is suitable for regression. The only worth-noticing correlation is between age and ever_married, but the reason behind it is pretty plain. Among all features, age has the largest correlation coefficients with stroke.
For continuous variables, we could also use scatter plot to display their relationship.
cont.plot <- ggplot(data = stroke, aes(x= age, y = bmi, color = stroke))+geom_point()
cont.plotAs we can see, stroke patients are more likely to appear at older people and people with high bmi.
From the EDA, we finds out that no certain feature has a strong correlation with stroke. To quantify how a feature affect the outcome, we need to conduct a regression. Since stroke is a binary variable, it’s better to use logistic regression.
First, we need to build the training dataset. We are going to use the stroke_dummy for regression. To test the regression result, we split the dataset into training set (70%) and testing set (30%).
# Training set creation
Training <- createDataPartition(y = stroke_dummy$stroke, p = 0.7, list = FALSE)
training <- stroke_dummy[Training,]
testing <- stroke_dummy[-Training,]We could examine the dimension of the training set and the testing set.
dim(training)## [1] 3577 15
dim(testing)## [1] 1532 15
Next, we carry out the regression using Generalized linear model (GLM) and set the link to be logit. The model is trained on the training set.
model <- glm(stroke ~.,family=binomial(link='logit'), data=training)
summary(model)##
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"),
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2310 -0.3171 -0.1675 -0.0972 3.4981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.922908 0.806938 -7.340 2.14e-13 ***
## age 0.072741 0.006799 10.700 < 2e-16 ***
## hypertension1 0.563302 0.192796 2.922 0.00348 **
## heart_disease1 0.342615 0.227875 1.504 0.13270
## ever_marriedYes -0.233803 0.265039 -0.882 0.37770
## Residence_typeUrban -0.102680 0.164565 -0.624 0.53266
## avg_glucose_level 0.002709 0.001456 1.860 0.06287 .
## bmi -0.004730 0.013519 -0.350 0.72640
## gender_Male 0.004179 0.169271 0.025 0.98030
## work_type_Govt_job -1.129978 0.865039 -1.306 0.19146
## work_type_Never_worked -10.635415 375.185929 -0.028 0.97739
## work_type_Private -0.965535 0.844501 -1.143 0.25291
## `work_type_Self-employed` -1.438607 0.872193 -1.649 0.09906 .
## `smoking_status_never smoked` -0.244233 0.191801 -1.273 0.20289
## smoking_status_smokes -0.074173 0.261212 -0.284 0.77644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1397.4 on 3576 degrees of freedom
## Residual deviance: 1116.0 on 3562 degrees of freedom
## AIC: 1146
##
## Number of Fisher Scoring iterations: 14
From the regression result, we could conclude that:
Now we can run the anova() function on the model to see the deviance of the regression model.
anova(model, test="Chisq")## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: stroke
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 3576 1397.4
## age 1 253.376 3575 1144.0 < 2.2e-16 ***
## hypertension 1 10.619 3574 1133.4 0.001119 **
## heart_disease 1 3.970 3573 1129.5 0.046322 *
## ever_married 1 0.597 3572 1128.9 0.439705
## Residence_type 1 0.282 3571 1128.6 0.595324
## avg_glucose_level 1 3.978 3570 1124.6 0.046097 *
## bmi 1 0.203 3569 1124.4 0.652680
## gender_Male 1 0.040 3568 1124.3 0.841968
## work_type_Govt_job 1 0.019 3567 1124.3 0.890438
## work_type_Never_worked 1 0.105 3566 1124.2 0.746112
## work_type_Private 1 4.417 3565 1119.8 0.035590 *
## `work_type_Self-employed` 1 2.017 3564 1117.8 0.155541
## `smoking_status_never smoked` 1 1.668 3563 1116.1 0.196522
## smoking_status_smokes 1 0.081 3562 1116.0 0.775904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA shows how features lower the original deviance to residual deviance. Besides three significant features shown in the regression result, avg_glucose_level also significantly lowers the deviance. Those features with a large p-value shows that even without these features, the model would explain more or less of the same of total variation.
With the model trained, we could predict on the testing set.
#predict the stroke variable based on the testing dataset
model.prob = predict(model, testing, type="response")To visualize the prediction result, we could use the confusionMatrix function from package caret. We set the prediction result to be 0.5, indicating that if the predicted stroke is larger than 0.5, we believe that this patient are likely to get stroke.
# use caret and compute a confusion matrix
confusionMatrix(data = as.factor(as.numeric(model.prob>0.5)), reference = as.factor(testing$stroke))## Warning in confusionMatrix.default(data = as.factor(as.numeric(model.prob > :
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1458 74
## 1 0 0
##
## Accuracy : 0.9517
## 95% CI : (0.9397, 0.9619)
## No Information Rate : 0.9517
## P-Value [Acc > NIR] : 0.5309
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9517
## Neg Pred Value : NaN
## Prevalence : 0.9517
## Detection Rate : 0.9517
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
From the confusion matrix report, we could find the accuracy to be 0.95, which is relatively high. However, if we take a closer look, we would find that most prediction would just predict the outcome to be 0. It’s because th outcome ‘stroke’ is so imbalanced that the model believes that almost all the outcomes are 0. It’s hard to distinguish the one with the stroke and the one without it.
What if we change the threshold? We found a helpful solution on the StacksOverflow(https://stats.stackexchange.com/questions/199978/optimizing-probability-thresholds-in-a-glm-model-in-caret) to plot the relationship between threshold and accuracy.
threhold <- seq(0.3,0.6,0.01)
accuracy <- NULL
for (i in seq(along = threhold)){
prediction <- ifelse(model$fitted.values >= threhold[i], 1, 0) #Predicting for cut-off
accuracy <- c(accuracy,length(which(training$stroke ==prediction))/length(prediction)*100)
}
plot(threhold, accuracy, pch =19,type='b',col= "steelblue",main ="Logistic Regression", xlab="Cutoff Level", ylab = "Accuracy %")The result is not pleasing. We cannot raise the accuracy just by changing the cutoff threshold.
Maybe a different model? We could do a random forest classifier to test the outcome.
stroke.rf <- randomForest(stroke ~ ., data = stroke, importance = TRUE,proximity = TRUE)
print(stroke.rf)##
## Call:
## randomForest(formula = stroke ~ ., data = stroke, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.97%
## Confusion matrix:
## 0 1 class.error
## 0 4854 6 0.001234568
## 1 248 1 0.995983936
For the random forest classifier, it reaches an accuracy of 95.03%, but the confusion matrix still looks alike with the logistic regression. So the imbalance of the dataset would greatly affect the prediction result.
In this project, we use logistic regression to discover the relationship between stroke and other input features. We get the conclusion that age, hypertension and work_type_self-employed would affect the possibility of getting stroke. We also use logistic regression and random forest to build a prediction model for stroke. Both model reach the accuracy of 95%, but the imbalance of dataset has limited the accuracy level. In the future we would look at questions like anomaly detection to solve this problem.
Our recommendation would be:
https://stackoverflow.com/questions/46028360/confusionmatrix-for-logistic-regression-in-r
https://tidyr.tidyverse.org/reference/replace_na.html
https://www.r-bloggers.com/2015/09/how-to-perform-a-logistic-regression-in-r/
https://stats.idre.ucla.edu/r/dae/logit-regression/
http://rstudio-pubs-static.s3.amazonaws.com/74431_8cbd662559f6451f9cd411545f28107f.html
https://www.statmethods.net/advstats/glm.html
https://www.geeksforgeeks.org/random-forest-approach-for-classification-in-r-programming/
https://daviddalpiaz.github.io/r4sl/the-caret-package.html
http://appliedpredictivemodeling.com/blog/2014/2/1/lw6har9oewknvus176q4o41alqw2ow
https://rdrr.io/cran/caret/man/thresholder.html
https://www.theissaclee.com/post/logistic-regression-beta/#fig:roc
https://stackoverflow.com/questions/65844978/caret-classification-thresholds