Final Project

1. Introduction
2. Data Exploration
3. Logistic Regression For Prediction

1. Introduction

1.1 Background

A stroke is a medical condition in which poor blood flow to the brain causes cell death.According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. It’s a dangerous but common symptom, which might greatly affect one’s living condition. Potential risk factors include age,hypertension, cholesterol,Smoking,Diabetes and obesity.

As a healthcare analytics student, what interests me is whether we could predict the stroke outcome for potential patients. This project aims to predict whether people will have stroke based on historical data. We mainly want to finsh two tasks:

Find out what input features might affect the stroke outcome
Predict the stroke outcome

1.2 Dataset summary

This dataset is a recording of patients’s body features and their stroke status. It contains 5110 observations with 12 attributes.It is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

Below is the attribute Information:

id: unique identifier
gender: “Male”, “Female” or “Other”
age: age of the patient
hypertension: 0 if the patient doesn’t have hypertension, 1 if the patient has hypertension
heart_disease: 0 if the patient doesn’t have any heart diseases, 1 if the patient has a heart disease
ever_married: “No” or “Yes”
work_type: “children”, “Govt_jov”, “Never_worked”, “Private” or “Self-employed”
Residence_type: “Rural” or “Urban”
avg_glucose_level: average glucose level in blood
bmi: body mass index
smoking_status: “formerly smoked”, “never smoked”, “smokes” or “Unknown”*
stroke: 1 if the patient had a stroke or 0 if not

*Note: “Unknown” in smoking_status means that the information is unavailable for this patient.

Here is the link of the dataset: https://www.kaggle.com/fedesoriano/stroke-prediction-dataset

2. Data Exploration

2.1 Data Import

Let’s import the data and get started.

# Import data file
data <- read.csv("healthcare-dataset-stroke-data.csv")
str(data)

## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : chr  "36.6" "N/A" "32.5" "34.4" ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...

summary(data)

##        id           gender               age         hypertension    
##  Min.   :   67   Length:5110        Min.   : 0.08   Min.   :0.00000  
##  1st Qu.:17741   Class :character   1st Qu.:25.00   1st Qu.:0.00000  
##  Median :36932   Mode  :character   Median :45.00   Median :0.00000  
##  Mean   :36518                      Mean   :43.23   Mean   :0.09746  
##  3rd Qu.:54682                      3rd Qu.:61.00   3rd Qu.:0.00000  
##  Max.   :72940                      Max.   :82.00   Max.   :1.00000  
##  heart_disease     ever_married        work_type         Residence_type    
##  Min.   :0.00000   Length:5110        Length:5110        Length:5110       
##  1st Qu.:0.00000   Class :character   Class :character   Class :character  
##  Median :0.00000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.05401                                                           
##  3rd Qu.:0.00000                                                           
##  Max.   :1.00000                                                           
##  avg_glucose_level     bmi            smoking_status         stroke       
##  Min.   : 55.12    Length:5110        Length:5110        Min.   :0.00000  
##  1st Qu.: 77.25    Class :character   Class :character   1st Qu.:0.00000  
##  Median : 91.89    Mode  :character   Mode  :character   Median :0.00000  
##  Mean   :106.15                                          Mean   :0.04873  
##  3rd Qu.:114.09                                          3rd Qu.:0.00000  
##  Max.   :271.74                                          Max.   :1.00000

From the structure and summary of the dataset, we find out that:

There are 9 input features and 1 outcome(stroke) in the dataset. We don’t need the feature ID.
Some columns include character, which might need to be transformed into factor or number.
The mean of stroke is 0.04, which means only 4% of the patients have stroke.

2.2 Data Cleaning

For the data cleaning part, we need to look at the column properties and null values. First, we need to drop column ID and deal with BMI. BMI should be a numeric column but now it’s a character column.

#Drop Column ID
stroke = subset(data, select = -c(id))

#Transfer column bmi to number
suppressWarnings(stroke$bmi <- as.numeric(as.character(stroke$bmi)))

# Check for null
colSums(is.na(stroke))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##               201                 0                 0

As we can see, there are many null values in BMI. We could replace the N/A with the column median.

# replace the bmi with the median.
bmi.median = median(stroke$bmi, na.rm = TRUE)
stroke$bmi <- stroke$bmi %>% replace_na(bmi.median)

Next,we might take a look at the other non-numeric column. Let’s continue with column gender.

# Examine column gender
as.data.frame(table(stroke$gender))

##     Var1 Freq
## 1 Female 2994
## 2   Male 2115
## 3  Other    1

Since there is only 1 row of Other in column gender, we could drop the row and turn it into a binary feature.

# Drop the column with 'other'.(Since there is only 1 row)
stroke = stroke[!stroke$gender == 'Other',]

Then we examine column work_type,residence_type and ever_married.

# Examine work_type
as.data.frame(table(stroke$work_type))

##            Var1 Freq
## 1      children  687
## 2      Govt_job  657
## 3  Never_worked   22
## 4       Private 2924
## 5 Self-employed  819

as.data.frame(table(stroke$Residence_type))

##    Var1 Freq
## 1 Rural 2513
## 2 Urban 2596

as.data.frame(table(stroke$ever_married))

##   Var1 Freq
## 1   No 1756
## 2  Yes 3353

They both look fine. Last we examine the smoking feature.

# Examine smoking
as.data.frame(table(stroke$smoking_status))

##              Var1 Freq
## 1 formerly smoked  884
## 2    never smoked 1892
## 3          smokes  789
## 4         Unknown 1544

There are a large number of unknowns for smoking_status. We replace the unknown with the most frequent category, which is ‘never smoked’.

#replace the unknown with the most frequent category.
stroke <- stroke %>% mutate(smoking_status = replace(smoking_status, smoking_status == "Unknown", "never smoked"))

We could look at the dataframe and check for NA again.

head(stroke)

##   gender age hypertension heart_disease ever_married     work_type
## 1   Male  67            0             1          Yes       Private
## 2 Female  61            0             0          Yes Self-employed
## 3   Male  80            0             1          Yes       Private
## 4 Female  49            0             0          Yes       Private
## 5 Female  79            1             0          Yes Self-employed
## 6   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21 28.1    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1

colSums(is.na(stroke))

##            gender               age      hypertension     heart_disease 
##                 0                 0                 0                 0 
##      ever_married         work_type    Residence_type avg_glucose_level 
##                 0                 0                 0                 0 
##               bmi    smoking_status            stroke 
##                 0                 0                 0

For the second part of the EDA, we need to transform columns for future analysis. For 0-1 binary variables, we would transform them into factors.

# Save a copy of the dataset
stroke1 <- copy(stroke)

# Transform binary variables to factor
stroke$stroke <- as.factor(stroke$stroke)
stroke$hypertension <- as.factor(stroke$hypertension)
stroke$heart_disease <- as.factor(stroke$heart_disease)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$Residence_type <- as.factor(stroke$Residence_type)

# examine the dataset
stroke %>% head()

##   gender age hypertension heart_disease ever_married     work_type
## 1   Male  67            0             1          Yes       Private
## 2 Female  61            0             0          Yes Self-employed
## 3   Male  80            0             1          Yes       Private
## 4 Female  49            0             0          Yes       Private
## 5 Female  79            1             0          Yes Self-employed
## 6   Male  81            0             0          Yes       Private
##   Residence_type avg_glucose_level  bmi  smoking_status stroke
## 1          Urban            228.69 36.6 formerly smoked      1
## 2          Rural            202.21 28.1    never smoked      1
## 3          Rural            105.92 32.5    never smoked      1
## 4          Urban            171.23 34.4          smokes      1
## 5          Rural            174.12 24.0    never smoked      1
## 6          Urban            186.21 29.0 formerly smoked      1

In case we need to do regression in the future, we should set multi-level features into dummy variable and save it in a dataframe.

#For multi-level variables, get dummy
stroke_dummy <- dummy_cols(stroke,select_columns = c("gender","work_type","smoking_status"),remove_first_dummy = TRUE, remove_selected_columns = TRUE)
stroke_dummy %>% head()

##   age hypertension heart_disease ever_married Residence_type avg_glucose_level
## 1  67            0             1          Yes          Urban            228.69
## 2  61            0             0          Yes          Rural            202.21
## 3  80            0             1          Yes          Rural            105.92
## 4  49            0             0          Yes          Urban            171.23
## 5  79            1             0          Yes          Rural            174.12
## 6  81            0             0          Yes          Urban            186.21
##    bmi stroke gender_Male work_type_Govt_job work_type_Never_worked
## 1 36.6      1           1                  0                      0
## 2 28.1      1           0                  0                      0
## 3 32.5      1           1                  0                      0
## 4 34.4      1           0                  0                      0
## 5 24.0      1           0                  0                      0
## 6 29.0      1           1                  0                      0
##   work_type_Private work_type_Self-employed smoking_status_never smoked
## 1                 1                       0                           0
## 2                 0                       1                           1
## 3                 1                       0                           1
## 4                 1                       0                           0
## 5                 0                       1                           1
## 6                 1                       0                           0
##   smoking_status_smokes
## 1                     0
## 2                     0
## 3                     0
## 4                     1
## 5                     0
## 6                     0

Now we finish the data cleaning and head for EDA.

2.3 Exploratory Data Analysis

First, let’s look at the distribution of for each individual variable. For discrete variables, we should use barplot to show their distribution.

# Discrete
p1 <- ggplot(data = stroke) +geom_bar(mapping = aes(x = gender))
p2 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = hypertension))
p3 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = heart_disease)) 
p4 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = ever_married)) 
grid.arrange(p1,p2,p3,p4, ncol= 2)

From these plots, we conclude that most patients do not have heart_disease and hypertension. Female and married people are the majority.

p5 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = work_type))
p6 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = Residence_type))
p7 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = smoking_status))
p8 <-ggplot(data = stroke) +geom_bar(mapping = aes(x = stroke))
grid.arrange(p5,p6,p7,p8, ncol= 2)

From these plots, we conclude that most patients do not have stroke. Private workers and non-smoker are the majority.

For the continuous variables, we use histogram to display their distribution.

# Continuous
c1 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = age), binwidth = 0.5, col = 'steelblue')
c2 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = avg_glucose_level), binwidth = 0.5, col = 'steelblue')
c3 <- ggplot(data = stroke) + geom_histogram(mapping = aes(x = bmi), binwidth = 0.5, col = 'steelblue')
grid.arrange(c1,c2,c3, ncol= 2)

From these plots, we conclude that age is slightly left-skewed abd glucose_level and bmi are right-skewed. The pike in the bmi plot is the result of replace NA with median. It seems that median is a better estimator than mean since bmi is right-skewed.

For the next part of the EDA, we could take a look at the connection between different variables. The outcome in this dataset is stroke, so we could first examine the connection between stroke and some input features. For continuous input, a boxplot is the best.

#stroke on age
ggplot(data = stroke, mapping = aes(x = stroke, y = age)) +geom_boxplot()+labs(title='Stroke on Age')

#stroke on hypertension
ggplot(data = stroke, mapping = aes(x = stroke, y = avg_glucose_level)) +geom_boxplot()+labs(title='Stroke on Glucose Level')

#stroke on bmi
ggplot(data = stroke, mapping = aes(x = stroke, y = bmi)) +geom_boxplot()+labs(title='Stroke on bmi')

We can conclude from these plot that those who have stroke are significantly older than those who don’t. Meanwhile, those who have stroke have a higher glucose level and bmi, but it’s not that significant.

Then we could look at the connection between stroke and some discrete input features. In this part, we use Mosaic Plot to plot the connection.

ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,work_type), fill=work_type)) + labs(title='Stroke on Work Type')

ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,smoking_status), fill=smoking_status)) + labs(title='Stroke on smoking_status')

ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,hypertension), fill=hypertension)) + labs(title='Stroke on hypertension')

ggplot(data = stroke) +geom_mosaic(aes(x = product(stroke,ever_married), fill=ever_married)) + labs(title='Stroke on ever_married')

From these plots, we could conclude that: * Self-employed workes are more likely to get stroke. * Smoke seems to have little effect on stroke. * Those who have hypertension are more likely to get stroke. * Those who are not married are less likely to get stroke.(Wonder Why)

Next, we want to figure out the correlation between input variables. One way to do this is through correlation heatmap. To do this, we have to transform some qualitative variables to quantitative variables.

# Residence_type: Urban for 0, Rural for 1
stroke1$Residence_type[stroke1$Residence_type == "Urban"] <- 0
stroke1$Residence_type[stroke1$Residence_type == "Rural"] <- 1

# ever_married: No for 0, Yes for 1
stroke1$ever_married[stroke1$ever_married == "Yes"] <- 1
stroke1$ever_married[stroke1$ever_married == "No"] <- 0

# gender: Male for 0, Female for 1
stroke1$gender[stroke1$gender == "Male"] <- 0
stroke1$gender[stroke1$gender == "Female"] <- 1

suppressWarnings(stroke1$Residence_type <- as.numeric(as.character(stroke1$Residence_type)))
suppressWarnings(stroke1$ever_married <- as.numeric(as.character(stroke1$ever_married)))
suppressWarnings(stroke1$gender <- as.numeric(as.character(stroke1$gender)))

# Quantitative Variables: Correlation Map
stroke.quant = subset(stroke1, select = -c(work_type, smoking_status))
stroke.cor = round(cor(stroke.quant),2)
ggplot(data = reshape2::melt(stroke.cor),aes(x=Var1, y=Var2, fill=value)) + geom_tile() +  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") + geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + theme(axis.text.x = element_text(angle = 30))

As we can see from the heatmap, most features don’t have a significant correlation with others, which is suitable for regression. The only worth-noticing correlation is between age and ever_married, but the reason behind it is pretty plain. Among all features, age has the largest correlation coefficients with stroke.

For continuous variables, we could also use scatter plot to display their relationship.

cont.plot <- ggplot(data = stroke, aes(x= age, y = bmi, color = stroke))+geom_point()
cont.plot

As we can see, stroke patients are more likely to appear at older people and people with high bmi.

3. Logistic Regression For Prediction

From the EDA, we finds out that no certain feature has a strong correlation with stroke. To quantify how a feature affect the outcome, we need to conduct a regression. Since stroke is a binary variable, it’s better to use logistic regression.

3.1 Dataset Building and Model Training

First, we need to build the training dataset. We are going to use the stroke_dummy for regression. To test the regression result, we split the dataset into training set (70%) and testing set (30%).

# Training set creation
Training <- createDataPartition(y = stroke_dummy$stroke, p = 0.7, list = FALSE)
training <- stroke_dummy[Training,]
testing <- stroke_dummy[-Training,]

We could examine the dimension of the training set and the testing set.

dim(training)

## [1] 3577   15

dim(testing)

## [1] 1532   15

Next, we carry out the regression using Generalized linear model (GLM) and set the link to be logit. The model is trained on the training set.

model <- glm(stroke ~.,family=binomial(link='logit'), data=training)
summary(model)

## 
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2310  -0.3171  -0.1675  -0.0972   3.4981  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -5.922908   0.806938  -7.340 2.14e-13 ***
## age                             0.072741   0.006799  10.700  < 2e-16 ***
## hypertension1                   0.563302   0.192796   2.922  0.00348 ** 
## heart_disease1                  0.342615   0.227875   1.504  0.13270    
## ever_marriedYes                -0.233803   0.265039  -0.882  0.37770    
## Residence_typeUrban            -0.102680   0.164565  -0.624  0.53266    
## avg_glucose_level               0.002709   0.001456   1.860  0.06287 .  
## bmi                            -0.004730   0.013519  -0.350  0.72640    
## gender_Male                     0.004179   0.169271   0.025  0.98030    
## work_type_Govt_job             -1.129978   0.865039  -1.306  0.19146    
## work_type_Never_worked        -10.635415 375.185929  -0.028  0.97739    
## work_type_Private              -0.965535   0.844501  -1.143  0.25291    
## `work_type_Self-employed`      -1.438607   0.872193  -1.649  0.09906 .  
## `smoking_status_never smoked`  -0.244233   0.191801  -1.273  0.20289    
## smoking_status_smokes          -0.074173   0.261212  -0.284  0.77644    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1397.4  on 3576  degrees of freedom
## Residual deviance: 1116.0  on 3562  degrees of freedom
## AIC: 1146
## 
## Number of Fisher Scoring iterations: 14

From the regression result, we could conclude that:

Age, hypertension and work_type_self-employed are statiscally significant in the regression results, with p-value smaller than 0.005. Age has the lowest p-value.
Age and hypertension are postively correlated with stroke, which corresponds to our observation in EDA.
Work_type_Self-employed is negatively correlated with stroke. It means that becoming self-employed would reduce the risk of getting stroke. Probably self-employed people would better enjoy the life and have a healthy lifestyle.

Now we can run the anova() function on the model to see the deviance of the regression model.

anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: stroke
## 
## Terms added sequentially (first to last)
## 
## 
##                               Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                           3576     1397.4              
## age                            1  253.376      3575     1144.0 < 2.2e-16 ***
## hypertension                   1   10.619      3574     1133.4  0.001119 ** 
## heart_disease                  1    3.970      3573     1129.5  0.046322 *  
## ever_married                   1    0.597      3572     1128.9  0.439705    
## Residence_type                 1    0.282      3571     1128.6  0.595324    
## avg_glucose_level              1    3.978      3570     1124.6  0.046097 *  
## bmi                            1    0.203      3569     1124.4  0.652680    
## gender_Male                    1    0.040      3568     1124.3  0.841968    
## work_type_Govt_job             1    0.019      3567     1124.3  0.890438    
## work_type_Never_worked         1    0.105      3566     1124.2  0.746112    
## work_type_Private              1    4.417      3565     1119.8  0.035590 *  
## `work_type_Self-employed`      1    2.017      3564     1117.8  0.155541    
## `smoking_status_never smoked`  1    1.668      3563     1116.1  0.196522    
## smoking_status_smokes          1    0.081      3562     1116.0  0.775904    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA shows how features lower the original deviance to residual deviance. Besides three significant features shown in the regression result, avg_glucose_level also significantly lowers the deviance. Those features with a large p-value shows that even without these features, the model would explain more or less of the same of total variation.

3.2 Prediction on the testing dataset and Calculate the prediction accuracy

With the model trained, we could predict on the testing set.

#predict the stroke variable based on the testing dataset
model.prob = predict(model, testing, type="response")

To visualize the prediction result, we could use the confusionMatrix function from package caret. We set the prediction result to be 0.5, indicating that if the predicted stroke is larger than 0.5, we believe that this patient are likely to get stroke.

# use caret and compute a confusion matrix
confusionMatrix(data = as.factor(as.numeric(model.prob>0.5)), reference = as.factor(testing$stroke))

## Warning in confusionMatrix.default(data = as.factor(as.numeric(model.prob > :
## Levels are not in the same order for reference and data. Refactoring data to
## match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1458   74
##          1    0    0
##                                           
##                Accuracy : 0.9517          
##                  95% CI : (0.9397, 0.9619)
##     No Information Rate : 0.9517          
##     P-Value [Acc > NIR] : 0.5309          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.9517          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.9517          
##          Detection Rate : 0.9517          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

From the confusion matrix report, we could find the accuracy to be 0.95, which is relatively high. However, if we take a closer look, we would find that most prediction would just predict the outcome to be 0. It’s because th outcome ‘stroke’ is so imbalanced that the model believes that almost all the outcomes are 0. It’s hard to distinguish the one with the stroke and the one without it.

What if we change the threshold? We found a helpful solution on the StacksOverflow(https://stats.stackexchange.com/questions/199978/optimizing-probability-thresholds-in-a-glm-model-in-caret) to plot the relationship between threshold and accuracy.

threhold <- seq(0.3,0.6,0.01)
accuracy <- NULL
for (i in seq(along = threhold)){
    prediction <- ifelse(model$fitted.values >= threhold[i], 1, 0) #Predicting for cut-off
accuracy <- c(accuracy,length(which(training$stroke ==prediction))/length(prediction)*100)
}
plot(threhold, accuracy, pch =19,type='b',col= "steelblue",main ="Logistic Regression", xlab="Cutoff Level", ylab = "Accuracy %")

The result is not pleasing. We cannot raise the accuracy just by changing the cutoff threshold.

Maybe a different model? We could do a random forest classifier to test the outcome.

stroke.rf <- randomForest(stroke ~ ., data = stroke, importance = TRUE,proximity = TRUE)
print(stroke.rf)

## 
## Call:
##  randomForest(formula = stroke ~ ., data = stroke, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 4.97%
## Confusion matrix:
##      0 1 class.error
## 0 4854 6 0.001234568
## 1  248 1 0.995983936

For the random forest classifier, it reaches an accuracy of 95.03%, but the confusion matrix still looks alike with the logistic regression. So the imbalance of the dataset would greatly affect the prediction result.

4. Conclusion and Recommendation

In this project, we use logistic regression to discover the relationship between stroke and other input features. We get the conclusion that age, hypertension and work_type_self-employed would affect the possibility of getting stroke. We also use logistic regression and random forest to build a prediction model for stroke. Both model reach the accuracy of 95%, but the imbalance of dataset has limited the accuracy level. In the future we would look at questions like anomaly detection to solve this problem.

Our recommendation would be:

Pay attention to older people. Age is the decisive factor in the risk of stroke.
Treat Hypertension seriesly. It could be a significant sign of stroke.
Working as self-employed would reduce your risk of getting stroke.