Abstract

Early detection of stroke can cure or delay the prevalence of stroke. This study seeks to identify the risk factor(s) which is/are highly significant in developing stroke in order for practitioners to make an accurate prognosis of any medical conditions. Hence, a model was developed to validate the risk factors of stroke. A retrospective study involving a sample size of 5110 extracted from Kaggle consisting of eleven(11) independent variables and one(1) dependent variable was conducted. Random Oversampling Examples(ROSE) was used to sample our data since the data was imbalanced. Association between variables was tested using Point-Biserial Correlation. The Binary Logistic Regression analysis was used to determine the predictors and developed the predictive model based on demographic and health characteristics. Variables that were significant to our model using StepAIC were added to the final model. Confusion Matrix was used as a performance measure for our model to check for Accuracy(79%) and Precision(78%). McFadden’s Test was used to check how the model fit the data. A result of 0.32 from McFadden’s test suggests that, the model has a high predictive power. Variables shown to be significant predictors of Stroke are Age, Hypertension, Heart Disease, Smoking Status, Gender(Female), Average Glucose Level, Work Type, Residence Type, and Body Mass Index.The predictive model developed in this study contains variables that can be easily collected in practice meaning it can be easily used in clinical practice.

Introduction

According to World Health Organization (WHO), stroke is known as the second leading cause of death globally, responsible for approximately 11% of total deaths. Stroke develops when an artery in your brain leaks or it’s blocked. Also, research by health line indicates that, stroke is the second most deadly disease in the world. Annually, 15 million people worldwide suffer from stroke. Out of these 15 million, 5 million die and another 5 million are left permanently disabled, placing a burden on family and community. During the occurrence of stroke in a person, the affected areas of the brain do not receive enough oxygen and blood. As a result, brain tissues begin to die. Depending on the area of the brain affected by stroke, this damage will cause changes in certain sensory, motor or cognitive functions.

Objective

The primary goal of this study is to develop a model that will fish out the main risk factors of stroke by taking historical data such as age, gender, bmi, glucose level, smoking status, and other factors as inputs and to fit a classification algorithm to validate the relationship these factors have with the response variable(stroke).

Study Design

It is a retrospective descriptive study design involving the review of stroke case hospital records. A retrospective study in the sense that, data is collected from records or by asking participants to recall their most probable period of exposure, because the event of interest(occurrence of stroke) has already occurred in each individual at the time of enrollment. Research questions about possible associations between an outcome and an exposure were formulated to further investigate the potential relationships.

Method of Data Analysis

The statistical methods implemented for our analysis are Descriptive Statistics, Exploratory Data Analysis and Binary Logistic Regression. The Classification Algorithm Logistic Regression is a type of supervised machine. Supervised machine as the name suggests is a machine on supervision. Classification algorithms are used to solve the classification problems in which the output variable is categorical or discrete example, “Yes” or “No”, “Male” or “Female”, etc. The data we are working with is an imbalanced data which if left unchecked will create a bias for the majority class. Our countermeasure for this problem is using one of the Oversampling Techniques to balance the data. There are many types of classification algorithms but for this project, only the Logistic Regression was used.

Data

This dataset was extracted freely from Kaggle, a data science and machine learning community for research purposes. It provides relevant information about patients such as gender, age, various diseases, and smoking status. With the data obtained, this project sought to predict when a patient is likely to get a stroke base on the input parameters given.

Attributes

Below are the column names and some information about it.

Logistic Regression Model

Logistic Regression Model: The Logistic Regression Model also known as the logit model, investigates the relationship between one or more predictor variables and a categorical dependent variable and estimates the likelihood of its occurrence “by fitting the data to a logistic curve”. There are 3 types of logistic regression models namely: binary logistic regression, multinomial logistic regression model and ordinal logistic regression. When the response feature is dichotomous and the independent variables are either continuous or categorical the Binary logistic regression is used,but when the response variable is dichotomous and has more than two categories the multinomial logistic regression is utilized. The ordinal logistic regression is used when there are three or more categories with a natural ordering to the levels. The dataset under consideration has 5110 distinct records, where the binary target variable accepts 0 or 1 values with 0 representing a negative stroke and 1 representing a positive stroke. There are 4861 cases in Class 0 and 249 cases in Class 1.

The Likelihood Ratio Test

The likelihood ratio test was incorporated in this study as a test for the goodness of fit for the logistic regression model. The likelihood ratio test is based on the comparison of the deviances of two competing models. The deviance for the full model or model that has all regressors (L1) is compared to the deviance of the reduced model or the model with relatively fewer number of predictors (L0).

The null and alternative hypotheses are stated as;

H0: There is no statistical significance between the reduced model and the full model. H1: The full model is statistically more significant than the reduced model.

When the null hypothesis is not rejected at 5% level of significance, the reduced model is preferred and hence, becomes the final model for the data.

McFadden R Squared

Logistic regression model coefficients are estimated using the Maximum Likelihood Method. The Parameter estimates are values that maximize the likelihood of observed data. McFadden is one of the forms of Pseudo R Squared.

Model Accuracy/Precision/Area Under Curve(AUC)

The accuracy of a logistic regression model refers to the proportion of correct classifications. To estimate the accuracy of the final model, predictions were made using the test set and the responses rounded to the nearest binary digit. Precision is one of the indicator of a machine learning model’s performance. Also, Area Under Curve helps us to evaluate and validate the performance of the model at distinguishing between the positive and negative classes. The results of the prediction were finally summarized in a confusion matrix.

Results and Discussion

In this chapter we seek to examine or give the analysis of the data as introduced earlier; an exploratory data analysis was performed to see the risk factors that contribute to the likelihood of getting diabetes? ( Not stroke?). We sequentially explored vivid description of the data using statistical inference such as mean, median and standard deviation on the numerical variable and described the frequency and percentage on the categorical variables.

Conclusion and Recommendation

In this chapter we summarize the research findings, draw conclusion from the analysis and offer recommendation.

Descriptive and Exploratory Analysis

Packages

attach(load)
describe(load)
##                   vars    n     mean       sd   median  trimmed      mad   min
## id                   1 5110 36517.83 21161.72 36932.00 36542.26 27413.27 67.00
## gender*              2 5110     1.41     0.49     1.00     1.39     0.00  1.00
## age                  3 5110    43.23    22.61    45.00    43.61    26.69  0.08
## hypertension         4 5110     0.10     0.30     0.00     0.00     0.00  0.00
## heart_disease        5 5110     0.05     0.23     0.00     0.00     0.00  0.00
## ever_married*        6 5110     1.66     0.48     2.00     1.70     0.00  1.00
## work_type*           7 5110     3.50     1.28     4.00     3.62     0.00  1.00
## Residence_type*      8 5110     1.51     0.50     2.00     1.51     0.00  1.00
## avg_glucose_level    9 5110   106.15    45.28    91.88    97.85    26.06 55.12
## bmi*                10 5110   172.19    88.96   158.00   163.08    74.13  1.00
## smoking_status*     11 5110     2.59     1.09     2.00     2.61     1.48  1.00
## stroke              12 5110     0.05     0.22     0.00     0.00     0.00  0.00
##                        max    range  skew kurtosis     se
## id                72940.00 72873.00 -0.02    -1.21 296.03
## gender*               3.00     2.00  0.35    -1.86   0.01
## age                  82.00    81.92 -0.14    -0.99   0.32
## hypertension          1.00     1.00  2.71     5.37   0.00
## heart_disease         1.00     1.00  3.94    13.57   0.00
## ever_married*         2.00     1.00 -0.66    -1.57   0.01
## work_type*            5.00     4.00 -0.91    -0.49   0.02
## Residence_type*       2.00     1.00 -0.03    -2.00   0.01
## avg_glucose_level   271.74   216.62  1.57     1.68   0.63
## bmi*                419.00   418.00  0.97     0.87   1.24
## smoking_status*       4.00     3.00  0.08    -1.35   0.02
## stroke                1.00     1.00  4.19    15.57   0.00

Replacing na with the mean

load$bmi[is.na(load$bmi)]<- mean(load$bmi,na.rm = T)
str(load)
## 'data.frame':    5110 obs. of  12 variables:
##  $ id               : int  9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
##  $ gender           : chr  "Male" "Female" "Male" "Female" ...
##  $ age              : num  67 61 80 49 79 81 74 69 59 78 ...
##  $ hypertension     : int  0 0 0 0 1 0 1 0 0 0 ...
##  $ heart_disease    : int  1 0 1 0 0 0 1 0 0 0 ...
##  $ ever_married     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ work_type        : chr  "Private" "Self-employed" "Private" "Private" ...
##  $ Residence_type   : chr  "Urban" "Rural" "Rural" "Urban" ...
##  $ avg_glucose_level: num  229 202 106 171 174 ...
##  $ bmi              : num  36.6 28.9 32.5 34.4 24 ...
##  $ smoking_status   : chr  "formerly smoked" "never smoked" "never smoked" "smokes" ...
##  $ stroke           : int  1 1 1 1 1 1 1 1 1 1 ...
g1<-ggplot(data, aes(x = stroke, fill = ever_married))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")
g2<-ggplot(data, aes(x = stroke, fill = gender))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")
g3<-ggplot(data, aes(x = stroke, fill = smoking_status))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")
g4<-ggplot(data, aes(x = stroke, fill = Residence_type))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")
g5<-ggplot(data, aes(x = stroke, fill = hypertension))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "yellow")
g6<-ggplot(data, aes(x = stroke, fill = heart_disease))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "green")
grid.arrange(grobs=list(g1,g2,g3,g4,g5,g6),nrow=3, top=" Distribution of both Categorical variable")

Checking for outliers in the data

b1<-data %>%
  ggplot(aes(x=stroke, y = avg_glucose_level, fill=stroke))+geom_boxplot()
b2<-data %>%
  ggplot(aes(x=stroke, y = bmi, fill=stroke))+geom_boxplot()
grid.arrange(grobs=list(b1,b2))

### Replacing the outliers Since the data is imbalanced, removing the outliers might loose a lot of information from the data so i replaced the outliers with the quantile range

#For glucose
quantile(data$avg_glucose_level)
##      0%     25%     50%     75%    100% 
##  55.120  77.245  91.885 114.090 271.740
data$avg_glucose_level<- ifelse(data$avg_glucose_level>114.09,120, data$avg_glucose_level)
ggplot(data, aes(x=stroke, y = avg_glucose_level, fill=stroke))+geom_boxplot()

#For bmi
quantile(data$bmi)
##   0%  25%  50%  75% 100% 
## 10.3 23.8 28.4 32.8 97.6
data$bmi<- ifelse(data$bmi>32.8,42,data$bmi)
ggplot(data, aes(x=stroke, y = bmi, fill=stroke))+geom_boxplot()

#Distribution of age
ggplot(data,aes(x=age,fill=stroke))+geom_density(alpha=0.5)

#Gender with stroke
ggplot(data, aes(x = stroke, fill = gender))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")

#barplot with stroke and ever_married
ggplot(data, aes(x = stroke, fill = ever_married))+
  geom_bar(position = "fill")+
  stat_count(geom = "text",
             aes(label = stat(count)),
             position = position_fill(vjust = 0.5), color = "black")

Patients who tested positive for stroke have a higher median bmi than patients who tested negative. Based on the point biserial correlation, We saw that the p-value(0.000) is less than the 5% alpha value. Hence, we reject the null hypothesis and conclude that, there is an association between bmi and stroke. Patients who tested positive for stroke had a higher median avg_glucose_level than patients who tested negative. From the point biserial correlation, We saw that the p-value (0.000) is less than the 5% alpha value, and so we reject the null hypothesis and conclude that there is a statistically significant association between avg_glucose_level and stroke. Also, we observed that, most of the smoke patients are aged 40 and above. Thus, number of strokes increases as the age increases. Also from the biserial correlation, we can see that, there is an association between stroke and age. Patients who have Ever_married had 220 stroke cases more than patients who are never married. Patients who have married are likely to get stroke as compared to patients who have never married. Female patients who had stroke were more than male patients who had stroke. Based on the figure above, females are more likely to get stroke as compared to male.

#transforming some qualitative variables to quantitative variables.
# Residence_type: Urban for 0, Rural for 1
data$Residence_type[data$Residence_type == "Urban"] <- 0
data$Residence_type[data$Residence_type == "Rural"] <- 1

# ever_married: No for 0, Yes for 1
data$ever_married[data$ever_married == "Yes"] <- 1
data$ever_married[data$ever_married == "No"] <- 0

# gender: Male for 0, Female for 1
data$gender[data$gender == "Male"] <- 0
data$gender[data$gender == "Female"] <- 1

#work_type: Children for 1, Private for 1, Govt for 2, Never_work for 3, Self_employed for 4
data$work_type[data$work_type=="children"] <- 0
data$work_type[data$work_type=="Private"] <- 1
data$work_type[data$work_type=="Govt_job"] <- 2
data$work_type[data$work_type=="Never_worked"] <- 3
data$work_type[data$work_type=="Self-employed"] <- 4

#smoking_status: 0 for formerly_smoked, 1 for never_smoked, 2 for smokes, 3 for Unknown
data$smoking_status[data$smoking_status=="formerly smoked"] <- 0
data$smoking_status[data$smoking_status=="never smoked"] <- 1
data$smoking_status[data$smoking_status=="smokes"] <- 2
data$smoking_status[data$smoking_status=="Unknown"] <- 3

# Converting factors to numerical
data$gender <- as.numeric(data$gender)
data$smoking_status<- as.numeric(data$smoking_status)
data$ever_married <- as.numeric(data$ever_married)
data$stroke <- as.numeric(data$stroke)
data$work_type <- as.numeric(data$work_type)
data$Residence_type <- as.numeric(data$Residence_type)

Heat map

stroke.cor = round(cor(data),2)
ggplot(data = reshape2::melt(stroke.cor),aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() +  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Pearson\nCorrelation") +
  geom_text(aes(Var2, Var1, label = value), color = "black", size = 4) + 
  theme(axis.text.x = element_text(angle = 30))

cor_onehot <- correlate(stroke.cor)
## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'
cor_onehot%>% focus(stroke) 
## # A tibble: 10 x 2
##    term               stroke
##    <chr>               <dbl>
##  1 gender            -0.178 
##  2 age                0.238 
##  3 hypertension       0.110 
##  4 heart_disease      0.125 
##  5 ever_married       0.0919
##  6 work_type          0.0425
##  7 Residence_type    -0.214 
##  8 avg_glucose_level -0.0292
##  9 bmi               -0.0424
## 10 smoking_status    -0.227
cor_onehot %>%
  focus(stroke) %>%
  mutate(rowname = reorder(term, stroke)) %>%
  ggplot(aes(term, stroke)) +
  geom_col() + coord_flip() +
  theme_bw()

As mentioned in the methodology, cor( ) function in R studio was used to perform a correlation analysis to determine how the variables are related to one another. As we can see from the heat map, most features are not highly correlated with each other, which is suitable for regression. The only worth-noticing correlation is between Age and Ever Married variables. Among all features, age has the largest correlation coefficient (0.25) with stroke.

Model Building

set.seed(123)

The model below shows all the independent variables using the training set which featured 70% and 30% validation (testing) on the Random Over-Sampling Examples (ROSE).

Oversampling

#Checking for the range of the imbalanced data 
table(data$stroke)
## 
##    0    1 
## 4861  249
#OverSampling
ov <- ovun.sample(stroke~.,data = data, method = "over", N =9722)$data
#splitting the data into train and test using 80% and 20%
split  <- sample(2,nrow(ov), replace = TRUE, prob = c(0.8,0.2))
traindata <- ov[split==1,]
testdata <- ov[split==2,]
#Balanced data on the training test
table(traindata$stroke)
## 
##    0    1 
## 3937 3850
#Balanced data on the testing test
table(testdata$stroke)
## 
##    0    1 
##  924 1011

Logit Model

#Logistic
model5 <- glm(stroke ~.,family=binomial(link='logit'), data=traindata)
summary(model5)
## 
## Call:
## glm(formula = stroke ~ ., family = binomial(link = "logit"), 
##     data = traindata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4323  -0.6950  -0.1954   0.7235   2.6220  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.301699   0.260689 -16.501  < 2e-16 ***
## gender1             0.138004   0.059782   2.308 0.020974 *  
## age                 0.082166   0.002307  35.608  < 2e-16 ***
## hypertension1       0.574607   0.077315   7.432 1.07e-13 ***
## heart_disease1      0.326886   0.096557   3.385 0.000711 ***
## ever_married1       0.009057   0.095092   0.095 0.924120    
## work_type1         -1.424344   0.239496  -5.947 2.73e-09 ***
## work_type2         -1.567685   0.246751  -6.353 2.11e-10 ***
## work_type3        -12.340414 206.285281  -0.060 0.952297    
## work_type4         -1.736391   0.253357  -6.854 7.21e-12 ***
## Residence_type1    -0.112252   0.058016  -1.935 0.053007 .  
## avg_glucose_level   0.008770   0.001424   6.160 7.28e-10 ***
## bmi                 0.003833   0.004227   0.907 0.364538    
## smoking_status1    -0.324294   0.076259  -4.253 2.11e-05 ***
## smoking_status2     0.151609   0.091373   1.659 0.097067 .  
## smoking_status3    -0.077427   0.089592  -0.864 0.387466    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10794.1  on 7786  degrees of freedom
## Residual deviance:  7391.1  on 7771  degrees of freedom
## AIC: 7423.1
## 
## Number of Fisher Scoring iterations: 13

Reduced Model using StepAIC

RModel= stepAIC(model5)
## Start:  AIC=7423.07
## stroke ~ gender + age + hypertension + heart_disease + ever_married + 
##     work_type + Residence_type + avg_glucose_level + bmi + smoking_status
## 
##                     Df Deviance    AIC
## - ever_married       1   7391.1 7421.1
## - bmi                1   7391.9 7421.9
## <none>                   7391.1 7423.1
## - Residence_type     1   7394.8 7424.8
## - gender             1   7396.4 7426.4
## - heart_disease      1   7402.9 7432.9
## - smoking_status     3   7427.9 7453.9
## - avg_glucose_level  1   7429.0 7459.0
## - work_type          4   7440.9 7464.9
## - hypertension       1   7448.0 7478.0
## - age                1   9166.3 9196.3
## 
## Step:  AIC=7421.08
## stroke ~ gender + age + hypertension + heart_disease + work_type + 
##     Residence_type + avg_glucose_level + bmi + smoking_status
## 
##                     Df Deviance    AIC
## - bmi                1   7391.9 7419.9
## <none>                   7391.1 7421.1
## - Residence_type     1   7394.8 7422.8
## - gender             1   7396.4 7424.4
## - heart_disease      1   7402.9 7430.9
## - smoking_status     3   7428.4 7452.4
## - avg_glucose_level  1   7429.3 7457.3
## - work_type          4   7441.7 7463.7
## - hypertension       1   7448.1 7476.1
## - age                1   9341.8 9369.8
## 
## Step:  AIC=7419.91
## stroke ~ gender + age + hypertension + heart_disease + work_type + 
##     Residence_type + avg_glucose_level + smoking_status
## 
##                     Df Deviance    AIC
## <none>                   7391.9 7419.9
## - Residence_type     1   7395.8 7421.8
## - gender             1   7397.2 7423.2
## - heart_disease      1   7404.0 7430.0
## - smoking_status     3   7429.3 7451.3
## - avg_glucose_level  1   7433.3 7459.3
## - work_type          4   7441.7 7461.7
## - hypertension       1   7450.5 7476.5
## - age                1   9353.0 9379.0
summary(RModel)
## 
## Call:
## glm(formula = stroke ~ gender + age + hypertension + heart_disease + 
##     work_type + Residence_type + avg_glucose_level + smoking_status, 
##     family = binomial(link = "logit"), data = traindata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4365  -0.6973  -0.2001   0.7253   2.6336  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -4.230674   0.248034 -17.057  < 2e-16 ***
## gender1             0.137445   0.059774   2.299 0.021481 *  
## age                 0.081925   0.002245  36.493  < 2e-16 ***
## hypertension1       0.580479   0.077028   7.536 4.85e-14 ***
## heart_disease1      0.329651   0.096115   3.430 0.000604 ***
## work_type1         -1.370205   0.225592  -6.074 1.25e-09 ***
## work_type2         -1.516512   0.234847  -6.457 1.06e-10 ***
## work_type3        -12.326203 206.289557  -0.060 0.952353    
## work_type4         -1.683757   0.240879  -6.990 2.75e-12 ***
## Residence_type1    -0.114349   0.057824  -1.978 0.047982 *  
## avg_glucose_level   0.009005   0.001399   6.436 1.23e-10 ***
## smoking_status1    -0.328007   0.075831  -4.326 1.52e-05 ***
## smoking_status2     0.148355   0.091313   1.625 0.104230    
## smoking_status3    -0.084855   0.089005  -0.953 0.340397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10794.1  on 7786  degrees of freedom
## Residual deviance:  7391.9  on 7773  degrees of freedom
## AIC: 7419.9
## 
## Number of Fisher Scoring iterations: 13

Likelihood Test Ratio

anova(RModel,model5,test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: stroke ~ gender + age + hypertension + heart_disease + work_type + 
##     Residence_type + avg_glucose_level + smoking_status
## Model 2: stroke ~ gender + age + hypertension + heart_disease + ever_married + 
##     work_type + Residence_type + avg_glucose_level + bmi + smoking_status
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      7773     7391.9                     
## 2      7771     7391.1  2   0.8432    0.656

The p-value for the Likelihood Ratio Test, 0.656 > α(0.05). This is evidence that the full model is not better than the reduced model.

Variable Importance

VImpot = varImp(RModel)
VImpot
##                       Overall
## gender1            2.29941845
## age               36.49327652
## hypertension1      7.53599220
## heart_disease1     3.42974569
## work_type1         6.07382564
## work_type2         6.45744706
## work_type3         0.05975195
## work_type4         6.99006509
## Residence_type1    1.97752525
## avg_glucose_level  6.43613650
## smoking_status1    4.32551869
## smoking_status2    1.62468372
## smoking_status3    0.95338119

Higher values indicate more importance. These results match up nicely with the p-values from the model. Age is by far the most important predictor variable, followed by hypertension1, work_type2, work_type4,work_type1, avg_glucose_level,smoking_status1, heart_disease1, gender1,work_type1,bmi,smoking_status3, smoking_status2, work_type3.

Checking for the predictive power of the independent variable

#Predictive Power
pR2(RModel)["McFadden"]
## fitting null model for pseudo-r2
##  McFadden 
## 0.3151896

R- squared tells us how the model fit the data and since R-squared cannot be computed in logistic regression, we calculated a metric known as McFadden’s R-Squared which ranges from 0 to just under 1, with higher values indicating a better model fit. From the model, McFadden’s test gave us 0.32 suggesting that the model fit the data and has a high predictive power.

Confusion Matrix

model.prob2 = predict(model5, testdata, type="response")
confusionMatrix(data = as.factor(as.numeric(model.prob2>0.5)), reference = as.factor(testdata$stroke))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 710 200
##          1 214 811
##                                           
##                Accuracy : 0.786           
##                  95% CI : (0.7671, 0.8041)
##     No Information Rate : 0.5225          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5709          
##                                           
##  Mcnemar's Test P-Value : 0.5229          
##                                           
##             Sensitivity : 0.7684          
##             Specificity : 0.8022          
##          Pos Pred Value : 0.7802          
##          Neg Pred Value : 0.7912          
##              Prevalence : 0.4775          
##          Detection Rate : 0.3669          
##    Detection Prevalence : 0.4703          
##       Balanced Accuracy : 0.7853          
##                                           
##        'Positive' Class : 0               
## 

From the table above, out of 1011 patients who had stroke, the model correctly predicted 811 of them and out of 924 patients who had no stroke, the model correctly predicted 710 as patients who had no stroke. Model Accuracy is 79% Model Precision is 78%

Area Under Curve(AUC)

auc(testdata$stroke,model.prob2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8448

The higher the Area Under Curve(AUC), the better the performance of the model. AUC of 0.85 suggests that, the model has the power to distinguish between positive and negative classes.

Conclusion

Identifying those who are at high risk of getting stroke is an important part of disease prevention and treatment. This paper proposed a stroke predictive equation to help researchers or practitioners better understand the risk variables of stroke in order to diagnose, prevent and treat it. Logistic regression was used to discover the relationship between stroke and other input features. With the 10 explanatory variables, only Nine(9) were found significant after using StepAIC on the full model. There was an observation that age, hypertension, heart disease, ever_married, average glucose level, bmi, smoking status and work_type_ self-employed could affect the possibility of getting stroke. The dataset was imbalanced as patients who had stroke were far less than patients who had no stroke. For that reason, Random Over-Sampling Examples (ROSE) was used to balance the dataset. 80% was used for training set and 20% for testing set. After fitting the model using the training set, the test set was used to make predictions and the results. Thus, out of 1011 patients who had stroke, the model correctly predicted 811 of them and out of 924 patients who had no stroke, the model correctly predicted 710 as patients who had no stroke, the model correctly predicted 1069 as patients who had no stroke. The accuracy and the precision of this model was 79% and 78% respectively.

Recommendation

  • Pay attention to older people. Age is the decisive factor in the risk of stroke.
  • Treat Hypertension seriously. It could be a significant sign of stroke.
  • Patients who have a history of smoking/ smoke should be charged higher. However, patients who formerly smoked have a higher percentage of experiencing stroke as compared to those who currently smoke and should be charged higher.
  • Patients with bmi should exercise often.