I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:
id (integer) - the unique identifying number given to each patient
gender (categorical) - the gender of the patient, male, female, or other
age (continuous) - the age of the patient
hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does
heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does
ever_married (categorical) - Yes if the patient has been married before, No if not
work_type (categorical) - the type of work done by the patient
residence_type (categorical) - the area in which the patient lived, urban or rural
avg_glucose_level (continuous) - the average blood glucose level of the patient
bmi (continuous) - the body mass index (BMI) of the patient
smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”
stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has
High blood sugar levels are commonly seen in diabetic patients who are historically known to have a higher likelihood of strokes. Based on this dataset, we will look into how the average glucose level of the patient is associated with the patient having a stroke through simple logistic regression.
Let us first examine our single predictor, avg_glucose_level, to see how it is distributed and if it is skewed.
We can see that the distribution is bimodal and is heavily skewed right.
For this regression, we will proceed without transforming the predictor
variable; however, in the future we may choose to discreteize it to
improve the interpretation of the model.
We will proceed by constructing the simple logistic regression model.
## Waiting for profiling to be done...
Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|
(Intercept) | -4.4363848 | 0.1756788 | -25.252819 | 0 | -4.7859217 | -4.0966562 |
avg_glucose_level | 0.0112496 | 0.0012169 | 9.244668 | 0 | 0.0088386 | 0.0136146 |
The above table indicates that average glucose level is positively correlated with the chance of stroke, as \(\beta_1 = 0.1125\) and a p-value close to zero. Additionally, the 95% confidence interval \([0.0088386, 0.0136146]\) supports this positive relationship as well. This supports what we expected.
Continuing, we will interpret our results through the odds ratio.
Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
---|---|---|---|---|---|
(Intercept) | -4.4363848 | 0.1756788 | -25.252819 | 0 | 0.0118387 |
avg_glucose_level | 0.0112496 | 0.0012169 | 9.244668 | 0 | 1.0113131 |
The odds-ratio gives us a result of \(1.01131\), which indicates that a 1 unit increase in average blood glucose level corresponds to a 1% increase in the likelihood of a stroke in the patient.
We will also include some goodness-of-fit measures for the model.
Deviance.residual | Null.Deviance.Residual | AIC |
---|---|---|
1653 | 1728 | 1657 |
We will use these measures to compare between different models at a later date.
Finally, we will create graphs for our model.
We can see plotted the logistic curve created through the model that shows the probability of a stroke increasing with the increase in average blood glucose levels as well as the right plot, which shows the rate of change of the model. The rate of change seems to go up continuously in the created model, and the logistic model created does not quite take the form of the standard S-curve of the logistic model.