1 Description of the Dataset

I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:

id (integer) - the unique identifying number given to each patient

gender (categorical) - the gender of the patient, male, female, or other

age (continuous) - the age of the patient

hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does

heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does

ever_married (categorical) - Yes if the patient has been married before, No if not

work_type (categorical) - the type of work done by the patient

residence_type (categorical) - the area in which the patient lived, urban or rural

avg_glucose_level (continuous) - the average blood glucose level of the patient

bmi (continuous) - the body mass index (BMI) of the patient

smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”

stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has

2 Research Question

High blood sugar levels are commonly seen in diabetic patients who are historically known to have a higher likelihood of strokes. Based on this dataset, we will look into how the average glucose level of the patient is associated with the patient having a stroke through simple logistic regression.

3 Exploratory Data Analysis

Let us first examine our single predictor, avg_glucose_level, to see how it is distributed and if it is skewed.

We can see that the distribution is bimodal and is heavily skewed right. For this regression, we will proceed without transforming the predictor variable; however, in the future we may choose to discreteize it to improve the interpretation of the model.

We will proceed by constructing the simple logistic regression model.

4 Simple Logistic Regression

## Waiting for profiling to be done...
The summary stats of regression coefficients
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) -4.4363848 0.1756788 -25.252819 0 -4.7859217 -4.0966562
avg_glucose_level 0.0112496 0.0012169 9.244668 0 0.0088386 0.0136146

The above table indicates that average glucose level is positively correlated with the chance of stroke, as \(\beta_1 = 0.1125\) and a p-value close to zero. Additionally, the 95% confidence interval \([0.0088386, 0.0136146]\) supports this positive relationship as well. This supports what we expected.

Continuing, we will interpret our results through the odds ratio.

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -4.4363848 0.1756788 -25.252819 0 0.0118387
avg_glucose_level 0.0112496 0.0012169 9.244668 0 1.0113131

The odds-ratio gives us a result of \(1.01131\), which indicates that a 1 unit increase in average blood glucose level corresponds to a 1% increase in the likelihood of a stroke in the patient.

We will also include some goodness-of-fit measures for the model.

Deviance.residual Null.Deviance.Residual AIC
1653 1728 1657

We will use these measures to compare between different models at a later date.

Finally, we will create graphs for our model.

We can see plotted the logistic curve created through the model that shows the probability of a stroke increasing with the increase in average blood glucose levels as well as the right plot, which shows the rate of change of the model. The rate of change seems to go up continuously in the created model, and the logistic model created does not quite take the form of the standard S-curve of the logistic model.