1 Description of the Dataset

I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:

id (integer) - the unique identifying number given to each patient

gender (categorical) - the gender of the patient, male, female, or other

age (continuous) - the age of the patient

hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does

heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does

ever_married (categorical) - Yes if the patient has been married before, No if not

work_type (categorical) - the type of work done by the patient

residence_type (categorical) - the area in which the patient lived, urban or rural

avg_glucose_level (continuous) - the average blood glucose level of the patient

bmi (continuous) - the body mass index (BMI) of the patient

smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”

stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has

I chose to eliminate all missing values from the dataset.

2 Research Question

The objective for this assignment is to identify different risk factors for a stroke.

3 Exploratory Analysis

To begin, we examine our predictor variables for possible issues:

We find that multicollinearity is not a major issue for this dataset. We will take a closer look at our continuous variables.

Looking at these histograms, we find that age and bmi are unimodal (with bmi having some right skew), but avg_glucose_level is bimodal. Therefore, we make the decision to discretize this variable.

We proceed with building the model.

4 Building the Model

We begin by building the full and smallest models. We will create a reduced model that only includes the risk factors of glucose level, hypertension, BMI, and smoking_status, which are all known to be stroke risk factors.

Summary of inferential statistics of the full model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8864472 1.0747962 -6.4072119 0.0000000
genderMale -0.0085132 0.1541947 -0.0552107 0.9559706
genderOther -11.1461654 2399.5447565 -0.0046451 0.9962937
age 0.0741004 0.0063784 11.6174165 0.0000000
hypertension 0.4980711 0.1758955 2.8316314 0.0046311
heart_disease 0.3605662 0.2069584 1.7422157 0.0814707
ever_marriedYes -0.1082745 0.2473676 -0.4377068 0.6615988
work_typeGovt_job -0.7362968 1.1150947 -0.6602998 0.5090614
work_typeNever_worked -10.8272064 509.1995873 -0.0212632 0.9830357
work_typePrivate -0.5650374 1.1013832 -0.5130252 0.6079337
work_typeSelf-employed -1.0057383 1.1202760 -0.8977595 0.3693138
Residence_typeUrban -0.0026481 0.1500144 -0.0176523 0.9859162
bmi 0.0046862 0.0118820 0.3943961 0.6932886
smoking_statusnever smoked -0.0669628 0.1887562 -0.3547582 0.7227707
smoking_statussmokes 0.3340425 0.2300076 1.4523111 0.1464151
smoking_statusUnknown -0.2724636 0.2472262 -1.1020824 0.2704259
grp.glucose150-200 0.6654695 0.2762230 2.4091751 0.0159886
grp.glucose200-300 0.4383726 0.2407938 1.8205310 0.0686782
grp.glucose50-100 -0.1173846 0.2030818 -0.5780163 0.5632531
Summary of inferential statistics of the reduced model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.9509721 0.3590873 -8.2179785 0.0000000
hypertension 1.0908654 0.1729980 6.3056546 0.0000000
bmi -0.0060480 0.0098366 -0.6148442 0.5386576
smoking_statusnever smoked -0.3562895 0.1813516 -1.9646341 0.0494566
smoking_statussmokes -0.1421640 0.2193225 -0.6481961 0.5168581
smoking_statusUnknown -0.9733116 0.2423512 -4.0161211 0.0000592
grp.glucose150-200 0.9576615 0.2650586 3.6130175 0.0003027
grp.glucose200-300 1.1199193 0.2340747 4.7844529 0.0000017
grp.glucose50-100 -0.0591308 0.1950295 -0.3031889 0.7617459

Using these two models, we will engage in automatic variable selection to create our model.

Summary of inferential statistics of the final model
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.3901581 0.6047444 -12.2202996 0.0000000
hypertension 0.4909446 0.1754564 2.7981009 0.0051404
bmi 0.0041574 0.0117457 0.3539487 0.7233773
smoking_statusnever smoked -0.0572202 0.1865683 -0.3066985 0.7590729
smoking_statussmokes 0.3421568 0.2289259 1.4946185 0.1350140
smoking_statusUnknown -0.2527804 0.2449583 -1.0319323 0.3021038
grp.glucose150-200 0.6523146 0.2758073 2.3651101 0.0180247
grp.glucose200-300 0.4535649 0.2395194 1.8936459 0.0582720
grp.glucose50-100 -0.1111053 0.2023739 -0.5490100 0.5829986
age 0.0694848 0.0058696 11.8381005 0.0000000
heart_disease 0.3848055 0.2045226 1.8814813 0.0599065

Aside from the p-values, we will look at a few other global goodness of fit measures.

Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 1362.084 1728.386 1400.084
reduced.model 1591.834 1728.386 1609.834
final.model 1368.659 1728.386 1390.659

5 Final Model

Through the process of automatic variable selection, the variables gender, ever_married, work_type, and Residence_type were all removed from the final model. Our final model includes the variables bmi and smoking_status even though they were both found to be statistically insignificant because these two variables have clinical importance. We will look at the odds ratio to interpret the model.

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -7.3901581 0.6047444 -12.2202996 0.0000000 0.0006173
hypertension 0.4909446 0.1754564 2.7981009 0.0051404 1.6338589
bmi 0.0041574 0.0117457 0.3539487 0.7233773 1.0041660
smoking_statusnever smoked -0.0572202 0.1865683 -0.3066985 0.7590729 0.9443861
smoking_statussmokes 0.3421568 0.2289259 1.4946185 0.1350140 1.4079811
smoking_statusUnknown -0.2527804 0.2449583 -1.0319323 0.3021038 0.7766384
grp.glucose150-200 0.6523146 0.2758073 2.3651101 0.0180247 1.9199797
grp.glucose200-300 0.4535649 0.2395194 1.8936459 0.0582720 1.5739130
grp.glucose50-100 -0.1111053 0.2023739 -0.5490100 0.5829986 0.8948445
age 0.0694848 0.0058696 11.8381005 0.0000000 1.0719558
heart_disease 0.3848055 0.2045226 1.8814813 0.0599065 1.4693285

Looking at the odds ratio between categories for the groups in the variable grp.glucose, we find that (assuming that the values are the same for the other variables) a person with an average blood glucose level of 150-200 is almost 2.5 times more likely to have a stroke than someone with an average glucose level of 0-50. We can see that in general, the chance of having a stroke increases as blood glucose level increases.

6 Summary and Conclusion

The case study focused on the association analysis between a set of potential risk factors for diabetes. The initial data set has 8 numerical and categorical variables.

In this report, we examined the association between several possible stroke risk factors. The dataset that we chose included 10 explanatory variables, 3 of which were continuous and the rest categorical. After some initial exploratory analysis, we chose to discretize the variable of avg_glucose_level due to its bimodal distribution.

Because smoking status, hypertension, average glucose level, and BMI are known risk factors for a stroke, we included these clinically important variables in our model regardless of statistical significance. Our final model has the explanatory variables of hypertension, age, heart disease, BMI, smoking_status (with 3 dummy variables) and grp.glucose (with three dummy variables). The variables smoking_status and BMI were found to be statistically insignificant, but we have left them in the model due to their clinical importance.