I chose a dataset found from this link: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/ that includes identifying information about different hospital patients and whether or not they had strokes. The author chose not to disclose the source of the data. The dataset includes the following variables:
id (integer) - the unique identifying number given to each patient
gender (categorical) - the gender of the patient, male, female, or other
age (continuous) - the age of the patient
hypertension (categorical) - 0 if the patient doesn’t have hypertension, 1 if the patient does
heart_disease (categorical) - 0 if the patient doesn’t have heart disease, 1 if the patient does
ever_married (categorical) - Yes if the patient has been married before, No if not
work_type (categorical) - the type of work done by the patient
residence_type (categorical) - the area in which the patient lived, urban or rural
avg_glucose_level (continuous) - the average blood glucose level of the patient
bmi (continuous) - the body mass index (BMI) of the patient
smoking_status (categorical) - the smoking status of the patient with the options “formerly smoked”, “never smoked”, “smokes” or “Unknown”
stroke (categorical) - 0 if the patient has not had a stroke, 1 if the patient has
I chose to eliminate all missing values from the dataset.
The objective for this assignment is to identify different risk factors for a stroke.
To begin, we examine our predictor variables for possible issues:
We find that multicollinearity is not a major issue for this dataset. We will take a closer look at our continuous variables.
Looking at these histograms, we find that age and bmi are unimodal (with bmi having some right skew), but avg_glucose_level is bimodal. Therefore, we make the decision to discretize this variable.
We proceed with building the model.
We begin by building the full and smallest models. We will create a reduced model that only includes the risk factors of glucose level, hypertension, BMI, and smoking_status, which are all known to be stroke risk factors.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -6.8864472 | 1.0747962 | -6.4072119 | 0.0000000 |
genderMale | -0.0085132 | 0.1541947 | -0.0552107 | 0.9559706 |
genderOther | -11.1461654 | 2399.5447565 | -0.0046451 | 0.9962937 |
age | 0.0741004 | 0.0063784 | 11.6174165 | 0.0000000 |
hypertension | 0.4980711 | 0.1758955 | 2.8316314 | 0.0046311 |
heart_disease | 0.3605662 | 0.2069584 | 1.7422157 | 0.0814707 |
ever_marriedYes | -0.1082745 | 0.2473676 | -0.4377068 | 0.6615988 |
work_typeGovt_job | -0.7362968 | 1.1150947 | -0.6602998 | 0.5090614 |
work_typeNever_worked | -10.8272064 | 509.1995873 | -0.0212632 | 0.9830357 |
work_typePrivate | -0.5650374 | 1.1013832 | -0.5130252 | 0.6079337 |
work_typeSelf-employed | -1.0057383 | 1.1202760 | -0.8977595 | 0.3693138 |
Residence_typeUrban | -0.0026481 | 0.1500144 | -0.0176523 | 0.9859162 |
bmi | 0.0046862 | 0.0118820 | 0.3943961 | 0.6932886 |
smoking_statusnever smoked | -0.0669628 | 0.1887562 | -0.3547582 | 0.7227707 |
smoking_statussmokes | 0.3340425 | 0.2300076 | 1.4523111 | 0.1464151 |
smoking_statusUnknown | -0.2724636 | 0.2472262 | -1.1020824 | 0.2704259 |
grp.glucose150-200 | 0.6654695 | 0.2762230 | 2.4091751 | 0.0159886 |
grp.glucose200-300 | 0.4383726 | 0.2407938 | 1.8205310 | 0.0686782 |
grp.glucose50-100 | -0.1173846 | 0.2030818 | -0.5780163 | 0.5632531 |
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.9509721 | 0.3590873 | -8.2179785 | 0.0000000 |
hypertension | 1.0908654 | 0.1729980 | 6.3056546 | 0.0000000 |
bmi | -0.0060480 | 0.0098366 | -0.6148442 | 0.5386576 |
smoking_statusnever smoked | -0.3562895 | 0.1813516 | -1.9646341 | 0.0494566 |
smoking_statussmokes | -0.1421640 | 0.2193225 | -0.6481961 | 0.5168581 |
smoking_statusUnknown | -0.9733116 | 0.2423512 | -4.0161211 | 0.0000592 |
grp.glucose150-200 | 0.9576615 | 0.2650586 | 3.6130175 | 0.0003027 |
grp.glucose200-300 | 1.1199193 | 0.2340747 | 4.7844529 | 0.0000017 |
grp.glucose50-100 | -0.0591308 | 0.1950295 | -0.3031889 | 0.7617459 |
Using these two models, we will engage in automatic variable selection to create our model.
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -7.3901581 | 0.6047444 | -12.2202996 | 0.0000000 |
hypertension | 0.4909446 | 0.1754564 | 2.7981009 | 0.0051404 |
bmi | 0.0041574 | 0.0117457 | 0.3539487 | 0.7233773 |
smoking_statusnever smoked | -0.0572202 | 0.1865683 | -0.3066985 | 0.7590729 |
smoking_statussmokes | 0.3421568 | 0.2289259 | 1.4946185 | 0.1350140 |
smoking_statusUnknown | -0.2527804 | 0.2449583 | -1.0319323 | 0.3021038 |
grp.glucose150-200 | 0.6523146 | 0.2758073 | 2.3651101 | 0.0180247 |
grp.glucose200-300 | 0.4535649 | 0.2395194 | 1.8936459 | 0.0582720 |
grp.glucose50-100 | -0.1111053 | 0.2023739 | -0.5490100 | 0.5829986 |
age | 0.0694848 | 0.0058696 | 11.8381005 | 0.0000000 |
heart_disease | 0.3848055 | 0.2045226 | 1.8814813 | 0.0599065 |
Aside from the p-values, we will look at a few other global goodness of fit measures.
Deviance.residual | Null.Deviance.Residual | AIC | |
---|---|---|---|
full.model | 1362.084 | 1728.386 | 1400.084 |
reduced.model | 1591.834 | 1728.386 | 1609.834 |
final.model | 1368.659 | 1728.386 | 1390.659 |
Through the process of automatic variable selection, the variables gender, ever_married, work_type, and Residence_type were all removed from the final model. Our final model includes the variables bmi and smoking_status even though they were both found to be statistically insignificant because these two variables have clinical importance. We will look at the odds ratio to interpret the model.
Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
---|---|---|---|---|---|
(Intercept) | -7.3901581 | 0.6047444 | -12.2202996 | 0.0000000 | 0.0006173 |
hypertension | 0.4909446 | 0.1754564 | 2.7981009 | 0.0051404 | 1.6338589 |
bmi | 0.0041574 | 0.0117457 | 0.3539487 | 0.7233773 | 1.0041660 |
smoking_statusnever smoked | -0.0572202 | 0.1865683 | -0.3066985 | 0.7590729 | 0.9443861 |
smoking_statussmokes | 0.3421568 | 0.2289259 | 1.4946185 | 0.1350140 | 1.4079811 |
smoking_statusUnknown | -0.2527804 | 0.2449583 | -1.0319323 | 0.3021038 | 0.7766384 |
grp.glucose150-200 | 0.6523146 | 0.2758073 | 2.3651101 | 0.0180247 | 1.9199797 |
grp.glucose200-300 | 0.4535649 | 0.2395194 | 1.8936459 | 0.0582720 | 1.5739130 |
grp.glucose50-100 | -0.1111053 | 0.2023739 | -0.5490100 | 0.5829986 | 0.8948445 |
age | 0.0694848 | 0.0058696 | 11.8381005 | 0.0000000 | 1.0719558 |
heart_disease | 0.3848055 | 0.2045226 | 1.8814813 | 0.0599065 | 1.4693285 |
Looking at the odds ratio between categories for the groups in the variable grp.glucose, we find that (assuming that the values are the same for the other variables) a person with an average blood glucose level of 150-200 is almost 2.5 times more likely to have a stroke than someone with an average glucose level of 0-50. We can see that in general, the chance of having a stroke increases as blood glucose level increases.
The case study focused on the association analysis between a set of potential risk factors for diabetes. The initial data set has 8 numerical and categorical variables.
In this report, we examined the association between several possible stroke risk factors. The dataset that we chose included 10 explanatory variables, 3 of which were continuous and the rest categorical. After some initial exploratory analysis, we chose to discretize the variable of avg_glucose_level due to its bimodal distribution.
Because smoking status, hypertension, average glucose level, and BMI are known risk factors for a stroke, we included these clinically important variables in our model regardless of statistical significance. Our final model has the explanatory variables of hypertension, age, heart disease, BMI, smoking_status (with 3 dummy variables) and grp.glucose (with three dummy variables). The variables smoking_status and BMI were found to be statistically insignificant, but we have left them in the model due to their clinical importance.