This data set contains medical/demographic data and if the individual has diabetes or not. This data set contains information on 100,000 individuals and contains 9 variables. This data set contains 5 categorical variables. These are the patient’s gender, If the patient has hypertension(a medical conditon that causes consistantly elevated blood pressure), if the patient has heart disease, the patients smoking history, and if the patient has diabetes. This data set has 4 numeric variables. These are the patient’s age, the patient’s bmi (body mass index) level, the patients hba1c level(messure of blood sugar), and the patients blood glucose level. These 8 predictor variables are all either risk factors or associated with diabetes.
This data set was intended to assist in building predictive models and machine learning algorithms to predict if a patient has diabetes. We will be building a multiple logistic regression model using the 8 predictor variables to determine the probability an individaul has diabetes. From a practicle perspective building this model serves two functions. First through coefficent analysis we can determine which varaiables have the most influence in determining if a patient develops diabeties. The completed model will also serve as an important tool for diagnosisng diabetes and determining if patients are at risk for developing diabetes.
Based on an initial visual inspection of the data set, most of the variables seem appropriate to use in model building. The one exception is the variable smoking history. smoking history is a categorical variable with numerous issues with categories that would make it hard to interpret if used in a model. “ever” is a category that makes little sense and is most likely a spelling mistake for “never”. “not current” is a vague category that could imply either “never” or “former”. The category for missing data “No Info” is the largest category with over 35% of the observations. As the category “not current” obscures any meaningful analysis of the effects of being a former smoker, we will modify this variable to focus solely on the effects of being a current smoker vs non-current smoker. All variables except current will be combined with not current. To deal with the large amount of No Info we will use single imputation based on a logistic model predicting smoking_history from the rest of the prediction variables. Using this model we can determine the predicted probability of each missing observation being “current” for smoking status. Then using this predicted probability we can generate a predicted observation for the variable smoker.
## # A tibble: 6 × 2
## smoking_history n
## <chr> <int>
## 1 No Info 35816
## 2 current 9286
## 3 ever 4004
## 4 former 9352
## 5 never 35095
## 6 not current 6447
## Warning in rbinom(length(probsNA), size = 1, prob = probsNA): NAs produced
observing the correlation matrix plot there is little concern for
multicolinarity among the numeric variables. Looking at the density
plots the distributions of all the numeric variables have issues with
normality. age is more continously distributed, bmi and
blood_glucose_level are right skewed, and hba1c is bimodal and right
skewed. To correct this without invalidating our association analysis we
will discretize these variables.
The following graphs will be used to evenly discretize the data
After fitting the complete logistic regression model, and before using an automatic variable selection algorithm we can already eliminate the variables HbA1c_level and blood_glucose_level from the model based on the extremely large p-values for all associated dummy variables. Besides those two variables, simply based on their p values it is unlikely that a step wise variable selection algorithm will remove any of the remaining variables as all of them have at least one dummy variable with a near zero p-value.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -43.0031695 | 225.9967989 | -0.1902822 | 0.8490880 |
| genderMale | 0.2997341 | 0.0336543 | 8.9062750 | 0.0000000 |
| hypertension1 | 0.8009436 | 0.0430654 | 18.5982966 | 0.0000000 |
| heart_disease1 | 0.8535046 | 0.0551040 | 15.4889728 | 0.0000000 |
| smoking_historycurrent | -0.2692728 | 0.0371861 | -7.2412213 | 0.0000000 |
| grp.age60-80 | 3.0269617 | 0.1323684 | 22.8677091 | 0.0000000 |
| grp.age40-59 | 2.1997468 | 0.1331250 | 16.5239162 | 0.0000000 |
| grp.age20-39 | 0.9500167 | 0.1404143 | 6.7658116 | 0.0000000 |
| grp.bmi20-30 | 0.5736295 | 0.1277435 | 4.4904779 | 0.0000071 |
| grp.bmi30-40 | 1.4432143 | 0.1295781 | 11.1377977 | 0.0000000 |
| grp.bmi40+ | 2.3192369 | 0.1363241 | 17.0126739 | 0.0000000 |
| grp.HbA1c_level5.5-7 | 18.8281594 | 144.7189447 | 0.1301016 | 0.8964861 |
| grp.HbA1c_level7.5+ | 42.3625428 | 475.9488203 | 0.0890065 | 0.9290767 |
| grp.blood_glucose_level126-160 | 18.1085257 | 173.5827135 | 0.1043222 | 0.9169137 |
| grp.blood_glucose_level200+ | 19.6751416 | 173.5827139 | 0.1133474 | 0.9097552 |
as expected the backwards AIC variable selection algoritm fails to remove any variables.
## Start: AIC=46776.13
## diabetes ~ gender + hypertension + heart_disease + smoking_history +
## grp.age + grp.bmi
##
## Df Deviance AIC
## <none> 46754 46776
## - smoking_history 1 46837 46857
## - gender 1 46893 46913
## - heart_disease 1 47176 47196
## - hypertension 1 47381 47401
## - grp.bmi 3 48849 48865
## - grp.age 3 50684 50700
Comparing the full model with the reduced model(missing blood_glucose_level and HbA1c_level) shows that the full model actually has much better AIC and Deviance scores. This indicates that the reduction in model complexity cannot justify the reduction of fit and increase in deviance size. Our final model will actually be the complete model.
| Deviance.residual | Null.Deviance.Residual | AIC | |
|---|---|---|---|
| full.model | 24909.63 | 58159.68 | 24939.63 |
| reduced.model | 46754.13 | 58159.68 | 46776.13 |
| Estimate | Std. Error | z value | Pr(>|z|) | odds.ratio | |
|---|---|---|---|---|---|
| (Intercept) | -43.0032 | 225.9968 | -0.1903 | 0.8491 | 0.0000 |
| genderMale | 0.2997 | 0.0337 | 8.9063 | 0.0000 | 1.3495 |
| hypertension1 | 0.8009 | 0.0431 | 18.5983 | 0.0000 | 2.2276 |
| heart_disease1 | 0.8535 | 0.0551 | 15.4890 | 0.0000 | 2.3479 |
| smoking_historycurrent | -0.2693 | 0.0372 | -7.2412 | 0.0000 | 0.7639 |
| grp.age60-80 | 3.0270 | 0.1324 | 22.8677 | 0.0000 | 20.6344 |
| grp.age40-59 | 2.1997 | 0.1331 | 16.5239 | 0.0000 | 9.0227 |
| grp.age20-39 | 0.9500 | 0.1404 | 6.7658 | 0.0000 | 2.5858 |
| grp.bmi20-30 | 0.5736 | 0.1277 | 4.4905 | 0.0000 | 1.7747 |
| grp.bmi30-40 | 1.4432 | 0.1296 | 11.1378 | 0.0000 | 4.2343 |
| grp.bmi40+ | 2.3192 | 0.1363 | 17.0127 | 0.0000 | 10.1679 |
| grp.HbA1c_level5.5-7 | 18.8282 | 144.7189 | 0.1301 | 0.8965 | 150302338.4585 |
| grp.HbA1c_level7.5+ | 42.3625 | 475.9488 | 0.0890 | 0.9291 | 2499300963049995776.0000 |
| grp.blood_glucose_level126-160 | 18.1085 | 173.5827 | 0.1043 | 0.9169 | 73186805.0548 |
| grp.blood_glucose_level200+ | 19.6751 | 173.5827 | 0.1133 | 0.9098 | 350594748.4398 |
Our final model Includes all 8 initial explanatory variables. The refrence levels for the all variables are the following.
Looking at the odds ratio for these variables the most significant results is the absolutely ridiculously huge ratios for the higher tiers of HBa1c and blood glucose levels compared to their base levels. These dramatic practical results explain why these two variables contributed so much to models goodness of fit in spite of their low statistical significance. While the low statistical significance makes it hard to say if these results reflect the overall population, at least in this data set having high glucose and hbA1c is by far the most important factor in determining probability of a diabetes diagnosis. looking at the odds ratios for the age and bmi dummy variables shows the probability of being diagnosed with diabetes climbs exponentially with increases in age and body fat. Being Diagnosied with hypertension or heart disease both more than double the probability of being diagnosed with diabetes. Males are 35% more likely to be diagnosed with diabetes than females. One somewhat surprising result is that being a smoker actually reduces the likelihood of being diagnosed with diabetes, not exactly thought of as a healthy pastime but the numbers show at least one benefit.