1 Data Set Overview

This data set contains medical/demographic data and if the individual has diabetes or not. This data set contains information on 100,000 individuals and contains 9 variables. This data set contains 5 categorical variables. These are the patient’s gender, If the patient has hypertension(a medical conditon that causes consistantly elevated blood pressure), if the patient has heart disease, the patients smoking history, and if the patient has diabetes. This data set has 4 numeric variables. These are the patient’s age, the patient’s bmi (body mass index) level, the patients hba1c level(messure of blood sugar), and the patients blood glucose level. These 8 predictor variables are all either risk factors or associated with diabetes.

2 Goals of Analysis

This data set was intended to assist in building predictive models and machine learning algorithms to predict if a patient has diabetes. We will be building a multiple logistic regression model using the 8 predictor variables to determine the probability an individaul has diabetes. From a practicle perspective building this model serves two functions. First through coefficent analysis we can determine which varaiables have the most influence in determining if a patient develops diabeties. The completed model will also serve as an important tool for diagnosisng diabetes and determining if patients are at risk for developing diabetes.

3 Inital Variable Inspection

Based on an initial visual inspection of the data set, most of the variables seem appropriate to use in model building. The one exception is the variable smoking history. smoking history is a categorical variable with numerous issues with categories that would make it hard to interpret if used in a model. “ever” is a category that makes little sense and is most likely a spelling mistake for “never”. “not current” is a vague category that could imply either “never” or “former”. The category for missing data “No Info” is the largest category with over 35% of the observations. As the category “not current” obscures any meaningful analysis of the effects of being a former smoker, we will modify this variable to focus solely on the effects of being a current smoker vs non-current smoker. All variables except current will be combined with not current. To deal with the large amount of No Info we will use single imputation based on a logistic model predicting smoking_history from the rest of the prediction variables. Using this model we can determine the predicted probability of each missing observation being “current” for smoking status. Then using this predicted probability we can generate a predicted observation for the variable smoker.

## # A tibble: 6 × 2
##   smoking_history     n
##   <chr>           <int>
## 1 No Info         35816
## 2 current          9286
## 3 ever             4004
## 4 former           9352
## 5 never           35095
## 6 not current      6447
## Warning in rbinom(length(probsNA), size = 1, prob = probsNA): NAs produced

observing the correlation matrix plot there is little concern for multicolinarity among the numeric variables. Looking at the density plots the distributions of all the numeric variables have issues with normality. age is more continously distributed, bmi and blood_glucose_level are right skewed, and hba1c is bimodal and right skewed. To correct this without invalidating our association analysis we will discretize these variables.

The following graphs will be used to evenly discretize the data

4 Building logistic model

After fitting the complete logistic regression model, and before using an automatic variable selection algorithm we can already eliminate the variables HbA1c_level and blood_glucose_level from the model based on the extremely large p-values for all associated dummy variables. Besides those two variables, simply based on their p values it is unlikely that a step wise variable selection algorithm will remove any of the remaining variables as all of them have at least one dummy variable with a near zero p-value.

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
full model coefficents
Estimate Std. Error z value Pr(>|z|)
(Intercept) -43.0031695 225.9967989 -0.1902822 0.8490880
genderMale 0.2997341 0.0336543 8.9062750 0.0000000
hypertension1 0.8009436 0.0430654 18.5982966 0.0000000
heart_disease1 0.8535046 0.0551040 15.4889728 0.0000000
smoking_historycurrent -0.2692728 0.0371861 -7.2412213 0.0000000
grp.age60-80 3.0269617 0.1323684 22.8677091 0.0000000
grp.age40-59 2.1997468 0.1331250 16.5239162 0.0000000
grp.age20-39 0.9500167 0.1404143 6.7658116 0.0000000
grp.bmi20-30 0.5736295 0.1277435 4.4904779 0.0000071
grp.bmi30-40 1.4432143 0.1295781 11.1377977 0.0000000
grp.bmi40+ 2.3192369 0.1363241 17.0126739 0.0000000
grp.HbA1c_level5.5-7 18.8281594 144.7189447 0.1301016 0.8964861
grp.HbA1c_level7.5+ 42.3625428 475.9488203 0.0890065 0.9290767
grp.blood_glucose_level126-160 18.1085257 173.5827135 0.1043222 0.9169137
grp.blood_glucose_level200+ 19.6751416 173.5827139 0.1133474 0.9097552

as expected the backwards AIC variable selection algoritm fails to remove any variables.

## Start:  AIC=46776.13
## diabetes ~ gender + hypertension + heart_disease + smoking_history + 
##     grp.age + grp.bmi
## 
##                   Df Deviance   AIC
## <none>                  46754 46776
## - smoking_history  1    46837 46857
## - gender           1    46893 46913
## - heart_disease    1    47176 47196
## - hypertension     1    47381 47401
## - grp.bmi          3    48849 48865
## - grp.age          3    50684 50700

5 Model Comparison

Comparing the full model with the reduced model(missing blood_glucose_level and HbA1c_level) shows that the full model actually has much better AIC and Deviance scores. This indicates that the reduction in model complexity cannot justify the reduction of fit and increase in deviance size. Our final model will actually be the complete model.

Comparison of global goodness-of-fit statistics
Deviance.residual Null.Deviance.Residual AIC
full.model 24909.63 58159.68 24939.63
reduced.model 46754.13 58159.68 46776.13

6 Coefficent Interpritation

Summary Stats with Odds Ratios
Estimate Std. Error z value Pr(>|z|) odds.ratio
(Intercept) -43.0032 225.9968 -0.1903 0.8491 0.0000
genderMale 0.2997 0.0337 8.9063 0.0000 1.3495
hypertension1 0.8009 0.0431 18.5983 0.0000 2.2276
heart_disease1 0.8535 0.0551 15.4890 0.0000 2.3479
smoking_historycurrent -0.2693 0.0372 -7.2412 0.0000 0.7639
grp.age60-80 3.0270 0.1324 22.8677 0.0000 20.6344
grp.age40-59 2.1997 0.1331 16.5239 0.0000 9.0227
grp.age20-39 0.9500 0.1404 6.7658 0.0000 2.5858
grp.bmi20-30 0.5736 0.1277 4.4905 0.0000 1.7747
grp.bmi30-40 1.4432 0.1296 11.1378 0.0000 4.2343
grp.bmi40+ 2.3192 0.1363 17.0127 0.0000 10.1679
grp.HbA1c_level5.5-7 18.8282 144.7189 0.1301 0.8965 150302338.4585
grp.HbA1c_level7.5+ 42.3625 475.9488 0.0890 0.9291 2499300963049995776.0000
grp.blood_glucose_level126-160 18.1085 173.5827 0.1043 0.9169 73186805.0548
grp.blood_glucose_level200+ 19.6751 173.5827 0.1133 0.9098 350594748.4398

Our final model Includes all 8 initial explanatory variables. The refrence levels for the all variables are the following.

  • Gender = Female
  • hypertension = 0 (no)
  • heart_disease = 0 (no)
  • smoking_history = not currently
  • grouped age = 0-19
  • group bmi = 10-20
  • group HbA1c level = 3.5-5
  • group blood glucose level =80-100

Looking at the odds ratio for these variables the most significant results is the absolutely ridiculously huge ratios for the higher tiers of HBa1c and blood glucose levels compared to their base levels. These dramatic practical results explain why these two variables contributed so much to models goodness of fit in spite of their low statistical significance. While the low statistical significance makes it hard to say if these results reflect the overall population, at least in this data set having high glucose and hbA1c is by far the most important factor in determining probability of a diabetes diagnosis. looking at the odds ratios for the age and bmi dummy variables shows the probability of being diagnosed with diabetes climbs exponentially with increases in age and body fat. Being Diagnosied with hypertension or heart disease both more than double the probability of being diagnosed with diabetes. Males are 35% more likely to be diagnosed with diabetes than females. One somewhat surprising result is that being a smoker actually reduces the likelihood of being diagnosed with diabetes, not exactly thought of as a healthy pastime but the numbers show at least one benefit.