Introduction

Diabetes is a global health challenge that affects millions of people worldwide, characterized by the body’s inability to effectively regulate blood sugar levels. This chronic metabolic disorder can lead to severe health complications, including heart disease, kidney failure, stroke, and nerve damage. According to the World Health Organization, diabetes has become a rapidly growing health concern, with global prevalence nearly quadrupling from 108 million in 1980 to 422 million in 2014. (National Institute of Health).)

Data science plays huge role in healthcare through early detection and personalized risk assessment. By analyzing complex health data sets, researchers can identify subtle patterns and predictors that might indicate an individual’s likelihood of developing diabetes. This project leverages three distinct data sets representing different populations to develop predictive models that could ultimately help healthcare professionals to devise preventive interventions more effectively.

This project aims to demonstrate how statistical modeling techniques can be applied to health data to predict diabetes risk. By examining various demographic, physiological, and lifestyle factors.

Diabetes Pima

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.(Kagle)

It consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

library(dplyr)

pima_data <- read.csv("DiabetesPima.csv")

We will be filtering BMI values which are equal zero, to keep data set clean.

pima_clean <- pima_data %>% 
  filter(BMI != 0)

Plot 1 - Density Plots of Predictors by Outcome

library(reshape2)
library(ggplot2)

data_melt <- melt(pima_clean, id.vars = "Outcome")

ggplot(data_melt, aes(x = value, fill = as.factor(Outcome))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Density Plots of Predictors by Outcome",
       x = "Value",
       y = "Density",
       fill = "Outcome")

Based on the graphs, at the begining we could include the following predictors in the logistic regression model:

  • The glucose distribution shows a clear bimodal pattern, with a distinct peak for the non-diabetic group at lower glucose levels and another peak for the diabetic group at higher glucose levels. This suggests that glucose level is a fundamental factor in determining diabetes status. Higher glucose levels are strongly associated with the presence of diabetes, as diabetes is characterized by the body’s inability to effectively regulate blood sugar.(WHO)

  • The BMI distribution also exhibits a bimodal pattern, with the diabetic group skewed towards higher BMI values. This aligns with the known risk factor of obesity and being overweight, which are closely linked to the development of type 2 diabetes. Individuals with a higher BMI tend to have increased insulin resistance, a key contributor to diabetes.

  • The age distribution shows that the diabetic group tends to be older than the non-diabetic group. This is consistent with the epidemiological evidence that the risk of developing diabetes increases with age. As individuals grow older, various physiological changes can lead to decreased insulin sensitivity and impaired glucose regulation, putting them at higher risk of diabetes.

  • The insulin distribution reveals a right-skewed pattern, with the diabetic group displaying higher insulin levels. This is likely indicative of insulin resistance, a hallmark of type 2 diabetes. In this condition, the body’s cells become less responsive to insulin, leading to elevated insulin levels as the pancreas tries to compensate by producing more of the hormone.

  • The pregnancies distribution shows a distinct difference between the diabetic and non-diabetic groups, with the diabetic group having a higher number of pregnancies. This aligns with the known risk factor of gestational diabetes, where women develop diabetes during pregnancy. Pregnancies can stress the body’s glucose regulation system, and women who experience gestational diabetes are at an increased risk of developing type 2 diabetes later in life.

While the other variables (Skin Thickness, Blood preasure) also show some distributional patterns, both positive and negative outcomes have similar distribution, with huge overlapped areas. So we would assume that won’t be contributing to model much.

Draft Regression Model

model <- glm(Outcome ~ Pregnancies+Glucose+BMI+Insulin+Age, data = pima_clean, family = "binomial")
Coefficient Estimate Std. Error z value Pr(>
(Intercept) -8.9288 0.716 -12.470 < 2e-16 ***
Pregnancies 0.1069 0.0316 3.382 0.00072 ***
Glucose 0.0346 0.0036 9.584 < 2e-16 ***
BMI 0.0967 0.0147 6.564 5.25e-11 ***
Insulin -0.001 0.0008 -1.190 0.23423
Age 0.0127 0.0092 1.383 0.16666
predicted_prob <- predict(model, type = "response")
cutoff <- 0.5
predicted_class <- ifelse(predicted_prob > cutoff, 1, 0)

conf_matrix <- table(Predicted = predicted_class, Actual = pima_clean$Outcome)
success_rate <- mean(predicted_class == pima_clean$Outcome)

After building model with all those 5 predictors, we can see that Insulin and Age have low significance in the model. That could be because they have high correlation with other variables use in the model, in case of Age it has high correlation with Pregnancies variable, and Insulin level is also correlated with BMI and Glucose level as we can see in bellow heat map.(Plot 2)
I calculated success rate of this model and it’s 0.7688, if we exclude those two variables, success rate increases to 0.7718, more on that later.

Plot 2 - Correlation Heatmap of Predictors

corr_matrix <- cor(pima_data[, -8], use = "complete.obs")

corr_melt <- melt(corr_matrix)
ggplot(corr_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
  labs(title = "Correlation Heatmap of Predictors",
       x = "",
       y = "") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Final Regression Model

model <- glm(Outcome ~ Pregnancies+Glucose+BMI, data = pima_clean, family = "binomial")

Log Odds Equation

\[ \text{Log-Odds} = -8.5057 + 0.133371 \cdot \text{Pregnancies} + 0.034209 \cdot \text{Glucose} + 0.092828 \cdot \text{BMI} \]

Odds Equation

\[ \text{Odds} = e^{-8.5057+ 0.133371 \cdot \text{Pregnancies} + 0.034209 \cdot \text{Glucose} + 0.092828 \cdot \text{BMI}} \]

  • Each additional pregnancy increases the log-odds of diabetes by 0.133371, or increases the odds by a factor of  \(e^{0.133371} \approx 1.14\) . This means the odds of being diabetic increase by 14% for each additional pregnancy.

  • Each unit increase in glucose level increases the log-odds of diabetes by 0.0342, or increases the odds by a factor of  \(e^{0.03421} \approx 1.03\) . This corresponds to a 3% increase in the odds of diabetes per unit increase in glucose.

  • Each unit increase in BMI increases the log-odds of diabetes by 0.0928, or increases the odds by a factor of  \(e^{0.0928} \approx 1.10\). This means the odds of being diabetic increase by 10% for every 1-unit increase in BMI.

Model Evaluation

predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.5, 1, 0)

conf_matrix <- table(Predicted = predicted_class, Actual = pima_clean$Outcome)
success_rate <- mean(predicted_class == pima_clean$Outcome)

Confusion Matrix (cutoff=0.5):

Predicted \ Actual 0 1
0 432 114
1 59 152

Success rate is 0.7715 and it’s calculated by formula: \[ \text{Success Rate} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Observations}} \]

We have 59 false positive case, these are cases where the model incorrectly predicts someone as diabetic, even though they are non-diabetic. False positives might lead to unnecessary stress and medical interventions, such as additional tests or treatments, which could be costly and anxiety-inducing for the patient.

On the other hand we have 114 false negatives. These are cases where the model incorrectly predicts someone as non-diabetic, even though they are diabetic. False negatives are more concerning because diabetic patients might go undiagnosed and miss critical early interventions. This could lead to severe health complications like cardiovascular issues, kidney disease, or vision problems.

Confusion Matrix (cutoff=0.35):

While the cutoff of 0.5 provides a reasonable balance between false positives and false negatives, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.

predicted_prob <- predict(model, type = "response")
predicted_class_0_3 <- ifelse(predicted_prob > 0.35, 1, 0)

conf_matrix_0_3 <- table(Predicted = predicted_class_0_3, Actual = pima_clean$Outcome)
success_rate_0_3 <- mean(predicted_class_0_3 == pima_clean$Outcome)
Predicted \ Actual 0 1
0 383 77
1 108 189

In this case we get success rate of 0.7556, which is lower than one 0.5 cutoff success rate, but we are willing to prioritize minimizing risk of false negatives, on behalf of false positives.

Hypothesis Test for Success Rate at 0.35 Cutoff

Success rate of the null model is 0.651, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.

  • Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate:  p = 0.651 

  • Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate:  p > 0.651 

We use a one-sided z-test for proportions:

\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}} \]

p_hat <- success_rate
p_0 <- 0.651
n <- nrow(pima_clean)
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)

We get: \(z = 6.0386\) and \(p = 7.773265 \times 10^{-10}\)

The p-value is far below any reasonable significance level (for example \(\alpha = 0.05\) ).

We reject the null hypothesis, model’s success rate of 75.56% is statistically significantly better than the default success rate of 65.1%.

At a cutoff of 0.35, the model focuses on reducing false negatives (failing to detect diabetes) while allowing for more false positives. This approach aligns well with medical priorities, as identifying potential diabetes cases for further testing is critical.

Diabetes Dataset 2019

Diabetes_2019 <- read.csv("diabetesTask2.csv")

This dataset was collected by Neha Prerna Tigga and Dr. Shruti Garg of the Department of Computer Science and Engineering, BIT Mesra, India. for dibeties research. In the study, total 952 participants are selected aged 18 and above, out of which 580 are males and 372 are females. The participants were asked to answer a questionnaire which was prepared based on the constraints that could lead to diabetes.(Prediction of Type 2 Diabetes using Machine Learning Classification Methods)

We will be using this data set to build our logistical regression model, to predict diabetes diagnoses. Since this data set is completely different from previous one, we will start by analyzing all the variables that we have and try to select best predictors for the model.

Diabetes_2019_clean <- Diabetes_2019 %>%
  filter(trimws(Diabetic)!="" & !is.na(BMI)) %>%
  mutate(
    Diabetic = ifelse(trimws(Diabetic) == "yes", 1, 0),
    BMI = as.numeric(BMI), 
    Sleep = as.numeric(Sleep),
    Pregnancies = ifelse(is.na(Pregancies), 0, Pregancies)
  )

Plot 3 - Density Plots of Continuous Predictors

The density plots show how continuous predictors like BMI, Sleep, SoundSleep, and Pregnancies are distributed across diabetic and non-diabetic groups. By comparing the two distributions for each variable, we can identify patterns and assess their potential utility in predicting diabetes.

continuous_vars <- c("BMI", "Sleep", "SoundSleep", "Pregnancies")
data_continuous <- Diabetes_2019_clean %>% dplyr::select(all_of(continuous_vars), Diabetic)

data_continuous_melt <- melt(data_continuous, id.vars = "Diabetic")

ggplot(data_continuous_melt, aes(x = value, fill = as.factor(Diabetic))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Density Plots of Continuous Predictors by Diabetic Status",
       x = "Value",
       y = "Density",
       fill = "Diabetic")

  • From BMI plot we observe that individuals with diabetes tend to have higher BMI values compared to those without diabetes. But we see huge overlap of the graphs, we can still include this predictor and see how it performs.

  • In the case of Sleep, the distributions for diabetic and non-diabetic groups overlap significantly, but there is interesting distribution of non diabetic patients, so it should be worth including.

  • SoundSleep, which measures the hours of quality sleep, exhibits a similar pattern to Sleep, with significant overlap between the two groups. This suggests it may not be a strong predictor unless combined with other variables or analyzed in more detail.

  • Pregnancies shows an interesting pattern. While most individuals in both groups have few pregnancies, the diabetic group has a slightly higher proportion at higher pregnancy counts. This could make Pregnancies a useful predictor, particularly when combined with other demographic or health-related variables.

Based on these observations, Pragnancies appears to be the most promising predictor among these variables due to its clear distinction between the two groups. Sleep could also contribute valuable information, while BMI and SoundSleep may play smaller roles unless they interact with other predictors in the model.

Plot 4 - Bar Plots of Categorical Predictors

categorical_vars <- c("Gender", "FamilyDiabetes", "highBP", "Smoking", "Alcohol","RegularMedicine", "BPLevel", "UriationFreq", "Age")
data_categorical <- Diabetes_2019_clean %>% select(all_of(categorical_vars), Diabetic)

data_categorical_melt <- melt(data_categorical, id.vars = "Diabetic")

ggplot(data_categorical_melt, aes(x = value, fill = as.factor(Diabetic))) +
  geom_bar(position = "dodge", alpha = 0.8) +
  facet_wrap(~variable, scales = "free_x") +
  labs(title = "Bar Plots of Categorical Predictors by Diabetic Status",
       x = "Category",
       y = "Count",
       fill = "Diabetic")+
    theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

  • For Gender, there seems to be little distinction between male and female groups in diabetic status, difference that we see must be caused by the fact that males are represented more in the dataset. suggesting it might not be a significant predictor on its own.

  • In the case of FamilyDiabetes, having a family history of diabetes is clearly more common among those with diabetes, making this variable a strong predictor.

  • highBP shows a strong relationship with diabetes status. Individuals with high blood pressure are more likely to be diabetic, indicating its potential predictive power.

  • Smoking does not show a stark difference between diabetic and non-diabetic groups, suggesting limited usefulness. Similarly, Alcohol consumption does not exhibit a clear pattern in this dataset.

  • RegularMedicine stands out as a strong predictor. Those taking regular medication are more likely to be diabetic, likely reflecting existing health conditions.

  • For BPLevel, higher blood pressure levels (categories like “high”) appear to correlate strongly with diabetes. This variable is likely important for the model. highBP might not be significant paired with this variable.

  • UriationFreq (frequent urination) and Age provide additional useful information. UriationFreq seems to have same distribution of diabetic and non-diabetic patients, but Age on the other hand should be good predictor patients 60+ are more likely to have diabetes.

From this analysis, key predictors to include in the model are FamilyDiabetes, highBP, RegularMedicine, BPLevel and Age, as they show clear distinctions between diabetic and non-diabetic groups. Variables like Smoking, Alcohol, UriationFreq and possibly Gender may be less informative.

Draft Regression Model

model <- glm(Diabetic~BMI+Sleep+SoundSleep+Pregnancies+FamilyDiabetes+RegularMedicine+BPLevel+highBP+Age, family = "binomial", data = Diabetes_2019_clean)
Coefficient Estimate Std. Error z value Pr(>
(Intercept) -2.8314 0.9446 -3.0000 0.0027 **
BMI 0.0130 0.0200 0.6490 0.5161
Sleep -0.1182 0.1064 -1.1110 0.2666
SoundSleep 0.2563 0.0816 3.1410 0.0017 **
Pregnancies 0.3063 0.1110 2.7600 0.0058 **
FamilyDiabetes (yes) 1.3041 0.2295 5.6830 <0.0001 ***
RegularMedicine (yes) 2.3301 0.2480 9.3940 <0.0001 ***
BPLevel (High) 0.6392 1.1866 0.5390 0.5901
BPLevel (low) -16.2072 736.1063 -0.0220 0.9824
BPLevel (Low) -15.5195 2231.3110 -0.0070 0.9945
BPLevel (normal) -1.1656 0.3426 -3.4020 0.0007 ***
BPLevel (normal) -12.3844 3956.1804 -0.0030 0.9975
highBP (yes) -0.3891 0.3452 -1.1270 0.2597
Age (50–59) 0.2146 0.3053 0.7030 0.4821
Age (60 or older) 1.5555 0.3468 4.4850 <0.0001 ***
Age (less than 40) -1.2360 0.2999 -4.1210 <0.0001 ***
predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.51, 1, 0)

conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)

After building model with these 9 predictors: BMI, Sleep, SoundSleep , Pregnancies, FamilyDiabetes , highBP, RegularMedicine, BPLevel, Age. we can see that BMI, Sleep and highBP have low significance in the model. That could be because they have high correlation or aren’t direcly contributing to diabetes, in case of highBP as I mentioned earlier, BPLevel provides same information.
I calculated success rate of this model and it’s 0.8660, if we exclude those those variables, success rate increases to 0.8766, but if we keep sleep variable it gives better success rate 0.8871, so we will be keeping 7 variables: Sleep, SoundSleep , Pregnancies, FamilyDiabetes, RegularMedicine, BPLevel, Age.

Final Regression model

model <- glm(Diabetic~Sleep+SoundSleep+Pregnancies+FamilyDiabetes+RegularMedicine+BPLevel+Age, family = "binomial", data = Diabetes_2019_clean)

Log Odds Equation

\[ \text{Log-Odds} = -2.64803 - 0.12460 \cdot \text{Sleep} + 0.24337 \cdot \text{SoundSleep} + 0.30791 \cdot \text{Pregnancies} \\ + 1.32986 \cdot \text{FamilyDiabetes}_{\text{yes}} + 2.25135 \cdot \text{RegularMedicine}_{\text{yes}} + 0.72135 \cdot \text{BPLevel}_{\text{High}} \\- 15.96476 \cdot \text{BPLevel}_{\text{low}} - 15.26225 \cdot \text{BPLevel}_{\text{Low}} - 0.91216 \cdot \text{BPLevel}_{\text{normal}} \\- 12.69628 \cdot \text{BPLevel}_{\text{normal}} + 0.23677 \cdot \text{Age}_{50-59} + 1.55498 \cdot \text{Age}_{60 \text{ or older}} \\- 1.22500 \cdot \text{Age}_{\text{less than 40}} \]

Odds Equation

\[ \text{Odds} = e^{\text{Log-Odds}} = ... \]

  • The intercept (-2.64803) represents the baseline log-odds when all predictors are zero. In this case, the odds of being diabetic are approximately  \(e^{-2.64803} \approx 0.071\) , corresponding to a probability of about 7.1%

  • For sleep, a one-unit increase reduces the log-odds by 0.12460. The odds are multiplied by  \(e^{-0.12460} \approx 0.882\) , which means an 11.8% reduction in the odds.

  • Sound sleep has a positive effect, with each additional unit increasing the log-odds by 0.24337. This translates to  \(e^{0.24337} \approx 1.275\) , or a 27.5% increase in the odds.

  • Each additional pregnancy raises the log-odds by 0.30791, meaning the odds are multiplied by  \(e^{0.30791} \approx 1.361\) , an increase of 36.1%.

  • Having a family history of diabetes significantly raises the log-odds by 1.32986, increasing the odds by nearly four times ( \(e^{1.32986} \approx 3.78\) ).

  • Taking regular medicine has an even greater effect, with the log-odds increasing by 2.25135, which multiplies the odds by about  \(e^{2.25135} \approx 9.49\) , an increase of 849%.

  • Blood pressure level has mixed effects. High BP increases the odds by  $e^{0.72135} \approx 2.057 $, or 105.7%. Low BP, on the other hand, drastically reduces the log-odds (-15.96476), making the probability effectively zero. Normal BP decreases the odds by  \(e^{-0.91216} \approx 0.402\) , a reduction of 59.8%.

  • Age categories also play a role. Individuals aged 50–59 see a slight increase in odds (+26.7%), while those 60 or older experience a significant rise (+373%). Conversely, being younger than 40 decreases the odds substantially by about 70.4%.

Model Evaluation

predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.51, 1, 0)

conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)

Confusion Matrix (cutoff=0.51):

Predicted \ Actual 0 1
0 635 58
1 49 207

Success rate is 0.8871 and it’s calculated by formula: \[ \text{Success Rate} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Observations}} \]

We have 49 false positive case, these are cases where the model incorrectly predicts someone as diabetic, even though they are non-diabetic. False positives might lead to unnecessary stress and medical interventions, such as additional tests or treatments, which could be costly and anxiety-inducing for the patient.

On the other hand we have 58 false negatives. These are cases where the model incorrectly predicts someone as non-diabetic, even though they are diabetic. False negatives are more concerning because diabetic patients might go undiagnosed and miss critical early interventions. This could lead to severe health complications like cardiovascular issues, kidney disease, or vision problems.

Confusion Matrix (cutoff=0.35):

While the cutoff of 0.51 provides the best success rate for the model, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.

predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.35, 1, 0)

conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)
Predicted \ Actual 0 1
0 383 77
1 108 189

In this case we get success rate of 0.8608 which is lower than one 0.51 cutoff success rate, but since we prioritize minimizing risk of false negatives it shoud be the best in this context.

Hypothesis Test for Success Rate at 0.35 Cutoff

Success rate of the null model is 0.8608, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.

  • Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate:  p = 0.7205 

  • Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate:  p > 0.7205

We will again use a one-sided z-test for proportions:

p_hat <- success_rate
n <- nrow(Diabetes_2019_clean)
p_0 <- 1 - sum(Diabetes_2019_clean$Diabetic) / n
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)

We get: \(z = 9.6255\). The z-score is extremely high, indicating that the observed value is far from the null hypothesis value in terms of standard deviations. Given this, the corresponding p-value will be incredibly small, effectively close to 0.

We can confidently reject the null hypothesis. Model’s success rate of 86.08% is statistically significantly better than the default success rate of 72.05%.

At a cutoff value of 0.35, the model focuses on reducing false negatives (failing to detect diabetes) while allowing for more false positives. This approach aligns well with medical priorities, as identifying potential diabetes cases for further testing is critical.

Model Comparison - Pima vs Diabetes 2019

First and Second models that we built are quite different because of the data they were built on. The first model used the DiabetesPima dataset, which focused on physical health metrics like glucose levels, BMI, and blood pressure. These variables gave a direct look at the body’s condition, making it easier to connect them to diabetes risk. For example, glucose and BMI stood out as strong predictors, which isn’t surprising since they’re well-known risk factors for diabetes. Overall, this model was pretty straightforward, using just a few continuous variables to predict outcomes.

On the other hand, the second model, based on the Diabetes 2019 dataset, included a mix of health, lifestyle, and demographic information. It looked at things like family history of diabetes, whether someone takes regular medicine, how much they sleep, and their blood pressure levels. This dataset also included a lot of categories, like different age groups and behaviors like smoking or junk food consumption. As a result, the model took a broader approach, capturing not just physical health but also how someone’s habits and background affect their risk

Some variables overlapped between the two models, like BMI and pregnancies, which seemed important in both cases. I didn’t end up using BMI in second model cause it wasn’t really helping model to predict well, buy pregnancies was used in both models. pregnancies had a stronger effect in model 2 - coefficient for this predictor was 0.30791 compared to 0.133371 in model 1, possibly because the second dataset included more context, like age groups and family history. The first model was simpler(3 predictors) and focused only on health metrics, while second one(7 predictors, some with multiple categories) brought in lifestyle and behavior, adding complexity and a more complete picture of risk factors.

The populations were also very different. The first dataset focused only on Pima Indian women over 21, whereas seoond included a more mixed group, with both men and women, various age groups, and lifestyle factors. This difference in who the data represented likely influenced which variables were important in each model.

The second model performed a bit better because it included more varied predictors, like family diabetes history and regular medicine use. These kinds of factors captured risks that the seond dataset couldn’t account for. Essentially, Pima dataset showed how physical health drives diabetes risk, while Diabetes Dataset 2019 highlighted how combining health, habits, and background gives a fuller understanding of what increases someone’s chances of developing diabetes.

BRFSS Data Analysis

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

In this part we will be analyzing some chunk of that data and building model on it, to predict diabetes.

behaviour_data <- read.csv("diabetesTask3Complete.csv")
mystery_data <- read.csv("diabetesTask3Mystery.csv")
behaviour_clean <- behaviour_data

Plot 5 - Density Plots of Continuous Predictors

These density plots show how continuous predictors like BMI, MentHlth and PhysHlth are distributed across diabetic and non-diabetic groups.

continuous_vars <- c("BMI", "MentHlth", "PhysHlth")
data_continuous <- behaviour_clean %>% dplyr::select(all_of(continuous_vars), Diabetes_binary)

data_continuous_melt <- melt(data_continuous, id.vars = "Diabetes_binary")

ggplot(data_continuous_melt, aes(x = value, fill = as.factor(Diabetes_binary))) +
  geom_density(alpha = 0.5) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Density Plots of Continuous Predictors by Diabetic Status",
       x = "Value",
       y = "Density",
       fill = "Diabetic")

  • Diabetic individuals tend to have higher BMI values compared to non-diabetic individuals, with a noticeable shift in the distribution. Despite this, there is still significant overlap between the two groups, suggesting BMI could be a useful predictor but may require additional context or interaction with other variables to improve its discriminatory power.

  • The distribution of the number of mentally unhealthy days is heavily skewed towards zero for both groups, with almost identical patterns between diabetic and non-diabetic individuals. This overlap indicates that MentHlth should not be a strong predictor of diabetes.

  • The distribution of physically unhealthy days shows a slightly higher proportion of diabetic individuals reporting more unhealthy days compared to non-diabetic individuals, particularly in the tail of the distribution. However, the overall overlap between the two groups is significant, meaning its predictive value might be limited unless used in combination with other variables.

Based on these observations, variables BMI appears to be the most promising predictor among the continuous variables, given its shift in distribution between groups. PhysHlth may provide some additional information, while MentHlth is less likely to contribute meaningfully to diabetes prediction on its own.

Plot 6 - Bar Plots of Categorical Predictors

categorical_vars <- c("HighBP", "HighChol", "Sex", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GeneralHealth", "DiffWalk")
data_categorical <- behaviour_clean %>% select(all_of(categorical_vars), Diabetes_binary)

data_categorical_melt <- melt(data_categorical, id.vars = "Diabetes_binary")

ggplot(data_categorical_melt, aes(x = value, fill = as.factor(Diabetes_binary))) +
  geom_bar(position = "dodge", alpha = 0.8) +
  facet_wrap(~variable, scales = "free_x") +
  labs(
    title = "Bar Plots of Categorical Predictors by Diabetic Status",
    x = "Category",
    y = "Count",
    fill = "Diabetic"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

  • From the HighBP plot, we observe that individuals with diabetes are more likely to have high blood pressure compared to those without diabetes. Although there is overlap between the groups, this variable appears to have a strong association with diabetes and should be included as a predictor.

  • For HighChol, a similar pattern is observed, with diabetics more frequently reporting high cholesterol levels. Despite some overlap, this predictor shows potential for distinguishing diabetic from non-diabetic individuals.

  • The Sex variable shows minimal difference in distribution between diabetics and non-diabetics, indicating it may not be a strong predictor on its own. However, it could still interact with other variables in the model.

  • CholCheck shows a high proportion of individuals in both groups having had their cholesterol checked, with slightly higher rates among diabetics. While not a primary predictor, it might still contribute to the model in combination with other factors.

  • Smoker displays a relatively even distribution between diabetics and non-diabetics, suggesting it may not be a strong standalone predictor. However, it could add value when combined with related variables like HeartDiseaseorAttack or Stroke.

  • Stroke and HeartDiseaseorAttack both demonstrate clear distinctions between the groups, with diabetics showing higher frequencies in the affirmative categories. These variables appear to be strong predictors and should be included in the model.

  • PhysActivity reveals that diabetics are less likely to engage in physical activity than non-diabetics. This difference makes it a potentially valuable predictor.

  • Fruits and Veggies consumption shows slight differences between the groups, with diabetics slightly less likely to consume fruits or vegetables daily. These variables could provide modest contributions to the model.

  • HvyAlcoholConsump indicates low levels of heavy drinking across both groups, with minimal differences. This suggests it may not be a key predictor unless combined with other lifestyle factors.

  • AnyHealthcare and NoDocbcCost reveal interesting socioeconomic patterns. Diabetics are more likely to have healthcare coverage but also face cost-related barriers to seeing a doctor. These variables could add meaningful context to the model.

  • GeneralHealth highlights a strong gradient, with diabetics more likely to rate their health as poor or fair compared to non-diabetics. This variable shows significant promise as a predictor.

  • DiffWalk, which measures difficulty walking, is much more common among diabetics, making it a potentially strong predictor.

Plot 7 - Income vs Diabetes

ggplot(behaviour_clean, aes(Income, fill = as.factor(Diabetes_binary))) +
  geom_bar(position = "dodge") +
  labs(title = "Income Bar Plot by Diabetic Status",
       x = "Category",
       y = "Count",
       fill = "Diabetic")+
    theme(
    axis.text.x = element_text(angle = 20, hjust = 1)
  )

From the Income plot, we observe notable trends in the distribution of income categories by diabetic status. Individuals with diabetes are more concentrated in the lower-income categories (e.g., “Less than $10K” to “Less than $35K”) compared to those without diabetes. In higher-income brackets, especially “More than $75K,” individuals without diabetes are more prevalent, showing a significant disparity.

This pattern suggests that income might be an important socioeconomic predictor of diabetes, as lower income levels could correlate with limited access to healthcare, nutritious food, or other factors influencing health. Including Income as a variable in the analysis could provide valuable insights into the relationship between socioeconomic status and diabetes risk.

Based on these observations, Income, Stroke, HeartDiseaseorAttack, HighBP, and GeneralHealth appear to be the most promising predictors due to their clear distinctions between the groups. PhysActivity, DiffWalk, and socioeconomic factors like AnyHealthcare and NoDocbcCost also show potential, while variables like Sex and HvyAlcoholConsump may play smaller roles unless they interact with other predictors.

Draft Regression Model

model <- glm(
  Diabetes_binary ~ PhysHlth+BMI+Income+Stroke+HeartDiseaseorAttack+HighBP+GeneralHealth+PhysActivity+DiffWalk+AnyHealthcare+NoDocbcCost,
  data = behaviour_clean,
  family = "binomial"
)
Coefficient Estimate Std. Error z value Pr(>
(Intercept) -3.0962 0.2818 -10.986 <0.0001 ***
PhysHlth -0.0042 0.0040 -1.044 0.2966
BMI 0.0643 0.0051 12.557 <0.0001 ***
Income (Less than $15K) 0.1320 0.1828 0.722 0.4702
Income (Less than $20K) 0.2161 0.1730 1.249 0.2117
Income (Less than $25K) 0.1076 0.1657 0.649 0.5163
Income (Less than $35K) 0.1104 0.1611 0.685 0.4931
Income (Less than $50K) -0.0807 0.1570 -0.514 0.6071
Income (Less than $75K) -0.1045 0.1554 -0.673 0.5012
Income (More than $75K) -0.3718 0.1514 -2.455 0.0141 *
Stroke 0.3021 0.1361 2.221 0.0264 *
HeartDiseaseorAttack 0.6068 0.0920 6.593 <0.0001 ***
HighBP (not high BP) -0.9972 0.0632 -15.771 <0.0001 ***
GeneralHealth (fair) 2.0831 0.1477 14.106 <0.0001 ***
GeneralHealth (good) 1.5847 0.1285 12.334 <0.0001 ***
GeneralHealth (poor) 1.9638 0.1875 10.473 <0.0001 ***
GeneralHealth (very good) 0.8748 0.1298 6.739 <0.0001 ***
PhysActivity -0.0528 0.0702 -0.752 0.4519
DiffWalk 0.1476 0.0862 1.711 0.0871 .
AnyHealthcare 0.3146 0.1527 2.061 0.0393 *
NoDocbcCost -0.2067 0.1085 -1.905 0.0568 .
train_probs <- predict(model, behaviour_clean, type = "response")

train_preds <- ifelse(train_probs > 0.5, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)
model <- glm(
  Diabetes_binary ~ BMI+Income+Stroke+HeartDiseaseorAttack+HighBP+GeneralHealth+AnyHealthcare+NoDocbcCost,
  data = behaviour_clean,
  family = "binomial"
)

Log Odds Equation

\[ \text{Log-Odds} = -3.15670 + 0.06559 \cdot \text{BMI} + 0.14363 \cdot \text{Income}_{\text{Less than \$15K}} + 0.21487 \cdot \text{Income}_{\text{Less than \$20K}} \\ + 0.10205 \cdot \text{Income}_{\text{Less than \$25K}} + 0.10489 \cdot \text{Income}_{\text{Less than \$35K}} -0.09945 \cdot \text{Income}_{\text{Less than \$50K}} \\-0.12304 \cdot \text{Income}_{\text{Less than \$75K}} -0.39908 \cdot \text{Income}_{\text{More than \$75K}} + 0.31278 \cdot \text{Stroke} + 0.61987 \cdot \text{HeartDiseaseorAttack} \\ -1.00349 \cdot \text{HighBP}_{\text{not high BP}} + 2.09343 \cdot \text{GeneralHealth}_{\text{fair}} + 1.59329 \cdot \text{GeneralHealth}_{\text{good}} + 1.96904 \cdot \text{GeneralHealth}_{\text{poor}} \\ + 0.87719 \cdot \text{GeneralHealth}_{\text{very good}} + 0.31719 \cdot \text{AnyHealthcare} -0.20583 \cdot \text{NoDocbcCost} \]

Odds Equation

\[ \text{Odds} = e^{\text{Log-Odds}} = ... \]

The intercept of -3.15670 represents the baseline log-odds when all predictors are zero. In this case, the odds of being diabetic are approximately \(e^{-3.15670} \approx 0.043\), corresponding to a probability of about 4.3%.

For BMI, a one-unit increase in BMI raises the log-odds by 0.06559. The odds are multiplied by \(e^{0.06559} \approx 1.068\), meaning the odds increase by 6.8%.

Income categories show various effects. For individuals with an income “Less than $15K”, the log-odds increase by 0.14363, which means the odds are multiplied by \(e^{0.14363} \approx 1.154\), an increase of 15.4%. For those “Less than $20K”, the increase is 0.21487, which translates to a \(e^{0.21487} \approx 1.239\), or a 23.9% increase in the odds. The effect decreases for other income categories, with the most significant reduction seen in the “More than $75K” category, where the log-odds decrease by 0.39908, leading to \(e^{-0.39908} \approx 0.671\), or a 32.9% decrease in the odds.

For stroke, the log-odds increase by 0.31278, meaning the odds are multiplied by \(e^{0.31278} \approx 1.367\), a 36.7% increase in the odds.

For heart disease or attack, a positive effect is observed, with the log-odds increasing by 0.61987. The odds are multiplied by \(e^{0.61987} \approx 1.858\), representing an 85.8% increase in the odds.

For individuals with high blood pressure (not “high BP”), the log-odds decrease by -1.00349. The odds are multiplied by \(e^{-1.00349} \approx 0.367\), a 63.3% decrease in the odds

For general health, the “fair” category increases the log-odds by 2.09343, multiplying the odds by \(e^{2.09343} \approx 8.1\), or an 810% increase. “Good” general health increases the odds by \(e^{1.59329} \approx 4.9\), a 389% increase. The “poor” category has a similar effect, with an odds increase of \(e^{1.96904} \approx 7.2\), or a 620% increase. “Very good” general health increases the odds by \(e^{0.87719} \approx 2.4\), or a 140% increase.

Having any healthcare increases the odds by 0.31719, which multiplies the odds by \(e^{0.31719} \approx 1.373\), or a 37.3% increase.

For individuals who do not have a doctor due to cost, the log-odds decrease by -0.20583. The odds are multiplied by \(e^{-0.20583} \approx 0.814\), a 18.6% decrease in the odds.

Model Evaluvation

train_probs <- predict(model, behaviour_clean, type = "response")
train_preds <- ifelse(train_probs > 0.50, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)

Confusion Matrix (cutoff=0.5):

Predicted \ Actual 0 1
0 2138 747
1 854 2261

Success rate of the model is 0.7332, when we use 0.5 as cutoff. There are 747 false positive cases, where the model mistakenly identifies non-diabetic individuals as diabetic. Additionally, there are 854 false negative cases, where the model fails to detect diabetes in individuals who are actually diabetic. As we already mentioned this is particularly problematic as undiagnosed diabetes means patients may miss out on critical early interventions, increasing the risk of serious complications such as heart disease, kidney failure, and vision loss.

Confusion Matrix (cutoff=0.33):

While the cutoff of 0.5 provides pretty reasonable success rate for the model, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.

train_probs <- predict(model, behaviour_clean, type = "response")
train_preds <- ifelse(train_probs > 0.33, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)
Predicted \ Actual 0 1
0 1526 331
1 1466 2677

In this case we get success rate of 0.7005 which is worse than 0.5 cutoff success rate, but again since we prioritize minimizing risk of false negatives it shoud be the best in this context.

Hypothesis Test for Success Rate at 0.33 Cutoff

Success rate of the null model is 0.5013, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.

  • Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate:  p = 0.5013 

  • Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate:  p > 0.5013

We will again use a one-sided z-test for proportions:

p_hat <- success_rate
n <- nrow(behaviour_clean)
p_0 <- sum(behaviour_clean$Diabetes_binary) / n
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)

We get:  z = 30.8549 . The z-score is extremely high, indicating that the observed value is significantly far from the null hypothesis value in terms of standard deviations. As a result, the corresponding p-value will again be very small, essentially close to 0.

This allows us to confidently reject the null hypothesis. The model’s success rate of 70.05% is statistically significantly better than the default success rate of 50.13%.

At a cutoff value of 0.33, the model prioritizes reducing false negatives (missing diabetes cases) at the cost of increasing false positives. This approach is in line with medical priorities, as detecting potential diabetes cases for further investigation is crucial.

BRFSS Model comparison

Even though, out of all three models discused, BRFSS Model doesn’t have highest success rate, it stands out from others, because it shifts focus away from variables like Pregnancies, which were prominent in Models 1 and 2, and instead incorporates a broader set of socioeconomic and health-related factors. This broader scope makes Model 3 a more well-rounded tool for assessing diabetes risk, but this data set also has some downsides which we will discuss later.

In terms of BMI, all three models include it, but its role and influence vary significantly. In Model 1, BMI is a straightforward and essential predictor of diabetes risk, with a coefficient of  0.092828 , reflecting a direct relationship. Similarly, Model 2 includes BMI, but its impact is less prominent because other lifestyle factors, such as Sleep and RegularMedicine, so we didn’t end up using it Model 2. In Model 3, BMI has smaller coeficent than the first Pima Model, its value is 0.06559 , suggesting it is still relevant but plays a more minor role compared to socioeconomic factors like Income or health conditions such as Heart Disease or Stroke.

For example, consider two individuals with identical BMI values of 30. In Model 1, the odds of diabetes are primarily driven by their BMI and Glucose levels, leading to a straightforward calculation. In Model 3, however, other factors like their income bracket, history of Stroke or Heart Disease, and self-reported General Health contribute more heavily to their risk profile.

Income is a particularly unique addition in Model 3. Lower income levels (e.g., Less than $15K) significantly increase diabetes risk, reflecting the broader societal impacts of economic disadvantage on health. Similarly, poor General Health ratings (e.g., “fair” or “poor”) are strong predictors in Model 3, offering a nuanced understanding of how perceived health relates to diabetes risk.

In contrast, Model 1’s narrow focus on physical health metrics, like BMI, limits its ability to capture these broader influences. Model 2 improves upon this by adding lifestyle factors like Sleep and demographic variables, but it still lacks the socioeconomic and general health dimensions that define Model 3. As I mentioned above, BRFSS data set has some limitations, since it’s made of randomly selected participants, who are contacted by phone - an individual may have difficulty remembering events that occurred a long time ago or the frequency of certain behaviors. Some respondents may overreport socially desirable behaviors, while underreporting behaviors they perceive to be less acceptable, so if in first two datasets data was carefully crafted and sourced by mostly clinical trials, here it can be highly subjective (Commonwealth of Massachusetts)

By incorporating BMI alongside variables such as Income and General Health, this model provides a richer and more context-aware prediction. It shows that while BMI is an important risk factor, diabetes is influenced by a complex interplay of physical, lifestyle, and socioeconomic conditions.

mystery_data$PredictionProb <- predict(model, newdata = mystery_data, type = "response")

mystery_data$PredictionBinary <- ifelse(mystery_data$PredictionProb > 0.5, 1, 0)

mystery_data$PredictionCategory <- ifelse(
  mystery_data$PredictionBinary == 1,
  "predicted diabetes",
  "predicted not diabetes"
)

output <- mystery_data[, c("ID", "PredictionProb", "PredictionBinary", "PredictionCategory")]

write.csv(output, "predictions_mystery_data.csv", row.names = FALSE)

Ethics Behind Models

Mining anonymous health data, like the datasets we used to build our models, raises both opportunities and ethical concerns. On the positive side, this type of research can identify critical predictors of diseases like diabetes, as seen in our models. For instance, BMI, age, and lifestyle factors like sleep or blood pressure provided actionable insights that could lead to targeted interventions. By analyzing these predictors, healthcare systems can better allocate resources, design prevention programs, and implement early diagnostic strategies, potentially reducing the burden of chronic diseases on individuals and the system.

However, even anonymized data is not immune to risks. A major concern is fingerprinting - the possibility that unique combinations of variables in a dataset, such as specific age ranges, rare health conditions, or unusual income categories, could allow for re-identification of individuals when combined with other datasets. For example, a person with high blood pressure, poor general health, and a certain income level could be indirectly linked back to an individual, even if explicit identifiers like names or addresses are removed. This breach of privacy undermines the principle of anonymity and could lead to stigma, discrimination, or even exploitation, particularly by insurers or employers.

Additionally, questions of consent and equity must be addressed. Were participants aware their data would be used in this way, and do they benefit from the findings? If our models were developed into commercial tools, the cost of using these predictive insights might exclude vulnerable populations, ironically those most represented in the datasets. These financial barriers could widen health disparities instead of addressing them. Ensuring that such data mining efforts remain ethical requires robust anonymization techniques, transparency in data usage, and equitable application of research outcomes, ensuring the benefits extend to all who contribute their data.

Conclusion

This project explored how data science can uncover key predictors of diabetes risk using three very different datasets. Each dataset told its own story, from focusing on physical health metrics like glucose and BMI in the Pima dataset to adding lifestyle and behavioral details in the Diabetes 2019 data, and finally incorporating socioeconomic and general health factors in the BRFSS dataset. Through these models, we saw how diabetes isn’t just about health metrics—it’s about the broader context of a person’s life.

The results showed clear patterns. BMI consistently played an important role across all datasets, but its impact varied depending on what other factors were included. In Model 3, the addition of income and general health variables revealed how socioeconomic status and self-perceived health influence diabetes risk. This makes it clear that diabetes isn’t just a medical issue; it’s tied to social and economic conditions, something the earlier models didn’t capture.

While the models provided valuable insights, there were challenges. Self-reported data, like in the BRFSS dataset, has limitations, and there’s always a risk of over-fitting with complex models. There’s also the ethical consideration of using health data responsibly, especially when data could unintentionally identify individuals despite anonymization. Balancing the potential benefits of these insights with the risks is a critical part of this kind of work.

This project shows how predictive modeling can connect data to real-world impacts. By improving these models and carefully considering the ethical challenges, we can better support diabetes prevention and care in ways that are both practical and fair.

Bibliography

Worldwide trends in diabetes since 1980 (2016) - https://pmc.ncbi.nlm.nih.gov/articles/PMC5081106/#

Pima Indians Diabetes Database - https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database

WHO GLOBAL REPORT ON DIABETES (2016) - https://iris.who.int/bitstream/handle/10665/204871/9789241565257_eng.pdf

Behavioral Risk Factor Surveillance System (BRFSS) Data - https://www.mass.gov/info-details/behavioral-risk-factor-surveillance-system-brfss-data#