Diabetes is a global health challenge that affects millions of people worldwide, characterized by the body’s inability to effectively regulate blood sugar levels. This chronic metabolic disorder can lead to severe health complications, including heart disease, kidney failure, stroke, and nerve damage. According to the World Health Organization, diabetes has become a rapidly growing health concern, with global prevalence nearly quadrupling from 108 million in 1980 to 422 million in 2014. (National Institute of Health).)
Data science plays huge role in healthcare through early detection and personalized risk assessment. By analyzing complex health data sets, researchers can identify subtle patterns and predictors that might indicate an individual’s likelihood of developing diabetes. This project leverages three distinct data sets representing different populations to develop predictive models that could ultimately help healthcare professionals to devise preventive interventions more effectively.
This project aims to demonstrate how statistical modeling techniques can be applied to health data to predict diabetes risk. By examining various demographic, physiological, and lifestyle factors.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.(Kagle)
It consists of several medical predictor variables and one target
variable, Outcome. Predictor variables includes the number
of pregnancies the patient has had, their BMI, insulin level, age, and
so on.
library(dplyr)
pima_data <- read.csv("DiabetesPima.csv")
We will be filtering BMI values which are equal zero, to keep data set clean.
pima_clean <- pima_data %>%
filter(BMI != 0)
library(reshape2)
library(ggplot2)
data_melt <- melt(pima_clean, id.vars = "Outcome")
ggplot(data_melt, aes(x = value, fill = as.factor(Outcome))) +
geom_density(alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
labs(title = "Density Plots of Predictors by Outcome",
x = "Value",
y = "Density",
fill = "Outcome")
Based on the graphs, at the begining we could include the following predictors in the logistic regression model:
The glucose distribution shows a clear bimodal
pattern, with a distinct peak for the non-diabetic group at lower
glucose levels and another peak for the diabetic group at higher glucose
levels. This suggests that glucose level is a fundamental factor in
determining diabetes status. Higher glucose levels are strongly
associated with the presence of diabetes, as diabetes is characterized
by the body’s inability to effectively regulate blood sugar.(WHO)
The BMI distribution also exhibits a bimodal
pattern, with the diabetic group skewed towards higher BMI
values. This aligns with the known risk factor of obesity and being
overweight, which are closely linked to the development of type 2
diabetes. Individuals with a higher BMI tend to have increased insulin
resistance, a key contributor to diabetes.
The age distribution shows that the diabetic group
tends to be older than the non-diabetic group. This is consistent with
the epidemiological evidence that the risk of developing diabetes
increases with age. As individuals grow older, various physiological
changes can lead to decreased insulin sensitivity and impaired glucose
regulation, putting them at higher risk of diabetes.
The insulin distribution reveals a right-skewed
pattern, with the diabetic group displaying higher insulin levels. This
is likely indicative of insulin resistance, a hallmark of type 2
diabetes. In this condition, the body’s cells become less responsive to
insulin, leading to elevated insulin levels as the pancreas tries to
compensate by producing more of the hormone.
The pregnancies distribution shows a distinct difference between the diabetic and non-diabetic groups, with the diabetic group having a higher number of pregnancies. This aligns with the known risk factor of gestational diabetes, where women develop diabetes during pregnancy. Pregnancies can stress the body’s glucose regulation system, and women who experience gestational diabetes are at an increased risk of developing type 2 diabetes later in life.
While the other variables (Skin Thickness,
Blood preasure) also show some distributional patterns,
both positive and negative outcomes have similar distribution, with huge
overlapped areas. So we would assume that won’t be contributing to model
much.
model <- glm(Outcome ~ Pregnancies+Glucose+BMI+Insulin+Age, data = pima_clean, family = "binomial")
| Coefficient | Estimate | Std. Error | z value | Pr(> |
|---|---|---|---|---|
| (Intercept) | -8.9288 | 0.716 | -12.470 | < 2e-16 *** |
| Pregnancies | 0.1069 | 0.0316 | 3.382 | 0.00072 *** |
| Glucose | 0.0346 | 0.0036 | 9.584 | < 2e-16 *** |
| BMI | 0.0967 | 0.0147 | 6.564 | 5.25e-11 *** |
| Insulin | -0.001 | 0.0008 | -1.190 | 0.23423 |
| Age | 0.0127 | 0.0092 | 1.383 | 0.16666 |
predicted_prob <- predict(model, type = "response")
cutoff <- 0.5
predicted_class <- ifelse(predicted_prob > cutoff, 1, 0)
conf_matrix <- table(Predicted = predicted_class, Actual = pima_clean$Outcome)
success_rate <- mean(predicted_class == pima_clean$Outcome)
After building model with all those 5 predictors, we can see that
Insulin and Age have low significance in the
model. That could be because they have high correlation with other
variables use in the model, in case of Age it has high
correlation with Pregnancies variable, and
Insulin level is also correlated with BMI and
Glucose level as we can see in bellow heat map.(Plot
2)
I calculated success rate of this model and it’s 0.7688, if we exclude
those two variables, success rate increases to 0.7718, more on that
later.
corr_matrix <- cor(pima_data[, -8], use = "complete.obs")
corr_melt <- melt(corr_matrix)
ggplot(corr_melt, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
labs(title = "Correlation Heatmap of Predictors",
x = "",
y = "") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
model <- glm(Outcome ~ Pregnancies+Glucose+BMI, data = pima_clean, family = "binomial")
\[ \text{Log-Odds} = -8.5057 + 0.133371 \cdot \text{Pregnancies} + 0.034209 \cdot \text{Glucose} + 0.092828 \cdot \text{BMI} \]
\[ \text{Odds} = e^{-8.5057+ 0.133371 \cdot \text{Pregnancies} + 0.034209 \cdot \text{Glucose} + 0.092828 \cdot \text{BMI}} \]
Each additional pregnancy increases the log-odds of diabetes by 0.133371, or increases the odds by a factor of \(e^{0.133371} \approx 1.14\) . This means the odds of being diabetic increase by 14% for each additional pregnancy.
Each unit increase in glucose level increases the log-odds of diabetes by 0.0342, or increases the odds by a factor of \(e^{0.03421} \approx 1.03\) . This corresponds to a 3% increase in the odds of diabetes per unit increase in glucose.
Each unit increase in BMI increases the log-odds of diabetes by 0.0928, or increases the odds by a factor of \(e^{0.0928} \approx 1.10\). This means the odds of being diabetic increase by 10% for every 1-unit increase in BMI.
predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.5, 1, 0)
conf_matrix <- table(Predicted = predicted_class, Actual = pima_clean$Outcome)
success_rate <- mean(predicted_class == pima_clean$Outcome)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 432 | 114 |
| 1 | 59 | 152 |
Success rate is 0.7715 and it’s calculated by formula: \[ \text{Success Rate} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Observations}} \]
We have 59 false positive case, these are cases where the model incorrectly predicts someone as diabetic, even though they are non-diabetic. False positives might lead to unnecessary stress and medical interventions, such as additional tests or treatments, which could be costly and anxiety-inducing for the patient.
On the other hand we have 114 false negatives. These are cases where the model incorrectly predicts someone as non-diabetic, even though they are diabetic. False negatives are more concerning because diabetic patients might go undiagnosed and miss critical early interventions. This could lead to severe health complications like cardiovascular issues, kidney disease, or vision problems.
While the cutoff of 0.5 provides a reasonable balance between false positives and false negatives, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.
predicted_prob <- predict(model, type = "response")
predicted_class_0_3 <- ifelse(predicted_prob > 0.35, 1, 0)
conf_matrix_0_3 <- table(Predicted = predicted_class_0_3, Actual = pima_clean$Outcome)
success_rate_0_3 <- mean(predicted_class_0_3 == pima_clean$Outcome)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 383 | 77 |
| 1 | 108 | 189 |
In this case we get success rate of 0.7556, which is lower than one 0.5 cutoff success rate, but we are willing to prioritize minimizing risk of false negatives, on behalf of false positives.
Success rate of the null model is 0.651, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.
Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate: p = 0.651
Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate: p > 0.651
We use a one-sided z-test for proportions:
\[ z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0 (1 - p_0)}{n}}} \]
p_hat <- success_rate
p_0 <- 0.651
n <- nrow(pima_clean)
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)
We get: \(z = 6.0386\) and \(p = 7.773265 \times 10^{-10}\)
The p-value is far below any reasonable significance level (for example \(\alpha = 0.05\) ).
We reject the null hypothesis, model’s success rate of 75.56% is statistically significantly better than the default success rate of 65.1%.
At a cutoff of 0.35, the model focuses on reducing false negatives (failing to detect diabetes) while allowing for more false positives. This approach aligns well with medical priorities, as identifying potential diabetes cases for further testing is critical.
Diabetes_2019 <- read.csv("diabetesTask2.csv")
This dataset was collected by Neha Prerna Tigga and Dr. Shruti Garg of the Department of Computer Science and Engineering, BIT Mesra, India. for dibeties research. In the study, total 952 participants are selected aged 18 and above, out of which 580 are males and 372 are females. The participants were asked to answer a questionnaire which was prepared based on the constraints that could lead to diabetes.(Prediction of Type 2 Diabetes using Machine Learning Classification Methods)
We will be using this data set to build our logistical regression model, to predict diabetes diagnoses. Since this data set is completely different from previous one, we will start by analyzing all the variables that we have and try to select best predictors for the model.
Diabetes_2019_clean <- Diabetes_2019 %>%
filter(trimws(Diabetic)!="" & !is.na(BMI)) %>%
mutate(
Diabetic = ifelse(trimws(Diabetic) == "yes", 1, 0),
BMI = as.numeric(BMI),
Sleep = as.numeric(Sleep),
Pregnancies = ifelse(is.na(Pregancies), 0, Pregancies)
)
The density plots show how continuous predictors like BMI, Sleep, SoundSleep, and Pregnancies are distributed across diabetic and non-diabetic groups. By comparing the two distributions for each variable, we can identify patterns and assess their potential utility in predicting diabetes.
continuous_vars <- c("BMI", "Sleep", "SoundSleep", "Pregnancies")
data_continuous <- Diabetes_2019_clean %>% dplyr::select(all_of(continuous_vars), Diabetic)
data_continuous_melt <- melt(data_continuous, id.vars = "Diabetic")
ggplot(data_continuous_melt, aes(x = value, fill = as.factor(Diabetic))) +
geom_density(alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
labs(title = "Density Plots of Continuous Predictors by Diabetic Status",
x = "Value",
y = "Density",
fill = "Diabetic")
From BMI plot we observe that individuals with diabetes tend to have higher BMI values compared to those without diabetes. But we see huge overlap of the graphs, we can still include this predictor and see how it performs.
In the case of Sleep, the distributions for diabetic and non-diabetic groups overlap significantly, but there is interesting distribution of non diabetic patients, so it should be worth including.
SoundSleep, which measures the hours of quality sleep, exhibits a similar pattern to Sleep, with significant overlap between the two groups. This suggests it may not be a strong predictor unless combined with other variables or analyzed in more detail.
Pregnancies shows an interesting pattern. While most individuals in both groups have few pregnancies, the diabetic group has a slightly higher proportion at higher pregnancy counts. This could make Pregnancies a useful predictor, particularly when combined with other demographic or health-related variables.
Based on these observations, Pragnancies appears to be the most promising predictor among these variables due to its clear distinction between the two groups. Sleep could also contribute valuable information, while BMI and SoundSleep may play smaller roles unless they interact with other predictors in the model.
categorical_vars <- c("Gender", "FamilyDiabetes", "highBP", "Smoking", "Alcohol","RegularMedicine", "BPLevel", "UriationFreq", "Age")
data_categorical <- Diabetes_2019_clean %>% select(all_of(categorical_vars), Diabetic)
data_categorical_melt <- melt(data_categorical, id.vars = "Diabetic")
ggplot(data_categorical_melt, aes(x = value, fill = as.factor(Diabetic))) +
geom_bar(position = "dodge", alpha = 0.8) +
facet_wrap(~variable, scales = "free_x") +
labs(title = "Bar Plots of Categorical Predictors by Diabetic Status",
x = "Category",
y = "Count",
fill = "Diabetic")+
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
For Gender, there seems to be little distinction
between male and female groups in diabetic status, difference that we
see must be caused by the fact that males are represented more in the
dataset. suggesting it might not be a significant predictor on its
own.
In the case of FamilyDiabetes, having a family
history of diabetes is clearly more common among those with diabetes,
making this variable a strong predictor.
highBP shows a strong relationship with diabetes
status. Individuals with high blood pressure are more likely to be
diabetic, indicating its potential predictive power.
Smoking does not show a stark difference between
diabetic and non-diabetic groups, suggesting limited usefulness.
Similarly, Alcohol consumption does not exhibit a clear
pattern in this dataset.
RegularMedicine stands out as a strong predictor.
Those taking regular medication are more likely to be diabetic, likely
reflecting existing health conditions.
For BPLevel, higher blood pressure levels
(categories like “high”) appear to correlate strongly with diabetes.
This variable is likely important for the model. highBP
might not be significant paired with this variable.
UriationFreq (frequent urination) and
Age provide additional useful information.
UriationFreq seems to have same distribution of diabetic
and non-diabetic patients, but Age on the other hand should
be good predictor patients 60+ are more likely to have
diabetes.
From this analysis, key predictors to include in the model are
FamilyDiabetes, highBP,
RegularMedicine, BPLevel and Age,
as they show clear distinctions between diabetic and non-diabetic
groups. Variables like Smoking, Alcohol,
UriationFreq and possibly Gender may be less
informative.
model <- glm(Diabetic~BMI+Sleep+SoundSleep+Pregnancies+FamilyDiabetes+RegularMedicine+BPLevel+highBP+Age, family = "binomial", data = Diabetes_2019_clean)
| Coefficient | Estimate | Std. Error | z value | Pr(> |
|---|---|---|---|---|
| (Intercept) | -2.8314 | 0.9446 | -3.0000 | 0.0027 ** |
| BMI | 0.0130 | 0.0200 | 0.6490 | 0.5161 |
| Sleep | -0.1182 | 0.1064 | -1.1110 | 0.2666 |
| SoundSleep | 0.2563 | 0.0816 | 3.1410 | 0.0017 ** |
| Pregnancies | 0.3063 | 0.1110 | 2.7600 | 0.0058 ** |
| FamilyDiabetes (yes) | 1.3041 | 0.2295 | 5.6830 | <0.0001 *** |
| RegularMedicine (yes) | 2.3301 | 0.2480 | 9.3940 | <0.0001 *** |
| BPLevel (High) | 0.6392 | 1.1866 | 0.5390 | 0.5901 |
| BPLevel (low) | -16.2072 | 736.1063 | -0.0220 | 0.9824 |
| BPLevel (Low) | -15.5195 | 2231.3110 | -0.0070 | 0.9945 |
| BPLevel (normal) | -1.1656 | 0.3426 | -3.4020 | 0.0007 *** |
| BPLevel (normal) | -12.3844 | 3956.1804 | -0.0030 | 0.9975 |
| highBP (yes) | -0.3891 | 0.3452 | -1.1270 | 0.2597 |
| Age (50–59) | 0.2146 | 0.3053 | 0.7030 | 0.4821 |
| Age (60 or older) | 1.5555 | 0.3468 | 4.4850 | <0.0001 *** |
| Age (less than 40) | -1.2360 | 0.2999 | -4.1210 | <0.0001 *** |
predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.51, 1, 0)
conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)
After building model with these 9 predictors: BMI,
Sleep, SoundSleep , Pregnancies,
FamilyDiabetes , highBP,
RegularMedicine, BPLevel, Age. we
can see that BMI, Sleep and
highBP have low significance in the model. That could be
because they have high correlation or aren’t direcly contributing to
diabetes, in case of highBP as I mentioned earlier,
BPLevel provides same information.
I calculated success rate of this model and it’s 0.8660, if we exclude
those those variables, success rate increases to 0.8766, but if we keep
sleep variable it gives better success rate 0.8871, so we will be
keeping 7 variables: Sleep, SoundSleep ,
Pregnancies, FamilyDiabetes,
RegularMedicine, BPLevel,
Age.
model <- glm(Diabetic~Sleep+SoundSleep+Pregnancies+FamilyDiabetes+RegularMedicine+BPLevel+Age, family = "binomial", data = Diabetes_2019_clean)
\[ \text{Log-Odds} = -2.64803 - 0.12460 \cdot \text{Sleep} + 0.24337 \cdot \text{SoundSleep} + 0.30791 \cdot \text{Pregnancies} \\ + 1.32986 \cdot \text{FamilyDiabetes}_{\text{yes}} + 2.25135 \cdot \text{RegularMedicine}_{\text{yes}} + 0.72135 \cdot \text{BPLevel}_{\text{High}} \\- 15.96476 \cdot \text{BPLevel}_{\text{low}} - 15.26225 \cdot \text{BPLevel}_{\text{Low}} - 0.91216 \cdot \text{BPLevel}_{\text{normal}} \\- 12.69628 \cdot \text{BPLevel}_{\text{normal}} + 0.23677 \cdot \text{Age}_{50-59} + 1.55498 \cdot \text{Age}_{60 \text{ or older}} \\- 1.22500 \cdot \text{Age}_{\text{less than 40}} \]
\[ \text{Odds} = e^{\text{Log-Odds}} = ... \]
The intercept (-2.64803) represents the baseline log-odds when all predictors are zero. In this case, the odds of being diabetic are approximately \(e^{-2.64803} \approx 0.071\) , corresponding to a probability of about 7.1%
For sleep, a one-unit increase reduces the log-odds by 0.12460. The odds are multiplied by \(e^{-0.12460} \approx 0.882\) , which means an 11.8% reduction in the odds.
Sound sleep has a positive effect, with each additional unit increasing the log-odds by 0.24337. This translates to \(e^{0.24337} \approx 1.275\) , or a 27.5% increase in the odds.
Each additional pregnancy raises the log-odds by 0.30791, meaning the odds are multiplied by \(e^{0.30791} \approx 1.361\) , an increase of 36.1%.
Having a family history of diabetes significantly raises the log-odds by 1.32986, increasing the odds by nearly four times ( \(e^{1.32986} \approx 3.78\) ).
Taking regular medicine has an even greater effect, with the log-odds increasing by 2.25135, which multiplies the odds by about \(e^{2.25135} \approx 9.49\) , an increase of 849%.
Blood pressure level has mixed effects. High BP increases the odds by $e^{0.72135} \approx 2.057 $, or 105.7%. Low BP, on the other hand, drastically reduces the log-odds (-15.96476), making the probability effectively zero. Normal BP decreases the odds by \(e^{-0.91216} \approx 0.402\) , a reduction of 59.8%.
Age categories also play a role. Individuals aged 50–59 see a slight increase in odds (+26.7%), while those 60 or older experience a significant rise (+373%). Conversely, being younger than 40 decreases the odds substantially by about 70.4%.
predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.51, 1, 0)
conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 635 | 58 |
| 1 | 49 | 207 |
Success rate is 0.8871 and it’s calculated by formula: \[ \text{Success Rate} = \frac{\text{True Positives} + \text{True Negatives}}{\text{Total Observations}} \]
We have 49 false positive case, these are cases where the model incorrectly predicts someone as diabetic, even though they are non-diabetic. False positives might lead to unnecessary stress and medical interventions, such as additional tests or treatments, which could be costly and anxiety-inducing for the patient.
On the other hand we have 58 false negatives. These are cases where the model incorrectly predicts someone as non-diabetic, even though they are diabetic. False negatives are more concerning because diabetic patients might go undiagnosed and miss critical early interventions. This could lead to severe health complications like cardiovascular issues, kidney disease, or vision problems.
While the cutoff of 0.51 provides the best success rate for the model, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.
predicted_prob <- predict(model, type = "response")
predicted_class <- ifelse(predicted_prob > 0.35, 1, 0)
conf_matrix <- table(Predicted = predicted_class, Actual = Diabetes_2019_clean$Diabetic)
success_rate <- mean(predicted_class == Diabetes_2019_clean$Diabetic)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 383 | 77 |
| 1 | 108 | 189 |
In this case we get success rate of 0.8608 which is lower than one 0.51 cutoff success rate, but since we prioritize minimizing risk of false negatives it shoud be the best in this context.
Success rate of the null model is 0.8608, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.
Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate: p = 0.7205
Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate: p > 0.7205
We will again use a one-sided z-test for proportions:
p_hat <- success_rate
n <- nrow(Diabetes_2019_clean)
p_0 <- 1 - sum(Diabetes_2019_clean$Diabetic) / n
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)
We get: \(z = 9.6255\). The z-score is extremely high, indicating that the observed value is far from the null hypothesis value in terms of standard deviations. Given this, the corresponding p-value will be incredibly small, effectively close to 0.
We can confidently reject the null hypothesis. Model’s success rate of 86.08% is statistically significantly better than the default success rate of 72.05%.
At a cutoff value of 0.35, the model focuses on reducing false negatives (failing to detect diabetes) while allowing for more false positives. This approach aligns well with medical priorities, as identifying potential diabetes cases for further testing is critical.
First and Second models that we built are quite different because of the data they were built on. The first model used the DiabetesPima dataset, which focused on physical health metrics like glucose levels, BMI, and blood pressure. These variables gave a direct look at the body’s condition, making it easier to connect them to diabetes risk. For example, glucose and BMI stood out as strong predictors, which isn’t surprising since they’re well-known risk factors for diabetes. Overall, this model was pretty straightforward, using just a few continuous variables to predict outcomes.
On the other hand, the second model, based on the Diabetes 2019 dataset, included a mix of health, lifestyle, and demographic information. It looked at things like family history of diabetes, whether someone takes regular medicine, how much they sleep, and their blood pressure levels. This dataset also included a lot of categories, like different age groups and behaviors like smoking or junk food consumption. As a result, the model took a broader approach, capturing not just physical health but also how someone’s habits and background affect their risk
Some variables overlapped between the two models, like
BMI and pregnancies, which seemed important in
both cases. I didn’t end up using BMI in second model cause it wasn’t
really helping model to predict well, buy pregnancies was
used in both models. pregnancies had a stronger effect in
model 2 - coefficient for this predictor was 0.30791 compared to
0.133371 in model 1, possibly because the second dataset included more
context, like age groups and family history. The first model was
simpler(3 predictors) and focused only on health metrics, while second
one(7 predictors, some with multiple categories) brought in lifestyle
and behavior, adding complexity and a more complete picture of risk
factors.
The populations were also very different. The first dataset focused only on Pima Indian women over 21, whereas seoond included a more mixed group, with both men and women, various age groups, and lifestyle factors. This difference in who the data represented likely influenced which variables were important in each model.
The second model performed a bit better because it included more varied predictors, like family diabetes history and regular medicine use. These kinds of factors captured risks that the seond dataset couldn’t account for. Essentially, Pima dataset showed how physical health drives diabetes risk, while Diabetes Dataset 2019 highlighted how combining health, habits, and background gives a fuller understanding of what increases someone’s chances of developing diabetes.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
In this part we will be analyzing some chunk of that data and building model on it, to predict diabetes.
behaviour_data <- read.csv("diabetesTask3Complete.csv")
mystery_data <- read.csv("diabetesTask3Mystery.csv")
behaviour_clean <- behaviour_data
These density plots show how continuous predictors like BMI, MentHlth and PhysHlth are distributed across diabetic and non-diabetic groups.
continuous_vars <- c("BMI", "MentHlth", "PhysHlth")
data_continuous <- behaviour_clean %>% dplyr::select(all_of(continuous_vars), Diabetes_binary)
data_continuous_melt <- melt(data_continuous, id.vars = "Diabetes_binary")
ggplot(data_continuous_melt, aes(x = value, fill = as.factor(Diabetes_binary))) +
geom_density(alpha = 0.5) +
facet_wrap(~variable, scales = "free") +
labs(title = "Density Plots of Continuous Predictors by Diabetic Status",
x = "Value",
y = "Density",
fill = "Diabetic")
Diabetic individuals tend to have higher BMI values
compared to non-diabetic individuals, with a noticeable shift in the
distribution. Despite this, there is still significant overlap between
the two groups, suggesting BMI could be a useful predictor
but may require additional context or interaction with other variables
to improve its discriminatory power.
The distribution of the number of mentally unhealthy days is
heavily skewed towards zero for both groups, with almost identical
patterns between diabetic and non-diabetic individuals. This overlap
indicates that MentHlth should not be a strong predictor of
diabetes.
The distribution of physically unhealthy days shows a slightly higher proportion of diabetic individuals reporting more unhealthy days compared to non-diabetic individuals, particularly in the tail of the distribution. However, the overall overlap between the two groups is significant, meaning its predictive value might be limited unless used in combination with other variables.
Based on these observations, variables BMI appears to be
the most promising predictor among the continuous variables, given its
shift in distribution between groups. PhysHlth may provide
some additional information, while MentHlth is less likely
to contribute meaningfully to diabetes prediction on its own.
categorical_vars <- c("HighBP", "HighChol", "Sex", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GeneralHealth", "DiffWalk")
data_categorical <- behaviour_clean %>% select(all_of(categorical_vars), Diabetes_binary)
data_categorical_melt <- melt(data_categorical, id.vars = "Diabetes_binary")
ggplot(data_categorical_melt, aes(x = value, fill = as.factor(Diabetes_binary))) +
geom_bar(position = "dodge", alpha = 0.8) +
facet_wrap(~variable, scales = "free_x") +
labs(
title = "Bar Plots of Categorical Predictors by Diabetic Status",
x = "Category",
y = "Count",
fill = "Diabetic"
) +
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
From the HighBP plot, we observe that individuals
with diabetes are more likely to have high blood pressure compared to
those without diabetes. Although there is overlap between the groups,
this variable appears to have a strong association with diabetes and
should be included as a predictor.
For HighChol, a similar pattern is observed, with
diabetics more frequently reporting high cholesterol levels. Despite
some overlap, this predictor shows potential for distinguishing diabetic
from non-diabetic individuals.
The Sex variable shows minimal difference in
distribution between diabetics and non-diabetics, indicating it may not
be a strong predictor on its own. However, it could still interact with
other variables in the model.
CholCheck shows a high proportion of individuals in
both groups having had their cholesterol checked, with slightly higher
rates among diabetics. While not a primary predictor, it might still
contribute to the model in combination with other factors.
Smoker displays a relatively even distribution between diabetics
and non-diabetics, suggesting it may not be a strong standalone
predictor. However, it could add value when combined with related
variables like HeartDiseaseorAttack or
Stroke.
Stroke and HeartDiseaseorAttack both demonstrate
clear distinctions between the groups, with diabetics showing higher
frequencies in the affirmative categories. These variables appear to be
strong predictors and should be included in the model.
PhysActivity reveals that diabetics are less likely
to engage in physical activity than non-diabetics. This difference makes
it a potentially valuable predictor.
Fruits and Veggies consumption shows
slight differences between the groups, with diabetics slightly less
likely to consume fruits or vegetables daily. These variables could
provide modest contributions to the model.
HvyAlcoholConsump indicates low levels of heavy
drinking across both groups, with minimal differences. This suggests it
may not be a key predictor unless combined with other lifestyle
factors.
AnyHealthcare and NoDocbcCost reveal
interesting socioeconomic patterns. Diabetics are more likely to have
healthcare coverage but also face cost-related barriers to seeing a
doctor. These variables could add meaningful context to the
model.
GeneralHealth highlights a strong gradient, with
diabetics more likely to rate their health as poor or fair compared to
non-diabetics. This variable shows significant promise as a
predictor.
DiffWalk, which measures difficulty walking, is much
more common among diabetics, making it a potentially strong
predictor.
ggplot(behaviour_clean, aes(Income, fill = as.factor(Diabetes_binary))) +
geom_bar(position = "dodge") +
labs(title = "Income Bar Plot by Diabetic Status",
x = "Category",
y = "Count",
fill = "Diabetic")+
theme(
axis.text.x = element_text(angle = 20, hjust = 1)
)
From the Income plot, we observe notable trends in the
distribution of income categories by diabetic status. Individuals with
diabetes are more concentrated in the lower-income categories (e.g.,
“Less than $10K” to “Less than $35K”) compared to those without
diabetes. In higher-income brackets, especially “More than $75K,”
individuals without diabetes are more prevalent, showing a significant
disparity.
This pattern suggests that income might be an important socioeconomic predictor of diabetes, as lower income levels could correlate with limited access to healthcare, nutritious food, or other factors influencing health. Including Income as a variable in the analysis could provide valuable insights into the relationship between socioeconomic status and diabetes risk.
Based on these observations, Income,
Stroke, HeartDiseaseorAttack,
HighBP, and GeneralHealth appear to be the
most promising predictors due to their clear distinctions between the
groups. PhysActivity, DiffWalk, and
socioeconomic factors like AnyHealthcare and
NoDocbcCost also show potential, while variables like
Sex and HvyAlcoholConsump may play smaller
roles unless they interact with other predictors.
model <- glm(
Diabetes_binary ~ PhysHlth+BMI+Income+Stroke+HeartDiseaseorAttack+HighBP+GeneralHealth+PhysActivity+DiffWalk+AnyHealthcare+NoDocbcCost,
data = behaviour_clean,
family = "binomial"
)
| Coefficient | Estimate | Std. Error | z value | Pr(> |
|---|---|---|---|---|
| (Intercept) | -3.0962 | 0.2818 | -10.986 | <0.0001 *** |
| PhysHlth | -0.0042 | 0.0040 | -1.044 | 0.2966 |
| BMI | 0.0643 | 0.0051 | 12.557 | <0.0001 *** |
| Income (Less than $15K) | 0.1320 | 0.1828 | 0.722 | 0.4702 |
| Income (Less than $20K) | 0.2161 | 0.1730 | 1.249 | 0.2117 |
| Income (Less than $25K) | 0.1076 | 0.1657 | 0.649 | 0.5163 |
| Income (Less than $35K) | 0.1104 | 0.1611 | 0.685 | 0.4931 |
| Income (Less than $50K) | -0.0807 | 0.1570 | -0.514 | 0.6071 |
| Income (Less than $75K) | -0.1045 | 0.1554 | -0.673 | 0.5012 |
| Income (More than $75K) | -0.3718 | 0.1514 | -2.455 | 0.0141 * |
| Stroke | 0.3021 | 0.1361 | 2.221 | 0.0264 * |
| HeartDiseaseorAttack | 0.6068 | 0.0920 | 6.593 | <0.0001 *** |
| HighBP (not high BP) | -0.9972 | 0.0632 | -15.771 | <0.0001 *** |
| GeneralHealth (fair) | 2.0831 | 0.1477 | 14.106 | <0.0001 *** |
| GeneralHealth (good) | 1.5847 | 0.1285 | 12.334 | <0.0001 *** |
| GeneralHealth (poor) | 1.9638 | 0.1875 | 10.473 | <0.0001 *** |
| GeneralHealth (very good) | 0.8748 | 0.1298 | 6.739 | <0.0001 *** |
| PhysActivity | -0.0528 | 0.0702 | -0.752 | 0.4519 |
| DiffWalk | 0.1476 | 0.0862 | 1.711 | 0.0871 . |
| AnyHealthcare | 0.3146 | 0.1527 | 2.061 | 0.0393 * |
| NoDocbcCost | -0.2067 | 0.1085 | -1.905 | 0.0568 . |
train_probs <- predict(model, behaviour_clean, type = "response")
train_preds <- ifelse(train_probs > 0.5, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)
model <- glm(
Diabetes_binary ~ BMI+Income+Stroke+HeartDiseaseorAttack+HighBP+GeneralHealth+AnyHealthcare+NoDocbcCost,
data = behaviour_clean,
family = "binomial"
)
\[ \text{Log-Odds} = -3.15670 + 0.06559 \cdot \text{BMI} + 0.14363 \cdot \text{Income}_{\text{Less than \$15K}} + 0.21487 \cdot \text{Income}_{\text{Less than \$20K}} \\ + 0.10205 \cdot \text{Income}_{\text{Less than \$25K}} + 0.10489 \cdot \text{Income}_{\text{Less than \$35K}} -0.09945 \cdot \text{Income}_{\text{Less than \$50K}} \\-0.12304 \cdot \text{Income}_{\text{Less than \$75K}} -0.39908 \cdot \text{Income}_{\text{More than \$75K}} + 0.31278 \cdot \text{Stroke} + 0.61987 \cdot \text{HeartDiseaseorAttack} \\ -1.00349 \cdot \text{HighBP}_{\text{not high BP}} + 2.09343 \cdot \text{GeneralHealth}_{\text{fair}} + 1.59329 \cdot \text{GeneralHealth}_{\text{good}} + 1.96904 \cdot \text{GeneralHealth}_{\text{poor}} \\ + 0.87719 \cdot \text{GeneralHealth}_{\text{very good}} + 0.31719 \cdot \text{AnyHealthcare} -0.20583 \cdot \text{NoDocbcCost} \]
\[ \text{Odds} = e^{\text{Log-Odds}} = ... \]
The intercept of -3.15670 represents the baseline log-odds when all predictors are zero. In this case, the odds of being diabetic are approximately \(e^{-3.15670} \approx 0.043\), corresponding to a probability of about 4.3%.
For BMI, a one-unit increase in BMI raises the log-odds by 0.06559. The odds are multiplied by \(e^{0.06559} \approx 1.068\), meaning the odds increase by 6.8%.
Income categories show various effects. For individuals with an income “Less than $15K”, the log-odds increase by 0.14363, which means the odds are multiplied by \(e^{0.14363} \approx 1.154\), an increase of 15.4%. For those “Less than $20K”, the increase is 0.21487, which translates to a \(e^{0.21487} \approx 1.239\), or a 23.9% increase in the odds. The effect decreases for other income categories, with the most significant reduction seen in the “More than $75K” category, where the log-odds decrease by 0.39908, leading to \(e^{-0.39908} \approx 0.671\), or a 32.9% decrease in the odds.
For stroke, the log-odds increase by 0.31278, meaning the odds are multiplied by \(e^{0.31278} \approx 1.367\), a 36.7% increase in the odds.
For heart disease or attack, a positive effect is observed, with the log-odds increasing by 0.61987. The odds are multiplied by \(e^{0.61987} \approx 1.858\), representing an 85.8% increase in the odds.
For individuals with high blood pressure (not “high BP”), the log-odds decrease by -1.00349. The odds are multiplied by \(e^{-1.00349} \approx 0.367\), a 63.3% decrease in the odds
For general health, the “fair” category increases the log-odds by 2.09343, multiplying the odds by \(e^{2.09343} \approx 8.1\), or an 810% increase. “Good” general health increases the odds by \(e^{1.59329} \approx 4.9\), a 389% increase. The “poor” category has a similar effect, with an odds increase of \(e^{1.96904} \approx 7.2\), or a 620% increase. “Very good” general health increases the odds by \(e^{0.87719} \approx 2.4\), or a 140% increase.
Having any healthcare increases the odds by 0.31719, which multiplies the odds by \(e^{0.31719} \approx 1.373\), or a 37.3% increase.
For individuals who do not have a doctor due to cost, the log-odds decrease by -0.20583. The odds are multiplied by \(e^{-0.20583} \approx 0.814\), a 18.6% decrease in the odds.
train_probs <- predict(model, behaviour_clean, type = "response")
train_preds <- ifelse(train_probs > 0.50, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 2138 | 747 |
| 1 | 854 | 2261 |
Success rate of the model is 0.7332, when we use 0.5 as cutoff. There are 747 false positive cases, where the model mistakenly identifies non-diabetic individuals as diabetic. Additionally, there are 854 false negative cases, where the model fails to detect diabetes in individuals who are actually diabetic. As we already mentioned this is particularly problematic as undiagnosed diabetes means patients may miss out on critical early interventions, increasing the risk of serious complications such as heart disease, kidney failure, and vision loss.
While the cutoff of 0.5 provides pretty reasonable success rate for the model, in medical contexts, as mentioned previously it is often beneficial to prioritize minimizing false negatives over false positives due to the higher medical risk associated with undiagnosed diabetes.
train_probs <- predict(model, behaviour_clean, type = "response")
train_preds <- ifelse(train_probs > 0.33, 1, 0)
train_conf_matrix <- table(Predicted = train_preds, Actual = behaviour_clean$Diabetes_binary)
success_rate <- mean(train_preds == behaviour_clean$Diabetes_binary)
| Predicted \ Actual | 0 | 1 |
|---|---|---|
| 0 | 1526 | 331 |
| 1 | 1466 | 2677 |
In this case we get success rate of 0.7005 which is worse than 0.5 cutoff success rate, but again since we prioritize minimizing risk of false negatives it shoud be the best in this context.
Success rate of the null model is 0.5013, which means that without any data or predictors, we could achive that success rate by just giving patients negative diagnosis, let’s check is our mode is significantly better that null model.
Null Hypothesis ( H_0 ): The model’s success rate is equal to the null success rate: p = 0.5013
Alternative Hypothesis ( H_a ): The model’s success rate is better than the null success rate: p > 0.5013
We will again use a one-sided z-test for proportions:
p_hat <- success_rate
n <- nrow(behaviour_clean)
p_0 <- sum(behaviour_clean$Diabetes_binary) / n
z <- (p_hat - p_0) / sqrt((p_0 * (1 - p_0)) / n)
p_value <- 1 - pnorm(z)
We get: z = 30.8549 . The z-score is extremely high, indicating that the observed value is significantly far from the null hypothesis value in terms of standard deviations. As a result, the corresponding p-value will again be very small, essentially close to 0.
This allows us to confidently reject the null hypothesis. The model’s success rate of 70.05% is statistically significantly better than the default success rate of 50.13%.
At a cutoff value of 0.33, the model prioritizes reducing false negatives (missing diabetes cases) at the cost of increasing false positives. This approach is in line with medical priorities, as detecting potential diabetes cases for further investigation is crucial.
Even though, out of all three models discused, BRFSS Model doesn’t
have highest success rate, it stands out from others, because it shifts
focus away from variables like Pregnancies, which were
prominent in Models 1 and 2, and instead incorporates a broader set of
socioeconomic and health-related factors. This broader scope makes Model
3 a more well-rounded tool for assessing diabetes risk, but this data
set also has some downsides which we will discuss later.
In terms of BMI, all three models include it, but its
role and influence vary significantly. In Model 1, BMI is a
straightforward and essential predictor of diabetes risk, with a
coefficient of 0.092828 , reflecting a direct relationship. Similarly,
Model 2 includes BMI, but its impact is less prominent
because other lifestyle factors, such as Sleep and
RegularMedicine, so we didn’t end up using it Model 2. In
Model 3, BMI has smaller coeficent than the first Pima
Model, its value is 0.06559 , suggesting it is still relevant but plays
a more minor role compared to socioeconomic factors like
Income or health conditions such as Heart Disease or
Stroke.
For example, consider two individuals with identical BMI
values of 30. In Model 1, the odds of diabetes are primarily driven by
their BMI and Glucose levels, leading to a straightforward
calculation. In Model 3, however, other factors like their income
bracket, history of Stroke or Heart Disease, and self-reported General
Health contribute more heavily to their risk profile.
Income is a particularly unique addition in Model 3.
Lower income levels (e.g., Less than $15K) significantly increase
diabetes risk, reflecting the broader societal impacts of economic
disadvantage on health. Similarly, poor General Health ratings (e.g.,
“fair” or “poor”) are strong predictors in Model 3, offering a nuanced
understanding of how perceived health relates to diabetes risk.
In contrast, Model 1’s narrow focus on physical health metrics, like
BMI, limits its ability to capture these broader
influences. Model 2 improves upon this by adding lifestyle factors like
Sleep and demographic variables, but it still lacks the
socioeconomic and general health dimensions that define Model 3. As I
mentioned above, BRFSS data set has some limitations, since it’s made of
randomly selected participants, who are contacted by phone - an
individual may have difficulty remembering events that occurred a long
time ago or the frequency of certain behaviors. Some respondents may
overreport socially desirable behaviors, while underreporting behaviors
they perceive to be less acceptable, so if in first two datasets data
was carefully crafted and sourced by mostly clinical trials, here it can
be highly subjective (Commonwealth
of Massachusetts)
By incorporating BMI alongside variables such as
Income and General Health, this model provides a richer and
more context-aware prediction. It shows that while BMI is
an important risk factor, diabetes is influenced by a complex interplay
of physical, lifestyle, and socioeconomic conditions.
mystery_data$PredictionProb <- predict(model, newdata = mystery_data, type = "response")
mystery_data$PredictionBinary <- ifelse(mystery_data$PredictionProb > 0.5, 1, 0)
mystery_data$PredictionCategory <- ifelse(
mystery_data$PredictionBinary == 1,
"predicted diabetes",
"predicted not diabetes"
)
output <- mystery_data[, c("ID", "PredictionProb", "PredictionBinary", "PredictionCategory")]
write.csv(output, "predictions_mystery_data.csv", row.names = FALSE)
Mining anonymous health data, like the datasets we used to build our models, raises both opportunities and ethical concerns. On the positive side, this type of research can identify critical predictors of diseases like diabetes, as seen in our models. For instance, BMI, age, and lifestyle factors like sleep or blood pressure provided actionable insights that could lead to targeted interventions. By analyzing these predictors, healthcare systems can better allocate resources, design prevention programs, and implement early diagnostic strategies, potentially reducing the burden of chronic diseases on individuals and the system.
However, even anonymized data is not immune to risks. A major concern is fingerprinting - the possibility that unique combinations of variables in a dataset, such as specific age ranges, rare health conditions, or unusual income categories, could allow for re-identification of individuals when combined with other datasets. For example, a person with high blood pressure, poor general health, and a certain income level could be indirectly linked back to an individual, even if explicit identifiers like names or addresses are removed. This breach of privacy undermines the principle of anonymity and could lead to stigma, discrimination, or even exploitation, particularly by insurers or employers.
Additionally, questions of consent and equity must be addressed. Were participants aware their data would be used in this way, and do they benefit from the findings? If our models were developed into commercial tools, the cost of using these predictive insights might exclude vulnerable populations, ironically those most represented in the datasets. These financial barriers could widen health disparities instead of addressing them. Ensuring that such data mining efforts remain ethical requires robust anonymization techniques, transparency in data usage, and equitable application of research outcomes, ensuring the benefits extend to all who contribute their data.
This project explored how data science can uncover key predictors of diabetes risk using three very different datasets. Each dataset told its own story, from focusing on physical health metrics like glucose and BMI in the Pima dataset to adding lifestyle and behavioral details in the Diabetes 2019 data, and finally incorporating socioeconomic and general health factors in the BRFSS dataset. Through these models, we saw how diabetes isn’t just about health metrics—it’s about the broader context of a person’s life.
The results showed clear patterns. BMI consistently played an important role across all datasets, but its impact varied depending on what other factors were included. In Model 3, the addition of income and general health variables revealed how socioeconomic status and self-perceived health influence diabetes risk. This makes it clear that diabetes isn’t just a medical issue; it’s tied to social and economic conditions, something the earlier models didn’t capture.
While the models provided valuable insights, there were challenges. Self-reported data, like in the BRFSS dataset, has limitations, and there’s always a risk of over-fitting with complex models. There’s also the ethical consideration of using health data responsibly, especially when data could unintentionally identify individuals despite anonymization. Balancing the potential benefits of these insights with the risks is a critical part of this kind of work.
This project shows how predictive modeling can connect data to real-world impacts. By improving these models and carefully considering the ethical challenges, we can better support diabetes prevention and care in ways that are both practical and fair.
Worldwide trends in diabetes since 1980 (2016) - https://pmc.ncbi.nlm.nih.gov/articles/PMC5081106/#
Pima Indians Diabetes Database - https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
WHO GLOBAL REPORT ON DIABETES (2016) - https://iris.who.int/bitstream/handle/10665/204871/9789241565257_eng.pdf
Behavioral Risk Factor Surveillance System (BRFSS) Data - https://www.mass.gov/info-details/behavioral-risk-factor-surveillance-system-brfss-data#