Introduction
For my final statistics project, titled “Binary
Insights into Diabetes: Health Indicators and Implications,” I
aim to explore the intricate relationships between age, body mass index
(BMI), and the risk of developing diabetes. Diabetes remains a pressing
global health issue, and understanding its risk factors is critical for
prevention and management. Age is a significant determinant of diabetes
risk, with prevalence increasing markedly as individuals grow older.
Research shows that middle-aged and elderly populations are particularly
vulnerable, with diabetes rates spiking after the age of 45 and reaching
nearly 25% in those aged 65 and older. This trend reflects the
cumulative effects of aging on insulin resistance and metabolic health.
Additionally, BMI plays a pivotal role in diabetes risk across all age
groups. Higher BMI is strongly associated with an increased likelihood
of developing diabetes, with this effect being particularly pronounced
in younger adults. However, as age advances, the relative impact of BMI
on diabetes risk appears to diminish. This project will examine these
relationships using statistical methods to highlight how increasing age
and BMI interact to influence diabetes risk. By analyzing these factors,
the study seeks to provide actionable insights into how targeted
interventions can mitigate this growing health challenge.
Link to dataset:
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
# Load necessary libraries
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'dplyr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
library(car)
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
library(ggplot2)
Data Preparation
First, we load the dataset.
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dataset)
Analysis
Pair 1: BMI_Category vs. Diabetes Status
Visualization
# Create a new variable: BMI category
dataset <- dataset %>%
mutate(BMI_Category = case_when(
BMI < 18.5 ~ "Underweight",
BMI < 25 ~ "Normal weight",
BMI < 30 ~ "Overweight",
TRUE ~ "Obese"
))
ggplot(dataset, aes(x = factor(Diabetes_binary), fill = factor(BMI_Category))) +
geom_bar(position = "dodge") +
labs(title = "Distribution of BMI Categories by Diabetes Status", x = "Diabetes Status (0 = No, 1 = Yes)", y = "Count")
Insight Gathered:
The boxplot shows that individuals with diabetes tend to have a
higher median BMI compared to those without diabetes. There are also
more outliers in the higher range for those with diabetes, indicating a
potential link between higher BMI and diabetes risk.
Correlation
correlation_bmi_diabetes <- cor(as.numeric(factor(dataset$BMI_Category)), dataset$Diabetes_binary, method = "spearman")
print(correlation_bmi_diabetes)
## [1] 0.04659539
Pair 2: Age group vs. High Blood Pressure
Visualization
# Create a new variable: Age group
dataset <- dataset %>%
mutate(Age_Group = case_when(
Age <= 3 ~ "18-24",
Age <= 6 ~ "25-34",
Age <= 9 ~ "35-44",
Age <= 12 ~ "45-54",
Age <= 15 ~ "55-64",
TRUE ~ "65+"
))
ggplot(dataset, aes(x = factor(HighBP), fill = factor(Age_Group))) +
geom_bar(position = "dodge") +
labs(title = "Distribution of Age Groups by High Blood Pressure", x = "High Blood Pressure (0 = No, 1 = Yes)", y = "Count")
ANOVA Test
Null Hypothesis
Formulate the null hypothesis: The mean BMI is the same across different general health categories.
Conduct ANOVA
We will perform an ANOVA test to evaluate this hypothesis:
# Perform ANOVA test
anova_result <- aov(BMI ~ GenHlth, data = dataset)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## GenHlth 1 256739 256739 5465 <2e-16 ***
## Residuals 70690 3320815 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation of Results
P-value: Check the p-value in the ANOVA summary output. If it is less than 0.05, reject the null hypothesis, indicating that there are significant differences in BMI across different health categories.
F-statistic: A higher F-statistic value suggests greater variance between groups compared to within groups.
Conclusion
Based on the p-value:
If p < 0.05: There is enough evidence to conclude that BMI varies significantly across different general health categories.
If p ≥ 0.05: There is not enough evidence to conclude that there are differences in BMI across these categories.
These results can help health professionals understand how general health status might influence BMI and guide targeted interventions or further research into specific health categories.
ANOVA Boxplot
To visualize differences in BMI across different general health categories, we can use a boxplot:
# Boxplot for BMI by General Health
ggplot(dataset, aes(x = GenHlth, y = BMI)) +
geom_boxplot() +
labs(title = "BMI by General Health Category", x = "General Health", y = "BMI")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
Regression Plot
To visualize the relationship between Age and BMI, we can use a scatter plot with a regression line:
# Scatter plot with regression line
ggplot(dataset, aes(x = Age, y = BMI)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", col = "blue") +
labs(title = "Relationship between Age and BMI", x = "Age", y = "BMI")
## `geom_smooth()` using formula = 'y ~ x'
Logistic Regression Model
The logistic regression model was fitted to predict the probability of having diabetes based on the selected health indicators. Below are the coefficients from the model:
# Build a Generalized Linear Model (Logistic Regression)
# Predicting diabetes status based on BMI, physical activity, and smoking status
glm_model <- glm(Diabetes_binary ~ BMI + PhysActivity + Smoker,
data = dataset, family = binomial)
# Summary of the model
summary(glm_model)
##
## Call:
## glm(formula = Diabetes_binary ~ BMI + PhysActivity + Smoker,
## family = binomial, data = dataset)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.669419 0.045274 -58.96 <2e-16 ***
## BMI 0.096964 0.001381 70.20 <2e-16 ***
## PhysActivity -0.502127 0.017771 -28.25 <2e-16 ***
## Smoker 0.326924 0.015985 20.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98000 on 70691 degrees of freedom
## Residual deviance: 89857 on 70688 degrees of freedom
## AIC: 89865
##
## Number of Fisher Scoring iterations: 4
# Model diagnostics - Checking for multicollinearity using Variance Inflation Factor (VIF)
vif(glm_model)
## BMI PhysActivity Smoker
## 1.010116 1.013019 1.004818
# Diagnose the model by checking residuals and fitted values
par(mfrow = c(2, 2)) # Set plot layout for diagnostics
# Model diagnostics - Residuals analysis
residuals <- residuals(glm_model, type = "deviance")
plot(residuals, main="Residuals Plot", ylab="Deviance Residuals", xlab="Index")
abline(h = 0, col = "red")
Hypothesis 1: Smoking Status and Diabetes Prevalence
Null Hypothesis (H0): There is no difference in diabetes prevalence between smokers and non-smokers.
Neyman-Pearson Framework
Test: Two-proportion z-test
Alpha Level (Type I Error): 0.05
Power (1 - Type II Error): 0.8
Minimum Effect Size: 0.1 (chosen based on practical
significance)
Neyman-Pearson Framework and Alternative
Hypothesis
In the Neyman-Pearson framework, hypothesis testing involves both a null hypothesis ($H_0$) and an alternative hypothesis ($H_1$). The null hypothesis typically represents a statement of no effect or no difference, while the alternative hypothesis is the opposite, suggesting that there is an effect or a difference.
Null Hypothesis ($H_0$): What I am testing for, such as “There is no association between high blood pressure and diabetes.”
Alternative Hypothesis ($H_1$): This would be “There is an
association between high blood pressure and diabetes.”
Selection of Alpha and Power Levels
Why I chose specific alpha (significance level) and power levels for my test:
Alpha Level (α): This is the probability of rejecting the null hypothesis when it is true (Type I error). Commonly, α is set at 0.05, meaning there is a 5% risk of concluding that an effect exists when it actually does not.
Power Level: This is the probability of correctly rejecting the null hypothesis when it is false (1 - Type II error). A power of 0.8 (80%) is often used, indicating a reasonable chance of detecting an effect if there is one.
For instance, if the consequences of a Type I error are severe, I might choose a lower alpha.
Sample Size Calculation
# Calculate sample size needed for adequate power
library(pwr)
effect_size <- 0.1
alpha <- 0.05
power <- 0.8
sample_size <- pwr.2p.test(h = effect_size, sig.level = alpha, power = power)$n
sample_size
## [1] 1569.772
Perform the Test
# Subset data by smoking status
smokers <- dataset %>% filter(Smoker == 1)
non_smokers <- dataset %>% filter(Smoker == 0)
# Proportion of diabetes in each group
p1 <- mean(smokers$Diabetes_binary)
p2 <- mean(non_smokers$Diabetes_binary)
# Perform two-proportion z-test
prop.test(x = c(sum(smokers$Diabetes_binary), sum(non_smokers$Diabetes_binary)),
n = c(nrow(smokers), nrow(non_smokers)),
alternative = "two.sided")
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(sum(smokers$Diabetes_binary), sum(non_smokers$Diabetes_binary)) out of c(nrow(smokers), nrow(non_smokers))
## X-squared = 522.48, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.07872292 0.09348572
## sample estimates:
## prop 1 prop 2
## 0.5451813 0.4590769
Insight Gathered
The test results indicate whether we reject or fail to reject the null hypothesis based on the p-value compared to the alpha level.
Visualization for Hypothesis 1
ggplot(dataset, aes(x = factor(Smoker), fill = factor(Diabetes_binary))) +
geom_bar(position = "fill") +
labs(title = "Diabetes Prevalence by Smoking Status",
x = "Smoking Status", y = "Proportion of Diabetes") +
scale_fill_discrete(name = "Diabetes")
Hypothesis 2: Obesity Status and Diabetes Prevalence
Null Hypothesis (H0): There is no difference in diabetes prevalence between obese and non-obese individuals.
Fisher’s Significance Testing Framework
Perform the Test
# Subset data by obesity status (BMI > 30 considered obese)
obese <- dataset %>% filter(BMI > 30)
non_obese <- dataset %>% filter(BMI <= 30)
# Create contingency table
table_obesity <- table(dataset$Diabetes_binary, dataset$BMI > 30)
# Perform chi-square test
chisq.test(table_obesity)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_obesity
## X-squared = 5252.7, df = 1, p-value < 2.2e-16
Visualization for Hypothesis 2
ggplot(dataset, aes(x = factor(BMI > 30), fill = factor(Diabetes_binary))) +
geom_bar(position = "fill") +
labs(title = "Diabetes Prevalence by Obesity Status",
x = "Obesity Status", y = "Proportion of Diabetes") +
scale_fill_discrete(name = "Diabetes")
Interpretation of Results
Interpreting the results of statistical tests:
P-values: A p-value less than α suggests that you reject the null hypothesis in favor of the alternative hypothesis. For example, if you find a p-value of 0.03 with α set at 0.05, this indicates statistical significance.
Conclusion: This analysis provides insights into how smoking and obesity are associated with diabetes prevalence. Further research could explore other factors or interactions between variables.
Conclusion
In conclusion, we have explored the critical relationships between age, BMI, and their combined influence on diabetes risk. The findings underscore the significant role of age as a primary determinant of diabetes prevalence, with risk increasing substantially as individuals grow older. This trend highlights the physiological changes associated with aging, such as declining insulin sensitivity and metabolic efficiency, which contribute to the heightened vulnerability of middle-aged and elderly populations.Moreover, the analysis reaffirms the strong association between BMI and diabetes risk across all age groups. Elevated BMI is a key driver of diabetes development due to its link with insulin resistance and chronic inflammation. However, the interplay between age and BMI reveals an evolving dynamic: while higher BMI has a pronounced impact on diabetes risk in younger adults, its relative influence diminishes with advancing age. This suggests that other age-related factors may overshadow BMI’s role in older populations.These insights emphasize the importance of adopting a nuanced approach to diabetes prevention and management. For younger individuals, interventions targeting weight management and lifestyle modifications are crucial to mitigating early-onset diabetes risk. For older populations, strategies should address broader age-related health challenges alongside weight management to reduce their vulnerability effectively. By understanding how age and BMI interact to influence diabetes risk, this study highlights the need for tailored public health initiatives that consider these interdependent factors to combat this growing global health issue.