Final Project Akshay Kumar Binary Insights into Diabetes

Introduction
For my final statistics project, titled “Binary Insights into Diabetes: Health Indicators and Implications,” I aim to explore the intricate relationships between age, body mass index (BMI), and the risk of developing diabetes. Diabetes remains a pressing global health issue, and understanding its risk factors is critical for prevention and management. Age is a significant determinant of diabetes risk, with prevalence increasing markedly as individuals grow older. Research shows that middle-aged and elderly populations are particularly vulnerable, with diabetes rates spiking after the age of 45 and reaching nearly 25% in those aged 65 and older. This trend reflects the cumulative effects of aging on insulin resistance and metabolic health. Additionally, BMI plays a pivotal role in diabetes risk across all age groups. Higher BMI is strongly associated with an increased likelihood of developing diabetes, with this effect being particularly pronounced in younger adults. However, as age advances, the relative impact of BMI on diabetes risk appears to diminish. This project will examine these relationships using statistical methods to highlight how increasing age and BMI interact to influence diabetes risk. By analyzing these factors, the study seeks to provide actionable insights into how targeted interventions can mitigate this growing health challenge.

Link to dataset:
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv

# Load necessary libraries
library(tidyverse)

## Warning: package 'tidyr' was built under R version 4.4.2

## Warning: package 'dplyr' was built under R version 4.4.2

## Warning: package 'lubridate' was built under R version 4.4.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(car)

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 4.4.2

## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")

## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(dataset)

Analysis

Pair 1: BMI_Category vs. Diabetes Status

Visualization

# Create a new variable: BMI category
dataset <- dataset %>%
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI < 25 ~ "Normal weight",
    BMI < 30 ~ "Overweight",
    TRUE ~ "Obese"
  ))

ggplot(dataset, aes(x = factor(Diabetes_binary), fill = factor(BMI_Category))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of BMI Categories by Diabetes Status", x = "Diabetes Status (0 = No, 1 = Yes)", y = "Count")

Insight Gathered:

The boxplot shows that individuals with diabetes tend to have a higher median BMI compared to those without diabetes. There are also more outliers in the higher range for those with diabetes, indicating a potential link between higher BMI and diabetes risk.

Correlation

correlation_bmi_diabetes <- cor(as.numeric(factor(dataset$BMI_Category)), dataset$Diabetes_binary, method = "spearman")
print(correlation_bmi_diabetes)

## [1] 0.04659539

Pair 2: Age group vs. High Blood Pressure

Visualization

# Create a new variable: Age group
dataset <- dataset %>%
  mutate(Age_Group = case_when(
    Age <= 3 ~ "18-24",
    Age <= 6 ~ "25-34",
    Age <= 9 ~ "35-44",
    Age <= 12 ~ "45-54",
    Age <= 15 ~ "55-64",
    TRUE ~ "65+"
  ))

ggplot(dataset, aes(x = factor(HighBP), fill = factor(Age_Group))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Age Groups by High Blood Pressure", x = "High Blood Pressure (0 = No, 1 = Yes)", y = "Count")

ANOVA Test

Null Hypothesis

Formulate the null hypothesis: The mean BMI is the same across different general health categories.

Conduct ANOVA

We will perform an ANOVA test to evaluate this hypothesis:

# Perform ANOVA test
anova_result <- aov(BMI ~ GenHlth, data = dataset)
summary(anova_result)

##                Df  Sum Sq Mean Sq F value Pr(>F)    
## GenHlth         1  256739  256739    5465 <2e-16 ***
## Residuals   70690 3320815      47                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of Results

P-value: Check the p-value in the ANOVA summary output. If it is less than 0.05, reject the null hypothesis, indicating that there are significant differences in BMI across different health categories.

F-statistic: A higher F-statistic value suggests greater variance between groups compared to within groups.

Conclusion

Based on the p-value:

If p < 0.05: There is enough evidence to conclude that BMI varies significantly across different general health categories.

If p ≥ 0.05: There is not enough evidence to conclude that there are differences in BMI across these categories.

These results can help health professionals understand how general health status might influence BMI and guide targeted interventions or further research into specific health categories.

ANOVA Boxplot

To visualize differences in BMI across different general health categories, we can use a boxplot:

# Boxplot for BMI by General Health
ggplot(dataset, aes(x = GenHlth, y = BMI)) +
  geom_boxplot() +
  labs(title = "BMI by General Health Category", x = "General Health", y = "BMI")

## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?

Regression Plot

To visualize the relationship between Age and BMI, we can use a scatter plot with a regression line:

# Scatter plot with regression line
ggplot(dataset, aes(x = Age, y = BMI)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Relationship between Age and BMI", x = "Age", y = "BMI")

## `geom_smooth()` using formula = 'y ~ x'

Logistic Regression Model

The logistic regression model was fitted to predict the probability of having diabetes based on the selected health indicators. Below are the coefficients from the model:

# Build a Generalized Linear Model (Logistic Regression)
# Predicting diabetes status based on BMI, physical activity, and smoking status
glm_model <- glm(Diabetes_binary ~ BMI + PhysActivity + Smoker, 
                 data = dataset, family = binomial)

# Summary of the model
summary(glm_model)

## 
## Call:
## glm(formula = Diabetes_binary ~ BMI + PhysActivity + Smoker, 
##     family = binomial, data = dataset)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.669419   0.045274  -58.96   <2e-16 ***
## BMI           0.096964   0.001381   70.20   <2e-16 ***
## PhysActivity -0.502127   0.017771  -28.25   <2e-16 ***
## Smoker        0.326924   0.015985   20.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 98000  on 70691  degrees of freedom
## Residual deviance: 89857  on 70688  degrees of freedom
## AIC: 89865
## 
## Number of Fisher Scoring iterations: 4

# Model diagnostics - Checking for multicollinearity using Variance Inflation Factor (VIF)
vif(glm_model)

##          BMI PhysActivity       Smoker 
##     1.010116     1.013019     1.004818

Multicollinearity: The Variance Inflation Factor (VIF) helps identify if any explanatory variables are highly correlated with each other. A VIF greater than 5 or 10 suggests multicollinearity issues.

# Diagnose the model by checking residuals and fitted values
par(mfrow = c(2, 2)) # Set plot layout for diagnostics

# Model diagnostics - Residuals analysis
residuals <- residuals(glm_model, type = "deviance")
plot(residuals, main="Residuals Plot", ylab="Deviance Residuals", xlab="Index")
abline(h = 0, col = "red")

Hypothesis 1: Smoking Status and Diabetes Prevalence

Null Hypothesis (H0): There is no difference in diabetes prevalence between smokers and non-smokers.

Neyman-Pearson Framework

Test: Two-proportion z-test
Alpha Level (Type I Error): 0.05
Power (1 - Type II Error): 0.8

Minimum Effect Size: 0.1 (chosen based on practical significance)

Neyman-Pearson Framework and Alternative Hypothesis

In the Neyman-Pearson framework, hypothesis testing involves both a null hypothesis ($H_0$) and an alternative hypothesis ($H_1$). The null hypothesis typically represents a statement of no effect or no difference, while the alternative hypothesis is the opposite, suggesting that there is an effect or a difference.

Null Hypothesis ($H_0$): What I am testing for, such as “There is no association between high blood pressure and diabetes.”

Alternative Hypothesis ($H_1$): This would be “There is an association between high blood pressure and diabetes.”

Selection of Alpha and Power Levels

Why I chose specific alpha (significance level) and power levels for my test:

Alpha Level (α): This is the probability of rejecting the null hypothesis when it is true (Type I error). Commonly, α is set at 0.05, meaning there is a 5% risk of concluding that an effect exists when it actually does not.

Power Level: This is the probability of correctly rejecting the null hypothesis when it is false (1 - Type II error). A power of 0.8 (80%) is often used, indicating a reasonable chance of detecting an effect if there is one.

For instance, if the consequences of a Type I error are severe, I might choose a lower alpha.

Sample Size Calculation

# Calculate sample size needed for adequate power
library(pwr)
effect_size <- 0.1
alpha <- 0.05
power <- 0.8

sample_size <- pwr.2p.test(h = effect_size, sig.level = alpha, power = power)$n
sample_size

## [1] 1569.772

Perform the Test

# Subset data by smoking status
smokers <- dataset %>% filter(Smoker == 1)
non_smokers <- dataset %>% filter(Smoker == 0)

# Proportion of diabetes in each group
p1 <- mean(smokers$Diabetes_binary)
p2 <- mean(non_smokers$Diabetes_binary)

# Perform two-proportion z-test
prop.test(x = c(sum(smokers$Diabetes_binary), sum(non_smokers$Diabetes_binary)),
          n = c(nrow(smokers), nrow(non_smokers)),
          alternative = "two.sided")

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(sum(smokers$Diabetes_binary), sum(non_smokers$Diabetes_binary)) out of c(nrow(smokers), nrow(non_smokers))
## X-squared = 522.48, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.07872292 0.09348572
## sample estimates:
##    prop 1    prop 2 
## 0.5451813 0.4590769

Insight Gathered

The test results indicate whether we reject or fail to reject the null hypothesis based on the p-value compared to the alpha level.

Visualization for Hypothesis 1

ggplot(dataset, aes(x = factor(Smoker), fill = factor(Diabetes_binary))) +
  geom_bar(position = "fill") +
  labs(title = "Diabetes Prevalence by Smoking Status",
       x = "Smoking Status", y = "Proportion of Diabetes") +
  scale_fill_discrete(name = "Diabetes")

Hypothesis 2: Obesity Status and Diabetes Prevalence

Null Hypothesis (H0): There is no difference in diabetes prevalence between obese and non-obese individuals.

Fisher’s Significance Testing Framework

Test: Chi-square test of independence

Perform the Test

# Subset data by obesity status (BMI > 30 considered obese)
obese <- dataset %>% filter(BMI > 30)
non_obese <- dataset %>% filter(BMI <= 30)

# Create contingency table
table_obesity <- table(dataset$Diabetes_binary, dataset$BMI > 30)

# Perform chi-square test
chisq.test(table_obesity)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_obesity
## X-squared = 5252.7, df = 1, p-value < 2.2e-16

Visualization for Hypothesis 2

ggplot(dataset, aes(x = factor(BMI > 30), fill = factor(Diabetes_binary))) +
  geom_bar(position = "fill") +
  labs(title = "Diabetes Prevalence by Obesity Status",
       x = "Obesity Status", y = "Proportion of Diabetes") +
  scale_fill_discrete(name = "Diabetes")

Interpretation of Results

Interpreting the results of statistical tests:

P-values: A p-value less than α suggests that you reject the null hypothesis in favor of the alternative hypothesis. For example, if you find a p-value of 0.03 with α set at 0.05, this indicates statistical significance.

Conclusion: This analysis provides insights into how smoking and obesity are associated with diabetes prevalence. Further research could explore other factors or interactions between variables.

Conclusion

In conclusion, we have explored the critical relationships between age, BMI, and their combined influence on diabetes risk. The findings underscore the significant role of age as a primary determinant of diabetes prevalence, with risk increasing substantially as individuals grow older. This trend highlights the physiological changes associated with aging, such as declining insulin sensitivity and metabolic efficiency, which contribute to the heightened vulnerability of middle-aged and elderly populations.Moreover, the analysis reaffirms the strong association between BMI and diabetes risk across all age groups. Elevated BMI is a key driver of diabetes development due to its link with insulin resistance and chronic inflammation. However, the interplay between age and BMI reveals an evolving dynamic: while higher BMI has a pronounced impact on diabetes risk in younger adults, its relative influence diminishes with advancing age. This suggests that other age-related factors may overshadow BMI’s role in older populations.These insights emphasize the importance of adopting a nuanced approach to diabetes prevention and management. For younger individuals, interventions targeting weight management and lifestyle modifications are crucial to mitigating early-onset diabetes risk. For older populations, strategies should address broader age-related health challenges alongside weight management to reduce their vulnerability effectively. By understanding how age and BMI interact to influence diabetes risk, this study highlights the need for tailored public health initiatives that consider these interdependent factors to combat this growing global health issue.

Final Project Akshay Kumar Binary Insights into Diabetes - Health Indicators and Implications