Introduction
In this data dive, we will explore relationships between variables in the dataset related to diabetes and health indicators. The goal is to document the modeling process, reference data documentation, and derive insights from visualizations and statistical analyses.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load necessary libraries
library(dplyr)
library(ggplot2)

Data Preparation

First, we load the dataset.

dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset
# Create a new variable: BMI category
dataset <- dataset %>%
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI < 25 ~ "Normal weight",
    BMI < 30 ~ "Overweight",
    TRUE ~ "Obese"
  ))

The “Obese” category in the new variable BMI Category is determined based on BMI values. Here’s how it’s categorized:

Underweight: BMI < 18.5

Normal weight: 18.5 ≤ BMI < 25

Overweight: 25 ≤ BMI < 30

Obese: BMI ≥ 30

In this dataset, “Obese” refers to individuals with a BMI of 30 or higher. This classification helps in analyzing the relationship between obesity and health indicators like diabetes and high blood pressure.

# Create a new variable: Age group
dataset <- dataset %>%
  mutate(Age_Group = case_when(
    Age <= 3 ~ "18-24",
    Age <= 6 ~ "25-34",
    Age <= 9 ~ "35-44",
    Age <= 12 ~ "45-54",
    Age <= 15 ~ "55-64",
    TRUE ~ "65+"
  ))

Analysis

Pair 1: BMI_Category vs. Diabetes Status

Visualization

ggplot(dataset, aes(x = factor(Diabetes_binary), fill = factor(BMI_Category))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of BMI Categories by Diabetes Status", x = "Diabetes Status (0 = No, 1 = Yes)", y = "Count")

Insight Gathered:

The boxplot shows that individuals with diabetes tend to have a higher median BMI compared to those without diabetes. There are also more outliers in the higher range for those with diabetes, indicating a potential link between higher BMI and diabetes risk.

Correlation

correlation_bmi_diabetes <- cor(as.numeric(factor(dataset$BMI_Category)), dataset$Diabetes_binary, method = "spearman")
print(correlation_bmi_diabetes)
## [1] 0.04659539

Conclusion:

The correlation coefficient is positive, suggesting a moderate association between higher BMI and the presence of diabetes. This aligns with the visualization showing higher BMIs in individuals with diabetes.

Pair 2: Age group vs. High Blood Pressure

Visualization

ggplot(dataset, aes(x = factor(HighBP), fill = factor(Age_Group))) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Age Groups by High Blood Pressure", x = "High Blood Pressure (0 = No, 1 = Yes)", y = "Count")

Insight Gathered:

The bar plot of age group versus high blood pressure shows that the 45-54 age group has a higher count of individuals with high blood pressure compared to those without. This suggests that, contrary to typical expectations, younger individuals in this dataset may have a significant incidence of high blood pressure.

Correlation

correlation_age_bp <- cor(as.numeric(factor(dataset$Age_Group)), dataset$HighBP, method = "spearman")
print(correlation_age_bp)
## [1] 0.3134982

Conclusion:

The correlation coefficient is positive, indicating a relationship between having high blood pressure and diabetes. This supports the visualization showing a higher count of individuals with both conditions.

Confidence Intervals

Confidence Interval for BMI

# Confidence interval for BMI of individuals with diabetes
bmi_diabetes <- dataset %>% filter(Diabetes_binary == 1) %>% select(BMI)
ci_bmi <- t.test(bmi_diabetes$BMI)$conf.int
print(ci_bmi)
## [1] 31.86724 32.02078
## attr(,"conf.level")
## [1] 0.95

Detailed Conclusion:

The confidence interval for the mean BMI of individuals with diabetes suggests that we can be confident that the true mean lies within this range. This reinforces the observation that higher BMIs are prevalent among those with diabetes.