Summary
Dataset Description
The dataset used for this analysis is the Diabetes Health Indicators Dataset from Kaggle. It contains various health indicators related to diabetes status, including BMI, physical activity, and more. The dataset was suggested by my TA, and it can be accessed here:(https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv)..)
Project Goal
The main goal of this project is to explore the relationship between various health indicators and diabetes status. Specifically, we aim to understand how factors like BMI and physical activity correlate with diabetes prevalence.
Interesting Aspects for Further Investigation
1. BMI Distribution: The histogram of BMI shows a right-skewed distribution, indicating that most individuals have a BMI in the lower range, but there are outliers with higher BMI values. This aspect is worth investigating to see how BMI correlates with diabetes status.
2. Physical Activity vs. Diabetes Status: The bar chart comparing diabetes status based on physical activity suggests that physical activity might be linked to lower diabetes prevalence. This relationship warrants further exploration.
Plan Moving Forward
- Conduct a detailed statistical analysis to test hypotheses about the relationships between health indicators and diabetes.
- Explore additional variables such as age, smoking status, and alcohol consumption.
- Use machine learning models to predict diabetes status based on health indicators.
Initial Findings
Hypotheses
1. Hypothesis 1: Higher BMI is associated with a higher likelihood of having diabetes.
2. Hypothesis 2: Individuals who engage in regular
physical activity have a lower prevalence of diabetes compared to those
who do not.
**
Visualizations
Visualization for Hypothesis 1**
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load necessary libraries
library(dplyr)
library(ggplot2)
dataset <-read_delim("C:/Users/Akshay Dembra/Downloads/Stats_Selected_Dataset/diabetes_binary_5050split_health_indicators_BRFSS2015_1.csv" , delim = ",")
## Rows: 70692 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (22): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dataset
# Plot BMI vs Diabetes Status
ggplot(dataset, aes(x = BMI, y = as.factor(Diabetes_binary))) +
geom_jitter(alpha = 0.5) +
labs(title = "BMI vs Diabetes Status", x = "BMI", y = "Diabetes Status (0 = No, 1 = Yes)")
Visualization for Hypothesis 2
# Plot Physical Activity vs Diabetes Status
ggplot(dataset, aes(x = as.factor(PhysActivity), fill = as.factor(Diabetes_binary))) +
geom_bar(position = "dodge") +
labs(title = "Diabetes Status Based on Physical Activity", x = "Physical Activity", y = "Count", fill = "Diabetes Status")