Diabetes Risk Analysis Study
Farah Talib | Brooke Morris | Tassia Drame
Introduction
Understanding the factors that influence diabetes risk is important because it helps individuals take preventive measures, supports healthcare providers in early detection, and contributes to improving overall public health outcomes.
It also allows people to make more informed lifestyle choices regarding physical activity, diet, and health monitoring.
Studying diabetes risk helps identify key variables, such as age, gender, and physical activity, that may contribute to the development of the condition and guide more effective prevention strategies.
Project Goal
The goal of this project is to explore the factors that may influence the risk of diabetes, including demographic and lifestyle variables such as age, gender, and physical activity.
Data
This dataset contains 6000 observations and 19 variables to describe risk characteristics of diabetes.
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(readxl)
diabetes_risk_dataset <- read_excel("~/Desktop/diabetes_risk_dataset.xlsx", skip = 1)
names(diabetes_risk_dataset) <- make.names(names(diabetes_risk_dataset), unique = TRUE)
data.frame(Variable_Names = names(diabetes_risk_dataset)) |>
kable(caption = "Variable Names in Diabetes Risk Dataset")| Variable_Names |
|---|
| Patient_ID |
| age |
| gender |
| bmi |
| blood_pressure |
| fasting_glucose_level |
| insulin_level |
| HbA1c_level |
| cholesterol_level |
| triglycerides_level |
| physical_activity_level |
| daily_calorie_intake |
| sugar_intake_grams_per_day |
| sleep_hours |
| stress_level |
| family_history_diabetes |
| waist_circumference_cm |
| diabetes_risk_score |
| diabetes_risk_category |
The data-set contains multiple variables related to health, lifestyle, and demographic characteristics. Key variables include age, gender, BMI, blood pressure, glucose levels, and physical activity, which are used to assess diabetes risk.
library(DT)
datatable(diabetes_risk_dataset)Analysis
This project examines diabetes risk by first looking at the overall distribution of risk within the dataset. It then explores how key numeric variables, such as age and BMI, and relate it to diabetes risk. Finally, the analysis incorporates additional demographic and lifestyle factors, including gender, physical activity, and family history, to better understand patterns and potential contributors to increased risk.
Target Variable Analysis
To better understand diabetes risk, the target variable (diabetes risk score and category) is analyzed using summary statistics and distribution plots. This analysis provides insight into the overall spread of risk values, including the average, minimum, and maximum scores. In addition, the distribution of participant age is examined to provide context for the dataset’s demographic composition.
Figure 1: Distribution of Diabetes Risk Score
This histogram shows the overall distribution of diabetes risk scores. The results indicate noticeable spikes at 0% and 100% risk values, with the remaining scores distributed across the range.Figure 2: Distribution of Diabetes Risk Categories
This bar chart displays the frequency of individuals across different risk categories. The distribution appears to have slight variation between low, moderate, and high-risk groups.Figure 3: Distribution of Diabetes Risk Categories
This bar chart illustrates the distribution of participant ages in the dataset. The results show how individuals are spread across different age groups.
summary(diabetes_risk_dataset$diabetes_risk_score) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 12.90 43.90 48.69 88.80 100.00
#Figure 1: Distribution of Diabetes Risk Score
ggplot(diabetes_risk_dataset, aes(x = diabetes_risk_score)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
theme_minimal()#Figure 2: Distribution of Diabetes Risk Categories
ggplot(diabetes_risk_dataset, aes(x = diabetes_risk_category)) +
geom_bar(fill = "darkred") +
theme_minimal()# Figure 3: Distribution of Age Across Participants
ggplot(diabetes_risk_dataset,
aes(x = age)) +
geom_bar(fill = "darkseagreen3", color = "firebrick") +
theme_minimal() +
labs(
title = "Distribution of Age Across Participants",
x = "Age",
y = "Count"
)Two-Dimension Analysis
Two-dimensional analysis examines relationships between pairs of variables to determine how individual factors influence diabetes risk.
Figure 4: Diabetes Risk Score by Gender This boxplot compares diabetes risk scores between males and females. The results suggest slight differences between genders, with females showing a higher median risk.
Figure 5: Physical Activity by Family History of Diabetes
This bar chart illustrates the relationship between family history and physical activity levels. Individuals with a family history of diabetes tend to show slightly lower activity levels.Figure 6: BMI vs Diabetes Risk Score
This scatter plot examines the relationship between BMI and diabetes risk score. Each point represents an individual observation, while the fitted trend line illustrates the overall pattern. The results suggest a positive relationship, indicating that higher BMI values are generally associated with increased diabetes risk.Figure 7: Age vs Diabetes Risk Score
This scatter plot shows the relationship between age and diabetes risk score. There appearss to be no correlation.Figure 8: Sugar Intake vs Diabetes Risk Score
This scatter plot examines the relationship between daily sugar intake and diabetes risk score. The trend line suggests that higher sugar intake may be associated with increased diabetes risk.
# Figure 4: Diabetes Risk Score by Gender
ggplot(diabetes_risk_dataset, aes(x = gender, y = diabetes_risk_score, fill = gender)) +
geom_boxplot() +
theme_minimal()#Figure 5: Physical Activity by Family History of Diabetes
ggplot(diabetes_risk_dataset,
aes(x = physical_activity_level, fill = family_history_diabetes)) +
geom_bar(position = "dodge") +
theme_minimal()#Figure 6: BMI vs Diabetes Risk Score
ggplot(diabetes_risk_dataset,
aes(x = bmi, y = diabetes_risk_score)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()`geom_smooth()` using formula = 'y ~ x'
#Figure 7: Age vs Diabetes Risk Score
ggplot(diabetes_risk_dataset,
aes(x = age, y = diabetes_risk_score)) +
geom_point() +
theme_minimal()# Figure 8: Sugar Intake vs Diabetes Risk Score
ggplot(diabetes_risk_dataset,
aes(x = sugar_intake_grams_per_day, y = diabetes_risk_score)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal() +
labs(
title = "Sugar Intake vs Diabetes Risk Score",
x = "Daily Sugar Intake (grams)",
y = "Diabetes Risk Score"
)`geom_smooth()` using formula = 'y ~ x'
Three-Dimension Analysis
Three-dimensional analysis explores how multiple variables interact simultaneously. By incorporating such, this analysis provides a deeper understanding of how different groups are distributed across levels of diabetes risk.
- Figure 9: Diabetes Risk Category by Gender
This stacked bar chart shows the proportion of males and females within each diabetes risk category. While differences exist, the overall distribution suggests that gender alone may not strongly determine diabetes risk.
#Figure 9: Diabetes Risk Category by Gender
ggplot(diabetes_risk_dataset,
aes(x = diabetes_risk_category, fill = gender)) +
geom_bar(position = "fill") +
theme_minimal()Outlier Analysis
Outlier analysis was conducted to identify extreme values in diabetes risk scores that fall outside the typical range.
- Figure 10: Outliers in Diabetes Risk Score This boxplot visualizes the spread of diabetes risk scores and identifies outliers. Extreme values are visible at both ends of the distribution.
#Figure 10: Outliers in Diabetes Risk Score
ggplot(diabetes_risk_dataset, aes(y = diabetes_risk_score)) +
geom_boxplot(fill = "orange") +
theme_minimal()Conclusion
Diabetes risk appears relatively balanced between males and females, with only slight differences.
Females show a slightly higher presence in higher-risk and pre-diabetes categories.
Gender alone does not appear to be a strong predictor of diabetes risk.
Individuals with a family history of diabetes tend to have lower physical activity levels.
Physical activity may play a role in influencing diabetes risk.
Higher BMI levels are associated with increased diabetes risk, suggesting a positive relationship between body composition and risk.
Overall, diabetes risk is influenced by a combination of demographic, lifestyle, and clinical factors rather than a single variable.
Contact Information
Thanks for visiting our page!
tdrame1@students.kennesaw.edu
cmorr206@students.kennesaw.edu
ftalib1@students.kennesaw.edu