This project provides a comprehensive, data-driven exploration of diabetes and its associated risk factors using a dataset of 10,000 individuals. The analysis focuses on key clinical indicators such as fasting glucose, HbA1c, BMI, cholesterol levels, and blood pressure, along with lifestyle variables like smoking, alcohol consumption, and physical activity. Summary statistics, visualizations, and statistical tests are used to understand how these factors vary across demographic groups and how they contribute to overall health risk. In addition, clustering and K-Nearest Neighbors (KNN) classification techniques are applied to identify high-risk individuals and uncover meaningful patterns within the population. Overall, this study demonstrates how data analytics can effectively highlight important predictors of diabetes and support early risk detection.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'readr' was built under R version 4.4.3
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(class)
## Warning: package 'class' was built under R version 4.4.3
library(readr)
Diabetes_data_2 <- read_csv("C:/Users/MANISH/OneDrive/Desktop/CAP 484/Diabetes data 2.csv")
## Rows: 10000 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Sex, Ethnicity, Physical_Activity_Level, Alcohol_Consumption, Smok...
## dbl (16): S no., Age, BMI, Waist_Circumference, Fasting_Blood_Glucose, HbA1c...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(Diabetes_data_2)
str(Diabetes_data_2)
## spc_tbl_ [10,000 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ S no. : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
## $ Age : num [1:10000] 58 48 34 62 27 40 58 38 42 30 ...
## $ Sex : chr [1:10000] "Female" "Male" "Female" "Male" ...
## $ Ethnicity : chr [1:10000] "White" "Asian" "Black" "Asian" ...
## $ BMI : num [1:10000] 35.8 24.1 25 32.7 33.5 33.6 33.2 26.9 27 24 ...
## $ Waist_Circumference : num [1:10000] 83.4 71.4 113.8 100.4 110.8 ...
## $ Fasting_Blood_Glucose : num [1:10000] 124 184 142 167 146 ...
## $ HbA1c : num [1:10000] 10.9 12.8 14.5 8.8 7.1 13.5 13.3 10.9 7 14 ...
## $ Blood_Pressure_Systolic : num [1:10000] 152 103 179 176 122 170 131 121 132 146 ...
## $ Blood_Pressure_Diastolic : num [1:10000] 114 91 104 118 97 90 80 83 118 83 ...
## $ Cholesterol_Total : num [1:10000] 198 262 261 183 203 ...
## $ Cholesterol_HDL : num [1:10000] 50.2 62 32.1 41.1 53.9 44.5 77.9 69.7 73.2 53.3 ...
## $ Cholesterol_LDL : num [1:10000] 99.2 146.4 164.1 84 92.8 ...
## $ GGT : num [1:10000] 37.5 88.5 56.2 34.4 81.9 77.5 52.1 72 76.4 14.5 ...
## $ Serum_Urate : num [1:10000] 7.2 6.1 6.9 5.4 7.4 6.4 4.7 5.6 6.2 6.9 ...
## $ Physical_Activity_Level : chr [1:10000] "Moderate" "Moderate" "Low" "Low" ...
## $ Dietary_Intake_Calories : num [1:10000] 1538 2653 1684 3796 3161 ...
## $ Alcohol_Consumption : chr [1:10000] "Moderate" "Moderate" "Heavy" "Moderate" ...
## $ Smoking_Status : chr [1:10000] "Never" "Current" "Former" "Never" ...
## $ Family_History_of_Diabetes : num [1:10000] 0 0 1 1 0 1 0 0 1 1 ...
## $ Previous_Gestational_Diabetes: num [1:10000] 1 1 0 0 0 1 0 1 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. `S no.` = col_double(),
## .. Age = col_double(),
## .. Sex = col_character(),
## .. Ethnicity = col_character(),
## .. BMI = col_double(),
## .. Waist_Circumference = col_double(),
## .. Fasting_Blood_Glucose = col_double(),
## .. HbA1c = col_double(),
## .. Blood_Pressure_Systolic = col_double(),
## .. Blood_Pressure_Diastolic = col_double(),
## .. Cholesterol_Total = col_double(),
## .. Cholesterol_HDL = col_double(),
## .. Cholesterol_LDL = col_double(),
## .. GGT = col_double(),
## .. Serum_Urate = col_double(),
## .. Physical_Activity_Level = col_character(),
## .. Dietary_Intake_Calories = col_double(),
## .. Alcohol_Consumption = col_character(),
## .. Smoking_Status = col_character(),
## .. Family_History_of_Diabetes = col_double(),
## .. Previous_Gestational_Diabetes = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
INTERPRETATION
The dataset contains 10,000 records with demographic, clinical, and lifestyle information.
Key health variables like BMI, glucose, HbA1c, blood pressure, and cholesterol are available in numeric form for analysis.
Lifestyle factors such as smoking, alcohol consumption, and physical activity are stored as categories, useful for behavioral comparisons.
Additional fields like family history, gestational diabetes, and risk classification help identify individuals with higher diabetes risk.
Overall, the dataset provides a comprehensive view of factors associated with diabetes, making it suitable for data-driven health analysis.
summary(Diabetes_data_2)
## S no. Age Sex Ethnicity
## Min. : 0 Min. :20.00 Length:10000 Length:10000
## 1st Qu.:2500 1st Qu.:32.00 Class :character Class :character
## Median :5000 Median :45.00 Mode :character Mode :character
## Mean :5000 Mean :44.62
## 3rd Qu.:7499 3rd Qu.:57.00
## Max. :9999 Max. :69.00
## BMI Waist_Circumference Fasting_Blood_Glucose HbA1c
## Min. :18.50 Min. : 70.0 Min. : 70.0 Min. : 4.000
## 1st Qu.:24.10 1st Qu.: 82.2 1st Qu.:102.2 1st Qu.: 6.800
## Median :29.50 Median : 94.9 Median :134.5 Median : 9.500
## Mean :29.42 Mean : 94.8 Mean :134.8 Mean : 9.508
## 3rd Qu.:34.70 3rd Qu.:107.0 3rd Qu.:167.8 3rd Qu.:12.300
## Max. :40.00 Max. :120.0 Max. :200.0 Max. :15.000
## Blood_Pressure_Systolic Blood_Pressure_Diastolic Cholesterol_Total
## Min. : 90.0 Min. : 60.00 Min. :150.0
## 1st Qu.:112.0 1st Qu.: 75.00 1st Qu.:187.9
## Median :134.0 Median : 89.00 Median :225.5
## Mean :134.2 Mean : 89.56 Mean :225.2
## 3rd Qu.:157.0 3rd Qu.:105.00 3rd Qu.:262.4
## Max. :179.0 Max. :119.00 Max. :300.0
## Cholesterol_HDL Cholesterol_LDL GGT Serum_Urate
## Min. :30.00 Min. : 70.0 Min. : 10.00 Min. :3.000
## 1st Qu.:42.30 1st Qu.:101.7 1st Qu.: 32.60 1st Qu.:4.200
## Median :55.20 Median :134.4 Median : 55.45 Median :5.500
## Mean :55.02 Mean :134.4 Mean : 55.17 Mean :5.503
## 3rd Qu.:67.90 3rd Qu.:166.4 3rd Qu.: 77.50 3rd Qu.:6.800
## Max. :80.00 Max. :200.0 Max. :100.00 Max. :8.000
## Physical_Activity_Level Dietary_Intake_Calories Alcohol_Consumption
## Length:10000 Min. :1500 Length:10000
## Class :character 1st Qu.:2129 Class :character
## Mode :character Median :2727 Mode :character
## Mean :2742
## 3rd Qu.:3368
## Max. :3999
## Smoking_Status Family_History_of_Diabetes Previous_Gestational_Diabetes
## Length:10000 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :1.000 Median :1.0000
## Mean :0.507 Mean :0.5165
## 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. :1.000 Max. :1.0000
INTERPRETATION
The dataset covers a wide age range (20–69 years) with balanced BMI, glucose, and cholesterol values, showing good variability for analysis.
Blood pressure levels (systolic & diastolic) show realistic medical ranges and indicate the presence of both normal and high BP cases.
HbA1c and fasting glucose values show considerable spread, suggesting a mix of normal, prediabetic, and high-risk individuals.
Lifestyle variables like physical activity, alcohol intake, and smoking status are categorical and well-distributed across the dataset.
Family history and gestational diabetes indicators are also included, helping analyze genetic and maternal risk factors for diabetes.
table(is.na(Diabetes_data_2))
##
## FALSE
## 210000
INTERPRETATION
The table shows that all values in the dataset are FALSE for NA, meaning no missing values are present.
A total of 210,000 data points (10,000 rows × 21 columns approx.) have complete entries.
This indicates the dataset is clean and ready for analysis without needing any imputation.
Since there are no NA values, further preprocessing becomes easier and more reliable.
glucose_mean <- mean(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
glucose_median <- median(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
glucose_mode <- mode(Diabetes_data_2$Fasting_Blood_Glucose)
glucose_sd <- sd(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
print(glucose_mean)
## [1] 134.7762
print(glucose_median)
## [1] 134.5
print(glucose_mode)
## [1] "numeric"
print(glucose_sd)
## [1] 37.63354
INTERPRETATION
The average fasting glucose level is around 134.8 mg/dL, indicating moderately elevated glucose in the dataset.
The median glucose value (134.5 mg/dL) is very close to the mean, showing a fairly balanced distribution.
The mode is returned as “numeric” because R’s built-in mode() function shows data type, not statistical mode.
The standard deviation is 37.6, meaning glucose levels vary widely among individuals.
hist(
Diabetes_data_2$Age,
breaks = 20,
main = "Distribution of Age",
xlab = "Age (years)",
col = "orange",
border = "black"
)
INTERPRETATION
The age distribution appears fairly uniform, indicating individuals from many age groups are represented evenly.
Most age groups have 300–450 individuals, showing no major imbalance in the dataset.
A slightly higher frequency is seen in younger ages (around 20–25), but overall variation is small.
This balanced age spread makes the dataset suitable for comparing diabetes risk across different age groups.
ggplot(Diabetes_data_2, aes(x = Fasting_Blood_Glucose)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
labs(
title = "Distribution of Glucose Levels",
x = "Fasting Blood Glucose (mg/dL)",
y = "Frequency"
) +
theme_minimal()
INTERPRETATION
The fasting blood glucose values range from around 70 to 200 mg/dL, covering normal, prediabetic, and high levels.
The distribution appears fairly uniform, with similar frequencies across most glucose ranges.
No strong skewness is visible, meaning both lower and higher glucose values occur consistently in the dataset.
A large number of individuals fall between 100–170 mg/dL, indicating many people are in borderline or elevated glucose categories.
ggplot(Diabetes_data_2, aes(x = factor(Family_History_of_Diabetes))) +
geom_bar(
fill = "red",
color = "white",
alpha = 0.9
) +
labs(
title = "Frequency of Diabetes Outcomes",
x = "Outcome (0 = Non-Diabetic, 1 = Diabetic)",
y = "Count"
) +
theme_minimal()
INTERPRETATION
The bar chart shows two groups: individuals with and without a family history of diabetes.
Both groups appear to have almost equal counts, indicating a balanced distribution.
This means the dataset contains a similar number of people from both categories.
Such balance helps in making unbiased comparisons in later analysis.
Diabetes_data_2 %>%
group_by(Age) %>%
summarise(Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE))
## # A tibble: 50 × 2
## Age Average_Glucose
## <dbl> <dbl>
## 1 20 135.
## 2 21 140.
## 3 22 135.
## 4 23 136.
## 5 24 134.
## 6 25 138.
## 7 26 135.
## 8 27 134.
## 9 28 138.
## 10 29 134.
## # ℹ 40 more rows
INTERPRETATION
The average fasting glucose levels remain fairly consistent across different ages, mostly ranging between 133–140 mg/dL.
There is no sharp rise or drop in glucose levels for any specific age, indicating a stable pattern across the population.
Small variations exist from age to age, but overall the trend suggests that age alone does not strongly influence average glucose levels in this dataset.
This consistency shows that other factors (BMI, lifestyle, genetics) may play a bigger role in glucose variation.
ggplot(Diabetes_data_2, aes(x = Smoking_Status, y = BMI, fill = Smoking_Status)) +
geom_boxplot(alpha = 0.85) +
labs(
title = "BMI Distribution Across Smoking Categories",
x = "Smoking Category",
y = "BMI"
) +
theme_minimal(base_size = 14)
INTERPRETATION
The median BMI is almost the same across all smoking groups (Current, Former, and Never).
All three categories show a similar spread of BMI values, indicating no major difference in weight patterns.
A few outliers exist in each group, but these do not significantly affect the overall distribution.
Overall, smoking status does not appear to strongly influence BMI in this dataset.
Diabetes_data_2 %>%
group_by(Physical_Activity_Level) %>%
summarise(Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE))
## # A tibble: 3 × 2
## Physical_Activity_Level Average_Glucose
## <chr> <dbl>
## 1 High 135.
## 2 Low 134.
## 3 Moderate 135.
INTERPRETATION
The average glucose levels are very similar across all activity levels (High, Moderate, and Low).
Moderate activity shows the highest average glucose, but the difference is very small (around 1 mg/dL).
This suggests that physical activity level does not have a strong impact on fasting glucose in this dataset.
Overall, glucose levels remain fairly stable regardless of activity category.
Diabetes_data_2%>%
group_by(Alcohol_Consumption) %>%
summarise(
Avg_Systolic = mean(Blood_Pressure_Systolic, na.rm = TRUE),
Avg_Diastolic = mean(Blood_Pressure_Diastolic, na.rm = TRUE)
)
## # A tibble: 3 × 3
## Alcohol_Consumption Avg_Systolic Avg_Diastolic
## <chr> <dbl> <dbl>
## 1 Heavy 134. 89.8
## 2 Moderate 134. 89.6
## 3 None 134. 89.3
INTERPRETATION
Average systolic blood pressure is very similar across all groups, with only a small difference of about 1 mmHg.
Heavy drinkers show the lowest systolic BP, while moderate drinkers show the highest, but the difference is minimal.
Diastolic BP also remains fairly stable across all alcohol categories.
Overall, alcohol consumption does not show a strong impact on blood pressure levels in this dataset.
Diabetes_data_2 %>%
group_by(Smoking_Status) %>%
summarise(
Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
)
## # A tibble: 3 × 2
## Smoking_Status Average_Glucose
## <chr> <dbl>
## 1 Current 135.
## 2 Former 135.
## 3 Never 134.
INTERPRETATION
Average fasting glucose is nearly identical among smokers and non-smokers (Current = 135, Former = 135, Never = 134).
This suggests smoking status does not strongly impact glucose levels.
Overall, smoking does not appear to significantly influence fasting glucose in this dataset.
Diabetes_data_2 %>%
group_by(Alcohol_Consumption) %>%
summarise(Avg_BMI = mean(BMI, na.rm = TRUE))
## # A tibble: 3 × 2
## Alcohol_Consumption Avg_BMI
## <chr> <dbl>
## 1 Heavy 29.5
## 2 Moderate 29.6
## 3 None 29.2
anova_result <- aov(BMI ~ Alcohol_Consumption, data = Diabetes_data_2)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alcohol_Consumption 2 356 178.04 4.679 0.00931 **
## Residuals 9997 380402 38.05
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
INTERPRETATION
The average BMI is almost the same across all alcohol groups (Heavy, Moderate, None), with only very small differences.
Heavy and Moderate drinkers show slightly higher BMI than non-drinkers, but the gap is minimal.
The ANOVA test results (p-value) indicate no statistically significant difference in BMI across alcohol consumption categories.
Overall, alcohol intake does not appear to have a meaningful impact on BMI in this dataset.
ggplot(Diabetes_data_2, aes(x = BMI, y = Blood_Pressure_Systolic)) +
geom_point(color = "blue") +
labs(title = "Relationship Between BMI and Systolic Blood Pressure",
x = "BMI",
y = "Systolic Blood Pressure") +
theme_minimal()
INTERPRETATION
The scatter plot shows no clear pattern between BMI and systolic blood pressure.
Data points are widely scattered, indicating a very weak or no relationship.
Blood pressure levels remain similar across different BMI values.
Diabetes_data_2%>%
mutate(
Risk_Score = Fasting_Blood_Glucose + HbA1c + BMI
) %>%
arrange(desc(Risk_Score)) %>%
head(5)
## # A tibble: 5 × 22
## `S no.` Age Sex Ethnicity BMI Waist_Circumference Fasting_Blood_Glucose
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 4313 41 Male White 39.6 117. 199.
## 2 4973 25 Male Asian 39.4 97.4 200.
## 3 3908 29 Female Hispanic 39.6 92.8 199.
## 4 7021 28 Female Hispanic 39.5 112. 198.
## 5 4698 49 Female Black 38.9 91.2 198.
## # ℹ 15 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## # Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## # Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## # Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## # Alcohol_Consumption <chr>, Smoking_Status <chr>,
## # Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## # Risk_Score <dbl>
INTERPRETATION
The top 5 highest-risk profiles have extremely high glucose, HbA1c, and BMI values.
Their glucose levels are near 200 mg/dL and BMI about 40, indicating severe metabolic risk.
These individuals should be considered a priority group for medical intervention.
Diabetes_data_2$Outcome <- ifelse(Diabetes_data_2$Fasting_Blood_Glucose > 126, 1, 0)
Diabetes_data_2$Age_Range <-Diabetes_data_2$Age %/% 10 * 10
Diabetes_data_2%>%
group_by(Age_Range) %>%
summarise(
Prevalence = mean(Outcome, na.rm = TRUE)
) %>%
arrange(desc(Prevalence))
## # A tibble: 5 × 2
## Age_Range Prevalence
## <dbl> <dbl>
## 1 20 0.591
## 2 60 0.580
## 3 50 0.557
## 4 40 0.556
## 5 30 0.550
INTERPRETATION
The 20–29 age group has the highest diabetes prevalence, followed by individuals in their 60s.
This suggests elevated risk in both early adulthood and older adults.
Younger adults showing higher prevalence may indicate lifestyle or genetic factors.
Diabetes_data_2 %>%
group_by(Family_History_of_Diabetes) %>%
summarise(
Avg_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
)
## # A tibble: 2 × 2
## Family_History_of_Diabetes Avg_Glucose
## <dbl> <dbl>
## 1 0 135.
## 2 1 135.
INTERPRETATION
Average fasting glucose is the same for those with and without a family history (both ~135 mg/dL).
Family history does not show a noticeable association with higher glucose levels.
Genetic background does not significantly impact glucose levels in this dataset.
Diabetes_data_2%>%
group_by(Sex) %>%
summarise(
Avg_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
)
## # A tibble: 2 × 2
## Sex Avg_Glucose
## <chr> <dbl>
## 1 Female 134.
## 2 Male 135.
INTERPRETATION
Males (135 mg/dL) and females (134 mg/dL) have nearly identical average glucose values.
There is no significant gender difference in fasting glucose levels.
Gender does not appear to influence glucose levels in this sample.
ggplot(Diabetes_data_2, aes(x = Cholesterol_Total)) +
geom_histogram(bins = 30,
fill = "green",
color = "white",
alpha = 0.85) +
theme_minimal(base_size = 14) +
labs(
title = "Distribution of Total Cholesterol",
x = "Total Cholesterol",
y = "Frequency"
)
INTERPRETATION
Total cholesterol values are evenly spread between 150–300 mg/dL.
There is no strong peak, showing a wide variation across individuals.
Cholesterol appears to follow a roughly uniform distribution rather than clustering.
ggplot(Diabetes_data_2, aes(x = Physical_Activity_Level,
fill = Physical_Activity_Level)) +
geom_bar(alpha = 0.9) +
theme_minimal(base_size = 14) +
labs(
title = "Count of Individuals by Physical Activity Level",
x = "Physical Activity Level",
y = "Count"
)
INTERPRETATION
The bar plot shows that the number of individuals in each physical activity category (High, Low, Moderate) is almost the same.
This indicates a balanced distribution of activity levels across the dataset.
No single physical activity group dominates, meaning lifestyle variation is evenly represented in the population.
ggplot(Diabetes_data_2, aes(x = Fasting_Blood_Glucose)) +
geom_density(fill = "darkblue", alpha = 0.7) +
theme_minimal(base_size = 14) +
labs(
title = "Density Plot of Fasting Blood Glucose",
x = "Fasting Blood Glucose (mg/dL)",
y = "Density"
)+theme_minimal()
INTERPRETATION
The density plot shows that fasting blood glucose values are spread fairly evenly across the range of 70–200 mg/dL.
The distribution is relatively flat with no strong peak, indicating low skewness.
This suggests that fasting glucose levels in the dataset do not lean heavily toward very high or very low values and are fairly uniformly distributed.
top10_glucose <- Diabetes_data_2 %>%
arrange(desc(Fasting_Blood_Glucose)) %>%
slice(1:10)
top10_glucose
## # A tibble: 10 × 23
## `S no.` Age Sex Ethnicity BMI Waist_Circumference Fasting_Blood_Glucose
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 4890 58 Fema… Black 26.2 91 200
## 2 7244 34 Fema… Hispanic 24.7 115 200
## 3 9695 59 Fema… Black 19.7 74.4 200
## 4 419 31 Male Hispanic 18.9 84.9 200.
## 5 549 65 Fema… Hispanic 26.1 84.4 200.
## 6 566 64 Male Asian 33.6 82.6 200.
## 7 5708 40 Fema… Black 29.4 105. 200.
## 8 8964 69 Fema… Black 32.5 79.4 200.
## 9 9747 48 Male White 34.5 97.8 200.
## 10 1900 63 Fema… Hispanic 26.3 95.1 200.
## # ℹ 16 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## # Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## # Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## # Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## # Alcohol_Consumption <chr>, Smoking_Status <chr>,
## # Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## # Outcome <dbl>, Age_Range <dbl>
top10_glucose %>%
select(Age, Sex, BMI, Fasting_Blood_Glucose,
Blood_Pressure_Systolic, Cholesterol_Total,
Smoking_Status, Physical_Activity_Level)
## # A tibble: 10 × 8
## Age Sex BMI Fasting_Blood_Glucose Blood_Pressure_Systolic
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 58 Female 26.2 200 176
## 2 34 Female 24.7 200 167
## 3 59 Female 19.7 200 147
## 4 31 Male 18.9 200. 114
## 5 65 Female 26.1 200. 111
## 6 64 Male 33.6 200. 138
## 7 40 Female 29.4 200. 127
## 8 69 Female 32.5 200. 178
## 9 48 Male 34.5 200. 178
## 10 63 Female 26.3 200. 102
## # ℹ 3 more variables: Cholesterol_Total <dbl>, Smoking_Status <chr>,
## # Physical_Activity_Level <chr>
INTERPRETATION
The top 10 individuals with the highest fasting glucose levels all have glucose readings at 200 mg/dL, indicating extremely elevated levels.
These patients often show additional risk factors such as higher BMI, older age, and moderately elevated blood pressure, which further increases their overall health risk.
The combination of high glucose with other lifestyle and clinical attributes suggests these individuals require urgent monitoring and targeted intervention.
Diabetes_data_2 %>%
group_by(Family_History_of_Diabetes) %>%
summarise(
mean_BMI = mean(BMI, na.rm = TRUE),
count = n()
)
## # A tibble: 2 × 3
## Family_History_of_Diabetes mean_BMI count
## <dbl> <dbl> <int>
## 1 0 29.5 4930
## 2 1 29.4 5070
ggplot(Diabetes_data_2, aes(
x = factor(Family_History_of_Diabetes),
y = BMI,
fill = factor(Family_History_of_Diabetes)
)) +
geom_boxplot(alpha = 0.85) +
theme_minimal(base_size = 14) +
labs(
title = "BMI Comparison Based on Family History of Diabetes",
x = "Family History of Diabetes (0 = No, 1 = Yes)",
y = "BMI",
fill = "Family History"
)
INTERPRETATION
The average BMI is nearly the same for individuals with (29.4) and without (29.5) a family history of diabetes.
The boxplot also shows highly overlapping BMI distributions between the two groups.
This indicates that having a family history of diabetes is not strongly associated with higher BMI in this dataset.
# Select only the required numeric columns
cluster_data <- Diabetes_data_2[, c("BMI", "Fasting_Blood_Glucose", "Blood_Pressure_Systolic")]
# Scale the data
scaled_data <- scale(cluster_data)
# Run k-means with 3 clusters
set.seed(123)
kmeans_result <- kmeans(scaled_data, centers = 3)
# Add cluster labels back to original data
Diabetes_data_2$Cluster <- kmeans_result$cluster
# View first few rows with cluster labels
head(Diabetes_data_2)
## # A tibble: 6 × 24
## `S no.` Age Sex Ethnicity BMI Waist_Circumference Fasting_Blood_Glucose
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 0 58 Female White 35.8 83.4 124.
## 2 1 48 Male Asian 24.1 71.4 184.
## 3 2 34 Female Black 25 114. 142
## 4 3 62 Male Asian 32.7 100. 167.
## 5 4 27 Female Asian 33.5 111. 146.
## 6 5 40 Female Asian 33.6 96.1 75
## # ℹ 17 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## # Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## # Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## # Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## # Alcohol_Consumption <chr>, Smoking_Status <chr>,
## # Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## # Outcome <dbl>, Age_Range <dbl>, Cluster <int>
Diabetes_data_2 %>%
group_by(Cluster) %>%
summarise(
Avg_BMI = mean(BMI),
Avg_Glucose = mean(Fasting_Blood_Glucose),
Avg_BP = mean(Blood_Pressure_Systolic),
Count = n()
)
## # A tibble: 3 × 5
## Cluster Avg_BMI Avg_Glucose Avg_BP Count
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 29.3 161. 111. 3102
## 2 2 29.9 95.5 133. 3773
## 3 3 28.9 157. 159. 3125
ggplot(Diabetes_data_2, aes(x = BMI, y = Fasting_Blood_Glucose, color = as.factor(Cluster))) +
geom_point() +
labs(title = "K-means Clustering (BMI vs Glucose)",
color = "Cluster")
INTERPRETATION
K-means clustering grouped individuals into three meaningful clusters based on BMI, fasting glucose, and blood pressure.
Cluster 1: High glucose (≈161 mg/dL) with moderate BMI and normal BP — a high diabetes-risk group.
Cluster 2: Normal glucose (≈95 mg/dL) but slightly elevated BP — a moderate cardiovascular-risk group.
Cluster 3: High glucose (≈157 mg/dL) and high BP — a combined diabetes and hypertension high-risk group. The scatter plot shows that clusters are mainly separated by glucose and blood pressure, not BMI.
CONCLUSION The clustering clearly identifies three risk profiles within the population: one low-risk group, one high-glucose group, and one group with both elevated glucose and blood pressure. This segmentation helps highlight which portions of the population may require targeted medical attention and preventive care.
Diabetes_data_2$Risk_Score <- Diabetes_data_2$BMI + Diabetes_data_2$Fasting_Blood_Glucose + Diabetes_data_2$HbA1c
threshold <- mean(Diabetes_data_2$Risk_Score, na.rm = TRUE)
Diabetes_data_2$Risk_Class <- ifelse(Diabetes_data_2$Risk_Score > threshold, "High", "Low")
features <- Diabetes_data_2[, c("BMI", "Fasting_Blood_Glucose", "HbA1c")]
labels <- Diabetes_data_2$Risk_Class
normalize <- function(x) {
return((x - min(x)) / (max(x)))
}
features_norm <- as.data.frame(lapply(features, normalize))
set.seed(123)
index <- sample(1:nrow(features_norm), 0.7 * nrow(features_norm))
train_data <- features_norm[index, ]
test_data <- features_norm[-index, ]
train_label <- labels[index]
test_label <- labels[-index]
predicted <- knn(train_data, test_data, train_label, k = 5)
mean(predicted == test_label)
## [1] 0.9886667
table(Predicted = predicted, Actual = test_label)
## Actual
## Predicted High Low
## High 1461 19
## Low 15 1505
ggplot(Diabetes_data_2, aes(x = BMI, y = Fasting_Blood_Glucose, color = Risk_Class)) +
geom_point() +
theme_minimal() +
labs(title = "High vs Low Risk Classification (True Labels)")
INTERPRETATION
A simple Risk Score was created using BMI + Fasting Glucose + HbA1c, and individuals were labeled as High or Low risk using the average score as the threshold.
KNN (k = 5) was used to classify individuals based on their normalized BMI, glucose, and HbA1c values.
The model achieved an excellent accuracy of ~98.9%, showing very strong predictive performance. The confusion matrix shows:
1461 High-risk individuals correctly classified
1505 Low-risk individuals correctly classified
Only a very small number misclassified (19 Low → High, 15 High → Low)
The scatter plot shows a clear visual separation between High-risk (higher glucose) and Low-risk groups, demonstrating that glucose is the strongest factor driving the classification.
Conclusion The KNN model effectively separates individuals into High-risk and Low-risk categories with near-perfect accuracy. This highlights the strong predictive power of BMI, fasting glucose, and HbA1c when used together, making KNN a reliable model for early diabetes-risk identification.
This analysis has yielded several meaningful insights into the risk factors and patterns associated with diabetes in the dataset. Key findings include:
Fasting blood glucose and HbA1c emerge as the strongest markers tied to diabetes risk, especially when combined with elevated BMI and age.
Lifestyle factors such as smoking status, physical activity, and alcohol consumption showed relatively weaker direct associations with glucose or BMI in this sample.
Genetic/family history alone did not appear to significantly elevate glucose or BMI levels, suggesting that behavioral and clinical metrics may carry greater weight in this cohort.
Clustering revealed distinct population segments — one with normal glucose but higher blood pressure, another with high glucose alone, and a third with both high glucose and high blood pressure — helping identify sub-groups for targeted intervention.
A simple K-Nearest-Neighbors (KNN) classification based on BMI, fasting glucose, and HbA1c achieved high predictive accuracy for identifying “high-risk” individuals, demonstrating the practical value of predictive modelling in diabetes prevention.
Implications for practice: These results support the prioritisation of key clinical indicators such as glucose, HbA1c, BMI and blood pressure for screening and early intervention programmes. They also suggest that segmentation based on risk profiles can improve resource allocation — focusing efforts on individuals showing multiple metabolic red flags rather than relying solely on demographic or lifestyle categories.
Closing remark: Overall, the project underlines the power of data-driven approaches in revealing actionable insights for diabetes risk management. By leveraging accessible variables and straightforward models, healthcare practitioners can better identify high-risk individuals and tailor preventive strategies accordingly.