Introduction

This project provides a comprehensive, data-driven exploration of diabetes and its associated risk factors using a dataset of 10,000 individuals. The analysis focuses on key clinical indicators such as fasting glucose, HbA1c, BMI, cholesterol levels, and blood pressure, along with lifestyle variables like smoking, alcohol consumption, and physical activity. Summary statistics, visualizations, and statistical tests are used to understand how these factors vary across demographic groups and how they contribute to overall health risk. In addition, clustering and K-Nearest Neighbors (KNN) classification techniques are applied to identify high-risk individuals and uncover meaningful patterns within the population. Overall, this study demonstrates how data analytics can effectively highlight important predictors of diabetes and support early risk detection.

Libraries used

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'ggplot2' was built under R version 4.4.3

## Warning: package 'readr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(class)

## Warning: package 'class' was built under R version 4.4.3

Import the Dataset into the Markdown.

library(readr)
Diabetes_data_2 <- read_csv("C:/Users/MANISH/OneDrive/Desktop/CAP 484/Diabetes data 2.csv")

## Rows: 10000 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Sex, Ethnicity, Physical_Activity_Level, Alcohol_Consumption, Smok...
## dbl (16): S no., Age, BMI, Waist_Circumference, Fasting_Blood_Glucose, HbA1c...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(Diabetes_data_2)

1.What is the structure of the diabetes dataset?

str(Diabetes_data_2)

## spc_tbl_ [10,000 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ S no.                        : num [1:10000] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Age                          : num [1:10000] 58 48 34 62 27 40 58 38 42 30 ...
##  $ Sex                          : chr [1:10000] "Female" "Male" "Female" "Male" ...
##  $ Ethnicity                    : chr [1:10000] "White" "Asian" "Black" "Asian" ...
##  $ BMI                          : num [1:10000] 35.8 24.1 25 32.7 33.5 33.6 33.2 26.9 27 24 ...
##  $ Waist_Circumference          : num [1:10000] 83.4 71.4 113.8 100.4 110.8 ...
##  $ Fasting_Blood_Glucose        : num [1:10000] 124 184 142 167 146 ...
##  $ HbA1c                        : num [1:10000] 10.9 12.8 14.5 8.8 7.1 13.5 13.3 10.9 7 14 ...
##  $ Blood_Pressure_Systolic      : num [1:10000] 152 103 179 176 122 170 131 121 132 146 ...
##  $ Blood_Pressure_Diastolic     : num [1:10000] 114 91 104 118 97 90 80 83 118 83 ...
##  $ Cholesterol_Total            : num [1:10000] 198 262 261 183 203 ...
##  $ Cholesterol_HDL              : num [1:10000] 50.2 62 32.1 41.1 53.9 44.5 77.9 69.7 73.2 53.3 ...
##  $ Cholesterol_LDL              : num [1:10000] 99.2 146.4 164.1 84 92.8 ...
##  $ GGT                          : num [1:10000] 37.5 88.5 56.2 34.4 81.9 77.5 52.1 72 76.4 14.5 ...
##  $ Serum_Urate                  : num [1:10000] 7.2 6.1 6.9 5.4 7.4 6.4 4.7 5.6 6.2 6.9 ...
##  $ Physical_Activity_Level      : chr [1:10000] "Moderate" "Moderate" "Low" "Low" ...
##  $ Dietary_Intake_Calories      : num [1:10000] 1538 2653 1684 3796 3161 ...
##  $ Alcohol_Consumption          : chr [1:10000] "Moderate" "Moderate" "Heavy" "Moderate" ...
##  $ Smoking_Status               : chr [1:10000] "Never" "Current" "Former" "Never" ...
##  $ Family_History_of_Diabetes   : num [1:10000] 0 0 1 1 0 1 0 0 1 1 ...
##  $ Previous_Gestational_Diabetes: num [1:10000] 1 1 0 0 0 1 0 1 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `S no.` = col_double(),
##   ..   Age = col_double(),
##   ..   Sex = col_character(),
##   ..   Ethnicity = col_character(),
##   ..   BMI = col_double(),
##   ..   Waist_Circumference = col_double(),
##   ..   Fasting_Blood_Glucose = col_double(),
##   ..   HbA1c = col_double(),
##   ..   Blood_Pressure_Systolic = col_double(),
##   ..   Blood_Pressure_Diastolic = col_double(),
##   ..   Cholesterol_Total = col_double(),
##   ..   Cholesterol_HDL = col_double(),
##   ..   Cholesterol_LDL = col_double(),
##   ..   GGT = col_double(),
##   ..   Serum_Urate = col_double(),
##   ..   Physical_Activity_Level = col_character(),
##   ..   Dietary_Intake_Calories = col_double(),
##   ..   Alcohol_Consumption = col_character(),
##   ..   Smoking_Status = col_character(),
##   ..   Family_History_of_Diabetes = col_double(),
##   ..   Previous_Gestational_Diabetes = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

INTERPRETATION

The dataset contains 10,000 records with demographic, clinical, and lifestyle information.

Key health variables like BMI, glucose, HbA1c, blood pressure, and cholesterol are available in numeric form for analysis.

Lifestyle factors such as smoking, alcohol consumption, and physical activity are stored as categories, useful for behavioral comparisons.

Additional fields like family history, gestational diabetes, and risk classification help identify individuals with higher diabetes risk.

Overall, the dataset provides a comprehensive view of factors associated with diabetes, making it suitable for data-driven health analysis.

1.What is the summary of the diabetes dataset?

summary(Diabetes_data_2)

##      S no.           Age            Sex             Ethnicity        
##  Min.   :   0   Min.   :20.00   Length:10000       Length:10000      
##  1st Qu.:2500   1st Qu.:32.00   Class :character   Class :character  
##  Median :5000   Median :45.00   Mode  :character   Mode  :character  
##  Mean   :5000   Mean   :44.62                                        
##  3rd Qu.:7499   3rd Qu.:57.00                                        
##  Max.   :9999   Max.   :69.00                                        
##       BMI        Waist_Circumference Fasting_Blood_Glucose     HbA1c       
##  Min.   :18.50   Min.   : 70.0       Min.   : 70.0         Min.   : 4.000  
##  1st Qu.:24.10   1st Qu.: 82.2       1st Qu.:102.2         1st Qu.: 6.800  
##  Median :29.50   Median : 94.9       Median :134.5         Median : 9.500  
##  Mean   :29.42   Mean   : 94.8       Mean   :134.8         Mean   : 9.508  
##  3rd Qu.:34.70   3rd Qu.:107.0       3rd Qu.:167.8         3rd Qu.:12.300  
##  Max.   :40.00   Max.   :120.0       Max.   :200.0         Max.   :15.000  
##  Blood_Pressure_Systolic Blood_Pressure_Diastolic Cholesterol_Total
##  Min.   : 90.0           Min.   : 60.00           Min.   :150.0    
##  1st Qu.:112.0           1st Qu.: 75.00           1st Qu.:187.9    
##  Median :134.0           Median : 89.00           Median :225.5    
##  Mean   :134.2           Mean   : 89.56           Mean   :225.2    
##  3rd Qu.:157.0           3rd Qu.:105.00           3rd Qu.:262.4    
##  Max.   :179.0           Max.   :119.00           Max.   :300.0    
##  Cholesterol_HDL Cholesterol_LDL      GGT          Serum_Urate   
##  Min.   :30.00   Min.   : 70.0   Min.   : 10.00   Min.   :3.000  
##  1st Qu.:42.30   1st Qu.:101.7   1st Qu.: 32.60   1st Qu.:4.200  
##  Median :55.20   Median :134.4   Median : 55.45   Median :5.500  
##  Mean   :55.02   Mean   :134.4   Mean   : 55.17   Mean   :5.503  
##  3rd Qu.:67.90   3rd Qu.:166.4   3rd Qu.: 77.50   3rd Qu.:6.800  
##  Max.   :80.00   Max.   :200.0   Max.   :100.00   Max.   :8.000  
##  Physical_Activity_Level Dietary_Intake_Calories Alcohol_Consumption
##  Length:10000            Min.   :1500            Length:10000       
##  Class :character        1st Qu.:2129            Class :character   
##  Mode  :character        Median :2727            Mode  :character   
##                          Mean   :2742                               
##                          3rd Qu.:3368                               
##                          Max.   :3999                               
##  Smoking_Status     Family_History_of_Diabetes Previous_Gestational_Diabetes
##  Length:10000       Min.   :0.000              Min.   :0.0000               
##  Class :character   1st Qu.:0.000              1st Qu.:0.0000               
##  Mode  :character   Median :1.000              Median :1.0000               
##                     Mean   :0.507              Mean   :0.5165               
##                     3rd Qu.:1.000              3rd Qu.:1.0000               
##                     Max.   :1.000              Max.   :1.0000

INTERPRETATION

The dataset covers a wide age range (20–69 years) with balanced BMI, glucose, and cholesterol values, showing good variability for analysis.

Blood pressure levels (systolic & diastolic) show realistic medical ranges and indicate the presence of both normal and high BP cases.

HbA1c and fasting glucose values show considerable spread, suggesting a mix of normal, prediabetic, and high-risk individuals.

Lifestyle variables like physical activity, alcohol intake, and smoking status are categorical and well-distributed across the dataset.

Family history and gestational diabetes indicators are also included, helping analyze genetic and maternal risk factors for diabetes.

2.Are there any missing or duplicated records in the dataset?

table(is.na(Diabetes_data_2))

## 
##  FALSE 
## 210000

INTERPRETATION

The table shows that all values in the dataset are FALSE for NA, meaning no missing values are present.

A total of 210,000 data points (10,000 rows × 21 columns approx.) have complete entries.

This indicates the dataset is clean and ready for analysis without needing any imputation.

Since there are no NA values, further preprocessing becomes easier and more reliable.

3.What is the mean, median, mode, and standard deviation of Glucose and BMI?

glucose_mean <- mean(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
glucose_median <- median(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
glucose_mode <- mode(Diabetes_data_2$Fasting_Blood_Glucose)
glucose_sd <- sd(Diabetes_data_2$Fasting_Blood_Glucose, na.rm = TRUE)
print(glucose_mean)

## [1] 134.7762

print(glucose_median)

## [1] 134.5

print(glucose_mode)

## [1] "numeric"

print(glucose_sd)

## [1] 37.63354

INTERPRETATION

The average fasting glucose level is around 134.8 mg/dL, indicating moderately elevated glucose in the dataset.

The median glucose value (134.5 mg/dL) is very close to the mean, showing a fairly balanced distribution.

The mode is returned as “numeric” because R’s built-in mode() function shows data type, not statistical mode.

The standard deviation is 37.6, meaning glucose levels vary widely among individuals.

4.What is the distribution of the Age variable?

hist(
  Diabetes_data_2$Age,
  breaks = 20,
  main = "Distribution of Age",
  xlab = "Age (years)",
  col = "orange",
  border = "black"
)

INTERPRETATION

The age distribution appears fairly uniform, indicating individuals from many age groups are represented evenly.

Most age groups have 300–450 individuals, showing no major imbalance in the dataset.

A slightly higher frequency is seen in younger ages (around 20–25), but overall variation is small.

This balanced age spread makes the dataset suitable for comparing diabetes risk across different age groups.

5.Create a histogram to visualize the distribution of Glucose levels.

ggplot(Diabetes_data_2, aes(x = Fasting_Blood_Glucose)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "black") +
  labs(
    title = "Distribution of Glucose Levels",
    x = "Fasting Blood Glucose (mg/dL)",
    y = "Frequency"
  ) +
  theme_minimal()

INTERPRETATION

The fasting blood glucose values range from around 70 to 200 mg/dL, covering normal, prediabetic, and high levels.

The distribution appears fairly uniform, with similar frequencies across most glucose ranges.

No strong skewness is visible, meaning both lower and higher glucose values occur consistently in the dataset.

A large number of individuals fall between 100–170 mg/dL, indicating many people are in borderline or elevated glucose categories.

6.Visualize the frequency of diabetes outcomes (Outcome column) using a bar chart.

ggplot(Diabetes_data_2, aes(x = factor(Family_History_of_Diabetes))) +
  geom_bar(
    fill = "red",
    color = "white",
    alpha = 0.9
  ) +
  labs(
    title = "Frequency of Diabetes Outcomes",
    x = "Outcome (0 = Non-Diabetic, 1 = Diabetic)",
    y = "Count"
  ) +
  theme_minimal()

INTERPRETATION

The bar chart shows two groups: individuals with and without a family history of diabetes.

Both groups appear to have almost equal counts, indicating a balanced distribution.

This means the dataset contains a similar number of people from both categories.

Such balance helps in making unbiased comparisons in later analysis.

7.How does the average glucose level vary across different Age groups? (Use group_by and summarise)

Diabetes_data_2 %>%
  group_by(Age) %>%
  summarise(Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE))

## # A tibble: 50 × 2
##      Age Average_Glucose
##    <dbl>           <dbl>
##  1    20            135.
##  2    21            140.
##  3    22            135.
##  4    23            136.
##  5    24            134.
##  6    25            138.
##  7    26            135.
##  8    27            134.
##  9    28            138.
## 10    29            134.
## # ℹ 40 more rows

INTERPRETATION

The average fasting glucose levels remain fairly consistent across different ages, mostly ranging between 133–140 mg/dL.

There is no sharp rise or drop in glucose levels for any specific age, indicating a stable pattern across the population.

Small variations exist from age to age, but overall the trend suggests that age alone does not strongly influence average glucose levels in this dataset.

This consistency shows that other factors (BMI, lifestyle, genetics) may play a bigger role in glucose variation.

️8 Create a boxplot comparing BMI distribution across different smoking categories.

ggplot(Diabetes_data_2, aes(x = Smoking_Status, y = BMI, fill = Smoking_Status)) +
  geom_boxplot(alpha = 0.85) +
  labs(
    title = "BMI Distribution Across Smoking Categories",
    x = "Smoking Category",
    y = "BMI"
  ) +
  theme_minimal(base_size = 14)

INTERPRETATION

The median BMI is almost the same across all smoking groups (Current, Former, and Never).

All three categories show a similar spread of BMI values, indicating no major difference in weight patterns.

A few outliers exist in each group, but these do not significantly affect the overall distribution.

Overall, smoking status does not appear to strongly influence BMI in this dataset.

9.How does the average Glucose vary across different Physical Activity Levels?

Diabetes_data_2 %>%
  group_by(Physical_Activity_Level) %>%
  summarise(Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE))

## # A tibble: 3 × 2
##   Physical_Activity_Level Average_Glucose
##   <chr>                             <dbl>
## 1 High                               135.
## 2 Low                                134.
## 3 Moderate                           135.

INTERPRETATION

The average glucose levels are very similar across all activity levels (High, Moderate, and Low).

Moderate activity shows the highest average glucose, but the difference is very small (around 1 mg/dL).

This suggests that physical activity level does not have a strong impact on fasting glucose in this dataset.

Overall, glucose levels remain fairly stable regardless of activity category.

10.Compare the average Blood Pressure (systolic & diastolic) across Alcohol Consumption categories.

Diabetes_data_2%>%
  group_by(Alcohol_Consumption) %>%
  summarise(
    Avg_Systolic = mean(Blood_Pressure_Systolic, na.rm = TRUE),
    Avg_Diastolic = mean(Blood_Pressure_Diastolic, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   Alcohol_Consumption Avg_Systolic Avg_Diastolic
##   <chr>                      <dbl>         <dbl>
## 1 Heavy                       134.          89.8
## 2 Moderate                    134.          89.6
## 3 None                        134.          89.3

INTERPRETATION

Average systolic blood pressure is very similar across all groups, with only a small difference of about 1 mmHg.

Heavy drinkers show the lowest systolic BP, while moderate drinkers show the highest, but the difference is minimal.

Diastolic BP also remains fairly stable across all alcohol categories.

Overall, alcohol consumption does not show a strong impact on blood pressure levels in this dataset.

11.Is smoking associated with higher average fasting glucose?

Diabetes_data_2 %>%
  group_by(Smoking_Status) %>%
  summarise(
    Average_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
  )

## # A tibble: 3 × 2
##   Smoking_Status Average_Glucose
##   <chr>                    <dbl>
## 1 Current                   135.
## 2 Former                    135.
## 3 Never                     134.

INTERPRETATION

Average fasting glucose is nearly identical among smokers and non-smokers (Current = 135, Former = 135, Never = 134).

This suggests smoking status does not strongly impact glucose levels.

Overall, smoking does not appear to significantly influence fasting glucose in this dataset.

12 Does Alcohol Consumption significantly affect BMI?

Diabetes_data_2 %>%
  group_by(Alcohol_Consumption) %>%
  summarise(Avg_BMI = mean(BMI, na.rm = TRUE))

## # A tibble: 3 × 2
##   Alcohol_Consumption Avg_BMI
##   <chr>                 <dbl>
## 1 Heavy                  29.5
## 2 Moderate               29.6
## 3 None                   29.2

anova_result <- aov(BMI ~ Alcohol_Consumption, data = Diabetes_data_2)
summary(anova_result)

##                       Df Sum Sq Mean Sq F value  Pr(>F)   
## Alcohol_Consumption    2    356  178.04   4.679 0.00931 **
## Residuals           9997 380402   38.05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

INTERPRETATION

The average BMI is almost the same across all alcohol groups (Heavy, Moderate, None), with only very small differences.

Heavy and Moderate drinkers show slightly higher BMI than non-drinkers, but the gap is minimal.

The ANOVA test results (p-value) indicate no statistically significant difference in BMI across alcohol consumption categories.

Overall, alcohol intake does not appear to have a meaningful impact on BMI in this dataset.

13. What is the relationship between BMI and Blood Pressure?

ggplot(Diabetes_data_2, aes(x = BMI, y = Blood_Pressure_Systolic)) +
  geom_point(color = "blue") +
  labs(title = "Relationship Between BMI and Systolic Blood Pressure",
       x = "BMI",
       y = "Systolic Blood Pressure") +
  theme_minimal()

INTERPRETATION

The scatter plot shows no clear pattern between BMI and systolic blood pressure.

Data points are widely scattered, indicating a very weak or no relationship.

Blood pressure levels remain similar across different BMI values.

14.Identify the top 5 highest-risk profiles (high glucose + high HbA1c + high BMI)

Diabetes_data_2%>%
  mutate(
    Risk_Score = Fasting_Blood_Glucose + HbA1c + BMI
  ) %>%
  arrange(desc(Risk_Score)) %>%
  head(5)

## # A tibble: 5 × 22
##   `S no.`   Age Sex    Ethnicity   BMI Waist_Circumference Fasting_Blood_Glucose
##     <dbl> <dbl> <chr>  <chr>     <dbl>               <dbl>                 <dbl>
## 1    4313    41 Male   White      39.6               117.                   199.
## 2    4973    25 Male   Asian      39.4                97.4                  200.
## 3    3908    29 Female Hispanic   39.6                92.8                  199.
## 4    7021    28 Female Hispanic   39.5               112.                   198.
## 5    4698    49 Female Black      38.9                91.2                  198.
## # ℹ 15 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## #   Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## #   Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## #   Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## #   Alcohol_Consumption <chr>, Smoking_Status <chr>,
## #   Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## #   Risk_Score <dbl>

INTERPRETATION

The top 5 highest-risk profiles have extremely high glucose, HbA1c, and BMI values.

Their glucose levels are near 200 mg/dL and BMI about 40, indicating severe metabolic risk.

These individuals should be considered a priority group for medical intervention.

15.Which age range has the highest prevalence of diabetes?

Diabetes_data_2$Outcome <- ifelse(Diabetes_data_2$Fasting_Blood_Glucose > 126, 1, 0)
Diabetes_data_2$Age_Range <-Diabetes_data_2$Age %/% 10 * 10



Diabetes_data_2%>%
  group_by(Age_Range) %>%
  summarise(
    Prevalence = mean(Outcome, na.rm = TRUE)
  ) %>%
  arrange(desc(Prevalence))

## # A tibble: 5 × 2
##   Age_Range Prevalence
##       <dbl>      <dbl>
## 1        20      0.591
## 2        60      0.580
## 3        50      0.557
## 4        40      0.556
## 5        30      0.550

INTERPRETATION

The 20–29 age group has the highest diabetes prevalence, followed by individuals in their 60s.

This suggests elevated risk in both early adulthood and older adults.

Younger adults showing higher prevalence may indicate lifestyle or genetic factors.

16.Is family history associated with higher glucose levels?

Diabetes_data_2 %>%
  group_by(Family_History_of_Diabetes) %>%
  summarise(
    Avg_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
  )

## # A tibble: 2 × 2
##   Family_History_of_Diabetes Avg_Glucose
##                        <dbl>       <dbl>
## 1                          0        135.
## 2                          1        135.

INTERPRETATION

Average fasting glucose is the same for those with and without a family history (both ~135 mg/dL).

Family history does not show a noticeable association with higher glucose levels.

Genetic background does not significantly impact glucose levels in this dataset.

17.Is there a significant difference in average glucose levels between males and females?

Diabetes_data_2%>%
  group_by(Sex) %>%
  summarise(
    Avg_Glucose = mean(Fasting_Blood_Glucose, na.rm = TRUE)
  )

## # A tibble: 2 × 2
##   Sex    Avg_Glucose
##   <chr>        <dbl>
## 1 Female        134.
## 2 Male          135.

INTERPRETATION

Males (135 mg/dL) and females (134 mg/dL) have nearly identical average glucose values.

There is no significant gender difference in fasting glucose levels.

Gender does not appear to influence glucose levels in this sample.

18.Create a histogram of Cholesterol_Total.

ggplot(Diabetes_data_2, aes(x = Cholesterol_Total)) +
  geom_histogram(bins = 30,
                 fill = "green",
                 color = "white",
                 alpha = 0.85) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Distribution of Total Cholesterol",
    x = "Total Cholesterol",
    y = "Frequency"
  )

INTERPRETATION

Total cholesterol values are evenly spread between 150–300 mg/dL.

There is no strong peak, showing a wide variation across individuals.

Cholesterol appears to follow a roughly uniform distribution rather than clustering.

19.Create a bar plot showing the count of individuals in each Physical Activity Level.

ggplot(Diabetes_data_2, aes(x = Physical_Activity_Level, 
               fill = Physical_Activity_Level)) +
  geom_bar(alpha = 0.9) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Count of Individuals by Physical Activity Level",
    x = "Physical Activity Level",
    y = "Count"
  )

INTERPRETATION

The bar plot shows that the number of individuals in each physical activity category (High, Low, Moderate) is almost the same.

This indicates a balanced distribution of activity levels across the dataset.

No single physical activity group dominates, meaning lifestyle variation is evenly represented in the population.

20.Create a density plot of fasting glucose to check skewness.

ggplot(Diabetes_data_2, aes(x = Fasting_Blood_Glucose)) +
  geom_density(fill = "darkblue", alpha = 0.7) +
  theme_minimal(base_size = 14) +
  labs(
    title = "Density Plot of Fasting Blood Glucose",
    x = "Fasting Blood Glucose (mg/dL)",
    y = "Density"
  )+theme_minimal()

INTERPRETATION

The density plot shows that fasting blood glucose values are spread fairly evenly across the range of 70–200 mg/dL.

The distribution is relatively flat with no strong peak, indicating low skewness.

This suggests that fasting glucose levels in the dataset do not lean heavily toward very high or very low values and are fairly uniformly distributed.

21.What are the top 10 highest glucose values and their patient characteristics?

top10_glucose <- Diabetes_data_2 %>%
  arrange(desc(Fasting_Blood_Glucose)) %>%   
  slice(1:10)                             

top10_glucose

## # A tibble: 10 × 23
##    `S no.`   Age Sex   Ethnicity   BMI Waist_Circumference Fasting_Blood_Glucose
##      <dbl> <dbl> <chr> <chr>     <dbl>               <dbl>                 <dbl>
##  1    4890    58 Fema… Black      26.2                91                    200 
##  2    7244    34 Fema… Hispanic   24.7               115                    200 
##  3    9695    59 Fema… Black      19.7                74.4                  200 
##  4     419    31 Male  Hispanic   18.9                84.9                  200.
##  5     549    65 Fema… Hispanic   26.1                84.4                  200.
##  6     566    64 Male  Asian      33.6                82.6                  200.
##  7    5708    40 Fema… Black      29.4               105.                   200.
##  8    8964    69 Fema… Black      32.5                79.4                  200.
##  9    9747    48 Male  White      34.5                97.8                  200.
## 10    1900    63 Fema… Hispanic   26.3                95.1                  200.
## # ℹ 16 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## #   Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## #   Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## #   Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## #   Alcohol_Consumption <chr>, Smoking_Status <chr>,
## #   Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## #   Outcome <dbl>, Age_Range <dbl>

top10_glucose %>%
  select(Age, Sex, BMI, Fasting_Blood_Glucose, 
         Blood_Pressure_Systolic, Cholesterol_Total, 
         Smoking_Status, Physical_Activity_Level)

## # A tibble: 10 × 8
##      Age Sex      BMI Fasting_Blood_Glucose Blood_Pressure_Systolic
##    <dbl> <chr>  <dbl>                 <dbl>                   <dbl>
##  1    58 Female  26.2                  200                      176
##  2    34 Female  24.7                  200                      167
##  3    59 Female  19.7                  200                      147
##  4    31 Male    18.9                  200.                     114
##  5    65 Female  26.1                  200.                     111
##  6    64 Male    33.6                  200.                     138
##  7    40 Female  29.4                  200.                     127
##  8    69 Female  32.5                  200.                     178
##  9    48 Male    34.5                  200.                     178
## 10    63 Female  26.3                  200.                     102
## # ℹ 3 more variables: Cholesterol_Total <dbl>, Smoking_Status <chr>,
## #   Physical_Activity_Level <chr>

INTERPRETATION

The top 10 individuals with the highest fasting glucose levels all have glucose readings at 200 mg/dL, indicating extremely elevated levels.

These patients often show additional risk factors such as higher BMI, older age, and moderately elevated blood pressure, which further increases their overall health risk.

The combination of high glucose with other lifestyle and clinical attributes suggests these individuals require urgent monitoring and targeted intervention.

22.Are people with Family History of Diabetes more likely to have higher BMI?

Diabetes_data_2 %>%
  group_by(Family_History_of_Diabetes) %>%
  summarise(
    mean_BMI = mean(BMI, na.rm = TRUE),
    count = n()
  )

## # A tibble: 2 × 3
##   Family_History_of_Diabetes mean_BMI count
##                        <dbl>    <dbl> <int>
## 1                          0     29.5  4930
## 2                          1     29.4  5070

ggplot(Diabetes_data_2, aes(
  x = factor(Family_History_of_Diabetes),
  y = BMI,
  fill = factor(Family_History_of_Diabetes)
)) +
  geom_boxplot(alpha = 0.85) +
  theme_minimal(base_size = 14) +
  labs(
    title = "BMI Comparison Based on Family History of Diabetes",
    x = "Family History of Diabetes (0 = No, 1 = Yes)",
    y = "BMI",
    fill = "Family History"
  )

INTERPRETATION

The average BMI is nearly the same for individuals with (29.4) and without (29.5) a family history of diabetes.

The boxplot also shows highly overlapping BMI distributions between the two groups.

This indicates that having a family history of diabetes is not strongly associated with higher BMI in this dataset.

23.Use clustering (k-means) to group individuals based on BMI, Glucose, and Blood Pressure.

# Select only the required numeric columns
cluster_data <- Diabetes_data_2[, c("BMI", "Fasting_Blood_Glucose", "Blood_Pressure_Systolic")]

# Scale the data 
scaled_data <- scale(cluster_data)

# Run k-means with 3 clusters 
set.seed(123)  
kmeans_result <- kmeans(scaled_data, centers = 3)

# Add cluster labels back to original data
Diabetes_data_2$Cluster <- kmeans_result$cluster

# View first few rows with cluster labels
head(Diabetes_data_2)

## # A tibble: 6 × 24
##   `S no.`   Age Sex    Ethnicity   BMI Waist_Circumference Fasting_Blood_Glucose
##     <dbl> <dbl> <chr>  <chr>     <dbl>               <dbl>                 <dbl>
## 1       0    58 Female White      35.8                83.4                  124.
## 2       1    48 Male   Asian      24.1                71.4                  184.
## 3       2    34 Female Black      25                 114.                   142 
## 4       3    62 Male   Asian      32.7               100.                   167.
## 5       4    27 Female Asian      33.5               111.                   146.
## 6       5    40 Female Asian      33.6                96.1                   75 
## # ℹ 17 more variables: HbA1c <dbl>, Blood_Pressure_Systolic <dbl>,
## #   Blood_Pressure_Diastolic <dbl>, Cholesterol_Total <dbl>,
## #   Cholesterol_HDL <dbl>, Cholesterol_LDL <dbl>, GGT <dbl>, Serum_Urate <dbl>,
## #   Physical_Activity_Level <chr>, Dietary_Intake_Calories <dbl>,
## #   Alcohol_Consumption <chr>, Smoking_Status <chr>,
## #   Family_History_of_Diabetes <dbl>, Previous_Gestational_Diabetes <dbl>,
## #   Outcome <dbl>, Age_Range <dbl>, Cluster <int>

Diabetes_data_2 %>%
  group_by(Cluster) %>%
  summarise(
    Avg_BMI = mean(BMI),
    Avg_Glucose = mean(Fasting_Blood_Glucose),
    Avg_BP = mean(Blood_Pressure_Systolic),
    Count = n()
  )

## # A tibble: 3 × 5
##   Cluster Avg_BMI Avg_Glucose Avg_BP Count
##     <int>   <dbl>       <dbl>  <dbl> <int>
## 1       1    29.3       161.    111.  3102
## 2       2    29.9        95.5   133.  3773
## 3       3    28.9       157.    159.  3125

ggplot(Diabetes_data_2, aes(x = BMI, y = Fasting_Blood_Glucose, color = as.factor(Cluster))) +
  geom_point() +
  labs(title = "K-means Clustering (BMI vs Glucose)",
       color = "Cluster")

INTERPRETATION

K-means clustering grouped individuals into three meaningful clusters based on BMI, fasting glucose, and blood pressure.

Cluster 1: High glucose (≈161 mg/dL) with moderate BMI and normal BP — a high diabetes-risk group.

Cluster 2: Normal glucose (≈95 mg/dL) but slightly elevated BP — a moderate cardiovascular-risk group.

Cluster 3: High glucose (≈157 mg/dL) and high BP — a combined diabetes and hypertension high-risk group. The scatter plot shows that clusters are mainly separated by glucose and blood pressure, not BMI.

CONCLUSION The clustering clearly identifies three risk profiles within the population: one low-risk group, one high-glucose group, and one group with both elevated glucose and blood pressure. This segmentation helps highlight which portions of the population may require targeted medical attention and preventive care.

24.Use KNN to classify individuals into High-risk vs Low-risk based on BMI, Glucose, and HbA1c.

Diabetes_data_2$Risk_Score <- Diabetes_data_2$BMI + Diabetes_data_2$Fasting_Blood_Glucose + Diabetes_data_2$HbA1c

threshold <- mean(Diabetes_data_2$Risk_Score, na.rm = TRUE)

Diabetes_data_2$Risk_Class <- ifelse(Diabetes_data_2$Risk_Score > threshold, "High", "Low")



features <- Diabetes_data_2[, c("BMI", "Fasting_Blood_Glucose", "HbA1c")]
labels <- Diabetes_data_2$Risk_Class

normalize <- function(x) {
  return((x - min(x)) / (max(x)))
}

features_norm <- as.data.frame(lapply(features, normalize))

set.seed(123)
index <- sample(1:nrow(features_norm), 0.7 * nrow(features_norm))

train_data <- features_norm[index, ]
test_data  <- features_norm[-index, ]

train_label <- labels[index]
test_label  <- labels[-index]

predicted <- knn(train_data, test_data, train_label, k = 5)

mean(predicted == test_label)

## [1] 0.9886667

table(Predicted = predicted, Actual = test_label)

##          Actual
## Predicted High  Low
##      High 1461   19
##      Low    15 1505

ggplot(Diabetes_data_2, aes(x = BMI, y = Fasting_Blood_Glucose, color = Risk_Class)) +
  geom_point() +
  theme_minimal() +
  labs(title = "High vs Low Risk Classification (True Labels)")

INTERPRETATION

A simple Risk Score was created using BMI + Fasting Glucose + HbA1c, and individuals were labeled as High or Low risk using the average score as the threshold.

KNN (k = 5) was used to classify individuals based on their normalized BMI, glucose, and HbA1c values.

The model achieved an excellent accuracy of ~98.9%, showing very strong predictive performance. The confusion matrix shows:

1461 High-risk individuals correctly classified

1505 Low-risk individuals correctly classified

Only a very small number misclassified (19 Low → High, 15 High → Low)

The scatter plot shows a clear visual separation between High-risk (higher glucose) and Low-risk groups, demonstrating that glucose is the strongest factor driving the classification.

Conclusion The KNN model effectively separates individuals into High-risk and Low-risk categories with near-perfect accuracy. This highlights the strong predictive power of BMI, fasting glucose, and HbA1c when used together, making KNN a reliable model for early diabetes-risk identification.

CONCLUSION

This analysis has yielded several meaningful insights into the risk factors and patterns associated with diabetes in the dataset. Key findings include:

Fasting blood glucose and HbA1c emerge as the strongest markers tied to diabetes risk, especially when combined with elevated BMI and age.

Lifestyle factors such as smoking status, physical activity, and alcohol consumption showed relatively weaker direct associations with glucose or BMI in this sample.

Genetic/family history alone did not appear to significantly elevate glucose or BMI levels, suggesting that behavioral and clinical metrics may carry greater weight in this cohort.

Clustering revealed distinct population segments — one with normal glucose but higher blood pressure, another with high glucose alone, and a third with both high glucose and high blood pressure — helping identify sub-groups for targeted intervention.

A simple K-Nearest-Neighbors (KNN) classification based on BMI, fasting glucose, and HbA1c achieved high predictive accuracy for identifying “high-risk” individuals, demonstrating the practical value of predictive modelling in diabetes prevention.

Implications for practice: These results support the prioritisation of key clinical indicators such as glucose, HbA1c, BMI and blood pressure for screening and early intervention programmes. They also suggest that segmentation based on risk profiles can improve resource allocation — focusing efforts on individuals showing multiple metabolic red flags rather than relying solely on demographic or lifestyle categories.

Closing remark: Overall, the project underlines the power of data-driven approaches in revealing actionable insights for diabetes risk management. By leveraging accessible variables and straightforward models, healthcare practitioners can better identify high-risk individuals and tailor preventive strategies accordingly.

Data-Driven Insights into Diabetes and Its Associated Risk Factors

Student 1: Manish Chandra Joshi

Student 2: Anubhav Kashyap

2025-11-14

Introduction

Libraries used

Import the Dataset into the Markdown.

1.What is the structure of the diabetes dataset?

1.What is the summary of the diabetes dataset?

2.Are there any missing or duplicated records in the dataset?

3.What is the mean, median, mode, and standard deviation of Glucose and BMI?

4.What is the distribution of the Age variable?

5.Create a histogram to visualize the distribution of Glucose levels.

6.Visualize the frequency of diabetes outcomes (Outcome column) using a bar chart.

7.How does the average glucose level vary across different Age groups? (Use group_by and summarise)

️8 Create a boxplot comparing BMI distribution across different smoking categories.

9.How does the average Glucose vary across different Physical Activity Levels?

10.Compare the average Blood Pressure (systolic & diastolic) across Alcohol Consumption categories.

11.Is smoking associated with higher average fasting glucose?

12 Does Alcohol Consumption significantly affect BMI?

13. What is the relationship between BMI and Blood Pressure?

14.Identify the top 5 highest-risk profiles (high glucose + high HbA1c + high BMI)

15.Which age range has the highest prevalence of diabetes?

16.Is family history associated with higher glucose levels?

17.Is there a significant difference in average glucose levels between males and females?

18.Create a histogram of Cholesterol_Total.

19.Create a bar plot showing the count of individuals in each Physical Activity Level.

20.Create a density plot of fasting glucose to check skewness.

21.What are the top 10 highest glucose values and their patient characteristics?

22.Are people with Family History of Diabetes more likely to have higher BMI?

23.Use clustering (k-means) to group individuals based on BMI, Glucose, and Blood Pressure.

24.Use KNN to classify individuals into High-risk vs Low-risk based on BMI, Glucose, and HbA1c.

CONCLUSION