Pima Indians Diabetes: Comprehensive Analysis

Setup and Data Loading

library(tidyverse)
library(ggplot2)

diabetes_data <- read.csv("diabetes.csv")

1 Understanding the data (Basic Exploration)

1.1 Question 1:- What is the structure of the dataset.

str(diabetes_data)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

summary(diabetes_data)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Interpretation: This dataset from diabetes_data contains health data across 768 records and 9 columns.

1.2 Question 2:- Are there any missing values in the dataset

colSums(is.na(diabetes_data))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Interpretation: The dataset has no missing values

2 Statistical Filter

2.1 Question 3:- Filtering the dataset for patients with a positive diabetes outcome

avg_age_diabetic <- diabetes_data %>%
  filter(Outcome == 1) %>%
  summarise(Average_Age = mean(Age, na.rm = TRUE))

print(paste("The average age of patients with diabetes is:", round(avg_age_diabetic$Average_Age, 2)))

## [1] "The average age of patients with diabetes is: 37.07"

Printing the result to the console:-average age of patients with diabetes is 37.07

2.2 Question 4:- Average age of diabetic women with pregnancy history

pregnant_diabetic_stats <- diabetes_data %>%
  filter(Outcome == 1 & Pregnancies > 0) %>%
  summarise(Avg_Age = mean(Age))

print(pregnant_diabetic_stats)

##    Avg_Age
## 1 38.46957

According to dataset Average age of diabetic women who get pregnancy :- 38.46957

2.3 Question 5:- Age Category Breakdown

age_breakdown <- diabetes_data %>%
  mutate(Age_Bracket = case_when(
    Age < 18 ~ "Less than 18",
    Age >= 19 & Age <= 40 ~ "19 to 40",
    Age > 40 ~ "Greater than 40"
  )) %>%
  group_by(Age_Bracket) %>%
  summarise(Count = n())

print(age_breakdown)

## # A tibble: 2 × 2
##   Age_Bracket     Count
##   <chr>           <int>
## 1 19 to 40          574
## 2 Greater than 40   194

check how many women age are between 0 to 18, 19 to 40 and more then 40

2.3.1 Analysis 34: Risky BMI Count

bmi_risk_audit <- diabetes_data %>%
  mutate(Risk_Level = case_when(
    BMI < 25 ~ "Healthy/Underweight",
    BMI >= 25 & BMI < 30 ~ "Overweight (Risky)",
    BMI >= 30 ~ "Obese (High Risk)"
  )) %>%
  group_by(Risk_Level) %>%
  summarise(Count = n())

print(bmi_risk_audit)

## # A tibble: 3 × 2
##   Risk_Level          Count
##   <chr>               <int>
## 1 Healthy/Underweight   117
## 2 Obese (High Risk)     472
## 3 Overweight (Risky)    179

Check BMI index and Risk rate (Healthy/Underweight:- 117 Obese(High Risk):-472 Overweight(Risky):- 179 )

3 Maternal History & Diabetes Correlation

3.1 Question 6:- How does the number of pregnancies impact the diabetes outcome?

# Summary of Pregnancy Data across all records
summary(diabetes_data$Pregnancies)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.845   6.000  17.000

# Average Pregnancies grouped by Health Outcome
diabetes_data %>%
  group_by(Outcome) %>%
  summarise(Avg_Pregnancies = mean(Pregnancies))

## # A tibble: 2 × 2
##   Outcome Avg_Pregnancies
##     <int>           <dbl>
## 1       0            3.30
## 2       1            4.87

Interpretation: The audit shows that women with a positive diabetes outcome have a higher average number of pregnancies compared to those with a negative outcome

3.2 Question 7:- Visual Distribution of Pregnancy vs. Outcome

ggplot(diabetes_data, aes(x = factor(Pregnancies), fill = factor(Outcome))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("green", "red"), 
                    labels = c("Non-Diabetic", "Diabetic")) +
  labs(title = "Pregnancy Count vs. Diabetes Outcome", 
       x = "Number of Pregnancies", 
       y = "Count of Women",
       fill = "Health Status")

Interpretation: The visualization reveals a “tipping point” where the proportion of diabetic cases increases significantly as the pregnancy count exceeds 6-7

3.3 Question 8:- Auditing the High-Risk Pregnancy Segment (7+ pregnancies)

high_risk_pregnancy <- diabetes_data %>%
  filter(Pregnancies >= 7) %>%
  summarise(
    Record_Count = n(),
    Avg_Glucose = mean(Glucose),
    Diabetes_Rate_Percentage = mean(Outcome) * 100
  )

print(high_risk_pregnancy)

##   Record_Count Avg_Glucose Diabetes_Rate_Percentage
## 1          169    129.9941                 56.21302

Interpretation: In this 768-record dataset, women in the high-risk pregnancy category show a significantly higher diabetes rate

4 Glucose Level Audit

4.1 Question 9:- What is the average glucose level for diabetic vs non-diabetic patients?

glucose_audit <- diabetes_data %>%
  group_by(Outcome) %>%
  summarise(Average_Glucose = mean(Glucose))

print(glucose_audit)

## # A tibble: 2 × 2
##   Outcome Average_Glucose
##     <int>           <dbl>
## 1       0            110.
## 2       1            141.

Interpretation: The average glucose level of diabetic patients is significantly higher than that of non-diabetic individuals, which serves as a primary indicator for both financial and health risk audits.

4.2 Question 10:- Visualizing Average Glucose Levels

# Visualizing Average Glucose by Outcome
ggplot(diabetes_data, aes(x = factor(Outcome), y = Glucose, fill = factor(Outcome))) +
  stat_summary(fun = "mean", geom = "bar", color = "black", width = 0.6) +
  scale_fill_manual(values = c("green", "red"), 
                    labels = c("Non-Diabetic", "Diabetic")) +
  labs(title = "Average Glucose Level: Diabetic vs. Non-Diabetic",
       x = "Health Status",
       y = "Mean Glucose Concentration",
       fill = "Status") +
  theme_minimal()

Interpretation: The bar chart clearly illustrates a significant disparity in mean glucose levels between the two groups. In the context of a financial transparency audit, this visualization justifies the prioritization of glucose monitoring as a primary risk assessment tool, as the diabetic group maintains a substantially higher average baseline compared to non-diabetic subjects

5 Feature Engineering & Risk Indicators

###Question 11:- Creating New Risk Metrics (BMI Category and Insulin-Glucose Ratio)

# Categorizing BMI into Risk Levels
# Creating an Insulin-to-Glucose Ratio (a marker for insulin resistance)
diabetes_featured <- diabetes_data %>%
  mutate(
    BMI_Category = case_when(
      BMI < 18.5 ~ "Underweight",
      BMI >= 18.5 & BMI < 25 ~ "Healthy",
      BMI >= 25 & BMI < 30 ~ "Overweight",
      BMI >= 30 ~ "Obese"
    ),
    Insulin_Glucose_Ratio = ifelse(Glucose > 0, Insulin / Glucose, 0)
  )

# New features
head(diabetes_featured %>% select(BMI, BMI_Category, Insulin, Glucose, Insulin_Glucose_Ratio))

##    BMI BMI_Category Insulin Glucose Insulin_Glucose_Ratio
## 1 33.6        Obese       0     148              0.000000
## 2 26.6   Overweight       0      85              0.000000
## 3 23.3      Healthy       0     183              0.000000
## 4 28.1   Overweight      94      89              1.056180
## 5 43.1        Obese     168     137              1.226277
## 6 25.6   Overweight       0     116              0.000000

Interpretation: Through Feature Engineering, we transformed raw BMI numerical data into categorical risk groups. Additionally, we calculated the Insulin-to-Glucose ratio. In a financial transparency audit, these derived features provide a more nuanced view of patient health than raw data alone, allowing for better segmentation of high-risk profile

5.1 Question 12:- Age-Pregnancy Interaction (Maternal Risk Score)

# Creating an interaction feature between Age and Pregnancies
diabetes_featured <- diabetes_featured %>%
  mutate(Maternal_Risk_Index = Pregnancies * Age)

# Comparing this new index for Diabetic vs Non-Diabetic
ggplot(diabetes_featured, aes(x = factor(Outcome), y = Maternal_Risk_Index, fill = factor(Outcome))) +
  geom_boxplot() +
  scale_fill_manual(values = c("green", "red"), labels = c("Non-Diabetic", "Diabetic")) +
  labs(title = "Audit of Maternal Risk Index",
       x = "Health Status",
       y = "Index Value (Age * Pregnancies)")

Interpretation: We engineered a ‘Maternal Risk Index’ by multiplying Age and Pregnancy count. The audit visualization shows that diabetic patients typically have a much higher interaction score. This engineered feature helps capture the cumulative biological stress over time

5.2 Question 13:- Categorizing women into “Young” and “Old” life stages

# Creating a new categorical column based on Age
# Young: 21-35 | Old: 36 and above
diabetes_featured <- diabetes_featured %>%
  mutate(Life_Stage = ifelse(Age <= 35, "Young", "Old"))

# Count of women in each category
life_stage_count <- diabetes_featured %>%
  group_by(Life_Stage) %>%
  summarise(Count = n())

print(life_stage_count)

## # A tibble: 2 × 2
##   Life_Stage Count
##   <chr>      <int>
## 1 Old          270
## 2 Young        498

Interpretation: We have successfully engineered a new categorical feature, ‘Life_Stage’, to segment the population. In the context of a financial transparency audit, this categorization allows for a more targeted analysis of insurance or health risks across different age demographics

5.3 Question 14:- Visualizing Diabetes Outcome by Life Stage

ggplot(diabetes_featured, aes(x = Life_Stage, fill = factor(Outcome))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("green", "red"), 
                    labels = c("Non-Diabetic", "Diabetic")) +
  labs(title = "Risk Proportion: Young vs. Old",
       x = "Life Stage",
       y = "Proportion",
       fill = "Health Status") +
  theme_minimal()

Interpretation: The visual audit indicates that the ‘Old’ category (Age > 35) has a significantly higher proportion of diabetic cases compared to the ‘Young’ category. This confirms that age is a critical multiplier for health risk transparency

5.4 Question 15:- Creating and Auditing the Maternal Risk Index (Stress Score)

# Maternal Risk Index = Age * Pregnancies
diabetes_featured <- diabetes_featured %>%
  mutate(Maternal_Risk_Index = Age * Pregnancies)

# Let's check the average Stress Score for Diabetic vs Non-Diabetic
stress_audit <- diabetes_featured %>%
  group_by(Outcome) %>%
  summarise(Avg_Stress_Score = mean(Maternal_Risk_Index))

print(stress_audit)

## # A tibble: 2 × 2
##   Outcome Avg_Stress_Score
##     <int>            <dbl>
## 1       0             123.
## 2       1             199.

5.5 Question 16:- Visualizing the Stress Score

ggplot(diabetes_featured, aes(x = factor(Outcome), y = Maternal_Risk_Index, fill = factor(Outcome))) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("green", "red"), labels = c("Non-Diabetic", "Diabetic")) +
  labs(title = "Audit of Maternal Risk Index (Stress Score)",
       subtitle = "Interaction between Age and Pregnancy Count",
       x = "Health Status (0: No, 1: Yes)",
       y = "Stress Score (Age * Pregnancies)") +
  theme_minimal()

Interpretation: We engineered a ‘Maternal Risk Index’ to capture the interaction between age and maternal history. The visualization reveals that diabetic patients have a significantly higher median stress score. This engineered feature is a powerful predictor for our transparency model as it accounts for cumulative health strain