Setup and Data Loading
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Interpretation: This dataset from diabetes_data contains health data across 768 records and 9 columns.
avg_age_diabetic <- diabetes_data %>%
filter(Outcome == 1) %>%
summarise(Average_Age = mean(Age, na.rm = TRUE))
print(paste("The average age of patients with diabetes is:", round(avg_age_diabetic$Average_Age, 2)))## [1] "The average age of patients with diabetes is: 37.07"
Printing the result to the console:-average age of patients with diabetes is 37.07
pregnant_diabetic_stats <- diabetes_data %>%
filter(Outcome == 1 & Pregnancies > 0) %>%
summarise(Avg_Age = mean(Age))
print(pregnant_diabetic_stats)## Avg_Age
## 1 38.46957
According to dataset Average age of diabetic women who get pregnancy :- 38.46957
age_breakdown <- diabetes_data %>%
mutate(Age_Bracket = case_when(
Age < 18 ~ "Less than 18",
Age >= 19 & Age <= 40 ~ "19 to 40",
Age > 40 ~ "Greater than 40"
)) %>%
group_by(Age_Bracket) %>%
summarise(Count = n())
print(age_breakdown)## # A tibble: 2 × 2
## Age_Bracket Count
## <chr> <int>
## 1 19 to 40 574
## 2 Greater than 40 194
check how many women age are between 0 to 18, 19 to 40 and more then 40
bmi_risk_audit <- diabetes_data %>%
mutate(Risk_Level = case_when(
BMI < 25 ~ "Healthy/Underweight",
BMI >= 25 & BMI < 30 ~ "Overweight (Risky)",
BMI >= 30 ~ "Obese (High Risk)"
)) %>%
group_by(Risk_Level) %>%
summarise(Count = n())
print(bmi_risk_audit)## # A tibble: 3 × 2
## Risk_Level Count
## <chr> <int>
## 1 Healthy/Underweight 117
## 2 Obese (High Risk) 472
## 3 Overweight (Risky) 179
Check BMI index and Risk rate (Healthy/Underweight:- 117 Obese(High Risk):-472 Overweight(Risky):- 179 )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.845 6.000 17.000
# Average Pregnancies grouped by Health Outcome
diabetes_data %>%
group_by(Outcome) %>%
summarise(Avg_Pregnancies = mean(Pregnancies))## # A tibble: 2 × 2
## Outcome Avg_Pregnancies
## <int> <dbl>
## 1 0 3.30
## 2 1 4.87
Interpretation: The audit shows that women with a positive diabetes outcome have a higher average number of pregnancies compared to those with a negative outcome
ggplot(diabetes_data, aes(x = factor(Pregnancies), fill = factor(Outcome))) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("green", "red"),
labels = c("Non-Diabetic", "Diabetic")) +
labs(title = "Pregnancy Count vs. Diabetes Outcome",
x = "Number of Pregnancies",
y = "Count of Women",
fill = "Health Status")
Interpretation: The visualization reveals a “tipping point”
where the proportion of diabetic cases increases significantly as the
pregnancy count exceeds 6-7
high_risk_pregnancy <- diabetes_data %>%
filter(Pregnancies >= 7) %>%
summarise(
Record_Count = n(),
Avg_Glucose = mean(Glucose),
Diabetes_Rate_Percentage = mean(Outcome) * 100
)
print(high_risk_pregnancy)## Record_Count Avg_Glucose Diabetes_Rate_Percentage
## 1 169 129.9941 56.21302
Interpretation: In this 768-record dataset, women in the high-risk pregnancy category show a significantly higher diabetes rate
glucose_audit <- diabetes_data %>%
group_by(Outcome) %>%
summarise(Average_Glucose = mean(Glucose))
print(glucose_audit)## # A tibble: 2 × 2
## Outcome Average_Glucose
## <int> <dbl>
## 1 0 110.
## 2 1 141.
Interpretation: The average glucose level of diabetic patients is significantly higher than that of non-diabetic individuals, which serves as a primary indicator for both financial and health risk audits.
# Visualizing Average Glucose by Outcome
ggplot(diabetes_data, aes(x = factor(Outcome), y = Glucose, fill = factor(Outcome))) +
stat_summary(fun = "mean", geom = "bar", color = "black", width = 0.6) +
scale_fill_manual(values = c("green", "red"),
labels = c("Non-Diabetic", "Diabetic")) +
labs(title = "Average Glucose Level: Diabetic vs. Non-Diabetic",
x = "Health Status",
y = "Mean Glucose Concentration",
fill = "Status") +
theme_minimal()Interpretation: The bar chart clearly illustrates a significant disparity in mean glucose levels between the two groups. In the context of a financial transparency audit, this visualization justifies the prioritization of glucose monitoring as a primary risk assessment tool, as the diabetic group maintains a substantially higher average baseline compared to non-diabetic subjects
###Question 11:- Creating New Risk Metrics (BMI Category and Insulin-Glucose Ratio)
# Categorizing BMI into Risk Levels
# Creating an Insulin-to-Glucose Ratio (a marker for insulin resistance)
diabetes_featured <- diabetes_data %>%
mutate(
BMI_Category = case_when(
BMI < 18.5 ~ "Underweight",
BMI >= 18.5 & BMI < 25 ~ "Healthy",
BMI >= 25 & BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese"
),
Insulin_Glucose_Ratio = ifelse(Glucose > 0, Insulin / Glucose, 0)
)
# New features
head(diabetes_featured %>% select(BMI, BMI_Category, Insulin, Glucose, Insulin_Glucose_Ratio))## BMI BMI_Category Insulin Glucose Insulin_Glucose_Ratio
## 1 33.6 Obese 0 148 0.000000
## 2 26.6 Overweight 0 85 0.000000
## 3 23.3 Healthy 0 183 0.000000
## 4 28.1 Overweight 94 89 1.056180
## 5 43.1 Obese 168 137 1.226277
## 6 25.6 Overweight 0 116 0.000000
Interpretation: Through Feature Engineering, we transformed raw BMI numerical data into categorical risk groups. Additionally, we calculated the Insulin-to-Glucose ratio. In a financial transparency audit, these derived features provide a more nuanced view of patient health than raw data alone, allowing for better segmentation of high-risk profile
# Creating an interaction feature between Age and Pregnancies
diabetes_featured <- diabetes_featured %>%
mutate(Maternal_Risk_Index = Pregnancies * Age)
# Comparing this new index for Diabetic vs Non-Diabetic
ggplot(diabetes_featured, aes(x = factor(Outcome), y = Maternal_Risk_Index, fill = factor(Outcome))) +
geom_boxplot() +
scale_fill_manual(values = c("green", "red"), labels = c("Non-Diabetic", "Diabetic")) +
labs(title = "Audit of Maternal Risk Index",
x = "Health Status",
y = "Index Value (Age * Pregnancies)")
Interpretation: We engineered a ‘Maternal Risk Index’ by
multiplying Age and Pregnancy count. The audit visualization shows that
diabetic patients typically have a much higher interaction score. This
engineered feature helps capture the cumulative biological stress over
time
# Creating a new categorical column based on Age
# Young: 21-35 | Old: 36 and above
diabetes_featured <- diabetes_featured %>%
mutate(Life_Stage = ifelse(Age <= 35, "Young", "Old"))
# Count of women in each category
life_stage_count <- diabetes_featured %>%
group_by(Life_Stage) %>%
summarise(Count = n())
print(life_stage_count)## # A tibble: 2 × 2
## Life_Stage Count
## <chr> <int>
## 1 Old 270
## 2 Young 498
Interpretation: We have successfully engineered a new categorical feature, ‘Life_Stage’, to segment the population. In the context of a financial transparency audit, this categorization allows for a more targeted analysis of insurance or health risks across different age demographics
ggplot(diabetes_featured, aes(x = Life_Stage, fill = factor(Outcome))) +
geom_bar(position = "fill") +
scale_fill_manual(values = c("green", "red"),
labels = c("Non-Diabetic", "Diabetic")) +
labs(title = "Risk Proportion: Young vs. Old",
x = "Life Stage",
y = "Proportion",
fill = "Health Status") +
theme_minimal()
Interpretation: The visual audit indicates that the ‘Old’
category (Age > 35) has a significantly higher proportion of diabetic
cases compared to the ‘Young’ category. This confirms that age is a
critical multiplier for health risk transparency
# Maternal Risk Index = Age * Pregnancies
diabetes_featured <- diabetes_featured %>%
mutate(Maternal_Risk_Index = Age * Pregnancies)
# Let's check the average Stress Score for Diabetic vs Non-Diabetic
stress_audit <- diabetes_featured %>%
group_by(Outcome) %>%
summarise(Avg_Stress_Score = mean(Maternal_Risk_Index))
print(stress_audit)## # A tibble: 2 × 2
## Outcome Avg_Stress_Score
## <int> <dbl>
## 1 0 123.
## 2 1 199.
ggplot(diabetes_featured, aes(x = factor(Outcome), y = Maternal_Risk_Index, fill = factor(Outcome))) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("green", "red"), labels = c("Non-Diabetic", "Diabetic")) +
labs(title = "Audit of Maternal Risk Index (Stress Score)",
subtitle = "Interaction between Age and Pregnancy Count",
x = "Health Status (0: No, 1: Yes)",
y = "Stress Score (Age * Pregnancies)") +
theme_minimal()
Interpretation: We engineered a ‘Maternal Risk Index’ to
capture the interaction between age and maternal history. The
visualization reveals that diabetic patients have a significantly higher
median stress score. This engineered feature is a powerful predictor for
our transparency model as it accounts for cumulative health
strain