Health Analytics for Diabetes: Risk Prediction and Diagnostic Classification

This project follows CRISP-DM methodology: Business Understanding, Data Understanding, Data Preparation, EDA & Modelling and Evaluation. Deployment is not considered in this project.

1. Business Understanding

• According to International Diabetes Federation (2025), approximately 589 million adults (20-79 years) are living with diabetes worldwide as of 2024 (7.3% of total world population).

• Early risk prediction of diabetes is crucial to individuals (especially high-risk population) to take preventive action sooner and prevent long-term health complications.Besides, accurate classification model is needed in medical field.

• At the national or regional level, this benefits healthcare providers, policymakers and communities by improving resource allocation and supporting targeted public health interventions.

• However, there is limited risk scoring tool in global context taking ethnicity, family history, diet, lifestyle, basic health indicators and socioeconomic status factors into consideration. There is also limited diagnostic classification tool that can accurately classify diagnosis.

1.1 Determine Business Objectives

Primary business objective:

• To develop an early diabetes risk prediction framework that incorporates multidimensional factors to support early intervention and resource planning.

• To develop a diagnostic classification framework to assist medical practitioners.

This framework will fulfill below initiatives:

• To help healthcare providers and policymakers to improve screening strategies and resource allocation.

• To support targeted public health interventions based on population-level risk patterns.

• To address limitations of existing global risk scores that do not fully consider diverse populations and socioeconomic contexts.

1.2 Situation Assessment

• Diabetes prevalence is increasing rapidly, yet early detection tools remain limited, especially for Asian or multi-ethnic populations.

• Current risk scores rarely include all major contributors such as ethnicity, family history, diet, lifestyle, basic health indicators and socioeconomic status.

• There is a strong need for an interpretable and population-relevant prediction and classification tool to support early intervention, public health planning and personalized prevention.

• Constraints include data quality, the need for ethical handling and clinical interpretability.

1.3 Determine Data Mining Goals

1.3.1 Data Mining Goals

• To develop a regression model for accurate prediction of individual’s diabetes risk score using demographic, lifestyle and medical variables.

• To develop a classification model to classify diabetic diagnosis (0 = No, 1 = Yes).

• To quantify key features that contribute most significantly to diabetes risk and diagnosis by using SHAP (SHapley Additive exPlanations).

• To compare the performance of multiple machine learning models (XGBoost vs Random Forest for regression/ Logistic Regression vs XGBoost Classifier for classification).

• To provide interpretable insights that can support early screening and preventive healthcare decision-making.

1.3.2 Data Mining Questions

• Which regression model is better in predicting diabetes risk?

• Which classification model is more accurate in classify diabetes diagnosis?

• Which is the most important feature in these models?

1.3.3 Data Mining Success Criteria

Besides achieving better performance metrics than baseline models:

Regression Model must achieve:
• R² ≥ 0.70 (good explanatory power)
• MAE minimized.
• RMSE minimized

Classification Model must achieve:
• Accuracy ≥ 80%
• F1-Score ≥ 0.75
• Recall ≥ 0.75

1.4 Produce Project Plan

1.4.1 Project Planning

Business Understanding > Data Understanding > Data Preparation > EDA > Modelling (Regression) + Modelling (Classification) > Model Assessment > Interpretation > Conclusion

1.4.2 Initial Assessment of Tools & Techniques

Tools:
• Programming Language: R
• Development Environment: RStudio 2025.09.2+418
• Libraries: tidyverse, janitor, forcats, recipes, caret etc.
• Data Source: Kaggle Diabetes Health Indicators Dataset
• Documentation: RMD

Techniques:

Exploratory Data Analysis (EDA):
• Correlation heatmap, violin plot ,propotional bar plot, scatter plot etc.
Modeling Techniques:
• Regression: XGBoost Regressor, Random Forest Regressor
• Classification: Logistic Regression, XGBoost Classifier
Evaluation Techniques:
• Performance metrics defined for Regression and Classification problem.
• SHAP results.
Validation & Performance Testing:
• Train-validate-test split: 70/15/15

2. Data Understanding

2.1 Data Collection

• Dataset name: Diabetes Health Indicators Dataset retrieved from Kaggle (Thalla, 2025)

• Contains 100,200 patient records for diabetes risk analysis with 31 features.

• Diabetes-related information including demographics, lifestyle factors, clinical measurements and laboratory results.

df <- read_csv("diabetes_dirty.csv") %>% 
  clean_names()

rows <- nrow(df)
cols <- ncol(df)
memory <- as.numeric(object.size(df)) / (1024^2)

rows; cols; memory

## [1] 100200

## [1] 31

## [1] 23.72153

The output above shows the basic properties of the diabetes dataset:

Number of rows (observations): 100200
Each row corresponds to one individual record in the dataset.
Number of columns (variables): 31
These include demographic, lifestyle, clinical and laboratory variables.
Approximate memory usage: 23.72 MB
This indicates the dataset is relatively ideal to handle in R.

2.2 Data Description

This section checks the data types of all variables to distinguish:
• categorical / ID-like variables (e.g., gender, ethnicity, diabetes_stage)
• numeric measurement variables (e.g., BMI, glucose, HbA1c, blood pressure)

# Create a table of variable names and their main class
dtype <- data.frame(
  variable = names(df),
  type     = sapply(df, function(x) class(x)[1]),
  stringsAsFactors = FALSE
)

dtype

##                                                              variable      type
## age                                                               age   numeric
## gender                                                         gender character
## ethnicity                                                   ethnicity character
## education_level                                       education_level character
## income_level                                             income_level character
## employment_status                                   employment_status character
## smoking_status                                         smoking_status character
## alcohol_consumption_per_week             alcohol_consumption_per_week   numeric
## physical_activity_minutes_per_week physical_activity_minutes_per_week   numeric
## diet_score                                                 diet_score   numeric
## sleep_hours_per_day                               sleep_hours_per_day   numeric
## screen_time_hours_per_day                   screen_time_hours_per_day   numeric
## family_history_diabetes                       family_history_diabetes   numeric
## hypertension_history                             hypertension_history   numeric
## cardiovascular_history                         cardiovascular_history   numeric
## bmi                                                               bmi   numeric
## waist_to_hip_ratio                                 waist_to_hip_ratio   numeric
## systolic_bp                                               systolic_bp   numeric
## diastolic_bp                                             diastolic_bp   numeric
## heart_rate                                                 heart_rate   numeric
## cholesterol_total                                   cholesterol_total   numeric
## hdl_cholesterol                                       hdl_cholesterol   numeric
## ldl_cholesterol                                       ldl_cholesterol   numeric
## triglycerides                                           triglycerides   numeric
## glucose_fasting                                       glucose_fasting   numeric
## glucose_postprandial                             glucose_postprandial   numeric
## insulin_level                                           insulin_level   numeric
## hba1c                                                           hba1c   numeric
## diabetes_risk_score                               diabetes_risk_score   numeric
## diabetes_stage                                         diabetes_stage character
## diagnosed_diabetes                                 diagnosed_diabetes   numeric

# Categorical / ID-like variables
id <- dtype$variable[
  dtype$type %in% c("character", "factor")
]

# Numeric measurement variables
num <- dtype$variable[
  dtype$type %in% c("numeric", "double", "integer")
]

id

## [1] "gender"            "ethnicity"         "education_level"  
## [4] "income_level"      "employment_status" "smoking_status"   
## [7] "diabetes_stage"

num

##  [1] "age"                                "alcohol_consumption_per_week"      
##  [3] "physical_activity_minutes_per_week" "diet_score"                        
##  [5] "sleep_hours_per_day"                "screen_time_hours_per_day"         
##  [7] "family_history_diabetes"            "hypertension_history"              
##  [9] "cardiovascular_history"             "bmi"                               
## [11] "waist_to_hip_ratio"                 "systolic_bp"                       
## [13] "diastolic_bp"                       "heart_rate"                        
## [15] "cholesterol_total"                  "hdl_cholesterol"                   
## [17] "ldl_cholesterol"                    "triglycerides"                     
## [19] "glucose_fasting"                    "glucose_postprandial"              
## [21] "insulin_level"                      "hba1c"                             
## [23] "diabetes_risk_score"                "diagnosed_diabetes"

The tables above summarise the data types in the dataset.

R reports 7 character variables and 24 numeric (double) variables
(as seen in the import message and the dtype output).
The categorical / ID-like variables include:
- gender, ethnicity, education_level, income_level, employment_status, smoking_status, diabetes_stage, diagnosed_diabetes.
The numeric measurement variables include:
- demographic: age
- lifestyle: alcohol_consumption_per_week, physical_activity_minutes_per_week, diet_score, sleep_hours_per_day, screen_time_hours_per_day
- clinical / lab: bmi, waist_to_hip_ratio, systolic_bp, diastolic_bp, heart_rate, cholesterol_total, hdl_cholesterol, ldl_cholesterol, triglycerides, glucose_fasting, glucose_postprandial, insulin_level, hba1c, diabetes_risk_score.
This confirms that the dataset structure is suitable for further descriptive analysis and modelling, although some categorical values (e.g., different capitalisation of Male / male, White / white) will need cleaning later.

2.3 Data Exploration

2.3.1 Descriptive Statistics (Numeric Variables)

This section summarises the continuous numeric variables in the diabetes dataset.
• Binary (0/1) variables such as medical history flags are excluded because their summary statistics are not meaningful.
• The goal is to inspect the general ranges,central tendency, and spread of the main clinical and lifestyle measurements.

# List of binary variables to exclude
binary_cols <- c(
  "family_history_diabetes",
  "hypertension_history",
  "cardiovascular_history",
  "diagnosed_diabetes"
)

# Select only continuous numeric columns
continuous <- df %>%
  select(where(is.numeric)) %>%
  select(-all_of(binary_cols))

summary(continuous)

##       age        alcohol_consumption_per_week
##  Min.   :18.00   Min.   : 0.000              
##  1st Qu.:39.00   1st Qu.: 1.000              
##  Median :50.00   Median : 2.000              
##  Mean   :50.12   Mean   : 2.004              
##  3rd Qu.:61.00   3rd Qu.: 3.000              
##  Max.   :90.00   Max.   :10.000              
##                                              
##  physical_activity_minutes_per_week   diet_score     sleep_hours_per_day
##  Min.   :  0.0                      Min.   : 0.000   Min.   : 3.000     
##  1st Qu.: 57.0                      1st Qu.: 4.800   1st Qu.: 6.300     
##  Median :100.0                      Median : 6.000   Median : 7.000     
##  Mean   :118.9                      Mean   : 5.995   Mean   : 6.998     
##  3rd Qu.:160.0                      3rd Qu.: 7.200   3rd Qu.: 7.700     
##  Max.   :833.0                      Max.   :10.000   Max.   :10.000     
##                                                                         
##  screen_time_hours_per_day      bmi         waist_to_hip_ratio  systolic_bp    
##  Min.   : 0.500            Min.   : 15.00   Min.   :0.6700     Min.   :  90.0  
##  1st Qu.: 4.300            1st Qu.: 23.20   1st Qu.:0.8200     1st Qu.: 106.0  
##  Median : 6.000            Median : 25.60   Median :0.8600     Median : 116.0  
##  Mean   : 5.997            Mean   : 25.68   Mean   :0.8561     Mean   : 115.9  
##  3rd Qu.: 7.700            3rd Qu.: 28.00   3rd Qu.:0.8900     3rd Qu.: 125.0  
##  Max.   :16.800            Max.   :255.90   Max.   :1.0600     Max.   :1002.5  
##  NA's   :130                                NA's   :50                         
##   diastolic_bp      heart_rate     cholesterol_total hdl_cholesterol
##  Min.   : 50.00   Min.   : 40.00   Min.   :100       Min.   :20.00  
##  1st Qu.: 70.00   1st Qu.: 64.00   1st Qu.:164       1st Qu.:47.00  
##  Median : 75.00   Median : 70.00   Median :186       Median :54.00  
##  Mean   : 75.23   Mean   : 69.79   Mean   :186       Mean   :54.04  
##  3rd Qu.: 81.00   3rd Qu.: 75.00   3rd Qu.:208       3rd Qu.:61.00  
##  Max.   :110.00   Max.   :586.99   Max.   :318       Max.   :98.00  
##                                    NA's   :140                      
##  ldl_cholesterol triglycerides   glucose_fasting glucose_postprandial
##  Min.   : 50     Min.   : 30.0   Min.   : 60.0   Min.   : 70         
##  1st Qu.: 78     1st Qu.: 91.0   1st Qu.:102.0   1st Qu.:139         
##  Median :102     Median :121.0   Median :111.0   Median :160         
##  Mean   :103     Mean   :121.5   Mean   :111.1   Mean   :160         
##  3rd Qu.:126     3rd Qu.:151.0   3rd Qu.:120.0   3rd Qu.:181         
##  Max.   :263     Max.   :344.0   Max.   :172.0   Max.   :287         
##                                                                      
##  insulin_level        hba1c       diabetes_risk_score
##  Min.   : 2.000   Min.   :4.000   Min.   : 2.70      
##  1st Qu.: 5.090   1st Qu.:5.970   1st Qu.:23.80      
##  Median : 8.790   Median :6.520   Median :29.00      
##  Mean   : 9.062   Mean   :6.521   Mean   :30.22      
##  3rd Qu.:12.450   3rd Qu.:7.070   3rd Qu.:35.60      
##  Max.   :32.220   Max.   :9.800   Max.   :67.20      
##                   NA's   :160

The descriptive statistics reveal several important characteristics:

Age ranges from 18 to 90 years, indicating that the dataset captures a wide adult age span suitable for diabetes and metabolic risk analysis.
Lifestyle variables (alcohol consumption, physical activity, diet score, sleep hours, screen time) show substantial variability:
- Sleep hours vary between 3 and 10 hours.
- Physical activity ranges from 0 to 833 minutes per week, suggesting extreme lifestyle differences that should be checked for outliers.
- Screen time ranges from 0.5 to 18 hours per day, which is unusually high at the upper end.
BMI ranges from 15 to 255.9.
- Values above 80 are physiologically unrealistic, suggesting potential data entry errors or extreme outliers requiring investigation.
Waist-to-hip ratio is mostly between 0.67 and 1.06, within plausible physiological limits.
Blood pressure:
- Systolic BP ranges from 90 to 1002.5, where values above 250 are physiologically impossible.
- Diastolic BP ranges from 50 to 110, mostly within realistic bounds.
- Systolic BP clearly contains erroneous values.
Heart rate ranges from 40 to 586 bpm.
- Anything above ~200 bpm is unrealistic, indicating significant outliers.
Cholesterol and triglycerides:
- Most values appear clinically plausible.
- Triglycerides have a maximum of 344, consistent with high-risk profiles but not impossible.
- Total cholesterol includes values up to 318, which is elevated but remains biologically credible.
Glucose levels show notable variation:
- Fasting glucose ranges from 60 to 172 mg/dL, covering normal, prediabetic, and diabetic ranges.
- Postprandial glucose ranges from 70 to 287 mg/dL, again showing broad metabolic diversity.
HbA1c ranges from 4.0 to 9.8%, consistent with normal to poorly controlled diabetes.
Insulin levels range from 2 to 32.2 µU/mL, which is plausible clinically but may still contain borderline high values.
Diabetes risk score ranges from 2.7 to 67.2, showing wide variability consistent with differing risk levels.

Overall, the descriptive statistics confirm that most clinical measurements are within realistic clinical ranges, although extreme outliers exist in variables such as BMI, systolic blood pressure, and heart rate. These values should be carefully examined and cleaned during the Data Quality Verification step. Several variables also contain missing values which will be assessed in the next section.

2.3.2 Distribution of Key Continuous Variables (Diabetes-Related Predictors)

• This section examines the distribution of some key predictors related to diabetes risk and diagnosis: bmi, glucose_fasting, glucose_postprandial, hba1c and diabetes_risk_score.
• Understanding these distributions helps identify abnormal values, skewness and overall trends relevant to predicting diabetes.

ggplot(df, aes(bmi)) +
  geom_histogram(binwidth = 1, fill = "steelblue") +
  labs(title = "BMI Distribution", x = "BMI", y = "Frequency")

Figure above shows that the BMI distribution is unimodal, peaking around 23–27. The distribution is strongly right-skewed due to a small number of extremely high BMI values.

ggplot(df, aes(glucose_fasting)) +
  geom_histogram(binwidth = 5, fill = "steelblue") +
  labs(title = "Fasting Glucose Distribution", x = "Fasting Glucose (mg/dL)", y = "Frequency")

Figure above shows that fasting glucose is unimodal, peaking around 100–120 mg/dL. The distribution is slightly right-skewed, reflecting individuals with elevated fasting glucose.

ggplot(df, aes(glucose_postprandial)) +
  geom_histogram(binwidth = 10, fill = "steelblue") +
  labs(title = "Postprandial Glucose Distribution", x = "Postprandial Glucose (mg/dL)", y = "Frequency")

Figure above shows that postprandial glucose is unimodal, with its highest density around 140–160 mg/dL. The distribution is moderately right-skewed due to high glucose responses.

ggplot(df, aes(hba1c)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue") +
  labs(title = "HbA1c Distribution", x = "HbA1c (%)", y = "Frequency")

Figure above shows that the HbA1c distribution is unimodal, peaking around 6–7%. The distribution is slightly right-skewed, indicating the presence of individuals with poor glycaemic control.

ggplot(df, aes(diabetes_risk_score)) +
  geom_histogram(binwidth = 2, fill = "steelblue") +
  labs(title = "Diabetes Risk Score Distribution", x = "Risk Score", y = "Frequency")

Figure above shows that the diabetes_risk_score distribution is unimodal, peaking around 25–35. The distribution is clearly right-skewed, showing that many individuals fall in the moderate-to-high risk range.

These distributions highlight the variability in key diabetes predictors and will play an important role in modelling diabetes risk and classification.

2.3.3 Correlation Analysis

This section examines the correlation between the mentioned key diabetes-related numerical variables to identify linear relationships that may influence diabetes risk prediction and diagnosis classification.

# Select numeric variables relevant to diabetes
corrvars <- df %>%
  select(bmi, glucose_fasting, glucose_postprandial, hba1c, diabetes_risk_score)

# Compute correlation matrix (rounded)
corrmat <- round(cor(corrvars, use = "complete.obs"), 2)

# Convert to long format (tidyverse)
corr_long <- as.data.frame(corrmat) %>%
  rownames_to_column(var = "Var1") %>%
  pivot_longer(-Var1, names_to = "Var2", values_to = "corr")

# Preserve axis ordering so heatmap rows/cols match corrmat order
corr_long <- corr_long %>%
  mutate(
    Var1 = factor(Var1, levels = colnames(corrmat)),
    Var2 = factor(Var2, levels = colnames(corrmat))
  )

# Heatmap with correlation values
ggplot(corr_long, aes(x = Var1, y = Var2, fill = corr)) +
  geom_tile() +
  geom_text(aes(label = corr), color = "black", size = 4) +
  scale_fill_gradient2(
    low = "steelblue",
    mid = "white",
    high = "darkred",
    midpoint = 0,
    limits = c(-1, 1)
  ) +
  labs(
    title = "Correlation Heatmap (Diabetes-Related Variables)",
    x = "",
    y = "",
    fill = "Correlation"
  ) +
  coord_fixed() +                          
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    axis.text.y = element_text(angle = 0)
  )

Figure shows strong positive correlations between glucose_postprandial and hba1c (r = 0.93) and between glucose_fasting and hba1c (r = 0.70). Glucose_fasting also shows moderate correlation with glucose_postprandial (r = 0.59) and diabetes_risk_score (r = 0.47). In contrast, bmi has weak correlations with all markers (r ≤ 0.24). This indicates that glycemic variables move closely together, while bmi contributes very little linearly.

2.4 Data Quality Verification

# Inspect total data points in each feature to detect columns with missing values
values <- df %>%
  summarise(across(everything(), ~ sum(!is.na(.)))) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "count") %>%
  arrange(desc(count))

values

## # A tibble: 31 × 2
##    variable                            count
##    <chr>                               <int>
##  1 age                                100200
##  2 gender                             100200
##  3 ethnicity                          100200
##  4 income_level                       100200
##  5 smoking_status                     100200
##  6 alcohol_consumption_per_week       100200
##  7 physical_activity_minutes_per_week 100200
##  8 diet_score                         100200
##  9 sleep_hours_per_day                100200
## 10 family_history_diabetes            100200
## # ℹ 21 more rows

ggplot(values, aes(x = reorder(variable, count), y = count)) +
  geom_line(group = 1) +
  geom_point() +
  coord_flip() +
  labs(
    title = "Number of data points in each column",
    x = "Variable",
    y = "Number of Data Points"
  )

From the graph above, we can see that there are missing values in waist-to-hip ratio, education level, employment status, screen time hours per day, cholesterol total and hba1c.

# For duplicated row checking
# count and show indices
sum(duplicated(df))

## [1] 0

which(duplicated(df))

## integer(0)

# For outlier checking
# Define threshold
thresholds <- list(
  age                          = c(0, 120),
  systolic_bp                  = c(70, 250),
  diastolic_bp                 = c(40, 150),
  heart_rate                   = c(30, 220),
  cholesterol_total            = c(70, 400),
  hdl_cholesterol              = c(10, 200),
  ldl_cholesterol              = c(10, 300),
  triglycerides                = c(10, 2000),
  glucose_fasting              = c(20, 400),
  glucose_postprandial         = c(20, 600),
  hba1c                        = c(3, 20),
  bmi                          = c(10, 80),
  waist_to_hip_ratio           = c(0.35, 2.0),
  alcohol_consumption_per_week = c(0, 70),
  physical_activity_minutes_per_week = c(0, 10080),
  physical_activity_hours_per_week = c(0, 168),
  diet_score                   = c(0, 100),
  sleep_hours_per_day          = c(0, 24),
  screen_time_hours_per_day    = c(0, 24),
  insulin_level                = c(0.1, 2000),
  family_history_diabetes      = c(0,1),
  hypertension_history         = c(0,1),
  cardiovascular_history       = c(0,1),
  diabetes_risk_score          = c(0,100),
  diagnosed_diabetes           = c(0,1)
)

# NA-safe IQR flagger 
iqr_flag1 <- function(x) {
  n <- length(x)
  flag <- rep(FALSE, n)
  finite_idx <- is.finite(x)
  if(sum(finite_idx) == 0) return(flag)

  q1  <- quantile(x[finite_idx], 0.25, na.rm = TRUE)
  q3  <- quantile(x[finite_idx], 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr

  flag[finite_idx] <- x[finite_idx] < lower | x[finite_idx] > upper
  flag
}

# Rules + IQR flagger 
rule_flag1 <- function(x, varname) {
  n <- length(x)
  flag <- rep(FALSE, n)
  finite_idx <- is.finite(x)

  if (sum(finite_idx) == 0) return(flag)

  if (varname %in% names(thresholds)) {
    th <- thresholds[[varname]]
    flag[finite_idx] <- x[finite_idx] < th[1] | x[finite_idx] > th[2]
  } else {
    flag <- iqr_flag1(x)
  }
  flag
}

# Apply to numeric columns in raw1
numeric_vars1 <- names(df)[sapply(df, is.numeric)]

# Compute flags as a data.frame
outlier_flags1 <- lapply(numeric_vars1, function(v) rule_flag1(df[[v]], v))
outlier_flags1 <- as.data.frame(setNames(outlier_flags1, numeric_vars1))

# Verification: NA should never be counted as outlier
na_flag_counts1 <- sapply(numeric_vars1, function(v) {
  sum(outlier_flags1[[v]] & is.na(df[[v]]))
})
print(na_flag_counts1)

##                                age       alcohol_consumption_per_week 
##                                  0                                  0 
## physical_activity_minutes_per_week                         diet_score 
##                                  0                                  0 
##                sleep_hours_per_day          screen_time_hours_per_day 
##                                  0                                  0 
##            family_history_diabetes               hypertension_history 
##                                  0                                  0 
##             cardiovascular_history                                bmi 
##                                  0                                  0 
##                 waist_to_hip_ratio                        systolic_bp 
##                                  0                                  0 
##                       diastolic_bp                         heart_rate 
##                                  0                                  0 
##                  cholesterol_total                    hdl_cholesterol 
##                                  0                                  0 
##                    ldl_cholesterol                      triglycerides 
##                                  0                                  0 
##                    glucose_fasting               glucose_postprandial 
##                                  0                                  0 
##                      insulin_level                              hba1c 
##                                  0                                  0 
##                diabetes_risk_score                 diagnosed_diabetes 
##                                  0                                  0

# Summary table
outlier_counts1 <- sapply(outlier_flags1, function(col) sum(col, na.rm = TRUE))

outlier_summary1 <- data.frame(
  variable = names(outlier_counts1),
  n_flagged = as.integer(outlier_counts1),
  pct_flagged = round(100 * outlier_counts1 / nrow(df), 2)
)

print(outlier_summary1[order(-outlier_summary1$n_flagged), ])

##                                                              variable n_flagged
## bmi                                                               bmi        50
## heart_rate                                                 heart_rate        40
## systolic_bp                                               systolic_bp        20
## age                                                               age         0
## alcohol_consumption_per_week             alcohol_consumption_per_week         0
## physical_activity_minutes_per_week physical_activity_minutes_per_week         0
## diet_score                                                 diet_score         0
## sleep_hours_per_day                               sleep_hours_per_day         0
## screen_time_hours_per_day                   screen_time_hours_per_day         0
## family_history_diabetes                       family_history_diabetes         0
## hypertension_history                             hypertension_history         0
## cardiovascular_history                         cardiovascular_history         0
## waist_to_hip_ratio                                 waist_to_hip_ratio         0
## diastolic_bp                                             diastolic_bp         0
## cholesterol_total                                   cholesterol_total         0
## hdl_cholesterol                                       hdl_cholesterol         0
## ldl_cholesterol                                       ldl_cholesterol         0
## triglycerides                                           triglycerides         0
## glucose_fasting                                       glucose_fasting         0
## glucose_postprandial                             glucose_postprandial         0
## insulin_level                                           insulin_level         0
## hba1c                                                           hba1c         0
## diabetes_risk_score                               diabetes_risk_score         0
## diagnosed_diabetes                                 diagnosed_diabetes         0
##                                    pct_flagged
## bmi                                       0.05
## heart_rate                                0.04
## systolic_bp                               0.02
## age                                       0.00
## alcohol_consumption_per_week              0.00
## physical_activity_minutes_per_week        0.00
## diet_score                                0.00
## sleep_hours_per_day                       0.00
## screen_time_hours_per_day                 0.00
## family_history_diabetes                   0.00
## hypertension_history                      0.00
## cardiovascular_history                    0.00
## waist_to_hip_ratio                        0.00
## diastolic_bp                              0.00
## cholesterol_total                         0.00
## hdl_cholesterol                           0.00
## ldl_cholesterol                           0.00
## triglycerides                             0.00
## glucose_fasting                           0.00
## glucose_postprandial                      0.00
## insulin_level                             0.00
## hba1c                                     0.00
## diabetes_risk_score                       0.00
## diagnosed_diabetes                        0.00

# Logical check
out_prefix <- "dq"

as_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
safe_write <- function(df0, fname) {
  tryCatch(write.csv(df0, fname, row.names = FALSE), error = function(e) NULL)
}

checks <- list()
abnormal_list <- list()

add_rule <- function(key, desc, idx) {
  if (length(idx) == 0) return()
  checks[[key]] <<- list(rule = desc, n = length(idx), idx = idx)
  abnormal_list[[key]] <<- cbind(df[idx, , drop = FALSE], dq_rule = key)
}

# rule a: diastolic_bp should not > systolic_bp
if (all(c("systolic_bp", "diastolic_bp") %in% names(df))) {
  s <- as_num(df$systolic_bp); d <- as_num(df$diastolic_bp)
  add_rule("diastolic_gt_systolic", "diastolic > systolic",
           which(!is.na(s) & !is.na(d) & d > s))
}

# rule b: glucose_postprandial should not < glucose_fasting
if (all(c("glucose_fasting", "glucose_postprandial") %in% names(df))) {
  gf <- as_num(df$glucose_fasting); gp <- as_num(df$glucose_postprandial)
  add_rule("glucose_post_lt_fasting", "postprandial < fasting",
           which(!is.na(gf) & !is.na(gp) & gp < gf))
}

# rule c: hba1c < 5 & fasting_glucose > 200
if (all(c("hba1c", "glucose_fasting") %in% names(df))) {
  h <- as_num(df$hba1c); gf <- as_num(df$glucose_fasting)
  add_rule("hba1c_low_glucose_high", "hba1c < 5 & fasting_glucose > 200",
           which(!is.na(h) & !is.na(gf) & h < 5 & gf > 200))
}

## NULL

# build summary dataframe
if (length(checks) == 0) {
  summary_df <- data.frame(
    name = character(0), rule = character(0),
    n = integer(0), stringsAsFactors = FALSE
  )
  all_abnormal_df <- df[integer(0), , drop = FALSE]
} else {
  summary_df <- data.frame(
    name = names(checks),
    rule = sapply(checks, `[[`, "rule"),
    n = sapply(checks, `[[`, "n"),
    stringsAsFactors = FALSE
  )

  all_abnormal_df <- do.call(rbind, abnormal_list)
  rownames(all_abnormal_df) <- NULL
}

# write the two files
safe_write(summary_df, paste0(out_prefix, "_cross_variable_checks_summary.csv"))
safe_write(all_abnormal_df, paste0(out_prefix, "_cross_variable_checks_all_abnormal.csv"))

# return invisibly
invisible(list(
  checks = checks,
  summary = summary_df,
  abnormal = abnormal_list,
  all_abnormal = all_abnormal_df
))

print(summary_df)

##                                            name                   rule    n
## diastolic_gt_systolic     diastolic_gt_systolic   diastolic > systolic  117
## glucose_post_lt_fasting glucose_post_lt_fasting postprandial < fasting 2604

Upon inspection on the output, delta (diastolic_bp - systolic_bp) is in the range of 1 to 10, that is still considered normal in clinical terms.
However, for glucose_fasting-glucose_postprandial, delta of >30 should be flagged as abnormal and were excluded to improve data reliability in data cleaning.

3. Data Preparation

3.1 Data Selection, Cleaning, Construction, Integration and Formatting

All columns of features were retained except diabetes_stage which contains classification of pre-diabetic, type 1 or type 2. These information was not crucial for our data mining goals as our target variables are risk scoring and diagnostic classes (diabetic or non-diabetic).

raw <- readr::read_csv("diabetes_dirty.csv") %>%
select(-diabetes_stage) # remove diabetes_stage column 
#gender
#Before cleaning: frequency table
cat("Before cleaning:\n")

## Before cleaning:

print(table(raw$gender, useNA = "ifany"))

## 
## female Female FEMALE   male   Male   MALE  other  Other  OTHER 
##  16881  16708  16725  15902  16152  15817    680    642    693

#Standardize gender 
raw <- raw %>% 
  mutate(
    gender = str_to_lower(gender),            
    gender = str_squish(gender),             
gender = case_when(
gender %in% c("female","Female","FEMALE") ~ "Female",
gender %in% c("male","Male","MALE") ~ "Male",
gender %in% c("other","Other","OTHER" ) ~ "Other"
    )
  ) %>%
  mutate(gender = factor(gender, levels = c("Female", "Male","Other")))

#After cleaning: frequency table
cat("After cleaning:\n")

## After cleaning:

print(table(raw$gender))

## 
## Female   Male  Other 
##  50314  47871   2015

#ethnicity
#Before cleaning: frequency table
cat("Before cleaning:\n")

## Before cleaning:

print(table(raw$ethnicity, useNA = "ifany"))

## 
##      asian      Asian      ASIAN    Asian !     Asian!      black      Black 
##       2906       3013       2889        609       2475       4502       4506 
##      BLACK    Black !     Black!   hispanic   Hispanic   HISPANIC Hispanic ! 
##       4484        909       3630       5028       4975       5195        969 
##  Hispanic!      other      Other      OTHER    Other !     Other!      white 
##       3977       1254       1319       1307        232        945      11086 
##      White      WHITE    White !     White! 
##      11206      11380       2271       9133

#Standardize ethnicity 
raw <- raw %>%
  mutate(
    ethnicity = ethnicity %>%
      str_to_lower() %>%
      str_replace_all("[[:punct:]]", "") %>%
      str_squish(),
    ethnicity = case_when(
      ethnicity == "asian" ~ "Asian",
      ethnicity == "black" ~ "Black",
      ethnicity == "hispanic" ~ "Hispanic",
      ethnicity == "white" ~ "White",
      ethnicity == "other" ~ "Other",
      TRUE ~ "Unknown"
    )
  )
#After cleaning: frequency table
cat("After cleaning:\n")

## After cleaning:

print(table(raw$ethnicity, useNA = "ifany"))

## 
##    Asian    Black Hispanic    Other    White 
##    11892    18031    20144     5057    45076

#smoking_status
#Before cleaning: frequency table
cat("Before cleaning:\n")

## Before cleaning:

print(table(raw$smoking_status, useNA = "ifany"))

## 
## current Current CURRENT  former  Former  FORMER   never   Never   NEVER 
##    5038   10152    5030    4901   10100    5051   14854   30109   14965

#Standardize smoking_status
raw <- raw %>%
  mutate(
    smoking_status = smoking_status %>%
      str_to_lower() %>%                     
      str_replace_all("[[:punct:]]", "") %>% 
      str_squish(),                           
    smoking_status = case_when(
      smoking_status == "current" ~ "Current",
      smoking_status == "former"  ~ "Former",
      smoking_status == "never"   ~ "Never",
      TRUE ~ NA_character_         
    )
  )
#After cleaning: frequency table
cat("After cleaning:\n")

## After cleaning:

print(table(raw$smoking_status, useNA = "ifany"))

## 
## Current  Former   Never 
##   20220   20052   59928

#physical_activity_minutes_per_week
#convert to physical_activity_hours_per_week
raw <- raw %>%
  mutate(
    physical_activity_hours_per_week = round(ifelse(
      is.na(physical_activity_minutes_per_week),
      NA,
      physical_activity_minutes_per_week / 60
    ),1)
  )
raw <- raw %>%
  relocate(physical_activity_hours_per_week, .after = 7)
raw <- raw %>% 
  select(-physical_activity_minutes_per_week)
#check first 10 numbers
raw$physical_activity_hours_per_week[1:10]

##  [1] 3.6 2.4 0.9 0.8 1.8 2.1 0.9 1.2 1.9 1.4

#round diet_score,sleep_hours_per_day,screen_time_hours_per_day,bm,diabetes_risk_score to 1 d.p.
safe_round <- function(x, digits = 1) {
  ifelse(is.finite(x), round(x, digits), NA)
}

raw <- raw %>%
  mutate(
    diet_score = safe_round(diet_score, 1),
    sleep_hours_per_day = safe_round(sleep_hours_per_day, 1),
    screen_time_hours_per_day = safe_round(screen_time_hours_per_day, 1),
    bmi = safe_round(bmi, 1),
    diabetes_risk_score = safe_round(diabetes_risk_score, 2)
  )
#check first 10 numbers
raw$diet_score[1:10]

##  [1] 5.7 6.7 6.4 3.4 7.2 9.0 9.2 4.1 6.7 8.2

raw$sleep_hours_per_day[1:10]

##  [1]  7.9  6.5 10.0  6.6  7.4  6.2  7.8  9.0  8.5  5.3

raw$screen_time_hours_per_day[1:10]

##  [1]  7.9  8.7  8.1  5.2  5.0  5.4  8.0 12.9  8.5  7.4

raw$bmi[1:10]

##  [1] 30.5 23.1 22.2 26.8 21.2 26.1 25.1 23.9 24.7 26.7

raw$diabetes_risk_score[1:10]

##  [1] 29.6 23.0 44.7 38.2 23.5 23.5 36.1 34.2 26.7 30.0

#round waist_to_hip_ratio,insulin_level,hba1c to 2 d.p.
safe_round <- function(x, digits = 2) {
  ifelse(is.finite(x), round(x, digits), NA)
}
raw <- raw %>%
  mutate(
    waist_to_hip_ratio = safe_round(waist_to_hip_ratio, 2),
    insulin_level = safe_round(insulin_level, 2),
    hba1c = safe_round(hba1c, 2)
  )
#check first 10 numbers
raw$waist_to_hip_ratio[1:10]

##  [1] 0.89 0.80 0.81 0.88 0.78 0.85 0.88 0.86 0.84 0.81

raw$insulin_level[1:10]

##  [1]  6.36  2.00  5.07  5.28 12.74  8.77 10.14  8.96  5.70  4.49

raw$hba1c[1:10]

##  [1] 8.18 5.63 7.51 9.03 7.20 6.03 5.24 7.04 6.90 4.99

#round others to nearest whole number
safe_round0 <- function(x) {
  ifelse(is.finite(x), round(x, 0), NA)
}
raw <- raw %>%
  mutate(
    systolic_bp           = safe_round0(systolic_bp),
    diastolic_bp          = safe_round0(diastolic_bp),
    heart_rate            = safe_round0(heart_rate),
    cholesterol_total     = safe_round0(cholesterol_total),
    hdl_cholesterol       = safe_round0(hdl_cholesterol),
    ldl_cholesterol       = safe_round0(ldl_cholesterol),
    triglycerides         = safe_round0(triglycerides),
    glucose_fasting       = safe_round0(glucose_fasting),
    glucose_postprandial  = safe_round0(glucose_postprandial)
  )
#check first 10 numbers
raw$systolic_bp[1:10]

##  [1] 134 129 115 120  92  95 129 128 103 124

raw$diastolic_bp[1:10]

##  [1] 78 76 73 93 67 81 77 83 71 81

raw$heart_rate[1:10]

##  [1] 68 67 74 68 67 57 81 76 72 70

raw$cholesterol_total[1:10]

##  [1] 239 116 213 171 210 218 238 241 187 188

raw$hdl_cholesterol[1:10]

##  [1] 41 55 66 50 52 61 46 49 33 52

raw$ldl_cholesterol[1:10]

##  [1] 160  50  99  79 125 119 161 159 132 103

raw$triglycerides[1:10]

##  [1] 145  30  36 140 160 179 155 120  98 104

raw$glucose_fasting[1:10]

##  [1] 136  93 118 139 137 100 101 110 116  76

raw$glucose_postprandial[1:10]

##  [1] 236 150 195 253 184 133 100 189 172 109

#There are no duplication, no need to remove any duplicated record.

#For outlier checking after dropping diabetes_stage feature
#Define threshold
thresholds <- list(
  age                          = c(0, 120),
  systolic_bp                  = c(70, 250),
  diastolic_bp                 = c(40, 150),
  heart_rate                   = c(30, 220),
  cholesterol_total            = c(70, 400),
  hdl_cholesterol              = c(10, 200),
  ldl_cholesterol              = c(10, 300),
  triglycerides                = c(10, 2000),
  glucose_fasting              = c(20, 400),
  glucose_postprandial         = c(20, 600),
  hba1c                        = c(3, 20),
  bmi                          = c(10, 80),
  waist_to_hip_ratio           = c(0.35, 2.0),
  alcohol_consumption_per_week = c(0, 70),
  physical_activity_minutes_per_week = c(0, 10080),
  physical_activity_hours_per_week = c(0, 168),
  diet_score                   = c(0, 100),
  sleep_hours_per_day          = c(0, 24),
  screen_time_hours_per_day    = c(0, 24),
  insulin_level                = c(0.1, 2000),
  family_history_diabetes      = c(0,1),
  hypertension_history         = c(0,1),
  cardiovascular_history       = c(0,1),
  diabetes_risk_score          = c(0,100),
  diagnosed_diabetes           = c(0,1)
)

#NA-safe IQR flagger
iqr_flag <- function(x) {
  n <- length(x)
  flag <- rep(FALSE, n)
  finite_idx <- is.finite(x)
  if(sum(finite_idx) == 0) return(flag)
  q1  <- quantile(x[finite_idx], 0.25, na.rm = TRUE)
  q3  <- quantile(x[finite_idx], 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  flag[finite_idx] <- x[finite_idx] < lower | x[finite_idx] > upper
  flag
}
#Rules IQR flagger
rule_flag <- function(x, varname) {
  n <- length(x)
  flag <- rep(FALSE, n)
  finite_idx <- is.finite(x)
  
  if(sum(finite_idx) == 0) return(flag)
  
  if (varname %in% names(thresholds)) {
    th <- thresholds[[varname]]
    flag[finite_idx] <- x[finite_idx] < th[1] | x[finite_idx] > th[2]
  } else {
    flag <- iqr_flag(x)
  }
  flag
}

#Apply to numeric columns 
numeric_vars <- names(raw)[sapply(raw, is.numeric)]

#Compute flags as a data.frame
outlier_flags <- lapply(numeric_vars, function(v) rule_flag(raw[[v]], v))
outlier_flags <- as.data.frame(setNames(outlier_flags, numeric_vars))

#Quick verification: ensure no NA is counted as outlier
na_flag_counts <- sapply(numeric_vars, function(v) {
  sum(outlier_flags[[v]] & is.na(raw[[v]]))
})
#Should be all zeros
print(na_flag_counts)

##                              age physical_activity_hours_per_week 
##                                0                                0 
##     alcohol_consumption_per_week                       diet_score 
##                                0                                0 
##              sleep_hours_per_day        screen_time_hours_per_day 
##                                0                                0 
##          family_history_diabetes             hypertension_history 
##                                0                                0 
##           cardiovascular_history                              bmi 
##                                0                                0 
##               waist_to_hip_ratio                      systolic_bp 
##                                0                                0 
##                     diastolic_bp                       heart_rate 
##                                0                                0 
##                cholesterol_total                  hdl_cholesterol 
##                                0                                0 
##                  ldl_cholesterol                    triglycerides 
##                                0                                0 
##                  glucose_fasting             glucose_postprandial 
##                                0                                0 
##                    insulin_level                            hba1c 
##                                0                                0 
##              diabetes_risk_score               diagnosed_diabetes 
##                                0                                0

#Summary
outlier_counts <- sapply(outlier_flags, function(col) sum(col, na.rm = TRUE))
outlier_summary <- data.frame(variable = names(outlier_counts),
                              n_flagged = as.integer(outlier_counts),
                              pct_flagged = round(100 * outlier_counts / nrow(raw), 2))
print(outlier_summary[order(-outlier_summary$n_flagged), ])

##                                                          variable n_flagged
## bmi                                                           bmi        50
## heart_rate                                             heart_rate        40
## systolic_bp                                           systolic_bp        20
## age                                                           age         0
## physical_activity_hours_per_week physical_activity_hours_per_week         0
## alcohol_consumption_per_week         alcohol_consumption_per_week         0
## diet_score                                             diet_score         0
## sleep_hours_per_day                           sleep_hours_per_day         0
## screen_time_hours_per_day               screen_time_hours_per_day         0
## family_history_diabetes                   family_history_diabetes         0
## hypertension_history                         hypertension_history         0
## cardiovascular_history                     cardiovascular_history         0
## waist_to_hip_ratio                             waist_to_hip_ratio         0
## diastolic_bp                                         diastolic_bp         0
## cholesterol_total                               cholesterol_total         0
## hdl_cholesterol                                   hdl_cholesterol         0
## ldl_cholesterol                                   ldl_cholesterol         0
## triglycerides                                       triglycerides         0
## glucose_fasting                                   glucose_fasting         0
## glucose_postprandial                         glucose_postprandial         0
## insulin_level                                       insulin_level         0
## hba1c                                                       hba1c         0
## diabetes_risk_score                           diabetes_risk_score         0
## diagnosed_diabetes                             diagnosed_diabetes         0
##                                  pct_flagged
## bmi                                     0.05
## heart_rate                              0.04
## systolic_bp                             0.02
## age                                     0.00
## physical_activity_hours_per_week        0.00
## alcohol_consumption_per_week            0.00
## diet_score                              0.00
## sleep_hours_per_day                     0.00
## screen_time_hours_per_day               0.00
## family_history_diabetes                 0.00
## hypertension_history                    0.00
## cardiovascular_history                  0.00
## waist_to_hip_ratio                      0.00
## diastolic_bp                            0.00
## cholesterol_total                       0.00
## hdl_cholesterol                         0.00
## ldl_cholesterol                         0.00
## triglycerides                           0.00
## glucose_fasting                         0.00
## glucose_postprandial                    0.00
## insulin_level                           0.00
## hba1c                                   0.00
## diabetes_risk_score                     0.00
## diagnosed_diabetes                      0.00

#Replace all outliers with <NA>
raw2 <- raw
for (v in names(outlier_flags)) {
    raw2[[v]][outlier_flags[[v]]] <- NA_real_
}
sapply(names(outlier_flags), function(v) {
  sum(outlier_flags[[v]] & !is.na(raw2[[v]]))
})

##                              age physical_activity_hours_per_week 
##                                0                                0 
##     alcohol_consumption_per_week                       diet_score 
##                                0                                0 
##              sleep_hours_per_day        screen_time_hours_per_day 
##                                0                                0 
##          family_history_diabetes             hypertension_history 
##                                0                                0 
##           cardiovascular_history                              bmi 
##                                0                                0 
##               waist_to_hip_ratio                      systolic_bp 
##                                0                                0 
##                     diastolic_bp                       heart_rate 
##                                0                                0 
##                cholesterol_total                  hdl_cholesterol 
##                                0                                0 
##                  ldl_cholesterol                    triglycerides 
##                                0                                0 
##                  glucose_fasting             glucose_postprandial 
##                                0                                0 
##                    insulin_level                            hba1c 
##                                0                                0 
##              diabetes_risk_score               diagnosed_diabetes 
##                                0                                0

#Change variable name back to raw
raw<-raw2

# Remove rows where glucose_postprandial is >30 mg/dL lower than glucose_fasting
gf <- suppressWarnings(as.numeric(raw$glucose_fasting))
gp <- suppressWarnings(as.numeric(raw$glucose_postprandial))

remove_idx <- gp < (gf - 30)
raw <- raw[ !(gp < (gf - 30)), ]
sum(remove_idx)

## [1] 46

#Fill missing values
# Variables NOT allowed to impute (avoid leakage)
no_impute <- c(
  "glucose_fasting", "glucose_postprandial", "insulin_level",
  "hba1c", "diabetes_risk_score", "diagnosed_diabetes"
)

# 1. Identify numeric & categorical columns
numeric_cols <- names(raw)[sapply(raw, is.numeric)]
cat_cols     <- names(raw)[sapply(raw, is.character)]

# 2. Numeric imputation (median), excluding no-impute variables
for (col in setdiff(numeric_cols, no_impute)) {
  med <- median(raw[[col]], na.rm = TRUE)
  raw[[col]][is.na(raw[[col]])] <- med
}

# 3. Categorical imputation (mode)
get_mode <- function(x) {
  ux <- unique(na.omit(x))
  ux[which.max(tabulate(match(x, ux)))]
}

for (col in cat_cols) {
  mode_val <- get_mode(raw[[col]])
  raw[[col]][is.na(raw[[col]])] <- mode_val
}

summary(raw)

##       age           gender       ethnicity         education_level   
##  Min.   :18.00   Female:50283   Length:100154      Length:100154     
##  1st Qu.:39.00   Male  :47856   Class :character   Class :character  
##  Median :50.00   Other : 2015   Mode  :character   Mode  :character  
##  Mean   :50.12                                                       
##  3rd Qu.:61.00                                                       
##  Max.   :90.00                                                       
##                                                                      
##  income_level       employment_status  smoking_status    
##  Length:100154      Length:100154      Length:100154     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  physical_activity_hours_per_week alcohol_consumption_per_week   diet_score    
##  Min.   : 0.000                   Min.   : 0.000               Min.   : 0.000  
##  1st Qu.: 0.900                   1st Qu.: 1.000               1st Qu.: 4.800  
##  Median : 1.700                   Median : 2.000               Median : 6.000  
##  Mean   : 1.981                   Mean   : 2.004               Mean   : 5.995  
##  3rd Qu.: 2.700                   3rd Qu.: 3.000               3rd Qu.: 7.200  
##  Max.   :13.900                   Max.   :10.000               Max.   :10.000  
##                                                                                
##  sleep_hours_per_day screen_time_hours_per_day family_history_diabetes
##  Min.   : 3.000      Min.   : 0.500            Min.   :0.0000         
##  1st Qu.: 6.300      1st Qu.: 4.300            1st Qu.:0.0000         
##  Median : 7.000      Median : 6.000            Median :0.0000         
##  Mean   : 6.998      Mean   : 5.997            Mean   :0.2195         
##  3rd Qu.: 7.700      3rd Qu.: 7.700            3rd Qu.:0.0000         
##  Max.   :10.000      Max.   :16.800            Max.   :1.0000         
##                                                                       
##  hypertension_history cardiovascular_history      bmi        waist_to_hip_ratio
##  Min.   :0.0000       Min.   :0.00000        Min.   :15.00   Min.   :0.6700    
##  1st Qu.:0.0000       1st Qu.:0.00000        1st Qu.:23.20   1st Qu.:0.8200    
##  Median :0.0000       Median :0.00000        Median :25.60   Median :0.8600    
##  Mean   :0.2508       Mean   :0.07912        Mean   :25.61   Mean   :0.8561    
##  3rd Qu.:1.0000       3rd Qu.:0.00000        3rd Qu.:28.00   3rd Qu.:0.8900    
##  Max.   :1.0000       Max.   :1.00000        Max.   :39.20   Max.   :1.0600    
##                                                                                
##   systolic_bp     diastolic_bp      heart_rate     cholesterol_total
##  Min.   : 90.0   Min.   : 50.00   Min.   : 40.00   Min.   :100      
##  1st Qu.:106.0   1st Qu.: 70.00   1st Qu.: 64.00   1st Qu.:164      
##  Median :116.0   Median : 75.00   Median : 70.00   Median :186      
##  Mean   :115.8   Mean   : 75.23   Mean   : 69.63   Mean   :186      
##  3rd Qu.:125.0   3rd Qu.: 81.00   3rd Qu.: 75.00   3rd Qu.:208      
##  Max.   :179.0   Max.   :110.00   Max.   :105.00   Max.   :318      
##                                                                     
##  hdl_cholesterol ldl_cholesterol triglycerides   glucose_fasting
##  Min.   :20.00   Min.   : 50     Min.   : 30.0   Min.   : 60.0  
##  1st Qu.:47.00   1st Qu.: 78     1st Qu.: 91.0   1st Qu.:102.0  
##  Median :54.00   Median :102     Median :121.0   Median :111.0  
##  Mean   :54.04   Mean   :103     Mean   :121.5   Mean   :111.1  
##  3rd Qu.:61.00   3rd Qu.:126     3rd Qu.:151.0   3rd Qu.:120.0  
##  Max.   :98.00   Max.   :263     Max.   :344.0   Max.   :172.0  
##                                                                 
##  glucose_postprandial insulin_level        hba1c       diabetes_risk_score
##  Min.   : 70.0        Min.   : 2.000   Min.   :4.000   Min.   : 2.70      
##  1st Qu.:139.0        1st Qu.: 5.090   1st Qu.:5.970   1st Qu.:23.80      
##  Median :160.0        Median : 8.790   Median :6.520   Median :29.00      
##  Mean   :160.1        Mean   : 9.062   Mean   :6.521   Mean   :30.22      
##  3rd Qu.:181.0        3rd Qu.:12.450   3rd Qu.:7.070   3rd Qu.:35.60      
##  Max.   :287.0        Max.   :32.220   Max.   :9.800   Max.   :67.20      
##                                        NA's   :160                        
##  diagnosed_diabetes
##  Min.   :0.0000    
##  1st Qu.:0.0000    
##  Median :1.0000    
##  Mean   :0.6002    
##  3rd Qu.:1.0000    
##  Max.   :1.0000    
##

#Feature Engineering
# BMI category
raw$bmi_category <- cut(
  raw$bmi,
  breaks = c(-Inf, 18.5, 24.9, 29.9, Inf),
  labels = c("Underweight", "Normal", "Overweight", "Obese")
)

# Blood pressure category
raw$bp_category <- case_when(
  raw$systolic_bp < 120 & raw$diastolic_bp < 80 ~ "Normal",
  raw$systolic_bp %in% 120:129 & raw$diastolic_bp < 80 ~ "Elevated",
  raw$systolic_bp %in% 130:139 | raw$diastolic_bp %in% 80:89 ~ "Stage 1 HTN",
  raw$systolic_bp >= 140 | raw$diastolic_bp >= 90 ~ "Stage 2 HTN",
  TRUE ~ NA_character_
)

# Convert selected categories to factors
factor_cols <- c(
  "gender", "ethnicity", "education_level", "income_level",
  "employment_status", "smoking_status",
  "family_history_diabetes", "hypertension_history",
  "cardiovascular_history", "bmi_category",
  "bp_category", "diagnosed_diabetes"
)

for (col in factor_cols) {
  if (col %in% names(raw))
    raw[[col]] <- as.factor(raw[[col]])
}

cat("\nPreview of engineered columns:\n")

## 
## Preview of engineered columns:

print(head(raw[c("bmi", "bmi_category", "systolic_bp", "diastolic_bp", "bp_category")], 10))

## # A tibble: 10 × 5
##      bmi bmi_category systolic_bp diastolic_bp bp_category
##    <dbl> <fct>              <dbl>        <dbl> <fct>      
##  1  30.5 Obese                134           78 Stage 1 HTN
##  2  23.1 Normal               129           76 Elevated   
##  3  22.2 Normal               115           73 Normal     
##  4  26.8 Overweight           120           93 Stage 2 HTN
##  5  21.2 Normal                92           67 Normal     
##  6  26.1 Overweight            95           81 Stage 1 HTN
##  7  25.1 Overweight           129           77 Elevated   
##  8  23.9 Normal               128           83 Stage 1 HTN
##  9  24.7 Normal               103           71 Normal     
## 10  26.7 Overweight           124           81 Stage 1 HTN

clean<-raw

4. Exploratory Data Analysis(EDA) & Modelling

4.1 Exploratory Data Analysis (EDA)

# Convert obvious categorical variables to factors
clean <- clean %>%
  mutate(
    diagnosed_diabetes = as.factor(diagnosed_diabetes),         # 0/1 or yes/no
    gender = as.factor(gender),
    ethnicity = as.factor(ethnicity),
    education_level = as.factor(education_level),
    income_level = as.factor(income_level),
    employment_status = as.factor(employment_status),
    smoking_status = as.factor(smoking_status),
    family_history_diabetes = as.factor(family_history_diabetes),
    hypertension_history = as.factor(hypertension_history),
    cardiovascular_history = as.factor(cardiovascular_history),
    bmi_category = as.factor(bmi_category),
    bp_category = as.factor(bp_category)
  )

# Create age groups for categorical analysis
clean <- clean %>%
  mutate(age_group = case_when(
    age < 30 ~ "<30",
    age >= 30 & age < 45 ~ "30-44",
    age >= 45 & age < 60 ~ "45-59",
    age >= 60 ~ "60+",
    TRUE ~ NA_character_
  )) %>%
  mutate(age_group = factor(age_group, levels = c("<30", "30-44", "45-59", "60+")))

4.1.1 Scatter plot: bmi vs glucose_fasting (colored by diagnosed_diabetes)

p1_data <- clean %>% drop_na(bmi, glucose_fasting, diagnosed_diabetes)
p1 <- ggplot(p1_data, aes(x = bmi, y = glucose_fasting, color = diagnosed_diabetes)) +
  geom_point(alpha = 0.6, size = 1.8, position = position_jitter(width = 0.2, height = 0)) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", aes(group = diagnosed_diabetes)) +
  labs(title = "BMI vs Fasting Glucose (colored by diagnosed diabetes)",
       x = "BMI", y = "Fasting glucose (mg/dL)",
       color = "Diagnosed") +
  theme_minimal()
print(p1)

Diagnosed diabetes is associated with higher fasting glucose.
BMI does not strongly separate individuals with and without diabetes.
There is substantial overlap, indicating the need for multivariate modeling.

4.1.2 Box + jitter: HbA1c by BMI category (color by diagnosed)

p2_data <- clean %>% drop_na(hba1c, bmi_category, diagnosed_diabetes)
p2 <- ggplot(p2_data, aes(x = bmi_category, y = hba1c)) +
geom_boxplot(outlier.shape = NA, alpha = 0.6) +
geom_jitter(aes(color = diagnosed_diabetes), width = 0.18, alpha = 0.7, size = 1.6) +
labs(title = "HbA1c distribution across BMI categories",
x = "BMI category", y = "HbA1c (%)", color = "Diagnosed") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
print(p2)

Diabetic individuals have consistently higher HbA1c across all BMI categories.
Non-diabetic individuals cluster at lower HbA1c, showing clear separation from diagnosed cases.
BMI category does not strongly influence HbA1c, as medians are similar across groups.
Large HbA1c variability exists within diabetics, regardless of BMI (underweight to obese).
HbA1c is a stronger indicator of diabetes status than BMI, based on clear vertical separation.

4.1.3 Correlation heatmap of numeric metabolic variables

num_vars <- c("age", "bmi", "waist_to_hip_ratio", "systolic_bp", "diastolic_bp",
"heart_rate", "cholesterol_total", "hdl_cholesterol", "ldl_cholesterol",
"triglycerides", "glucose_fasting", "glucose_postprandial", "insulin_level",
"hba1c", "diabetes_risk_score")
num_df <- clean %>% select(any_of(num_vars)) %>% drop_na()
corr_mat <- cor(num_df, use = "pairwise.complete.obs")
col_pal <- colorRampPalette(brewer.pal(11, "RdYlBu"))(50)
# show on screen
corrplot(corr_mat, method = "color", col = col_pal, tl.cex = 0.8,
addCoef.col = "black", number.cex = 0.6, order = "original",
title = "Correlation heatmap of numeric metabolic", mar = c(0,0,2,0))

BMI and Waist-to-Hip Ratio show strong correlation (0.77)
Total Cholesterol and LDL are very strongly correlated (0.91)
Fasting Glucose and Postprandial Glucose correlate moderately (0.59)
Triglycerides correlate moderately with BMI (0.39)
Diabetes Risk Score correlates most with Age (0.50) and Fasting Glucose (0.48), indicating they may be the key contributors.

4.1.4 Proportion bar: Diagnosed diabetes rate by age_group and gender

p4_data <- clean %>% drop_na(age_group, gender, diagnosed_diabetes) %>%
group_by(age_group, gender) %>%
summarise(n = n(), n_diab = sum(as.numeric(as.character(diagnosed_diabetes)) == 1, na.rm = TRUE), .groups = "drop") %>%
mutate(rate = n_diab / n)
p4 <- ggplot(p4_data, aes(x = age_group, y = rate, fill = gender)) +
geom_col(position = position_dodge(width = 0.9), width = 0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Proportion of diagnosed diabetes by age group and gender",
x = "Age group", y = "Diagnosed diabetes (%)", fill = "Gender") +
theme_minimal()
print(p4)

Diagnosed diabetes increases steadily with age.
The 60+ age group has the highest proportion of diabetes.
Younger adults (<30) have the lowest diabetes rates.
Gender differences are small, with female, male and “other” showing fairly similar proportions within each age group.
The “other” gender category consistently shows slightly higher proportions in every age group.

4.1.5 Violin + facet plot: Insulin by BP category, faceted by smoking status

p5_data <- clean %>% drop_na(insulin_level, bp_category, smoking_status)
p5 <- ggplot(p5_data, aes(x = bp_category, y = insulin_level)) +
geom_violin(trim = TRUE, alpha = 0.6) +
geom_jitter(width = 0.15, size = 0.7, alpha = 0.5) +
facet_wrap(~ smoking_status, scales = "free_y") +
labs(title = "Insulin level by BP category, faceted by smoking status",
x = "BP category", y = "Insulin level") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
print(p5)

Insulin levels show similar distributions across all BP categories.
Smoking status (Current, Former, Never) does not markedly shift insulin levels.
All groups show a wide spread of insulin values.
The shape of the violin plots is consistent across facets, indicating that insulin variability remains similar regardless of BP or smoking group.

4.1.6 Class imbalance determination

p_diag_data <- clean %>%
drop_na(diagnosed_diabetes) %>%
group_by(diagnosed_diabetes) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(pct = count / sum(count))
p_diag <- ggplot(p_diag_data, aes(x = as.numeric(as.character(diagnosed_diabetes)), y = count)) +
geom_col(width = 0.5) +
geom_text(aes(label = count), vjust = -0.5, size = 3.5) +
scale_x_continuous(breaks = c(0, 1), labels = c("0", "1")) + # show 0 and 1 on x-axis
labs(
title = "Count of Diagnosed Diabetes (0 / 1)",
x = "Diagnosed Diabetes (0 = No, 1 = Yes)",
y = "Count"
) +
theme_minimal()

print(p_diag)

There is a minor class imbalance of 40045 No diabetes to 60109 Yes diabetes. The ratio of 1 to 0 is 1.5. This is a marginally small imbalance and classification modelling can proceed (Hoffman, 2021).

4.2 Model Techniques Selection

Regression:

XGBoost Regressor: fast, powerful model capturing complex non-linear patterns
Random Forest: robust ensemble reducing overfitting for stable predictions

Classification:

XGBoost classifier: non-linear modeling that captures complex patterns
Logistic regression: simple, interpretable baseline for diagnosis classification

4.3 Test Design Generation

4.3.1 Regression

Data goal: Predict continuous diabetes risk score.
Metrics: MAE, RMSE, R².
Baseline: Mean predictor (predicts average risk).
Split: train 70% / val 15% / test 15% with fixed random seed
One-hot encoding to convert categorical variables for both models
Pre-processing: Feature scaling optional; tree models (RF, XGBoost) do not require scaling
Training & tuning: Train Random Forest and XGBoost on train; tune hyperparameters on validation
Validation: select best model based on MAE, RMSE, and R²
Final test: evaluate chosen regression model on the test set
Comparison: compare both models to baseline and to each other

4.3.2 Classification

Data goal: Classify diagnosis (binary)
Metrics: accuracy, precision, recall, F1, ROC-AUC
Baseline: majority-class predictor
Split: train 70% / val 15% / test 15% with fixed random seed
One hot encoding to convert categorical variable
Pre-processing: Feature scaling for LR, not required for XGBoost
Training & tuning: Train LR and XGBoost on train; tune hyperparameters on val
Validation: Select best threshold on validation
Final test: Evaluate chosen models
Comparison: Compare to baseline and each other

4.4 Model Building

4.4.1 Regression - XGBoost

set.seed(42)

# Split 70% train / 15% val / 15% test

# First split: 85% train+val / 15% test
split1 <- initial_split(clean, prop = 0.85)
train_val <- training(split1)
test_set  <- testing(split1)

# Second split: split train_val into 70% train / 15% val
split2 <- initial_split(train_val, prop = 70/85)
train_set <- training(split2)
val_set   <- testing(split2)

cat("Rows — Train:", nrow(train_set),
    "Val:", nrow(val_set),
    "Test:", nrow(test_set), "\n\n")

## Rows — Train: 70107 Val: 15023 Test: 15024

# Recipe: imputation, unknown categories, one-hot encoding, remove zero-variance
rec <- recipe(diabetes_risk_score ~ ., data = train_set) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_zv(all_predictors())

rec_prep <- prep(rec, training = train_set, retain = TRUE)

train_baked <- bake(rec_prep, train_set)
val_baked   <- bake(rec_prep, val_set)
test_baked  <- bake(rec_prep, test_set)

cat("Baked columns:", ncol(train_baked), "\n")

## Baked columns: 64

# Convert baked datasets to sparse matrices
s_train <- sparse.model.matrix(diabetes_risk_score ~ . -1, train_baked)
s_val   <- sparse.model.matrix(diabetes_risk_score ~ . -1, val_baked)
s_test  <- sparse.model.matrix(diabetes_risk_score ~ . -1, test_baked)

dtrain <- xgb.DMatrix(s_train, label = train_baked$diabetes_risk_score)
dval   <- xgb.DMatrix(s_val,   label = val_baked$diabetes_risk_score)
dtest  <- xgb.DMatrix(s_test,  label = test_baked$diabetes_risk_score)

# Train 3 simple candidate models (safe tuning)
param_list <- list(
  list(eta = 0.1,  max_depth = 3),
  list(eta = 0.05, max_depth = 4),
  list(eta = 0.2,  max_depth = 2)
)

models <- vector("list", length(param_list))
rmses <- rep(Inf, length(param_list))

for (i in seq_along(param_list)) {

  p <- param_list[[i]]
  cat("Training model", i, "...\n")

  m <- tryCatch({
    xgb.train(
      params = list(
        booster = "gbtree",
        objective = "reg:squarederror",
        eval_metric = "rmse",
        eta = p$eta,
        max_depth = p$max_depth
      ),
      data = dtrain,
      nrounds = 2000,
      watchlist = list(train = dtrain, eval = dval),
      early_stopping_rounds = 50,
      verbose = 0
    )
  }, error = function(e) {
    cat(" → Model", i, "FAILED:", e$message, "\n")
    NULL
  })

  models[[i]] <- m

  score <- tryCatch(m$best_score, error = function(e) NA)

  if (is.null(score) || is.na(score) || !is.finite(score)) {
    cat(" → Invalid RMSE detected. Assigning RMSE = Inf\n")
    rmses[i] <- Inf
  } else {
    rmses[i] <- score
  }

  cat(" → Validation RMSE used:", rmses[i], "\n")
}

## Training model 1 ...
##  → Validation RMSE used: 0.2061891 
## Training model 2 ...
##  → Validation RMSE used: 0.2052611 
## Training model 3 ...
##  → Validation RMSE used: 0.2217658

# Select best model
best_index <- which.min(rmses)
best_model <- models[[best_index]]

if (is.null(best_model)) {
  stop("ERROR: All XGBoost models failed — no usable model was created.")
}

cat("\nBEST MODEL INDEX:", best_index, "\n")

## 
## BEST MODEL INDEX: 2

cat("BEST VALIDATION RMSE:", rmses[best_index], "\n\n")

## BEST VALIDATION RMSE: 0.2052611

# Test-set matrices (final model performance)
test_pred <- predict(best_model, dtest)
true <- test_baked$diabetes_risk_score

mae  <- mean(abs(test_pred - true))
rmse <- sqrt(mean((test_pred - true)^2))
r2   <- yardstick::rsq_trad_vec(true, test_pred)

final_metrics <- tibble(
Metric = c("MAE", "RMSE", "R²"),
Value = c(mae, rmse, r2)
)

final_metrics %>% knitr::kable()

Metric	Value
MAE	0.1567964
RMSE	0.2031207
R²	0.9995074

# Baseline (mean predictor)
mean_pred <- mean(train_set$diabetes_risk_score)

baseline_mae  <- mean(abs(mean_pred - true))
baseline_rmse <- sqrt(mean((mean_pred - true)^2))
baseline_r2   <- yardstick::rsq_trad_vec(true, rep(mean_pred, length(true)))

baseline_metrics <- tibble(
  Metric = c("MAE", "RMSE", "R²"),
  Value  = c(baseline_mae, baseline_rmse, baseline_r2)
) 

baseline_metrics%>% knitr::kable()

Metric	Value
MAE	7.279057
RMSE	9.152769
R²	-0.000159

#SHAP analysis (feature importance)
pred_data <- s_test

shap_contrib <- predict(best_model, pred_data, predcontrib = TRUE)

shap_dt <- as.data.table(shap_contrib)

bias_col <- names(shap_dt)[ncol(shap_dt)]
feature_cols <- setdiff(names(shap_dt), bias_col)

mean_abs_shap <- shap_dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = feature_cols]

mean_abs_shap_tidy <- melt(
  mean_abs_shap,
  variable.name = "feature",
  value.name = "mean_abs_shap"
)

top10 <- mean_abs_shap_tidy %>%
  arrange(desc(mean_abs_shap)) %>%
  slice_head(n = 10) %>%
  mutate(feature = fct_reorder(feature, mean_abs_shap))

ggplot(top10, aes(x = feature, y = mean_abs_shap)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Features by Mean |SHAP| (XGBoost Regression)",
    x = NULL,
    y = "Mean Absolute SHAP Value"
  ) +
  theme_minimal(base_size = 13)

4.4.2 Regression - Random Forest

set.seed(42)

# 70/15/15 split (not stratified because regression)
split1 <- initial_split(clean, prop = 0.85)
train_val <- training(split1)
test_set  <- testing(split1)

split2 <- initial_split(train_val, prop = 70/85)
train_set <- training(split2)
val_set   <- testing(split2)

cat("Train:", nrow(train_set),
    " | Val:", nrow(val_set),
    " | Test:", nrow(test_set), "\n")

## Train: 70107  | Val: 15023  | Test: 15024

# Recipe — impute, one-hot encode, remove zv

rec <- recipe(diabetes_risk_score ~ ., data = train_set) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_zv(all_predictors())

rec_prep <- prep(rec, training = train_set, retain = TRUE)

# Bake train / validation / test

train_baked <- bake(rec_prep, train_set)
val_baked   <- bake(rec_prep, val_set)
test_baked  <- bake(rec_prep, test_set)

y_train <- train_baked$diabetes_risk_score
y_val   <- val_baked$diabetes_risk_score
y_test  <- test_baked$diabetes_risk_score

X_train <- train_baked %>% select(-diabetes_risk_score)
X_val   <- val_baked %>% select(-diabetes_risk_score)
X_test  <- test_baked %>% select(-diabetes_risk_score)

# Sanity check
cat("Missing values in predictors (train/val/test):",
    sum(!complete.cases(X_train)), "/",
    sum(!complete.cases(X_val)), "/",
    sum(!complete.cases(X_test)), "\n")

## Missing values in predictors (train/val/test): 0 / 0 / 0

# RF requires X (predictors) and y (target)
X_train <- train_baked %>% select(-diabetes_risk_score)
y_train <- train_baked$diabetes_risk_score

X_val <- val_baked %>% select(-diabetes_risk_score)
y_val <- val_baked$diabetes_risk_score

# Train a reasonable RF model (500 trees)
rf_model <- ranger(
  x = X_train,
  y = y_train,
  num.trees = 500,
  mtry = floor(sqrt(ncol(X_train))),
  importance = "permutation",
  write.forest = TRUE, #required for shap
  seed = 42
)

## Growing trees.. Progress: 80%. Estimated remaining time: 7 seconds.
## Computing permutation importance.. Progress: 26%. Estimated remaining time: 1 minute, 27 seconds.
## Computing permutation importance.. Progress: 43%. Estimated remaining time: 1 minute, 24 seconds.
## Computing permutation importance.. Progress: 52%. Estimated remaining time: 1 minute, 28 seconds.
## Computing permutation importance.. Progress: 69%. Estimated remaining time: 56 seconds.
## Computing permutation importance.. Progress: 79%. Estimated remaining time: 42 seconds.
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 21 seconds.

rf_model

## Ranger result
## 
## Call:
##  ranger(x = X_train, y = y_train, num.trees = 500, mtry = floor(sqrt(ncol(X_train))),      importance = "permutation", write.forest = TRUE, seed = 42) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      70107 
## Number of independent variables:  63 
## Mtry:                             7 
## Target node size:                 5 
## Variable importance mode:         permutation 
## Splitrule:                        variance 
## OOB prediction error (MSE):       2.837341 
## R squared (OOB):                  0.9653071

# Compute metrics 

test_pred <- predict(rf_model, data = test_baked)$predictions

test_mae  <- mean(abs(test_pred - y_test))
test_rmse <- sqrt(mean((test_pred - y_test)^2))
test_r2   <- yardstick::rsq_trad_vec(y_test, test_pred)

tibble(
  Metric = c("MAE", "RMSE", "R²"),
  Value = c(test_mae, test_rmse, test_r2)
) %>% knitr::kable()

Metric	Value
MAE	1.2955275
RMSE	1.6768921
R²	0.9664282

# Baseline predictor: always predicts mean of training target
mean_pred_rf <- mean(y_train)

baseline_mae_rf  <- mean(abs(mean_pred_rf - y_test))
baseline_rmse_rf <- sqrt(mean((mean_pred_rf - y_test)^2))
baseline_r2_rf   <- yardstick::rsq_trad_vec(y_test, rep(mean_pred_rf, length(y_test)))

baseline_rf_tbl <- tibble(
  Metric = c("MAE", "RMSE", "R²"),
  Value  = c(baseline_mae_rf, baseline_rmse_rf, baseline_r2_rf)
)

baseline_rf_tbl %>% knitr::kable()

Metric	Value
MAE	7.279057
RMSE	9.152769
R²	-0.000159

# Prediction wrapper for ranger
rf_pred <- function(object, newdata) {
  predict(object, data = newdata)$predictions
}

# Use a small test sample to keep SHAP fast
set.seed(42)
X_sample <- X_test %>% slice_sample(n = 200)

# Compute SHAP values (nsim = 10 for speed)
shap_mat <- fastshap::explain(
  rf_model,
  X = as.data.frame(X_train), 
  newdata = as.data.frame(X_sample),
  pred_wrapper = rf_pred,
  nsim = 10
)

# Convert SHAP matrix → data.frame
shap_df <- as.data.frame(shap_mat)

# Mean absolute SHAP per feature
shap_summary <- shap_df %>%
  summarise(across(everything(), ~ mean(abs(.x)))) %>%
  tidyr::pivot_longer(everything(),
                      names_to = "feature",
                      values_to = "mean_abs_shap") %>%
  arrange(desc(mean_abs_shap))

# Top 10 features
top10 <- shap_summary %>% slice_head(n = 10)

# Plot
ggplot(top10, aes(x = fct_reorder(feature, mean_abs_shap),
                  y = mean_abs_shap)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 SHAP-like Feature Importance — Random Forest",
    x = NULL,
    y = "Mean |SHAP| Contribution"
  ) +
  theme_minimal(base_size = 13)

4.4.3 Classification - Logistic Regression

# Data preparation
clean <- clean %>%
  dplyr::select(-diabetes_risk_score) %>%
  dplyr::mutate(
    diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1))
  )

# Train / Validation / Test split (70 / 15 / 15)

set.seed(123)

idx_train <- caret::createDataPartition(
  clean$diagnosed_diabetes,
  p = 0.7,
  list = FALSE
)

train_data <- clean[idx_train, ]
temp_data  <- clean[-idx_train, ]

idx_val <- caret::createDataPartition(
  temp_data$diagnosed_diabetes,
  p = 0.5,
  list = FALSE
)

val_data  <- temp_data[idx_val, ]
test_data <- temp_data[-idx_val, ]


# Baseline model (majority class)

majority_class <- names(which.max(table(train_data$diagnosed_diabetes)))

baseline_preds <- factor(
  rep(majority_class, nrow(test_data)),
  levels = c(0, 1)
)

baseline_cm <- caret::confusionMatrix(
  baseline_preds,
  test_data$diagnosed_diabetes,
  positive = "1"
)

baseline_metrics <- data.frame(
  Accuracy  = round(baseline_cm$overall["Accuracy"], 4),
  Precision = round(baseline_cm$byClass["Precision"], 4),
  Recall    = round(baseline_cm$byClass["Recall"], 4),
  F1_Score  = round(baseline_cm$byClass["F1"], 4)
)

print("Baseline metrics:")

## [1] "Baseline metrics:"

print(baseline_metrics)

##          Accuracy Precision Recall F1_Score
## Accuracy   0.6002    0.6002      1   0.7501

# One-hot encoding

dummy <- caret::dummyVars(
  diagnosed_diabetes ~ .,
  data = train_data,
  fullRank = TRUE
)

X_train <- predict(dummy, train_data)
X_val   <- predict(dummy, val_data)
X_test  <- predict(dummy, test_data)

y_train <- train_data$diagnosed_diabetes
y_val   <- val_data$diagnosed_diabetes
y_test  <- test_data$diagnosed_diabetes


# Imputation (train only)

imputer <- caret::preProcess(X_train, method = "medianImpute")

X_train <- predict(imputer, X_train)
X_val   <- predict(imputer, X_val)
X_test  <- predict(imputer, X_test)


# Feature scaling

scaler <- caret::preProcess(X_train, method = c("center", "scale"))

X_train <- predict(scaler, X_train)
X_val   <- predict(scaler, X_val)
X_test  <- predict(scaler, X_test)

# Ensure identical columns
train_cols <- colnames(X_train)
X_val  <- X_val[, train_cols, drop = FALSE]
X_test <- X_test[, train_cols, drop = FALSE]

stopifnot(
  identical(colnames(X_train), colnames(X_val)),
  identical(colnames(X_train), colnames(X_test))
)


# Logistic Regression (LASSO)

set.seed(123)

cv_model <- glmnet::cv.glmnet(
  x = X_train,
  y = y_train,
  family = "binomial",
  alpha = 1
)

best_lambda <- cv_model$lambda.min


# Threshold selection (validation)

val_probs <- predict(
  cv_model,
  newx = X_val,
  s = best_lambda,
  type = "response"
)

thresholds <- seq(0.1, 0.9, by = 0.01)

f1_scores <- sapply(thresholds, function(t) {
  preds <- factor(ifelse(val_probs >= t, 1, 0), levels = c(0, 1))
  cm <- caret::confusionMatrix(preds, y_val, positive = "1")
  cm$byClass["F1"]
})

best_threshold <- thresholds[which.max(f1_scores)]
print(paste("Best threshold:", best_threshold))

## [1] "Best threshold: 0.64"

# Final test evaluation

test_probs <- predict(
  cv_model,
  newx = X_test,
  s = best_lambda,
  type = "response"
)

test_preds <- factor(
  ifelse(test_probs >= best_threshold, 1, 0),
  levels = c(0, 1)
)

test_cm <- caret::confusionMatrix(
  test_preds,
  y_test,
  positive = "1"
)

cat("\nLogistic Regression Confusion Matrix:\n")

## 
## Logistic Regression Confusion Matrix:

print(test_cm$table)

##           Reference
## Prediction    0    1
##          0 5666 1244
##          1  340 7772

final_metrics <- data.frame(
  Accuracy  = round(test_cm$overall["Accuracy"], 4),
  Precision = round(test_cm$byClass["Precision"], 4),
  Recall    = round(test_cm$byClass["Recall"], 4),
  F1_Score  = round(test_cm$byClass["F1"], 4)
)

print("Logistic Regression metrics:")

## [1] "Logistic Regression metrics:"

print(final_metrics)

##          Accuracy Precision Recall F1_Score
## Accuracy   0.8946    0.9581  0.862   0.9075

# ROC-AUC

roc_obj <- pROC::roc(
  response  = y_test,
  predictor = as.numeric(test_probs),
  levels    = c("0", "1"),
  direction = "<"
)

print(paste("ROC-AUC:", round(pROC::auc(roc_obj), 4)))

## [1] "ROC-AUC: 0.934"

# SHAP: Top 10 Mean |SHAP| (Global)

# Predictor object (glmnet-compatible)
X_train_df <- as.data.frame(X_train)
train_cols <- colnames(X_train)

predictor <- iml::Predictor$new(
  model = cv_model,
  data  = X_train_df,
  y     = y_train,
  predict.function = function(model, newdata) {
    
    # Add missing columns
    missing_cols <- setdiff(train_cols, colnames(newdata))
    if (length(missing_cols) > 0) newdata[missing_cols] <- 0
    
    # Remove extra columns
    extra_cols <- setdiff(colnames(newdata), train_cols)
    if (length(extra_cols) > 0)
      newdata <- newdata[, !(colnames(newdata) %in% extra_cols), drop = FALSE]
    
    # Reorder and convert to matrix
    newdata <- as.matrix(newdata[, train_cols, drop = FALSE])
    
    predict(
      model,
      newx = newdata,
      s = best_lambda,
      type = "response"
    )
  }
)

# Compute SHAP-based feature importance (mean |SHAP|)
shap_imp <- iml::FeatureImp$new(
  predictor,
  loss = "ce"
)

# Extract results and select top 10
top10_shap <- shap_imp$results %>%
  arrange(desc(importance)) %>%
  slice_head(n = 10) %>%
  mutate(feature = reorder(feature, importance))

# Horizontal bar plot
ggplot(top10_shap, aes(x = feature, y = importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Features by Mean |SHAP| Value",
    x = NULL,
    y = "Mean Absolute SHAP Value"
  ) +
  theme_minimal(base_size = 13)

4.4.4 Classification - XGBoost Classifier

set.seed(42) 

# Splits (stratified 70/15/15)
split_1 <- initial_split(clean, prop = 0.85, strata = diagnosed_diabetes)
train_val <- training(split_1)
test_set <- testing(split_1)

prop_train_in_trainval <- 70/85
split_2 <- initial_split(train_val, prop = prop_train_in_trainval, strata = diagnosed_diabetes)
train_set <- training(split_2)
val_set <- testing(split_2)

cat("Rows - train:", nrow(train_set), " val:", nrow(val_set), " test:", nrow(test_set), "\n")

## Rows - train: 70106  val: 15024  test: 15024

prop.table(table(train_set$diagnosed_diabetes))

## 
##         0         1 
## 0.3998374 0.6001626

prop.table(table(val_set$diagnosed_diabetes))

## 
##         0         1 
## 0.3998269 0.6001731

prop.table(table(test_set$diagnosed_diabetes))

## 
##         0         1 
## 0.3998269 0.6001731

# Recipe: impute, one-hot encode, remove zero-variance
rec <- recipe(diagnosed_diabetes ~ ., data = train_set) %>% step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>% 
step_unknown(all_nominal_predictors()) %>% 
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
rec_prep <- prep(rec, training = train_set, retain = TRUE)

# Bake train/val/test and build X/y
train_baked <- bake(rec_prep, new_data = train_set)
val_baked <- bake(rec_prep, new_data = val_set)
test_baked <- bake(rec_prep, new_data = test_set)
# Create X and y
y_train <- train_baked$diagnosed_diabetes
y_val <- val_baked$diagnosed_diabetes
y_test <- test_baked$diagnosed_diabetes

X_train <- train_baked %>% dplyr::select(-diagnosed_diabetes)
X_val <- val_baked %>% dplyr::select(-diagnosed_diabetes)
X_test <- test_baked %>% dplyr::select(-diagnosed_diabetes)

#Quick sanity check (should be zero after imputation)
cat("NA counts (train):", sum(!complete.cases(X_train)),
" (val):", sum(!complete.cases(X_val)),
" (test):", sum(!complete.cases(X_test)), "\n")

## NA counts (train): 0  (val): 0  (test): 0

# Convert to sparse matrices and DMatrices (with checks)
# Coerce labels to numeric 0/1
y_train_vec <- as.numeric(as.character(y_train))
y_val_vec <- as.numeric(as.character(y_val))
y_test_vec <- as.numeric(as.character(y_test))

# Create sparse model matrices
sparse_train <- sparse.model.matrix(~ . -1, data = X_train)
sparse_val <- sparse.model.matrix(~ . -1, data = X_val)
sparse_test <- sparse.model.matrix(~ . -1, data = X_test)

# Diagnostics
cat("Rows sparse_train:", nrow(sparse_train), " Labels:", length(y_train_vec), "\n")

## Rows sparse_train: 70106  Labels: 70106

cat("Rows sparse_val: ", nrow(sparse_val), " Labels:", length(y_val_vec), "\n")

## Rows sparse_val:  15024  Labels: 15024

cat("Rows sparse_test: ", nrow(sparse_test), " Labels:", length(y_test_vec), "\n")

## Rows sparse_test:  15024  Labels: 15024

# Final checks and DMatrix creation
if (nrow(sparse_train) != length(y_train_vec)) stop("Mismatch train rows vs labels")
if (nrow(sparse_val) != length(y_val_vec)) stop("Mismatch val rows vs labels")
if (nrow(sparse_test) != length(y_test_vec)) stop("Mismatch test rows vs labels")

dtrain <- xgb.DMatrix(data = sparse_train, label = y_train_vec)
dval <- xgb.DMatrix(data = sparse_val, label = y_val_vec)
dtest <- xgb.DMatrix(data = sparse_test, label = y_test_vec)


# Baseline: Majority Class

# Ensure outcome is factor with correct levels
train_set <- train_set %>%
  mutate(diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1)))

test_set <- test_set %>%
  mutate(diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1)))

# Majority class from TRAIN set
majority_class <- names(which.max(table(train_set$diagnosed_diabetes)))
cat("Majority class (train):", majority_class, "\n")

## Majority class (train): 1

# Baseline predictions on TEST set
baseline_results <- tibble(
  truth = test_set$diagnosed_diabetes,
  pred  = factor(rep(majority_class, nrow(test_set)), levels = c(0, 1))
)

# Compute metrics (positive class = "1")
baseline_metrics_xgb <- tibble(
  metric = c("accuracy", "precision (pos=1)", "recall (pos=1)", "F1 (pos=1)"),
  value  = c(
    accuracy_vec(baseline_results$truth, baseline_results$pred),
    precision_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second"),
    recall_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second"),
    f_meas_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second")
  )
)

print(baseline_metrics_xgb)

## # A tibble: 4 × 2
##   metric            value
##   <chr>             <dbl>
## 1 accuracy          0.600
## 2 precision (pos=1) 0.600
## 3 recall (pos=1)    1    
## 4 F1 (pos=1)        0.750

# Hyperparameter grid
param_grid <- expand.grid(
eta = c(0.01, 0.1),
max_depth = c(3, 6),
subsample = c(0.8, 1),
colsample_bytree = c(0.8, 1),
min_child_weight = c(1, 5)
)
param_grid <- as_tibble(param_grid)
param_grid

## # A tibble: 32 × 5
##      eta max_depth subsample colsample_bytree min_child_weight
##    <dbl>     <dbl>     <dbl>            <dbl>            <dbl>
##  1  0.01         3       0.8              0.8                1
##  2  0.1          3       0.8              0.8                1
##  3  0.01         6       0.8              0.8                1
##  4  0.1          6       0.8              0.8                1
##  5  0.01         3       1                0.8                1
##  6  0.1          3       1                0.8                1
##  7  0.01         6       1                0.8                1
##  8  0.1          6       1                0.8                1
##  9  0.01         3       0.8              1                  1
## 10  0.1          3       0.8              1                  1
## # ℹ 22 more rows

# Grid search with early stopping (validate on val, monitor AUC)
best_auc <- 0
best_model <- NULL
best_params <- NULL
best_nrounds <- NULL

for(i in seq_len(nrow(param_grid))) {
params <- list(
  booster = "gbtree",
  objective = "binary:logistic",    
  eval_metric = "auc",
  eta = param_grid$eta[i],
  max_depth = param_grid$max_depth[i],
  subsample = param_grid$subsample[i],
  colsample_bytree = param_grid$colsample_bytree[i],
  min_child_weight = param_grid$min_child_weight[i],
  scale_pos_weight = sum(y_train_vec == 0) / sum(y_train_vec == 1)
)


watchlist <- list(train = dtrain, eval = dval)

xgb_mod <- xgb.train(
params = params,
data = dtrain,
nrounds = 5000,
watchlist = watchlist,
early_stopping_rounds = 50,
verbose = 0
)

val_auc <- xgb_mod$best_score
cat("Tried params:", i, " val_auc=", val_auc, "\n")
if (!is.null(val_auc) && val_auc > best_auc) {
best_auc <- val_auc
best_model <- xgb_mod
best_params <- params
best_nrounds <- xgb_mod$best_iteration
}
}

## Tried params: 1  val_auc= 0.9431315 
## Tried params: 2  val_auc= 0.9464957 
## Tried params: 3  val_auc= 0.9447166 
## Tried params: 4  val_auc= 0.9458329 
## Tried params: 5  val_auc= 0.9427986 
## Tried params: 6  val_auc= 0.9459068 
## Tried params: 7  val_auc= 0.9434923 
## Tried params: 8  val_auc= 0.9462354 
## Tried params: 9  val_auc= 0.9466534 
## Tried params: 10  val_auc= 0.9466191 
## Tried params: 11  val_auc= 0.9462675 
## Tried params: 12  val_auc= 0.9457532 
## Tried params: 13  val_auc= 0.9425437 
## Tried params: 14  val_auc= 0.9463158 
## Tried params: 15  val_auc= 0.9451053 
## Tried params: 16  val_auc= 0.945606 
## Tried params: 17  val_auc= 0.9443847 
## Tried params: 18  val_auc= 0.9462617 
## Tried params: 19  val_auc= 0.9447544 
## Tried params: 20  val_auc= 0.9461457 
## Tried params: 21  val_auc= 0.946011 
## Tried params: 22  val_auc= 0.9460361 
## Tried params: 23  val_auc= 0.9450045 
## Tried params: 24  val_auc= 0.94582 
## Tried params: 25  val_auc= 0.9465466 
## Tried params: 26  val_auc= 0.9467387 
## Tried params: 27  val_auc= 0.9464008 
## Tried params: 28  val_auc= 0.9462841 
## Tried params: 29  val_auc= 0.9425437 
## Tried params: 30  val_auc= 0.9462185 
## Tried params: 31  val_auc= 0.9452762 
## Tried params: 32  val_auc= 0.9458197

cat("Best validation AUC:", best_auc, "\n")

## Best validation AUC: 0.9467387

best_params

## $booster
## [1] "gbtree"
## 
## $objective
## [1] "binary:logistic"
## 
## $eval_metric
## [1] "auc"
## 
## $eta
## [1] 0.1
## 
## $max_depth
## [1] 3
## 
## $subsample
## [1] 0.8
## 
## $colsample_bytree
## [1] 1
## 
## $min_child_weight
## [1] 5
## 
## $scale_pos_weight
## [1] 0.6662151

best_nrounds

## [1] 84

# Choose threshold on validation (maximize F1)
# Predict on validation
val_proba <- predict(best_model, dval)
thresholds <- seq(0.01, 0.99, by = 0.01)
f1s <- map_dbl(thresholds, ~ {
preds <- factor(ifelse(val_proba >= .x, 1, 0), levels = c(0,1))
f_meas_vec(factor(y_val_vec, levels = c(0,1)), preds, estimator = "binary",event_level = "second")
})
best_idx <- which.max(f1s)
best_threshold <- thresholds[best_idx]
cat("Best threshold (by F1 on val):", best_threshold, " F1=", f1s[best_idx], "\n")

## Best threshold (by F1 on val): 0.28  F1= 0.9285036

# Defensive final evaluation & recovery chunk

recreate_results <- function() {
  # If results already exists, make sure it's sane
  if (exists("results")) {
    if (is.data.frame(results) && all(c("truth","prob","pred") %in% names(results))) {
      message("Using existing `results` object.")
      return(results)
    } else {
      warning("`results` exists but doesn't have required columns (truth, prob, pred). Will try to recreate.")
    }
  }

  # Need y_test_vec (numeric 0/1) or y_test (factor)
  if (!exists("y_test_vec")) stop("y_test_vec not found: can't recreate results without test labels.")
  if (!is.numeric(y_test_vec)) {
    y_test_vec <- as.numeric(as.character(y_test_vec))
  }
  n_test <- length(y_test_vec)

  # Build pred_proba from best_model (in memory) or saved file
  pred_proba <- NULL
  # 1) if already exists, validate length
  if (exists("pred_proba")) {
    if (length(pred_proba) != n_test) {
      warning("pred_proba exists but length doesn't match y_test_vec. Ignoring pred_proba and recomputing.")
      pred_proba <- NULL
    } else {
      message("Using existing `pred_proba` object.")
    }
  }

  # 2) try to predict with best_model
  if (is.null(pred_proba)) {
    if (exists("best_model") && inherits(best_model, "xgb.Booster")) {
      if (!exists("dtest")) stop("dtest DMatrix not found. Cannot predict with best_model without dtest.")
      message("Predicting with in-memory best_model...")
      pred_proba <- predict(best_model, dtest)
    } else {
      # try to load saved model file
      model_file <- "xgb_best.model"
      if (file.exists(model_file)) {
        message("Loading saved model from ", model_file)
        try({
          best_model_loaded <- xgb.load(model_file)
          if (!exists("dtest")) stop("dtest DMatrix not found. Cannot predict with loaded model without dtest.")
          pred_proba <- predict(best_model_loaded, dtest)
        }, silent = TRUE)
      } else {
        stop("No in-memory best_model and no saved model file 'xgb_best.model'. Cannot create predictions.")
      }
    }
  }

  # Validate pred_proba length
  if (is.null(pred_proba) || length(pred_proba) != n_test) {
    stop("pred_proba either NULL or length mismatch after attempts to compute it.")
  }

  # Determine threshold: prefer best_threshold; else if val_proba and y_val_vec exist compute F1; else default 0.5
  threshold_to_use <- 0.5
  if (exists("best_threshold")) {
    threshold_to_use <- best_threshold
    message("Using existing best_threshold = ", threshold_to_use)
  } else if (exists("val_proba") && exists("y_val_vec")) {
    # compute threshold maximizing F1 on validation
    message("Computing best threshold on validation set (max F1)...")
    thr_seq <- seq(0.01, 0.99, by = 0.01)
    f1s <- sapply(thr_seq, function(t) {
      preds_val <- factor(ifelse(val_proba >= t, 1, 0), levels = c(0,1))
      yardstick::f_meas_vec(factor(y_val_vec, levels = c(0,1)), preds_val, estimator = "binary",event_level = "second")
    })
    threshold_to_use <- thr_seq[which.max(f1s)]
    message("Computed threshold_from_val = ", threshold_to_use, " (max F1 on val = ", max(f1s, na.rm = TRUE), ")")
  } else {
    message("No best_threshold or validation probs found; defaulting threshold = 0.5")
  }

  # Build results tibble
  results_new <- tibble(
    truth = factor(y_test_vec, levels = c(0,1)),
    prob  = as.numeric(pred_proba),
    pred  = factor(ifelse(as.numeric(pred_proba) >= threshold_to_use, 1, 0), levels = c(0,1))
  )

  message("Recreated `results` (n = ", nrow(results_new), "). Threshold used = ", threshold_to_use)
  return(results_new)
}

# Run recovery & compute final metrics
results <- tryCatch(recreate_results(), error = function(e) { stop("Failed to recreate results: ", e$message) })

# Defensive type enforcement
results <- results %>%
  mutate(
    prob  = as.numeric(prob),
    pred  = factor(ifelse(prob >= best_threshold, 1, 0), levels = c(0, 1)),
    truth = factor(truth, levels = c(0, 1))
  )


# Compute classification metrics (class decision)
acc  <- accuracy_vec(results$truth, results$pred)
prec <- precision_vec(results$truth, results$pred, estimator = "binary",event_level = "second")
rec  <- recall_vec(results$truth, results$pred, estimator = "binary",event_level = "second")
f1   <- f_meas_vec(results$truth, results$pred, estimator = "binary",event_level = "second")

# Compute AUC (pROC preferred)
auc_p <- tryCatch({
  roc_obj <- pROC::roc(
    response = as.numeric(as.character(results$truth)),
    predictor = results$prob,
    quiet = TRUE
  )
  as.numeric(pROC::auc(roc_obj))
}, error = function(e) NA_real_)
final_tbl <- tibble(
  metric = c("accuracy","precision (pos=1)","recall (pos=1)","F1 (pos=1)","ROC-AUC"),
  value  = c(acc, prec, rec, f1, auc_p)
)

print(final_tbl)

## # A tibble: 5 × 2
##   metric            value
##   <chr>             <dbl>
## 1 accuracy          0.920
## 2 precision (pos=1) 0.999
## 3 recall (pos=1)    0.868
## 4 F1 (pos=1)        0.929
## 5 ROC-AUC           0.946

# Confusion matrix (rows = Truth, cols = Pred)
print(table(Truth = results$truth, Pred = results$pred))

##      Pred
## Truth    0    1
##     0 6000    7
##     1 1194 7823

# Prepare data for SHAP computation
if (!inherits(sparse_test, c("matrix","dgCMatrix"))) {
    pred_data <- sparse.model.matrix(~ . -1, data = X_test)
} else {
    pred_data <- sparse_test
}

# Compute SHAP values (feature contributions)
shap_contrib <- predict(best_model, pred_data, predcontrib = TRUE)

# Convert to data.table
shap_dt <- as.data.table(shap_contrib)

# Identify bias column (last column usually)
bias_col <- names(shap_dt)[ncol(shap_dt)]
feature_cols <- setdiff(names(shap_dt), bias_col)

# Compute mean absolute SHAP for each feature
mean_abs_shap <- shap_dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = feature_cols]

# Long format
mean_abs_shap_tidy <- melt(
    mean_abs_shap,
    variable.name = "feature",
    value.name = "mean_abs_shap"
)

# Top 10 features
top10 <- mean_abs_shap_tidy %>%
    arrange(desc(mean_abs_shap)) %>%
    slice_head(n = 10) %>%
    mutate(feature = fct_reorder(feature, mean_abs_shap))

# Bar Plot
p_bar <- ggplot(top10, aes(x = feature, y = mean_abs_shap)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    labs(
        title = "Top 10 Features by Mean |SHAP|",
        x = NULL,
        y = "Mean Absolute SHAP Value"
    ) +
    theme_minimal(base_size = 13)

# Display the plot
print(p_bar)

4.5 Model Assessment

4.5.1 Regression

Metrics: MAE, RMSE, and R²
Baseline: mean-based predictor

Metric	XGBoost	Random Forest	Baseline (Mean Predictor)
MAE	0.1568	1.2955	7.2791
RMSE	0.2031	1.6769	9.1528
R²	0.9995	0.9664	-0.0002

Both regression models outperform the mean predictor baseline.
XGBoost outperforms Random Forest in accuracy and having lower error.

4.5.2 Classification

Metrics: accuracy, precision, recall, F1-score, and ROC-AUC
Baseline: majority-class predictor

Metric	Baseline	Logistic Regression	XGBoost
Accuracy	0.6002	0.8946	0.920
Precision	0.6002	0.9581	0.999
Recall	1.0000	0.8620	0.868
F1-score	0.7501	0.9075	0.929
ROC-AUC	NA	0.934	0.946

Both classification models outperform the baseline.
XGBoost achieves the best overall results and is the preferred model for this task.

5. Evaluation

5.1 Results Evaluation & Interpretation

Problem 1: Diabetes Risk Score Prediction (XGBoost Regressor)

The XGBoost regressor achieved the lowest RMSE and MAE, and the highest R² among regression models tested.
Predicted risk scores closely followed observed values, indicating good model fit.
SHAP analysis revealed family history, age and physical activity hours per week as the top three predictors.
Insights: Risk scores increase with age and family history, while higher physical activity is associated with lower predicted risk.Lifestyle-related variables play a stronger role in continuous risk estimation than in binary diagnosis.
Suggestion: Use the risk score to stratify individuals into low, medium, and high-risk groups for targeted prevention.

Problem 2: Diabetes Diagnostic Classification (XGBoost Classifier)

The XGBoost classifier achieved the best performance based on highest accuracy, precision, recall, F1-score and ROC AUC.
High recall indicates effective identification of diabetic patients, reducing false negatives.
Confusion matrix analysis revealed that most classification errors were false negatives rather than false positives, highlighting areas for further sensitivity improvement while maintaining acceptable screening performance.
SHAP analysis identified HbA1c, glucose level and family history as the top three influential predictors.
Insight: Diagnostic predictions are primarily driven by clinical biomarkers (HbA1c and glucose), while family history provides additional risk context that supports early detection.
Suggestion: The classifier is best suited as a clinical decision-support or screening tool and should complement, rather than replace, professional medical diagnosis.

5.2 Review Processes

Business objectives, data mining goals and success criteria were fulfilled and achieved.

Future Improvements:

Incorporate larger and more diverse datasets to improve model robustness.
Include additional clinical variables such as longitudinal lab results to enhance predictive performance.
Apply real-world testing in clinical settings.
Explore model deployment and integration into healthcare decision-support systems.

6. Conclusion

All data mining goals were achieved, and the key questions on diabetes risk prediction, diagnosis accuracy, and feature importance were answered, fulfilling the success criteria.
For risk score prediction, the XGBoost Regressor achieved MAE = 0.1568, RMSE = 0.2031, and R² = 0.9995, outperforming Random Forest and baseline models.
For diagnostic classification, the XGBoost Classifier achieved the best performance with accuracy = 0.920, precision=0.999, recall = 0.868, F1 score = 0.929, and ROC-AUC = 0.946. It outperformed Logistic Regression.
SHAP analysis showed that family history drives risk score prediction, while HbA1c dominates diagnostic classification.
Overall, the study highlights the importance of explainable XGBoost models in targeted prevention and informed clinical decision making for diabetes management.

7. References

Hoffman,K (2021). Machine learning: How to handle class imbalance. Medium. https://medium.com/analytics-vidhya/machine-learning-how-to-handle-class-imbalance-920e48c3e970

International Diabetes Federation (2025). Diabetes facts and figures.https://idf.org/about-diabetes/diabetes-facts-figures/

Thalla M. K. (2025). Diabetes Health Indicators Dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset?