This project follows CRISP-DM methodology: Business Understanding, Data Understanding, Data Preparation, EDA & Modelling and Evaluation. Deployment is not considered in this project.
1. Business Understanding
• According to International Diabetes Federation (2025),
approximately 589 million adults (20-79 years) are living with diabetes
worldwide as of 2024 (7.3% of total world population).
• Early risk prediction of diabetes is crucial to individuals
(especially high-risk population) to take preventive action sooner and
prevent long-term health complications.Besides, accurate classification
model is needed in medical field.
• At the national or regional level, this benefits healthcare
providers, policymakers and communities by improving resource allocation
and supporting targeted public health interventions.
• However, there is limited risk scoring tool in global context
taking ethnicity, family history, diet, lifestyle, basic health
indicators and socioeconomic status factors into consideration. There is
also limited diagnostic classification tool that can accurately classify
diagnosis.
1.1 Determine Business Objectives
Primary business objective:
• To develop an early diabetes risk prediction framework that
incorporates multidimensional factors to support early intervention and
resource planning.
• To develop a diagnostic classification framework to assist medical practitioners.
This framework will fulfill below initiatives:
• To help healthcare providers and policymakers to improve screening
strategies and resource allocation.
• To support targeted public health interventions based on
population-level risk patterns.
• To address limitations of existing global risk scores that do not
fully consider diverse populations and socioeconomic contexts.
1.2 Situation Assessment
• Diabetes prevalence is increasing rapidly, yet early detection
tools remain limited, especially for Asian or multi-ethnic
populations.
• Current risk scores rarely include all major contributors such as
ethnicity, family history, diet, lifestyle, basic health indicators and
socioeconomic status.
• There is a strong need for an interpretable and population-relevant
prediction and classification tool to support early intervention, public
health planning and personalized prevention.
• Constraints include data quality, the need for ethical handling and
clinical interpretability.
1.3 Determine Data Mining Goals
1.3.1 Data Mining Goals
• To develop a regression model for accurate prediction of
individual’s diabetes risk score using demographic, lifestyle and
medical variables.
• To develop a classification model to classify diabetic diagnosis (0
= No, 1 = Yes).
• To quantify key features that contribute most significantly to
diabetes risk and diagnosis by using SHAP (SHapley Additive
exPlanations).
• To compare the performance of multiple machine learning models
(XGBoost vs Random Forest for regression/ Logistic Regression vs XGBoost
Classifier for classification).
• To provide interpretable insights that can support early screening
and preventive healthcare decision-making.
1.3.2 Data Mining Questions
• Which regression model is better in predicting diabetes
risk?
• Which classification model is more accurate in classify diabetes
diagnosis?
• Which is the most important feature in these models?
1.3.3 Data Mining Success Criteria
Besides achieving better performance metrics than baseline
models:
Regression Model must achieve:
• R² ≥ 0.70 (good explanatory
power)
• MAE minimized.
• RMSE minimized
Classification Model must achieve:
• Accuracy ≥ 80%
•
F1-Score ≥ 0.75
• Recall ≥ 0.75
1.4 Produce Project Plan
1.4.1 Project Planning
Business Understanding > Data Understanding > Data Preparation > EDA > Modelling (Regression) + Modelling (Classification) > Model Assessment > Interpretation > Conclusion
1.4.2 Initial Assessment of Tools & Techniques
Tools:
• Programming Language: R
•
Development Environment: RStudio 2025.09.2+418
• Libraries:
tidyverse, janitor, forcats, recipes, caret etc.
• Data Source:
Kaggle Diabetes Health Indicators Dataset
• Documentation:
RMD
Techniques:
Exploratory Data Analysis (EDA):
• Correlation heatmap, violin
plot ,propotional bar plot, scatter plot etc.
Modeling
Techniques:
• Regression: XGBoost Regressor, Random Forest
Regressor
• Classification: Logistic Regression, XGBoost
Classifier
Evaluation Techniques:
• Performance metrics defined
for Regression and Classification problem.
• SHAP results.
Validation & Performance Testing:
• Train-validate-test split:
70/15/15
2. Data Understanding
2.1 Data Collection
• Dataset name: Diabetes Health Indicators Dataset
retrieved from Kaggle (Thalla, 2025)
• Contains 100,200 patient records for diabetes risk analysis with 31
features.
• Diabetes-related information including demographics, lifestyle
factors, clinical measurements and laboratory results.
df <- read_csv("diabetes_dirty.csv") %>%
clean_names()
rows <- nrow(df)
cols <- ncol(df)
memory <- as.numeric(object.size(df)) / (1024^2)
rows; cols; memory## [1] 100200
## [1] 31
## [1] 23.72153
The output above shows the basic properties of the diabetes dataset:
- Number of rows (observations): 100200
- Each row corresponds to one individual record in the dataset.
- Number of columns (variables): 31
- These include demographic, lifestyle, clinical and laboratory variables.
- Approximate memory usage: 23.72 MB
- This indicates the dataset is relatively ideal to handle in R.
2.2 Data Description
This section checks the data types of all variables to
distinguish:
• categorical / ID-like variables (e.g., gender,
ethnicity, diabetes_stage)
• numeric measurement variables (e.g.,
BMI, glucose, HbA1c, blood pressure)
# Create a table of variable names and their main class
dtype <- data.frame(
variable = names(df),
type = sapply(df, function(x) class(x)[1]),
stringsAsFactors = FALSE
)
dtype## variable type
## age age numeric
## gender gender character
## ethnicity ethnicity character
## education_level education_level character
## income_level income_level character
## employment_status employment_status character
## smoking_status smoking_status character
## alcohol_consumption_per_week alcohol_consumption_per_week numeric
## physical_activity_minutes_per_week physical_activity_minutes_per_week numeric
## diet_score diet_score numeric
## sleep_hours_per_day sleep_hours_per_day numeric
## screen_time_hours_per_day screen_time_hours_per_day numeric
## family_history_diabetes family_history_diabetes numeric
## hypertension_history hypertension_history numeric
## cardiovascular_history cardiovascular_history numeric
## bmi bmi numeric
## waist_to_hip_ratio waist_to_hip_ratio numeric
## systolic_bp systolic_bp numeric
## diastolic_bp diastolic_bp numeric
## heart_rate heart_rate numeric
## cholesterol_total cholesterol_total numeric
## hdl_cholesterol hdl_cholesterol numeric
## ldl_cholesterol ldl_cholesterol numeric
## triglycerides triglycerides numeric
## glucose_fasting glucose_fasting numeric
## glucose_postprandial glucose_postprandial numeric
## insulin_level insulin_level numeric
## hba1c hba1c numeric
## diabetes_risk_score diabetes_risk_score numeric
## diabetes_stage diabetes_stage character
## diagnosed_diabetes diagnosed_diabetes numeric
# Categorical / ID-like variables
id <- dtype$variable[
dtype$type %in% c("character", "factor")
]
# Numeric measurement variables
num <- dtype$variable[
dtype$type %in% c("numeric", "double", "integer")
]
id## [1] "gender" "ethnicity" "education_level"
## [4] "income_level" "employment_status" "smoking_status"
## [7] "diabetes_stage"
## [1] "age" "alcohol_consumption_per_week"
## [3] "physical_activity_minutes_per_week" "diet_score"
## [5] "sleep_hours_per_day" "screen_time_hours_per_day"
## [7] "family_history_diabetes" "hypertension_history"
## [9] "cardiovascular_history" "bmi"
## [11] "waist_to_hip_ratio" "systolic_bp"
## [13] "diastolic_bp" "heart_rate"
## [15] "cholesterol_total" "hdl_cholesterol"
## [17] "ldl_cholesterol" "triglycerides"
## [19] "glucose_fasting" "glucose_postprandial"
## [21] "insulin_level" "hba1c"
## [23] "diabetes_risk_score" "diagnosed_diabetes"
The tables above summarise the data types in the dataset.
- R reports 7 character variables and 24
numeric (double) variables
(as seen in the import message and thedtypeoutput). - The categorical / ID-like variables include:
gender,ethnicity,education_level,income_level,employment_status,smoking_status,diabetes_stage,diagnosed_diabetes.
- The numeric measurement variables include:
- demographic:
age - lifestyle:
alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,sleep_hours_per_day,screen_time_hours_per_day - clinical / lab:
bmi,waist_to_hip_ratio,systolic_bp,diastolic_bp,heart_rate,cholesterol_total,hdl_cholesterol,ldl_cholesterol,triglycerides,glucose_fasting,glucose_postprandial,insulin_level,hba1c,diabetes_risk_score.
- demographic:
- This confirms that the dataset structure is suitable for further
descriptive analysis and modelling, although some categorical values
(e.g., different capitalisation of
Male/male,White/white) will need cleaning later.
2.3 Data Exploration
2.3.1 Descriptive Statistics (Numeric Variables)
This section summarises the continuous numeric variables in the
diabetes dataset.
• Binary (0/1) variables such as medical history flags are excluded
because their summary statistics are not meaningful.
• The goal is
to inspect the general ranges,central tendency, and spread of the main
clinical and lifestyle measurements.
# List of binary variables to exclude
binary_cols <- c(
"family_history_diabetes",
"hypertension_history",
"cardiovascular_history",
"diagnosed_diabetes"
)
# Select only continuous numeric columns
continuous <- df %>%
select(where(is.numeric)) %>%
select(-all_of(binary_cols))
summary(continuous)## age alcohol_consumption_per_week
## Min. :18.00 Min. : 0.000
## 1st Qu.:39.00 1st Qu.: 1.000
## Median :50.00 Median : 2.000
## Mean :50.12 Mean : 2.004
## 3rd Qu.:61.00 3rd Qu.: 3.000
## Max. :90.00 Max. :10.000
##
## physical_activity_minutes_per_week diet_score sleep_hours_per_day
## Min. : 0.0 Min. : 0.000 Min. : 3.000
## 1st Qu.: 57.0 1st Qu.: 4.800 1st Qu.: 6.300
## Median :100.0 Median : 6.000 Median : 7.000
## Mean :118.9 Mean : 5.995 Mean : 6.998
## 3rd Qu.:160.0 3rd Qu.: 7.200 3rd Qu.: 7.700
## Max. :833.0 Max. :10.000 Max. :10.000
##
## screen_time_hours_per_day bmi waist_to_hip_ratio systolic_bp
## Min. : 0.500 Min. : 15.00 Min. :0.6700 Min. : 90.0
## 1st Qu.: 4.300 1st Qu.: 23.20 1st Qu.:0.8200 1st Qu.: 106.0
## Median : 6.000 Median : 25.60 Median :0.8600 Median : 116.0
## Mean : 5.997 Mean : 25.68 Mean :0.8561 Mean : 115.9
## 3rd Qu.: 7.700 3rd Qu.: 28.00 3rd Qu.:0.8900 3rd Qu.: 125.0
## Max. :16.800 Max. :255.90 Max. :1.0600 Max. :1002.5
## NA's :130 NA's :50
## diastolic_bp heart_rate cholesterol_total hdl_cholesterol
## Min. : 50.00 Min. : 40.00 Min. :100 Min. :20.00
## 1st Qu.: 70.00 1st Qu.: 64.00 1st Qu.:164 1st Qu.:47.00
## Median : 75.00 Median : 70.00 Median :186 Median :54.00
## Mean : 75.23 Mean : 69.79 Mean :186 Mean :54.04
## 3rd Qu.: 81.00 3rd Qu.: 75.00 3rd Qu.:208 3rd Qu.:61.00
## Max. :110.00 Max. :586.99 Max. :318 Max. :98.00
## NA's :140
## ldl_cholesterol triglycerides glucose_fasting glucose_postprandial
## Min. : 50 Min. : 30.0 Min. : 60.0 Min. : 70
## 1st Qu.: 78 1st Qu.: 91.0 1st Qu.:102.0 1st Qu.:139
## Median :102 Median :121.0 Median :111.0 Median :160
## Mean :103 Mean :121.5 Mean :111.1 Mean :160
## 3rd Qu.:126 3rd Qu.:151.0 3rd Qu.:120.0 3rd Qu.:181
## Max. :263 Max. :344.0 Max. :172.0 Max. :287
##
## insulin_level hba1c diabetes_risk_score
## Min. : 2.000 Min. :4.000 Min. : 2.70
## 1st Qu.: 5.090 1st Qu.:5.970 1st Qu.:23.80
## Median : 8.790 Median :6.520 Median :29.00
## Mean : 9.062 Mean :6.521 Mean :30.22
## 3rd Qu.:12.450 3rd Qu.:7.070 3rd Qu.:35.60
## Max. :32.220 Max. :9.800 Max. :67.20
## NA's :160
The descriptive statistics reveal several important characteristics:
Age ranges from 18 to 90 years, indicating that the dataset captures a wide adult age span suitable for diabetes and metabolic risk analysis.
Lifestyle variables (alcohol consumption, physical activity, diet score, sleep hours, screen time) show substantial variability:
- Sleep hours vary between 3 and 10 hours.
- Physical activity ranges from 0 to 833 minutes per week, suggesting extreme lifestyle differences that should be checked for outliers.
- Screen time ranges from 0.5 to 18 hours per day, which is unusually high at the upper end.
BMI ranges from 15 to 255.9.
- Values above 80 are physiologically unrealistic, suggesting potential data entry errors or extreme outliers requiring investigation.
Waist-to-hip ratio is mostly between 0.67 and 1.06, within plausible physiological limits.
Blood pressure:
- Systolic BP ranges from 90 to 1002.5, where values above 250 are physiologically impossible.
- Diastolic BP ranges from 50 to 110, mostly within realistic bounds.
- Systolic BP clearly contains erroneous values.
Heart rate ranges from 40 to 586 bpm.
- Anything above ~200 bpm is unrealistic, indicating significant outliers.
Cholesterol and triglycerides:
- Most values appear clinically plausible.
- Triglycerides have a maximum of 344, consistent with high-risk profiles but not impossible.
- Total cholesterol includes values up to 318, which is elevated but remains biologically credible.
Glucose levels show notable variation:
- Fasting glucose ranges from 60 to 172 mg/dL, covering normal, prediabetic, and diabetic ranges.
- Postprandial glucose ranges from 70 to 287 mg/dL, again showing broad metabolic diversity.
HbA1c ranges from 4.0 to 9.8%, consistent with normal to poorly controlled diabetes.
Insulin levels range from 2 to 32.2 µU/mL, which is plausible clinically but may still contain borderline high values.
Diabetes risk score ranges from 2.7 to 67.2, showing wide variability consistent with differing risk levels.
Overall, the descriptive statistics confirm that most clinical measurements are within realistic clinical ranges, although extreme outliers exist in variables such as BMI, systolic blood pressure, and heart rate. These values should be carefully examined and cleaned during the Data Quality Verification step. Several variables also contain missing values which will be assessed in the next section.
2.3.3 Correlation Analysis
This section examines the correlation between the mentioned key diabetes-related numerical variables to identify linear relationships that may influence diabetes risk prediction and diagnosis classification.
# Select numeric variables relevant to diabetes
corrvars <- df %>%
select(bmi, glucose_fasting, glucose_postprandial, hba1c, diabetes_risk_score)
# Compute correlation matrix (rounded)
corrmat <- round(cor(corrvars, use = "complete.obs"), 2)
# Convert to long format (tidyverse)
corr_long <- as.data.frame(corrmat) %>%
rownames_to_column(var = "Var1") %>%
pivot_longer(-Var1, names_to = "Var2", values_to = "corr")
# Preserve axis ordering so heatmap rows/cols match corrmat order
corr_long <- corr_long %>%
mutate(
Var1 = factor(Var1, levels = colnames(corrmat)),
Var2 = factor(Var2, levels = colnames(corrmat))
)
# Heatmap with correlation values
ggplot(corr_long, aes(x = Var1, y = Var2, fill = corr)) +
geom_tile() +
geom_text(aes(label = corr), color = "black", size = 4) +
scale_fill_gradient2(
low = "steelblue",
mid = "white",
high = "darkred",
midpoint = 0,
limits = c(-1, 1)
) +
labs(
title = "Correlation Heatmap (Diabetes-Related Variables)",
x = "",
y = "",
fill = "Correlation"
) +
coord_fixed() +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0)
)Figure shows strong positive correlations between glucose_postprandial and hba1c (r = 0.93) and between glucose_fasting and hba1c (r = 0.70). Glucose_fasting also shows moderate correlation with glucose_postprandial (r = 0.59) and diabetes_risk_score (r = 0.47). In contrast, bmi has weak correlations with all markers (r ≤ 0.24). This indicates that glycemic variables move closely together, while bmi contributes very little linearly.
2.4 Data Quality Verification
# Inspect total data points in each feature to detect columns with missing values
values <- df %>%
summarise(across(everything(), ~ sum(!is.na(.)))) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "count") %>%
arrange(desc(count))
values## # A tibble: 31 × 2
## variable count
## <chr> <int>
## 1 age 100200
## 2 gender 100200
## 3 ethnicity 100200
## 4 income_level 100200
## 5 smoking_status 100200
## 6 alcohol_consumption_per_week 100200
## 7 physical_activity_minutes_per_week 100200
## 8 diet_score 100200
## 9 sleep_hours_per_day 100200
## 10 family_history_diabetes 100200
## # ℹ 21 more rows
ggplot(values, aes(x = reorder(variable, count), y = count)) +
geom_line(group = 1) +
geom_point() +
coord_flip() +
labs(
title = "Number of data points in each column",
x = "Variable",
y = "Number of Data Points"
)From the graph above, we can see that there are missing values in waist-to-hip ratio, education level, employment status, screen time hours per day, cholesterol total and hba1c.
## [1] 0
## integer(0)
# For outlier checking
# Define threshold
thresholds <- list(
age = c(0, 120),
systolic_bp = c(70, 250),
diastolic_bp = c(40, 150),
heart_rate = c(30, 220),
cholesterol_total = c(70, 400),
hdl_cholesterol = c(10, 200),
ldl_cholesterol = c(10, 300),
triglycerides = c(10, 2000),
glucose_fasting = c(20, 400),
glucose_postprandial = c(20, 600),
hba1c = c(3, 20),
bmi = c(10, 80),
waist_to_hip_ratio = c(0.35, 2.0),
alcohol_consumption_per_week = c(0, 70),
physical_activity_minutes_per_week = c(0, 10080),
physical_activity_hours_per_week = c(0, 168),
diet_score = c(0, 100),
sleep_hours_per_day = c(0, 24),
screen_time_hours_per_day = c(0, 24),
insulin_level = c(0.1, 2000),
family_history_diabetes = c(0,1),
hypertension_history = c(0,1),
cardiovascular_history = c(0,1),
diabetes_risk_score = c(0,100),
diagnosed_diabetes = c(0,1)
)
# NA-safe IQR flagger
iqr_flag1 <- function(x) {
n <- length(x)
flag <- rep(FALSE, n)
finite_idx <- is.finite(x)
if(sum(finite_idx) == 0) return(flag)
q1 <- quantile(x[finite_idx], 0.25, na.rm = TRUE)
q3 <- quantile(x[finite_idx], 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
flag[finite_idx] <- x[finite_idx] < lower | x[finite_idx] > upper
flag
}
# Rules + IQR flagger
rule_flag1 <- function(x, varname) {
n <- length(x)
flag <- rep(FALSE, n)
finite_idx <- is.finite(x)
if (sum(finite_idx) == 0) return(flag)
if (varname %in% names(thresholds)) {
th <- thresholds[[varname]]
flag[finite_idx] <- x[finite_idx] < th[1] | x[finite_idx] > th[2]
} else {
flag <- iqr_flag1(x)
}
flag
}
# Apply to numeric columns in raw1
numeric_vars1 <- names(df)[sapply(df, is.numeric)]
# Compute flags as a data.frame
outlier_flags1 <- lapply(numeric_vars1, function(v) rule_flag1(df[[v]], v))
outlier_flags1 <- as.data.frame(setNames(outlier_flags1, numeric_vars1))
# Verification: NA should never be counted as outlier
na_flag_counts1 <- sapply(numeric_vars1, function(v) {
sum(outlier_flags1[[v]] & is.na(df[[v]]))
})
print(na_flag_counts1)## age alcohol_consumption_per_week
## 0 0
## physical_activity_minutes_per_week diet_score
## 0 0
## sleep_hours_per_day screen_time_hours_per_day
## 0 0
## family_history_diabetes hypertension_history
## 0 0
## cardiovascular_history bmi
## 0 0
## waist_to_hip_ratio systolic_bp
## 0 0
## diastolic_bp heart_rate
## 0 0
## cholesterol_total hdl_cholesterol
## 0 0
## ldl_cholesterol triglycerides
## 0 0
## glucose_fasting glucose_postprandial
## 0 0
## insulin_level hba1c
## 0 0
## diabetes_risk_score diagnosed_diabetes
## 0 0
# Summary table
outlier_counts1 <- sapply(outlier_flags1, function(col) sum(col, na.rm = TRUE))
outlier_summary1 <- data.frame(
variable = names(outlier_counts1),
n_flagged = as.integer(outlier_counts1),
pct_flagged = round(100 * outlier_counts1 / nrow(df), 2)
)
print(outlier_summary1[order(-outlier_summary1$n_flagged), ])## variable n_flagged
## bmi bmi 50
## heart_rate heart_rate 40
## systolic_bp systolic_bp 20
## age age 0
## alcohol_consumption_per_week alcohol_consumption_per_week 0
## physical_activity_minutes_per_week physical_activity_minutes_per_week 0
## diet_score diet_score 0
## sleep_hours_per_day sleep_hours_per_day 0
## screen_time_hours_per_day screen_time_hours_per_day 0
## family_history_diabetes family_history_diabetes 0
## hypertension_history hypertension_history 0
## cardiovascular_history cardiovascular_history 0
## waist_to_hip_ratio waist_to_hip_ratio 0
## diastolic_bp diastolic_bp 0
## cholesterol_total cholesterol_total 0
## hdl_cholesterol hdl_cholesterol 0
## ldl_cholesterol ldl_cholesterol 0
## triglycerides triglycerides 0
## glucose_fasting glucose_fasting 0
## glucose_postprandial glucose_postprandial 0
## insulin_level insulin_level 0
## hba1c hba1c 0
## diabetes_risk_score diabetes_risk_score 0
## diagnosed_diabetes diagnosed_diabetes 0
## pct_flagged
## bmi 0.05
## heart_rate 0.04
## systolic_bp 0.02
## age 0.00
## alcohol_consumption_per_week 0.00
## physical_activity_minutes_per_week 0.00
## diet_score 0.00
## sleep_hours_per_day 0.00
## screen_time_hours_per_day 0.00
## family_history_diabetes 0.00
## hypertension_history 0.00
## cardiovascular_history 0.00
## waist_to_hip_ratio 0.00
## diastolic_bp 0.00
## cholesterol_total 0.00
## hdl_cholesterol 0.00
## ldl_cholesterol 0.00
## triglycerides 0.00
## glucose_fasting 0.00
## glucose_postprandial 0.00
## insulin_level 0.00
## hba1c 0.00
## diabetes_risk_score 0.00
## diagnosed_diabetes 0.00
# Logical check
out_prefix <- "dq"
as_num <- function(x) suppressWarnings(as.numeric(as.character(x)))
safe_write <- function(df0, fname) {
tryCatch(write.csv(df0, fname, row.names = FALSE), error = function(e) NULL)
}
checks <- list()
abnormal_list <- list()
add_rule <- function(key, desc, idx) {
if (length(idx) == 0) return()
checks[[key]] <<- list(rule = desc, n = length(idx), idx = idx)
abnormal_list[[key]] <<- cbind(df[idx, , drop = FALSE], dq_rule = key)
}
# rule a: diastolic_bp should not > systolic_bp
if (all(c("systolic_bp", "diastolic_bp") %in% names(df))) {
s <- as_num(df$systolic_bp); d <- as_num(df$diastolic_bp)
add_rule("diastolic_gt_systolic", "diastolic > systolic",
which(!is.na(s) & !is.na(d) & d > s))
}
# rule b: glucose_postprandial should not < glucose_fasting
if (all(c("glucose_fasting", "glucose_postprandial") %in% names(df))) {
gf <- as_num(df$glucose_fasting); gp <- as_num(df$glucose_postprandial)
add_rule("glucose_post_lt_fasting", "postprandial < fasting",
which(!is.na(gf) & !is.na(gp) & gp < gf))
}
# rule c: hba1c < 5 & fasting_glucose > 200
if (all(c("hba1c", "glucose_fasting") %in% names(df))) {
h <- as_num(df$hba1c); gf <- as_num(df$glucose_fasting)
add_rule("hba1c_low_glucose_high", "hba1c < 5 & fasting_glucose > 200",
which(!is.na(h) & !is.na(gf) & h < 5 & gf > 200))
}## NULL
# build summary dataframe
if (length(checks) == 0) {
summary_df <- data.frame(
name = character(0), rule = character(0),
n = integer(0), stringsAsFactors = FALSE
)
all_abnormal_df <- df[integer(0), , drop = FALSE]
} else {
summary_df <- data.frame(
name = names(checks),
rule = sapply(checks, `[[`, "rule"),
n = sapply(checks, `[[`, "n"),
stringsAsFactors = FALSE
)
all_abnormal_df <- do.call(rbind, abnormal_list)
rownames(all_abnormal_df) <- NULL
}
# write the two files
safe_write(summary_df, paste0(out_prefix, "_cross_variable_checks_summary.csv"))
safe_write(all_abnormal_df, paste0(out_prefix, "_cross_variable_checks_all_abnormal.csv"))
# return invisibly
invisible(list(
checks = checks,
summary = summary_df,
abnormal = abnormal_list,
all_abnormal = all_abnormal_df
))
print(summary_df)## name rule n
## diastolic_gt_systolic diastolic_gt_systolic diastolic > systolic 117
## glucose_post_lt_fasting glucose_post_lt_fasting postprandial < fasting 2604
- Upon inspection on the output, delta (diastolic_bp - systolic_bp) is in the range of 1 to 10, that is still considered normal in clinical terms.
- However, for glucose_fasting-glucose_postprandial, delta of >30 should be flagged as abnormal and were excluded to improve data reliability in data cleaning.
3. Data Preparation
3.1 Data Selection, Cleaning, Construction, Integration and Formatting
All columns of features were retained except diabetes_stage which contains classification of pre-diabetic, type 1 or type 2. These information was not crucial for our data mining goals as our target variables are risk scoring and diagnostic classes (diabetic or non-diabetic).
raw <- readr::read_csv("diabetes_dirty.csv") %>%
select(-diabetes_stage) # remove diabetes_stage column
#gender
#Before cleaning: frequency table
cat("Before cleaning:\n")## Before cleaning:
##
## female Female FEMALE male Male MALE other Other OTHER
## 16881 16708 16725 15902 16152 15817 680 642 693
#Standardize gender
raw <- raw %>%
mutate(
gender = str_to_lower(gender),
gender = str_squish(gender),
gender = case_when(
gender %in% c("female","Female","FEMALE") ~ "Female",
gender %in% c("male","Male","MALE") ~ "Male",
gender %in% c("other","Other","OTHER" ) ~ "Other"
)
) %>%
mutate(gender = factor(gender, levels = c("Female", "Male","Other")))
#After cleaning: frequency table
cat("After cleaning:\n")## After cleaning:
##
## Female Male Other
## 50314 47871 2015
## Before cleaning:
##
## asian Asian ASIAN Asian ! Asian! black Black
## 2906 3013 2889 609 2475 4502 4506
## BLACK Black ! Black! hispanic Hispanic HISPANIC Hispanic !
## 4484 909 3630 5028 4975 5195 969
## Hispanic! other Other OTHER Other ! Other! white
## 3977 1254 1319 1307 232 945 11086
## White WHITE White ! White!
## 11206 11380 2271 9133
#Standardize ethnicity
raw <- raw %>%
mutate(
ethnicity = ethnicity %>%
str_to_lower() %>%
str_replace_all("[[:punct:]]", "") %>%
str_squish(),
ethnicity = case_when(
ethnicity == "asian" ~ "Asian",
ethnicity == "black" ~ "Black",
ethnicity == "hispanic" ~ "Hispanic",
ethnicity == "white" ~ "White",
ethnicity == "other" ~ "Other",
TRUE ~ "Unknown"
)
)
#After cleaning: frequency table
cat("After cleaning:\n")## After cleaning:
##
## Asian Black Hispanic Other White
## 11892 18031 20144 5057 45076
## Before cleaning:
##
## current Current CURRENT former Former FORMER never Never NEVER
## 5038 10152 5030 4901 10100 5051 14854 30109 14965
#Standardize smoking_status
raw <- raw %>%
mutate(
smoking_status = smoking_status %>%
str_to_lower() %>%
str_replace_all("[[:punct:]]", "") %>%
str_squish(),
smoking_status = case_when(
smoking_status == "current" ~ "Current",
smoking_status == "former" ~ "Former",
smoking_status == "never" ~ "Never",
TRUE ~ NA_character_
)
)
#After cleaning: frequency table
cat("After cleaning:\n")## After cleaning:
##
## Current Former Never
## 20220 20052 59928
#physical_activity_minutes_per_week
#convert to physical_activity_hours_per_week
raw <- raw %>%
mutate(
physical_activity_hours_per_week = round(ifelse(
is.na(physical_activity_minutes_per_week),
NA,
physical_activity_minutes_per_week / 60
),1)
)
raw <- raw %>%
relocate(physical_activity_hours_per_week, .after = 7)
raw <- raw %>%
select(-physical_activity_minutes_per_week)
#check first 10 numbers
raw$physical_activity_hours_per_week[1:10]## [1] 3.6 2.4 0.9 0.8 1.8 2.1 0.9 1.2 1.9 1.4
#round diet_score,sleep_hours_per_day,screen_time_hours_per_day,bm,diabetes_risk_score to 1 d.p.
safe_round <- function(x, digits = 1) {
ifelse(is.finite(x), round(x, digits), NA)
}
raw <- raw %>%
mutate(
diet_score = safe_round(diet_score, 1),
sleep_hours_per_day = safe_round(sleep_hours_per_day, 1),
screen_time_hours_per_day = safe_round(screen_time_hours_per_day, 1),
bmi = safe_round(bmi, 1),
diabetes_risk_score = safe_round(diabetes_risk_score, 2)
)
#check first 10 numbers
raw$diet_score[1:10]## [1] 5.7 6.7 6.4 3.4 7.2 9.0 9.2 4.1 6.7 8.2
## [1] 7.9 6.5 10.0 6.6 7.4 6.2 7.8 9.0 8.5 5.3
## [1] 7.9 8.7 8.1 5.2 5.0 5.4 8.0 12.9 8.5 7.4
## [1] 30.5 23.1 22.2 26.8 21.2 26.1 25.1 23.9 24.7 26.7
## [1] 29.6 23.0 44.7 38.2 23.5 23.5 36.1 34.2 26.7 30.0
#round waist_to_hip_ratio,insulin_level,hba1c to 2 d.p.
safe_round <- function(x, digits = 2) {
ifelse(is.finite(x), round(x, digits), NA)
}
raw <- raw %>%
mutate(
waist_to_hip_ratio = safe_round(waist_to_hip_ratio, 2),
insulin_level = safe_round(insulin_level, 2),
hba1c = safe_round(hba1c, 2)
)
#check first 10 numbers
raw$waist_to_hip_ratio[1:10]## [1] 0.89 0.80 0.81 0.88 0.78 0.85 0.88 0.86 0.84 0.81
## [1] 6.36 2.00 5.07 5.28 12.74 8.77 10.14 8.96 5.70 4.49
## [1] 8.18 5.63 7.51 9.03 7.20 6.03 5.24 7.04 6.90 4.99
#round others to nearest whole number
safe_round0 <- function(x) {
ifelse(is.finite(x), round(x, 0), NA)
}
raw <- raw %>%
mutate(
systolic_bp = safe_round0(systolic_bp),
diastolic_bp = safe_round0(diastolic_bp),
heart_rate = safe_round0(heart_rate),
cholesterol_total = safe_round0(cholesterol_total),
hdl_cholesterol = safe_round0(hdl_cholesterol),
ldl_cholesterol = safe_round0(ldl_cholesterol),
triglycerides = safe_round0(triglycerides),
glucose_fasting = safe_round0(glucose_fasting),
glucose_postprandial = safe_round0(glucose_postprandial)
)
#check first 10 numbers
raw$systolic_bp[1:10]## [1] 134 129 115 120 92 95 129 128 103 124
## [1] 78 76 73 93 67 81 77 83 71 81
## [1] 68 67 74 68 67 57 81 76 72 70
## [1] 239 116 213 171 210 218 238 241 187 188
## [1] 41 55 66 50 52 61 46 49 33 52
## [1] 160 50 99 79 125 119 161 159 132 103
## [1] 145 30 36 140 160 179 155 120 98 104
## [1] 136 93 118 139 137 100 101 110 116 76
## [1] 236 150 195 253 184 133 100 189 172 109
#There are no duplication, no need to remove any duplicated record.
#For outlier checking after dropping diabetes_stage feature
#Define threshold
thresholds <- list(
age = c(0, 120),
systolic_bp = c(70, 250),
diastolic_bp = c(40, 150),
heart_rate = c(30, 220),
cholesterol_total = c(70, 400),
hdl_cholesterol = c(10, 200),
ldl_cholesterol = c(10, 300),
triglycerides = c(10, 2000),
glucose_fasting = c(20, 400),
glucose_postprandial = c(20, 600),
hba1c = c(3, 20),
bmi = c(10, 80),
waist_to_hip_ratio = c(0.35, 2.0),
alcohol_consumption_per_week = c(0, 70),
physical_activity_minutes_per_week = c(0, 10080),
physical_activity_hours_per_week = c(0, 168),
diet_score = c(0, 100),
sleep_hours_per_day = c(0, 24),
screen_time_hours_per_day = c(0, 24),
insulin_level = c(0.1, 2000),
family_history_diabetes = c(0,1),
hypertension_history = c(0,1),
cardiovascular_history = c(0,1),
diabetes_risk_score = c(0,100),
diagnosed_diabetes = c(0,1)
)
#NA-safe IQR flagger
iqr_flag <- function(x) {
n <- length(x)
flag <- rep(FALSE, n)
finite_idx <- is.finite(x)
if(sum(finite_idx) == 0) return(flag)
q1 <- quantile(x[finite_idx], 0.25, na.rm = TRUE)
q3 <- quantile(x[finite_idx], 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
flag[finite_idx] <- x[finite_idx] < lower | x[finite_idx] > upper
flag
}
#Rules IQR flagger
rule_flag <- function(x, varname) {
n <- length(x)
flag <- rep(FALSE, n)
finite_idx <- is.finite(x)
if(sum(finite_idx) == 0) return(flag)
if (varname %in% names(thresholds)) {
th <- thresholds[[varname]]
flag[finite_idx] <- x[finite_idx] < th[1] | x[finite_idx] > th[2]
} else {
flag <- iqr_flag(x)
}
flag
}
#Apply to numeric columns
numeric_vars <- names(raw)[sapply(raw, is.numeric)]
#Compute flags as a data.frame
outlier_flags <- lapply(numeric_vars, function(v) rule_flag(raw[[v]], v))
outlier_flags <- as.data.frame(setNames(outlier_flags, numeric_vars))
#Quick verification: ensure no NA is counted as outlier
na_flag_counts <- sapply(numeric_vars, function(v) {
sum(outlier_flags[[v]] & is.na(raw[[v]]))
})
#Should be all zeros
print(na_flag_counts)## age physical_activity_hours_per_week
## 0 0
## alcohol_consumption_per_week diet_score
## 0 0
## sleep_hours_per_day screen_time_hours_per_day
## 0 0
## family_history_diabetes hypertension_history
## 0 0
## cardiovascular_history bmi
## 0 0
## waist_to_hip_ratio systolic_bp
## 0 0
## diastolic_bp heart_rate
## 0 0
## cholesterol_total hdl_cholesterol
## 0 0
## ldl_cholesterol triglycerides
## 0 0
## glucose_fasting glucose_postprandial
## 0 0
## insulin_level hba1c
## 0 0
## diabetes_risk_score diagnosed_diabetes
## 0 0
#Summary
outlier_counts <- sapply(outlier_flags, function(col) sum(col, na.rm = TRUE))
outlier_summary <- data.frame(variable = names(outlier_counts),
n_flagged = as.integer(outlier_counts),
pct_flagged = round(100 * outlier_counts / nrow(raw), 2))
print(outlier_summary[order(-outlier_summary$n_flagged), ])## variable n_flagged
## bmi bmi 50
## heart_rate heart_rate 40
## systolic_bp systolic_bp 20
## age age 0
## physical_activity_hours_per_week physical_activity_hours_per_week 0
## alcohol_consumption_per_week alcohol_consumption_per_week 0
## diet_score diet_score 0
## sleep_hours_per_day sleep_hours_per_day 0
## screen_time_hours_per_day screen_time_hours_per_day 0
## family_history_diabetes family_history_diabetes 0
## hypertension_history hypertension_history 0
## cardiovascular_history cardiovascular_history 0
## waist_to_hip_ratio waist_to_hip_ratio 0
## diastolic_bp diastolic_bp 0
## cholesterol_total cholesterol_total 0
## hdl_cholesterol hdl_cholesterol 0
## ldl_cholesterol ldl_cholesterol 0
## triglycerides triglycerides 0
## glucose_fasting glucose_fasting 0
## glucose_postprandial glucose_postprandial 0
## insulin_level insulin_level 0
## hba1c hba1c 0
## diabetes_risk_score diabetes_risk_score 0
## diagnosed_diabetes diagnosed_diabetes 0
## pct_flagged
## bmi 0.05
## heart_rate 0.04
## systolic_bp 0.02
## age 0.00
## physical_activity_hours_per_week 0.00
## alcohol_consumption_per_week 0.00
## diet_score 0.00
## sleep_hours_per_day 0.00
## screen_time_hours_per_day 0.00
## family_history_diabetes 0.00
## hypertension_history 0.00
## cardiovascular_history 0.00
## waist_to_hip_ratio 0.00
## diastolic_bp 0.00
## cholesterol_total 0.00
## hdl_cholesterol 0.00
## ldl_cholesterol 0.00
## triglycerides 0.00
## glucose_fasting 0.00
## glucose_postprandial 0.00
## insulin_level 0.00
## hba1c 0.00
## diabetes_risk_score 0.00
## diagnosed_diabetes 0.00
#Replace all outliers with <NA>
raw2 <- raw
for (v in names(outlier_flags)) {
raw2[[v]][outlier_flags[[v]]] <- NA_real_
}
sapply(names(outlier_flags), function(v) {
sum(outlier_flags[[v]] & !is.na(raw2[[v]]))
})## age physical_activity_hours_per_week
## 0 0
## alcohol_consumption_per_week diet_score
## 0 0
## sleep_hours_per_day screen_time_hours_per_day
## 0 0
## family_history_diabetes hypertension_history
## 0 0
## cardiovascular_history bmi
## 0 0
## waist_to_hip_ratio systolic_bp
## 0 0
## diastolic_bp heart_rate
## 0 0
## cholesterol_total hdl_cholesterol
## 0 0
## ldl_cholesterol triglycerides
## 0 0
## glucose_fasting glucose_postprandial
## 0 0
## insulin_level hba1c
## 0 0
## diabetes_risk_score diagnosed_diabetes
## 0 0
#Change variable name back to raw
raw<-raw2
# Remove rows where glucose_postprandial is >30 mg/dL lower than glucose_fasting
gf <- suppressWarnings(as.numeric(raw$glucose_fasting))
gp <- suppressWarnings(as.numeric(raw$glucose_postprandial))
remove_idx <- gp < (gf - 30)
raw <- raw[ !(gp < (gf - 30)), ]
sum(remove_idx)## [1] 46
#Fill missing values
# Variables NOT allowed to impute (avoid leakage)
no_impute <- c(
"glucose_fasting", "glucose_postprandial", "insulin_level",
"hba1c", "diabetes_risk_score", "diagnosed_diabetes"
)
# 1. Identify numeric & categorical columns
numeric_cols <- names(raw)[sapply(raw, is.numeric)]
cat_cols <- names(raw)[sapply(raw, is.character)]
# 2. Numeric imputation (median), excluding no-impute variables
for (col in setdiff(numeric_cols, no_impute)) {
med <- median(raw[[col]], na.rm = TRUE)
raw[[col]][is.na(raw[[col]])] <- med
}
# 3. Categorical imputation (mode)
get_mode <- function(x) {
ux <- unique(na.omit(x))
ux[which.max(tabulate(match(x, ux)))]
}
for (col in cat_cols) {
mode_val <- get_mode(raw[[col]])
raw[[col]][is.na(raw[[col]])] <- mode_val
}
summary(raw)## age gender ethnicity education_level
## Min. :18.00 Female:50283 Length:100154 Length:100154
## 1st Qu.:39.00 Male :47856 Class :character Class :character
## Median :50.00 Other : 2015 Mode :character Mode :character
## Mean :50.12
## 3rd Qu.:61.00
## Max. :90.00
##
## income_level employment_status smoking_status
## Length:100154 Length:100154 Length:100154
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## physical_activity_hours_per_week alcohol_consumption_per_week diet_score
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.900 1st Qu.: 1.000 1st Qu.: 4.800
## Median : 1.700 Median : 2.000 Median : 6.000
## Mean : 1.981 Mean : 2.004 Mean : 5.995
## 3rd Qu.: 2.700 3rd Qu.: 3.000 3rd Qu.: 7.200
## Max. :13.900 Max. :10.000 Max. :10.000
##
## sleep_hours_per_day screen_time_hours_per_day family_history_diabetes
## Min. : 3.000 Min. : 0.500 Min. :0.0000
## 1st Qu.: 6.300 1st Qu.: 4.300 1st Qu.:0.0000
## Median : 7.000 Median : 6.000 Median :0.0000
## Mean : 6.998 Mean : 5.997 Mean :0.2195
## 3rd Qu.: 7.700 3rd Qu.: 7.700 3rd Qu.:0.0000
## Max. :10.000 Max. :16.800 Max. :1.0000
##
## hypertension_history cardiovascular_history bmi waist_to_hip_ratio
## Min. :0.0000 Min. :0.00000 Min. :15.00 Min. :0.6700
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:23.20 1st Qu.:0.8200
## Median :0.0000 Median :0.00000 Median :25.60 Median :0.8600
## Mean :0.2508 Mean :0.07912 Mean :25.61 Mean :0.8561
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:28.00 3rd Qu.:0.8900
## Max. :1.0000 Max. :1.00000 Max. :39.20 Max. :1.0600
##
## systolic_bp diastolic_bp heart_rate cholesterol_total
## Min. : 90.0 Min. : 50.00 Min. : 40.00 Min. :100
## 1st Qu.:106.0 1st Qu.: 70.00 1st Qu.: 64.00 1st Qu.:164
## Median :116.0 Median : 75.00 Median : 70.00 Median :186
## Mean :115.8 Mean : 75.23 Mean : 69.63 Mean :186
## 3rd Qu.:125.0 3rd Qu.: 81.00 3rd Qu.: 75.00 3rd Qu.:208
## Max. :179.0 Max. :110.00 Max. :105.00 Max. :318
##
## hdl_cholesterol ldl_cholesterol triglycerides glucose_fasting
## Min. :20.00 Min. : 50 Min. : 30.0 Min. : 60.0
## 1st Qu.:47.00 1st Qu.: 78 1st Qu.: 91.0 1st Qu.:102.0
## Median :54.00 Median :102 Median :121.0 Median :111.0
## Mean :54.04 Mean :103 Mean :121.5 Mean :111.1
## 3rd Qu.:61.00 3rd Qu.:126 3rd Qu.:151.0 3rd Qu.:120.0
## Max. :98.00 Max. :263 Max. :344.0 Max. :172.0
##
## glucose_postprandial insulin_level hba1c diabetes_risk_score
## Min. : 70.0 Min. : 2.000 Min. :4.000 Min. : 2.70
## 1st Qu.:139.0 1st Qu.: 5.090 1st Qu.:5.970 1st Qu.:23.80
## Median :160.0 Median : 8.790 Median :6.520 Median :29.00
## Mean :160.1 Mean : 9.062 Mean :6.521 Mean :30.22
## 3rd Qu.:181.0 3rd Qu.:12.450 3rd Qu.:7.070 3rd Qu.:35.60
## Max. :287.0 Max. :32.220 Max. :9.800 Max. :67.20
## NA's :160
## diagnosed_diabetes
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6002
## 3rd Qu.:1.0000
## Max. :1.0000
##
#Feature Engineering
# BMI category
raw$bmi_category <- cut(
raw$bmi,
breaks = c(-Inf, 18.5, 24.9, 29.9, Inf),
labels = c("Underweight", "Normal", "Overweight", "Obese")
)
# Blood pressure category
raw$bp_category <- case_when(
raw$systolic_bp < 120 & raw$diastolic_bp < 80 ~ "Normal",
raw$systolic_bp %in% 120:129 & raw$diastolic_bp < 80 ~ "Elevated",
raw$systolic_bp %in% 130:139 | raw$diastolic_bp %in% 80:89 ~ "Stage 1 HTN",
raw$systolic_bp >= 140 | raw$diastolic_bp >= 90 ~ "Stage 2 HTN",
TRUE ~ NA_character_
)
# Convert selected categories to factors
factor_cols <- c(
"gender", "ethnicity", "education_level", "income_level",
"employment_status", "smoking_status",
"family_history_diabetes", "hypertension_history",
"cardiovascular_history", "bmi_category",
"bp_category", "diagnosed_diabetes"
)
for (col in factor_cols) {
if (col %in% names(raw))
raw[[col]] <- as.factor(raw[[col]])
}
cat("\nPreview of engineered columns:\n")##
## Preview of engineered columns:
## # A tibble: 10 × 5
## bmi bmi_category systolic_bp diastolic_bp bp_category
## <dbl> <fct> <dbl> <dbl> <fct>
## 1 30.5 Obese 134 78 Stage 1 HTN
## 2 23.1 Normal 129 76 Elevated
## 3 22.2 Normal 115 73 Normal
## 4 26.8 Overweight 120 93 Stage 2 HTN
## 5 21.2 Normal 92 67 Normal
## 6 26.1 Overweight 95 81 Stage 1 HTN
## 7 25.1 Overweight 129 77 Elevated
## 8 23.9 Normal 128 83 Stage 1 HTN
## 9 24.7 Normal 103 71 Normal
## 10 26.7 Overweight 124 81 Stage 1 HTN
4. Exploratory Data Analysis(EDA) & Modelling
4.1 Exploratory Data Analysis (EDA)
# Convert obvious categorical variables to factors
clean <- clean %>%
mutate(
diagnosed_diabetes = as.factor(diagnosed_diabetes), # 0/1 or yes/no
gender = as.factor(gender),
ethnicity = as.factor(ethnicity),
education_level = as.factor(education_level),
income_level = as.factor(income_level),
employment_status = as.factor(employment_status),
smoking_status = as.factor(smoking_status),
family_history_diabetes = as.factor(family_history_diabetes),
hypertension_history = as.factor(hypertension_history),
cardiovascular_history = as.factor(cardiovascular_history),
bmi_category = as.factor(bmi_category),
bp_category = as.factor(bp_category)
)
# Create age groups for categorical analysis
clean <- clean %>%
mutate(age_group = case_when(
age < 30 ~ "<30",
age >= 30 & age < 45 ~ "30-44",
age >= 45 & age < 60 ~ "45-59",
age >= 60 ~ "60+",
TRUE ~ NA_character_
)) %>%
mutate(age_group = factor(age_group, levels = c("<30", "30-44", "45-59", "60+")))4.1.1 Scatter plot: bmi vs glucose_fasting (colored by diagnosed_diabetes)
p1_data <- clean %>% drop_na(bmi, glucose_fasting, diagnosed_diabetes)
p1 <- ggplot(p1_data, aes(x = bmi, y = glucose_fasting, color = diagnosed_diabetes)) +
geom_point(alpha = 0.6, size = 1.8, position = position_jitter(width = 0.2, height = 0)) +
geom_smooth(method = "lm", se = FALSE, linetype = "dashed", aes(group = diagnosed_diabetes)) +
labs(title = "BMI vs Fasting Glucose (colored by diagnosed diabetes)",
x = "BMI", y = "Fasting glucose (mg/dL)",
color = "Diagnosed") +
theme_minimal()
print(p1)- Diagnosed diabetes is associated with higher fasting glucose.
- BMI does not strongly separate individuals with and without diabetes.
- There is substantial overlap, indicating the need for multivariate modeling.
4.1.2 Box + jitter: HbA1c by BMI category (color by diagnosed)
p2_data <- clean %>% drop_na(hba1c, bmi_category, diagnosed_diabetes)
p2 <- ggplot(p2_data, aes(x = bmi_category, y = hba1c)) +
geom_boxplot(outlier.shape = NA, alpha = 0.6) +
geom_jitter(aes(color = diagnosed_diabetes), width = 0.18, alpha = 0.7, size = 1.6) +
labs(title = "HbA1c distribution across BMI categories",
x = "BMI category", y = "HbA1c (%)", color = "Diagnosed") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
print(p2)- Diabetic individuals have consistently higher HbA1c across all BMI categories.
- Non-diabetic individuals cluster at lower HbA1c, showing clear separation from diagnosed cases.
- BMI category does not strongly influence HbA1c, as medians are similar across groups.
- Large HbA1c variability exists within diabetics, regardless of BMI (underweight to obese).
- HbA1c is a stronger indicator of diabetes status than BMI, based on clear vertical separation.
4.1.3 Correlation heatmap of numeric metabolic variables
num_vars <- c("age", "bmi", "waist_to_hip_ratio", "systolic_bp", "diastolic_bp",
"heart_rate", "cholesterol_total", "hdl_cholesterol", "ldl_cholesterol",
"triglycerides", "glucose_fasting", "glucose_postprandial", "insulin_level",
"hba1c", "diabetes_risk_score")
num_df <- clean %>% select(any_of(num_vars)) %>% drop_na()
corr_mat <- cor(num_df, use = "pairwise.complete.obs")
col_pal <- colorRampPalette(brewer.pal(11, "RdYlBu"))(50)
# show on screen
corrplot(corr_mat, method = "color", col = col_pal, tl.cex = 0.8,
addCoef.col = "black", number.cex = 0.6, order = "original",
title = "Correlation heatmap of numeric metabolic", mar = c(0,0,2,0))- BMI and Waist-to-Hip Ratio show strong correlation (0.77)
- Total Cholesterol and LDL are very strongly correlated (0.91)
- Fasting Glucose and Postprandial Glucose correlate moderately (0.59)
- Triglycerides correlate moderately with BMI (0.39)
- Diabetes Risk Score correlates most with Age (0.50) and Fasting Glucose (0.48), indicating they may be the key contributors.
4.1.4 Proportion bar: Diagnosed diabetes rate by age_group and gender
p4_data <- clean %>% drop_na(age_group, gender, diagnosed_diabetes) %>%
group_by(age_group, gender) %>%
summarise(n = n(), n_diab = sum(as.numeric(as.character(diagnosed_diabetes)) == 1, na.rm = TRUE), .groups = "drop") %>%
mutate(rate = n_diab / n)
p4 <- ggplot(p4_data, aes(x = age_group, y = rate, fill = gender)) +
geom_col(position = position_dodge(width = 0.9), width = 0.75) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = "Proportion of diagnosed diabetes by age group and gender",
x = "Age group", y = "Diagnosed diabetes (%)", fill = "Gender") +
theme_minimal()
print(p4)- Diagnosed diabetes increases steadily with age.
- The 60+ age group has the highest proportion of diabetes.
- Younger adults (<30) have the lowest diabetes rates.
- Gender differences are small, with female, male and “other” showing fairly similar proportions within each age group.
- The “other” gender category consistently shows slightly higher proportions in every age group.
4.1.5 Violin + facet plot: Insulin by BP category, faceted by smoking status
p5_data <- clean %>% drop_na(insulin_level, bp_category, smoking_status)
p5 <- ggplot(p5_data, aes(x = bp_category, y = insulin_level)) +
geom_violin(trim = TRUE, alpha = 0.6) +
geom_jitter(width = 0.15, size = 0.7, alpha = 0.5) +
facet_wrap(~ smoking_status, scales = "free_y") +
labs(title = "Insulin level by BP category, faceted by smoking status",
x = "BP category", y = "Insulin level") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))
print(p5)- Insulin levels show similar distributions across all BP categories.
- Smoking status (Current, Former, Never) does not markedly shift insulin levels.
- All groups show a wide spread of insulin values.
- The shape of the violin plots is consistent across facets, indicating that insulin variability remains similar regardless of BP or smoking group.
4.1.6 Class imbalance determination
p_diag_data <- clean %>%
drop_na(diagnosed_diabetes) %>%
group_by(diagnosed_diabetes) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(pct = count / sum(count))
p_diag <- ggplot(p_diag_data, aes(x = as.numeric(as.character(diagnosed_diabetes)), y = count)) +
geom_col(width = 0.5) +
geom_text(aes(label = count), vjust = -0.5, size = 3.5) +
scale_x_continuous(breaks = c(0, 1), labels = c("0", "1")) + # show 0 and 1 on x-axis
labs(
title = "Count of Diagnosed Diabetes (0 / 1)",
x = "Diagnosed Diabetes (0 = No, 1 = Yes)",
y = "Count"
) +
theme_minimal()
print(p_diag)- There is a minor class imbalance of 40045 No diabetes to 60109 Yes diabetes. The ratio of 1 to 0 is 1.5. This is a marginally small imbalance and classification modelling can proceed (Hoffman, 2021).
4.2 Model Techniques Selection
- Regression:
XGBoost Regressor: fast, powerful model capturing complex non-linear
patterns
Random Forest: robust ensemble reducing overfitting for
stable predictions
- Classification:
XGBoost classifier: non-linear modeling that captures complex
patterns
Logistic regression: simple, interpretable baseline for
diagnosis classification
4.3 Test Design Generation
4.3.1 Regression
- Data goal: Predict continuous diabetes risk score.
- Metrics: MAE, RMSE, R².
- Baseline: Mean predictor (predicts average risk).
- Split: train 70% / val 15% / test 15% with fixed random
seed
- One-hot encoding to convert categorical variables for both
models
- Pre-processing: Feature scaling optional; tree models (RF, XGBoost)
do not require scaling
- Training & tuning: Train Random Forest and XGBoost on train;
tune hyperparameters on validation
- Validation: select best model based on MAE, RMSE, and R²
- Final test: evaluate chosen regression model on the test
set
- Comparison: compare both models to baseline and to each
other
4.3.2 Classification
- Data goal: Classify diagnosis (binary)
- Metrics: accuracy, precision, recall, F1, ROC-AUC
- Baseline: majority-class predictor
- Split: train 70% / val 15% / test 15% with fixed random
seed
- One hot encoding to convert categorical variable
- Pre-processing: Feature scaling for LR, not required for
XGBoost
- Training & tuning: Train LR and XGBoost on train; tune
hyperparameters on val
- Validation: Select best threshold on validation
- Final test: Evaluate chosen models
- Comparison: Compare to baseline and each other
4.4 Model Building
4.4.1 Regression - XGBoost
set.seed(42)
# Split 70% train / 15% val / 15% test
# First split: 85% train+val / 15% test
split1 <- initial_split(clean, prop = 0.85)
train_val <- training(split1)
test_set <- testing(split1)
# Second split: split train_val into 70% train / 15% val
split2 <- initial_split(train_val, prop = 70/85)
train_set <- training(split2)
val_set <- testing(split2)
cat("Rows — Train:", nrow(train_set),
"Val:", nrow(val_set),
"Test:", nrow(test_set), "\n\n")## Rows — Train: 70107 Val: 15023 Test: 15024
# Recipe: imputation, unknown categories, one-hot encoding, remove zero-variance
rec <- recipe(diabetes_risk_score ~ ., data = train_set) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
rec_prep <- prep(rec, training = train_set, retain = TRUE)
train_baked <- bake(rec_prep, train_set)
val_baked <- bake(rec_prep, val_set)
test_baked <- bake(rec_prep, test_set)
cat("Baked columns:", ncol(train_baked), "\n")## Baked columns: 64
# Convert baked datasets to sparse matrices
s_train <- sparse.model.matrix(diabetes_risk_score ~ . -1, train_baked)
s_val <- sparse.model.matrix(diabetes_risk_score ~ . -1, val_baked)
s_test <- sparse.model.matrix(diabetes_risk_score ~ . -1, test_baked)
dtrain <- xgb.DMatrix(s_train, label = train_baked$diabetes_risk_score)
dval <- xgb.DMatrix(s_val, label = val_baked$diabetes_risk_score)
dtest <- xgb.DMatrix(s_test, label = test_baked$diabetes_risk_score)# Train 3 simple candidate models (safe tuning)
param_list <- list(
list(eta = 0.1, max_depth = 3),
list(eta = 0.05, max_depth = 4),
list(eta = 0.2, max_depth = 2)
)
models <- vector("list", length(param_list))
rmses <- rep(Inf, length(param_list))
for (i in seq_along(param_list)) {
p <- param_list[[i]]
cat("Training model", i, "...\n")
m <- tryCatch({
xgb.train(
params = list(
booster = "gbtree",
objective = "reg:squarederror",
eval_metric = "rmse",
eta = p$eta,
max_depth = p$max_depth
),
data = dtrain,
nrounds = 2000,
watchlist = list(train = dtrain, eval = dval),
early_stopping_rounds = 50,
verbose = 0
)
}, error = function(e) {
cat(" → Model", i, "FAILED:", e$message, "\n")
NULL
})
models[[i]] <- m
score <- tryCatch(m$best_score, error = function(e) NA)
if (is.null(score) || is.na(score) || !is.finite(score)) {
cat(" → Invalid RMSE detected. Assigning RMSE = Inf\n")
rmses[i] <- Inf
} else {
rmses[i] <- score
}
cat(" → Validation RMSE used:", rmses[i], "\n")
}## Training model 1 ...
## → Validation RMSE used: 0.2061891
## Training model 2 ...
## → Validation RMSE used: 0.2052611
## Training model 3 ...
## → Validation RMSE used: 0.2217658
# Select best model
best_index <- which.min(rmses)
best_model <- models[[best_index]]
if (is.null(best_model)) {
stop("ERROR: All XGBoost models failed — no usable model was created.")
}
cat("\nBEST MODEL INDEX:", best_index, "\n")##
## BEST MODEL INDEX: 2
## BEST VALIDATION RMSE: 0.2052611
# Test-set matrices (final model performance)
test_pred <- predict(best_model, dtest)
true <- test_baked$diabetes_risk_score
mae <- mean(abs(test_pred - true))
rmse <- sqrt(mean((test_pred - true)^2))
r2 <- yardstick::rsq_trad_vec(true, test_pred)
final_metrics <- tibble(
Metric = c("MAE", "RMSE", "R²"),
Value = c(mae, rmse, r2)
)
final_metrics %>% knitr::kable()| Metric | Value |
|---|---|
| MAE | 0.1567964 |
| RMSE | 0.2031207 |
| R² | 0.9995074 |
# Baseline (mean predictor)
mean_pred <- mean(train_set$diabetes_risk_score)
baseline_mae <- mean(abs(mean_pred - true))
baseline_rmse <- sqrt(mean((mean_pred - true)^2))
baseline_r2 <- yardstick::rsq_trad_vec(true, rep(mean_pred, length(true)))
baseline_metrics <- tibble(
Metric = c("MAE", "RMSE", "R²"),
Value = c(baseline_mae, baseline_rmse, baseline_r2)
)
baseline_metrics%>% knitr::kable()| Metric | Value |
|---|---|
| MAE | 7.279057 |
| RMSE | 9.152769 |
| R² | -0.000159 |
#SHAP analysis (feature importance)
pred_data <- s_test
shap_contrib <- predict(best_model, pred_data, predcontrib = TRUE)
shap_dt <- as.data.table(shap_contrib)
bias_col <- names(shap_dt)[ncol(shap_dt)]
feature_cols <- setdiff(names(shap_dt), bias_col)
mean_abs_shap <- shap_dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = feature_cols]
mean_abs_shap_tidy <- melt(
mean_abs_shap,
variable.name = "feature",
value.name = "mean_abs_shap"
)
top10 <- mean_abs_shap_tidy %>%
arrange(desc(mean_abs_shap)) %>%
slice_head(n = 10) %>%
mutate(feature = fct_reorder(feature, mean_abs_shap))
ggplot(top10, aes(x = feature, y = mean_abs_shap)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Features by Mean |SHAP| (XGBoost Regression)",
x = NULL,
y = "Mean Absolute SHAP Value"
) +
theme_minimal(base_size = 13)4.4.2 Regression - Random Forest
set.seed(42)
# 70/15/15 split (not stratified because regression)
split1 <- initial_split(clean, prop = 0.85)
train_val <- training(split1)
test_set <- testing(split1)
split2 <- initial_split(train_val, prop = 70/85)
train_set <- training(split2)
val_set <- testing(split2)
cat("Train:", nrow(train_set),
" | Val:", nrow(val_set),
" | Test:", nrow(test_set), "\n")## Train: 70107 | Val: 15023 | Test: 15024
# Recipe — impute, one-hot encode, remove zv
rec <- recipe(diabetes_risk_score ~ ., data = train_set) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
rec_prep <- prep(rec, training = train_set, retain = TRUE)# Bake train / validation / test
train_baked <- bake(rec_prep, train_set)
val_baked <- bake(rec_prep, val_set)
test_baked <- bake(rec_prep, test_set)
y_train <- train_baked$diabetes_risk_score
y_val <- val_baked$diabetes_risk_score
y_test <- test_baked$diabetes_risk_score
X_train <- train_baked %>% select(-diabetes_risk_score)
X_val <- val_baked %>% select(-diabetes_risk_score)
X_test <- test_baked %>% select(-diabetes_risk_score)
# Sanity check
cat("Missing values in predictors (train/val/test):",
sum(!complete.cases(X_train)), "/",
sum(!complete.cases(X_val)), "/",
sum(!complete.cases(X_test)), "\n")## Missing values in predictors (train/val/test): 0 / 0 / 0
# RF requires X (predictors) and y (target)
X_train <- train_baked %>% select(-diabetes_risk_score)
y_train <- train_baked$diabetes_risk_score
X_val <- val_baked %>% select(-diabetes_risk_score)
y_val <- val_baked$diabetes_risk_score
# Train a reasonable RF model (500 trees)
rf_model <- ranger(
x = X_train,
y = y_train,
num.trees = 500,
mtry = floor(sqrt(ncol(X_train))),
importance = "permutation",
write.forest = TRUE, #required for shap
seed = 42
)## Growing trees.. Progress: 80%. Estimated remaining time: 7 seconds.
## Computing permutation importance.. Progress: 26%. Estimated remaining time: 1 minute, 27 seconds.
## Computing permutation importance.. Progress: 43%. Estimated remaining time: 1 minute, 24 seconds.
## Computing permutation importance.. Progress: 52%. Estimated remaining time: 1 minute, 28 seconds.
## Computing permutation importance.. Progress: 69%. Estimated remaining time: 56 seconds.
## Computing permutation importance.. Progress: 79%. Estimated remaining time: 42 seconds.
## Computing permutation importance.. Progress: 90%. Estimated remaining time: 21 seconds.
## Ranger result
##
## Call:
## ranger(x = X_train, y = y_train, num.trees = 500, mtry = floor(sqrt(ncol(X_train))), importance = "permutation", write.forest = TRUE, seed = 42)
##
## Type: Regression
## Number of trees: 500
## Sample size: 70107
## Number of independent variables: 63
## Mtry: 7
## Target node size: 5
## Variable importance mode: permutation
## Splitrule: variance
## OOB prediction error (MSE): 2.837341
## R squared (OOB): 0.9653071
# Compute metrics
test_pred <- predict(rf_model, data = test_baked)$predictions
test_mae <- mean(abs(test_pred - y_test))
test_rmse <- sqrt(mean((test_pred - y_test)^2))
test_r2 <- yardstick::rsq_trad_vec(y_test, test_pred)
tibble(
Metric = c("MAE", "RMSE", "R²"),
Value = c(test_mae, test_rmse, test_r2)
) %>% knitr::kable()| Metric | Value |
|---|---|
| MAE | 1.2955275 |
| RMSE | 1.6768921 |
| R² | 0.9664282 |
# Baseline predictor: always predicts mean of training target
mean_pred_rf <- mean(y_train)
baseline_mae_rf <- mean(abs(mean_pred_rf - y_test))
baseline_rmse_rf <- sqrt(mean((mean_pred_rf - y_test)^2))
baseline_r2_rf <- yardstick::rsq_trad_vec(y_test, rep(mean_pred_rf, length(y_test)))
baseline_rf_tbl <- tibble(
Metric = c("MAE", "RMSE", "R²"),
Value = c(baseline_mae_rf, baseline_rmse_rf, baseline_r2_rf)
)
baseline_rf_tbl %>% knitr::kable()| Metric | Value |
|---|---|
| MAE | 7.279057 |
| RMSE | 9.152769 |
| R² | -0.000159 |
# Prediction wrapper for ranger
rf_pred <- function(object, newdata) {
predict(object, data = newdata)$predictions
}
# Use a small test sample to keep SHAP fast
set.seed(42)
X_sample <- X_test %>% slice_sample(n = 200)
# Compute SHAP values (nsim = 10 for speed)
shap_mat <- fastshap::explain(
rf_model,
X = as.data.frame(X_train),
newdata = as.data.frame(X_sample),
pred_wrapper = rf_pred,
nsim = 10
)
# Convert SHAP matrix → data.frame
shap_df <- as.data.frame(shap_mat)
# Mean absolute SHAP per feature
shap_summary <- shap_df %>%
summarise(across(everything(), ~ mean(abs(.x)))) %>%
tidyr::pivot_longer(everything(),
names_to = "feature",
values_to = "mean_abs_shap") %>%
arrange(desc(mean_abs_shap))
# Top 10 features
top10 <- shap_summary %>% slice_head(n = 10)
# Plot
ggplot(top10, aes(x = fct_reorder(feature, mean_abs_shap),
y = mean_abs_shap)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 SHAP-like Feature Importance — Random Forest",
x = NULL,
y = "Mean |SHAP| Contribution"
) +
theme_minimal(base_size = 13)4.4.3 Classification - Logistic Regression
# Data preparation
clean <- clean %>%
dplyr::select(-diabetes_risk_score) %>%
dplyr::mutate(
diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1))
)
# Train / Validation / Test split (70 / 15 / 15)
set.seed(123)
idx_train <- caret::createDataPartition(
clean$diagnosed_diabetes,
p = 0.7,
list = FALSE
)
train_data <- clean[idx_train, ]
temp_data <- clean[-idx_train, ]
idx_val <- caret::createDataPartition(
temp_data$diagnosed_diabetes,
p = 0.5,
list = FALSE
)
val_data <- temp_data[idx_val, ]
test_data <- temp_data[-idx_val, ]
# Baseline model (majority class)
majority_class <- names(which.max(table(train_data$diagnosed_diabetes)))
baseline_preds <- factor(
rep(majority_class, nrow(test_data)),
levels = c(0, 1)
)
baseline_cm <- caret::confusionMatrix(
baseline_preds,
test_data$diagnosed_diabetes,
positive = "1"
)
baseline_metrics <- data.frame(
Accuracy = round(baseline_cm$overall["Accuracy"], 4),
Precision = round(baseline_cm$byClass["Precision"], 4),
Recall = round(baseline_cm$byClass["Recall"], 4),
F1_Score = round(baseline_cm$byClass["F1"], 4)
)
print("Baseline metrics:")## [1] "Baseline metrics:"
## Accuracy Precision Recall F1_Score
## Accuracy 0.6002 0.6002 1 0.7501
# One-hot encoding
dummy <- caret::dummyVars(
diagnosed_diabetes ~ .,
data = train_data,
fullRank = TRUE
)
X_train <- predict(dummy, train_data)
X_val <- predict(dummy, val_data)
X_test <- predict(dummy, test_data)
y_train <- train_data$diagnosed_diabetes
y_val <- val_data$diagnosed_diabetes
y_test <- test_data$diagnosed_diabetes
# Imputation (train only)
imputer <- caret::preProcess(X_train, method = "medianImpute")
X_train <- predict(imputer, X_train)
X_val <- predict(imputer, X_val)
X_test <- predict(imputer, X_test)
# Feature scaling
scaler <- caret::preProcess(X_train, method = c("center", "scale"))
X_train <- predict(scaler, X_train)
X_val <- predict(scaler, X_val)
X_test <- predict(scaler, X_test)
# Ensure identical columns
train_cols <- colnames(X_train)
X_val <- X_val[, train_cols, drop = FALSE]
X_test <- X_test[, train_cols, drop = FALSE]
stopifnot(
identical(colnames(X_train), colnames(X_val)),
identical(colnames(X_train), colnames(X_test))
)
# Logistic Regression (LASSO)
set.seed(123)
cv_model <- glmnet::cv.glmnet(
x = X_train,
y = y_train,
family = "binomial",
alpha = 1
)
best_lambda <- cv_model$lambda.min
# Threshold selection (validation)
val_probs <- predict(
cv_model,
newx = X_val,
s = best_lambda,
type = "response"
)
thresholds <- seq(0.1, 0.9, by = 0.01)
f1_scores <- sapply(thresholds, function(t) {
preds <- factor(ifelse(val_probs >= t, 1, 0), levels = c(0, 1))
cm <- caret::confusionMatrix(preds, y_val, positive = "1")
cm$byClass["F1"]
})
best_threshold <- thresholds[which.max(f1_scores)]
print(paste("Best threshold:", best_threshold))## [1] "Best threshold: 0.64"
# Final test evaluation
test_probs <- predict(
cv_model,
newx = X_test,
s = best_lambda,
type = "response"
)
test_preds <- factor(
ifelse(test_probs >= best_threshold, 1, 0),
levels = c(0, 1)
)
test_cm <- caret::confusionMatrix(
test_preds,
y_test,
positive = "1"
)
cat("\nLogistic Regression Confusion Matrix:\n")##
## Logistic Regression Confusion Matrix:
## Reference
## Prediction 0 1
## 0 5666 1244
## 1 340 7772
final_metrics <- data.frame(
Accuracy = round(test_cm$overall["Accuracy"], 4),
Precision = round(test_cm$byClass["Precision"], 4),
Recall = round(test_cm$byClass["Recall"], 4),
F1_Score = round(test_cm$byClass["F1"], 4)
)
print("Logistic Regression metrics:")## [1] "Logistic Regression metrics:"
## Accuracy Precision Recall F1_Score
## Accuracy 0.8946 0.9581 0.862 0.9075
# ROC-AUC
roc_obj <- pROC::roc(
response = y_test,
predictor = as.numeric(test_probs),
levels = c("0", "1"),
direction = "<"
)
print(paste("ROC-AUC:", round(pROC::auc(roc_obj), 4)))## [1] "ROC-AUC: 0.934"
# SHAP: Top 10 Mean |SHAP| (Global)
# Predictor object (glmnet-compatible)
X_train_df <- as.data.frame(X_train)
train_cols <- colnames(X_train)
predictor <- iml::Predictor$new(
model = cv_model,
data = X_train_df,
y = y_train,
predict.function = function(model, newdata) {
# Add missing columns
missing_cols <- setdiff(train_cols, colnames(newdata))
if (length(missing_cols) > 0) newdata[missing_cols] <- 0
# Remove extra columns
extra_cols <- setdiff(colnames(newdata), train_cols)
if (length(extra_cols) > 0)
newdata <- newdata[, !(colnames(newdata) %in% extra_cols), drop = FALSE]
# Reorder and convert to matrix
newdata <- as.matrix(newdata[, train_cols, drop = FALSE])
predict(
model,
newx = newdata,
s = best_lambda,
type = "response"
)
}
)
# Compute SHAP-based feature importance (mean |SHAP|)
shap_imp <- iml::FeatureImp$new(
predictor,
loss = "ce"
)
# Extract results and select top 10
top10_shap <- shap_imp$results %>%
arrange(desc(importance)) %>%
slice_head(n = 10) %>%
mutate(feature = reorder(feature, importance))
# Horizontal bar plot
ggplot(top10_shap, aes(x = feature, y = importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Features by Mean |SHAP| Value",
x = NULL,
y = "Mean Absolute SHAP Value"
) +
theme_minimal(base_size = 13)4.4.4 Classification - XGBoost Classifier
set.seed(42)
# Splits (stratified 70/15/15)
split_1 <- initial_split(clean, prop = 0.85, strata = diagnosed_diabetes)
train_val <- training(split_1)
test_set <- testing(split_1)
prop_train_in_trainval <- 70/85
split_2 <- initial_split(train_val, prop = prop_train_in_trainval, strata = diagnosed_diabetes)
train_set <- training(split_2)
val_set <- testing(split_2)
cat("Rows - train:", nrow(train_set), " val:", nrow(val_set), " test:", nrow(test_set), "\n")## Rows - train: 70106 val: 15024 test: 15024
##
## 0 1
## 0.3998374 0.6001626
##
## 0 1
## 0.3998269 0.6001731
##
## 0 1
## 0.3998269 0.6001731
# Recipe: impute, one-hot encode, remove zero-variance
rec <- recipe(diagnosed_diabetes ~ ., data = train_set) %>% step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
rec_prep <- prep(rec, training = train_set, retain = TRUE)
# Bake train/val/test and build X/y
train_baked <- bake(rec_prep, new_data = train_set)
val_baked <- bake(rec_prep, new_data = val_set)
test_baked <- bake(rec_prep, new_data = test_set)
# Create X and y
y_train <- train_baked$diagnosed_diabetes
y_val <- val_baked$diagnosed_diabetes
y_test <- test_baked$diagnosed_diabetes
X_train <- train_baked %>% dplyr::select(-diagnosed_diabetes)
X_val <- val_baked %>% dplyr::select(-diagnosed_diabetes)
X_test <- test_baked %>% dplyr::select(-diagnosed_diabetes)
#Quick sanity check (should be zero after imputation)
cat("NA counts (train):", sum(!complete.cases(X_train)),
" (val):", sum(!complete.cases(X_val)),
" (test):", sum(!complete.cases(X_test)), "\n")## NA counts (train): 0 (val): 0 (test): 0
# Convert to sparse matrices and DMatrices (with checks)
# Coerce labels to numeric 0/1
y_train_vec <- as.numeric(as.character(y_train))
y_val_vec <- as.numeric(as.character(y_val))
y_test_vec <- as.numeric(as.character(y_test))
# Create sparse model matrices
sparse_train <- sparse.model.matrix(~ . -1, data = X_train)
sparse_val <- sparse.model.matrix(~ . -1, data = X_val)
sparse_test <- sparse.model.matrix(~ . -1, data = X_test)
# Diagnostics
cat("Rows sparse_train:", nrow(sparse_train), " Labels:", length(y_train_vec), "\n")## Rows sparse_train: 70106 Labels: 70106
## Rows sparse_val: 15024 Labels: 15024
## Rows sparse_test: 15024 Labels: 15024
# Final checks and DMatrix creation
if (nrow(sparse_train) != length(y_train_vec)) stop("Mismatch train rows vs labels")
if (nrow(sparse_val) != length(y_val_vec)) stop("Mismatch val rows vs labels")
if (nrow(sparse_test) != length(y_test_vec)) stop("Mismatch test rows vs labels")
dtrain <- xgb.DMatrix(data = sparse_train, label = y_train_vec)
dval <- xgb.DMatrix(data = sparse_val, label = y_val_vec)
dtest <- xgb.DMatrix(data = sparse_test, label = y_test_vec)
# Baseline: Majority Class
# Ensure outcome is factor with correct levels
train_set <- train_set %>%
mutate(diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1)))
test_set <- test_set %>%
mutate(diagnosed_diabetes = factor(diagnosed_diabetes, levels = c(0, 1)))
# Majority class from TRAIN set
majority_class <- names(which.max(table(train_set$diagnosed_diabetes)))
cat("Majority class (train):", majority_class, "\n")## Majority class (train): 1
# Baseline predictions on TEST set
baseline_results <- tibble(
truth = test_set$diagnosed_diabetes,
pred = factor(rep(majority_class, nrow(test_set)), levels = c(0, 1))
)
# Compute metrics (positive class = "1")
baseline_metrics_xgb <- tibble(
metric = c("accuracy", "precision (pos=1)", "recall (pos=1)", "F1 (pos=1)"),
value = c(
accuracy_vec(baseline_results$truth, baseline_results$pred),
precision_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second"),
recall_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second"),
f_meas_vec(baseline_results$truth, baseline_results$pred, estimator = "binary",event_level = "second")
)
)
print(baseline_metrics_xgb)## # A tibble: 4 × 2
## metric value
## <chr> <dbl>
## 1 accuracy 0.600
## 2 precision (pos=1) 0.600
## 3 recall (pos=1) 1
## 4 F1 (pos=1) 0.750
# Hyperparameter grid
param_grid <- expand.grid(
eta = c(0.01, 0.1),
max_depth = c(3, 6),
subsample = c(0.8, 1),
colsample_bytree = c(0.8, 1),
min_child_weight = c(1, 5)
)
param_grid <- as_tibble(param_grid)
param_grid## # A tibble: 32 × 5
## eta max_depth subsample colsample_bytree min_child_weight
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.01 3 0.8 0.8 1
## 2 0.1 3 0.8 0.8 1
## 3 0.01 6 0.8 0.8 1
## 4 0.1 6 0.8 0.8 1
## 5 0.01 3 1 0.8 1
## 6 0.1 3 1 0.8 1
## 7 0.01 6 1 0.8 1
## 8 0.1 6 1 0.8 1
## 9 0.01 3 0.8 1 1
## 10 0.1 3 0.8 1 1
## # ℹ 22 more rows
# Grid search with early stopping (validate on val, monitor AUC)
best_auc <- 0
best_model <- NULL
best_params <- NULL
best_nrounds <- NULL
for(i in seq_len(nrow(param_grid))) {
params <- list(
booster = "gbtree",
objective = "binary:logistic",
eval_metric = "auc",
eta = param_grid$eta[i],
max_depth = param_grid$max_depth[i],
subsample = param_grid$subsample[i],
colsample_bytree = param_grid$colsample_bytree[i],
min_child_weight = param_grid$min_child_weight[i],
scale_pos_weight = sum(y_train_vec == 0) / sum(y_train_vec == 1)
)
watchlist <- list(train = dtrain, eval = dval)
xgb_mod <- xgb.train(
params = params,
data = dtrain,
nrounds = 5000,
watchlist = watchlist,
early_stopping_rounds = 50,
verbose = 0
)
val_auc <- xgb_mod$best_score
cat("Tried params:", i, " val_auc=", val_auc, "\n")
if (!is.null(val_auc) && val_auc > best_auc) {
best_auc <- val_auc
best_model <- xgb_mod
best_params <- params
best_nrounds <- xgb_mod$best_iteration
}
}## Tried params: 1 val_auc= 0.9431315
## Tried params: 2 val_auc= 0.9464957
## Tried params: 3 val_auc= 0.9447166
## Tried params: 4 val_auc= 0.9458329
## Tried params: 5 val_auc= 0.9427986
## Tried params: 6 val_auc= 0.9459068
## Tried params: 7 val_auc= 0.9434923
## Tried params: 8 val_auc= 0.9462354
## Tried params: 9 val_auc= 0.9466534
## Tried params: 10 val_auc= 0.9466191
## Tried params: 11 val_auc= 0.9462675
## Tried params: 12 val_auc= 0.9457532
## Tried params: 13 val_auc= 0.9425437
## Tried params: 14 val_auc= 0.9463158
## Tried params: 15 val_auc= 0.9451053
## Tried params: 16 val_auc= 0.945606
## Tried params: 17 val_auc= 0.9443847
## Tried params: 18 val_auc= 0.9462617
## Tried params: 19 val_auc= 0.9447544
## Tried params: 20 val_auc= 0.9461457
## Tried params: 21 val_auc= 0.946011
## Tried params: 22 val_auc= 0.9460361
## Tried params: 23 val_auc= 0.9450045
## Tried params: 24 val_auc= 0.94582
## Tried params: 25 val_auc= 0.9465466
## Tried params: 26 val_auc= 0.9467387
## Tried params: 27 val_auc= 0.9464008
## Tried params: 28 val_auc= 0.9462841
## Tried params: 29 val_auc= 0.9425437
## Tried params: 30 val_auc= 0.9462185
## Tried params: 31 val_auc= 0.9452762
## Tried params: 32 val_auc= 0.9458197
## Best validation AUC: 0.9467387
## $booster
## [1] "gbtree"
##
## $objective
## [1] "binary:logistic"
##
## $eval_metric
## [1] "auc"
##
## $eta
## [1] 0.1
##
## $max_depth
## [1] 3
##
## $subsample
## [1] 0.8
##
## $colsample_bytree
## [1] 1
##
## $min_child_weight
## [1] 5
##
## $scale_pos_weight
## [1] 0.6662151
## [1] 84
# Choose threshold on validation (maximize F1)
# Predict on validation
val_proba <- predict(best_model, dval)
thresholds <- seq(0.01, 0.99, by = 0.01)
f1s <- map_dbl(thresholds, ~ {
preds <- factor(ifelse(val_proba >= .x, 1, 0), levels = c(0,1))
f_meas_vec(factor(y_val_vec, levels = c(0,1)), preds, estimator = "binary",event_level = "second")
})
best_idx <- which.max(f1s)
best_threshold <- thresholds[best_idx]
cat("Best threshold (by F1 on val):", best_threshold, " F1=", f1s[best_idx], "\n")## Best threshold (by F1 on val): 0.28 F1= 0.9285036
# Defensive final evaluation & recovery chunk
recreate_results <- function() {
# If results already exists, make sure it's sane
if (exists("results")) {
if (is.data.frame(results) && all(c("truth","prob","pred") %in% names(results))) {
message("Using existing `results` object.")
return(results)
} else {
warning("`results` exists but doesn't have required columns (truth, prob, pred). Will try to recreate.")
}
}
# Need y_test_vec (numeric 0/1) or y_test (factor)
if (!exists("y_test_vec")) stop("y_test_vec not found: can't recreate results without test labels.")
if (!is.numeric(y_test_vec)) {
y_test_vec <- as.numeric(as.character(y_test_vec))
}
n_test <- length(y_test_vec)
# Build pred_proba from best_model (in memory) or saved file
pred_proba <- NULL
# 1) if already exists, validate length
if (exists("pred_proba")) {
if (length(pred_proba) != n_test) {
warning("pred_proba exists but length doesn't match y_test_vec. Ignoring pred_proba and recomputing.")
pred_proba <- NULL
} else {
message("Using existing `pred_proba` object.")
}
}
# 2) try to predict with best_model
if (is.null(pred_proba)) {
if (exists("best_model") && inherits(best_model, "xgb.Booster")) {
if (!exists("dtest")) stop("dtest DMatrix not found. Cannot predict with best_model without dtest.")
message("Predicting with in-memory best_model...")
pred_proba <- predict(best_model, dtest)
} else {
# try to load saved model file
model_file <- "xgb_best.model"
if (file.exists(model_file)) {
message("Loading saved model from ", model_file)
try({
best_model_loaded <- xgb.load(model_file)
if (!exists("dtest")) stop("dtest DMatrix not found. Cannot predict with loaded model without dtest.")
pred_proba <- predict(best_model_loaded, dtest)
}, silent = TRUE)
} else {
stop("No in-memory best_model and no saved model file 'xgb_best.model'. Cannot create predictions.")
}
}
}
# Validate pred_proba length
if (is.null(pred_proba) || length(pred_proba) != n_test) {
stop("pred_proba either NULL or length mismatch after attempts to compute it.")
}
# Determine threshold: prefer best_threshold; else if val_proba and y_val_vec exist compute F1; else default 0.5
threshold_to_use <- 0.5
if (exists("best_threshold")) {
threshold_to_use <- best_threshold
message("Using existing best_threshold = ", threshold_to_use)
} else if (exists("val_proba") && exists("y_val_vec")) {
# compute threshold maximizing F1 on validation
message("Computing best threshold on validation set (max F1)...")
thr_seq <- seq(0.01, 0.99, by = 0.01)
f1s <- sapply(thr_seq, function(t) {
preds_val <- factor(ifelse(val_proba >= t, 1, 0), levels = c(0,1))
yardstick::f_meas_vec(factor(y_val_vec, levels = c(0,1)), preds_val, estimator = "binary",event_level = "second")
})
threshold_to_use <- thr_seq[which.max(f1s)]
message("Computed threshold_from_val = ", threshold_to_use, " (max F1 on val = ", max(f1s, na.rm = TRUE), ")")
} else {
message("No best_threshold or validation probs found; defaulting threshold = 0.5")
}
# Build results tibble
results_new <- tibble(
truth = factor(y_test_vec, levels = c(0,1)),
prob = as.numeric(pred_proba),
pred = factor(ifelse(as.numeric(pred_proba) >= threshold_to_use, 1, 0), levels = c(0,1))
)
message("Recreated `results` (n = ", nrow(results_new), "). Threshold used = ", threshold_to_use)
return(results_new)
}
# Run recovery & compute final metrics
results <- tryCatch(recreate_results(), error = function(e) { stop("Failed to recreate results: ", e$message) })
# Defensive type enforcement
results <- results %>%
mutate(
prob = as.numeric(prob),
pred = factor(ifelse(prob >= best_threshold, 1, 0), levels = c(0, 1)),
truth = factor(truth, levels = c(0, 1))
)
# Compute classification metrics (class decision)
acc <- accuracy_vec(results$truth, results$pred)
prec <- precision_vec(results$truth, results$pred, estimator = "binary",event_level = "second")
rec <- recall_vec(results$truth, results$pred, estimator = "binary",event_level = "second")
f1 <- f_meas_vec(results$truth, results$pred, estimator = "binary",event_level = "second")
# Compute AUC (pROC preferred)
auc_p <- tryCatch({
roc_obj <- pROC::roc(
response = as.numeric(as.character(results$truth)),
predictor = results$prob,
quiet = TRUE
)
as.numeric(pROC::auc(roc_obj))
}, error = function(e) NA_real_)
final_tbl <- tibble(
metric = c("accuracy","precision (pos=1)","recall (pos=1)","F1 (pos=1)","ROC-AUC"),
value = c(acc, prec, rec, f1, auc_p)
)
print(final_tbl)## # A tibble: 5 × 2
## metric value
## <chr> <dbl>
## 1 accuracy 0.920
## 2 precision (pos=1) 0.999
## 3 recall (pos=1) 0.868
## 4 F1 (pos=1) 0.929
## 5 ROC-AUC 0.946
# Confusion matrix (rows = Truth, cols = Pred)
print(table(Truth = results$truth, Pred = results$pred))## Pred
## Truth 0 1
## 0 6000 7
## 1 1194 7823
# Prepare data for SHAP computation
if (!inherits(sparse_test, c("matrix","dgCMatrix"))) {
pred_data <- sparse.model.matrix(~ . -1, data = X_test)
} else {
pred_data <- sparse_test
}
# Compute SHAP values (feature contributions)
shap_contrib <- predict(best_model, pred_data, predcontrib = TRUE)
# Convert to data.table
shap_dt <- as.data.table(shap_contrib)
# Identify bias column (last column usually)
bias_col <- names(shap_dt)[ncol(shap_dt)]
feature_cols <- setdiff(names(shap_dt), bias_col)
# Compute mean absolute SHAP for each feature
mean_abs_shap <- shap_dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = feature_cols]
# Long format
mean_abs_shap_tidy <- melt(
mean_abs_shap,
variable.name = "feature",
value.name = "mean_abs_shap"
)
# Top 10 features
top10 <- mean_abs_shap_tidy %>%
arrange(desc(mean_abs_shap)) %>%
slice_head(n = 10) %>%
mutate(feature = fct_reorder(feature, mean_abs_shap))
# Bar Plot
p_bar <- ggplot(top10, aes(x = feature, y = mean_abs_shap)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Features by Mean |SHAP|",
x = NULL,
y = "Mean Absolute SHAP Value"
) +
theme_minimal(base_size = 13)
# Display the plot
print(p_bar)4.5 Model Assessment
4.5.1 Regression
- Metrics: MAE, RMSE, and R²
- Baseline: mean-based predictor
| Metric | XGBoost | Random Forest | Baseline (Mean Predictor) |
|---|---|---|---|
| MAE | 0.1568 | 1.2955 | 7.2791 |
| RMSE | 0.2031 | 1.6769 | 9.1528 |
| R² | 0.9995 | 0.9664 | -0.0002 |
Both regression models outperform the mean predictor baseline.
XGBoost outperforms Random Forest in accuracy and having lower error.
4.5.2 Classification
- Metrics: accuracy, precision, recall, F1-score, and ROC-AUC
- Baseline: majority-class predictor
| Metric | Baseline | Logistic Regression | XGBoost |
|---|---|---|---|
| Accuracy | 0.6002 | 0.8946 | 0.920 |
| Precision | 0.6002 | 0.9581 | 0.999 |
| Recall | 1.0000 | 0.8620 | 0.868 |
| F1-score | 0.7501 | 0.9075 | 0.929 |
| ROC-AUC | NA | 0.934 | 0.946 |
Both classification models outperform the baseline.
XGBoost achieves the best overall results and is the preferred model for this task.
5. Evaluation
5.1 Results Evaluation & Interpretation
Problem 1: Diabetes Risk Score Prediction (XGBoost
Regressor)
The XGBoost regressor achieved the lowest RMSE and MAE, and the highest R² among regression models tested.
Predicted risk scores closely followed observed values, indicating good model fit.
SHAP analysis revealed family history, age and physical activity hours per week as the top three predictors.
Insights: Risk scores increase with age and family history, while higher physical activity is associated with lower predicted risk.Lifestyle-related variables play a stronger role in continuous risk estimation than in binary diagnosis.
Suggestion: Use the risk score to stratify individuals into low, medium, and high-risk groups for targeted prevention.
Problem 2: Diabetes Diagnostic Classification (XGBoost
Classifier)
The XGBoost classifier achieved the best performance based on highest accuracy, precision, recall, F1-score and ROC AUC.
High recall indicates effective identification of diabetic patients, reducing false negatives.
Confusion matrix analysis revealed that most classification errors were false negatives rather than false positives, highlighting areas for further sensitivity improvement while maintaining acceptable screening performance.
SHAP analysis identified HbA1c, glucose level and family history as the top three influential predictors.
Insight: Diagnostic predictions are primarily driven by clinical biomarkers (HbA1c and glucose), while family history provides additional risk context that supports early detection.
Suggestion: The classifier is best suited as a clinical decision-support or screening tool and should complement, rather than replace, professional medical diagnosis.
5.2 Review Processes
- Business objectives, data mining goals and success criteria were
fulfilled and achieved.
Future Improvements:
Incorporate larger and more diverse datasets to improve model robustness.
Include additional clinical variables such as longitudinal lab results to enhance predictive performance.
Apply real-world testing in clinical settings.
Explore model deployment and integration into healthcare decision-support systems.
6. Conclusion
All data mining goals were achieved, and the key questions on diabetes risk prediction, diagnosis accuracy, and feature importance were answered, fulfilling the success criteria.
For risk score prediction, the XGBoost Regressor achieved MAE = 0.1568, RMSE = 0.2031, and R² = 0.9995, outperforming Random Forest and baseline models.
For diagnostic classification, the XGBoost Classifier achieved the best performance with accuracy = 0.920, precision=0.999, recall = 0.868, F1 score = 0.929, and ROC-AUC = 0.946. It outperformed Logistic Regression.
SHAP analysis showed that family history drives risk score prediction, while HbA1c dominates diagnostic classification.
Overall, the study highlights the importance of explainable XGBoost models in targeted prevention and informed clinical decision making for diabetes management.
7. References
Hoffman,K (2021). Machine learning: How to handle class imbalance. Medium. https://medium.com/analytics-vidhya/machine-learning-how-to-handle-class-imbalance-920e48c3e970
International Diabetes Federation (2025). Diabetes facts and figures.https://idf.org/about-diabetes/diabetes-facts-figures/
Thalla M. K. (2025). Diabetes Health Indicators Dataset [Data set]. Kaggle. https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset?