1. Introduction

This report presents a comprehensive analysis of the Diabetes Risk Factors & Diagnosis Dataset, including data exploration, cleaning, regression modeling, and cluster analysis to identify diabetes risk subgroups.

1.1 Core objectives

Complete the full-process dataset processing: Clarify the basic characteristics (scale, variable types, distribution, etc.) and quality issues (duplicate values, outliers, class imbalance, etc.) of the dataset through exploratory analysis, and optimize data quality through data cleaning to provide a reliable data foundation for subsequent modeling; Build regression models: Based on patients’ demographic characteristics (age, gender, etc.) and clinical indicators (HbA1c, BMI, etc.), compare linear and nonlinear machine learning algorithms to construct a high-precision blood glucose level prediction model and identify key factors affecting blood glucose; Explore diabetes risk subgroups: Identify patient subgroups with different diabetes risk profiles through cluster analysis, and clarify the differences in clinical characteristics and diabetes prevalence among each subgroup; Realize individual risk assessment: Construct an interpretable classification model to accurately determine the diabetes risk level of individual patients, and reveal the core clinical factors driving risk, providing reference for stratified management and early screening of diabetes risk.

2. Data Exploration & Validation

2.1 Overview

The core objective of data exploration is to understand the basic characteristics of the dataset, the distribution patterns of variables, and the quality of the data, and to identify potential problems such as anomalies, imbalances, and rare categories in the data, so as to provide a clear basis for subsequent data cleaning and data analysis.

2.2 Raw Dataset Basic Information

The raw dataset consisted of 100,000 rows × 16 columns, spanning from 2015 to 2022. It was primarily used for two types of supervised learning tasks: classification and regression. In terms of sample size, 100,000 records constitute a large-scale dataset, providing ample sample support for subsequent modeling. The seven-year time span also ensures the temporal representativeness of the data, avoiding the randomness of data from a single year.

2.3 Data Integrity and Uniqueness

Missing Values: There are no NA values in the dataset, indicating that no information was missed during data collection, but there are outliers that need to be addressed during data cleaning. Duplicate Rows: 14 duplicate records were detected, accounting for 0.014%. Although the percentage is extremely low, duplicate records can lead to statistical bias (such as overestimating the proportion of a certain feature) and overfitting in model training. Therefore, they still need to be removed during the cleaning phase.

2.4 Interpretation of the target variable (diabetes status)

The distribution of the target variable diabetes is as follows: 91,500 entries (91.5%) for those without the disease (0) and 8,500 entries (8.5%) for those with the disease (1), exhibiting a highly unbalanced characteristic. The proportion of diabetic patients in the sample is far lower than that of non-patients, which is consistent with the basic characteristics of diabetes prevalence in the general population (the global prevalence of diabetes in adults is approximately 10%), but it also reflects the “natural imbalance” of the data.

# Module 1: Data Exploration Core Code
# Basic dataset validation
cat("Title: Diabetes Risk Factors & Diagnosis Dataset\n")
## Title: Diabetes Risk Factors & Diagnosis Dataset
cat("Purpose: This dataset can be used for supervised learning, e.g.,\n")
## Purpose: This dataset can be used for supervised learning, e.g.,
cat("- Classification: predict diabetes status (0/1)\n")
## - Classification: predict diabetes status (0/1)
cat("- Regression: predict continuous indicators such as HbA1c or blood glucose level\n")
## - Regression: predict continuous indicators such as HbA1c or blood glucose level
# Path compatibility (use current directory if data folder does not exist)
data_path <- "data/diabetes_dataset.csv"
if (!file.exists(data_path)) {
  data_path <- "./diabetes_dataset.csv" # Alternative path
}
diabetes_raw <- read.csv(data_path, stringsAsFactors = FALSE)

cat("Raw Dataset Basic Info\n")
## Raw Dataset Basic Info
cat("Rows:", nrow(diabetes_raw), "\nCols:", ncol(diabetes_raw))
## Rows: 100000 
## Cols: 16
cat("\nFirst 5 rows:\n")
## 
## First 5 rows:
print(head(diabetes_raw, 5))
##   year gender age location race.AfricanAmerican race.Asian race.Caucasian
## 1 2020 Female  32  Alabama                    0          0              0
## 2 2015 Female  29  Alabama                    0          1              0
## 3 2015   Male  18  Alabama                    0          0              0
## 4 2015   Male  41  Alabama                    0          0              1
## 5 2016 Female  52  Alabama                    1          0              0
##   race.Hispanic race.Other hypertension heart_disease smoking_history   bmi
## 1             0          1            0             0           never 27.32
## 2             0          0            0             0           never 19.95
## 3             0          1            0             0           never 23.76
## 4             0          0            0             0           never 27.32
## 5             0          0            0             0           never 23.75
##   hbA1c_level blood_glucose_level diabetes
## 1         5.0                 100        0
## 2         5.0                  90        0
## 3         4.8                 160        0
## 4         4.0                 159        0
## 5         6.5                  90        0
if ("year" %in% names(diabetes_raw)) {
  cat("\nYear range:", min(diabetes_raw$year, na.rm = TRUE), "~", max(diabetes_raw$year, na.rm = TRUE), "\n")
} else {
  cat("\nYear range: (No 'year' column found)\n")
}
## 
## Year range: 2015 ~ 2022
# Target variable (diabetes) distribution
diabetes_dist <- diabetes_raw %>%
  group_by(diabetes) %>%
  summarise(Count = n(), Percentage = round(n()/nrow(diabetes_raw)*100, 1))
cat("\nDiabetes Distribution:\n"); print(diabetes_dist)
## 
## Diabetes Distribution:
## # A tibble: 2 × 3
##   diabetes Count Percentage
##      <int> <int>      <dbl>
## 1        0 91500       91.5
## 2        1  8500        8.5
# Content & Structure: variable types + unique counts
cat("\nContent & Structure: Column Types & Unique Levels\n")
## 
## Content & Structure: Column Types & Unique Levels
col_type <- sapply(diabetes_raw, function(x) class(x)[1])
col_unique <- sapply(diabetes_raw, function(x) length(unique(x)))

structure_table <- data.frame(
  Variable = names(diabetes_raw),
  Type = col_type,
  Unique_Values = as.integer(col_unique),
  stringsAsFactors = FALSE
)
print(structure_table)
##                                  Variable      Type Unique_Values
## year                                 year   integer             7
## gender                             gender character             3
## age                                   age   numeric           102
## location                         location character            55
## race.AfricanAmerican race.AfricanAmerican   integer             2
## race.Asian                     race.Asian   integer             2
## race.Caucasian             race.Caucasian   integer             2
## race.Hispanic               race.Hispanic   integer             2
## race.Other                     race.Other   integer             2
## hypertension                 hypertension   integer             2
## heart_disease               heart_disease   integer             2
## smoking_history           smoking_history character             6
## bmi                                   bmi   numeric          4247
## hbA1c_level                   hbA1c_level   numeric            18
## blood_glucose_level   blood_glucose_level   integer            18
## diabetes                         diabetes   integer             2
# Data quality check (missing + duplicate values)
cat("\nData Quality Check\n")
## 
## Data Quality Check
# Missing values:
missing_summary <- diabetes_raw %>%
  summarise_all(~sum(is.na(.))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  filter(Missing_Count > 0)
if (nrow(missing_summary) == 0) {
  cat("1) Missing Values: None detected (NA)\n")
} else {
  cat("1) Missing Values: Found NA in these columns:\n")
  print(missing_summary)
}
## 1) Missing Values: None detected (NA)
# Duplicate values:
duplicate_count <- sum(duplicated(diabetes_raw))
cat("2) Duplicate Rows:", duplicate_count,
    "(", round(duplicate_count/nrow(diabetes_raw)*100, 4), "%)\n")
## 2) Duplicate Rows: 14 ( 0.014 %)

3. Numeric & Categorical Variable Analysis

3.1 Variable Structure and Types

The dataset contains 16 variables, divided into two categories: numerical and categorical/character variables: Numerical variables Covering year, age, 5 ethnic identifiers (race series), hypertension, heart disease, body mass index (BMI), glycated hemoglobin (hbA1c), blood glucose, and the target variable (diabetes); among them, ethnic identifier, hypertension, heart disease, and diabetes are binary variables (0/1 encoding). Categorical/Characteristic variables Gender, location, and smoking history, containing 3, 55, and 6 classes of values respectively.

3.2 Key Variable Statistical Characteristics

Age: Minimum 0.08 (abnormal for very young children), median 43 years, maximum 80 years, 1st percentile 1.08 years, indicating the presence of infants and young children (inconsistent with the target population for diabetes studies). Gender: “Female” (58,552 entries, 58.55%), “Male” (41,430 entries, 41.43%), “Other” (18 entries, 0.018%), “Other” is a rare category. BMI: Minimum 10.01, median 27.32, maximum 95.69, 99th percentile 48.79, indicating the presence of extremely high BMI outliers. HbA1c_level: Median 5.8, Mean 5.53, Maximum 9.0, 99th percentile 8.8 (≥6.5 is the clinical diagnostic threshold for diabetes). Blood_glucose_level: Median 140, Mean 138.06, Maximum 300, 99th percentile 280, indicating the presence of abnormally high blood glucose levels.

3.3 Distribution of Categorical Variables

Smoking history: “No Info” (not recorded, 35816 entries, 35.816%) and “never” (never smoked, 35095 entries, 35.095%) accounted for over 70%. The high proportion of the “No Info” category may affect the association analysis. Location: Covering 55 locations, the sample size is evenly distributed across regions (e.g., Hawaii and Iowa each have 2038 entries), indicating good data representativeness.

# Numeric Columns (Range & Quantiles)
# Quantiles: min, 1%, 25%, 50% (median), mean, 75%, 99%, max, standard deviation
cat("\nNumeric Columns (Range & Quantiles)\n")
## 
## Numeric Columns (Range & Quantiles)
num_cols <- names(diabetes_raw)[sapply(diabetes_raw, is.numeric)]
if (length(num_cols) == 0) {
  cat("No numeric columns found.\n")
} else {
  for (col in num_cols) {
    x <- diabetes_raw[[col]]
    cat("\n[", col, "]\n", sep = "")
    print(c(
      Minimum = min(x, na.rm = TRUE),
      P01_1Percent = quantile(x, 0.01, na.rm = TRUE),
      Q1_25Percent = quantile(x, 0.25, na.rm = TRUE),
      Median_50Percent = median(x, na.rm = TRUE),
      Mean_Average = mean(x, na.rm = TRUE),
      Q3_75Percent = quantile(x, 0.75, na.rm = TRUE),
      P99_99Percent = quantile(x, 0.99, na.rm = TRUE),
      Maximum = max(x, na.rm = TRUE),
      SD_StdDev = sd(x, na.rm = TRUE)
    ))
  }
}
## 
## [year]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##       2015.000000       2015.000000       2019.000000       2019.000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##       2018.360820       2019.000000       2019.000000       2022.000000 
##         SD_StdDev 
##          1.345239 
## 
## [age]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##           0.08000           1.08000          24.00000          43.00000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##          41.88586          60.00000          80.00000          80.00000 
##         SD_StdDev 
##          22.51684 
## 
## [race.AfricanAmerican]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         0.2022300         0.0000000         1.0000000         1.0000000 
##         SD_StdDev 
##         0.4016648 
## 
## [race.Asian]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         0.2001500         0.0000000         1.0000000         1.0000000 
##         SD_StdDev 
##         0.4001145 
## 
## [race.Caucasian]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##          0.000000          0.000000          0.000000          0.000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##          0.198760          0.000000          1.000000          1.000000 
##         SD_StdDev 
##          0.399069 
## 
## [race.Hispanic]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         0.1988800         0.0000000         1.0000000         1.0000000 
##         SD_StdDev 
##         0.3991595 
## 
## [race.Other]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##          0.000000          0.000000          0.000000          0.000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##          0.199980          0.000000          1.000000          1.000000 
##         SD_StdDev 
##          0.399987 
## 
## [hypertension]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         0.0748500         0.0000000         1.0000000         1.0000000 
##         SD_StdDev 
##         0.2631505 
## 
## [heart_disease]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##          0.000000          0.000000          0.000000          0.000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##          0.039420          0.000000          1.000000          1.000000 
##         SD_StdDev 
##          0.194593 
## 
## [bmi]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         10.010000         14.600000         23.630000         27.320000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         27.320767         29.580000         48.790100         95.690000 
##         SD_StdDev 
##          6.636783 
## 
## [hbA1c_level]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##          3.500000          3.500000          4.800000          5.800000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##          5.527507          6.200000          8.800000          9.000000 
##         SD_StdDev 
##          1.070672 
## 
## [blood_glucose_level]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##          80.00000          80.00000         100.00000         140.00000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         138.05806         159.00000         280.00000         300.00000 
##         SD_StdDev 
##          40.70814 
## 
## [diabetes]
##           Minimum   P01_1Percent.1%  Q1_25Percent.25%  Median_50Percent 
##         0.0000000         0.0000000         0.0000000         0.0000000 
##      Mean_Average  Q3_75Percent.75% P99_99Percent.99%           Maximum 
##         0.0850000         0.0000000         1.0000000         1.0000000 
##         SD_StdDev 
##         0.2788831
# Unique levels + top10 levels + rare levels count
cat("\n\nCategorical Columns (Levels & Top Frequencies)\n\n")
## 
## 
## Categorical Columns (Levels & Top Frequencies)
cat_cols <- names(diabetes_raw)[!sapply(diabetes_raw, is.numeric)]
cat_cols <- setdiff(cat_cols, c("diabetes"))
cat_cols_to_show <- head(cat_cols, 10)

for (col in cat_cols_to_show) {
  cat("\n[", col, "]\n", sep = "")
  freq <- sort(table(diabetes_raw[[col]]), decreasing = TRUE)
  cat("Unique levels:", length(freq), "\n")
  cat("Top 10 levels:\n")
  print(head(freq, 10))

  # Count rare levels (<0.1% of total observations)
  rare_levels <- sum((freq / sum(freq)) < 0.001)
  cat("Rare levels (<0.1%):", rare_levels, "\n")
}
## 
## [gender]
## Unique levels: 3 
## Top 10 levels:
## 
## Female   Male  Other 
##  58552  41430     18 
## Rare levels (<0.1%): 1 
## 
## [location]
## Unique levels: 55 
## Top 10 levels:
## 
##     Hawaii       Iowa   Kentucky   Nebraska   Arkansas    Florida  Minnesota 
##       2038       2038       2038       2038       2037       2037       2037 
## New Jersey    Alabama   Delaware 
##       2037       2036       2036 
## Rare levels (<0.1%): 0 
## 
## [smoking_history]
## Unique levels: 6 
## Top 10 levels:
## 
##     No Info       never      former     current not current        ever 
##       35816       35095        9352        9286        6447        4004 
## Rare levels (<0.1%): 0
# Check special categories in smoking_history
cat("\nSpecial Category Check (Missing-like categories)\n")
## 
## Special Category Check (Missing-like categories)
N <- nrow(diabetes_raw)
if ("smoking_history" %in% names(diabetes_raw)) {
  noinfo_n <- sum(diabetes_raw$smoking_history == "No Info", na.rm=TRUE)
  cat("smoking_history == 'No Info':", noinfo_n, " (", round(noinfo_n/N*100, 3), "%)\n", sep="")
}
## smoking_history == 'No Info':35816 (35.816%)
if ("gender" %in% names(diabetes_raw)) {
  other_n <- sum(diabetes_raw$gender == "Other", na.rm=TRUE)
  cat("gender == 'Other':", other_n, " (", round(other_n/N*100, 4), "%)\n", sep="")
}
## gender == 'Other':18 (0.018%)

4. Exploratory Data Analysis Visualizations

4.1 Exploratory Visualization Results

Diabetes Label Distribution: Visually presents the imbalance of the target variable (far more non-diabetic samples than diabetic samples).

HbA1c Level Distribution: HbA1c is mainly concentrated in the 4-6 range, consistent with the characteristics of the non-diabetic population.

HbA1c by Diabetes: The median HbA1c in the diabetic group is significantly higher than that in the non-diabetic group, validating the strong correlation between HbA1c and diabetes.

# Module 2: 3 Key Visualizations
base_theme <- theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))

cat("\n=== 3 Key Visualizations (EDA) ===\n")
## 
## === 3 Key Visualizations (EDA) ===
# 1) Bar chart: diabetes distribution
# Purpose: Check target variable balance (critical for classification tasks)
if ("diabetes" %in% names(diabetes_raw)) {
  p1 <- ggplot(diabetes_raw, aes(x = factor(diabetes), fill = factor(diabetes))) +
    geom_bar() +
    labs(title = "Diabetes Label Distribution",
         x = "diabetes (0/1)", y = "Count", fill = "diabetes") +
    scale_fill_brewer(palette = "Set2") +
    base_theme
  print(p1)
}

# 2) Histogram: HbA1c/BMI distribution
# Purpose: Inspect distribution shape and potential outliers of key clinical variables
if ("hbA1c_level" %in% names(diabetes_raw)) {
  p2 <- ggplot(diabetes_raw, aes(x = hbA1c_level)) +
    geom_histogram(bins = 40, fill = "darkseagreen3", color = "white") +
    labs(title = "HbA1c Level Distribution",
         x = "HbA1c Level", y = "Count") +
    base_theme
  print(p2)
} else if ("bmi" %in% names(diabetes_raw)) {
  p2 <- ggplot(diabetes_raw, aes(x = bmi)) +
    geom_histogram(bins = 40, fill = "steelblue", color = "white") +
    labs(title = "BMI Distribution",
         x = "BMI", y = "Count") +
    base_theme
  print(p2)
}

# 3) Boxplot: HbA1c/BMI by diabetes status
# Purpose: Visually compare key indicators between diabetes positive/negative groups
if (all(c("diabetes", "hbA1c_level") %in% names(diabetes_raw))) {
  p3 <- ggplot(diabetes_raw, aes(x = factor(diabetes), y = hbA1c_level, fill = factor(diabetes))) +
    geom_boxplot(outlier.alpha = 0.3) +
    labs(title = "HbA1c by Diabetes Status",
         x = "diabetes (0/1)", y = "HbA1c Level", fill = "diabetes") +
    scale_fill_brewer(palette = "Pastel1") +
    base_theme
  print(p3)
} else if (all(c("diabetes", "bmi") %in% names(diabetes_raw))) {
  p3 <- ggplot(diabetes_raw, aes(x = factor(diabetes), y = bmi, fill = factor(diabetes))) +
    geom_boxplot(outlier.alpha = 0.3) +
    labs(title = "BMI by Diabetes Status",
         x = "diabetes (0/1)", y = "BMI", fill = "diabetes") +
    scale_fill_brewer(palette = "Pastel1") +
    base_theme
  print(p3)
}

4.2 Summary

This exploration identified the basic characteristics and potential problems of the dataset: the data was largely complete but contained duplicate rows; the target variable was categorically imbalanced; outliers were present in variables such as age, BMI, and blood glucose; and gender and smoking history included rare/unrecorded categories. These issues need to be addressed specifically in subsequent data cleaning to improve data quality.

5. Data Cleaning Process

5.1 Overview

The core objective of data cleaning is to address quality issues in data, such as duplicate rows, outliers, and rare categories, thereby improving the accuracy, rationality, and usability of the data and providing a high-quality dataset for subsequent supervised learning modeling.

5.2 Background and Objectives

During the data exploration phase, 14 duplicate records were found in the raw data, the age included outliers under 18 years old, the gender included the rare “Other” category, and there were extreme outliers in BMI, blood glucose, and glycated hemoglobin.

The cleanup objectives are to remove completely duplicate records to ensure data uniqueness. Convert key numerical fields such as age, BMI, HbA1c, and blood glucose to numeric types and handle invalid inputs. Remove records where the key modeling variables HbA1c and blood glucose are missing. Filter outliers based on medically reasonable ranges to reduce noise from data entry or measurement errors. Eliminate combinations of HbA1c and blood glucose that clearly contradict each other to avoid data that does not conform to clinical logic and interferes with the model. Standardize the format of categorical variables such as gender and smoking history while retaining categories with practical meaning.

# =========================
# Module 3: Data Cleaning (As Required)
# goals:
# 1) Delete completely duplicate records
# 2) BMI outliers: Set the range to 15-40, and set NA for values exceeding this range, then fill in the gaps using the BMI median.
# 3) Categorical field standardization
# 4) Data validation: No missing values, no BMI out of range
# =========================

library(tidyverse, warn.conflicts = FALSE)
library(stringr)
raw_path <- data_path


# Step 0: Load raw data
diabetes_raw <- read.csv(raw_path, stringsAsFactors = FALSE)

cat("\n=== Data Cleaning Start ===\n")
## 
## === Data Cleaning Start ===
cat("Raw rows:", nrow(diabetes_raw), "| Raw cols:", ncol(diabetes_raw), "\n")
## Raw rows: 100000 | Raw cols: 16
# 1) Delete duplicate records
diabetes_clean <- diabetes_raw %>%
  distinct()

dup_removed <- nrow(diabetes_raw) - nrow(diabetes_clean)
cat("\n1) Remove Duplicates\n")
## 
## 1) Remove Duplicates
cat("Rows:", nrow(diabetes_raw), "->", nrow(diabetes_clean),
    "| Removed:", dup_removed, "duplicate rows\n")
## Rows: 100000 -> 99986 | Removed: 14 duplicate rows
# 2) Column naming normalization
diabetes_clean <- diabetes_clean %>%
  rename_with(~ gsub(":", ".", .x))

cat("\n2) Rename Columns\n")
## 
## 2) Rename Columns
cat("Renamed ':' to '.' in column names.\n")
## Renamed ':' to '.' in column names.
# 3) Outlier handling
cat("\n3) Numeric Clipping\n")
## 
## 3) Numeric Clipping
# Ensure that the key column is numeric.
diabetes_clean <- diabetes_clean %>%
  mutate(
    bmi = as.numeric(bmi),
    hbA1c_level = as.numeric(hbA1c_level),
    blood_glucose_level = as.numeric(blood_glucose_level)
  ) %>%
  mutate(
    bmi = pmin(pmax(bmi, 15), 40),                   
    # BMI limited to [15, 40]
    hbA1c_level = pmin(hbA1c_level, 8.8),            
    # HbA1c upper limit 8.8
    blood_glucose_level = pmin(blood_glucose_level, 280) 
    # Upper limit of blood sugar 280
  )

cat("BMI after clip -> Min:", round(min(diabetes_clean$bmi), 2),
    "| Max:", round(max(diabetes_clean$bmi), 2), "\n")
## BMI after clip -> Min: 15 | Max: 40
cat("HbA1c after clip -> Max:", round(max(diabetes_clean$hbA1c_level), 2), "\n")
## HbA1c after clip -> Max: 8.8
cat("Glucose after clip -> Max:", round(max(diabetes_clean$blood_glucose_level), 2), "\n")
## Glucose after clip -> Max: 280
# 4) Added glucose_group (grouping by blood glucose level)
cat("\n4) Create glucose_group\n")
## 
## 4) Create glucose_group
diabetes_clean <- diabetes_clean %>%
  mutate(
    glucose_group = case_when(
      blood_glucose_level <= 140 ~ "Low",
      blood_glucose_level <= 200 ~ "Medium",
      blood_glucose_level <= 260 ~ "High",
      TRUE ~ "Very High"
    )
  )

# 5) Data Validation
cat("\n5) Data Validation\n")
## 
## 5) Data Validation
# 5.1 Duplicate row check
dup_after <- sum(duplicated(diabetes_clean))
cat("Duplicated rows after cleaning:", dup_after, "\n")
## Duplicated rows after cleaning: 1
# 5.2 Missing value check
na_total <- sum(is.na(diabetes_clean))
cat("Total NA after cleaning:", na_total, "\n")
## Total NA after cleaning: 0
# 5.3 Range check
bmi_bad <- sum(diabetes_clean$bmi < 15 | diabetes_clean$bmi > 40, na.rm = TRUE)
hba1c_bad <- sum(diabetes_clean$hbA1c_level > 8.8, na.rm = TRUE)
glu_bad <- sum(diabetes_clean$blood_glucose_level > 280, na.rm = TRUE)

cat("BMI out-of-range count:", bmi_bad, "\n")
## BMI out-of-range count: 0
cat("HbA1c > 8.8 count:", hba1c_bad, "\n")
## HbA1c > 8.8 count: 0
cat("Glucose > 280 count:", glu_bad, "\n")
## Glucose > 280 count: 0
# 6) Save the cleaned data
out_path <- "data/diabetes_dataset_cleaned.csv"
write.csv(diabetes_clean, file = out_path, row.names = FALSE, fileEncoding = "UTF-8")

cat("\n=== Save Cleaned Dataset ===\n")
## 
## === Save Cleaned Dataset ===
cat("Saved to:", out_path, "\n")
## Saved to: data/diabetes_dataset_cleaned.csv
cat("Cleaned Dimension:", nrow(diabetes_clean), "rows x", ncol(diabetes_clean), "cols\n")
## Cleaned Dimension: 99986 rows x 17 cols
cat("=== Data Cleaning Completed ===\n")
## === Data Cleaning Completed ===

5.3 Implementation Steps

  1. Loaded the raw diabetes dataset containing 100,000 rows and 16 columns.

  2. Removed completely duplicated records using a full-row comparison, eliminating 14 duplicate rows and ensuring that each observation was unique.

  3. Standardized column names by replacing the character : with . to improve naming consistency and compatibility with data processing functions.

  4. Handled outliers in key numeric variables using a clipping approach:

BMI values were restricted to the range 15–40. HbA1c values were capped at 8.8. Blood glucose levels were capped at 280. No records were removed and no statistical imputation was applied.

  1. Created a new categorical variable, glucose_group, by grouping blood glucose levels into four categories: Low, Medium, High, and Very High.

  2. Performed data validation to confirm that there were no duplicated records, no missing values introduced during processing, and that all numeric variables fell within their predefined ranges.

  3. Saved the cleaned dataset as a CSV file for subsequent analysis.

5.4 Summary

The data cleaning process improved the quality and consistency of the diabetes dataset by removing duplicate records, standardizing variable names, and controlling extreme values in key numeric variables through clipping. A new glucose-level grouping variable was also created to support stratified analysis. After validation, the final dataset contains 99,986 rows with consistent formatting and valid numeric ranges, making it suitable for further statistical analysis and modeling tasks.

6. Regression Modeling (Linear, Tree, Random Forest)

6.1 Overview

In the data analysis stage of this project, we are committed to exploring the complex relationship between key health indicators and patients’ Blood Glucose Level through Regression Analysis. Considering the complexity and potential nonlinear characteristics of medical data, we constructed and compared three different mechanisms of machine learning models: Lasso regression (linear regularization model), random forest (nonlinear integration model) and stacking Ensemble.

After data reprocessing and strict model evaluation, the results show that there are huge differences in the performance of different algorithms. The analysis shows that Random Forest performed well in all indicators, significantly surpassing the linear model. This reveals that there is a significant nonlinear interaction in the data set, so we select the random forest as the final prediction model of this project.

6.2 Research Questions

According to the project requirements, we need to identify the problems to be solved through data processing. For the regression analysis part, we put forward the following two core issues:

RQ1: Based on the patient’s demographic characteristics (such as age, gender) and clinical indicators (such as HbA1c, BMI), can a high-precision model be built to predict blood sugar levels?

RQ2: Compared with traditional linear statistical models, can nonlinear machine learning algorithms, such as random forests, better capture the complex relationship between health indicators and blood sugar?

6.3 Research Objectives

The specific research objectives of this part are as follows:

  1. Use Lasso, random forest and stacking technology to build a prediction model.

  2. By comparing RMSE and R², quantify the generalization ability of different models on the test set.

  3. Identify the most critical factors affecting blood sugar levels through model explanatory tools and verify the necessity of nonlinear modeling.

6.4 Expected Outcomes:

  1. High-precision model: Obtain a robust regression model of R² exceeding 0.7, which can reliably predict the patient’s blood sugar.

  2. Model comparison insight: Prove the superiority of nonlinear models over linear models in processing such physiological data.

  3. Key feature identification: Identify the top three variables,such as HbA1c or age that contribute the most to the prediction results to provide reference for medical screening.

# Module 4: Regression Analysis
# ==============================================================================
# Step 1: Check and install any missing packages
# ==============================================================================
required_packages <- c("tidyverse", "caret", "glmnet", "randomForest", "Metrics", "corrplot", "gridExtra")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

library(tidyverse)    
library(caret)        
library(glmnet)       
library(randomForest)
library(Metrics)      
library(corrplot)     
library(gridExtra)    

# ==============================================================================
# Step 2: Data Preparation
# ==============================================================================
cat("\n--- Reading data ---\n")
## 
## --- Reading data ---
df <- read.csv("data/diabetes_dataset_cleaned.csv", stringsAsFactors = FALSE)

# 2.1 Examine the distribution of the target variable
p1 <- ggplot(df, aes(x = blood_glucose_level)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(title = "Target variable distribution: blood glucose level", x = "blood sugar", y = "Frequency")

# 2.2 Correlation analysis of numerical variables
# Filtering numerical columns
numeric_vars <- df %>% select_if(is.numeric)
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Plotting correlation heatmaps
cat("Generating EDA charts...\n")
## Generating EDA charts...
corrplot(cor_matrix, method = "color", type = "upper",
         tl.col = "black", tl.cex = 0.8,
         title = "Correlation Matrix", mar = c(0,0,1,0))

print(p1) 

# ==============================================================================
# Step 3:Data preprocessing
# ==============================================================================
X <- df %>% select(-blood_glucose_level)
y <- df$blood_glucose_level

# Define feature type
categorical_features <- c("gender", "location", "smoking_history") 
# Ensure that categorical variables are present in the data.
categorical_features <- intersect(categorical_features, colnames(X))
numeric_features <- setdiff(colnames(X), categorical_features)

# Standardized numerical features
preprocess_steps <- preProcess(X[, numeric_features], method = c("center", "scale"))
X_numeric_processed <- predict(preprocess_steps, X[, numeric_features])

# One-hot coding classification features
if(length(categorical_features) > 0){
  dummy <- dummyVars(" ~ .", data = X[, categorical_features])
  X_categorical_encoded <- predict(dummy, newdata = X[, categorical_features])
  X_processed <- cbind(X_numeric_processed, X_categorical_encoded)
} else {
  X_processed <- X_numeric_processed
}

# Split the dataset (80% training, 20% testing)
set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X_processed[train_index, ]
X_test <- X_processed[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

# Matrixing (adapted for glmnet)
X_train_matrix <- as.matrix(X_train)
X_test_matrix <- as.matrix(X_test)

# ==============================================================================
# Step 4: Modeling – This corresponds to the theoretical concept of "model fitting".
# ==============================================================================
cat("\n--- Training Lasso to return to normal ---\n")
## 
## --- Training Lasso to return to normal ---
# Model A: Lasso
lasso_cv <- cv.glmnet(x = X_train_matrix, y = y_train, alpha = 1, nfolds = 5)
lasso_model <- glmnet(x = X_train_matrix, y = y_train, alpha = 1, lambda = lasso_cv$lambda.min)

y_pred_lasso_train <- predict(lasso_model, X_train_matrix)
y_pred_lasso_test <- predict(lasso_model, X_test_matrix)

cat("\n--- Random Forest is being trained. ---\n")
## 
## --- Random Forest is being trained. ---
# Model B: Random Forest (Parameter Optimization)
rf_optimized <- randomForest(
  x = X_train, y = y_train,
  ntree = 300, mtry = 10, maxdepth = 15, nodesize = 3,
  importance = TRUE, keep.forest = TRUE
)

y_pred_rf_train <- predict(rf_optimized, X_train)
y_pred_rf_test <- predict(rf_optimized, X_test)

cat("\n--- Training a stacking model. ---\n")
## 
## --- Training a stacking model. ---
# Model C: Stacking
stack_train <- data.frame(lasso = as.vector(y_pred_lasso_train), rf = as.vector(y_pred_rf_train), y = y_train)
stack_test <- data.frame(lasso = as.vector(y_pred_lasso_test), rf = as.vector(y_pred_rf_test), y = y_test)

# Using linear regression as the meta-learner
stack_model <- lm(y ~ ., data = stack_train)
y_pred_stack <- predict(stack_model, stack_test)

# ==============================================================================
# Step 5: Model Diagnostics – “Testing the Hypothesis” of the Corresponding Theory
# ==============================================================================
cat("\n--- Execution model diagnostics (based on stacked models) ---\n")
## 
## --- Execution model diagnostics (based on stacked models) ---
# Calculate residuals
residuals_stack <- y_test - y_pred_stack
diagnostic_df <- data.frame(
  Predicted = y_pred_stack,
  Residuals = residuals_stack
)

# 5.1 Residual vs. Predicted Value 
p_resid <- ggplot(diagnostic_df, aes(x = Predicted, y = Residuals)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot", x = "Predicted", y = "Residual") +
  theme_minimal()

# 5.2 Q-Q plot (check for normality)
p_qq <- ggplot(diagnostic_df, aes(sample = Residuals)) +
  stat_qq(color = "darkblue", alpha = 0.5) +
  stat_qq_line(color = "red") +
  labs(title = "Residual Analysis - Normal Q-Q Plot", x = "Theoretical quantiles", y = "Sample quantiles") +
  theme_minimal()

# 5.3 Actual value vs. predicted value
p_fit <- ggplot(data.frame(Actual = y_test, Predicted = y_pred_stack), aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.5, color = "forestgreen") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Predicted vs Actual Plot", x = "Actual", y = "Predicted") +
  theme_minimal()

grid.arrange(p_resid, p_qq, p_fit, ncol = 3)

# ==============================================================================
# Step 6: Model Evaluation and Results Interpretation – Corresponding to the "Evaluation and Interpretation" of the Theory
# ==============================================================================
cat("\n====================== Final Model Evaluation Table ======================\n")
## 
## ====================== Final Model Evaluation Table ======================
# Define the evaluation function
calc_metrics <- function(actual, pred, model_name) {
  r2 <- R2(actual, pred)
  rmse_val <- rmse(actual, pred)
  mae_val <- mae(actual, pred)
  return(data.frame(Model = model_name, R2 = r2, RMSE = rmse_val, MAE = mae_val))
}

# Summary Results
results <- rbind(
  calc_metrics(y_test, as.vector(y_pred_lasso_test), "Lasso"),
  calc_metrics(y_test, y_pred_rf_test, "Random Forest"),
  calc_metrics(y_test, y_pred_stack, "Stacking")
)

print(results)
##           Model        R2     RMSE      MAE
## 1         Lasso 0.1616773 36.62113 30.54724
## 2 Random Forest 0.7441867 20.29623 16.84022
## 3      Stacking 0.7440471 20.32161 17.33117
cat("\n====================== Key features (based on random forest) ======================\n")
## 
## ====================== Key features (based on random forest) ======================
# Feature Importance Visualization
rf_imp <- importance(rf_optimized) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  arrange(desc(`%IncMSE`)) %>% # 
  head(10)

p_imp <- ggplot(rf_imp, aes(x = reorder(Feature, `%IncMSE`), y = `%IncMSE`)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Feature Importance(Top 10)", x = "Feature", y = "%IncMSE") +
  theme_minimal()

print(p_imp)

cat("\nAll steps are complete. Check the pop-up window.\n")
## 
## All steps are complete. Check the pop-up window.

6.5 Results Summary

We use 80% of the data for training and 20% for testing. By comparing the performance of the three models on the test set, the results are shown in the following table:

1)The victory of the nonlinear model: The R² of the random forest reached 0.744, which means that the model successfully explained about 74.4% of the fluctuations in blood glucose data. In contrast, the R²of Lasso regression is only 0.162. This huge performance gap strongly proves that there is a high degree of nonlinear relationship between independent variables and dependent variables.

2)Stacking model vs. Single model: Despite the use of a complex Stacking model, its performance (R²=0.7440) is almost equal to or even slightly lower that of the random forest (R²=0.7442). According to the Oakham razor principle, under the condition of equivalent effect, we give priority to random forests with lower calculation cost and easier to explain as the final model.

3)The error was greatly reduced: the RMSE of the random forest was reduced to 20.30, which was nearly 45% lower than the 36.62 of Lasso. This shows that the prediction accuracy of the model in practical applications has been qualitatively improved.

6.6 Visualization

(1)Correlation Matrix

The linear correlation between variables is shown. It is worth noting that although the linear correlation coefficient between some variables and blood sugar is not high, they are important in random forests, which further confirms the necessity of nonlinear modeling.

(2)Predicted vs. Actual Plot

Draw for random forest models. Unlike the linear model, the chart shows that the data points are closely surrounded by the diagonal, indicating that the predicted value is highly consistent with the real value.

(3)Residual Plot

The residual plot shows that most of the errors are concentrated near zero. Although there is still a small number of extreme “long tail” phenomena, the overall fit has improved significantly compared with before.

(4)Residual Analysis - Normal Q-Q Plot

It shows an obvious S-shaped curve, indicating that the residual distribution deviates from the theoretical normal distribution at both ends and has light tail characteristics. Given that the random forest is a Non-parametric model, it does not strictly rely on the normality hypothesis of residual, and the overall prediction accuracy of the model is high (R² > 0.74), so this distribution pattern is within the acceptable range.

(5)Feature Importance

Based on random forest algorithm extraction. The figure clearly shows HbA1c_level, age and other characteristics, which are the key factors that dominate the change of blood sugar.

6.7 Summary

This regression analysis has achieved remarkable success. By optimizing the data preprocessing process, the performance of the model has been greatly improved. Random forest proved to be the best algorithm to handle the diabetes data set. It not only raises R² to more than 74%, but also effectively reduces the prediction error. The results show that the prediction of blood glucose levels cannot only rely on simple linear weighting, but must take into account the complex interaction between physiological indicators. The current random forest model has a high practical value and can be used to assist the early screening of potential risks.

7. Classification Model (Logistic Regression)

7.1 Research Question

Can a Classification Model Predict Whether a Patient Has Diabetes?

7.2 Research Objective

The objective of this analysis is to use a classification model to predict whether a patient has diabetes based on a set of clinical features. By treating diabetes status as a binary outcome, this study aims to evaluate how well clinical variables such as HbA1c level, blood glucose level, BMI, age, and patient subgroup membership can be used to distinguish between diabetic and non-diabetic patients. In this project, patient subgroup membership derived from clustering is treated as an additional input feature to support the classification task.

7.3 Data Analysis Approach

This section outlines the overall analytical workflow used for the classification task. K-means clustering is first applied to selected clinical variables to identify patient subgroups with similar characteristics. The resulting cluster labels are used as additional information in the analysis.

Next, a logistic regression model is used as a classification approach to predict diabetes status. Clinical features and patient subgroup membership are included as input variables. The dataset is split into training and testing sets to evaluate the model’s performance on unseen data.

Model performance is assessed using accuracy, ROC-AUC, and a confusion matrix. Visualisation methods are also used to help interpret the classification results. The code implementation for this classification task is provided in Module 5.

# Module 5: Cluster Analysis & Diabetes Risk Classification
# Research Question: Are there distinct subgroups of patients with different diabetes risk profiles that may require differentiated clinical attention?
# Step 0 — Initialize Environment
rm(list = ls()[!ls() %in% c("out_path")]); gc() # Preserve key variables
##           used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 2758580 147.4    7544374  403.0   7544374  403.0
## Vcells 4855729  37.1  374437353 2856.8 465131209 3548.7
# Step 1 — Load Cleaned Data, Standardize Column Names & Set Target Label
dat <- read_csv(out_path) |>
  clean_names()

stopifnot(is.data.frame(dat), "diabetes" %in% names(dat))
dat$diabetes <- factor(dat$diabetes, levels = c(0,1), labels = c("No","Yes"))

# Fix HbA1c column name (if needed)
if (!("hba1c_level" %in% names(dat)) && ("hb_a1c_level" %in% names(dat))) {
  names(dat)[names(dat)=="hb_a1c_level"] <- "hba1c_level"
}

cat("Dataset Dimensions - Rows:", nrow(dat), "Cols:", ncol(dat), "\n")
## Dataset Dimensions - Rows: 99986 Cols: 17
print(table(dat$diabetes))
## 
##    No   Yes 
## 91486  8500
# Step 2 — Visualization: Diabetes Outcome Distribution
barplot(table(dat$diabetes),
        main="Diabetes Outcome Distribution",
        xlab="Diabetes Status", ylab="Patient Count")

png("01_outcome_distribution.png", width=900, height=600, res=150)
barplot(table(dat$diabetes),
        main="Diabetes Outcome Distribution",
        xlab="Diabetes Status", ylab="Patient Count")
dev.off()
## png 
##   2
# Step 3 — Visualization: Key Clinical Variable Distributions
key_vars <- c("age","bmi","hba1c_level","blood_glucose_level")

# stopifnot(all(key_vars %in% names(dat)))
#
# par(mfrow=c(2,2))
# for(v in key_vars){
#   hist(dat[[v]], main=paste("Histogram:", v), xlab=v, col="grey")
# }
# par(mfrow=c(1,1))
#
# png("02_key_variables_hist.png", width=1200, height=800, res=150)
# par(mfrow=c(2,2))
# for(v in key_vars){
#   hist(dat[[v]], main=paste("Histogram:", v), xlab=v, col="grey")
# }
# par(mfrow=c(1,1))
# dev.off()


# Step 4 — Prepare Clustering Input: Select Variables & Standardize
# Select clinically meaningful variables for clustering
cluster_vars <- c("age","bmi","hba1c_level","blood_glucose_level","hypertension","heart_disease")
stopifnot(all(cluster_vars %in% names(dat)))

x_cluster <- dat[, cluster_vars]
x_scaled  <- scale(x_cluster) # Standardize variables to equalize scale impact

# Step 5 — Determine Optimal Cluster Number: Elbow Method (WSS)
# WSS = Within-Cluster Sum of Squares
set.seed(123)
ks <- 2:6 # Test cluster numbers from 2 to 6

wss <- sapply(ks, function(k){
  kmeans(x_scaled, centers=k, nstart=10, iter.max=20)$tot.withinss
})

plot(ks, wss, type="b", pch=19,
     main="Elbow Method for Optimal k (WSS)", xlab="Number of Clusters (k)", ylab="Total Within-Cluster Sum of Squares")

png("03_elbow.png", width=900, height=600, res=150)
plot(ks, wss, type="b", pch=19,
     main="Elbow Method for Optimal k (WSS)", xlab="Number of Clusters (k)", ylab="Total Within-Cluster Sum of Squares")
dev.off()
## png 
##   2
# Step 6 — Determine Optimal Cluster Number: Silhouette Score
# Silhouette Score ranges from -1 to 1; higher values indicate better cluster separation
set.seed(123)
n_all <- nrow(x_scaled)
m_sub <- min(3000, n_all) # Use subset to reduce computational cost
idx_sub <- sample.int(n_all, m_sub)
x_sub <- x_scaled[idx_sub, , drop=FALSE]

sil <- sapply(ks, function(k){
  km_tmp <- kmeans(x_sub, centers=k, nstart=10, iter.max=20)
  ss <- silhouette(km_tmp$cluster, dist(x_sub))
  mean(ss[,3])
})

plot(ks, sil, type="b", pch=19,
     main=paste("Silhouette Score (subset n =", m_sub, ")"),
     xlab="Number of Clusters (k)", ylab="Average Silhouette Score")

png("04_silhouette.png", width=900, height=600, res=150)
plot(ks, sil, type="b", pch=19,
     main=paste("Silhouette Score (subset n =", m_sub, ")"),
     xlab="Number of Clusters (k)", ylab="Average Silhouette Score")
dev.off()
## png 
##   2
# Step 7 — Run K-means Clustering with Optimal k
k <- 2 # Selected based on Elbow and Silhouette methods
set.seed(123)
km <- kmeans(x_scaled, centers=k, nstart=20, iter.max=30)

# Assign cluster labels to original dataset
dat$cluster <- factor(km$cluster)
print(table(dat$cluster))
## 
##     1     2 
## 89475 10511
# Step 8 — Visualization: PCA Projection of Clusters (Subset)
# Reduce dimensionality to 2D using PCA for visualization
set.seed(123)
m_pca <- min(5000, nrow(x_scaled)) # Use subset for faster plotting
idx_pca <- sample.int(nrow(x_scaled), m_pca)

pca <- prcomp(x_scaled[idx_pca, , drop=FALSE], center=FALSE, scale.=FALSE)

plot(pca$x[,1], pca$x[,2],
     col=as.numeric(dat$cluster[idx_pca]), pch=19,
     xlab="Principal Component 1", ylab="Principal Component 2",
     main=paste("PCA Projection of Patient Clusters (subset n =", m_pca, ")"))
legend("topright", legend=levels(dat$cluster),
       col=1:length(levels(dat$cluster)), pch=19, title="Cluster")

png("05_cluster_pca.png", width=900, height=650, res=150)
plot(pca$x[,1], pca$x[,2],
     col=as.numeric(dat$cluster[idx_pca]), pch=19,
     xlab="Principal Component 1", ylab="Principal Component 2",
     main=paste("PCA Projection of Patient Clusters (subset n =", m_pca, ")"))
legend("topright", legend=levels(dat$cluster),
       col=1:length(levels(dat$cluster)), pch=19, title="Cluster")
dev.off()
## png 
##   2
# Step 9 — Cluster Profile Analysis: Clinical Characteristics & Diabetes Prevalence
# Calculate mean clinical characteristics per cluster
cluster_profile <- aggregate(dat[, cluster_vars],
                             by=list(cluster=dat$cluster),
                             FUN=function(x) mean(x, na.rm=TRUE))

# Calculate diabetes prevalence per cluster
cluster_prev <- aggregate(as.numeric(dat$diabetes=="Yes"),
                          by=list(cluster=dat$cluster),
                          FUN=mean)
names(cluster_prev)[2] <- "diabetes_rate"

# Add cluster size information
cluster_n <- as.data.frame(table(dat$cluster))
names(cluster_n) <- c("cluster","n")
cluster_prev <- merge(cluster_prev, cluster_n, by="cluster")

cat("\nCluster Clinical Profile (Mean Values):\n")
## 
## Cluster Clinical Profile (Mean Values):
print(cluster_profile)
##   cluster      age      bmi hba1c_level blood_glucose_level hypertension
## 1       1 39.35981 26.75319    5.490445            136.5055    0.0000000
## 2       2 63.38959 29.80256    5.830644            150.0059    0.7121111
##   heart_disease
## 1     0.0000000
## 2     0.3750357
cat("\nDiabetes Prevalence by Cluster:\n")
## 
## Diabetes Prevalence by Cluster:
print(cluster_prev)
##   cluster diabetes_rate     n
## 1       1    0.06150321 89475
## 2       2    0.28512986 10511
# Save cluster analysis results
write_csv(cluster_profile, "cluster_profile.csv")
write_csv(cluster_prev, "cluster_prevalence.csv")

# Step 10 — Visualization: Diabetes Prevalence by Cluster
barplot(cluster_prev$diabetes_rate,
        names.arg = cluster_prev$cluster,
        main="Diabetes Prevalence by Patient Cluster",
        xlab="Cluster", ylab="Diabetes Prevalence Rate")

png("06_cluster_prevalence.png", width=900, height=600, res=150)
barplot(cluster_prev$diabetes_rate,
        names.arg = cluster_prev$cluster,
        main="Diabetes Prevalence by Patient Cluster",
        xlab="Cluster", ylab="Diabetes Prevalence Rate")
dev.off()
## png 
##   2
# Step 11 — Train-Test Split for Classification Model (80/20)
set.seed(123)
n <- nrow(dat)
train_idx <- sample.int(n, size=floor(0.8*n))
train <- dat[train_idx, ]
test  <- dat[-train_idx, ]

# Step 12 — Train Interpretable Classification Model: Logistic Regression
# Include cluster membership as a predictor to capture subgroup effects
race_cols <- grep("^race_", names(dat), value=TRUE)

# Convert categorical variables to factor type
train$gender <- as.factor(train$gender); test$gender <- as.factor(test$gender)
train$smoking_history <- as.factor(train$smoking_history); test$smoking_history <- as.factor(test$smoking_history)
train$cluster <- as.factor(train$cluster); test$cluster <- as.factor(test$cluster)

# Define model variables
model_vars <- c("age","bmi","hba1c_level","blood_glucose_level",
                "hypertension","heart_disease",
                "gender","smoking_history",
                race_cols,
                "cluster")
model_vars <- intersect(model_vars, names(dat)) # Ensure variables exist in dataset

# Build model formula
formula_str <- paste("diabetes ~", paste(model_vars, collapse=" + "))
m <- glm(as.formula(formula_str), data=train, family=binomial())

# Generate predictions on test set
prob <- predict(m, newdata=test, type="response")
pred <- factor(ifelse(prob>=0.5, "Yes","No"), levels=c("No","Yes"))

# Step 13 — Numeric Model Evaluation: Confusion Matrix, Accuracy, AUC
# Confusion Matrix
cm <- table(Predicted=pred, Actual=test$diabetes)
cat("\nConfusion Matrix (Test Set):\n")
## 
## Confusion Matrix (Test Set):
print(cm)
##          Actual
## Predicted    No   Yes
##       No  18130   648
##       Yes   147  1073
# Calculate accuracy and AUC
accuracy <- sum(diag(cm))/sum(cm)
roc_obj <- roc(test$diabetes, prob)
auc_val <- auc(roc_obj)

cat("\nModel Performance Metrics (Test Set):\n")
## 
## Model Performance Metrics (Test Set):
cat("Accuracy =", round(as.numeric(accuracy),4), "\n")
## Accuracy = 0.9602
cat("AUC      =", round(as.numeric(auc_val),4), "\n")
## AUC      = 0.9604
# Save evaluation results
write_csv(data.frame(metric=c("Accuracy","AUC"),
                     value=c(as.numeric(accuracy), as.numeric(auc_val))),
          "evaluation_summary.csv")

# Step 14 — Evaluation Plot: ROC Curve
# ROC Curve shows trade-off between Sensitivity and Specificity
plot(roc_obj,
     main=paste0("ROC Curve (AUC = ", round(as.numeric(auc_val),3), ")"),
     lwd=2)
abline(a=0, b=1, lty=2, col="gray") # Reference line (random guess)

png("09_roc_curve.png", width=800, height=600, res=150)
plot(roc_obj,
     main=paste0("ROC Curve (AUC = ", round(as.numeric(auc_val),3), ")"),
     lwd=2)
abline(a=0, b=1, lty=2, col="gray")
dev.off()
## png 
##   2
# Step 15 — Evaluation Plot: Predicted Probability Distributions
# Compare probability distributions between true positive and true negative cases
hist(prob[test$diabetes=="No"], breaks=30, xlim=c(0,1),
     main="Predicted Probability Distribution by True Diabetes Status",
     xlab="Predicted Probability of Diabetes", col="lightblue")
hist(prob[test$diabetes=="Yes"], breaks=30, add=TRUE, col="lightcoral")
legend("topright", legend=c("No Diabetes","Diabetes"), fill=c("lightblue","lightcoral"), bty="n")

png("10_predicted_probability_distribution.png", width=900, height=600, res=150)
hist(prob[test$diabetes=="No"], breaks=30, xlim=c(0,1),
     main="Predicted Probability Distribution by True Diabetes Status",
     xlab="Predicted Probability of Diabetes", col="lightblue")
hist(prob[test$diabetes=="Yes"], breaks=30, add=TRUE, col="lightcoral")
legend("topright", legend=c("No Diabetes","Diabetes"), fill=c("lightblue","lightcoral"), bty="n")
dev.off()
## png 
##   2
# Step 16 — Evaluation Plot: Confusion Matrix Heatmap
cm_mat <- as.matrix(cm)

image(1:ncol(cm_mat), 1:nrow(cm_mat), t(cm_mat),
      axes=FALSE, xlab="Actual Diabetes Status", ylab="Predicted Diabetes Status",
      main="Confusion Matrix Heatmap")
axis(1, at=1:ncol(cm_mat), labels=colnames(cm_mat))
axis(2, at=1:nrow(cm_mat), labels=rownames(cm_mat))
for (i in 1:nrow(cm_mat)) for (j in 1:ncol(cm_mat)) text(j, i, cm_mat[i,j])

png("11_confusion_matrix_heatmap.png", width=800, height=650, res=150)
image(1:ncol(cm_mat), 1:nrow(cm_mat), t(cm_mat),
      axes=FALSE, xlab="Actual Diabetes Status", ylab="Predicted Diabetes Status",
      main="Confusion Matrix Heatmap")
axis(1, at=1:ncol(cm_mat), labels=colnames(cm_mat))
axis(2, at=1:nrow(cm_mat), labels=rownames(cm_mat))
for (i in 1:nrow(cm_mat)) for (j in 1:ncol(cm_mat)) text(j, i, cm_mat[i,j])
dev.off()
## png 
##   2
# Step 17 — Identify Key Diabetes Risk Drivers: Odds Ratios (OR)
# Convert logistic regression coefficients to Odds Ratios for interpretability
coef_est <- coef(m)
or <- exp(coef_est)
ci <- suppressMessages(confint(m))
or_ci <- exp(ci)

# Create Odds Ratio table with 95% Confidence Intervals
or_table <- data.frame(
  term=names(or),
  OR=as.numeric(or),
  CI_low=as.numeric(or_ci[,1]),
  CI_high=as.numeric(or_ci[,2])
)
or_table <- or_table[or_table$term!="(Intercept)", ]
or_table <- or_table[order(or_table$OR, decreasing=TRUE), ]

cat("\nTop 15 Diabetes Risk Drivers (Odds Ratios with 95% CI):\n")
## 
## Top 15 Diabetes Risk Drivers (Odds Ratios with 95% CI):
print(head(or_table, 15))
##                     term         OR    CI_low    CI_high
## 4            hba1c_level 10.3269128 9.5649455 11.1706585
## 20              cluster2  1.6255366 1.2106706  2.1848133
## 7          heart_disease  1.4717010 1.1392650  1.8978990
## 6           hypertension  1.3957743 1.0599960  1.8356866
## 8             genderMale  1.2502490 1.1550129  1.3532736
## 15 race_african_american  1.1338408 1.0031535  1.2817232
## 3                    bmi  1.1233347 1.1151538  1.1316092
## 16            race_asian  1.0713444 0.9467367  1.2124417
## 18         race_hispanic  1.0579636 0.9347348  1.1975102
## 2                    age  1.0469266 1.0443390  1.0495477
## 5    blood_glucose_level  1.0341969 1.0331125  1.0352995
## 17        race_caucasian  1.0018931 0.8846612  1.1346848
## 10   smoking_historyever  0.9798216 0.7986695  1.1997395
## 11 smoking_historyformer  0.9517365 0.8165210  1.1100683
## 12  smoking_historynever  0.8705310 0.7621313  0.9956972
write_csv(or_table, "logistic_or_drivers.csv")

# Step 18 — Visualization: Top Odds Ratio Features
topN <- 10
top_or <- head(or_table, topN)

# Adjust plot margins for long variable names
op <- par(mar = c(5, 14, 4, 2))
barplot(
  rev(top_or$OR),
  names.arg = rev(top_or$term),
  horiz = TRUE,
  las = 1,
  cex.names = 0.9,
  main = "Top 10 Diabetes Risk Drivers (Odds Ratios)",
  xlab = "Odds Ratio (OR)",
  col = "steelblue"
)
abline(v=1, lty=2, col="red") # OR=1 means no effect

par(op)

# Save plot
dev.copy(png, filename="07_top_or_drivers.png", width=1200, height=700, res=150)
## png 
##   3
dev.off()
## png 
##   2
# Step 19 — Driver Heterogeneity: Key Clinical Variables by Cluster (Boxplots)
# Compare distribution of key variables across clusters
par(mfrow=c(2,2))
for(v in key_vars){
  boxplot(dat[[v]] ~ dat$cluster,
          main=paste(v, "by Cluster"),
          xlab="Cluster", ylab=v,
          col=c("lightblue","lightcoral"))
}

par(mfrow=c(1,1))

png("08_drivers_by_cluster.png", width=1200, height=800, res=150)
par(mfrow=c(2,2))
for(v in key_vars){
  boxplot(dat[[v]] ~ dat$cluster,
          main=paste(v, "by Cluster"),
          xlab="Cluster", ylab=v,
          col=c("lightblue","lightcoral"))
}
par(mfrow=c(1,1))
dev.off()
## png 
##   2
# Save complete workspace for future reference
saveRDS(list(dat=dat, x_scaled=x_scaled, km=km, k=k, ks=ks, wss=wss, sil=sil,
             cluster_profile=cluster_profile, cluster_prev=cluster_prev,
             train=train, test=test, m=m, prob=prob, pred=pred, cm=cm,
             roc_obj=roc_obj, auc_val=auc_val, or_table=or_table),
        "workspace_final.rds")

cat("\nCluster Analysis & Classification Modeling Completed Successfully!\n")
## 
## Cluster Analysis & Classification Modeling Completed Successfully!
cat("All output files and plots have been saved to the working directory.\n")
## All output files and plots have been saved to the working directory.

7.4 Model specification and processes

The diabetes prediction task is formulated as a binary classification problem, where the target variable indicates whether a patient has diabetes. A logistic regression model is adopted due to its suitability for binary outcomes and its interpretability.

Prior to model training, K-means clustering is applied to the dataset to generate patient subgroup information. Clinical variables including age, BMI, HbA1c level, blood glucose level, hypertension status, and heart disease status are first standardised to ensure comparable scales across features. K-means clustering is then performed on the standardised data to identify patient subgroups representing different combinations of metabolic and cardiovascular risk factors.

The number of clusters is determined using a combination of the Elbow Method and Silhouette Analysis. The elbow plot based on within-cluster sum of squares (WSS) shows a clear reduction in the rate of improvement at lower values of K, suggesting that most of the cluster structure can be captured with a small number of clusters. In addition, the silhouette plot indicates that K = 2 achieves the highest average silhouette score, implying better cluster cohesion and separation compared to larger values of K. Based on the results of both methods, K = 2 is selected as an appropriate number of clusters. This choice balances cluster quality and interpretability, while also aligning well with the binary nature of the diabetes classification task by grouping patients into lower-risk and higher-risk subgroups.

The resulting cluster assignments are added to the dataset as a categorical variable representing patient subgroup membership. After incorporating the cluster variable, the dataset is randomly split into training and testing sets using an 80/20 ratio, with a fixed random seed to ensure reproducibility. The logistic regression classification model is trained on the training set using key clinical variables, including HbA1c level, blood glucose level, BMI, age, and cluster membership as input features. Continuous variables are standardised prior to model fitting.

Model performance is evaluated on the test set using accuracy, ROC-AUC, and a confusion matrix to assess the model’s predictive ability and discriminative performance.

7.5 Results

The performance of the logistic regression classification model is evaluated on the test dataset using multiple evaluation metrics. Overall, the model demonstrates strong predictive ability in distinguishing between diabetic and non-diabetic patients.

The model achieves an accuracy of 0.9602, indicating that most observations in the test set are correctly classified. In addition, the ROC-AUC score of 0.9604 suggests a high level of discriminative ability across different classification thresholds, showing that the model performs well despite the class imbalance in the data.

The confusion matrix provides further insight into the classification results. Out of all test observations, 18,130 non-diabetic patients are correctly classified as non-diabetic, while 1,073 diabetic patients are correctly identified as diabetic. The number of misclassifications is relatively small, with 147 false positives and 648 false negatives. The large concentration of observations along the diagonal indicates that the model’s predictions are generally stable and reliable.

Further evidence of the model’s performance is provided by the predicted probability distribution by true class. The distribution shows a clear separation between diabetic and non-diabetic patients. Most non-diabetic patients receive predicted probabilities close to zero, while diabetic patients tend to have much higher predicted probabilities, often close to one. The limited overlap between the two distributions corresponds to the observed misclassifications in the confusion matrix and indicates that the model is able to assign diabetes risk at the individual level with a high degree of confidence.

Overall, these results suggest that the logistic regression model generalises well to unseen data and is suitable for predicting diabetes status based on clinical features and patient subgroup membership.

7.6 Discussion

The results of this study indicate that the logistic regression classification model performs well in predicting diabetes status based on clinical features and patient subgroup membership. The high accuracy and ROC-AUC values suggest that the model is able to effectively distinguish between diabetic and non-diabetic patients. The predicted probability distributions further show that the model assigns low diabetes risk to most non-diabetic individuals and high risk to most diabetic individuals, indicating strong confidence in its predictions.

The inclusion of patient subgroup membership derived from K-means clustering plays an important role in improving the interpretability of the classification results. Boxplots comparing key clinical variables across clusters show clear and consistent differences between the two identified subgroups. Patients in Cluster 2 tend to be older and have higher median values of BMI, HbA1c level, and blood glucose level compared to patients in Cluster 1. These characteristics are commonly associated with higher diabetes risk, suggesting that Cluster 2 represents a higher-risk subgroup, while Cluster 1 corresponds to a lower-risk group.

The observed differences across multiple clinical variables indicate that the clustering process captures meaningful combinations of risk factors rather than random variation. This helps explain why cluster membership is a significant predictor in the logistic regression model and supports its inclusion as an additional feature. By incorporating subgroup information, the model is able to account for aggregated risk patterns that may not be fully reflected by individual clinical variables alone.

Despite the strong overall performance, some misclassifications remain, particularly false negatives, where diabetic patients are incorrectly classified as non-diabetic. This suggests that certain patients may exhibit less distinct clinical profiles, making diabetes status more difficult to predict using the current feature set. Future work could explore alternative classification thresholds or more complex models to further reduce misclassification. Additionally, performing clustering within the training set only may help improve the robustness of subgroup-based features.

7.7 Conclusion

This study addresses the question of whether a classification model can predict diabetes status based on clinical features and patient subgroup membership. The results show that a supervised logistic regression model is able to accurately distinguish between diabetic and non-diabetic patients, achieving strong predictive performance on the test dataset. The inclusion of patient subgroups derived from K-means clustering provides additional insight into diabetes risk patterns and contributes to the interpretability of the model. The identified subgroups capture meaningful differences in clinical characteristics, and subgroup membership is shown to be a useful feature for diabetes classification.

Overall, the findings demonstrate that classification-based approaches can be effectively used to predict diabetes status at the individual level using clinical data. This suggests that such models may be valuable for supporting diabetes risk assessment and identifying higher-risk patient groups in a data-driven and interpretable manner.