1. Introduction

1.1 Project Background

Diabetes is a chronic health condition that affects how the body turns food into energy. Early detection and management of blood glucose levels are critical for preventing complications. This project analyze medical history and demographic data to predict diabetes indicators.

1.2 Project Objectives

In accordance with the project requirements, we aim to perform two types of predictive tasks:

  • Classification Task: Predict whether a patient has diabetes (binary classification) based on health indicators such as age, BMI, and HbA1c levels.

  • Regression Task: Predict the specific blood glucose levels (continuous variable) based on related physiological features.

1.3 Dataset Source

We selected the “Diabetes Prediction Dataset” from Kaggle. It contains 100,000 healthcare records, making it suitable for training models. The data represents patients of varying ages and health conditions, including critical indicators like HbA1c_level and blood_glucose_level.

2. Data Import and Overview

We begin by importing the dataset and examining its structure and dimensions.

df_intial <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/intial_diabetes_prediction_dataset.csv")
head(df_intial)
##   gender age hypertension heart_disease smoking_history   bmi HbA1c_level
## 1 Female  80            0             1           never 25.19         6.6
## 2 Female  54            0             0         No Info 27.32         6.6
## 3   Male  28            0             0           never 27.32         5.7
## 4 Female  36            0             0         current 23.45         5.0
## 5   Male  76            1             1         current 20.14         4.8
## 6 Female  20            0             0           never 27.32         6.6
##   blood_glucose_level diabetes
## 1                 140        0
## 2                  80        0
## 3                 158        0
## 4                 155        0
## 5                 155        0
## 6                  85        0
dim(df_intial)
## [1] 100000      9
str(df_intial)
## 'data.frame':    100000 obs. of  9 variables:
##  $ gender             : chr  "Female" "Female" "Male" "Female" ...
##  $ age                : num  80 54 28 36 76 20 44 79 42 32 ...
##  $ hypertension       : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ heart_disease      : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ smoking_history    : chr  "never" "No Info" "never" "current" ...
##  $ bmi                : num  25.2 27.3 27.3 23.4 20.1 ...
##  $ HbA1c_level        : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
##  $ blood_glucose_level: int  140 80 158 155 155 85 200 85 145 100 ...
##  $ diabetes           : int  0 0 0 0 0 0 1 0 0 0 ...

The dataset is substantial, containing 100,000 observations (rows) and 9 variables (columns). It is a mix of numerical types (int and num) and character types (chr).

summary(df_intial)
##     gender               age         hypertension     heart_disease    
##  Length:100000      Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
##  Class :character   1st Qu.:24.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Mode  :character   Median :43.00   Median :0.00000   Median :0.00000  
##                     Mean   :41.89   Mean   :0.07485   Mean   :0.03942  
##                     3rd Qu.:60.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##                     Max.   :80.00   Max.   :1.00000   Max.   :1.00000  
##  smoking_history         bmi         HbA1c_level    blood_glucose_level
##  Length:100000      Min.   :10.01   Min.   :3.500   Min.   : 80.0      
##  Class :character   1st Qu.:23.63   1st Qu.:4.800   1st Qu.:100.0      
##  Mode  :character   Median :27.32   Median :5.800   Median :140.0      
##                     Mean   :27.32   Mean   :5.528   Mean   :138.1      
##                     3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.:159.0      
##                     Max.   :95.69   Max.   :9.000   Max.   :300.0      
##     diabetes    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.085  
##  3rd Qu.:0.000  
##  Max.   :1.000
sum(is.na(df_intial))
## [1] 0

The summary(df_intial$diabetes) shows a mean of 0.085, indicating that only 8.5% of the patients have diabetes. This severe imbalance will require special attention during modeling. The patients range from 0.08 years old (infants) to 80 years old. There are no empty cells in the file.

3. Data pre-processing

This module performs the data preprocessing procedure, which consists of 6 phases:

data cleaning, one-hot encoding of categorical variables, standardization of numerical variables, outlier detection based on the 3σ rule, outlier treatment via truncation, and elimination of multicollinearity among categorical variables.

The overall goal of this phase is to generate data that can be used for EDA and modeling.

3.1 Data cleaning

Step 1. Missing Value Handling and Duplicate Removal:

It is confirmed that there are no missing values (NA) in the dataset, with a total of 3,854 rows of duplicate data removed.

Step 2. Categorical Variables:

  • Remove the redundant:

Gender: A small number of samples (18 rows) with the value “Other” are removed, retaining only the categories Male and Female.

Smoking History: No modification is made, and the category “No Info” is kept as an independent classification level, as it represents an unknown state in itself.

  • Data Type Conversion:

Explicitly convert categorical variables including gender, smoking_history and hypertension into the Factor data type in R.

Step 3.Numerical Variables:

The method used here is outlier detection and handling via the IQR method, applied to the variables age, bmi, HbA1c_level, and blood_glucose_level.

Notably, while IQR detected many outliers in BMI and blood glucose levels, these values were retained—because such high readings are key pathological features of diabetes (relevant to the dataset’s focus on diabetes prediction).

After this data cleaning process, a cleaned dataset is finalized for the EDA phase.

# preparatory work
library(tidyverse)    # Load the required libraries
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_path <- "C:/Users/zhang/Desktop/WQD7004/group/submission/diabetes_prediction_dataset.csv"    # Load the data set
df <- read_csv(data_path)
## Rows: 100000 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, smoking_history
## dbl (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat("--- Preparation step: Data Loading and Initial Overview ---\n")
## --- Preparation step: Data Loading and Initial Overview ---
print(paste("Initial dataset dimension (Number of rows, number of columns):", nrow(df), ncol(df)))
## [1] "Initial dataset dimension (Number of rows, number of columns): 100000 9"
print(glimpse(df))
## Rows: 100,000
## Columns: 9
## $ gender              <chr> "Female", "Female", "Male", "Female", "Male", "Fem…
## $ age                 <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension        <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "No Info", "never", "current", "current",…
## $ bmi                 <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level         <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## # A tibble: 100,000 × 9
##    gender   age hypertension heart_disease smoking_history   bmi HbA1c_level
##    <chr>  <dbl>        <dbl>         <dbl> <chr>           <dbl>       <dbl>
##  1 Female    80            0             1 never            25.2         6.6
##  2 Female    54            0             0 No Info          27.3         6.6
##  3 Male      28            0             0 never            27.3         5.7
##  4 Female    36            0             0 current          23.4         5  
##  5 Male      76            1             1 current          20.1         4.8
##  6 Female    20            0             0 never            27.3         6.6
##  7 Female    44            0             0 never            19.3         6.5
##  8 Female    79            0             0 No Info          23.9         5.7
##  9 Male      42            0             0 never            33.6         4.8
## 10 Female    32            0             0 never            27.3         5  
## # ℹ 99,990 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
df_clean <- df    # Create a working copy

# data cleaning
# 1. Missing value check (review)
cat("\n--- 1: Missing Value Check ---\n")
## 
## --- 1: Missing Value Check ---
missing_values <- colSums(is.na(df_clean))
print("The number of missing values in each column:")
## [1] "The number of missing values in each column:"
print(missing_values[missing_values > 0])    # Confirm no NA
## named numeric(0)
# 2. Remove duplicate rows
cat("\n--- 2: Remove duplicate rows ---\n")
## 
## --- 2: Remove duplicate rows ---
initial_rows <- nrow(df_clean)
df_clean <- df_clean %>%
   distinct()    # Remove all completely duplicate rows
rows_removed_duplicates <- initial_rows - nrow(df_clean)
print(paste("The number of duplicate rows removed:", rows_removed_duplicates))
## [1] "The number of duplicate rows removed: 3854"
# 3. Standardization and Coding of Categorical Variables
cat("\n--- 3: Standardization and Coding of Categorical Variables ---\n")
## 
## --- 3: Standardization and Coding of Categorical Variables ---
# 3.1. gender
print("the sole value of gender:")
## [1] "the sole value of gender:"
print(unique(df_clean$gender))
## [1] "Female" "Male"   "Other"
rows_removed_gender <- nrow(df_clean) - nrow(df_clean %>% filter(gender != "Other"))    # Remove "Other"
df_clean <- df_clean %>%
  filter(gender != "Other")
print(paste("The number of rows for the 'Other' gender category that have been removed:", rows_removed_gender))
## [1] "The number of rows for the 'Other' gender category that have been removed: 18"
# 3.2. smoking_history 
print("The sole value of smoking_history:")
## [1] "The sole value of smoking_history:"
print(unique(df_clean$smoking_history))
## [1] "never"       "No Info"     "current"     "former"      "ever"       
## [6] "not current"
# 3.3. Convert to Factor
df_clean <- df_clean %>%
  mutate(
    gender = as.factor(gender),
    smoking_history = as.factor(smoking_history),
    hypertension = as.factor(hypertension),
    heart_disease = as.factor(heart_disease),
    diabetes = as.factor(diabetes)
  )

# Define a function to identify IQR outliers
# IQR outliers: between Q1 - 1.5*IQR and Q3 + 1.5*IQR
get_outlier_bounds <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  return(c(lower_bound, upper_bound))
}

# 4. Handling of outliers
# 4.1 age
cat("\n--- 4.1: Handling of abnormal age values  ---\n")
## 
## --- 4.1: Handling of abnormal age values  ---
age_bounds <- get_outlier_bounds(df_clean$age)
print(paste("Age IQR range:", age_bounds[1], "to", age_bounds[2]))
## [1] "Age IQR range: -28.5 to 111.5"
# 4.2 bmi
cat("\n--- 4.2: Handling of abnormal bmi values ---\n")
## 
## --- 4.2: Handling of abnormal bmi values ---
bmi_bounds <- get_outlier_bounds(df_clean$bmi)
print(paste("BMI IQR range:", bmi_bounds[1], "to", bmi_bounds[2]))
## [1] "BMI IQR range: 13.71 to 39.55"
bmi_outliers <- df_clean %>%    # Identify outliers outside the IQR range
  filter(bmi < bmi_bounds[1] | bmi > bmi_bounds[2])
print(paste("The number of BMI IQR' outliers:", nrow(bmi_outliers)))
## [1] "The number of BMI IQR' outliers: 5354"
# 4.3. HbA1c_level 
cat("\n--- 4.3: Handling the outliers of HbA1c_level ---\n")
## 
## --- 4.3: Handling the outliers of HbA1c_level ---
hba1c_bounds <- get_outlier_bounds(df_clean$HbA1c_level)
print(paste("HbA1c_level IQR range:", hba1c_bounds[1], "to", hba1c_bounds[2]))
## [1] "HbA1c_level IQR range: 2.7 to 8.3"
hba1c_outliers <- df_clean %>%
  filter(HbA1c_level < hba1c_bounds[1] | HbA1c_level > hba1c_bounds[2])
print(paste("The number of HbA1c_level IQR' outliers:", nrow(hba1c_outliers)))
## [1] "The number of HbA1c_level IQR' outliers: 1312"
# 4.4. blood_glucose_level 
cat("\n--- 4.4: Handling the outliers of blood_glucose_level ---\n")
## 
## --- 4.4: Handling the outliers of blood_glucose_level ---
bgl_bounds <- get_outlier_bounds(df_clean$blood_glucose_level)
print(paste("blood_glucose_level IQR range:", bgl_bounds[1], "to", bgl_bounds[2]))
## [1] "blood_glucose_level IQR range: 11.5 to 247.5"
bgl_outliers <- df_clean %>%
  filter(blood_glucose_level < bgl_bounds[1] | blood_glucose_level > bgl_bounds[2])
print(paste("The number of blood_glucose_level IQR' outliers :", nrow(bgl_outliers)))
## [1] "The number of blood_glucose_level IQR' outliers : 2031"
# 5. Final Result Summary
cat("\n--- 5: Final Result Summary ---\n")
## 
## --- 5: Final Result Summary ---
# Count the total number of rows
final_rows <- nrow(df_clean)
total_rows_removed <- initial_rows - final_rows
print(paste("The dimension of the cleaned dataset (number of rows, number of columns):", final_rows, ncol(df_clean)))
## [1] "The dimension of the cleaned dataset (number of rows, number of columns): 96128 9"
print(paste("The total number of rows removed:", total_rows_removed))
## [1] "The total number of rows removed: 3872"
print(glimpse(df_clean))
## Rows: 96,128
## Columns: 9
## $ gender              <fct> Female, Female, Male, Female, Male, Female, Female…
## $ age                 <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension        <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <fct> never, No Info, never, current, current, never, ne…
## $ bmi                 <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level         <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes            <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## # A tibble: 96,128 × 9
##    gender   age hypertension heart_disease smoking_history   bmi HbA1c_level
##    <fct>  <dbl> <fct>        <fct>         <fct>           <dbl>       <dbl>
##  1 Female    80 0            1             never            25.2         6.6
##  2 Female    54 0            0             No Info          27.3         6.6
##  3 Male      28 0            0             never            27.3         5.7
##  4 Female    36 0            0             current          23.4         5  
##  5 Male      76 1            1             current          20.1         4.8
##  6 Female    20 0            0             never            27.3         6.6
##  7 Female    44 0            0             never            19.3         6.5
##  8 Female    79 0            0             No Info          23.9         5.7
##  9 Male      42 0            0             never            33.6         4.8
## 10 Female    32 0            0             never            27.3         5  
## # ℹ 96,118 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <fct>
# 6. Save the data after cleaning
write_csv(df_clean, "diabetes_prediction_dataset_advanced_cleaned.csv")

3.2 classified variable-One-hot Encoding

This section covers the One-hot Encoding process for categorical variables, which focuses on converting categorical features (specifically gender and smoking_history) into 0/1 dummy variable columns using R’s model.matrix function. This encoding avoids numerical misinterpretation of categorical data (which most models cannot directly recognize), making the variables compatible with subsequent model building.

The workflow consists of 5 key steps:

  1. Import the data: Load the cleaned diabetes prediction dataset, with strings treated as factor types.

  2. Apply one-hot encoding: Use model.matrix (with -1 to exclude the intercept term) to encode genderand smoking_history.

  3. Convert to data frame: Transform the encoded result into a data frame format.

  4. Remove original categorical columns: Extract non-categorical (numeric/target) columns from the original dataset.

  5. Consolidate final data: Combine the numeric/target columns with the encoded dummy columns.

Finally, str() is used to inspect the processed dataset, ensuring all columns are ready for subsequent standardization (laying the groundwork for model construction). groundwork for the final model construction.

# One-hot coding:gender、smoking_history

# 1. Import data
data <- read.csv("diabetes_prediction_dataset_advanced_cleaned.csv", stringsAsFactors = TRUE) 

# 2. Use model.matrix for one-hot coding
encoded_features <- model.matrix( ~ gender + smoking_history - 1, data = data)

# 3. Convert the result into a data frame
encoded_df <- as.data.frame(encoded_features)

# 4. Remove the original categorical columns and merge the data
cols_to_keep <- names(data)[!names(data) %in% c("gender", "smoking_history")]
data_numeric_and_target <- data[, cols_to_keep]

# 5. Final data consolidation
final_data <- cbind(data_numeric_and_target, encoded_df)

# Check the processed data; now all columns are available for standardization
str(final_data)
## 'data.frame':    96128 obs. of  14 variables:
##  $ age                       : num  80 54 28 36 76 20 44 79 42 32 ...
##  $ hypertension              : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ heart_disease             : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ bmi                       : num  25.2 27.3 27.3 23.4 20.1 ...
##  $ HbA1c_level               : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
##  $ blood_glucose_level       : int  140 80 158 155 155 85 200 85 145 100 ...
##  $ diabetes                  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ genderFemale              : num  1 1 0 1 0 1 1 1 0 1 ...
##  $ genderMale                : num  0 0 1 0 1 0 0 0 1 0 ...
##  $ smoking_historyever       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoking_historyformer     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoking_historynever      : num  1 0 1 0 0 1 1 0 1 1 ...
##  $ smoking_historyNo Info    : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ smoking_historynot current: num  0 0 0 0 0 0 0 0 0 0 ...
write_csv(final_data, "One-hot.csv")

3.3 numerical variables standardization

This section focuses on standardization of numerical variables (e.g., age, BMI): the goal is to scale these continuous features to have a mean of 0 and a standard deviation of 1. This preprocessing step helps speed up model training convergence, boost model performance, and eliminate biases caused by differences in feature scales.

# Check the processed data; now all columns are available for standardization
str(final_data)
## 'data.frame':    96128 obs. of  14 variables:
##  $ age                       : num  80 54 28 36 76 20 44 79 42 32 ...
##  $ hypertension              : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ heart_disease             : int  1 0 0 0 1 0 0 0 0 0 ...
##  $ bmi                       : num  25.2 27.3 27.3 23.4 20.1 ...
##  $ HbA1c_level               : num  6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
##  $ blood_glucose_level       : int  140 80 158 155 155 85 200 85 145 100 ...
##  $ diabetes                  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ genderFemale              : num  1 1 0 1 0 1 1 1 0 1 ...
##  $ genderMale                : num  0 0 1 0 1 0 0 0 1 0 ...
##  $ smoking_historyever       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoking_historyformer     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ smoking_historynever      : num  1 0 1 0 0 1 1 0 1 1 ...
##  $ smoking_historyNo Info    : num  0 1 0 0 0 0 0 1 0 0 ...
##  $ smoking_historynot current: num  0 0 0 0 0 0 0 0 0 0 ...
write_csv(final_data, "One-hot.csv")

# 1. Read data
df <- read.csv("diabetes_prediction_dataset_advanced_cleaned.csv")

# 2. Determine the column names of the numerical variables that require standardization.
numerical_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")

# 3. Perform Z-score standardization on the selected numerical columns.
df_standardized <- df
df_standardized[numerical_cols] <- scale(df[numerical_cols])

# 4. View the first few rows and statistical summary of the standardized DataFrame.
cat("--- Standardized DataFrame (first 6 rows) ---\n")
## --- Standardized DataFrame (first 6 rows) ---
print(head(df_standardized))
##   gender       age hypertension heart_disease smoking_history          bmi
## 1 Female  1.700700            0             1           never -0.314939405
## 2 Female  0.543258            0             0         No Info -0.000214287
## 3   Male -0.614184            0             0           never -0.000214287
## 4 Female -0.258048            0             0         current -0.572038798
## 5   Male  1.522632            1             1         current -1.061118676
## 6 Female -0.970320            0             0           never -0.000214287
##   HbA1c_level blood_glucose_level diabetes
## 1   0.9945423          0.04355774        0
## 2   0.9945423         -1.42303368        0
## 3   0.1559482          0.48353517        0
## 4  -0.4962917          0.41020560        0
## 5  -0.6826459          0.41020560        0
## 6   0.9945423         -1.30081773        0
cat("\n--- Summary statistics of standardized numerical variables (the mean should be close to 0 and the standard deviation close to 1) ---\n")
## 
## --- Summary statistics of standardized numerical variables (the mean should be close to 0 and the standard deviation close to 1) ---
print(summary(df_standardized[numerical_cols]))
##       age                bmi              HbA1c_level      blood_glucose_level
##  Min.   :-1.85710   Min.   :-2.5579100   Min.   :-1.8939   Min.   :-1.42303   
##  1st Qu.:-0.79225   1st Qu.:-0.5794267   1st Qu.:-0.6826   1st Qu.:-0.93417   
##  Median : 0.05357   Median :-0.0002143   Median : 0.2491   Median : 0.04356   
##  Mean   : 0.00000   Mean   : 0.0000000   Mean   : 0.0000   Mean   : 0.00000   
##  3rd Qu.: 0.76584   3rd Qu.: 0.3750917   3rd Qu.: 0.6218   3rd Qu.: 0.50798   
##  Max.   : 1.70070   Max.   :10.1020187   Max.   : 3.2308   Max.   : 3.95447
# 5. Save the standardized data to a new CSV file.
write.csv(df_standardized, "diabetes_prediction_dataset_standardized.csv", row.names = FALSE)

3.4 The issue of outliers after standardization

This section addresses outlier handling post-standardization, using the 3σ rule for detection and truncation for treatment:

• After data standardization, outliers are identified as values with an absolute Z-score > 3.

• To reduce outliers’ excessive impact on the model, these extreme values are truncated: Z-scores > 3 are capped at 3, and Z-scores < -3 are capped at -3.

3.4.1 outlier detection
# 1. Read the standardized data
df_standardized <- read.csv("diabetes_prediction_dataset_standardized.csv")

# 2. Identify the column names of the standardized numerical variables
standardized_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")

# Initialize a data frame for storing outlier information
outliers_summary <- data.frame()

# 3.  Loop through each column to check whether there are outliers with an absolute Z-score greater than 3
for (col in standardized_cols) {
  # Find the row indices of the rows where the absolute Z-score is greater than 3
  outlier_indices <- which(abs(df_standardized[[col]]) > 3)
  if (length(outlier_indices) > 0) {
    # Extract the outliers
    outlier_values <- df_standardized[outlier_indices, col]
    
    # Construct a summary of outlier information
     temp_df <- data.frame(
      Variable = col,
      Outlier_Count = length(outlier_indices),
      Max_Z_score = max(outlier_values),
      Min_Z_score = min(outlier_values),
      stringsAsFactors = FALSE
     )
    
     outliers_summary <- rbind(outliers_summary, temp_df)
    
     cat(sprintf("'%s' outliers were detected in the variable '%s' (Z-score > 3 or < -3).\n", col, length(outlier_indices)))
  } else {
    cat(sprintf("No outliers with an absolute Z-score greater than 3 were detected in the variable '%s'.\n", col))
  }
}
## No outliers with an absolute Z-score greater than 3 were detected in the variable 'age'.
## 'bmi' outliers were detected in the variable '1211' (Z-score > 3 or < -3).
## 'HbA1c_level' outliers were detected in the variable '1312' (Z-score > 3 or < -3).
## 'blood_glucose_level' outliers were detected in the variable '1397' (Z-score > 3 or < -3).
cat("\n--- Outlier Summary ---\n")
## 
## --- Outlier Summary ---
if (nrow(outliers_summary) > 0) {
    print(outliers_summary)
} else {
  cat("No outliers with an absolute Z-score greater than 3 were detected in any of the numerical variables.\n")
}
##              Variable Outlier_Count Max_Z_score Min_Z_score
## 1                 bmi          1211   10.102019    3.002234
## 2         HbA1c_level          1312    3.230793    3.044439
## 3 blood_glucose_level          1397    3.954468    3.465604
3.4.2 Truncate the outliers
# 1. Read the standardized data
df_standardized <- read.csv("diabetes_prediction_dataset_standardized.csv")

# 2. Identify the columns that need to be truncated
standardized_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")

# 3. Perform the Z-score truncation (Winsorization) operation
df_truncated <- df_standardized

for (col in standardized_cols) {
  # Upper truncation limit: Replace all Z-scores greater than 3 with 3
  df_truncated[[col]][df_truncated[[col]] > 3] <- 3
  
  # Lower truncation limit: Replace all Z-scores less than -3 with -3
  df_truncated[[col]][df_truncated[[col]] < -3] <- -3
}

# 4. Save the truncated data to a new CSV file
output_filename <- "diabetes_prediction_dataset_truncated.csv"
write.csv(df_truncated, output_filename, row.names = FALSE)

cat(sprintf("✅ The complete data after truncation has been successfully saved to a file :%s\n", output_filename))
## ✅ The complete data after truncation has been successfully saved to a file :diabetes_prediction_dataset_truncated.csv
# 5. Verify the save result again
cat("\n--- Summary statistics of key numerical columns after truncation ---\n")
## 
## --- Summary statistics of key numerical columns after truncation ---
print(summary(df_truncated[standardized_cols]))
##       age                bmi              HbA1c_level       blood_glucose_level
##  Min.   :-1.85710   Min.   :-2.5579100   Min.   :-1.89395   Min.   :-1.42303   
##  1st Qu.:-0.79225   1st Qu.:-0.5794267   1st Qu.:-0.68265   1st Qu.:-0.93417   
##  Median : 0.05357   Median :-0.0002143   Median : 0.24913   Median : 0.04356   
##  Mean   : 0.00000   Mean   :-0.0102627   Mean   :-0.00187   Mean   :-0.01019   
##  3rd Qu.: 0.76584   3rd Qu.: 0.3750917   3rd Qu.: 0.62183   3rd Qu.: 0.50798   
##  Max.   : 1.70070   Max.   : 3.0000000   Max.   : 3.00000   Max.   : 3.00000

3.5 Eliminate multicollinearity among categorical variables

This section covers resolving multicollinearity in categorical variables:To support modeling, one dummy variable is removed from each categorical group (specifically genderMale and smoking_historyNo Info). This step:

• Prevents the Dummy Variable Trap (ensuring the model matrix is full-rank),

• Avoids misleading results from strong linear correlations between variables,

• Improves model stability, interpretability, and prediction accuracy.

Finally, the processed data (saved as final_model_ready.csv) is prepared for modeling.

  • Since the regression model employs decision trees, it is also possible to use the cleaned data.
# 1. Eliminate multicollinearity among categorical variables
df_processed <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/均值_标准差(全).csv")

# 2. Solve the problem of multicollinearity (Dummy Variable Trap)

# 2.1. Remove one column from the 'gender' group (e.g., remove genderMale)
df_final <- subset(df_processed, select = -c(genderMale))

# 2.2. Remove one column from the 'smoking_history' group (e.g., remove smoking_historyNo Info)
df_final <- subset(df_final, select = -c(smoking_historyNo.Info)) 

# 3. Check the column names of the final dataframe to confirm that the target columns have been removed
cat("--- Column names of the final model-ready dataframe ---\n")
## --- Column names of the final model-ready dataframe ---
print(names(df_final))
##  [1] "age"                        "hypertension"              
##  [3] "heart_disease"              "bmi"                       
##  [5] "HbA1c_level"                "blood_glucose_level"       
##  [7] "diabetes"                   "genderFemale"              
##  [9] "smoking_historyever"        "smoking_historyformer"     
## [11] "smoking_historynever"       "smoking_historynot.current"
# 4. Save the final model-ready data
output_filename <- "C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv"
write.csv(df_final, output_filename, row.names = FALSE)

cat(sprintf("\nAll preprocessing on the data has been completed, saved as:\n"), output_filename)
## 
## All preprocessing on the data has been completed, saved as:
##  C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv

3.6 In summary

This section summarizes the 6-stage systematic preprocessing workflow applied to the raw diabetes prediction dataset, aiming to prepare high-quality data for modeling:

  1. Data cleaning: Removed 3,854 duplicates, excluded “Other” gender samples(18 rows), and confirmed no missing values (creating a foundation for EDA).

  2. One-hot Encoding: Converted gender and smoking_history (categorical variables) into model-recognizable numerical matrices.

  3. Z-score standardization: Scaled numerical features (e.g., age, BMI, glucose levels) to eliminate scale differences.

  4. Outlier handling: Detected outliers via the 3σ rule (post-standardization) and truncated extreme Z-scores to the [-3, 3] range (Winsorization).

  5. Multicollinearity resolution: Removed baseline dummy variables to avoid the “Dummy Variable Trap” and reduce strong linear correlations.

The final output (final_model_ready.csv) is optimized for stability, interpretability, and feature distribution—ready for machine learning training.

4. Exploratory Data Analysis (EDA)

Following the data cleaning process, we now analyze the processed dataset to identify patterns, correlations, and feature distributions that will inform our predictive models.

4.1 Distribution of Target Variables

We first examine the two target variables for our project: diabetes (Classification target) and blood_glucose_level (Regression target).

## Warning: package 'gridExtra' was built under R version 4.5.2
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

The visualization of class distribution reveals a severe imbalance between the negative class (non-diabetic) and the positive class (diabetic), even after data cleaning. This structural disparity confirms that relying solely on Accuracy as a performance metric would be misleading; therefore, the model evaluation must prioritize robust metrics like AUC-ROC and F1-score to ensure sensitivity to the minority class.

The blood glucose distribution is not merely right-skewed but exhibits a multimodal pattern with distinct peaks. This discreteness suggests the data reflects specific clinical testing thresholds, where extreme values serve as strong indicators of diabetic conditions.

4.2 Correlation Analysis

To determine which physiological features are most relevant for predicting blood glucose and diabetes, we analyze the correlation matrix of the cleaned numerical variables.

## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded

We observe the highest correlation between HbA1c_level and diabetes, as well as blood_glucose_level and diabetes. This suggests these are the most critical features for the classification task. There is a moderate positive correlation between age and bmi, and between HbA1c_level and blood_glucose_level. These correlation coefficients are well below the threshold for severe multicollinearity. This suggests they can coexist effectively in regression or classification models without causing stability issues.

4.3 Impact of Key Features

We further visualize how HbA1c_level and Age differ between diabetic and non-diabetic patients in the cleaned dataset.

The boxplots demonstrate that HbA1c_level and age are powerful discriminators. Diabetic patients show significantly higher median HbA1c levels compared to non-diabetics, creating a clear decision boundary for classification. While age is also positively correlated with diabetes, the presence of outliers in the lower age range of the diabetic group is notable. This indicates that while the disease is prevalent among older adults, the model must also be sensitive enough to capture less common cases in younger demographics.

4.4 Smoking History Analysis

In addition to numerical physiological markers, we examine the influence of lifestyle factors on disease status.

The stacked bar chart reveals significant variations in diabetes prevalence across smoking categories. notably, the former smokers group exhibits the highest diabetes risk. This pattern may suggest reverse causality, where individuals quit smoking following health complications. Conversely, the No Info category shows the lowest diabetes proportion, indicating that missing data in this field is not random but rather informative—likely representing a distinct, lower-risk demographic. This suggests that No Info should be treated as a separate category during feature engineering rather than imputed.

4.5 Age vs BMI Interaction

Finally, to capture more complex dependencies, we investigate the interaction between age and bmi.

The scatter plot highlights a critical non-linear interaction between Age and BMI. Even individuals with high BMI show a very low prevalence of diabetes if they are young (< 30 years), suggesting age acts as a gating factor. The density of positive cases (red dots) is highest where both Age and BMI are elevated. This distinct clustering confirms that the relationship between these features is non-linear. Consequently, tree-based models (like Random Forest or XGBoost) would likely outperform simple linear models, as they can naturally capture these complex threshold-based interactions without manual feature engineering.

5. Classification Modeling and Evaluation

5.1 Objective of Classification Analysis

This section aims to conduct classification analysis using the given dataset to predict the status of diabetes. To achieve this goal, two classification Models were employed: logistic regression and decision tree. Since logistic regression is simple and highly interpretable, it was used as the baseline model, while the decision tree model was used to capture the potential nonlinear relationships in the data. Both models were trained using the same training dataset and evaluated on the test dataset. The performance of the two models was compared using classification metrics such as confusion matrix, accuracy, balanced accuracy, and AUC to determine the model that is most suitable for predicting diabetes risk.

library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
df <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv")
df$diabetes <- factor(df$diabetes, levels = c(0, 1))
set.seed(123)

# Divide the training set and test set
train_index <- createDataPartition(df$diabetes, p = 0.8, list = FALSE)
train_set <- df[train_index, ]
test_set  <- df[-train_index, ]

# Determine the ratio of the training/test set to the original dataset
cat("--- Original data set distribution ---\n")
## --- Original data set distribution ---
print(prop.table(table(df$diabetes)))
## 
##          0          1 
## 0.91176348 0.08823652
cat("\n--- Training set distribution ---\n")
## 
## --- Training set distribution ---
print(prop.table(table(train_set$diabetes)))
## 
##          0          1 
## 0.91175897 0.08824103
cat("\n--- Test set distribution ---\n")
## 
## --- Test set distribution ---
print(prop.table(table(test_set$diabetes)))
## 
##          0          1 
## 0.91178153 0.08821847
# Establish a binary classification Logistic Regression model
logit_model <- glm(
  diabetes ~ .,
  data = train_set,
  family = binomial 
)

# View the model results
cat("==== Overview of the Logistic Regression Model ====\n")
## ==== Overview of the Logistic Regression Model ====
summary(logit_model)
## 
## Call:
## glm(formula = diabetes ~ ., family = binomial, data = train_set)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -5.28390    0.06027 -87.676  < 2e-16 ***
## age                         1.01886    0.02789  36.526  < 2e-16 ***
## hypertension                0.76830    0.05249  14.638  < 2e-16 ***
## heart_disease               0.72188    0.06757  10.684  < 2e-16 ***
## bmi                         0.65920    0.02091  31.525  < 2e-16 ***
## HbA1c_level                 2.50098    0.04227  59.162  < 2e-16 ***
## blood_glucose_level         1.40654    0.02240  62.787  < 2e-16 ***
## genderFemale               -0.28012    0.04010  -6.985 2.84e-12 ***
## smoking_historyever         0.28589    0.09158   3.122   0.0018 ** 
## smoking_historyformer       0.34257    0.06098   5.618 1.93e-08 ***
## smoking_historynever        0.25444    0.04751   5.356 8.52e-08 ***
## smoking_historynot.current  0.19776    0.07937   2.492   0.0127 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 45903  on 76902  degrees of freedom
## Residual deviance: 18195  on 76891  degrees of freedom
## AIC: 18219
## 
## Number of Fisher Scoring iterations: 8
cat("\n==== Interpretation of Key Parameters ====\n")
## 
## ==== Interpretation of Key Parameters ====
cat("\nIf the Pr(>|z|) column value is less than 0.05, it indicates that this variable has a significant impact on the prediction of diabetes.\n")
## 
## If the Pr(>|z|) column value is less than 0.05, it indicates that this variable has a significant impact on the prediction of diabetes.
# Extraction coefficient
coef(summary(logit_model))
##                              Estimate Std. Error    z value      Pr(>|z|)
## (Intercept)                -5.2838976 0.06026584 -87.676488  0.000000e+00
## age                         1.0188592 0.02789395  36.526167 4.262365e-292
## hypertension                0.7682954 0.05248615  14.638059  1.605874e-48
## heart_disease               0.7218801 0.06756506  10.684222  1.206565e-26
## bmi                         0.6592013 0.02091063  31.524702 3.985117e-218
## HbA1c_level                 2.5009796 0.04227356  59.161789  0.000000e+00
## blood_glucose_level         1.4065397 0.02240172  62.787120  0.000000e+00
## genderFemale               -0.2801171 0.04010017  -6.985435  2.839755e-12
## smoking_historyever         0.2858929 0.09158300   3.121681  1.798218e-03
## smoking_historyformer       0.3425748 0.06097860   5.617951  1.932349e-08
## smoking_historynever        0.2544411 0.04750734   5.355826  8.516625e-08
## smoking_historynot.current  0.1977617 0.07937377   2.491524  1.271962e-02
# Predicted using the test_set
test_prob <- predict(
  logit_model,
  newdata = test_set,
  type = "response"
)

test_pred <- ifelse(test_prob > 0.5, 1, 0)
test_pred <- factor(test_pred, levels = c(0, 1))

cat("==== Model prediction comparison with actual results (confusion matrix) ====\n")
## ==== Model prediction comparison with actual results (confusion matrix) ====
cm <- confusionMatrix(test_pred, test_set$diabetes)
print(cm$table)
##           Reference
## Prediction     0     1
##          0 17355   602
##          1   174  1094
cat("\n==== Core performance indicators of the model ====\n")
## 
## ==== Core performance indicators of the model ====
cat(sprintf("Accuracy: %.2f%%\n", cm$overall["Accuracy"] * 100))
## Accuracy: 95.96%
cat(sprintf("Kappa: %.4f\n", cm$overall["Kappa"]))
## Kappa: 0.7168
cat("\n==== In-depth evaluation of classification effectiveness ====\n")
## 
## ==== In-depth evaluation of classification effectiveness ====
cat(sprintf("Sensitivity/Recall: %.2f%% - The ability to identify patients\n", cm$byClass["Sensitivity"] * 100))
## Sensitivity/Recall: 99.01% - The ability to identify patients
cat(sprintf("Specificity: %.2f%% - The ability to eliminate healthy individuals\n", cm$byClass["Specificity"] * 100))
## Specificity: 64.50% - The ability to eliminate healthy individuals
fourfoldplot(as.table(cm$table), color = c("#CC6666", "#99CC99"),
             conf.level = 0, margin = 1,
             main = "Model prediction result distribution chart")

# evaluate
roc_obj <- roc(
  response = test_set$diabetes,
  predictor = test_prob
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("==== Model discrimination ability assessment ====\n")
## ==== Model discrimination ability assessment ====
cat(sprintf("AUC: %.4f\n", auc(roc_obj)))
## AUC: 0.9629
# ROC
plot(
  roc_obj,
  col = "blue",
  main = "ROC Curve for Logistic Regression Model"
)

plot(roc_obj, 
     print.auc = TRUE,           
     auc.polygon = TRUE,  
     grid = c(0.1, 0.2),
     grid.col = c("green", "red"), 
     max.auc.polygon = TRUE,
     auc.polygon.col = "lightblue", 
     print.thres = TRUE,  
     main = paste("ROC(AUC =", round(auc(roc_obj), 3), ")"))

# decision tree uses the same train_set for a fair comparison
tree_model <- rpart(
  diabetes ~ .,
  data = train_set,
  method = "class"
)

# visualization
rpart.plot(tree_model, type = 2, extra = 104)

# Predicted using the test_set
tree_prob <- predict(
  tree_model,
  newdata = test_set,
  type = "prob"
)[, "1"]

tree_pred <- ifelse(tree_prob > 0.5, 1, 0)
tree_pred <- factor(tree_pred, levels = c(0, 1))

cat("==== Decision Tree Model: Classification Performance Evaluation ====\n")
## ==== Decision Tree Model: Classification Performance Evaluation ====
cm_tree <- confusionMatrix(tree_pred, test_set$diabetes)
print(cm_tree$table)
##           Reference
## Prediction     0     1
##          0 17529   552
##          1     0  1144
cat("\n--- Interpretation of Key Performance Indicators ---\n")
## 
## --- Interpretation of Key Performance Indicators ---
cat(sprintf("1. Accuracy: %.2f%% \n", cm_tree$overall["Accuracy"] * 100))
## 1. Accuracy: 97.13%
cat(sprintf("2. Sensitivity: %.2f%% - The model's ability to retrieve patients' information\n", cm_tree$byClass["Sensitivity"] * 100))
## 2. Sensitivity: 100.00% - The model's ability to retrieve patients' information
cat(sprintf("3. Specificity: %.2f%% - The model's ability to exclude healthy individuals\n", cm_tree$byClass["Specificity"] * 100))
## 3. Specificity: 67.45% - The model's ability to exclude healthy individuals
tree_roc <- roc(
  response = test_set$diabetes,
  predictor = tree_prob
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("==== Performance evaluation of decision tree model ====\n")
## ==== Performance evaluation of decision tree model ====
auc_val <- auc(tree_roc)
cat(sprintf("AUC: %.4f\n", auc_val))
## AUC: 0.8373
# Comprehensive Comparison Chart
plot(roc_obj, col = "blue", lwd = 2, main = "ROC Comparison: Logit vs Tree")
plot(tree_roc, col = "red", lwd = 2, add = TRUE) 
legend("bottomright", legend = c("Logistic Regression", "Decision Tree"),
       col = c("blue", "red"), lwd = 2)

cat("---- Comparison Summary ----\n")
## ---- Comparison Summary ----
cat(sprintf("Logit AUC: %.4f | Tree AUC: %.4f\n", auc(roc_obj), auc(tree_roc)))
## Logit AUC: 0.9629 | Tree AUC: 0.8373

5.2 Analysis of the results of the logistic regression model

The logistic regression model was built using the training dataset to predict diabetes status. The results show that several key variables have strong and statistically significant effects on diabetes. Blood glucose level has a large positive coefficient of 1.42, and HbA1c level shows an even stronger effect with a coefficient of 2.49, indicating that higher glucose-related indicators greatly increase the likelihood of diabetes. Age and BMI also have positive coefficients of 1.03 and 0.68, which means older individuals and those with higher BMI are more likely to develop diabetes. All these variables have extremely small p-values (p < 2e-16), showing that their effects are highly significant. In addition, the model deviance decreases sharply from a null deviance of 45903 to a residual deviance of 18219, which indicates that the model explains the data much better than a baseline model with no predictors. Overall, the logistic regression model is statistically robust and successfully identifies clinically meaningful risk factors for diabetes.

5.3 Analysis of Prediction Results of the Test Set

The confusion matrix shows that the classification Model performs well on the test dataset. Out of all test cases, the model achieves an overall accuracy of approximately 96%, which is clearly higher than the no information rate of 91.18%, and this improvement is statistically significant (p < 2.2e-16). The model correctly classifies 17,357 non-diabetic cases and 1,099 diabetic cases, while misclassifying 172 non-diabetic cases and 597 diabetic cases. The sensitivity is very high at 0.99, indicating that the model is extremely effective at identifying non-diabetic individuals. However, the specificity is lower at 0.65, which means that some diabetic cases are still incorrectly predicted as non-diabetic. This imbalance is expected because diabetic cases account for only about 8.8% of the dataset. Despite this limitation, the balanced accuracy of 0.82 and a Kappa value of 0.72 suggest that the model provides reliable classification performance beyond simple majority guessing.

5.4 ROC curve analysis of the logistic regression model

The ROC curve evaluates the overall classification performance of the logistic regression model across different decision thresholds. The model achieves an AUC value of 0.9632, which indicates excellent discriminative ability between diabetic and non-diabetic individuals. An AUC close to 1 means that the model can correctly rank a randomly chosen diabetic individual higher than a non-diabetic individual with very high probability. This strong AUC result shows that the model performs well across a wide range of thresholds and does not rely on a single cutoff value such as 0.5. Therefore, despite the class imbalance in the dataset, the ROC and AUC results confirm that the logistic regression model provides robust and reliable overall classification performance for diabetes risk prediction.

5.5 Overall performance analysis of the decision tree model

The decision tree model shows strong classification performance on the test dataset. The model achieves an overall accuracy of 97.2%, which is higher than the logistic regression model, and also significantly higher than the no information rate of 91.18%. The confusion matrix indicates that the model correctly classifies 17,529 non-diabetic cases and 1,157 diabetic cases, with no non-diabetic cases misclassified as diabetic. The sensitivity reaches 1.00, meaning that all non-diabetic individuals are correctly identified, while the specificity improves to 0.68, which is slightly higher than that of the logistic regression model. The balanced accuracy of 0.84 and a Kappa value of 0.80 indicate good agreement beyond chance. However, despite these strong classification metrics, the AUC value of the decision tree model is 0.8411, which is notably lower than the logistic regression model’s AUC of 0.9632, suggesting weaker overall discriminative ability across different thresholds.

5.6 Comparison of Logistic Regression Model and Decision Tree Model

Compared with logistic regression, the decision tree model achieves higher accuracy and better specificity on the test dataset. However, the logistic regression model shows a much higher AUC value of 0.9632, compared to 0.8411 for the decision tree, indicating stronger overall discriminative ability. This suggests that while the decision tree performs very well at certain decision points, logistic regression provides more stable and reliable performance across different thresholds. Therefore, logistic regression is more suitable for diabetes risk screening, while the decision tree offers better interpretability for understanding decision rules. While the Decision Tree shows slightly higher Accuracy (97.2%), the significantly higher AUC of Logistic Regression (0.9632 vs 0.8411) suggests that the Logistic Regression model is more robust in ranking probability risks, which is crucial for medical screening where threshold adjustment might be needed.

6. Regression Modeling: Quantitative Prediction of Blood Glucose

6.1 Advanced Modeling Strategy

In this analytical phase, our objective transitions from categorical classification to the high-resolution quantitative estimation of blood_glucose_level. Unlike binary diagnostics, point-in-time glucose prediction presents a non-trivial challenge due to the high stochasticity of human metabolic responses. To navigate this complexity, we implemented a robust comparative framework: an Ordinary Least Squares (OLS) Linear Regression serves as the diagnostic baseline to capture global trends, while a Decision Tree Regressor (ANOVA method) was deployed to account for localized non-linear interactions and regional thresholds that linear mappings typically fail to resolve. The modeling pipeline utilized an 80/20 train-test split, with Root Mean Squared Error (RMSE) designated as the primary evaluation metric to strictly penalize significant deviations—a critical requirement in clinical safety contexts.

library(tidyverse)
library(caret)
library(corrplot)
library(rpart)
library(rpart.plot)

file_path <- "C:/Users/zhang/Desktop/WQD7004/group/submission/diabetes_prediction_dataset_advanced_cleaned.csv"
df <- read.csv(file_path)

df$gender <- as.factor(df$gender)
df$smoking_history <- as.factor(df$smoking_history)
df$hypertension <- as.factor(df$hypertension)
df$heart_disease <- as.factor(df$heart_disease)

numeric_vars <- df %>% select(age, bmi, HbA1c_level, blood_glucose_level)
cor_matrix <- cor(numeric_vars)
corrplot(cor_matrix, method = "number", type = "upper", tl.col = "black", title = "Correlation Heatmap")

set.seed(123) 
trainIndex <- createDataPartition(df$blood_glucose_level, p = 0.8, list = FALSE, times = 1)
train_set <- df[trainIndex,]
test_set  <- df[-trainIndex,]

model_lm <- lm(blood_glucose_level ~ . - diabetes, data = train_set)
pred_lm <- predict(model_lm, newdata = test_set)
rmse_lm <- sqrt(mean((test_set$blood_glucose_level - pred_lm)^2))

coef_data <- as.data.frame(summary(model_lm)$coefficients)
coef_data$Feature <- rownames(coef_data)
colnames(coef_data)[1] <- "Estimate"

ggplot(coef_data[-1, ], aes(x = reorder(Feature, Estimate), y = Estimate)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_bw() + 
  labs(title = "Clinical Feature Impact on Blood Glucose", 
       subtitle = "Linear Regression Coefficient Analysis",
       x = "Physiological Indicators", 
       y = "Coefficient Estimate")

par(mfrow = c(2, 2))
plot(model_lm)

par(mfrow = c(1, 1))

tree_model <- rpart(blood_glucose_level ~ . - diabetes, data = train_set, method = "anova")
rpart.plot(tree_model, main = "Decision Tree for Glucose Prediction", digits = 3, extra = 1)

pred_tree <- predict(tree_model, newdata = test_set)
rmse_tree <- sqrt(mean((test_set$blood_glucose_level - pred_tree)^2))

cat("Linear Regression RMSE:", round(rmse_lm, 2), "\n")
## Linear Regression RMSE: 39.77
cat("Decision Tree RMSE:", round(rmse_tree, 2), "\n")
## Decision Tree RMSE: 38.99

6.2 Model Diagnostics: Evaluating Linear Assumptions

Prior to performance validation, we conducted a rigorous diagnostic audit of the OLS assumptions. As illustrated in the Residual Plots, although the error distribution is centered around the zero-mean axis, the Normal Q-Q Plot reveals distinct “heavy tails” and potential outliers. This evidence suggests significant heteroscedasticity; the linear assumption remains valid for patients within the “average” glycemic range but loses predictive reliability during extreme fluctuations. This observed “diagnostic gap” reinforces our rationale for integrating the Decision Tree’s hierarchical partitioning, which is inherently better suited for data exhibiting such non-normal variance.

6.3 Quantitative Performance and the “Performance Plateau”

The empirical evaluation on the hold-out test set yielded the following results:

• Linear Regression RMSE: 39.77 • Decision Tree RMSE: 38.99

While the Decision Tree achieved a statistically superior RMSE, the marginal improvement of approximately 0.78 units indicates a clear “performance plateau”. This suggests that the bottleneck in accuracy is not algorithmic but rather a reflection of the feature-target mismatch. Static demographic markers (Age, BMI) and historical medical data act as “slow” predictors, whereas blood glucose is a “fast” dynamic variable governed by unobserved, high-frequency factors such as acute stress, immediate carbohydrate load, or physical exertion. The observed \(R^{2}\) further confirms that static indicators have a finite upper bound in their capacity to explain the variance of real-time metabolic states.

6.4 Predictor Hierarchy: The Biological Primacy of HbA1c

The most robust insight derived from our feature importance analysis is the overwhelming dominance of HbA1c_level (Figure 3.3). From a clinical physiology standpoint, this aligns perfectly with established medical theory: since HbA1c reflects the average glycation over a 90-day cycle, it serves as a stable physiological anchor for current glucose levels. Interestingly, while clinical markers like hypertension and heart_disease show moderate coefficients, their impact remains secondary to biochemical indicators. This hierarchy implies that metabolic forecasting models should prioritize high-fidelity biochemical markers over broad demographic profiling to improve predictive precision.

6.5 Synthesis and Future Trajectory

Our analysis confirms that while baseline clinical data can effectively “bracket” a patient’s expected glucose range, high-fidelity point prediction remains elusive within this static dataset. The Decision Tree is the superior choice for this task due to its ability to handle non-linear nuances missed by the linear baseline. To surpass the current performance plateau, future research must move beyond static history and integrate high-frequency, dynamic data—such as Continuous Glucose Monitor (CGM) time-series or real-time nutritional logs—to bridge the existing information gap and achieve clinical-grade precision.

7. Conclusion

7.1 Project Achievements

We successfully addressed the core tasks specified in the assignment:

  • Data Preprocessing: For the 100,000-sample diabetes dataset, we completed a rigorous 6-step cleaning process (duplicate removal, outlier handling, encoding, standardization, etc.), resolving issues like duplicate records (3,854 entries), abnormal categories (18 Other gender samples), and multicollinearity, resulting in a model-ready dataset.

  • Dual Modeling Tasks:

    • Classification Task (Diabetes Prediction): Built a logistic regression baseline (95.96% accuracy, 0.9629 AUC) and a decision tree (97.13% accuracy, 0.8373 AUC). The results showed that HbA1c and blood_glucose_level are the most critical predictors, with the logistic regression achieving better generalization (higher AUC).

    • Regression Task (Blood Glucose Prediction): Constructed a linear regression (RMSE=39.77) and a decision tree regression (RMSE=38.99). Residual analysis revealed heteroscedasticity in the linear model, while the decision tree better captured non-linear relationships (though prone to overfitting).

  • Practical Value: The models provide actionable insights for clinical diabetes screening, e.g., prioritizing patients with HbA1c > 6.5% or blood_glucose_level > 140 mg/dL for further diagnosis.

7.2 Limitations

Despite meeting the assignment requirements, the project has two key limitations:

  • Data Imbalance: The dataset contains only 8.5% diabetic patients, leading to low specificity (64.5%-67.45%) in classification models (over-predicting non-diabetic cases).

  • Static Feature Limitation: The dataset lacks temporal features (e.g., blood glucose trends over time), limiting the regression model’s ability to predict dynamic glucose fluctuations.