Diabetes is a chronic health condition that affects how the body turns food into energy. Early detection and management of blood glucose levels are critical for preventing complications. This project analyze medical history and demographic data to predict diabetes indicators.
In accordance with the project requirements, we aim to perform two types of predictive tasks:
Classification Task: Predict whether a patient has diabetes (binary classification) based on health indicators such as age, BMI, and HbA1c levels.
Regression Task: Predict the specific blood glucose levels (continuous variable) based on related physiological features.
We selected the “Diabetes
Prediction Dataset” from Kaggle. It contains 100,000 healthcare
records, making it suitable for training models. The data represents
patients of varying ages and health conditions, including critical
indicators like HbA1c_level and
blood_glucose_level.
We begin by importing the dataset and examining its structure and dimensions.
df_intial <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/intial_diabetes_prediction_dataset.csv")
head(df_intial)
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## 1 Female 80 0 1 never 25.19 6.6
## 2 Female 54 0 0 No Info 27.32 6.6
## 3 Male 28 0 0 never 27.32 5.7
## 4 Female 36 0 0 current 23.45 5.0
## 5 Male 76 1 1 current 20.14 4.8
## 6 Female 20 0 0 never 27.32 6.6
## blood_glucose_level diabetes
## 1 140 0
## 2 80 0
## 3 158 0
## 4 155 0
## 5 155 0
## 6 85 0
dim(df_intial)
## [1] 100000 9
str(df_intial)
## 'data.frame': 100000 obs. of 9 variables:
## $ gender : chr "Female" "Female" "Male" "Female" ...
## $ age : num 80 54 28 36 76 20 44 79 42 32 ...
## $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
## $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
## $ smoking_history : chr "never" "No Info" "never" "current" ...
## $ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
## $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
## $ blood_glucose_level: int 140 80 158 155 155 85 200 85 145 100 ...
## $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
The dataset is substantial, containing 100,000
observations (rows) and 9 variables (columns).
It is a mix of numerical types (int and num)
and character types (chr).
summary(df_intial)
## gender age hypertension heart_disease
## Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07485 Mean :0.03942
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## smoking_history bmi HbA1c_level blood_glucose_level
## Length:100000 Min. :10.01 Min. :3.500 Min. : 80.0
## Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.:100.0
## Mode :character Median :27.32 Median :5.800 Median :140.0
## Mean :27.32 Mean :5.528 Mean :138.1
## 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.:159.0
## Max. :95.69 Max. :9.000 Max. :300.0
## diabetes
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.085
## 3rd Qu.:0.000
## Max. :1.000
sum(is.na(df_intial))
## [1] 0
The summary(df_intial$diabetes) shows a mean of
0.085, indicating that only 8.5% of
the patients have diabetes. This severe imbalance will require special
attention during modeling. The patients range from 0.08 years
old (infants) to 80 years old. There are
no empty cells in the file.
This module performs the data preprocessing procedure, which consists of 6 phases:
data cleaning, one-hot encoding of categorical variables, standardization of numerical variables, outlier detection based on the 3σ rule, outlier treatment via truncation, and elimination of multicollinearity among categorical variables.
The overall goal of this phase is to generate data that can be used for EDA and modeling.
Step 1. Missing Value Handling and Duplicate Removal:
It is confirmed that there are no missing values (NA) in the dataset, with a total of 3,854 rows of duplicate data removed.
Step 2. Categorical Variables:
Gender: A small number of samples (18 rows) with the value “Other” are removed, retaining only the categories Male and Female.
Smoking History: No modification is made, and the category “No Info” is kept as an independent classification level, as it represents an unknown state in itself.
Explicitly convert categorical variables including gender, smoking_history and hypertension into the Factor data type in R.
Step 3.Numerical Variables:
The method used here is outlier detection and handling via the IQR method, applied to the variables age, bmi, HbA1c_level, and blood_glucose_level.
Notably, while IQR detected many outliers in BMI and blood glucose levels, these values were retained—because such high readings are key pathological features of diabetes (relevant to the dataset’s focus on diabetes prediction).
After this data cleaning process, a cleaned dataset is finalized for the EDA phase.
# preparatory work
library(tidyverse) # Load the required libraries
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## Warning: package 'purrr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## Warning: package 'forcats' was built under R version 4.5.2
## Warning: package 'lubridate' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data_path <- "C:/Users/zhang/Desktop/WQD7004/group/submission/diabetes_prediction_dataset.csv" # Load the data set
df <- read_csv(data_path)
## Rows: 100000 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, smoking_history
## dbl (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cat("--- Preparation step: Data Loading and Initial Overview ---\n")
## --- Preparation step: Data Loading and Initial Overview ---
print(paste("Initial dataset dimension (Number of rows, number of columns):", nrow(df), ncol(df)))
## [1] "Initial dataset dimension (Number of rows, number of columns): 100000 9"
print(glimpse(df))
## Rows: 100,000
## Columns: 9
## $ gender <chr> "Female", "Female", "Male", "Female", "Male", "Fem…
## $ age <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "No Info", "never", "current", "current",…
## $ bmi <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## # A tibble: 100,000 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Female 80 0 1 never 25.2 6.6
## 2 Female 54 0 0 No Info 27.3 6.6
## 3 Male 28 0 0 never 27.3 5.7
## 4 Female 36 0 0 current 23.4 5
## 5 Male 76 1 1 current 20.1 4.8
## 6 Female 20 0 0 never 27.3 6.6
## 7 Female 44 0 0 never 19.3 6.5
## 8 Female 79 0 0 No Info 23.9 5.7
## 9 Male 42 0 0 never 33.6 4.8
## 10 Female 32 0 0 never 27.3 5
## # ℹ 99,990 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
df_clean <- df # Create a working copy
# data cleaning
# 1. Missing value check (review)
cat("\n--- 1: Missing Value Check ---\n")
##
## --- 1: Missing Value Check ---
missing_values <- colSums(is.na(df_clean))
print("The number of missing values in each column:")
## [1] "The number of missing values in each column:"
print(missing_values[missing_values > 0]) # Confirm no NA
## named numeric(0)
# 2. Remove duplicate rows
cat("\n--- 2: Remove duplicate rows ---\n")
##
## --- 2: Remove duplicate rows ---
initial_rows <- nrow(df_clean)
df_clean <- df_clean %>%
distinct() # Remove all completely duplicate rows
rows_removed_duplicates <- initial_rows - nrow(df_clean)
print(paste("The number of duplicate rows removed:", rows_removed_duplicates))
## [1] "The number of duplicate rows removed: 3854"
# 3. Standardization and Coding of Categorical Variables
cat("\n--- 3: Standardization and Coding of Categorical Variables ---\n")
##
## --- 3: Standardization and Coding of Categorical Variables ---
# 3.1. gender
print("the sole value of gender:")
## [1] "the sole value of gender:"
print(unique(df_clean$gender))
## [1] "Female" "Male" "Other"
rows_removed_gender <- nrow(df_clean) - nrow(df_clean %>% filter(gender != "Other")) # Remove "Other"
df_clean <- df_clean %>%
filter(gender != "Other")
print(paste("The number of rows for the 'Other' gender category that have been removed:", rows_removed_gender))
## [1] "The number of rows for the 'Other' gender category that have been removed: 18"
# 3.2. smoking_history
print("The sole value of smoking_history:")
## [1] "The sole value of smoking_history:"
print(unique(df_clean$smoking_history))
## [1] "never" "No Info" "current" "former" "ever"
## [6] "not current"
# 3.3. Convert to Factor
df_clean <- df_clean %>%
mutate(
gender = as.factor(gender),
smoking_history = as.factor(smoking_history),
hypertension = as.factor(hypertension),
heart_disease = as.factor(heart_disease),
diabetes = as.factor(diabetes)
)
# Define a function to identify IQR outliers
# IQR outliers: between Q1 - 1.5*IQR and Q3 + 1.5*IQR
get_outlier_bounds <- function(x) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
return(c(lower_bound, upper_bound))
}
# 4. Handling of outliers
# 4.1 age
cat("\n--- 4.1: Handling of abnormal age values ---\n")
##
## --- 4.1: Handling of abnormal age values ---
age_bounds <- get_outlier_bounds(df_clean$age)
print(paste("Age IQR range:", age_bounds[1], "to", age_bounds[2]))
## [1] "Age IQR range: -28.5 to 111.5"
# 4.2 bmi
cat("\n--- 4.2: Handling of abnormal bmi values ---\n")
##
## --- 4.2: Handling of abnormal bmi values ---
bmi_bounds <- get_outlier_bounds(df_clean$bmi)
print(paste("BMI IQR range:", bmi_bounds[1], "to", bmi_bounds[2]))
## [1] "BMI IQR range: 13.71 to 39.55"
bmi_outliers <- df_clean %>% # Identify outliers outside the IQR range
filter(bmi < bmi_bounds[1] | bmi > bmi_bounds[2])
print(paste("The number of BMI IQR' outliers:", nrow(bmi_outliers)))
## [1] "The number of BMI IQR' outliers: 5354"
# 4.3. HbA1c_level
cat("\n--- 4.3: Handling the outliers of HbA1c_level ---\n")
##
## --- 4.3: Handling the outliers of HbA1c_level ---
hba1c_bounds <- get_outlier_bounds(df_clean$HbA1c_level)
print(paste("HbA1c_level IQR range:", hba1c_bounds[1], "to", hba1c_bounds[2]))
## [1] "HbA1c_level IQR range: 2.7 to 8.3"
hba1c_outliers <- df_clean %>%
filter(HbA1c_level < hba1c_bounds[1] | HbA1c_level > hba1c_bounds[2])
print(paste("The number of HbA1c_level IQR' outliers:", nrow(hba1c_outliers)))
## [1] "The number of HbA1c_level IQR' outliers: 1312"
# 4.4. blood_glucose_level
cat("\n--- 4.4: Handling the outliers of blood_glucose_level ---\n")
##
## --- 4.4: Handling the outliers of blood_glucose_level ---
bgl_bounds <- get_outlier_bounds(df_clean$blood_glucose_level)
print(paste("blood_glucose_level IQR range:", bgl_bounds[1], "to", bgl_bounds[2]))
## [1] "blood_glucose_level IQR range: 11.5 to 247.5"
bgl_outliers <- df_clean %>%
filter(blood_glucose_level < bgl_bounds[1] | blood_glucose_level > bgl_bounds[2])
print(paste("The number of blood_glucose_level IQR' outliers :", nrow(bgl_outliers)))
## [1] "The number of blood_glucose_level IQR' outliers : 2031"
# 5. Final Result Summary
cat("\n--- 5: Final Result Summary ---\n")
##
## --- 5: Final Result Summary ---
# Count the total number of rows
final_rows <- nrow(df_clean)
total_rows_removed <- initial_rows - final_rows
print(paste("The dimension of the cleaned dataset (number of rows, number of columns):", final_rows, ncol(df_clean)))
## [1] "The dimension of the cleaned dataset (number of rows, number of columns): 96128 9"
print(paste("The total number of rows removed:", total_rows_removed))
## [1] "The total number of rows removed: 3872"
print(glimpse(df_clean))
## Rows: 96,128
## Columns: 9
## $ gender <fct> Female, Female, Male, Female, Male, Female, Female…
## $ age <dbl> 80, 54, 28, 36, 76, 20, 44, 79, 42, 32, 53, 54, 78…
## $ hypertension <fct> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <fct> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <fct> never, No Info, never, current, current, never, ne…
## $ bmi <dbl> 25.19, 27.32, 27.32, 23.45, 20.14, 27.32, 19.31, 2…
## $ HbA1c_level <dbl> 6.6, 6.6, 5.7, 5.0, 4.8, 6.6, 6.5, 5.7, 4.8, 5.0, …
## $ blood_glucose_level <dbl> 140, 80, 158, 155, 155, 85, 200, 85, 145, 100, 85,…
## $ diabetes <fct> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## # A tibble: 96,128 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <fct> <dbl> <fct> <fct> <fct> <dbl> <dbl>
## 1 Female 80 0 1 never 25.2 6.6
## 2 Female 54 0 0 No Info 27.3 6.6
## 3 Male 28 0 0 never 27.3 5.7
## 4 Female 36 0 0 current 23.4 5
## 5 Male 76 1 1 current 20.1 4.8
## 6 Female 20 0 0 never 27.3 6.6
## 7 Female 44 0 0 never 19.3 6.5
## 8 Female 79 0 0 No Info 23.9 5.7
## 9 Male 42 0 0 never 33.6 4.8
## 10 Female 32 0 0 never 27.3 5
## # ℹ 96,118 more rows
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <fct>
# 6. Save the data after cleaning
write_csv(df_clean, "diabetes_prediction_dataset_advanced_cleaned.csv")
This section covers the One-hot Encoding process for categorical variables, which focuses on converting categorical features (specifically gender and smoking_history) into 0/1 dummy variable columns using R’s model.matrix function. This encoding avoids numerical misinterpretation of categorical data (which most models cannot directly recognize), making the variables compatible with subsequent model building.
The workflow consists of 5 key steps:
Import the data: Load the cleaned diabetes prediction dataset, with strings treated as factor types.
Apply one-hot encoding: Use model.matrix (with -1 to exclude the intercept term) to encode genderand smoking_history.
Convert to data frame: Transform the encoded result into a data frame format.
Remove original categorical columns: Extract non-categorical (numeric/target) columns from the original dataset.
Consolidate final data: Combine the numeric/target columns with the encoded dummy columns.
Finally, str() is used to inspect the processed dataset, ensuring all columns are ready for subsequent standardization (laying the groundwork for model construction). groundwork for the final model construction.
# One-hot coding:gender、smoking_history
# 1. Import data
data <- read.csv("diabetes_prediction_dataset_advanced_cleaned.csv", stringsAsFactors = TRUE)
# 2. Use model.matrix for one-hot coding
encoded_features <- model.matrix( ~ gender + smoking_history - 1, data = data)
# 3. Convert the result into a data frame
encoded_df <- as.data.frame(encoded_features)
# 4. Remove the original categorical columns and merge the data
cols_to_keep <- names(data)[!names(data) %in% c("gender", "smoking_history")]
data_numeric_and_target <- data[, cols_to_keep]
# 5. Final data consolidation
final_data <- cbind(data_numeric_and_target, encoded_df)
# Check the processed data; now all columns are available for standardization
str(final_data)
## 'data.frame': 96128 obs. of 14 variables:
## $ age : num 80 54 28 36 76 20 44 79 42 32 ...
## $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
## $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
## $ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
## $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
## $ blood_glucose_level : int 140 80 158 155 155 85 200 85 145 100 ...
## $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
## $ genderFemale : num 1 1 0 1 0 1 1 1 0 1 ...
## $ genderMale : num 0 0 1 0 1 0 0 0 1 0 ...
## $ smoking_historyever : num 0 0 0 0 0 0 0 0 0 0 ...
## $ smoking_historyformer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ smoking_historynever : num 1 0 1 0 0 1 1 0 1 1 ...
## $ smoking_historyNo Info : num 0 1 0 0 0 0 0 1 0 0 ...
## $ smoking_historynot current: num 0 0 0 0 0 0 0 0 0 0 ...
write_csv(final_data, "One-hot.csv")
This section focuses on standardization of numerical variables (e.g., age, BMI): the goal is to scale these continuous features to have a mean of 0 and a standard deviation of 1. This preprocessing step helps speed up model training convergence, boost model performance, and eliminate biases caused by differences in feature scales.
# Check the processed data; now all columns are available for standardization
str(final_data)
## 'data.frame': 96128 obs. of 14 variables:
## $ age : num 80 54 28 36 76 20 44 79 42 32 ...
## $ hypertension : int 0 0 0 0 1 0 0 0 0 0 ...
## $ heart_disease : int 1 0 0 0 1 0 0 0 0 0 ...
## $ bmi : num 25.2 27.3 27.3 23.4 20.1 ...
## $ HbA1c_level : num 6.6 6.6 5.7 5 4.8 6.6 6.5 5.7 4.8 5 ...
## $ blood_glucose_level : int 140 80 158 155 155 85 200 85 145 100 ...
## $ diabetes : int 0 0 0 0 0 0 1 0 0 0 ...
## $ genderFemale : num 1 1 0 1 0 1 1 1 0 1 ...
## $ genderMale : num 0 0 1 0 1 0 0 0 1 0 ...
## $ smoking_historyever : num 0 0 0 0 0 0 0 0 0 0 ...
## $ smoking_historyformer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ smoking_historynever : num 1 0 1 0 0 1 1 0 1 1 ...
## $ smoking_historyNo Info : num 0 1 0 0 0 0 0 1 0 0 ...
## $ smoking_historynot current: num 0 0 0 0 0 0 0 0 0 0 ...
write_csv(final_data, "One-hot.csv")
# 1. Read data
df <- read.csv("diabetes_prediction_dataset_advanced_cleaned.csv")
# 2. Determine the column names of the numerical variables that require standardization.
numerical_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")
# 3. Perform Z-score standardization on the selected numerical columns.
df_standardized <- df
df_standardized[numerical_cols] <- scale(df[numerical_cols])
# 4. View the first few rows and statistical summary of the standardized DataFrame.
cat("--- Standardized DataFrame (first 6 rows) ---\n")
## --- Standardized DataFrame (first 6 rows) ---
print(head(df_standardized))
## gender age hypertension heart_disease smoking_history bmi
## 1 Female 1.700700 0 1 never -0.314939405
## 2 Female 0.543258 0 0 No Info -0.000214287
## 3 Male -0.614184 0 0 never -0.000214287
## 4 Female -0.258048 0 0 current -0.572038798
## 5 Male 1.522632 1 1 current -1.061118676
## 6 Female -0.970320 0 0 never -0.000214287
## HbA1c_level blood_glucose_level diabetes
## 1 0.9945423 0.04355774 0
## 2 0.9945423 -1.42303368 0
## 3 0.1559482 0.48353517 0
## 4 -0.4962917 0.41020560 0
## 5 -0.6826459 0.41020560 0
## 6 0.9945423 -1.30081773 0
cat("\n--- Summary statistics of standardized numerical variables (the mean should be close to 0 and the standard deviation close to 1) ---\n")
##
## --- Summary statistics of standardized numerical variables (the mean should be close to 0 and the standard deviation close to 1) ---
print(summary(df_standardized[numerical_cols]))
## age bmi HbA1c_level blood_glucose_level
## Min. :-1.85710 Min. :-2.5579100 Min. :-1.8939 Min. :-1.42303
## 1st Qu.:-0.79225 1st Qu.:-0.5794267 1st Qu.:-0.6826 1st Qu.:-0.93417
## Median : 0.05357 Median :-0.0002143 Median : 0.2491 Median : 0.04356
## Mean : 0.00000 Mean : 0.0000000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.76584 3rd Qu.: 0.3750917 3rd Qu.: 0.6218 3rd Qu.: 0.50798
## Max. : 1.70070 Max. :10.1020187 Max. : 3.2308 Max. : 3.95447
# 5. Save the standardized data to a new CSV file.
write.csv(df_standardized, "diabetes_prediction_dataset_standardized.csv", row.names = FALSE)
This section addresses outlier handling post-standardization, using the 3σ rule for detection and truncation for treatment:
• After data standardization, outliers are identified as values with an absolute Z-score > 3.
• To reduce outliers’ excessive impact on the model, these extreme values are truncated: Z-scores > 3 are capped at 3, and Z-scores < -3 are capped at -3.
# 1. Read the standardized data
df_standardized <- read.csv("diabetes_prediction_dataset_standardized.csv")
# 2. Identify the column names of the standardized numerical variables
standardized_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")
# Initialize a data frame for storing outlier information
outliers_summary <- data.frame()
# 3. Loop through each column to check whether there are outliers with an absolute Z-score greater than 3
for (col in standardized_cols) {
# Find the row indices of the rows where the absolute Z-score is greater than 3
outlier_indices <- which(abs(df_standardized[[col]]) > 3)
if (length(outlier_indices) > 0) {
# Extract the outliers
outlier_values <- df_standardized[outlier_indices, col]
# Construct a summary of outlier information
temp_df <- data.frame(
Variable = col,
Outlier_Count = length(outlier_indices),
Max_Z_score = max(outlier_values),
Min_Z_score = min(outlier_values),
stringsAsFactors = FALSE
)
outliers_summary <- rbind(outliers_summary, temp_df)
cat(sprintf("'%s' outliers were detected in the variable '%s' (Z-score > 3 or < -3).\n", col, length(outlier_indices)))
} else {
cat(sprintf("No outliers with an absolute Z-score greater than 3 were detected in the variable '%s'.\n", col))
}
}
## No outliers with an absolute Z-score greater than 3 were detected in the variable 'age'.
## 'bmi' outliers were detected in the variable '1211' (Z-score > 3 or < -3).
## 'HbA1c_level' outliers were detected in the variable '1312' (Z-score > 3 or < -3).
## 'blood_glucose_level' outliers were detected in the variable '1397' (Z-score > 3 or < -3).
cat("\n--- Outlier Summary ---\n")
##
## --- Outlier Summary ---
if (nrow(outliers_summary) > 0) {
print(outliers_summary)
} else {
cat("No outliers with an absolute Z-score greater than 3 were detected in any of the numerical variables.\n")
}
## Variable Outlier_Count Max_Z_score Min_Z_score
## 1 bmi 1211 10.102019 3.002234
## 2 HbA1c_level 1312 3.230793 3.044439
## 3 blood_glucose_level 1397 3.954468 3.465604
# 1. Read the standardized data
df_standardized <- read.csv("diabetes_prediction_dataset_standardized.csv")
# 2. Identify the columns that need to be truncated
standardized_cols <- c("age", "bmi", "HbA1c_level", "blood_glucose_level")
# 3. Perform the Z-score truncation (Winsorization) operation
df_truncated <- df_standardized
for (col in standardized_cols) {
# Upper truncation limit: Replace all Z-scores greater than 3 with 3
df_truncated[[col]][df_truncated[[col]] > 3] <- 3
# Lower truncation limit: Replace all Z-scores less than -3 with -3
df_truncated[[col]][df_truncated[[col]] < -3] <- -3
}
# 4. Save the truncated data to a new CSV file
output_filename <- "diabetes_prediction_dataset_truncated.csv"
write.csv(df_truncated, output_filename, row.names = FALSE)
cat(sprintf("✅ The complete data after truncation has been successfully saved to a file :%s\n", output_filename))
## ✅ The complete data after truncation has been successfully saved to a file :diabetes_prediction_dataset_truncated.csv
# 5. Verify the save result again
cat("\n--- Summary statistics of key numerical columns after truncation ---\n")
##
## --- Summary statistics of key numerical columns after truncation ---
print(summary(df_truncated[standardized_cols]))
## age bmi HbA1c_level blood_glucose_level
## Min. :-1.85710 Min. :-2.5579100 Min. :-1.89395 Min. :-1.42303
## 1st Qu.:-0.79225 1st Qu.:-0.5794267 1st Qu.:-0.68265 1st Qu.:-0.93417
## Median : 0.05357 Median :-0.0002143 Median : 0.24913 Median : 0.04356
## Mean : 0.00000 Mean :-0.0102627 Mean :-0.00187 Mean :-0.01019
## 3rd Qu.: 0.76584 3rd Qu.: 0.3750917 3rd Qu.: 0.62183 3rd Qu.: 0.50798
## Max. : 1.70070 Max. : 3.0000000 Max. : 3.00000 Max. : 3.00000
This section covers resolving multicollinearity in categorical variables:To support modeling, one dummy variable is removed from each categorical group (specifically genderMale and smoking_historyNo Info). This step:
• Prevents the Dummy Variable Trap (ensuring the model matrix is full-rank),
• Avoids misleading results from strong linear correlations between variables,
• Improves model stability, interpretability, and prediction accuracy.
Finally, the processed data (saved as final_model_ready.csv) is prepared for modeling.
# 1. Eliminate multicollinearity among categorical variables
df_processed <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/均值_标准差(全).csv")
# 2. Solve the problem of multicollinearity (Dummy Variable Trap)
# 2.1. Remove one column from the 'gender' group (e.g., remove genderMale)
df_final <- subset(df_processed, select = -c(genderMale))
# 2.2. Remove one column from the 'smoking_history' group (e.g., remove smoking_historyNo Info)
df_final <- subset(df_final, select = -c(smoking_historyNo.Info))
# 3. Check the column names of the final dataframe to confirm that the target columns have been removed
cat("--- Column names of the final model-ready dataframe ---\n")
## --- Column names of the final model-ready dataframe ---
print(names(df_final))
## [1] "age" "hypertension"
## [3] "heart_disease" "bmi"
## [5] "HbA1c_level" "blood_glucose_level"
## [7] "diabetes" "genderFemale"
## [9] "smoking_historyever" "smoking_historyformer"
## [11] "smoking_historynever" "smoking_historynot.current"
# 4. Save the final model-ready data
output_filename <- "C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv"
write.csv(df_final, output_filename, row.names = FALSE)
cat(sprintf("\nAll preprocessing on the data has been completed, saved as:\n"), output_filename)
##
## All preprocessing on the data has been completed, saved as:
## C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv
This section summarizes the 6-stage systematic preprocessing workflow applied to the raw diabetes prediction dataset, aiming to prepare high-quality data for modeling:
Data cleaning: Removed 3,854 duplicates, excluded “Other” gender samples(18 rows), and confirmed no missing values (creating a foundation for EDA).
One-hot Encoding: Converted gender and smoking_history (categorical variables) into model-recognizable numerical matrices.
Z-score standardization: Scaled numerical features (e.g., age, BMI, glucose levels) to eliminate scale differences.
Outlier handling: Detected outliers via the 3σ rule (post-standardization) and truncated extreme Z-scores to the [-3, 3] range (Winsorization).
Multicollinearity resolution: Removed baseline dummy variables to avoid the “Dummy Variable Trap” and reduce strong linear correlations.
The final output (final_model_ready.csv) is optimized for stability, interpretability, and feature distribution—ready for machine learning training.
Following the data cleaning process, we now analyze the processed dataset to identify patterns, correlations, and feature distributions that will inform our predictive models.
We first examine the two target variables for our project:
diabetes (Classification target) and
blood_glucose_level (Regression target).
## Warning: package 'gridExtra' was built under R version 4.5.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
The visualization of class distribution reveals a severe imbalance between the negative class (non-diabetic) and the positive class (diabetic), even after data cleaning. This structural disparity confirms that relying solely on Accuracy as a performance metric would be misleading; therefore, the model evaluation must prioritize robust metrics like AUC-ROC and F1-score to ensure sensitivity to the minority class.
The blood glucose distribution is not merely right-skewed but exhibits a multimodal pattern with distinct peaks. This discreteness suggests the data reflects specific clinical testing thresholds, where extreme values serve as strong indicators of diabetic conditions.
To determine which physiological features are most relevant for predicting blood glucose and diabetes, we analyze the correlation matrix of the cleaned numerical variables.
## Warning: package 'corrplot' was built under R version 4.5.2
## corrplot 0.95 loaded
We observe the highest correlation between HbA1c_level and diabetes, as well as blood_glucose_level and diabetes. This suggests these are the most critical features for the classification task. There is a moderate positive correlation between age and bmi, and between HbA1c_level and blood_glucose_level. These correlation coefficients are well below the threshold for severe multicollinearity. This suggests they can coexist effectively in regression or classification models without causing stability issues.
We further visualize how HbA1c_level and Age differ between diabetic and non-diabetic patients in the cleaned dataset.
The boxplots demonstrate that HbA1c_level and age are powerful discriminators. Diabetic patients show significantly higher median HbA1c levels compared to non-diabetics, creating a clear decision boundary for classification. While age is also positively correlated with diabetes, the presence of outliers in the lower age range of the diabetic group is notable. This indicates that while the disease is prevalent among older adults, the model must also be sensitive enough to capture less common cases in younger demographics.
In addition to numerical physiological markers, we examine the influence of lifestyle factors on disease status.
The stacked bar chart reveals significant variations in diabetes
prevalence across smoking categories. notably, the
former smokers group exhibits the highest
diabetes risk. This pattern may suggest reverse causality, where
individuals quit smoking following health complications. Conversely, the
No Info category shows the lowest diabetes
proportion, indicating that missing data in this field is not random but
rather informative—likely representing a distinct, lower-risk
demographic. This suggests that No Info should be treated
as a separate category during feature engineering rather than
imputed.
Finally, to capture more complex dependencies, we investigate the
interaction between age and bmi.
The scatter plot highlights a critical non-linear interaction between Age and BMI. Even individuals with high BMI show a very low prevalence of diabetes if they are young (< 30 years), suggesting age acts as a gating factor. The density of positive cases (red dots) is highest where both Age and BMI are elevated. This distinct clustering confirms that the relationship between these features is non-linear. Consequently, tree-based models (like Random Forest or XGBoost) would likely outperform simple linear models, as they can naturally capture these complex threshold-based interactions without manual feature engineering.
This section aims to conduct classification analysis using the given dataset to predict the status of diabetes. To achieve this goal, two classification Models were employed: logistic regression and decision tree. Since logistic regression is simple and highly interpretable, it was used as the baseline model, while the decision tree model was used to capture the potential nonlinear relationships in the data. Both models were trained using the same training dataset and evaluated on the test dataset. The performance of the two models was compared using classification metrics such as confusion matrix, accuracy, balanced accuracy, and AUC to determine the model that is most suitable for predicting diabetes risk.
library(caret)
## Warning: package 'caret' was built under R version 4.5.2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
df <- read.csv("C:/Users/zhang/Desktop/WQD7004/group/submission/final_model_ready.csv")
df$diabetes <- factor(df$diabetes, levels = c(0, 1))
set.seed(123)
# Divide the training set and test set
train_index <- createDataPartition(df$diabetes, p = 0.8, list = FALSE)
train_set <- df[train_index, ]
test_set <- df[-train_index, ]
# Determine the ratio of the training/test set to the original dataset
cat("--- Original data set distribution ---\n")
## --- Original data set distribution ---
print(prop.table(table(df$diabetes)))
##
## 0 1
## 0.91176348 0.08823652
cat("\n--- Training set distribution ---\n")
##
## --- Training set distribution ---
print(prop.table(table(train_set$diabetes)))
##
## 0 1
## 0.91175897 0.08824103
cat("\n--- Test set distribution ---\n")
##
## --- Test set distribution ---
print(prop.table(table(test_set$diabetes)))
##
## 0 1
## 0.91178153 0.08821847
# Establish a binary classification Logistic Regression model
logit_model <- glm(
diabetes ~ .,
data = train_set,
family = binomial
)
# View the model results
cat("==== Overview of the Logistic Regression Model ====\n")
## ==== Overview of the Logistic Regression Model ====
summary(logit_model)
##
## Call:
## glm(formula = diabetes ~ ., family = binomial, data = train_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.28390 0.06027 -87.676 < 2e-16 ***
## age 1.01886 0.02789 36.526 < 2e-16 ***
## hypertension 0.76830 0.05249 14.638 < 2e-16 ***
## heart_disease 0.72188 0.06757 10.684 < 2e-16 ***
## bmi 0.65920 0.02091 31.525 < 2e-16 ***
## HbA1c_level 2.50098 0.04227 59.162 < 2e-16 ***
## blood_glucose_level 1.40654 0.02240 62.787 < 2e-16 ***
## genderFemale -0.28012 0.04010 -6.985 2.84e-12 ***
## smoking_historyever 0.28589 0.09158 3.122 0.0018 **
## smoking_historyformer 0.34257 0.06098 5.618 1.93e-08 ***
## smoking_historynever 0.25444 0.04751 5.356 8.52e-08 ***
## smoking_historynot.current 0.19776 0.07937 2.492 0.0127 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 45903 on 76902 degrees of freedom
## Residual deviance: 18195 on 76891 degrees of freedom
## AIC: 18219
##
## Number of Fisher Scoring iterations: 8
cat("\n==== Interpretation of Key Parameters ====\n")
##
## ==== Interpretation of Key Parameters ====
cat("\nIf the Pr(>|z|) column value is less than 0.05, it indicates that this variable has a significant impact on the prediction of diabetes.\n")
##
## If the Pr(>|z|) column value is less than 0.05, it indicates that this variable has a significant impact on the prediction of diabetes.
# Extraction coefficient
coef(summary(logit_model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.2838976 0.06026584 -87.676488 0.000000e+00
## age 1.0188592 0.02789395 36.526167 4.262365e-292
## hypertension 0.7682954 0.05248615 14.638059 1.605874e-48
## heart_disease 0.7218801 0.06756506 10.684222 1.206565e-26
## bmi 0.6592013 0.02091063 31.524702 3.985117e-218
## HbA1c_level 2.5009796 0.04227356 59.161789 0.000000e+00
## blood_glucose_level 1.4065397 0.02240172 62.787120 0.000000e+00
## genderFemale -0.2801171 0.04010017 -6.985435 2.839755e-12
## smoking_historyever 0.2858929 0.09158300 3.121681 1.798218e-03
## smoking_historyformer 0.3425748 0.06097860 5.617951 1.932349e-08
## smoking_historynever 0.2544411 0.04750734 5.355826 8.516625e-08
## smoking_historynot.current 0.1977617 0.07937377 2.491524 1.271962e-02
# Predicted using the test_set
test_prob <- predict(
logit_model,
newdata = test_set,
type = "response"
)
test_pred <- ifelse(test_prob > 0.5, 1, 0)
test_pred <- factor(test_pred, levels = c(0, 1))
cat("==== Model prediction comparison with actual results (confusion matrix) ====\n")
## ==== Model prediction comparison with actual results (confusion matrix) ====
cm <- confusionMatrix(test_pred, test_set$diabetes)
print(cm$table)
## Reference
## Prediction 0 1
## 0 17355 602
## 1 174 1094
cat("\n==== Core performance indicators of the model ====\n")
##
## ==== Core performance indicators of the model ====
cat(sprintf("Accuracy: %.2f%%\n", cm$overall["Accuracy"] * 100))
## Accuracy: 95.96%
cat(sprintf("Kappa: %.4f\n", cm$overall["Kappa"]))
## Kappa: 0.7168
cat("\n==== In-depth evaluation of classification effectiveness ====\n")
##
## ==== In-depth evaluation of classification effectiveness ====
cat(sprintf("Sensitivity/Recall: %.2f%% - The ability to identify patients\n", cm$byClass["Sensitivity"] * 100))
## Sensitivity/Recall: 99.01% - The ability to identify patients
cat(sprintf("Specificity: %.2f%% - The ability to eliminate healthy individuals\n", cm$byClass["Specificity"] * 100))
## Specificity: 64.50% - The ability to eliminate healthy individuals
fourfoldplot(as.table(cm$table), color = c("#CC6666", "#99CC99"),
conf.level = 0, margin = 1,
main = "Model prediction result distribution chart")
# evaluate
roc_obj <- roc(
response = test_set$diabetes,
predictor = test_prob
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("==== Model discrimination ability assessment ====\n")
## ==== Model discrimination ability assessment ====
cat(sprintf("AUC: %.4f\n", auc(roc_obj)))
## AUC: 0.9629
# ROC
plot(
roc_obj,
col = "blue",
main = "ROC Curve for Logistic Regression Model"
)
plot(roc_obj,
print.auc = TRUE,
auc.polygon = TRUE,
grid = c(0.1, 0.2),
grid.col = c("green", "red"),
max.auc.polygon = TRUE,
auc.polygon.col = "lightblue",
print.thres = TRUE,
main = paste("ROC(AUC =", round(auc(roc_obj), 3), ")"))
# decision tree uses the same train_set for a fair comparison
tree_model <- rpart(
diabetes ~ .,
data = train_set,
method = "class"
)
# visualization
rpart.plot(tree_model, type = 2, extra = 104)
# Predicted using the test_set
tree_prob <- predict(
tree_model,
newdata = test_set,
type = "prob"
)[, "1"]
tree_pred <- ifelse(tree_prob > 0.5, 1, 0)
tree_pred <- factor(tree_pred, levels = c(0, 1))
cat("==== Decision Tree Model: Classification Performance Evaluation ====\n")
## ==== Decision Tree Model: Classification Performance Evaluation ====
cm_tree <- confusionMatrix(tree_pred, test_set$diabetes)
print(cm_tree$table)
## Reference
## Prediction 0 1
## 0 17529 552
## 1 0 1144
cat("\n--- Interpretation of Key Performance Indicators ---\n")
##
## --- Interpretation of Key Performance Indicators ---
cat(sprintf("1. Accuracy: %.2f%% \n", cm_tree$overall["Accuracy"] * 100))
## 1. Accuracy: 97.13%
cat(sprintf("2. Sensitivity: %.2f%% - The model's ability to retrieve patients' information\n", cm_tree$byClass["Sensitivity"] * 100))
## 2. Sensitivity: 100.00% - The model's ability to retrieve patients' information
cat(sprintf("3. Specificity: %.2f%% - The model's ability to exclude healthy individuals\n", cm_tree$byClass["Specificity"] * 100))
## 3. Specificity: 67.45% - The model's ability to exclude healthy individuals
tree_roc <- roc(
response = test_set$diabetes,
predictor = tree_prob
)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
cat("==== Performance evaluation of decision tree model ====\n")
## ==== Performance evaluation of decision tree model ====
auc_val <- auc(tree_roc)
cat(sprintf("AUC: %.4f\n", auc_val))
## AUC: 0.8373
# Comprehensive Comparison Chart
plot(roc_obj, col = "blue", lwd = 2, main = "ROC Comparison: Logit vs Tree")
plot(tree_roc, col = "red", lwd = 2, add = TRUE)
legend("bottomright", legend = c("Logistic Regression", "Decision Tree"),
col = c("blue", "red"), lwd = 2)
cat("---- Comparison Summary ----\n")
## ---- Comparison Summary ----
cat(sprintf("Logit AUC: %.4f | Tree AUC: %.4f\n", auc(roc_obj), auc(tree_roc)))
## Logit AUC: 0.9629 | Tree AUC: 0.8373
The logistic regression model was built using the training dataset to predict diabetes status. The results show that several key variables have strong and statistically significant effects on diabetes. Blood glucose level has a large positive coefficient of 1.42, and HbA1c level shows an even stronger effect with a coefficient of 2.49, indicating that higher glucose-related indicators greatly increase the likelihood of diabetes. Age and BMI also have positive coefficients of 1.03 and 0.68, which means older individuals and those with higher BMI are more likely to develop diabetes. All these variables have extremely small p-values (p < 2e-16), showing that their effects are highly significant. In addition, the model deviance decreases sharply from a null deviance of 45903 to a residual deviance of 18219, which indicates that the model explains the data much better than a baseline model with no predictors. Overall, the logistic regression model is statistically robust and successfully identifies clinically meaningful risk factors for diabetes.
The confusion matrix shows that the classification Model performs well on the test dataset. Out of all test cases, the model achieves an overall accuracy of approximately 96%, which is clearly higher than the no information rate of 91.18%, and this improvement is statistically significant (p < 2.2e-16). The model correctly classifies 17,357 non-diabetic cases and 1,099 diabetic cases, while misclassifying 172 non-diabetic cases and 597 diabetic cases. The sensitivity is very high at 0.99, indicating that the model is extremely effective at identifying non-diabetic individuals. However, the specificity is lower at 0.65, which means that some diabetic cases are still incorrectly predicted as non-diabetic. This imbalance is expected because diabetic cases account for only about 8.8% of the dataset. Despite this limitation, the balanced accuracy of 0.82 and a Kappa value of 0.72 suggest that the model provides reliable classification performance beyond simple majority guessing.
The ROC curve evaluates the overall classification performance of the logistic regression model across different decision thresholds. The model achieves an AUC value of 0.9632, which indicates excellent discriminative ability between diabetic and non-diabetic individuals. An AUC close to 1 means that the model can correctly rank a randomly chosen diabetic individual higher than a non-diabetic individual with very high probability. This strong AUC result shows that the model performs well across a wide range of thresholds and does not rely on a single cutoff value such as 0.5. Therefore, despite the class imbalance in the dataset, the ROC and AUC results confirm that the logistic regression model provides robust and reliable overall classification performance for diabetes risk prediction.
The decision tree model shows strong classification performance on the test dataset. The model achieves an overall accuracy of 97.2%, which is higher than the logistic regression model, and also significantly higher than the no information rate of 91.18%. The confusion matrix indicates that the model correctly classifies 17,529 non-diabetic cases and 1,157 diabetic cases, with no non-diabetic cases misclassified as diabetic. The sensitivity reaches 1.00, meaning that all non-diabetic individuals are correctly identified, while the specificity improves to 0.68, which is slightly higher than that of the logistic regression model. The balanced accuracy of 0.84 and a Kappa value of 0.80 indicate good agreement beyond chance. However, despite these strong classification metrics, the AUC value of the decision tree model is 0.8411, which is notably lower than the logistic regression model’s AUC of 0.9632, suggesting weaker overall discriminative ability across different thresholds.
Compared with logistic regression, the decision tree model achieves higher accuracy and better specificity on the test dataset. However, the logistic regression model shows a much higher AUC value of 0.9632, compared to 0.8411 for the decision tree, indicating stronger overall discriminative ability. This suggests that while the decision tree performs very well at certain decision points, logistic regression provides more stable and reliable performance across different thresholds. Therefore, logistic regression is more suitable for diabetes risk screening, while the decision tree offers better interpretability for understanding decision rules. While the Decision Tree shows slightly higher Accuracy (97.2%), the significantly higher AUC of Logistic Regression (0.9632 vs 0.8411) suggests that the Logistic Regression model is more robust in ranking probability risks, which is crucial for medical screening where threshold adjustment might be needed.
In this analytical phase, our objective transitions from categorical classification to the high-resolution quantitative estimation of blood_glucose_level. Unlike binary diagnostics, point-in-time glucose prediction presents a non-trivial challenge due to the high stochasticity of human metabolic responses. To navigate this complexity, we implemented a robust comparative framework: an Ordinary Least Squares (OLS) Linear Regression serves as the diagnostic baseline to capture global trends, while a Decision Tree Regressor (ANOVA method) was deployed to account for localized non-linear interactions and regional thresholds that linear mappings typically fail to resolve. The modeling pipeline utilized an 80/20 train-test split, with Root Mean Squared Error (RMSE) designated as the primary evaluation metric to strictly penalize significant deviations—a critical requirement in clinical safety contexts.
library(tidyverse)
library(caret)
library(corrplot)
library(rpart)
library(rpart.plot)
file_path <- "C:/Users/zhang/Desktop/WQD7004/group/submission/diabetes_prediction_dataset_advanced_cleaned.csv"
df <- read.csv(file_path)
df$gender <- as.factor(df$gender)
df$smoking_history <- as.factor(df$smoking_history)
df$hypertension <- as.factor(df$hypertension)
df$heart_disease <- as.factor(df$heart_disease)
numeric_vars <- df %>% select(age, bmi, HbA1c_level, blood_glucose_level)
cor_matrix <- cor(numeric_vars)
corrplot(cor_matrix, method = "number", type = "upper", tl.col = "black", title = "Correlation Heatmap")
set.seed(123)
trainIndex <- createDataPartition(df$blood_glucose_level, p = 0.8, list = FALSE, times = 1)
train_set <- df[trainIndex,]
test_set <- df[-trainIndex,]
model_lm <- lm(blood_glucose_level ~ . - diabetes, data = train_set)
pred_lm <- predict(model_lm, newdata = test_set)
rmse_lm <- sqrt(mean((test_set$blood_glucose_level - pred_lm)^2))
coef_data <- as.data.frame(summary(model_lm)$coefficients)
coef_data$Feature <- rownames(coef_data)
colnames(coef_data)[1] <- "Estimate"
ggplot(coef_data[-1, ], aes(x = reorder(Feature, Estimate), y = Estimate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_bw() +
labs(title = "Clinical Feature Impact on Blood Glucose",
subtitle = "Linear Regression Coefficient Analysis",
x = "Physiological Indicators",
y = "Coefficient Estimate")
par(mfrow = c(2, 2))
plot(model_lm)
par(mfrow = c(1, 1))
tree_model <- rpart(blood_glucose_level ~ . - diabetes, data = train_set, method = "anova")
rpart.plot(tree_model, main = "Decision Tree for Glucose Prediction", digits = 3, extra = 1)
pred_tree <- predict(tree_model, newdata = test_set)
rmse_tree <- sqrt(mean((test_set$blood_glucose_level - pred_tree)^2))
cat("Linear Regression RMSE:", round(rmse_lm, 2), "\n")
## Linear Regression RMSE: 39.77
cat("Decision Tree RMSE:", round(rmse_tree, 2), "\n")
## Decision Tree RMSE: 38.99
Prior to performance validation, we conducted a rigorous diagnostic audit of the OLS assumptions. As illustrated in the Residual Plots, although the error distribution is centered around the zero-mean axis, the Normal Q-Q Plot reveals distinct “heavy tails” and potential outliers. This evidence suggests significant heteroscedasticity; the linear assumption remains valid for patients within the “average” glycemic range but loses predictive reliability during extreme fluctuations. This observed “diagnostic gap” reinforces our rationale for integrating the Decision Tree’s hierarchical partitioning, which is inherently better suited for data exhibiting such non-normal variance.
The empirical evaluation on the hold-out test set yielded the following results:
• Linear Regression RMSE: 39.77 • Decision Tree RMSE: 38.99
While the Decision Tree achieved a statistically superior RMSE, the marginal improvement of approximately 0.78 units indicates a clear “performance plateau”. This suggests that the bottleneck in accuracy is not algorithmic but rather a reflection of the feature-target mismatch. Static demographic markers (Age, BMI) and historical medical data act as “slow” predictors, whereas blood glucose is a “fast” dynamic variable governed by unobserved, high-frequency factors such as acute stress, immediate carbohydrate load, or physical exertion. The observed \(R^{2}\) further confirms that static indicators have a finite upper bound in their capacity to explain the variance of real-time metabolic states.
The most robust insight derived from our feature importance analysis is the overwhelming dominance of HbA1c_level (Figure 3.3). From a clinical physiology standpoint, this aligns perfectly with established medical theory: since HbA1c reflects the average glycation over a 90-day cycle, it serves as a stable physiological anchor for current glucose levels. Interestingly, while clinical markers like hypertension and heart_disease show moderate coefficients, their impact remains secondary to biochemical indicators. This hierarchy implies that metabolic forecasting models should prioritize high-fidelity biochemical markers over broad demographic profiling to improve predictive precision.
Our analysis confirms that while baseline clinical data can effectively “bracket” a patient’s expected glucose range, high-fidelity point prediction remains elusive within this static dataset. The Decision Tree is the superior choice for this task due to its ability to handle non-linear nuances missed by the linear baseline. To surpass the current performance plateau, future research must move beyond static history and integrate high-frequency, dynamic data—such as Continuous Glucose Monitor (CGM) time-series or real-time nutritional logs—to bridge the existing information gap and achieve clinical-grade precision.
We successfully addressed the core tasks specified in the assignment:
Data Preprocessing: For the 100,000-sample diabetes dataset, we completed a rigorous 6-step cleaning process (duplicate removal, outlier handling, encoding, standardization, etc.), resolving issues like duplicate records (3,854 entries), abnormal categories (18 Other gender samples), and multicollinearity, resulting in a model-ready dataset.
Dual Modeling Tasks:
Classification Task (Diabetes Prediction): Built a logistic regression baseline (95.96% accuracy, 0.9629 AUC) and a decision tree (97.13% accuracy, 0.8373 AUC). The results showed that HbA1c and blood_glucose_level are the most critical predictors, with the logistic regression achieving better generalization (higher AUC).
Regression Task (Blood Glucose Prediction): Constructed a linear regression (RMSE=39.77) and a decision tree regression (RMSE=38.99). Residual analysis revealed heteroscedasticity in the linear model, while the decision tree better captured non-linear relationships (though prone to overfitting).
Practical Value: The models provide actionable insights for clinical diabetes screening, e.g., prioritizing patients with HbA1c > 6.5% or blood_glucose_level > 140 mg/dL for further diagnosis.
Despite meeting the assignment requirements, the project has two key limitations:
Data Imbalance: The dataset contains only 8.5% diabetic patients, leading to low specificity (64.5%-67.45%) in classification models (over-predicting non-diabetic cases).
Static Feature Limitation: The dataset lacks temporal features (e.g., blood glucose trends over time), limiting the regression model’s ability to predict dynamic glucose fluctuations.