rpub link: https://rpubs.com/oscarhew/WQD7004Report

Introduction

Diabetes Mellitus is a chronic metabolic disorder characterized by the body’s inability to regulate blood glucose levels effectively. With its growing prevalence worldwide, early detection and continuous monitoring are critical for reducing health risks and improving patient outcomes. The rise in data science and machine learning has opened new opportunities for building predictive models that can aid healthcare providers in identifying at-risk individuals and personalizing treatment plans.

In this project, we utilize a health dataset containing biometric and lifestyle-related features to build two predictive models: one to classify individuals at risk of developing diabetes, and another to predict blood glucose levels. These models aim to enhance early screening and support data-driven healthcare decisions.

Objectives

To explore and preprocess a real-world diabetes-related dataset by addressing missing values, encoding categorical variables, and preparing features for modeling.
To develop two robust and accurate predictive models:
- A classification model for predicting diabetes presence.
- A regression model for estimating blood glucose levels.
To evaluate model performance using appropriate metrics (e.g., accuracy, F1 score, RMSE).
To identify the most effective predictive approach for supporting clinical decision-making and preventive healthcare.

Dataset Description

This document explores a dataset related to diabetes prediction. The data includes demographic and medical attributes of individuals and is used to perform data preprocessing, exploratory data analysis (EDA), classification modeling to detect diabetes, and regression modeling to estimate blood glucose levels.

Title: Diabetes Prediction Dataset

Year: Not explicitly stated; last updated on Kaggle around 2 years ago.

Source: Obtained from Kaggle: Diabetes Prediction Dataset

Purpose: To analyze and predict diabetes occurrence and blood glucose levels using individual health features such as BMI, HbA1c level, and smoking history.

Dimension: 100,000 rows (individual records) × 9 columns (features)

Content (Features):

Demographic attributes: Gender, Age
Health status: Hypertension, Heart Disease
Lifestyle: Smoking History
Medical measures: Body Mass Index (BMI), Hemoglobin A1c Level (HbA1c), Blood Glucose Level
Diagnosis outcome: Diabetes (binary: 0 = No, 1 = Yes)

Structure:

Target Variables:
- Classification: diabetes (binary: 0 or 1)
- Regression: blood_glucose_level (continuous numeric)
Data Types: A mix of categorical, binary, and numeric fields
Data Format: Structured tabular data (CSV)

Summary:

The dataset supports both classification and regression tasks, making it suitable for multifaceted predictive modeling.
Initial exploration shows a wide distribution in key numeric variables (e.g., bmi, blood_glucose_level, HbA1c_level), indicating strong predictive potential.
Several features (such as gender and smoking_history) may require cleaning or encoding prior to effective modeling.

Importing Libraries and Dataset

To initiate the analysis, a suite of R libraries was loaded to support data ingestion, wrangling, visualization, and machine learning. Packages such as readr, dplyr, and tidyr facilitated data import and transformation. For modeling, libraries like caret, randomForest, xgboost, and glmnet were used to develop and evaluate both classification and regression models. Visualizations were constructed using ggplot2 and ggcorrplot to explore feature relationships. The dataset was imported using read_csv() for efficient parsing of its structured format.

# install.packages("ggcorrplot", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("caTools", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("e1071", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("Metrics", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("xgboost", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("caret", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("yardstick", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("MLmetrics", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for classification models
# install.packages("glmnet", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("pROC", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for ROC curves
# install.packages("ggplot2", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for plotting
# install.packages("randomForest", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for randomForest
# install.packages("rpart", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for rpart

Data Cleaning (Preprocessing)

Data cleaning was carried out with the dual goals of ensuring data integrity and preparing the dataset for machine learning algorithms, most of which require clean, numeric inputs.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggcorrplot)

## Loading required package: ggplot2

df <- read_csv("sample_data/diabetes_prediction_dataset.csv")

## Rows: 100000 Columns: 9

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, smoking_history
## dbl (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 × 9
##   gender   age hypertension heart_disease smoking_history   bmi HbA1c_level
##   <chr>  <dbl>        <dbl>         <dbl> <chr>           <dbl>       <dbl>
## 1 Female    32            0             0 never            27.3         5  
## 2 Female    54            0             0 former           54.7         6  
## 3 Female    20            0             0 never            22.2         3.5
## 4 Female    44            0             0 never            24.9         6.1
## 5 Female     3            0             0 No Info          19.3         6.5
## 6 Female    74            0             0 No Info          28.1         5  
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>

Metadata & dataset overview (before data cleaning)

glimpse(df)

## Rows: 100,000
## Columns: 9
## $ gender              <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age                 <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

summary(df)

##     gender               age         hypertension     heart_disease    
##  Length:100000      Min.   : 0.08   Min.   :0.00000   Min.   :0.00000  
##  Class :character   1st Qu.:24.00   1st Qu.:0.00000   1st Qu.:0.00000  
##  Mode  :character   Median :43.00   Median :0.00000   Median :0.00000  
##                     Mean   :41.89   Mean   :0.07485   Mean   :0.03942  
##                     3rd Qu.:60.00   3rd Qu.:0.00000   3rd Qu.:0.00000  
##                     Max.   :80.00   Max.   :1.00000   Max.   :1.00000  
##                     NA's   :1       NA's   :1                          
##  smoking_history         bmi         HbA1c_level    blood_glucose_level
##  Length:100000      Min.   :10.01   Min.   :3.500   Min.   : 80.00     
##  Class :character   1st Qu.:23.63   1st Qu.:4.800   1st Qu.: 90.00     
##  Mode  :character   Median :27.32   Median :5.800   Median : 90.00     
##                     Mean   :27.32   Mean   :5.528   Mean   : 98.47     
##                     3rd Qu.:29.58   3rd Qu.:6.200   3rd Qu.: 90.00     
##                     Max.   :95.69   Max.   :9.000   Max.   :300.00     
##                     NA's   :1       NA's   :2                          
##     diabetes    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.085  
##  3rd Qu.:0.000  
##  Max.   :1.000  
##

Handling Missing Data

The dataset contained a very small proportion of missing entries, primarily in the bmi, HbA1c_level, and age columns. Since these represented less than 0.01% of the total rows, a complete-case strategy was adopted using drop_na(). This allowed preservation of data quality without significantly reducing the sample size.

# since there are some columns contain null value, we can remove it as there
# are only a few of it
df <- drop_na(df)
glimpse(df)

## Rows: 99,993
## Columns: 9
## $ gender              <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age                 <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

# we dropped 7 rows of data that consist null value

Removing Invalid Entries

Data validation also revealed some entries where age was less than or equal to 1. These records were deemed likely erroneous or irrelevant to the adult-onset diabetes focus of this project and were excluded from the dataset.

# Gender has Other type, since there is a few data which is labeled as "Other"
# we can remove it

df %>%
  group_by(gender) %>%
  count()

## # A tibble: 3 × 2
## # Groups:   gender [3]
##   gender     n
##   <chr>  <int>
## 1 Female 58550
## 2 Male   41425
## 3 Other     18

glimpse(df)

## Rows: 99,993
## Columns: 9
## $ gender              <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age                 <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

Categorical Variable Encoding

Two categorical variables required transformation:

Gender: The rare “Other” category was excluded due to its low frequency. The remaining categories were binary-encoded: 0 for Female, 1 for Male.
Smoking History: This multi-class variable was simplified into a binary format. Values “never” and “No Info” were grouped as 0, while other responses (e.g., “former”, “current”) were grouped as 1, indicating known smoking history.

These transformations were executed using conditional logic with ifelse() and ensured that all model inputs were numeric.

# there are some rows that age is smaller than 1, remove the noisy data
df <- df %>% filter(age > 1)

glimpse(df)

## Rows: 98,999
## Columns: 9
## $ gender              <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age                 <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

# Convert gender to numerical (0 and 1)
df$gender <- ifelse(df$gender == "Female", 0, 1)

# Convert smoking_history to numerical (0 and 1)
df$smoking_history <- ifelse(df$smoking_history %in% c("never", "No Info"), 0, 1)

glimpse(df)

## Rows: 98,999
## Columns: 9
## $ gender              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
## $ age                 <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

After completing these preprocessing steps, the final dataset consisted of 98999 observations with clean, model-ready variables.

Exploratory Data Analysis (EDA)

# Create a correlation matrix for the numeric columns
numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")

# Visualize the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
           lab = TRUE, lab_size = 3,
           method = "circle", colors = c("tomato2", "white", "springgreen3"))

Numeric features are analyzed to find the most important ones. The sapply() function is used to extract numeric features, and cor() calculates the correlation matrix. Then, the ggcorplot library is used to visualize it. From the plot, the strongest correlations are between blood_glucose_level and diabetes (0.84), HbA1c_level and diabetes (0.40), and age with BMI and HbA1c_level with blood_glucose_level (0.34).

Histogram

#BMI vs diabetes
ggplot(df, aes(x = bmi, fill = factor(diabetes))) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +

labs(title = "BMI distribution by Diabetes", x = "BMI", y = "Frequency", fill = "Diabetes")

#smoking_history vs diabetes
ggplot(df, aes(x = smoking_history, fill = factor(diabetes))) +
  geom_bar(position = "dodge") +
  labs(title = "Smoking History by Diabetes", x = "Smoking History", y = "Frequency", fill = "Diabetes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Age vs diabetes
ggplot(df,aes(x=age, fill = factor(diabetes))) +
geom_histogram(bins=30,alpha=0.7,position="identity")+
labs(title="Age Distribution by Diabetes",x="Age",y="Frequency",fill="Diabetes")

Histograms are plotted using ggplot(), geom_histogram(), and geom_bar() to show age, BMI, and smoking status by diabetes. Age and BMI are key features that help doctors identify diabetes risk. Older people are more likely to have diabetes, but some younger individuals (around age 10+) are also affected. BMI above 25 (overweight) increases diabetes risk, though a small number with normal BMI also have diabetes. For smoking history, most diabetic cases are in the “never smoked” group, but this is likely due to the large group size. In contrast, former smokers show a higher rate of diabetes within their group, suggesting smoking may increase the risk.

# Create a correlation matrix for the numeric columns
numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")

# Visualize the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
           lab = TRUE, lab_size = 3,
           method = "circle", colors = c("tomato2", "white", "springgreen3"))

#BMI vs diabetes
ggplot(df, aes(x = bmi, fill = factor(diabetes))) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  labs(title = "BMI distribution by Diabetes", x = "BMI", y = "Frequency", fill = "Diabetes")

#smoking_history vs diabetes
ggplot(df, aes(x = smoking_history, fill = factor(diabetes))) +
  geom_bar(position = "dodge") +
  labs(title = "Smoking History by Diabetes", x = "Smoking History", y = "Frequency", fill = "Diabetes") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Age vs diabetes
ggplot(df,aes(x=age, fill = factor(diabetes))) +
geom_histogram(bins=30,alpha=0.7,position="identity")+
labs(title="Age Distribution by Diabetes",x="Age",y="Frequency",fill="Diabetes")

Data Modeling Process

Model Training Preparation

glimpse(df)

## Rows: 98,999
## Columns: 9
## $ gender              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
## $ age                 <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history     <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ bmi                 <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level         <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

# Load required libraries
library(dplyr)
library(caret)

## Loading required package: lattice

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-9

library(rpart)

# Read the data
clean_data = df

# Convert categorical variables to factors
clean_data$gender <- as.factor(clean_data$gender)
clean_data$diabetes <- as.factor(clean_data$diabetes)

# Split features and target variables
X <- clean_data %>% select(-diabetes, -blood_glucose_level)
y_class <- clean_data$diabetes
y_reg <- clean_data$blood_glucose_level

# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y_class, p = 0.8, list = FALSE)
X_train <- X[train_index,]
X_test <- X[-train_index,]
y_class_train <- y_class[train_index]
y_class_test <- y_class[-train_index]
y_reg_train <- y_reg[train_index]
y_reg_test <- y_reg[-train_index]

# Convert factor columns in X_train and X_test to numeric
# This handles cases where factor columns might still exist after initial processing
X_train[] <- lapply(X_train, function(x) if(is.factor(x)) as.numeric(x) else x)
X_test[] <- lapply(X_test, function(x) if(is.factor(x)) as.numeric(x) else x)

Model Training Preparation

# Install and load necessary packages
if(!require(randomForest)){install.packages("randomForest")}
if(!require(e1071)){install.packages("e1071")}

## Loading required package: e1071

if(!require(rpart)){install.packages("rpart")}
if(!require(caret)){install.packages("caret")}
if(!require(MLmetrics)){install.packages("MLmetrics")}

## Loading required package: MLmetrics

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

if(!require(MLmetrics)){install.packages("glmnet")}
install.packages("glmnet")

## Warning: package 'glmnet' is in use and will not be installed

  library(dplyr)
  library(caret)
  library(randomForest)
  library(xgboost)
  library(glmnet)
  library(rpart)

Classification Modelling

Research Question:
How accurately can we predict whether an individual has diabetes using health-related features such as age, BMI, smoking status, and HbA1c level?

In this task, the target variable was diabetes (binary). The features used excluded blood_glucose_level to avoid target leakage. The dataset was split into 80% training and 20% testing using stratified sampling (createDataPartition() from caret) to preserve the original class distribution.

All predictor variables were already numeric or transformed to be so. No scaling was required for tree-based models. This classification problem was modeled using various algorithms, including Random Forest, Logistic Regression, and KNN, evaluated using standard metrics such as accuracy and AUC.

The first classification model used is Random Forest Model. The split train data is then fit into the random forest model to train the model. Then, use the trained model to predict on the class of the unseen data (the test data, X_test). The F1 score of the Random Forest model has been calculated using the confusionMatrix function.

# Classification Models

# 1. Random Forest
rf_class <- randomForest(x = X_train, y = y_class_train)
rf_pred <- predict(rf_class, X_test)
rf_f1 <- confusionMatrix(rf_pred, y_class_test)$byClass["F1"]

The second classification model use in this project is XGBoost. The split train data is fit into the XGBoost model, and the target data (y_class_train) is converted into numeric as XGBoost model is targeting the target variable to be numeric. When using as.numeric(), the class will turn into class 1 and class 2. However, XGBoost is expecting to receive 0 and 1, therefore the -1 is used to meet the requirements of XGBoost. Since, this part is working on classification prediction, the objective is set to “binary: logistic” to output the probabilities of the element to be positive class.

After the model is trained, the test data is used to predict for the target output of an unseen data. Then, the predicted value of each element in the test data is classified based on its value, when the value is is greater than 0.5, it will be classified as positive class (1), and if receive a false, then it will return to the negative class (0). Then, the F1-score is calculated for further use.

# 2. XGBoost
# Ensure X_train and X_test are numeric matrices
xgb_class <- xgboost(data = as.matrix(X_train), label = as.numeric(y_class_train) - 1,
                     nrounds = 100, objective = "binary:logistic")

## [1]  train-logloss:0.476237 
## [2]  train-logloss:0.357456 
## [3]  train-logloss:0.283952 
## [4]  train-logloss:0.235927 
## [5]  train-logloss:0.203585 
## [6]  train-logloss:0.181388 
## [7]  train-logloss:0.165909 
## [8]  train-logloss:0.155104 
## [9]  train-logloss:0.147334 
## [10] train-logloss:0.141840 
## [11] train-logloss:0.137837 
## [12] train-logloss:0.134925 
## [13] train-logloss:0.132849 
## [14] train-logloss:0.131287 
## [15] train-logloss:0.130153 
## [16] train-logloss:0.129284 
## [17] train-logloss:0.128450 
## [18] train-logloss:0.127943 
## [19] train-logloss:0.127495 
## [20] train-logloss:0.127191 
## [21] train-logloss:0.126905 
## [22] train-logloss:0.126746 
## [23] train-logloss:0.126338 
## [24] train-logloss:0.126195 
## [25] train-logloss:0.126068 
## [26] train-logloss:0.126016 
## [27] train-logloss:0.125904 
## [28] train-logloss:0.125681 
## [29] train-logloss:0.125593 
## [30] train-logloss:0.125462 
## [31] train-logloss:0.125412 
## [32] train-logloss:0.125157 
## [33] train-logloss:0.124921 
## [34] train-logloss:0.124657 
## [35] train-logloss:0.124619 
## [36] train-logloss:0.124212 
## [37] train-logloss:0.123980 
## [38] train-logloss:0.123878 
## [39] train-logloss:0.123853 
## [40] train-logloss:0.123760 
## [41] train-logloss:0.123483 
## [42] train-logloss:0.123196 
## [43] train-logloss:0.122979 
## [44] train-logloss:0.122828 
## [45] train-logloss:0.122693 
## [46] train-logloss:0.122645 
## [47] train-logloss:0.122521 
## [48] train-logloss:0.122179 
## [49] train-logloss:0.122152 
## [50] train-logloss:0.122131 
## [51] train-logloss:0.121743 
## [52] train-logloss:0.121720 
## [53] train-logloss:0.121664 
## [54] train-logloss:0.121645 
## [55] train-logloss:0.121415 
## [56] train-logloss:0.120917 
## [57] train-logloss:0.120760 
## [58] train-logloss:0.120537 
## [59] train-logloss:0.120248 
## [60] train-logloss:0.120020 
## [61] train-logloss:0.119783 
## [62] train-logloss:0.119767 
## [63] train-logloss:0.119607 
## [64] train-logloss:0.119581 
## [65] train-logloss:0.119125 
## [66] train-logloss:0.118922 
## [67] train-logloss:0.118691 
## [68] train-logloss:0.118529 
## [69] train-logloss:0.118364 
## [70] train-logloss:0.118042 
## [71] train-logloss:0.118013 
## [72] train-logloss:0.118002 
## [73] train-logloss:0.117697 
## [74] train-logloss:0.117463 
## [75] train-logloss:0.117399 
## [76] train-logloss:0.117364 
## [77] train-logloss:0.117203 
## [78] train-logloss:0.117071 
## [79] train-logloss:0.117001 
## [80] train-logloss:0.116625 
## [81] train-logloss:0.116615 
## [82] train-logloss:0.116469 
## [83] train-logloss:0.116354 
## [84] train-logloss:0.116164 
## [85] train-logloss:0.115985 
## [86] train-logloss:0.115747 
## [87] train-logloss:0.115560 
## [88] train-logloss:0.115328 
## [89] train-logloss:0.115192 
## [90] train-logloss:0.115103 
## [91] train-logloss:0.114983 
## [92] train-logloss:0.114860 
## [93] train-logloss:0.114844 
## [94] train-logloss:0.114823 
## [95] train-logloss:0.114734 
## [96] train-logloss:0.114589 
## [97] train-logloss:0.114426 
## [98] train-logloss:0.114412 
## [99] train-logloss:0.114372 
## [100]    train-logloss:0.114238

xgb_pred <- predict(xgb_class, as.matrix(X_test))
xgb_pred <- as.factor(ifelse(xgb_pred > 0.5, 1, 0))
xgb_f1 <- confusionMatrix(xgb_pred, y_class_test)$byClass["F1"]

The third classification model that has been used in this project is the K-Nearest Neighbours model. The model is first trained with the train data using the KNN algorithm. The training process has used the 5-fold cross-validation. After the training process, the K-NN model is used to predict on the test data class, and F1-score of the model is calculated for future use.

knn_class <- train(x = X_train, y = y_class_train, method = "knn",
                   trControl = trainControl(method = "cv", number = 5), # 5-fold cross-validation
                   preProcess = c("center", "scale")) # Center and scale features

## Warning: Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.

knn_pred <- predict(knn_class, X_test)
knn_f1 <- confusionMatrix(knn_pred, y_class_test)$byClass["F1"]

The last classification model that used in this project is the logistic regression model. To use the logistic regression algorithm, glm() function is used. The logistic regression model is a type of Generalised Linear Models (glm() function). Since the aim is for classification usage, therefore the “family” is telling the gl model that binary classification, the logistic regression is expected to use. After the model is trained, the model is used to predict the output on an unseen data to the probabilities of an element to be positive. If the probability is greater than 0.5, it will be classified as positive class else will be negative. Then, the F1-score is of the Logistic Regression model is calculated.

# 4. Logistic Regression
lr_class <- glm(y_class_train ~ ., data = X_train, family = "binomial")
lr_pred <- predict(lr_class, X_test, type = "response")
lr_pred <- as.factor(ifelse(lr_pred > 0.5, 1, 0))
lr_f1 <- confusionMatrix(lr_pred, y_class_test)$byClass["F1"]

##Compare to find the best Classification model Then, compare the models by their F1-score. Create the vectors to store the name of the models and their F1-score.The highest F1-score model is defining by using comparing the F1-Score of each model and get the maximum value’s model. Combine the vectors using rbind function by rows which to show all models with their F1-score.

# Print best classification model
class_models <- c("Random Forest", "XGBoost", "KNN", "Logistic Regression")
class_f1_scores <- c(rf_f1, xgb_f1, knn_f1, lr_f1)
best_class_model <- class_models[which.max(class_f1_scores)]
f1_scores_matrix <- rbind(class_models, class_f1_scores)
f1_scores_matrix

##                 F1                  F1                  F1                 
## class_models    "Random Forest"     "XGBoost"           "KNN"              
## class_f1_scores "0.975834726792168" "0.975514731744271" "0.970711523379398"
##                 F1                   
## class_models    "Logistic Regression"
## class_f1_scores "0.970884809818605"

cat("\n")

cat("Best Classification Model:", best_class_model, "with F1 Score:", max(class_f1_scores), "\n")

## Best Classification Model: Random Forest with F1 Score: 0.9758347

Beside F1-score, ROC-AUC is used to compare the classification model. The higher the AUC value, the better the model can differentiate the positive classes and the negative classes across the different classification thresholds. To calculate the AUC value, the ROC curve is required to calculate beforehand. The ROC curves are plotted as visual aid as below.

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
library(yardstick)

## 
## Attaching package: 'yardstick'

## The following objects are masked from 'package:caret':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:readr':
## 
##     spec

# Calculate ROC curves
rf_roc <- roc(y_class_test, predict(rf_class, X_test, type = "prob")[,2])

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

xgb_roc <- roc(y_class_test, predict(xgb_class, as.matrix(X_test), type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

knn_roc <- roc(y_class_test, predict(knn_class, X_test, type = "prob")[,2], levels = c("0", "1"))

## Setting direction: controls < cases

lr_roc <- roc(y_class_test, predict(lr_class, X_test, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

#X_train <- X[train_index,]
#X_test <- X[-train_index,]

# Create a data frame for plotting ROC curves
roc_data <- bind_rows(
  coords(rf_roc, "all") %>% mutate(Model = "Random Forest"),
  coords(xgb_roc, "all") %>% mutate(Model = "XGBoost"),
  coords(knn_roc, "all") %>% mutate(Model = "KNN"),
  coords(lr_roc, "all") %>% mutate(Model = "Logistic Regression")
)

# Plot ROC curves
ggplot(roc_data, aes(x = 1 - specificity, y = sensitivity, color = Model)) +
  geom_line(size = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
  labs(title = "ROC Curves for Classification Models",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Print AUC for each model
cat("\nClassification Model AUC Scores:\n")

## 
## Classification Model AUC Scores:

cat("Random Forest AUC:", auc(rf_roc), "\n")

## Random Forest AUC: 0.9077738

cat("XGBoost AUC:", auc(xgb_roc), "\n")

## XGBoost AUC: 0.9449312

cat("KNN AUC:", auc(knn_roc), "\n")

## KNN AUC: 0.8860344

cat("Logistic Regression AUC:", auc(lr_roc), "\n")

## Logistic Regression AUC: 0.9305243

From above, it is noticeable the models are having similar F1-Score. The Random Forest Classification model has the highest F1-Score which is 97.52%. XGBoost model has the second highest F1-score which is 97.51% followed by the KNN model (97.13%) and the smallest F1-score value model is the Logistic Regression Model (97.03%).

Besides, the AUC value of XGBoost classification is the highest which have 94.3%, followed by Logistic Regression (92.6%), then the Random Forest model (90.1%) and lastly the KNN model (88.7%).

Since the F1-score of XGBoost model and the Random Forest model are near, and XGBoost model has the highest AUC value, it can be conclude that the XGBoost model is the best classification model towards the dataset.

Regression Modelling

Research Question:
Can we estimate a person’s blood glucose level using the same set of health and demographic features?

For the regression task, blood_glucose_level served as the continuous target variable. Predictors included all cleaned features except diabetes, to avoid redundancy. The same train-test split used in the classification task was reused for consistency.

Since all variables were numeric, the data was readily compatible with models such as Linear Regression, Ridge Regression, and XGBoost, which were subsequently applied.

First, we will be using random forest, random forest can be used in classification and regression model as it produce result in numerical form, which in classification it means probability of being a class, and for regression it means value predicted

# 1. Random Forest
rf_reg <- randomForest(x = X_train, y = y_reg_train)
rf_reg_pred <- predict(rf_reg, X_test)
rf_rmse <- sqrt(mean((rf_reg_pred - y_reg_test)^2))

Second, we will be using XGBoost, here we choose to have 100 rounds of predicting and using squared error as the metric to evaluate each round to improve the model since we are calculating RMSE as our regression model evaluatio method, using squared error is suitable

# 2. XGBoost
# Ensure X_train and X_test are numeric matrices
xgb_reg <- xgboost(data = as.matrix(X_train), label = y_reg_train,
                   nrounds = 100, objective = "reg:squarederror")

## [1]  train-rmse:75.184338 
## [2]  train-rmse:56.055078 
## [3]  train-rmse:43.702838 
## [4]  train-rmse:36.104619 
## [5]  train-rmse:31.703509 
## [6]  train-rmse:29.276379 
## [7]  train-rmse:27.995957 
## [8]  train-rmse:27.336362 
## [9]  train-rmse:26.954447 
## [10] train-rmse:26.739591 
## [11] train-rmse:26.623635 
## [12] train-rmse:26.539067 
## [13] train-rmse:26.472835 
## [14] train-rmse:26.418347 
## [15] train-rmse:26.372162 
## [16] train-rmse:26.319027 
## [17] train-rmse:26.290800 
## [18] train-rmse:26.256199 
## [19] train-rmse:26.243422 
## [20] train-rmse:26.201184 
## [21] train-rmse:26.165404 
## [22] train-rmse:26.149298 
## [23] train-rmse:26.135755 
## [24] train-rmse:26.108132 
## [25] train-rmse:26.098812 
## [26] train-rmse:26.076205 
## [27] train-rmse:26.061427 
## [28] train-rmse:26.016878 
## [29] train-rmse:25.996775 
## [30] train-rmse:25.941656 
## [31] train-rmse:25.910325 
## [32] train-rmse:25.903199 
## [33] train-rmse:25.897452 
## [34] train-rmse:25.859564 
## [35] train-rmse:25.847892 
## [36] train-rmse:25.831328 
## [37] train-rmse:25.808744 
## [38] train-rmse:25.794547 
## [39] train-rmse:25.789847 
## [40] train-rmse:25.753101 
## [41] train-rmse:25.710986 
## [42] train-rmse:25.688768 
## [43] train-rmse:25.628960 
## [44] train-rmse:25.614830 
## [45] train-rmse:25.593270 
## [46] train-rmse:25.575391 
## [47] train-rmse:25.546083 
## [48] train-rmse:25.541652 
## [49] train-rmse:25.539284 
## [50] train-rmse:25.511068 
## [51] train-rmse:25.470559 
## [52] train-rmse:25.420285 
## [53] train-rmse:25.388558 
## [54] train-rmse:25.380959 
## [55] train-rmse:25.352225 
## [56] train-rmse:25.341326 
## [57] train-rmse:25.318267 
## [58] train-rmse:25.291493 
## [59] train-rmse:25.273671 
## [60] train-rmse:25.249120 
## [61] train-rmse:25.191906 
## [62] train-rmse:25.166969 
## [63] train-rmse:25.136147 
## [64] train-rmse:25.130765 
## [65] train-rmse:25.114637 
## [66] train-rmse:25.078193 
## [67] train-rmse:25.045497 
## [68] train-rmse:25.034148 
## [69] train-rmse:25.024315 
## [70] train-rmse:24.988586 
## [71] train-rmse:24.948577 
## [72] train-rmse:24.915647 
## [73] train-rmse:24.889723 
## [74] train-rmse:24.876717 
## [75] train-rmse:24.867677 
## [76] train-rmse:24.845676 
## [77] train-rmse:24.840690 
## [78] train-rmse:24.831774 
## [79] train-rmse:24.804136 
## [80] train-rmse:24.800562 
## [81] train-rmse:24.775386 
## [82] train-rmse:24.748655 
## [83] train-rmse:24.746997 
## [84] train-rmse:24.715903 
## [85] train-rmse:24.687394 
## [86] train-rmse:24.636124 
## [87] train-rmse:24.607074 
## [88] train-rmse:24.585421 
## [89] train-rmse:24.580992 
## [90] train-rmse:24.579400 
## [91] train-rmse:24.575778 
## [92] train-rmse:24.570698 
## [93] train-rmse:24.561342 
## [94] train-rmse:24.549254 
## [95] train-rmse:24.532371 
## [96] train-rmse:24.500695 
## [97] train-rmse:24.470716 
## [98] train-rmse:24.450032 
## [99] train-rmse:24.443107 
## [100]    train-rmse:24.422137

xgb_reg_pred <- predict(xgb_reg, as.matrix(X_test))
xgb_rmse <- sqrt(mean((xgb_reg_pred - y_reg_test)^2))

Third, we will be using decision tree, it will be using the default method of decision tree which is anova

# 3. Decision Tree
dt_reg <- rpart(y_reg_train ~ ., data = X_train)
dt_reg_pred <- predict(dt_reg, X_test)
dt_rmse <- sqrt(mean((dt_reg_pred - y_reg_test)^2))

Lastly, we will be using linear regression regression modeling since we are to predict continuous and numerical value

# 4. Linear Regression
lr_reg <- lm(y_reg_train ~ ., data = X_train)
lr_reg_pred <- predict(lr_reg, X_test)
lr_rmse <- sqrt(mean((lr_reg_pred - y_reg_test)^2))

Compare to find the best regression model

Here we will be using RMSE score to find the best model, lower RMSE means lower variance from the actual value.

# Print best regression model
reg_models <- c("Random Forest", "XGBoost", "Decision Tree", "Linear Regression")
reg_rmse_scores <- c(rf_rmse, xgb_rmse, dt_rmse, lr_rmse)
best_reg_model <- reg_models[which.min(reg_rmse_scores)]
rmse_scores_matrix <- rbind(reg_models, reg_rmse_scores)
rmse_scores_matrix

##                 [,1]               [,2]               [,3]              
## reg_models      "Random Forest"    "XGBoost"          "Decision Tree"   
## reg_rmse_scores "27.5938024602372" "28.1465381391684" "27.9635129276575"
##                 [,4]               
## reg_models      "Linear Regression"
## reg_rmse_scores "31.4164296070712"

cat("Best Regression Model:", best_reg_model, "with RMSE:", min(reg_rmse_scores))

## Best Regression Model: Random Forest with RMSE: 27.5938

Saving the model

Model will be stored as RDS to be then used to deploy in our ShinyApp

saveRDS(rf_class, "diabetes_model.rds")
saveRDS(rf_reg, "glucose_model.rds")

Conclusion

For classification model which predicting diabetes, we have Random Forest model as the best model with 0.9758058 of F1-Score and AUC of 0.9067476

For Regression model which predicting blood glucose level, we have Random Forest model also as the best model with the lowest RMSE of 27.59888

Diabetes Risk Classification and Blood Glucose Level Prediction

Group 4 – Hew Ding Xuan (U2005416), Ng Xin Yie (S2006201), Choo En Yi (24084637), Toh Sin Tong (24071339), Sharon Wong Xuan Wen (24058156)

2025-06-20

Introduction

Objectives

Dataset Description

Importing Libraries and Dataset

Data Cleaning (Preprocessing)

Metadata & dataset overview (before data cleaning)

Handling Missing Data

Removing Invalid Entries

Categorical Variable Encoding

Exploratory Data Analysis (EDA)

Histogram

Data Modeling Process

Model Training Preparation

Model Training Preparation

Classification Modelling

Regression Modelling

Compare to find the best regression model

Saving the model

Conclusion