rpub link: https://rpubs.com/oscarhew/WQD7004Report
Diabetes Mellitus is a chronic metabolic disorder characterized by the body’s inability to regulate blood glucose levels effectively. With its growing prevalence worldwide, early detection and continuous monitoring are critical for reducing health risks and improving patient outcomes. The rise in data science and machine learning has opened new opportunities for building predictive models that can aid healthcare providers in identifying at-risk individuals and personalizing treatment plans.
In this project, we utilize a health dataset containing biometric and lifestyle-related features to build two predictive models: one to classify individuals at risk of developing diabetes, and another to predict blood glucose levels. These models aim to enhance early screening and support data-driven healthcare decisions.
This document explores a dataset related to diabetes prediction. The data includes demographic and medical attributes of individuals and is used to perform data preprocessing, exploratory data analysis (EDA), classification modeling to detect diabetes, and regression modeling to estimate blood glucose levels.
Title: Diabetes Prediction Dataset
Year: Not explicitly stated; last updated on Kaggle around 2 years ago.
Source: Obtained from Kaggle: Diabetes Prediction Dataset
Purpose: To analyze and predict diabetes occurrence and blood glucose levels using individual health features such as BMI, HbA1c level, and smoking history.
Dimension: 100,000 rows (individual records) × 9 columns (features)
Content (Features):
Structure:
diabetes (binary: 0 or
1)blood_glucose_level
(continuous numeric)Summary:
bmi, blood_glucose_level,
HbA1c_level), indicating strong predictive potential.gender and
smoking_history) may require cleaning or encoding prior to
effective modeling.To initiate the analysis, a suite of R libraries was loaded to
support data ingestion, wrangling, visualization, and machine learning.
Packages such as readr, dplyr, and
tidyr facilitated data import and transformation. For
modeling, libraries like caret, randomForest,
xgboost, and glmnet were used to develop and
evaluate both classification and regression models. Visualizations were
constructed using ggplot2 and ggcorrplot to
explore feature relationships. The dataset was imported using
read_csv() for efficient parsing of its structured
format.
# install.packages("ggcorrplot", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("caTools", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("e1071", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("Metrics", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("xgboost", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("caret", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("yardstick", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("MLmetrics", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for classification models
# install.packages("glmnet", quite = TRUE, repos = "http://cran.us.r-project.org")
# install.packages("pROC", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for ROC curves
# install.packages("ggplot2", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for plotting
# install.packages("randomForest", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for randomForest
# install.packages("rpart", quite = TRUE, repos = "http://cran.us.r-project.org") # Added this for rpart
Data cleaning was carried out with the dual goals of ensuring data integrity and preparing the dataset for machine learning algorithms, most of which require clean, numeric inputs.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggcorrplot)
## Loading required package: ggplot2
df <- read_csv("sample_data/diabetes_prediction_dataset.csv")
## Rows: 100000 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): gender, smoking_history
## dbl (7): age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_l...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
## # A tibble: 6 × 9
## gender age hypertension heart_disease smoking_history bmi HbA1c_level
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Female 32 0 0 never 27.3 5
## 2 Female 54 0 0 former 54.7 6
## 3 Female 20 0 0 never 22.2 3.5
## 4 Female 44 0 0 never 24.9 6.1
## 5 Female 3 0 0 No Info 19.3 6.5
## 6 Female 74 0 0 No Info 28.1 5
## # ℹ 2 more variables: blood_glucose_level <dbl>, diabetes <dbl>
glimpse(df)
## Rows: 100,000
## Columns: 9
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
summary(df)
## gender age hypertension heart_disease
## Length:100000 Min. : 0.08 Min. :0.00000 Min. :0.00000
## Class :character 1st Qu.:24.00 1st Qu.:0.00000 1st Qu.:0.00000
## Mode :character Median :43.00 Median :0.00000 Median :0.00000
## Mean :41.89 Mean :0.07485 Mean :0.03942
## 3rd Qu.:60.00 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :80.00 Max. :1.00000 Max. :1.00000
## NA's :1 NA's :1
## smoking_history bmi HbA1c_level blood_glucose_level
## Length:100000 Min. :10.01 Min. :3.500 Min. : 80.00
## Class :character 1st Qu.:23.63 1st Qu.:4.800 1st Qu.: 90.00
## Mode :character Median :27.32 Median :5.800 Median : 90.00
## Mean :27.32 Mean :5.528 Mean : 98.47
## 3rd Qu.:29.58 3rd Qu.:6.200 3rd Qu.: 90.00
## Max. :95.69 Max. :9.000 Max. :300.00
## NA's :1 NA's :2
## diabetes
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.085
## 3rd Qu.:0.000
## Max. :1.000
##
The dataset contained a very small proportion of missing entries,
primarily in the bmi, HbA1c_level, and
age columns. Since these represented less than 0.01% of the
total rows, a complete-case strategy was adopted using
drop_na(). This allowed preservation of data quality
without significantly reducing the sample size.
# since there are some columns contain null value, we can remove it as there
# are only a few of it
df <- drop_na(df)
glimpse(df)
## Rows: 99,993
## Columns: 9
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# we dropped 7 rows of data that consist null value
Data validation also revealed some entries where age was
less than or equal to 1. These records were deemed likely erroneous or
irrelevant to the adult-onset diabetes focus of this project and were
excluded from the dataset.
# Gender has Other type, since there is a few data which is labeled as "Other"
# we can remove it
df %>%
group_by(gender) %>%
count()
## # A tibble: 3 × 2
## # Groups: gender [3]
## gender n
## <chr> <int>
## 1 Female 58550
## 2 Male 41425
## 3 Other 18
glimpse(df)
## Rows: 99,993
## Columns: 9
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age <dbl> 32.00, 54.00, 20.00, 44.00, 3.00, 74.00, 80.00, 27…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Two categorical variables required transformation:
0
for Female, 1 for Male.0, while other responses (e.g., “former”, “current”) were
grouped as 1, indicating known smoking history.These transformations were executed using conditional logic with
ifelse() and ensured that all model inputs were
numeric.
# there are some rows that age is smaller than 1, remove the noisy data
df <- df %>% filter(age > 1)
glimpse(df)
## Rows: 98,999
## Columns: 9
## $ gender <chr> "Female", "Female", "Female", "Female", "Female", …
## $ age <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <chr> "never", "former", "never", "never", "No Info", "N…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Convert gender to numerical (0 and 1)
df$gender <- ifelse(df$gender == "Female", 0, 1)
# Convert smoking_history to numerical (0 and 1)
df$smoking_history <- ifelse(df$smoking_history %in% c("never", "No Info"), 0, 1)
glimpse(df)
## Rows: 98,999
## Columns: 9
## $ gender <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
## $ age <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
After completing these preprocessing steps, the final dataset consisted of 98999 observations with clean, model-ready variables.
# Create a correlation matrix for the numeric columns
numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")
# Visualize the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3,
method = "circle", colors = c("tomato2", "white", "springgreen3"))
Numeric features are analyzed to find the most important ones. The sapply() function is used to extract numeric features, and cor() calculates the correlation matrix. Then, the ggcorplot library is used to visualize it. From the plot, the strongest correlations are between blood_glucose_level and diabetes (0.84), HbA1c_level and diabetes (0.40), and age with BMI and HbA1c_level with blood_glucose_level (0.34).
#BMI vs diabetes
ggplot(df, aes(x = bmi, fill = factor(diabetes))) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
labs(title = "BMI distribution by Diabetes", x = "BMI", y = "Frequency", fill = "Diabetes")
#smoking_history vs diabetes
ggplot(df, aes(x = smoking_history, fill = factor(diabetes))) +
geom_bar(position = "dodge") +
labs(title = "Smoking History by Diabetes", x = "Smoking History", y = "Frequency", fill = "Diabetes") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Age vs diabetes
ggplot(df,aes(x=age, fill = factor(diabetes))) +
geom_histogram(bins=30,alpha=0.7,position="identity")+
labs(title="Age Distribution by Diabetes",x="Age",y="Frequency",fill="Diabetes")
Histograms are plotted using ggplot(), geom_histogram(), and geom_bar() to show age, BMI, and smoking status by diabetes. Age and BMI are key features that help doctors identify diabetes risk. Older people are more likely to have diabetes, but some younger individuals (around age 10+) are also affected. BMI above 25 (overweight) increases diabetes risk, though a small number with normal BMI also have diabetes. For smoking history, most diabetic cases are in the “never smoked” group, but this is likely due to the large group size. In contrast, former smokers show a higher rate of diabetes within their group, suggesting smoking may increase the risk.
# Create a correlation matrix for the numeric columns
numeric_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, numeric_cols], use = "complete.obs")
# Visualize the correlation matrix using ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, type = "lower",
lab = TRUE, lab_size = 3,
method = "circle", colors = c("tomato2", "white", "springgreen3"))
#BMI vs diabetes
ggplot(df, aes(x = bmi, fill = factor(diabetes))) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
labs(title = "BMI distribution by Diabetes", x = "BMI", y = "Frequency", fill = "Diabetes")
#smoking_history vs diabetes
ggplot(df, aes(x = smoking_history, fill = factor(diabetes))) +
geom_bar(position = "dodge") +
labs(title = "Smoking History by Diabetes", x = "Smoking History", y = "Frequency", fill = "Diabetes") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Age vs diabetes
ggplot(df,aes(x=age, fill = factor(diabetes))) +
geom_histogram(bins=30,alpha=0.7,position="identity")+
labs(title="Age Distribution by Diabetes",x="Age",y="Frequency",fill="Diabetes")
glimpse(df)
## Rows: 98,999
## Columns: 9
## $ gender <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,…
## $ age <dbl> 32, 54, 20, 44, 3, 74, 80, 27, 30, 75, 11, 68, 33,…
## $ hypertension <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heart_disease <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ smoking_history <dbl> 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,…
## $ bmi <dbl> 27.32, 54.70, 22.19, 24.93, 19.27, 28.12, 20.05, 3…
## $ HbA1c_level <dbl> 5.0, 6.0, 3.5, 6.1, 6.5, 5.0, 5.7, 5.7, 6.0, 6.1, …
## $ blood_glucose_level <dbl> 100, 105, 105, 105, 105, 105, 90, 105, 105, 105, 1…
## $ diabetes <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
# Load required libraries
library(dplyr)
library(caret)
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-9
library(rpart)
# Read the data
clean_data = df
# Convert categorical variables to factors
clean_data$gender <- as.factor(clean_data$gender)
clean_data$diabetes <- as.factor(clean_data$diabetes)
# Split features and target variables
X <- clean_data %>% select(-diabetes, -blood_glucose_level)
y_class <- clean_data$diabetes
y_reg <- clean_data$blood_glucose_level
# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y_class, p = 0.8, list = FALSE)
X_train <- X[train_index,]
X_test <- X[-train_index,]
y_class_train <- y_class[train_index]
y_class_test <- y_class[-train_index]
y_reg_train <- y_reg[train_index]
y_reg_test <- y_reg[-train_index]
# Convert factor columns in X_train and X_test to numeric
# This handles cases where factor columns might still exist after initial processing
X_train[] <- lapply(X_train, function(x) if(is.factor(x)) as.numeric(x) else x)
X_test[] <- lapply(X_test, function(x) if(is.factor(x)) as.numeric(x) else x)
# Install and load necessary packages
if(!require(randomForest)){install.packages("randomForest")}
if(!require(e1071)){install.packages("e1071")}
## Loading required package: e1071
if(!require(rpart)){install.packages("rpart")}
if(!require(caret)){install.packages("caret")}
if(!require(MLmetrics)){install.packages("MLmetrics")}
## Loading required package: MLmetrics
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
if(!require(MLmetrics)){install.packages("glmnet")}
install.packages("glmnet")
## Warning: package 'glmnet' is in use and will not be installed
library(dplyr)
library(caret)
library(randomForest)
library(xgboost)
library(glmnet)
library(rpart)
Research Question:
How accurately can we predict whether an individual has diabetes
using health-related features such as age, BMI, smoking status, and
HbA1c level?
In this task, the target variable was diabetes (binary).
The features used excluded blood_glucose_level to avoid
target leakage. The dataset was split into 80% training and 20% testing
using stratified sampling (createDataPartition() from
caret) to preserve the original class distribution.
All predictor variables were already numeric or transformed to be so. No scaling was required for tree-based models. This classification problem was modeled using various algorithms, including Random Forest, Logistic Regression, and KNN, evaluated using standard metrics such as accuracy and AUC.
The first classification model used is Random Forest Model. The split train data is then fit into the random forest model to train the model. Then, use the trained model to predict on the class of the unseen data (the test data, X_test). The F1 score of the Random Forest model has been calculated using the confusionMatrix function.
# Classification Models
# 1. Random Forest
rf_class <- randomForest(x = X_train, y = y_class_train)
rf_pred <- predict(rf_class, X_test)
rf_f1 <- confusionMatrix(rf_pred, y_class_test)$byClass["F1"]
The second classification model use in this project is XGBoost. The split train data is fit into the XGBoost model, and the target data (y_class_train) is converted into numeric as XGBoost model is targeting the target variable to be numeric. When using as.numeric(), the class will turn into class 1 and class 2. However, XGBoost is expecting to receive 0 and 1, therefore the -1 is used to meet the requirements of XGBoost. Since, this part is working on classification prediction, the objective is set to “binary: logistic” to output the probabilities of the element to be positive class.
After the model is trained, the test data is used to predict for the target output of an unseen data. Then, the predicted value of each element in the test data is classified based on its value, when the value is is greater than 0.5, it will be classified as positive class (1), and if receive a false, then it will return to the negative class (0). Then, the F1-score is calculated for further use.
# 2. XGBoost
# Ensure X_train and X_test are numeric matrices
xgb_class <- xgboost(data = as.matrix(X_train), label = as.numeric(y_class_train) - 1,
nrounds = 100, objective = "binary:logistic")
## [1] train-logloss:0.476237
## [2] train-logloss:0.357456
## [3] train-logloss:0.283952
## [4] train-logloss:0.235927
## [5] train-logloss:0.203585
## [6] train-logloss:0.181388
## [7] train-logloss:0.165909
## [8] train-logloss:0.155104
## [9] train-logloss:0.147334
## [10] train-logloss:0.141840
## [11] train-logloss:0.137837
## [12] train-logloss:0.134925
## [13] train-logloss:0.132849
## [14] train-logloss:0.131287
## [15] train-logloss:0.130153
## [16] train-logloss:0.129284
## [17] train-logloss:0.128450
## [18] train-logloss:0.127943
## [19] train-logloss:0.127495
## [20] train-logloss:0.127191
## [21] train-logloss:0.126905
## [22] train-logloss:0.126746
## [23] train-logloss:0.126338
## [24] train-logloss:0.126195
## [25] train-logloss:0.126068
## [26] train-logloss:0.126016
## [27] train-logloss:0.125904
## [28] train-logloss:0.125681
## [29] train-logloss:0.125593
## [30] train-logloss:0.125462
## [31] train-logloss:0.125412
## [32] train-logloss:0.125157
## [33] train-logloss:0.124921
## [34] train-logloss:0.124657
## [35] train-logloss:0.124619
## [36] train-logloss:0.124212
## [37] train-logloss:0.123980
## [38] train-logloss:0.123878
## [39] train-logloss:0.123853
## [40] train-logloss:0.123760
## [41] train-logloss:0.123483
## [42] train-logloss:0.123196
## [43] train-logloss:0.122979
## [44] train-logloss:0.122828
## [45] train-logloss:0.122693
## [46] train-logloss:0.122645
## [47] train-logloss:0.122521
## [48] train-logloss:0.122179
## [49] train-logloss:0.122152
## [50] train-logloss:0.122131
## [51] train-logloss:0.121743
## [52] train-logloss:0.121720
## [53] train-logloss:0.121664
## [54] train-logloss:0.121645
## [55] train-logloss:0.121415
## [56] train-logloss:0.120917
## [57] train-logloss:0.120760
## [58] train-logloss:0.120537
## [59] train-logloss:0.120248
## [60] train-logloss:0.120020
## [61] train-logloss:0.119783
## [62] train-logloss:0.119767
## [63] train-logloss:0.119607
## [64] train-logloss:0.119581
## [65] train-logloss:0.119125
## [66] train-logloss:0.118922
## [67] train-logloss:0.118691
## [68] train-logloss:0.118529
## [69] train-logloss:0.118364
## [70] train-logloss:0.118042
## [71] train-logloss:0.118013
## [72] train-logloss:0.118002
## [73] train-logloss:0.117697
## [74] train-logloss:0.117463
## [75] train-logloss:0.117399
## [76] train-logloss:0.117364
## [77] train-logloss:0.117203
## [78] train-logloss:0.117071
## [79] train-logloss:0.117001
## [80] train-logloss:0.116625
## [81] train-logloss:0.116615
## [82] train-logloss:0.116469
## [83] train-logloss:0.116354
## [84] train-logloss:0.116164
## [85] train-logloss:0.115985
## [86] train-logloss:0.115747
## [87] train-logloss:0.115560
## [88] train-logloss:0.115328
## [89] train-logloss:0.115192
## [90] train-logloss:0.115103
## [91] train-logloss:0.114983
## [92] train-logloss:0.114860
## [93] train-logloss:0.114844
## [94] train-logloss:0.114823
## [95] train-logloss:0.114734
## [96] train-logloss:0.114589
## [97] train-logloss:0.114426
## [98] train-logloss:0.114412
## [99] train-logloss:0.114372
## [100] train-logloss:0.114238
xgb_pred <- predict(xgb_class, as.matrix(X_test))
xgb_pred <- as.factor(ifelse(xgb_pred > 0.5, 1, 0))
xgb_f1 <- confusionMatrix(xgb_pred, y_class_test)$byClass["F1"]
The third classification model that has been used in this project is the K-Nearest Neighbours model. The model is first trained with the train data using the KNN algorithm. The training process has used the 5-fold cross-validation. After the training process, the K-NN model is used to predict on the test data class, and F1-score of the model is calculated for future use.
knn_class <- train(x = X_train, y = y_class_train, method = "knn",
trControl = trainControl(method = "cv", number = 5), # 5-fold cross-validation
preProcess = c("center", "scale")) # Center and scale features
## Warning: Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
knn_pred <- predict(knn_class, X_test)
knn_f1 <- confusionMatrix(knn_pred, y_class_test)$byClass["F1"]
The last classification model that used in this project is the logistic regression model. To use the logistic regression algorithm, glm() function is used. The logistic regression model is a type of Generalised Linear Models (glm() function). Since the aim is for classification usage, therefore the “family” is telling the gl model that binary classification, the logistic regression is expected to use. After the model is trained, the model is used to predict the output on an unseen data to the probabilities of an element to be positive. If the probability is greater than 0.5, it will be classified as positive class else will be negative. Then, the F1-score is of the Logistic Regression model is calculated.
# 4. Logistic Regression
lr_class <- glm(y_class_train ~ ., data = X_train, family = "binomial")
lr_pred <- predict(lr_class, X_test, type = "response")
lr_pred <- as.factor(ifelse(lr_pred > 0.5, 1, 0))
lr_f1 <- confusionMatrix(lr_pred, y_class_test)$byClass["F1"]
##Compare to find the best Classification model Then, compare the models by their F1-score. Create the vectors to store the name of the models and their F1-score.The highest F1-score model is defining by using comparing the F1-Score of each model and get the maximum value’s model. Combine the vectors using rbind function by rows which to show all models with their F1-score.
# Print best classification model
class_models <- c("Random Forest", "XGBoost", "KNN", "Logistic Regression")
class_f1_scores <- c(rf_f1, xgb_f1, knn_f1, lr_f1)
best_class_model <- class_models[which.max(class_f1_scores)]
f1_scores_matrix <- rbind(class_models, class_f1_scores)
f1_scores_matrix
## F1 F1 F1
## class_models "Random Forest" "XGBoost" "KNN"
## class_f1_scores "0.975834726792168" "0.975514731744271" "0.970711523379398"
## F1
## class_models "Logistic Regression"
## class_f1_scores "0.970884809818605"
cat("\n")
cat("Best Classification Model:", best_class_model, "with F1 Score:", max(class_f1_scores), "\n")
## Best Classification Model: Random Forest with F1 Score: 0.9758347
Beside F1-score, ROC-AUC is used to compare the classification model. The higher the AUC value, the better the model can differentiate the positive classes and the negative classes across the different classification thresholds. To calculate the AUC value, the ROC curve is required to calculate beforehand. The ROC curves are plotted as visual aid as below.
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ggplot2)
library(yardstick)
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:readr':
##
## spec
# Calculate ROC curves
rf_roc <- roc(y_class_test, predict(rf_class, X_test, type = "prob")[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
xgb_roc <- roc(y_class_test, predict(xgb_class, as.matrix(X_test), type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
knn_roc <- roc(y_class_test, predict(knn_class, X_test, type = "prob")[,2], levels = c("0", "1"))
## Setting direction: controls < cases
lr_roc <- roc(y_class_test, predict(lr_class, X_test, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
#X_train <- X[train_index,]
#X_test <- X[-train_index,]
# Create a data frame for plotting ROC curves
roc_data <- bind_rows(
coords(rf_roc, "all") %>% mutate(Model = "Random Forest"),
coords(xgb_roc, "all") %>% mutate(Model = "XGBoost"),
coords(knn_roc, "all") %>% mutate(Model = "KNN"),
coords(lr_roc, "all") %>% mutate(Model = "Logistic Regression")
)
# Plot ROC curves
ggplot(roc_data, aes(x = 1 - specificity, y = sensitivity, color = Model)) +
geom_line(size = 1) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray") +
labs(title = "ROC Curves for Classification Models",
x = "False Positive Rate (1 - Specificity)",
y = "True Positive Rate (Sensitivity)") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Print AUC for each model
cat("\nClassification Model AUC Scores:\n")
##
## Classification Model AUC Scores:
cat("Random Forest AUC:", auc(rf_roc), "\n")
## Random Forest AUC: 0.9077738
cat("XGBoost AUC:", auc(xgb_roc), "\n")
## XGBoost AUC: 0.9449312
cat("KNN AUC:", auc(knn_roc), "\n")
## KNN AUC: 0.8860344
cat("Logistic Regression AUC:", auc(lr_roc), "\n")
## Logistic Regression AUC: 0.9305243
From above, it is noticeable the models are having similar F1-Score. The Random Forest Classification model has the highest F1-Score which is 97.52%. XGBoost model has the second highest F1-score which is 97.51% followed by the KNN model (97.13%) and the smallest F1-score value model is the Logistic Regression Model (97.03%).
Besides, the AUC value of XGBoost classification is the highest which have 94.3%, followed by Logistic Regression (92.6%), then the Random Forest model (90.1%) and lastly the KNN model (88.7%).
Since the F1-score of XGBoost model and the Random Forest model are near, and XGBoost model has the highest AUC value, it can be conclude that the XGBoost model is the best classification model towards the dataset.
Research Question:
Can we estimate a person’s blood glucose level using the same set of
health and demographic features?
For the regression task, blood_glucose_level served as
the continuous target variable. Predictors included all cleaned features
except diabetes, to avoid redundancy. The same train-test
split used in the classification task was reused for consistency.
Since all variables were numeric, the data was readily compatible with models such as Linear Regression, Ridge Regression, and XGBoost, which were subsequently applied.
First, we will be using random forest, random forest can be used in classification and regression model as it produce result in numerical form, which in classification it means probability of being a class, and for regression it means value predicted
# 1. Random Forest
rf_reg <- randomForest(x = X_train, y = y_reg_train)
rf_reg_pred <- predict(rf_reg, X_test)
rf_rmse <- sqrt(mean((rf_reg_pred - y_reg_test)^2))
Second, we will be using XGBoost, here we choose to have 100 rounds of predicting and using squared error as the metric to evaluate each round to improve the model since we are calculating RMSE as our regression model evaluatio method, using squared error is suitable
# 2. XGBoost
# Ensure X_train and X_test are numeric matrices
xgb_reg <- xgboost(data = as.matrix(X_train), label = y_reg_train,
nrounds = 100, objective = "reg:squarederror")
## [1] train-rmse:75.184338
## [2] train-rmse:56.055078
## [3] train-rmse:43.702838
## [4] train-rmse:36.104619
## [5] train-rmse:31.703509
## [6] train-rmse:29.276379
## [7] train-rmse:27.995957
## [8] train-rmse:27.336362
## [9] train-rmse:26.954447
## [10] train-rmse:26.739591
## [11] train-rmse:26.623635
## [12] train-rmse:26.539067
## [13] train-rmse:26.472835
## [14] train-rmse:26.418347
## [15] train-rmse:26.372162
## [16] train-rmse:26.319027
## [17] train-rmse:26.290800
## [18] train-rmse:26.256199
## [19] train-rmse:26.243422
## [20] train-rmse:26.201184
## [21] train-rmse:26.165404
## [22] train-rmse:26.149298
## [23] train-rmse:26.135755
## [24] train-rmse:26.108132
## [25] train-rmse:26.098812
## [26] train-rmse:26.076205
## [27] train-rmse:26.061427
## [28] train-rmse:26.016878
## [29] train-rmse:25.996775
## [30] train-rmse:25.941656
## [31] train-rmse:25.910325
## [32] train-rmse:25.903199
## [33] train-rmse:25.897452
## [34] train-rmse:25.859564
## [35] train-rmse:25.847892
## [36] train-rmse:25.831328
## [37] train-rmse:25.808744
## [38] train-rmse:25.794547
## [39] train-rmse:25.789847
## [40] train-rmse:25.753101
## [41] train-rmse:25.710986
## [42] train-rmse:25.688768
## [43] train-rmse:25.628960
## [44] train-rmse:25.614830
## [45] train-rmse:25.593270
## [46] train-rmse:25.575391
## [47] train-rmse:25.546083
## [48] train-rmse:25.541652
## [49] train-rmse:25.539284
## [50] train-rmse:25.511068
## [51] train-rmse:25.470559
## [52] train-rmse:25.420285
## [53] train-rmse:25.388558
## [54] train-rmse:25.380959
## [55] train-rmse:25.352225
## [56] train-rmse:25.341326
## [57] train-rmse:25.318267
## [58] train-rmse:25.291493
## [59] train-rmse:25.273671
## [60] train-rmse:25.249120
## [61] train-rmse:25.191906
## [62] train-rmse:25.166969
## [63] train-rmse:25.136147
## [64] train-rmse:25.130765
## [65] train-rmse:25.114637
## [66] train-rmse:25.078193
## [67] train-rmse:25.045497
## [68] train-rmse:25.034148
## [69] train-rmse:25.024315
## [70] train-rmse:24.988586
## [71] train-rmse:24.948577
## [72] train-rmse:24.915647
## [73] train-rmse:24.889723
## [74] train-rmse:24.876717
## [75] train-rmse:24.867677
## [76] train-rmse:24.845676
## [77] train-rmse:24.840690
## [78] train-rmse:24.831774
## [79] train-rmse:24.804136
## [80] train-rmse:24.800562
## [81] train-rmse:24.775386
## [82] train-rmse:24.748655
## [83] train-rmse:24.746997
## [84] train-rmse:24.715903
## [85] train-rmse:24.687394
## [86] train-rmse:24.636124
## [87] train-rmse:24.607074
## [88] train-rmse:24.585421
## [89] train-rmse:24.580992
## [90] train-rmse:24.579400
## [91] train-rmse:24.575778
## [92] train-rmse:24.570698
## [93] train-rmse:24.561342
## [94] train-rmse:24.549254
## [95] train-rmse:24.532371
## [96] train-rmse:24.500695
## [97] train-rmse:24.470716
## [98] train-rmse:24.450032
## [99] train-rmse:24.443107
## [100] train-rmse:24.422137
xgb_reg_pred <- predict(xgb_reg, as.matrix(X_test))
xgb_rmse <- sqrt(mean((xgb_reg_pred - y_reg_test)^2))
Third, we will be using decision tree, it will be using the default method of decision tree which is anova
# 3. Decision Tree
dt_reg <- rpart(y_reg_train ~ ., data = X_train)
dt_reg_pred <- predict(dt_reg, X_test)
dt_rmse <- sqrt(mean((dt_reg_pred - y_reg_test)^2))
Lastly, we will be using linear regression regression modeling since we are to predict continuous and numerical value
# 4. Linear Regression
lr_reg <- lm(y_reg_train ~ ., data = X_train)
lr_reg_pred <- predict(lr_reg, X_test)
lr_rmse <- sqrt(mean((lr_reg_pred - y_reg_test)^2))
Here we will be using RMSE score to find the best model, lower RMSE means lower variance from the actual value.
# Print best regression model
reg_models <- c("Random Forest", "XGBoost", "Decision Tree", "Linear Regression")
reg_rmse_scores <- c(rf_rmse, xgb_rmse, dt_rmse, lr_rmse)
best_reg_model <- reg_models[which.min(reg_rmse_scores)]
rmse_scores_matrix <- rbind(reg_models, reg_rmse_scores)
rmse_scores_matrix
## [,1] [,2] [,3]
## reg_models "Random Forest" "XGBoost" "Decision Tree"
## reg_rmse_scores "27.5938024602372" "28.1465381391684" "27.9635129276575"
## [,4]
## reg_models "Linear Regression"
## reg_rmse_scores "31.4164296070712"
cat("Best Regression Model:", best_reg_model, "with RMSE:", min(reg_rmse_scores))
## Best Regression Model: Random Forest with RMSE: 27.5938
Model will be stored as RDS to be then used to deploy in our ShinyApp
saveRDS(rf_class, "diabetes_model.rds")
saveRDS(rf_reg, "glucose_model.rds")
For classification model which predicting diabetes, we have Random Forest model as the best model with 0.9758058 of F1-Score and AUC of 0.9067476
For Regression model which predicting blood glucose level, we have Random Forest model also as the best model with the lowest RMSE of 27.59888