| Group Members |
|---|
| XU YIRAN 25088321 |
| WANG JIACUN 25075892 |
| SAM YEE THONG WAH 24209652 |
| LUO YUN 25088950 |
| HUANG MAO WEI 24231728 |
Diabetes is one of the most common chronic diseases worldwide and has become a major public health concern. Early identification of diabetes risk factors can help healthcare providers and individuals take preventive actions to reduce complications and improve health outcomes.
In this project, the Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset was selected for analysis. The dataset contains 253,680 observations and 22 health-related variables, including demographic characteristics, lifestyle habits, and medical conditions. Examples of these variables include age, body mass index (BMI), smoking status, physical activity, blood pressure, cholesterol level, education, and income.
Using this large-scale dataset, the project applies data science and machine learning techniques to explore relationships between health indicators and diabetes status. Both classification and regression approaches are used to generate predictive insights and support healthcare-related decision making.
The objectives of this project are:
The dataset used in this project is the Diabetes Health Indicators Dataset obtained from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 conducted by the Centers for Disease Control and Prevention (CDC).
This dataset contains information related to individuals’ health conditions, lifestyle habits, and demographic characteristics. Some of the variables included in the dataset are BMI, smoking habits, physical activity, blood pressure, cholesterol level, age, and income.
Diabetes is one of the most common health problems worldwide and has become an important public health issue. By using machine learning techniques, this project aims to study the relationship between health indicators and diabetes risk.
This dataset was selected because it is suitable for both classification and regression analysis. It also contains a large number of observations, which makes it appropriate for machine learning modeling and prediction tasks.
In total, this specific dataset contains 253,680 observations and 22 variables. The large number of observations makes the dataset exceptionally suitable for training robust machine learning models and handling complex prediction tasks.
To check whether the dataset meets the project requirement, we calculated the dataset size. The dataset has 253,680 rows and 22 columns, giving a total of 5,580,960 data points. This is more than the required 100,000, so the dataset is suitable for this project.
# Load the required libraries
library(readr)
library(dplyr)
# Read the dataset (Make sure the file is in your working directory)
diabetes <- read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")
# Verify the dimensions programmatically
dim(diabetes)
## [1] 253680 22
The variables in this dataset can be divided into three main groups, which are health condition variables, lifestyle habit variables, and demographic variables.
Examples of health condition variables include blood pressure and cholesterol level, while lifestyle variables include smoking habits, physical activity, and diet-related information. Demographic variables include age, education level, and income category.
Based on the objectives of this project, the variables are used for two different types of machine learning tasks:
Diabetes_binary
is used as the target variable to predict whether an individual has
diabetes.GenHlth (General
Health Score) is used as the numerical target variable in the regression
analysis.# Display dataset structure
str(diabetes)
## spc_tbl_ [253,680 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Diabetes_binary : num [1:253680] 0 0 0 0 0 0 0 0 1 0 ...
## $ HighBP : num [1:253680] 1 0 1 1 1 1 1 1 1 0 ...
## $ HighChol : num [1:253680] 1 0 1 0 1 1 0 1 1 0 ...
## $ CholCheck : num [1:253680] 1 0 1 1 1 1 1 1 1 1 ...
## $ BMI : num [1:253680] 40 25 28 27 24 25 30 25 30 24 ...
## $ Smoker : num [1:253680] 1 1 0 0 0 1 1 1 1 0 ...
## $ Stroke : num [1:253680] 0 0 0 0 0 0 0 0 0 0 ...
## $ HeartDiseaseorAttack: num [1:253680] 0 0 0 0 0 0 0 0 1 0 ...
## $ PhysActivity : num [1:253680] 0 1 0 1 1 1 0 1 0 0 ...
## $ Fruits : num [1:253680] 0 0 1 1 1 1 0 0 1 0 ...
## $ Veggies : num [1:253680] 1 0 0 1 1 1 0 1 1 1 ...
## $ HvyAlcoholConsump : num [1:253680] 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num [1:253680] 1 0 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num [1:253680] 0 1 1 0 0 0 0 0 0 0 ...
## $ GenHlth : num [1:253680] 5 3 5 2 2 2 3 3 5 2 ...
## $ MentHlth : num [1:253680] 18 0 30 0 3 0 0 0 30 0 ...
## $ PhysHlth : num [1:253680] 15 0 30 0 0 2 14 0 30 0 ...
## $ DiffWalk : num [1:253680] 1 0 1 0 0 0 0 1 1 0 ...
## $ Sex : num [1:253680] 0 0 0 0 0 1 0 0 0 1 ...
## $ Age : num [1:253680] 9 7 9 11 11 10 9 11 9 8 ...
## $ Education : num [1:253680] 4 6 4 3 5 6 6 4 5 4 ...
## $ Income : num [1:253680] 3 1 8 6 4 8 7 4 1 3 ...
## - attr(*, "spec")=
## .. cols(
## .. Diabetes_binary = col_double(),
## .. HighBP = col_double(),
## .. HighChol = col_double(),
## .. CholCheck = col_double(),
## .. BMI = col_double(),
## .. Smoker = col_double(),
## .. Stroke = col_double(),
## .. HeartDiseaseorAttack = col_double(),
## .. PhysActivity = col_double(),
## .. Fruits = col_double(),
## .. Veggies = col_double(),
## .. HvyAlcoholConsump = col_double(),
## .. AnyHealthcare = col_double(),
## .. NoDocbcCost = col_double(),
## .. GenHlth = col_double(),
## .. MentHlth = col_double(),
## .. PhysHlth = col_double(),
## .. DiffWalk = col_double(),
## .. Sex = col_double(),
## .. Age = col_double(),
## .. Education = col_double(),
## .. Income = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
To better understand the dataset structure, the data type of each variable was checked. This step helps ensure that the variables are in a suitable format before data cleaning and machine learning.
# Display the data type of each variable
sapply(diabetes, class)
## Diabetes_binary HighBP HighChol
## "numeric" "numeric" "numeric"
## CholCheck BMI Smoker
## "numeric" "numeric" "numeric"
## Stroke HeartDiseaseorAttack PhysActivity
## "numeric" "numeric" "numeric"
## Fruits Veggies HvyAlcoholConsump
## "numeric" "numeric" "numeric"
## AnyHealthcare NoDocbcCost GenHlth
## "numeric" "numeric" "numeric"
## MentHlth PhysHlth DiffWalk
## "numeric" "numeric" "numeric"
## Sex Age Education
## "numeric" "numeric" "numeric"
## Income
## "numeric"
The result shows that most variables are stored as numeric or integer values. However, some variables actually represent categories or levels, such as education level, income category, and general health score. Therefore, we need to understand the meaning of each variable before using them in the analysis.
The summary statistics below provide a general overview of the variables included in the dataset.
# Display summary statistics
summary(diabetes)
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1393 Mean :0.429 Mean :0.4241 Mean :0.9627
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:24.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :27.00 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :28.38 Mean :0.4432 Mean :0.04057 Mean :0.09419
## 3rd Qu.:31.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7565 Mean :0.6343 Mean :0.8114 Mean :0.0562
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :2.000 Median : 0.000
## Mean :0.9511 Mean :0.08418 Mean :2.511 Mean : 3.185
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:3.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 8.000
## Mean : 4.242 Mean :0.1682 Mean :0.4403 Mean : 8.032
## 3rd Qu.: 3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.00 Min. :1.000
## 1st Qu.:4.00 1st Qu.:5.000
## Median :5.00 Median :7.000
## Mean :5.05 Mean :6.054
## 3rd Qu.:6.00 3rd Qu.:8.000
## Max. :6.00 Max. :8.000
Before building machine learning models, the dataset is checked for missing values and duplicate records to ensure data quality.
# Check missing values
total_na <- sum(is.na(diabetes))
print(paste("Total Missing Values:", total_na))
## [1] "Total Missing Values: 0"
# Check duplicate rows
total_duplicates <- sum(duplicated(diabetes))
print(paste("Total Duplicate Rows:", total_duplicates))
## [1] "Total Duplicate Rows: 24206"
The dataset does not contain any missing values, so no imputation or replacement is required.
A total of 24,206 duplicate rows were identified in the dataset. Since this dataset comes from a large public health survey, it is possible for different individuals to share similar health profiles and responses. Therefore, the duplicate rows are retained for further analysis.
Several variables in the dataset are stored as numeric values such as 0 and 1, but they actually represent categorical health conditions or responses.
To improve readability and prepare the dataset for machine learning analysis, these variables are converted into factor variables with meaningful labels.
# Transform selected variables into factors
diabetes_cleaned <- diabetes %>%
mutate(
# Diabetes status
Diabetes_binary = factor(
Diabetes_binary,
levels = c(0, 1),
labels = c("No_Diabetes", "Diabetes")
),
# Health condition indicators
HighBP = factor(
HighBP,
levels = c(0, 1),
labels = c("No_HighBP", "HighBP")
),
HighChol = factor(
HighChol,
levels = c(0, 1),
labels = c("No_HighChol", "HighChol")
),
CholCheck = factor(
CholCheck,
levels = c(0, 1),
labels = c("No_Check", "Checked")
),
Stroke = factor(
Stroke,
levels = c(0, 1),
labels = c("No_Stroke", "Stroke")
),
HeartDiseaseorAttack = factor(
HeartDiseaseorAttack,
levels = c(0, 1),
labels = c("No_HD", "HeartDisease")
)
)
# Verify transformed variables
str(diabetes_cleaned[, c(
"Diabetes_binary",
"HighBP",
"HighChol",
"CholCheck",
"Stroke",
"HeartDiseaseorAttack"
)])
## tibble [253,680 × 6] (S3: tbl_df/tbl/data.frame)
## $ Diabetes_binary : Factor w/ 2 levels "No_Diabetes",..: 1 1 1 1 1 1 1 1 2 1 ...
## $ HighBP : Factor w/ 2 levels "No_HighBP","HighBP": 2 1 2 2 2 2 2 2 2 1 ...
## $ HighChol : Factor w/ 2 levels "No_HighChol",..: 2 1 2 1 2 2 1 2 2 1 ...
## $ CholCheck : Factor w/ 2 levels "No_Check","Checked": 2 1 2 2 2 2 2 2 2 2 ...
## $ Stroke : Factor w/ 2 levels "No_Stroke","Stroke": 1 1 1 1 1 1 1 1 1 1 ...
## $ HeartDiseaseorAttack: Factor w/ 2 levels "No_HD","HeartDisease": 1 1 1 1 1 1 1 1 2 1 ...
The remaining lifestyle and demographic variables are also converted into factor variables to improve readability and support further analysis.
# Transform lifestyle and demographic variables into factors
diabetes_cleaned <- diabetes_cleaned %>%
mutate(
# Lifestyle variables
Smoker = factor(
Smoker,
levels = c(0, 1),
labels = c("Non_Smoker", "Smoker")
),
PhysActivity = factor(
PhysActivity,
levels = c(0, 1),
labels = c("No_Activity", "Active")
),
Fruits = factor(
Fruits,
levels = c(0, 1),
labels = c("No_Fruits", "Fruits")
),
Veggies = factor(
Veggies,
levels = c(0, 1),
labels = c("No_Veggies", "Veggies")
),
HvyAlcoholConsump = factor(
HvyAlcoholConsump,
levels = c(0, 1),
labels = c("No_HeavyDrink", "HeavyDrink")
),
# Demographic variables
Sex = factor(
Sex,
levels = c(0, 1),
labels = c("Female", "Male")
),
AnyHealthcare = factor(
AnyHealthcare,
levels = c(0, 1),
labels = c("No_Coverage", "Has_Coverage")
),
NoDocbcCost = factor(
NoDocbcCost,
levels = c(0, 1),
labels = c("No", "Yes_CostBarrier")
),
DiffWalk = factor(
DiffWalk,
levels = c(0, 1),
labels = c("No_Difficulty", "Difficulty_Walking")
)
)
# Verify transformed variables
str(diabetes_cleaned[, c(
"Smoker",
"PhysActivity",
"Fruits",
"Veggies",
"HvyAlcoholConsump",
"Sex",
"AnyHealthcare",
"NoDocbcCost",
"DiffWalk"
)])
## tibble [253,680 × 9] (S3: tbl_df/tbl/data.frame)
## $ Smoker : Factor w/ 2 levels "Non_Smoker","Smoker": 2 2 1 1 1 2 2 2 2 1 ...
## $ PhysActivity : Factor w/ 2 levels "No_Activity",..: 1 2 1 2 2 2 1 2 1 1 ...
## $ Fruits : Factor w/ 2 levels "No_Fruits","Fruits": 1 1 2 2 2 2 1 1 2 1 ...
## $ Veggies : Factor w/ 2 levels "No_Veggies","Veggies": 2 1 1 2 2 2 1 2 2 2 ...
## $ HvyAlcoholConsump: Factor w/ 2 levels "No_HeavyDrink",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 1 1 2 ...
## $ AnyHealthcare : Factor w/ 2 levels "No_Coverage",..: 2 1 2 2 2 2 2 2 2 2 ...
## $ NoDocbcCost : Factor w/ 2 levels "No","Yes_CostBarrier": 1 2 2 1 1 1 1 1 1 1 ...
## $ DiffWalk : Factor w/ 2 levels "No_Difficulty",..: 2 1 2 1 1 1 1 2 2 1 ...
For EDA, some ordinal variables are converted into ordered factors to
make the categories easier to read in tables and charts.
GenHlth is also converted into an ordered factor in this
section because its values represent ordered health levels.
However, GenHlth is still used as the target variable
for the regression task. In the later modeling section, it will be
converted back into a numerical score for regression analysis.
Other variables such as Age, Education, and
Income are also ordinal variables because their categories
follow a meaningful order.
To prepare the dataset for further analysis, these variables are converted into ordered factors with labels based on the BRFSS 2015 documentation.
# Transform ordinal variables into ordered factors
diabetes_cleaned <- diabetes_cleaned %>%
mutate(
# General health condition
GenHlth = factor(
GenHlth,
levels = 1:5,
labels = c("Excellent", "Very_Good", "Good", "Fair", "Poor"),
ordered = TRUE
),
# Age groups
Age = factor(
Age,
levels = 1:13,
labels = c(
"18-24", "25-29", "30-34", "35-39", "40-44", "45-49",
"50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+"
),
ordered = TRUE
),
# Education level
Education = factor(
Education,
levels = 1:6,
labels = c(
"Never_Attended", "Elementary", "Some_HighSchool",
"HighSchool_Graduate", "Some_College", "College_Graduate"
),
ordered = TRUE
),
# Income groups
Income = factor(
Income,
levels = 1:8,
labels = c(
"<10k", "10k-15k", "15k-20k", "20k-25k",
"25k-35k", "35k-50k", "50k-75k", "75k+"
),
ordered = TRUE
)
)
# Verify transformed variables
str(diabetes_cleaned[, c("GenHlth", "Age", "Education", "Income")])
## tibble [253,680 × 4] (S3: tbl_df/tbl/data.frame)
## $ GenHlth : Ord.factor w/ 5 levels "Excellent"<"Very_Good"<..: 5 3 5 2 2 2 3 3 5 2 ...
## $ Age : Ord.factor w/ 13 levels "18-24"<"25-29"<..: 9 7 9 11 11 10 9 11 9 8 ...
## $ Education: Ord.factor w/ 6 levels "Never_Attended"<..: 4 6 4 3 5 6 6 4 5 4 ...
## $ Income : Ord.factor w/ 8 levels "<10k"<"10k-15k"<..: 3 1 8 6 4 8 7 4 1 3 ...
After cleaning and transforming the variables, a final check is performed to confirm the structure and quality of the cleaned dataset.
# Check dimension of cleaned dataset
dim(diabetes_cleaned)
## [1] 253680 22
# Check missing values after cleaning
sum(is.na(diabetes_cleaned))
## [1] 0
# Check target variable distribution
summary(diabetes_cleaned$Diabetes_binary)
## No_Diabetes Diabetes
## 218334 35346
The cleaned dataset contains 253,680 observations and 22 variables. No missing values are found after the cleaning process.
The dataset is now ready for exploratory data analysis, classification modeling, and regression analysis.
After completing the data cleaning process, exploratory data analysis (EDA) is performed to better understand the dataset and the distribution of important variables.
The following graphs show the distribution of diabetes cases and BMI values in the dataset.
library(ggplot2)
library(gridExtra)
plot_target_class <- ggplot(diabetes_cleaned, aes(x = Diabetes_binary, fill = Diabetes_binary)) +
geom_bar(color = "black", alpha = 0.8, width = 0.5) +
scale_fill_manual(values = c("#2a9d8f", "#e76f51")) +
labs(
title = "Distribution of Diabetes Cases",
x = "Diabetes Status",
y = "Count"
) +
theme_minimal() +
theme(legend.position = "none")
plot_target_reg <- ggplot(diabetes_cleaned, aes(x = BMI)) +
geom_histogram(
binwidth = 1,
fill = "#457b9d",
color = "black",
alpha = 0.7
) +
labs(
title = "Distribution of BMI",
x = "BMI",
y = "Frequency"
) +
theme_minimal()
grid.arrange(plot_target_class, plot_target_reg, ncol = 2)
The bar chart shows that the number of individuals without diabetes is higher than the number of individuals with diabetes.
The BMI histogram shows that most BMI values are concentrated between 20 and 35, with some higher BMI values appearing in the dataset.
In this section, several important health and lifestyle variables are compared with diabetes status to explore possible relationships in the dataset.
# Diabetes vs High Blood Pressure
p1 <- ggplot(diabetes_cleaned, aes(x = HighBP, fill = Diabetes_binary)) +
geom_bar(position = "dodge") +
labs(
title = "Diabetes vs High Blood Pressure",
x = "High Blood Pressure",
y = "Count"
) +
theme_minimal()
# Diabetes vs Smoking
p2 <- ggplot(diabetes_cleaned, aes(x = Smoker, fill = Diabetes_binary)) +
geom_bar(position = "dodge") +
labs(
title = "Diabetes vs Smoking",
x = "Smoking Status",
y = "Count"
) +
theme_minimal()
# Diabetes vs Physical Activity
p3 <- ggplot(diabetes_cleaned, aes(x = PhysActivity, fill = Diabetes_binary)) +
geom_bar(position = "dodge") +
labs(
title = "Diabetes vs Physical Activity",
x = "Physical Activity",
y = "Count"
) +
theme_minimal()
grid.arrange(p1, p2, p3, ncol = 1)
The graphs show that individuals with high blood pressure appear more likely to have diabetes compared to those without high blood pressure.
The smoking graph suggests that smokers have a slightly higher number of diabetes cases compared to non-smokers.
For physical activity, people who are physically active generally show fewer diabetes cases than those with no physical activity.
To better understand the relationships between numerical variables, a correlation matrix is created using selected health-related variables.
library(corrplot)
# Select numeric variables
numeric_data <- diabetes %>%
select(
Diabetes_binary,
BMI,
HighBP,
HighChol,
PhysActivity,
MentHlth,
PhysHlth
)
# Create correlation matrix
cor_matrix <- cor(numeric_data)
# Plot correlation matrix
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45
)
The correlation matrix shows that diabetes has a positive relationship with variables such as BMI, high blood pressure, and high cholesterol.
BMI also shows moderate relationships with physical health and diabetes status.
Overall, the relationships are not extremely strong, which suggests that multiple variables together may be needed for effective prediction models.
This part aims to build a binary classification model using the cleaned diabetes health indicators dataset. The target variable, Diabetes_binary, indicates whether an individual has diabetes or not. The dataset includes several health, lifestyle, and demographic factors.
The analysis is conducted through a structured machine learning workflow. First, the dataset is explored to understand its structure, feature characteristics, and target class distribution. Next, data cleaning and transformation are carried out to check data quality and transform format. Feature encoding is then applied to prepare the data for model training. The dataset is split into training and testing sets using stratified sampling, and class imbalance is handled on the training set to reduce bias toward the majority class. After that, several classification models are trained and evaluated using multiple performance metrics, including accuracy, precision, recall, F1-score, ROC-AUC. Finally, SHAP analysis is used to explain the best-performing model and identify the key predictors of diabetes risk.
The first step of this study is to understand the structure and characteristics of the diabetes dataset before conducting model development.
# Load required library
library(readr)
library(dplyr)
# Load dataset
diabetes_data <- read_csv("diabetes_cleaned.csv")
# View the first few rows
head(diabetes_data)
## # A tibble: 6 × 22
## Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 No_Diabetes HighBP HighChol Checked 40 Smoker No_Stroke
## 2 No_Diabetes No_HighBP No_HighChol No_Check 25 Smoker No_Stroke
## 3 No_Diabetes HighBP HighChol Checked 28 Non_Smoker No_Stroke
## 4 No_Diabetes HighBP No_HighChol Checked 27 Non_Smoker No_Stroke
## 5 No_Diabetes HighBP HighChol Checked 24 Non_Smoker No_Stroke
## 6 No_Diabetes HighBP HighChol Checked 25 Smoker No_Stroke
## # ℹ 15 more variables: HeartDiseaseorAttack <chr>, PhysActivity <chr>,
## # Fruits <chr>, Veggies <chr>, HvyAlcoholConsump <chr>, AnyHealthcare <chr>,
## # NoDocbcCost <chr>, GenHlth <chr>, MentHlth <dbl>, PhysHlth <dbl>,
## # DiffWalk <chr>, Sex <chr>, Age <chr>, Education <chr>, Income <chr>
# Check dataset dimension
dim(diabetes_data)
## [1] 253680 22
# Summary statistics
summary(diabetes_data)
## Diabetes_binary HighBP HighChol CholCheck
## Length:253680 Length:253680 Length:253680 Length:253680
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Length:253680 Length:253680 Length:253680
## 1st Qu.:24.00 Class :character Class :character Class :character
## Median :27.00 Mode :character Mode :character Mode :character
## Mean :28.38
## 3rd Qu.:31.00
## Max. :98.00
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Length:253680 Length:253680 Length:253680 Length:253680
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Length:253680 Length:253680 Length:253680 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 0.000
## Mode :character Mode :character Mode :character Median : 0.000
## Mean : 3.185
## 3rd Qu.: 2.000
## Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Length:253680 Length:253680 Length:253680
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 4.242
## 3rd Qu.: 3.000
## Max. :30.000
## Education Income
## Length:253680 Length:253680
## Class :character Class :character
## Mode :character Mode :character
##
##
##
The dataset contains 253,680 records and 22 variables, including one target variable and 21 predictor variables. The target variable is Diabetes_binary, which indicates whether an individual has diabetes or not. Therefore, this study is formulated as a binary classification task.
# Check target variable distribution
table(diabetes_data$Diabetes_binary)
##
## Diabetes No_Diabetes
## 35346 218334
# Check target variable percentage
prop.table(table(diabetes_data$Diabetes_binary)) * 100
##
## Diabetes No_Diabetes
## 13.9333 86.0667
The target variable distribution shows that the dataset is imbalanced, with 86.07% of records belonging to the no-diabetes class and 13.93% belonging to the diabetes class.
# Check data types of each variable
sapply(diabetes_data, class)
## Diabetes_binary HighBP HighChol
## "character" "character" "character"
## CholCheck BMI Smoker
## "character" "numeric" "character"
## Stroke HeartDiseaseorAttack PhysActivity
## "character" "character" "character"
## Fruits Veggies HvyAlcoholConsump
## "character" "character" "character"
## AnyHealthcare NoDocbcCost GenHlth
## "character" "character" "character"
## MentHlth PhysHlth DiffWalk
## "numeric" "numeric" "character"
## Sex Age Education
## "character" "character" "character"
## Income
## "character"
| Type | Fields |
|---|---|
| Target variable | Diabetes_binary |
| Numerical variables | BMI, MentHlth, PhysHlth |
| Binary categorical variables | HighBP, HighChol, Smoker,
Stroke, Sex, etc. |
| Ordinal categorical variables | GenHlth, Age, Education,
Income |
Based on the initial data understanding, the dataset is appropriate for binary classification modelling. However, the imbalance in the target variable indicates that model evaluation should not rely only on accuracy. Metrics such as recall, F1-score and ROC-AUC should be used in later stages.
# Transform target variable
diabetes_data$Diabetes_binary <- ifelse( diabetes_data$Diabetes_binary == "Diabetes", 1, 0 )
# Check the results
table(diabetes_data$Diabetes_binary)
##
## 0 1
## 218334 35346
The target variable Diabetes_binary was transformed into binary numerical format, where No_Diabetes was encoded as 0 and Diabetes was encoded as 1.
diabetes_data$HighBP <- ifelse(diabetes_data$HighBP == "HighBP", 1, 0)
diabetes_data$HighChol <- ifelse(diabetes_data$HighChol == "HighChol", 1, 0)
diabetes_data$CholCheck <- ifelse(diabetes_data$CholCheck == "Checked", 1, 0)
diabetes_data$Smoker <- ifelse(diabetes_data$Smoker == "Smoker", 1, 0)
diabetes_data$Stroke <- ifelse(diabetes_data$Stroke == "Stroke", 1, 0)
diabetes_data$HeartDiseaseorAttack <- ifelse(
diabetes_data$HeartDiseaseorAttack == "HeartDisease", 1, 0 )
diabetes_data$PhysActivity <- ifelse(diabetes_data$PhysActivity == "Active", 1, 0)
diabetes_data$Fruits <- ifelse(diabetes_data$Fruits == "Fruits", 1, 0)
diabetes_data$Veggies <- ifelse(diabetes_data$Veggies == "Veggies", 1, 0)
diabetes_data$HvyAlcoholConsump <- ifelse( diabetes_data$HvyAlcoholConsump == "HeavyDrink", 1, 0 )
diabetes_data$AnyHealthcare <- ifelse( diabetes_data$AnyHealthcare == "Has_Coverage", 1, 0 )
diabetes_data$NoDocbcCost <- ifelse( diabetes_data$NoDocbcCost == "Yes_CostBarrier", 1, 0 )
diabetes_data$DiffWalk <- ifelse( diabetes_data$DiffWalk == "Difficulty_Walking", 1, 0 )
diabetes_data$Sex <- ifelse( diabetes_data$Sex == "Male", 1, 0 )
Binary variables such as HighBP, HighChol, Smoker, Stroke, PhysActivity, DiffWalk, and Sex were prepared for conversion into 0/1 numerical format.
# GenHlth
diabetes_data$GenHlth <- factor(
diabetes_data$GenHlth,
levels = c("Excellent", "Very_Good", "Good", "Fair", "Poor"),
ordered = TRUE
)
diabetes_data$GenHlth <- as.numeric(diabetes_data$GenHlth)
# Age
diabetes_data$Age <- factor(
diabetes_data$Age,
levels = c("18-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54", "55-59", "60-64", "65-69",
"70-74", "75-79", "80+"),
ordered = TRUE
)
diabetes_data$Age <- as.numeric(diabetes_data$Age)
# Education
diabetes_data$Education <- factor(
diabetes_data$Education,
levels = c("Never_Attended", "Elementary", "Some_HighSchool",
"HighSchool_Graduate", "Some_College", "College_Graduate"),
ordered = TRUE
)
diabetes_data$Education <- as.numeric(diabetes_data$Education)
# Income
diabetes_data$Income <- factor(
diabetes_data$Income,
levels = c("<10k", "10k-15k", "15k-20k", "20k-25k",
"25k-35k", "35k-50k", "50k-75k", "75k+"),
ordered = TRUE
)
diabetes_data$Income <- as.numeric(diabetes_data$Income)
Ordinal variables such as GenHlth, Age, Education, and Income were transformed according to their natural order.
# Check whether all variables are numeric
sapply(diabetes_data, class)
## Diabetes_binary HighBP HighChol
## "numeric" "numeric" "numeric"
## CholCheck BMI Smoker
## "numeric" "numeric" "numeric"
## Stroke HeartDiseaseorAttack PhysActivity
## "numeric" "numeric" "numeric"
## Fruits Veggies HvyAlcoholConsump
## "numeric" "numeric" "numeric"
## AnyHealthcare NoDocbcCost GenHlth
## "numeric" "numeric" "numeric"
## MentHlth PhysHlth DiffWalk
## "numeric" "numeric" "numeric"
## Sex Age Education
## "numeric" "numeric" "numeric"
## Income
## "numeric"
# View first few rows
head(diabetes_data)
## # A tibble: 6 × 22
## Diabetes_binary HighBP HighChol CholCheck BMI Smoker Stroke
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1 1 1 40 1 0
## 2 0 0 0 0 25 1 0
## 3 0 1 1 1 28 0 0
## 4 0 1 0 1 27 0 0
## 5 0 1 1 1 24 0 0
## 6 0 1 1 1 25 1 0
## # ℹ 15 more variables: HeartDiseaseorAttack <dbl>, PhysActivity <dbl>,
## # Fruits <dbl>, Veggies <dbl>, HvyAlcoholConsump <dbl>, AnyHealthcare <dbl>,
## # NoDocbcCost <dbl>, GenHlth <dbl>, MentHlth <dbl>, PhysHlth <dbl>,
## # DiffWalk <dbl>, Sex <dbl>, Age <dbl>, Education <dbl>, Income <dbl>
# Save the dataset
write.csv(diabetes_data, "diabetes_transformed.csv", row.names = FALSE)
After transformation, all variables were converted into numerical format and became suitable for machine learning algorithms. Binary categorical variables were encoded as 0 and 1, while ordinal variables were encoded according to their natural order. Numerical variables such as BMI, MentHlth, and PhysHlth were kept as numerical features. The transformed dataset was then used for dataset splitting and subsequent model development.
The transformed dataset was split into three subsets: training set, validation set, and testing set. The training set was used to train the classification models, the validation set was used for model tuning and model selection, and the testing set was used for final model evaluation. In this study, 60% of the data was assigned to the training set, 20% to the validation set, and 20% to the testing set. Stratified sampling was applied during the splitting process to preserve the original distribution of the target variable in each subset.
library(caret)
library(readr)
# Load transformed dataset
diabetes_transformed <- read_csv("diabetes_transformed.csv")
set.seed(42)
# First split: 60% training, 40% temporary data
train_index <- createDataPartition(
diabetes_transformed$Diabetes_binary,
p = 0.6,
list = FALSE
)
train_data <- diabetes_transformed[train_index, ]
temp_data <- diabetes_transformed[-train_index, ]
# Second split: split temporary data into 20% validation and 20% testing
valid_index <- createDataPartition(
temp_data$Diabetes_binary,
p = 0.5,
list = FALSE
)
valid_data <- temp_data[valid_index, ]
test_data <- temp_data[-valid_index, ]
# Summary of dataset dimensions and class distribution
summary_table <- data.frame(
Dataset = c("Training Set", "Validation Set", "Test Set"),
Rows = c(nrow(train_data), nrow(valid_data), nrow(test_data)),
Columns = c(ncol(train_data), ncol(valid_data), ncol(test_data)),
Class_0_Percentage = c(
prop.table(table(train_data$Diabetes_binary))[1] * 100,
prop.table(table(valid_data$Diabetes_binary))[1] * 100,
prop.table(table(test_data$Diabetes_binary))[1] * 100
),
Class_1_Percentage = c(
prop.table(table(train_data$Diabetes_binary))[2] * 100,
prop.table(table(valid_data$Diabetes_binary))[2] * 100,
prop.table(table(test_data$Diabetes_binary))[2] * 100
)
)
summary_table
## Dataset Rows Columns Class_0_Percentage Class_1_Percentage
## 1 Training Set 152208 22 86.04147 13.95853
## 2 Validation Set 50736 22 86.17944 13.82056
## 3 Test Set 50736 22 86.02964 13.97036
After splitting, the training, validation, and testing sets maintained a similar target class distribution to the original dataset. This ensured that model training, model selection, and final evaluation were performed on representative subsets of the data.
The target variable in this dataset is imbalanced, with the majority of records belonging to the No_Diabetes class and a much smaller proportion belonging to the Diabetes class. To address this issue, class imbalance handling was applied only to the training set. The validation set and testing set were kept unchanged to preserve the original real-world class distribution. This prevents data leakage and ensures that model evaluation remains reliable and realistic.
# Check class distribution in training set
table(train_data$Diabetes_binary)
##
## 0 1
## 130962 21246
The current training set maintains the imbalanced structure of the original data.
library(dplyr)
set.seed(42)
# Separate majority and minority classes
majority_class <- train_data %>% filter(Diabetes_binary == 0)
minority_class <- train_data %>% filter(Diabetes_binary == 1)
ratio <- 2
# Randomly undersample majority class to 2 times the minority class
majority_downsampled <- majority_class %>% sample_n(ratio * nrow(minority_class))
train_balanced <- bind_rows(minority_class, majority_downsampled)
train_balanced <- train_balanced %>% sample_frac(1)
# Check class distribution after balancing
table(train_balanced$Diabetes_binary)
##
## 0 1
## 42492 21246
prop.table(table(train_balanced$Diabetes_binary)) * 100
##
## 0 1
## 66.66667 33.33333
Only the training set was balanced. Instead of creating a fully balanced 1:1 class ratio, the majority class was downsampled to approximately twice the size of the minority class. This produced a 2:1 ratio between non-diabetic and diabetic cases, reducing class imbalance while avoiding excessive information loss from the majority class.
Three classification models were trained and compared in this study: Decision Tree, Random Forest, and XGBoost. Decision Tree was used as the baseline model because it is simple, interpretable, and does not require feature scaling. Random Forest and XGBoost were selected as advanced ensemble models because they are suitable for structured tabular data and can handle nonlinear relationships between predictors and the target variable.
The models were first trained using the balanced training set. The validation set was then used for hyperparameter tuning and model selection. The testing set was not used during training or tuning, and was only used for final model evaluation. This design helps prevent data leakage and provides a more reliable estimate of model performance on unseen data.
After comparing the models on the validation set, the best-performing model was selected and evaluated on the testing set. Finally, feature importance analysis was conducted on the best model to identify the most influential predictors of diabetes classification.
library(rpart)
library(rpart.plot)
library(ranger)
library(xgboost)
library(caret)
library(pROC)
library(dplyr)
Decision Tree and Random Forest require a categorical target variable, which is then converted into a factor. XGBoost, however, requires numeric 0/1 labels and does not perform this process.
# For Decision Tree and Random Forest
train_balanced$Diabetes_factor <- as.factor(train_balanced$Diabetes_binary)
valid_data$Diabetes_factor <- as.factor(valid_data$Diabetes_binary)
test_data$Diabetes_factor <- as.factor(test_data$Diabetes_binary)
# For XGBoost
train_label <- train_balanced$Diabetes_binary
valid_label <- valid_data$Diabetes_binary
test_label <- test_data$Diabetes_binary
Define feature columns
# Remove target-related columns
feature_cols <- setdiff(names(train_balanced), c("Diabetes_binary", "Diabetes_factor"))
x_train <- train_balanced[, feature_cols]
x_valid <- valid_data[, feature_cols]
x_test <- test_data[, feature_cols]
Decision Tree was selected as the baseline model because it is simple, interpretable, and does not require feature scaling.
First, use a simple parameter tuning version to compare different cp values.
dt_grid <- expand.grid(
cp = c(0.001, 0.005, 0.01, 0.02)
)
dt_results <- data.frame()
for (i in 1:nrow(dt_grid)) {
cp_value <- dt_grid$cp[i]
dt_model <- rpart(
Diabetes_factor ~ .,
data = train_balanced[, c(feature_cols, "Diabetes_factor")],
method = "class",
control = rpart.control(cp = cp_value)
)
# Predict probability on validation set
dt_prob <- predict(dt_model, valid_data[, feature_cols], type = "prob")[, "1"]
# Convert probability to class
dt_pred <- ifelse(dt_prob >= 0.5, 1, 0)
# Calculate validation AUC
dt_auc <- auc(valid_data$Diabetes_binary, dt_prob)
# Store result
dt_results <- rbind(
dt_results,
data.frame(
cp = cp_value,
Validation_AUC = as.numeric(dt_auc)))
}
dt_results
## cp Validation_AUC
## 1 0.001 0.7971714
## 2 0.005 0.7403289
## 3 0.010 0.7385117
## 4 0.020 0.7235862
Choose the validation set with the best AUC
best_dt_cp <- dt_results$cp[which.max(dt_results$Validation_AUC)]
best_dt_cp
## [1] 0.001
Retrain the optimal Decision Tree
best_dt_model <- rpart(
Diabetes_factor ~ .,
data = train_balanced[, c(feature_cols, "Diabetes_factor")],
method = "class",
control = rpart.control(cp = best_dt_cp)
)
The final Decision Tree model was trained using cp = 0.001, which achieved the best validation performance during hyperparameter tuning. This setting allowed the tree to capture more detailed decision patterns while still applying pruning to reduce overfitting.
Random Forest was implemented using the ranger package in R to improve training speed and computational efficiency. As an ensemble learning method, Random Forest builds multiple decision trees from bootstrap samples and combines their predictions through majority voting.
A simple parameter tuning process was applied to compare different
mtry values. The mtry parameter controls the
number of variables randomly selected at each split in the Random Forest
model.
rf_grid <- expand.grid(
mtry = c(3, 5, 7, 9)
)
rf_results <- vector("list", nrow(rf_grid))
train_rf <- train_balanced[, c(feature_cols, "Diabetes_factor")]
valid_x <- valid_data[, feature_cols]
for (i in seq_len(nrow(rf_grid))) {
mtry_value <- rf_grid$mtry[i]
set.seed(42)
rf_model <- ranger(
Diabetes_factor ~ .,
data = train_rf,
num.trees = 300,
mtry = mtry_value,
probability = TRUE,
importance = "impurity",
num.threads = parallel::detectCores() - 1
)
# Predict probability on validation set
rf_prob <- predict(rf_model, data = valid_x)$predictions[, "1"]
# Calculate validation AUC
rf_auc <- auc(valid_data$Diabetes_binary, rf_prob)
rf_results[[i]] <- data.frame( mtry = mtry_value, Validation_AUC = as.numeric(rf_auc))
}
rf_results <- bind_rows(rf_results)
rf_results
## mtry Validation_AUC
## 1 3 0.8300043
## 2 5 0.8253878
## 3 7 0.8222375
## 4 9 0.8192751
Choose the best mtry
best_rf_mtry <- rf_results$mtry[which.max(rf_results$Validation_AUC)]
best_rf_mtry
## [1] 3
Retrain the optimal Random Forest
set.seed(42)
best_rf_model <- ranger(
Diabetes_factor ~ .,
data = train_rf,
num.trees = 300,
mtry = best_rf_mtry,
probability = TRUE,
importance = "impurity",
num.threads = parallel::detectCores() - 1
)
The final Random Forest model was trained using mtry = 3, which produced the best validation performance during tuning. This means that three randomly selected features were considered at each split in the trees.
XGBoost was implemented using the xgboost package in R as an advanced boosting-based classification model. Unlike Random Forest, which builds trees independently, XGBoost builds trees sequentially, with each tree correcting errors from previous trees. In this study, XGBoost was trained on the balanced training set using the binary:logistic objective for diabetes binary classification. Key hyperparameters, including max_depth, eta, subsample, and colsample_bytree, were tuned using the validation set before comparing its performance with Decision Tree and Random Forest.
Prepare XGBoost matrix
dtrain <- xgb.DMatrix( data = as.matrix(x_train), label = train_label )
dvalid <- xgb.DMatrix( data = as.matrix(x_valid), label = valid_label )
dtest <- xgb.DMatrix( data = as.matrix(x_test), label = test_label )
Train XGBoost with validation tuning
xgb_grid <- expand.grid(
max_depth = c(3, 5, 7),
eta = c(0.05, 0.1),
subsample = c(0.8, 1.0),
colsample_bytree = c(0.8, 1.0)
)
xgb_results <- data.frame()
for (i in 1:nrow(xgb_grid)) {
params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = xgb_grid$max_depth[i],
eta = xgb_grid$eta[i],
subsample = xgb_grid$subsample[i],
colsample_bytree = xgb_grid$colsample_bytree[i]
)
set.seed(42)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 300,
watchlist = list(valid = dvalid),
early_stopping_rounds = 20,
verbose = 0
)
xgb_prob <- predict(xgb_model, dvalid)
xgb_auc <- auc(valid_label, xgb_prob)
if (!is.null(xgb_model$best_iteration)) {
best_iter <- xgb_model$best_iteration
} else {
best_iter <- 300
}
xgb_results <- rbind(
xgb_results,
data.frame(
max_depth = xgb_grid$max_depth[i],
eta = xgb_grid$eta[i],
subsample = xgb_grid$subsample[i],
colsample_bytree = xgb_grid$colsample_bytree[i],
best_iteration = best_iter,
Validation_AUC = as.numeric(xgb_auc) ) )
}
head(xgb_results)
## max_depth eta subsample colsample_bytree best_iteration Validation_AUC
## 1 3 0.05 0.8 0.8 300 0.8341898
## 2 5 0.05 0.8 0.8 300 0.8342961
## 3 7 0.05 0.8 0.8 300 0.8337384
## 4 3 0.10 0.8 0.8 300 0.8343263
## 5 5 0.10 0.8 0.8 300 0.8338556
## 6 7 0.10 0.8 0.8 300 0.8333974
Choose the parameter that best optimizes the validation set AUC
best_xgb <- xgb_results[which.max(xgb_results$Validation_AUC), ]
best_xgb
## max_depth eta subsample colsample_bytree best_iteration Validation_AUC
## 4 3 0.1 0.8 0.8 300 0.8343263
Retrain the optimal XGBoost
best_xgb_params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = best_xgb$max_depth,
eta = best_xgb$eta,
subsample = best_xgb$subsample,
colsample_bytree = best_xgb$colsample_bytree
)
set.seed(42)
best_xgb_model <- xgb.train(
params = best_xgb_params,
data = dtrain,
nrounds = best_xgb$best_iteration,
verbose = 0
)
The final XGBoost model was trained using the best hyperparameter combination identified from the validation set. The selected parameters were max_depth = 3, eta = 0.1, subsample = 0.8, colsample_bytree = 0.8, and best_iteration = 300. This configuration achieved a validation AUC of 0.8343, which was the highest among the tested XGBoost parameter combinations. Therefore, this parameter setting was used as the final XGBoost model for comparison with Decision Tree and Random Forest.
First, compare the three models using the validation set
# Decision Tree validation prediction
dt_valid_prob <- predict(best_dt_model, valid_data[, feature_cols], type = "prob")[, "1"]
dt_valid_pred <- ifelse(dt_valid_prob >= 0.5, 1, 0)
# Random Forest validation prediction
rf_valid_prob <- predict(best_rf_model, data = valid_data[, feature_cols])$predictions[, "1"]
rf_valid_pred <- ifelse(rf_valid_prob >= 0.5, 1, 0)
# XGBoost validation prediction
xgb_valid_prob <- predict(best_xgb_model, dvalid)
xgb_valid_pred <- ifelse(xgb_valid_prob >= 0.5, 1, 0)
evaluate_model <- function(actual, predicted, prob) {
cm <- confusionMatrix( as.factor(predicted), as.factor(actual), positive = "1")
auc_value <- auc(actual, prob)
result <- data.frame(
Accuracy = cm$overall["Accuracy"],
Precision = cm$byClass["Precision"],
Recall = cm$byClass["Recall"],
F1 = cm$byClass["F1"],
ROC_AUC = as.numeric(auc_value)
)
return(result)
}
Compare results and visualization
library(ggplot2)
library(tidyr)
library(tibble)
library(pROC)
library(patchwork)
dt_valid_eval <- evaluate_model(valid_data$Diabetes_binary, dt_valid_pred, dt_valid_prob)
rf_valid_eval <- evaluate_model(valid_data$Diabetes_binary, rf_valid_pred, rf_valid_prob)
xgb_valid_eval <- evaluate_model(valid_data$Diabetes_binary, xgb_valid_pred, xgb_valid_prob)
validation_results <- rbind(
Decision_Tree = dt_valid_eval,
Random_Forest = rf_valid_eval,
XGBoost = xgb_valid_eval
)
validation_results
## Accuracy Precision Recall F1 ROC_AUC
## Decision_Tree 0.8028816 0.3666220 0.5858528 0.4510073 0.7971714
## Random_Forest 0.8213300 0.3978302 0.5700228 0.4686089 0.8300043
## XGBoost 0.8166588 0.3934487 0.6029663 0.4761797 0.8341929
# transform format
metrics_df <- as.data.frame(validation_results) %>%
rownames_to_column(var = "Model") %>%
pivot_longer(cols = -Model, names_to = "Metric", values_to = "Score") %>%
filter(Metric != "Precision")
# Plot bar chart
bar_plot <- ggplot(metrics_df, aes(x = Metric, y = Score, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(Score, 3)), vjust = -0.5, size = 3.5) +
facet_wrap(~ Model, nrow = 1) +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 11),
strip.text = element_text(size = 13, face = "bold"),
plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
legend.position = "none",
plot.margin = ggplot2::margin(t = 10, r = 10, b = 30, l = 10, unit = "pt")
) +
labs(title = "Validation Results: Metrics Comparison",
x = NULL,
y = "Score")
# Plot ROC curve
roc_dt <- roc(valid_data$Diabetes_binary, dt_valid_prob)
roc_rf <- roc(valid_data$Diabetes_binary, rf_valid_prob)
roc_xgb <- roc(valid_data$Diabetes_binary, xgb_valid_prob)
auc_dt <- round(auc(roc_dt), 3)
auc_rf <- round(auc(roc_rf), 3)
auc_xgb <- round(auc(roc_xgb), 3)
plot_roc <- function(roc_obj, auc_val, model_name, line_color) {
ggroc(roc_obj, color = line_color, size = 1) +
geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "darkgray") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
plot.margin = ggplot2::margin(t = 10, r = 10, b = 10, l = 10, unit = "pt")
) +
labs(title = paste(model_name, "\nAUC =", auc_val),
x = "Specificity", y = "Sensitivity")
}
# Plot
p_roc_dt <- plot_roc(roc_dt, auc_dt, "Decision Tree", "#F8766D")
p_roc_rf <- plot_roc(roc_rf, auc_rf, "Random Forest", "#00BA38")
p_roc_xgb <- plot_roc(roc_xgb, auc_xgb, "XGBoost", "#619CFF")
roc_combined <- p_roc_dt | p_roc_rf | p_roc_xgb
final_combined_plot <- bar_plot / roc_combined +
plot_annotation(
title = 'Model Evaluation Summary',
theme = theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"))
)
# show plot
print(final_combined_plot)
The figure summarizes the validation performance of the three models using metric comparison and ROC curve analysis. The upper section compares the key classification metrics, while the lower section presents the ROC curves for each model. Overall, Random Forest achieved the highest accuracy at 0.821, but XGBoost performed better on the more important indicators for this imbalanced classification task. Specifically, XGBoost achieved the highest recall of 0.603, the highest F1-score of 0.476, and the highest ROC-AUC of 0.834.
The validation results show that XGBoost achieved the best overall performance among the three models. Although Random Forest obtained the highest accuracy, XGBoost achieved the highest recall, F1-score, and ROC-AUC. Since the dataset is imbalanced and the main objective is to identify diabetic cases, recall and ROC-AUC were considered more important than accuracy alone. Therefore, XGBoost was selected as the best-performing model for final testing and further SHAP explainability analysis.
# XGBoost test prediction
xgb_test_prob <- predict(best_xgb_model, dtest)
xgb_test_pred <- ifelse(xgb_test_prob >= 0.5, 1, 0)
xgb_test_eval <- evaluate_model(test_data$Diabetes_binary, xgb_test_pred, xgb_test_prob)
xgb_test_eval
## Accuracy Precision Recall F1 ROC_AUC
## Accuracy 0.8144119 0.3915999 0.5932562 0.4717828 0.829068
This data shows that XGBoost remains robust on the test set and has good discriminative performance, which will be visualized and explained further below.
library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(scales)
library(caret)
library(pROC)
xgb_metric_df <- as.data.frame(xgb_test_eval)
xgb_metric_long <- xgb_metric_df %>%
pivot_longer(
cols = everything(),
names_to = "Metric",
values_to = "Value"
)
xgb_metric_long <- xgb_metric_long %>%
filter(Metric %in% c("Accuracy", "Precision", "Recall", "F1", "ROC_AUC"))
xgb_metric_long$Metric <- factor(
xgb_metric_long$Metric,
levels = c("Accuracy", "Precision", "Recall", "F1", "ROC_AUC")
)
# Bar chart
xgb_bar_plot <- ggplot(
xgb_metric_long,
aes(x = Metric, y = Value, fill = Metric)
) +
geom_col( width = 0.65, show.legend = FALSE, color = "white", linewidth = 0.8
) +
geom_text(
aes(label = round(Value, 3)),
vjust = -0.6,
size = 4.8,
fontface = "bold",
color = "#222222"
) +
scale_y_continuous(
limits = c(0, 1),
labels = percent_format(accuracy = 1),
expand = expansion(mult = c(0, 0.08))
) +
scale_fill_manual(
values = c(
"Accuracy" = "#2F80ED",
"Precision" = "#F2994A",
"Recall" = "#27AE60",
"F1" = "#EB5757",
"ROC_AUC" = "#56CCF2"
)
) +
labs(
title = "XGBoost Test Set Performance",
x = NULL,
y = "Score"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
axis.text.x = element_text( face = "bold", size = 12, color = "#222222" ),
axis.text.y = element_text( size = 11, color = "#222222" ),
axis.title.y = element_text( face = "bold", size = 12 ),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
# Confusion matrix data
cm <- table(
Actual = test_data$Diabetes_binary,
Predicted = xgb_test_pred
)
cm_df <- as.data.frame(cm)
cm_df <- cm_df %>% mutate( Label = Freq )
cm_df$Actual <- factor(
cm_df$Actual,
levels = c(0, 1),
labels = c("No Diabetes", "Diabetes")
)
cm_df$Predicted <- factor(
cm_df$Predicted,
levels = c(0, 1),
labels = c("No Diabetes", "Diabetes")
)
# Plot
xgb_cm_plot <- ggplot( cm_df, aes(x = Predicted, y = Actual, fill = Freq)
) +
geom_tile( color = "white", linewidth = 1.2
) +
geom_text( aes(label = Label), size = 6, fontface = "bold", color = "white"
) +
scale_fill_gradient( low = "#BDEBFF", high = "#0077B6"
) +
labs(
title = "XGBoost Confusion Matrix",
x = "Predicted Class",
y = "Actual Class",
fill = "Count"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
axis.text.x = element_text( face = "bold", size = 12, color = "#222222" ),
axis.text.y = element_text( face = "bold", size = 12, color = "#222222" ),
axis.title.x = element_text( face = "bold", size = 12 ),
axis.title.y = element_text( face = "bold", size = 12 ),
panel.grid = element_blank(),
legend.position = "right",
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
combined_xgb_test_plot <- grid.arrange( xgb_bar_plot, xgb_cm_plot, ncol = 2 )
The final XGBoost model achieved an accuracy of 0.814, precision of 0.392, recall of 0.593, F1-score of 0.472, and ROC-AUC of 0.829 on the test set. These results are close to the validation performance, indicating that the model generalizes reasonably well to unseen data.
The confusion matrix shows that the XGBoost model achieved a reasonable level of classification performance on the test set, but there is still room for improvement. The model correctly identified 37,115 non-diabetic cases and 4,205 diabetic cases. However, 2,883 diabetic cases were misclassified as non-diabetic, indicating that the model still missed some positive cases. In addition, 6,533 non-diabetic cases were incorrectly classified as diabetic, suggesting that false positives were also present.
Overall, the model was able to capture a portion of diabetes cases, but its classification performance was not perfect. Therefore, we should consider evaluation metrics such as recall, precision, F1 score, and ROC-AUC in a comprehensive manner.
library(ggplot2)
library(dplyr)
library(scales)
xgb_importance <- xgb.importance( feature_names = feature_cols, model = best_xgb_model )
# Prepare importance data
importance_df <- xgb_importance %>%
as.data.frame() %>%
arrange(desc(Gain)) %>%
slice(1:15)
importance_df$Feature <- factor(
importance_df$Feature,
levels = rev(importance_df$Feature)
)
# Plot bar chart
xgb_imp_plot <- ggplot(importance_df, aes(x = Feature, y = Gain, fill = Gain)) +
geom_col(width = 0.7, color = "white", linewidth = 0.8) +
geom_text(
aes(label = round(Gain, 3)),
hjust = -0.15,
size = 4.3,
fontface = "bold",
color = "#222222"
) +
coord_flip() +
scale_y_continuous( labels = number_format(accuracy = 0.001),
expand = expansion(mult = c(0, 0.12))
) +
scale_fill_gradient( low = "#8FD3FF", high = "#0077B6"
) +
labs(
title = "XGBoost Feature Importance",
x = NULL,
y = "Feature Importance Score"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
axis.text.x = element_text( size = 11, color = "#222222" ),
axis.text.y = element_text( face = "bold", size = 12, color = "#222222" ),
axis.title.x = element_text( face = "bold", size = 12 ),
panel.grid.major.y = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "none",
plot.background = element_rect(fill = "white", color = NA),
panel.background = element_rect(fill = "white", color = NA)
)
print(xgb_imp_plot)
The XGBoost feature importance plot shows that GenHlth was the most influential predictor, with the highest importance score of 0.301, followed by HighBP with a score of 0.230. This indicates that general health status and high blood pressure played the most important roles in the model’s diabetes prediction. BMI and Age were also important predictors, with importance scores of 0.132 and 0.113, respectively. Other variables such as HighChol, DiffWalk, Income, and HeartDiseaseorAttack contributed less to the model, while PhysActivity had the lowest importance score. Overall, the model mainly relied on general health condition, blood pressure status, BMI, and age when predicting diabetes risk.
In this classification modelling section, three models were developed and compared for diabetes prediction: Decision Tree, Random Forest, and XGBoost. The Decision Tree model was used as the baseline classifier, while Random Forest and XGBoost were applied as ensemble models to improve predictive performance.
The models were trained using the balanced training dataset and tuned using the validation set. Based on the validation results, XGBoost achieved the best overall performance. XGBoost achieved an accuracy of 0.814, precision of 0.392, recall of 0.593, F1-score of 0.472, and ROC-AUC of 0.829 in the test set. These results suggest that the model maintained reasonably consistent performance on unseen data and demonstrated good overall discrimination ability. However, the relatively low precision and moderate recall indicate that the model still has limitations in accurately identifying diabetic cases. Therefore, the XGBoost model can be considered useful for diabetes risk prediction, but its results should be interpreted cautiously, especially in practical healthcare related applications.
In order to predict the health score of an individual, 3 regression models were trained and evaluated: Linear Regression, Random Forest and XGBoost. The Linear Regression model acts as a baseline for the comparisons of the 3 models.
library(tidyverse)
library(caret)
library(xgboost)
library(ordinal)
library(Metrics)
library(ranger) #faster than randomForest package, lower mem usage, multi-threading
library(tidyr)
library(ggplot2)
df <- read.csv("diabetes_binary_health_indicators_BRFSS2015.csv")
#Convert Ordinal Variables to factors
df$Age <- as.factor(df$Age)
df$Income <- as.factor(df$Income)
df$Education <- as.factor(df$Education)
#Train-Test Split 80-20
set.seed(13)
train_index <- createDataPartition(
df$GenHlth,
p = 0.8,
list = FALSE
)
train <- df[train_index, ]
test <- df[-train_index, ]
#Linear Regression
lm_model <- lm(
GenHlth ~ . ,
data = train
)
#Linear regression prediction
pred_lm <- predict(
lm_model,
newdata = test
)
#Evaluate metrics of the linear regression model
lm_rmse <- rmse(test$GenHlth, pred_lm)
lm_mae <- mae(test$GenHlth, pred_lm)
lm_r2 <- cor(test$GenHlth, pred_lm)^2
#Random Forest Regression using ranger for faster speed and lower mem usage
rf_model <- ranger(
GenHlth ~ . ,
data = train,
num.trees = 500,
seed = 123,
importance = "permutation"
)
## Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
## Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
## Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 3 seconds.
## Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
## Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
## Computing permutation importance.. Progress: 13%. Estimated remaining time: 3 minutes, 31 seconds.
## Computing permutation importance.. Progress: 25%. Estimated remaining time: 3 minutes, 5 seconds.
## Computing permutation importance.. Progress: 38%. Estimated remaining time: 2 minutes, 30 seconds.
## Computing permutation importance.. Progress: 51%. Estimated remaining time: 1 minute, 59 seconds.
## Computing permutation importance.. Progress: 63%. Estimated remaining time: 1 minute, 31 seconds.
## Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 3 seconds.
## Computing permutation importance.. Progress: 86%. Estimated remaining time: 37 seconds.
## Computing permutation importance.. Progress: 98%. Estimated remaining time: 5 seconds.
#Random Forest prediction
pred_rf <- predict(
rf_model,
#newdata = test
data = test
)$predictions
#Evaluate Metrics of the Random Forest model
rf_rmse <- rmse(test$GenHlth, pred_rf)
rf_mae <- mae(test$GenHlth, pred_rf)
rf_r2 <- cor(test$GenHlth, pred_rf)^2
#XGBoost Regression
#One-Hot Encoding
dummy <- dummyVars(
GenHlth ~ .,
data = train
)
x_train <- predict(dummy, train)
x_test <- predict(dummy, test)
y_train <- train$GenHlth
y_test <- test$GenHlth
xgb_model <- xgboost(
x = x_train,
y = y_train,
objective = "reg:squarederror",
nrounds = 200,
max_depth = 6,
learning_rate = 0.1
)
#XGBoost Prediction
pred_xgb <- predict(
xgb_model,
x_test
)
#Evaluation Metrics of XGBoost model with test data
xgb_rmse <- rmse(y_test, pred_xgb)
xgb_mae <- mae(y_test, pred_xgb)
xgb_r2 <- cor(y_test, pred_xgb)^2
#Compare the models
results <- data.frame(
Model = c(
"Linear Regression",
"Random Forest",
"XGBoost"
),
RMSE = c(
lm_rmse,
rf_rmse,
xgb_rmse
),
MAE = c(
lm_mae,
rf_mae,
xgb_mae
),
R2 = c(
lm_r2,
rf_r2,
xgb_r2
)
)
print(results)
## Model RMSE MAE R2
## 1 Linear Regression 0.7895687 0.6311160 0.4520434
## 2 Random Forest 0.7782872 0.6252869 0.4677728
## 3 XGBoost 0.7739594 0.6212268 0.4735073
#Plotting graphs for models evaluation
results_long <- pivot_longer(
results,
cols = c(RMSE, MAE, R2),
names_to = "Metric",
values_to = "Value"
)
#1.Bar charts to compare model performance MAE, RMSE, R2
ggplot(results_long,
aes(
x = Model,y = Value,fill = Model
)
) +
geom_col() +
geom_text(aes(label = round(Value, 3)),vjust = -0.5)+
facet_wrap(~Metric, scales = "free_y") +
theme_minimal() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1
)
) + labs( title = "Models Performance Comparison")
Based on the bar charts comparing the MAE, RMSE and R2 values of the
three models, XGBoost model has the best performance among the three
models. Linear regression as the baseline model has the highest MAE
(0.631) and RMSE(0.79), both Random Forest and XGBoost models perform
better than Linear Regression, with the XGBoost model performing
slightly better than the Random Forest.
In terms of MAE and RMSE, the smaller the values the better as it indicates less error in our model predictions, XGBoost has the smallest MAE of 0.621 and RMSE of 0.774 among the three models.
In terms of the coeficient of determination (R2), a larger R2 value is more preferred, the XGBoost model has the highest R2 value among three models which is 0.474, which means the XGBoost model able to explains about 47.4% of the variation in the health score.
#2.Actual vs Predicted Scatter plot
scatter_df <- data.frame(
Actual = test$GenHlth,
Linear_Regression = pred_lm,
Random_Forest = pred_rf,
XGBoost = pred_xgb
)
scatter_long <- scatter_df %>%
pivot_longer(
cols = c(
Linear_Regression,
Random_Forest,
XGBoost
),
names_to = "Model",
values_to = "Predicted"
)
#3 Scatter Plots Comparison
ggplot( scatter_long,
aes(
x = Actual,
y = Predicted
)
) +
geom_point(
alpha = 0.1,
size = 0.5
) +
geom_abline(
slope = 1,
intercept = 0,
linetype = "dashed"
) +
facet_wrap(~Model) +
theme_minimal() +
labs(
title = "Actual vs Predicted Health Score",
x = "Actual GenHlth",
y = "Predicted GenHlth"
)
From the 3 scatter plots of the 3 models above, we can see that the
predicted values are quite spread out from the diagonal line. The
diagonal reference line indicates the perfect match between the actual
and predicted values. All the 3 models showing positive relationship
between the predicted and actual data, indicating the models somehow
learn some patterns from the training data but the predicted data are
quite dispersed from the diagonal line, which indicates our models
prediction are not perfectly aligned with the actual value. Among the 3
models, XGBoost showing the strongest alignment with the diagonal
line.
#3 alternative boxplot
comparison_df <- data.frame(
Actual = test$GenHlth,
Linear = pred_lm,
RF = pred_rf,
XGB = pred_xgb
) %>%
pivot_longer(
cols = c(Linear, RF, XGB),
names_to = "Model",
values_to = "Prediction"
)
ggplot(
comparison_df,
aes(
x = factor(Actual),
y = Prediction
)
) +
geom_boxplot() +
facet_wrap(~Model)
A box plot was generated for each models to better inspect the spread of
the predicted health scores as opposed to the actual health score. The
box plot showing the Median of the predicted health scores on the actual
health score, the 25th & 75th percentile of the predictions and the
outliers. For example, when the actual health score is 4, we can see the
box plot showing the prediction around 3.3 to 3.8. Notice that here are
some overlapping in the predictions for actual health scores of 2,3
& 4, which indicates the models are capable to predict very healthy
(5) and very unhealthy individuals(1) with clear separation, but might
struggle to correctly predict the health score of a fair or good health
individual with health score ranging from 2~3 or 3~4. The box plot among
the 3 models showing not much difference, but somehow XGBoost model’s
box plot showing better medians separation and narrower distributions of
predicted values which means the model is making more consistent
predictions for each actual health score.
After model comparison, XGBoost was selected as the best-performing model based on its validation performance. To improve model interpretability, SHAP analysis was applied to explain how different predictors contributed to diabetes classification.
library(shapviz)
library(ggplot2)
library(xgboost)
shap_values <- predict( best_xgb_model, dtest, predcontrib = TRUE)
shap_values_no_bias <- shap_values[, -ncol(shap_values)]
shap_xgb <- shapviz( shap_values_no_bias, X = test_data[, feature_cols])
shap_blue <- "#1E88E5"
shap_pink <- "#FF0051"
# Extract SHAP matrix from shapviz object
shap_matrix <- shap_xgb$S
# Compute mean absolute SHAP value
shap_importance <- data.frame(
Feature = colnames(shap_matrix),
MeanAbsSHAP = colMeans(abs(shap_matrix))
) %>%
arrange(desc(MeanAbsSHAP)) %>%
dplyr::slice(1:10)
# Make feature order nicer for plotting
shap_importance$Feature <- factor(
shap_importance$Feature,
levels = rev(shap_importance$Feature)
)
# Plot
p_shap_bar <- ggplot(shap_importance, aes(x = Feature, y = MeanAbsSHAP)) +
geom_col(fill = shap_pink, width = 0.7) +
geom_text(
aes(label = sprintf("%.3f", MeanAbsSHAP)),
hjust = -0.1,
size = 4,
fontface = "bold"
) +
coord_flip() +
expand_limits(y = max(shap_importance$MeanAbsSHAP) * 1.15) +
labs(
title = "SHAP Global Feature Importance",
x = NULL,
y = "Mean Absolute SHAP Value"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
axis.title.y = element_blank(),
axis.title.x = element_text(face = "bold", size = 13),
axis.text.y = element_text(size = 11),
axis.text.x = element_text(size = 11),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank()
)
# beeswarm
p_shap_beeswarm <- sv_importance( shap_xgb, kind = "beeswarm", max_display = 10
) +
scale_color_gradient( low = shap_blue, high = shap_pink, name = "Feature value"
) +
labs( title = "SHAP Beeswarm Plot", x = "SHAP Value", y = NULL
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
axis.title.x = element_text(face = "bold", size = 13),
axis.text.y = element_text(size = 11),
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)
# Combine SHAP bar plot and beeswarm plot side by side
p_shap_combined <- p_shap_bar + p_shap_beeswarm +
plot_layout(widths = c(1, 1.2)) +
plot_annotation(
title = "SHAP Model Interpretation",
theme = theme(
plot.title = element_text(
hjust = 0.5,
face = "bold",
size = 18 ) ) )
p_shap_combined
SHAP Global Feature Importance
The SHAP global feature importance plot shows that GenHlth was the most influential predictor in the XGBoost model, with the highest mean absolute SHAP value of 0.612. This means that general health status had the strongest overall impact on the model’s diabetes prediction. Age, HighBP, and BMI were also important features, with similar SHAP values around 0.40, followed by HighChol with a value of 0.302. Overall, the model mainly relied on general health condition, age, blood pressure, BMI, and cholesterol status when making predictions.
SHAP Beeswarm Plot
The SHAP beeswarm plot further explains how each feature affected the prediction direction. Positive SHAP values push the prediction toward the Diabetes class, while negative SHAP values reduce the predicted diabetes risk. The plot shows that poorer general health, older age, high blood pressure, higher BMI, and high cholesterol generally increased the model’s predicted diabetes risk. In contrast, better health status, younger age, absence of high blood pressure, lower BMI, and normal cholesterol tended to reduce the prediction toward diabetes.
This project successfully applied machine learning techniques to the BRFSS 2015 dataset to predict diabetes risk (classification) and general health scores (regression). Across both tasks, tree-based ensemble methods, specifically XGBoost, demonstrated the most robust predictive capabilities.
GenHlth): We further analyzed the dataset by
treating the general health score as a continuous target variable.
XGBoost once again outperformed Linear Regression and Random Forest,
achieving the lowest Root Mean Square Error (RMSE = 0.774) and Mean
Absolute Error (MAE = 0.621). The model captured approximately 47.4% of
the variance (R-squared = 0.474) in the health scores. Box plot analyses
revealed that while the model clearly distinguishes between extreme
health states (excellent vs. poor), it exhibits some overlap when
predicting intermediate health scores.Through feature importance ranking and SHAP explainability analysis, several critical factors were consistently identified as major drivers of diabetes risk and overall health:
GenHlth):
Self-reported general health was the most dominant predictor of
diabetes. The SHAP analysis clearly indicated that a poorer general
health score heavily pushes the model’s prediction toward positive
diabetes status.HighBP and
HighChol): High blood pressure and high
cholesterol were the next most significant predictors. The presence of
these conditions drastically increases the likelihood of being
classified as diabetic, reinforcing the known medical consensus that
metabolic syndrome components are deeply interconnected.Age and
BMI): Advancing age and higher Body Mass Index
(BMI) also showed strong positive correlations with diabetes risk.PhysActivity) showed lower relative
importance in the final models compared to established medical
conditions (like blood pressure) and demographics, though they still
contribute to the overall health profile.While the models provide valuable insights, several limitations must be acknowledged:
This project demonstrates the strong potential of machine learning—particularly XGBoost—in analyzing large-scale public health data to uncover complex relationships between lifestyle, demographics, and chronic diseases.
From a public health perspective, the findings strongly emphasize the importance of early intervention in managing blood pressure, cholesterol, and BMI to mitigate diabetes risk. Healthcare providers and policymakers can utilize these insights to design targeted awareness campaigns and preventive care strategies, focusing especially on aging populations and individuals reporting declining general health. Moving forward, incorporating more direct clinical measurements could significantly enhance the predictive power and clinical utility of these machine learning models.
| Group Members | Contribution |
|---|---|
| XU YIRAN 25088321 | Data Collection, Understanding & Cleaning |
| WANG JIACUN 25075892 | Classification Modeling & Evaluation |
| SAM YEE THONG WAH 24209652 | Regression Modeling & Evaluation, Final Checking & Video Compilation |
| LUO YUN 25088950 | Interpretation, Explainability & Conclusion |
| HUANG MAO WEI 24231728 | Introduction, Project Workflow & PPT |