Group Members
XU YIRAN 25088321
WANG JIACUN 25075892
SAM YEE THONG WAH 24209652
LUO YUN 25088950
HUANG MAO WEI 24231728

Introduction

Diabetes is one of the most common chronic diseases worldwide and has become a major public health concern. Early identification of diabetes risk factors can help healthcare providers and individuals take preventive actions to reduce complications and improve health outcomes.

In this project, the Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset was selected for analysis. The dataset contains 253,680 observations and 22 health-related variables, including demographic characteristics, lifestyle habits, and medical conditions. Examples of these variables include age, body mass index (BMI), smoking status, physical activity, blood pressure, cholesterol level, education, and income.

Using this large-scale dataset, the project applies data science and machine learning techniques to explore relationships between health indicators and diabetes status. Both classification and regression approaches are used to generate predictive insights and support healthcare-related decision making.

Objectives

The objectives of this project are:

  1. To determine whether an individual is diabetic
  2. To predict the health score of an individual based on the health indicators.
  3. To identify importance of health, lifestyle, and demographic factors associated with diabetes.
  4. To develop and evaluate classification models for predicting diabetes status.
  5. To develop and evaluate regression models for analyzing relationships among health-related variables.
  6. To interpret the modeling results and provide meaningful insights for diabetes awareness and prevention.

1. Data Collection & Overview

1.1 Dataset Source and Background

The dataset used in this project is the Diabetes Health Indicators Dataset obtained from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 conducted by the Centers for Disease Control and Prevention (CDC).

This dataset contains information related to individuals’ health conditions, lifestyle habits, and demographic characteristics. Some of the variables included in the dataset are BMI, smoking habits, physical activity, blood pressure, cholesterol level, age, and income.

1.2 Motivation for Selection

Diabetes is one of the most common health problems worldwide and has become an important public health issue. By using machine learning techniques, this project aims to study the relationship between health indicators and diabetes risk.

This dataset was selected because it is suitable for both classification and regression analysis. It also contains a large number of observations, which makes it appropriate for machine learning modeling and prediction tasks.

1.3 Dimension and Compliance Check

In total, this specific dataset contains 253,680 observations and 22 variables. The large number of observations makes the dataset exceptionally suitable for training robust machine learning models and handling complex prediction tasks.

To check whether the dataset meets the project requirement, we calculated the dataset size. The dataset has 253,680 rows and 22 columns, giving a total of 5,580,960 data points. This is more than the required 100,000, so the dataset is suitable for this project.

# Load the required libraries
library(readr)
library(dplyr)

# Read the dataset (Make sure the file is in your working directory)
diabetes <- read_csv("diabetes_binary_health_indicators_BRFSS2015.csv")

# Verify the dimensions programmatically
dim(diabetes)
## [1] 253680     22

2. Dataset Structure & Content Analysis

2.1 Variable Overview and Problem Mapping

The variables in this dataset can be divided into three main groups, which are health condition variables, lifestyle habit variables, and demographic variables.

Examples of health condition variables include blood pressure and cholesterol level, while lifestyle variables include smoking habits, physical activity, and diet-related information. Demographic variables include age, education level, and income category.

Based on the objectives of this project, the variables are used for two different types of machine learning tasks:

  • Classification Target: Diabetes_binary is used as the target variable to predict whether an individual has diabetes.
  • Regression Target: GenHlth (General Health Score) is used as the numerical target variable in the regression analysis.
# Display dataset structure  
str(diabetes)  
## spc_tbl_ [253,680 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Diabetes_binary     : num [1:253680] 0 0 0 0 0 0 0 0 1 0 ...
##  $ HighBP              : num [1:253680] 1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : num [1:253680] 1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : num [1:253680] 1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num [1:253680] 40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : num [1:253680] 1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : num [1:253680] 0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: num [1:253680] 0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity        : num [1:253680] 0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : num [1:253680] 0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : num [1:253680] 1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : num [1:253680] 0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num [1:253680] 1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num [1:253680] 0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : num [1:253680] 5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : num [1:253680] 18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : num [1:253680] 15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : num [1:253680] 1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : num [1:253680] 0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : num [1:253680] 9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : num [1:253680] 4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : num [1:253680] 3 1 8 6 4 8 7 4 1 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Diabetes_binary = col_double(),
##   ..   HighBP = col_double(),
##   ..   HighChol = col_double(),
##   ..   CholCheck = col_double(),
##   ..   BMI = col_double(),
##   ..   Smoker = col_double(),
##   ..   Stroke = col_double(),
##   ..   HeartDiseaseorAttack = col_double(),
##   ..   PhysActivity = col_double(),
##   ..   Fruits = col_double(),
##   ..   Veggies = col_double(),
##   ..   HvyAlcoholConsump = col_double(),
##   ..   AnyHealthcare = col_double(),
##   ..   NoDocbcCost = col_double(),
##   ..   GenHlth = col_double(),
##   ..   MentHlth = col_double(),
##   ..   PhysHlth = col_double(),
##   ..   DiffWalk = col_double(),
##   ..   Sex = col_double(),
##   ..   Age = col_double(),
##   ..   Education = col_double(),
##   ..   Income = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

2.2 Inspection of Data Types

To better understand the dataset structure, the data type of each variable was checked. This step helps ensure that the variables are in a suitable format before data cleaning and machine learning.

# Display the data type of each variable
sapply(diabetes, class)
##      Diabetes_binary               HighBP             HighChol 
##            "numeric"            "numeric"            "numeric" 
##            CholCheck                  BMI               Smoker 
##            "numeric"            "numeric"            "numeric" 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##            "numeric"            "numeric"            "numeric" 
##               Fruits              Veggies    HvyAlcoholConsump 
##            "numeric"            "numeric"            "numeric" 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##            "numeric"            "numeric"            "numeric" 
##             MentHlth             PhysHlth             DiffWalk 
##            "numeric"            "numeric"            "numeric" 
##                  Sex                  Age            Education 
##            "numeric"            "numeric"            "numeric" 
##               Income 
##            "numeric"

The result shows that most variables are stored as numeric or integer values. However, some variables actually represent categories or levels, such as education level, income category, and general health score. Therefore, we need to understand the meaning of each variable before using them in the analysis.

3. Summary Statistics

The summary statistics below provide a general overview of the variables included in the dataset.

# Display summary statistics
summary(diabetes)
##  Diabetes_binary      HighBP         HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1393   Mean   :0.429   Mean   :0.4241   Mean   :0.9627  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000     
##  1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000     
##  Median :27.00   Median :0.0000   Median :0.00000   Median :0.00000     
##  Mean   :28.38   Mean   :0.4432   Mean   :0.04057   Mean   :0.09419     
##  3rd Qu.:31.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000     
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000     
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000   
##  Mean   :0.7565   Mean   :0.6343   Mean   :0.8114   Mean   :0.0562   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :2.000   Median : 0.000  
##  Mean   :0.9511   Mean   :0.08418   Mean   :2.511   Mean   : 3.185  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 8.000  
##  Mean   : 4.242   Mean   :0.1682   Mean   :0.4403   Mean   : 8.032  
##  3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education        Income     
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.00   1st Qu.:5.000  
##  Median :5.00   Median :7.000  
##  Mean   :5.05   Mean   :6.054  
##  3rd Qu.:6.00   3rd Qu.:8.000  
##  Max.   :6.00   Max.   :8.000

4. Data Cleaning

Before building machine learning models, the dataset is checked for missing values and duplicate records to ensure data quality.

4.1 Missing Values and Duplicate Check

# Check missing values
total_na <- sum(is.na(diabetes))
print(paste("Total Missing Values:", total_na))
## [1] "Total Missing Values: 0"
# Check duplicate rows
total_duplicates <- sum(duplicated(diabetes))
print(paste("Total Duplicate Rows:", total_duplicates))
## [1] "Total Duplicate Rows: 24206"

Interpretation

The dataset does not contain any missing values, so no imputation or replacement is required.

A total of 24,206 duplicate rows were identified in the dataset. Since this dataset comes from a large public health survey, it is possible for different individuals to share similar health profiles and responses. Therefore, the duplicate rows are retained for further analysis.

4.2 Categorical Variable Transformation

Several variables in the dataset are stored as numeric values such as 0 and 1, but they actually represent categorical health conditions or responses.

To improve readability and prepare the dataset for machine learning analysis, these variables are converted into factor variables with meaningful labels.

# Transform selected variables into factors
diabetes_cleaned <- diabetes %>%
  mutate(
    
    # Diabetes status
    Diabetes_binary = factor(
      Diabetes_binary,
      levels = c(0, 1),
      labels = c("No_Diabetes", "Diabetes")
    ),
    
    # Health condition indicators
    HighBP = factor(
      HighBP,
      levels = c(0, 1),
      labels = c("No_HighBP", "HighBP")
    ),
    
    HighChol = factor(
      HighChol,
      levels = c(0, 1),
      labels = c("No_HighChol", "HighChol")
    ),
    
    CholCheck = factor(
      CholCheck,
      levels = c(0, 1),
      labels = c("No_Check", "Checked")
    ),
    
    Stroke = factor(
      Stroke,
      levels = c(0, 1),
      labels = c("No_Stroke", "Stroke")
    ),
    
    HeartDiseaseorAttack = factor(
      HeartDiseaseorAttack,
      levels = c(0, 1),
      labels = c("No_HD", "HeartDisease")
    )
  )

# Verify transformed variables
str(diabetes_cleaned[, c(
  "Diabetes_binary",
  "HighBP",
  "HighChol",
  "CholCheck",
  "Stroke",
  "HeartDiseaseorAttack"
)])
## tibble [253,680 × 6] (S3: tbl_df/tbl/data.frame)
##  $ Diabetes_binary     : Factor w/ 2 levels "No_Diabetes",..: 1 1 1 1 1 1 1 1 2 1 ...
##  $ HighBP              : Factor w/ 2 levels "No_HighBP","HighBP": 2 1 2 2 2 2 2 2 2 1 ...
##  $ HighChol            : Factor w/ 2 levels "No_HighChol",..: 2 1 2 1 2 2 1 2 2 1 ...
##  $ CholCheck           : Factor w/ 2 levels "No_Check","Checked": 2 1 2 2 2 2 2 2 2 2 ...
##  $ Stroke              : Factor w/ 2 levels "No_Stroke","Stroke": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HeartDiseaseorAttack: Factor w/ 2 levels "No_HD","HeartDisease": 1 1 1 1 1 1 1 1 2 1 ...

4.3 Lifestyle and Demographic Variable Transformation

The remaining lifestyle and demographic variables are also converted into factor variables to improve readability and support further analysis.

# Transform lifestyle and demographic variables into factors
diabetes_cleaned <- diabetes_cleaned %>%
  mutate(
    
    # Lifestyle variables
    Smoker = factor(
      Smoker,
      levels = c(0, 1),
      labels = c("Non_Smoker", "Smoker")
    ),
    
    PhysActivity = factor(
      PhysActivity,
      levels = c(0, 1),
      labels = c("No_Activity", "Active")
    ),
    
    Fruits = factor(
      Fruits,
      levels = c(0, 1),
      labels = c("No_Fruits", "Fruits")
    ),
    
    Veggies = factor(
      Veggies,
      levels = c(0, 1),
      labels = c("No_Veggies", "Veggies")
    ),
    
    HvyAlcoholConsump = factor(
      HvyAlcoholConsump,
      levels = c(0, 1),
      labels = c("No_HeavyDrink", "HeavyDrink")
    ),
    
    # Demographic variables
    Sex = factor(
      Sex,
      levels = c(0, 1),
      labels = c("Female", "Male")
    ),
    
    AnyHealthcare = factor(
      AnyHealthcare,
      levels = c(0, 1),
      labels = c("No_Coverage", "Has_Coverage")
    ),
    
    NoDocbcCost = factor(
      NoDocbcCost,
      levels = c(0, 1),
      labels = c("No", "Yes_CostBarrier")
    ),
    
    DiffWalk = factor(
      DiffWalk,
      levels = c(0, 1),
      labels = c("No_Difficulty", "Difficulty_Walking")
    )
  )

# Verify transformed variables
str(diabetes_cleaned[, c(
  "Smoker",
  "PhysActivity",
  "Fruits",
  "Veggies",
  "HvyAlcoholConsump",
  "Sex",
  "AnyHealthcare",
  "NoDocbcCost",
  "DiffWalk"
)])
## tibble [253,680 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Smoker           : Factor w/ 2 levels "Non_Smoker","Smoker": 2 2 1 1 1 2 2 2 2 1 ...
##  $ PhysActivity     : Factor w/ 2 levels "No_Activity",..: 1 2 1 2 2 2 1 2 1 1 ...
##  $ Fruits           : Factor w/ 2 levels "No_Fruits","Fruits": 1 1 2 2 2 2 1 1 2 1 ...
##  $ Veggies          : Factor w/ 2 levels "No_Veggies","Veggies": 2 1 1 2 2 2 1 2 2 2 ...
##  $ HvyAlcoholConsump: Factor w/ 2 levels "No_HeavyDrink",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sex              : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 1 1 1 2 ...
##  $ AnyHealthcare    : Factor w/ 2 levels "No_Coverage",..: 2 1 2 2 2 2 2 2 2 2 ...
##  $ NoDocbcCost      : Factor w/ 2 levels "No","Yes_CostBarrier": 1 2 2 1 1 1 1 1 1 1 ...
##  $ DiffWalk         : Factor w/ 2 levels "No_Difficulty",..: 2 1 2 1 1 1 1 2 2 1 ...

4.4 Ordinal Variable Transformation

For EDA, some ordinal variables are converted into ordered factors to make the categories easier to read in tables and charts. GenHlth is also converted into an ordered factor in this section because its values represent ordered health levels.

However, GenHlth is still used as the target variable for the regression task. In the later modeling section, it will be converted back into a numerical score for regression analysis.

Other variables such as Age, Education, and Income are also ordinal variables because their categories follow a meaningful order.

To prepare the dataset for further analysis, these variables are converted into ordered factors with labels based on the BRFSS 2015 documentation.

# Transform ordinal variables into ordered factors
diabetes_cleaned <- diabetes_cleaned %>%
  mutate(
    
    # General health condition
    GenHlth = factor(
      GenHlth,
      levels = 1:5,
      labels = c("Excellent", "Very_Good", "Good", "Fair", "Poor"),
      ordered = TRUE
    ),
    
    # Age groups
    Age = factor(
      Age,
      levels = 1:13,
      labels = c(
        "18-24", "25-29", "30-34", "35-39", "40-44", "45-49",
        "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+"
      ),
      ordered = TRUE
    ),
    
    # Education level
    Education = factor(
      Education,
      levels = 1:6,
      labels = c(
        "Never_Attended", "Elementary", "Some_HighSchool",
        "HighSchool_Graduate", "Some_College", "College_Graduate"
      ),
      ordered = TRUE
    ),
    
    # Income groups
    Income = factor(
      Income,
      levels = 1:8,
      labels = c(
        "<10k", "10k-15k", "15k-20k", "20k-25k",
        "25k-35k", "35k-50k", "50k-75k", "75k+"
      ),
      ordered = TRUE
    )
  )

# Verify transformed variables
str(diabetes_cleaned[, c("GenHlth", "Age", "Education", "Income")])
## tibble [253,680 × 4] (S3: tbl_df/tbl/data.frame)
##  $ GenHlth  : Ord.factor w/ 5 levels "Excellent"<"Very_Good"<..: 5 3 5 2 2 2 3 3 5 2 ...
##  $ Age      : Ord.factor w/ 13 levels "18-24"<"25-29"<..: 9 7 9 11 11 10 9 11 9 8 ...
##  $ Education: Ord.factor w/ 6 levels "Never_Attended"<..: 4 6 4 3 5 6 6 4 5 4 ...
##  $ Income   : Ord.factor w/ 8 levels "<10k"<"10k-15k"<..: 3 1 8 6 4 8 7 4 1 3 ...

4.5 Post-Cleaning Verification

After cleaning and transforming the variables, a final check is performed to confirm the structure and quality of the cleaned dataset.

# Check dimension of cleaned dataset
dim(diabetes_cleaned)
## [1] 253680     22
# Check missing values after cleaning
sum(is.na(diabetes_cleaned))
## [1] 0
# Check target variable distribution
summary(diabetes_cleaned$Diabetes_binary)
## No_Diabetes    Diabetes 
##      218334       35346

Data Cleaning Summary

The cleaned dataset contains 253,680 observations and 22 variables. No missing values are found after the cleaning process.

The dataset is now ready for exploratory data analysis, classification modeling, and regression analysis.

5. Exploratory Data Analysis (EDA)

After completing the data cleaning process, exploratory data analysis (EDA) is performed to better understand the dataset and the distribution of important variables.

5.1 Distribution of Diabetes Cases and BMI

The following graphs show the distribution of diabetes cases and BMI values in the dataset.

library(ggplot2)
library(gridExtra)

plot_target_class <- ggplot(diabetes_cleaned, aes(x = Diabetes_binary, fill = Diabetes_binary)) +
  geom_bar(color = "black", alpha = 0.8, width = 0.5) +
  scale_fill_manual(values = c("#2a9d8f", "#e76f51")) +
  labs(
    title = "Distribution of Diabetes Cases",
    x = "Diabetes Status",
    y = "Count"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

plot_target_reg <- ggplot(diabetes_cleaned, aes(x = BMI)) +
  geom_histogram(
    binwidth = 1,
    fill = "#457b9d",
    color = "black",
    alpha = 0.7
  ) +
  labs(
    title = "Distribution of BMI",
    x = "BMI",
    y = "Frequency"
  ) +
  theme_minimal()

grid.arrange(plot_target_class, plot_target_reg, ncol = 2)

Interpretation

The bar chart shows that the number of individuals without diabetes is higher than the number of individuals with diabetes.

The BMI histogram shows that most BMI values are concentrated between 20 and 35, with some higher BMI values appearing in the dataset.

5.2 Relationship Between Diabetes and Health Factors

In this section, several important health and lifestyle variables are compared with diabetes status to explore possible relationships in the dataset.

# Diabetes vs High Blood Pressure
p1 <- ggplot(diabetes_cleaned, aes(x = HighBP, fill = Diabetes_binary)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes vs High Blood Pressure",
    x = "High Blood Pressure",
    y = "Count"
  ) +
  theme_minimal()

# Diabetes vs Smoking
p2 <- ggplot(diabetes_cleaned, aes(x = Smoker, fill = Diabetes_binary)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes vs Smoking",
    x = "Smoking Status",
    y = "Count"
  ) +
  theme_minimal()

# Diabetes vs Physical Activity
p3 <- ggplot(diabetes_cleaned, aes(x = PhysActivity, fill = Diabetes_binary)) +
  geom_bar(position = "dodge") +
  labs(
    title = "Diabetes vs Physical Activity",
    x = "Physical Activity",
    y = "Count"
  ) +
  theme_minimal()

grid.arrange(p1, p2, p3, ncol = 1)

Interpretation

The graphs show that individuals with high blood pressure appear more likely to have diabetes compared to those without high blood pressure.

The smoking graph suggests that smokers have a slightly higher number of diabetes cases compared to non-smokers.

For physical activity, people who are physically active generally show fewer diabetes cases than those with no physical activity.

5.3 Correlation Analysis

To better understand the relationships between numerical variables, a correlation matrix is created using selected health-related variables.

library(corrplot)

# Select numeric variables
numeric_data <- diabetes %>%
  select(
    Diabetes_binary,
    BMI,
    HighBP,
    HighChol,
    PhysActivity,
    MentHlth,
    PhysHlth
  )

# Create correlation matrix
cor_matrix <- cor(numeric_data)

# Plot correlation matrix
corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.col = "black",
  tl.srt = 45
)

Interpretation

The correlation matrix shows that diabetes has a positive relationship with variables such as BMI, high blood pressure, and high cholesterol.

BMI also shows moderate relationships with physical health and diabetes status.

Overall, the relationships are not extremely strong, which suggests that multiple variables together may be needed for effective prediction models.

6. Classification

Determine whether an individual is diabetic.

This part aims to build a binary classification model using the cleaned diabetes health indicators dataset. The target variable, Diabetes_binary, indicates whether an individual has diabetes or not. The dataset includes several health, lifestyle, and demographic factors.

The analysis is conducted through a structured machine learning workflow. First, the dataset is explored to understand its structure, feature characteristics, and target class distribution. Next, data cleaning and transformation are carried out to check data quality and transform format. Feature encoding is then applied to prepare the data for model training. The dataset is split into training and testing sets using stratified sampling, and class imbalance is handled on the training set to reduce bias toward the majority class. After that, several classification models are trained and evaluated using multiple performance metrics, including accuracy, precision, recall, F1-score, ROC-AUC. Finally, SHAP analysis is used to explain the best-performing model and identify the key predictors of diabetes risk.

6.1 Data Understanding

The first step of this study is to understand the structure and characteristics of the diabetes dataset before conducting model development.

# Load required library
library(readr)
library(dplyr)

# Load dataset
diabetes_data <- read_csv("diabetes_cleaned.csv")

# View the first few rows
head(diabetes_data)
## # A tibble: 6 × 22
##   Diabetes_binary HighBP    HighChol    CholCheck   BMI Smoker     Stroke   
##   <chr>           <chr>     <chr>       <chr>     <dbl> <chr>      <chr>    
## 1 No_Diabetes     HighBP    HighChol    Checked      40 Smoker     No_Stroke
## 2 No_Diabetes     No_HighBP No_HighChol No_Check     25 Smoker     No_Stroke
## 3 No_Diabetes     HighBP    HighChol    Checked      28 Non_Smoker No_Stroke
## 4 No_Diabetes     HighBP    No_HighChol Checked      27 Non_Smoker No_Stroke
## 5 No_Diabetes     HighBP    HighChol    Checked      24 Non_Smoker No_Stroke
## 6 No_Diabetes     HighBP    HighChol    Checked      25 Smoker     No_Stroke
## # ℹ 15 more variables: HeartDiseaseorAttack <chr>, PhysActivity <chr>,
## #   Fruits <chr>, Veggies <chr>, HvyAlcoholConsump <chr>, AnyHealthcare <chr>,
## #   NoDocbcCost <chr>, GenHlth <chr>, MentHlth <dbl>, PhysHlth <dbl>,
## #   DiffWalk <chr>, Sex <chr>, Age <chr>, Education <chr>, Income <chr>
# Check dataset dimension
dim(diabetes_data)
## [1] 253680     22
# Summary statistics
summary(diabetes_data)
##  Diabetes_binary       HighBP            HighChol          CholCheck        
##  Length:253680      Length:253680      Length:253680      Length:253680     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       BMI           Smoker             Stroke          HeartDiseaseorAttack
##  Min.   :12.00   Length:253680      Length:253680      Length:253680       
##  1st Qu.:24.00   Class :character   Class :character   Class :character    
##  Median :27.00   Mode  :character   Mode  :character   Mode  :character    
##  Mean   :28.38                                                             
##  3rd Qu.:31.00                                                             
##  Max.   :98.00                                                             
##  PhysActivity          Fruits            Veggies          HvyAlcoholConsump 
##  Length:253680      Length:253680      Length:253680      Length:253680     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  AnyHealthcare      NoDocbcCost          GenHlth             MentHlth     
##  Length:253680      Length:253680      Length:253680      Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 0.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 0.000  
##                                                           Mean   : 3.185  
##                                                           3rd Qu.: 2.000  
##                                                           Max.   :30.000  
##     PhysHlth        DiffWalk             Sex                Age           
##  Min.   : 0.000   Length:253680      Length:253680      Length:253680     
##  1st Qu.: 0.000   Class :character   Class :character   Class :character  
##  Median : 0.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 4.242                                                           
##  3rd Qu.: 3.000                                                           
##  Max.   :30.000                                                           
##   Education            Income         
##  Length:253680      Length:253680     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

The dataset contains 253,680 records and 22 variables, including one target variable and 21 predictor variables. The target variable is Diabetes_binary, which indicates whether an individual has diabetes or not. Therefore, this study is formulated as a binary classification task.

# Check target variable distribution
table(diabetes_data$Diabetes_binary)
## 
##    Diabetes No_Diabetes 
##       35346      218334
# Check target variable percentage
prop.table(table(diabetes_data$Diabetes_binary)) * 100
## 
##    Diabetes No_Diabetes 
##     13.9333     86.0667

The target variable distribution shows that the dataset is imbalanced, with 86.07% of records belonging to the no-diabetes class and 13.93% belonging to the diabetes class.

# Check data types of each variable
sapply(diabetes_data, class)
##      Diabetes_binary               HighBP             HighChol 
##          "character"          "character"          "character" 
##            CholCheck                  BMI               Smoker 
##          "character"            "numeric"          "character" 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##          "character"          "character"          "character" 
##               Fruits              Veggies    HvyAlcoholConsump 
##          "character"          "character"          "character" 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##          "character"          "character"          "character" 
##             MentHlth             PhysHlth             DiffWalk 
##            "numeric"            "numeric"          "character" 
##                  Sex                  Age            Education 
##          "character"          "character"          "character" 
##               Income 
##          "character"
Type Fields
Target variable Diabetes_binary
Numerical variables BMI, MentHlth, PhysHlth
Binary categorical variables HighBP, HighChol, Smoker, Stroke, Sex, etc.
Ordinal categorical variables GenHlth, Age, Education, Income

Based on the initial data understanding, the dataset is appropriate for binary classification modelling. However, the imbalance in the target variable indicates that model evaluation should not rely only on accuracy. Metrics such as recall, F1-score and ROC-AUC should be used in later stages.

6.2 Data Transformation

6.2.1 Transform target variable

# Transform target variable
diabetes_data$Diabetes_binary <- ifelse( diabetes_data$Diabetes_binary == "Diabetes", 1, 0 )

# Check the results
table(diabetes_data$Diabetes_binary)
## 
##      0      1 
## 218334  35346

The target variable Diabetes_binary was transformed into binary numerical format, where No_Diabetes was encoded as 0 and Diabetes was encoded as 1.

6.2.2 Transform binary categorical variables

diabetes_data$HighBP <- ifelse(diabetes_data$HighBP == "HighBP", 1, 0)
diabetes_data$HighChol <- ifelse(diabetes_data$HighChol == "HighChol", 1, 0)
diabetes_data$CholCheck <- ifelse(diabetes_data$CholCheck == "Checked", 1, 0)
diabetes_data$Smoker <- ifelse(diabetes_data$Smoker == "Smoker", 1, 0)
diabetes_data$Stroke <- ifelse(diabetes_data$Stroke == "Stroke", 1, 0)

diabetes_data$HeartDiseaseorAttack <- ifelse(
  diabetes_data$HeartDiseaseorAttack == "HeartDisease", 1, 0 )

diabetes_data$PhysActivity <- ifelse(diabetes_data$PhysActivity == "Active", 1, 0)
diabetes_data$Fruits <- ifelse(diabetes_data$Fruits == "Fruits", 1, 0)
diabetes_data$Veggies <- ifelse(diabetes_data$Veggies == "Veggies", 1, 0)

diabetes_data$HvyAlcoholConsump <- ifelse( diabetes_data$HvyAlcoholConsump == "HeavyDrink", 1, 0 )
diabetes_data$AnyHealthcare <- ifelse( diabetes_data$AnyHealthcare == "Has_Coverage", 1, 0 )
diabetes_data$NoDocbcCost <- ifelse( diabetes_data$NoDocbcCost == "Yes_CostBarrier", 1, 0 )
diabetes_data$DiffWalk <- ifelse( diabetes_data$DiffWalk == "Difficulty_Walking", 1, 0 )
diabetes_data$Sex <- ifelse( diabetes_data$Sex == "Male", 1, 0 )

Binary variables such as HighBP, HighChol, Smoker, Stroke, PhysActivity, DiffWalk, and Sex were prepared for conversion into 0/1 numerical format.

6.2.3 Transform ordinal variables

# GenHlth
diabetes_data$GenHlth <- factor(
  diabetes_data$GenHlth,
  levels = c("Excellent", "Very_Good", "Good", "Fair", "Poor"),
  ordered = TRUE
)

diabetes_data$GenHlth <- as.numeric(diabetes_data$GenHlth)

# Age
diabetes_data$Age <- factor(
  diabetes_data$Age,
  levels = c("18-24", "25-29", "30-34", "35-39", "40-44",
             "45-49", "50-54", "55-59", "60-64", "65-69",
             "70-74", "75-79", "80+"),
  ordered = TRUE
)

diabetes_data$Age <- as.numeric(diabetes_data$Age)

# Education
diabetes_data$Education <- factor(
  diabetes_data$Education,
  levels = c("Never_Attended", "Elementary", "Some_HighSchool",
             "HighSchool_Graduate", "Some_College", "College_Graduate"),
  ordered = TRUE
)

diabetes_data$Education <- as.numeric(diabetes_data$Education)

# Income
diabetes_data$Income <- factor(
  diabetes_data$Income,
  levels = c("<10k", "10k-15k", "15k-20k", "20k-25k",
             "25k-35k", "35k-50k", "50k-75k", "75k+"),
  ordered = TRUE
)

diabetes_data$Income <- as.numeric(diabetes_data$Income)

Ordinal variables such as GenHlth, Age, Education, and Income were transformed according to their natural order.

6.2.4 Check transformed dataset

# Check whether all variables are numeric
sapply(diabetes_data, class)
##      Diabetes_binary               HighBP             HighChol 
##            "numeric"            "numeric"            "numeric" 
##            CholCheck                  BMI               Smoker 
##            "numeric"            "numeric"            "numeric" 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##            "numeric"            "numeric"            "numeric" 
##               Fruits              Veggies    HvyAlcoholConsump 
##            "numeric"            "numeric"            "numeric" 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##            "numeric"            "numeric"            "numeric" 
##             MentHlth             PhysHlth             DiffWalk 
##            "numeric"            "numeric"            "numeric" 
##                  Sex                  Age            Education 
##            "numeric"            "numeric"            "numeric" 
##               Income 
##            "numeric"
# View first few rows
head(diabetes_data)
## # A tibble: 6 × 22
##   Diabetes_binary HighBP HighChol CholCheck   BMI Smoker Stroke
##             <dbl>  <dbl>    <dbl>     <dbl> <dbl>  <dbl>  <dbl>
## 1               0      1        1         1    40      1      0
## 2               0      0        0         0    25      1      0
## 3               0      1        1         1    28      0      0
## 4               0      1        0         1    27      0      0
## 5               0      1        1         1    24      0      0
## 6               0      1        1         1    25      1      0
## # ℹ 15 more variables: HeartDiseaseorAttack <dbl>, PhysActivity <dbl>,
## #   Fruits <dbl>, Veggies <dbl>, HvyAlcoholConsump <dbl>, AnyHealthcare <dbl>,
## #   NoDocbcCost <dbl>, GenHlth <dbl>, MentHlth <dbl>, PhysHlth <dbl>,
## #   DiffWalk <dbl>, Sex <dbl>, Age <dbl>, Education <dbl>, Income <dbl>
# Save the dataset
write.csv(diabetes_data, "diabetes_transformed.csv", row.names = FALSE)

After transformation, all variables were converted into numerical format and became suitable for machine learning algorithms. Binary categorical variables were encoded as 0 and 1, while ordinal variables were encoded according to their natural order. Numerical variables such as BMI, MentHlth, and PhysHlth were kept as numerical features. The transformed dataset was then used for dataset splitting and subsequent model development.

6.3 Train-Validation-Test Split

The transformed dataset was split into three subsets: training set, validation set, and testing set. The training set was used to train the classification models, the validation set was used for model tuning and model selection, and the testing set was used for final model evaluation. In this study, 60% of the data was assigned to the training set, 20% to the validation set, and 20% to the testing set. Stratified sampling was applied during the splitting process to preserve the original distribution of the target variable in each subset.

library(caret)
library(readr)

# Load transformed dataset
diabetes_transformed <- read_csv("diabetes_transformed.csv")

set.seed(42)
# First split: 60% training, 40% temporary data
train_index <- createDataPartition(
  diabetes_transformed$Diabetes_binary,
  p = 0.6,
  list = FALSE
)

train_data <- diabetes_transformed[train_index, ]
temp_data <- diabetes_transformed[-train_index, ]

# Second split: split temporary data into 20% validation and 20% testing
valid_index <- createDataPartition(
  temp_data$Diabetes_binary,
  p = 0.5,
  list = FALSE
)

valid_data <- temp_data[valid_index, ]
test_data <- temp_data[-valid_index, ]

# Summary of dataset dimensions and class distribution
summary_table <- data.frame(
  Dataset = c("Training Set", "Validation Set", "Test Set"),
  Rows = c(nrow(train_data), nrow(valid_data), nrow(test_data)),
  Columns = c(ncol(train_data), ncol(valid_data), ncol(test_data)),
  Class_0_Percentage = c(
    prop.table(table(train_data$Diabetes_binary))[1] * 100,
    prop.table(table(valid_data$Diabetes_binary))[1] * 100,
    prop.table(table(test_data$Diabetes_binary))[1] * 100
  ),
  Class_1_Percentage = c(
    prop.table(table(train_data$Diabetes_binary))[2] * 100,
    prop.table(table(valid_data$Diabetes_binary))[2] * 100,
    prop.table(table(test_data$Diabetes_binary))[2] * 100
  )
)

summary_table
##          Dataset   Rows Columns Class_0_Percentage Class_1_Percentage
## 1   Training Set 152208      22           86.04147           13.95853
## 2 Validation Set  50736      22           86.17944           13.82056
## 3       Test Set  50736      22           86.02964           13.97036

After splitting, the training, validation, and testing sets maintained a similar target class distribution to the original dataset. This ensured that model training, model selection, and final evaluation were performed on representative subsets of the data.

6.4 Class Imbalance Handling

The target variable in this dataset is imbalanced, with the majority of records belonging to the No_Diabetes class and a much smaller proportion belonging to the Diabetes class. To address this issue, class imbalance handling was applied only to the training set. The validation set and testing set were kept unchanged to preserve the original real-world class distribution. This prevents data leakage and ensures that model evaluation remains reliable and realistic.

6.4.1 Check class distribution before balancing

# Check class distribution in training set
table(train_data$Diabetes_binary)
## 
##      0      1 
## 130962  21246

The current training set maintains the imbalanced structure of the original data.

6.4.2 Random Undersampling

library(dplyr)

set.seed(42)

# Separate majority and minority classes
majority_class <- train_data %>% filter(Diabetes_binary == 0)
minority_class <- train_data %>% filter(Diabetes_binary == 1)

ratio <- 2
# Randomly undersample majority class to 2 times the minority class
majority_downsampled <- majority_class %>% sample_n(ratio * nrow(minority_class))
train_balanced <- bind_rows(minority_class, majority_downsampled)
train_balanced <- train_balanced %>% sample_frac(1)

# Check class distribution after balancing
table(train_balanced$Diabetes_binary)
## 
##     0     1 
## 42492 21246
prop.table(table(train_balanced$Diabetes_binary)) * 100
## 
##        0        1 
## 66.66667 33.33333

Only the training set was balanced. Instead of creating a fully balanced 1:1 class ratio, the majority class was downsampled to approximately twice the size of the minority class. This produced a 2:1 ratio between non-diabetic and diabetic cases, reducing class imbalance while avoiding excessive information loss from the majority class.

6.5 Classification Modeling & Evaluation

Three classification models were trained and compared in this study: Decision Tree, Random Forest, and XGBoost. Decision Tree was used as the baseline model because it is simple, interpretable, and does not require feature scaling. Random Forest and XGBoost were selected as advanced ensemble models because they are suitable for structured tabular data and can handle nonlinear relationships between predictors and the target variable.

The models were first trained using the balanced training set. The validation set was then used for hyperparameter tuning and model selection. The testing set was not used during training or tuning, and was only used for final model evaluation. This design helps prevent data leakage and provides a more reliable estimate of model performance on unseen data.

After comparing the models on the validation set, the best-performing model was selected and evaluated on the testing set. Finally, feature importance analysis was conducted on the best model to identify the most influential predictors of diabetes classification.

6.5.1 Load required packages

library(rpart)
library(rpart.plot)
library(ranger)
library(xgboost)
library(caret)
library(pROC)
library(dplyr)

6.5.2 Prepare target variables

Decision Tree and Random Forest require a categorical target variable, which is then converted into a factor. XGBoost, however, requires numeric 0/1 labels and does not perform this process.

# For Decision Tree and Random Forest
train_balanced$Diabetes_factor <- as.factor(train_balanced$Diabetes_binary)
valid_data$Diabetes_factor <- as.factor(valid_data$Diabetes_binary)
test_data$Diabetes_factor <- as.factor(test_data$Diabetes_binary)

# For XGBoost
train_label <- train_balanced$Diabetes_binary
valid_label <- valid_data$Diabetes_binary
test_label  <- test_data$Diabetes_binary

Define feature columns

# Remove target-related columns
feature_cols <- setdiff(names(train_balanced), c("Diabetes_binary", "Diabetes_factor"))

x_train <- train_balanced[, feature_cols]
x_valid <- valid_data[, feature_cols]
x_test  <- test_data[, feature_cols]

6.5.3 Decision Tree

Decision Tree was selected as the baseline model because it is simple, interpretable, and does not require feature scaling.

First, use a simple parameter tuning version to compare different cp values.

dt_grid <- expand.grid(
  cp = c(0.001, 0.005, 0.01, 0.02)
)

dt_results <- data.frame()

for (i in 1:nrow(dt_grid)) {
  cp_value <- dt_grid$cp[i]
  dt_model <- rpart(
    Diabetes_factor ~ .,
    data = train_balanced[, c(feature_cols, "Diabetes_factor")],
    method = "class",
    control = rpart.control(cp = cp_value)
  )
  
  # Predict probability on validation set
  dt_prob <- predict(dt_model, valid_data[, feature_cols], type = "prob")[, "1"]
  
  # Convert probability to class
  dt_pred <- ifelse(dt_prob >= 0.5, 1, 0)
  
  # Calculate validation AUC
  dt_auc <- auc(valid_data$Diabetes_binary, dt_prob)
  
  # Store result
  dt_results <- rbind(
    dt_results,
    data.frame(
      cp = cp_value,
      Validation_AUC = as.numeric(dt_auc)))
}

dt_results
##      cp Validation_AUC
## 1 0.001      0.7971714
## 2 0.005      0.7403289
## 3 0.010      0.7385117
## 4 0.020      0.7235862

Choose the validation set with the best AUC

best_dt_cp <- dt_results$cp[which.max(dt_results$Validation_AUC)]
best_dt_cp
## [1] 0.001

Retrain the optimal Decision Tree

best_dt_model <- rpart(
  Diabetes_factor ~ .,
  data = train_balanced[, c(feature_cols, "Diabetes_factor")],
  method = "class",
  control = rpart.control(cp = best_dt_cp)
)

The final Decision Tree model was trained using cp = 0.001, which achieved the best validation performance during hyperparameter tuning. This setting allowed the tree to capture more detailed decision patterns while still applying pruning to reduce overfitting.

6.5.4 Random Forest

Random Forest was implemented using the ranger package in R to improve training speed and computational efficiency. As an ensemble learning method, Random Forest builds multiple decision trees from bootstrap samples and combines their predictions through majority voting.

A simple parameter tuning process was applied to compare different mtry values. The mtry parameter controls the number of variables randomly selected at each split in the Random Forest model.

rf_grid <- expand.grid(
  mtry = c(3, 5, 7, 9)
)

rf_results <- vector("list", nrow(rf_grid))
train_rf <- train_balanced[, c(feature_cols, "Diabetes_factor")]
valid_x <- valid_data[, feature_cols]

for (i in seq_len(nrow(rf_grid))) {
  mtry_value <- rf_grid$mtry[i]
  set.seed(42)
  
  rf_model <- ranger(
    Diabetes_factor ~ .,
    data = train_rf,
    num.trees = 300,
    mtry = mtry_value,
    probability = TRUE,
    importance = "impurity",
    num.threads = parallel::detectCores() - 1
  )
  
  # Predict probability on validation set
  rf_prob <- predict(rf_model, data = valid_x)$predictions[, "1"]
  
  # Calculate validation AUC
  rf_auc <- auc(valid_data$Diabetes_binary, rf_prob)
  
  rf_results[[i]] <- data.frame( mtry = mtry_value, Validation_AUC = as.numeric(rf_auc))
}

rf_results <- bind_rows(rf_results)
rf_results
##   mtry Validation_AUC
## 1    3      0.8300043
## 2    5      0.8253878
## 3    7      0.8222375
## 4    9      0.8192751

Choose the best mtry

best_rf_mtry <- rf_results$mtry[which.max(rf_results$Validation_AUC)]
best_rf_mtry
## [1] 3

Retrain the optimal Random Forest

set.seed(42)
best_rf_model <- ranger(
  Diabetes_factor ~ .,
  data = train_rf,
  num.trees = 300,
  mtry = best_rf_mtry,
  probability = TRUE,
  importance = "impurity",
  num.threads = parallel::detectCores() - 1
)

The final Random Forest model was trained using mtry = 3, which produced the best validation performance during tuning. This means that three randomly selected features were considered at each split in the trees.

6.5.5 XGBoost

XGBoost was implemented using the xgboost package in R as an advanced boosting-based classification model. Unlike Random Forest, which builds trees independently, XGBoost builds trees sequentially, with each tree correcting errors from previous trees. In this study, XGBoost was trained on the balanced training set using the binary:logistic objective for diabetes binary classification. Key hyperparameters, including max_depth, eta, subsample, and colsample_bytree, were tuned using the validation set before comparing its performance with Decision Tree and Random Forest.

Prepare XGBoost matrix

dtrain <- xgb.DMatrix( data = as.matrix(x_train), label = train_label )
dvalid <- xgb.DMatrix( data = as.matrix(x_valid), label = valid_label )
dtest <- xgb.DMatrix( data = as.matrix(x_test), label = test_label )

Train XGBoost with validation tuning

xgb_grid <- expand.grid(
  max_depth = c(3, 5, 7),
  eta = c(0.05, 0.1),
  subsample = c(0.8, 1.0),
  colsample_bytree = c(0.8, 1.0)
)

xgb_results <- data.frame()
for (i in 1:nrow(xgb_grid)) {
  params <- list(
    objective = "binary:logistic",
    eval_metric = "auc",
    max_depth = xgb_grid$max_depth[i],
    eta = xgb_grid$eta[i],
    subsample = xgb_grid$subsample[i],
    colsample_bytree = xgb_grid$colsample_bytree[i]
  )
  
  set.seed(42)
  xgb_model <- xgb.train(
    params = params,
    data = dtrain,
    nrounds = 300,
    watchlist = list(valid = dvalid),
    early_stopping_rounds = 20,
    verbose = 0
  )
  
  xgb_prob <- predict(xgb_model, dvalid)
  xgb_auc <- auc(valid_label, xgb_prob)
  
  if (!is.null(xgb_model$best_iteration)) {
  best_iter <- xgb_model$best_iteration
  } else {
  best_iter <- 300
  }
  
  xgb_results <- rbind(
    xgb_results,
    data.frame(
      max_depth = xgb_grid$max_depth[i],
      eta = xgb_grid$eta[i],
      subsample = xgb_grid$subsample[i],
      colsample_bytree = xgb_grid$colsample_bytree[i],
      best_iteration = best_iter,
      Validation_AUC = as.numeric(xgb_auc) ) )
}

head(xgb_results)
##   max_depth  eta subsample colsample_bytree best_iteration Validation_AUC
## 1         3 0.05       0.8              0.8            300      0.8341898
## 2         5 0.05       0.8              0.8            300      0.8342961
## 3         7 0.05       0.8              0.8            300      0.8337384
## 4         3 0.10       0.8              0.8            300      0.8343263
## 5         5 0.10       0.8              0.8            300      0.8338556
## 6         7 0.10       0.8              0.8            300      0.8333974

Choose the parameter that best optimizes the validation set AUC

best_xgb <- xgb_results[which.max(xgb_results$Validation_AUC), ]
best_xgb
##   max_depth eta subsample colsample_bytree best_iteration Validation_AUC
## 4         3 0.1       0.8              0.8            300      0.8343263

Retrain the optimal XGBoost

best_xgb_params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  max_depth = best_xgb$max_depth,
  eta = best_xgb$eta,
  subsample = best_xgb$subsample,
  colsample_bytree = best_xgb$colsample_bytree
)

set.seed(42)
best_xgb_model <- xgb.train(
  params = best_xgb_params,
  data = dtrain,
  nrounds = best_xgb$best_iteration,
  verbose = 0
)

The final XGBoost model was trained using the best hyperparameter combination identified from the validation set. The selected parameters were max_depth = 3, eta = 0.1, subsample = 0.8, colsample_bytree = 0.8, and best_iteration = 300. This configuration achieved a validation AUC of 0.8343, which was the highest among the tested XGBoost parameter combinations. Therefore, this parameter setting was used as the final XGBoost model for comparison with Decision Tree and Random Forest.

6.5.6 Compare Models on Validation Set

First, compare the three models using the validation set

# Decision Tree validation prediction
dt_valid_prob <- predict(best_dt_model, valid_data[, feature_cols], type = "prob")[, "1"]
dt_valid_pred <- ifelse(dt_valid_prob >= 0.5, 1, 0)

# Random Forest validation prediction
rf_valid_prob <- predict(best_rf_model, data = valid_data[, feature_cols])$predictions[, "1"]
rf_valid_pred <- ifelse(rf_valid_prob >= 0.5, 1, 0)

# XGBoost validation prediction
xgb_valid_prob <- predict(best_xgb_model, dvalid)
xgb_valid_pred <- ifelse(xgb_valid_prob >= 0.5, 1, 0)
evaluate_model <- function(actual, predicted, prob) {
  cm <- confusionMatrix( as.factor(predicted), as.factor(actual), positive = "1")
  auc_value <- auc(actual, prob)
  
  result <- data.frame(
    Accuracy = cm$overall["Accuracy"],
    Precision = cm$byClass["Precision"],
    Recall = cm$byClass["Recall"],
    F1 = cm$byClass["F1"],
    ROC_AUC = as.numeric(auc_value)
  )
  
  return(result)
}

Compare results and visualization

library(ggplot2)
library(tidyr)
library(tibble)
library(pROC)
library(patchwork)

dt_valid_eval <- evaluate_model(valid_data$Diabetes_binary, dt_valid_pred, dt_valid_prob)
rf_valid_eval <- evaluate_model(valid_data$Diabetes_binary, rf_valid_pred, rf_valid_prob)
xgb_valid_eval <- evaluate_model(valid_data$Diabetes_binary, xgb_valid_pred, xgb_valid_prob)

validation_results <- rbind(
  Decision_Tree = dt_valid_eval,
  Random_Forest = rf_valid_eval,
  XGBoost = xgb_valid_eval
)

validation_results
##                Accuracy Precision    Recall        F1   ROC_AUC
## Decision_Tree 0.8028816 0.3666220 0.5858528 0.4510073 0.7971714
## Random_Forest 0.8213300 0.3978302 0.5700228 0.4686089 0.8300043
## XGBoost       0.8166588 0.3934487 0.6029663 0.4761797 0.8341929
# transform format
metrics_df <- as.data.frame(validation_results) %>%
  rownames_to_column(var = "Model") %>%
  pivot_longer(cols = -Model, names_to = "Metric", values_to = "Score") %>%
  filter(Metric != "Precision")

# Plot bar chart
bar_plot <- ggplot(metrics_df, aes(x = Metric, y = Score, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") + 
  geom_text(aes(label = round(Score, 3)), vjust = -0.5, size = 3.5) + 
  facet_wrap(~ Model, nrow = 1) + 
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) + 
  theme_minimal() + 
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 11),
    strip.text = element_text(size = 13, face = "bold"),
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    legend.position = "none",
    plot.margin = ggplot2::margin(t = 10, r = 10, b = 30, l = 10, unit = "pt") 
  ) +
  labs(title = "Validation Results: Metrics Comparison",
       x = NULL,
       y = "Score")

# Plot ROC curve
roc_dt  <- roc(valid_data$Diabetes_binary, dt_valid_prob)
roc_rf  <- roc(valid_data$Diabetes_binary, rf_valid_prob)
roc_xgb <- roc(valid_data$Diabetes_binary, xgb_valid_prob)

auc_dt  <- round(auc(roc_dt), 3)
auc_rf  <- round(auc(roc_rf), 3)
auc_xgb <- round(auc(roc_xgb), 3)

plot_roc <- function(roc_obj, auc_val, model_name, line_color) {
  ggroc(roc_obj, color = line_color, size = 1) +
    geom_abline(intercept = 1, slope = 1, linetype = "dashed", color = "darkgray") +
    theme_minimal() + 
    theme(
      plot.title = element_text(hjust = 0.5, face = "bold", size = 13),
      plot.margin = ggplot2::margin(t = 10, r = 10, b = 10, l = 10, unit = "pt")
    ) +
    labs(title = paste(model_name, "\nAUC =", auc_val), 
         x = "Specificity", y = "Sensitivity")
}

# Plot
p_roc_dt  <- plot_roc(roc_dt, auc_dt, "Decision Tree", "#F8766D") 
p_roc_rf  <- plot_roc(roc_rf, auc_rf, "Random Forest", "#00BA38")
p_roc_xgb <- plot_roc(roc_xgb, auc_xgb, "XGBoost", "#619CFF")


roc_combined <- p_roc_dt | p_roc_rf | p_roc_xgb
final_combined_plot <- bar_plot / roc_combined + 
  plot_annotation(
    title = 'Model Evaluation Summary',
    theme = theme(plot.title = element_text(hjust = 0.5, size = 18, face = "bold"))
  )

# show plot
print(final_combined_plot)

The figure summarizes the validation performance of the three models using metric comparison and ROC curve analysis. The upper section compares the key classification metrics, while the lower section presents the ROC curves for each model. Overall, Random Forest achieved the highest accuracy at 0.821, but XGBoost performed better on the more important indicators for this imbalanced classification task. Specifically, XGBoost achieved the highest recall of 0.603, the highest F1-score of 0.476, and the highest ROC-AUC of 0.834.

The validation results show that XGBoost achieved the best overall performance among the three models. Although Random Forest obtained the highest accuracy, XGBoost achieved the highest recall, F1-score, and ROC-AUC. Since the dataset is imbalanced and the main objective is to identify diabetic cases, recall and ROC-AUC were considered more important than accuracy alone. Therefore, XGBoost was selected as the best-performing model for final testing and further SHAP explainability analysis.

6.5.7 Final Evaluation on Test Set

# XGBoost test prediction
xgb_test_prob <- predict(best_xgb_model, dtest)
xgb_test_pred <- ifelse(xgb_test_prob >= 0.5, 1, 0)
xgb_test_eval <- evaluate_model(test_data$Diabetes_binary, xgb_test_pred, xgb_test_prob)
xgb_test_eval
##           Accuracy Precision    Recall        F1  ROC_AUC
## Accuracy 0.8144119 0.3915999 0.5932562 0.4717828 0.829068

This data shows that XGBoost remains robust on the test set and has good discriminative performance, which will be visualized and explained further below.

library(ggplot2)
library(dplyr)
library(tidyr)
library(gridExtra)
library(scales)
library(caret)
library(pROC)


xgb_metric_df <- as.data.frame(xgb_test_eval)
xgb_metric_long <- xgb_metric_df %>%
  pivot_longer(
    cols = everything(),
    names_to = "Metric",
    values_to = "Value"
  )

xgb_metric_long <- xgb_metric_long %>%
  filter(Metric %in% c("Accuracy", "Precision", "Recall", "F1", "ROC_AUC"))

xgb_metric_long$Metric <- factor(
  xgb_metric_long$Metric,
  levels = c("Accuracy", "Precision", "Recall", "F1", "ROC_AUC")
)

# Bar chart
xgb_bar_plot <- ggplot(
  xgb_metric_long,
  aes(x = Metric, y = Value, fill = Metric)
  ) +
  geom_col( width = 0.65, show.legend = FALSE, color = "white", linewidth = 0.8
  ) +
  geom_text(
    aes(label = round(Value, 3)),
    vjust = -0.6,
    size = 4.8,
    fontface = "bold",
    color = "#222222"
  ) +
  scale_y_continuous(
    limits = c(0, 1),
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.08))
  ) +
  scale_fill_manual(
    values = c(
      "Accuracy"  = "#2F80ED",
      "Precision" = "#F2994A",
      "Recall"    = "#27AE60",
      "F1"        = "#EB5757",
      "ROC_AUC"   = "#56CCF2"
    )
  ) +
  labs(
    title = "XGBoost Test Set Performance",
    x = NULL,
    y = "Score"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
    plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
    axis.text.x = element_text( face = "bold", size = 12, color = "#222222" ),
    axis.text.y = element_text( size = 11, color = "#222222" ),
    axis.title.y = element_text( face = "bold", size = 12 ),
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

# Confusion matrix data
cm <- table(
  Actual = test_data$Diabetes_binary,
  Predicted = xgb_test_pred
)
cm_df <- as.data.frame(cm)
cm_df <- cm_df %>% mutate( Label = Freq )

cm_df$Actual <- factor(
  cm_df$Actual,
  levels = c(0, 1),
  labels = c("No Diabetes", "Diabetes")
)

cm_df$Predicted <- factor(
  cm_df$Predicted,
  levels = c(0, 1),
  labels = c("No Diabetes", "Diabetes")
)
# Plot
xgb_cm_plot <- ggplot( cm_df, aes(x = Predicted, y = Actual, fill = Freq)
) +
  geom_tile( color = "white", linewidth = 1.2
  ) +
  geom_text( aes(label = Label), size = 6, fontface = "bold", color = "white"
  ) +
  scale_fill_gradient( low = "#BDEBFF", high = "#0077B6"
  ) +
  labs(
    title = "XGBoost Confusion Matrix",
    x = "Predicted Class",
    y = "Actual Class",
    fill = "Count"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
    plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
    axis.text.x = element_text( face = "bold", size = 12, color = "#222222" ),
    axis.text.y = element_text( face = "bold", size = 12, color = "#222222" ),
    axis.title.x = element_text( face = "bold", size = 12 ),
    axis.title.y = element_text( face = "bold", size = 12 ),
    panel.grid = element_blank(),
    legend.position = "right",
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

combined_xgb_test_plot <- grid.arrange( xgb_bar_plot, xgb_cm_plot, ncol = 2 )

The final XGBoost model achieved an accuracy of 0.814, precision of 0.392, recall of 0.593, F1-score of 0.472, and ROC-AUC of 0.829 on the test set. These results are close to the validation performance, indicating that the model generalizes reasonably well to unseen data.

The confusion matrix shows that the XGBoost model achieved a reasonable level of classification performance on the test set, but there is still room for improvement. The model correctly identified 37,115 non-diabetic cases and 4,205 diabetic cases. However, 2,883 diabetic cases were misclassified as non-diabetic, indicating that the model still missed some positive cases. In addition, 6,533 non-diabetic cases were incorrectly classified as diabetic, suggesting that false positives were also present.

Overall, the model was able to capture a portion of diabetes cases, but its classification performance was not perfect. Therefore, we should consider evaluation metrics such as recall, precision, F1 score, and ROC-AUC in a comprehensive manner.

6.5.8 Feature Importance Ranking

library(ggplot2)
library(dplyr)
library(scales)

xgb_importance <- xgb.importance( feature_names = feature_cols, model = best_xgb_model )

# Prepare importance data
importance_df <- xgb_importance %>%
  as.data.frame() %>%
  arrange(desc(Gain)) %>%
  slice(1:15)

importance_df$Feature <- factor(
  importance_df$Feature,
  levels = rev(importance_df$Feature)
)


# Plot bar chart
xgb_imp_plot <- ggplot(importance_df, aes(x = Feature, y = Gain, fill = Gain)) +
  geom_col(width = 0.7, color = "white", linewidth = 0.8) +
  geom_text(
    aes(label = round(Gain, 3)),
    hjust = -0.15,
    size = 4.3,
    fontface = "bold",
    color = "#222222"
  ) +
  coord_flip() +
  scale_y_continuous( labels = number_format(accuracy = 0.001),
    expand = expansion(mult = c(0, 0.12))
  ) +
  scale_fill_gradient( low = "#8FD3FF", high = "#0077B6"
  ) +
  labs(
    title = "XGBoost Feature Importance",
    x = NULL,
    y = "Feature Importance Score"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text( face = "bold", size = 18, hjust = 0.5, color = "#222222" ),
    plot.subtitle = element_text( size = 12, hjust = 0.5, color = "gray40" ),
    axis.text.x = element_text( size = 11, color = "#222222" ),
    axis.text.y = element_text( face = "bold", size = 12, color = "#222222" ),
    axis.title.x = element_text( face = "bold", size = 12 ),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.background = element_rect(fill = "white", color = NA),
    panel.background = element_rect(fill = "white", color = NA)
  )

print(xgb_imp_plot)

The XGBoost feature importance plot shows that GenHlth was the most influential predictor, with the highest importance score of 0.301, followed by HighBP with a score of 0.230. This indicates that general health status and high blood pressure played the most important roles in the model’s diabetes prediction. BMI and Age were also important predictors, with importance scores of 0.132 and 0.113, respectively. Other variables such as HighChol, DiffWalk, Income, and HeartDiseaseorAttack contributed less to the model, while PhysActivity had the lowest importance score. Overall, the model mainly relied on general health condition, blood pressure status, BMI, and age when predicting diabetes risk.

6.6 Classification Models Interpretation

In this classification modelling section, three models were developed and compared for diabetes prediction: Decision Tree, Random Forest, and XGBoost. The Decision Tree model was used as the baseline classifier, while Random Forest and XGBoost were applied as ensemble models to improve predictive performance.

The models were trained using the balanced training dataset and tuned using the validation set. Based on the validation results, XGBoost achieved the best overall performance. XGBoost achieved an accuracy of 0.814, precision of 0.392, recall of 0.593, F1-score of 0.472, and ROC-AUC of 0.829 in the test set. These results suggest that the model maintained reasonably consistent performance on unseen data and demonstrated good overall discrimination ability. However, the relatively low precision and moderate recall indicate that the model still has limitations in accurately identifying diabetic cases. Therefore, the XGBoost model can be considered useful for diabetes risk prediction, but its results should be interpreted cautiously, especially in practical healthcare related applications.

7. Regression Modeling of Health Score & Evaluation

In order to predict the health score of an individual, 3 regression models were trained and evaluated: Linear Regression, Random Forest and XGBoost. The Linear Regression model acts as a baseline for the comparisons of the 3 models.

7.1 Required Libraries

library(tidyverse)
library(caret)
library(xgboost)
library(ordinal)
library(Metrics)
library(ranger) #faster than randomForest package, lower mem usage, multi-threading
library(tidyr)
library(ggplot2)

7.2 Data Preparation Before Regression Modeling

df <- read.csv("diabetes_binary_health_indicators_BRFSS2015.csv")

#Convert Ordinal Variables to factors
df$Age <- as.factor(df$Age)
df$Income <- as.factor(df$Income)
df$Education <- as.factor(df$Education)

#Train-Test Split 80-20
set.seed(13)
train_index <- createDataPartition(
  df$GenHlth,
  p = 0.8,
  list = FALSE
)
train <- df[train_index, ]
test <- df[-train_index, ]

7.3 Linear Regression for Health Score Prediction

#Linear Regression
lm_model <- lm(
  GenHlth ~ . ,
  data = train
)

#Linear regression prediction
pred_lm <- predict(
  lm_model,
  newdata = test
)

#Evaluate metrics of the linear regression model
lm_rmse <- rmse(test$GenHlth, pred_lm)
lm_mae  <- mae(test$GenHlth, pred_lm)
lm_r2   <- cor(test$GenHlth, pred_lm)^2

7.4 Random Forest for Health Score Prediction

#Random Forest Regression using ranger for faster speed and lower mem usage
rf_model <- ranger(
  GenHlth ~ . ,
  data = train,
  num.trees = 500,
  seed = 123,
  importance = "permutation"
)
## Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
## Growing trees.. Progress: 39%. Estimated remaining time: 1 minute, 35 seconds.
## Growing trees.. Progress: 59%. Estimated remaining time: 1 minute, 3 seconds.
## Growing trees.. Progress: 79%. Estimated remaining time: 33 seconds.
## Growing trees.. Progress: 99%. Estimated remaining time: 1 seconds.
## Computing permutation importance.. Progress: 13%. Estimated remaining time: 3 minutes, 31 seconds.
## Computing permutation importance.. Progress: 25%. Estimated remaining time: 3 minutes, 5 seconds.
## Computing permutation importance.. Progress: 38%. Estimated remaining time: 2 minutes, 30 seconds.
## Computing permutation importance.. Progress: 51%. Estimated remaining time: 1 minute, 59 seconds.
## Computing permutation importance.. Progress: 63%. Estimated remaining time: 1 minute, 31 seconds.
## Computing permutation importance.. Progress: 75%. Estimated remaining time: 1 minute, 3 seconds.
## Computing permutation importance.. Progress: 86%. Estimated remaining time: 37 seconds.
## Computing permutation importance.. Progress: 98%. Estimated remaining time: 5 seconds.
#Random Forest prediction
pred_rf <- predict(
  rf_model,
  #newdata = test
  data = test
)$predictions

#Evaluate Metrics of the Random Forest model
rf_rmse <- rmse(test$GenHlth, pred_rf)
rf_mae  <- mae(test$GenHlth, pred_rf)
rf_r2   <- cor(test$GenHlth, pred_rf)^2

7.5 XGBoost Modeling for Health Score

#XGBoost Regression
#One-Hot Encoding
dummy <- dummyVars(
  GenHlth ~ .,
  data = train
)
x_train <- predict(dummy, train)
x_test  <- predict(dummy, test)
y_train <- train$GenHlth
y_test  <- test$GenHlth
xgb_model <- xgboost(
  x = x_train,
  y = y_train,
  objective = "reg:squarederror",
  nrounds = 200,
  max_depth = 6,
  learning_rate = 0.1
)

#XGBoost Prediction
pred_xgb <- predict(
  xgb_model,
  x_test
)

#Evaluation Metrics of XGBoost model with test data
xgb_rmse <- rmse(y_test, pred_xgb)
xgb_mae  <- mae(y_test, pred_xgb)
xgb_r2   <- cor(y_test, pred_xgb)^2

7.6 Evaluation of Models Performance: RMSE, MAE & R2

#Compare the models
results <- data.frame(
  Model = c(
    "Linear Regression",
    "Random Forest",
    "XGBoost"
  ),
  RMSE = c(
    lm_rmse,
    rf_rmse,
    xgb_rmse
  ),
  MAE = c(
    lm_mae,
    rf_mae,
    xgb_mae
  ),
  R2 = c(
    lm_r2,
    rf_r2,
    xgb_r2
  )
)
print(results)
##               Model      RMSE       MAE        R2
## 1 Linear Regression 0.7895687 0.6311160 0.4520434
## 2     Random Forest 0.7782872 0.6252869 0.4677728
## 3           XGBoost 0.7739594 0.6212268 0.4735073
#Plotting graphs for models evaluation
results_long <- pivot_longer(
  results,
  cols = c(RMSE, MAE, R2),
  names_to = "Metric",
  values_to = "Value"
)

#1.Bar charts to compare model performance MAE, RMSE, R2
ggplot(results_long, 
  aes(
    x = Model,y = Value,fill = Model
    )
  ) +
  geom_col() + 
  geom_text(aes(label = round(Value, 3)),vjust = -0.5)+
  facet_wrap(~Metric, scales = "free_y") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1
    )
  ) + labs( title = "Models Performance Comparison")

Based on the bar charts comparing the MAE, RMSE and R2 values of the three models, XGBoost model has the best performance among the three models. Linear regression as the baseline model has the highest MAE (0.631) and RMSE(0.79), both Random Forest and XGBoost models perform better than Linear Regression, with the XGBoost model performing slightly better than the Random Forest.

In terms of MAE and RMSE, the smaller the values the better as it indicates less error in our model predictions, XGBoost has the smallest MAE of 0.621 and RMSE of 0.774 among the three models.

In terms of the coeficient of determination (R2), a larger R2 value is more preferred, the XGBoost model has the highest R2 value among three models which is 0.474, which means the XGBoost model able to explains about 47.4% of the variation in the health score.

7.7 Scatter Plot of the Actual vs Predicted Values

#2.Actual vs Predicted Scatter plot
scatter_df <- data.frame(
  Actual = test$GenHlth,
  Linear_Regression = pred_lm,
  Random_Forest = pred_rf,
  XGBoost = pred_xgb
)
scatter_long <- scatter_df %>%
  pivot_longer(
    cols = c(
      Linear_Regression,
      Random_Forest,
      XGBoost
    ),
    names_to = "Model",
    values_to = "Predicted"
  )
#3 Scatter Plots Comparison
ggplot( scatter_long,
  aes(
    x = Actual,
    y = Predicted
  )
) +
  geom_point(
    alpha = 0.1,
    size = 0.5
  ) +
  geom_abline(
    slope = 1,
    intercept = 0,
    linetype = "dashed"
  ) +
  facet_wrap(~Model) +
  theme_minimal() +
  labs(
    title = "Actual vs Predicted Health Score",
    x = "Actual GenHlth",
    y = "Predicted GenHlth"
  )

From the 3 scatter plots of the 3 models above, we can see that the predicted values are quite spread out from the diagonal line. The diagonal reference line indicates the perfect match between the actual and predicted values. All the 3 models showing positive relationship between the predicted and actual data, indicating the models somehow learn some patterns from the training data but the predicted data are quite dispersed from the diagonal line, which indicates our models prediction are not perfectly aligned with the actual value. Among the 3 models, XGBoost showing the strongest alignment with the diagonal line.

7.8 Box Plot of the Actual vs Predicted Values

#3 alternative boxplot
comparison_df <- data.frame(
  Actual = test$GenHlth,
  Linear = pred_lm,
  RF = pred_rf,
  XGB = pred_xgb
) %>%
  pivot_longer(
    cols = c(Linear, RF, XGB),
    names_to = "Model",
    values_to = "Prediction"
  )

ggplot(
  comparison_df,
  aes(
    x = factor(Actual),
    y = Prediction
  )
) +
  geom_boxplot() +
  facet_wrap(~Model)

A box plot was generated for each models to better inspect the spread of the predicted health scores as opposed to the actual health score. The box plot showing the Median of the predicted health scores on the actual health score, the 25th & 75th percentile of the predictions and the outliers. For example, when the actual health score is 4, we can see the box plot showing the prediction around 3.3 to 3.8. Notice that here are some overlapping in the predictions for actual health scores of 2,3 & 4, which indicates the models are capable to predict very healthy (5) and very unhealthy individuals(1) with clear separation, but might struggle to correctly predict the health score of a fair or good health individual with health score ranging from 2~3 or 3~4. The box plot among the 3 models showing not much difference, but somehow XGBoost model’s box plot showing better medians separation and narrower distributions of predicted values which means the model is making more consistent predictions for each actual health score.

8. SHAP Explainability

After model comparison, XGBoost was selected as the best-performing model based on its validation performance. To improve model interpretability, SHAP analysis was applied to explain how different predictors contributed to diabetes classification.

library(shapviz)
library(ggplot2)
library(xgboost)

shap_values <- predict( best_xgb_model, dtest, predcontrib = TRUE)
shap_values_no_bias <- shap_values[, -ncol(shap_values)]
shap_xgb <- shapviz( shap_values_no_bias, X = test_data[, feature_cols])
shap_blue <- "#1E88E5"
shap_pink <- "#FF0051"

# Extract SHAP matrix from shapviz object
shap_matrix <- shap_xgb$S

# Compute mean absolute SHAP value
shap_importance <- data.frame(
  Feature = colnames(shap_matrix),
  MeanAbsSHAP = colMeans(abs(shap_matrix))
) %>%
  arrange(desc(MeanAbsSHAP)) %>%
  dplyr::slice(1:10)

# Make feature order nicer for plotting
shap_importance$Feature <- factor(
  shap_importance$Feature,
  levels = rev(shap_importance$Feature)
)

# Plot
p_shap_bar <- ggplot(shap_importance, aes(x = Feature, y = MeanAbsSHAP)) +
  geom_col(fill = shap_pink, width = 0.7) +
  geom_text(
    aes(label = sprintf("%.3f", MeanAbsSHAP)),
    hjust = -0.1,
    size = 4,
    fontface = "bold"
  ) +
  coord_flip() +
  expand_limits(y = max(shap_importance$MeanAbsSHAP) * 1.15) +
  labs(
    title = "SHAP Global Feature Importance",
    x = NULL,
    y = "Mean Absolute SHAP Value"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title.y = element_blank(),
    axis.title.x = element_text(face = "bold", size = 13),
    axis.text.y = element_text(size = 11),
    axis.text.x = element_text(size = 11),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank()
  )


# beeswarm
p_shap_beeswarm <- sv_importance( shap_xgb, kind = "beeswarm", max_display = 10
) +
  scale_color_gradient( low = shap_blue, high = shap_pink, name = "Feature value"
  ) +
  labs( title = "SHAP Beeswarm Plot", x = "SHAP Value", y = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
    axis.title.x = element_text(face = "bold", size = 13),
    axis.text.y = element_text(size = 11),
    legend.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

# Combine SHAP bar plot and beeswarm plot side by side
p_shap_combined <- p_shap_bar + p_shap_beeswarm +
  plot_layout(widths = c(1, 1.2)) +
  plot_annotation(
    title = "SHAP Model Interpretation",
    theme = theme(
      plot.title = element_text(
        hjust = 0.5,
        face = "bold",
        size = 18 ) ) )

p_shap_combined

SHAP Global Feature Importance

The SHAP global feature importance plot shows that GenHlth was the most influential predictor in the XGBoost model, with the highest mean absolute SHAP value of 0.612. This means that general health status had the strongest overall impact on the model’s diabetes prediction. Age, HighBP, and BMI were also important features, with similar SHAP values around 0.40, followed by HighChol with a value of 0.302. Overall, the model mainly relied on general health condition, age, blood pressure, BMI, and cholesterol status when making predictions.

SHAP Beeswarm Plot

The SHAP beeswarm plot further explains how each feature affected the prediction direction. Positive SHAP values push the prediction toward the Diabetes class, while negative SHAP values reduce the predicted diabetes risk. The plot shows that poorer general health, older age, high blood pressure, higher BMI, and high cholesterol generally increased the model’s predicted diabetes risk. In contrast, better health status, younger age, absence of high blood pressure, lower BMI, and normal cholesterol tended to reduce the prediction toward diabetes.

9. Results and Conclusion

9.1 Summary of Modeling Results

This project successfully applied machine learning techniques to the BRFSS 2015 dataset to predict diabetes risk (classification) and general health scores (regression). Across both tasks, tree-based ensemble methods, specifically XGBoost, demonstrated the most robust predictive capabilities.

  • Classification (Predicting Diabetes Status): Dealing with a highly imbalanced dataset (86% non-diabetic vs. 14% diabetic), we prioritized metrics like Recall and ROC-AUC over strict accuracy. After addressing the class imbalance using random undersampling on the training data, XGBoost emerged as the best-performing model. It achieved an ROC-AUC of 0.829, an Accuracy of 0.814, and a Recall of 0.593 on the unseen test set. While the model occasionally struggled with false positives, its ability to identify actual diabetic cases outperformed the Decision Tree and Random Forest models.
  • Regression (Predicting General Health Score - GenHlth): We further analyzed the dataset by treating the general health score as a continuous target variable. XGBoost once again outperformed Linear Regression and Random Forest, achieving the lowest Root Mean Square Error (RMSE = 0.774) and Mean Absolute Error (MAE = 0.621). The model captured approximately 47.4% of the variance (R-squared = 0.474) in the health scores. Box plot analyses revealed that while the model clearly distinguishes between extreme health states (excellent vs. poor), it exhibits some overlap when predicting intermediate health scores.

9.2 Key Risk Factors Identified

Through feature importance ranking and SHAP explainability analysis, several critical factors were consistently identified as major drivers of diabetes risk and overall health:

  • General Health (GenHlth): Self-reported general health was the most dominant predictor of diabetes. The SHAP analysis clearly indicated that a poorer general health score heavily pushes the model’s prediction toward positive diabetes status.
  • Cardiovascular Indicators (HighBP and HighChol): High blood pressure and high cholesterol were the next most significant predictors. The presence of these conditions drastically increases the likelihood of being classified as diabetic, reinforcing the known medical consensus that metabolic syndrome components are deeply interconnected.
  • Demographics and Body Metrics (Age and BMI): Advancing age and higher Body Mass Index (BMI) also showed strong positive correlations with diabetes risk.
  • Lifestyle Factors: Interestingly, variables like physical activity (PhysActivity) showed lower relative importance in the final models compared to established medical conditions (like blood pressure) and demographics, though they still contribute to the overall health profile.

9.3 Limitations

While the models provide valuable insights, several limitations must be acknowledged:

  1. Imperfect Recall and Precision: Despite balancing the training set, the classification model still missed some positive diabetes cases (false negatives) and misclassified a portion of healthy individuals as diabetic (false positives). This suggests that predicting diabetes perfectly requires more granular clinical data (e.g., actual blood glucose levels) rather than survey responses alone.
  2. Unmeasured Confounders: The regression model explained 47.4% of the variance in general health. This indicates that over half of the variation in an individual’s health status is driven by factors not present in this dataset, such as genetics, specific dietary habits, mental stress, or environmental factors.
  3. Self-Reporting Bias: The BRFSS relies on telephone surveys, meaning the data is self-reported. Variables like weight (used to calculate BMI), physical activity, and even general health are subject to human bias and recall error.

9.4 Final Conclusion

This project demonstrates the strong potential of machine learning—particularly XGBoost—in analyzing large-scale public health data to uncover complex relationships between lifestyle, demographics, and chronic diseases.

From a public health perspective, the findings strongly emphasize the importance of early intervention in managing blood pressure, cholesterol, and BMI to mitigate diabetes risk. Healthcare providers and policymakers can utilize these insights to design targeted awareness campaigns and preventive care strategies, focusing especially on aging populations and individuals reporting declining general health. Moving forward, incorporating more direct clinical measurements could significantly enhance the predictive power and clinical utility of these machine learning models.

10. Members Contribution

Group Members Contribution
XU YIRAN 25088321 Data Collection, Understanding & Cleaning
WANG JIACUN 25075892 Classification Modeling & Evaluation
SAM YEE THONG WAH 24209652 Regression Modeling & Evaluation, Final Checking & Video Compilation
LUO YUN 25088950 Interpretation, Explainability & Conclusion
HUANG MAO WEI 24231728 Introduction, Project Workflow & PPT