1 Introduction

Diabetes is a global health concern as it is associated with many long term complications such as cardiovascular disease, stroke, kidney disease, and reduced quality of life. It is essential for us to identify individuals with higher diabetes risk earlier so we can support better health screening, prevention planning, and intervention strategies.

This project uses the Diabetes Health Indicators Dataset obtained from Kaggle, which is derived from the 2015 Behavioral Risk Factor Surveillance System (BRFSS). The dataset mainly contains health survey responses from adults in the United States and cover the following datapoints:

Demographic variables
- Age
- Education
- Income
Lifestyle variables
- Smoking status
- Physical activity
Healthcare access variables
- Access to healthcare
- Ability to afford medical care
Health condition variables
- BMI
- High blood pressure
- High cholesterol
- General health
- Mental health
- Physical health
- Diabetes status

The dataset contains 253,680 observations and 22 variables, making it suitable for data exploration, cleaning, visualization, and machine learning analysis. The main target variable is the Diabetes Status which represents respondents with prediabetes or diabetes.

This project applies R programming techniques to understand the dataset, clean and transform the data, explore important health patterns, and build predictive models. Two main analytical problems are considered. First, a classification problem is used to predict whether an individual has diabetes based on selected health indicators. Second, a regression problem is used to predict BMI using demographic, lifestyle, and health-related variables. By combining both approaches, the project aims to identify important diabetes-related risk factors and evaluate how well different machine learning models perform on health indicator data.

This project uses the R programming techniques to analyze, clean and transform the data so that we can explore patterns and build predictive models. Two main analytical problems are considered:

Classification: Predict whether an individual has diabetes based on selected health indicators.
Regression: Predict BMI using demographic, lifestyle, and health-related variables.

1.1 Project Objectives

The main objective of this project is to analyse the BRFSS 2015 Diabetes Health Indicators dataset and develop predictive models that can support diabetes risk understanding and BMI prediction.

To understand the dataset, including its size, variables, and summary statistics.
To clean and prepare the data by checking for missing values, duplicates, and BMI categories.
To explore the data using charts, tables, distributions, and correlation analysis.
To identify important factors linked to diabetes risk.
To build and compare classification models to predict diabetes status.
To evaluate the classification models using accuracy, recall, F1-score, balanced accuracy, and AUC.
To build and compare regression models to predict BMI.
To evaluate the regression models using RMSE, MAE, and R-squared.
To interpret the results and discuss the strengths, limitations, and possible improvements of the models.

1.2 Research Questions

Can we predict diabetes status by using health indicators and medical history of an individual?
How well can BMI be predicted using demographic, lifestyle, and health condition variables?

2 A. Dataset Understanding

2.1 Load Required Libraries

The following code loads the required libraries for data manipulation and visualization.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.3

## Warning: package 'ggplot2' was built under R version 4.5.3

## Warning: package 'tibble' was built under R version 4.5.3

## Warning: package 'tidyr' was built under R version 4.5.3

## Warning: package 'readr' was built under R version 4.5.3

## Warning: package 'dplyr' was built under R version 4.5.3

## Warning: package 'forcats' was built under R version 4.5.3

## Warning: package 'lubridate' was built under R version 4.5.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.5.3

## corrplot 0.95 loaded

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.5.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.5.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.5.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## 
## Attaching package: 'PerformanceAnalytics'
## 
## The following object is masked from 'package:graphics':
## 
##     legend

2.2 Import Dataset

The following code reads the diabetes dataset into R.

df_raw <- read.csv("diabetes_binary_health_indicators_BRFSS2015.csv")

head(df_raw)

2.3 Variable Description

The dataset contains 22 variables describing demographic, lifestyle, and health indicators:

Binary variables: HighBP, HighChol, Smoker, Stroke, etc.
Continuous variables: BMI, MentHlth, PhysHlth
Ordinal variables: Age, Education, Income, GenHlth

See full mapping in Table below.

Variable	Description
Diabetes_binary	0 = no diabetes, 1 = prediabetes or diabetes
HighBP	0 = no high blood pressure, 1 = high blood pressure
HighChol	0 = no high cholesterol, 1 = high cholesterol
CholCheck	0 = no cholesterol check in 5 years, 1 = yes
BMI	Body Mass Index
Smoker	0 = no, 1 = yes (≥100 cigarettes lifetime)
Stroke	0 = no, 1 = yes
HeartDiseaseorAttack	0 = no, 1 = yes
PhysActivity	0 = no, 1 = yes (past 30 days)
Fruits	0 = no, 1 = yes (≥1 fruit/day)
Veggies	0 = no, 1 = yes (≥1 vegetable/day)
HvyAlcoholConsump	0 = no, 1 = yes (heavy drinking)
AnyHealthcare	0 = no, 1 = yes
NoDocbcCost	0 = no, 1 = yes (couldn’t see doctor due to cost)
GenHlth	1 = excellent to 5 = poor
MentHlth	Days of poor mental health (0–30)
PhysHlth	Days of physical illness (0–30)
DiffWalk	0 = no, 1 = yes
Sex	0 = female, 1 = male
Age	Age category (1–13)
Education	Education level (1–6)
Income	Income level (1–8)

2.4 Dataset Structure

The following code provides a quick overview of dataset structure.

glimpse(df_raw)

## Rows: 253,680
## Columns: 22
## $ Diabetes_binary      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ HighBP               <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1…
## $ HighChol             <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1…
## $ CholCheck            <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI                  <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2…
## $ Smoker               <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0…
## $ Stroke               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PhysActivity         <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Fruits               <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1…
## $ Veggies              <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ HvyAlcoholConsump    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare        <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost          <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ GenHlth              <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3…
## $ MentHlth             <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, …
## $ PhysHlth             <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2…
## $ DiffWalk             <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0…
## $ Sex                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ Age                  <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, …
## $ Education            <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4…
## $ Income               <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3…

2.5 Dataset Dimension

The following code checks the number of observations and variables.

dim(df_raw)

## [1] 253680     22

2.6 Dataset Summary

The following code generates statistical summary of all variables.

summary(df_raw)

##  Diabetes_binary      HighBP         HighChol        CholCheck     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.000   Median :0.0000   Median :1.0000  
##  Mean   :0.1393   Mean   :0.429   Mean   :0.4241   Mean   :0.9627  
##  3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.000   Max.   :1.0000   Max.   :1.0000  
##       BMI            Smoker           Stroke        HeartDiseaseorAttack
##  Min.   :12.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000     
##  1st Qu.:24.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000     
##  Median :27.00   Median :0.0000   Median :0.00000   Median :0.00000     
##  Mean   :28.38   Mean   :0.4432   Mean   :0.04057   Mean   :0.09419     
##  3rd Qu.:31.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000     
##  Max.   :98.00   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000     
##   PhysActivity        Fruits          Veggies       HvyAlcoholConsump
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   
##  1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.0000   
##  Mean   :0.7565   Mean   :0.6343   Mean   :0.8114   Mean   :0.0562   
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   
##  AnyHealthcare     NoDocbcCost         GenHlth         MentHlth     
##  Min.   :0.0000   Min.   :0.00000   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:1.0000   1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000  
##  Median :1.0000   Median :0.00000   Median :2.000   Median : 0.000  
##  Mean   :0.9511   Mean   :0.08418   Mean   :2.511   Mean   : 3.185  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :5.000   Max.   :30.000  
##     PhysHlth         DiffWalk           Sex              Age        
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000  
##  Median : 0.000   Median :0.0000   Median :0.0000   Median : 8.000  
##  Mean   : 4.242   Mean   :0.1682   Mean   :0.4403   Mean   : 8.032  
##  3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:10.000  
##  Max.   :30.000   Max.   :1.0000   Max.   :1.0000   Max.   :13.000  
##    Education        Income     
##  Min.   :1.00   Min.   :1.000  
##  1st Qu.:4.00   1st Qu.:5.000  
##  Median :5.00   Median :7.000  
##  Mean   :5.05   Mean   :6.054  
##  3rd Qu.:6.00   3rd Qu.:8.000  
##  Max.   :6.00   Max.   :8.000

The original dataset contains 253,680 observations and 22 variables, and contains no missing values. The variables mainly consist of binary categorical variables and numerical variables such as BMI, Mental Health, and Physical Health. The target variable is Diabetes_binary, where:

0 = No diabetes
1 = Prediabetes or diabetes

2.7 Check Complete and Incomplete Cases

The following code checks complete and incomplete records in the dataset.

nrow(na.omit(df_raw))

## [1] 253680

nrow(df_raw[!complete.cases(df_raw), ])

## [1] 0

nrow(df_raw[complete.cases(df_raw), ])

## [1] 253680

2.8 Check Missing Values

The following code counts missing values for each variable column.

cn <- names(df_raw)

for(i in cn){
  print(paste(i, sum(is.na(df_raw[i]))))
}

## [1] "Diabetes_binary 0"
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 0"
## [1] "BMI 0"
## [1] "Smoker 0"
## [1] "Stroke 0"
## [1] "HeartDiseaseorAttack 0"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 0"
## [1] "HvyAlcoholConsump 0"
## [1] "AnyHealthcare 0"
## [1] "NoDocbcCost 0"
## [1] "GenHlth 0"
## [1] "MentHlth 0"
## [1] "PhysHlth 0"
## [1] "DiffWalk 0"
## [1] "Sex 0"
## [1] "Age 0"
## [1] "Education 0"
## [1] "Income 0"

2.9 Check Duplicate Entries

The following code checks duplicate records in the dataset.

sum(duplicated(df_raw))

## [1] 24206

2.10 Check Extreme Outliers

The following code visualizes the distributions and potential extreme outliers for the continuous variables (BMI, MentHlth, and PhysHlth) using side-by-side histograms and boxplots.

library(ggplot2)
library(patchwork)

# 1. BMI (Body Mass Index)
h_bmi <- ggplot(df_raw, aes(x = BMI)) +
  geom_histogram(fill = "steelblue", color = "white", bins = 35) +
  labs(title = "Distribution of BMI", x = "BMI", y = "Frequency") +
  theme_minimal()

b_bmi <- ggplot(df_raw, aes(x = BMI)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red", outlier.shape = 19) +
  labs(title = "Outlier Check: BMI", x = "BMI") +
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

# 2. MentHlth
h_ment <- ggplot(df_raw, aes(x = MentHlth)) +
  geom_histogram(fill = "mediumseagreen", color = "white", bins = 30) +
  labs(title = "Distribution of Mental Health", x = "Days (0-30)", y = "Frequency") +
  theme_minimal()

b_ment <- ggplot(df_raw, aes(x = MentHlth)) +
  geom_boxplot(fill = "lightgreen", outlier.color = "red", outlier.shape = 19) +
  labs(title = "Outlier Check: MentHlth", x = "Days (0-30)") +
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

# 3. PhysHlth 
h_phys <- ggplot(df_raw, aes(x = PhysHlth)) +
  geom_histogram(fill = "indianred", color = "white", bins = 30) +
  labs(title = "Distribution of Physical Health", x = "Days (0-30)", y = "Frequency") +
  theme_minimal()

b_phys <- ggplot(df_raw, aes(x = PhysHlth)) +
  geom_boxplot(fill = "lightcoral", outlier.color = "red", outlier.shape = 19) +
  labs(title = "Outlier Check: PhysHlth", x = "Days (0-30)") +
  theme_minimal() +
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())

# Combine and Display all plots side-by-side
(h_bmi | b_bmi) / 
(h_ment | b_ment) / 
(h_phys | b_phys)

Interpretation:

BMI - The data is positively skewed as most individuals have a BMI between 24 and 31. There are a few individuals who have an extreme BMI record of 80-100, which are biologically possible in extreme cases of severe obesity.

MentHlth - The distribution is heavily positively skewed, featuring a massive spike at 0. This indicates that the vast majority of individuals reported zero days of poor mental health in the past month. Overall, the data points are within the logical 0 to 30 day boundary, thus there are no outliers that need to be removed
PhysHlth - The distribution is heavily positively skewed, featuring a massive spike at 0. This indicates most individuals experience no days of physical illness. There is also a small secondary peak at 30 days, representing a subset of individuals with chronic, daily physical illness. Overall, the data points are within the logical 0 to 30 day boundary, thus there are no outliers that need to be removed.

3 B. Data Cleaning

3.1 Missing Values of the Dataset

As previously verified, the initial dataset contains no missing values.

any(is.na(df_raw))

## [1] FALSE

colSums(is.na(df_raw))

##      Diabetes_binary               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                    0                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                    0                    0                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

3.2 Inject Random Missing Values

The following code artificially introduces missing values into multiple variables to simulate missing data scenarios for methodological demonstration purposes.

set.seed(123)

# Work only on a copy
df_clean <- df_raw

# Inject missing values into BMI
df_clean$BMI[sample(1:nrow(df_clean), 500)] <- NA

# Inject missing values into Mental Health
df_clean$MentHlth[sample(1:nrow(df_clean), 300)] <- NA

# Inject missing values into Physical Health
df_clean$PhysHlth[sample(1:nrow(df_clean), 300)] <- NA

# Check total missing values per column
colSums(is.na(df_clean))

##      Diabetes_binary               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                  500                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                  300                  300                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

3.3 Replace Missing Values

The following code replaces missing values using mean imputation. Mean imputation is used for numerical variables as it is simple and preserves dataset size. Although it reduces variance, it is suitable since the proportion of missing values is small.

# Replace BMI missing values
df_clean$BMI[is.na(df_clean$BMI)] <- mean(df_clean$BMI, na.rm = TRUE)

# Replace Mental Health missing values
df_clean$MentHlth[is.na(df_clean$MentHlth)] <- mean(df_clean$MentHlth, na.rm = TRUE)

# Replace Physical Health missing values
df_clean$PhysHlth[is.na(df_clean$PhysHlth)] <- mean(df_clean$PhysHlth, na.rm = TRUE)

# Check missing values after cleaning
colSums(is.na(df_clean))

##      Diabetes_binary               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                    0                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                    0                    0                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

3.4 Duplicate Entries of Dataset

As previously verified, the initial dataset contains duplicate entries.

sum(duplicated(df_clean))

## [1] 24046

3.5 Remove Duplicate Records

The following code removes duplicate entries. Duplicate records were checked to ensure data consistency and avoid bias during model training.

df_clean <- distinct(df_clean)

sum(duplicated(df_clean))

## [1] 0

3.6 Normalize BMI

The following code normalizes BMI using Min-Max scaling. BMI values were normalized using Min-Max normalization to scale the values between 0 and 1. This helps improve model performance and ensures consistent variable scaling. Min-max normalization was applied to ensure BMI is on the same scale as other features, preventing dominance during modeling.

df_clean$BMI_Normalized <- (df_clean$BMI - min(df_clean$BMI)) /
                     (max(df_clean$BMI) - min(df_clean$BMI))

3.7 Check BMI Transformation

The following code compares original and transformed BMI.

summary(df_clean$BMI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   24.00   27.00   28.69   32.00   98.00

summary(df_clean$BMI_Normalized)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1395  0.1744  0.1940  0.2326  1.0000

3.8 Create BMI Category

The following code transforms BMI into categorical groups.

df_clean <- df_clean %>%
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI < 25 ~ "Normal",
    BMI < 30 ~ "Overweight",
    TRUE ~ "Obese"
  ))

table(df_clean$BMI_Category)

## 
##      Normal       Obese  Overweight Underweight 
##       58773       84700       83118        3043

glimpse(df_clean)

## Rows: 229,634
## Columns: 24
## $ Diabetes_binary      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ HighBP               <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1…
## $ HighChol             <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1…
## $ CholCheck            <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI                  <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2…
## $ Smoker               <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0…
## $ Stroke               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PhysActivity         <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Fruits               <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1…
## $ Veggies              <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ HvyAlcoholConsump    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare        <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost          <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ GenHlth              <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3…
## $ MentHlth             <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, …
## $ PhysHlth             <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2…
## $ DiffWalk             <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0…
## $ Sex                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ Age                  <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, …
## $ Education            <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4…
## $ Income               <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3…
## $ BMI_Normalized       <dbl> 0.3255814, 0.1511628, 0.1860465, 0.1744186, 0.139…
## $ BMI_Category         <chr> "Obese", "Overweight", "Overweight", "Overweight"…

3.9 Save Cleaned Dataset

# Save cleaned dataset
write.csv(df_clean,
          "diabetes_cleaned.csv",
          row.names = FALSE)

cat("Cleaned dataset saved successfully as 'diabetes_cleaned.csv'")

## Cleaned dataset saved successfully as 'diabetes_cleaned.csv'

4 C. Data Exploration

4.1 Frequency Check

The following code checks distribution of key variables. Frequency check is used to understand the distribution of categorical variables in the dataset. It helps identify imbalance in key health indicators such as smoking, high blood pressure, and high cholesterol, which may influence model performance. This is important because class imbalance in health indicators may affect model bias when predicting diabetes risk.

cols <- c("HighBP", "HighChol", "Smoker")

for(i in cols){
  cat("\nFrequency Distribution of", i, ":\n")
  print(table(df_clean[[i]]))
}

## 
## Frequency Distribution of HighBP :
## 
##      0      1 
## 125338 104296 
## 
## Frequency Distribution of HighChol :
## 
##      0      1 
## 128247 101387 
## 
## Frequency Distribution of Smoker :
## 
##      0      1 
## 122695 106939

Interpretation:

The results reveal that these three specific predictor variables are remarkably well-balanced as it is maintaining roughly a 55 and 45 split. Therefore, feature imbalance within these specific lifestyle indicators will not introduce bias into the predictive models.

4.2 Target Variable Distribution Analysis

The following code visualizes diabetes cases. The dataset is imbalanced, with a lower proportion of positive diabetes cases compared to non-diabetes cases.

ggplot(df_clean, aes(x = factor(Diabetes_binary))) +
  geom_bar(fill = "steelblue") +
  theme_minimal() +
  labs(title = "Distribution of Diabetes Cases",
       x = "Diabetes (0 = No Diabetes, 1 = Diabetes)",
       y = "Count")

4.3 Correlation Analysis by using Heatmap

The following code generates a correlation matrix for all variables. This provides a clear view of the dataset, highlighting which health indicators are highly correlated with each other before building predictive models.

The matrix uses a color gradient where blue indicates positive correlation, orange or red indicates negative correlation and white indicates no linear relationship.

library(corrplot)

df_numeric <- df_clean %>% 
  select(where(is.numeric))

global_cor <- cor(df_numeric, method = "spearman")

corrplot(global_cor, 
         method = "color", 
         type = "upper",        
         tl.col = "black",      
         tl.srt = 45,           
         tl.cex = 0.5,          
         diag = FALSE,          
         title = "Global Correlation Matrix",
         mar = c(0,0,2,0))

Interpretation:

Diabetes_binary is positively correlated with GenHlth, HighBP, HighChol, BMI, Age and DiffWalk indicating diabetes risk rises alongside poorer general health, high blood pressure, high cholesterol, increased BMI, older age and difficulty walking. Conversely, it shows weak negative correlations with Income, Education, and PhysActivity, indicating that higher socioeconomic status and active lifestyles act as protective factors against the disease.
Income and Education are highly positively correlated with each other. Both display negative correlations with GenHlth, PhysHlth, and DiffWalk, indicating that higher socioeconomic status is associated with fewer physical health limitations.
Poorer general health (GenHlth) is strongly positively correlated with physical illness days (PhysHlth) and mobility issues (DiffWalk).
Age shows expected positive correlations with cardiovascular risks, specifically HighBP and HeartDiseaseorAttack.
The majority of the matrix is white, indicating low multicollinearity across most variables. This means the features are largely independent, which is ideal for predictive machine learning models.

4.4 Multivariate Analysis: Age, BMI, and Diabetes

The following visualization explores the interaction between three variables simultaneously. It plots Age against BMI, while coloring the points based on Diabetes status to reveal complex interactions.

The diabetes cases were represented by red dots while the non-diabetes cases were represented by blue dots.

library(ggplot2)

ggplot(df_clean, aes(x = factor(Age), y = BMI, color = factor(Diabetes_binary))) +
  
  geom_jitter(alpha = 0.4, width = 0.25, size = 1) + 
  scale_color_manual(values = c("steelblue", "indianred"),
                     labels = c("0 = No Diabetes", "1 = Diabetes")) +
  labs(
    title = "Multivariate Analysis: BMI vs. Age colored by Diabetes Status",
    x = "Age Category (1 = Youngest, 13 = Oldest)",
    y = "Body Mass Index (BMI)",
    color = "Diabetes Status"
  ) +
  theme_minimal() +
  # Makes the x-axis easier to read
  theme(legend.position = "bottom")

Interpretation: The plot shows a positive association between age and diabetes cases, as the red dots are visibly denser on the right side of the plot (older age categories) and in the upper half (higher BMI ranges). This reinforces that older age combined with high BMI creates a compounding risk factor.

5 D. Data Analysis and Machine Learning Modelling

5.1 Objective of Data Analysis

The main objective of this analysis is to identify key health-related factors associated with diabetes risk and compare multiple predictive models using fair train-test evaluation.

This study focuses on two types of predictive problems:

Classification Problem - Predict whether an individual has diabetes based on health indicators.
Regression Problem - Predict BMI using demographic, lifestyle, and health condition variables.

This dual approach allows the project to identify individuals at risk of diabetes and evaluate which machine learning method performs best for each task.

5.2 Modelling Preparation

A stratified 70:30 train-test split is used. Stratification is important for the classification task because the dataset is imbalanced, with many more non-diabetes cases than diabetes cases.

library(rpart)
library(MASS)
library(nnet)
library(class)

set.seed(7004)

class_features <- c(
  "BMI", "HighBP", "HighChol", "Smoker", "PhysActivity",
  "Age", "Income", "GenHlth", "DiffWalk", "HeartDiseaseorAttack"
)

df_model <- df_clean %>%
  dplyr::select(Diabetes_binary, all_of(class_features))

diabetes_0_index <- which(df_model$Diabetes_binary == 0)
diabetes_1_index <- which(df_model$Diabetes_binary == 1)

train_index <- c(
  sample(diabetes_0_index, floor(0.70 * length(diabetes_0_index))),
  sample(diabetes_1_index, floor(0.70 * length(diabetes_1_index)))
)

train_class <- df_model[train_index, ]
test_class <- df_model[-train_index, ]

train_class$Diabetes_binary_factor <- factor(
  train_class$Diabetes_binary,
  levels = c(0, 1),
  labels = c("No_Diabetes", "Diabetes")
)

test_class$Diabetes_binary_factor <- factor(
  test_class$Diabetes_binary,
  levels = c(0, 1),
  labels = c("No_Diabetes", "Diabetes")
)

cat("Training records:", nrow(train_class), "\n")

## Training records: 160743

cat("Testing records:", nrow(test_class), "\n\n")

## Testing records: 68891

cat("Training diabetes distribution:")

## Training diabetes distribution:

print(prop.table(table(train_class$Diabetes_binary)))

## 
##         0         1 
## 0.8471473 0.1528527

cat("\nTesting diabetes distribution:")

## 
## Testing diabetes distribution:

print(prop.table(table(test_class$Diabetes_binary)))

## 
##         0         1 
## 0.8471498 0.1528502

5.3 1. Classification Analysis: Predicting Diabetes Risk

5.3.1 Research Question 1

Can diabetes status be predicted using BMI, blood pressure, cholesterol, lifestyle factors, age, income, general health, walking difficulty, and heart disease history?

5.3.2 Classification Models Used

Four classification models are compared:

Logistic Regression - baseline statistical classification model.
Decision Tree - interpretable rule-based machine learning model.
Linear Discriminant Analysis (LDA) - probability-based classification model.
Neural Network - non-linear machine learning model.

5.3.3 Classification Correlation Analysis

relation <- cor(df_model, method = "spearman")

corrplot(relation, method = "color")

corrplot(relation, method = "circle")

5.3.3.1 Interpretation

BMI, high blood pressure, high cholesterol, poorer general health, walking difficulty, heart disease history, and age show positive associations with diabetes risk. Physical activity and income show weaker negative associations, suggesting protective effects.

5.3.4 Build Classification Models

classification_formula <- as.formula(
  paste("Diabetes_binary_factor ~", paste(class_features, collapse = " + "))
)

classification_formula_glm <- as.formula(
  paste("Diabetes_binary ~", paste(class_features, collapse = " + "))
)

# Model 1: Logistic Regression
model_logistic <- glm(
  classification_formula_glm,
  data = train_class,
  family = binomial()
)

prob_logistic <- predict(model_logistic, newdata = test_class, type = "response")

# Model 2: Decision Tree
model_class_tree <- rpart(
  classification_formula,
  data = train_class,
  method = "class",
  control = rpart.control(cp = 0.001, minsplit = 250, xval = 5)
)

prob_class_tree <- predict(model_class_tree, newdata = test_class, type = "prob")[, "Diabetes"]

# Model 3: Linear Discriminant Analysis(LDA)
model_lda <- lda(classification_formula, data = train_class)

prob_lda <- predict(model_lda, newdata = test_class)$posterior[, "Diabetes"]

# Model 4: Neural Network
# 40000-row training sample 
nnet_class_index <- sample(seq_len(nrow(train_class)), min(40000, nrow(train_class)))

model_nnet_class <- nnet(
  classification_formula,
  data = train_class[nnet_class_index, ],
  size = 5,
  decay = 0.001,
  maxit = 100,
  trace = FALSE,
  MaxNWts = 10000
)

prob_nnet_class <- as.numeric(predict(model_nnet_class, newdata = test_class, type = "raw"))

5.3.5 Classification Evaluation Metrics

Accuracy alone can be misleading because the diabetes class is the minority class. Therefore, precision, recall, F1-score, balanced accuracy, and AUC are also used. The model with the strongest F1-score is treated as the best classification model because F1 balances precision and recall for diabetes detection.

calculate_auc <- function(actual, probability) {
  actual <- as.numeric(actual)
  positive <- actual == 1
  n_positive <- sum(positive)
  n_negative <- sum(!positive)
  ranks <- rank(probability, ties.method = "average")
  
  (sum(ranks[positive]) - n_positive * (n_positive + 1) / 2) /
    (n_positive * n_negative)
}

classification_metrics <- function(actual, probability, threshold = 0.50) {
  predicted <- ifelse(probability >= threshold, 1, 0)
  
  TP <- sum(predicted == 1 & actual == 1)
  TN <- sum(predicted == 0 & actual == 0)
  FP <- sum(predicted == 1 & actual == 0)
  FN <- sum(predicted == 0 & actual == 1)
  
  precision <- ifelse(TP + FP == 0, 0, TP / (TP + FP))
  recall <- ifelse(TP + FN == 0, 0, TP / (TP + FN))
  specificity <- ifelse(TN + FP == 0, 0, TN / (TN + FP))
  f1 <- ifelse(precision + recall == 0, 0, 2 * precision * recall / (precision + recall))
  
  data.frame(
    Accuracy = (TP + TN) / (TP + TN + FP + FN),
    Precision = precision,
    Recall = recall,
    Specificity = specificity,
    F1_Score = f1,
    Balanced_Accuracy = (recall + specificity) / 2,
    AUC = calculate_auc(actual, probability)
  )
}

classification_results <- rbind(
  Logistic_Regression = classification_metrics(test_class$Diabetes_binary, prob_logistic),
  Decision_Tree = classification_metrics(test_class$Diabetes_binary, prob_class_tree),
  Linear_Discriminant_Analysis = classification_metrics(test_class$Diabetes_binary, prob_lda),
  Neural_Network = classification_metrics(test_class$Diabetes_binary, prob_nnet_class)
)

classification_results <- round(classification_results, 4)
classification_results

best_classification_model <- rownames(classification_results)[
  which.max(classification_results$F1_Score)
]

cat("Best classification model based on F1-score:", best_classification_model)

## Best classification model based on F1-score: Linear_Discriminant_Analysis

best_classification_probability <- switch(
  best_classification_model,
  "Logistic_Regression" = prob_logistic,
  "Decision_Tree" = prob_class_tree,
  "Linear_Discriminant_Analysis" = prob_lda,
  "Neural_Network" = prob_nnet_class
)

best_classification_prediction <- ifelse(best_classification_probability >= 0.50, 1, 0)

table(
  Predicted = best_classification_prediction,
  Actual = test_class$Diabetes_binary
)

##          Actual
## Predicted     0     1
##         0 56599  8690
##         1  1762  1840

5.3.5.1 Classification Model Interpretation

The classification results show that all models achieve high accuracy because most respondents are non-diabetic. However, recall and F1-score reveal the more important issue: diabetes cases are harder to detect than non-diabetes cases. Among the four models, the best model is selected using F1-score because it balances false positives and false negatives better than accuracy in an imbalanced health-risk dataset.

The logistic regression and LDA models are useful because they are more interpretable, while the decision tree gives a rule-based structure and the neural network can model non-linear relationships. For a public health screening problem, a future team could also test a lower probability threshold to increase recall if identifying more at-risk individuals is more important than reducing false alarms.

5.3.6 Classification Prediction Example

sample_patient_class <- data.frame(
  BMI = 30,
  HighBP = 1,
  HighChol = 1,
  Smoker = 0,
  PhysActivity = 1,
  Age = 10,
  Income = 5,
  GenHlth = 3,
  DiffWalk = 0,
  HeartDiseaseorAttack = 0
)

predict(model_logistic, newdata = sample_patient_class, type = "response")

##         1 
## 0.2813032

5.3.6.1 Prediction Interpretation

The output is the estimated probability that the sample individual belongs to the diabetes class. A probability above 0.50 is classified as diabetes risk, while a probability of 0.50 or below is classified as no diabetes risk under the default decision rule.

5.4 2. Regression Analysis: BMI Prediction

5.4.1 Research Question 2

How well can BMI be predicted using demographic, lifestyle, and health condition variables?

5.4.2 Regression Models Used

Four regression models are compared:

Multiple Linear Regression - baseline regression model.
Stepwise Linear Regression - feature-selected regression model using AIC.
Regression Tree - rule-based non-linear regression model.
Neural Network Regression - non-linear machine learning regression model.

5.4.3 Regression Dataset Preparation

regression_features <- c(
  "HighBP", "HighChol", "CholCheck", "Smoker", "Stroke",
  "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies",
  "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GenHlth",
  "MentHlth", "PhysHlth", "DiffWalk", "Sex", "Age", "Education",
  "Income", "Diabetes_binary"
)

df_regression <- df_clean %>%
  dplyr::select(BMI, all_of(regression_features))

train_regression <- df_regression[train_index, ]
test_regression <- df_regression[-train_index, ]

regression_formula <- as.formula(
  paste("BMI ~", paste(regression_features, collapse = " + "))
)

5.4.4 Regression Correlation Analysis

df_reg_corr <- df_clean %>%
  dplyr::select(BMI, Age, PhysHlth, MentHlth, Income, Education, GenHlth, Diabetes_binary)

chart.Correlation(df_reg_corr, histogram = TRUE)

5.4.4.1 Interpretation

BMI has weak to moderate relationships with the selected predictors. General health, diabetes status, high blood pressure, and physical health are relevant because they are connected with obesity-related health risk, while income and education provide socioeconomic context.

5.4.5 Build Regression Models

# Model 1: Multiple Linear Regression
model_lm <- lm(regression_formula, data = train_regression)

pred_lm <- predict(model_lm, newdata = test_regression)

# Model 2: Stepwise Linear Regression
model_stepwise <- stepAIC(model_lm, direction = "both", trace = FALSE)

pred_stepwise <- predict(model_stepwise, newdata = test_regression)

# Model 3: Regression Tree
model_reg_tree <- rpart(
  regression_formula,
  data = train_regression,
  method = "anova",
  control = rpart.control(cp = 0.001, minsplit = 250, xval = 5)
)

pred_reg_tree <- predict(model_reg_tree, newdata = test_regression)

# Model 4: Neural Network Regression
# predictors standardised before neural-network fitting to avoid scale dominance
scale_center <- sapply(train_regression[, regression_features], mean)
scale_spread <- sapply(train_regression[, regression_features], sd)
scale_spread[scale_spread == 0] <- 1

train_regression_scaled <- as.data.frame(
  scale(train_regression[, regression_features],
        center = scale_center,
        scale = scale_spread)
)
train_regression_scaled$BMI <- train_regression$BMI

test_regression_scaled <- as.data.frame(
  scale(test_regression[, regression_features],
        center = scale_center,
        scale = scale_spread)
)

nnet_reg_index <- sample(seq_len(nrow(train_regression_scaled)),
                         min(40000, nrow(train_regression_scaled)))

model_nnet_reg <- nnet(
  regression_formula,
  data = train_regression_scaled[nnet_reg_index, ],
  size = 5,
  decay = 0.01,
  maxit = 150,
  linout = TRUE,
  trace = FALSE,
  MaxNWts = 10000
)

pred_nnet_reg <- as.numeric(predict(model_nnet_reg, newdata = test_regression_scaled))

5.4.6 Regression Evaluation Metrics

Regression models are compared using RMSE, MAE, and test-set R-squared. The best model is selected based on the lowest RMSE because RMSE directly measures BMI prediction error and penalizes larger prediction mistakes.

regression_metrics <- function(actual, predicted) {
  data.frame(
    RMSE = sqrt(mean((actual - predicted)^2)),
    MAE = mean(abs(actual - predicted)),
    R_Squared = 1 - sum((actual - predicted)^2) /
      sum((actual - mean(actual))^2)
  )
}

regression_results <- rbind(
  Linear_Regression = regression_metrics(test_regression$BMI, pred_lm),
  Stepwise_Regression = regression_metrics(test_regression$BMI, pred_stepwise),
  Regression_Tree = regression_metrics(test_regression$BMI, pred_reg_tree),
  Neural_Network = regression_metrics(test_regression$BMI, pred_nnet_reg)
)

regression_results <- round(regression_results, 4)
regression_results

best_regression_model <- rownames(regression_results)[
  which.min(regression_results$RMSE)
]

cat("Best regression model based on RMSE:", best_regression_model)

## Best regression model based on RMSE: Neural_Network

5.4.6.1 Regression Model Interpretation

The regression comparison shows how much each model reduces BMI prediction error on unseen test data. Linear regression is useful as a baseline and for explaining direction of relationships. Stepwise regression checks whether a smaller subset of predictors can perform similarly. The regression tree captures non-linear decision rules, while the neural network captures more flexible non-linear patterns.

The best regression model is selected using RMSE. A lower RMSE means the predicted BMI values are closer to the actual BMI values on average. The test-set R-squared should be interpreted cautiously because BMI is influenced by many factors not available in the BRFSS health-indicator variables.

5.4.7 Regression Prediction Example

sample_patient_regression <- data.frame(
  HighBP = 1,
  HighChol = 1,
  CholCheck = 1,
  Smoker = 0,
  Stroke = 0,
  HeartDiseaseorAttack = 0,
  PhysActivity = 1,
  Fruits = 1,
  Veggies = 1,
  HvyAlcoholConsump = 0,
  AnyHealthcare = 1,
  NoDocbcCost = 0,
  GenHlth = 3,
  MentHlth = 2,
  PhysHlth = 5,
  DiffWalk = 0,
  Sex = 1,
  Age = 10,
  Education = 4,
  Income = 6,
  Diabetes_binary = 0
)

predict(model_lm, newdata = sample_patient_regression, interval = "prediction")

##        fit      lwr      upr
## 1 29.25299 16.80315 41.70284

5.4.7.1 Prediction Interpretation

The regression output provides an estimated BMI for the sample individual. The prediction interval shows uncertainty for an individual-level BMI prediction, which is expected because BMI is affected by biological, dietary, socioeconomic, and behavioral factors that are not fully captured in this dataset.

5.5 Model-Based Visualizations

classification_plot_data <- data.frame(
  Actual = factor(test_class$Diabetes_binary),
  Predicted_Probability = best_classification_probability
)

ggplot(classification_plot_data, aes(x = Actual, y = Predicted_Probability)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Predicted Diabetes Probability by Actual Diabetes Status",
    x = "Actual Diabetes Status",
    y = "Predicted Probability"
  ) +
  theme_minimal()

regression_plot_data <- data.frame(
  Actual_BMI = test_regression$BMI,
  Predicted_BMI = pred_nnet_reg
)

ggplot(regression_plot_data, aes(x = Actual_BMI, y = Predicted_BMI)) +
  geom_point(alpha = 0.25, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(
    title = "Actual BMI vs Predicted BMI",
    x = "Actual BMI",
    y = "Predicted BMI"
  ) +
  theme_minimal()

5.5.0.1 Visualization Interpretation

The classification plot compares predicted diabetes probabilities between actual non-diabetes and diabetes groups. A better classifier should assign higher probabilities to the actual diabetes group. The regression plot compares actual BMI with predicted BMI; points closer to the red dashed line indicate better prediction accuracy.

6 Conclusion

This project analyzed the BRFSS 2015 Diabetes Health Indicators dataset to identify factors associated with diabetes risk and BMI levels using classification and regression techniques in R.

The dataset exploration and cleaning process showed that the dataset was generally complete and well-structured. Missing values were artificially introduced for methodological demonstration and handled using mean imputation, while duplicate records were removed to reduce repeated-record bias. BMI normalization and BMI category creation were also added during preprocessing.

For the classification task, four models were compared: logistic regression, decision tree, LDA, and neural network. The use of a stratified train-test split allowed fairer holdout evaluation. Because the dataset is imbalanced, F1-score, recall, balanced accuracy, and AUC were considered alongside accuracy. The best classification model was selected based on F1-score, which is more appropriate than accuracy when diabetes cases are the minority class.

For the regression task, four models were compared: multiple linear regression, stepwise regression, regression tree, and neural network regression. The best regression model was selected based on the lowest RMSE on the test dataset. RMSE and MAE showed the average BMI prediction error, while test-set R-squared showed the proportion of BMI variation explained by the available health indicators.

Overall, the analysis demonstrates that health, lifestyle, and demographic factors can support diabetes risk prediction, although class imbalance makes diabetes detection more difficult than non-diabetes detection. The regression results also suggest that BMI is only partly explained by the available survey indicators, meaning future modelling could improve by adding dietary, genetic, medication, and more detailed physical activity variables.

Predictive Modeling of Diabetes Risk and BMI Using CDC Health Indicators

Group 12: Tan Jun Qi (25079850) Lau Sue Nee (25091564) Foo Jia En (25060168) Bernard Joshua A/L Raja Rajan (23121618) Ng Hua Ching (22108471)

1 Introduction

1.1 Project Objectives

1.2 Research Questions

2 A. Dataset Understanding

2.1 Load Required Libraries

2.2 Import Dataset

2.3 Variable Description

2.4 Dataset Structure

2.5 Dataset Dimension

2.6 Dataset Summary

2.7 Check Complete and Incomplete Cases

2.8 Check Missing Values

2.9 Check Duplicate Entries

2.10 Check Extreme Outliers

3 B. Data Cleaning

3.1 Missing Values of the Dataset

3.2 Inject Random Missing Values

3.3 Replace Missing Values

3.4 Duplicate Entries of Dataset

3.5 Remove Duplicate Records

3.6 Normalize BMI

3.7 Check BMI Transformation

3.8 Create BMI Category

3.9 Save Cleaned Dataset

4 C. Data Exploration

4.1 Frequency Check

4.2 Target Variable Distribution Analysis

4.3 Correlation Analysis by using Heatmap

4.4 Multivariate Analysis: Age, BMI, and Diabetes

5 D. Data Analysis and Machine Learning Modelling

5.1 Objective of Data Analysis

5.2 Modelling Preparation

5.3 1. Classification Analysis: Predicting Diabetes Risk

5.3.1 Research Question 1

5.3.2 Classification Models Used

5.3.3 Classification Correlation Analysis

5.3.3.1 Interpretation

5.3.4 Build Classification Models

5.3.5 Classification Evaluation Metrics

5.3.5.1 Classification Model Interpretation

5.3.6 Classification Prediction Example

5.3.6.1 Prediction Interpretation

5.4 2. Regression Analysis: BMI Prediction

5.4.1 Research Question 2

5.4.2 Regression Models Used

5.4.3 Regression Dataset Preparation

5.4.4 Regression Correlation Analysis

5.4.4.1 Interpretation

5.4.5 Build Regression Models

5.4.6 Regression Evaluation Metrics

5.4.6.1 Regression Model Interpretation

5.4.7 Regression Prediction Example

5.4.7.1 Prediction Interpretation

5.5 Model-Based Visualizations

5.5.0.1 Visualization Interpretation

6 Conclusion

Group 12:

Tan Jun Qi (25079850)
Lau Sue Nee (25091564)
Foo Jia En (25060168)
Bernard Joshua A/L Raja Rajan (23121618)
Ng Hua Ching (22108471)