Group 12:
Tan Jun Qi (25079850)
Lau Sue Nee
(25091564)
Foo Jia En (25060168)
Bernard Joshua A/L Raja Rajan
(23121618)
Ng Hua Ching (22108471)
Diabetes is a global health concern as it is associated with many long term complications such as cardiovascular disease, stroke, kidney disease, and reduced quality of life. It is essential for us to identify individuals with higher diabetes risk earlier so we can support better health screening, prevention planning, and intervention strategies.
This project uses the Diabetes Health Indicators Dataset obtained from Kaggle, which is derived from the 2015 Behavioral Risk Factor Surveillance System (BRFSS). The dataset mainly contains health survey responses from adults in the United States and cover the following datapoints:
The dataset contains 253,680 observations and 22 variables, making it suitable for data exploration, cleaning, visualization, and machine learning analysis. The main target variable is the Diabetes Status which represents respondents with prediabetes or diabetes.
This project applies R programming techniques to understand the dataset, clean and transform the data, explore important health patterns, and build predictive models. Two main analytical problems are considered. First, a classification problem is used to predict whether an individual has diabetes based on selected health indicators. Second, a regression problem is used to predict BMI using demographic, lifestyle, and health-related variables. By combining both approaches, the project aims to identify important diabetes-related risk factors and evaluate how well different machine learning models perform on health indicator data.
This project uses the R programming techniques to analyze, clean and transform the data so that we can explore patterns and build predictive models. Two main analytical problems are considered:
The main objective of this project is to analyse the BRFSS 2015 Diabetes Health Indicators dataset and develop predictive models that can support diabetes risk understanding and BMI prediction.
The following code loads the required libraries for data manipulation and visualization.
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.3
## Warning: package 'tibble' was built under R version 4.5.3
## Warning: package 'tidyr' was built under R version 4.5.3
## Warning: package 'readr' was built under R version 4.5.3
## Warning: package 'dplyr' was built under R version 4.5.3
## Warning: package 'forcats' was built under R version 4.5.3
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
## Warning: package 'PerformanceAnalytics' was built under R version 4.5.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.5.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.5.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
##
## ######################### Warning from 'xts' package ##########################
## # #
## # The dplyr lag() function breaks how base R's lag() function is supposed to #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or #
## # source() into this session won't work correctly. #
## # #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
## # dplyr from breaking base R's lag() function. #
## # #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
## # #
## ###############################################################################
##
## Attaching package: 'xts'
##
## The following objects are masked from 'package:dplyr':
##
## first, last
##
##
## Attaching package: 'PerformanceAnalytics'
##
## The following object is masked from 'package:graphics':
##
## legend
The following code reads the diabetes dataset into R.
The dataset contains 22 variables describing demographic, lifestyle, and health indicators:
See full mapping in Table below.
| Variable | Description |
|---|---|
| Diabetes_binary | 0 = no diabetes, 1 = prediabetes or diabetes |
| HighBP | 0 = no high blood pressure, 1 = high blood pressure |
| HighChol | 0 = no high cholesterol, 1 = high cholesterol |
| CholCheck | 0 = no cholesterol check in 5 years, 1 = yes |
| BMI | Body Mass Index |
| Smoker | 0 = no, 1 = yes (≥100 cigarettes lifetime) |
| Stroke | 0 = no, 1 = yes |
| HeartDiseaseorAttack | 0 = no, 1 = yes |
| PhysActivity | 0 = no, 1 = yes (past 30 days) |
| Fruits | 0 = no, 1 = yes (≥1 fruit/day) |
| Veggies | 0 = no, 1 = yes (≥1 vegetable/day) |
| HvyAlcoholConsump | 0 = no, 1 = yes (heavy drinking) |
| AnyHealthcare | 0 = no, 1 = yes |
| NoDocbcCost | 0 = no, 1 = yes (couldn’t see doctor due to cost) |
| GenHlth | 1 = excellent to 5 = poor |
| MentHlth | Days of poor mental health (0–30) |
| PhysHlth | Days of physical illness (0–30) |
| DiffWalk | 0 = no, 1 = yes |
| Sex | 0 = female, 1 = male |
| Age | Age category (1–13) |
| Education | Education level (1–6) |
| Income | Income level (1–8) |
The following code provides a quick overview of dataset structure.
## Rows: 253,680
## Columns: 22
## $ Diabetes_binary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ HighBP <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1…
## $ HighChol <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1…
## $ CholCheck <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2…
## $ Smoker <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0…
## $ Stroke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PhysActivity <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Fruits <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1…
## $ Veggies <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ HvyAlcoholConsump <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ GenHlth <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3…
## $ MentHlth <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, …
## $ PhysHlth <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2…
## $ DiffWalk <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0…
## $ Sex <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ Age <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, …
## $ Education <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4…
## $ Income <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3…
The following code checks the number of observations and variables.
## [1] 253680 22
The following code generates statistical summary of all variables.
## Diabetes_binary HighBP HighChol CholCheck
## Min. :0.0000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.0000
## Median :0.0000 Median :0.000 Median :0.0000 Median :1.0000
## Mean :0.1393 Mean :0.429 Mean :0.4241 Mean :0.9627
## 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.000 Max. :1.0000 Max. :1.0000
## BMI Smoker Stroke HeartDiseaseorAttack
## Min. :12.00 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:24.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :27.00 Median :0.0000 Median :0.00000 Median :0.00000
## Mean :28.38 Mean :0.4432 Mean :0.04057 Mean :0.09419
## 3rd Qu.:31.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :98.00 Max. :1.0000 Max. :1.00000 Max. :1.00000
## PhysActivity Fruits Veggies HvyAlcoholConsump
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.0000
## Mean :0.7565 Mean :0.6343 Mean :0.8114 Mean :0.0562
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## AnyHealthcare NoDocbcCost GenHlth MentHlth
## Min. :0.0000 Min. :0.00000 Min. :1.000 Min. : 0.000
## 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:2.000 1st Qu.: 0.000
## Median :1.0000 Median :0.00000 Median :2.000 Median : 0.000
## Mean :0.9511 Mean :0.08418 Mean :2.511 Mean : 3.185
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:3.000 3rd Qu.: 2.000
## Max. :1.0000 Max. :1.00000 Max. :5.000 Max. :30.000
## PhysHlth DiffWalk Sex Age
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 6.000
## Median : 0.000 Median :0.0000 Median :0.0000 Median : 8.000
## Mean : 4.242 Mean :0.1682 Mean :0.4403 Mean : 8.032
## 3rd Qu.: 3.000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:10.000
## Max. :30.000 Max. :1.0000 Max. :1.0000 Max. :13.000
## Education Income
## Min. :1.00 Min. :1.000
## 1st Qu.:4.00 1st Qu.:5.000
## Median :5.00 Median :7.000
## Mean :5.05 Mean :6.054
## 3rd Qu.:6.00 3rd Qu.:8.000
## Max. :6.00 Max. :8.000
The original dataset contains 253,680 observations and 22 variables, and contains no missing values. The variables mainly consist of binary categorical variables and numerical variables such as BMI, Mental Health, and Physical Health. The target variable is Diabetes_binary, where:
The following code checks complete and incomplete records in the dataset.
## [1] 253680
## [1] 0
## [1] 253680
The following code counts missing values for each variable column.
## [1] "Diabetes_binary 0"
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 0"
## [1] "BMI 0"
## [1] "Smoker 0"
## [1] "Stroke 0"
## [1] "HeartDiseaseorAttack 0"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 0"
## [1] "HvyAlcoholConsump 0"
## [1] "AnyHealthcare 0"
## [1] "NoDocbcCost 0"
## [1] "GenHlth 0"
## [1] "MentHlth 0"
## [1] "PhysHlth 0"
## [1] "DiffWalk 0"
## [1] "Sex 0"
## [1] "Age 0"
## [1] "Education 0"
## [1] "Income 0"
The following code checks duplicate records in the dataset.
## [1] 24206
The following code visualizes the distributions and potential extreme
outliers for the continuous variables (BMI,
MentHlth, and PhysHlth) using side-by-side
histograms and boxplots.
library(ggplot2)
library(patchwork)
# 1. BMI (Body Mass Index)
h_bmi <- ggplot(df_raw, aes(x = BMI)) +
geom_histogram(fill = "steelblue", color = "white", bins = 35) +
labs(title = "Distribution of BMI", x = "BMI", y = "Frequency") +
theme_minimal()
b_bmi <- ggplot(df_raw, aes(x = BMI)) +
geom_boxplot(fill = "lightblue", outlier.color = "red", outlier.shape = 19) +
labs(title = "Outlier Check: BMI", x = "BMI") +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
# 2. MentHlth
h_ment <- ggplot(df_raw, aes(x = MentHlth)) +
geom_histogram(fill = "mediumseagreen", color = "white", bins = 30) +
labs(title = "Distribution of Mental Health", x = "Days (0-30)", y = "Frequency") +
theme_minimal()
b_ment <- ggplot(df_raw, aes(x = MentHlth)) +
geom_boxplot(fill = "lightgreen", outlier.color = "red", outlier.shape = 19) +
labs(title = "Outlier Check: MentHlth", x = "Days (0-30)") +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
# 3. PhysHlth
h_phys <- ggplot(df_raw, aes(x = PhysHlth)) +
geom_histogram(fill = "indianred", color = "white", bins = 30) +
labs(title = "Distribution of Physical Health", x = "Days (0-30)", y = "Frequency") +
theme_minimal()
b_phys <- ggplot(df_raw, aes(x = PhysHlth)) +
geom_boxplot(fill = "lightcoral", outlier.color = "red", outlier.shape = 19) +
labs(title = "Outlier Check: PhysHlth", x = "Days (0-30)") +
theme_minimal() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
# Combine and Display all plots side-by-side
(h_bmi | b_bmi) /
(h_ment | b_ment) /
(h_phys | b_phys)Interpretation:
MentHlth - The distribution is heavily positively skewed, featuring a massive spike at 0. This indicates that the vast majority of individuals reported zero days of poor mental health in the past month. Overall, the data points are within the logical 0 to 30 day boundary, thus there are no outliers that need to be removed
PhysHlth - The distribution is heavily positively skewed, featuring a massive spike at 0. This indicates most individuals experience no days of physical illness. There is also a small secondary peak at 30 days, representing a subset of individuals with chronic, daily physical illness. Overall, the data points are within the logical 0 to 30 day boundary, thus there are no outliers that need to be removed.
As previously verified, the initial dataset contains no missing values.
## [1] FALSE
## Diabetes_binary HighBP HighChol
## 0 0 0
## CholCheck BMI Smoker
## 0 0 0
## Stroke HeartDiseaseorAttack PhysActivity
## 0 0 0
## Fruits Veggies HvyAlcoholConsump
## 0 0 0
## AnyHealthcare NoDocbcCost GenHlth
## 0 0 0
## MentHlth PhysHlth DiffWalk
## 0 0 0
## Sex Age Education
## 0 0 0
## Income
## 0
The following code artificially introduces missing values into multiple variables to simulate missing data scenarios for methodological demonstration purposes.
set.seed(123)
# Work only on a copy
df_clean <- df_raw
# Inject missing values into BMI
df_clean$BMI[sample(1:nrow(df_clean), 500)] <- NA
# Inject missing values into Mental Health
df_clean$MentHlth[sample(1:nrow(df_clean), 300)] <- NA
# Inject missing values into Physical Health
df_clean$PhysHlth[sample(1:nrow(df_clean), 300)] <- NA
# Check total missing values per column
colSums(is.na(df_clean))## Diabetes_binary HighBP HighChol
## 0 0 0
## CholCheck BMI Smoker
## 0 500 0
## Stroke HeartDiseaseorAttack PhysActivity
## 0 0 0
## Fruits Veggies HvyAlcoholConsump
## 0 0 0
## AnyHealthcare NoDocbcCost GenHlth
## 0 0 0
## MentHlth PhysHlth DiffWalk
## 300 300 0
## Sex Age Education
## 0 0 0
## Income
## 0
The following code replaces missing values using mean imputation. Mean imputation is used for numerical variables as it is simple and preserves dataset size. Although it reduces variance, it is suitable since the proportion of missing values is small.
# Replace BMI missing values
df_clean$BMI[is.na(df_clean$BMI)] <- mean(df_clean$BMI, na.rm = TRUE)
# Replace Mental Health missing values
df_clean$MentHlth[is.na(df_clean$MentHlth)] <- mean(df_clean$MentHlth, na.rm = TRUE)
# Replace Physical Health missing values
df_clean$PhysHlth[is.na(df_clean$PhysHlth)] <- mean(df_clean$PhysHlth, na.rm = TRUE)
# Check missing values after cleaning
colSums(is.na(df_clean))## Diabetes_binary HighBP HighChol
## 0 0 0
## CholCheck BMI Smoker
## 0 0 0
## Stroke HeartDiseaseorAttack PhysActivity
## 0 0 0
## Fruits Veggies HvyAlcoholConsump
## 0 0 0
## AnyHealthcare NoDocbcCost GenHlth
## 0 0 0
## MentHlth PhysHlth DiffWalk
## 0 0 0
## Sex Age Education
## 0 0 0
## Income
## 0
As previously verified, the initial dataset contains duplicate entries.
## [1] 24046
The following code removes duplicate entries. Duplicate records were checked to ensure data consistency and avoid bias during model training.
## [1] 0
The following code normalizes BMI using Min-Max scaling. BMI values were normalized using Min-Max normalization to scale the values between 0 and 1. This helps improve model performance and ensures consistent variable scaling. Min-max normalization was applied to ensure BMI is on the same scale as other features, preventing dominance during modeling.
The following code compares original and transformed BMI.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 24.00 27.00 28.69 32.00 98.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1395 0.1744 0.1940 0.2326 1.0000
The following code transforms BMI into categorical groups.
df_clean <- df_clean %>%
mutate(BMI_Category = case_when(
BMI < 18.5 ~ "Underweight",
BMI < 25 ~ "Normal",
BMI < 30 ~ "Overweight",
TRUE ~ "Obese"
))
table(df_clean$BMI_Category)##
## Normal Obese Overweight Underweight
## 58773 84700 83118 3043
## Rows: 229,634
## Columns: 24
## $ Diabetes_binary <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0…
## $ HighBP <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1…
## $ HighChol <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1…
## $ CholCheck <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2…
## $ Smoker <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0…
## $ Stroke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PhysActivity <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1…
## $ Fruits <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1…
## $ Veggies <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ HvyAlcoholConsump <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ GenHlth <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3…
## $ MentHlth <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, …
## $ PhysHlth <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2…
## $ DiffWalk <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0…
## $ Sex <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ Age <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, …
## $ Education <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4…
## $ Income <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3…
## $ BMI_Normalized <dbl> 0.3255814, 0.1511628, 0.1860465, 0.1744186, 0.139…
## $ BMI_Category <chr> "Obese", "Overweight", "Overweight", "Overweight"…
The following code checks distribution of key variables. Frequency check is used to understand the distribution of categorical variables in the dataset. It helps identify imbalance in key health indicators such as smoking, high blood pressure, and high cholesterol, which may influence model performance. This is important because class imbalance in health indicators may affect model bias when predicting diabetes risk.
cols <- c("HighBP", "HighChol", "Smoker")
for(i in cols){
cat("\nFrequency Distribution of", i, ":\n")
print(table(df_clean[[i]]))
}##
## Frequency Distribution of HighBP :
##
## 0 1
## 125338 104296
##
## Frequency Distribution of HighChol :
##
## 0 1
## 128247 101387
##
## Frequency Distribution of Smoker :
##
## 0 1
## 122695 106939
Interpretation:
The following code visualizes diabetes cases. The dataset is imbalanced, with a lower proportion of positive diabetes cases compared to non-diabetes cases.
ggplot(df_clean, aes(x = factor(Diabetes_binary))) +
geom_bar(fill = "steelblue") +
theme_minimal() +
labs(title = "Distribution of Diabetes Cases",
x = "Diabetes (0 = No Diabetes, 1 = Diabetes)",
y = "Count")The following code generates a correlation matrix for all variables. This provides a clear view of the dataset, highlighting which health indicators are highly correlated with each other before building predictive models.
The matrix uses a color gradient where blue indicates positive correlation, orange or red indicates negative correlation and white indicates no linear relationship.
library(corrplot)
df_numeric <- df_clean %>%
select(where(is.numeric))
global_cor <- cor(df_numeric, method = "spearman")
corrplot(global_cor,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
tl.cex = 0.5,
diag = FALSE,
title = "Global Correlation Matrix",
mar = c(0,0,2,0)) Interpretation:
Diabetes_binary is positively correlated with GenHlth, HighBP, HighChol, BMI, Age and DiffWalk indicating diabetes risk rises alongside poorer general health, high blood pressure, high cholesterol, increased BMI, older age and difficulty walking. Conversely, it shows weak negative correlations with Income, Education, and PhysActivity, indicating that higher socioeconomic status and active lifestyles act as protective factors against the disease.
Income and Education are highly positively correlated with each other. Both display negative correlations with GenHlth, PhysHlth, and DiffWalk, indicating that higher socioeconomic status is associated with fewer physical health limitations.
Poorer general health (GenHlth) is strongly positively correlated with physical illness days (PhysHlth) and mobility issues (DiffWalk).
Age shows expected positive correlations with cardiovascular risks, specifically HighBP and HeartDiseaseorAttack.
The majority of the matrix is white, indicating low multicollinearity across most variables. This means the features are largely independent, which is ideal for predictive machine learning models.
The following visualization explores the interaction between three variables simultaneously. It plots Age against BMI, while coloring the points based on Diabetes status to reveal complex interactions.
The diabetes cases were represented by red dots while the non-diabetes cases were represented by blue dots.
library(ggplot2)
ggplot(df_clean, aes(x = factor(Age), y = BMI, color = factor(Diabetes_binary))) +
geom_jitter(alpha = 0.4, width = 0.25, size = 1) +
scale_color_manual(values = c("steelblue", "indianred"),
labels = c("0 = No Diabetes", "1 = Diabetes")) +
labs(
title = "Multivariate Analysis: BMI vs. Age colored by Diabetes Status",
x = "Age Category (1 = Youngest, 13 = Oldest)",
y = "Body Mass Index (BMI)",
color = "Diabetes Status"
) +
theme_minimal() +
# Makes the x-axis easier to read
theme(legend.position = "bottom")Interpretation: The plot shows a positive association between age and diabetes cases, as the red dots are visibly denser on the right side of the plot (older age categories) and in the upper half (higher BMI ranges). This reinforces that older age combined with high BMI creates a compounding risk factor.
The main objective of this analysis is to identify key health-related factors associated with diabetes risk and compare multiple predictive models using fair train-test evaluation.
This study focuses on two types of predictive problems:
This dual approach allows the project to identify individuals at risk of diabetes and evaluate which machine learning method performs best for each task.
A stratified 70:30 train-test split is used. Stratification is important for the classification task because the dataset is imbalanced, with many more non-diabetes cases than diabetes cases.
library(rpart)
library(MASS)
library(nnet)
library(class)
set.seed(7004)
class_features <- c(
"BMI", "HighBP", "HighChol", "Smoker", "PhysActivity",
"Age", "Income", "GenHlth", "DiffWalk", "HeartDiseaseorAttack"
)
df_model <- df_clean %>%
dplyr::select(Diabetes_binary, all_of(class_features))
diabetes_0_index <- which(df_model$Diabetes_binary == 0)
diabetes_1_index <- which(df_model$Diabetes_binary == 1)
train_index <- c(
sample(diabetes_0_index, floor(0.70 * length(diabetes_0_index))),
sample(diabetes_1_index, floor(0.70 * length(diabetes_1_index)))
)
train_class <- df_model[train_index, ]
test_class <- df_model[-train_index, ]
train_class$Diabetes_binary_factor <- factor(
train_class$Diabetes_binary,
levels = c(0, 1),
labels = c("No_Diabetes", "Diabetes")
)
test_class$Diabetes_binary_factor <- factor(
test_class$Diabetes_binary,
levels = c(0, 1),
labels = c("No_Diabetes", "Diabetes")
)
cat("Training records:", nrow(train_class), "\n")## Training records: 160743
## Testing records: 68891
## Training diabetes distribution:
##
## 0 1
## 0.8471473 0.1528527
##
## Testing diabetes distribution:
##
## 0 1
## 0.8471498 0.1528502
Can diabetes status be predicted using BMI, blood pressure, cholesterol, lifestyle factors, age, income, general health, walking difficulty, and heart disease history?
Four classification models are compared:
BMI, high blood pressure, high cholesterol, poorer general health, walking difficulty, heart disease history, and age show positive associations with diabetes risk. Physical activity and income show weaker negative associations, suggesting protective effects.
classification_formula <- as.formula(
paste("Diabetes_binary_factor ~", paste(class_features, collapse = " + "))
)
classification_formula_glm <- as.formula(
paste("Diabetes_binary ~", paste(class_features, collapse = " + "))
)
# Model 1: Logistic Regression
model_logistic <- glm(
classification_formula_glm,
data = train_class,
family = binomial()
)
prob_logistic <- predict(model_logistic, newdata = test_class, type = "response")
# Model 2: Decision Tree
model_class_tree <- rpart(
classification_formula,
data = train_class,
method = "class",
control = rpart.control(cp = 0.001, minsplit = 250, xval = 5)
)
prob_class_tree <- predict(model_class_tree, newdata = test_class, type = "prob")[, "Diabetes"]
# Model 3: Linear Discriminant Analysis(LDA)
model_lda <- lda(classification_formula, data = train_class)
prob_lda <- predict(model_lda, newdata = test_class)$posterior[, "Diabetes"]
# Model 4: Neural Network
# 40000-row training sample
nnet_class_index <- sample(seq_len(nrow(train_class)), min(40000, nrow(train_class)))
model_nnet_class <- nnet(
classification_formula,
data = train_class[nnet_class_index, ],
size = 5,
decay = 0.001,
maxit = 100,
trace = FALSE,
MaxNWts = 10000
)
prob_nnet_class <- as.numeric(predict(model_nnet_class, newdata = test_class, type = "raw"))Accuracy alone can be misleading because the diabetes class is the minority class. Therefore, precision, recall, F1-score, balanced accuracy, and AUC are also used. The model with the strongest F1-score is treated as the best classification model because F1 balances precision and recall for diabetes detection.
calculate_auc <- function(actual, probability) {
actual <- as.numeric(actual)
positive <- actual == 1
n_positive <- sum(positive)
n_negative <- sum(!positive)
ranks <- rank(probability, ties.method = "average")
(sum(ranks[positive]) - n_positive * (n_positive + 1) / 2) /
(n_positive * n_negative)
}
classification_metrics <- function(actual, probability, threshold = 0.50) {
predicted <- ifelse(probability >= threshold, 1, 0)
TP <- sum(predicted == 1 & actual == 1)
TN <- sum(predicted == 0 & actual == 0)
FP <- sum(predicted == 1 & actual == 0)
FN <- sum(predicted == 0 & actual == 1)
precision <- ifelse(TP + FP == 0, 0, TP / (TP + FP))
recall <- ifelse(TP + FN == 0, 0, TP / (TP + FN))
specificity <- ifelse(TN + FP == 0, 0, TN / (TN + FP))
f1 <- ifelse(precision + recall == 0, 0, 2 * precision * recall / (precision + recall))
data.frame(
Accuracy = (TP + TN) / (TP + TN + FP + FN),
Precision = precision,
Recall = recall,
Specificity = specificity,
F1_Score = f1,
Balanced_Accuracy = (recall + specificity) / 2,
AUC = calculate_auc(actual, probability)
)
}
classification_results <- rbind(
Logistic_Regression = classification_metrics(test_class$Diabetes_binary, prob_logistic),
Decision_Tree = classification_metrics(test_class$Diabetes_binary, prob_class_tree),
Linear_Discriminant_Analysis = classification_metrics(test_class$Diabetes_binary, prob_lda),
Neural_Network = classification_metrics(test_class$Diabetes_binary, prob_nnet_class)
)
classification_results <- round(classification_results, 4)
classification_resultsbest_classification_model <- rownames(classification_results)[
which.max(classification_results$F1_Score)
]
cat("Best classification model based on F1-score:", best_classification_model)## Best classification model based on F1-score: Linear_Discriminant_Analysis
best_classification_probability <- switch(
best_classification_model,
"Logistic_Regression" = prob_logistic,
"Decision_Tree" = prob_class_tree,
"Linear_Discriminant_Analysis" = prob_lda,
"Neural_Network" = prob_nnet_class
)
best_classification_prediction <- ifelse(best_classification_probability >= 0.50, 1, 0)
table(
Predicted = best_classification_prediction,
Actual = test_class$Diabetes_binary
)## Actual
## Predicted 0 1
## 0 56599 8690
## 1 1762 1840
The classification results show that all models achieve high accuracy because most respondents are non-diabetic. However, recall and F1-score reveal the more important issue: diabetes cases are harder to detect than non-diabetes cases. Among the four models, the best model is selected using F1-score because it balances false positives and false negatives better than accuracy in an imbalanced health-risk dataset.
The logistic regression and LDA models are useful because they are more interpretable, while the decision tree gives a rule-based structure and the neural network can model non-linear relationships. For a public health screening problem, a future team could also test a lower probability threshold to increase recall if identifying more at-risk individuals is more important than reducing false alarms.
sample_patient_class <- data.frame(
BMI = 30,
HighBP = 1,
HighChol = 1,
Smoker = 0,
PhysActivity = 1,
Age = 10,
Income = 5,
GenHlth = 3,
DiffWalk = 0,
HeartDiseaseorAttack = 0
)
predict(model_logistic, newdata = sample_patient_class, type = "response")## 1
## 0.2813032
The output is the estimated probability that the sample individual belongs to the diabetes class. A probability above 0.50 is classified as diabetes risk, while a probability of 0.50 or below is classified as no diabetes risk under the default decision rule.
How well can BMI be predicted using demographic, lifestyle, and health condition variables?
Four regression models are compared:
regression_features <- c(
"HighBP", "HighChol", "CholCheck", "Smoker", "Stroke",
"HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies",
"HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GenHlth",
"MentHlth", "PhysHlth", "DiffWalk", "Sex", "Age", "Education",
"Income", "Diabetes_binary"
)
df_regression <- df_clean %>%
dplyr::select(BMI, all_of(regression_features))
train_regression <- df_regression[train_index, ]
test_regression <- df_regression[-train_index, ]
regression_formula <- as.formula(
paste("BMI ~", paste(regression_features, collapse = " + "))
)df_reg_corr <- df_clean %>%
dplyr::select(BMI, Age, PhysHlth, MentHlth, Income, Education, GenHlth, Diabetes_binary)
chart.Correlation(df_reg_corr, histogram = TRUE)BMI has weak to moderate relationships with the selected predictors. General health, diabetes status, high blood pressure, and physical health are relevant because they are connected with obesity-related health risk, while income and education provide socioeconomic context.
# Model 1: Multiple Linear Regression
model_lm <- lm(regression_formula, data = train_regression)
pred_lm <- predict(model_lm, newdata = test_regression)
# Model 2: Stepwise Linear Regression
model_stepwise <- stepAIC(model_lm, direction = "both", trace = FALSE)
pred_stepwise <- predict(model_stepwise, newdata = test_regression)
# Model 3: Regression Tree
model_reg_tree <- rpart(
regression_formula,
data = train_regression,
method = "anova",
control = rpart.control(cp = 0.001, minsplit = 250, xval = 5)
)
pred_reg_tree <- predict(model_reg_tree, newdata = test_regression)
# Model 4: Neural Network Regression
# predictors standardised before neural-network fitting to avoid scale dominance
scale_center <- sapply(train_regression[, regression_features], mean)
scale_spread <- sapply(train_regression[, regression_features], sd)
scale_spread[scale_spread == 0] <- 1
train_regression_scaled <- as.data.frame(
scale(train_regression[, regression_features],
center = scale_center,
scale = scale_spread)
)
train_regression_scaled$BMI <- train_regression$BMI
test_regression_scaled <- as.data.frame(
scale(test_regression[, regression_features],
center = scale_center,
scale = scale_spread)
)
nnet_reg_index <- sample(seq_len(nrow(train_regression_scaled)),
min(40000, nrow(train_regression_scaled)))
model_nnet_reg <- nnet(
regression_formula,
data = train_regression_scaled[nnet_reg_index, ],
size = 5,
decay = 0.01,
maxit = 150,
linout = TRUE,
trace = FALSE,
MaxNWts = 10000
)
pred_nnet_reg <- as.numeric(predict(model_nnet_reg, newdata = test_regression_scaled))Regression models are compared using RMSE, MAE, and test-set R-squared. The best model is selected based on the lowest RMSE because RMSE directly measures BMI prediction error and penalizes larger prediction mistakes.
regression_metrics <- function(actual, predicted) {
data.frame(
RMSE = sqrt(mean((actual - predicted)^2)),
MAE = mean(abs(actual - predicted)),
R_Squared = 1 - sum((actual - predicted)^2) /
sum((actual - mean(actual))^2)
)
}
regression_results <- rbind(
Linear_Regression = regression_metrics(test_regression$BMI, pred_lm),
Stepwise_Regression = regression_metrics(test_regression$BMI, pred_stepwise),
Regression_Tree = regression_metrics(test_regression$BMI, pred_reg_tree),
Neural_Network = regression_metrics(test_regression$BMI, pred_nnet_reg)
)
regression_results <- round(regression_results, 4)
regression_resultsbest_regression_model <- rownames(regression_results)[
which.min(regression_results$RMSE)
]
cat("Best regression model based on RMSE:", best_regression_model)## Best regression model based on RMSE: Neural_Network
The regression comparison shows how much each model reduces BMI prediction error on unseen test data. Linear regression is useful as a baseline and for explaining direction of relationships. Stepwise regression checks whether a smaller subset of predictors can perform similarly. The regression tree captures non-linear decision rules, while the neural network captures more flexible non-linear patterns.
The best regression model is selected using RMSE. A lower RMSE means the predicted BMI values are closer to the actual BMI values on average. The test-set R-squared should be interpreted cautiously because BMI is influenced by many factors not available in the BRFSS health-indicator variables.
sample_patient_regression <- data.frame(
HighBP = 1,
HighChol = 1,
CholCheck = 1,
Smoker = 0,
Stroke = 0,
HeartDiseaseorAttack = 0,
PhysActivity = 1,
Fruits = 1,
Veggies = 1,
HvyAlcoholConsump = 0,
AnyHealthcare = 1,
NoDocbcCost = 0,
GenHlth = 3,
MentHlth = 2,
PhysHlth = 5,
DiffWalk = 0,
Sex = 1,
Age = 10,
Education = 4,
Income = 6,
Diabetes_binary = 0
)
predict(model_lm, newdata = sample_patient_regression, interval = "prediction")## fit lwr upr
## 1 29.25299 16.80315 41.70284
The regression output provides an estimated BMI for the sample individual. The prediction interval shows uncertainty for an individual-level BMI prediction, which is expected because BMI is affected by biological, dietary, socioeconomic, and behavioral factors that are not fully captured in this dataset.
classification_plot_data <- data.frame(
Actual = factor(test_class$Diabetes_binary),
Predicted_Probability = best_classification_probability
)
ggplot(classification_plot_data, aes(x = Actual, y = Predicted_Probability)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Predicted Diabetes Probability by Actual Diabetes Status",
x = "Actual Diabetes Status",
y = "Predicted Probability"
) +
theme_minimal()regression_plot_data <- data.frame(
Actual_BMI = test_regression$BMI,
Predicted_BMI = pred_nnet_reg
)
ggplot(regression_plot_data, aes(x = Actual_BMI, y = Predicted_BMI)) +
geom_point(alpha = 0.25, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
labs(
title = "Actual BMI vs Predicted BMI",
x = "Actual BMI",
y = "Predicted BMI"
) +
theme_minimal()The classification plot compares predicted diabetes probabilities between actual non-diabetes and diabetes groups. A better classifier should assign higher probabilities to the actual diabetes group. The regression plot compares actual BMI with predicted BMI; points closer to the red dashed line indicate better prediction accuracy.
This project analyzed the BRFSS 2015 Diabetes Health Indicators dataset to identify factors associated with diabetes risk and BMI levels using classification and regression techniques in R.
The dataset exploration and cleaning process showed that the dataset was generally complete and well-structured. Missing values were artificially introduced for methodological demonstration and handled using mean imputation, while duplicate records were removed to reduce repeated-record bias. BMI normalization and BMI category creation were also added during preprocessing.
For the classification task, four models were compared: logistic regression, decision tree, LDA, and neural network. The use of a stratified train-test split allowed fairer holdout evaluation. Because the dataset is imbalanced, F1-score, recall, balanced accuracy, and AUC were considered alongside accuracy. The best classification model was selected based on F1-score, which is more appropriate than accuracy when diabetes cases are the minority class.
For the regression task, four models were compared: multiple linear regression, stepwise regression, regression tree, and neural network regression. The best regression model was selected based on the lowest RMSE on the test dataset. RMSE and MAE showed the average BMI prediction error, while test-set R-squared showed the proportion of BMI variation explained by the available health indicators.
Overall, the analysis demonstrates that health, lifestyle, and demographic factors can support diabetes risk prediction, although class imbalance makes diabetes detection more difficult than non-diabetes detection. The regression results also suggest that BMI is only partly explained by the available survey indicators, meaning future modelling could improve by adding dietary, genetic, medication, and more detailed physical activity variables.