1. Introduction

Diabetes is a widespread and chronic health condition that affects millions of people globally, and its impact is particularly pronounced in the United States. According to the Centers for Disease Control and Prevention (CDC), approximately 34.2 million Americans had diabetes as of 2018, with an additional 88 million living with prediabetes. Diabetes leads to serious health complications such as heart disease, kidney failure, vision loss, and amputations, while also significantly reducing quality of life and life expectancy for those affected. Furthermore, the economic burden is substantial, with the annual cost of diagnosed diabetes estimated at $327 billion.

Diabetes is characterized by the body’s inability to effectively regulate blood glucose levels, either due to insufficient insulin production (Type 1 diabetes) or an inability to use insulin properly (Type 2 diabetes). Early detection of diabetes and prediabetes is crucial, as it allows individuals to adopt lifestyle changes and receive treatments that can help manage or delay the onset of complications.

The goal of this report is to leverage the Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset—a large, national survey conducted by the CDC—to build predictive models that can accurately assess diabetes risk based on various health indicators. The BRFSS collects data annually from over 400,000 participants on topics such as health behaviors, chronic health conditions, and the use of preventive services. This dataset offers a rich source of information to investigate the factors most closely associated with diabetes.

Purpose of the Study:

This study aims to apply machine learning techniques to predict an individual’s diabetes risk based on the responses provided in the BRFSS survey. Additionally, the study seeks to identify which factors are most predictive of diabetes, providing valuable insights for healthcare practitioners and policymakers.

Research Questions:

The study is designed to address the following key research questions:

Predictive Power of Survey Data:

Can survey responses from the BRFSS accurately predict whether an individual has diabetes, prediabetes, or no diabetes?

Risk Factors:

What are the most significant risk factors for diabetes according to the data, and how do these factors contribute to overall diabetes risk?

Feature Subset Selection:

Can a subset of the available health indicators be used to build an accurate and efficient predictive model for diabetes risk?

Streamlined Prediction Model:

Is it possible to develop a simplified model that uses fewer survey questions while still accurately predicting an individual’s diabetes risk, which could be deployed in public health initiatives?

Objectives:

The primary objectives of this study are:

To develop and evaluate predictive models for diabetes risk using data from the BRFSS 2015.
To identify the most important health indicators and risk factors that contribute to the likelihood of an individual developing diabetes.
To use feature selection techniques to create a more efficient predictive model that requires fewer inputs but still maintains high accuracy.
To provide insights that can be used to develop a streamlined set of survey questions, which can be implemented in future public health screenings for early diabetes detection.

This study aims to contribute to the growing body of research in the field of public health and machine learning, with the ultimate goal of improving diabetes prevention efforts and public health strategies.

2. Data Import and Description

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

# Import the data
diabetesUS <- read.csv("data_input/diabetes_012_health_indicators_BRFSS2015.csv")

The first step involves loading the necessary libraries and importing the dataset from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey. The dataset consists of 253,680 rows and 22 columns, representing individuals’ health-related survey responses. This dataset will be used to predict diabetes status and identify key risk factors associated with diabetes and prediabetes.

# Display the first few rows of the data
head(diabetesUS)

##   Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack
## 1            0      1        1         1  40      1      0                    0
## 2            0      0        0         0  25      1      0                    0
## 3            0      1        1         1  28      0      0                    0
## 4            0      1        0         1  27      0      0                    0
## 5            0      1        1         1  24      0      0                    0
## 6            0      1        1         1  25      1      0                    0
##   PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1            0      0       1                 0             1           0
## 2            1      0       0                 0             0           1
## 3            0      1       0                 0             1           1
## 4            1      1       1                 0             1           0
## 5            1      1       1                 0             1           0
## 6            1      1       1                 0             1           0
##   GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1       5       18       15        1   0   9         4      3
## 2       3        0        0        0   0   7         6      1
## 3       5       30       30        1   0   9         4      8
## 4       2        0        0        0   0  11         3      6
## 5       2        3        0        0   0  11         5      4
## 6       2        0        2        0   1  10         6      8

The head() function provides a preview of the first six rows of the dataset. This allows us to inspect the data structure, the variable names, and a small sample of the data. From the initial rows, we can see that each row represents an individual’s survey response, with columns corresponding to health-related variables such as HighBP (high blood pressure), CholCheck (cholesterol check), BMI, and Diabetes_012, which indicates their diabetes status.

tail(diabetesUS)

##        Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke
## 253675            0      0        0         1  27      0      0
## 253676            0      1        1         1  45      0      0
## 253677            2      1        1         1  18      0      0
## 253678            0      0        0         1  28      0      0
## 253679            0      1        0         1  23      0      0
## 253680            2      1        1         1  25      0      0
##        HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 253675                    0            0      0       1                 0
## 253676                    0            0      1       1                 0
## 253677                    0            0      0       0                 0
## 253678                    0            1      1       0                 0
## 253679                    0            0      1       1                 0
## 253680                    1            1      1       0                 0
##        AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 253675             1           0       1        0        0        0   0   3
## 253676             1           0       3        0        5        0   1   5
## 253677             1           0       4        0        0        1   0  11
## 253678             1           0       1        0        0        0   0   2
## 253679             1           0       3        0        0        0   1   7
## 253680             1           0       2        0        0        0   0   9
##        Education Income
## 253675         6      5
## 253676         6      7
## 253677         2      4
## 253678         5      2
## 253679         5      1
## 253680         6      2

The tail() function is used to inspect the last six rows of the dataset, which helps confirm that the data has been imported correctly. The last rows provide insight into individuals who were surveyed towards the end of the data collection, allowing us to further validate the dataset and ensure that it is complete.

dim(diabetesUS)

## [1] 253680     22

The dim() function reveals that the dataset contains 253,680 observations (rows) and 22 variables (columns). This confirms the dataset’s large size, indicating that it is robust and contains a significant amount of data for analysis. The number of variables also highlights the variety of health indicators included in the dataset, which will be valuable for predicting diabetes status.

names(diabetesUS)

##  [1] "Diabetes_012"         "HighBP"               "HighChol"            
##  [4] "CholCheck"            "BMI"                  "Smoker"              
##  [7] "Stroke"               "HeartDiseaseorAttack" "PhysActivity"        
## [10] "Fruits"               "Veggies"              "HvyAlcoholConsump"   
## [13] "AnyHealthcare"        "NoDocbcCost"          "GenHlth"             
## [16] "MentHlth"             "PhysHlth"             "DiffWalk"            
## [19] "Sex"                  "Age"                  "Education"           
## [22] "Income"

The dataset used in this analysis is sourced from the Behavioral Risk Factor Surveillance System (BRFSS) 2015 survey, which is managed by the Centers for Disease Control and Prevention (CDC). The BRFSS collects data on health-related risk behaviors, chronic health conditions, and use of preventive services through telephone surveys conducted annually. For this analysis, we focus on predicting diabetes status based on survey responses related to health indicators.

Dataset Overview

Number of observations: 253,680
Number of features: 22

The target variable in this dataset is Diabetes_012, which categorizes individuals into three distinct groups:

0: No diabetes or only during pregnancy
1: Prediabetes
2: Diabetes

This target variable helps to identify different stages of diabetes risk, allowing us to explore predictive models based on health indicators.

Feature Descriptions

The features in this dataset represent various health conditions, lifestyle choices, and demographic information. Below is a brief explanation of each feature:

HighBP: Binary indicator of whether the individual has been diagnosed with high blood pressure (1: Yes, 0: No).
HighChol: Binary indicator of whether the individual has been diagnosed with high cholesterol (1: Yes, 0: No).
CholCheck: Binary indicator of whether the individual has had a cholesterol check within the past five years (1: Yes, 0: No).
BMI: Body Mass Index, a continuous variable representing an individual’s weight in relation to their height.
Smoker: Binary indicator of whether the individual has smoked at least 100 cigarettes in their lifetime (1: Yes, 0: No).
Stroke: Binary indicator of whether the individual has been diagnosed with a stroke (1: Yes, 0: No).
HeartDisease: Binary indicator of whether the individual has been diagnosed with coronary heart disease or a myocardial infarction (1: Yes, 0: No).
PhysActivity: Binary indicator of whether the individual has engaged in any physical activity or exercise other than their regular job in the past 30 days (1: Yes, 0: No).
Fruits: Binary indicator of whether the individual consumes fruit at least once per day (1: Yes, 0: No).
Veg: Binary indicator of whether the individual consumes vegetables at least once per day (1: Yes, 0: No).
Alcohol: Binary indicator of heavy alcohol consumption (1: Yes, 0: No). Defined as more than 14 drinks per week for men and more than 7 drinks per week for women.
HealthCoverage: Binary indicator of whether the individual has any kind of health care coverage, including health insurance, prepaid plans, or government plans (1: Yes, 0: No).
CostDoc: Binary indicator of whether the individual could not see a doctor in the past 12 months due to cost (1: Yes, 0: No).
GenHealth: Self-reported general health status, ranging from 1 (Excellent) to 5 (Poor).
MentalHealth: Number of days in the past 30 days where the individual’s mental health was not good.
PhysicalHealth: Number of days in the past 30 days where the individual’s physical health was not good.
DiffWalk: Binary indicator of whether the individual has serious difficulty walking or climbing stairs (1: Yes, 0: No).
Sex: Binary indicator of the individual’s sex (1: Male, 0: Female).
Age: Age categorized into 14 levels ranging from 18-24 (Level 1) to 80+ (Level 14).
Education: Highest level of education completed, ranging from 1 (Never attended school) to 6 (College graduate).
Income: Household income from all sources, categorized into 8 levels (1: Less than $10,000 to 8: $75,000 or more).

This dataset provides a comprehensive snapshot of individuals’ health and lifestyle, which can be used to predict diabetes risk. Each feature is either binary, categorical, or continuous, representing different aspects of an individual’s health profile.

By understanding these features, we can begin to explore how they correlate with diabetes status and identify key predictors that can help improve early diagnosis and intervention.

3. Data Cleansing and Preparation

# Checking data types and converting to appropriate types
str(diabetesUS)

## 'data.frame':    253680 obs. of  22 variables:
##  $ Diabetes_012        : num  0 0 0 0 0 0 0 0 2 0 ...
##  $ HighBP              : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : num  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity        : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : num  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : num  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : num  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : num  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : num  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : num  3 1 8 6 4 8 7 4 1 3 ...

The str() function provides a summary of the structure of the dataset. It indicates that the dataset contains 253,680 observations (rows) and 22 variables (columns). At this stage, we inspect the data types of each variable. We see that most variables, including the target variable Diabetes_012, are initially stored as numeric. However, Diabetes_012 is a categorical variable with three levels (0: no diabetes, 1: prediabetes, 2: diabetes), so it should be converted into a factor for proper analysis. The other variables, such as HighBP, CholCheck, and BMI, represent health-related indicators and remain numeric for now. This step ensures that the variables are being treated with the appropriate data types for further analysis and modeling.

# Checking for missing values
colSums(is.na(diabetesUS))

##         Diabetes_012               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                    0                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                    0                    0                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

anyNA(diabetesUS)

## [1] FALSE

The first line of code checks for any missing values in each column of the dataset by summing up the number of NA values per column. The second line checks if there are any missing values in the entire dataset. The results show that there are no missing values in the dataset, as indicated by the zeros in the output and the FALSE result from the anyNA() function. This confirms that the dataset is clean and complete, which simplifies the data preprocessing process because no imputation or removal of missing data is needed.

# Converting target variable to factor
diabetesUS$Diabetes_012 <- as.factor(diabetesUS$Diabetes_012)

str(diabetesUS)

## 'data.frame':    253680 obs. of  22 variables:
##  $ Diabetes_012        : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 3 1 ...
##  $ HighBP              : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol            : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker              : num  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity        : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruits              : num  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veggies             : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHlth             : num  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk            : num  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex                 : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age                 : num  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education           : num  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income              : num  3 1 8 6 4 8 7 4 1 3 ...

In this step, the target variable Diabetes_012 is converted from a numeric type to a factor. This conversion is crucial because Diabetes_012 is a categorical variable with three possible levels: 0 for no diabetes, 1 for prediabetes, and 2 for diabetes. Treating it as a factor ensures that predictive models correctly interpret it as a categorical outcome. After the conversion, another str() check confirms that Diabetes_012 is now stored as a factor with three levels, ensuring proper handling in the upcoming analysis and modeling phases.

# Renaming target variable for consistency
names(diabetesUS)[names(diabetesUS) == "Diabetes_012"] <- "Diabetes"

# Selecting and renaming relevant columns based on their actual names in the dataset
relevant_columns <- c("Diabetes", "HighBP", "HighChol", "CholCheck", "BMI", "Smoker", 
                      "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", 
                      "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GenHlth", 
                      "MentHlth", "PhysHlth", "DiffWalk", "Sex", "Age", "Education", "Income")

# Subset the data to include only the relevant columns
diabetesUS <- diabetesUS %>% select(all_of(relevant_columns))

# Renaming columns for easier access if needed (already consistent in this case)
names(diabetesUS) <- c("Diabetes", "HighBP", "HighChol", "CholCheck", "BMI", "Smoker", 
                          "Stroke", "HeartDisease", "PhysActivity", "Fruit", "Veg", "Alcohol", 
                          "HealthCoverage", "CostDoc", "GenHealth", "MentalHealth", "PhysicalHealth", 
                          "DiffWalk", "Sex", "Age", "Education", "Income")

# Checking the structure again after subsetting and renaming
str(diabetesUS)

## 'data.frame':    253680 obs. of  22 variables:
##  $ Diabetes      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 3 1 ...
##  $ HighBP        : num  1 0 1 1 1 1 1 1 1 0 ...
##  $ HighChol      : num  1 0 1 0 1 1 0 1 1 0 ...
##  $ CholCheck     : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI           : num  40 25 28 27 24 25 30 25 30 24 ...
##  $ Smoker        : num  1 1 0 0 0 1 1 1 1 0 ...
##  $ Stroke        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDisease  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ PhysActivity  : num  0 1 0 1 1 1 0 1 0 0 ...
##  $ Fruit         : num  0 0 1 1 1 1 0 0 1 0 ...
##  $ Veg           : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ Alcohol       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HealthCoverage: num  1 0 1 1 1 1 1 1 1 1 ...
##  $ CostDoc       : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ GenHealth     : num  5 3 5 2 2 2 3 3 5 2 ...
##  $ MentalHealth  : num  18 0 30 0 3 0 0 0 30 0 ...
##  $ PhysicalHealth: num  15 0 30 0 0 2 14 0 30 0 ...
##  $ DiffWalk      : num  1 0 1 0 0 0 0 1 1 0 ...
##  $ Sex           : num  0 0 0 0 0 1 0 0 0 1 ...
##  $ Age           : num  9 7 9 11 11 10 9 11 9 8 ...
##  $ Education     : num  4 6 4 3 5 6 6 4 5 4 ...
##  $ Income        : num  3 1 8 6 4 8 7 4 1 3 ...

Here, we first rename the target variable Diabetes_012 to Diabetes for better readability and consistency throughout the analysis. Next, we select only the relevant columns that are necessary for the analysis. These columns represent various health indicators, lifestyle factors, and demographic information that will be used to predict diabetes status. The selected columns include variables like HighBP (high blood pressure), BMI (body mass index), PhysActivity (physical activity), GenHlth (general health), and others. Renaming some of the columns also ensures ease of interpretation and a consistent naming convention in future code.

After the renaming and subsetting, we reduce the dataset to 22 key variables (from the original 330 in the full BRFSS dataset), focusing only on the most relevant health indicators for predicting diabetes. This step is essential for preparing the data for efficient analysis and modeling, ensuring that we work with a manageable and meaningful subset of the data.

After renaming the columns and subsetting the dataset, we use str() again to verify that the structure of the data is correct. The dataset now contains 253,680 observations across 22 variables, including the newly renamed and relevant columns. The Diabetes column remains a factor with three levels, and the health indicators such as HighBP, BMI, and PhysActivity remain numeric. This final check confirms that the data is properly prepared and structured for further exploratory analysis and predictive modeling.

4. Exploratory Data Analysis

# Descriptive Statistics
summary(diabetesUS)

##  Diabetes       HighBP         HighChol        CholCheck           BMI       
##  0:213703   Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :12.00  
##  1:  4631   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:24.00  
##  2: 35346   Median :0.000   Median :0.0000   Median :1.0000   Median :27.00  
##             Mean   :0.429   Mean   :0.4241   Mean   :0.9627   Mean   :28.38  
##             3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:31.00  
##             Max.   :1.000   Max.   :1.0000   Max.   :1.0000   Max.   :98.00  
##      Smoker           Stroke         HeartDisease      PhysActivity   
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median :0.0000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.4432   Mean   :0.04057   Mean   :0.09419   Mean   :0.7565  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##      Fruit             Veg            Alcohol       HealthCoverage  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:1.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.6343   Mean   :0.8114   Mean   :0.0562   Mean   :0.9511  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     CostDoc          GenHealth      MentalHealth    PhysicalHealth  
##  Min.   :0.00000   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.00000   1st Qu.:2.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :0.00000   Median :2.000   Median : 0.000   Median : 0.000  
##  Mean   :0.08418   Mean   :2.511   Mean   : 3.185   Mean   : 4.242  
##  3rd Qu.:0.00000   3rd Qu.:3.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :1.00000   Max.   :5.000   Max.   :30.000   Max.   :30.000  
##     DiffWalk           Sex              Age           Education   
##  Min.   :0.0000   Min.   :0.0000   Min.   : 1.000   Min.   :1.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 6.000   1st Qu.:4.00  
##  Median :0.0000   Median :0.0000   Median : 8.000   Median :5.00  
##  Mean   :0.1682   Mean   :0.4403   Mean   : 8.032   Mean   :5.05  
##  3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:10.000   3rd Qu.:6.00  
##  Max.   :1.0000   Max.   :1.0000   Max.   :13.000   Max.   :6.00  
##      Income     
##  Min.   :1.000  
##  1st Qu.:5.000  
##  Median :7.000  
##  Mean   :6.054  
##  3rd Qu.:8.000  
##  Max.   :8.000

Descriptive Statistics

The summary statistics for the diabetesUS dataset give us an overview of the distribution and central tendencies for the various variables.

Diabetes: The target variable, with the following class distribution:

Class 0 (No Diabetes): 213,703
Class 1 (Prediabetes): 4,631
Class 2 (Diabetes): 35,346

HighBP (High Blood Pressure):

Description:

This is a binary variable indicating whether the individual has high blood pressure (1 = Yes, 0 = No).

Distribution:

Approximately 43% of the individuals in the dataset have high blood pressure (mean = 0.429). This is consistent with broader public health data that shows a significant portion of the population suffers from hypertension.

Health Insight:

High blood pressure is a known risk factor for diabetes, particularly type 2 diabetes. The high prevalence of hypertension in the dataset suggests that managing blood pressure could be a key strategy for mitigating diabetes risk.

HighChol (High Cholesterol):

Description:

This is a binary variable indicating whether the individual has high cholesterol (1 = Yes, 0 = No).

Distribution:

Approximately 42% of the individuals in the dataset have high cholesterol (mean = 0.4241).

Health Insight:

High cholesterol levels are commonly associated with a higher risk of cardiovascular diseases and are often found in individuals with diabetes. This makes HighChol an important variable to monitor, as high cholesterol can exacerbate complications in diabetic patients.

CholCheck (Cholesterol Check):

Description:

This binary variable indicates whether the individual has had their cholesterol checked within the past five years (1 = Yes, 0 = No).

Distribution:

The majority of individuals (about 96.27%) have had their cholesterol checked recently, with a mean of 0.9627.

Health Insight:

Regular cholesterol checks are an important part of preventive health care, particularly for those at risk of cardiovascular diseases and diabetes. Since cholesterol is linked to heart disease and diabetes, it’s encouraging to see a high proportion of individuals adhering to cholesterol screenings.

BMI (Body Mass Index):

Description:

BMI is a numerical variable representing an individual’s body mass index. It is calculated as weight in kilograms divided by the square of height in meters (kg/m²).

Distribution:

The BMI values range from a minimum of 12 to a maximum of 98, with a median of 27 and a mean of 28.38. This suggests that the dataset includes individuals across a wide spectrum of weight categories, from underweight to severely obese.

Health Insight:

BMI is one of the most important predictors of diabetes, with higher BMI values indicating a greater likelihood of obesity—a significant risk factor for developing diabetes. The dataset shows that individuals with higher BMI are more likely to be classified as having diabetes, reinforcing the well-established link between obesity and diabetes risk.

Smoker:

Description:

This binary variable indicates whether the individual currently smokes (1 = Yes, 0 = No).

Distribution:

About 44.32% of the individuals are smokers, with a mean of 0.4432.

Health Insight:

Smoking is a known risk factor for many chronic diseases, including diabetes. Smoking can lead to insulin resistance and poor circulation, which complicates diabetes management. Given the relatively high smoking rate in this dataset, smoking cessation programs could be beneficial for diabetes prevention efforts.

Stroke:

Description:

This binary variable indicates whether the individual has ever had a stroke (1 = Yes, 0 = No).

Distribution:

Only 4% of the individuals have had a stroke (mean = 0.04057), making it a relatively rare condition within this dataset.

Health Insight:

Stroke is associated with a variety of health issues, including diabetes. Individuals with a history of stroke may be at higher risk for developing or exacerbating diabetic conditions due to poor circulation and other complications.

HeartDisease (Heart Disease or Heart Attack):

Description:

This binary variable indicates whether the individual has had heart disease or a heart attack (1 = Yes, 0 = No).

Distribution:

About 9.42% of the individuals report having heart disease or a heart attack (mean = 0.09419).

Health Insight:

Heart disease is a significant comorbidity for diabetes. Individuals with diabetes are more likely to suffer from cardiovascular diseases, so the presence of heart disease in the dataset is an important indicator of diabetes severity and the need for integrated care.

PhysActivity (Physical Activity):

Description:

This binary variable indicates whether the individual engages in physical activity or exercise during the past 30 days (1 = Yes, 0 = No).

Distribution:

About 75.65% of the individuals engage in physical activity (mean = 0.7565), suggesting that a large portion of the population is physically active.

Health Insight:

Physical activity is a protective factor against diabetes, as regular exercise helps in maintaining healthy body weight and insulin sensitivity. The high proportion of individuals reporting physical activity in the dataset is a positive indicator, although its effectiveness in reducing diabetes risk will need to be assessed in conjunction with other variables.

Fruit (Fruit Consumption):

Description:

This binary variable indicates whether the individual consumes fruits at least once per day (1 = Yes, 0 = No).

Distribution:

The mean for Fruit is 0.6343, indicating that 63.43% of the population consumes fruits daily.

Health Insight:

Regular fruit consumption is associated with a healthy diet and reduced risk of chronic diseases, including diabetes. The consumption of fruits provides essential vitamins and fiber that help regulate blood sugar levels.

Veg (Vegetable Consumption):

Description:

This binary variable indicates whether the individual consumes vegetables at least once per day (1 = Yes, 0 = No).

Distribution:

The mean for Veg is 0.8114, indicating that about 81.14% of the population consumes vegetables daily.

Health Insight:

Like fruit consumption, regular vegetable intake is linked to better overall health. Vegetables are rich in nutrients and fiber, which are important for managing blood sugar and weight, both key factors in diabetes prevention.

Alcohol (Heavy Alcohol Consumption):

Description:

This binary variable indicates whether the individual engages in heavy alcohol consumption, defined as more than 14 drinks per week for men and more than 7 drinks per week for women (1 = Yes, 0 = No).

Distribution:

Only 5.62% of the population engages in heavy alcohol consumption (mean = 0.0562).

Health Insight: Excessive alcohol consumption is a risk factor for various health conditions, including diabetes. Heavy drinking can lead to weight gain, poor diet, and insulin resistance, which increases the risk of developing diabetes.

HealthCoverage (Any Health Coverage):

Description:

This binary variable indicates whether the individual has health coverage (1 = Yes, 0 = No).

Distribution:

The overwhelming majority of individuals (95.11%) have health coverage, with a mean of 0.9511.

Health Insight:

Access to health coverage is critical for managing chronic diseases like diabetes. Individuals with health coverage are more likely to receive preventive care, early diagnosis, and effective management of their conditions.

CostDoc (Could Not See Doctor Because of Cost):

Description:

This binary variable indicates whether the individual was unable to see a doctor in the past year due to cost (1 = Yes, 0 = No).

Distribution:

About 8.42% of the population was unable to see a doctor due to cost (mean = 0.08418).

Health Insight:

Financial barriers to healthcare can lead to delayed diagnoses and poor management of chronic conditions like diabetes. Addressing the cost of care could help improve health outcomes for at-risk populations.

GenHealth (General Health):

Description:

This categorical variable rates the individual’s overall general health on a scale from 1 (Excellent) to 5 (Poor).

Distribution:

The median value for GenHealth is 2, indicating that most individuals rate their health as fair to good.

Health Insight:

General health perception is a holistic measure that correlates with various chronic conditions, including diabetes. Individuals who rate their health as poor are likely dealing with multiple health challenges, including diabetes.

MentalHealth:

Description:

This numeric variable indicates the number of days in the past month that the individual experienced poor mental health.

Distribution:

The median number of poor mental health days is 0, though the distribution has a long tail with some individuals experiencing mental health challenges for up to 30 days.

Health Insight:

Poor mental health is often linked with chronic diseases such as diabetes, both as a cause and a consequence. Individuals with diabetes may experience mental health challenges related to the stress of managing the disease, while poor mental health can also lead to neglect of healthy behaviors, further increasing diabetes risk.

PhysHlth (Physical Health):

Description:

This numeric variable indicates the number of days in the past month that the individual experienced poor physical health.

Distribution:

The median number of poor physical health days is 0, with similar tail behavior to MentalHealth. Some individuals report up to 30 days of poor physical health in a month.

Health Insight:

Poor physical health is closely related to diabetes management. Individuals who experience frequent poor physical health are more likely to have difficulty managing their condition, leading to worse outcomes.

DiffWalk (Difficulty Walking or Climbing Stairs):

Description:

This binary variable indicates whether the individual has difficulty walking or climbing stairs (1 = Yes, 0 = No).

Distribution:

About 16.82% of the population reports difficulty walking or climbing stairs (mean = 0.1682).

Health Insight:

Difficulty walking is often associated with chronic conditions such as diabetes, particularly in individuals with obesity or complications like neuropathy. Interventions to improve mobility, such as physical therapy or weight loss programs, could help improve quality of life for these individuals.

Sex:

Description:

This binary variable represents the individual’s sex (0 = Female, 1 = Male).

Distribution:

The mean for Sex is 0.4403, indicating a roughly equal distribution of males and females in the dataset.

Health Insight:

Diabetes risk can vary by sex due to hormonal differences, lifestyle factors, and other variables. Understanding these differences can help tailor interventions and treatment options for different populations.

Age:

Description:

This categorical variable represents the individual’s age group, ranging from 1 (18-24 years) to 13 (80 years and older).

Distribution:

The median age group is 8, corresponding to ages 55-59, with the mean age group being approximately 8.03.

Health Insight:

Age is a significant risk factor for diabetes, with older individuals being more likely to develop the condition. This dataset includes a substantial number of older individuals, which aligns with the higher prevalence of diabetes in this age group.

Education:

Description:

This categorical variable indicates the highest level of education attained by the individual, ranging from 1 (No schooling) to 6 (College graduate).

Distribution:

The mean education level is approximately 5.05, indicating that most individuals have at least some college education.

Health Insight:

Education is an important social determinant of health, with higher levels of education being associated with healthier behaviors, better access to healthcare, and lower risk of chronic diseases like diabetes.

Income:

Description:

This categorical variable indicates the individual’s income level, on a scale from 1 (Low Income) to 8 (High Income).

Distribution:

The median income level is 7, indicating that the majority of the population falls within a higher income bracket.

Health Insight:

Income is an important social determinant of health, influencing access to healthcare, healthy food, and the ability to maintain a healthy lifestyle. Higher income typically correlates with lower diabetes risk due to better access to resources for managing the condition.

Visualizations

The visualizations help us explore the distribution of the diabetes classes and their relationships with key variables.

# Bar Plot for the Target Variable (Diabetes)
ggplot(diabetesUS, aes(x=Diabetes)) + 
  geom_bar(fill="steelblue") + 
  geom_text(stat='count', aes(label=scales::percent(..count../sum(..count..))), 
            vjust=-0.5, color="black") +
  labs(title="Distribution of Diabetes Classes", x="Diabetes Status", y="Count") +
  theme_minimal()

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The bar plot visualization illustrates the distribution of the diabetes classes within the dataset. The majority of respondents, approximately 84%, fall into the No Diabetes category (class 0), which reflects the prevalence of non-diabetic individuals. Meanwhile, 14% of respondents are categorized as having diabetes (class 2), and only 2% fall into the prediabetes group (class 1).

This clear imbalance in the distribution is critical to understanding the dataset, as it can affect the performance of predictive models. Specifically, the underrepresentation of the prediabetes class may lead to challenges in accurately identifying individuals in this category. The visualization highlights the need to consider this class imbalance during the modeling process, perhaps through resampling techniques or adjusted evaluation metrics.

# Box Plot for BMI by Diabetes Status
ggplot(diabetesUS, aes(x=Diabetes, y=BMI, fill=Diabetes)) + 
  geom_boxplot(outlier.colour="red", outlier.shape=16, outlier.size=2) + 
  scale_fill_manual(values=c("0"="lightblue", "1"="lightgreen", "2"="lightcoral")) +
  labs(title="BMI Distribution by Diabetes Status", x="Diabetes Status", y="BMI") +
  theme_minimal()

This box plot visualizes the distribution of BMI values across different diabetes status categories (0, 1, and 2). The plot reveals some interesting trends:

No Diabetes (Class 0): The individuals in this group tend to have a lower median BMI, with the interquartile range mostly within 20 to 30. There are fewer extreme outliers, but some individuals still reach BMI values of 50 or higher.
Prediabetes (Class 1): The median BMI is slightly higher than that of the no-diabetes group, and the interquartile range is shifted slightly upwards, mostly falling between 25 and 35. This reflects the known association between increased BMI and higher diabetes risk.
Diabetes (Class 2): This group shows the highest median BMI, with the interquartile range extending from approximately 25 to 35. There are more individuals in this group who have very high BMI values, even reaching up to 100, demonstrating the significant link between obesity and diabetes.

The visualization confirms that as the diabetes status progresses from class 0 (no diabetes) to class 2 (diabetes), there is an observable increase in BMI values, reinforcing the strong correlation between higher BMI and the risk of developing diabetes. The spread of BMI values among individuals with diabetes also indicates that while high BMI is a strong predictor, there are cases across a range of BMI levels.

# Scatter Plot of Age vs. BMI by Diabetes Status
ggplot(diabetesUS, aes(x=Age, y=BMI, color=Diabetes)) + 
  geom_jitter(alpha=0.6) + 
  geom_smooth(method='loess', se=FALSE) +
  scale_color_manual(values=c("0"="lightblue", "1"="lightgreen", "2"="lightcoral")) +
  labs(title="Age vs BMI by Diabetes Status", x="Age", y="BMI") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The scatter plot titled “Age vs BMI by Diabetes Status” shows the relationship between an individual’s age (X-axis) and their Body Mass Index (BMI) (Y-axis) across different diabetes statuses (indicated by colors).

Key Insights:

Diabetes Status Color Coding: The plot differentiates between the three diabetes statuses using colors:

Class 0 (No Diabetes): Represented by light blue.
Class 1 (Prediabetes): Represented by light green.
Class 2 (Diabetes): Represented by light coral (pink).

Concentration of Data Points:

The majority of the individuals, represented by the light blue points, belong to the Class 0 (No Diabetes) category. These points are densely packed across the BMI range but are more concentrated at the lower end of the age spectrum (ages 5 to 10).
Individuals in the Class 2 (Diabetes) category, represented by light coral points, tend to cluster at higher BMIs, with some dispersion across the age groups. There is a subtle pattern suggesting that those with diabetes tend to have higher BMI values, though the data also shows significant overlap with Class 0.
Class 1 (Prediabetes) individuals, represented by light green, are fewer and more scattered but seem to lie between the clusters of Class 0 and Class 2, particularly at moderate BMI levels.

Trend Lines:

The loess smoothed curves for each class provide an overall trend for how BMI changes with age for each diabetes status:

The trend for Class 0 (light blue line) remains fairly flat across age groups, indicating that BMI does not change much with age for those without diabetes.
The trend for Class 1 (light green line) shows a slight curve, suggesting a modest rise in BMI for certain age groups.
The trend for Class 2 (light coral line) appears slightly higher, reflecting that individuals with diabetes tend to have higher BMI values on average across most age groups.

Overlap and Separation: While there is considerable overlap between the classes, there is a noticeable separation where individuals with diabetes (Class 2) tend to have higher BMIs. However, the wide spread and overlap among all classes suggest that age and BMI alone may not fully separate individuals into distinct diabetes categories.

Possible Improvements for Analysis:

The scatter plot helps in visualizing the interaction between age and BMI for different diabetes statuses, but the overlap suggests that additional factors might be needed to improve class separation.
Introducing interactive elements could allow for dynamic filtering or hovering to see detailed information for individual points, improving usability.
The addition of other visual techniques, such as faceting by age range or including more predictive features, might help to further explore the patterns in the data.

This visualization gives a preliminary look into how age and BMI interact with diabetes status, but further analysis with additional features may be necessary to extract more definitive patterns.

5. Predictive Modeling

Multinomial Logistic Regression

# Load necessary libraries
library(nnet)

# Split the data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(diabetesUS$Diabetes, p = .8, 
                                  list = FALSE, 
                                  times = 1)
diabetesTrain <- diabetesUS[ trainIndex,]
diabetesTest  <- diabetesUS[-trainIndex,]

# Check the structure of the training data
str(diabetesTrain)

## 'data.frame':    202945 obs. of  22 variables:
##  $ Diabetes      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 3 1 1 1 1 ...
##  $ HighBP        : num  1 1 1 1 1 1 0 1 0 0 ...
##  $ HighChol      : num  1 1 0 0 1 1 0 1 0 1 ...
##  $ CholCheck     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI           : num  40 28 27 30 25 30 24 34 26 33 ...
##  $ Smoker        : num  1 0 0 1 1 1 0 1 1 1 ...
##  $ Stroke        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ HeartDisease  : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ PhysActivity  : num  0 0 1 0 1 0 0 0 0 1 ...
##  $ Fruit         : num  0 1 1 0 0 1 0 1 0 0 ...
##  $ Veg           : num  1 0 1 0 1 1 1 1 1 1 ...
##  $ Alcohol       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HealthCoverage: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ CostDoc       : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ GenHealth     : num  5 5 2 3 3 5 2 3 3 4 ...
##  $ MentalHealth  : num  18 30 0 0 0 30 0 0 0 30 ...
##  $ PhysicalHealth: num  15 30 0 14 0 30 0 30 15 28 ...
##  $ DiffWalk      : num  1 1 0 0 1 1 0 1 0 0 ...
##  $ Sex           : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ Age           : num  9 9 11 9 11 9 8 10 7 4 ...
##  $ Education     : num  4 4 3 6 4 5 4 5 5 6 ...
##  $ Income        : num  3 8 6 7 4 1 3 1 7 2 ...

We split the dataset into 80% training data (diabetesTrain, with 202,945 observations) and 20% testing data (diabetesTest, with 50,735 observations). The training data is used to fit the model, and the testing data is used to evaluate its performance.

The structure of the training data shows that we have 22 variables, with Diabetes being the target variable and predictors including health indicators such as HighBP (high blood pressure), BMI, Smoker, and Age.

Fit the multinomial logistic regression model

# Fit the multinomial logistic regression model
multinom_model <- multinom(Diabetes ~ ., data = diabetesTrain)

## # weights:  69 (44 variable)
## initial  value 222957.870925 
## iter  10 value 101223.235405
## iter  20 value 96930.165949
## iter  30 value 94960.737410
## iter  40 value 90327.861863
## iter  50 value 83809.588604
## iter  60 value 81300.135819
## final  value 81297.342941 
## converged

# Summary of the model
summary(multinom_model)

## Call:
## multinom(formula = Diabetes ~ ., data = diabetesTrain)
## 
## Coefficients:
##   (Intercept)    HighBP  HighChol CholCheck        BMI       Smoker      Stroke
## 1   -7.807732 0.3586665 0.5858527 0.9150845 0.05088379  0.001540787 -0.08471365
## 2   -7.937181 0.7677284 0.5968404 1.2467178 0.06313342 -0.016817803  0.16450500
##   HeartDisease PhysActivity      Fruit         Veg    Alcohol HealthCoverage
## 1  -0.04935111  -0.02020710 -0.0201531 -0.07022140 -0.1720663    -0.05554856
## 2   0.22234286  -0.05085253 -0.0529354 -0.01906576 -0.7882631     0.10005859
##      CostDoc GenHealth MentalHealth PhysicalHealth    DiffWalk       Sex
## 1 0.39016482 0.3121362  0.006437564   -0.006032139 -0.01076913 0.1120458
## 2 0.05460354 0.5500156 -0.003064860   -0.008651601  0.13129153 0.2590316
##         Age   Education      Income
## 1 0.1270228 -0.06846306 -0.06616799
## 2 0.1265264 -0.03427455 -0.05563494
## 
## Std. Errors:
##   (Intercept)     HighBP   HighChol  CholCheck         BMI     Smoker
## 1   0.2174001 0.03761426 0.03604026 0.15065323 0.002245444 0.03467782
## 2   0.1048136 0.01653106 0.01524569 0.07626469 0.001023702 0.01481944
##       Stroke HeartDisease PhysActivity      Fruit        Veg    Alcohol
## 1 0.07412508   0.05142567   0.03850990 0.03616178 0.04177495 0.08188444
## 2 0.02815491   0.02005668   0.01625447 0.01538451 0.01791127 0.04364012
##   HealthCoverage    CostDoc   GenHealth MentalHealth PhysicalHealth   DiffWalk
## 1     0.07890535 0.05462893 0.021010233  0.002153117   0.0021236322 0.04628491
## 2     0.03768804 0.02590414 0.009146583  0.000961553   0.0008845318 0.01909544
##          Sex         Age   Education      Income
## 1 0.03542561 0.007242151 0.018201381 0.009359981
## 2 0.01512793 0.003146744 0.007842207 0.004011602
## 
## Residual Deviance: 162594.7 
## AIC: 162682.7

We fit a multinomial logistic regression model using the training data. This model estimates the probability of an individual being in each of the three diabetes classes based on the health indicators.

The coefficients from the model represent the estimated change in the log-odds of being in a particular class (Class 1 or Class 2) compared to Class 0 (the reference class) for each unit change in the predictor variable.
For example, the coefficient for HighBP (high blood pressure) in Class 2 (Diabetes) is 0.7677, indicating that having high blood pressure increases the log-odds of being diabetic (Class 2) compared to being non-diabetic (Class 0). Similarly, the BMI coefficient in Class 2 is 0.0631, meaning higher BMI is associated with a higher risk of diabetes.
Notably, some coefficients have negative values (e.g., Stroke in Class 1), suggesting a lower likelihood of prediabetes in individuals with these characteristics.

The model successfully converged after 60 iterations, achieving a final residual deviance of 162594.7 and an AIC (Akaike Information Criterion) of 162682.7. The lower the AIC, the better the model fits the data.

Predictions on the Test Set

# Predict the probabilities
multinom_pred <- predict(multinom_model, diabetesTest, type = "probs")

## We use the model to predict the probabilities for each class (0, 1, or 2) on the testing data. The predicted class for each individual is the one with the highest probability:

# Predict the class with the highest probability
multinom_pred_class <- apply(multinom_pred, 1, which.max) - 1

# Convert predictions to factor to match levels of actual data
multinom_pred_class <- factor(multinom_pred_class, levels = levels(diabetesTest$Diabetes))

# Model Performance Evaluation: Confusion Matrix
confusionMatrix(multinom_pred_class, diabetesTest$Diabetes)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1     2
##          0 41654   857  5815
##          1     0     0     0
##          2  1086    69  1254
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8457          
##                  95% CI : (0.8426, 0.8489)
##     No Information Rate : 0.8424          
##     P-Value [Acc > NIR] : 0.02041         
##                                           
##                   Kappa : 0.1922          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2
## Sensitivity            0.9746  0.00000  0.17739
## Specificity            0.1655  1.00000  0.97355
## Pos Pred Value         0.8619      NaN  0.52055
## Neg Pred Value         0.5492  0.98175  0.87967
## Prevalence             0.8424  0.01825  0.13933
## Detection Rate         0.8210  0.00000  0.02472
## Detection Prevalence   0.9525  0.00000  0.04748
## Balanced Accuracy      0.5700  0.50000  0.57547

The confusion matrix provides a summary of the model’s performance by comparing the predicted diabetes classes to the actual classes in the test data:

The overall accuracy of the model is 84.57%, which suggests that the model correctly predicts diabetes status for 84.57% of the test cases. However, the balanced accuracy indicates that performance across all classes is uneven.

Key metrics for each class include:

Sensitivity (Recall):

Class 0 (No Diabetes): 97.46%
Class 1 (Prediabetes): 0%
Class 2 (Diabetes): 17.74%

The model performs very well in detecting non-diabetic individuals (Class 0) but performs poorly in identifying prediabetic (Class 1) and diabetic individuals (Class 2).

Specificity:

Class 0: 16.55%
Class 1: 100%
Class 2: 97.36%

This suggests that the model is highly specific in identifying prediabetic and diabetic individuals but lacks specificity for Class 0.

Positive Predictive Value (Precision):

Class 0: 86.19%
Class 1: NaN (no predictions for Class 1)
Class 2: 52.06%

The model has high precision for Class 0 and moderate precision for Class 2.

Interpretation and Analysis

High Performance for Class 0: The model performs exceptionally well at identifying non-diabetic individuals. This high sensitivity for Class 0 likely stems from the significant class imbalance in the data, where 84% of the individuals belong to Class 0.
Struggles with Minority Classes: The model struggles to correctly identify prediabetic (Class 1) and diabetic individuals (Class 2). This is evident in the low sensitivity for these classes. One reason for this may be the class imbalance, with Class 1 representing only 2% of the data and Class 2 representing 14%. As a result, the model may be biased towards predicting Class 0, leading to poor performance on the minority classes.
Class Imbalance Issue: Addressing the class imbalance through techniques such as over-sampling, under-sampling, or using more sophisticated models like Random Forests or Gradient Boosting Machines could improve performance on the minority classes.
Model Refinement: Further refinement of the model could focus on improving prediction accuracy for Class 1 and Class 2, possibly by incorporating interaction terms, non-linear models, or more features to better capture the patterns associated with diabetes risk.

6. Feature Selection

We apply Recursive Feature Elimination (RFE) to identify the most important features for predicting diabetes status. RFE is a technique that recursively removes features and builds models on the remaining attributes to identify which features contribute the most to the model’s accuracy.

# 1. Data Sampling and Splitting

## We first take a random sample of 10,000 observations from the dataset to speed up the feature selection process. The data is then split into 80% training and 20% testing sets, similar to our earlier approach:

set.seed(123)
sample_index <- sample(nrow(diabetesUS), 10000)
diabetes_sample <- diabetesUS[sample_index,]

# Split the sample data into training and testing sets
set.seed(123)
trainIndex <- createDataPartition(diabetes_sample$Diabetes, p = .8, 
                                  list = FALSE, 
                                  times = 1)
diabetesTrain <- diabetes_sample[ trainIndex,]
diabetesTest  <- diabetes_sample[-trainIndex,]

# 2. Recursive Feature Elimination
## Next, we apply RFE using random forest functions (rfFuncs) as the core modeling approach. RFE aims to select the subset of features that result in the highest predictive accuracy:

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# Perform Recursive Feature Elimination
results <- rfe(diabetesTrain[, -1], diabetesTrain$Diabetes, sizes=c(1:10), rfeControl=control)

# Print and plot the results
print(results)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy     Kappa AccuracySD   KappaSD Selected
##          1   0.8438 0.0002567  0.0026500 0.0008116         
##          2   0.8435 0.0198864  0.0021740 0.0177828         
##          3   0.8448 0.0097353  0.0008708 0.0307858         
##          4   0.8444 0.1575691  0.0059285 0.0455334         
##          5   0.8479 0.1541063  0.0051702 0.0317733        *
##          6   0.8474 0.1315774  0.0042080 0.0167048         
##          7   0.8443 0.0993191  0.0036740 0.0292737         
##          8   0.8474 0.1162518  0.0067247 0.0533508         
##          9   0.8454 0.1966008  0.0077827 0.0371041         
##         10   0.8415 0.1674637  0.0082106 0.0524412         
##         21   0.8448 0.1506721  0.0061733 0.0462215         
## 
## The top 5 variables (out of 5):
##    GenHealth, BMI, HighBP, Age, HighChol

predictors(results)

## [1] "GenHealth" "BMI"       "HighBP"    "Age"       "HighChol"

plot(results, type=c("g", "o"))

Cross-validation: We used 10-fold cross-validation to assess the accuracy of the models for each subset of features. This ensures that the model is evaluated on different splits of the data and reduces the risk of overfitting.
Feature Subset Sizes: The RFE process evaluated feature subsets ranging from 1 to 21 variables.

RFE Results

The plot below visualizes the accuracy of the model as the number of features increases:

This plot shows that the highest accuracy is achieved when using 5 features, which suggests that these top 5 features are the most influential in predicting diabetes status.

Selected Features

RFE identified the following top 5 features:

General Health (GenHealth)
Body Mass Index (BMI)
High Blood Pressure (HighBP)
Age
High Cholesterol (HighChol)

These features are the most predictive of diabetes risk, as identified by the RFE process. These health indicators reflect both physical and subjective health conditions, reinforcing their importance in predicting chronic diseases like diabetes.

Key Insights

Top Features:

The most important features align with medical knowledge, where general health, BMI, high blood pressure, age, and high cholesterol are known risk factors for diabetes.

Model Accuracy:

The model reached its highest accuracy when using 5 features (around 84.8%). This indicates that including more features beyond these top 5 does not necessarily improve the model’s performance, and may even slightly reduce it due to overfitting or noise from less relevant features.

Simplicity vs. Complexity:

The fact that a model with only 5 features can achieve similar accuracy to a model with all 21 features highlights the importance of simplicity. By focusing on these key features, we can develop a more interpretable and efficient model without sacrificing accuracy.

The Recursive Feature Elimination process has identified the most predictive features for diabetes risk. This result is promising because it allows us to build a more interpretable model with fewer features while maintaining high predictive accuracy. Future steps may include using these selected features to refine the predictive model and evaluate its performance on unseen data.

7. Conclusion

In this report, we utilized the BRFSS 2015 dataset to explore predictive modeling for diabetes risk and successfully addressed the research questions. Through data preprocessing, exploratory data analysis, and predictive modeling, we gained valuable insights into the key predictors of diabetes and created an efficient, streamlined model for diabetes risk assessment.

Key Findings:

Class Imbalance:

The dataset exhibited significant class imbalance, with the vast majority of participants classified as non-diabetic (84%), and smaller proportions as prediabetic (2%) or diabetic (14%). This imbalance posed challenges for predictive modeling, particularly for the minority classes (prediabetes and diabetes). Addressing this imbalance will be crucial for improving model sensitivity across all classes.

Key Predictive Features:

The Recursive Feature Elimination (RFE) process identified five key features that are most predictive of diabetes risk:

General Health (GenHealth): Individuals reporting worse general health are more likely to be diabetic, indicating that overall health status is strongly correlated with diabetes risk.
Body Mass Index (BMI): Higher BMI is associated with a significantly increased risk of diabetes, reaffirming the link between obesity and diabetes.
High Blood Pressure (HighBP): Those with high blood pressure are at a higher risk of diabetes, as this is a known risk factor for cardiovascular and metabolic diseases.
Age: Older individuals are more likely to develop diabetes, which reflects the increasing prevalence of diabetes with age.
High Cholesterol (HighChol): High cholesterol levels were also identified as a significant predictor, aligning with the understanding that cholesterol management is important in diabetes prevention.

Model Performance:

The multinomial logistic regression model achieved an overall accuracy of approximately 84.57%. While the model performed well for predicting non-diabetic cases (class 0), it struggled with the minority classes (prediabetes and diabetes). The sensitivity was high for class 0 but much lower for classes 1 and 2. This indicates that further refinement is needed, particularly for improving the prediction of prediabetes and diabetes cases.

Feature Selection Efficiency:

Using RFE, we demonstrated that a subset of five variables (GenHealth, BMI, HighBP, Age, and HighChol) could provide comparable predictive performance to models that use all 21 features. This streamlining could help create a more efficient and accessible diabetes risk assessment tool that focuses on the most impactful variables.

Research Questions Addressed:

Can survey questions from the BRFSS provide accurate predictions of whether an individual has diabetes?

Yes, the BRFSS survey data provides a strong foundation for accurate predictions of diabetes risk, with a multinomial logistic regression model achieving a reasonable level of accuracy (84.57%).

What risk factors are most predictive of diabetes risk?

The most predictive risk factors were identified as General Health, BMI, High Blood Pressure, Age, and High Cholesterol. These factors were the most influential in predicting diabetes risk in the population.

Can we use a subset of the risk factors to accurately predict whether an individual has diabetes?

Yes, the RFE process confirmed that a subset of five key variables can be used to build a predictive model with good accuracy, offering a more efficient approach to diabetes risk assessment.

Can we create a short form of questions from the BRFSS using feature selection to accurately predict if someone might have diabetes or is at high risk of diabetes?

Absolutely. The RFE process demonstrated that using a short form based on the five most predictive variables could maintain strong predictive performance, making it a viable tool for public health screening.

Recommendations:

Based on these findings, several steps could further improve prediction accuracy and help identify individuals at high risk of diabetes:

Address Class Imbalance:

Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) or under-sampling of the majority class should be considered to better balance the dataset and enhance model performance, particularly for prediabetes and diabetes classes.

Explore Other Modeling Techniques:

Implementing other machine learning models like Random Forest, Gradient Boosting Machines, or Neural Networks may help handle class imbalance more effectively and provide better performance across all classes.

Use Advanced Evaluation Metrics:

Evaluating model performance using Precision, Recall, and F1-Score—especially for the minority classes—can provide more detailed insights into model effectiveness and areas for improvement.

Feature Importance Analysis:

Performing detailed feature importance analysis using methods like SHAP (SHapley Additive exPlanations) values would provide greater transparency and understanding of how each feature influences the prediction outcome.

Model Interpretability:

Improving model interpretability through techniques like SHAP or LIME (Local Interpretable Model-agnostic Explanations) can help clinicians and public health officials better understand the drivers of diabetes risk and make more informed decisions.

Conclusion:

This report demonstrates that the BRFSS 2015 data is a valuable resource for predicting diabetes risk. By identifying key risk factors and leveraging feature selection techniques, we successfully created an efficient and accurate predictive model. This research highlights the potential for creating streamlined screening tools that can help identify individuals at risk of diabetes earlier, ultimately contributing to improved public health outcomes. Further research and model refinement are recommended to continue improving the performance of these predictive tools.

8. Reference

https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

Diabetes Risk Prediction Using BRFSS 2015 Data

Yuanda Krisna

6 July 2024