Introduction


Diabetes is generally characterized by the body’s inability to produce sufficient amounts of insulin or to use the insulin effectively as needed. Diabetes is one of the most prevalent chronic diseases in many countries, affecting millions of people each year and placing a significant economic burden on the economy. Diabetes is a serious chronic disease that can lead to a loss of the ability to effectively regulate blood sugar levels, reducing quality of life and life expectancy. It can cause serious complications such as heart disease, blindness, leg amputations, and kidney disease with chronically high levels of sugar remaining in the blood of diabetes patients. There is no cure for diabetes, but strategies such as weight loss, a healthy diet, exercise and treatment can reduce the damage caused by the disease in many patients. Diabetes risk prediction models will be an important tool for the public and public health agencies as they may lead to more targeted treatments.

Problem Statement


It is also important to recognize the magnitude of the diabetes problem. The Centers for Disease Control and Prevention says that 34.2 million US citizens had diabetes in 2018, and 88 million had prediabetes. Additionally, the CDC estimates that about 1 in 5 of those with diabetes and 8 in 10 of those with prediabetes are unaware of their risk. There are many types of diabetes, and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of disease also falls on people of lower socioeconomic status. Diabetes also places a huge burden on the economy of the country. There is a need to develop a diabetes classification model which can help for early identification of the disease, using diabetes health indicators.

Research Questions


  • What are the trends or patterns of relevant factors contributing towards diabetes?
  • What algorithm to be used in the diabetes prediction model?
  • Which prediction model works the best by achieving the highest accuracy?

Research Objectives


  • To explore relevant factors contributing towards diabetes.
  • To develop diabetes prediction model by machine learning algorithms.
  • To evaluate the performance of diabetes prediction model.

Methodology


OSEMN

We follows the OSEMN Data Science Project Lifecycle in this study.

  1. Obtain: Data sample is obtained from Kaggle.
  2. Scrub: Data cleaning is performed by removing or replacing missing values with mean or mode imputation.
  3. Explore: Data visualization, correlation test and feature importance are carried out.
  4. Modeling: Prediction model is developed based on classification (Random Forest, Logistic Regression & K-Nearest Neighbors) and regression (Decision Tree, Multiple Linear Model & K-Nearest Neighbors) methods.
  5. Interpret: Comparative test between the prediction models are measured by percentage of accuracy and R-squared.

Library

library(dplyr) #for data cleaning/modification
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse) #for data cleaning/modification
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   0.3.5
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate) #for data cleaning/modification
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(stringr) #for data cleaning/modification
library(readxl) #for data sampling
library(knitr) #for data sampling
library(ggplot2) #for data visualization
library(plotly) #for data visualization
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(creditmodel) #for correlation
## Package 'creditmodel' version 1.3.1
## 
## Attaching package: 'creditmodel'
## 
## The following object is masked from 'package:stringr':
## 
##     str_match
library(ggcorrplot) #for correlation
library(caret) #for data modeling
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest) #for data modeling
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library (class) #for data modeling
library(tree) #for data modeling



Obtain


Data Description

The original dataset we started with, was the US Center for Disease Control’s (CDC) Behavioral Risk Factor Surveillance System’s (BRFSS) survey of 2015. The dataset can be found from Kaggle (https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system?resource=download&select=2015.csv).

According to the Kaggle page, the Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.

So, that is a big dataset of more than 400,000 participants. It also contains 330 columns of factor information, covering a wide range of topics from the user’s demographics to current situation, survey understanding and consent, healthcare access, physical activity, consumption behavior (tobacco, alcohol, food types) and of course, data on health conditions (like eye problems, mental health, asthma, cancer, kidney disease, diabetes, high blood pressure etc.), medication.

Overall it is a very comprehensive survey. However, that makes it hard to handle in its entirety.

We have therefore filtered out 22 most relevant factors that relates to diabetes based on the study in Kaggle (https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook/notebook).

A brief explanation of the dataset used and factors are given below.

Details Description
Title Diabetes Health Indicators Dataset
Source Kaggle
Link https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system
Years 2015
File Name diabetes.xlsx
File Size 36.1 MB
Dimension Rows: 441456 | Attributes: 22
Purpose To record the relevant factors of diabetes.



Data Attributes

No Attribute Question.Description Response
1 Diabetes_012 Main Response Variable. (Ever told) you have diabetes (If “Yes” and respondent is female, ask “Was this only when you were pregnant?” 1 = Yes; 2 = Yes, but female told only during pregnancy; 3 = No; 4 = No, pre-diabetes or borderline diabetes; 7 = Don’t know/Not Sure; 9 = Refused BLANK: Not asked or Missing
2 HighBP Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional 1 = No; 2 = Yes; 9 = Don’t know/Not Sure/Refused/Missing
3 HighChol Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing
4 CholCheck Cholesterol check within past five years 1 = Had cholesterol checked in past 5 years; 2 = Did not have cholesterol checked in past 5 years; 7 = Have never had cholesterol checked; 9 = Don’t know/Not Sure Or Refused/Missing
5 BMI Body Mass Index (BMI) 1-9999 = 1 or greater; BLANK = Don’t know/Refused/Missing
6 Smoker Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing
7 Stroke 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused
8 HeartDiseaseorAttack Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) 1 = Reported having MI or CHD; 2 = Did not report having MI or CHD; BLANK = Not asked or Missing
9 PhysActivity Adults who reported doing physical activity or exercise during the past 30 days other than their regular job 1 = Had physical activity or exercise; 2 = No physical activity or exercise in last 30 days; 9 = Don’t know/Refused/Missing
10 Fruits Consume Fruit 1 or more times per day 1 = Consumed fruit one or more times per day; 2 = Consumed fruit less than one time per day; 9 = Don´t know, refused or missing values
11 Veggies Consume Vegetables 1 or more times per day 1 = Consumed vegetables one or more times per day; 2 = Consumed vegetables less than one time per day; 9 = Don´t know, refused or missing values
12 HvyAlcoholConsump Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 1 = No; 2 = Yes, 9 = Don’t know/Refused/Missing
13 AnyHealthcare Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused
14 NoDocbcCost Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing
15 GenHlth Would you say that in general your health is: 1 = Excellent; 2 = Very good; 3 = Good; 4 = Fair; 5 = Poor; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing
16 MentHlth Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? 1-30 = Number of days; 88 = None; 77 = Don’t know/Not sure; BLANK = Refused
17 PhysHlth Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? 1-30 = Number of days; 88 = None; 77 = Don’t know/Not sure; BLANK = Refused
18 DiffWalk Do you have serious difficulty walking or climbing stairs? 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing
19 Sex Indicate sex of respondent. 1 = Male; 2 = Female
20 Age Fourteen-level age category 1 = Age 18 to 24; 2 = Age 25 to 29; 3 = Age 30 to 34; 4 = Age 35 to 39; 5 = Age 40 to 44; 6 = Age 45 to 49; 7 = Age 50 to 54; 8 = Age 55 to 59; 9 = Age 60 to 64; 10 = Age 65 to 69; 11 = Age 70 to 74; 12 = Age 75 to 79; 13 = Age 80 or older; 14 = Don’t know/Refused/Missing
21 Education What is the highest grade or year of school you completed? 1 = Never attended school or only kindergarten; 2 = Grades 1 through 8 (Elementary); 3 = Grades 9 through 11 (Some high school); 4 = Grade 12 or GED (High school graduate); 5 = College 1 year to 3 years (Some college or technical school); 6 = College 4 years or more (College graduate); 9 = Refused
22 Income Is your annual household income from all sources: (If respondent refuses at any income level, code “Refused.”) 1 = Less than $10,000; 2 = Less than $15,000 ($10,000 to less than $15,000); 3 = Less than $20,000 ($15,000 to less than $20,000); 4 = Less than $25,000 ($20,000 to less than $25,000); 5 = Less than $35,000 ($25,000 to less than $35,000); 6 = Less than $50,000 ($35,000 to less than $50,000); 7 = Less than $75,000 ($50,000 to less than $75,000); 8 = $75,000 or more; 77 = Don’t know/Not sure; 99 = Refused; BLANK = Not asked or Missing



Data Overview

# read the data from excel
diabetes_df<-data.frame(read_excel("diabetes.xlsx"))
head(diabetes_df)
##   Diabetes_012 HighBP HighChol CholCheck  BMI Smoker Stroke
## 1            3      2        1         1 4018      1      2
## 2            3      1        2         2 2509      1      2
## 3            3      1        1         1 2204     NA      1
## 4            3      2        1         1 2819      2      2
## 5            3      1        2         1 2437      2      2
## 6            3      2        2         1 2652      2      2
##   HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1                    2            2      2       1                 1
## 2                    2            1      2       2                 1
## 3                   NA            9      9       9                 9
## 4                    2            2      1       2                 1
## 5                    2            2      9       1                 1
## 6                    2            1      1       1                 1
##   AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1             1           2       5       18       15        1   2   9
## 2             2           1       3       88       88        2   2   7
## 3             1           2       4       88       15       NA   2  11
## 4             1           1       5       30       30        1   2   9
## 5             1           2       5       88       20        2   2   9
## 6             1           2       2       88       88        2   2  11
##   Education Income
## 1         4      3
## 2         6      1
## 3         4     99
## 4         4      8
## 5         5     77
## 6         3      6



Scrub


Data Checking

Glimpse of Dataset

glimpse(diabetes_df)
## Rows: 441,456
## Columns: 22
## $ Diabetes_012         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 3, 3, 1…
## $ HighBP               <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ HighChol             <dbl> 1, 2, 1, 1, 2, 2, 1, 1, NA, 1, 2, 1, 1, 1, 2, 1, …
## $ CholCheck            <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI                  <dbl> 4018, 2509, 2204, 2819, 2437, 2652, 2389, 3382, 1…
## $ Smoker               <dbl> 1, 1, NA, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, …
## $ Stroke               <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ HeartDiseaseorAttack <dbl> 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 1, 2, 2, 1, 2,…
## $ PhysActivity         <dbl> 2, 1, 9, 2, 2, 1, 1, 1, 1, 1, 2, 2, 9, 9, 9, 1, 2…
## $ Fruits               <dbl> 2, 2, 9, 1, 9, 1, 1, 1, 9, 1, 2, 2, 9, 1, 9, 2, 1…
## $ Veggies              <dbl> 1, 2, 9, 2, 1, 1, 1, 1, 9, 1, 2, 1, 9, 9, 9, 1, 1…
## $ HvyAlcoholConsump    <dbl> 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ AnyHealthcare        <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost          <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ GenHlth              <dbl> 5, 3, 4, 5, 5, 2, 2, 5, 5, 2, 3, 5, 5, 4, 3, 3, 5…
## $ MentHlth             <dbl> 18, 88, 88, 30, 88, 88, 3, 88, 88, 88, 88, 88, 88…
## $ PhysHlth             <dbl> 15, 88, 15, 30, 20, 88, 88, 8, 77, 2, 14, 88, 5, …
## $ DiffWalk             <dbl> 1, 2, NA, 1, 2, 2, 2, 1, 7, 2, 2, 1, 1, 1, 1, 1, …
## $ Sex                  <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ Age                  <dbl> 9, 7, 11, 9, 9, 11, 11, 13, 13, 10, 9, 13, 11, 13…
## $ Education            <dbl> 4, 6, 4, 4, 5, 3, 5, 3, 3, 6, 6, 5, 5, 5, 5, 4, 5…
## $ Income               <dbl> 3, 1, 99, 8, 77, 6, 4, 3, 77, 8, 7, 77, 5, 7, 77,…


Data Quality Overview

# check NA values
any(is.na(diabetes_df))
## [1] TRUE
# count the total rows with no NA values
nrow(na.omit(diabetes_df))
## [1] 343606
# count the total rows with not complete cases
nrow(diabetes_df[!complete.cases(diabetes_df),])
## [1] 97850



Data Cleaning/Modification

Remove Duplicate Records

diabetes_df <- unique(diabetes_df)
nrow(diabetes_df)
## [1] 434358

7,098 duplicate records removed.

Now, Replace missing (““) values to NA and show column names with NA values

##  [1] "Diabetes_012"         "HighChol"             "BMI"                 
##  [4] "Smoker"               "HeartDiseaseorAttack" "NoDocbcCost"         
##  [7] "GenHlth"              "PhysHlth"             "DiffWalk"            
## [10] "Income"

Diabetes_012 is the target variable, so remove all records with NA values for this column

# remove NA 
diabetes_df<-diabetes_df %>%  filter(!is.na(Diabetes_012))
unique(diabetes_df$Diabetes_012)
## [1] 3 1 4 2 7 9

Mean imputation is used to fill NA values of BMI column

# replace NA by mean
diabetes_df<-diabetes_df %>% replace_na(list(BMI=round(mean(diabetes_df$BMI, na.rm = T), 2)))

Mode imputation is used to remove NA values of high cholesterol, smoker, heart disease or attack, no cost to consult doctor, general health, mental health, difficulty walking and income.

# replace NA by mode
diabetes_df<-diabetes_df %>% replace_na(list(HighChol=Mode(diabetes_df$HighChol)))
diabetes_df<-diabetes_df %>% replace_na(list(Smoker=Mode(diabetes_df$Smoker)))
diabetes_df<-diabetes_df %>% replace_na(list(HeartDiseaseorAttack=Mode(diabetes_df$HeartDiseaseorAttack)))
diabetes_df<-diabetes_df %>% replace_na(list(NoDocbcCost=Mode(diabetes_df$NoDocbcCost)))
diabetes_df<-diabetes_df %>% replace_na(list(GenHlth=Mode(diabetes_df$GenHlth)))
diabetes_df<-diabetes_df %>% replace_na(list(PhysHlth=Mode(diabetes_df$PhysHlth)))
diabetes_df<-diabetes_df %>% replace_na(list(DiffWalk=Mode(diabetes_df$DiffWalk)))
diabetes_df<-diabetes_df %>% replace_na(list(Income=Mode(diabetes_df$Income)))

Now, check names of columns with NA values.

## [1] "Diabetes_012 0"
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 0"
## [1] "BMI 0"
## [1] "Smoker 0"
## [1] "Stroke 0"
## [1] "HeartDiseaseorAttack 0"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 0"
## [1] "HvyAlcoholConsump 0"
## [1] "AnyHealthcare 0"
## [1] "NoDocbcCost 0"
## [1] "GenHlth 0"
## [1] "MentHlth 0"
## [1] "PhysHlth 0"
## [1] "DiffWalk 0"
## [1] "Sex 0"
## [1] "Age 0"
## [1] "Education 0"
## [1] "Income 0"

All duplicate, missing (““) and NA values are removed.

Diabetes_012, We make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes and diabetes. Remove all 7 (don’t knows), all 9 (refused)

## [1] 0 1

High blood pressure Change 1 to 0 so it represents No high blood pressure and 2 to 1 so it represents high blood pressure and remove 9 (don’t know)

## [1] 1 0

High Cholesterol, Change 2 to 0 because it is No. Remove all 7 (don’t knows), 9 (refused)

## [1] 1 0

Cholesterol Check, Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years. Remove 9 (Don’t know)

## [1] 1 0

BMI (no changes, just note that these are BMI * 100. So for example a BMI of 2608 is really 26.08)

## [1] 22.04

Smoker, Change 2 to 0 because it is No, Remove all 7 (don’t knows) and 9 (refused)

## [1] 1 0

Stroke, Change 2 to 0 because it is No, Remove all 7 (don’t knows) and 9 (refused)

## [1] 0 1

Heart Disease or Attack, Change 2 to 0 because this means did not have MI or CHD

## [1] 0 1

Physical Activity, change 2 to 0 for no physical activity, Remove all 9 (don’t know/refused)

## [1] 0 1

Fruits, Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day remove all don’t knows and missing 9

## [1] 0 1

Veggies, Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day remove all don’t knows and missing 9

## [1] 1 0

Heavy Alcohol Consumption, Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking) remove all don’t knows and missing 9

## [1] 0 1

Any Health care, 1 is yes, change 2 to 0 because it is No health care access remove 7 and 9 for don’t know or refused

## [1] 1 0

No cost to consult doctor, Change 2 to 0 for no, 1 is already yes, remove 7 for don’t know and 9 for refused

## [1] 0 1

General Health, This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor) Remove 7 and 9 for don’t know and refused

## [1] 5 3 2 4 1

Mental Health, already in days so keep that, scale will be 0-30 change 88 to 0 because it means none (no bad mental health days). Remove 77 and 99 for don’t know not sure and refused

##  [1] 18  0 30  3  5 15 10  6 20  2 25  1 29  4  7  8 21 14 26  9 16 28 11 12 24
## [26] 17 13 23 27 19 22

Physical Health, already in days so keep that, scale will be 0-30 change 88 to 0 because it means none (no bad physical health days). Remove 77 and 99 for don’t know not sure and refused

##  [1] 15  0 30  8  2 14  6 28  1  7 10 20  3  5 17  4 19 21 12 25 27 22 29 24  9
## [26] 16 18 23 13 26 11

Difficulty Walking, change 2 to 0 for no. 1 is already yes, remove 7 and 9 for don’t know not sure and refused

## [1] 1 0

Sex, change 2 to 0 (female as 0). Male is 1

## [1] 0 1

Age, already ordinal. 1 is 18-24 all the way up to 13 with 80 and older. 5 year increments. Remove 14 because it is don’t know or missing

##  [1]  9  7 11 13 10  8  4  6  2  3 12  5  1

Education, This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more Scale here is 1-6. Remove 9 for refused

## [1] 4 6 3 5 2 1

Income is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more. Remove 77 and 99 for don’t know and refused

## [1] 3 1 8 6 4 7 2 5

The binary columns are:

  1. Difficulty Walking
  2. Sex
  3. No Cost to Consult Doctor
  4. Any Health Care
  5. Heavy Alcohol Consumption
  6. Veggies
  7. Fruits
  8. Physical Activity
  9. Heart Disease or Attack
  10. Stroke
  11. Smoker
  12. Cholesterol Check
  13. High Cholesterol
  14. High Blood Pressure


Cleaned Data

Summary of attributes representation on cleaned data:

No Attribute Description
1 Diabetes_012 0 = no diabetes 1 = prediabetes 2 = diabetes
2 HighBP 0 = no high BP 1 = high BP
3 HighChol 0 = no high cholesterol 1 = high cholesterol
4 CholCheck 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
5 BMI Body Mass Index
6 Smoker Smoked at least 100 cigarettes in entire life 0 = no 1 = yes
7 Stroke Stroke 0 = no 1 = yes
8 HeartDiseaseorAttack Coronary Heart Disease (CHD) or Myocardial Infarction (MI) 0 = no 1 = yes
9 PhysActivity Physical activity in past 30 days - not including job 0 = no 1 = yes
10 Fruits Consume Fruit 1 or more times per day 0 = no 1 = yes
11 Veggies Consume Vegetables 1 or more times per day 0 = no 1 = yes
12 HvyAlcoholConsump Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
13 AnyHealthcare Health care coverage 0 = no 1 = yes
14 NoDocbcCost No cost to consult doctor in the past 12 months 0 = no 1 = yes
15 GenHlth General health 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor
16 MentHlth Mental health including stress, depression, and problems with emotions, for the past 30 days was not good, scale 1-30 days
17 PhysHlth Physical health including physical illness and injury for the past 30 days was not good, scale 1-30 days
18 DiffWalk Serious difficulty walking or climbing stairs 0 = no 1 = yes
19 Sex 0 = female 1 = male
20 Age 1 = Age 18 to 24; 2 = Age 25 to 29; 3 = Age 30 to 34; 4 = Age 35 to 39; 5 = Age 40 to 44; 6 = Age 45 to 49; 7 = Age 50 to 54; 8 = Age 55 to 59; 9 = Age 60 to 64; 10 = Age 65 to 69; 11 = Age 70 to 74; 12 = Age 75 to 79; 13 = Age 80 or older
21 Education Education level 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate)
22 Income Income scale 1 = Less than $10,000; 2 = Less than $15,000 ($10,000 to less than $15,000); 3 = Less than $20,000 ($15,000 to less than $20,000); 4 = Less than $25,000 ($20,000 to less than $25,000); 5 = Less than $35,000 ($25,000 to less than $35,000); 6 = Less than $50,000 ($35,000 to less than $50,000); 7 = Less than $75,000 ($50,000 to less than $75,000); 8 = $75,000 or more

Cleaned data dimension:

dim(diabetes_df)
## [1] 292188     22

The cleaned data consists of 292,188 rows and 22 columns.

Explore


Data Structure

str(diabetes_df)
## 'data.frame':    292188 obs. of  22 variables:
##  $ Diabetes_012        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ HighBP              : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ HighChol            : num  1 0 1 0 1 1 1 0 1 1 ...
##  $ CholCheck           : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num  40.2 25.1 28.2 26.5 23.9 ...
##  $ Smoker              : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Stroke              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ HeartDiseaseorAttack: num  0 0 0 0 0 0 0 0 0 1 ...
##  $ PhysActivity        : num  0 1 0 1 1 1 1 0 1 0 ...
##  $ Fruits              : num  0 0 1 1 1 1 1 0 0 1 ...
##  $ Veggies             : num  1 0 0 1 1 1 1 0 1 1 ...
##  $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num  0 1 1 0 0 1 0 0 0 0 ...
##  $ GenHlth             : num  5 3 5 2 2 5 2 3 3 5 ...
##  $ MentHlth            : num  18 0 30 0 3 0 0 0 0 30 ...
##  $ PhysHlth            : num  15 0 30 0 0 8 2 14 0 30 ...
##  $ DiffWalk            : num  1 0 1 0 0 1 0 0 1 1 ...
##  $ Sex                 : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ Age                 : num  9 7 9 11 11 13 10 9 11 9 ...
##  $ Education           : num  4 6 4 3 5 3 6 6 4 5 ...
##  $ Income              : num  3 1 8 6 4 3 8 7 4 1 ...

All attributes are numerical data type. Hence, we have to convert the data type for categorical and ordinal data to factor.

diabetes_df$Diabetes_012 <- as.factor(diabetes_df$Diabetes_012)
diabetes_df$HighBP <- as.factor(diabetes_df$HighBP)
diabetes_df$HighChol <- as.factor(diabetes_df$HighChol)
diabetes_df$CholCheck <- as.factor(diabetes_df$CholCheck)
diabetes_df$Smoker <- as.factor(diabetes_df$Smoker)
diabetes_df$Stroke <- as.factor(diabetes_df$Stroke)
diabetes_df$HeartDiseaseorAttack <- as.factor(diabetes_df$HeartDiseaseorAttack)
diabetes_df$PhysActivity <- as.factor(diabetes_df$PhysActivity)
diabetes_df$Fruits <- as.factor(diabetes_df$Fruits)
diabetes_df$Veggies <- as.factor(diabetes_df$Veggies)
diabetes_df$HvyAlcoholConsump <- as.factor(diabetes_df$HvyAlcoholConsump)
diabetes_df$AnyHealthcare <- as.factor(diabetes_df$AnyHealthcare)
diabetes_df$NoDocbcCost <- as.factor(diabetes_df$NoDocbcCost)
diabetes_df$GenHlth <- as.factor(diabetes_df$GenHlth)
diabetes_df$DiffWalk <- as.factor(diabetes_df$DiffWalk)
diabetes_df$Sex <- as.factor(diabetes_df$Sex)
diabetes_df$Age <- as.factor(diabetes_df$Age)
diabetes_df$Education <- as.factor(diabetes_df$Education)
diabetes_df$Income <- as.factor(diabetes_df$Income)



Descriptive Analysis

summary(diabetes_df)
##  Diabetes_012 HighBP     HighChol   CholCheck       BMI        Smoker    
##  0:249162     0:174087   0:180933   0: 41928   Min.   :12.02   0:162828  
##  1: 43026     1:118101   1:111255   1:250260   1st Qu.:24.13   1:129360  
##                                                Median :27.44             
##                                                Mean   :28.29             
##                                                3rd Qu.:31.00             
##                                                Max.   :97.65             
##                                                                          
##  Stroke     HeartDiseaseorAttack PhysActivity Fruits     Veggies   
##  0:281019   0:267180             0: 74193     0:111368   0: 58574  
##  1: 11169   1: 25008             1:217995     1:180820   1:233614  
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##                                                                    
##  HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth       MentHlth     
##  0:275094          0: 19938      0:263429    1: 50974   Min.   : 0.000  
##  1: 17094          1:272250      1: 28759    2:101380   1st Qu.: 0.000  
##                                              3: 89498   Median : 0.000  
##                                              4: 36721   Mean   : 3.346  
##                                              5: 13615   3rd Qu.: 2.000  
##                                                         Max.   :30.000  
##                                                                         
##     PhysHlth      DiffWalk   Sex             Age         Education 
##  Min.   : 0.000   0:244363   0:164330   9      : 35386   1:   281  
##  1st Qu.: 0.000   1: 47825   1:127858   10     : 33976   2:  5624  
##  Median : 0.000                         8      : 33169   3: 12464  
##  Mean   : 4.194                         7      : 28762   4: 76199  
##  3rd Qu.: 3.000                         11     : 24836   5: 82321  
##  Max.   :30.000                         6      : 22146   6:115299  
##                                         (Other):113913             
##      Income     
##  8      :95843  
##  7      :49170  
##  6      :42884  
##  5      :31231  
##  4      :24974  
##  3      :20263  
##  (Other):27823

The summary shows the total number of factor data and statistical measures for numerical data.

Data Visualization

1. Exploratory Data Analysis

# pie chart
plot_ly(diabetes_df, labels=~Diabetes_012, type="pie", title="Diabetes Status")

There are 85.3% of respondent don’t have diabetes, 14.7% are facing diabetes health problem.

# bar graph
ggplotly(ggplot(diabetes_df, aes(HighBP, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Blood Pressure"))

Most of the respondents do not have high blood pressure. We can observe that Most of the respondents having diabetes in the high blood pressure group than no high blood pressure group.

ggplotly(ggplot(diabetes_df, aes(HighChol, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Cholesterol"))

Most of the respondents do not have high cholesterol. We can observe that most of the respondents having diabetes in the high cholesterol group than no high cholesterol group.

ggplotly(ggplot(diabetes_df, aes(CholCheck, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Cholesterol Check"))

Most of the respondents did cholesterol health check in the past 5 years.

# histogram
ggplotly(ggplot(diabetes_df, aes(BMI)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of BMI"))

The BMI distribution is skewed to left. Most of the respondents have 23-28kg/m^2, which is from normal or overweight category.

# boxplot
ggplot(diabetes_df, aes(x=BMI, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of BMI")



The BMI mean is higher in diabetes category.

ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Smoker"))

Most of the respondents are not a smoker. We can observe that slightly most of the respondents having diabetes in smoker category than non-smoker.

ggplotly(ggplot(diabetes_df, aes(HeartDiseaseorAttack, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Heart Disease/Attack"))

Most of the respondents do not have heart disease.

ggplotly(ggplot(diabetes_df, aes(PhysActivity, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Doing Physical Activity"))

Most of the respondents did physical activites in the past 30 days.

ggplotly(ggplot(diabetes_df, aes(Fruits, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Fruits Consumption"))

Most of the respondents consumed fruits.

ggplotly(ggplot(diabetes_df, aes(Veggies, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Vegetables Consumption"))

Most of the respondents consumed vegetables.

ggplotly(ggplot(diabetes_df, aes(HvyAlcoholConsump, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Heavy Alcohol Consumption"))

Most of the respondents did not have heavy alcohol consumption.

ggplotly(ggplot(diabetes_df, aes(AnyHealthcare, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Healthcare Coverage"))

Most of the respondents have healthcare coverage.

ggplotly(ggplot(diabetes_df, aes(NoDocbcCost, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of No Cost for Doctor Consultation"))

Most of the respondents can afford doctor consultant fee.

ggplotly(ggplot(diabetes_df, aes(GenHlth, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of General Health Level"))

Most of the respondents are healthy in general. Most of the respondents having diabetes in good condition (level 3).

ggplotly(ggplot(diabetes_df, aes(MentHlth)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of Days of Having Mental Health Problem"))

The distribution of having mental health days is skewed to left. Most of the respondents have less than 3 days facing mental health for the past 30 days.

ggplot(diabetes_df, aes(x=MentHlth, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of Days of Having Mental Health Problem")



The mean of mental health for each diabetes level are the same, which is nearly to 0.

ggplotly(ggplot(diabetes_df, aes(PhysHlth)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of Days of Having Physical Health Problem"))

The distribution of having physical health days is skewed to left. Most of the respondents have less than 3 days facing physical health for the past 30 days.

ggplot(diabetes_df, aes(x=PhysHlth, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of Days of Having Physical Health Problem")



The mean for facing physical health is higher in diabetes category.

ggplotly(ggplot(diabetes_df, aes(DiffWalk, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Facing Walking Difficulty"))

Most of the respondents do not facing any difficulty in walking.

ggplotly(ggplot(diabetes_df, aes(Sex, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Gender"))

Most of the respondents are female.

ggplotly(ggplot(diabetes_df, aes(Education, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number By Education Level"))

Most of the respondents are college gradute. Most of the high school graduate respondents having diabetes than other education levels.

ggplotly(ggplot(diabetes_df, aes(Income, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number By Income Group"))

Most of the respondents are from high income category.

2. Most of the interesting plots with multiple variables

ggplotly(ggplot(diabetes_df, aes(HighBP, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Blood Pressure vs High Cholesterol")+facet_wrap(~HighChol))

Most the respondents who having diabetes, also have high blood pressure and high cholesterol.

ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Smoker vs Heavy Alcohol Consumption")+facet_wrap(~HvyAlcoholConsump))

Most respondents do not drink heavy alcohol. However, smoking is one of the risk factors for diabetes, as slightly most of the people facing diabetes than non-smokers despite not drinking heavily.

ggplotly(ggplot(diabetes_df, aes(HeartDiseaseorAttack, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Facing Heart Disease vs Stroke")+facet_wrap(~Stroke))

Most of the respondents do not have heart disease and stroke.

ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Sex))+geom_bar()+ggtitle("Total Number of Smoker vs Heavy Alcohol Consumption by Gender")+facet_wrap(~HvyAlcoholConsump))

Among non-heavy drinking respondents, the total number of male smokers and non-smokers were nearly identical.

ggplotly(ggplot(diabetes_df, aes(Income, fill=NoDocbcCost))+geom_bar()+ggtitle("Total Number of Income Level vs General Health by No Cost for Doctor Consultation")+facet_wrap(~GenHlth))

High income respondents are healthier than low income respondents, although low income respondent can still afford doctor consultant fee.

Relationship

1. Between categorical variables by Cramer’s V

# cramer's V
r1 <- diabetes_df %>% select_if(is.factor) %>% char_cor() %>% round(2)
# correlation matrix
ggcorrplot(r1)



From the Cramer’s V correlation matrix, we found that all the categorical variables have weak or moderate strong positive relationship between each other. Let’s confirm with the Cramer’s V correlation coefficient.

##                      Diabetes_012 HighBP HighChol CholCheck Smoker Stroke
## Diabetes_012                 1.00   0.28     0.23      0.13   0.06   0.11
## HighBP                       0.28   1.00     0.32      0.21   0.09   0.13
## HighChol                     0.23   0.32     1.00      0.28   0.08   0.10
## CholCheck                    0.13   0.21     0.28      1.00   0.01   0.05
## Smoker                       0.06   0.09     0.08      0.01   1.00   0.06
## Stroke                       0.11   0.13     0.10      0.05   0.06   1.00
## HeartDiseaseorAttack         0.18   0.21     0.19      0.09   0.11   0.20
## PhysActivity                 0.11   0.11     0.06      0.02   0.08   0.06
## Fruits                       0.03   0.02     0.02      0.07   0.07   0.01
## Veggies                      0.05   0.04     0.01      0.06   0.02   0.03
## HvyAlcoholConsump            0.06   0.01     0.02      0.03   0.11   0.02
## AnyHealthcare                0.04   0.07     0.09      0.22   0.02   0.02
## NoDocbcCost                  0.02   0.00     0.02      0.12   0.05   0.03
## GenHlth                      0.29   0.29     0.20      0.06   0.16   0.19
## DiffWalk                     0.22   0.23     0.15      0.09   0.12   0.18
## Sex                          0.02   0.04     0.01      0.05   0.10   0.00
## Age                          0.22   0.38     0.34      0.39   0.14   0.14
## Education                    0.11   0.11     0.02      0.10   0.17   0.07
## Income                       0.15   0.14     0.04      0.12   0.12   0.11
##                      HeartDiseaseorAttack PhysActivity Fruits Veggies
## Diabetes_012                         0.18         0.11   0.03    0.05
## HighBP                               0.21         0.11   0.02    0.04
## HighChol                             0.19         0.06   0.02    0.01
## CholCheck                            0.09         0.02   0.07    0.06
## Smoker                               0.11         0.08   0.07    0.02
## Stroke                               0.20         0.06   0.01    0.03
## HeartDiseaseorAttack                 1.00         0.08   0.01    0.03
## PhysActivity                         0.08         1.00   0.14    0.15
## Fruits                               0.01         0.14   1.00    0.26
## Veggies                              0.03         0.15   0.26    1.00
## HvyAlcoholConsump                    0.03         0.01   0.04    0.02
## AnyHealthcare                        0.03         0.04   0.04    0.04
## NoDocbcCost                          0.02         0.06   0.05    0.03
## GenHlth                              0.26         0.26   0.09    0.11
## DiffWalk                             0.21         0.24   0.04    0.07
## Sex                                  0.08         0.03   0.09    0.07
## Age                                  0.24         0.09   0.10    0.04
## Education                            0.08         0.19   0.11    0.15
## Income                               0.12         0.19   0.08    0.15
##                      HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth
## Diabetes_012                      0.06          0.04        0.02    0.29
## HighBP                            0.01          0.07        0.00    0.29
## HighChol                          0.02          0.09        0.02    0.20
## CholCheck                         0.03          0.22        0.12    0.06
## Smoker                            0.11          0.02        0.05    0.16
## Stroke                            0.02          0.02        0.03    0.19
## HeartDiseaseorAttack              0.03          0.03        0.02    0.26
## PhysActivity                      0.01          0.04        0.06    0.26
## Fruits                            0.04          0.04        0.05    0.09
## Veggies                           0.02          0.04        0.03    0.11
## HvyAlcoholConsump                 1.00          0.02        0.01    0.04
## AnyHealthcare                     0.02          1.00        0.26    0.05
## NoDocbcCost                       0.01          0.26        1.00    0.18
## GenHlth                           0.04          0.05        0.18    1.00
## DiffWalk                          0.04          0.02        0.11    0.49
## Sex                               0.01          0.04        0.04    0.02
## Age                               0.05          0.18        0.15    0.08
## Education                         0.02          0.17        0.12    0.14
## Income                            0.05          0.20        0.22    0.19
##                      DiffWalk  Sex  Age Education Income
## Diabetes_012             0.22 0.02 0.22      0.11   0.15
## HighBP                   0.23 0.04 0.38      0.11   0.14
## HighChol                 0.15 0.01 0.34      0.02   0.04
## CholCheck                0.09 0.05 0.39      0.10   0.12
## Smoker                   0.12 0.10 0.14      0.17   0.12
## Stroke                   0.18 0.00 0.14      0.07   0.11
## HeartDiseaseorAttack     0.21 0.08 0.24      0.08   0.12
## PhysActivity             0.24 0.03 0.09      0.19   0.19
## Fruits                   0.04 0.09 0.10      0.11   0.08
## Veggies                  0.07 0.07 0.04      0.15   0.15
## HvyAlcoholConsump        0.04 0.01 0.05      0.02   0.05
## AnyHealthcare            0.02 0.04 0.18      0.17   0.20
## NoDocbcCost              0.11 0.04 0.15      0.12   0.22
## GenHlth                  0.49 0.02 0.08      0.14   0.19
## DiffWalk                 1.00 0.07 0.23      0.17   0.29
## Sex                      0.07 1.00 0.05      0.04   0.12
## Age                      0.23 0.05 1.00      0.07   0.09
## Education                0.17 0.04 0.07      1.00   0.21
## Income                   0.29 0.12 0.09      0.21   1.00

Most of the categorical variables have correlation coefficient less than 0.3. The highest coefficient is 0.49, from general health and difficult in walking. Hence, we can say that general health is moderate strong correlated to difficult in walking.

2. Between numerical variables by Pearson correlation

# pearson
r2 <- diabetes_df %>% select_if(is.numeric) %>% cor(method = "pearson") %>% round(2)
ggcorrplot(r2)

# scatterplot
ggplotly(ggplot(diabetes_df, aes(x=PhysHlth, y=MentHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("Physical Health against Mental Health"))
## `geom_smooth()` using formula = 'y ~ x'
ggplotly(ggplot(diabetes_df, aes(x=BMI, y=MentHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("BMI against Mental Health"))
## `geom_smooth()` using formula = 'y ~ x'
ggplotly(ggplot(diabetes_df, aes(x=BMI, y=PhysHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("BMI against Physical Health"))
## `geom_smooth()` using formula = 'y ~ x'



All the numerical variables are positive correlated. Let’s discover the strength of relationship below.

##           BMI MentHlth PhysHlth
## BMI      1.00     0.07     0.11
## MentHlth 0.07     1.00     0.34
## PhysHlth 0.11     0.34     1.00

They have poor positive relationship to each other.

From the correlation coefficients between categorical variables and between numerical variables, all the variables have weak relationship. This also can conclude that our dataset do not have redundant variables.

Redundant variable: correlation > 0.75

Feature Importance

# use roc_curve area as score
roc_imp <- filterVarImp(x = diabetes_df[,2:ncol(diabetes_df)], y = diabetes_df$Diabetes_012)

# sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
##                variable     score
## 14              GenHlth 0.7210405
## 1                HighBP 0.6937586
## 4                   BMI 0.6844951
## 19                  Age 0.6715637
## 2              HighChol 0.6565158
## 17             DiffWalk 0.6172731
## 21               Income 0.6149303
## 16             PhysHlth 0.6097001
## 20            Education 0.5791783
## 7  HeartDiseaseorAttack 0.5707880
## 8          PhysActivity 0.5672897
## 3             CholCheck 0.5654546
## 5                Smoker 0.5407487
## 6                Stroke 0.5291810
## 10              Veggies 0.5255615
## 15             MentHlth 0.5224938
## 9                Fruits 0.5190999
## 11    HvyAlcoholConsump 0.5188765
## 18                  Sex 0.5132233
## 12        AnyHealthcare 0.5131637
## 13          NoDocbcCost 0.5089957

The top 3 most important features are: General Health, High Blood Pressure, BMI

The top 3 least important features are: Gender, Healthcare, No Cost to Consult Doctor



Modelling & Interpret


Data Balancing

Our target variable ‘Diabetes_012’ is imbalanced. Most records represent not diabetes cases. Also our dataset is quite large and would require a lot of computational resources for modeling. To solve these problems we perform undersampling of not diabetes records to have the equal amount for both groups.

# undersampling
set.seed(123)

balance <- function(df) {
  number_of_diabetes <- nrow(df[df$Diabetes_012 == 1, ])

  no_diabetes_ids <- which(df$Diabetes_012 == 0)
  sample_ids <- sample(no_diabetes_ids, number_of_diabetes)

  sample_no_diabetes_df <- df[sample_ids, ]
  only_diabetes_df <- df[df$Diabetes_012 == 1, ]

  return(rbind(sample_no_diabetes_df, only_diabetes_df))
}

balanced_df <- balance(diabetes_df)
invisible(gc())

print(paste("Number of rows before balancing:", nrow(diabetes_df)))
## [1] "Number of rows before balancing: 292188"
print(paste("Number of rows after balancing:", nrow(balanced_df)))
## [1] "Number of rows after balancing: 86052"
table(balanced_df$Diabetes_012)
## 
##     0     1 
## 43026 43026



Data Splitting

For modeling, we splitted the balanced data into 80% training set and 20% test set.

# Test train split function
get_train_test <- function(df, target_name) {
  train_ids <- createDataPartition(df$Diabetes_012, p=0.8, list=FALSE)
  df_train <- df[train_ids,]
  df_test <- df[-train_ids,]

  x_train <- df_train[, !(names(df_train) %in% target_name)]
  y_train <- df_train[, target_name]

  x_test <- df_test[ , !(names(df_test) %in% target_name)]
  y_test <- df_test[, target_name]

  return(list("df_train" = df_train, 
              "df_test" = df_test, 
              "x_train" = x_train, 
              "y_train" = y_train,
              "x_test" = x_test,
              "y_test" = y_test))
   
}
# Train test split
train_test <- get_train_test(balanced_df, 'Diabetes_012')
invisible(gc())



Classification

Diabetes Prediction Models:

1. Random Forest (RF)

# RF
rf_model <- randomForest(Diabetes_012~., data=train_test[["df_train"]])

predictions <- predict(rf_model,train_test[["x_test"]])
confusionMatrix(predictions, train_test[["y_test"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6087 1744
##          1 2518 6861
##                                           
##                Accuracy : 0.7524          
##                  95% CI : (0.7458, 0.7588)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5047          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7074          
##             Specificity : 0.7973          
##          Pos Pred Value : 0.7773          
##          Neg Pred Value : 0.7315          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3537          
##    Detection Prevalence : 0.4550          
##       Balanced Accuracy : 0.7524          
##                                           
##        'Positive' Class : 0               
## 
invisible(gc())

The accuracy of the diabetes prediction is 75%.

Feature importance plot:

randomForest::varImpPlot(rf_model)



The most important features are: BMI, Age, General Health

The least important features are: AnyHealthCare, HvyAlcoholConsump, Stroke, NoDocbcCost

Modeling using only most important features:

df_important <- balanced_df[ , c("BMI", "Diabetes_012", "Age", "GenHlth")]

train_test_rf2 <- get_train_test(df_important, 'Diabetes_012')

rf_model_2 <- randomForest(Diabetes_012~., data=train_test_rf2[["df_train"]])

predictions <- predict(rf_model_2,train_test_rf2[["x_test"]])
confusionMatrix(predictions, train_test_rf2[["y_test"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5885 1884
##          1 2720 6721
##                                           
##                Accuracy : 0.7325          
##                  95% CI : (0.7258, 0.7391)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.465           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6839          
##             Specificity : 0.7811          
##          Pos Pred Value : 0.7575          
##          Neg Pred Value : 0.7119          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3420          
##    Detection Prevalence : 0.4514          
##       Balanced Accuracy : 0.7325          
##                                           
##        'Positive' Class : 0               
## 
invisible(gc())

We can see that using 3 most important features gives us 73% accuracy.

2. Logistic regression (LR)

# LR
lr <- glm (Diabetes_012~.,data = train_test[["df_train"]] , family = binomial)
summary(lr)
## 
## Call:
## glm(formula = Diabetes_012 ~ ., family = binomial, data = train_test[["df_train"]])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4084  -0.8010  -0.0193   0.8183   3.1766  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -5.713863   0.287344 -19.885  < 2e-16 ***
## HighBP1                0.686903   0.020131  34.122  < 2e-16 ***
## HighChol1              0.541555   0.019672  27.529  < 2e-16 ***
## CholCheck1             0.851680   0.041981  20.287  < 2e-16 ***
## BMI                    0.069035   0.001617  42.680  < 2e-16 ***
## Smoker1               -0.034020   0.019408  -1.753 0.079626 .  
## Stroke1                0.148176   0.042666   3.473 0.000515 ***
## HeartDiseaseorAttack1  0.249878   0.030076   8.308  < 2e-16 ***
## PhysActivity1         -0.062896   0.021645  -2.906 0.003663 ** 
## Fruits1               -0.020535   0.020066  -1.023 0.306138    
## Veggies1              -0.041328   0.023746  -1.740 0.081794 .  
## HvyAlcoholConsump1    -0.632887   0.047934 -13.203  < 2e-16 ***
## AnyHealthcare1         0.099366   0.044813   2.217 0.026599 *  
## NoDocbcCost1           0.083695   0.033820   2.475 0.013336 *  
## GenHlth2               0.687204   0.037849  18.157  < 2e-16 ***
## GenHlth3               1.311931   0.037465  35.017  < 2e-16 ***
## GenHlth4               1.765750   0.043616  40.484  < 2e-16 ***
## GenHlth5               1.897044   0.059028  32.138  < 2e-16 ***
## MentHlth              -0.002076   0.001305  -1.590 0.111725    
## PhysHlth              -0.001403   0.001280  -1.096 0.273101    
## DiffWalk1              0.139007   0.026620   5.222 1.77e-07 ***
## Sex1                   0.234423   0.019670  11.918  < 2e-16 ***
## Age2                   0.041629   0.127708   0.326 0.744449    
## Age3                   0.214900   0.117191   1.834 0.066689 .  
## Age4                   0.657148   0.110420   5.951 2.66e-09 ***
## Age5                   0.944047   0.107294   8.799  < 2e-16 ***
## Age6                   1.199964   0.104914  11.438  < 2e-16 ***
## Age7                   1.345305   0.103156  13.041  < 2e-16 ***
## Age8                   1.407128   0.102510  13.727  < 2e-16 ***
## Age9                   1.648687   0.102142  16.141  < 2e-16 ***
## Age10                  1.811573   0.102317  17.705  < 2e-16 ***
## Age11                  1.966091   0.103473  19.001  < 2e-16 ***
## Age12                  1.903056   0.105262  18.079  < 2e-16 ***
## Age13                  1.779927   0.105080  16.939  < 2e-16 ***
## Education2             0.028094   0.266214   0.106 0.915954    
## Education3            -0.109151   0.262643  -0.416 0.677712    
## Education4            -0.215612   0.260089  -0.829 0.407109    
## Education5            -0.152908   0.260235  -0.588 0.556816    
## Education6            -0.281317   0.260374  -1.080 0.279948    
## Income2               -0.076909   0.056223  -1.368 0.171337    
## Income3               -0.105254   0.053368  -1.972 0.048585 *  
## Income4               -0.129029   0.051986  -2.482 0.013066 *  
## Income5               -0.200807   0.051194  -3.922 8.76e-05 ***
## Income6               -0.282523   0.050004  -5.650 1.60e-08 ***
## Income7               -0.349961   0.050186  -6.973 3.10e-12 ***
## Income8               -0.491713   0.049522  -9.929  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 95435  on 68841  degrees of freedom
## Residual deviance: 69435  on 68796  degrees of freedom
## AIC: 69527
## 
## Number of Fisher Scoring iterations: 5
lr_predictions <- predict(lr, train_test[["x_test"]], type = 'response')
lr_predictions <- as.factor(ifelse(lr_predictions > 0.5,1,0))

confusionMatrix(lr_predictions, train_test[["y_test"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 6265 1914
##          1 2340 6691
##                                           
##                Accuracy : 0.7528          
##                  95% CI : (0.7463, 0.7592)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5056          
##                                           
##  Mcnemar's Test P-Value : 7.214e-11       
##                                           
##             Sensitivity : 0.7281          
##             Specificity : 0.7776          
##          Pos Pred Value : 0.7660          
##          Neg Pred Value : 0.7409          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3640          
##    Detection Prevalence : 0.4752          
##       Balanced Accuracy : 0.7528          
##                                           
##        'Positive' Class : 0               
## 
invisible(gc())

Prediction accuracy is 75%.

The least important features according to the p-values of their coefficients are: Education, Veggies, Fruits, Smoker

3. K-Nearest Neighbors (KNN)

To train KNN model, we need to scale our numerical features first.

scaled_df <- balanced_df %>% mutate_at(c("BMI", "MentHlth", "PhysHlth"), ~(scale(.) %>% as.vector))
glimpse(scaled_df)
## Rows: 86,052
## Columns: 22
## $ Diabetes_012         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP               <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0…
## $ HighChol             <fct> 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0…
## $ CholCheck            <fct> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0…
## $ BMI                  <dbl> 1.3611031370, -2.0793836239, 0.1405155102, -0.853…
## $ Smoker               <fct> 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Stroke               <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ PhysActivity         <fct> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0…
## $ Fruits               <fct> 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Veggies              <fct> 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ HvyAlcoholConsump    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare        <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1…
## $ NoDocbcCost          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
## $ GenHlth              <fct> 3, 4, 2, 2, 2, 1, 2, 1, 1, 4, 3, 1, 5, 2, 2, 3, 2…
## $ MentHlth             <dbl> 0.14139593, -0.46860902, -0.46860902, -0.22460704…
## $ PhysHlth             <dbl> -0.5705128, 2.4504253, 0.1343727, -0.3691169, -0.…
## $ DiffWalk             <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sex                  <fct> 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0…
## $ Age                  <fct> 9, 8, 7, 7, 6, 8, 2, 3, 1, 10, 13, 5, 7, 11, 8, 6…
## $ Education            <fct> 5, 3, 5, 3, 4, 6, 6, 6, 6, 6, 4, 6, 3, 3, 5, 5, 5…
## $ Income               <fct> 5, 4, 8, 7, 8, 8, 7, 4, 7, 8, 5, 8, 2, 2, 8, 8, 7…
# KNN
train_test_knn <- get_train_test(scaled_df, "Diabetes_012")
knn_pred <- knn(train_test_knn[["x_train"]] , train_test_knn[["x_test"]] , train_test_knn[["y_train"]] , k = 20)
confusionMatrix(knn_pred, train_test_knn[["y_test"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5942 1816
##          1 2663 6789
##                                           
##                Accuracy : 0.7397          
##                  95% CI : (0.7331, 0.7463)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4795          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6905          
##             Specificity : 0.7890          
##          Pos Pred Value : 0.7659          
##          Neg Pred Value : 0.7183          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3453          
##    Detection Prevalence : 0.4508          
##       Balanced Accuracy : 0.7397          
##                                           
##        'Positive' Class : 0               
## 
invisible(gc())

The accuracy of KNN algorithms is 74%.

4. Summary

We trained 3 machine learning models to predict diabetes. All of them showed similar level of accuracy of 74-75%. Random Forest and Logistic Regression models showed better accuracy than KNN model.

Modeling showed that the most important features for predicting diabetes are:
I. BMI
II. Age
III. General Health
IV. High blood pressure
V. Income

The least important features are:
I. AnyHealthCare
II. HvyAlcoholConsump
III. Stroke
IV. NoDocbcCost
V. Education
VI. Veggies
VII. Fruits
VIII. Smoker

Modeling with 3 most important features (BMI, Age, General Health) provided 73% accuracy.



Regression

BMI Prediction

1. Decision Tree (DT)

# DT
train_test_bmi <- get_train_test(balanced_df, 'BMI')
tree_bmi <- tree (BMI ~ ., data = train_test_bmi[["df_train"]])
pred = predict(tree_bmi, newdata = train_test_bmi[["x_test"]])
postResample(pred = pred, obs = train_test_bmi[["y_test"]])
##     RMSE Rsquared      MAE 
## 6.537785 0.131683 4.671237
invisible(gc())

R squared value is low, which means the model doesn’t fit the data good enough.

2. Linear Model (LM)

# LM
linear_bmi <- lm(BMI~., data = train_test_bmi[["df_train"]])
predicted_bmi <- predict ( linear_bmi , train_test_bmi[["x_test"]])
summary(linear_bmi)
## 
## Call:
## lm(formula = BMI ~ ., data = train_test_bmi[["df_train"]])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.675  -3.965  -0.804   2.905  71.989 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           23.974133   0.708669  33.830  < 2e-16 ***
## Diabetes_0121          2.473905   0.057742  42.844  < 2e-16 ***
## HighBP1                2.065539   0.057217  36.100  < 2e-16 ***
## HighChol1              0.113044   0.054558   2.072 0.038272 *  
## CholCheck1             0.824533   0.092119   8.951  < 2e-16 ***
## Smoker1               -0.504504   0.050574  -9.976  < 2e-16 ***
## Stroke1               -0.964483   0.107016  -9.013  < 2e-16 ***
## HeartDiseaseorAttack1 -0.253403   0.077457  -3.272 0.001070 ** 
## PhysActivity1         -1.061638   0.056450 -18.807  < 2e-16 ***
## Fruits1               -0.409091   0.051907  -7.881 3.29e-15 ***
## Veggies1               0.075682   0.061378   1.233 0.217566    
## HvyAlcoholConsump1    -0.973331   0.117322  -8.296  < 2e-16 ***
## AnyHealthcare1         0.119642   0.109784   1.090 0.275810    
## NoDocbcCost1          -0.307977   0.085002  -3.623 0.000291 ***
## GenHlth2               1.163169   0.084934  13.695  < 2e-16 ***
## GenHlth3               2.108005   0.087075  24.209  < 2e-16 ***
## GenHlth4               2.091706   0.105653  19.798  < 2e-16 ***
## GenHlth5               1.554077   0.143001  10.868  < 2e-16 ***
## MentHlth               0.001772   0.003314   0.535 0.592788    
## PhysHlth              -0.002752   0.003305  -0.833 0.405013    
## DiffWalk1              2.365806   0.070208  33.697  < 2e-16 ***
## Sex1                   0.135490   0.050819   2.666 0.007675 ** 
## Age2                   1.730618   0.201948   8.570  < 2e-16 ***
## Age3                   2.589935   0.193109  13.412  < 2e-16 ***
## Age4                   2.479602   0.189096  13.113  < 2e-16 ***
## Age5                   2.699344   0.184402  14.638  < 2e-16 ***
## Age6                   2.382123   0.178276  13.362  < 2e-16 ***
## Age7                   1.896099   0.172237  11.009  < 2e-16 ***
## Age8                   1.554523   0.169983   9.145  < 2e-16 ***
## Age9                   0.877739   0.168728   5.202 1.98e-07 ***
## Age10                  0.601376   0.169921   3.539 0.000402 ***
## Age11                 -0.413402   0.174824  -2.365 0.018049 *  
## Age12                 -1.286323   0.182524  -7.047 1.84e-12 ***
## Age13                 -3.011033   0.181981 -16.546  < 2e-16 ***
## Education2             0.211805   0.694745   0.305 0.760468    
## Education3             0.634848   0.686999   0.924 0.355444    
## Education4             0.679325   0.681007   0.998 0.318511    
## Education5             0.758405   0.681377   1.113 0.265692    
## Education6             0.307953   0.681824   0.452 0.651515    
## Income2                0.251365   0.139639   1.800 0.071848 .  
## Income3                0.339004   0.133268   2.544 0.010968 *  
## Income4                0.289942   0.130781   2.217 0.026627 *  
## Income5                0.355019   0.128985   2.752 0.005918 ** 
## Income6                0.496276   0.126203   3.932 8.42e-05 ***
## Income7                0.511397   0.126945   4.028 5.62e-05 ***
## Income8                0.074972   0.124615   0.602 0.547423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.328 on 68796 degrees of freedom
## Multiple R-squared:  0.1838, Adjusted R-squared:  0.1832 
## F-statistic: 344.2 on 45 and 68796 DF,  p-value: < 2.2e-16
postResample(pred = predicted_bmi, obs = train_test_bmi[["y_test"]])
##      RMSE  Rsquared       MAE 
## 6.3151518 0.1898009 4.5101785
invisible(gc())

R squared value is low as well.

3. K-Nearest Neighbors (KNN)

# KNN
scaled_df_bmi <- balanced_df %>% mutate_at(c("MentHlth", "PhysHlth"), ~(scale(.) %>% as.vector))
train_test_bmi_scaled <- get_train_test(scaled_df_bmi, 'BMI')
knnmodel = knnreg(train_test_bmi_scaled[["x_train"]], train_test_bmi_scaled[["y_train"]], k = 10)
pred_y = predict(knnmodel, train_test_bmi_scaled[["x_test"]])
postResample(pred = pred_y, obs = train_test_bmi_scaled[["y_test"]])
##      RMSE  Rsquared       MAE 
## 6.5290902 0.1457319 4.6713351
invisible(gc())

R squared score for this algorithm is very low as well.

4. Summary

We built 3 machine learning models to predict BMI from the other variables. All algorithms got a low R squared score. We make the conclusion that all algorithms that we tried did not provide a good fit for the given data. RMSE score was in the range from 6.3 to 6.6.



Interpret

We can make a conclusion that BMI, age and general health of a person can be used to predict the risk of getting diabetes. The other important factors are high blood pressure and person’s income. Such factors as either a person is or not a smoker, eats fruits and vegetables, and the level of education did not prove to be important in predicting diabetes.

Conclusion


In conclusion, we have achieved our 3 research objectives.

RO1: To explore relevant factors contributing towards diabetes.
We found that BMI, age, general health, high blood pressure and person’s income are the most relevant factors contributing towards diabetes.

RO2: To develop diabetes prediction model by machine learning algorithms.
Classification models such as Random Forest, Logistic Regression and KNN are developed for diabetes prediction.

RO3: To evaluate the performance of diabetes prediction model.
All of the classification models achieved similar results, which have accuracy of 74%-75%.


Future Works


  1. To employ a visualization dashboard in order to raise awareness of the factors of diabetes in global view.
  2. To develop more diabetes prediction models such as Naive Bayes, AdaBoost, Bagging, Support Vector Machine and Multilayer Perceptron for comparative test.
  3. To encourage local government or every nation to provide diabetes-related open source data for analysis.

References


  1. Alexteboul, A. (2022, March 10). Diabetes Health Indicators Dataset Notebook. Kaggle. https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook/notebook

  2. Behavioral Risk Factor Surveillance System. (2017, August 24). Kaggle. https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system?resource=download

  3. Chapter 77 Feature selection in r | EDAV Fall 2021 Tues/Thurs Community Contributions. (n.d.). https://jtr13.github.io/cc21fall2/feature-selection-in-r.html

  4. char_cor_vars function - RDocumentation. (n.d.). https://www.rdocumentation.org/packages/creditmodel/versions/1.3.1/topics/char_cor_vars

  5. Zach, Z. (2021, May 18). Understanding the Null Hypothesis for Linear Regression. Statology. https://www.statology.org/null-hypothesis-for-linear-regression/