Diabetes is generally characterized by the body’s inability to
produce sufficient amounts of insulin or to use the insulin effectively
as needed. Diabetes is one of the most prevalent chronic diseases in
many countries, affecting millions of people each year and placing a
significant economic burden on the economy. Diabetes is a serious
chronic disease that can lead to a loss of the ability to effectively
regulate blood sugar levels, reducing quality of life and life
expectancy. It can cause serious complications such as heart disease,
blindness, leg amputations, and kidney disease with chronically high
levels of sugar remaining in the blood of diabetes patients. There is no
cure for diabetes, but strategies such as weight loss, a healthy diet,
exercise and treatment can reduce the damage caused by the disease in
many patients. Diabetes risk prediction models will be an important tool
for the public and public health agencies as they may lead to more
targeted treatments.
It is also important to recognize the magnitude of the diabetes
problem. The Centers for Disease Control and Prevention says that 34.2
million US citizens had diabetes in 2018, and 88 million had
prediabetes. Additionally, the CDC estimates that about 1 in 5 of those
with diabetes and 8 in 10 of those with prediabetes are unaware of their
risk. There are many types of diabetes, and its prevalence varies by
age, education, income, location, race, and other social determinants of
health. Much of the burden of disease also falls on people of lower
socioeconomic status. Diabetes also places a huge burden on the economy
of the country. There is a need to develop a diabetes classification
model which can help for early identification of the disease, using
diabetes health indicators.
We follows the OSEMN Data Science Project Lifecycle in this study.
library(dplyr) #for data cleaning/modification
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse) #for data cleaning/modification
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate) #for data cleaning/modification
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(stringr) #for data cleaning/modification
library(readxl) #for data sampling
library(knitr) #for data sampling
library(ggplot2) #for data visualization
library(plotly) #for data visualization
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(creditmodel) #for correlation
## Package 'creditmodel' version 1.3.1
##
## Attaching package: 'creditmodel'
##
## The following object is masked from 'package:stringr':
##
## str_match
library(ggcorrplot) #for correlation
library(caret) #for data modeling
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest) #for data modeling
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
library (class) #for data modeling
library(tree) #for data modeling
The original dataset we started with, was the US Center for Disease Control’s (CDC) Behavioral Risk Factor Surveillance System’s (BRFSS) survey of 2015. The dataset can be found from Kaggle (https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system?resource=download&select=2015.csv).
According to the Kaggle page, the Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
So, that is a big dataset of more than 400,000 participants. It also contains 330 columns of factor information, covering a wide range of topics from the user’s demographics to current situation, survey understanding and consent, healthcare access, physical activity, consumption behavior (tobacco, alcohol, food types) and of course, data on health conditions (like eye problems, mental health, asthma, cancer, kidney disease, diabetes, high blood pressure etc.), medication.
Overall it is a very comprehensive survey. However, that makes it hard to handle in its entirety.
We have therefore filtered out 22 most relevant factors that relates to diabetes based on the study in Kaggle (https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook/notebook).
A brief explanation of the dataset used and factors are given below.
| Details | Description |
|---|---|
| Title | Diabetes Health Indicators Dataset |
| Source | Kaggle |
| Link | https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system |
| Years | 2015 |
| File Name | diabetes.xlsx |
| File Size | 36.1 MB |
| Dimension | Rows: 441456 | Attributes: 22 |
| Purpose | To record the relevant factors of diabetes. |
| No | Attribute | Question.Description | Response |
|---|---|---|---|
| 1 | Diabetes_012 | Main Response Variable. (Ever told) you have diabetes (If “Yes” and respondent is female, ask “Was this only when you were pregnant?” | 1 = Yes; 2 = Yes, but female told only during pregnancy; 3 = No; 4 = No, pre-diabetes or borderline diabetes; 7 = Don’t know/Not Sure; 9 = Refused BLANK: Not asked or Missing |
| 2 | HighBP | Adults who have been told they have high blood pressure by a doctor, nurse, or other health professional | 1 = No; 2 = Yes; 9 = Don’t know/Not Sure/Refused/Missing |
| 3 | HighChol | Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high? | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing |
| 4 | CholCheck | Cholesterol check within past five years | 1 = Had cholesterol checked in past 5 years; 2 = Did not have cholesterol checked in past 5 years; 7 = Have never had cholesterol checked; 9 = Don’t know/Not Sure Or Refused/Missing |
| 5 | BMI | Body Mass Index (BMI) | 1-9999 = 1 or greater; BLANK = Don’t know/Refused/Missing |
| 6 | Smoker | Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing |
| 7 | Stroke | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused | |
| 8 | HeartDiseaseorAttack | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI) | 1 = Reported having MI or CHD; 2 = Did not report having MI or CHD; BLANK = Not asked or Missing |
| 9 | PhysActivity | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job | 1 = Had physical activity or exercise; 2 = No physical activity or exercise in last 30 days; 9 = Don’t know/Refused/Missing |
| 10 | Fruits | Consume Fruit 1 or more times per day | 1 = Consumed fruit one or more times per day; 2 = Consumed fruit less than one time per day; 9 = Don´t know, refused or missing values |
| 11 | Veggies | Consume Vegetables 1 or more times per day | 1 = Consumed vegetables one or more times per day; 2 = Consumed vegetables less than one time per day; 9 = Don´t know, refused or missing values |
| 12 | HvyAlcoholConsump | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) | 1 = No; 2 = Yes, 9 = Don’t know/Refused/Missing |
| 13 | AnyHealthcare | Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service? | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused |
| 14 | NoDocbcCost | Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing |
| 15 | GenHlth | Would you say that in general your health is: | 1 = Excellent; 2 = Very good; 3 = Good; 4 = Fair; 5 = Poor; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing |
| 16 | MentHlth | Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? | 1-30 = Number of days; 88 = None; 77 = Don’t know/Not sure; BLANK = Refused |
| 17 | PhysHlth | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? | 1-30 = Number of days; 88 = None; 77 = Don’t know/Not sure; BLANK = Refused |
| 18 | DiffWalk | Do you have serious difficulty walking or climbing stairs? | 1 = Yes; 2 = No; 7 = Don’t know/Not Sure; 9 = Refused; BLANK = Not asked or Missing |
| 19 | Sex | Indicate sex of respondent. | 1 = Male; 2 = Female |
| 20 | Age | Fourteen-level age category | 1 = Age 18 to 24; 2 = Age 25 to 29; 3 = Age 30 to 34; 4 = Age 35 to 39; 5 = Age 40 to 44; 6 = Age 45 to 49; 7 = Age 50 to 54; 8 = Age 55 to 59; 9 = Age 60 to 64; 10 = Age 65 to 69; 11 = Age 70 to 74; 12 = Age 75 to 79; 13 = Age 80 or older; 14 = Don’t know/Refused/Missing |
| 21 | Education | What is the highest grade or year of school you completed? | 1 = Never attended school or only kindergarten; 2 = Grades 1 through 8 (Elementary); 3 = Grades 9 through 11 (Some high school); 4 = Grade 12 or GED (High school graduate); 5 = College 1 year to 3 years (Some college or technical school); 6 = College 4 years or more (College graduate); 9 = Refused |
| 22 | Income | Is your annual household income from all sources: (If respondent refuses at any income level, code “Refused.”) | 1 = Less than $10,000; 2 = Less than $15,000 ($10,000 to less than $15,000); 3 = Less than $20,000 ($15,000 to less than $20,000); 4 = Less than $25,000 ($20,000 to less than $25,000); 5 = Less than $35,000 ($25,000 to less than $35,000); 6 = Less than $50,000 ($35,000 to less than $50,000); 7 = Less than $75,000 ($50,000 to less than $75,000); 8 = $75,000 or more; 77 = Don’t know/Not sure; 99 = Refused; BLANK = Not asked or Missing |
# read the data from excel
diabetes_df<-data.frame(read_excel("diabetes.xlsx"))
head(diabetes_df)
## Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke
## 1 3 2 1 1 4018 1 2
## 2 3 1 2 2 2509 1 2
## 3 3 1 1 1 2204 NA 1
## 4 3 2 1 1 2819 2 2
## 5 3 1 2 1 2437 2 2
## 6 3 2 2 1 2652 2 2
## HeartDiseaseorAttack PhysActivity Fruits Veggies HvyAlcoholConsump
## 1 2 2 2 1 1
## 2 2 1 2 2 1
## 3 NA 9 9 9 9
## 4 2 2 1 2 1
## 5 2 2 9 1 1
## 6 2 1 1 1 1
## AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth DiffWalk Sex Age
## 1 1 2 5 18 15 1 2 9
## 2 2 1 3 88 88 2 2 7
## 3 1 2 4 88 15 NA 2 11
## 4 1 1 5 30 30 1 2 9
## 5 1 2 5 88 20 2 2 9
## 6 1 2 2 88 88 2 2 11
## Education Income
## 1 4 3
## 2 6 1
## 3 4 99
## 4 4 8
## 5 5 77
## 6 3 6
Glimpse of Dataset
glimpse(diabetes_df)
## Rows: 441,456
## Columns: 22
## $ Diabetes_012 <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 3, 3, 3, 1…
## $ HighBP <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2…
## $ HighChol <dbl> 1, 2, 1, 1, 2, 2, 1, 1, NA, 1, 2, 1, 1, 1, 2, 1, …
## $ CholCheck <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1…
## $ BMI <dbl> 4018, 2509, 2204, 2819, 2437, 2652, 2389, 3382, 1…
## $ Smoker <dbl> 1, 1, NA, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, …
## $ Stroke <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2…
## $ HeartDiseaseorAttack <dbl> 2, 2, NA, 2, 2, 2, 2, NA, 2, 2, 2, 1, 2, 2, 1, 2,…
## $ PhysActivity <dbl> 2, 1, 9, 2, 2, 1, 1, 1, 1, 1, 2, 2, 9, 9, 9, 1, 2…
## $ Fruits <dbl> 2, 2, 9, 1, 9, 1, 1, 1, 9, 1, 2, 2, 9, 1, 9, 2, 1…
## $ Veggies <dbl> 1, 2, 9, 2, 1, 1, 1, 1, 9, 1, 2, 1, 9, 9, 9, 1, 1…
## $ HvyAlcoholConsump <dbl> 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ AnyHealthcare <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2…
## $ GenHlth <dbl> 5, 3, 4, 5, 5, 2, 2, 5, 5, 2, 3, 5, 5, 4, 3, 3, 5…
## $ MentHlth <dbl> 18, 88, 88, 30, 88, 88, 3, 88, 88, 88, 88, 88, 88…
## $ PhysHlth <dbl> 15, 88, 15, 30, 20, 88, 88, 8, 77, 2, 14, 88, 5, …
## $ DiffWalk <dbl> 1, 2, NA, 1, 2, 2, 2, 1, 7, 2, 2, 1, 1, 1, 1, 1, …
## $ Sex <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ Age <dbl> 9, 7, 11, 9, 9, 11, 11, 13, 13, 10, 9, 13, 11, 13…
## $ Education <dbl> 4, 6, 4, 4, 5, 3, 5, 3, 3, 6, 6, 5, 5, 5, 5, 4, 5…
## $ Income <dbl> 3, 1, 99, 8, 77, 6, 4, 3, 77, 8, 7, 77, 5, 7, 77,…
Data Quality Overview
# check NA values
any(is.na(diabetes_df))
## [1] TRUE
# count the total rows with no NA values
nrow(na.omit(diabetes_df))
## [1] 343606
# count the total rows with not complete cases
nrow(diabetes_df[!complete.cases(diabetes_df),])
## [1] 97850
Remove Duplicate Records
diabetes_df <- unique(diabetes_df)
nrow(diabetes_df)
## [1] 434358
7,098 duplicate records removed.
Now, Replace missing (““) values to NA and show column names with NA values
## [1] "Diabetes_012" "HighChol" "BMI"
## [4] "Smoker" "HeartDiseaseorAttack" "NoDocbcCost"
## [7] "GenHlth" "PhysHlth" "DiffWalk"
## [10] "Income"
Diabetes_012 is the target variable, so remove all records with NA values for this column
# remove NA
diabetes_df<-diabetes_df %>% filter(!is.na(Diabetes_012))
unique(diabetes_df$Diabetes_012)
## [1] 3 1 4 2 7 9
Mean imputation is used to fill NA values of BMI column
# replace NA by mean
diabetes_df<-diabetes_df %>% replace_na(list(BMI=round(mean(diabetes_df$BMI, na.rm = T), 2)))
Mode imputation is used to remove NA values of high cholesterol, smoker, heart disease or attack, no cost to consult doctor, general health, mental health, difficulty walking and income.
# replace NA by mode
diabetes_df<-diabetes_df %>% replace_na(list(HighChol=Mode(diabetes_df$HighChol)))
diabetes_df<-diabetes_df %>% replace_na(list(Smoker=Mode(diabetes_df$Smoker)))
diabetes_df<-diabetes_df %>% replace_na(list(HeartDiseaseorAttack=Mode(diabetes_df$HeartDiseaseorAttack)))
diabetes_df<-diabetes_df %>% replace_na(list(NoDocbcCost=Mode(diabetes_df$NoDocbcCost)))
diabetes_df<-diabetes_df %>% replace_na(list(GenHlth=Mode(diabetes_df$GenHlth)))
diabetes_df<-diabetes_df %>% replace_na(list(PhysHlth=Mode(diabetes_df$PhysHlth)))
diabetes_df<-diabetes_df %>% replace_na(list(DiffWalk=Mode(diabetes_df$DiffWalk)))
diabetes_df<-diabetes_df %>% replace_na(list(Income=Mode(diabetes_df$Income)))
Now, check names of columns with NA values.
## [1] "Diabetes_012 0"
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 0"
## [1] "BMI 0"
## [1] "Smoker 0"
## [1] "Stroke 0"
## [1] "HeartDiseaseorAttack 0"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 0"
## [1] "HvyAlcoholConsump 0"
## [1] "AnyHealthcare 0"
## [1] "NoDocbcCost 0"
## [1] "GenHlth 0"
## [1] "MentHlth 0"
## [1] "PhysHlth 0"
## [1] "DiffWalk 0"
## [1] "Sex 0"
## [1] "Age 0"
## [1] "Education 0"
## [1] "Income 0"
All duplicate, missing (““) and NA values are removed.
Diabetes_012, We make this ordinal. 0 is for no diabetes or only during pregnancy, 1 is for pre-diabetes or borderline diabetes and diabetes. Remove all 7 (don’t knows), all 9 (refused)
## [1] 0 1
High blood pressure Change 1 to 0 so it represents No high blood pressure and 2 to 1 so it represents high blood pressure and remove 9 (don’t know)
## [1] 1 0
High Cholesterol, Change 2 to 0 because it is No. Remove all 7 (don’t knows), 9 (refused)
## [1] 1 0
Cholesterol Check, Change 3 to 0 and 2 to 0 for Not checked cholesterol in past 5 years. Remove 9 (Don’t know)
## [1] 1 0
BMI (no changes, just note that these are BMI * 100. So for example a BMI of 2608 is really 26.08)
## [1] 22.04
Smoker, Change 2 to 0 because it is No, Remove all 7 (don’t knows) and 9 (refused)
## [1] 1 0
Stroke, Change 2 to 0 because it is No, Remove all 7 (don’t knows) and 9 (refused)
## [1] 0 1
Heart Disease or Attack, Change 2 to 0 because this means did not have MI or CHD
## [1] 0 1
Physical Activity, change 2 to 0 for no physical activity, Remove all 9 (don’t know/refused)
## [1] 0 1
Fruits, Change 2 to 0. this means no fruit consumed per day. 1 will mean consumed 1 or more pieces of fruit per day remove all don’t knows and missing 9
## [1] 0 1
Veggies, Change 2 to 0. this means no vegetables consumed per day. 1 will mean consumed 1 or more pieces of vegetable per day remove all don’t knows and missing 9
## [1] 1 0
Heavy Alcohol Consumption, Change 1 to 0 (1 was no for heavy drinking). change all 2 to 1 (2 was yes for heavy drinking) remove all don’t knows and missing 9
## [1] 0 1
Any Health care, 1 is yes, change 2 to 0 because it is No health care access remove 7 and 9 for don’t know or refused
## [1] 1 0
No cost to consult doctor, Change 2 to 0 for no, 1 is already yes, remove 7 for don’t know and 9 for refused
## [1] 0 1
General Health, This is an ordinal variable that I want to keep (1 is Excellent -> 5 is Poor) Remove 7 and 9 for don’t know and refused
## [1] 5 3 2 4 1
Mental Health, already in days so keep that, scale will be 0-30 change 88 to 0 because it means none (no bad mental health days). Remove 77 and 99 for don’t know not sure and refused
## [1] 18 0 30 3 5 15 10 6 20 2 25 1 29 4 7 8 21 14 26 9 16 28 11 12 24
## [26] 17 13 23 27 19 22
Physical Health, already in days so keep that, scale will be 0-30 change 88 to 0 because it means none (no bad physical health days). Remove 77 and 99 for don’t know not sure and refused
## [1] 15 0 30 8 2 14 6 28 1 7 10 20 3 5 17 4 19 21 12 25 27 22 29 24 9
## [26] 16 18 23 13 26 11
Difficulty Walking, change 2 to 0 for no. 1 is already yes, remove 7 and 9 for don’t know not sure and refused
## [1] 1 0
Sex, change 2 to 0 (female as 0). Male is 1
## [1] 0 1
Age, already ordinal. 1 is 18-24 all the way up to 13 with 80 and older. 5 year increments. Remove 14 because it is don’t know or missing
## [1] 9 7 11 13 10 8 4 6 2 3 12 5 1
Education, This is already an ordinal variable with 1 being never attended school or kindergarten only up to 6 being college 4 years or more Scale here is 1-6. Remove 9 for refused
## [1] 4 6 3 5 2 1
Income is already ordinal with 1 being less than $10,000 all the way up to 8 being $75,000 or more. Remove 77 and 99 for don’t know and refused
## [1] 3 1 8 6 4 7 2 5
The binary columns are:
Summary of attributes representation on cleaned data:
| No | Attribute | Description |
|---|---|---|
| 1 | Diabetes_012 | 0 = no diabetes 1 = prediabetes 2 = diabetes |
| 2 | HighBP | 0 = no high BP 1 = high BP |
| 3 | HighChol | 0 = no high cholesterol 1 = high cholesterol |
| 4 | CholCheck | 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years |
| 5 | BMI | Body Mass Index |
| 6 | Smoker | Smoked at least 100 cigarettes in entire life 0 = no 1 = yes |
| 7 | Stroke | Stroke 0 = no 1 = yes |
| 8 | HeartDiseaseorAttack | Coronary Heart Disease (CHD) or Myocardial Infarction (MI) 0 = no 1 = yes |
| 9 | PhysActivity | Physical activity in past 30 days - not including job 0 = no 1 = yes |
| 10 | Fruits | Consume Fruit 1 or more times per day 0 = no 1 = yes |
| 11 | Veggies | Consume Vegetables 1 or more times per day 0 = no 1 = yes |
| 12 | HvyAlcoholConsump | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes |
| 13 | AnyHealthcare | Health care coverage 0 = no 1 = yes |
| 14 | NoDocbcCost | No cost to consult doctor in the past 12 months 0 = no 1 = yes |
| 15 | GenHlth | General health 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor |
| 16 | MentHlth | Mental health including stress, depression, and problems with emotions, for the past 30 days was not good, scale 1-30 days |
| 17 | PhysHlth | Physical health including physical illness and injury for the past 30 days was not good, scale 1-30 days |
| 18 | DiffWalk | Serious difficulty walking or climbing stairs 0 = no 1 = yes |
| 19 | Sex | 0 = female 1 = male |
| 20 | Age | 1 = Age 18 to 24; 2 = Age 25 to 29; 3 = Age 30 to 34; 4 = Age 35 to 39; 5 = Age 40 to 44; 6 = Age 45 to 49; 7 = Age 50 to 54; 8 = Age 55 to 59; 9 = Age 60 to 64; 10 = Age 65 to 69; 11 = Age 70 to 74; 12 = Age 75 to 79; 13 = Age 80 or older |
| 21 | Education | Education level 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate) |
| 22 | Income | Income scale 1 = Less than $10,000; 2 = Less than $15,000 ($10,000 to less than $15,000); 3 = Less than $20,000 ($15,000 to less than $20,000); 4 = Less than $25,000 ($20,000 to less than $25,000); 5 = Less than $35,000 ($25,000 to less than $35,000); 6 = Less than $50,000 ($35,000 to less than $50,000); 7 = Less than $75,000 ($50,000 to less than $75,000); 8 = $75,000 or more |
Cleaned data dimension:
dim(diabetes_df)
## [1] 292188 22
The cleaned data consists of 292,188 rows and 22 columns.
str(diabetes_df)
## 'data.frame': 292188 obs. of 22 variables:
## $ Diabetes_012 : num 0 0 0 0 0 0 0 0 0 1 ...
## $ HighBP : num 1 0 1 1 1 1 1 1 1 1 ...
## $ HighChol : num 1 0 1 0 1 1 1 0 1 1 ...
## $ CholCheck : num 1 0 1 1 1 1 1 1 1 1 ...
## $ BMI : num 40.2 25.1 28.2 26.5 23.9 ...
## $ Smoker : num 1 1 0 0 0 1 1 1 1 1 ...
## $ Stroke : num 0 0 0 0 0 0 0 0 0 0 ...
## $ HeartDiseaseorAttack: num 0 0 0 0 0 0 0 0 0 1 ...
## $ PhysActivity : num 0 1 0 1 1 1 1 0 1 0 ...
## $ Fruits : num 0 0 1 1 1 1 1 0 0 1 ...
## $ Veggies : num 1 0 0 1 1 1 1 0 1 1 ...
## $ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num 1 0 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num 0 1 1 0 0 1 0 0 0 0 ...
## $ GenHlth : num 5 3 5 2 2 5 2 3 3 5 ...
## $ MentHlth : num 18 0 30 0 3 0 0 0 0 30 ...
## $ PhysHlth : num 15 0 30 0 0 8 2 14 0 30 ...
## $ DiffWalk : num 1 0 1 0 0 1 0 0 1 1 ...
## $ Sex : num 0 0 0 0 0 1 1 0 0 0 ...
## $ Age : num 9 7 9 11 11 13 10 9 11 9 ...
## $ Education : num 4 6 4 3 5 3 6 6 4 5 ...
## $ Income : num 3 1 8 6 4 3 8 7 4 1 ...
All attributes are numerical data type. Hence, we have to convert the data type for categorical and ordinal data to factor.
diabetes_df$Diabetes_012 <- as.factor(diabetes_df$Diabetes_012)
diabetes_df$HighBP <- as.factor(diabetes_df$HighBP)
diabetes_df$HighChol <- as.factor(diabetes_df$HighChol)
diabetes_df$CholCheck <- as.factor(diabetes_df$CholCheck)
diabetes_df$Smoker <- as.factor(diabetes_df$Smoker)
diabetes_df$Stroke <- as.factor(diabetes_df$Stroke)
diabetes_df$HeartDiseaseorAttack <- as.factor(diabetes_df$HeartDiseaseorAttack)
diabetes_df$PhysActivity <- as.factor(diabetes_df$PhysActivity)
diabetes_df$Fruits <- as.factor(diabetes_df$Fruits)
diabetes_df$Veggies <- as.factor(diabetes_df$Veggies)
diabetes_df$HvyAlcoholConsump <- as.factor(diabetes_df$HvyAlcoholConsump)
diabetes_df$AnyHealthcare <- as.factor(diabetes_df$AnyHealthcare)
diabetes_df$NoDocbcCost <- as.factor(diabetes_df$NoDocbcCost)
diabetes_df$GenHlth <- as.factor(diabetes_df$GenHlth)
diabetes_df$DiffWalk <- as.factor(diabetes_df$DiffWalk)
diabetes_df$Sex <- as.factor(diabetes_df$Sex)
diabetes_df$Age <- as.factor(diabetes_df$Age)
diabetes_df$Education <- as.factor(diabetes_df$Education)
diabetes_df$Income <- as.factor(diabetes_df$Income)
summary(diabetes_df)
## Diabetes_012 HighBP HighChol CholCheck BMI Smoker
## 0:249162 0:174087 0:180933 0: 41928 Min. :12.02 0:162828
## 1: 43026 1:118101 1:111255 1:250260 1st Qu.:24.13 1:129360
## Median :27.44
## Mean :28.29
## 3rd Qu.:31.00
## Max. :97.65
##
## Stroke HeartDiseaseorAttack PhysActivity Fruits Veggies
## 0:281019 0:267180 0: 74193 0:111368 0: 58574
## 1: 11169 1: 25008 1:217995 1:180820 1:233614
##
##
##
##
##
## HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth
## 0:275094 0: 19938 0:263429 1: 50974 Min. : 0.000
## 1: 17094 1:272250 1: 28759 2:101380 1st Qu.: 0.000
## 3: 89498 Median : 0.000
## 4: 36721 Mean : 3.346
## 5: 13615 3rd Qu.: 2.000
## Max. :30.000
##
## PhysHlth DiffWalk Sex Age Education
## Min. : 0.000 0:244363 0:164330 9 : 35386 1: 281
## 1st Qu.: 0.000 1: 47825 1:127858 10 : 33976 2: 5624
## Median : 0.000 8 : 33169 3: 12464
## Mean : 4.194 7 : 28762 4: 76199
## 3rd Qu.: 3.000 11 : 24836 5: 82321
## Max. :30.000 6 : 22146 6:115299
## (Other):113913
## Income
## 8 :95843
## 7 :49170
## 6 :42884
## 5 :31231
## 4 :24974
## 3 :20263
## (Other):27823
The summary shows the total number of factor data and statistical
measures for numerical data.
1. Exploratory Data Analysis
# pie chart
plot_ly(diabetes_df, labels=~Diabetes_012, type="pie", title="Diabetes Status")
There are 85.3% of respondent don’t have diabetes, 14.7% are facing diabetes health problem.
# bar graph
ggplotly(ggplot(diabetes_df, aes(HighBP, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Blood Pressure"))
Most of the respondents do not have high blood pressure. We can observe that Most of the respondents having diabetes in the high blood pressure group than no high blood pressure group.
ggplotly(ggplot(diabetes_df, aes(HighChol, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Cholesterol"))
Most of the respondents do not have high cholesterol. We can observe that most of the respondents having diabetes in the high cholesterol group than no high cholesterol group.
ggplotly(ggplot(diabetes_df, aes(CholCheck, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Cholesterol Check"))
Most of the respondents did cholesterol health check in the past 5 years.
# histogram
ggplotly(ggplot(diabetes_df, aes(BMI)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of BMI"))
The BMI distribution is skewed to left. Most of the respondents have 23-28kg/m^2, which is from normal or overweight category.
# boxplot
ggplot(diabetes_df, aes(x=BMI, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of BMI")
The BMI mean is higher in diabetes category.
ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Smoker"))
Most of the respondents are not a smoker. We can observe that slightly most of the respondents having diabetes in smoker category than non-smoker.
ggplotly(ggplot(diabetes_df, aes(HeartDiseaseorAttack, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Heart Disease/Attack"))
Most of the respondents do not have heart disease.
ggplotly(ggplot(diabetes_df, aes(PhysActivity, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Doing Physical Activity"))
Most of the respondents did physical activites in the past 30 days.
ggplotly(ggplot(diabetes_df, aes(Fruits, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Fruits Consumption"))
Most of the respondents consumed fruits.
ggplotly(ggplot(diabetes_df, aes(Veggies, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Vegetables Consumption"))
Most of the respondents consumed vegetables.
ggplotly(ggplot(diabetes_df, aes(HvyAlcoholConsump, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Heavy Alcohol Consumption"))
Most of the respondents did not have heavy alcohol consumption.
ggplotly(ggplot(diabetes_df, aes(AnyHealthcare, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Healthcare Coverage"))
Most of the respondents have healthcare coverage.
ggplotly(ggplot(diabetes_df, aes(NoDocbcCost, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of No Cost for Doctor Consultation"))
Most of the respondents can afford doctor consultant fee.
ggplotly(ggplot(diabetes_df, aes(GenHlth, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of General Health Level"))
Most of the respondents are healthy in general. Most of the respondents having diabetes in good condition (level 3).
ggplotly(ggplot(diabetes_df, aes(MentHlth)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of Days of Having Mental Health Problem"))
The distribution of having mental health days is skewed to left. Most of the respondents have less than 3 days facing mental health for the past 30 days.
ggplot(diabetes_df, aes(x=MentHlth, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of Days of Having Mental Health Problem")
The mean of mental health for each diabetes level are the same,
which is nearly to 0.
ggplotly(ggplot(diabetes_df, aes(PhysHlth)) + geom_histogram(binwidth=3, colour="black", fill="white")+ ggtitle("Distribution of Days of Having Physical Health Problem"))
The distribution of having physical health days is skewed to left. Most of the respondents have less than 3 days facing physical health for the past 30 days.
ggplot(diabetes_df, aes(x=PhysHlth, y=Diabetes_012, fill=Diabetes_012)) + geom_boxplot() + ggtitle("Distribution of Days of Having Physical Health Problem")
The mean for facing physical health is higher in diabetes
category.
ggplotly(ggplot(diabetes_df, aes(DiffWalk, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Facing Walking Difficulty"))
Most of the respondents do not facing any difficulty in walking.
ggplotly(ggplot(diabetes_df, aes(Sex, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Gender"))
Most of the respondents are female.
ggplotly(ggplot(diabetes_df, aes(Education, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number By Education Level"))
Most of the respondents are college gradute. Most of the high school graduate respondents having diabetes than other education levels.
ggplotly(ggplot(diabetes_df, aes(Income, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number By Income Group"))
Most of the respondents are from high income category.
2. Most of the interesting plots with multiple variables
ggplotly(ggplot(diabetes_df, aes(HighBP, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of High Blood Pressure vs High Cholesterol")+facet_wrap(~HighChol))
Most the respondents who having diabetes, also have high blood pressure and high cholesterol.
ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Smoker vs Heavy Alcohol Consumption")+facet_wrap(~HvyAlcoholConsump))
Most respondents do not drink heavy alcohol. However, smoking is one of the risk factors for diabetes, as slightly most of the people facing diabetes than non-smokers despite not drinking heavily.
ggplotly(ggplot(diabetes_df, aes(HeartDiseaseorAttack, fill=Diabetes_012))+geom_bar()+ggtitle("Total Number of Facing Heart Disease vs Stroke")+facet_wrap(~Stroke))
Most of the respondents do not have heart disease and stroke.
ggplotly(ggplot(diabetes_df, aes(Smoker, fill=Sex))+geom_bar()+ggtitle("Total Number of Smoker vs Heavy Alcohol Consumption by Gender")+facet_wrap(~HvyAlcoholConsump))
Among non-heavy drinking respondents, the total number of male smokers and non-smokers were nearly identical.
ggplotly(ggplot(diabetes_df, aes(Income, fill=NoDocbcCost))+geom_bar()+ggtitle("Total Number of Income Level vs General Health by No Cost for Doctor Consultation")+facet_wrap(~GenHlth))
High income respondents are healthier than low income respondents,
although low income respondent can still afford doctor consultant fee.
1. Between categorical variables by Cramer’s V
# cramer's V
r1 <- diabetes_df %>% select_if(is.factor) %>% char_cor() %>% round(2)
# correlation matrix
ggcorrplot(r1)
From the Cramer’s V correlation matrix, we found that all the
categorical variables have weak or moderate strong positive relationship
between each other. Let’s confirm with the Cramer’s V correlation
coefficient.
## Diabetes_012 HighBP HighChol CholCheck Smoker Stroke
## Diabetes_012 1.00 0.28 0.23 0.13 0.06 0.11
## HighBP 0.28 1.00 0.32 0.21 0.09 0.13
## HighChol 0.23 0.32 1.00 0.28 0.08 0.10
## CholCheck 0.13 0.21 0.28 1.00 0.01 0.05
## Smoker 0.06 0.09 0.08 0.01 1.00 0.06
## Stroke 0.11 0.13 0.10 0.05 0.06 1.00
## HeartDiseaseorAttack 0.18 0.21 0.19 0.09 0.11 0.20
## PhysActivity 0.11 0.11 0.06 0.02 0.08 0.06
## Fruits 0.03 0.02 0.02 0.07 0.07 0.01
## Veggies 0.05 0.04 0.01 0.06 0.02 0.03
## HvyAlcoholConsump 0.06 0.01 0.02 0.03 0.11 0.02
## AnyHealthcare 0.04 0.07 0.09 0.22 0.02 0.02
## NoDocbcCost 0.02 0.00 0.02 0.12 0.05 0.03
## GenHlth 0.29 0.29 0.20 0.06 0.16 0.19
## DiffWalk 0.22 0.23 0.15 0.09 0.12 0.18
## Sex 0.02 0.04 0.01 0.05 0.10 0.00
## Age 0.22 0.38 0.34 0.39 0.14 0.14
## Education 0.11 0.11 0.02 0.10 0.17 0.07
## Income 0.15 0.14 0.04 0.12 0.12 0.11
## HeartDiseaseorAttack PhysActivity Fruits Veggies
## Diabetes_012 0.18 0.11 0.03 0.05
## HighBP 0.21 0.11 0.02 0.04
## HighChol 0.19 0.06 0.02 0.01
## CholCheck 0.09 0.02 0.07 0.06
## Smoker 0.11 0.08 0.07 0.02
## Stroke 0.20 0.06 0.01 0.03
## HeartDiseaseorAttack 1.00 0.08 0.01 0.03
## PhysActivity 0.08 1.00 0.14 0.15
## Fruits 0.01 0.14 1.00 0.26
## Veggies 0.03 0.15 0.26 1.00
## HvyAlcoholConsump 0.03 0.01 0.04 0.02
## AnyHealthcare 0.03 0.04 0.04 0.04
## NoDocbcCost 0.02 0.06 0.05 0.03
## GenHlth 0.26 0.26 0.09 0.11
## DiffWalk 0.21 0.24 0.04 0.07
## Sex 0.08 0.03 0.09 0.07
## Age 0.24 0.09 0.10 0.04
## Education 0.08 0.19 0.11 0.15
## Income 0.12 0.19 0.08 0.15
## HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth
## Diabetes_012 0.06 0.04 0.02 0.29
## HighBP 0.01 0.07 0.00 0.29
## HighChol 0.02 0.09 0.02 0.20
## CholCheck 0.03 0.22 0.12 0.06
## Smoker 0.11 0.02 0.05 0.16
## Stroke 0.02 0.02 0.03 0.19
## HeartDiseaseorAttack 0.03 0.03 0.02 0.26
## PhysActivity 0.01 0.04 0.06 0.26
## Fruits 0.04 0.04 0.05 0.09
## Veggies 0.02 0.04 0.03 0.11
## HvyAlcoholConsump 1.00 0.02 0.01 0.04
## AnyHealthcare 0.02 1.00 0.26 0.05
## NoDocbcCost 0.01 0.26 1.00 0.18
## GenHlth 0.04 0.05 0.18 1.00
## DiffWalk 0.04 0.02 0.11 0.49
## Sex 0.01 0.04 0.04 0.02
## Age 0.05 0.18 0.15 0.08
## Education 0.02 0.17 0.12 0.14
## Income 0.05 0.20 0.22 0.19
## DiffWalk Sex Age Education Income
## Diabetes_012 0.22 0.02 0.22 0.11 0.15
## HighBP 0.23 0.04 0.38 0.11 0.14
## HighChol 0.15 0.01 0.34 0.02 0.04
## CholCheck 0.09 0.05 0.39 0.10 0.12
## Smoker 0.12 0.10 0.14 0.17 0.12
## Stroke 0.18 0.00 0.14 0.07 0.11
## HeartDiseaseorAttack 0.21 0.08 0.24 0.08 0.12
## PhysActivity 0.24 0.03 0.09 0.19 0.19
## Fruits 0.04 0.09 0.10 0.11 0.08
## Veggies 0.07 0.07 0.04 0.15 0.15
## HvyAlcoholConsump 0.04 0.01 0.05 0.02 0.05
## AnyHealthcare 0.02 0.04 0.18 0.17 0.20
## NoDocbcCost 0.11 0.04 0.15 0.12 0.22
## GenHlth 0.49 0.02 0.08 0.14 0.19
## DiffWalk 1.00 0.07 0.23 0.17 0.29
## Sex 0.07 1.00 0.05 0.04 0.12
## Age 0.23 0.05 1.00 0.07 0.09
## Education 0.17 0.04 0.07 1.00 0.21
## Income 0.29 0.12 0.09 0.21 1.00
Most of the categorical variables have correlation coefficient less than 0.3. The highest coefficient is 0.49, from general health and difficult in walking. Hence, we can say that general health is moderate strong correlated to difficult in walking.
2. Between numerical variables by Pearson correlation
# pearson
r2 <- diabetes_df %>% select_if(is.numeric) %>% cor(method = "pearson") %>% round(2)
ggcorrplot(r2)
# scatterplot
ggplotly(ggplot(diabetes_df, aes(x=PhysHlth, y=MentHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("Physical Health against Mental Health"))
## `geom_smooth()` using formula = 'y ~ x'
ggplotly(ggplot(diabetes_df, aes(x=BMI, y=MentHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("BMI against Mental Health"))
## `geom_smooth()` using formula = 'y ~ x'
ggplotly(ggplot(diabetes_df, aes(x=BMI, y=PhysHlth)) + geom_smooth(method=lm, se=FALSE) + ggtitle("BMI against Physical Health"))
## `geom_smooth()` using formula = 'y ~ x'
All the numerical variables are positive correlated. Let’s
discover the strength of relationship below.
## BMI MentHlth PhysHlth
## BMI 1.00 0.07 0.11
## MentHlth 0.07 1.00 0.34
## PhysHlth 0.11 0.34 1.00
They have poor positive relationship to each other.
From the correlation coefficients between categorical variables and between numerical variables, all the variables have weak relationship. This also can conclude that our dataset do not have redundant variables.
Redundant variable: correlation > 0.75
# use roc_curve area as score
roc_imp <- filterVarImp(x = diabetes_df[,2:ncol(diabetes_df)], y = diabetes_df$Diabetes_012)
# sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
## variable score
## 14 GenHlth 0.7210405
## 1 HighBP 0.6937586
## 4 BMI 0.6844951
## 19 Age 0.6715637
## 2 HighChol 0.6565158
## 17 DiffWalk 0.6172731
## 21 Income 0.6149303
## 16 PhysHlth 0.6097001
## 20 Education 0.5791783
## 7 HeartDiseaseorAttack 0.5707880
## 8 PhysActivity 0.5672897
## 3 CholCheck 0.5654546
## 5 Smoker 0.5407487
## 6 Stroke 0.5291810
## 10 Veggies 0.5255615
## 15 MentHlth 0.5224938
## 9 Fruits 0.5190999
## 11 HvyAlcoholConsump 0.5188765
## 18 Sex 0.5132233
## 12 AnyHealthcare 0.5131637
## 13 NoDocbcCost 0.5089957
The top 3 most important features are: General Health, High Blood Pressure, BMI
The top 3 least important features are: Gender, Healthcare, No Cost to Consult Doctor
Our target variable ‘Diabetes_012’ is imbalanced. Most records represent not diabetes cases. Also our dataset is quite large and would require a lot of computational resources for modeling. To solve these problems we perform undersampling of not diabetes records to have the equal amount for both groups.
# undersampling
set.seed(123)
balance <- function(df) {
number_of_diabetes <- nrow(df[df$Diabetes_012 == 1, ])
no_diabetes_ids <- which(df$Diabetes_012 == 0)
sample_ids <- sample(no_diabetes_ids, number_of_diabetes)
sample_no_diabetes_df <- df[sample_ids, ]
only_diabetes_df <- df[df$Diabetes_012 == 1, ]
return(rbind(sample_no_diabetes_df, only_diabetes_df))
}
balanced_df <- balance(diabetes_df)
invisible(gc())
print(paste("Number of rows before balancing:", nrow(diabetes_df)))
## [1] "Number of rows before balancing: 292188"
print(paste("Number of rows after balancing:", nrow(balanced_df)))
## [1] "Number of rows after balancing: 86052"
table(balanced_df$Diabetes_012)
##
## 0 1
## 43026 43026
For modeling, we splitted the balanced data into 80% training set and 20% test set.
# Test train split function
get_train_test <- function(df, target_name) {
train_ids <- createDataPartition(df$Diabetes_012, p=0.8, list=FALSE)
df_train <- df[train_ids,]
df_test <- df[-train_ids,]
x_train <- df_train[, !(names(df_train) %in% target_name)]
y_train <- df_train[, target_name]
x_test <- df_test[ , !(names(df_test) %in% target_name)]
y_test <- df_test[, target_name]
return(list("df_train" = df_train,
"df_test" = df_test,
"x_train" = x_train,
"y_train" = y_train,
"x_test" = x_test,
"y_test" = y_test))
}
# Train test split
train_test <- get_train_test(balanced_df, 'Diabetes_012')
invisible(gc())
Diabetes Prediction Models:
1. Random Forest (RF)
# RF
rf_model <- randomForest(Diabetes_012~., data=train_test[["df_train"]])
predictions <- predict(rf_model,train_test[["x_test"]])
confusionMatrix(predictions, train_test[["y_test"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6087 1744
## 1 2518 6861
##
## Accuracy : 0.7524
## 95% CI : (0.7458, 0.7588)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5047
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7074
## Specificity : 0.7973
## Pos Pred Value : 0.7773
## Neg Pred Value : 0.7315
## Prevalence : 0.5000
## Detection Rate : 0.3537
## Detection Prevalence : 0.4550
## Balanced Accuracy : 0.7524
##
## 'Positive' Class : 0
##
invisible(gc())
The accuracy of the diabetes prediction is 75%.
Feature importance plot:
randomForest::varImpPlot(rf_model)
The most important features are: BMI, Age, General Health
The least important features are: AnyHealthCare, HvyAlcoholConsump,
Stroke, NoDocbcCost
Modeling using only most important features:
df_important <- balanced_df[ , c("BMI", "Diabetes_012", "Age", "GenHlth")]
train_test_rf2 <- get_train_test(df_important, 'Diabetes_012')
rf_model_2 <- randomForest(Diabetes_012~., data=train_test_rf2[["df_train"]])
predictions <- predict(rf_model_2,train_test_rf2[["x_test"]])
confusionMatrix(predictions, train_test_rf2[["y_test"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5885 1884
## 1 2720 6721
##
## Accuracy : 0.7325
## 95% CI : (0.7258, 0.7391)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.465
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6839
## Specificity : 0.7811
## Pos Pred Value : 0.7575
## Neg Pred Value : 0.7119
## Prevalence : 0.5000
## Detection Rate : 0.3420
## Detection Prevalence : 0.4514
## Balanced Accuracy : 0.7325
##
## 'Positive' Class : 0
##
invisible(gc())
We can see that using 3 most important features gives us 73%
accuracy.
2. Logistic regression (LR)
# LR
lr <- glm (Diabetes_012~.,data = train_test[["df_train"]] , family = binomial)
summary(lr)
##
## Call:
## glm(formula = Diabetes_012 ~ ., family = binomial, data = train_test[["df_train"]])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4084 -0.8010 -0.0193 0.8183 3.1766
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.713863 0.287344 -19.885 < 2e-16 ***
## HighBP1 0.686903 0.020131 34.122 < 2e-16 ***
## HighChol1 0.541555 0.019672 27.529 < 2e-16 ***
## CholCheck1 0.851680 0.041981 20.287 < 2e-16 ***
## BMI 0.069035 0.001617 42.680 < 2e-16 ***
## Smoker1 -0.034020 0.019408 -1.753 0.079626 .
## Stroke1 0.148176 0.042666 3.473 0.000515 ***
## HeartDiseaseorAttack1 0.249878 0.030076 8.308 < 2e-16 ***
## PhysActivity1 -0.062896 0.021645 -2.906 0.003663 **
## Fruits1 -0.020535 0.020066 -1.023 0.306138
## Veggies1 -0.041328 0.023746 -1.740 0.081794 .
## HvyAlcoholConsump1 -0.632887 0.047934 -13.203 < 2e-16 ***
## AnyHealthcare1 0.099366 0.044813 2.217 0.026599 *
## NoDocbcCost1 0.083695 0.033820 2.475 0.013336 *
## GenHlth2 0.687204 0.037849 18.157 < 2e-16 ***
## GenHlth3 1.311931 0.037465 35.017 < 2e-16 ***
## GenHlth4 1.765750 0.043616 40.484 < 2e-16 ***
## GenHlth5 1.897044 0.059028 32.138 < 2e-16 ***
## MentHlth -0.002076 0.001305 -1.590 0.111725
## PhysHlth -0.001403 0.001280 -1.096 0.273101
## DiffWalk1 0.139007 0.026620 5.222 1.77e-07 ***
## Sex1 0.234423 0.019670 11.918 < 2e-16 ***
## Age2 0.041629 0.127708 0.326 0.744449
## Age3 0.214900 0.117191 1.834 0.066689 .
## Age4 0.657148 0.110420 5.951 2.66e-09 ***
## Age5 0.944047 0.107294 8.799 < 2e-16 ***
## Age6 1.199964 0.104914 11.438 < 2e-16 ***
## Age7 1.345305 0.103156 13.041 < 2e-16 ***
## Age8 1.407128 0.102510 13.727 < 2e-16 ***
## Age9 1.648687 0.102142 16.141 < 2e-16 ***
## Age10 1.811573 0.102317 17.705 < 2e-16 ***
## Age11 1.966091 0.103473 19.001 < 2e-16 ***
## Age12 1.903056 0.105262 18.079 < 2e-16 ***
## Age13 1.779927 0.105080 16.939 < 2e-16 ***
## Education2 0.028094 0.266214 0.106 0.915954
## Education3 -0.109151 0.262643 -0.416 0.677712
## Education4 -0.215612 0.260089 -0.829 0.407109
## Education5 -0.152908 0.260235 -0.588 0.556816
## Education6 -0.281317 0.260374 -1.080 0.279948
## Income2 -0.076909 0.056223 -1.368 0.171337
## Income3 -0.105254 0.053368 -1.972 0.048585 *
## Income4 -0.129029 0.051986 -2.482 0.013066 *
## Income5 -0.200807 0.051194 -3.922 8.76e-05 ***
## Income6 -0.282523 0.050004 -5.650 1.60e-08 ***
## Income7 -0.349961 0.050186 -6.973 3.10e-12 ***
## Income8 -0.491713 0.049522 -9.929 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 95435 on 68841 degrees of freedom
## Residual deviance: 69435 on 68796 degrees of freedom
## AIC: 69527
##
## Number of Fisher Scoring iterations: 5
lr_predictions <- predict(lr, train_test[["x_test"]], type = 'response')
lr_predictions <- as.factor(ifelse(lr_predictions > 0.5,1,0))
confusionMatrix(lr_predictions, train_test[["y_test"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 6265 1914
## 1 2340 6691
##
## Accuracy : 0.7528
## 95% CI : (0.7463, 0.7592)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5056
##
## Mcnemar's Test P-Value : 7.214e-11
##
## Sensitivity : 0.7281
## Specificity : 0.7776
## Pos Pred Value : 0.7660
## Neg Pred Value : 0.7409
## Prevalence : 0.5000
## Detection Rate : 0.3640
## Detection Prevalence : 0.4752
## Balanced Accuracy : 0.7528
##
## 'Positive' Class : 0
##
invisible(gc())
Prediction accuracy is 75%.
The least important features according to the p-values of their
coefficients are: Education, Veggies, Fruits, Smoker
3. K-Nearest Neighbors (KNN)
To train KNN model, we need to scale our numerical features first.
scaled_df <- balanced_df %>% mutate_at(c("BMI", "MentHlth", "PhysHlth"), ~(scale(.) %>% as.vector))
glimpse(scaled_df)
## Rows: 86,052
## Columns: 22
## $ Diabetes_012 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP <fct> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0…
## $ HighChol <fct> 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0…
## $ CholCheck <fct> 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0…
## $ BMI <dbl> 1.3611031370, -2.0793836239, 0.1405155102, -0.853…
## $ Smoker <fct> 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ Stroke <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ PhysActivity <fct> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0…
## $ Fruits <fct> 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Veggies <fct> 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ HvyAlcoholConsump <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ AnyHealthcare <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1…
## $ NoDocbcCost <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1…
## $ GenHlth <fct> 3, 4, 2, 2, 2, 1, 2, 1, 1, 4, 3, 1, 5, 2, 2, 3, 2…
## $ MentHlth <dbl> 0.14139593, -0.46860902, -0.46860902, -0.22460704…
## $ PhysHlth <dbl> -0.5705128, 2.4504253, 0.1343727, -0.3691169, -0.…
## $ DiffWalk <fct> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Sex <fct> 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0…
## $ Age <fct> 9, 8, 7, 7, 6, 8, 2, 3, 1, 10, 13, 5, 7, 11, 8, 6…
## $ Education <fct> 5, 3, 5, 3, 4, 6, 6, 6, 6, 6, 4, 6, 3, 3, 5, 5, 5…
## $ Income <fct> 5, 4, 8, 7, 8, 8, 7, 4, 7, 8, 5, 8, 2, 2, 8, 8, 7…
# KNN
train_test_knn <- get_train_test(scaled_df, "Diabetes_012")
knn_pred <- knn(train_test_knn[["x_train"]] , train_test_knn[["x_test"]] , train_test_knn[["y_train"]] , k = 20)
confusionMatrix(knn_pred, train_test_knn[["y_test"]])
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5942 1816
## 1 2663 6789
##
## Accuracy : 0.7397
## 95% CI : (0.7331, 0.7463)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4795
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6905
## Specificity : 0.7890
## Pos Pred Value : 0.7659
## Neg Pred Value : 0.7183
## Prevalence : 0.5000
## Detection Rate : 0.3453
## Detection Prevalence : 0.4508
## Balanced Accuracy : 0.7397
##
## 'Positive' Class : 0
##
invisible(gc())
The accuracy of KNN algorithms is 74%.
4. Summary
We trained 3 machine learning models to predict diabetes. All of them
showed similar level of accuracy of 74-75%. Random Forest and Logistic
Regression models showed better accuracy than KNN model.
Modeling showed that the most important features for predicting
diabetes are:
I. BMI
II. Age
III. General Health
IV. High blood pressure
V. Income
The least important features are:
I. AnyHealthCare
II. HvyAlcoholConsump
III. Stroke
IV. NoDocbcCost
V. Education
VI. Veggies
VII. Fruits
VIII. Smoker
Modeling with 3 most important features (BMI, Age, General Health) provided 73% accuracy.
BMI Prediction
1. Decision Tree (DT)
# DT
train_test_bmi <- get_train_test(balanced_df, 'BMI')
tree_bmi <- tree (BMI ~ ., data = train_test_bmi[["df_train"]])
pred = predict(tree_bmi, newdata = train_test_bmi[["x_test"]])
postResample(pred = pred, obs = train_test_bmi[["y_test"]])
## RMSE Rsquared MAE
## 6.537785 0.131683 4.671237
invisible(gc())
R squared value is low, which means the model doesn’t fit the data
good enough.
2. Linear Model (LM)
# LM
linear_bmi <- lm(BMI~., data = train_test_bmi[["df_train"]])
predicted_bmi <- predict ( linear_bmi , train_test_bmi[["x_test"]])
summary(linear_bmi)
##
## Call:
## lm(formula = BMI ~ ., data = train_test_bmi[["df_train"]])
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.675 -3.965 -0.804 2.905 71.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.974133 0.708669 33.830 < 2e-16 ***
## Diabetes_0121 2.473905 0.057742 42.844 < 2e-16 ***
## HighBP1 2.065539 0.057217 36.100 < 2e-16 ***
## HighChol1 0.113044 0.054558 2.072 0.038272 *
## CholCheck1 0.824533 0.092119 8.951 < 2e-16 ***
## Smoker1 -0.504504 0.050574 -9.976 < 2e-16 ***
## Stroke1 -0.964483 0.107016 -9.013 < 2e-16 ***
## HeartDiseaseorAttack1 -0.253403 0.077457 -3.272 0.001070 **
## PhysActivity1 -1.061638 0.056450 -18.807 < 2e-16 ***
## Fruits1 -0.409091 0.051907 -7.881 3.29e-15 ***
## Veggies1 0.075682 0.061378 1.233 0.217566
## HvyAlcoholConsump1 -0.973331 0.117322 -8.296 < 2e-16 ***
## AnyHealthcare1 0.119642 0.109784 1.090 0.275810
## NoDocbcCost1 -0.307977 0.085002 -3.623 0.000291 ***
## GenHlth2 1.163169 0.084934 13.695 < 2e-16 ***
## GenHlth3 2.108005 0.087075 24.209 < 2e-16 ***
## GenHlth4 2.091706 0.105653 19.798 < 2e-16 ***
## GenHlth5 1.554077 0.143001 10.868 < 2e-16 ***
## MentHlth 0.001772 0.003314 0.535 0.592788
## PhysHlth -0.002752 0.003305 -0.833 0.405013
## DiffWalk1 2.365806 0.070208 33.697 < 2e-16 ***
## Sex1 0.135490 0.050819 2.666 0.007675 **
## Age2 1.730618 0.201948 8.570 < 2e-16 ***
## Age3 2.589935 0.193109 13.412 < 2e-16 ***
## Age4 2.479602 0.189096 13.113 < 2e-16 ***
## Age5 2.699344 0.184402 14.638 < 2e-16 ***
## Age6 2.382123 0.178276 13.362 < 2e-16 ***
## Age7 1.896099 0.172237 11.009 < 2e-16 ***
## Age8 1.554523 0.169983 9.145 < 2e-16 ***
## Age9 0.877739 0.168728 5.202 1.98e-07 ***
## Age10 0.601376 0.169921 3.539 0.000402 ***
## Age11 -0.413402 0.174824 -2.365 0.018049 *
## Age12 -1.286323 0.182524 -7.047 1.84e-12 ***
## Age13 -3.011033 0.181981 -16.546 < 2e-16 ***
## Education2 0.211805 0.694745 0.305 0.760468
## Education3 0.634848 0.686999 0.924 0.355444
## Education4 0.679325 0.681007 0.998 0.318511
## Education5 0.758405 0.681377 1.113 0.265692
## Education6 0.307953 0.681824 0.452 0.651515
## Income2 0.251365 0.139639 1.800 0.071848 .
## Income3 0.339004 0.133268 2.544 0.010968 *
## Income4 0.289942 0.130781 2.217 0.026627 *
## Income5 0.355019 0.128985 2.752 0.005918 **
## Income6 0.496276 0.126203 3.932 8.42e-05 ***
## Income7 0.511397 0.126945 4.028 5.62e-05 ***
## Income8 0.074972 0.124615 0.602 0.547423
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.328 on 68796 degrees of freedom
## Multiple R-squared: 0.1838, Adjusted R-squared: 0.1832
## F-statistic: 344.2 on 45 and 68796 DF, p-value: < 2.2e-16
postResample(pred = predicted_bmi, obs = train_test_bmi[["y_test"]])
## RMSE Rsquared MAE
## 6.3151518 0.1898009 4.5101785
invisible(gc())
R squared value is low as well.
3. K-Nearest Neighbors (KNN)
# KNN
scaled_df_bmi <- balanced_df %>% mutate_at(c("MentHlth", "PhysHlth"), ~(scale(.) %>% as.vector))
train_test_bmi_scaled <- get_train_test(scaled_df_bmi, 'BMI')
knnmodel = knnreg(train_test_bmi_scaled[["x_train"]], train_test_bmi_scaled[["y_train"]], k = 10)
pred_y = predict(knnmodel, train_test_bmi_scaled[["x_test"]])
postResample(pred = pred_y, obs = train_test_bmi_scaled[["y_test"]])
## RMSE Rsquared MAE
## 6.5290902 0.1457319 4.6713351
invisible(gc())
R squared score for this algorithm is very low as well.
4. Summary
We built 3 machine learning models to predict BMI from the other variables. All algorithms got a low R squared score. We make the conclusion that all algorithms that we tried did not provide a good fit for the given data. RMSE score was in the range from 6.3 to 6.6.
We can make a conclusion that BMI, age and general health of a person
can be used to predict the risk of getting diabetes. The other important
factors are high blood pressure and person’s income. Such factors as
either a person is or not a smoker, eats fruits and vegetables, and the
level of education did not prove to be important in predicting diabetes.
In conclusion, we have achieved our 3 research objectives.
RO1: To explore relevant factors contributing
towards diabetes.
We found that BMI, age, general health, high blood pressure and person’s
income are the most relevant factors contributing towards diabetes.
RO2: To develop diabetes prediction model by machine
learning algorithms.
Classification models such as Random Forest, Logistic Regression and KNN
are developed for diabetes prediction.
RO3: To evaluate the performance of diabetes
prediction model.
All of the classification models achieved similar results, which have
accuracy of 74%-75%.
Alexteboul, A. (2022, March 10). Diabetes Health Indicators Dataset Notebook. Kaggle. https://www.kaggle.com/code/alexteboul/diabetes-health-indicators-dataset-notebook/notebook
Behavioral Risk Factor Surveillance System. (2017, August 24). Kaggle. https://www.kaggle.com/datasets/cdc/behavioral-risk-factor-surveillance-system?resource=download
Chapter 77 Feature selection in r | EDAV Fall 2021 Tues/Thurs Community Contributions. (n.d.). https://jtr13.github.io/cc21fall2/feature-selection-in-r.html
char_cor_vars function - RDocumentation. (n.d.). https://www.rdocumentation.org/packages/creditmodel/versions/1.3.1/topics/char_cor_vars
Zach, Z. (2021, May 18). Understanding the Null Hypothesis for Linear Regression. Statology. https://www.statology.org/null-hypothesis-for-linear-regression/