This is my LBB for classification on Machine Learning II. For this LBB, I get data about “heart disease”.
Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year — making it the leading cause of death. The build up of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease.
While there are different types of coronary heart disease, the majority of individuals only learn they have the disease following symptoms such as chest pain, a heart attack, or sudden cardiac arrest. This fact highlights the importance of preventative measures and tests that can accurately predict heart disease in the population prior to negative outcomes like myocardial infarctions (heart attacks) taking place.
The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.
Make model for health indicator to predict heart disease or attack ?
Target data is HeartDiseaseorAttack
Use the library() function to import library packages that we needed for the modeling.
# import library
library(dplyr) # data wrangling
library(e1071) # naive bayes
library(caret) # confusion matrix
library(ROCR) # ROC curve
library(partykit) # decision tree / `ctree()`
Use read.csv() function to read the data and head() function to check the data.
# read data
heart_dis <- read.csv("data_input/heart_disease_health_indicators_BRFSS2015.csv")
# check data
head(heart_dis)
#> HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes
#> 1 0 1 1 1 40 1 0 0
#> 2 0 0 0 0 25 1 0 0
#> 3 0 1 1 1 28 0 0 0
#> 4 0 1 0 1 27 0 0 0
#> 5 0 1 1 1 24 0 0 0
#> 6 0 1 1 1 25 1 0 0
#> PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
#> 1 0 0 1 0 1 0
#> 2 1 0 0 0 0 1
#> 3 0 1 0 0 1 1
#> 4 1 1 1 0 1 0
#> 5 1 1 1 0 1 0
#> 6 1 1 1 0 1 0
#> GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
#> 1 5 18 15 1 0 9 4 3
#> 2 3 0 0 0 0 7 6 1
#> 3 5 30 30 1 0 9 4 8
#> 4 2 0 0 0 0 11 3 6
#> 5 2 3 0 0 0 11 5 4
#> 6 2 0 2 0 1 10 6 8
We use dim() to check the dimension data.
# check dimension data
dim(heart_dis)
#> [1] 253680 22
Our data have 253680 rows and 22 columns.
We check the data type from object heart_dis by using glimpse() function.
# check data type
glimpse(heart_dis)
#> Rows: 253,680
#> Columns: 22
#> $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ HighBP <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1~
#> $ HighChol <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1~
#> $ CholCheck <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ BMI <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2~
#> $ Smoker <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0~
#> $ Stroke <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ Diabetes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0~
#> $ PhysActivity <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1~
#> $ Fruits <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ Veggies <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1~
#> $ HvyAlcoholConsump <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ AnyHealthcare <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ NoDocbcCost <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ GenHlth <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3~
#> $ MentHlth <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, ~
#> $ PhysHlth <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2~
#> $ DiffWalk <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ Sex <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0~
#> $ Age <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, ~
#> $ Education <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4~
#> $ Income <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3~
What variable that we need to change the data type ?
# adjust data type
heart_dis <- heart_dis %>%
mutate_at(vars(HeartDiseaseorAttack, HighBP, HighChol, CholCheck, Smoker, Stroke, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex), as.factor)
glimpse(heart_dis)
#> Rows: 253,680
#> Columns: 22
#> $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ HighBP <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1~
#> $ HighChol <fct> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1~
#> $ CholCheck <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ BMI <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2~
#> $ Smoker <fct> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0~
#> $ Stroke <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ Diabetes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0~
#> $ PhysActivity <fct> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1~
#> $ Fruits <fct> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ Veggies <fct> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1~
#> $ HvyAlcoholConsump <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ AnyHealthcare <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ NoDocbcCost <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ GenHlth <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3~
#> $ MentHlth <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, ~
#> $ PhysHlth <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2~
#> $ DiffWalk <fct> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ Sex <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0~
#> $ Age <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, ~
#> $ Education <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4~
#> $ Income <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3~
str(heart_dis)
#> 'data.frame': 253680 obs. of 22 variables:
#> $ HeartDiseaseorAttack: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
#> $ HighBP : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 1 ...
#> $ HighChol : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 2 2 1 ...
#> $ CholCheck : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
#> $ BMI : num 40 25 28 27 24 25 30 25 30 24 ...
#> $ Smoker : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 2 1 ...
#> $ Stroke : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ Diabetes : num 0 0 0 0 0 0 0 0 2 0 ...
#> $ PhysActivity : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 1 1 ...
#> $ Fruits : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 2 1 ...
#> $ Veggies : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 2 ...
#> $ HvyAlcoholConsump : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#> $ AnyHealthcare : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
#> $ NoDocbcCost : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
#> $ GenHlth : num 5 3 5 2 2 2 3 3 5 2 ...
#> $ MentHlth : num 18 0 30 0 3 0 0 0 30 0 ...
#> $ PhysHlth : num 15 0 30 0 0 2 14 0 30 0 ...
#> $ DiffWalk : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 2 1 ...
#> $ Sex : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 2 ...
#> $ Age : num 9 7 9 11 11 10 9 11 9 8 ...
#> $ Education : num 4 6 4 3 5 6 6 4 5 4 ...
#> $ Income : num 3 1 8 6 4 8 7 4 1 3 ...
The data type is already appropriate, we can do the next process.
We can check the missing value in the data by using anyNA(), is.na() and colsum()/nrow() with dplyr format.
# Is there any NA ?
anyNA(heart_dis)
#> [1] FALSE
# How much the missing value for each column ?
heart_dis %>%
is.na() %>%
colSums()/nrow(heart_dis)
#> HeartDiseaseorAttack HighBP HighChol
#> 0 0 0
#> CholCheck BMI Smoker
#> 0 0 0
#> Stroke Diabetes PhysActivity
#> 0 0 0
#> Fruits Veggies HvyAlcoholConsump
#> 0 0 0
#> AnyHealthcare NoDocbcCost GenHlth
#> 0 0 0
#> MentHlth PhysHlth DiffWalk
#> 0 0 0
#> Sex Age Education
#> 0 0 0
#> Income
#> 0
There is no missing value, so we can use the data for the next process.
#Check summary data
summary(heart_dis)
#> HeartDiseaseorAttack HighBP HighChol CholCheck BMI
#> 0:229787 0:144851 0:146089 0: 9470 Min. :12.00
#> 1: 23893 1:108829 1:107591 1:244210 1st Qu.:24.00
#> Median :27.00
#> Mean :28.38
#> 3rd Qu.:31.00
#> Max. :98.00
#> Smoker Stroke Diabetes PhysActivity Fruits Veggies
#> 0:141257 0:243388 Min. :0.0000 0: 61760 0: 92782 0: 47839
#> 1:112423 1: 10292 1st Qu.:0.0000 1:191920 1:160898 1:205841
#> Median :0.0000
#> Mean :0.2969
#> 3rd Qu.:0.0000
#> Max. :2.0000
#> HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth
#> 0:239424 0: 12417 0:232326 Min. :1.000 Min. : 0.000
#> 1: 14256 1:241263 1: 21354 1st Qu.:2.000 1st Qu.: 0.000
#> Median :2.000 Median : 0.000
#> Mean :2.511 Mean : 3.185
#> 3rd Qu.:3.000 3rd Qu.: 2.000
#> Max. :5.000 Max. :30.000
#> PhysHlth DiffWalk Sex Age Education
#> Min. : 0.000 0:211005 0:141974 Min. : 1.000 Min. :1.00
#> 1st Qu.: 0.000 1: 42675 1:111706 1st Qu.: 6.000 1st Qu.:4.00
#> Median : 0.000 Median : 8.000 Median :5.00
#> Mean : 4.242 Mean : 8.032 Mean :5.05
#> 3rd Qu.: 3.000 3rd Qu.:10.000 3rd Qu.:6.00
#> Max. :30.000 Max. :13.000 Max. :6.00
#> Income
#> Min. :1.000
#> 1st Qu.:5.000
#> Median :7.000
#> Mean :6.054
#> 3rd Qu.:8.000
#> Max. :8.000
Use prop.table() function to check the proportion of each levels at target variable. The levels at targets variable are 0 = “no” and 1 = “yes”
# check proportion for target variable `HeartDiseaseorAttack`
prop.table(table(heart_dis$HeartDiseaseorAttack))
#>
#> 0 1
#> 0.90581441 0.09418559
# check number of rows for target variable `HeartDiseaseorAttack`
table(heart_dis$HeartDiseaseorAttack)
#>
#> 0 1
#> 229787 23893
Based on the prop.table(), there is an imbalanced proportion between each level in the target variable. We will use downsampling to balance the proportion.
We can see the correlation between Target and Predictor Variable with boxplot.
# Numeric Variable :
# 1 = heart_dis$BMI
# 2 = heart_dis$Diabetes
# 3 = heart_dis$GenHlth
# 4 = heart_dis$MentHlth
# 5 = heart_dis$PhysHlth
# 6 = heart_dis$Age
# 7 = heart_dis$Education
# 8 = heart_dis$Income
# 9 = heart_dis$HeartDiseaseorAttack (Target)
boxplot(heart_dis$BMI, heart_dis$Diabetes, heart_dis$GenHlth, heart_dis$MentHlth, heart_dis$PhysHlth, heart_dis$Age, heart_dis$Education, heart_dis$Income, heart_dis$HeartDiseaseorAttack)
Based on the boxplot above, we get the insight, such as :
For Predictor : Diabetes, GenHlth, MentHlth, PhysHlth, Age, Education, and Income => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
For variable BMI, there is no correlattion (zero risk) for Heart Disease or Attack
Diabetes, MentHlth and PhysHlth variable have many outlier (> 5)
There is outlier < 5 for variable Diabetes, GenHlth and HeartDiseaseorAttack
# Part A - character(factor) variable :
# 1 = heart_dis$HighBP
# 2 = heart_dis$HighChol
# 3 = heart_dis$CholCheck
# 4 = heart_dis$Smoker
# 5 = heart_dis$Stroke
# 6 = heart_dis$PhysActivity
# 7 = heart_dis$HeartDiseaseorAttack (Target)
boxplot(heart_dis$HighBP, heart_dis$HighChol, heart_dis$CholCheck, heart_dis$Smoker, heart_dis$Stroke, heart_dis$PhysActivity, heart_dis$HeartDiseaseorAttack)
Based on the boxplot above, we get the insight, such as :
For Predictor : HighBP, HighChol, CholCheck, Smoker, Stroke, and PhysActivity => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
CholCheckdan PhysActivity variable have 1 lower outlier for each variable
Stroke and HeartDiseaseorAttack variable have 1 upper outlier for each variable
# Part B - character(factor) variable :
# 1 = heart_dis$Fruits
# 2 = heart_dis$Veggies
# 3 = heart_dis$HvyAlcoholConsump
# 4 = heart_dis$AnyHealthcare
# 5 = heart_dis$NoDocbcCost
# 6 = heart_dis$DiffWalk
# 7 = heart_dis$Sex
# 8 = heart_dis$HeartDiseaseorAttack (Target)
boxplot(heart_dis$Fruits, heart_dis$Veggies, heart_dis$HvyAlcoholConsump, heart_dis$AnyHealthcare, heart_dis$NoDocbcCost, heart_dis$DiffWalk, heart_dis$Sex, heart_dis$HeartDiseaseorAttack)
Based on the boxplot above, we get the insight, such as :
For Predictor : Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, and Sex => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
Veggiesdan AnyHealthcare variable have 1 lower outlier for each variable
HvyAlcoholConsump, NoDocbcCost, DiffWalk and HeartDiseaseorAttack variable have 1 upper outlier for each variable
We split data into train data and test data.
RNGkind(sample.kind = "Rounding")
set.seed(100)
# train-test splitting
# n sample = 75% for train
index <- sample(nrow(heart_dis), nrow(heart_dis)*0.75)
heart_d_train <- heart_dis[index,]
heart_d_test <- heart_dis[-index,]
prop.table(table(heart_dis$HeartDiseaseorAttack))
#>
#> 0 1
#> 0.90581441 0.09418559
prop.table(table(heart_d_train$HeartDiseaseorAttack))
#>
#> 0 1
#> 0.90541364 0.09458636
prop.table(table(heart_d_test$HeartDiseaseorAttack))
#>
#> 0 1
#> 0.90701671 0.09298329
Based on the prop.table(), there is an imbalanced proportion between each level in the target variable of data train and data test. We will use downsampling to balance the proportion.
# downsampling - data train only
RNGkind(sample.kind = "Rounding")
set.seed(100)
heart_dis_train <- downSample(x = heart_d_train %>% select(-HeartDiseaseorAttack),
y = heart_d_train$HeartDiseaseorAttack,
yname = "HeartDiseaseorAttack")
head(heart_dis_train)
#> HighBP HighChol CholCheck BMI Smoker Stroke Diabetes PhysActivity Fruits
#> 1 0 0 1 26 0 0 0 1 1
#> 2 0 0 1 22 0 0 0 0 1
#> 3 0 0 1 20 1 0 0 1 0
#> 4 0 0 1 31 1 0 0 0 0
#> 5 1 1 1 29 1 0 2 0 0
#> 6 1 0 1 26 1 0 0 1 1
#> Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth
#> 1 1 0 1 0 2 0 0
#> 2 1 0 1 0 1 5 0
#> 3 1 0 1 0 2 0 1
#> 4 1 0 0 0 4 0 10
#> 5 0 0 1 0 4 14 0
#> 6 1 0 1 0 2 0 0
#> DiffWalk Sex Age Education Income HeartDiseaseorAttack
#> 1 0 1 8 6 8 0
#> 2 0 0 5 6 8 0
#> 3 0 0 2 5 4 0
#> 4 0 1 4 2 4 0
#> 5 0 0 8 4 4 0
#> 6 0 1 12 5 8 0
# check the column names of data `train`
names(heart_dis_train)
#> [1] "HighBP" "HighChol" "CholCheck"
#> [4] "BMI" "Smoker" "Stroke"
#> [7] "Diabetes" "PhysActivity" "Fruits"
#> [10] "Veggies" "HvyAlcoholConsump" "AnyHealthcare"
#> [13] "NoDocbcCost" "GenHlth" "MentHlth"
#> [16] "PhysHlth" "DiffWalk" "Sex"
#> [19] "Age" "Education" "Income"
#> [22] "HeartDiseaseorAttack"
# check the proportion of the levels in the target variable of data `train`
prop.table(table(heart_dis_train$HeartDiseaseorAttack))
#>
#> 0 1
#> 0.5 0.5
Based on prop.table() above, there is a balanced proportion between each level in our target variable of the data train. We can do the next process.
Use naiveBayes() function to do the fitting model by data train.
# train
heart_d_naive <- naiveBayes(x = heart_dis_train %>% select(-HeartDiseaseorAttack),
y = heart_dis_train$HeartDiseaseorAttack)
We use data heart_d_naive$table to make model interpretation based on target variable heart_dis_train$HeartDiseaseorAttack for each predictor variable.
# 1. Model interpretation : HighBP
# HighBP : Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
heart_d_naive$tables$HighBP
#> HighBP
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.6054123 0.3945877
#> 1 0.2504445 0.7495555
Based on the model above, we get the insight, such as :
# 2. Model Interpretation : HighChol
# HighChol : Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?
heart_d_naive$tables$HighChol
#> HighChol
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.6024672 0.3975328
#> 1 0.2986775 0.7013225
Based on the model above, we get the insight, such as :
# 3. Model Interpretation : CholCheck
# CholCheck : Cholesterol check within past five years
heart_d_naive$tables$CholCheck
#> CholCheck
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.03995332 0.96004668
#> 1 0.01161369 0.98838631
Based on the model above, we get the insight, such as :
# 4. Model Interpretation : BMI
# BMI : Body Mass Index (BMI)
heart_d_naive$tables$BMI
#> BMI
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 28.28167 6.668373
#> 1 29.49216 6.731924
Based on the model above, we get the insight, such as :
# 5. Model Interpretation : Smoker
# Smoker : Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
heart_d_naive$tables$Smoker
#> Smoker
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.5768504 0.4231496
#> 1 0.3801400 0.6198600
Based on the model above, we get the insight, such as :
# 6. Model Interpretation : Stroke
# Stroke : (Ever told) you had a stroke.
heart_d_naive$tables$Stroke
#> Stroke
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.97260502 0.02739498
#> 1 0.83318515 0.16681485
Based on the model above, we get the insight, such as :
# 7. Model Interpretation : Diabetes
# Diabetes : 0 is no diabetes, 1 is pre-diabetes, and 2 is diabetes
heart_d_naive$tables$Diabetes
#> Diabetes
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 0.2556679 0.6541666
#> 1 0.6917093 0.9364824
Based on the model above, we get the insight, such as :
# 8. Model Interpretation : PhysActivity
# PhysActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
heart_d_naive$tables$PhysActivity
#> PhysActivity
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.2320516 0.7679484
#> 1 0.3618582 0.6381418
Based on the model above, we get the insight, such as :
# 9. Model Interpretation : Fruits
# Fruits : Consume Fruit 1 or more times per day
heart_d_naive$tables$Fruits
#> Fruits
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.3637475 0.6362525
#> 1 0.3954768 0.6045232
Based on the model above, we get the insight, such as :
# 10. Model Interpretation : Veggies
# Veggies : Consume Vegetables 1 or more times per day
heart_d_naive$tables$Veggies
#> Veggies
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.1844299 0.8155701
#> 1 0.2352189 0.7647811
Based on the model above, we get the insight, such as :
# 11. Model Interpretation : HvyAlcoholConsump
# HvyAlcoholConsump : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
heart_d_naive$tables$HvyAlcoholConsump
#> HvyAlcoholConsump
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.94004223 0.05995777
#> 1 0.96415870 0.03584130
Based on the model above, we get the insight, such as :
# 12. Model Interpretation : AnyHealthcare
# AnyHealthcare : Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
heart_d_naive$tables$AnyHealthcare
#> AnyHealthcare
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.05067793 0.94932207
#> 1 0.03673038 0.96326962
Based on the model above, we get the insight, such as :
# 13. Model Interpretation : NoDocbcCost
# NoDocbcCost : Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
heart_d_naive$tables$NoDocbcCost
#> NoDocbcCost
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.92076017 0.07923983
#> 1 0.88725272 0.11274728
Based on the model above, we get the insight, such as :
# 14. Model Interpretation : GenHlth
# GenHlth : Would you say that in general your health is ?
heart_d_naive$tables$GenHlth
#> GenHlth
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 2.414314 1.025782
#> 1 3.372749 1.084166
Based on the model above, we get the insight, such as :
# 15. Model Interpretation : MentHlth
# MentHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
heart_d_naive$tables$MentHlth
#> MentHlth
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 3.068349 7.295966
#> 1 4.724050 9.251305
Based on the model above, we get the insight, such as :
# 16. Model Interpretation : PhysHlth
# PhysHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
heart_d_naive$tables$PhysHlth
#> PhysHlth
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 3.697099 8.138909
#> 1 9.212658 11.920467
Based on the model above, we get the insight, such as :
# 17. Model Interpretation : DiffWalk
# DiffWalk : Do you have serious difficulty walking or climbing stairs?
heart_d_naive$tables$DiffWalk
#> DiffWalk
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.8596355 0.1403645
#> 1 0.5853523 0.4146477
Based on the model above, we get the insight, such as :
# 18. Model Interpretation : Sex
# Sex : Indicate sex of respondent : 0 = male and 1 = female
heart_d_naive$tables$Sex
#> Sex
#> heart_dis_train$HeartDiseaseorAttack 0 1
#> 0 0.5806290 0.4193710
#> 1 0.4229829 0.5770171
Based on the model above, we get the insight, such as :
# 19. Model Interpretation : Age
# Age : Fourteen-level age category
heart_d_naive$tables$Age
#> Age
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 7.798066 3.052001
#> 1 10.109969 2.221211
Based on the model above, we get the insight, such as :
# 20. Model Interpretation : Education
# Education : What is the highest grade or year of school you completed?
heart_d_naive$tables$Education
#> Education
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 5.072016 0.9753468
#> 1 4.740887 1.0649576
Based on the model above, we get the insight, such as :
# 21. Model Interpretation : Income
# Income : Is your annual household income from all sources:(If respondent refuses at any income level, code "Refused.")
heart_d_naive$tables$Income
#> Income
#> heart_dis_train$HeartDiseaseorAttack [,1] [,2]
#> 0 6.138753 2.029739
#> 1 5.139198 2.197479
Based on the model above, we get the insight, such as :
We do predict class from test data by using predict() function. Then, we do model evaluation by using confusionMatrix().
# Predict **class** from test data with function `predict()`
heart_pred_class <- predict(object = heart_d_naive,
newdata = heart_d_test,
type = "class")
head(heart_pred_class)
#> [1] 0 1 0 0 0 1
#> Levels: 0 1
# Model Evaluation with `confusionMatrix()`
confusionMatrix(data = heart_pred_class, #predict result
reference = heart_d_test$HeartDiseaseorAttack) #actual data
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 43990 1793
#> 1 13533 4104
#>
#> Accuracy : 0.7583
#> 95% CI : (0.755, 0.7617)
#> No Information Rate : 0.907
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2433
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.7647
#> Specificity : 0.6959
#> Pos Pred Value : 0.9608
#> Neg Pred Value : 0.2327
#> Prevalence : 0.9070
#> Detection Rate : 0.6936
#> Detection Prevalence : 0.7219
#> Balanced Accuracy : 0.7303
#>
#> 'Positive' Class : 0
#>
Based on the model evaluation above, we get the insight, such as :
We do predict probability of data naive bayes based on data test by using predict() function. Then, we build ROC model by compare predict probability and actual (ROC curve). We can build AUC model by using performance() function.
# Take predict result to format probability
heart_pred_prob <- predict(object = heart_d_naive,
newdata = heart_d_test,
type = "raw")
head(heart_pred_prob)
#> 0 1
#> [1,] 0.8342897942 0.165710206
#> [2,] 0.0059961511 0.994003849
#> [3,] 0.9946525796 0.005347420
#> [4,] 0.6013037878 0.398696212
#> [5,] 0.9977844452 0.002215555
#> [6,] 0.0004100749 0.999589925
# ROC : predict vs actual
data_roc <- data.frame(pred_prob = heart_pred_prob[, '1'],
actual = ifelse(heart_d_test$HeartDiseaseorAttack == '1',1, 0))
head(data_roc)
#> pred_prob actual
#> 1 0.165710206 0
#> 2 0.994003849 0
#> 3 0.005347420 0
#> 4 0.398696212 0
#> 5 0.002215555 0
#> 6 0.999589925 0
# objek prediction
heart_roc <- prediction(predictions = data_roc$pred_prob,
labels = data_roc$actual)
# ROC curve
plot(performance(heart_roc,"tpr","fpr"))
# AUC value
heart_auc <- performance(heart_roc, measure = "auc")
heart_auc@y.values
#> [[1]]
#> [1] 0.8144615
Based on the model above, we get the insight, such as :
AUC = 0.8142672, so we can conclude that model heart_d_naive is good enough at separating classes (1 and 0).
Use ctree() function to do the fitting model by data train with all predictor variables.
heart_tree <- ctree(formula = HeartDiseaseorAttack ~.,
data = heart_dis_train)
# print struktur tree
# heart_tree
# visualization decision tree
plot(heart_tree, type = "simple")
# predict class in data test
pred_heart_test <- predict(object = heart_tree,
newdata = heart_d_test,
type = "response")
# confusion matrix data test
confusionMatrix(data = pred_heart_test,
reference = heart_d_test$HeartDiseaseorAttack,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 41452 1169
#> 1 16071 4728
#>
#> Accuracy : 0.7282
#> 95% CI : (0.7247, 0.7316)
#> No Information Rate : 0.907
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2448
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.80176
#> Specificity : 0.72062
#> Pos Pred Value : 0.22732
#> Neg Pred Value : 0.97257
#> Prevalence : 0.09298
#> Detection Rate : 0.07455
#> Detection Prevalence : 0.32796
#> Balanced Accuracy : 0.76119
#>
#> 'Positive' Class : 1
#>
Before tuning, we get sensitivity/recall value (0.80176) in data test.
# predict class in data train
pred_heart_train <- predict(object = heart_tree,
newdata = heart_dis_train,
type = "response")
# confusion matrix data train
confusionMatrix(data = pred_heart_train,
reference = heart_dis_train$HeartDiseaseorAttack,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 13213 3373
#> 1 4783 14623
#>
#> Accuracy : 0.7734
#> 95% CI : (0.769, 0.7777)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 0.5468
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.8126
#> Specificity : 0.7342
#> Pos Pred Value : 0.7535
#> Neg Pred Value : 0.7966
#> Prevalence : 0.5000
#> Detection Rate : 0.4063
#> Detection Prevalence : 0.5392
#> Balanced Accuracy : 0.7734
#>
#> 'Positive' Class : 1
#>
Before tuning, we get sensitivity/recall value (0.8126) in data train.
We try to tuning model in data train with mincriterion = 0.05, minsplit = 300, minbucket = 200.
After tuning model on above, we do evaluation model in data test by using predict() and confusionMatrix() function.
# predict class in data test only
pred_heart_test_tuned <- predict(object = heart_tree_complex,
newdata = heart_d_test,
type = "response")
# confusion matrix data test only
confusionMatrix(data = pred_heart_test_tuned,reference = heart_d_test$HeartDiseaseorAttack,
positive = "1")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 40652 1153
#> 1 16871 4744
#>
#> Accuracy : 0.7158
#> 95% CI : (0.7123, 0.7193)
#> No Information Rate : 0.907
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.2328
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.80448
#> Specificity : 0.70671
#> Pos Pred Value : 0.21948
#> Neg Pred Value : 0.97242
#> Prevalence : 0.09298
#> Detection Rate : 0.07480
#> Detection Prevalence : 0.34082
#> Balanced Accuracy : 0.75559
#>
#> 'Positive' Class : 1
#>
Based on the model evaluation above, we get the insight, such as :
Before-Tuned: We get data on sensitivity value 0.80176 (minimizing FN = “recall”), which is high enough to Model Interpretation to solve the business problem.
After-Tuned: We get data on sensitivity value 0.80448 (minimizing FN = “recall”), The results did not change much after tuning the tree.
My conclusion from analysis data on above are : We prefer to use Naive Bayes model to solving the business problem, because result of Model Evaluation is more precise / better.
I get the data for this LBB at : https://www.kaggle.com/alexteboul/heart-disease-health-indicators-dataset?select=heart_disease_health_indicators_BRFSS2015.csv.