HEART DISEASE

1. Introduction

This is my LBB for classification on Machine Learning II. For this LBB, I get data about “heart disease”.

Heart Disease is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. In the United States alone, heart disease claims roughly 647,000 lives each year — making it the leading cause of death. The build up of plaques inside larger coronary arteries, molecular changes associated with aging, chronic inflammation, high blood pressure, and diabetes are all causes of and risk factors for heart disease.

While there are different types of coronary heart disease, the majority of individuals only learn they have the disease following symptoms such as chest pain, a heart attack, or sudden cardiac arrest. This fact highlights the importance of preventative measures and tests that can accurately predict heart disease in the population prior to negative outcomes like myocardial infarctions (heart attacks) taking place.

The Centers for Disease Control and Prevention has identified high blood pressure, high blood cholesterol, and smoking as three key risk factors for heart disease. Roughly half of Americans have at least one of these three risk factors. The National Heart, Lung, and Blood Institute highlights a wider array of factors such as Age, Environment and Occupation, Family History and Genetics, Lifestyle Habits, Other Medical Conditions, Race or Ethnicity, and Sex for clinicians to use in diagnosing coronary heart disease. Diagnosis tends to be driven by an initial survey of these common risk factors followed by bloodwork and other tests.

1.1. Data Outline

HeartDiseaseorAttack : Respondents that have ever reported having coronary heart disease (CHD)or myocardial infarction (MI)
HighBP : Respondents that have ever reported having coronary heart disease (CHD)or myocardial infarction (MI)
HighChol : Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?
CholCheck : Cholesterol check within past five years
BMI : Body Mass Index (BMI)
Smoker : Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]
Stroke : (Ever told) you had a stroke.
Diabetes : 0 is no diabetes, 1 is pre-diabetes, and 2 is diabetes
PhysActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
Fruits : Consume Fruit 1 or more times per day
Veggies : Consume Vegetables 1 or more times per day
HvyAlcoholConsump : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)
AnyHealthcare : Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?
NoDocbcCost : Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
GenHlth : Would you say that in general your health is ?
MentHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
PhysHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
DiffWalk : Do you have serious difficulty walking or climbing stairs?
Sex : Indicate sex of respondent : 0 = male and 1 = female
Age : Fourteen-level age category
Education : What is the highest grade or year of school you completed?
Income : Is your annual household income from all sources: (If respondent refuses at any income level, code “Refused.”)

1.2. Business Problem

Make model for health indicator to predict heart disease or attack ?

Target data is HeartDiseaseorAttack

2. Import Library and Read Data

2.1. Import Library

Use the library() function to import library packages that we needed for the modeling.

# import library
library(dplyr) # data wrangling
library(e1071) # naive bayes
library(caret) # confusion matrix
library(ROCR) # ROC curve
library(partykit) # decision tree / `ctree()`

2.2. Read data

Use read.csv() function to read the data and head() function to check the data.

# read data
heart_dis <- read.csv("data_input/heart_disease_health_indicators_BRFSS2015.csv")


# check data
head(heart_dis)

#>   HeartDiseaseorAttack HighBP HighChol CholCheck BMI Smoker Stroke Diabetes
#> 1                    0      1        1         1  40      1      0        0
#> 2                    0      0        0         0  25      1      0        0
#> 3                    0      1        1         1  28      0      0        0
#> 4                    0      1        0         1  27      0      0        0
#> 5                    0      1        1         1  24      0      0        0
#> 6                    0      1        1         1  25      1      0        0
#>   PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
#> 1            0      0       1                 0             1           0
#> 2            1      0       0                 0             0           1
#> 3            0      1       0                 0             1           1
#> 4            1      1       1                 0             1           0
#> 5            1      1       1                 0             1           0
#> 6            1      1       1                 0             1           0
#>   GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
#> 1       5       18       15        1   0   9         4      3
#> 2       3        0        0        0   0   7         6      1
#> 3       5       30       30        1   0   9         4      8
#> 4       2        0        0        0   0  11         3      6
#> 5       2        3        0        0   0  11         5      4
#> 6       2        0        2        0   1  10         6      8

We use dim() to check the dimension data.

# check dimension data
dim(heart_dis)

#> [1] 253680     22

Our data have 253680 rows and 22 columns.

3. Data Wrangling

3.1. Check Data Type

We check the data type from object heart_dis by using glimpse() function.

# check data type
glimpse(heart_dis)

#> Rows: 253,680
#> Columns: 22
#> $ HeartDiseaseorAttack <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ HighBP               <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1~
#> $ HighChol             <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1~
#> $ CholCheck            <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ BMI                  <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2~
#> $ Smoker               <dbl> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0~
#> $ Stroke               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ Diabetes             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0~
#> $ PhysActivity         <dbl> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1~
#> $ Fruits               <dbl> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ Veggies              <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1~
#> $ HvyAlcoholConsump    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ AnyHealthcare        <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ NoDocbcCost          <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ GenHlth              <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3~
#> $ MentHlth             <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, ~
#> $ PhysHlth             <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2~
#> $ DiffWalk             <dbl> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ Sex                  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0~
#> $ Age                  <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, ~
#> $ Education            <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4~
#> $ Income               <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3~

What variable that we need to change the data type ?

HeartDiseaseorAttack => as factor
HighBP => as factor
HighChol => as factor
CholCheck => as factor
Smoker => as factor
Stroke => as factor
PhysActivity => as factor
Fruits => as factor
Veggies => as factor
HvyAlcoholConsump => as factor
AnyHealthcare => as factor
NoDocbcCost => as factor
DiffWalk => as factor
Sex => as factor

# adjust data type
heart_dis <- heart_dis %>% 
  mutate_at(vars(HeartDiseaseorAttack, HighBP, HighChol, CholCheck, Smoker, Stroke, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, Sex), as.factor)

glimpse(heart_dis)

#> Rows: 253,680
#> Columns: 22
#> $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ HighBP               <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1~
#> $ HighChol             <fct> 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1~
#> $ CholCheck            <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ BMI                  <dbl> 40, 25, 28, 27, 24, 25, 30, 25, 30, 24, 25, 34, 2~
#> $ Smoker               <fct> 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0~
#> $ Stroke               <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ Diabetes             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0~
#> $ PhysActivity         <fct> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1~
#> $ Fruits               <fct> 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ Veggies              <fct> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1~
#> $ HvyAlcoholConsump    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ AnyHealthcare        <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ NoDocbcCost          <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ GenHlth              <dbl> 5, 3, 5, 2, 2, 2, 3, 3, 5, 2, 3, 3, 3, 4, 4, 2, 3~
#> $ MentHlth             <dbl> 18, 0, 30, 0, 3, 0, 0, 0, 30, 0, 0, 0, 0, 0, 30, ~
#> $ PhysHlth             <dbl> 15, 0, 30, 0, 0, 2, 14, 0, 30, 0, 0, 30, 15, 0, 2~
#> $ DiffWalk             <fct> 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ Sex                  <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0~
#> $ Age                  <dbl> 9, 7, 9, 11, 11, 10, 9, 11, 9, 8, 13, 10, 7, 11, ~
#> $ Education            <dbl> 4, 6, 4, 3, 5, 6, 6, 4, 5, 4, 6, 5, 5, 4, 6, 6, 4~
#> $ Income               <dbl> 3, 1, 8, 6, 4, 8, 7, 4, 1, 3, 8, 1, 7, 6, 2, 8, 3~

str(heart_dis)

#> 'data.frame':    253680 obs. of  22 variables:
#>  $ HeartDiseaseorAttack: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...
#>  $ HighBP              : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 1 ...
#>  $ HighChol            : Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 2 2 1 ...
#>  $ CholCheck           : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
#>  $ BMI                 : num  40 25 28 27 24 25 30 25 30 24 ...
#>  $ Smoker              : Factor w/ 2 levels "0","1": 2 2 1 1 1 2 2 2 2 1 ...
#>  $ Stroke              : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ Diabetes            : num  0 0 0 0 0 0 0 0 2 0 ...
#>  $ PhysActivity        : Factor w/ 2 levels "0","1": 1 2 1 2 2 2 1 2 1 1 ...
#>  $ Fruits              : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 1 2 1 ...
#>  $ Veggies             : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 1 2 2 2 ...
#>  $ HvyAlcoholConsump   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ AnyHealthcare       : Factor w/ 2 levels "0","1": 2 1 2 2 2 2 2 2 2 2 ...
#>  $ NoDocbcCost         : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
#>  $ GenHlth             : num  5 3 5 2 2 2 3 3 5 2 ...
#>  $ MentHlth            : num  18 0 30 0 3 0 0 0 30 0 ...
#>  $ PhysHlth            : num  15 0 30 0 0 2 14 0 30 0 ...
#>  $ DiffWalk            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 2 2 1 ...
#>  $ Sex                 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 2 ...
#>  $ Age                 : num  9 7 9 11 11 10 9 11 9 8 ...
#>  $ Education           : num  4 6 4 3 5 6 6 4 5 4 ...
#>  $ Income              : num  3 1 8 6 4 8 7 4 1 3 ...

The data type is already appropriate, we can do the next process.

3.2. Check Missing Value

We can check the missing value in the data by using anyNA(), is.na() and colsum()/nrow() with dplyr format.

# Is there any NA ?
anyNA(heart_dis)

#> [1] FALSE

# How much the missing value for each column ?
heart_dis %>% 
  is.na() %>% 
  colSums()/nrow(heart_dis)

#> HeartDiseaseorAttack               HighBP             HighChol 
#>                    0                    0                    0 
#>            CholCheck                  BMI               Smoker 
#>                    0                    0                    0 
#>               Stroke             Diabetes         PhysActivity 
#>                    0                    0                    0 
#>               Fruits              Veggies    HvyAlcoholConsump 
#>                    0                    0                    0 
#>        AnyHealthcare          NoDocbcCost              GenHlth 
#>                    0                    0                    0 
#>             MentHlth             PhysHlth             DiffWalk 
#>                    0                    0                    0 
#>                  Sex                  Age            Education 
#>                    0                    0                    0 
#>               Income 
#>                    0

There is no missing value, so we can use the data for the next process.

#Check summary data
summary(heart_dis)

#>  HeartDiseaseorAttack HighBP     HighChol   CholCheck       BMI       
#>  0:229787             0:144851   0:146089   0:  9470   Min.   :12.00  
#>  1: 23893             1:108829   1:107591   1:244210   1st Qu.:24.00  
#>                                                        Median :27.00  
#>                                                        Mean   :28.38  
#>                                                        3rd Qu.:31.00  
#>                                                        Max.   :98.00  
#>  Smoker     Stroke        Diabetes      PhysActivity Fruits     Veggies   
#>  0:141257   0:243388   Min.   :0.0000   0: 61760     0: 92782   0: 47839  
#>  1:112423   1: 10292   1st Qu.:0.0000   1:191920     1:160898   1:205841  
#>                        Median :0.0000                                     
#>                        Mean   :0.2969                                     
#>                        3rd Qu.:0.0000                                     
#>                        Max.   :2.0000                                     
#>  HvyAlcoholConsump AnyHealthcare NoDocbcCost    GenHlth         MentHlth     
#>  0:239424          0: 12417      0:232326    Min.   :1.000   Min.   : 0.000  
#>  1: 14256          1:241263      1: 21354    1st Qu.:2.000   1st Qu.: 0.000  
#>                                              Median :2.000   Median : 0.000  
#>                                              Mean   :2.511   Mean   : 3.185  
#>                                              3rd Qu.:3.000   3rd Qu.: 2.000  
#>                                              Max.   :5.000   Max.   :30.000  
#>     PhysHlth      DiffWalk   Sex             Age           Education   
#>  Min.   : 0.000   0:211005   0:141974   Min.   : 1.000   Min.   :1.00  
#>  1st Qu.: 0.000   1: 42675   1:111706   1st Qu.: 6.000   1st Qu.:4.00  
#>  Median : 0.000                         Median : 8.000   Median :5.00  
#>  Mean   : 4.242                         Mean   : 8.032   Mean   :5.05  
#>  3rd Qu.: 3.000                         3rd Qu.:10.000   3rd Qu.:6.00  
#>  Max.   :30.000                         Max.   :13.000   Max.   :6.00  
#>      Income     
#>  Min.   :1.000  
#>  1st Qu.:5.000  
#>  Median :7.000  
#>  Mean   :6.054  
#>  3rd Qu.:8.000  
#>  Max.   :8.000

4. Exploratory Data Analysis

4.1. Target Variable Proportion

Use prop.table() function to check the proportion of each levels at target variable. The levels at targets variable are 0 = “no” and 1 = “yes”

# check proportion for target variable `HeartDiseaseorAttack`
prop.table(table(heart_dis$HeartDiseaseorAttack))

#> 
#>          0          1 
#> 0.90581441 0.09418559

# check number of rows for target variable `HeartDiseaseorAttack`
table(heart_dis$HeartDiseaseorAttack)

#> 
#>      0      1 
#> 229787  23893

Based on the prop.table(), there is an imbalanced proportion between each level in the target variable. We will use downsampling to balance the proportion.

4.2. Correlation Between Target and Predictor Variable

We can see the correlation between Target and Predictor Variable with boxplot.

# Numeric Variable : 
# 1 = heart_dis$BMI
# 2 = heart_dis$Diabetes
# 3 = heart_dis$GenHlth
# 4 = heart_dis$MentHlth
# 5 = heart_dis$PhysHlth
# 6 = heart_dis$Age
# 7 = heart_dis$Education
# 8 = heart_dis$Income
# 9 = heart_dis$HeartDiseaseorAttack (Target)


boxplot(heart_dis$BMI, heart_dis$Diabetes, heart_dis$GenHlth, heart_dis$MentHlth, heart_dis$PhysHlth, heart_dis$Age, heart_dis$Education, heart_dis$Income, heart_dis$HeartDiseaseorAttack)

Based on the boxplot above, we get the insight, such as :

For Predictor : Diabetes, GenHlth, MentHlth, PhysHlth, Age, Education, and Income => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
For variable BMI, there is no correlattion (zero risk) for Heart Disease or Attack
Diabetes, MentHlth and PhysHlth variable have many outlier (> 5)
There is outlier < 5 for variable Diabetes, GenHlth and HeartDiseaseorAttack

# Part A - character(factor) variable :
# 1 = heart_dis$HighBP
# 2 = heart_dis$HighChol
# 3 = heart_dis$CholCheck
# 4 = heart_dis$Smoker
# 5 = heart_dis$Stroke
# 6 = heart_dis$PhysActivity
# 7 = heart_dis$HeartDiseaseorAttack (Target)


boxplot(heart_dis$HighBP, heart_dis$HighChol, heart_dis$CholCheck, heart_dis$Smoker, heart_dis$Stroke, heart_dis$PhysActivity, heart_dis$HeartDiseaseorAttack)

Based on the boxplot above, we get the insight, such as :

For Predictor : HighBP, HighChol, CholCheck, Smoker, Stroke, and PhysActivity => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
CholCheckdan PhysActivity variable have 1 lower outlier for each variable
Stroke and HeartDiseaseorAttack variable have 1 upper outlier for each variable

# Part B - character(factor) variable :
# 1 = heart_dis$Fruits
# 2 = heart_dis$Veggies
# 3 = heart_dis$HvyAlcoholConsump
# 4 = heart_dis$AnyHealthcare
# 5 = heart_dis$NoDocbcCost
# 6 = heart_dis$DiffWalk
# 7 = heart_dis$Sex
# 8 = heart_dis$HeartDiseaseorAttack (Target)


boxplot(heart_dis$Fruits, heart_dis$Veggies, heart_dis$HvyAlcoholConsump, heart_dis$AnyHealthcare, heart_dis$NoDocbcCost, heart_dis$DiffWalk, heart_dis$Sex, heart_dis$HeartDiseaseorAttack)

Based on the boxplot above, we get the insight, such as :

For Predictor : Fruits, Veggies, HvyAlcoholConsump, AnyHealthcare, NoDocbcCost, DiffWalk, and Sex => When 0 < predictor < 1, the risk for Heart Disease or Attack is between 0 - 1.
Veggiesdan AnyHealthcare variable have 1 lower outlier for each variable
HvyAlcoholConsump, NoDocbcCost, DiffWalk and HeartDiseaseorAttack variable have 1 upper outlier for each variable

5. Cross Validation

5.1. Cross Validation - train-test splitting

We split data into train data and test data.

RNGkind(sample.kind = "Rounding")
set.seed(100)

# train-test splitting
# n sample = 75% for train
index <- sample(nrow(heart_dis), nrow(heart_dis)*0.75)

heart_d_train <- heart_dis[index,]

heart_d_test <- heart_dis[-index,]

5.2. Target Class Proportion

prop.table(table(heart_dis$HeartDiseaseorAttack))

#> 
#>          0          1 
#> 0.90581441 0.09418559

prop.table(table(heart_d_train$HeartDiseaseorAttack))

#> 
#>          0          1 
#> 0.90541364 0.09458636

prop.table(table(heart_d_test$HeartDiseaseorAttack))

#> 
#>          0          1 
#> 0.90701671 0.09298329

Based on the prop.table(), there is an imbalanced proportion between each level in the target variable of data train and data test. We will use downsampling to balance the proportion.

# downsampling - data train only
RNGkind(sample.kind = "Rounding")
set.seed(100)

heart_dis_train <- downSample(x = heart_d_train %>% select(-HeartDiseaseorAttack),
                         y = heart_d_train$HeartDiseaseorAttack,
                         yname = "HeartDiseaseorAttack")
head(heart_dis_train)

#>   HighBP HighChol CholCheck BMI Smoker Stroke Diabetes PhysActivity Fruits
#> 1      0        0         1  26      0      0        0            1      1
#> 2      0        0         1  22      0      0        0            0      1
#> 3      0        0         1  20      1      0        0            1      0
#> 4      0        0         1  31      1      0        0            0      0
#> 5      1        1         1  29      1      0        2            0      0
#> 6      1        0         1  26      1      0        0            1      1
#>   Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth PhysHlth
#> 1       1                 0             1           0       2        0        0
#> 2       1                 0             1           0       1        5        0
#> 3       1                 0             1           0       2        0        1
#> 4       1                 0             0           0       4        0       10
#> 5       0                 0             1           0       4       14        0
#> 6       1                 0             1           0       2        0        0
#>   DiffWalk Sex Age Education Income HeartDiseaseorAttack
#> 1        0   1   8         6      8                    0
#> 2        0   0   5         6      8                    0
#> 3        0   0   2         5      4                    0
#> 4        0   1   4         2      4                    0
#> 5        0   0   8         4      4                    0
#> 6        0   1  12         5      8                    0

# check the column names of data `train`
names(heart_dis_train)

#>  [1] "HighBP"               "HighChol"             "CholCheck"           
#>  [4] "BMI"                  "Smoker"               "Stroke"              
#>  [7] "Diabetes"             "PhysActivity"         "Fruits"              
#> [10] "Veggies"              "HvyAlcoholConsump"    "AnyHealthcare"       
#> [13] "NoDocbcCost"          "GenHlth"              "MentHlth"            
#> [16] "PhysHlth"             "DiffWalk"             "Sex"                 
#> [19] "Age"                  "Education"            "Income"              
#> [22] "HeartDiseaseorAttack"

# check the proportion of the levels in the target variable of data `train`
prop.table(table(heart_dis_train$HeartDiseaseorAttack))

#> 
#>   0   1 
#> 0.5 0.5

Based on prop.table() above, there is a balanced proportion between each level in our target variable of the data train. We can do the next process.

6. Modeling : Naive Bayes

6.1. Model Fitting

Use naiveBayes() function to do the fitting model by data train.

# train
heart_d_naive <- naiveBayes(x =  heart_dis_train %>% select(-HeartDiseaseorAttack),
                          y = heart_dis_train$HeartDiseaseorAttack)

6.2. Model Interpretation

We use data heart_d_naive$table to make model interpretation based on target variable heart_dis_train$HeartDiseaseorAttack for each predictor variable.

# 1. Model interpretation : HighBP
# HighBP : Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)

heart_d_naive$tables$HighBP

#>                                     HighBP
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.6054123 0.3945877
#>                                    1 0.2504445 0.7495555

Based on the model above, we get the insight, such as :

The probability of non-HighBP patient who are not at risk of heart attack is 60.54%
The probability of non-HighBP patient at risk of heart attack is 25.04%
The probability of HighBP patient who are not at risk of heart attack is 39.46%
The probability of HighBP patient being at risk of heart attack is 74.96%

# 2. Model Interpretation : HighChol
# HighChol : Have you EVER been told by a doctor, nurse or other health professional that your blood cholesterol is high?

heart_d_naive$tables$HighChol

#>                                     HighChol
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.6024672 0.3975328
#>                                    1 0.2986775 0.7013225

Based on the model above, we get the insight, such as :

The probability of non-HighChol patient who are not at risk of heart attack is 60.25%
The probability of non-HighChol patient at risk of heart attack is 29.87%
The probability of HighChol patient who are not at risk of heart attack is 39.75%
The probability of HighChol patient being at risk of heart attack is 70.13%

# 3. Model Interpretation : CholCheck
# CholCheck : Cholesterol check within past five years

heart_d_naive$tables$CholCheck

#>                                     CholCheck
#> heart_dis_train$HeartDiseaseorAttack          0          1
#>                                    0 0.03995332 0.96004668
#>                                    1 0.01161369 0.98838631

Based on the model above, we get the insight, such as :

The probability of non-CholCheck patient who are not at risk of heart attack is 39.95%
The probability of non-CholCheck patient at risk of heart attack is 11.61%
The probability of CholCheck patient who are not at risk of heart attack is 96.00%
The probability of CholCheck patient being at risk of heart attack is 98.84%

# 4. Model Interpretation : BMI
# BMI : Body Mass Index (BMI)

heart_d_naive$tables$BMI

#>                                     BMI
#> heart_dis_train$HeartDiseaseorAttack     [,1]     [,2]
#>                                    0 28.28167 6.668373
#>                                    1 29.49216 6.731924

Based on the model above, we get the insight, such as :

The probability of non-BMI patient who are not at risk of heart attack is 28.28
The probability of non-BMI patient at risk of heart attack is 29.49
The probability of BMI patient who are not at risk of heart attack is 6.67
The probability of BMI patient being at risk of heart attack is 6.73

# 5. Model Interpretation : Smoker
# Smoker : Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes]

heart_d_naive$tables$Smoker

#>                                     Smoker
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.5768504 0.4231496
#>                                    1 0.3801400 0.6198600

Based on the model above, we get the insight, such as :

The probability of non-Smoker patient who are not at risk of heart attack is 57.69%
The probability of non-Smoker patient at risk of heart attack is 38.01%
The probability of Smoker patient who are not at risk of heart attack is 42.31%
The probability of Smoker patient being at risk of heart attack is 61.99%

# 6. Model Interpretation : Stroke
# Stroke : (Ever told) you had a stroke.
heart_d_naive$tables$Stroke

#>                                     Stroke
#> heart_dis_train$HeartDiseaseorAttack          0          1
#>                                    0 0.97260502 0.02739498
#>                                    1 0.83318515 0.16681485

Based on the model above, we get the insight, such as :

The probability of non-Stroke patient who are not at risk of heart attack is 97.26%
The probability of non-Stroke patient at risk of heart attack is 83.32%
The probability of Stroke patient who are not at risk of heart attack is 27.39%
The probability of Stroke patient being at risk of heart attack is 16.68%

# 7. Model Interpretation : Diabetes
# Diabetes : 0 is no diabetes, 1 is pre-diabetes, and 2 is diabetes

heart_d_naive$tables$Diabetes

#>                                     Diabetes
#> heart_dis_train$HeartDiseaseorAttack      [,1]      [,2]
#>                                    0 0.2556679 0.6541666
#>                                    1 0.6917093 0.9364824

Based on the model above, we get the insight, such as :

The probability of non-Diabetes patient who are not at risk of heart attack is 25.57%
The probability of non-Diabetes patient at risk of heart attack is 69.17%
The probability of Diabetes patient who are not at risk of heart attack is 65.42%
The probability of Diabetes patient being at risk of heart attack is 93.65%

# 8. Model Interpretation : PhysActivity
# PhysActivity : Adults who reported doing physical activity or exercise during the past 30 days other than their regular job

heart_d_naive$tables$PhysActivity

#>                                     PhysActivity
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.2320516 0.7679484
#>                                    1 0.3618582 0.6381418

Based on the model above, we get the insight, such as :

The probability of non-PhysActivity patient who are not at risk of heart attack is 23.21%
The probability of non-PhysActivity patient at risk of heart attack is 36.19%
The probability of PhysActivity patient who are not at risk of heart attack is 76.79%
The probability of PhysActivity patient being at risk of heart attack is 63.81%

# 9. Model Interpretation : Fruits
# Fruits : Consume Fruit 1 or more times per day

heart_d_naive$tables$Fruits

#>                                     Fruits
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.3637475 0.6362525
#>                                    1 0.3954768 0.6045232

Based on the model above, we get the insight, such as :

The probability of non-Fruits patient who are not at risk of heart attack is 36.37%
The probability of non-Fruits patient at risk of heart attack is 39.55%
The probability of Fruits patient who are not at risk of heart attack is 63.63%
The probability of Fruits patient being at risk of heart attack is 60.45%

# 10. Model Interpretation : Veggies
# Veggies : Consume Vegetables 1 or more times per day

heart_d_naive$tables$Veggies

#>                                     Veggies
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.1844299 0.8155701
#>                                    1 0.2352189 0.7647811

Based on the model above, we get the insight, such as :

The probability of non-Veggies patient who are not at risk of heart attack is 18.44%
The probability of non-Veggies patient at risk of heart attack is 23.52%
The probability of Veggies patient who are not at risk of heart attack is 81.56%
The probability of Veggies patient being at risk of heart attack is 76.48%

# 11. Model Interpretation : HvyAlcoholConsump
# HvyAlcoholConsump : Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week)

heart_d_naive$tables$HvyAlcoholConsump

#>                                     HvyAlcoholConsump
#> heart_dis_train$HeartDiseaseorAttack          0          1
#>                                    0 0.94004223 0.05995777
#>                                    1 0.96415870 0.03584130

Based on the model above, we get the insight, such as :

The probability of non-HvyAlcoholConsump patient who are not at risk of heart attack is 94.00%
The probability of non-HvyAlcoholConsump patient at risk of heart attack is 96.42%
The probability of HvyAlcoholConsump patient who are not at risk of heart attack is 59.96%
The probability of HvyAlcoholConsump patient being at risk of heart attack is 35.84%

# 12. Model Interpretation : AnyHealthcare
# AnyHealthcare : Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?

heart_d_naive$tables$AnyHealthcare

#>                                     AnyHealthcare
#> heart_dis_train$HeartDiseaseorAttack          0          1
#>                                    0 0.05067793 0.94932207
#>                                    1 0.03673038 0.96326962

Based on the model above, we get the insight, such as :

The probability of non-AnyHealthcare patient who are not at risk of heart attack is 5.07%
The probability of non-AnyHealthcare patient at risk of heart attack is 3.67%
The probability of AnyHealthcare patient who are not at risk of heart attack is 94.93%
The probability of AnyHealthcare patient being at risk of heart attack is 96.33%

# 13. Model Interpretation : NoDocbcCost
# NoDocbcCost : Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?
  
heart_d_naive$tables$NoDocbcCost

#>                                     NoDocbcCost
#> heart_dis_train$HeartDiseaseorAttack          0          1
#>                                    0 0.92076017 0.07923983
#>                                    1 0.88725272 0.11274728

Based on the model above, we get the insight, such as :

The probability of non-NoDocbcCost patient who are not at risk of heart attack is 92.08%
The probability of non-NoDocbcCost patient at risk of heart attack is 8.73%
The probability of NoDocbcCost patient who are not at risk of heart attack is 7.92%
The probability of NoDocbcCost patient being at risk of heart attack is 11.27%

# 14. Model Interpretation : GenHlth
# GenHlth : Would you say that in general your health is ?

heart_d_naive$tables$GenHlth

#>                                     GenHlth
#> heart_dis_train$HeartDiseaseorAttack     [,1]     [,2]
#>                                    0 2.414314 1.025782
#>                                    1 3.372749 1.084166

Based on the model above, we get the insight, such as :

The probability of non-GenHlth patient who are not at risk of heart attack is 2.414
The probability of non-GenHlth patient at risk of heart attack is 3.373
The probability of GenHlth patient who are not at risk of heart attack is 1.026
The probability of GenHlth patient being at risk of heart attack is 1.084

# 15. Model Interpretation : MentHlth
# MentHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?
                              
heart_d_naive$tables$MentHlth

#>                                     MentHlth
#> heart_dis_train$HeartDiseaseorAttack     [,1]     [,2]
#>                                    0 3.068349 7.295966
#>                                    1 4.724050 9.251305

Based on the model above, we get the insight, such as :

The probability of non-MentHlth patient who are not at risk of heart attack is 3.068
The probability of non-MentHlth patient at risk of heart attack is 4.724
The probability of MentHlth patient who are not at risk of heart attack is 7.296
The probability of MentHlth patient being at risk of heart attack is 9.251

# 16. Model Interpretation : PhysHlth
# PhysHlth : Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?

heart_d_naive$tables$PhysHlth

#>                                     PhysHlth
#> heart_dis_train$HeartDiseaseorAttack     [,1]      [,2]
#>                                    0 3.697099  8.138909
#>                                    1 9.212658 11.920467

Based on the model above, we get the insight, such as :

The probability of non-PhysHlth patient who are not at risk of heart attack is 3.697
The probability of non-PhysHlth patient at risk of heart attack is 9.213
The probability of PhysHlth patient who are not at risk of heart attack is 8.139
The probability of PhysHlth patient being at risk of heart attack is 11.920

# 17. Model Interpretation : DiffWalk
# DiffWalk : Do you have serious difficulty walking or climbing stairs?

heart_d_naive$tables$DiffWalk

#>                                     DiffWalk
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.8596355 0.1403645
#>                                    1 0.5853523 0.4146477

Based on the model above, we get the insight, such as :

The probability of non-DiffWalk patient who are not at risk of heart attack is 85.96%
The probability of non-DiffWalk patient at risk of heart attack is 58.54%
The probability of DiffWalk patient who are not at risk of heart attack is 14.04%
The probability of DiffWalk patient being at risk of heart attack is 41.46%

# 18. Model Interpretation : Sex
# Sex : Indicate sex of respondent : 0 = male and 1 = female

heart_d_naive$tables$Sex

#>                                     Sex
#> heart_dis_train$HeartDiseaseorAttack         0         1
#>                                    0 0.5806290 0.4193710
#>                                    1 0.4229829 0.5770171

Based on the model above, we get the insight, such as :

The probability of Sex : Male patient who are not at risk of heart attack is 58.06%
The probability of Sex : Male patient at risk of heart attack is 42.30%
The probability of Sex : Female patient who are not at risk of heart attack is 41.94%
The probability of Sex : Female patient being at risk of heart attack is 57.70%

# 19. Model Interpretation : Age
# Age : Fourteen-level age category

heart_d_naive$tables$Age

#>                                     Age
#> heart_dis_train$HeartDiseaseorAttack      [,1]     [,2]
#>                                    0  7.798066 3.052001
#>                                    1 10.109969 2.221211

Based on the model above, we get the insight, such as :

The probability of Age : Fourteen-level category and under patient who are not at risk of heart attack is 7.798
The probability of Age : Fourteen-level category and under patient at risk of heart attack is 10.110
The probability of Age : above Fourteen-level category patient who are not at risk of heart attack is 3.052
The probability of Age : above Fourteen-level category patient being at risk of heart attack is 2.221

# 20. Model Interpretation : Education
# Education : What is the highest grade or year of school you completed?

heart_d_naive$tables$Education

#>                                     Education
#> heart_dis_train$HeartDiseaseorAttack     [,1]      [,2]
#>                                    0 5.072016 0.9753468
#>                                    1 4.740887 1.0649576

Based on the model above, we get the insight, such as :

The probability of non-Education patient who are not at risk of heart attack is 5.072
The probability of non-Education patient at risk of heart attack is 4.741
The probability of Education patient who are not at risk of heart attack is 0.975
The probability of Education patient being at risk of heart attack is 1.065

# 21. Model Interpretation : Income
# Income : Is your annual household income from all sources:(If respondent refuses at any income level, code "Refused.")

heart_d_naive$tables$Income

#>                                     Income
#> heart_dis_train$HeartDiseaseorAttack     [,1]     [,2]
#>                                    0 6.138753 2.029739
#>                                    1 5.139198 2.197479

Based on the model above, we get the insight, such as :

The probability of non-Income / Refused patient who are not at risk of heart attack is 6.139
The probability of non-Income / Refused patient at risk of heart attack is 5.139
The probability of Income patient who are not at risk of heart attack is 2.030
The probability of Income patient being at risk of heart attack is 2.197

6.3. Model Evaluation

We do predict class from test data by using predict() function. Then, we do model evaluation by using confusionMatrix().

# Predict **class** from test data with function `predict()`
heart_pred_class <- predict(object = heart_d_naive,
                            newdata = heart_d_test,
                            type = "class")
head(heart_pred_class)

#> [1] 0 1 0 0 0 1
#> Levels: 0 1

# Model Evaluation with `confusionMatrix()`
confusionMatrix(data = heart_pred_class, #predict result
                reference = heart_d_test$HeartDiseaseorAttack) #actual data

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 43990  1793
#>          1 13533  4104
#>                                              
#>                Accuracy : 0.7583             
#>                  95% CI : (0.755, 0.7617)    
#>     No Information Rate : 0.907              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.2433             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.7647             
#>             Specificity : 0.6959             
#>          Pos Pred Value : 0.9608             
#>          Neg Pred Value : 0.2327             
#>              Prevalence : 0.9070             
#>          Detection Rate : 0.6936             
#>    Detection Prevalence : 0.7219             
#>       Balanced Accuracy : 0.7303             
#>                                              
#>        'Positive' Class : 0                  
#>

Based on the model evaluation above, we get the insight, such as :

We focus on minimizing FN (Refer to Sensitivity : 0.7647 ) = “recall”, because we want to avoid patients who have heart disease, but are predicted not to have heart disease. It is more dangerous for the patient.

6.3. ROC and AUC Model

We do predict probability of data naive bayes based on data test by using predict() function. Then, we build ROC model by compare predict probability and actual (ROC curve). We can build AUC model by using performance() function.

# Take predict result to format probability
heart_pred_prob <- predict(object = heart_d_naive,
                           newdata = heart_d_test,
                           type = "raw")
head(heart_pred_prob)

#>                 0           1
#> [1,] 0.8342897942 0.165710206
#> [2,] 0.0059961511 0.994003849
#> [3,] 0.9946525796 0.005347420
#> [4,] 0.6013037878 0.398696212
#> [5,] 0.9977844452 0.002215555
#> [6,] 0.0004100749 0.999589925

# ROC :  predict vs actual
data_roc <- data.frame(pred_prob = heart_pred_prob[, '1'],
                       actual = ifelse(heart_d_test$HeartDiseaseorAttack == '1',1, 0))
head(data_roc)

#>     pred_prob actual
#> 1 0.165710206      0
#> 2 0.994003849      0
#> 3 0.005347420      0
#> 4 0.398696212      0
#> 5 0.002215555      0
#> 6 0.999589925      0

# objek prediction
heart_roc <- prediction(predictions = data_roc$pred_prob,
                        labels = data_roc$actual)
# ROC curve
plot(performance(heart_roc,"tpr","fpr"))

# AUC value
heart_auc <- performance(heart_roc, measure = "auc")
heart_auc@y.values

#> [[1]]
#> [1] 0.8144615

Based on the model above, we get the insight, such as :

AUC = 0.8142672, so we can conclude that model heart_d_naive is good enough at separating classes (1 and 0).

7. Modeling : Decision Tree

7.1. Model Fitting

Use ctree() function to do the fitting model by data train with all predictor variables.

heart_tree <- ctree(formula = HeartDiseaseorAttack ~.,
                       data = heart_dis_train)

# print struktur tree
# heart_tree

# visualization decision tree
plot(heart_tree, type = "simple")

7.2. Model Evaluation

# predict class in data test
pred_heart_test <- predict(object = heart_tree,
                          newdata = heart_d_test,
                          type = "response")

# confusion matrix data test
confusionMatrix(data = pred_heart_test,
                reference = heart_d_test$HeartDiseaseorAttack,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 41452  1169
#>          1 16071  4728
#>                                              
#>                Accuracy : 0.7282             
#>                  95% CI : (0.7247, 0.7316)   
#>     No Information Rate : 0.907              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.2448             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.80176            
#>             Specificity : 0.72062            
#>          Pos Pred Value : 0.22732            
#>          Neg Pred Value : 0.97257            
#>              Prevalence : 0.09298            
#>          Detection Rate : 0.07455            
#>    Detection Prevalence : 0.32796            
#>       Balanced Accuracy : 0.76119            
#>                                              
#>        'Positive' Class : 1                  
#>

Before tuning, we get sensitivity/recall value (0.80176) in data test.

# predict class in data train
pred_heart_train <- predict(object = heart_tree,
                          newdata = heart_dis_train,
                          type = "response")

# confusion matrix data train
confusionMatrix(data = pred_heart_train,
                reference = heart_dis_train$HeartDiseaseorAttack,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 13213  3373
#>          1  4783 14623
#>                                                
#>                Accuracy : 0.7734               
#>                  95% CI : (0.769, 0.7777)      
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.5468               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.8126               
#>             Specificity : 0.7342               
#>          Pos Pred Value : 0.7535               
#>          Neg Pred Value : 0.7966               
#>              Prevalence : 0.5000               
#>          Detection Rate : 0.4063               
#>    Detection Prevalence : 0.5392               
#>       Balanced Accuracy : 0.7734               
#>                                                
#>        'Positive' Class : 1                    
#>

Before tuning, we get sensitivity/recall value (0.8126) in data train.

We try to tuning model in data train with mincriterion = 0.05, minsplit = 300, minbucket = 200.

After tuning model on above, we do evaluation model in data test by using predict() and confusionMatrix() function.

# predict class in data test only
pred_heart_test_tuned <- predict(object = heart_tree_complex,
                                newdata = heart_d_test,
                                type = "response")

# confusion matrix data test only
confusionMatrix(data = pred_heart_test_tuned,reference = heart_d_test$HeartDiseaseorAttack,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction     0     1
#>          0 40652  1153
#>          1 16871  4744
#>                                              
#>                Accuracy : 0.7158             
#>                  95% CI : (0.7123, 0.7193)   
#>     No Information Rate : 0.907              
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.2328             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.80448            
#>             Specificity : 0.70671            
#>          Pos Pred Value : 0.21948            
#>          Neg Pred Value : 0.97242            
#>              Prevalence : 0.09298            
#>          Detection Rate : 0.07480            
#>    Detection Prevalence : 0.34082            
#>       Balanced Accuracy : 0.75559            
#>                                              
#>        'Positive' Class : 1                  
#>

Based on the model evaluation above, we get the insight, such as :

Before-Tuned: We get data on sensitivity value 0.80176 (minimizing FN = “recall”), which is high enough to Model Interpretation to solve the business problem.
After-Tuned: We get data on sensitivity value 0.80448 (minimizing FN = “recall”), The results did not change much after tuning the tree.

8. Conclusion

My conclusion from analysis data on above are : We prefer to use Naive Bayes model to solving the business problem, because result of Model Evaluation is more precise / better.

9. Reference

I get the data for this LBB at : https://www.kaggle.com/alexteboul/heart-disease-health-indicators-dataset?select=heart_disease_health_indicators_BRFSS2015.csv.

LBB-C2-Melia Sie

Melia Sie

12/22/2021

HEART DISEASE

1. Introduction

1.1. Data Outline

1.2. Business Problem

2. Import Library and Read Data

2.1. Import Library

2.2. Read data

3. Data Wrangling

3.1. Check Data Type

3.2. Check Missing Value

4. Exploratory Data Analysis

4.1. Target Variable Proportion

4.2. Correlation Between Target and Predictor Variable

5. Cross Validation

5.1. Cross Validation - train-test splitting

5.2. Target Class Proportion

6. Modeling : Naive Bayes

6.1. Model Fitting

6.2. Model Interpretation

6.3. Model Evaluation

6.3. ROC and AUC Model

7. Modeling : Decision Tree

7.1. Model Fitting

7.2. Model Evaluation

8. Conclusion

9. Reference