Introduction

According to the [CDC] (https://www.cdc.gov/heartdisease/risk_factors.htm), heart disease is one of the leading causes of death for people in the US. In this project, we will make predictions whether a patient in the hospital has heart disease or not using the Naive Bayes Classifier, Decision Tree, and Random Forest. The dataset came from 2020 annual CDC survey data of 400k adults related to their health status (Kaggle).

First, import the library:

library(dplyr)
library(tidyverse)
library(GGally)
library(ggplot2)
library(e1071)
library(rsample)
library(caret)
library(partykit)
library(randomForest)

Data Preparation

Input Data

Input our data and put it into heart object. We use parameter stringsAsFactors = TRUE so that all character columns will automatically stored as factors.

heart <- read.csv("heart.csv", stringsAsFactors = TRUE)

Overview our data:

head(heart)

Description:

  • Heart Disease: Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI).
  • BMI: Body Mass Index.
  • Smoking: Smoked at least 100 cigarettes.
  • Alcohol Drinking: Adult men having more than 14 drinks per week and adult women having more than 7 drinks per week.
  • Stroke: Had stroke.
  • Physical Health: Number of days the physical health not good during the past 30 days.
  • Mental Health: Number of days the mental health not good during the past 30 days.
  • Diff Walking: Have serious difficulty walking or climbing stairs.
  • Sex: Male or Female.
  • Age Category: Fourteen-level age category.
  • Race: Imputed race/ethnicity value.
  • Diabetic: Had diabetes.
  • Physical Activity: Adults who reported doing physical activity or exercise during the past 30 days other than their regular job.
  • Gen Health: Health condition in general.
  • Sleep Time: Hours of sleep in a 24-hour period.
  • Asthma: Had asthma.
  • KidneyDisease: Had Kidney Disease.
  • SkinCancer: Had Skin Cancer.

Data Structure

Check the number of columns and rows.

dim(heart)
## [1] 319795     18

Data contains 319.795 rows and 18 columns.

View all columns and the data types.

glimpse(heart)
## Rows: 319,795
## Columns: 18
## $ HeartDisease     <fct> No, No, No, No, No, Yes, No, No, No, No, Yes, No, No,~
## $ BMI              <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6~
## $ Smoking          <fct> Yes, No, Yes, No, No, Yes, No, Yes, No, No, Yes, Yes,~
## $ AlcoholDrinking  <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
## $ Stroke           <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No, ~
## $ PhysicalHealth   <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5~
## $ MentalHealth     <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,~
## $ DiffWalking      <fct> No, No, No, No, Yes, Yes, No, Yes, No, Yes, Yes, No, ~
## $ Sex              <fct> Female, Female, Male, Female, Female, Female, Female,~
## $ AgeCategory      <fct> 55-59, 80 or older, 65-69, 75-79, 40-44, 75-79, 70-74~
## $ Race             <fct> White, White, White, White, White, Black, White, Whit~
## $ Diabetic         <fct> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N~
## $ PhysicalActivity <fct> Yes, Yes, Yes, No, Yes, No, Yes, No, No, Yes, No, Yes~
## $ GenHealth        <fct> Very good, Very good, Fair, Good, Very good, Fair, Fa~
## $ SleepTime        <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10~
## $ Asthma           <fct> Yes, No, Yes, No, No, No, Yes, Yes, No, No, Yes, No, ~
## $ KidneyDisease    <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, ~
## $ SkinCancer       <fct> Yes, No, No, Yes, No, No, Yes, No, No, No, No, No, No~

Data type of all columns are in the correct type.

Pre-processing Data

Checking the missing value.

colSums(is.na(heart))
##     HeartDisease              BMI          Smoking  AlcoholDrinking 
##                0                0                0                0 
##           Stroke   PhysicalHealth     MentalHealth      DiffWalking 
##                0                0                0                0 
##              Sex      AgeCategory             Race         Diabetic 
##                0                0                0                0 
## PhysicalActivity        GenHealth        SleepTime           Asthma 
##                0                0                0                0 
##    KidneyDisease       SkinCancer 
##                0                0

No missing value found. Now the data is ready to explore.

Exploratory Data Analysis

Let’s see the summary of all columns.

summary(heart)
##  HeartDisease      BMI        Smoking      AlcoholDrinking Stroke      
##  No :292422   Min.   :12.02   No :187887   No :298018      No :307726  
##  Yes: 27373   1st Qu.:24.03   Yes:131908   Yes: 21777      Yes: 12069  
##               Median :27.34                                            
##               Mean   :28.33                                            
##               3rd Qu.:31.42                                            
##               Max.   :94.85                                            
##                                                                        
##  PhysicalHealth    MentalHealth    DiffWalking      Sex        
##  Min.   : 0.000   Min.   : 0.000   No :275385   Female:167805  
##  1st Qu.: 0.000   1st Qu.: 0.000   Yes: 44410   Male  :151990  
##  Median : 0.000   Median : 0.000                               
##  Mean   : 3.372   Mean   : 3.898                               
##  3rd Qu.: 2.000   3rd Qu.: 3.000                               
##  Max.   :30.000   Max.   :30.000                               
##                                                                
##       AgeCategory                                 Race       
##  65-69      : 34151   American Indian/Alaskan Native:  5202  
##  60-64      : 33686   Asian                         :  8068  
##  70-74      : 31065   Black                         : 22939  
##  55-59      : 29757   Hispanic                      : 27446  
##  50-54      : 25382   Other                         : 10928  
##  80 or older: 24153   White                         :245212  
##  (Other)    :141601                                          
##                     Diabetic      PhysicalActivity     GenHealth     
##  No                     :269653   No : 71838       Excellent: 66842  
##  No, borderline diabetes:  6781   Yes:247957       Fair     : 34677  
##  Yes                    : 40802                    Good     : 93129  
##  Yes (during pregnancy) :  2559                    Poor     : 11289  
##                                                    Very good:113858  
##                                                                      
##                                                                      
##    SleepTime      Asthma       KidneyDisease SkinCancer  
##  Min.   : 1.000   No :276923   No :308016    No :289976  
##  1st Qu.: 6.000   Yes: 42872   Yes: 11779    Yes: 29819  
##  Median : 7.000                                          
##  Mean   : 7.097                                          
##  3rd Qu.: 8.000                                          
##  Max.   :24.000                                          
## 

Before doing the analysis, we have to inspect the distribution of all variables. Categorical variables:

ggplot(gather(heart %>% select_if(is.factor)), aes(value)) + 
  geom_bar(bins = 10, fill = "navy") + 
  facet_wrap(~key, scales = 'free_x') +
  theme_minimal()

Numerical variables:

ggplot(gather(heart %>% select_if(is.numeric)), aes(value)) + 
  geom_histogram(bins = 10, fill = "Navy") + 
  facet_wrap(~key, scales = 'free_x') +
  theme_minimal()

Correlation between numeric variables:

ggcorr(heart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2, low = "black", high = "blue")

Naive Bayes Classifier

Naive Bayes Classifier is a classification algorithm based on Bayes’s theorem which gives an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes Assumptions are the predictors is mutually independent and have the same weight.

Advantages:

  • Computing time is relatively faster than other classification models, because it only computes the proportion of the frequency table
  • Often used as a baseline model or benchmark, which is a simple model (reference) that we will compare with more complex models.

Disadvantages: Skewness due to data scarcity, bias arises when there are events that rarely or do not occur at all.

Cross Validation

Split the data into 80% data train and and 20% data test.

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[index,]
heart_test <- heart[-index,]

After split the data, check the class imbalance of data train.

prop.table(table(heart_train$HeartDisease))
## 
##         No        Yes 
## 0.91363608 0.08636392
table(heart_train$HeartDisease)
## 
##     No    Yes 
## 233741  22095

It is not balance, so we need to Upsampling or Downsampling.

  • Upsampling: adding minority class observations to balance with the majority class, by duplicating data from minority observations. Disadvantages: only duplicate data, does not add new information.

  • Downsampling: reduce the observation of the majority class to balance with the minority class. Disadvantages: removing information from the data owned, commonly used when we have a lot of data.

We will use downsampling method because the minority class of heart disease-yes has 22,095 observations.

heart_train_down <- downSample(x = heart_train %>% select(-HeartDisease),
                         y = as.factor(heart_train$HeartDisease),
                         yname = "HeartDisease")

Re-check the class imbalance.

prop.table(table(heart_train_down$HeartDisease))
## 
##  No Yes 
## 0.5 0.5

Now it is balanced.

dim(heart_train_down)
## [1] 44190    18

After downsampling, there are 44,190 observations.

Build Model

model_naive_bayes <- naiveBayes(formula = HeartDisease ~ ., data = heart_train_down, laplace = 1)
model_naive_bayes
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##  No Yes 
## 0.5 0.5 
## 
## Conditional probabilities:
##      BMI
## Y         [,1]     [,2]
##   No  28.17281 6.283340
##   Yes 29.41582 6.611655
## 
##      Smoking
## Y            No       Yes
##   No  0.5990406 0.4009594
##   Yes 0.4133593 0.5866407
## 
##      AlcoholDrinking
## Y             No        Yes
##   No  0.92971897 0.07028103
##   Yes 0.95868217 0.04131783
## 
##      Stroke
## Y             No        Yes
##   No  0.97361633 0.02638367
##   Yes 0.83871114 0.16128886
## 
##      PhysicalHealth
## Y         [,1]      [,2]
##   No  3.004164  7.466262
##   Yes 7.841412 11.512455
## 
##      MentalHealth
## Y         [,1]     [,2]
##   No  3.854809 7.791103
##   Yes 4.625571 9.146825
## 
##      DiffWalking
## Y            No       Yes
##   No  0.8827895 0.1172105
##   Yes 0.6352446 0.3647554
## 
##      Sex
## Y        Female      Male
##   No  0.5389419 0.4610581
##   Yes 0.4077929 0.5922071
## 
##      AgeCategory
## Y           18-24       25-29       30-34       35-39       40-44       45-49
##   No  0.069477112 0.059978288 0.061697123 0.069658042 0.068391532 0.072371992
##   Yes 0.005156504 0.005111272 0.008096617 0.011081961 0.018228695 0.027139497
##      AgeCategory
## Y           50-54       55-59       60-64       65-69       70-74       75-79
##   No  0.081373259 0.096616609 0.106748688 0.103944274 0.090012665 0.057083409
##   Yes 0.050253302 0.080513841 0.122625294 0.147231771 0.177582775 0.147503166
##      AgeCategory
## Y     80 or older
##   No  0.062647006
##   Yes 0.199475303
## 
##      Race
## Y     American Indian/Alaskan Native       Asian       Black    Hispanic
##   No                     0.015112438 0.024478530 0.072937876 0.091715307
##   Yes                    0.020044342 0.009909054 0.063300303 0.051309895
##      Race
## Y           Other       White
##   No  0.034342337 0.761413511
##   Yes 0.032894439 0.822541966
## 
##      Diabetic
## Y              No No, borderline diabetes         Yes Yes (during pregnancy)
##   No  0.861577447             0.021086927 0.108602199            0.008733427
##   Yes 0.639938459             0.028870085 0.327254627            0.003936830
## 
##      PhysicalActivity
## Y            No       Yes
##   No  0.2079015 0.7920985
##   Yes 0.3601394 0.6398606
## 
##      GenHealth
## Y      Excellent       Fair       Good       Poor  Very good
##   No  0.22402715 0.09239819 0.28800905 0.02488688 0.37067873
##   Yes 0.05461538 0.25941176 0.34850679 0.14004525 0.19742081
## 
##      SleepTime
## Y         [,1]     [,2]
##   No  7.109618 1.397189
##   Yes 7.135461 1.788307
## 
##      Asthma
## Y            No       Yes
##   No  0.8698918 0.1301082
##   Yes 0.8186632 0.1813368
## 
##      KidneyDisease
## Y             No        Yes
##   No  0.97035797 0.02964203
##   Yes 0.87391954 0.12608046
## 
##      SkinCancer
## Y             No        Yes
##   No  0.91541838 0.08458162
##   Yes 0.81911572 0.18088428

Prediction

Predict using data test and compare the prediction result with the existing data.

pred_nb <- predict(model_naive_bayes, newdata = heart_test, type = "class")

Model Evaluation

Using confusionMatrix() to evaluate our model. Confusion matrix is a table that shows:

  • TP (True Positive) = When we predict positive class, and it’s true.
  • TN (True Negative) = When we predict negative class, and it’s true.
  • FP (False Positive) = When we predict positive class, and it’s not true.
  • FN (False Negative) = When we predict negative class, and it’s not true.

We will get information about:

  • Accuracy: How accurately our model predicts the target class.
  • Sensitivity/ Recall: The measure of the goodness of the model to the positive class.
  • Specificity: The measure of the goodness of the model to the negative class.
  • Pos Pred Value/Precision: How precise the model predicts positive class.
confusionMatrix(data = pred_nb, reference = heart_test$HeartDisease, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  46745  1854
##        Yes 11936  3424
##                                              
##                Accuracy : 0.7844             
##                  95% CI : (0.7812, 0.7876)   
##     No Information Rate : 0.9175             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2382             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.64873            
##             Specificity : 0.79660            
##          Pos Pred Value : 0.22292            
##          Neg Pred Value : 0.96185            
##              Prevalence : 0.08252            
##          Detection Rate : 0.05353            
##    Detection Prevalence : 0.24015            
##       Balanced Accuracy : 0.72266            
##                                              
##        'Positive' Class : Yes                
## 

The evaluation result of naive bayes model:

  • Accuracy = 0.78

  • Sensitivity/ Recall = 0.64

  • Specificity = 0.79

  • Pos Pred Value/Precision = 0.22

Decision Tree

The decision tree algorithm builds the classification model in the form of a tree structure. It utilizes the if-then rules which are equally exhaustive and mutually exclusive in classification. The process goes on with breaking down the data into smaller structures and eventually associating it with an incremental decision tree. The final structure looks like a tree with nodes and leaves. Decision tree Assumptions is mutually independent between predictors.

Advantages:

  • Can be used for both numerical and categorical predictors
  • can handle outliers
  • Interpretable
  • Robust

Disadvantages:

  • Can build an overfit model

Build Model

model_decision_tree <- ctree(formula = HeartDisease ~ .,
                             data = heart_train_down)

Prediction

Predict using data test and compare the prediction result with the existing data.

pred_test_dt <- predict(object = model_decision_tree,
                        newdata = heart_test,
                        type = "response")

Evaluation Model

Using confusionMatrix() to evaluate our model.

confusionMatrix(pred_test_dt, reference = heart_test$HeartDisease, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  40269   840
##        Yes 18412  4438
##                                              
##                Accuracy : 0.699              
##                  95% CI : (0.6954, 0.7025)   
##     No Information Rate : 0.9175             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2096             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.84085            
##             Specificity : 0.68624            
##          Pos Pred Value : 0.19422            
##          Neg Pred Value : 0.97957            
##              Prevalence : 0.08252            
##          Detection Rate : 0.06939            
##    Detection Prevalence : 0.35726            
##       Balanced Accuracy : 0.76354            
##                                              
##        'Positive' Class : Yes                
## 

The evaluation of Decision Tree Model:

  • Accuracy = 0.69

  • Sensitivity/ Recall = 0.84

  • Specificity = 0.68

  • Pos Pred Value/Precision = 0.19

Random Forest

Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It operates by constructing a multitude of decision trees at training time and outputs the class that is the mode of the classes or classification or mean prediction(regression) of the individual trees.

Advantages:

  • Reducing bias as well as variance from the decision tree,
  • Robust for prediction,
  • Automatic Feature Selection: automatic & random selection of predictors in decision tree making,
  • Out-of-Bag error to substitute model evaluation for test data,
  • Feature selection automatically (parameter mtry).

Disadvantages:

  • Less interpretable (black-box model),
  • Training costs are very large (in terms of hardware and computational time), but this can be reduced by selecting predictors that are less informative.

Build Model

From the train data we created. For example, we will create a random forest model with k-fold cross validation (k = 5) and the creation of the k-fold set is done 3 times, then the trainControl argument is used:

  • method: method for doing cross validation
  • number: number of k in k-fold CV
  • repeats: repetition in build k-fold CV
set.seed(100)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# model_random_forest <- train(HeartDisease ~ ., data = heart_train_down, method = "rf", trControl = control)

Model random forest yang dibuat telah disave ke file .RDS.

# saveRDS(model_random_forest, "model_random_forest.RDS")
model_random_forest <- readRDS("model_random_forest.RDS")
model_random_forest
## Random Forest 
## 
## 44190 samples
##    17 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 35352, 35352, 35352, 35352, 35352, 35352, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7553519  0.5107038
##   19    0.7403787  0.4807573
##   37    0.7330693  0.4661386
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Prediction

Predict using data test and compare the prediction result with the existing data.

pred_rf <- predict(object = model_random_forest,
                   newdata  = heart_test)

Evaluation Model

Using confusionMatrix() to evaluate our model.

confusionMatrix(pred_rf, reference = heart_test$HeartDisease, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  42603  1129
##        Yes 16078  4149
##                                              
##                Accuracy : 0.731              
##                  95% CI : (0.7275, 0.7344)   
##     No Information Rate : 0.9175             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.2237             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.78609            
##             Specificity : 0.72601            
##          Pos Pred Value : 0.20512            
##          Neg Pred Value : 0.97418            
##              Prevalence : 0.08252            
##          Detection Rate : 0.06487            
##    Detection Prevalence : 0.31625            
##       Balanced Accuracy : 0.75605            
##                                              
##        'Positive' Class : Yes                
## 

The evaluation of Random Forest Model:

  • Accuracy = 0.73

  • Sensitivity/ Recall = 0.78

  • Specificity = 0.72

  • Pos Pred Value/Precision = 0.20

Variable Importance

plot(varImp(model_random_forest))

Based on the result above, variables that have significant to heart disease are DiffWalking, Diabetic, Age Category 80 or older, General Health, Stroke, and Physical Health.

Conclusion

Comparison of all model:

Based on the result above, we can see that all model have similar evaluation results. We want to correctly target people who are likely to have heart disease. We want to know what variables that indicate people to have heart disease and give suitable treatment for them. In order to effectively know the heart disease, we want our model to have high Accuracy and Sensitivity/Recall. So the model that has the highest Accuracy and Sensitivity/Recall is Random Forest Model.

However, all model have low PosPredValue/Precision around 19-22%. It needs to be improved. For future project, in order to build the best model, we can perform some options:

  • Using upsampling method for class imbalanced. But we must consider that observations in this project is quite a lot and it will need more computational time to run the model.
  • Find the best parameter in each model or tunning model.
  • Select suitable variables or reducing variables.
  • Analyze using other methods.