According to the [CDC] (https://www.cdc.gov/heartdisease/risk_factors.htm), heart disease is one of the leading causes of death for people in the US. In this project, we will make predictions whether a patient in the hospital has heart disease or not using the Naive Bayes Classifier, Decision Tree, and Random Forest. The dataset came from 2020 annual CDC survey data of 400k adults related to their health status (Kaggle).
First, import the library:
library(dplyr)
library(tidyverse)
library(GGally)
library(ggplot2)
library(e1071)
library(rsample)
library(caret)
library(partykit)
library(randomForest)Input our data and put it into heart object. We use
parameter stringsAsFactors = TRUE so that all character
columns will automatically stored as factors.
heart <- read.csv("heart.csv", stringsAsFactors = TRUE)Overview our data:
head(heart)Description:
Check the number of columns and rows.
dim(heart)## [1] 319795 18
Data contains 319.795 rows and 18 columns.
View all columns and the data types.
glimpse(heart)## Rows: 319,795
## Columns: 18
## $ HeartDisease <fct> No, No, No, No, No, Yes, No, No, No, No, Yes, No, No,~
## $ BMI <dbl> 16.60, 20.34, 26.58, 24.21, 23.71, 28.87, 21.63, 31.6~
## $ Smoking <fct> Yes, No, Yes, No, No, Yes, No, Yes, No, No, Yes, Yes,~
## $ AlcoholDrinking <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N~
## $ Stroke <fct> No, Yes, No, No, No, No, No, No, No, No, No, No, No, ~
## $ PhysicalHealth <dbl> 3, 0, 20, 0, 28, 6, 15, 5, 0, 0, 30, 0, 0, 7, 0, 1, 5~
## $ MentalHealth <dbl> 30, 0, 30, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 0, 2,~
## $ DiffWalking <fct> No, No, No, No, Yes, Yes, No, Yes, No, Yes, Yes, No, ~
## $ Sex <fct> Female, Female, Male, Female, Female, Female, Female,~
## $ AgeCategory <fct> 55-59, 80 or older, 65-69, 75-79, 40-44, 75-79, 70-74~
## $ Race <fct> White, White, White, White, White, Black, White, Whit~
## $ Diabetic <fct> "Yes", "No", "Yes", "No", "No", "No", "No", "Yes", "N~
## $ PhysicalActivity <fct> Yes, Yes, Yes, No, Yes, No, Yes, No, No, Yes, No, Yes~
## $ GenHealth <fct> Very good, Very good, Fair, Good, Very good, Fair, Fa~
## $ SleepTime <dbl> 5, 7, 8, 6, 8, 12, 4, 9, 5, 10, 15, 5, 8, 7, 5, 6, 10~
## $ Asthma <fct> Yes, No, Yes, No, No, No, Yes, Yes, No, No, Yes, No, ~
## $ KidneyDisease <fct> No, No, No, No, No, No, No, No, Yes, No, No, No, No, ~
## $ SkinCancer <fct> Yes, No, No, Yes, No, No, Yes, No, No, No, No, No, No~
Data type of all columns are in the correct type.
Checking the missing value.
colSums(is.na(heart))## HeartDisease BMI Smoking AlcoholDrinking
## 0 0 0 0
## Stroke PhysicalHealth MentalHealth DiffWalking
## 0 0 0 0
## Sex AgeCategory Race Diabetic
## 0 0 0 0
## PhysicalActivity GenHealth SleepTime Asthma
## 0 0 0 0
## KidneyDisease SkinCancer
## 0 0
No missing value found. Now the data is ready to explore.
Let’s see the summary of all columns.
summary(heart)## HeartDisease BMI Smoking AlcoholDrinking Stroke
## No :292422 Min. :12.02 No :187887 No :298018 No :307726
## Yes: 27373 1st Qu.:24.03 Yes:131908 Yes: 21777 Yes: 12069
## Median :27.34
## Mean :28.33
## 3rd Qu.:31.42
## Max. :94.85
##
## PhysicalHealth MentalHealth DiffWalking Sex
## Min. : 0.000 Min. : 0.000 No :275385 Female:167805
## 1st Qu.: 0.000 1st Qu.: 0.000 Yes: 44410 Male :151990
## Median : 0.000 Median : 0.000
## Mean : 3.372 Mean : 3.898
## 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :30.000 Max. :30.000
##
## AgeCategory Race
## 65-69 : 34151 American Indian/Alaskan Native: 5202
## 60-64 : 33686 Asian : 8068
## 70-74 : 31065 Black : 22939
## 55-59 : 29757 Hispanic : 27446
## 50-54 : 25382 Other : 10928
## 80 or older: 24153 White :245212
## (Other) :141601
## Diabetic PhysicalActivity GenHealth
## No :269653 No : 71838 Excellent: 66842
## No, borderline diabetes: 6781 Yes:247957 Fair : 34677
## Yes : 40802 Good : 93129
## Yes (during pregnancy) : 2559 Poor : 11289
## Very good:113858
##
##
## SleepTime Asthma KidneyDisease SkinCancer
## Min. : 1.000 No :276923 No :308016 No :289976
## 1st Qu.: 6.000 Yes: 42872 Yes: 11779 Yes: 29819
## Median : 7.000
## Mean : 7.097
## 3rd Qu.: 8.000
## Max. :24.000
##
Before doing the analysis, we have to inspect the distribution of all variables. Categorical variables:
ggplot(gather(heart %>% select_if(is.factor)), aes(value)) +
geom_bar(bins = 10, fill = "navy") +
facet_wrap(~key, scales = 'free_x') +
theme_minimal()Numerical variables:
ggplot(gather(heart %>% select_if(is.numeric)), aes(value)) +
geom_histogram(bins = 10, fill = "Navy") +
facet_wrap(~key, scales = 'free_x') +
theme_minimal()Correlation between numeric variables:
ggcorr(heart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2, low = "black", high = "blue")Naive Bayes Classifier is a classification algorithm based on Bayes’s theorem which gives an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes Assumptions are the predictors is mutually independent and have the same weight.
Advantages:
Disadvantages: Skewness due to data scarcity, bias arises when there are events that rarely or do not occur at all.
Split the data into 80% data train and and 20% data test.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(heart), nrow(heart)*0.8)
heart_train <- heart[index,]
heart_test <- heart[-index,]After split the data, check the class imbalance of data train.
prop.table(table(heart_train$HeartDisease))##
## No Yes
## 0.91363608 0.08636392
table(heart_train$HeartDisease)##
## No Yes
## 233741 22095
It is not balance, so we need to Upsampling or Downsampling.
Upsampling: adding minority class observations to balance with the majority class, by duplicating data from minority observations. Disadvantages: only duplicate data, does not add new information.
Downsampling: reduce the observation of the majority class to balance with the minority class. Disadvantages: removing information from the data owned, commonly used when we have a lot of data.
We will use downsampling method because the minority class of heart disease-yes has 22,095 observations.
heart_train_down <- downSample(x = heart_train %>% select(-HeartDisease),
y = as.factor(heart_train$HeartDisease),
yname = "HeartDisease")Re-check the class imbalance.
prop.table(table(heart_train_down$HeartDisease))##
## No Yes
## 0.5 0.5
Now it is balanced.
dim(heart_train_down)## [1] 44190 18
After downsampling, there are 44,190 observations.
model_naive_bayes <- naiveBayes(formula = HeartDisease ~ ., data = heart_train_down, laplace = 1)
model_naive_bayes##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## No Yes
## 0.5 0.5
##
## Conditional probabilities:
## BMI
## Y [,1] [,2]
## No 28.17281 6.283340
## Yes 29.41582 6.611655
##
## Smoking
## Y No Yes
## No 0.5990406 0.4009594
## Yes 0.4133593 0.5866407
##
## AlcoholDrinking
## Y No Yes
## No 0.92971897 0.07028103
## Yes 0.95868217 0.04131783
##
## Stroke
## Y No Yes
## No 0.97361633 0.02638367
## Yes 0.83871114 0.16128886
##
## PhysicalHealth
## Y [,1] [,2]
## No 3.004164 7.466262
## Yes 7.841412 11.512455
##
## MentalHealth
## Y [,1] [,2]
## No 3.854809 7.791103
## Yes 4.625571 9.146825
##
## DiffWalking
## Y No Yes
## No 0.8827895 0.1172105
## Yes 0.6352446 0.3647554
##
## Sex
## Y Female Male
## No 0.5389419 0.4610581
## Yes 0.4077929 0.5922071
##
## AgeCategory
## Y 18-24 25-29 30-34 35-39 40-44 45-49
## No 0.069477112 0.059978288 0.061697123 0.069658042 0.068391532 0.072371992
## Yes 0.005156504 0.005111272 0.008096617 0.011081961 0.018228695 0.027139497
## AgeCategory
## Y 50-54 55-59 60-64 65-69 70-74 75-79
## No 0.081373259 0.096616609 0.106748688 0.103944274 0.090012665 0.057083409
## Yes 0.050253302 0.080513841 0.122625294 0.147231771 0.177582775 0.147503166
## AgeCategory
## Y 80 or older
## No 0.062647006
## Yes 0.199475303
##
## Race
## Y American Indian/Alaskan Native Asian Black Hispanic
## No 0.015112438 0.024478530 0.072937876 0.091715307
## Yes 0.020044342 0.009909054 0.063300303 0.051309895
## Race
## Y Other White
## No 0.034342337 0.761413511
## Yes 0.032894439 0.822541966
##
## Diabetic
## Y No No, borderline diabetes Yes Yes (during pregnancy)
## No 0.861577447 0.021086927 0.108602199 0.008733427
## Yes 0.639938459 0.028870085 0.327254627 0.003936830
##
## PhysicalActivity
## Y No Yes
## No 0.2079015 0.7920985
## Yes 0.3601394 0.6398606
##
## GenHealth
## Y Excellent Fair Good Poor Very good
## No 0.22402715 0.09239819 0.28800905 0.02488688 0.37067873
## Yes 0.05461538 0.25941176 0.34850679 0.14004525 0.19742081
##
## SleepTime
## Y [,1] [,2]
## No 7.109618 1.397189
## Yes 7.135461 1.788307
##
## Asthma
## Y No Yes
## No 0.8698918 0.1301082
## Yes 0.8186632 0.1813368
##
## KidneyDisease
## Y No Yes
## No 0.97035797 0.02964203
## Yes 0.87391954 0.12608046
##
## SkinCancer
## Y No Yes
## No 0.91541838 0.08458162
## Yes 0.81911572 0.18088428
Predict using data test and compare the prediction result with the existing data.
pred_nb <- predict(model_naive_bayes, newdata = heart_test, type = "class")Using confusionMatrix() to evaluate our model. Confusion matrix is a table that shows:
We will get information about:
confusionMatrix(data = pred_nb, reference = heart_test$HeartDisease, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 46745 1854
## Yes 11936 3424
##
## Accuracy : 0.7844
## 95% CI : (0.7812, 0.7876)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2382
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.64873
## Specificity : 0.79660
## Pos Pred Value : 0.22292
## Neg Pred Value : 0.96185
## Prevalence : 0.08252
## Detection Rate : 0.05353
## Detection Prevalence : 0.24015
## Balanced Accuracy : 0.72266
##
## 'Positive' Class : Yes
##
The evaluation result of naive bayes model:
Accuracy = 0.78
Sensitivity/ Recall = 0.64
Specificity = 0.79
Pos Pred Value/Precision = 0.22
The decision tree algorithm builds the classification model in the form of a tree structure. It utilizes the if-then rules which are equally exhaustive and mutually exclusive in classification. The process goes on with breaking down the data into smaller structures and eventually associating it with an incremental decision tree. The final structure looks like a tree with nodes and leaves. Decision tree Assumptions is mutually independent between predictors.
Advantages:
Disadvantages:
model_decision_tree <- ctree(formula = HeartDisease ~ .,
data = heart_train_down)Predict using data test and compare the prediction result with the existing data.
pred_test_dt <- predict(object = model_decision_tree,
newdata = heart_test,
type = "response")Using confusionMatrix() to evaluate our model.
confusionMatrix(pred_test_dt, reference = heart_test$HeartDisease, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 40269 840
## Yes 18412 4438
##
## Accuracy : 0.699
## 95% CI : (0.6954, 0.7025)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2096
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.84085
## Specificity : 0.68624
## Pos Pred Value : 0.19422
## Neg Pred Value : 0.97957
## Prevalence : 0.08252
## Detection Rate : 0.06939
## Detection Prevalence : 0.35726
## Balanced Accuracy : 0.76354
##
## 'Positive' Class : Yes
##
The evaluation of Decision Tree Model:
Accuracy = 0.69
Sensitivity/ Recall = 0.84
Specificity = 0.68
Pos Pred Value/Precision = 0.19
Random decision trees or random forest are an ensemble learning method for classification, regression, etc. It operates by constructing a multitude of decision trees at training time and outputs the class that is the mode of the classes or classification or mean prediction(regression) of the individual trees.
Advantages:
mtry).Disadvantages:
From the train data we created. For example, we will create a random
forest model with k-fold cross validation (k = 5) and the creation of
the k-fold set is done 3 times, then the trainControl
argument is used:
method: method for doing cross validationnumber: number of k in k-fold CVrepeats: repetition in build k-fold CVset.seed(100)
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)# model_random_forest <- train(HeartDisease ~ ., data = heart_train_down, method = "rf", trControl = control)Model random forest yang dibuat telah disave ke file .RDS.
# saveRDS(model_random_forest, "model_random_forest.RDS")model_random_forest <- readRDS("model_random_forest.RDS")model_random_forest## Random Forest
##
## 44190 samples
## 17 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 35352, 35352, 35352, 35352, 35352, 35352, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7553519 0.5107038
## 19 0.7403787 0.4807573
## 37 0.7330693 0.4661386
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Predict using data test and compare the prediction result with the existing data.
pred_rf <- predict(object = model_random_forest,
newdata = heart_test)Using confusionMatrix() to evaluate our model.
confusionMatrix(pred_rf, reference = heart_test$HeartDisease, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 42603 1129
## Yes 16078 4149
##
## Accuracy : 0.731
## 95% CI : (0.7275, 0.7344)
## No Information Rate : 0.9175
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2237
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.78609
## Specificity : 0.72601
## Pos Pred Value : 0.20512
## Neg Pred Value : 0.97418
## Prevalence : 0.08252
## Detection Rate : 0.06487
## Detection Prevalence : 0.31625
## Balanced Accuracy : 0.75605
##
## 'Positive' Class : Yes
##
The evaluation of Random Forest Model:
Accuracy = 0.73
Sensitivity/ Recall = 0.78
Specificity = 0.72
Pos Pred Value/Precision = 0.20
plot(varImp(model_random_forest))Based on the result above, variables that have significant to heart disease are DiffWalking, Diabetic, Age Category 80 or older, General Health, Stroke, and Physical Health.
Comparison of all model:
Based on the result above, we can see that all model have similar evaluation results. We want to correctly target people who are likely to have heart disease. We want to know what variables that indicate people to have heart disease and give suitable treatment for them. In order to effectively know the heart disease, we want our model to have high Accuracy and Sensitivity/Recall. So the model that has the highest Accuracy and Sensitivity/Recall is Random Forest Model.
However, all model have low PosPredValue/Precision around 19-22%. It needs to be improved. For future project, in order to build the best model, we can perform some options: