1 Introduction

Heart disease is the leading cause of death in the United States, causing about 1 in 4 deaths. The term “heart disease” refers to several types of heart conditions. In the United States, the most common type of heart disease is coronary artery disease (CAD), which can lead to heart attack. This dataset is from Cleveland which is a major city in the U.S. state of Ohio. At this time we make model for predict heart disease in Cleveland.

Source dataset: https://www.kaggle.com/ronitf/heart-disease-uci

2 Import Library

library(dplyr)
library(tidyr)
library(MASS)
library(caret)

3 Read Data

heart <- read.csv("heart.csv", stringsAsFactors = T)
glimpse(heart)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5…
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1…
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, 0…
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1…
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2…
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1…
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1…
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0…
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, 1…
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Attribute information:

age : age.

sex : sex.

cp : chest pain type (4 values).

trestbps : resting blood pressure.

chol : serum cholestoral in mg/dl.

fbs : fasting blood sugar > 120 mg/dl.

restecg : resting electrocardiographic results (values 0,1,2).

thalach : maximum heart rate achieved.

exang : exercise induced angina.

oldpeak : ST depression induced by exercise relative to rest.

slope : the slope of the peak exercise ST segment.

ca: number of major vessels (0-3) colored by flourosopy.

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect.

target : 0 = Health ; 1 = Not Health.

4 Exploratory Data Analysis

Change Data Type

heart <- heart %>% 
    mutate_if(is.integer, as.factor) %>% 
    mutate(target = factor(target, level = c(0,1), labels = c("Health", "Not Health")))
head(heart)

Check Missing Value

colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

There is no missing value at our dataset.

Check Propotional data

prop.table(table(heart$target))
## 
##     Health Not Health 
##  0.4554455  0.5445545

We have quite balance data propotion, we can continue next step.

Cross Validation

We do splitting our dataset into train and test data. Right now we split our dataset 80% train data and 20% test data.

set.seed(212)

index <- sample(nrow(heart), nrow(heart)*0.8)

heart_train <- heart[index, ]
heart_test <-  heart[-index, ]

5 Modelling

We try to build model machine learning to predict heart disease.

5.1 Logistic Regression

model_logistic <- glm(formula = target ~ sex+cp+fbs+exang+oldpeak+slope+ca+thal, 
                      family = "binomial",
                      data = heart_train)
summary(model_logistic)
## 
## Call:
## glm(formula = target ~ sex + cp + fbs + exang + oldpeak + slope + 
##     ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.86182  -0.35074   0.08587   0.36404   3.06243  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7316     4.2749   0.171 0.864122    
## sex1         -1.9091     0.6021  -3.171 0.001521 ** 
## cp1           0.9981     0.6662   1.498 0.134070    
## cp2           2.0423     0.5949   3.433 0.000597 ***
## cp3           2.2555     0.7997   2.821 0.004794 ** 
## fbs1          0.4578     0.6333   0.723 0.469792    
## exang1       -0.9428     0.5094  -1.851 0.064226 .  
## oldpeak      -0.5059     0.2550  -1.984 0.047213 *  
## slope1       -1.3048     0.8848  -1.475 0.140283    
## slope2        0.6994     0.9782   0.715 0.474593    
## ca1          -2.1061     0.5798  -3.633 0.000281 ***
## ca2          -2.6051     0.8824  -2.952 0.003154 ** 
## ca3          -2.1581     0.9293  -2.322 0.020216 *  
## ca4           1.4331     1.5944   0.899 0.368728    
## thal1         2.2960     4.2561   0.539 0.589578    
## thal2         2.3008     4.1713   0.552 0.581237    
## thal3         0.9524     4.1754   0.228 0.819567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 334.42  on 241  degrees of freedom
## Residual deviance: 146.50  on 225  degrees of freedom
## AIC: 180.5
## 
## Number of Fisher Scoring iterations: 6

5.2 Prediction

heart_test$prob_heart<-predict(model_logistic, type = "response", newdata = heart_test)
ggplot(heart_test, aes(x=prob_heart)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

Based on graph above we can see our prediction tend to be 1 that means Not Health.

heart_test$pred_heart <- factor(ifelse(heart_test$prob_heart > 0.5, "Not Health","Health"))

5.3 Evaluation

cm_logis <- confusionMatrix(heart_test$pred_heart, heart_test$target, positive = "Not Health")
cm_logis
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         19          4
##   Not Health      6         32
##                                           
##                Accuracy : 0.8361          
##                  95% CI : (0.7191, 0.9185)
##     No Information Rate : 0.5902          
##     P-Value [Acc > NIR] : 3.428e-05       
##                                           
##                   Kappa : 0.6569          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.8889          
##             Specificity : 0.7600          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.8261          
##              Prevalence : 0.5902          
##          Detection Rate : 0.5246          
##    Detection Prevalence : 0.6230          
##       Balanced Accuracy : 0.8244          
##                                           
##        'Positive' Class : Not Health      
## 

Based on confussionMatrix our result are Accuracy 83.6%, Sensitivity/Recall 88.8%, Specificity 76%, Precision 84% .

6 Conclusion

Our objective is Sensitivity/Recall which is 88.8%. Because we want get False Positive as high as possible, our model can be functional as pre-screening for the doctor, eventhough our patient is labeled as positive or not health but doctor can do checking for more details.