1 Introduction

Heart disease illustration. (Source: Webmd)

According to World Health Organization (WHO), ischemic heart disease is the 1st leading cause of death worldwide, which is responsible for 16% of the world’s total deaths. This project aims to predict presence of heart disease in a patient based on several characteristics using a dataset named Cleveland Heart Disease Dataset from Kaggle. Since the target variable is categorical, I will be using logistic regression and k-nearest neighbours algorithms to do the prediction.

2 Data Preparation

Load the required libraries.

library(tidyverse)
library(car)
library(caret)
library(e1071)
library(naniar)
library(class)
library(psych)

Load the dataset.

heart <- read.csv("heartattack.csv")
rmarkdown::paged_table(heart)
str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

In total, there are 14 variables in this dataset, with target as the target variable.

Variables explanation:
* Age: Age of the patient
* Sex: Sex of the patient (0 = Female, 1 = Male)
* cp: Chest Pain type
- Value 0: typical angina
- Value 1: atypical angina
- Value 2: non-anginal pain
- Value 3: asymptomatic
* trestbps: Resting blood pressure (in mm/Hg) on admission to the hospital
* chol: Serum cholestoral (in mg/dl)
* fbs: Fasting blood sugar classification (1 = > 120 mg/dl, 0 = <= 120 mg/dl)
* restecg: Resting electrocardiographic results
- Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- Value 1: normal
- Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* thalach: Maximum heart rate achieved
* exang: Exercised induced angina (1 = yes, 0 = no)
* oldpeak: ST depression induced by exercise relative to rest
* slope: The slope of the peak exercise ST segment
- Value 0: descending
- Value 1: flat
- Value 2: ascending
* ca: Number of major vessels (0 - 3)
* thal: Thalium Stress Test Result (1 = fixed defect; 2 = normal; 3 = reversable defect)
* target: Heart attack (1 = no, 0 = yes)

Some variables do not have the correct data type. Furthermore, to make the interpretation easier, the code for the variable of interest, presence of heart disease which previously coded as 0, will be coded as 1 and categorical variables will be labelled.

heart$cp <- as.factor(heart$cp)
heart$restecg <- as.factor(heart$restecg)
heart$slope <- as.factor(heart$slope)
heart$ca <- as.factor(heart$ca)
heart$thal <- as.factor(heart$thal)
heart$sex <- as.factor(heart$sex)
heart$fbs <- as.factor(heart$fbs)
heart$exang <- as.factor(heart$exang)
heart$target <- as.factor(heart$target)

heart$target2 <- ifelse(heart$target == 1, 0, "x")
heart$target2 <- ifelse(heart$target2 == "x", 1, 0)
heart$target2 <- as.factor(heart$target2)

heart <- heart %>% select(-target)
names(heart)[1] <- "age"
names(heart)[14] <- "target"

heart <- heart %>% mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
         fbs = factor(fbs, levels = c(0,1), labels = c("<= 120 mg/dl", "> 120 mg/dl")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         target = factor(target, levels = c(0,1), labels = c("No", "Yes")),
         thal = factor(thal, levels = c(1,2,3), labels = c("Fixed defect", "Normal", "Reversable defect")),
         slope = factor(slope, levels = c(0,1,2), labels = c("Descending", "Flat", "Ascending")),
         cp = factor(cp, levels = c(0,1,2, 3), labels = c("Typical angina", "Atypical angina", "Non-anginal pain", "Asymptomatic")),
         restecg = factor(restecg, levels = c(0,1,2), labels = c("Left ventricular hypertrophy", "Normal", "ST-T wave abnormality")))

Check if there is any missing value.

anyNA(heart)
## [1] TRUE

Looks like there are missing values in this dataset.

colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        2        0

There are 2 missing values in thal variable. We will eliminate them later.

summary(heart)
##       age            sex                     cp         trestbps    
##  Min.   :29.00   Female: 96   Typical angina  :143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   Atypical angina : 50   1st Qu.:120.0  
##  Median :55.00                Non-anginal pain: 87   Median :130.0  
##  Mean   :54.37                Asymptomatic    : 23   Mean   :131.6  
##  3rd Qu.:61.00                                       3rd Qu.:140.0  
##  Max.   :77.00                                       Max.   :200.0  
##       chol                 fbs                              restecg   
##  Min.   :126.0   <= 120 mg/dl:258   Left ventricular hypertrophy:147  
##  1st Qu.:211.0   > 120 mg/dl : 45   Normal                      :152  
##  Median :240.0                      ST-T wave abnormality       :  4  
##  Mean   :246.3                                                        
##  3rd Qu.:274.5                                                        
##  Max.   :564.0                                                        
##     thalach      exang        oldpeak            slope     ca     
##  Min.   : 71.0   No :204   Min.   :0.00   Descending: 21   0:175  
##  1st Qu.:133.5   Yes: 99   1st Qu.:0.00   Flat      :140   1: 65  
##  Median :153.0             Median :0.80   Ascending :142   2: 38  
##  Mean   :149.6             Mean   :1.04                    3: 20  
##  3rd Qu.:166.0             3rd Qu.:1.60                    4:  5  
##  Max.   :202.0             Max.   :6.20                           
##                 thal     target   
##  Fixed defect     : 18   No :165  
##  Normal           :166   Yes:138  
##  Reversable defect:117            
##  NA's             :  2            
##                                   
## 

Based on the summary, it looks like ca or number of major vessels range from 0 to 4, while from the dataset explanation, it only ranges from 0 - 3, so observations with ca > 3 and missing values will be eliminated from the dataset

heart <- heart %>% filter(ca != 4 | thal != 0)
heart$thal <- droplevels(heart$thal)
heart$ca <- droplevels(heart$ca)

Before carrying out the analysis, we have to make sure the target variable has an equal (or almost equal) proportion for each of its categories, so the model will be able to predict both positive and negative class in a balanced way.

prop.table(table(heart$target))
## 
##        No       Yes 
## 0.5445545 0.4554455

Looks like both categories have almost equal proportion, so we can proceed with the analysis.

3 Exploratory Data Analysis

To see the relationship between each predictor variable and the target variable, we can visualize it.

3.1 Sex

ggplot(heart, aes(target, fill = sex)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Sex",
                                     x = "Heart Attack",
                                     y = "Proportion")

Based on sex, heart disease are mostly experienced by male patients.

3.2 Chest Pain Type

ggplot(heart, aes(target, fill = cp)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Chest Pain Type",
                                     x = "Heart Attack",
                                     y = "Proportion")

Based on the chest pain type, the majority of people with heart disease experience typical angina type of chest pain.

3.3 Resting Blood Pressure

ggplot(heart, aes(target, trestbps)) +
  geom_boxplot() + labs(title = "Heart Attack based on Resting Blood Pressure",
                                     x = "Heart Attack",
                                     y = "in mm/Hg")

Based on resting blood pressure, there was no difference in resting blood pressure between patients with heart disease and healthy patients.

3.4 Cholesterol Serum

ggplot(heart, aes(target, chol)) +
  geom_boxplot() + labs(title = "Heart Attack based on Cholesterol Serum",
                                     x = "Heart Attack",
                                     y = "in mg/dl")

There is not much difference in terms of cholesterol serum between patients with heart diseases and healthy patients.

3.5 Fasting Blood Sugar

ggplot(heart, aes(target, fill = fbs)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Fasting Blood Sugar",
                                     x = "Heart Attack", 
                                     y = "Proportion")

Based on fasting blood sugar levels, there was no difference in fasting blood sugar levels in patients with heart disease and healthy patients.

3.6 Resting ECG Result

ggplot(heart, aes(target, fill = restecg)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Resting ECG Results",
                                     x = "Heart Attack", 
                                     y = "Proportion")

Based on resting ecg results, most heart disease patients have left ventricular hypertrophy.

3.7 Max Heart Rate

ggplot(heart, aes(target, thalach)) +
  geom_boxplot() + labs(title = "Heart attack and Maximum Heart Rate",
                        x = "Heart Attack")

Heart disease patients tend to have a lower heart rate than healthy patients.

3.8 Exercised Induced Angina

ggplot(heart, aes(target, fill = exang)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Exercised Induced Angina",
                                     x = "Heart Attack", 
                                     y = "Proportion")

More than half of heart diseases patients experience exercised induced angina.

3.9 ST depression induced by exercise relative to rest

ggplot(heart, aes(target, oldpeak)) +
  geom_boxplot() + labs(title = "Heart attack and ST depression induced by exercise relative to rest",
                        x = "Heart Attack")

In terms of ST depression induced by exercise relative to rest, heart disease patients have a higher level of ST depression induced by exercise relative to rest than healthy patients.

3.10 Slope

ggplot(heart, aes(target, fill = slope)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Slope",
                                     x = "Heart Attack", 
                                     y = "Proportion")

Most patients with heart disease have a flat slope.

3.11 Number of major vessels

ggplot(heart, aes(target, fill = ca)) +
  geom_bar(position = "fill") + labs(title = "Heart Attack based on Number of Major Vessels",
                                     x = "Heart Attack")

There is a difference in the number of major vessels between a patient with heart disease and a healthy patient. In healthy patients, most have 0 number of major vessels, while in patients with heart disease, the number of major vessels is more diverse.

3.12 Thalium Stress Test Results

ggplot(heart, aes(target, fill = thal)) +
 geom_bar(position = "fill") +labs(title = "Heart Attack based on Thalium Stress Test Result",
                                     x = "Heart Attack", 
                                     y = "Proportion")

Based on thalium stress test result, majority of heart attack patients have reversable defect.

Based on visualizations, there are differences between people with heart disease and people who are healthy in terms of sex, chest pain type, resting ecg result, maximum heart rate, exercised induces angina, st depression induced by exercise relative to rest, slope, number of major vessels, dan thalium stress test result. Sehingga, variabel sex, cp, restecg, thalach, exang, oldpeak, slope, ca, and thal will be used as the predictor variables.

4 Modeling

4.1 Logistic Regression

Split the dataset into train and test data.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(111)
index <- sample(nrow(heart), 
       nrow(heart)*0.8)
train <- heart[index, ]
test <- heart[-index, ]
model <- glm(formula = target~sex+cp+restecg+thalach+exang+oldpeak+slope+ca+thal, family = "binomial", data = train)
summary(model)
## 
## Call:
## glm(formula = target ~ sex + cp + restecg + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2364  -0.4174  -0.1084   0.2445   3.0058  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.23716    2.24490   0.551 0.581566    
## sexMale                       1.31466    0.61252   2.146 0.031849 *  
## cpAtypical angina            -0.97774    0.65276  -1.498 0.134169    
## cpNon-anginal pain           -2.42021    0.61006  -3.967 7.27e-05 ***
## cpAsymptomatic               -2.37496    0.78049  -3.043 0.002343 ** 
## restecgNormal                -0.67534    0.43897  -1.538 0.123932    
## restecgST-T wave abnormality -0.06474    2.75386  -0.024 0.981245    
## thalach                      -0.02117    0.01220  -1.735 0.082736 .  
## exangYes                      0.67992    0.50380   1.350 0.177151    
## oldpeak                       0.30482    0.28024   1.088 0.276730    
## slopeFlat                     0.57626    0.92328   0.624 0.532531    
## slopeAscending               -0.90963    1.01719  -0.894 0.371187    
## ca1                           2.19294    0.58350   3.758 0.000171 ***
## ca2                           2.95847    0.97920   3.021 0.002517 ** 
## ca3                           1.80597    0.95687   1.887 0.059111 .  
## ca4                          -1.70119    1.74417  -0.975 0.329382    
## thalNormal                    0.29145    0.87017   0.335 0.737676    
## thalReversable defect         2.00885    0.83375   2.409 0.015979 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.69  on 239  degrees of freedom
## Residual deviance: 142.28  on 222  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 178.28
## 
## Number of Fisher Scoring iterations: 6

From this model, we get AIC value of 178.28. But, some predictor variables are not significant to the model. We can utilize stepwise selection method to choose more significant predictor variables.

model.step <- step(model, direction = "backward", trace = F)
summary(model.step)
## 
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + slope + ca + 
##     thal, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3085  -0.4232  -0.1013   0.2984   2.8713  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            1.58965    2.03998   0.779 0.435834    
## sexMale                1.33564    0.60313   2.215 0.026794 *  
## cpAtypical angina     -1.08155    0.64611  -1.674 0.094141 .  
## cpNon-anginal pain    -2.43423    0.60097  -4.051 5.11e-05 ***
## cpAsymptomatic        -2.07100    0.73271  -2.826 0.004706 ** 
## thalach               -0.02143    0.01200  -1.785 0.074284 .  
## exangYes               0.74653    0.49847   1.498 0.134224    
## slopeFlat              0.17654    0.80244   0.220 0.825870    
## slopeAscending        -1.51481    0.82250  -1.842 0.065516 .  
## ca1                    2.21813    0.57211   3.877 0.000106 ***
## ca2                    3.13681    0.95391   3.288 0.001008 ** 
## ca3                    2.13707    0.93133   2.295 0.021754 *  
## ca4                   -1.86631    1.65260  -1.129 0.258763    
## thalNormal             0.30321    0.84731   0.358 0.720450    
## thalReversable defect  1.99250    0.81843   2.435 0.014910 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 330.69  on 239  degrees of freedom
## Residual deviance: 145.84  on 225  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 175.84
## 
## Number of Fisher Scoring iterations: 6

From the model created using stepwise method, we get AIC value of 175.84, lower than the previous model. So, we will use this model to do the prediction.

4.1.1 Prediction

test$pred <- predict(object = model.step, #  model
        newdata = test,  # data baru
        type = "response")

test$pred.label <- ifelse(test$pred > 0.5, "1", "0") %>% factor(levels = c(0,1), labels = c("No", "Yes"))

4.1.2 Model Evaluation

eval_log <- confusionMatrix(data = test$pred.label, 
                reference = test$target,
                positive = "Yes")

eval_log <- data.frame(Accuracy = paste0(round(eval_log$overall[1]*100, 2), "%"),
                          Recall = paste0(round(eval_log$byClass[1]*100, 2), "%"),
                          Precision = paste0(round(eval_log$byClass[3]*100, 2), "%"))

eval_log
##   Accuracy Recall Precision
## 1   88.52% 85.71%    88.89%

From the prediction results, we obtain an accuracy rate of 88.52%, which means the model is able to predict the target variable accurately as much as 88.52% of the overall data. In addition, this models gives us sensitivity rate of 85,71%, which means that the model can accurately predict 85.71% of the presence of heart disease from all patients who actually have heart disease. This model is also able to predict 88.9% of positive cases (incidence of heart disease) accurately from all positive predicted cases as shown by the Precision value.

4.2 kNN

Another algorithm that can be used to predict categorical target variables is kNN (k-nearest neighbors). The kNN algorithm works by calculating the distance between individual observations and assuming that observations with similar characteristics are in close proximity to each other. Therefore, the predictor variables in this algorithm are numeric variables.

heart <- read.csv("heartattack.csv")
summary(heart)
##      ï..age           sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

In KNN, all variables must be of the same scale. Therefore, the variables need to be standardized first. Standardization can be done using the scale () function. In addition, for categorical variables, a dummy variable is made.

4.2.1 Data Preparation

heart_new <- heart %>% select(c(sex, cp, thalach, exang, slope, ca, thal, target))

# Scale the numerical predictor variables
heart_new[, "thalach"] <- scale(heart_new[, "thalach"])
head(heart_new)
##   sex cp     thalach exang slope ca thal target
## 1   1  3  0.01541728     0     0  0    1      1
## 2   1  2  1.63077374     0     0  0    2      1
## 3   0  1  0.97589950     0     2  0    2      1
## 4   1  1  1.23784920     0     2  0    2      1
## 5   0  0  0.58297496     1     2  0    2      1
## 6   1  0 -0.07189928     0     1  0    1      1
# Dummy code vars with 3 or more levels

cp <- as.data.frame(dummy.code(heart_new$cp))
names(cp) <- paste0("cp", names(cp))
slope <- as.data.frame(dummy.code(heart_new$slope))
names(slope) <- paste0("slope", names(slope))
ca <- as.data.frame(dummy.code(heart_new$ca))
names(ca) <- paste0("ca", names(ca))
thal <- as.data.frame(dummy.code(heart_new$thal))
names(thal) <- paste0("thal", names(thal))

heart_new <- heart_new %>% select(-c(cp, slope, ca, thal)) %>% cbind(c(cp, slope, ca, thal))

names(heart_new)
##  [1] "sex"     "thalach" "exang"   "target"  "cp0"     "cp2"     "cp1"    
##  [8] "cp3"     "slope2"  "slope1"  "slope0"  "ca0"     "ca1"     "ca2"    
## [15] "ca3"     "ca4"     "thal2"   "thal3"   "thal1"   "thal0"

4.2.2 Modeling

Split the dataset into train and test data.

# Data splitting
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(112)
index <- sample(nrow(heart_new), 
       nrow(heart_new)*0.8)
train <- heart_new[index, ]
test <- heart_new[-index, ]

trainx <- train %>% select(-target)
testx <- test %>% select(-target)
train_y <- train$target
test_y <- test$target %>% as.factor()

Before making a prediction, determine the value of k.

k <- round(sqrt(nrow(train)))

pred <- knn(train = trainx,
    test = testx,
    cl = train_y,
    k = k)

4.2.3 Model Evaluation

eval_knn <- confusionMatrix(data = pred,
                reference = test_y,
                positive = "1")

eval_knn <- data.frame(Accuracy = paste0(round(eval_knn$overall[1]*100, 2), "%"),
                          Recall = paste0(round(eval_knn$byClass[1]*100, 2), "%"),
                          Precision = paste0(round(eval_knn$byClass[3]*100, 2), "%"))

eval_knn
##   Accuracy Recall Precision
## 1   80.33% 93.33%    73.68%

This model can accurately predict 81.97% of positive and negative cases from the overall data. Of the total positive cases, this model is able to accurately predict as many as 70.97% of positive cases as shown by Recall value. In addition, from all cases that are predicted to be positive, 91.67% of these cases are predicted accurately by the model.

5 Conclusion

Patients with heart disease and patients who based on their characteristics have a higher chance for heart disease need to get immediate intervention to prevent worse things from happening, so it would be better if more patients are predicted to have heart disease and get treatment. Therefore, we need a model that has the ability to accurately detect as many positive cases as possible, shown by a high recall rate. Of the 2 algorithms that have been used, the logistic regression model has a higher recall rate, so the logistic regression model is a better model for predicting cases of heart disease in this dataset.