Heart disease illustration. (Source: Webmd)
According to World Health Organization (WHO), ischemic heart disease is the 1st leading cause of death worldwide, which is responsible for 16% of the world’s total deaths. This project aims to predict presence of heart disease in a patient based on several characteristics using a dataset named Cleveland Heart Disease Dataset from Kaggle. Since the target variable is categorical, I will be using logistic regression and k-nearest neighbours algorithms to do the prediction.
Load the required libraries.
library(tidyverse)
library(car)
library(caret)
library(e1071)
library(naniar)
library(class)
library(psych)
Load the dataset.
<- read.csv("heartattack.csv")
heart ::paged_table(heart) rmarkdown
str(heart)
## 'data.frame': 303 obs. of 14 variables:
## $ ï..age : int 63 37 41 56 57 57 56 44 52 57 ...
## $ sex : int 1 1 0 1 0 1 0 1 1 1 ...
## $ cp : int 3 2 1 1 0 0 1 1 2 2 ...
## $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ...
## $ chol : int 233 250 204 236 354 192 294 263 199 168 ...
## $ fbs : int 1 0 0 0 0 0 0 0 1 0 ...
## $ restecg : int 0 1 0 1 1 1 0 1 1 1 ...
## $ thalach : int 150 187 172 178 163 148 153 173 162 174 ...
## $ exang : int 0 0 0 0 1 0 0 0 0 0 ...
## $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
## $ slope : int 0 0 2 2 2 1 1 2 2 2 ...
## $ ca : int 0 0 0 0 0 0 0 0 0 0 ...
## $ thal : int 1 2 2 2 2 1 2 3 3 2 ...
## $ target : int 1 1 1 1 1 1 1 1 1 1 ...
In total, there are 14 variables in this dataset, with target
as the target variable.
Variables explanation:
* Age
: Age of the patient
* Sex
: Sex of the patient (0 = Female, 1 = Male)
* cp
: Chest Pain type
- Value 0: typical angina
- Value 1: atypical angina
- Value 2: non-anginal pain
- Value 3: asymptomatic
* trestbps
: Resting blood pressure (in mm/Hg) on admission to the hospital
* chol
: Serum cholestoral (in mg/dl)
* fbs
: Fasting blood sugar classification (1 = > 120 mg/dl, 0 = <= 120 mg/dl)
* restecg
: Resting electrocardiographic results
- Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
- Value 1: normal
- Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* thalach
: Maximum heart rate achieved
* exang
: Exercised induced angina (1 = yes, 0 = no)
* oldpeak
: ST depression induced by exercise relative to rest
* slope
: The slope of the peak exercise ST segment
- Value 0: descending
- Value 1: flat
- Value 2: ascending
* ca
: Number of major vessels (0 - 3)
* thal
: Thalium Stress Test Result (1 = fixed defect; 2 = normal; 3 = reversable defect)
* target
: Heart attack (1 = no, 0 = yes)
Some variables do not have the correct data type. Furthermore, to make the interpretation easier, the code for the variable of interest, presence of heart disease which previously coded as 0, will be coded as 1 and categorical variables will be labelled.
$cp <- as.factor(heart$cp)
heart$restecg <- as.factor(heart$restecg)
heart$slope <- as.factor(heart$slope)
heart$ca <- as.factor(heart$ca)
heart$thal <- as.factor(heart$thal)
heart$sex <- as.factor(heart$sex)
heart$fbs <- as.factor(heart$fbs)
heart$exang <- as.factor(heart$exang)
heart$target <- as.factor(heart$target)
heart
$target2 <- ifelse(heart$target == 1, 0, "x")
heart$target2 <- ifelse(heart$target2 == "x", 1, 0)
heart$target2 <- as.factor(heart$target2)
heart
<- heart %>% select(-target)
heart names(heart)[1] <- "age"
names(heart)[14] <- "target"
<- heart %>% mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
heart fbs = factor(fbs, levels = c(0,1), labels = c("<= 120 mg/dl", "> 120 mg/dl")),
exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
target = factor(target, levels = c(0,1), labels = c("No", "Yes")),
thal = factor(thal, levels = c(1,2,3), labels = c("Fixed defect", "Normal", "Reversable defect")),
slope = factor(slope, levels = c(0,1,2), labels = c("Descending", "Flat", "Ascending")),
cp = factor(cp, levels = c(0,1,2, 3), labels = c("Typical angina", "Atypical angina", "Non-anginal pain", "Asymptomatic")),
restecg = factor(restecg, levels = c(0,1,2), labels = c("Left ventricular hypertrophy", "Normal", "ST-T wave abnormality")))
Check if there is any missing value.
anyNA(heart)
## [1] TRUE
Looks like there are missing values in this dataset.
colSums(is.na(heart))
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 2 0
There are 2 missing values in thal
variable. We will eliminate them later.
summary(heart)
## age sex cp trestbps
## Min. :29.00 Female: 96 Typical angina :143 Min. : 94.0
## 1st Qu.:47.50 Male :207 Atypical angina : 50 1st Qu.:120.0
## Median :55.00 Non-anginal pain: 87 Median :130.0
## Mean :54.37 Asymptomatic : 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg
## Min. :126.0 <= 120 mg/dl:258 Left ventricular hypertrophy:147
## 1st Qu.:211.0 > 120 mg/dl : 45 Normal :152
## Median :240.0 ST-T wave abnormality : 4
## Mean :246.3
## 3rd Qu.:274.5
## Max. :564.0
## thalach exang oldpeak slope ca
## Min. : 71.0 No :204 Min. :0.00 Descending: 21 0:175
## 1st Qu.:133.5 Yes: 99 1st Qu.:0.00 Flat :140 1: 65
## Median :153.0 Median :0.80 Ascending :142 2: 38
## Mean :149.6 Mean :1.04 3: 20
## 3rd Qu.:166.0 3rd Qu.:1.60 4: 5
## Max. :202.0 Max. :6.20
## thal target
## Fixed defect : 18 No :165
## Normal :166 Yes:138
## Reversable defect:117
## NA's : 2
##
##
Based on the summary, it looks like ca
or number of major vessels range from 0 to 4, while from the dataset explanation, it only ranges from 0 - 3, so observations with ca
> 3 and missing values will be eliminated from the dataset
<- heart %>% filter(ca != 4 | thal != 0)
heart $thal <- droplevels(heart$thal)
heart$ca <- droplevels(heart$ca) heart
Before carrying out the analysis, we have to make sure the target variable has an equal (or almost equal) proportion for each of its categories, so the model will be able to predict both positive and negative class in a balanced way.
prop.table(table(heart$target))
##
## No Yes
## 0.5445545 0.4554455
Looks like both categories have almost equal proportion, so we can proceed with the analysis.
To see the relationship between each predictor variable and the target variable, we can visualize it.
ggplot(heart, aes(target, fill = sex)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Sex",
x = "Heart Attack",
y = "Proportion")
Based on sex, heart disease are mostly experienced by male patients.
ggplot(heart, aes(target, fill = cp)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Chest Pain Type",
x = "Heart Attack",
y = "Proportion")
Based on the chest pain type, the majority of people with heart disease experience typical angina type of chest pain.
ggplot(heart, aes(target, trestbps)) +
geom_boxplot() + labs(title = "Heart Attack based on Resting Blood Pressure",
x = "Heart Attack",
y = "in mm/Hg")
Based on resting blood pressure, there was no difference in resting blood pressure between patients with heart disease and healthy patients.
ggplot(heart, aes(target, chol)) +
geom_boxplot() + labs(title = "Heart Attack based on Cholesterol Serum",
x = "Heart Attack",
y = "in mg/dl")
There is not much difference in terms of cholesterol serum between patients with heart diseases and healthy patients.
ggplot(heart, aes(target, fill = fbs)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Fasting Blood Sugar",
x = "Heart Attack",
y = "Proportion")
Based on fasting blood sugar levels, there was no difference in fasting blood sugar levels in patients with heart disease and healthy patients.
ggplot(heart, aes(target, fill = restecg)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Resting ECG Results",
x = "Heart Attack",
y = "Proportion")
Based on resting ecg results, most heart disease patients have left ventricular hypertrophy.
ggplot(heart, aes(target, thalach)) +
geom_boxplot() + labs(title = "Heart attack and Maximum Heart Rate",
x = "Heart Attack")
Heart disease patients tend to have a lower heart rate than healthy patients.
ggplot(heart, aes(target, fill = exang)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Exercised Induced Angina",
x = "Heart Attack",
y = "Proportion")
More than half of heart diseases patients experience exercised induced angina.
ggplot(heart, aes(target, oldpeak)) +
geom_boxplot() + labs(title = "Heart attack and ST depression induced by exercise relative to rest",
x = "Heart Attack")
In terms of ST depression induced by exercise relative to rest, heart disease patients have a higher level of ST depression induced by exercise relative to rest than healthy patients.
ggplot(heart, aes(target, fill = slope)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Slope",
x = "Heart Attack",
y = "Proportion")
Most patients with heart disease have a flat slope.
ggplot(heart, aes(target, fill = ca)) +
geom_bar(position = "fill") + labs(title = "Heart Attack based on Number of Major Vessels",
x = "Heart Attack")
There is a difference in the number of major vessels between a patient with heart disease and a healthy patient. In healthy patients, most have 0 number of major vessels, while in patients with heart disease, the number of major vessels is more diverse.
ggplot(heart, aes(target, fill = thal)) +
geom_bar(position = "fill") +labs(title = "Heart Attack based on Thalium Stress Test Result",
x = "Heart Attack",
y = "Proportion")
Based on thalium stress test result, majority of heart attack patients have reversable defect.
Based on visualizations, there are differences between people with heart disease and people who are healthy in terms of sex
, chest pain type
, resting ecg result
, maximum heart rate
, exercised induces angina
, st depression induced by exercise relative to rest
, slope
, number of major vessels
, dan thalium stress test result
. Sehingga, variabel sex
, cp
, restecg
, thalach
, exang
, oldpeak
, slope
, ca
, and thal
will be used as the predictor variables.
Split the dataset into train and test data.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(111)
<- sample(nrow(heart),
index nrow(heart)*0.8)
<- heart[index, ]
train <- heart[-index, ] test
<- glm(formula = target~sex+cp+restecg+thalach+exang+oldpeak+slope+ca+thal, family = "binomial", data = train)
model summary(model)
##
## Call:
## glm(formula = target ~ sex + cp + restecg + thalach + exang +
## oldpeak + slope + ca + thal, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2364 -0.4174 -0.1084 0.2445 3.0058
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.23716 2.24490 0.551 0.581566
## sexMale 1.31466 0.61252 2.146 0.031849 *
## cpAtypical angina -0.97774 0.65276 -1.498 0.134169
## cpNon-anginal pain -2.42021 0.61006 -3.967 7.27e-05 ***
## cpAsymptomatic -2.37496 0.78049 -3.043 0.002343 **
## restecgNormal -0.67534 0.43897 -1.538 0.123932
## restecgST-T wave abnormality -0.06474 2.75386 -0.024 0.981245
## thalach -0.02117 0.01220 -1.735 0.082736 .
## exangYes 0.67992 0.50380 1.350 0.177151
## oldpeak 0.30482 0.28024 1.088 0.276730
## slopeFlat 0.57626 0.92328 0.624 0.532531
## slopeAscending -0.90963 1.01719 -0.894 0.371187
## ca1 2.19294 0.58350 3.758 0.000171 ***
## ca2 2.95847 0.97920 3.021 0.002517 **
## ca3 1.80597 0.95687 1.887 0.059111 .
## ca4 -1.70119 1.74417 -0.975 0.329382
## thalNormal 0.29145 0.87017 0.335 0.737676
## thalReversable defect 2.00885 0.83375 2.409 0.015979 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 330.69 on 239 degrees of freedom
## Residual deviance: 142.28 on 222 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 178.28
##
## Number of Fisher Scoring iterations: 6
From this model, we get AIC value of 178.28. But, some predictor variables are not significant to the model. We can utilize stepwise selection method to choose more significant predictor variables.
<- step(model, direction = "backward", trace = F)
model.step summary(model.step)
##
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + slope + ca +
## thal, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3085 -0.4232 -0.1013 0.2984 2.8713
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.58965 2.03998 0.779 0.435834
## sexMale 1.33564 0.60313 2.215 0.026794 *
## cpAtypical angina -1.08155 0.64611 -1.674 0.094141 .
## cpNon-anginal pain -2.43423 0.60097 -4.051 5.11e-05 ***
## cpAsymptomatic -2.07100 0.73271 -2.826 0.004706 **
## thalach -0.02143 0.01200 -1.785 0.074284 .
## exangYes 0.74653 0.49847 1.498 0.134224
## slopeFlat 0.17654 0.80244 0.220 0.825870
## slopeAscending -1.51481 0.82250 -1.842 0.065516 .
## ca1 2.21813 0.57211 3.877 0.000106 ***
## ca2 3.13681 0.95391 3.288 0.001008 **
## ca3 2.13707 0.93133 2.295 0.021754 *
## ca4 -1.86631 1.65260 -1.129 0.258763
## thalNormal 0.30321 0.84731 0.358 0.720450
## thalReversable defect 1.99250 0.81843 2.435 0.014910 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 330.69 on 239 degrees of freedom
## Residual deviance: 145.84 on 225 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 175.84
##
## Number of Fisher Scoring iterations: 6
From the model created using stepwise method, we get AIC value of 175.84, lower than the previous model. So, we will use this model to do the prediction.
$pred <- predict(object = model.step, # model
testnewdata = test, # data baru
type = "response")
$pred.label <- ifelse(test$pred > 0.5, "1", "0") %>% factor(levels = c(0,1), labels = c("No", "Yes")) test
<- confusionMatrix(data = test$pred.label,
eval_log reference = test$target,
positive = "Yes")
<- data.frame(Accuracy = paste0(round(eval_log$overall[1]*100, 2), "%"),
eval_log Recall = paste0(round(eval_log$byClass[1]*100, 2), "%"),
Precision = paste0(round(eval_log$byClass[3]*100, 2), "%"))
eval_log
## Accuracy Recall Precision
## 1 88.52% 85.71% 88.89%
From the prediction results, we obtain an accuracy rate of 88.52%, which means the model is able to predict the target variable accurately as much as 88.52% of the overall data. In addition, this models gives us sensitivity rate of 85,71%, which means that the model can accurately predict 85.71% of the presence of heart disease from all patients who actually have heart disease. This model is also able to predict 88.9% of positive cases (incidence of heart disease) accurately from all positive predicted cases as shown by the Precision value.
Another algorithm that can be used to predict categorical target variables is kNN (k-nearest neighbors). The kNN algorithm works by calculating the distance between individual observations and assuming that observations with similar characteristics are in close proximity to each other. Therefore, the predictor variables in this algorithm are numeric variables.
<- read.csv("heartattack.csv")
heart summary(heart)
## ï..age sex cp trestbps
## Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0
## 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0
## Median :55.00 Median :1.0000 Median :1.000 Median :130.0
## Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0
## chol fbs restecg thalach
## Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5
## Median :240.0 Median :0.0000 Median :1.0000 Median :153.0
## Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6
## 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0
## exang oldpeak slope ca
## Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.80 Median :1.000 Median :0.0000
## Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294
## 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000
## thal target
## Min. :0.000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000
## Median :2.000 Median :1.0000
## Mean :2.314 Mean :0.5446
## 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.0000
In KNN, all variables must be of the same scale. Therefore, the variables need to be standardized first. Standardization can be done using the scale ()
function. In addition, for categorical variables, a dummy variable is made.
<- heart %>% select(c(sex, cp, thalach, exang, slope, ca, thal, target))
heart_new
# Scale the numerical predictor variables
"thalach"] <- scale(heart_new[, "thalach"])
heart_new[, head(heart_new)
## sex cp thalach exang slope ca thal target
## 1 1 3 0.01541728 0 0 0 1 1
## 2 1 2 1.63077374 0 0 0 2 1
## 3 0 1 0.97589950 0 2 0 2 1
## 4 1 1 1.23784920 0 2 0 2 1
## 5 0 0 0.58297496 1 2 0 2 1
## 6 1 0 -0.07189928 0 1 0 1 1
# Dummy code vars with 3 or more levels
<- as.data.frame(dummy.code(heart_new$cp))
cp names(cp) <- paste0("cp", names(cp))
<- as.data.frame(dummy.code(heart_new$slope))
slope names(slope) <- paste0("slope", names(slope))
<- as.data.frame(dummy.code(heart_new$ca))
ca names(ca) <- paste0("ca", names(ca))
<- as.data.frame(dummy.code(heart_new$thal))
thal names(thal) <- paste0("thal", names(thal))
<- heart_new %>% select(-c(cp, slope, ca, thal)) %>% cbind(c(cp, slope, ca, thal))
heart_new
names(heart_new)
## [1] "sex" "thalach" "exang" "target" "cp0" "cp2" "cp1"
## [8] "cp3" "slope2" "slope1" "slope0" "ca0" "ca1" "ca2"
## [15] "ca3" "ca4" "thal2" "thal3" "thal1" "thal0"
Split the dataset into train and test data.
# Data splitting
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(112)
<- sample(nrow(heart_new),
index nrow(heart_new)*0.8)
<- heart_new[index, ]
train <- heart_new[-index, ]
test
<- train %>% select(-target)
trainx <- test %>% select(-target)
testx <- train$target
train_y <- test$target %>% as.factor() test_y
Before making a prediction, determine the value of k.
<- round(sqrt(nrow(train)))
k
<- knn(train = trainx,
pred test = testx,
cl = train_y,
k = k)
<- confusionMatrix(data = pred,
eval_knn reference = test_y,
positive = "1")
<- data.frame(Accuracy = paste0(round(eval_knn$overall[1]*100, 2), "%"),
eval_knn Recall = paste0(round(eval_knn$byClass[1]*100, 2), "%"),
Precision = paste0(round(eval_knn$byClass[3]*100, 2), "%"))
eval_knn
## Accuracy Recall Precision
## 1 80.33% 93.33% 73.68%
This model can accurately predict 81.97% of positive and negative cases from the overall data. Of the total positive cases, this model is able to accurately predict as many as 70.97% of positive cases as shown by Recall value. In addition, from all cases that are predicted to be positive, 91.67% of these cases are predicted accurately by the model.
Patients with heart disease and patients who based on their characteristics have a higher chance for heart disease need to get immediate intervention to prevent worse things from happening, so it would be better if more patients are predicted to have heart disease and get treatment. Therefore, we need a model that has the ability to accurately detect as many positive cases as possible, shown by a high recall rate. Of the 2 algorithms that have been used, the logistic regression model has a higher recall rate, so the logistic regression model is a better model for predicting cases of heart disease in this dataset.