In this project, we will try to predict heart disease for some patience in a hospital. We will use logistic regression model and KNN model which is included in supervised learning.
We will use the library below for this project:
library(dplyr)
library(class)
library(caret) # for confusion matrixThe dataset obtained from Kaggle (Click for the source of the dataset). Heart disease dataset consists of 14 columns and 303 rows.
# Read Data
heart <- read.csv("heart.csv")
# Inspect Data
head(heart)## ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 3 145 233 1 0 150 0 2.3 0 0 1
## 2 37 1 2 130 250 0 1 187 0 3.5 0 0 2
## 3 41 0 1 130 204 0 0 172 0 1.4 2 0 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2 0 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2 0 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1 0 1
## target
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Below is the detailed information about each variable in the dataset:
ï..age: The person’s age in yearssex: The person’s sex (1 = male, 0 = female)cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)chol: The person’s cholesterol measurement in mg/dlfbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or - definite left ventricular hypertrophy by Estes’ criteria)thalach: The person’s maximum heart rate achievedexang: Exercise induced angina (1 = yes; 0 = no)oldpeak: ST depression induced by exercise relative to restslope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)ca: The number of major vessels (0-3)thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)target: Heart disease (0 = no, 1 = yes)As we can see in the data description above, there are several variables that is categorical, such as sex, cp, fbs, restecg, exang, slope, ca, thal, and target. We must convert them into factor.
# Data Wrangling
heart_clean <- heart %>%
mutate(sex = factor(sex, levels=c(0,1), labels = c("male", "female")),
cp = factor(cp, levels=c(0,1,2,3), labels=c("typical angina", "atypical angina", "non-anginal pain", "asymptomatic")),
fbs = factor(fbs, levels = c(0,1), labels = c("false", "true")),
restecg = as.factor(restecg),
exang = factor(exang, levels=c(0,1), labels = c("no", "yes")),
slope = factor(slope, levels=c(0,1,2), labels = c("upsloping", "flat", "downsloping")),
ca = as.factor(ca),
thal = as.factor(thal),
target = factor(target, levels = c(0,1), labels=c("no", "yes")))
glimpse(heart_clean)## Rows: 303
## Columns: 14
## $ ï..age <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex <fct> female, female, male, female, male, female, male, female, fem~
## $ cp <fct> asymptomatic, non-anginal pain, atypical angina, atypical ang~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs <fct> true, false, false, false, false, false, false, false, true, ~
## $ restecg <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang <fct> no, no, no, no, yes, no, no, no, no, no, no, no, no, yes, no,~
## $ oldpeak <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope <fct> upsloping, upsloping, downsloping, downsloping, downsloping, ~
## $ ca <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y~
We will try to check the target class proportion.
# target class proportion
prop.table(table(heart_clean$target))##
## no yes
## 0.4554455 0.5445545
As we can see, the target class proportion is balance enough for both class.
After that, we will try to check the range for each variable.
# range for each variable
summary(heart_clean)## ï..age sex cp trestbps
## Min. :29.00 male : 96 typical angina :143 Min. : 94.0
## 1st Qu.:47.50 female:207 atypical angina : 50 1st Qu.:120.0
## Median :55.00 non-anginal pain: 87 Median :130.0
## Mean :54.37 asymptomatic : 23 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## chol fbs restecg thalach exang oldpeak
## Min. :126.0 false:258 0:147 Min. : 71.0 no :204 Min. :0.00
## 1st Qu.:211.0 true : 45 1:152 1st Qu.:133.5 yes: 99 1st Qu.:0.00
## Median :240.0 2: 4 Median :153.0 Median :0.80
## Mean :246.3 Mean :149.6 Mean :1.04
## 3rd Qu.:274.5 3rd Qu.:166.0 3rd Qu.:1.60
## Max. :564.0 Max. :202.0 Max. :6.20
## slope ca thal target
## upsloping : 21 0:175 0: 2 no :138
## flat :140 1: 65 1: 18 yes:165
## downsloping:142 2: 38 2:166
## 3: 20 3:117
## 4: 5
##
As we can see in the output above, each variable has different range.
Then, we also will try to check the missing value of the data.
anyNA(heart_clean)## [1] FALSE
Because the output is false, so we can conclude that there is no missing value in this dataset.
Before we build the model, we should split the dataset into train and test data in order to perform model validation. We will split heart_clean dataset into 80% train and 20% test using sample() function and use set.seed() with the seed 417. Data train will be used for modelling and data test will be used to validate whether the model can be a good model for unseen data.
RNGkind(sample.kind = "Rounding") ## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)
index <- sample(nrow(heart_clean), nrow(heart_clean)*0.8)
heart_train <- heart_clean[index,]
heart_test <- heart_clean[-index,] Let’s take a look at the distribution of our target variable in train data using prop.table(table(data)) to make sure that the train data also have a balanced proportion of our target class.
# target class proportion for data train
prop.table(table(heart_train$target))##
## no yes
## 0.4628099 0.5371901
One of the techniques for selecting predictor variables is using stepwise regression algorithm. First, we will create a logistic regression model using glm() function to predict target using all variables. Then, we use the step() function with direction="backward" to perform stepwise regression.
model_log <- glm(target~., data = heart_train, family = "binomial")
model_step <- step(model_log, direction = "backward", trace = 0)
summary(model_step)##
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang +
## slope + ca + thal, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.93413 -0.27363 0.07124 0.37508 2.11624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.55786 4.54776 -0.123 0.902371
## sexfemale -1.89609 0.64079 -2.959 0.003087 **
## cpatypical angina 1.25096 0.69705 1.795 0.072708 .
## cpnon-anginal pain 2.40113 0.64668 3.713 0.000205 ***
## cpasymptomatic 2.12390 0.81513 2.606 0.009171 **
## trestbps -0.02279 0.01233 -1.848 0.064534 .
## thalach 0.02431 0.01324 1.836 0.066330 .
## exangyes -1.20845 0.53547 -2.257 0.024019 *
## slopeflat -0.52100 0.94624 -0.551 0.581908
## slopedownsloping 1.58712 0.98648 1.609 0.107644
## ca1 -2.91998 0.62354 -4.683 2.83e-06 ***
## ca2 -4.38372 0.94062 -4.660 3.15e-06 ***
## ca3 -2.60980 0.95458 -2.734 0.006257 **
## ca4 1.66614 1.83606 0.907 0.364168
## thal1 2.96444 3.95135 0.750 0.453113
## thal2 2.10669 3.86836 0.545 0.586033
## thal3 0.73313 3.86960 0.189 0.849733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.14 on 241 degrees of freedom
## Residual deviance: 137.59 on 225 degrees of freedom
## AIC: 171.59
##
## Number of Fisher Scoring iterations: 6
Then, we will try to predict the probability for variable target in heart_test. We will store it in the new column in heart_test dataset. After that, classify variable pred_log. If pred_log is more than 0.5, then it is classified as 1. If pred_log is below 0.5, then it is classified as 0.
# Probability prediction
heart_test$pred_log <- predict(model_step, newdata = heart_test, type = "response")
# Classifying the class
heart_test$pred_Lablog <- ifelse(heart_test$pred_log > 0.5, "yes", "no") %>%
as.factor()In this step, we will try to validate whether or not our model did an excellent job of predicting unseen data. We will make a confusion matrix from the logistic regression result based on the actual label from variable target in heart_test dataset and the predicted label from variable pred_Lablog in heart_test dataset and use the positive class as “1”.
There are some evaluating classifiers that can we use to interpret the model:
# Model Evaluation
confusionMatrix(data = heart_test$pred_Lablog,
reference = heart_test$target,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 19 4
## yes 7 31
##
## Accuracy : 0.8197
## 95% CI : (0.7002, 0.9064)
## No Information Rate : 0.5738
## P-Value [Acc > NIR] : 4.229e-05
##
## Kappa : 0.6258
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.8857
## Specificity : 0.7308
## Pos Pred Value : 0.8158
## Neg Pred Value : 0.8261
## Prevalence : 0.5738
## Detection Rate : 0.5082
## Detection Prevalence : 0.6230
## Balanced Accuracy : 0.8082
##
## 'Positive' Class : yes
##
Observe from the confusion matrix that:
Now let’s try to explore the classification model using the k-Nearest Neighbor algorithm. In the k-Nearest Neighbor algorithm, we need to perform one more step of data pre-processing. For both our train and test set, drop the categorical variable from each data except our target variable. Separate the predictor and target from our train and test set.
# data train predictor
train_x <- heart_train %>%
select_if(is.numeric)
# data train target
train_y <- heart_train %>%
select(target)
# data test predictor
test_x <- heart_test %>%
select_if(is.numeric) %>%
select(-c(pred_log))
# data test target
test_y <- heart_test %>%
select(target)The distance calculation for KNN is heavily dependent upon the measurement scale of the input features. Any variable that has an extremely different range of values from the other variable can potentially cause problems for our classifier. So let’s apply normalization techniques to rescale the features to a standard range of values.
To normalize the features in train_x, we will use scale() function. Meanwhile, for the testing data, we will normalize each features using the attributes center and scale obtained from the train_x data.
# scaling
train_x <- scale(train_x)
test_x <- scale(test_x,
center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale")) After normalizing our data, we need to find the right K to use for our kNN model. Method that we can use to choose the number of k is square root by number of row.
# number of k
sqrt(nrow(train_x))## [1] 15.55635
Using k value we have calculated before, we will try to predict test_y using train_x and target variable from train_y dataset using knn() function.
# predict
heart_pred <- knn(train = train_x,
test = test_x,
cl = train_y$target,
k = 15)
head(heart_pred)## [1] yes yes no yes yes no
## Levels: no yes
In this section, we will try to validate whether or not our model did an excellent job of predicting unseen data. We will try to make a confusion matrix from the KNN prediction result based on the actual label from target in test_y dataset and the predicted label heart_pred and use the positive class as “1”.
# Model Evaluation
confusionMatrix(data = heart_pred,
reference = test_y$target,
positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 15 7
## yes 11 28
##
## Accuracy : 0.7049
## 95% CI : (0.5743, 0.8148)
## No Information Rate : 0.5738
## P-Value [Acc > NIR] : 0.02452
##
## Kappa : 0.3845
##
## Mcnemar's Test P-Value : 0.47950
##
## Sensitivity : 0.8000
## Specificity : 0.5769
## Pos Pred Value : 0.7179
## Neg Pred Value : 0.6818
## Prevalence : 0.5738
## Detection Rate : 0.4590
## Detection Prevalence : 0.6393
## Balanced Accuracy : 0.6885
##
## 'Positive' Class : yes
##
Observe from the confusion matrix that:
If we see from the perspective of the doctors. We will try to minimize cases that we failed to predict. For example, there is a patient that has heart disease, but we failed to predict it, so the disease will worsen. So, We will focus on higher the recall value to lower the precision value. As we can see from both model confusion matrix, logistic regression has a bigger sensitivity. So, logistic regression will be the best model in this project that can predict heart disease.