Introduction

In this project, we will try to predict heart disease for some patience in a hospital. We will use logistic regression model and KNN model which is included in supervised learning.

Library

We will use the library below for this project:

library(dplyr)
library(class)
library(caret) # for confusion matrix

Data Preparation

The dataset obtained from Kaggle (Click for the source of the dataset). Heart disease dataset consists of 14 columns and 303 rows.

# Read Data
heart <- read.csv("heart.csv")
# Inspect Data
head(heart)
##   ï..age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1     63   1  3      145  233   1       0     150     0     2.3     0  0    1
## 2     37   1  2      130  250   0       1     187     0     3.5     0  0    2
## 3     41   0  1      130  204   0       0     172     0     1.4     2  0    2
## 4     56   1  1      120  236   0       1     178     0     0.8     2  0    2
## 5     57   0  0      120  354   0       1     163     1     0.6     2  0    2
## 6     57   1  0      140  192   0       1     148     0     0.4     1  0    1
##   target
## 1      1
## 2      1
## 3      1
## 4      1
## 5      1
## 6      1

Below is the detailed information about each variable in the dataset:

  • ï..age: The person’s age in years
  • sex: The person’s sex (1 = male, 0 = female)
  • cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
  • trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
  • chol: The person’s cholesterol measurement in mg/dl
  • fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
  • restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or - definite left ventricular hypertrophy by Estes’ criteria)
  • thalach: The person’s maximum heart rate achieved
  • exang: Exercise induced angina (1 = yes; 0 = no)
  • oldpeak: ST depression induced by exercise relative to rest
  • slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
  • ca: The number of major vessels (0-3)
  • thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
  • target: Heart disease (0 = no, 1 = yes)

As we can see in the data description above, there are several variables that is categorical, such as sex, cp, fbs, restecg, exang, slope, ca, thal, and target. We must convert them into factor.

# Data Wrangling
heart_clean <- heart %>% 
  mutate(sex = factor(sex, levels=c(0,1), labels = c("male", "female")),
         cp = factor(cp, levels=c(0,1,2,3), labels=c("typical angina", "atypical angina", "non-anginal pain", "asymptomatic")),
         fbs = factor(fbs, levels = c(0,1), labels = c("false", "true")),
         restecg = as.factor(restecg),
         exang = factor(exang, levels=c(0,1), labels = c("no", "yes")),
         slope = factor(slope, levels=c(0,1,2), labels = c("upsloping", "flat", "downsloping")),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = factor(target, levels = c(0,1), labels=c("no", "yes")))
glimpse(heart_clean)
## Rows: 303
## Columns: 14
## $ ï..age   <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex      <fct> female, female, male, female, male, female, male, female, fem~
## $ cp       <fct> asymptomatic, non-anginal pain, atypical angina, atypical ang~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <fct> true, false, false, false, false, false, false, false, true, ~
## $ restecg  <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1~
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang    <fct> no, no, no, no, yes, no, no, no, no, no, no, no, no, yes, no,~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope    <fct> upsloping, upsloping, downsloping, downsloping, downsloping, ~
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target   <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y~

Exploratory Data Analysis

We will try to check the target class proportion.

# target class proportion
prop.table(table(heart_clean$target))
## 
##        no       yes 
## 0.4554455 0.5445545

As we can see, the target class proportion is balance enough for both class.

After that, we will try to check the range for each variable.

# range for each variable
summary(heart_clean)
##      ï..age          sex                     cp         trestbps    
##  Min.   :29.00   male  : 96   typical angina  :143   Min.   : 94.0  
##  1st Qu.:47.50   female:207   atypical angina : 50   1st Qu.:120.0  
##  Median :55.00                non-anginal pain: 87   Median :130.0  
##  Mean   :54.37                asymptomatic    : 23   Mean   :131.6  
##  3rd Qu.:61.00                                       3rd Qu.:140.0  
##  Max.   :77.00                                       Max.   :200.0  
##       chol          fbs      restecg    thalach      exang        oldpeak    
##  Min.   :126.0   false:258   0:147   Min.   : 71.0   no :204   Min.   :0.00  
##  1st Qu.:211.0   true : 45   1:152   1st Qu.:133.5   yes: 99   1st Qu.:0.00  
##  Median :240.0               2:  4   Median :153.0             Median :0.80  
##  Mean   :246.3                       Mean   :149.6             Mean   :1.04  
##  3rd Qu.:274.5                       3rd Qu.:166.0             3rd Qu.:1.60  
##  Max.   :564.0                       Max.   :202.0             Max.   :6.20  
##          slope     ca      thal    target   
##  upsloping  : 21   0:175   0:  2   no :138  
##  flat       :140   1: 65   1: 18   yes:165  
##  downsloping:142   2: 38   2:166            
##                    3: 20   3:117            
##                    4:  5                    
## 

As we can see in the output above, each variable has different range.

Then, we also will try to check the missing value of the data.

anyNA(heart_clean)
## [1] FALSE

Because the output is false, so we can conclude that there is no missing value in this dataset.

Algorithms

Logistic Regression

Cross Validation

Before we build the model, we should split the dataset into train and test data in order to perform model validation. We will split heart_clean dataset into 80% train and 20% test using sample() function and use set.seed() with the seed 417. Data train will be used for modelling and data test will be used to validate whether the model can be a good model for unseen data.

RNGkind(sample.kind = "Rounding") 
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)

index <- sample(nrow(heart_clean), nrow(heart_clean)*0.8) 
heart_train <- heart_clean[index,] 
heart_test <- heart_clean[-index,] 

Let’s take a look at the distribution of our target variable in train data using prop.table(table(data)) to make sure that the train data also have a balanced proportion of our target class.

# target class proportion for data train
prop.table(table(heart_train$target))
## 
##        no       yes 
## 0.4628099 0.5371901

Model Fitting

One of the techniques for selecting predictor variables is using stepwise regression algorithm. First, we will create a logistic regression model using glm() function to predict target using all variables. Then, we use the step() function with direction="backward" to perform stepwise regression.

model_log <- glm(target~., data = heart_train, family = "binomial")
model_step <- step(model_log, direction = "backward", trace = 0)
summary(model_step)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     slope + ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.93413  -0.27363   0.07124   0.37508   2.11624  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.55786    4.54776  -0.123 0.902371    
## sexfemale          -1.89609    0.64079  -2.959 0.003087 ** 
## cpatypical angina   1.25096    0.69705   1.795 0.072708 .  
## cpnon-anginal pain  2.40113    0.64668   3.713 0.000205 ***
## cpasymptomatic      2.12390    0.81513   2.606 0.009171 ** 
## trestbps           -0.02279    0.01233  -1.848 0.064534 .  
## thalach             0.02431    0.01324   1.836 0.066330 .  
## exangyes           -1.20845    0.53547  -2.257 0.024019 *  
## slopeflat          -0.52100    0.94624  -0.551 0.581908    
## slopedownsloping    1.58712    0.98648   1.609 0.107644    
## ca1                -2.91998    0.62354  -4.683 2.83e-06 ***
## ca2                -4.38372    0.94062  -4.660 3.15e-06 ***
## ca3                -2.60980    0.95458  -2.734 0.006257 ** 
## ca4                 1.66614    1.83606   0.907 0.364168    
## thal1               2.96444    3.95135   0.750 0.453113    
## thal2               2.10669    3.86836   0.545 0.586033    
## thal3               0.73313    3.86960   0.189 0.849733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 334.14  on 241  degrees of freedom
## Residual deviance: 137.59  on 225  degrees of freedom
## AIC: 171.59
## 
## Number of Fisher Scoring iterations: 6

Then, we will try to predict the probability for variable target in heart_test. We will store it in the new column in heart_test dataset. After that, classify variable pred_log. If pred_log is more than 0.5, then it is classified as 1. If pred_log is below 0.5, then it is classified as 0.

# Probability prediction
heart_test$pred_log <- predict(model_step, newdata = heart_test, type = "response")
# Classifying the class
heart_test$pred_Lablog <- ifelse(heart_test$pred_log > 0.5, "yes", "no") %>% 
                          as.factor()

Model Evaluation

In this step, we will try to validate whether or not our model did an excellent job of predicting unseen data. We will make a confusion matrix from the logistic regression result based on the actual label from variable target in heart_test dataset and the predicted label from variable pred_Lablog in heart_test dataset and use the positive class as “1”.

There are some evaluating classifiers that can we use to interpret the model:

  • Sensitivity/recall = the proportion of positives that are correctly identified as positive.
  • Pos Pred Value/Precision = the proportion of correctly identified as positives from all classified as positive.
  • Accuracy: the proportion of correctly identified cases from all cases.
  • Specificity: the proportion of negatives that are correctly identified as negative.
# Model Evaluation
confusionMatrix(data = heart_test$pred_Lablog,
                reference = heart_test$target,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  19   4
##        yes  7  31
##                                           
##                Accuracy : 0.8197          
##                  95% CI : (0.7002, 0.9064)
##     No Information Rate : 0.5738          
##     P-Value [Acc > NIR] : 4.229e-05       
##                                           
##                   Kappa : 0.6258          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.8857          
##             Specificity : 0.7308          
##          Pos Pred Value : 0.8158          
##          Neg Pred Value : 0.8261          
##              Prevalence : 0.5738          
##          Detection Rate : 0.5082          
##    Detection Prevalence : 0.6230          
##       Balanced Accuracy : 0.8082          
##                                           
##        'Positive' Class : yes             
## 

Observe from the confusion matrix that:

  • Out of 35 actual “yes”, we classified 31 of them correctly.
  • Out of 26 actual “no”, we classified 19 of them correctly.
  • Out of 61 cases of heart disease in our data test, we classified 50 of them correctly.
  • The accuracy is 81.97%
  • The sensitivity or recall is 88.57%
  • The pos pred value or precision is 81.58%

KNN

Cross Validation

Now let’s try to explore the classification model using the k-Nearest Neighbor algorithm. In the k-Nearest Neighbor algorithm, we need to perform one more step of data pre-processing. For both our train and test set, drop the categorical variable from each data except our target variable. Separate the predictor and target from our train and test set.

# data train predictor
train_x <-  heart_train %>% 
            select_if(is.numeric)

# data train target
train_y <- heart_train %>% 
            select(target)

# data test predictor
test_x <-  heart_test %>% 
          select_if(is.numeric) %>% 
          select(-c(pred_log))

# data test target
test_y <- heart_test %>% 
          select(target)

Model Fitting

The distance calculation for KNN is heavily dependent upon the measurement scale of the input features. Any variable that has an extremely different range of values from the other variable can potentially cause problems for our classifier. So let’s apply normalization techniques to rescale the features to a standard range of values.

To normalize the features in train_x, we will use scale() function. Meanwhile, for the testing data, we will normalize each features using the attributes center and scale obtained from the train_x data.

# scaling
train_x <- scale(train_x)
test_x <- scale(test_x,
                center = attr(train_x, "scaled:center"), 
                scale = attr(train_x, "scaled:scale")) 

After normalizing our data, we need to find the right K to use for our kNN model. Method that we can use to choose the number of k is square root by number of row.

# number of k
sqrt(nrow(train_x))
## [1] 15.55635

Using k value we have calculated before, we will try to predict test_y using train_x and target variable from train_y dataset using knn() function.

# predict
heart_pred <- knn(train = train_x, 
                 test = test_x,
                 cl = train_y$target, 
                 k = 15)
head(heart_pred)
## [1] yes yes no  yes yes no 
## Levels: no yes

Model Evaluation

In this section, we will try to validate whether or not our model did an excellent job of predicting unseen data. We will try to make a confusion matrix from the KNN prediction result based on the actual label from target in test_y dataset and the predicted label heart_pred and use the positive class as “1”.

# Model Evaluation 
confusionMatrix(data = heart_pred,
                reference = test_y$target,
                positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  15   7
##        yes 11  28
##                                           
##                Accuracy : 0.7049          
##                  95% CI : (0.5743, 0.8148)
##     No Information Rate : 0.5738          
##     P-Value [Acc > NIR] : 0.02452         
##                                           
##                   Kappa : 0.3845          
##                                           
##  Mcnemar's Test P-Value : 0.47950         
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.5769          
##          Pos Pred Value : 0.7179          
##          Neg Pred Value : 0.6818          
##              Prevalence : 0.5738          
##          Detection Rate : 0.4590          
##    Detection Prevalence : 0.6393          
##       Balanced Accuracy : 0.6885          
##                                           
##        'Positive' Class : yes             
## 

Observe from the confusion matrix that:

  • Out of 35 actual “yes”, we classified 28 of them correctly.
  • Out of 26 actual “no”, we classified 15 of them correctly.
  • Out of 61 cases of heart disease in our data test, we classified 43 of them correctly.
  • The accuracy is 70.49%
  • The sensitivity or recall is 80%
  • The pos pred value or precision is 71.79%

Conclusion

If we see from the perspective of the doctors. We will try to minimize cases that we failed to predict. For example, there is a patient that has heart disease, but we failed to predict it, so the disease will worsen. So, We will focus on higher the recall value to lower the precision value. As we can see from both model confusion matrix, logistic regression has a bigger sensitivity. So, logistic regression will be the best model in this project that can predict heart disease.