Hi, welcome to my Learning by Building for Classification. In this project, I will try to analyze about Heart Disease. I found the dataset from kaggle.com

1 Load Library

Here are the library I will use in this LBB

library(tidyverse)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(scales)
library(GGally)
library(caret)
library(tibble)


options(scipen = 9999)

2 Load Dataset

Let’s just load the data for this poject

heart <- read.csv("data_input/heart.csv")
colnames(heart)[1] = "age"
heart

The dataset has of 14 columns, which are:

In this project, I will use the ggplot theme I’ve already created

white_theme <- theme(
  panel.background = element_rect(fill="white"),
  plot.background = element_rect(fill="white"),
  panel.grid.minor.x = element_blank(),
  panel.grid.major.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  text = element_text(color="black"),
  axis.text = element_text(color="black"),
  strip.background =element_rect(fill="snow3"),
  strip.text = element_text(colour = 'black')
  )

3 Explanatory Data Analysis (EDA)

Before we continue, let’s chech the correlation between our predictor, so we can define our predictor

ggcorr(heart, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

Predictors that have strong correlatio with our target are slope, thalach, restecg and cp. This can be used when we use glm() model

Now, let’s check if our dataset is already in proper data type

glimpse(heart)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

We are about to change some data typeS to analyze the data. In this data, I will change target into factor and change it into Healthy and Not Healthy and some predictors in binary to factor.

heart <- heart %>% 
  mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
         fbs = factor(fbs, levels = c(0,1), labels = c("False", "True")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         target = factor(target, levels = c(0,1), 
                        labels = c("Healthy", "Not Healthy")))
glimpse(heart)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58...
## $ sex      <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male...
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3...
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130...
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275...
## $ fbs      <fct> True, False, False, False, False, False, False, False, Tru...
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1...
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139...
## $ exang    <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, ...
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2...
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2...
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2...
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ target   <fct> Not Healthy, Not Healthy, Not Healthy, Not Healthy, Not He...

Now, let’s check whether there are missing values in our dataset

colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

There are no missing values in our data

3.1 Pre-Processing Data

Before we continue, let’s check the proportion for out target, this process is to determined if our data is having a balance data

prop.table(table(heart$target))
## 
##     Healthy Not Healthy 
##   0.4554455   0.5445545

The data we have is having a proportional balance

Now, let’s see the summary of our data

summary(heart)
##       age            sex            cp           trestbps          chol      
##  Min.   :29.00   Female: 96   Min.   :0.000   Min.   : 94.0   Min.   :126.0  
##  1st Qu.:47.50   Male  :207   1st Qu.:0.000   1st Qu.:120.0   1st Qu.:211.0  
##  Median :55.00                Median :1.000   Median :130.0   Median :240.0  
##  Mean   :54.37                Mean   :0.967   Mean   :131.6   Mean   :246.3  
##  3rd Qu.:61.00                3rd Qu.:2.000   3rd Qu.:140.0   3rd Qu.:274.5  
##  Max.   :77.00                Max.   :3.000   Max.   :200.0   Max.   :564.0  
##     fbs         restecg          thalach      exang        oldpeak    
##  False:258   Min.   :0.0000   Min.   : 71.0   No :204   Min.   :0.00  
##  True : 45   1st Qu.:0.0000   1st Qu.:133.5   Yes: 99   1st Qu.:0.00  
##              Median :1.0000   Median :153.0             Median :0.80  
##              Mean   :0.5281   Mean   :149.6             Mean   :1.04  
##              3rd Qu.:1.0000   3rd Qu.:166.0             3rd Qu.:1.60  
##              Max.   :2.0000   Max.   :202.0             Max.   :6.20  
##      slope             ca              thal               target   
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Healthy    :138  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   Not Healthy:165  
##  Median :1.000   Median :0.0000   Median :2.000                    
##  Mean   :1.399   Mean   :0.7294   Mean   :2.314                    
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000                    
##  Max.   :2.000   Max.   :4.0000   Max.   :3.000

From the summary above, we can conclude that the youngest patient was 29 years old the oldest was 77 years old, Female patients were 96 and Male patients were 207. cp consisted of 1, 2, and 3 score. restecg also consisted of 0, 1, and 2 point. And for trestbps, chol and thalach, they had various score.

3.2 Cross Validation

Let’s split our data to train and test. I will split it wIth 80% and 20% proportion

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(nrow(heart), nrow(heart)*0.85)
train <- heart[index,]
test <- heart[-index,]

Now, let’s check the proportion for target column in our train and test data

prop.table(table(train$target))
## 
##     Healthy Not Healthy 
##   0.4552529   0.5447471
prop.table(table(test$target))
## 
##     Healthy Not Healthy 
##   0.4565217   0.5434783

And our target in our train and test data is already balance

4 Modelling

4.1 Logistic Regression

Now, let’s check some models with all predictors and some predictors that have strong correlation

glm(target ~ ., train ,family = "binomial")
## 
## Call:  glm(formula = target ~ ., family = "binomial", data = train)
## 
## Coefficients:
## (Intercept)          age      sexMale           cp     trestbps         chol  
##    4.814425    -0.004785    -1.750236     0.939859    -0.021192    -0.004124  
##     fbsTrue      restecg      thalach     exangYes      oldpeak        slope  
##    0.050349     0.276302     0.014837    -1.059257    -0.706850     0.643376  
##          ca         thal  
##   -0.758366    -0.891504  
## 
## Degrees of Freedom: 256 Total (i.e. Null);  243 Residual
## Null Deviance:       354.2 
## Residual Deviance: 175.5     AIC: 203.5

From all predictors we use, the AIC number we get is 203.5

Now, let’s check the model using slope, thalach, restecg and cp

step(glm(target ~ slope + thalach + restecg + cp, train ,family = "binomial"), direction = "backward")
## Start:  AIC=265.7
## target ~ slope + thalach + restecg + cp
## 
##           Df Deviance    AIC
## - restecg  1   256.89 264.89
## <none>         255.70 265.70
## - thalach  1   265.33 273.33
## - slope    1   268.92 276.92
## - cp       1   292.24 300.24
## 
## Step:  AIC=264.89
## target ~ slope + thalach + cp
## 
##           Df Deviance    AIC
## <none>         256.89 264.89
## - thalach  1   266.41 272.41
## - slope    1   270.19 276.19
## - cp       1   294.38 300.38
## 
## Call:  glm(formula = target ~ slope + thalach + cp, family = "binomial", 
##     data = train)
## 
## Coefficients:
## (Intercept)        slope      thalach           cp  
##    -5.59122      1.02332      0.02338      0.94153  
## 
## Degrees of Freedom: 256 Total (i.e. Null);  253 Residual
## Null Deviance:       354.2 
## Residual Deviance: 256.9     AIC: 264.9

The AIC we get is greater than the AIC we get from all predictors. So, we will just use all predictors

model1 <- glm(target ~ ., train ,family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6436  -0.3717   0.1510   0.5962   2.5437  
## 
## Coefficients:
##              Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)  4.814425   2.860851   1.683   0.092401 .  
## age         -0.004785   0.025127  -0.190   0.848966    
## sexMale     -1.750236   0.518216  -3.377   0.000732 ***
## cp           0.939859   0.206652   4.548 0.00000542 ***
## trestbps    -0.021192   0.011537  -1.837   0.066242 .  
## chol        -0.004124   0.004084  -1.010   0.312601    
## fbsTrue      0.050349   0.598407   0.084   0.932947    
## restecg      0.276302   0.380824   0.726   0.468122    
## thalach      0.014837   0.011216   1.323   0.185885    
## exangYes    -1.059257   0.459936  -2.303   0.021276 *  
## oldpeak     -0.706850   0.260123  -2.717   0.006580 ** 
## slope        0.643376   0.430717   1.494   0.135246    
## ca          -0.758366   0.223014  -3.401   0.000673 ***
## thal        -0.891504   0.318591  -2.798   0.005138 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 354.22  on 256  degrees of freedom
## Residual deviance: 175.54  on 243  degrees of freedom
## AIC: 203.54
## 
## Number of Fisher Scoring iterations: 6

Now, let’s find out if we use stepwise, will we get smaller AIC?

model2 <- step(object = glm(target ~ ., train, family = "binomial"), direction = "backward")
## Start:  AIC=203.54
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - fbs       1   175.54 201.54
## - age       1   175.57 201.57
## - restecg   1   176.06 202.06
## - chol      1   176.53 202.53
## - thalach   1   177.34 203.34
## <none>          175.54 203.54
## - slope     1   177.75 203.75
## - trestbps  1   179.03 205.03
## - exang     1   180.85 206.85
## - thal      1   183.59 209.59
## - oldpeak   1   183.60 209.60
## - ca        1   187.77 213.77
## - sex       1   188.78 214.78
## - cp        1   199.76 225.76
## 
## Step:  AIC=201.54
## target ~ age + sex + cp + trestbps + chol + restecg + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - age       1   175.58 199.58
## - restecg   1   176.08 200.08
## - chol      1   176.54 200.54
## - thalach   1   177.36 201.36
## <none>          175.54 201.54
## - slope     1   177.76 201.76
## - trestbps  1   179.06 203.06
## - exang     1   180.85 204.85
## - oldpeak   1   183.70 207.70
## - thal      1   183.82 207.82
## - ca        1   187.89 211.89
## - sex       1   188.90 212.90
## - cp        1   200.38 224.38
## 
## Step:  AIC=199.58
## target ~ sex + cp + trestbps + chol + restecg + thalach + exang + 
##     oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - restecg   1   176.14 198.14
## - chol      1   176.65 198.65
## <none>          175.58 199.58
## - slope     1   177.82 199.82
## - thalach   1   178.01 200.01
## - trestbps  1   179.45 201.45
## - exang     1   180.85 202.85
## - oldpeak   1   183.71 205.71
## - thal      1   183.85 205.85
## - ca        1   188.70 210.70
## - sex       1   188.95 210.95
## - cp        1   200.41 222.41
## 
## Step:  AIC=198.14
## target ~ sex + cp + trestbps + chol + thalach + exang + oldpeak + 
##     slope + ca + thal
## 
##            Df Deviance    AIC
## - chol      1   177.52 197.52
## <none>          176.14 198.14
## - slope     1   178.40 198.40
## - thalach   1   178.67 198.67
## - trestbps  1   180.16 200.16
## - exang     1   181.33 201.33
## - oldpeak   1   184.15 204.15
## - thal      1   184.17 204.17
## - ca        1   189.16 209.16
## - sex       1   189.80 209.80
## - cp        1   200.75 220.75
## 
## Step:  AIC=197.52
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
##     ca + thal
## 
##            Df Deviance    AIC
## <none>          177.52 197.52
## - thalach   1   179.67 197.67
## - slope     1   179.67 197.67
## - trestbps  1   181.56 199.56
## - exang     1   182.79 200.79
## - oldpeak   1   186.52 204.52
## - thal      1   186.57 204.57
## - sex       1   189.82 207.82
## - ca        1   190.58 208.58
## - cp        1   203.07 221.07
summary(model2)
## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6138  -0.3894   0.1584   0.5505   2.5092  
## 
## Coefficients:
##             Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)  3.82212    2.14716   1.780   0.075062 .  
## sexMale     -1.58325    0.48331  -3.276   0.001053 ** 
## cp           0.94813    0.20278   4.676 0.00000293 ***
## trestbps    -0.02177    0.01099  -1.981   0.047630 *  
## thalach      0.01444    0.01002   1.442   0.149344    
## exangYes    -1.03926    0.45284  -2.295   0.021735 *  
## oldpeak     -0.73241    0.25702  -2.850   0.004377 ** 
## slope        0.62350    0.42402   1.470   0.141436    
## ca          -0.74482    0.21581  -3.451   0.000558 ***
## thal        -0.92191    0.30936  -2.980   0.002882 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 354.22  on 256  degrees of freedom
## Residual deviance: 177.52  on 247  degrees of freedom
## AIC: 197.52
## 
## Number of Fisher Scoring iterations: 6

The AIC in model2 is smaller than model2. In stepwise model, we get AIC of 197.52 with a combination of sex, cp, trestbps, exang, oldpeak, ca, thal

Now, with model2, we will try to predict our data with test data

test$pred <- predict(model2, type = "response", newdata = test)

Now, let’s see the data distribution in test

ggplot(test, aes(x = pred)) +
  geom_density(lwd = 1) +
  labs(title = "Distribution of Probability of Data Prediction", 
       x = "Probability",
       y = "Density") +
  theme(plot.title = element_text(hjust = 0.5)) +
  white_theme

From the graph above, we can conclude that the prediction result is more inclined to Not Healthy

test$pred <- factor(ifelse(test$pred > 0.5, "Not Healthy","Healthy"))
test[, c("pred", "target")]

From the table, we can define that whenever the probability of test is more than 0.5, the patient is Not Healthy

Since we want to predict Not Healthy patient, we must define it into the positive class in the confusionMatrix

logreg_cm <- confusionMatrix(test$pred, test$target, positive = "Not Healthy")
logreg_cm
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Healthy Not Healthy
##   Healthy          17           3
##   Not Healthy       4          22
##                                           
##                Accuracy : 0.8478          
##                  95% CI : (0.7113, 0.9366)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 0.000013        
##                                           
##                   Kappa : 0.6922          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8800          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.8500          
##              Prevalence : 0.5435          
##          Detection Rate : 0.4783          
##    Detection Prevalence : 0.5652          
##       Balanced Accuracy : 0.8448          
##                                           
##        'Positive' Class : Not Healthy     
## 

We’re just finished our Logistic Regression model. Now, let’s just use K-Nearest Neighbor

4.2 K-Nearest Neighbor

Before we continue to the k-NN, let’s scale the numeric column

numerictrain <- sapply(X = train, FUN = is.numeric)
numerictest <- sapply(X = test, FUN = is.numeric)

train_x <- scale(train[,numerictrain])
train_y <- train$target

test_x <- scale(test[,numerictest], 
                             center = attr(train_x, "scaled:center"),
                             scale = attr(train_x, "scaled:scale"))
test_y <- test$target

Now, let’s define the k number from train_x

round(sqrt(nrow(train_x)),0)
## [1] 16

Since we have 2 target class, we would better to use odds number, let’s just use 15. Now, let’s do a prediction with k-NN

knn <- knn(train = train_x,
    test = test_x,
    cl = train_y,
    k = 15)

Just like model2, now it’s time to predict it with confusionMatrix

knn_cm <- confusionMatrix(knn, test$target, positive = "Not Healthy")
knn_cm
## Confusion Matrix and Statistics
## 
##              Reference
## Prediction    Healthy Not Healthy
##   Healthy          14           2
##   Not Healthy       7          23
##                                           
##                Accuracy : 0.8043          
##                  95% CI : (0.6609, 0.9064)
##     No Information Rate : 0.5435          
##     P-Value [Acc > NIR] : 0.0002066       
##                                           
##                   Kappa : 0.5981          
##                                           
##  Mcnemar's Test P-Value : 0.1824224       
##                                           
##             Sensitivity : 0.9200          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.7667          
##          Neg Pred Value : 0.8750          
##              Prevalence : 0.5435          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.6522          
##       Balanced Accuracy : 0.7933          
##                                           
##        'Positive' Class : Not Healthy     
## 

5 Model Evaluation of Logistic Regression and k-NN

eval_log <- data.frame(Accuracy = logreg_cm$overall[1],
                       Recall = logreg_cm$byClass[1],
                       Specificity = logreg_cm$byClass[2],
                       Precision = logreg_cm$byClass[3])


eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
                       Recall = knn_cm$byClass[1],
                       Specificity = knn_cm$byClass[2],
                       Precision = knn_cm$byClass[3])

eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "kNN")

eval_total

From the result above, we can see that both of Logistic Regression and k-NN model give the best result for Recall while for Accuracy, Specificity and Precision, Logistic Regression give the higher result.

6 Conclusion

In this case, since this model is to predict within Healthy and Not Healthy patient, I will use Precision metric from Logistic Regression model, since it give the most accurate percentage to predict the patient who has heart problem or not, so I won’t give treatment to the wrong person, cause if I wrongly predict my patient, the result can be fatal.

Our result with Logistic Regression seems to perform better than our k-NN model. I believe that this happened because the categorical variable that we omit out of our k-NN model is highly affecting our model quality.