Heart Disease

Introduction
Library
Read Data
Exploratory Data Analysis
- Variable Transforming
- Data Visualization
Modeling
- Logistic Regression
- K-Nearest Neighbour
Evaluation
- Logistic Regression
- K-Nearest Neighbour
  - Data Training Evaluation
  - Data Test Evaluation
Conclusion
- Logistic Regression
- K-Nearest Neighbour

Introduction

We would like to use logistic regression model and K-Nearest Neighbour (K-NN) to predict the presence of heart disease in patients, which is target variable with 13 attributes, which are:

age: in years
sex: male or female
chest pain type (4 values)
resting blood pressure: detect blood pressure during rest (in mmHg)
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 3 = No Thalassemia; 6 = Fixed Defect Thalassemia; 7 = Reversible Defect Thalassemia

Library

library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(dplyr)
library(MASS)
options(scipen = 100, max.print = 1e+06)

Read Data

heart <- read.csv('heart.csv')

Exploratory Data Analysis

Variable Transforming

We need to change some variables from numeric into character class. ca variable is not transformed into character class because it need to be know whether more visibled blood vessels will affect the model calculation or not.

glimpse(heart)

## Observations: 303
## Variables: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, …
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, …
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, …
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, …
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, …
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, …
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, …
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, …
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

heart <- heart %>% 
  mutate(sex = ifelse(sex==1, "Male","Female"),
         fbs = ifelse(fbs == 1, "> 120 mg/dl", "< 120 mg/dl"),
         exang = ifelse(exang == 1, "Exercise Induced Angina" ,"No Exercise Induced Angina"),
         cp = ifelse(cp == 0, "Chest Pain Type 0",
                      ifelse(cp == 1, "Chest Pain Type 1", ifelse(cp==2, "Chest Pain Type 2", "Chest Pain Type 3"))),
         restecg = ifelse(restecg == 0, "Normal",
                           if_else(restecg == 1, "Abnormality", "Probable or Definite")),
         thal = ifelse(thal== 0, "No Thalassemia", ifelse(thal==1, "Normal Thalassemia", ifelse(thal==2, "Fixed Defect Thalassemia", "Reversible Defect Thalassemia"))),
         target = ifelse(target == 0, "Healthy", "Heart Disease"),
         slope = ifelse(slope == 0, "Peak Excercise ST Slope 0", ifelse(slope==1,"Peak Excercise ST Slope 1", "Peak Excercise ST Slope 2"))
         ) %>% 
  mutate_if(is.character, as.factor)


summary(heart)

##       age            sex                      cp         trestbps    
##  Min.   :29.00   Female: 96   Chest Pain Type 0:143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   Chest Pain Type 1: 50   1st Qu.:120.0  
##  Median :55.00                Chest Pain Type 2: 87   Median :130.0  
##  Mean   :54.37                Chest Pain Type 3: 23   Mean   :131.6  
##  3rd Qu.:61.00                                        3rd Qu.:140.0  
##  Max.   :77.00                                        Max.   :200.0  
##       chol                fbs                      restecg       thalach     
##  Min.   :126.0   < 120 mg/dl:258   Abnormality         :152   Min.   : 71.0  
##  1st Qu.:211.0   > 120 mg/dl: 45   Normal              :147   1st Qu.:133.5  
##  Median :240.0                     Probable or Definite:  4   Median :153.0  
##  Mean   :246.3                                                Mean   :149.6  
##  3rd Qu.:274.5                                                3rd Qu.:166.0  
##  Max.   :564.0                                                Max.   :202.0  
##                         exang        oldpeak    
##  Exercise Induced Angina   : 99   Min.   :0.00  
##  No Exercise Induced Angina:204   1st Qu.:0.00  
##                                   Median :0.80  
##                                   Mean   :1.04  
##                                   3rd Qu.:1.60  
##                                   Max.   :6.20  
##                        slope           ca        
##  Peak Excercise ST Slope 0: 21   Min.   :0.0000  
##  Peak Excercise ST Slope 1:140   1st Qu.:0.0000  
##  Peak Excercise ST Slope 2:142   Median :0.0000  
##                                  Mean   :0.7294  
##                                  3rd Qu.:1.0000  
##                                  Max.   :4.0000  
##                             thal               target   
##  Fixed Defect Thalassemia     :166   Healthy      :138  
##  No Thalassemia               :  2   Heart Disease:165  
##  Normal Thalassemia           : 18                      
##  Reversible Defect Thalassemia:117                      
##                                                         
##

Data Visualization

In data visualization, we would try to visualize the relationship between heart disesase with each variable.

Sex

sex <- heart %>% 
  group_by(sex, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(sex, aes(x=sex, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip()+ 
  facet_wrap(~target, ncol=1,scale="fixed")+xlab("Sex") +
  ylab("Number") +
  ggtitle("Gender") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

In our data, there is only 30 more male who has heart disease than female. However, there is much more healthy male than healthy female.

Age

age <- heart %>% 
  mutate(age=as.factor(age)) %>% 
  group_by(age, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

age2 <- ggplot(age, aes(x=age, y=count, fill=target)) + geom_bar(stat = "identity") + xlab("Age") + scale_x_discrete(breaks = c("30", "35","40","45","50","55","60","65","70","75","80")) +
  ylab("Number") +
  ggtitle("Age Distribution") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

ggplotly(age2)

The patient in the data distributed from the age of 29 into 77. Age with the highest number with heart disease is 55.

Chest Pain

 cp <- heart %>% 
  group_by(cp, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(cp, aes(x=cp, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +
  ylab("Number") +
  ggtitle("Chest Pain") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

Chest pain type 0 shows significant number of healthy patient compare to another types.

Resting Blood Presure

ggplot(heart, aes(x=trestbps, fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
  geom_density(alpha=.2, fill="yellow")+
  facet_wrap(~target, ncol=1,scale="fixed")+
  xlab("Resting Blood Pressure (mmHg)") + 
  ylab("Density/Count") +
  ggtitle("Resting Blood Pressure") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In heart disease patients, many of them have blood pressure of 120 - 140 mmHG, although the number is not too far off from the healthy patients.

Cholestoral

ggplot(heart, aes(x=chol, fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
  geom_density(alpha=.2, fill="yellow")+
  facet_wrap(~target, ncol=1,scale="fixed")+
  xlab("Cholesterol") + 
  ylab("Density/Count") +
  ggtitle("Serum Cholestoral (mg/dl)") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is no significant differences between healthy and heart disease patients in their cholestoral (mg/dl).

Fasting Blood Sugar

 fbs <- heart %>% 
  group_by(fbs, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(fbs, aes(x=fbs, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  xlab("fbs >120 mg/dl") +
  ylab("Number") +
  ggtitle("Fasting Blood Sugar") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

There is no significant differences between healthy and heart disease patients in their fasting blood sugar, whether it is more or less than 120 mg/dl.

Resting Electrocardiographic Results

restecg <- heart %>% 
  group_by(restecg, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(restecg, aes(x=restecg, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  facet_wrap(~target, ncol=1,scale="fixed")+ xlab("RER") +
  ylab("Number") +
  ggtitle("Resting Electrocardiographic Results") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

There are twice of heart disease patients that detected with abnormality compare to healthy patients that detected with abnormality.

Maximum Heart Rate Achieved

ggplot(heart, aes(x=thalach , fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
  geom_density(alpha=.2, fill="yellow")+
  facet_wrap(~target, ncol=1,scale="fixed")+
  xlab("Maximum Heart Rate Achieved") + 
  ylab("Density/Count") +
  ggtitle("Maximum Heart Rate Achieved") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Patients with heart disease appear to have maximum of heart rate achieved of more than 150.

Exercise Induced Angina

ex <- heart %>% 
  group_by(exang, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(ex, aes(x=exang, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  facet_wrap(~target, ncol=1,scale="fixed")+ xlab("RER") +
  ylab("EIA") +
  ggtitle("Exercise Induced Angina") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

Most of the patient with heart disease detected with no exercise induced angina.

ST Depression

ggplot(heart, aes(x=oldpeak , fill=target)) + geom_histogram(aes(y=..density..), color="grey17") +
  geom_density(alpha=.2, fill="yellow")+
  facet_wrap(~target, ncol=1,scale="fixed")+
  xlab("Oldpeak") + 
  ylab("Density/Count") +
  ggtitle("ST depression induced by exercise relative to rest") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most of the patient with heart disease detected with zero oldpeak.

Slope of The Peak Exercise

slope <- heart %>% 
  group_by(slope, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(slope, aes(x=slope, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Slope") +
  ylab("Number") +
  ggtitle("the slope of the peak exercise ST segment") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

More than half of the patient with heart disease detected with peak exercise ST slope 2, while more than half of the healthy patient with detected with peak exercise ST slope 1

Number of Major Vessels

ca <- heart %>% 
  group_by(ca, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(ca, aes(x=ca, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Number of Major Vessels") +
  ylab("Number") +
  ggtitle("Major Vessels Colored by Flourosopy") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

Most of the patient with heart disease detected with zero number of major vessel colored by flourosopy.

Thallasemia

thal <- heart %>% 
  group_by(thal, target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(thal, aes(x=thal, y=count, fill=target)) + geom_col(position = "dodge") + coord_flip() +  facet_wrap(~target, ncol=1,scale="fixed")+ xlab("Thalassemia Examination Result") +
  ylab("Number") +
  ggtitle("Thalassemia") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

Most of the patient with heart disease detected with fixed defect of thalassemia.

Multivariables Interpretation

We would try to correlate the relationship between oldpeak, number of blood vessels to the patient who is healty or having a heart disease.

ggplot(heart, aes(x=age, y=oldpeak, color=sex, size=ca)) + geom_point(alpha=0.7) + facet_wrap(~target, ncol=1,scale="fixed") + xlab("Age") +
  ylab("Oldpeak") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

From the graph above, we could interpretate:

Patient who has heart disease having a relatively low oldpeak.
Patient who has heart disease having a relatively low oldpeak number of blood vessels (ca).
Age has no siginificant effect whether the patient has heart disease or not.

Checking Data Availability

Before going further, we need to check if there is any not available data.

colSums(is.na(heart))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

Checking Target Proportion

h <- heart %>% 
  group_by(target) %>% 
  summarise(count=n()) %>% 
  mutate(perc=count/sum(count))

ggplot(h, aes(x=target, y=perc*100, fill=target)) + geom_bar(stat = "identity") + coord_flip() + xlab("Patient Status") +
  ylab("Percentage (%)") +
  ggtitle("Patient Status Proportion") +
  scale_fill_discrete(name = "Heart Disease", labels = c("Absence", "Presence")) + theme(plot.title = element_text(hjust = 0.5))

Finally, we need to cross validate the target variable. The proportion is balanced, 54:46.

Modeling

There would be two models that we will use which are logistic regression and k-nearest neighbour (K-NN)

Logistic Regression

Data Splitting

First, we must slip the dataset into train and test dataset. The data will be split into ratio of 70% for data train and 30% for data test.

set.seed(303)
id <-sample(nrow(heart),nrow(heart)*0.7)
heart_train<-heart[id,]
heart_test<-heart[-id,]

Targetting All Variables

modelall <- glm(target~.,heart_train, family='binomial')

summary(modelall)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7597  -0.3682   0.1828   0.4810   2.8007  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        2.465457   3.332549   0.740  0.45941   
## age                               -0.009925   0.029102  -0.341  0.73306   
## sexMale                           -0.972760   0.594311  -1.637  0.10168   
## cpChest Pain Type 1                0.676937   0.650570   1.041  0.29809   
## cpChest Pain Type 2                1.683990   0.561586   2.999  0.00271 **
## cpChest Pain Type 3                1.876914   0.763048   2.460  0.01390 * 
## trestbps                          -0.021103   0.013683  -1.542  0.12301   
## chol                              -0.005397   0.005413  -0.997  0.31867   
## fbs> 120 mg/dl                     0.464879   0.694490   0.669  0.50325   
## restecgNormal                     -0.114183   0.481740  -0.237  0.81264   
## restecgProbable or Definite       -0.036626   2.214469  -0.017  0.98680   
## thalach                            0.017983   0.012822   1.403  0.16076   
## exangNo Exercise Induced Angina    1.286237   0.507204   2.536  0.01121 * 
## oldpeak                           -0.340274   0.261999  -1.299  0.19403   
## slopePeak Excercise ST Slope 1    -0.372844   1.008251  -0.370  0.71154   
## slopePeak Excercise ST Slope 2     0.646221   1.097124   0.589  0.55585   
## ca                                -0.658304   0.265271  -2.482  0.01308 * 
## thalNo Thalassemia                -1.687082   2.231244  -0.756  0.44958   
## thalNormal Thalassemia            -1.059155   0.980853  -1.080  0.28022   
## thalReversible Defect Thalassemia -1.427661   0.508727  -2.806  0.00501 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 142.36  on 192  degrees of freedom
## AIC: 182.36
## 
## Number of Fisher Scoring iterations: 6

By using all the variables provided, we have numbers of variables that do not have significant effect for the target variable which could be seen on the p values. To eliminate those unsignificant variables, we would try stepwise method with backward direction.

Stepwise Method

model.b <- stepAIC(modelall, direction='backward')

## Start:  AIC=182.36
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - restecg   2   142.42 178.42
## - age       1   142.48 180.48
## - fbs       1   142.82 180.82
## - chol      1   143.36 181.36
## - slope     2   146.04 182.04
## - oldpeak   1   144.12 182.12
## <none>          142.36 182.36
## - thalach   1   144.41 182.41
## - trestbps  1   144.84 182.84
## - sex       1   145.13 183.13
## - thal      3   150.81 184.81
## - ca        1   148.69 186.69
## - exang     1   148.87 186.87
## - cp        3   154.84 188.84
## 
## Step:  AIC=178.42
## target ~ age + sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - age       1   142.58 176.58
## - fbs       1   142.87 176.87
## - chol      1   143.57 177.57
## - oldpeak   1   144.17 178.17
## - slope     2   146.38 178.38
## <none>          142.42 178.42
## - thalach   1   144.47 178.47
## - trestbps  1   144.94 178.94
## - sex       1   145.37 179.37
## - thal      3   150.90 180.90
## - ca        1   148.76 182.76
## - exang     1   148.97 182.97
## - cp        3   154.87 184.87
## 
## Step:  AIC=176.58
## target ~ sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - fbs       1   143.02 175.02
## - chol      1   143.93 175.93
## - oldpeak   1   144.25 176.25
## - slope     2   146.51 176.51
## <none>          142.58 176.58
## - sex       1   145.48 177.48
## - thalach   1   145.64 177.64
## - trestbps  1   145.73 177.73
## - thal      3   151.10 179.10
## - exang     1   149.22 181.22
## - ca        1   149.83 181.83
## - cp        3   155.30 183.30
## 
## Step:  AIC=175.02
## target ~ sex + cp + trestbps + chol + thalach + exang + oldpeak + 
##     slope + ca + thal
## 
##            Df Deviance    AIC
## - chol      1   144.24 174.24
## - oldpeak   1   144.79 174.79
## - slope     2   146.93 174.93
## <none>          143.02 175.02
## - sex       1   145.84 175.84
## - trestbps  1   145.92 175.92
## - thalach   1   146.09 176.09
## - thal      3   151.40 177.40
## - exang     1   149.39 179.39
## - ca        1   149.85 179.85
## - cp        3   156.88 182.88
## 
## Step:  AIC=174.24
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
##     ca + thal
## 
##            Df Deviance    AIC
## <none>          144.24 174.24
## - oldpeak   1   146.30 174.30
## - sex       1   146.47 174.47
## - slope     2   148.57 174.57
## - thalach   1   147.16 175.16
## - trestbps  1   147.24 175.24
## - thal      3   152.49 176.49
## - exang     1   150.29 178.29
## - ca        1   151.44 179.44
## - cp        3   158.83 182.83

This method eliminates six unsignificant variables, which are age, fbs, chol, restecg, oldpeak, and slope. So therefore, we would use model.b as our prefer mode.

model.b <- glm(target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
    ca + thal,heart_train, family='binomial')
summary(model.b)

## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7746  -0.3799   0.2014   0.4931   2.7368  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        0.43152    2.58633   0.167  0.86749   
## sexMale                           -0.83157    0.56335  -1.476  0.13991   
## cpChest Pain Type 1                0.72184    0.63672   1.134  0.25692   
## cpChest Pain Type 2                1.79685    0.54745   3.282  0.00103 **
## cpChest Pain Type 3                1.93688    0.75034   2.581  0.00984 **
## trestbps                          -0.02183    0.01295  -1.685  0.09200 . 
## thalach                            0.01927    0.01156   1.666  0.09568 . 
## exangNo Exercise Induced Angina    1.21859    0.49692   2.452  0.01419 * 
## oldpeak                           -0.36267    0.25753  -1.408  0.15906   
## slopePeak Excercise ST Slope 1    -0.44129    0.98775  -0.447  0.65504   
## slopePeak Excercise ST Slope 2     0.64093    1.07046   0.599  0.54934   
## ca                                -0.64851    0.24833  -2.612  0.00901 **
## thalNo Thalassemia                -1.25894    2.45145  -0.514  0.60757   
## thalNormal Thalassemia            -0.84997    0.94975  -0.895  0.37082   
## thalReversible Defect Thalassemia -1.38805    0.49318  -2.815  0.00489 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 144.24  on 197  degrees of freedom
## AIC: 174.24
## 
## Number of Fisher Scoring iterations: 6

Model Interpretation

exp(model.b$coefficients) %>% 
  data.frame()

##                                           .
## (Intercept)                       1.5395889
## sexMale                           0.4353669
## cpChest Pain Type 1               2.0582241
## cpChest Pain Type 2               6.0306449
## cpChest Pain Type 3               6.9370818
## trestbps                          0.9784109
## thalach                           1.0194538
## exangNo Exercise Induced Angina   3.3824019
## oldpeak                           0.6958139
## slopePeak Excercise ST Slope 1    0.6432050
## slopePeak Excercise ST Slope 2    1.8982437
## ca                                0.5228225
## thalNo Thalassemia                0.2839554
## thalNormal Thalassemia            0.4274267
## thalReversible Defect Thalassemia 0.2495606

Fom the model, model.b, we could interprete several things:

The odds of male to diagnosed with heart disease is 43% less than female.
Patient detected with no exercise induced angina (unstable angina) is 3 times more reluctant to heart disease compared with patient with exercise induced angina.
The odds of patient detected with oldpeak to diagnosed with heart disease is 70% less than those who don’t.

K-Nearest Neighbour

Variable Transforming

dv <- dummyVars(" ~.", data = heart)
heart2 <- data.frame(predict(dv, newdata = heart))

str(heart2)

## 'data.frame':    303 obs. of  28 variables:
##  $ age                               : num  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex.Female                        : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ sex.Male                          : num  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp.Chest.Pain.Type.0              : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ cp.Chest.Pain.Type.1              : num  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp.Chest.Pain.Type.2              : num  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp.Chest.Pain.Type.3              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps                          : num  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol                              : num  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs...120.mg.dl                   : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ fbs...120.mg.dl.1                 : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg.Abnormality               : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg.Normal                    : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ restecg.Probable.or.Definite      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thalach                           : num  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang.Exercise.Induced.Angina     : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ exang.No.Exercise.Induced.Angina  : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ oldpeak                           : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope.Peak.Excercise.ST.Slope.0   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ slope.Peak.Excercise.ST.Slope.1   : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope.Peak.Excercise.ST.Slope.2   : num  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.Fixed.Defect.Thalassemia     : num  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal.No.Thalassemia               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.Normal.Thalassemia           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal.Reversible.Defect.Thalassemia: num  0 0 0 0 0 0 0 1 1 0 ...
##  $ target.Healthy                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ target.Heart.Disease              : num  1 1 1 1 1 1 1 1 1 1 ...

We need to transform all the variables into numeric class, as KNN only processing numeric variables.

heart2 <- heart2 %>% 
dplyr::select(-c(target.Healthy, sex.Female, fbs...120.mg.dl.1 , exang.Exercise.Induced.Angina))
    
names(heart2)

##  [1] "age"                                "sex.Male"                          
##  [3] "cp.Chest.Pain.Type.0"               "cp.Chest.Pain.Type.1"              
##  [5] "cp.Chest.Pain.Type.2"               "cp.Chest.Pain.Type.3"              
##  [7] "trestbps"                           "chol"                              
##  [9] "fbs...120.mg.dl"                    "restecg.Abnormality"               
## [11] "restecg.Normal"                     "restecg.Probable.or.Definite"      
## [13] "thalach"                            "exang.No.Exercise.Induced.Angina"  
## [15] "oldpeak"                            "slope.Peak.Excercise.ST.Slope.0"   
## [17] "slope.Peak.Excercise.ST.Slope.1"    "slope.Peak.Excercise.ST.Slope.2"   
## [19] "ca"                                 "thal.Fixed.Defect.Thalassemia"     
## [21] "thal.No.Thalassemia"                "thal.Normal.Thalassemia"           
## [23] "thal.Reversible.Defect.Thalassemia" "target.Heart.Disease"

Then, we neglect variables with only two levels, such as sex variable.

Data Splitting

For KNN, we need to split data into the main data and label data.

heart2_train <- heart2[id,1:23]
heart2_test <- heart2[-id,1:23]

heart2_train_label <- heart2[id,24]%>% as.factor()
heart2_test_label <- heart2[-id,24] %>% as.factor()

Then, we need to scale the data as the scale from each variables are different.

heart2_train <- scale(heart2_train[,1:23])
heart2_test <- scale(heart2_test[,1:23],
                     center = attr(heart2_train, "scaled:center"), 
                     scale = attr(heart2_train, "scaled:scale"))

Prediction with KNN

sqrt(nrow(heart2_train))

## [1] 14.56022

Before modeling, we need to define the value of K. From the formula above, here we get the value of 15.

library(class)
pred.knn.train <- knn(train = heart2_train, test= heart2_train, cl=heart2_train_label, k=15)

pred.knn.test <- knn(train = heart2_train, test= heart2_test, cl=heart2_train_label, k=15)

Evaluation

Logistic Regression

Preparation

We put the predict value from the model into our data, heart_train and heart_test.

heart_test$pred.Target <- predict(model.b, heart_test, type = 'response')

heart_train$pred.Target <- predict(model.b, heart_train, type = 'response')

Tuning Cutoff

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = heart_test$pred.Target, 
                     ref = heart_test$target, 
                     postarget = "Heart Disease", 
                     negtarget = "Healthy")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff model perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

Before doing evaluation, we need to inspect the good cutoff or threshold to maximize the accuracy, recall and precision value. Here from the graph, 0.53 is the most balance value.

heart_test$pred.Label <- ifelse(heart_test$pred.Target < 0.53, "Healthy", "Heart Disease") %>% as.factor()

heart_train$pred.Label <- ifelse(heart_train$pred.Target < 0.53, "Healthy", "Heart Disease") %>% as.factor()

Data Training Evaluation

logtrain <- confusionMatrix(heart_train$pred.Label, heart_train$target, positive="Heart Disease") %>% print()

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Healthy Heart Disease
##   Healthy            80            13
##   Heart Disease      16           103
##                                              
##                Accuracy : 0.8632             
##                  95% CI : (0.8095, 0.9064)   
##     No Information Rate : 0.5472             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7232             
##                                              
##  Mcnemar's Test P-Value : 0.7103             
##                                              
##             Sensitivity : 0.8879             
##             Specificity : 0.8333             
##          Pos Pred Value : 0.8655             
##          Neg Pred Value : 0.8602             
##              Prevalence : 0.5472             
##          Detection Rate : 0.4858             
##    Detection Prevalence : 0.5613             
##       Balanced Accuracy : 0.8606             
##                                              
##        'Positive' Class : Heart Disease      
##

Data Test Evaluation

logtest <- confusionMatrix(heart_test$pred.Label, heart_test$target, positive="Heart Disease") %>% print()

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Healthy Heart Disease
##   Healthy            35             8
##   Heart Disease       7            41
##                                           
##                Accuracy : 0.8352          
##                  95% CI : (0.7427, 0.9047)
##     No Information Rate : 0.5385          
##     P-Value [Acc > NIR] : 0.000000002394  
##                                           
##                   Kappa : 0.6689          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8367          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8542          
##          Neg Pred Value : 0.8140          
##              Prevalence : 0.5385          
##          Detection Rate : 0.4505          
##    Detection Prevalence : 0.5275          
##       Balanced Accuracy : 0.8350          
##                                           
##        'Positive' Class : Heart Disease   
##

The train data give accuracy of 86.32%, while the data test give accurracy of 83.52%. Because the accuracy is not far, so we can assume the model is fit.

K-Nearest Neighbour

Data Training Evaluation

knn.train <- confusionMatrix(pred.knn.train, heart2_train_label, positive="1") %>% print()

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 76 18
##          1 20 98
##                                              
##                Accuracy : 0.8208             
##                  95% CI : (0.7624, 0.8699)   
##     No Information Rate : 0.5472             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6376             
##                                              
##  Mcnemar's Test P-Value : 0.8711             
##                                              
##             Sensitivity : 0.8448             
##             Specificity : 0.7917             
##          Pos Pred Value : 0.8305             
##          Neg Pred Value : 0.8085             
##              Prevalence : 0.5472             
##          Detection Rate : 0.4623             
##    Detection Prevalence : 0.5566             
##       Balanced Accuracy : 0.8182             
##                                              
##        'Positive' Class : 1                  
##

Data Test Evaluation

knn.test <- confusionMatrix(pred.knn.test, heart2_test_label, positive="1") %>% print()

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34  5
##          1  8 44
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7681, 0.9217)
##     No Information Rate : 0.5385          
##     P-Value [Acc > NIR] : 0.0000000001091 
##                                           
##                   Kappa : 0.7111          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.8980          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.5385          
##          Detection Rate : 0.4835          
##    Detection Prevalence : 0.5714          
##       Balanced Accuracy : 0.8537          
##                                           
##        'Positive' Class : 1               
##

The train data give accuracy of 82.08%, while the data test give accurracy of 85.71%. Because the accuracy is not far, so we can assume the model is fit.

Conclusion

Logistic Regression

eval_logit <- data_frame(Accuracy = logtest$overall[1],
           Recall = logtest$byClass[1],
           Precision = logtest$byClass[3]) %>% print()

## # A tibble: 1 x 3
##   Accuracy Recall Precision
##      <dbl>  <dbl>     <dbl>
## 1    0.835  0.837     0.854

K-Nearest Neighbour

eval_knn <- data_frame(Accuracy = knn.test$overall[1],
           Recall = knn.test$byClass[1],
           Precision = knn.test$byClass[3]) %>% print()

## # A tibble: 1 x 3
##   Accuracy Recall Precision
##      <dbl>  <dbl>     <dbl>
## 1    0.857  0.898     0.846

KNN give accuracy of 85.71%, while logistic regression model give accuracy of 83.51%. KNN also could predict the actual patient with heart disease with recall value of 89.79%, compared to logistic regression model with value of 83.67%.