Introduction

This data set is Cleveland heart disease. Heart disease describes a range of conditions that affect the heart from UC Irvine Repository.

Heart diseases include:
- Blood vessel disease, such as coronary artery disease
- Irregular heartbeats (arrhythmias)
- Heart problems you’re born with (congenital heart defects)
- Disease of the heart muscle
- Heart valve disease
Many forms of heart disease can be prevented or treated with healthy lifestyle choices. Heart disease symptoms depend on the type of heart disease such as dizziness, fatigue, and feeling short of breath.
source:Mayo Clinic

Changing your lifestyle can prevent and reverse certain types of heart disease. Below are five common risk factors associated to heart disease:
-High Blood Pressure
-High Low- density Lipoprotein (LDL) Cholesterol
-Diabetes
-Smoking and Secondhand Smoke Exposure
-Physical Inactivity & Unhealthy Diet
source:southcoasthealth

Data Preparation

Prerequisites

Importing Libraries

library(dplyr)
library(gtools) 
library(caret)
library(inspectdf)
library(rsample)
library(car)
library(class)

Importing Dataset

We will using read.table() function for importing the dataset because the data is on .data file type.

cl <- read.table("processed.cleveland.data", fileEncoding = "UTF-8", sep = ",")
rmarkdown::paged_table(cl)

Data Columns:
V1: age in years
V2: sex (1 = male; 0 = female)
V3: cp = chest pain type
– Value 1: typical angina
– Value 2: atypical angina
– Value 3: non-anginal pain
– Value 4: asymptomatic
V4: trestbps = resting blood pressure (in mm Hg on admission to the hospital)
V5: chol = serum cholestoral in mg/dl
V6: fbs = fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
V7: restecg = resting electrocardiographic results
– Value 0: normal
– Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
– Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
V8: thalach = maximum heart rate achieved
V9: exang = exercise induced angina (1 = yes; 0 = no) V10: oldpeak = ST depression induced by exercise relative to rest V11: slope = the slope of the peak exercise ST segment – Value 1: upsloping – Value 2: flat – Value 3: downsloping V12: ca = number of major vessels (0-3) colored by flourosopy
V13: thal = (3) = normal; (6) = fixed defect; (7) = reversable defect V14: class = diagnosis of heart disease (angiographic disease status)
Diagonosis of heart disease (4 classes). It can be 2 classes by setting 0 for 0 values and 1 for non-0 values.

Change the column name so it more intuitive according to the column data above.

names(cl) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "class")

Change the class column become 2 classes

cl$class <-  sapply(X= as.character(cl$class),
                    FUN = switch,
                    "0" = "0",
                    "1" = "1",
                    "2" = "1",
                    "3" = "1",
                    "4" = "1")

head(cl)

#>   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
#> 1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0
#> 2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0
#> 3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0
#> 4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0
#> 5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0
#> 6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0
#>   class
#> 1     0
#> 2     1
#> 3     1
#> 4     0
#> 5     0
#> 6     0

Data Processing

Duplicates

sum(duplicated(cl))

#> [1] 0

Missing Values

colSums(is.na(x = cl))

#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal    class 
#>        0        0        0        0        0        0

Data Types

glimpse(cl)

#> Rows: 303
#> Columns: 14
#> $ age      <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
#> $ sex      <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
#> $ cp       <dbl> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1…
#> $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
#> $ chol     <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
#> $ fbs      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ restecg  <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2…
#> $ thalach  <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
#> $ exang    <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
#> $ oldpeak  <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
#> $ slope    <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2…
#> $ ca       <chr> "0.0", "3.0", "2.0", "0.0", "0.0", "0.0", "2.0", "0.0", "1.0"…
#> $ thal     <chr> "6.0", "3.0", "7.0", "3.0", "3.0", "3.0", "3.0", "3.0", "7.0"…
#> $ class    <chr> "0", "1", "1", "0", "0", "0", "1", "0", "1", "1", "0", "0", "…

Convert sex, fbs,cp,restecg,exang,slope,ca, thal, and class into factor data type.

cl_clean <- cl %>% 
  mutate_at(.vars = c("sex", "fbs", "cp", "restecg", "exang", "slope","ca","thal", "class"),
            .funs = as.factor) %>% 
  glimpse()

#> Rows: 303
#> Columns: 14
#> $ age      <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
#> $ sex      <fct> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1…
#> $ cp       <fct> 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, 3, 2, 1…
#> $ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
#> $ chol     <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
#> $ fbs      <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0…
#> $ restecg  <fct> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2…
#> $ thalach  <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1…
#> $ oldpeak  <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
#> $ slope    <fct> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2…
#> $ ca       <fct> 0.0, 3.0, 2.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1…
#> $ thal     <fct> 6.0, 3.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 6.0, 3.0, 6…
#> $ class    <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0…

Exploratory Data Analysis

cl_clean %>% 
  inspect_cat()

#> # A tibble: 9 × 5
#>   col_name   cnt common common_pcnt levels          
#>   <chr>    <int> <chr>        <dbl> <named list>    
#> 1 ca           5 0.0           58.1 <tibble [5 × 3]>
#> 2 class        2 0             54.1 <tibble [2 × 3]>
#> 3 cp           4 4             47.5 <tibble [4 × 3]>
#> 4 exang        2 0             67.3 <tibble [2 × 3]>
#> 5 fbs          2 0             85.1 <tibble [2 × 3]>
#> 6 restecg      3 0             49.8 <tibble [3 × 3]>
#> 7 sex          2 1             68.0 <tibble [2 × 3]>
#> 8 slope        3 1             46.9 <tibble [3 × 3]>
#> 9 thal         4 3.0           54.8 <tibble [4 × 3]>

cl_clean %>% 
  inspect_num()

#> # A tibble: 5 × 10
#>   col_name   min    q1 median   mean    q3   max    sd pcnt_na hist             
#>   <chr>    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl> <named list>     
#> 1 age         29   48    56    54.4   61    77    9.04       0 <tibble [25 × 2]>
#> 2 trestbps    94  120   130   132.   140   200   17.6        0 <tibble [22 × 2]>
#> 3 chol       126  211   241   247.   275   564   51.8        0 <tibble [23 × 2]>
#> 4 thalach     71  134.  153   150.   166   202   22.9        0 <tibble [27 × 2]>
#> 5 oldpeak      0    0     0.8   1.04   1.6   6.2  1.16       0 <tibble [13 × 2]>

💡 Insight:
- 67.98% of patients are male.
- The youngest patient is 29 years old, the oldest patient is 77 years old, and the average age of the patient is 54 years old.

Cross Validation

We will divide data into train data and test data using the rsample library. The proportion of train data is 80%.

RNGkind(sample.kind = "Rounding")
set.seed(150)

splitter <- initial_split(data = cl_clean, prop = 0.8) 

cl_train <- training(splitter)
cl_test <- testing(splitter)
cl_test

#>    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
#> 1   41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0
#> 2   44   1  2      120  263   0       0     173     0     0.0     1 0.0  7.0
#> 3   58   1  2      120  284   0       2     160     0     1.8     2 0.0  3.0
#> 4   59   1  4      135  234   0       0     161     0     0.5     2 0.0  7.0
#> 5   42   1  4      140  226   0       0     178     0     0.0     1 0.0  3.0
#> 6   61   1  3      150  243   1       0     137     1     1.0     2 0.0  3.0
#> 7   65   0  4      150  225   0       2     114     0     1.0     2 3.0  7.0
#> 8   41   0  2      105  198   0       0     168     0     0.0     1 1.0  3.0
#> 9   44   1  4      112  290   0       2     153     0     0.0     1 1.0  3.0
#> 10  50   1  3      140  233   0       0     163     0     0.6     2 1.0  7.0
#> 11  51   1  1      125  213   0       2     125     1     1.4     1 1.0  3.0
#> 12  46   0  3      142  177   0       2     160     1     1.4     3 0.0  3.0
#> 13  58   1  4      128  216   0       2     131     1     2.2     2 3.0  7.0
#> 14  58   1  4      150  270   0       2     111     1     0.8     1 0.0  7.0
#> 15  45   1  4      104  208   0       2     148     1     3.0     2 0.0  3.0
#> 16  53   0  4      130  264   0       2     143     0     0.4     2 0.0  3.0
#> 17  39   1  3      140  321   0       2     182     0     0.0     1 0.0  3.0
#> 18  68   1  3      180  274   1       2     150     1     1.6     2 0.0  7.0
#> 19  52   1  2      120  325   0       0     172     0     0.2     1 0.0  3.0
#> 20  51   0  3      130  256   0       2     149     0     0.5     1 0.0  3.0
#> 21  62   0  4      160  164   0       2     145     0     6.2     3 3.0  7.0
#> 22  52   1  4      128  255   0       0     161     1     0.0     1 1.0  7.0
#> 23  49   1  3      120  188   0       0     139     0     2.0     2 3.0  7.0
#> 24  39   1  4      118  219   0       0     140     0     1.2     2 0.0  7.0
#> 25  48   1  4      130  256   1       2     150     1     0.0     1 2.0  7.0
#> 26  65   1  1      138  282   1       2     174     0     1.4     2 1.0  3.0
#> 27  45   0  2      130  234   0       2     175     0     0.6     2 0.0  3.0
#> 28  56   0  4      200  288   1       2     133     1     4.0     3 2.0  7.0
#> 29  29   1  2      130  204   0       2     202     0     0.0     1 0.0  3.0
#> 30  51   1  4      140  261   0       2     186     1     0.0     1 0.0  3.0
#> 31  59   1  1      170  288   0       2     159     0     0.2     2 0.0  7.0
#> 32  45   1  2      128  308   0       2     170     0     0.0     1 0.0  3.0
#> 33  42   0  4      102  265   0       2     122     0     0.6     2 0.0  3.0
#> 34  67   0  3      115  564   0       2     160     0     1.6     2 0.0  7.0
#> 35  55   1  4      160  289   0       2     145     1     0.8     2 1.0  7.0
#> 36  70   1  4      130  322   0       2     109     0     2.4     2 3.0  3.0
#> 37  58   1  4      125  300   0       2     171     0     0.0     1 2.0  7.0
#> 38  51   1  4      140  298   0       0     122     1     4.2     2 3.0  7.0
#> 39  43   1  4      132  247   1       2     143     1     0.1     2   ?  7.0
#> 40  68   0  3      120  211   0       2     115     0     1.5     2 0.0  3.0
#> 41  67   1  4      100  299   0       2     125     1     0.9     2 2.0  3.0
#> 42  45   0  4      138  236   0       2     152     1     0.2     2 0.0  3.0
#> 43  50   0  2      120  244   0       0     162     0     1.1     1 0.0  3.0
#> 44  58   1  4      128  259   0       2     130     1     3.0     2 2.0  7.0
#> 45  55   1  2      130  262   0       0     155     0     0.0     1 0.0  3.0
#> 46  56   1  1      120  193   0       2     162     0     1.9     2 0.0  7.0
#> 47  64   0  4      130  303   0       0     122     0     2.0     2 2.0  3.0
#> 48  41   0  3      112  268   0       2     172     1     0.0     1 0.0  3.0
#> 49  53   1  4      123  282   0       0      95     1     2.0     2 2.0  7.0
#> 50  66   1  4      112  212   0       2     132     1     0.1     1 1.0  3.0
#> 51  60   0  3      120  178   1       0      96     0     0.0     1 0.0  3.0
#> 52  58   1  4      100  234   0       0     156     0     0.1     1 1.0  7.0
#> 53  52   1  4      125  212   0       0     168     0     1.0     1 2.0  7.0
#> 54  62   1  2      128  208   1       2     140     0     0.0     1 0.0  3.0
#> 55  57   1  4      110  201   0       0     126     1     1.5     2 0.0  6.0
#> 56  60   0  1      150  240   0       0     171     0     0.9     1 0.0  3.0
#> 57  61   1  4      138  166   0       2     125     1     3.6     2 1.0  3.0
#> 58  71   0  4      112  149   0       0     125     0     1.6     2 0.0  3.0
#> 59  57   1  2      154  232   0       2     164     0     0.0     1 1.0  3.0
#> 60  57   1  4      110  335   0       0     143     1     3.0     2 1.0  7.0
#> 61  56   1  2      130  221   0       2     163     0     0.0     1 0.0  7.0
#>    class
#> 1      0
#> 2      0
#> 3      1
#> 4      0
#> 5      0
#> 6      0
#> 7      1
#> 8      0
#> 9      1
#> 10     1
#> 11     0
#> 12     0
#> 13     1
#> 14     1
#> 15     0
#> 16     0
#> 17     0
#> 18     1
#> 19     0
#> 20     0
#> 21     1
#> 22     1
#> 23     1
#> 24     1
#> 25     1
#> 26     1
#> 27     0
#> 28     1
#> 29     0
#> 30     0
#> 31     1
#> 32     0
#> 33     0
#> 34     0
#> 35     1
#> 36     1
#> 37     1
#> 38     1
#> 39     1
#> 40     0
#> 41     1
#> 42     0
#> 43     0
#> 44     1
#> 45     0
#> 46     0
#> 47     0
#> 48     0
#> 49     1
#> 50     1
#> 51     0
#> 52     1
#> 53     1
#> 54     0
#> 55     0
#> 56     0
#> 57     1
#> 58     0
#> 59     1
#> 60     1
#> 61     0

Imbalance Class
Then we will check the imbalance class of train and test data. The data should have a good balance because we need to keep data still and not be biased.

table(cl_train$class) %>% 
  prop.table()

#> 
#>         0         1 
#> 0.5454545 0.4545455

Our data has a good balance, it’s not be biased. Then, we can continue making the model.

Modeling

Logistic Regression

Build the model

1. All variable as predictor model

model_all <- glm(formula = class ~ ., 
                family = "binomial", 
                data = cl_train)

summary(model_all)

#> 
#> Call:
#> glm(formula = class ~ ., family = "binomial", data = cl_train)
#> 
#> Coefficients:
#>                Estimate  Std. Error z value Pr(>|z|)   
#> (Intercept)  -16.727003 1160.349633  -0.014  0.98850   
#> age           -0.033775    0.028932  -1.167  0.24305   
#> sex1           1.350498    0.620206   2.177  0.02944 * 
#> cp2            1.548061    0.941269   1.645  0.10004   
#> cp3            0.624294    0.824267   0.757  0.44881   
#> cp4            2.661976    0.828497   3.213  0.00131 **
#> trestbps       0.019276    0.012910   1.493  0.13541   
#> chol           0.003562    0.005190   0.686  0.49246   
#> fbs1          -0.682929    0.676709  -1.009  0.31288   
#> restecg1       0.491888    2.659196   0.185  0.85325   
#> restecg2       0.484779    0.443916   1.092  0.27481   
#> thalach       -0.024950    0.013593  -1.836  0.06643 . 
#> exang1         0.906034    0.513217   1.765  0.07750 . 
#> oldpeak        0.505875    0.268267   1.886  0.05933 . 
#> slope2         1.318425    0.532120   2.478  0.01322 * 
#> slope3         0.355442    1.027862   0.346  0.72949   
#> ca0.0         15.081081 1160.343611   0.013  0.98963   
#> ca1.0         16.787448 1160.343702   0.014  0.98846   
#> ca2.0         18.547202 1160.343934   0.016  0.98725   
#> ca3.0         16.850481 1160.343966   0.015  0.98841   
#> thal3.0       -1.734833    2.234844  -0.776  0.43759   
#> thal6.0       -1.773673    2.349699  -0.755  0.45034   
#> thal7.0       -0.493088    2.242270  -0.220  0.82594   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 333.48  on 241  degrees of freedom
#> Residual deviance: 147.52  on 219  degrees of freedom
#> AIC: 193.52
#> 
#> Number of Fisher Scoring iterations: 15

2. Stepwise model

model_backward <- step(object = model_all,
                   direction = "backward")

#> Start:  AIC=193.52
#> class ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + 
#>     exang + oldpeak + slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - restecg   2   148.73 190.73
#> - chol      1   147.99 191.99
#> - fbs       1   148.57 192.57
#> - age       1   148.90 192.90
#> <none>          147.52 193.52
#> - trestbps  1   149.80 193.80
#> - exang     1   150.61 194.61
#> - thal      3   154.84 194.84
#> - thalach   1   151.21 195.21
#> - oldpeak   1   151.24 195.24
#> - slope     2   154.37 196.37
#> - sex       1   152.58 196.58
#> - cp        3   167.80 207.80
#> - ca        4   172.54 210.54
#> 
#> Step:  AIC=190.73
#> class ~ age + sex + cp + trestbps + chol + fbs + thalach + exang + 
#>     oldpeak + slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - chol      1   149.52 189.52
#> - fbs       1   149.80 189.80
#> - age       1   149.87 189.87
#> <none>          148.73 190.73
#> - trestbps  1   151.26 191.26
#> - exang     1   151.53 191.53
#> - thal      3   155.88 191.88
#> - thalach   1   152.24 192.24
#> - oldpeak   1   152.31 192.31
#> - sex       1   154.25 194.25
#> - slope     2   156.31 194.31
#> - cp        3   168.87 204.87
#> - ca        4   174.37 208.37
#> 
#> Step:  AIC=189.52
#> class ~ age + sex + cp + trestbps + fbs + thalach + exang + oldpeak + 
#>     slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - age       1   150.46 188.46
#> - fbs       1   150.48 188.48
#> <none>          149.52 189.52
#> - trestbps  1   152.23 190.23
#> - exang     1   152.50 190.50
#> - thalach   1   152.71 190.71
#> - thal      3   156.90 190.90
#> - oldpeak   1   153.73 191.73
#> - sex       1   154.42 192.42
#> - slope     2   157.19 193.19
#> - cp        3   170.63 204.63
#> - ca        4   175.16 207.16
#> 
#> Step:  AIC=188.46
#> class ~ sex + cp + trestbps + fbs + thalach + exang + oldpeak + 
#>     slope + ca + thal
#> 
#>            Df Deviance    AIC
#> - fbs       1   151.31 187.31
#> <none>          150.46 188.46
#> - trestbps  1   152.47 188.47
#> - thalach   1   152.81 188.81
#> - exang     1   153.37 189.37
#> - thal      3   157.87 189.87
#> - oldpeak   1   155.18 191.18
#> - slope     2   157.85 191.85
#> - sex       1   155.90 191.90
#> - cp        3   171.75 203.75
#> - ca        4   175.97 205.97
#> 
#> Step:  AIC=187.31
#> class ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
#>     ca + thal
#> 
#>            Df Deviance    AIC
#> - trestbps  1   153.16 187.16
#> <none>          151.31 187.31
#> - thalach   1   153.72 187.72
#> - exang     1   154.15 188.15
#> - thal      3   158.79 188.79
#> - oldpeak   1   156.47 190.47
#> - slope     2   158.57 190.57
#> - sex       1   156.60 190.60
#> - ca        4   175.97 203.97
#> - cp        3   173.99 203.99
#> 
#> Step:  AIC=187.16
#> class ~ sex + cp + thalach + exang + oldpeak + slope + ca + thal
#> 
#>           Df Deviance    AIC
#> - thalach  1   155.13 187.13
#> <none>         153.16 187.16
#> - exang    1   155.93 187.93
#> - thal     3   161.03 189.03
#> - sex      1   157.39 189.39
#> - slope    2   160.14 190.14
#> - oldpeak  1   159.57 191.57
#> - cp       3   175.12 203.12
#> - ca       4   177.13 203.13
#> 
#> Step:  AIC=187.13
#> class ~ sex + cp + exang + oldpeak + slope + ca + thal
#> 
#>           Df Deviance    AIC
#> <none>         155.13 187.13
#> - exang    1   158.54 188.54
#> - sex      1   158.63 188.63
#> - thal     3   163.66 189.66
#> - oldpeak  1   162.24 192.24
#> - slope    2   165.43 193.43
#> - ca       4   181.04 205.04
#> - cp       3   180.06 206.06

model_backward

#> 
#> Call:  glm(formula = class ~ sex + cp + exang + oldpeak + slope + ca + 
#>     thal, family = "binomial", data = cl_train)
#> 
#> Coefficients:
#> (Intercept)         sex1          cp2          cp3          cp4       exang1  
#>    -18.7917       1.0037       1.4235       0.3478       2.6281       0.9066  
#>     oldpeak       slope2       slope3        ca0.0        ca1.0        ca2.0  
#>      0.6464       1.4996       0.3402      15.2757      16.9758      18.2221  
#>       ca3.0      thal3.0      thal6.0      thal7.0  
#>     17.0174      -1.8197      -1.6453      -0.4763  
#> 
#> Degrees of Freedom: 241 Total (i.e. Null);  226 Residual
#> Null Deviance:       333.5 
#> Residual Deviance: 155.1     AIC: 187.1

summary(model_backward)

#> 
#> Call:
#> glm(formula = class ~ sex + cp + exang + oldpeak + slope + ca + 
#>     thal, family = "binomial", data = cl_train)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  -18.7917  1125.7802  -0.017  0.98668    
#> sex1           1.0037     0.5457   1.839  0.06588 .  
#> cp2            1.4235     0.8938   1.593  0.11124    
#> cp3            0.3478     0.7888   0.441  0.65931    
#> cp4            2.6281     0.7863   3.343  0.00083 ***
#> exang1         0.9066     0.4888   1.855  0.06364 .  
#> oldpeak        0.6464     0.2540   2.545  0.01091 *  
#> slope2         1.4996     0.5019   2.988  0.00281 ** 
#> slope3         0.3402     0.9532   0.357  0.72115    
#> ca0.0         15.2757  1125.7739   0.014  0.98917    
#> ca1.0         16.9758  1125.7740   0.015  0.98797    
#> ca2.0         18.2221  1125.7742   0.016  0.98709    
#> ca3.0         17.0174  1125.7743   0.015  0.98794    
#> thal3.0       -1.8197     3.6157  -0.503  0.61476    
#> thal6.0       -1.6453     3.6983  -0.445  0.65641    
#> thal7.0       -0.4763     3.6224  -0.131  0.89539    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 333.48  on 241  degrees of freedom
#> Residual deviance: 155.13  on 226  degrees of freedom
#> AIC: 187.13
#> 
#> Number of Fisher Scoring iterations: 15

model_all$deviance

#> [1] 147.5159

model_backward$deviance

#> [1] 155.1281

model_all$aic

#> [1] 193.5159

model_backward$aic

#> [1] 187.1281

Based on deviance value, model_all is better than model_backward, but based on AIC value, model_backward is better than model_all. We can conclude that both are good models because the difference in values between the two models is not significant. Then we will continue using model_backward.

Logistic Regression Assumptions

1. Linearity of Predictor & Log of Odds
For numeric variables, a 1-value increase will increase the log of odds. Then we will check on numeric variables (oldpeak)

exp(model_backward$coefficients[7])

#>  oldpeak 
#> 1.908686

An increase of 1 oldpeak value will increase the possibility of someone having heart disease by 1.9 times. This is by understanding the oldpeak value, namely ST depression induced by exercise relative to rest. Based on sources regarding how to read an EKG from Siloamhospital and Alomedika, it can be seen that the higher the ST depression value, the higher the possibility of a patient having a heart disorder. So this model_backward fulfills the assumption of linearity of predictor & log of odds.

2. Independence of Observations
All of measurement on this data set is independent. There is no redundant observation as there is no duplicate data on this data set.

3. Multicollinearity
We will use vif() function from carpackage. The threshold of VIF value is below than 10.

vif(model_backward)

#>             GVIF Df GVIF^(1/(2*Df))
#> sex     1.471733  1        1.213150
#> cp      1.731642  3        1.095830
#> exang   1.150309  1        1.072525
#> oldpeak 1.568623  1        1.252447
#> slope   1.733007  2        1.147361
#> ca      1.534643  4        1.054996
#> thal    1.435561  3        1.062112

For glm() models use GVIF^(1/(2*df)) to see the multicollinearity values. All of model_backward variables can fulfill the threshold, so there is no multicollinearity in this model.

Prediction using Logistic Regression Model

p_heartcl <- predict(object = model_all,
        newdata = head(cl_test),
        type = "response")
ifelse(p_heartcl > 0.5, yes = 1, no = 0)

#> 1 2 3 4 5 6 
#> 0 0 0 1 0 0

cl_test$pred_risk <- predict(object = model_backward,
        newdata = cl_test,
        type = "response")
head(cl_test, 10)

#>    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
#> 1   41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0
#> 2   44   1  2      120  263   0       0     173     0     0.0     1 0.0  7.0
#> 3   58   1  2      120  284   0       2     160     0     1.8     2 0.0  3.0
#> 4   59   1  4      135  234   0       0     161     0     0.5     2 0.0  7.0
#> 5   42   1  4      140  226   0       0     178     0     0.0     1 0.0  3.0
#> 6   61   1  3      150  243   1       0     137     1     1.0     2 0.0  3.0
#> 7   65   0  4      150  225   0       2     114     0     1.0     2 3.0  7.0
#> 8   41   0  2      105  198   0       0     168     0     0.0     1 1.0  3.0
#> 9   44   1  4      112  290   0       2     153     0     0.0     1 1.0  3.0
#> 10  50   1  3      140  233   0       0     163     0     0.6     2 1.0  7.0
#>    class  pred_risk
#> 1      0 0.04709934
#> 2      0 0.17290664
#> 3      1 0.43895054
#> 4      0 0.81186956
#> 5      0 0.15394398
#> 6      0 0.28256832
#> 7      1 0.92577208
#> 8      0 0.09866771
#> 9      1 0.49902984
#> 10     1 0.72043776

Classify the cl_test data based on pred_risk and store it in a new column named pred_label using ifelse.

cl_test$pred_label <- ifelse(cl_test$pred_risk > 0.5, yes = 1, no = 0)
head(cl_test,10)

#>    age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
#> 1   41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0
#> 2   44   1  2      120  263   0       0     173     0     0.0     1 0.0  7.0
#> 3   58   1  2      120  284   0       2     160     0     1.8     2 0.0  3.0
#> 4   59   1  4      135  234   0       0     161     0     0.5     2 0.0  7.0
#> 5   42   1  4      140  226   0       0     178     0     0.0     1 0.0  3.0
#> 6   61   1  3      150  243   1       0     137     1     1.0     2 0.0  3.0
#> 7   65   0  4      150  225   0       2     114     0     1.0     2 3.0  7.0
#> 8   41   0  2      105  198   0       0     168     0     0.0     1 1.0  3.0
#> 9   44   1  4      112  290   0       2     153     0     0.0     1 1.0  3.0
#> 10  50   1  3      140  233   0       0     163     0     0.6     2 1.0  7.0
#>    class  pred_risk pred_label
#> 1      0 0.04709934          0
#> 2      0 0.17290664          0
#> 3      1 0.43895054          0
#> 4      0 0.81186956          1
#> 5      0 0.15394398          0
#> 6      0 0.28256832          0
#> 7      1 0.92577208          1
#> 8      0 0.09866771          0
#> 9      1 0.49902984          0
#> 10     1 0.72043776          1

We will see the result of prediction with make aggregate table.

cl_test %>% 
  select(class,
         pred_risk,
         pred_label) %>%
  head(10)

#>    class  pred_risk pred_label
#> 1      0 0.04709934          0
#> 2      0 0.17290664          0
#> 3      1 0.43895054          0
#> 4      0 0.81186956          1
#> 5      0 0.15394398          0
#> 6      0 0.28256832          0
#> 7      1 0.92577208          1
#> 8      0 0.09866771          0
#> 9      1 0.49902984          0
#> 10     1 0.72043776          1

K-Nearest Neighbour (KNN)

Pre-processing data

Because of KNN is not good for categorical data, then we will make new data frame which only include numeric variable and 3 level and more categorical variable. This categorical variable will be convert into numeric.

head(cl_clean)

#>   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope  ca thal
#> 1  63   1  1      145  233   1       2     150     0     2.3     3 0.0  6.0
#> 2  67   1  4      160  286   0       2     108     1     1.5     2 3.0  3.0
#> 3  67   1  4      120  229   0       2     129     1     2.6     2 2.0  7.0
#> 4  37   1  3      130  250   0       0     187     0     3.5     3 0.0  3.0
#> 5  41   0  2      130  204   0       2     172     0     1.4     1 0.0  3.0
#> 6  56   1  2      120  236   0       0     178     0     0.8     1 0.0  3.0
#>   class
#> 1     0
#> 2     1
#> 3     1
#> 4     0
#> 5     0
#> 6     0

cl_knn <- cl_clean %>% 
  select(-c("sex", "fbs", "exang", )) %>% 
  mutate_at(.vars = c("cp","restecg", "slope", "ca", "thal"), 
            .funs = as.numeric)
cl_knn

#>     age cp trestbps chol restecg thalach oldpeak slope ca thal class
#> 1    63  1      145  233       3     150     2.3     3  2    3     0
#> 2    67  4      160  286       3     108     1.5     2  5    2     1
#> 3    67  4      120  229       3     129     2.6     2  4    4     1
#> 4    37  3      130  250       1     187     3.5     3  2    2     0
#> 5    41  2      130  204       3     172     1.4     1  2    2     0
#> 6    56  2      120  236       1     178     0.8     1  2    2     0
#> 7    62  4      140  268       3     160     3.6     3  4    2     1
#> 8    57  4      120  354       1     163     0.6     1  2    2     0
#> 9    63  4      130  254       3     147     1.4     2  3    4     1
#> 10   53  4      140  203       3     155     3.1     3  2    4     1
#> 11   57  4      140  192       1     148     0.4     2  2    3     0
#> 12   56  2      140  294       3     153     1.3     2  2    2     0
#> 13   56  3      130  256       3     142     0.6     2  3    3     1
#> 14   44  2      120  263       1     173     0.0     1  2    4     0
#> 15   52  3      172  199       1     162     0.5     1  2    4     0
#> 16   57  3      150  168       1     174     1.6     1  2    2     0
#> 17   48  2      110  229       1     168     1.0     3  2    4     1
#> 18   54  4      140  239       1     160     1.2     1  2    2     0
#> 19   48  3      130  275       1     139     0.2     1  2    2     0
#> 20   49  2      130  266       1     171     0.6     1  2    2     0
#> 21   64  1      110  211       3     144     1.8     2  2    2     0
#> 22   58  1      150  283       3     162     1.0     1  2    2     0
#> 23   58  2      120  284       3     160     1.8     2  2    2     1
#> 24   58  3      132  224       3     173     3.2     1  4    4     1
#> 25   60  4      130  206       3     132     2.4     2  4    4     1
#> 26   50  3      120  219       1     158     1.6     2  2    2     0
#> 27   58  3      120  340       1     172     0.0     1  2    2     0
#> 28   66  1      150  226       1     114     2.6     3  2    2     0
#> 29   43  4      150  247       1     171     1.5     1  2    2     0
#> 30   40  4      110  167       3     114     2.0     2  2    4     1
#> 31   69  1      140  239       1     151     1.8     1  4    2     0
#> 32   60  4      117  230       1     160     1.4     1  4    4     1
#> 33   64  3      140  335       1     158     0.0     1  2    2     1
#> 34   59  4      135  234       1     161     0.5     2  2    4     0
#> 35   44  3      130  233       1     179     0.4     1  2    2     0
#> 36   42  4      140  226       1     178     0.0     1  2    2     0
#> 37   43  4      120  177       3     120     2.5     2  2    4     1
#> 38   57  4      150  276       3     112     0.6     2  3    3     1
#> 39   55  4      132  353       1     132     1.2     2  3    4     1
#> 40   61  3      150  243       1     137     1.0     2  2    2     0
#> 41   65  4      150  225       3     114     1.0     2  5    4     1
#> 42   40  1      140  199       1     178     1.4     1  2    4     0
#> 43   71  2      160  302       1     162     0.4     1  4    2     0
#> 44   59  3      150  212       1     157     1.6     1  2    2     0
#> 45   61  4      130  330       3     169     0.0     1  2    2     1
#> 46   58  3      112  230       3     165     2.5     2  3    4     1
#> 47   51  3      110  175       1     123     0.6     1  2    2     0
#> 48   50  4      150  243       3     128     2.6     2  2    4     1
#> 49   65  3      140  417       3     157     0.8     1  3    2     0
#> 50   53  3      130  197       3     152     1.2     3  2    2     0
#> 51   41  2      105  198       1     168     0.0     1  3    2     0
#> 52   65  4      120  177       1     140     0.4     1  2    4     0
#> 53   44  4      112  290       3     153     0.0     1  3    2     1
#> 54   44  2      130  219       3     188     0.0     1  2    2     0
#> 55   60  4      130  253       1     144     1.4     1  3    4     1
#> 56   54  4      124  266       3     109     2.2     2  3    4     1
#> 57   50  3      140  233       1     163     0.6     2  3    4     1
#> 58   41  4      110  172       3     158     0.0     1  2    4     1
#> 59   54  3      125  273       3     152     0.5     3  3    2     0
#> 60   51  1      125  213       3     125     1.4     1  3    2     0
#> 61   51  4      130  305       1     142     1.2     2  2    4     1
#> 62   46  3      142  177       3     160     1.4     3  2    2     0
#> 63   58  4      128  216       3     131     2.2     2  5    4     1
#> 64   54  3      135  304       1     170     0.0     1  2    2     0
#> 65   54  4      120  188       1     113     1.4     2  3    4     1
#> 66   60  4      145  282       3     142     2.8     2  4    4     1
#> 67   60  3      140  185       3     155     3.0     2  2    2     1
#> 68   54  3      150  232       3     165     1.6     1  2    4     0
#> 69   59  4      170  326       3     140     3.4     3  2    4     1
#> 70   46  3      150  231       1     147     3.6     2  2    2     1
#> 71   65  3      155  269       1     148     0.8     1  2    2     0
#> 72   67  4      125  254       1     163     0.2     2  4    4     1
#> 73   62  4      120  267       1      99     1.8     2  4    4     1
#> 74   65  4      110  248       3     158     0.6     1  4    3     1
#> 75   44  4      110  197       3     177     0.0     1  3    2     1
#> 76   65  3      160  360       3     151     0.8     1  2    2     0
#> 77   60  4      125  258       3     141     2.8     2  3    4     1
#> 78   51  3      140  308       3     142     1.5     1  3    2     0
#> 79   48  2      130  245       3     180     0.2     2  2    2     0
#> 80   58  4      150  270       3     111     0.8     1  2    4     1
#> 81   45  4      104  208       3     148     3.0     2  2    2     0
#> 82   53  4      130  264       3     143     0.4     2  2    2     0
#> 83   39  3      140  321       3     182     0.0     1  2    2     0
#> 84   68  3      180  274       3     150     1.6     2  2    4     1
#> 85   52  2      120  325       1     172     0.2     1  2    2     0
#> 86   44  3      140  235       3     180     0.0     1  2    2     0
#> 87   47  3      138  257       3     156     0.0     1  2    2     0
#> 88   53  3      128  216       3     115     0.0     1  2    1     0
#> 89   53  4      138  234       3     160     0.0     1  2    2     0
#> 90   51  3      130  256       3     149     0.5     1  2    2     0
#> 91   66  4      120  302       3     151     0.4     2  2    2     0
#> 92   62  4      160  164       3     145     6.2     3  5    4     1
#> 93   62  3      130  231       1     146     1.8     2  5    4     0
#> 94   44  3      108  141       1     175     0.6     2  2    2     0
#> 95   63  3      135  252       3     172     0.0     1  2    2     0
#> 96   52  4      128  255       1     161     0.0     1  3    4     1
#> 97   59  4      110  239       3     142     1.2     2  3    4     1
#> 98   60  4      150  258       3     157     2.6     2  4    4     1
#> 99   52  2      134  201       1     158     0.8     1  3    2     0
#> 100  48  4      122  222       3     186     0.0     1  2    2     0
#> 101  45  4      115  260       3     185     0.0     1  2    2     0
#> 102  34  1      118  182       3     174     0.0     1  2    2     0
#> 103  57  4      128  303       3     159     0.0     1  3    2     0
#> 104  71  3      110  265       3     130     0.0     1  3    2     0
#> 105  49  3      120  188       1     139     2.0     2  5    4     1
#> 106  54  2      108  309       1     156     0.0     1  2    4     0
#> 107  59  4      140  177       1     162     0.0     1  3    4     1
#> 108  57  3      128  229       3     150     0.4     2  3    4     1
#> 109  61  4      120  260       1     140     3.6     2  3    4     1
#> 110  39  4      118  219       1     140     1.2     2  2    4     1
#> 111  61  4      145  307       3     146     1.0     2  2    4     1
#> 112  56  4      125  249       3     144     1.2     2  3    2     1
#> 113  52  1      118  186       3     190     0.0     2  2    3     0
#> 114  43  4      132  341       3     136     3.0     2  2    4     1
#> 115  62  3      130  263       1      97     1.2     2  3    4     1
#> 116  41  2      135  203       1     132     0.0     2  2    3     0
#> 117  58  3      140  211       3     165     0.0     1  2    2     0
#> 118  35  4      138  183       1     182     1.4     1  2    2     0
#> 119  63  4      130  330       3     132     1.8     1  5    4     1
#> 120  65  4      135  254       3     127     2.8     2  3    4     1
#> 121  48  4      130  256       3     150     0.0     1  4    4     1
#> 122  63  4      150  407       3     154     4.0     2  5    4     1
#> 123  51  3      100  222       1     143     1.2     2  2    2     0
#> 124  55  4      140  217       1     111     5.6     3  2    4     1
#> 125  65  1      138  282       3     174     1.4     2  3    2     1
#> 126  45  2      130  234       3     175     0.6     2  2    2     0
#> 127  56  4      200  288       3     133     4.0     3  4    4     1
#> 128  54  4      110  239       1     126     2.8     2  3    4     1
#> 129  44  2      120  220       1     170     0.0     1  2    2     0
#> 130  62  4      124  209       1     163     0.0     1  2    2     0
#> 131  54  3      120  258       3     147     0.4     2  2    4     0
#> 132  51  3       94  227       1     154     0.0     1  3    4     0
#> 133  29  2      130  204       3     202     0.0     1  2    2     0
#> 134  51  4      140  261       3     186     0.0     1  2    2     0
#> 135  43  3      122  213       1     165     0.2     2  2    2     0
#> 136  55  2      135  250       3     161     1.4     2  2    2     0
#> 137  70  4      145  174       1     125     2.6     3  2    4     1
#> 138  62  2      120  281       3     103     1.4     2  3    4     1
#> 139  35  4      120  198       1     130     1.6     2  2    4     1
#> 140  51  3      125  245       3     166     2.4     2  2    2     0
#> 141  59  2      140  221       1     164     0.0     1  2    2     0
#> 142  59  1      170  288       3     159     0.2     2  2    4     1
#> 143  52  2      128  205       1     184     0.0     1  2    2     0
#> 144  64  3      125  309       1     131     1.8     2  2    4     1
#> 145  58  3      105  240       3     154     0.6     2  2    4     0
#> 146  47  3      108  243       1     152     0.0     1  2    2     1
#> 147  57  4      165  289       3     124     1.0     2  5    4     1
#> 148  41  3      112  250       1     179     0.0     1  2    2     0
#> 149  45  2      128  308       3     170     0.0     1  2    2     0
#> 150  60  3      102  318       1     160     0.0     1  3    2     0
#> 151  52  1      152  298       1     178     1.2     2  2    4     0
#> 152  42  4      102  265       3     122     0.6     2  2    2     0
#> 153  67  3      115  564       3     160     1.6     2  2    4     0
#> 154  55  4      160  289       3     145     0.8     2  3    4     1
#> 155  64  4      120  246       3      96     2.2     3  3    2     1
#> 156  70  4      130  322       3     109     2.4     2  5    2     1
#> 157  51  4      140  299       1     173     1.6     1  2    4     1
#> 158  58  4      125  300       3     171     0.0     1  4    4     1
#> 159  60  4      140  293       3     170     1.2     2  4    4     1
#> 160  68  3      118  277       1     151     1.0     1  3    4     0
#> 161  46  2      101  197       1     156     0.0     1  2    4     0
#> 162  77  4      125  304       3     162     0.0     1  5    2     1
#> 163  54  3      110  214       1     158     1.6     2  2    2     0
#> 164  58  4      100  248       3     122     1.0     2  2    2     0
#> 165  48  3      124  255       1     175     0.0     1  4    2     0
#> 166  57  4      132  207       1     168     0.0     1  2    4     0
#> 167  52  3      138  223       1     169     0.0     1  1    2     0
#> 168  54  2      132  288       3     159     0.0     1  3    2     0
#> 169  35  4      126  282       3     156     0.0     1  2    4     1
#> 170  45  2      112  160       1     138     0.0     2  2    2     0
#> 171  70  3      160  269       1     112     2.9     2  3    4     1
#> 172  53  4      142  226       3     111     0.0     1  2    4     0
#> 173  59  4      174  249       1     143     0.0     2  2    2     1
#> 174  62  4      140  394       3     157     1.2     2  2    2     0
#> 175  64  4      145  212       3     132     2.0     2  4    3     1
#> 176  57  4      152  274       1      88     1.2     2  3    4     1
#> 177  52  4      108  233       1     147     0.1     1  5    4     0
#> 178  56  4      132  184       3     105     2.1     2  3    3     1
#> 179  43  3      130  315       1     162     1.9     1  3    2     0
#> 180  53  3      130  246       3     173     0.0     1  5    2     0
#> 181  48  4      124  274       3     166     0.5     2  2    4     1
#> 182  56  4      134  409       3     150     1.9     2  4    4     1
#> 183  42  1      148  244       3     178     0.8     1  4    2     0
#> 184  59  1      178  270       3     145     4.2     3  2    4     0
#> 185  60  4      158  305       3     161     0.0     1  2    2     1
#> 186  63  2      140  195       1     179     0.0     1  4    2     0
#> 187  42  3      120  240       1     194     0.8     3  2    4     0
#> 188  66  2      160  246       1     120     0.0     2  5    3     1
#> 189  54  2      192  283       3     195     0.0     1  3    4     1
#> 190  69  3      140  254       3     146     2.0     2  5    4     1
#> 191  50  3      129  196       1     163     0.0     1  2    2     0
#> 192  51  4      140  298       1     122     4.2     2  5    4     1
#> 193  43  4      132  247       3     143     0.1     2  1    4     1
#> 194  62  4      138  294       1     106     1.9     2  5    2     1
#> 195  68  3      120  211       3     115     1.5     2  2    2     0
#> 196  67  4      100  299       3     125     0.9     2  4    2     1
#> 197  69  1      160  234       3     131     0.1     2  3    2     0
#> 198  45  4      138  236       3     152     0.2     2  2    2     0
#> 199  50  2      120  244       1     162     1.1     1  2    2     0
#> 200  59  1      160  273       3     125     0.0     1  2    2     1
#> 201  50  4      110  254       3     159     0.0     1  2    2     0
#> 202  64  4      180  325       1     154     0.0     1  2    2     0
#> 203  57  3      150  126       1     173     0.2     1  3    4     0
#> 204  64  3      140  313       1     133     0.2     1  2    4     0
#> 205  43  4      110  211       1     161     0.0     1  2    4     0
#> 206  45  4      142  309       3     147     0.0     2  5    4     1
#> 207  58  4      128  259       3     130     3.0     2  4    4     1
#> 208  50  4      144  200       3     126     0.9     2  2    4     1
#> 209  55  2      130  262       1     155     0.0     1  2    2     0
#> 210  62  4      150  244       1     154     1.4     2  2    2     1
#> 211  37  3      120  215       1     170     0.0     1  2    2     0
#> 212  38  1      120  231       1     182     3.8     2  2    4     1
#> 213  41  3      130  214       3     168     2.0     2  2    2     0
#> 214  66  4      178  228       1     165     1.0     2  4    4     1
#> 215  52  4      112  230       1     160     0.0     1  3    2     1
#> 216  56  1      120  193       3     162     1.9     2  2    4     0
#> 217  46  2      105  204       1     172     0.0     1  2    2     0
#> 218  46  4      138  243       3     152     0.0     2  2    2     0
#> 219  64  4      130  303       1     122     2.0     2  4    2     0
#> 220  59  4      138  271       3     182     0.0     1  2    2     0
#> 221  41  3      112  268       3     172     0.0     1  2    2     0
#> 222  54  3      108  267       3     167     0.0     1  2    2     0
#> 223  39  3       94  199       1     179     0.0     1  2    2     0
#> 224  53  4      123  282       1      95     2.0     2  4    4     1
#> 225  63  4      108  269       1     169     1.8     2  4    2     1
#> 226  34  2      118  210       1     192     0.7     1  2    2     0
#> 227  47  4      112  204       1     143     0.1     1  2    2     0
#> 228  67  3      152  277       1     172     0.0     1  3    2     0
#> 229  54  4      110  206       3     108     0.0     2  3    2     1
#> 230  66  4      112  212       3     132     0.1     1  3    2     1
#> 231  52  3      136  196       3     169     0.1     2  2    2     0
#> 232  55  4      180  327       2     117     3.4     2  2    2     1
#> 233  49  3      118  149       3     126     0.8     1  5    2     1
#> 234  74  2      120  269       3     121     0.2     1  3    2     0
#> 235  54  3      160  201       1     163     0.0     1  3    2     0
#> 236  54  4      122  286       3     116     3.2     2  4    2     1
#> 237  56  4      130  283       3     103     1.6     3  2    4     1
#> 238  46  4      120  249       3     144     0.8     1  2    4     1
#> 239  49  2      134  271       1     162     0.0     2  2    2     0
#> 240  42  2      120  295       1     162     0.0     1  2    2     0
#> 241  41  2      110  235       1     153     0.0     1  2    2     0
#> 242  41  2      126  306       1     163     0.0     1  2    2     0
#> 243  49  4      130  269       1     163     0.0     1  2    2     0
#> 244  61  1      134  234       1     145     2.6     2  4    2     1
#> 245  60  3      120  178       1      96     0.0     1  2    2     0
#> 246  67  4      120  237       1      71     1.0     2  2    2     1
#> 247  58  4      100  234       1     156     0.1     1  3    4     1
#> 248  47  4      110  275       3     118     1.0     2  3    2     1
#> 249  52  4      125  212       1     168     1.0     1  4    4     1
#> 250  62  2      128  208       3     140     0.0     1  2    2     0
#> 251  57  4      110  201       1     126     1.5     2  2    3     0
#> 252  58  4      146  218       1     105     2.0     2  3    4     1
#> 253  64  4      128  263       1     105     0.2     2  3    4     0
#> 254  51  3      120  295       3     157     0.6     1  2    2     0
#> 255  43  4      115  303       1     181     1.2     2  2    2     0
#> 256  42  3      120  209       1     173     0.0     2  2    2     0
#> 257  67  4      106  223       1     142     0.3     1  4    2     0
#> 258  76  3      140  197       2     116     1.1     2  2    2     0
#> 259  70  2      156  245       3     143     0.0     1  2    2     0
#> 260  57  2      124  261       1     141     0.3     1  2    4     1
#> 261  44  3      118  242       1     149     0.3     2  3    2     0
#> 262  58  2      136  319       3     152     0.0     1  4    2     1
#> 263  60  1      150  240       1     171     0.9     1  2    2     0
#> 264  44  3      120  226       1     169     0.0     1  2    2     0
#> 265  61  4      138  166       3     125     3.6     2  3    2     1
#> 266  42  4      136  315       1     125     1.8     2  2    3     1
#> 267  52  4      128  204       1     156     1.0     2  2    1     1
#> 268  59  3      126  218       1     134     2.2     2  3    3     1
#> 269  40  4      152  223       1     181     0.0     1  2    4     1
#> 270  42  3      130  180       1     150     0.0     1  2    2     0
#> 271  61  4      140  207       3     138     1.9     1  3    4     1
#> 272  66  4      160  228       3     138     2.3     1  2    3     0
#> 273  46  4      140  311       1     120     1.8     2  4    4     1
#> 274  71  4      112  149       1     125     1.6     2  2    2     0
#> 275  59  1      134  204       1     162     0.8     1  4    2     1
#> 276  64  1      170  227       3     155     0.6     2  2    4     0
#> 277  66  3      146  278       3     152     0.0     2  3    2     0
#> 278  39  3      138  220       1     152     0.0     2  2    2     0
#> 279  57  2      154  232       3     164     0.0     1  3    2     1
#> 280  58  4      130  197       1     131     0.6     2  2    2     0
#> 281  57  4      110  335       1     143     3.0     2  3    4     1
#> 282  47  3      130  253       1     179     0.0     1  2    2     0
#> 283  55  4      128  205       2     130     2.0     2  3    4     1
#> 284  35  2      122  192       1     174     0.0     1  2    2     0
#> 285  61  4      148  203       1     161     0.0     1  3    4     1
#> 286  58  4      114  318       2     140     4.4     3  5    3     1
#> 287  58  4      170  225       3     146     2.8     2  4    3     1
#> 288  58  2      125  220       1     144     0.4     2  1    4     0
#> 289  56  2      130  221       3     163     0.0     1  2    4     0
#> 290  56  2      120  240       1     169     0.0     3  2    2     0
#> 291  67  3      152  212       3     150     0.8     2  2    4     1
#> 292  55  2      132  342       1     166     1.2     1  2    2     0
#> 293  44  4      120  169       1     144     2.8     3  2    3     1
#> 294  63  4      140  187       3     144     4.0     1  4    4     1
#> 295  63  4      124  197       1     136     0.0     2  2    2     1
#> 296  41  2      120  157       1     182     0.0     1  2    2     0
#> 297  59  4      164  176       3      90     1.0     2  4    3     1
#> 298  57  4      140  241       1     123     0.2     2  2    4     1
#> 299  45  1      110  264       1     132     1.2     2  2    4     1
#> 300  68  4      144  193       1     141     3.4     2  4    4     1
#> 301  57  4      130  131       1     115     1.2     2  3    4     1
#> 302  57  2      130  236       3     174     0.0     2  3    2     1
#> 303  38  3      138  175       1     173     0.0     1  1    2     0

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(150)

splitter <- initial_split(data = cl_knn, prop = 0.8)

train_knn <- training(splitter)
test_knn <- testing(splitter)

Then we will check the balance of train_knn data.

table(train_knn$class) %>% 
  prop.table()

#> 
#>         0         1 
#> 0.5454545 0.4545455

Our data has a good balance, it’s not be biased. Then, we can continue making the prediction using KNN method.

Prediction using KNN method

Split the train data and test data become predictor and target data.
predictor as x data
target as y data

cl_train_x <- train_knn %>% select(-class)

cl_test_x <- test_knn %>% select(-class)

cl_train_y <- train_knn$class

cl_test_y <- test_knn$class

Then we need to scaling the data for ensure that all variables have the same scale. Scaling process use scale() function.

cl_train_xs <- scale(cl_train_x)

cl_test_xs <- scale(cl_test_x,
                        center = attr(cl_train_xs, "scaled:center"),
                        scale = attr(cl_train_xs, "scaled:scale"))

Find the optimum k value

sqrt(nrow(train_knn))

#> [1] 15.55635

Because our class is 2 class (even), then k value must be an odd. We will use k = 15

KNN prediction will use knn() function from class library.

cl_knn <- knn(train = cl_train_xs,
                  test = cl_test_xs,
                  cl = cl_train_y,
                  k = 15
                  )
head(cl_knn)

#> [1] 0 0 0 1 0 0
#> Levels: 0 1

Save the prediction results into the test data. Because KNN makes predictions directly on the test data

test_knn$pred_label <- cl_knn
test_knn

#>    age cp trestbps chol restecg thalach oldpeak slope ca thal class pred_label
#> 1   41  2      130  204       3     172     1.4     1  2    2     0          0
#> 2   44  2      120  263       1     173     0.0     1  2    4     0          0
#> 3   58  2      120  284       3     160     1.8     2  2    2     1          0
#> 4   59  4      135  234       1     161     0.5     2  2    4     0          1
#> 5   42  4      140  226       1     178     0.0     1  2    2     0          0
#> 6   61  3      150  243       1     137     1.0     2  2    2     0          0
#> 7   65  4      150  225       3     114     1.0     2  5    4     1          1
#> 8   41  2      105  198       1     168     0.0     1  3    2     0          0
#> 9   44  4      112  290       3     153     0.0     1  3    2     1          0
#> 10  50  3      140  233       1     163     0.6     2  3    4     1          1
#> 11  51  1      125  213       3     125     1.4     1  3    2     0          0
#> 12  46  3      142  177       3     160     1.4     3  2    2     0          0
#> 13  58  4      128  216       3     131     2.2     2  5    4     1          1
#> 14  58  4      150  270       3     111     0.8     1  2    4     1          1
#> 15  45  4      104  208       3     148     3.0     2  2    2     0          1
#> 16  53  4      130  264       3     143     0.4     2  2    2     0          0
#> 17  39  3      140  321       3     182     0.0     1  2    2     0          0
#> 18  68  3      180  274       3     150     1.6     2  2    4     1          1
#> 19  52  2      120  325       1     172     0.2     1  2    2     0          0
#> 20  51  3      130  256       3     149     0.5     1  2    2     0          0
#> 21  62  4      160  164       3     145     6.2     3  5    4     1          1
#> 22  52  4      128  255       1     161     0.0     1  3    4     1          1
#> 23  49  3      120  188       1     139     2.0     2  5    4     1          1
#> 24  39  4      118  219       1     140     1.2     2  2    4     1          1
#> 25  48  4      130  256       3     150     0.0     1  4    4     1          1
#> 26  65  1      138  282       3     174     1.4     2  3    2     1          0
#> 27  45  2      130  234       3     175     0.6     2  2    2     0          0
#> 28  56  4      200  288       3     133     4.0     3  4    4     1          1
#> 29  29  2      130  204       3     202     0.0     1  2    2     0          0
#> 30  51  4      140  261       3     186     0.0     1  2    2     0          0
#> 31  59  1      170  288       3     159     0.2     2  2    4     1          0
#> 32  45  2      128  308       3     170     0.0     1  2    2     0          0
#> 33  42  4      102  265       3     122     0.6     2  2    2     0          0
#> 34  67  3      115  564       3     160     1.6     2  2    4     0          1
#> 35  55  4      160  289       3     145     0.8     2  3    4     1          1
#> 36  70  4      130  322       3     109     2.4     2  5    2     1          1
#> 37  58  4      125  300       3     171     0.0     1  4    4     1          1
#> 38  51  4      140  298       1     122     4.2     2  5    4     1          1
#> 39  43  4      132  247       3     143     0.1     2  1    4     1          1
#> 40  68  3      120  211       3     115     1.5     2  2    2     0          0
#> 41  67  4      100  299       3     125     0.9     2  4    2     1          1
#> 42  45  4      138  236       3     152     0.2     2  2    2     0          0
#> 43  50  2      120  244       1     162     1.1     1  2    2     0          0
#> 44  58  4      128  259       3     130     3.0     2  4    4     1          1
#> 45  55  2      130  262       1     155     0.0     1  2    2     0          0
#> 46  56  1      120  193       3     162     1.9     2  2    4     0          0
#> 47  64  4      130  303       1     122     2.0     2  4    2     0          1
#> 48  41  3      112  268       3     172     0.0     1  2    2     0          0
#> 49  53  4      123  282       1      95     2.0     2  4    4     1          1
#> 50  66  4      112  212       3     132     0.1     1  3    2     1          0
#> 51  60  3      120  178       1      96     0.0     1  2    2     0          0
#> 52  58  4      100  234       1     156     0.1     1  3    4     1          0
#> 53  52  4      125  212       1     168     1.0     1  4    4     1          0
#> 54  62  2      128  208       3     140     0.0     1  2    2     0          0
#> 55  57  4      110  201       1     126     1.5     2  2    3     0          0
#> 56  60  1      150  240       1     171     0.9     1  2    2     0          0
#> 57  61  4      138  166       3     125     3.6     2  3    2     1          1
#> 58  71  4      112  149       1     125     1.6     2  2    2     0          0
#> 59  57  2      154  232       3     164     0.0     1  3    2     1          0
#> 60  57  4      110  335       1     143     3.0     2  3    4     1          1
#> 61  56  2      130  221       3     163     0.0     1  2    4     0          0

Evaluation Logistic Regression Model and KNN Model

We will evaluate the model using confusion matrix from caret library
## The Logistic Regression Model

confusionMatrix(data = as.factor(cl_test$pred_label),
                reference = cl_test$class, 
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 28  6
#>          1  4 23
#>                                           
#>                Accuracy : 0.8361          
#>                  95% CI : (0.7191, 0.9185)
#>     No Information Rate : 0.5246          
#>     P-Value [Acc > NIR] : 0.0000003442    
#>                                           
#>                   Kappa : 0.6703          
#>                                           
#>  Mcnemar's Test P-Value : 0.7518          
#>                                           
#>             Sensitivity : 0.7931          
#>             Specificity : 0.8750          
#>          Pos Pred Value : 0.8519          
#>          Neg Pred Value : 0.8235          
#>              Prevalence : 0.4754          
#>          Detection Rate : 0.3770          
#>    Detection Prevalence : 0.4426          
#>       Balanced Accuracy : 0.8341          
#>                                           
#>        'Positive' Class : 1               
#>

The outcome demonstrates that, on the test dataset, our logistic model has an accuracy of 83.61%, which indicates that 83.61% of our data is correctly classified. The corresponding values for specificity and sensitivity are 79.31% and 87.50%, respectively. While a tiny percentage of adverse outcomes are accurately categorized, the most favorable outcomes are. According to the precision/positive anticipated value of 85.19%, 85.19% of our optimistic prediction is correct.

KNN

confusionMatrix(data = test_knn$pred_label,
                reference = test_knn$class,
                positive = "1")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 28  8
#>          1  4 21
#>                                          
#>                Accuracy : 0.8033         
#>                  95% CI : (0.6816, 0.894)
#>     No Information Rate : 0.5246         
#>     P-Value [Acc > NIR] : 0.000005828    
#>                                          
#>                   Kappa : 0.603          
#>                                          
#>  Mcnemar's Test P-Value : 0.3865         
#>                                          
#>             Sensitivity : 0.7241         
#>             Specificity : 0.8750         
#>          Pos Pred Value : 0.8400         
#>          Neg Pred Value : 0.7778         
#>              Prevalence : 0.4754         
#>          Detection Rate : 0.3443         
#>    Detection Prevalence : 0.4098         
#>       Balanced Accuracy : 0.7996         
#>                                          
#>        'Positive' Class : 1              
#>

The outcome demonstrates that, on the test dataset, our K-NN with K = 15 has an accuracy of 83.33%, which indicates that 83.33% of our data is correctly classified. The corresponding values for specificity and sensitivity are 87.50% and 72.41%, respectively. While a tiny percentage of adverse outcomes are accurately categorized, the most favorable outcomes are. According to the precision/positive anticipated value of 84%, 84% of our optimistic prediction is correct.

Conclusion

Because of the patient’s psychology, we must minimize the false negative value. Based on those two methods, logistic regression and K-NN, the model’s ability to correctly predict actual data of people who are not healthy is better using logistic regression because its precision value is 85.19% greater than using the KNN method.

References

-https://archive.ics.uci.edu/dataset/45/heart+disease
-https://search.r-project.org/CRAN/refmans/kmed/html/heart.html
-https://www.mayoclinic.org/diseases-conditions/heart-disease/symptoms-causes/syc-20353118#:~:text=Heart%20disease%20describes%20a%20range,born%20with%20(congenital%20heart%20defects)
-https://www.southcoasthealth.com/posts/view/402-what-causes-heart-disease
-https://www.siloamhospitals.com/informasi-siloam/artikel/cara-membaca-ekg
-https://www.alomedika.com/diagnosis-banding-elevasi-segmen-st-pada-elektrokardiografi

Heart Disease Classification: Logistic Regression and K-Nearest Neighbor

Galuh Chynintya

2024-07-25

Introduction

Data Preparation

Prerequisites

Importing Libraries

Importing Dataset

Data Processing

Duplicates

Missing Values

Data Types

Exploratory Data Analysis

Cross Validation

Modeling

Logistic Regression

Build the model

Logistic Regression Assumptions

Prediction using Logistic Regression Model

K-Nearest Neighbour (KNN)

Pre-processing data

Cross Validation

Prediction using KNN method

Evaluation Logistic Regression Model and KNN Model

KNN

Conclusion

References