Heart Disease

Introduction

We would like to use logistic regression model and K-Nearest Neighbour (K-NN) to predict the presence of heart disease in patients, which is target variable with 13 attributes, which are:

  1. age: in years

  2. sex: male or female

  3. chest pain type (4 values)

  4. resting blood pressure: detect blood pressure during rest (in mmHg)

  5. serum cholestoral in mg/dl

  6. fasting blood sugar > 120 mg/dl

  7. resting electrocardiographic results (values 0,1,2)

  8. maximum heart rate achieved

  9. exercise induced angina

  10. oldpeak = ST depression induced by exercise relative to rest

  11. the slope of the peak exercise ST segment

  12. number of major vessels (0-3) colored by flourosopy

  13. thal: 3 = No Thalassemia; 6 = Fixed Defect Thalassemia; 7 = Reversible Defect Thalassemia

Exploratory Data Analysis

Variable Transforming

We need to change some variables from numeric into character class. ca variable is not transformed into character class because it need to be know whether more visibled blood vessels will affect the model calculation or not.

## Observations: 303
## Variables: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, …
## $ sex      <int> 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, …
## $ cp       <int> 3, 2, 1, 1, 0, 0, 1, 1, 2, 2, 0, 2, 1, 3, 3, 2, 2, 3, 0, 3, …
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, …
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, …
## $ fbs      <int> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ restecg  <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, …
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, …
## $ exang    <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, …
## $ slope    <int> 0, 0, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 0, 2, 2, …
## $ ca       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, …
## $ thal     <int> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ target   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
##       age            sex                      cp         trestbps    
##  Min.   :29.00   Female: 96   Chest Pain Type 0:143   Min.   : 94.0  
##  1st Qu.:47.50   Male  :207   Chest Pain Type 1: 50   1st Qu.:120.0  
##  Median :55.00                Chest Pain Type 2: 87   Median :130.0  
##  Mean   :54.37                Chest Pain Type 3: 23   Mean   :131.6  
##  3rd Qu.:61.00                                        3rd Qu.:140.0  
##  Max.   :77.00                                        Max.   :200.0  
##       chol                fbs                      restecg       thalach     
##  Min.   :126.0   < 120 mg/dl:258   Abnormality         :152   Min.   : 71.0  
##  1st Qu.:211.0   > 120 mg/dl: 45   Normal              :147   1st Qu.:133.5  
##  Median :240.0                     Probable or Definite:  4   Median :153.0  
##  Mean   :246.3                                                Mean   :149.6  
##  3rd Qu.:274.5                                                3rd Qu.:166.0  
##  Max.   :564.0                                                Max.   :202.0  
##                         exang        oldpeak    
##  Exercise Induced Angina   : 99   Min.   :0.00  
##  No Exercise Induced Angina:204   1st Qu.:0.00  
##                                   Median :0.80  
##                                   Mean   :1.04  
##                                   3rd Qu.:1.60  
##                                   Max.   :6.20  
##                        slope           ca        
##  Peak Excercise ST Slope 0: 21   Min.   :0.0000  
##  Peak Excercise ST Slope 1:140   1st Qu.:0.0000  
##  Peak Excercise ST Slope 2:142   Median :0.0000  
##                                  Mean   :0.7294  
##                                  3rd Qu.:1.0000  
##                                  Max.   :4.0000  
##                             thal               target   
##  Fixed Defect Thalassemia     :166   Healthy      :138  
##  No Thalassemia               :  2   Heart Disease:165  
##  Normal Thalassemia           : 18                      
##  Reversible Defect Thalassemia:117                      
##                                                         
## 

Data Visualization

In data visualization, we would try to visualize the relationship between heart disesase with each variable.

Age

The patient in the data distributed from the age of 29 into 77. Age with the highest number with heart disease is 55.

Multivariables Interpretation

We would try to correlate the relationship between oldpeak, number of blood vessels to the patient who is healty or having a heart disease.

From the graph above, we could interpretate:

  1. Patient who has heart disease having a relatively low oldpeak.
  2. Patient who has heart disease having a relatively low oldpeak number of blood vessels (ca).
  3. Age has no siginificant effect whether the patient has heart disease or not.

Checking Data Availability

Before going further, we need to check if there is any not available data.

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

Modeling

There would be two models that we will use which are logistic regression and k-nearest neighbour (K-NN)

Logistic Regression

Data Splitting

First, we must slip the dataset into train and test dataset. The data will be split into ratio of 70% for data train and 30% for data test.

Targetting All Variables

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7597  -0.3682   0.1828   0.4810   2.8007  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        2.465457   3.332549   0.740  0.45941   
## age                               -0.009925   0.029102  -0.341  0.73306   
## sexMale                           -0.972760   0.594311  -1.637  0.10168   
## cpChest Pain Type 1                0.676937   0.650570   1.041  0.29809   
## cpChest Pain Type 2                1.683990   0.561586   2.999  0.00271 **
## cpChest Pain Type 3                1.876914   0.763048   2.460  0.01390 * 
## trestbps                          -0.021103   0.013683  -1.542  0.12301   
## chol                              -0.005397   0.005413  -0.997  0.31867   
## fbs> 120 mg/dl                     0.464879   0.694490   0.669  0.50325   
## restecgNormal                     -0.114183   0.481740  -0.237  0.81264   
## restecgProbable or Definite       -0.036626   2.214469  -0.017  0.98680   
## thalach                            0.017983   0.012822   1.403  0.16076   
## exangNo Exercise Induced Angina    1.286237   0.507204   2.536  0.01121 * 
## oldpeak                           -0.340274   0.261999  -1.299  0.19403   
## slopePeak Excercise ST Slope 1    -0.372844   1.008251  -0.370  0.71154   
## slopePeak Excercise ST Slope 2     0.646221   1.097124   0.589  0.55585   
## ca                                -0.658304   0.265271  -2.482  0.01308 * 
## thalNo Thalassemia                -1.687082   2.231244  -0.756  0.44958   
## thalNormal Thalassemia            -1.059155   0.980853  -1.080  0.28022   
## thalReversible Defect Thalassemia -1.427661   0.508727  -2.806  0.00501 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 142.36  on 192  degrees of freedom
## AIC: 182.36
## 
## Number of Fisher Scoring iterations: 6

By using all the variables provided, we have numbers of variables that do not have significant effect for the target variable which could be seen on the p values. To eliminate those unsignificant variables, we would try stepwise method with backward direction.

Stepwise Method

## Start:  AIC=182.36
## target ~ age + sex + cp + trestbps + chol + fbs + restecg + thalach + 
##     exang + oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - restecg   2   142.42 178.42
## - age       1   142.48 180.48
## - fbs       1   142.82 180.82
## - chol      1   143.36 181.36
## - slope     2   146.04 182.04
## - oldpeak   1   144.12 182.12
## <none>          142.36 182.36
## - thalach   1   144.41 182.41
## - trestbps  1   144.84 182.84
## - sex       1   145.13 183.13
## - thal      3   150.81 184.81
## - ca        1   148.69 186.69
## - exang     1   148.87 186.87
## - cp        3   154.84 188.84
## 
## Step:  AIC=178.42
## target ~ age + sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - age       1   142.58 176.58
## - fbs       1   142.87 176.87
## - chol      1   143.57 177.57
## - oldpeak   1   144.17 178.17
## - slope     2   146.38 178.38
## <none>          142.42 178.42
## - thalach   1   144.47 178.47
## - trestbps  1   144.94 178.94
## - sex       1   145.37 179.37
## - thal      3   150.90 180.90
## - ca        1   148.76 182.76
## - exang     1   148.97 182.97
## - cp        3   154.87 184.87
## 
## Step:  AIC=176.58
## target ~ sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca + thal
## 
##            Df Deviance    AIC
## - fbs       1   143.02 175.02
## - chol      1   143.93 175.93
## - oldpeak   1   144.25 176.25
## - slope     2   146.51 176.51
## <none>          142.58 176.58
## - sex       1   145.48 177.48
## - thalach   1   145.64 177.64
## - trestbps  1   145.73 177.73
## - thal      3   151.10 179.10
## - exang     1   149.22 181.22
## - ca        1   149.83 181.83
## - cp        3   155.30 183.30
## 
## Step:  AIC=175.02
## target ~ sex + cp + trestbps + chol + thalach + exang + oldpeak + 
##     slope + ca + thal
## 
##            Df Deviance    AIC
## - chol      1   144.24 174.24
## - oldpeak   1   144.79 174.79
## - slope     2   146.93 174.93
## <none>          143.02 175.02
## - sex       1   145.84 175.84
## - trestbps  1   145.92 175.92
## - thalach   1   146.09 176.09
## - thal      3   151.40 177.40
## - exang     1   149.39 179.39
## - ca        1   149.85 179.85
## - cp        3   156.88 182.88
## 
## Step:  AIC=174.24
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
##     ca + thal
## 
##            Df Deviance    AIC
## <none>          144.24 174.24
## - oldpeak   1   146.30 174.30
## - sex       1   146.47 174.47
## - slope     2   148.57 174.57
## - thalach   1   147.16 175.16
## - trestbps  1   147.24 175.24
## - thal      3   152.49 176.49
## - exang     1   150.29 178.29
## - ca        1   151.44 179.44
## - cp        3   158.83 182.83

This method eliminates six unsignificant variables, which are age, fbs, chol, restecg, oldpeak, and slope. So therefore, we would use model.b as our prefer mode.

## 
## Call:
## glm(formula = target ~ sex + cp + trestbps + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7746  -0.3799   0.2014   0.4931   2.7368  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                        0.43152    2.58633   0.167  0.86749   
## sexMale                           -0.83157    0.56335  -1.476  0.13991   
## cpChest Pain Type 1                0.72184    0.63672   1.134  0.25692   
## cpChest Pain Type 2                1.79685    0.54745   3.282  0.00103 **
## cpChest Pain Type 3                1.93688    0.75034   2.581  0.00984 **
## trestbps                          -0.02183    0.01295  -1.685  0.09200 . 
## thalach                            0.01927    0.01156   1.666  0.09568 . 
## exangNo Exercise Induced Angina    1.21859    0.49692   2.452  0.01419 * 
## oldpeak                           -0.36267    0.25753  -1.408  0.15906   
## slopePeak Excercise ST Slope 1    -0.44129    0.98775  -0.447  0.65504   
## slopePeak Excercise ST Slope 2     0.64093    1.07046   0.599  0.54934   
## ca                                -0.64851    0.24833  -2.612  0.00901 **
## thalNo Thalassemia                -1.25894    2.45145  -0.514  0.60757   
## thalNormal Thalassemia            -0.84997    0.94975  -0.895  0.37082   
## thalReversible Defect Thalassemia -1.38805    0.49318  -2.815  0.00489 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.00  on 211  degrees of freedom
## Residual deviance: 144.24  on 197  degrees of freedom
## AIC: 174.24
## 
## Number of Fisher Scoring iterations: 6

Model Interpretation

##                                           .
## (Intercept)                       1.5395889
## sexMale                           0.4353669
## cpChest Pain Type 1               2.0582241
## cpChest Pain Type 2               6.0306449
## cpChest Pain Type 3               6.9370818
## trestbps                          0.9784109
## thalach                           1.0194538
## exangNo Exercise Induced Angina   3.3824019
## oldpeak                           0.6958139
## slopePeak Excercise ST Slope 1    0.6432050
## slopePeak Excercise ST Slope 2    1.8982437
## ca                                0.5228225
## thalNo Thalassemia                0.2839554
## thalNormal Thalassemia            0.4274267
## thalReversible Defect Thalassemia 0.2495606

Fom the model, model.b, we could interprete several things:

  1. The odds of male to diagnosed with heart disease is 43% less than female.

  2. Patient detected with no exercise induced angina (unstable angina) is 3 times more reluctant to heart disease compared with patient with exercise induced angina.

  3. The odds of patient detected with oldpeak to diagnosed with heart disease is 70% less than those who don’t.

K-Nearest Neighbour

Variable Transforming

## 'data.frame':    303 obs. of  28 variables:
##  $ age                               : num  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex.Female                        : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ sex.Male                          : num  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp.Chest.Pain.Type.0              : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ cp.Chest.Pain.Type.1              : num  0 0 1 1 0 0 1 1 0 0 ...
##  $ cp.Chest.Pain.Type.2              : num  0 1 0 0 0 0 0 0 1 1 ...
##  $ cp.Chest.Pain.Type.3              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps                          : num  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol                              : num  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs...120.mg.dl                   : num  0 1 1 1 1 1 1 1 0 1 ...
##  $ fbs...120.mg.dl.1                 : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg.Abnormality               : num  0 1 0 1 1 1 0 1 1 1 ...
##  $ restecg.Normal                    : num  1 0 1 0 0 0 1 0 0 0 ...
##  $ restecg.Probable.or.Definite      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thalach                           : num  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang.Exercise.Induced.Angina     : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ exang.No.Exercise.Induced.Angina  : num  1 1 1 1 0 1 1 1 1 1 ...
##  $ oldpeak                           : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope.Peak.Excercise.ST.Slope.0   : num  1 1 0 0 0 0 0 0 0 0 ...
##  $ slope.Peak.Excercise.ST.Slope.1   : num  0 0 0 0 0 1 1 0 0 0 ...
##  $ slope.Peak.Excercise.ST.Slope.2   : num  0 0 1 1 1 0 0 1 1 1 ...
##  $ ca                                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.Fixed.Defect.Thalassemia     : num  0 1 1 1 1 0 1 0 0 1 ...
##  $ thal.No.Thalassemia               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal.Normal.Thalassemia           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ thal.Reversible.Defect.Thalassemia: num  0 0 0 0 0 0 0 1 1 0 ...
##  $ target.Healthy                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ target.Heart.Disease              : num  1 1 1 1 1 1 1 1 1 1 ...

We need to transform all the variables into numeric class, as KNN only processing numeric variables.

##  [1] "age"                                "sex.Male"                          
##  [3] "cp.Chest.Pain.Type.0"               "cp.Chest.Pain.Type.1"              
##  [5] "cp.Chest.Pain.Type.2"               "cp.Chest.Pain.Type.3"              
##  [7] "trestbps"                           "chol"                              
##  [9] "fbs...120.mg.dl"                    "restecg.Abnormality"               
## [11] "restecg.Normal"                     "restecg.Probable.or.Definite"      
## [13] "thalach"                            "exang.No.Exercise.Induced.Angina"  
## [15] "oldpeak"                            "slope.Peak.Excercise.ST.Slope.0"   
## [17] "slope.Peak.Excercise.ST.Slope.1"    "slope.Peak.Excercise.ST.Slope.2"   
## [19] "ca"                                 "thal.Fixed.Defect.Thalassemia"     
## [21] "thal.No.Thalassemia"                "thal.Normal.Thalassemia"           
## [23] "thal.Reversible.Defect.Thalassemia" "target.Heart.Disease"

Then, we neglect variables with only two levels, such as sex variable.

Evaluation

Logistic Regression

Tuning Cutoff

## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.

Before doing evaluation, we need to inspect the good cutoff or threshold to maximize the accuracy, recall and precision value. Here from the graph, 0.53 is the most balance value.

Data Training Evaluation

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Healthy Heart Disease
##   Healthy            80            13
##   Heart Disease      16           103
##                                              
##                Accuracy : 0.8632             
##                  95% CI : (0.8095, 0.9064)   
##     No Information Rate : 0.5472             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.7232             
##                                              
##  Mcnemar's Test P-Value : 0.7103             
##                                              
##             Sensitivity : 0.8879             
##             Specificity : 0.8333             
##          Pos Pred Value : 0.8655             
##          Neg Pred Value : 0.8602             
##              Prevalence : 0.5472             
##          Detection Rate : 0.4858             
##    Detection Prevalence : 0.5613             
##       Balanced Accuracy : 0.8606             
##                                              
##        'Positive' Class : Heart Disease      
## 

Data Test Evaluation

## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      Healthy Heart Disease
##   Healthy            35             8
##   Heart Disease       7            41
##                                           
##                Accuracy : 0.8352          
##                  95% CI : (0.7427, 0.9047)
##     No Information Rate : 0.5385          
##     P-Value [Acc > NIR] : 0.000000002394  
##                                           
##                   Kappa : 0.6689          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8367          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.8542          
##          Neg Pred Value : 0.8140          
##              Prevalence : 0.5385          
##          Detection Rate : 0.4505          
##    Detection Prevalence : 0.5275          
##       Balanced Accuracy : 0.8350          
##                                           
##        'Positive' Class : Heart Disease   
## 

The train data give accuracy of 86.32%, while the data test give accurracy of 83.52%. Because the accuracy is not far, so we can assume the model is fit.

K-Nearest Neighbour

Data Training Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 76 18
##          1 20 98
##                                              
##                Accuracy : 0.8208             
##                  95% CI : (0.7624, 0.8699)   
##     No Information Rate : 0.5472             
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.6376             
##                                              
##  Mcnemar's Test P-Value : 0.8711             
##                                              
##             Sensitivity : 0.8448             
##             Specificity : 0.7917             
##          Pos Pred Value : 0.8305             
##          Neg Pred Value : 0.8085             
##              Prevalence : 0.5472             
##          Detection Rate : 0.4623             
##    Detection Prevalence : 0.5566             
##       Balanced Accuracy : 0.8182             
##                                              
##        'Positive' Class : 1                  
## 

Data Test Evaluation

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 34  5
##          1  8 44
##                                           
##                Accuracy : 0.8571          
##                  95% CI : (0.7681, 0.9217)
##     No Information Rate : 0.5385          
##     P-Value [Acc > NIR] : 0.0000000001091 
##                                           
##                   Kappa : 0.7111          
##                                           
##  Mcnemar's Test P-Value : 0.5791          
##                                           
##             Sensitivity : 0.8980          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.5385          
##          Detection Rate : 0.4835          
##    Detection Prevalence : 0.5714          
##       Balanced Accuracy : 0.8537          
##                                           
##        'Positive' Class : 1               
## 

The train data give accuracy of 82.08%, while the data test give accurracy of 85.71%. Because the accuracy is not far, so we can assume the model is fit.

Conclusion

Logistic Regression

## # A tibble: 1 x 3
##   Accuracy Recall Precision
##      <dbl>  <dbl>     <dbl>
## 1    0.835  0.837     0.854

K-Nearest Neighbour

## # A tibble: 1 x 3
##   Accuracy Recall Precision
##      <dbl>  <dbl>     <dbl>
## 1    0.857  0.898     0.846

KNN give accuracy of 85.71%, while logistic regression model give accuracy of 83.51%. KNN also could predict the actual patient with heart disease with recall value of 89.79%, compared to logistic regression model with value of 83.67%.

Dhaneswara Mandrasa T.