Heart Desease

1. Introduction

Background

Cardiovascular disease (CVD), including heart disease, is a major cause of mortality worldwide,remains the leading cause of death globally, accounting for approximately 31% of all deaths worldwide in 2019. Early detection and intervention are crucial in reducing the burden of disease and improving patient outcomes. Machine learning (ML) has the potential to aid clinicians in identifying patients at risk of heart disease through the analysis of clinical and laboratory data. Several studies have been conducted to develop ML algorithms for predicting the risk of heart disease, achieving high levels of accuracy. The use of ML models has the potential to improve patient outcomes by enabling earlier detection and intervention, but further research is necessary to validate these models and ensure their effectiveness in real-world clinical settings.

📝 In this project we will trial Logistic Regression and KNN as our Machine Learning

Data-set

Heart-disease, this data set was created in 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It has 76 attributes, one of which is the predicted attribute, but all published experiments use only 14 of them. The “target” field refers to the patient’s presence of heart disease. It has an integer value of 0 for no disease and 1 for disease.

Attribute Information:

  1. age
  2. sex : 1 = male; 0 = female
  3. chest pain type : (4 values : – Value 0: typical angina – Value 1: atypical angina – Value 2: non-anginal pain – Value 3: asymptomatic)
  4. resting blood pressure : resting blood pressure (in mm Hg on admission to the hospital)
  5. serum cholestoral : serum cholestoral in mg/dl
  6. fasting blood sugar > 120 mg/dl : (1 = true; 0 = false)
  7. resting electrocar diographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise induced angina : 1 = yes; 0 = no
  10. oldpeak = ST depression induced by exercise relative to rest
  11. Slope = the slope of the peak exercise ST segment (0 - 2)
  12. ca = number of major vessels (0-4) colored by flourosopy
  13. thal: 0 = normal; 1 = fixed defect; 2 = reversible defect; 3 = reversible defect after thallium stress testing
  14. target : 1 = disease; 0 = no disease

2. Preparation

Library

library(dplyr)
library(class) # package for `knn()`
library(caret) # confusion matrix
library(car) # multicollinearity
library(GGally) # ggcor

Data Preparation

Read dataset = heart.csv

df = read.csv("LBB DEAN/heart.csv")
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

df have 14 columns and 1025 records of patient check up of heart disease. There is several columns need to change as a factor. in this case column sex, cp,fbs,restecg,exang,slope,ca,thal

df <- df %>% 
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target), as.factor)
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

Check missing values in df as a data-set

colSums(is.na(df))
#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal   target 
#>        0        0        0        0        0        0

There is no any missing values in df as a data-set

EDA (Exploratory Data Analysis)

Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well

summary(df)
#>       age        sex     cp         trestbps          chol     fbs     restecg
#>  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
#>  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
#>  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
#>  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
#>  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
#>  Max.   :77.00                   Max.   :200.0   Max.   :564                  
#>     thalach      exang      oldpeak      slope   ca      thal    target 
#>  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
#>  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
#>  Median :152.0           Median :0.800   2:469   2:134   2:544          
#>  Mean   :149.1           Mean   :1.072           3: 69   3:410          
#>  3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
#>  Max.   :202.0           Max.   :6.200

insight :

  • average age of patients are 54, dominated by male
  • majority patients have type of chest pain is 0, it means typical angina
  • fasting blood sugars quite lower about 872 patients below < 120 mg/dl
round(prop.table(table(df$target)),1)
#> 
#>   0   1 
#> 0.5 0.5

insight : Proportion of target already balance 50:50

Corelation between predictors

ggcorr(df, label=TRUE)

insight =

  • in this correlation shown by heat map, there is no any high correlation between predictors, from this insight it can be concluded that the relationship between predictors is in accordance with the assumptions of logistic regression

3. Cross Validation

The process of dividing the dataset we have to train and test the model is known as cross validation. We will randomly select a sample of Heart Disease to be included in the df train variable, with the remainder included in the df test.

RNGkind(sample.kind = "Rounding") 
set.seed(417)

# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)

# splitting
df_train <- df[index,]
df_test <- df[-index,]

4. Modelling

We are going to make several models and test the performance.

Logistic Regression

📝 Log of odds values cannot be interpreted. Log of Odds generated by Logistic Regression. Log of odds values can be converted to odds and/or probabilities so they can be interpreted.

No Predictor

model_none <- glm(formula = target~1,
                  data = df_train,
                  family = "binomial")
summary(model_none)
#> 
#> Call:
#> glm(formula = target ~ 1, family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.209  -1.209   1.146   1.146   1.146  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  0.07320    0.06989   1.047    0.295
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.7  on 819  degrees of freedom
#> Residual deviance: 1135.7  on 819  degrees of freedom
#> AIC: 1137.7
#> 
#> Number of Fisher Scoring iterations: 3
exp(0.07320)
#> [1] 1.075946

📈 Model interpretation :

  • It can be seen that the intercept section has a value of 0.07320, which means that from all the data collected there is a possibility that there is a 1.07 chance of having a heart attack or also a possibility that the patient will not have a heart attack 0.93 times compared to having a heart attack.

Categoric Predictor

sex,cp,fbs,restecg,exang,slope,ca,thal

model_categoric <- glm(formula = target~ sex+cp+fbs+restecg+exang+slope+ca+thal ,
                  data = df_train,
                  family = "binomial")
summary(model_categoric)
#> 
#> Call:
#> glm(formula = target ~ sex + cp + fbs + restecg + exang + slope + 
#>     ca + thal, family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.92220  -0.34695   0.07266   0.42072   3.14937  
#> 
#> Coefficients:
#>             Estimate Std. Error z value            Pr(>|z|)    
#> (Intercept)  -1.1045     1.8662  -0.592            0.553971    
#> sex1         -1.6801     0.3211  -5.232 0.00000016775552554 ***
#> cp1           1.1432     0.3352   3.410            0.000649 ***
#> cp2           2.1655     0.3246   6.672 0.00000000002528101 ***
#> cp3           2.1430     0.4207   5.094 0.00000035110360891 ***
#> fbs1          0.3031     0.3401   0.891            0.372822    
#> restecg1      0.4937     0.2353   2.098            0.035871 *  
#> restecg2     -1.7079     1.1767  -1.451            0.146669    
#> exang1       -0.8924     0.2742  -3.255            0.001134 ** 
#> slope1       -0.2613     0.4386  -0.596            0.551356    
#> slope2        1.7783     0.4471   3.977 0.00006964804733135 ***
#> ca1          -2.2969     0.3111  -7.383 0.00000000000015444 ***
#> ca2          -3.6182     0.4659  -7.766 0.00000000000000809 ***
#> ca3          -2.0375     0.5409  -3.767            0.000165 ***
#> ca4           2.2090     1.0406   2.123            0.033771 *  
#> thal1         2.2613     1.8522   1.221            0.222128    
#> thal2         2.6027     1.8037   1.443            0.149016    
#> thal3         0.7894     1.8030   0.438            0.661519    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  494.32  on 802  degrees of freedom
#> AIC: 530.32
#> 
#> Number of Fisher Scoring iterations: 6
exp(model_categoric$coefficients)
#> (Intercept)        sex1         cp1         cp2         cp3        fbs1 
#>  0.33138730  0.18634683  3.13683135  8.71918722  8.52480997  1.35410577 
#>    restecg1    restecg2      exang1      slope1      slope2         ca1 
#>  1.63828543  0.18125122  0.40966600  0.77004331  5.91973985  0.10056836 
#>         ca2         ca3         ca4       thal1       thal2       thal3 
#>  0.02683137  0.13035619  9.10643407  9.59600631 13.50008312  2.20203009

Any variable that lowers the odds :

  • sex1, restecg2,exang1,slope1,ca1,ca2,ca3

📈 Model interpretation :

  • People who are cp1 are more at risk of heart failure (reducing the possibility of heart failure) by 3.13 times compared to “base” (cp0), if other variables are fixed
  • People who are cp2 are more at risk of heart failure (reducing the possibility of heart failure) by 8.71 times compared to “base” (cp0), if other variables are fixed
  • People who are cp3 are more at risk of heart failure (reducing the possibility of heart failure) by 8.52 times compared to “base” (cp0), if other variables are fixed
  • People who are fbs1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.35 times compared to “base” (fbs0), if other variables are fixed
  • People who are restecg1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.63 times compared to “base” (restecg0), if other variables are fixed
  • People who are slope2 are more at risk of heart failure (reducing the possibility of heart failure) by 5.91 times compared to “base” (slope0), if other variables are fixed
  • People who are ca4 are more at risk of heart failure (reducing the possibility of heart failure) by 9.1 times compared to “base” (ca0), if other variables are fixed
  • People who are thal1 are more at risk of heart failure (reducing the possibility of heart failure) by 9.59 times compared to “base” (thal0), if other variables are fixed
  • People who are thal2 are more at risk of heart failure (reducing the possibility of heart failure) by 13.5 times compared to “base” (thal0), if other variables are fixed
  • People who are thal3 are more at risk of heart failure (reducing the possibility of heart failure) by 2.2 times compared to “base” (thal0), if other variables are fixed

Numeric Predictor

age,trestbps,chol,thalach,oldpeak

model_numeric <- glm(formula = target~ age+trestbps+chol+thalach+oldpeak ,
                  data = df_train,
                  family = "binomial")
summary(model_numeric)
#> 
#> Call:
#> glm(formula = target ~ age + trestbps + chol + thalach + oldpeak, 
#>     family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1848  -0.7812   0.4288   0.8541   2.6114  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept) -2.025752   1.098845  -1.844               0.0653 .  
#> age          0.003310   0.010839   0.305               0.7601    
#> trestbps    -0.011531   0.005264  -2.190               0.0285 *  
#> chol        -0.004096   0.001749  -2.342               0.0192 *  
#> thalach      0.035287   0.004557   7.744  0.00000000000000964 ***
#> oldpeak     -0.834227   0.092076  -9.060 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  851.21  on 814  degrees of freedom
#> AIC: 863.21
#> 
#> Number of Fisher Scoring iterations: 4
exp(model_numeric$coefficients)
#> (Intercept)         age    trestbps        chol     thalach     oldpeak 
#>   0.1318946   1.0033152   0.9885348   0.9959125   1.0359168   0.4342098

Any variable that lowers the odds :

  • trestbps, chol,oldpeak

📈 Model interpretation :

  • For each addition of 1 point age, then the possibility of patient getting heart disease increases 1.00 times, provided that other variables have the same value.
  • For each addition of 1 point thalach, then the possibility of patient getting heart disease increases 1.03 times, provided that other variables have the same value.

All Predictor

model_all <- glm(formula = target~. ,
                  data = df_train,
                  family = "binomial")
summary(model_all)
#> 
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.0412  -0.2616   0.0570   0.4065   3.3227  
#> 
#> Coefficients:
#>              Estimate Std. Error z value          Pr(>|z|)    
#> (Intercept) -0.265828   2.309760  -0.115          0.908375    
#> age          0.026965   0.016411   1.643          0.100368    
#> sex1        -2.078567   0.375782  -5.531 0.000000031783653 ***
#> cp1          0.829104   0.353346   2.346          0.018954 *  
#> cp2          2.168814   0.341372   6.353 0.000000000210833 ***
#> cp3          2.466048   0.466150   5.290 0.000000122150326 ***
#> trestbps    -0.028018   0.007818  -3.584          0.000339 ***
#> chol        -0.006596   0.002753  -2.396          0.016579 *  
#> fbs1         0.412043   0.372173   1.107          0.268238    
#> restecg1     0.374898   0.255211   1.469          0.141840    
#> restecg2    -0.562131   1.926327  -0.292          0.770428    
#> thalach      0.025281   0.007816   3.235          0.001218 ** 
#> exang1      -0.729219   0.292905  -2.490          0.012789 *  
#> oldpeak     -0.511915   0.155308  -3.296          0.000980 ***
#> slope1      -0.323303   0.534445  -0.605          0.545224    
#> slope2       1.168961   0.563860   2.073          0.038159 *  
#> ca1         -2.542979   0.342386  -7.427 0.000000000000111 ***
#> ca2         -3.680525   0.521668  -7.055 0.000000000001722 ***
#> ca3         -1.675346   0.600219  -2.791          0.005251 ** 
#> ca4          1.891545   1.047329   1.806          0.070908 .  
#> thal1        2.940066   1.568853   1.874          0.060927 .  
#> thal2        2.764906   1.513894   1.826          0.067797 .  
#> thal3        1.091833   1.515119   0.721          0.471140    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  452.07  on 797  degrees of freedom
#> AIC: 498.07
#> 
#> Number of Fisher Scoring iterations: 6
exp(model_all$coefficients)
#> (Intercept)         age        sex1         cp1         cp2         cp3 
#>  0.76657125  1.02733167  0.12510939  2.29126495  8.74789990 11.77581908 
#>    trestbps        chol        fbs1    restecg1    restecg2     thalach 
#>  0.97237088  0.99342604  1.50989958  1.45484288  0.56999318  1.02560367 
#>      exang1     oldpeak      slope1      slope2         ca1         ca2 
#>  0.48228567  0.59934695  0.72375441  3.21864567  0.07863177  0.02520973 
#>         ca3         ca4       thal1       thal2       thal3 
#>  0.18724333  6.62960374 18.91708684 15.87755124  2.97973045

Any variable that lowers the odds :

sex1, restecg2,exang1,slope1,ca1,ca2,ca3,trestbps, chol,oldpeak

📈 Model interpretation :

  • For each addition of 1 point age, then the possibility of patient getting heart disease increases 1.00 times, provided that other variables have the same value.
  • For each addition of 1 point thalach, then the possibility of patient getting heart disease increases 1.03 times, provided that other variables have the same value.
  • People who are cp1 are more at risk of heart failure (reducing the possibility of heart failure) by 3.13 times compared to “base” (cp0), if other variables are fixed
  • People who are cp2 are more at risk of heart failure (reducing the possibility of heart failure) by 8.71 times compared to “base” (cp0), if other variables are fixed
  • People who are cp3 are more at risk of heart failure (reducing the possibility of heart failure) by 8.52 times compared to “base” (cp0), if other variables are fixed
  • People who are restecg1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.63 times compared to “base” (restecg0), if other variables are fixed
  • People who are slope2 are more at risk of heart failure (reducing the possibility of heart failure) by 5.91 times compared to “base” (slope0), if other variables are fixed
  • People who are ca4 are more at risk of heart failure (reducing the possibility of heart failure) by 9.1 times compared to “base” (ca0), if other variables are fixed
  • People who are thal1 are more at risk of heart failure (reducing the possibility of heart failure) by 9.59 times compared to “base” (thal0), if other variables are fixed
  • People who are thal2 are more at risk of heart failure (reducing the possibility of heart failure) by 13.5 times compared to “base” (thal0), if other variables are fixed
  • People who are thal3 are more at risk of heart failure (reducing the possibility of heart failure) by 2.2 times compared to “base” (thal0), if other variables are fixed

Step Model

We used model with all predictor due to low AIC and do a feature selection using stepwise regression

model_step <- step(object = model_all,
                   direction = "backward", trace = F)
summary(model_step)
#> 
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + thalach + 
#>     exang + oldpeak + slope + ca + thal, family = "binomial", 
#>     data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.0096  -0.2815   0.0540   0.4044   3.3720  
#> 
#> Coefficients:
#>              Estimate Std. Error z value           Pr(>|z|)    
#> (Intercept)  0.298116   2.469446   0.121           0.903911    
#> age          0.024359   0.016267   1.497           0.134274    
#> sex1        -2.080038   0.373448  -5.570 0.0000000255013944 ***
#> cp1          0.853616   0.351534   2.428           0.015172 *  
#> cp2          2.237598   0.337174   6.636 0.0000000000321570 ***
#> cp3          2.518670   0.462415   5.447 0.0000000512930186 ***
#> trestbps    -0.027613   0.007554  -3.655           0.000257 ***
#> chol        -0.007103   0.002650  -2.680           0.007367 ** 
#> thalach      0.025386   0.007880   3.222           0.001274 ** 
#> exang1      -0.700975   0.291217  -2.407           0.016082 *  
#> oldpeak     -0.522742   0.150681  -3.469           0.000522 ***
#> slope1      -0.367936   0.521164  -0.706           0.480195    
#> slope2       1.150041   0.547990   2.099           0.035848 *  
#> ca1         -2.522397   0.333224  -7.570 0.0000000000000374 ***
#> ca2         -3.586970   0.508731  -7.051 0.0000000000017787 ***
#> ca3         -1.537753   0.581697  -2.644           0.008204 ** 
#> ca4          2.042608   1.037289   1.969           0.048932 *  
#> thal1        2.880288   1.851205   1.556           0.119732    
#> thal2        2.616895   1.800002   1.454           0.145994    
#> thal3        0.973965   1.799240   0.541           0.588287    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  455.76  on 800  degrees of freedom
#> AIC: 495.76
#> 
#> Number of Fisher Scoring iterations: 6

Compare between Step model and Model_all

# Deviance
model_all$deviance
#> [1] 452.0742
model_step$deviance
#> [1] 455.7572
# AIC
model_all$aic
#> [1] 498.0742
model_step$aic
#> [1] 495.7572

Use Model_Step due to low AIC Model_Step without Perfect Separation

KNN

k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).

Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).

Characteristics of k-NN :

  • Does not create a model: classifies on the spot
  • Not learning from data, every time you want to classify you have to provide training data again
  • No assumptions
  • Can predict multiclass
  • Good for numerical predictors (because it classifies by distance), not good for categorical predictors
  • Robust: performance is good -> error is small
  • Not interpretable -> does not form an equation
knitr::include_graphics("assets/KNN.png")

# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)

# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)

# variabel target pada `train`
train_y <- df_train[,"target"]

# variabel target pada `test`
test_y <- df_test[,"target"]

Remember that distance measurements in kNN are highly dependent on the data scale of the predictor variables entered as input of the model. The existence of a predictor that has a range of values that is very different from other predictors can cause problems in the classification model. Therefore, let’s normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.

To normalize the train_x data, please use the scale() function. Meanwhile, to normalize the test data, please use the same function but use the center and scale attributes obtained from the train_x data.

# your code here

# scale train_x data
train_xs <- scale(train_x)

# scale test_x data
test_xs <- scale(x = test_x,
                center = attr(train_xs,"scaled:center"),
                scale = attr(train_xs,"scaled:scale"))
# find optimum k
round(sqrt(nrow(test_xs)))
#> [1] 14

5. Prediction

Logistic Prediction

The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).

df_test$pred_risk <- predict(object = model_step, 
                     newdata = df_test, 
                     type = "response")

df_test$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)

df_train$pred_risk <- predict(object = model_step, 
                     newdata = df_test, 
                     type = "response")

df_train$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)

Insight :

  • Logistic regression equation :

\[log(\frac{p}{1-p}) = b_0 + b_1 x\]

  • Use Logistic regression because it is interpretable and can can process both numerical and categorical variables as predictor.

KNN

df_predKNN <- knn(train = train_xs, 
                 test = test_xs, 
                 cl = train_y, 
                 k = 15)
head(df_predKNN)
#> [1] 0 1 0 0 0 1
#> Levels: 0 1

Insight :

  • Picking Optimum k to avoid a tie when majority voting:
    • k must be odd if the number of target classes is even
    • k must be even if the number of target classes is odd

6. Evaluation

💣 Business Case :

  • Machine Learning Engineers have been asked by many doctors and hospital operational management to focus on the accuracy of the prediction classification of patients who show heart disease but the prediction is healthy, which is fatal to the treatment of doctors considering the patients who come for cardiac re-examination on average thousands of patients

Each model must have an error. We want to get a model with the smallest possible prediction error.

Logistic Regression

df_test %>% 
  select(target,pred_Label)

After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:

table(predicted = df_test$pred_Label, actual = df_test$target)
#>          actual
#> predicted  0  1
#>         0 82 10
#>         1 22 91
  • Class determination:

    • positive class: the class that becomes a concern/observation
    • negative class: class that is not a concern/observation
  • Sample case:

    • Machine learning for detecting covid patients:
      • positive class: detected covid
      • negative class: detected healthy
  • Contents of the Confusion Matrix:

    • True Positive (TP): predicted positive and true (positive prediction; actual positive)
    • True Negative (TN): predicted negative and true (negative prediction; negative actual)
    • False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
    • False Negative (FN): predicted negative but wrong (negative prediction; positive actual)

TRUE/FALSE:

  • TRUE: TRUE -> the prediction is correct, the prediction is correct, the predicted result = the original value
  • FALSE: FALSE -> the prediction is wrong, the prediction is not right, the result of prediction != original value

Indicate Heart Disease / Not Indicate Heart Disease

Positive class: Indicate Heart Disease, Negative class: Not Indicate Heart Disease

  • TP: predicted Indicate Heart Disease, and indeed got Heart Disease
  • TN: predicted Not Indicate Heart Disease, and indeed Not Indicate Heart Disease
  • FP: predicted Indicate Heart Disease, originally healthy -> panic
  • FN: predicted to be healthy (Not Indicate Heart Disease), originally Indicate Heart Disease -> quite fatal

# confusion matrix
library(caret)

confusionMatrix(data = as.factor(df_test$pred_Label), 
                reference = as.factor(df_test$target), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 82 10
#>          1 22 91
#>                                               
#>                Accuracy : 0.8439              
#>                  95% CI : (0.7868, 0.8907)    
#>     No Information Rate : 0.5073              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.6883              
#>                                               
#>  Mcnemar's Test P-Value : 0.05183             
#>                                               
#>             Sensitivity : 0.9010              
#>             Specificity : 0.7885              
#>          Pos Pred Value : 0.8053              
#>          Neg Pred Value : 0.8913              
#>              Prevalence : 0.4927              
#>          Detection Rate : 0.4439              
#>    Detection Prevalence : 0.5512              
#>       Balanced Accuracy : 0.8447              
#>                                               
#>        'Positive' Class : 1                   
#> 
  • Class Positive : indicate of heart disease
  • FN : Actual indicate of heart disease, Predict no indicate of heart disease
  • FP : Actual no indicate of heart disease, Predict indicate of heart disease
  • Metrics = Recall

Accuracy of logistic regression in 84.39 % accurate but there are still wrong predictions :

  • Actual heart disease count 10 patients which is have been predicted by logistic regression healthy

KNN Method

confusionMatrix(data = df_predKNN, 
                reference = as.factor(test_y), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 71 25
#>          1 33 76
#>                                           
#>                Accuracy : 0.7171          
#>                  95% CI : (0.6501, 0.7776)
#>     No Information Rate : 0.5073          
#>     P-Value [Acc > NIR] : 0.0000000007776 
#>                                           
#>                   Kappa : 0.4347          
#>                                           
#>  Mcnemar's Test P-Value : 0.358           
#>                                           
#>             Sensitivity : 0.7525          
#>             Specificity : 0.6827          
#>          Pos Pred Value : 0.6972          
#>          Neg Pred Value : 0.7396          
#>              Prevalence : 0.4927          
#>          Detection Rate : 0.3707          
#>    Detection Prevalence : 0.5317          
#>       Balanced Accuracy : 0.7176          
#>                                           
#>        'Positive' Class : 1               
#> 
  • Class Positive : indicate of heart disease
  • FN : Actual indicate of heart disease, Predict no indicate of heart disease
  • FP : Actual no indicate of heart disease, Predict indicate of heart disease
  • Metrics = Recall

Accuracy of KNN in 76.59 % accurate but there are still wrong predictions, if we remember about the characteristic of KNN , rest our predictors are in categorical , the conclusion is KNN is good for numerical predictors (because it classifies by distance).

KNN Improvement

df = read.csv("LBB DEAN/heart.csv")
df <- df %>%
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal), as.factor)
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
RNGkind(sample.kind = "Rounding") 
set.seed(417)

# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)

# splitting
df_train <- df[index,]
df_test <- df[-index,]
# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)

# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)

# variabel target pada `train`
train_y <- df_train[,"target"]

# variabel target pada `test`
test_y <- df_test[,"target"]
# your code here

# scale train_x data
train_xs <- scale(train_x)

# scale test_x data
test_xs <- scale(x = test_x,
                center = attr(train_xs,"scaled:center"),
                scale = attr(train_xs,"scaled:scale"))
# find optimum k
round(sqrt(nrow(test_xs)))
#> [1] 14
df_predKNN <- knn(train = train_xs, 
                 test = test_xs, 
                 cl = train_y, 
                 k = 15)
head(df_predKNN)
#> [1] 0 1 0 0 0 1
#> Levels: 0 1
confusionMatrix(data = df_predKNN, 
                reference = as.factor(test_y), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 104   1
#>          1   0 100
#>                                              
#>                Accuracy : 0.9951             
#>                  95% CI : (0.9731, 0.9999)   
#>     No Information Rate : 0.5073             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9902             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.9901             
#>             Specificity : 1.0000             
#>          Pos Pred Value : 1.0000             
#>          Neg Pred Value : 0.9905             
#>              Prevalence : 0.4927             
#>          Detection Rate : 0.4878             
#>    Detection Prevalence : 0.4878             
#>       Balanced Accuracy : 0.9950             
#>                                              
#>        'Positive' Class : 1                  
#> 

Insight :

  • We have successful improve our accuracy model from 76,59% to 99,02%.
  • We have focused to seperate data factorial and non factorial as KNN’s best practice.
  • Even accuracy 99,02% we should concern about accuracy with total 2 patients have heart disease, but machine predicted not heart disease. Because we focus on positive class = 1 (“there is indicate hear disease”)

7. Conclusion

  • Use KNN because it tends to have a higher performance than logistic regression and able to perform binary and multi-class classification.
  • Logistic Regression in Log of odds values cannot be interpreted. Log of Odds generated by Logistic Regression. Log of odds values can be converted to odds and/or probabilities so they can be interpreted.
  • The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).
  • Use Logistic regression because it is interpretable and can can process both numerical and categorical variables as predictor.
  • We created a Logistic Regression model as well as a K-NN model. We created the best Logistic Regression model by using all of its variables as predictors. However, we need to lower the threshold to reduce the number of incorrect predictions for patients who are likely to be diagnosed with cardiovascular disease. Meanwhile, the K-NN model outperforms the Logistic Regression model in prediction. However, the K-NN model is uninterpretable.
  • Because cases like this may be appropriate only when the class targets we have are balanced. If the target class is not balanced, when the threshold is shifted, one of the Recall or Precision metric values will usually fall. To ensure that the threshold shift remains optimal, we generally check the model’s goodness first using the AUC value.
  • For data-set should be fix on integral data-set, not in factorial or categorical due to KNN is good for numerical predictors (because it classifies by distance).

8. Model Improvement

For next model, we are going to trial Decision Tree and Random Forest to make a classification model of heart disease and compare the model, what is the best model for our characteristic of data in next topic.