Heart Desease

1. Introduction

Background

Cardiovascular disease (CVD), including heart disease, is a major cause of mortality worldwide,remains the leading cause of death globally, accounting for approximately 31% of all deaths worldwide in 2019. Early detection and intervention are crucial in reducing the burden of disease and improving patient outcomes. Machine learning (ML) has the potential to aid clinicians in identifying patients at risk of heart disease through the analysis of clinical and laboratory data. Several studies have been conducted to develop ML algorithms for predicting the risk of heart disease, achieving high levels of accuracy. The use of ML models has the potential to improve patient outcomes by enabling earlier detection and intervention, but further research is necessary to validate these models and ensure their effectiveness in real-world clinical settings.

In this Project, we are continuing Heart Disease - Classification (Machine Learning vol1.0) added 2 machine learning models that is Decision Tree and Random Forest.

  • 📝 In this project we will trial Decision Tree and Random Forest as our Machine Learning
  • 📝 We compare the result of accuracy all machine learnings

Data-set

Heart-disease, this data set was created in 1988 and consists of four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It has 76 attributes, one of which is the predicted attribute, but all published experiments use only 14 of them. The “target” field refers to the patient’s presence of heart disease. It has an integer value of 0 for no disease and 1 for disease.

Attribute Information:

  1. age
  2. sex : 1 = male; 0 = female
  3. chest pain type : (4 values : – Value 0: typical angina – Value 1: atypical angina – Value 2: non-anginal pain – Value 3: asymptomatic)
  4. resting blood pressure : resting blood pressure (in mm Hg on admission to the hospital)
  5. serum cholestoral : serum cholestoral in mg/dl
  6. fasting blood sugar > 120 mg/dl : (1 = true; 0 = false)
  7. resting electrocar diographic results (values 0,1,2)
  8. maximum heart rate achieved
  9. exercise induced angina : 1 = yes; 0 = no
  10. oldpeak = ST depression induced by exercise relative to rest
  11. Slope = the slope of the peak exercise ST segment (0 - 2)
  12. ca = number of major vessels (0-4) colored by flourosopy
  13. thal: 0 = normal; 1 = fixed defect; 2 = reversible defect; 3 = reversible defect after thallium stress testing
  14. target : 1 = disease; 0 = no disease

2. Preparation

Library

library(dplyr) # function of data wrangling 1
library(tidyr) # function of data wrangling 2
library(class) # package for `knn()`
library(caret) # confusion matrix
library(car) # multicollinearity
library(partykit) #function decision tree
library(randomForest) #function random forest
library(ROCR) #function ROC AUC
library(GGally) # ggcor

Data Preparation

Read dataset = heart.csv

df = read.csv("LBB DEAN/heart.csv")
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

df have 14 columns and 1025 records of patient check up of heart disease. There is several columns need to change as a factor. in this case column sex, cp,fbs,restecg,exang,slope,ca,thal

df <- df %>% 
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal,target), as.factor)
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

Check missing values in df as a data-set

colSums(is.na(df))
#>      age      sex       cp trestbps     chol      fbs  restecg  thalach 
#>        0        0        0        0        0        0        0        0 
#>    exang  oldpeak    slope       ca     thal   target 
#>        0        0        0        0        0        0

There is no any missing values in df as a data-set

EDA (Exploratory Data Analysis)

Do EDA to know well distribution and characteristics of the data-set and make sure target of the data is balance to make machine learning models work well

summary(df)
#>       age        sex     cp         trestbps          chol     fbs     restecg
#>  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
#>  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
#>  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
#>  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
#>  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
#>  Max.   :77.00                   Max.   :200.0   Max.   :564                  
#>     thalach      exang      oldpeak      slope   ca      thal    target 
#>  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
#>  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
#>  Median :152.0           Median :0.800   2:469   2:134   2:544          
#>  Mean   :149.1           Mean   :1.072           3: 69   3:410          
#>  3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
#>  Max.   :202.0           Max.   :6.200

insight :

  • average age of patients are 54, dominated by male
  • majority patients have type of chest pain is 0, it means typical angina
  • fasting blood sugars quite lower about 872 patients below < 120 mg/dl
round(prop.table(table(df$target)),1)
#> 
#>   0   1 
#> 0.5 0.5

insight : Proportion of target already balance 50:50

Corelation between predictors

ggcorr(df, label=TRUE)

insight =

  • in this correlation shown by heat map, there is no any high correlation between predictors, from this insight it can be concluded that the relationship between predictors is in accordance with the assumptions of logistic regression

3. Cross Validation

The process of dividing the dataset we have to train and test the model is known as cross validation. We will randomly select a sample of Heart Disease to be included in the df train variable, with the remainder included in the df test.

RNGkind(sample.kind = "Rounding") 
set.seed(417)

# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)

# splitting
df_train <- df[index,]
df_test <- df[-index,]

4. Modelling

We are going to make several models and test the performance.

Logistic Regression

📝 Log of odds values cannot be interpreted. Log of Odds generated by Logistic Regression. Log of odds values can be converted to odds and/or probabilities so they can be interpreted.

No Predictor

model_none <- glm(formula = target~1,
                  data = df_train,
                  family = "binomial")
summary(model_none)
#> 
#> Call:
#> glm(formula = target ~ 1, family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.209  -1.209   1.146   1.146   1.146  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  0.07320    0.06989   1.047    0.295
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.7  on 819  degrees of freedom
#> Residual deviance: 1135.7  on 819  degrees of freedom
#> AIC: 1137.7
#> 
#> Number of Fisher Scoring iterations: 3
exp(0.07320)
#> [1] 1.075946

📈 Model interpretation :

  • It can be seen that the intercept section has a value of 0.07320, which means that from all the data collected there is a possibility that there is a 1.07 chance of having a heart attack or also a possibility that the patient will not have a heart attack 0.93 times compared to having a heart attack.

Categoric Predictor

sex,cp,fbs,restecg,exang,slope,ca,thal

model_categoric <- glm(formula = target~ sex+cp+fbs+restecg+exang+slope+ca+thal ,
                  data = df_train,
                  family = "binomial")
summary(model_categoric)
#> 
#> Call:
#> glm(formula = target ~ sex + cp + fbs + restecg + exang + slope + 
#>     ca + thal, family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.92220  -0.34695   0.07266   0.42072   3.14937  
#> 
#> Coefficients:
#>             Estimate Std. Error z value            Pr(>|z|)    
#> (Intercept)  -1.1045     1.8662  -0.592            0.553971    
#> sex1         -1.6801     0.3211  -5.232 0.00000016775552554 ***
#> cp1           1.1432     0.3352   3.410            0.000649 ***
#> cp2           2.1655     0.3246   6.672 0.00000000002528101 ***
#> cp3           2.1430     0.4207   5.094 0.00000035110360891 ***
#> fbs1          0.3031     0.3401   0.891            0.372822    
#> restecg1      0.4937     0.2353   2.098            0.035871 *  
#> restecg2     -1.7079     1.1767  -1.451            0.146669    
#> exang1       -0.8924     0.2742  -3.255            0.001134 ** 
#> slope1       -0.2613     0.4386  -0.596            0.551356    
#> slope2        1.7783     0.4471   3.977 0.00006964804733135 ***
#> ca1          -2.2969     0.3111  -7.383 0.00000000000015444 ***
#> ca2          -3.6182     0.4659  -7.766 0.00000000000000809 ***
#> ca3          -2.0375     0.5409  -3.767            0.000165 ***
#> ca4           2.2090     1.0406   2.123            0.033771 *  
#> thal1         2.2613     1.8522   1.221            0.222128    
#> thal2         2.6027     1.8037   1.443            0.149016    
#> thal3         0.7894     1.8030   0.438            0.661519    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  494.32  on 802  degrees of freedom
#> AIC: 530.32
#> 
#> Number of Fisher Scoring iterations: 6
exp(model_categoric$coefficients)
#> (Intercept)        sex1         cp1         cp2         cp3        fbs1 
#>  0.33138730  0.18634683  3.13683135  8.71918722  8.52480997  1.35410577 
#>    restecg1    restecg2      exang1      slope1      slope2         ca1 
#>  1.63828543  0.18125122  0.40966600  0.77004331  5.91973985  0.10056836 
#>         ca2         ca3         ca4       thal1       thal2       thal3 
#>  0.02683137  0.13035619  9.10643407  9.59600631 13.50008312  2.20203009

Any variable that lowers the odds :

  • sex1, restecg2,exang1,slope1,ca1,ca2,ca3

📈 Model interpretation :

  • People who are cp1 are more at risk of heart failure (reducing the possibility of heart failure) by 3.13 times compared to “base” (cp0), if other variables are fixed
  • People who are cp2 are more at risk of heart failure (reducing the possibility of heart failure) by 8.71 times compared to “base” (cp0), if other variables are fixed
  • People who are cp3 are more at risk of heart failure (reducing the possibility of heart failure) by 8.52 times compared to “base” (cp0), if other variables are fixed
  • People who are fbs1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.35 times compared to “base” (fbs0), if other variables are fixed
  • People who are restecg1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.63 times compared to “base” (restecg0), if other variables are fixed
  • People who are slope2 are more at risk of heart failure (reducing the possibility of heart failure) by 5.91 times compared to “base” (slope0), if other variables are fixed
  • People who are ca4 are more at risk of heart failure (reducing the possibility of heart failure) by 9.1 times compared to “base” (ca0), if other variables are fixed
  • People who are thal1 are more at risk of heart failure (reducing the possibility of heart failure) by 9.59 times compared to “base” (thal0), if other variables are fixed
  • People who are thal2 are more at risk of heart failure (reducing the possibility of heart failure) by 13.5 times compared to “base” (thal0), if other variables are fixed
  • People who are thal3 are more at risk of heart failure (reducing the possibility of heart failure) by 2.2 times compared to “base” (thal0), if other variables are fixed

Numeric Predictor

age,trestbps,chol,thalach,oldpeak

model_numeric <- glm(formula = target~ age+trestbps+chol+thalach+oldpeak ,
                  data = df_train,
                  family = "binomial")
summary(model_numeric)
#> 
#> Call:
#> glm(formula = target ~ age + trestbps + chol + thalach + oldpeak, 
#>     family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.1848  -0.7812   0.4288   0.8541   2.6114  
#> 
#> Coefficients:
#>              Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept) -2.025752   1.098845  -1.844               0.0653 .  
#> age          0.003310   0.010839   0.305               0.7601    
#> trestbps    -0.011531   0.005264  -2.190               0.0285 *  
#> chol        -0.004096   0.001749  -2.342               0.0192 *  
#> thalach      0.035287   0.004557   7.744  0.00000000000000964 ***
#> oldpeak     -0.834227   0.092076  -9.060 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  851.21  on 814  degrees of freedom
#> AIC: 863.21
#> 
#> Number of Fisher Scoring iterations: 4
exp(model_numeric$coefficients)
#> (Intercept)         age    trestbps        chol     thalach     oldpeak 
#>   0.1318946   1.0033152   0.9885348   0.9959125   1.0359168   0.4342098

Any variable that lowers the odds :

  • trestbps, chol,oldpeak

📈 Model interpretation :

  • For each addition of 1 point age, then the possibility of patient getting heart disease increases 1.00 times, provided that other variables have the same value.
  • For each addition of 1 point thalach, then the possibility of patient getting heart disease increases 1.03 times, provided that other variables have the same value.

All Predictor

model_all <- glm(formula = target~. ,
                  data = df_train,
                  family = "binomial")
summary(model_all)
#> 
#> Call:
#> glm(formula = target ~ ., family = "binomial", data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.0412  -0.2616   0.0570   0.4065   3.3227  
#> 
#> Coefficients:
#>              Estimate Std. Error z value          Pr(>|z|)    
#> (Intercept) -0.265828   2.309760  -0.115          0.908375    
#> age          0.026965   0.016411   1.643          0.100368    
#> sex1        -2.078567   0.375782  -5.531 0.000000031783653 ***
#> cp1          0.829104   0.353346   2.346          0.018954 *  
#> cp2          2.168814   0.341372   6.353 0.000000000210833 ***
#> cp3          2.466048   0.466150   5.290 0.000000122150326 ***
#> trestbps    -0.028018   0.007818  -3.584          0.000339 ***
#> chol        -0.006596   0.002753  -2.396          0.016579 *  
#> fbs1         0.412043   0.372173   1.107          0.268238    
#> restecg1     0.374898   0.255211   1.469          0.141840    
#> restecg2    -0.562131   1.926327  -0.292          0.770428    
#> thalach      0.025281   0.007816   3.235          0.001218 ** 
#> exang1      -0.729219   0.292905  -2.490          0.012789 *  
#> oldpeak     -0.511915   0.155308  -3.296          0.000980 ***
#> slope1      -0.323303   0.534445  -0.605          0.545224    
#> slope2       1.168961   0.563860   2.073          0.038159 *  
#> ca1         -2.542979   0.342386  -7.427 0.000000000000111 ***
#> ca2         -3.680525   0.521668  -7.055 0.000000000001722 ***
#> ca3         -1.675346   0.600219  -2.791          0.005251 ** 
#> ca4          1.891545   1.047329   1.806          0.070908 .  
#> thal1        2.940066   1.568853   1.874          0.060927 .  
#> thal2        2.764906   1.513894   1.826          0.067797 .  
#> thal3        1.091833   1.515119   0.721          0.471140    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  452.07  on 797  degrees of freedom
#> AIC: 498.07
#> 
#> Number of Fisher Scoring iterations: 6
exp(model_all$coefficients)
#> (Intercept)         age        sex1         cp1         cp2         cp3 
#>  0.76657125  1.02733167  0.12510939  2.29126495  8.74789990 11.77581908 
#>    trestbps        chol        fbs1    restecg1    restecg2     thalach 
#>  0.97237088  0.99342604  1.50989958  1.45484288  0.56999318  1.02560367 
#>      exang1     oldpeak      slope1      slope2         ca1         ca2 
#>  0.48228567  0.59934695  0.72375441  3.21864567  0.07863177  0.02520973 
#>         ca3         ca4       thal1       thal2       thal3 
#>  0.18724333  6.62960374 18.91708684 15.87755124  2.97973045

Any variable that lowers the odds :

sex1, restecg2,exang1,slope1,ca1,ca2,ca3,trestbps, chol,oldpeak

📈 Model interpretation :

  • For each addition of 1 point age, then the possibility of patient getting heart disease increases 1.00 times, provided that other variables have the same value.
  • For each addition of 1 point thalach, then the possibility of patient getting heart disease increases 1.03 times, provided that other variables have the same value.
  • People who are cp1 are more at risk of heart failure (reducing the possibility of heart failure) by 3.13 times compared to “base” (cp0), if other variables are fixed
  • People who are cp2 are more at risk of heart failure (reducing the possibility of heart failure) by 8.71 times compared to “base” (cp0), if other variables are fixed
  • People who are cp3 are more at risk of heart failure (reducing the possibility of heart failure) by 8.52 times compared to “base” (cp0), if other variables are fixed
  • People who are restecg1 are more at risk of heart failure (reducing the possibility of heart failure) by 1.63 times compared to “base” (restecg0), if other variables are fixed
  • People who are slope2 are more at risk of heart failure (reducing the possibility of heart failure) by 5.91 times compared to “base” (slope0), if other variables are fixed
  • People who are ca4 are more at risk of heart failure (reducing the possibility of heart failure) by 9.1 times compared to “base” (ca0), if other variables are fixed
  • People who are thal1 are more at risk of heart failure (reducing the possibility of heart failure) by 9.59 times compared to “base” (thal0), if other variables are fixed
  • People who are thal2 are more at risk of heart failure (reducing the possibility of heart failure) by 13.5 times compared to “base” (thal0), if other variables are fixed
  • People who are thal3 are more at risk of heart failure (reducing the possibility of heart failure) by 2.2 times compared to “base” (thal0), if other variables are fixed

Step Model

We used model with all predictor due to low AIC and do a feature selection using stepwise regression

model_step <- step(object = model_all,
                   direction = "backward", trace = F)
summary(model_step)
#> 
#> Call:
#> glm(formula = target ~ age + sex + cp + trestbps + chol + thalach + 
#>     exang + oldpeak + slope + ca + thal, family = "binomial", 
#>     data = df_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.0096  -0.2815   0.0540   0.4044   3.3720  
#> 
#> Coefficients:
#>              Estimate Std. Error z value           Pr(>|z|)    
#> (Intercept)  0.298116   2.469446   0.121           0.903911    
#> age          0.024359   0.016267   1.497           0.134274    
#> sex1        -2.080038   0.373448  -5.570 0.0000000255013944 ***
#> cp1          0.853616   0.351534   2.428           0.015172 *  
#> cp2          2.237598   0.337174   6.636 0.0000000000321570 ***
#> cp3          2.518670   0.462415   5.447 0.0000000512930186 ***
#> trestbps    -0.027613   0.007554  -3.655           0.000257 ***
#> chol        -0.007103   0.002650  -2.680           0.007367 ** 
#> thalach      0.025386   0.007880   3.222           0.001274 ** 
#> exang1      -0.700975   0.291217  -2.407           0.016082 *  
#> oldpeak     -0.522742   0.150681  -3.469           0.000522 ***
#> slope1      -0.367936   0.521164  -0.706           0.480195    
#> slope2       1.150041   0.547990   2.099           0.035848 *  
#> ca1         -2.522397   0.333224  -7.570 0.0000000000000374 ***
#> ca2         -3.586970   0.508731  -7.051 0.0000000000017787 ***
#> ca3         -1.537753   0.581697  -2.644           0.008204 ** 
#> ca4          2.042608   1.037289   1.969           0.048932 *  
#> thal1        2.880288   1.851205   1.556           0.119732    
#> thal2        2.616895   1.800002   1.454           0.145994    
#> thal3        0.973965   1.799240   0.541           0.588287    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1135.66  on 819  degrees of freedom
#> Residual deviance:  455.76  on 800  degrees of freedom
#> AIC: 495.76
#> 
#> Number of Fisher Scoring iterations: 6

Compare between Step model and Model_all

# Deviance
model_all$deviance
#> [1] 452.0742
model_step$deviance
#> [1] 455.7572
# AIC
model_all$aic
#> [1] 498.0742
model_step$aic
#> [1] 495.7572

Use Model_Step due to low AIC Model_Step without Perfect Separation

KNN

k-NN or K-nearest neighbor classifies new data by comparing the characteristics of the new data (test data) with existing data (train data).

Characteristic proximity is measured by Euclidean Distance until k data points (neighbors) with closest distance are obtained. The most class owned by these neighbors is the class of new data (majority voting).

Characteristics of k-NN :

  • Does not create a model: classifies on the spot
  • Not learning from data, every time you want to classify you have to provide training data again
  • No assumptions
  • Can predict multiclass
  • Good for numerical predictors (because it classifies by distance), not good for categorical predictors
  • Robust: performance is good -> error is small
  • Not interpretable -> does not form an equation
knitr::include_graphics("assets/KNN.png")

# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)

# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)

# variabel target pada `train`
train_y <- df_train[,"target"]

# variabel target pada `test`
test_y <- df_test[,"target"]

Remember that distance measurements in kNN are highly dependent on the data scale of the predictor variables entered as input of the model. The existence of a predictor that has a range of values that is very different from other predictors can cause problems in the classification model. Therefore, let’s normalize the data to equalize the scale of each predictor variable so that it has a standard range of values.

To normalize the train_x data, please use the scale() function. Meanwhile, to normalize the test data, please use the same function but use the center and scale attributes obtained from the train_x data.

# your code here

# scale train_x data
train_xs <- scale(train_x)

# scale test_x data
test_xs <- scale(x = test_x,
                center = attr(train_xs,"scaled:center"),
                scale = attr(train_xs,"scaled:scale"))
# find optimum k
round(sqrt(nrow(test_xs)))
#> [1] 14

Decision Tree

Decision Tree is a fairly simple tree-based model with robust/powerful* performance for prediction. The Decision Tree produces a visualization in the form of a decision tree which can be interpreted easily.

Decision Tree additional characters:

  • Variable predictors are assumed to be mutually dependent, so that multicollinearity can be overcome.
  • Can overcome numerical predictor values in the form of outliers.

Note: Decision Tree is not only limited to Classification cases, but can be used in Regression cases. In this course, our focus is on the Classification case because the idea is the same.

Structure

Let’s understand how the structure of the Decision Tree and the terms that are often used. The following is an example to determine whether this weekend we will go out or not:

- Root Node: The first branch in determining the target value, commonly referred to as the main predictor. - Interior Node: The next branch that uses another predictor if the root node is not enough to determine the target. - Terminal/Leaf Node: The final decision is the predicted target value.

dtree_model <- ctree(formula = target ~.,
                     data = df_train,
                     control = ctree_control(mincriterion=0.95,
                                             minsplit=50,
                                             minbucket=20))
plot(dtree_model, type = "s")

Random Forest

Random Forest is a type of Ensemble Method which consists of many Decision Trees. Each Decision Tree has its own characteristics and is not related to each other. Random Forest makes use of the Bagging (Bootstrap and Aggregation) concept in its creation. Here is the process:

  1. PROCESS 1 = Bootstrap sampling: Generates data by random sampling (with replacement) of the entire data and allows for duplicate rows.
  2. PROCESS 2 = 1 decision tree is made for each bootstrap data. The mtry parameter is used to randomly select the number of predictor candidates (Automatic Feature Selection)
  3. PROCESS 3 = Make predictions on new observations for each Decision Tree.
  4. PROCESS 4 = Aggregation: Generates a single prediction to predict.
    • Case classification: majority voting
    • Regression case: average of target values

Pros of Random Forests:

  • Suppresses the bias and variance of the Decision Tree, resulting in better predictive performance.
  • Automatic feature selection: Predictors are randomly selected in the making of the Decision Tree.
  • There is an out-of-bag error as a substitute for model evaluation.

One of the disadvantages of random forest is its computationally intensive and time-consuming nature. This can be mitigated by feature selection to reduce the number of predictors used in the model. If a large number of columns are found, we can remove columns with near-zero variance (less informative) using the nearZeroVar() function from the caret package.

Data Pre-processing

By applying feature selection techniques like nearZeroVar(), we can identify and remove predictors with very low variance, which often indicate little or no variation in the data. These low-variance predictors may not contribute much to the model’s predictive power and can be safely discarded, reducing the computational burden of the random forest algorithm.

By reducing the number of predictors, the random forest algorithm can become more efficient and faster in terms of computation time, without significantly sacrificing its predictive performance. It allows us to focus on the most informative predictors and improve the efficiency of the model training process.

# Data Pre-processing
# Please Run The Code Down Below
dim(df_train)
#> [1] 820  14
# feature selection using nearzerovar
zero_var <- nearZeroVar(df_train)
df_rf <- df_train %>% 
  select(-c(zero_var))

dim(df_rf)
#> [1] 820  14

insight :

  • All columns in data titanic as a data set have already had variance almost 0

K-Fold Cross Validation

Usually we do cross validation by dividing the data only into training and testing data. K-Fold Cross Validation divides the data into equal parts of \(k\), where each part is used as testing data alternately.

In this case we are using 10 as our K-Fold with repeats = 20.

#set.seed(417)
#ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 20)

# make a model of random forest
#df_rf_model <- train(target~., df_train, method = "rf", trControl = ctrl)
#df_rf_model

# save the model
#saveRDS(df_rf_model, "df_forest.RDS")

# read model
df_forest = readRDS("df_forest.RDS")
df_forest
#> Random Forest 
#> 
#> 820 samples
#>  13 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 20 times) 
#> Summary of sample sizes: 738, 738, 737, 739, 737, 737, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.9591782  0.9181581
#>   12    0.9823901  0.9647358
#>   22    0.9840365  0.9680364
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 22.

Insight :

  • 820 samples -> the number of rows in the data train used in making the model
  • 13 predictors -> the number of predictor variables in our train data
  • 2 classes -> the number of target classes that exist in our data
  • Summary of sample sizes -> the number of sample sizes in the training data resulting from the k-fold cross validation
  • mtry and accuracy show the number of mtry used and the accuracy value of each model mtry. This accuracy can be used as a reference for which model is the best based on its mtry.
df_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 22
#> 
#>         OOB estimate of  error rate: 0.73%
#> Confusion matrix:
#>     0   1 class.error
#> 0 392   3 0.007594937
#> 1   3 422 0.007058824

Insight :

  • Number of trees: 500 -> random forest creates 500 trees
  • No. of variables tried at each split: 22 -> mtry : 22, in this case, the best mtry is 22 ,which has the highest accuracy when tested on data from boostrap sampling (can be considered as data train in making decision trees in random forests).
  • OOB estimate of error rate: 0.73% -> out-of-bag error from out-of-bag sample (unseen data when doing bootstrap sampling), In other words, the model accuracy on test data (out of bag data) is 100% - error rate (0.73) = 99.27% Confusion matrix -> confusion matrix values for existing out-of-bag samples
# run in console, then see panel `Plot`

library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstration of the k-fold Cross Validation", bty = "l")

5. Prediction

Logistic Prediction

The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).

df_test$pred_risk <- predict(object = model_step, 
                     newdata = df_test, 
                     type = "response")

df_test_prob <- predict(object = model_step, 
                     newdata = df_test, 
                     type = "response")

df_test$pred_Label <- ifelse(test = df_test$pred_risk > 0.5, yes = 1, no = 0)

df_train$pred_risk <- predict(object = model_step, 
                     newdata = df_train, 
                     type = "response")

df_train$pred_Label <- ifelse(test = df_train$pred_risk > 0.5, yes = 1, no = 0)

Insight :

  • Logistic regression equation :

\[log(\frac{p}{1-p}) = b_0 + b_1 x\]

  • Use Logistic regression because it is interpretable and can can process both numerical and categorical variables as predictor.

KNN

df_predKNN <- knn(train = train_xs, 
                 test = test_xs, 
                 cl = train_y, 
                 k = 15)
head(df_predKNN)
#> [1] 0 1 0 0 0 1
#> Levels: 0 1

Insight :

  • Picking Optimum k to avoid a tie when majority voting:
    • k must be odd if the number of target classes is even
    • k must be even if the number of target classes is odd

Decision Tree

# prediksi kelas di data test
pred_diab_test <- predict(object = dtree_model , 
                          newdata = df_test , 
                          type = "response")

Random Forest

df_rf_model <- readRDS("df_forest.RDS")
varImp(df_rf_model)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 22)
#> 
#>           Overall
#> thal2    100.0000
#> oldpeak   46.2266
#> chol      34.0979
#> age       32.8648
#> thalach   31.1735
#> trestbps  25.3897
#> ca1       11.6868
#> cp3       10.1069
#> cp2        8.8004
#> sex1       8.2533
#> exang1     5.1028
#> slope2     4.4544
#> restecg1   2.1813
#> slope1     2.1627
#> ca2        2.0740
#> cp1        1.5632
#> fbs1       1.4097
#> ca3        1.3435
#> thal3      1.0615
#> thal1      0.9118
#RAW
pred_df_rf <- predict(object = df_rf_model,
                   newdata = df_test,
                   type ="raw")

#Prob
pred_df_rf_prob <- predict(object = df_rf_model,
                   newdata = df_test,
                   type ="prob")

6. Evaluation

💣 Business Case :

  • Machine Learning Engineers have been asked by many doctors and hospital operational management to focus on the accuracy of the prediction classification of patients who show heart disease but the prediction is healthy, which is fatal to the treatment of doctors considering the patients who come for cardiac re-examination on average thousands of patients

Each model must have an error. We want to get a model with the smallest possible prediction error.

Logistic Regression

df_test %>% 
  select(target,pred_Label)

After making predictions using the model, there are still wrong predictions. In classification, we evaluate the model based on the confusion matrix:

table(predicted = df_test$pred_Label, actual = df_test$target)
#>          actual
#> predicted  0  1
#>         0 82 10
#>         1 22 91
  • Class determination:

    • positive class: the class that becomes a concern/observation
    • negative class: class that is not a concern/observation
  • Sample case:

    • Machine learning for detecting covid patients:
      • positive class: detected covid
      • negative class: detected healthy
  • Contents of the Confusion Matrix:

    • True Positive (TP): predicted positive and true (positive prediction; actual positive)
    • True Negative (TN): predicted negative and true (negative prediction; negative actual)
    • False Positive (FP): predicted positive but wrong (predictive positive; actual negative)
    • False Negative (FN): predicted negative but wrong (negative prediction; positive actual)

TRUE/FALSE:

  • TRUE: TRUE -> the prediction is correct, the prediction is correct, the predicted result = the original value
  • FALSE: FALSE -> the prediction is wrong, the prediction is not right, the result of prediction != original value

Indicate Heart Disease / Not Indicate Heart Disease

Positive class: Indicate Heart Disease, Negative class: Not Indicate Heart Disease

  • TP: predicted Indicate Heart Disease, and indeed got Heart Disease
  • TN: predicted Not Indicate Heart Disease, and indeed Not Indicate Heart Disease
  • FP: predicted Indicate Heart Disease, originally healthy -> panic
  • FN: predicted to be healthy (Not Indicate Heart Disease), originally Indicate Heart Disease -> quite fatal

Data Test

# confusion matrix
library(caret)

confusionMatrix(data = as.factor(df_test$pred_Label), 
                reference = as.factor(df_test$target), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 82 10
#>          1 22 91
#>                                               
#>                Accuracy : 0.8439              
#>                  95% CI : (0.7868, 0.8907)    
#>     No Information Rate : 0.5073              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.6883              
#>                                               
#>  Mcnemar's Test P-Value : 0.05183             
#>                                               
#>             Sensitivity : 0.9010              
#>             Specificity : 0.7885              
#>          Pos Pred Value : 0.8053              
#>          Neg Pred Value : 0.8913              
#>              Prevalence : 0.4927              
#>          Detection Rate : 0.4439              
#>    Detection Prevalence : 0.5512              
#>       Balanced Accuracy : 0.8447              
#>                                               
#>        'Positive' Class : 1                   
#> 
  • Class Positive : indicate of heart disease
  • FN : Actual indicate of heart disease, Predict no indicate of heart disease
  • FP : Actual no indicate of heart disease, Predict indicate of heart disease
  • Metrics = Recall

Accuracy of logistic regression in 84.39 % accurate but there are still wrong predictions :

  • Actual heart disease count 10 patients which is have been predicted by logistic regression healthy

Data Train

# confusion matrix
library(caret)

confusionMatrix(data = as.factor(df_train$pred_Label), 
                reference = as.factor(df_train$target), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 339  26
#>          1  56 399
#>                                                
#>                Accuracy : 0.9                  
#>                  95% CI : (0.8774, 0.9197)     
#>     No Information Rate : 0.5183               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.7992               
#>                                                
#>  Mcnemar's Test P-Value : 0.001362             
#>                                                
#>             Sensitivity : 0.9388               
#>             Specificity : 0.8582               
#>          Pos Pred Value : 0.8769               
#>          Neg Pred Value : 0.9288               
#>              Prevalence : 0.5183               
#>          Detection Rate : 0.4866               
#>    Detection Prevalence : 0.5549               
#>       Balanced Accuracy : 0.8985               
#>                                                
#>        'Positive' Class : 1                    
#> 

ROC

Accuracy has the disadvantage of demonstrating the goodness of the model in classifying the two classes. Overcoming the lack of accuracy, ROC and AUC are available as evaluation tools besides the Confusion Matrix.

ROC is a curve that describes the relationship between the True Positive Rate (Sensitivity or Recall) and the False Positive Rate (1-Specificity) at each threshold. A good model should ideally have a high True Positive Rate and a low False Positive Rate. Note: Specificity is a True Negative Rate.

AUC shows the area under the ROC curve. The closer to 1, the better the model performance in separating positive and negative classes. To get the AUC value, use the parameter measure = "auc" in the performance() function and then take the value y.values.

# ROC
log_roc <- data.frame(prediction=round(df_test_prob,4),
                      trueclass= as.numeric(df_test$target == 1))

log_roc <- prediction(log_roc$prediction, log_roc$trueclass) 

# ROC curve
plot(performance(log_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

#Check AUC
log_value <- performance(prediction.obj = log_roc, 
                         measure = "auc")
log_value@y.values
#> [[1]]
#> [1] 0.9119383

Insight :

  • 91.19% it means, models that the model performance is good in classifying both positive and negative class.

KNN Method

confusionMatrix(data = df_predKNN, 
                reference = as.factor(test_y), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 71 25
#>          1 33 76
#>                                           
#>                Accuracy : 0.7171          
#>                  95% CI : (0.6501, 0.7776)
#>     No Information Rate : 0.5073          
#>     P-Value [Acc > NIR] : 0.0000000007776 
#>                                           
#>                   Kappa : 0.4347          
#>                                           
#>  Mcnemar's Test P-Value : 0.358           
#>                                           
#>             Sensitivity : 0.7525          
#>             Specificity : 0.6827          
#>          Pos Pred Value : 0.6972          
#>          Neg Pred Value : 0.7396          
#>              Prevalence : 0.4927          
#>          Detection Rate : 0.3707          
#>    Detection Prevalence : 0.5317          
#>       Balanced Accuracy : 0.7176          
#>                                           
#>        'Positive' Class : 1               
#> 
  • Class Positive : indicate of heart disease
  • FN : Actual indicate of heart disease, Predict no indicate of heart disease
  • FP : Actual no indicate of heart disease, Predict indicate of heart disease
  • Metrics = Recall

Accuracy of KNN in 71.71 % accurate but there are still wrong predictions, if we remember about the characteristic of KNN , rest our predictors are in categorical , the conclusion is KNN is good for numerical predictors (because it classifies by distance).

KNN Improvement

df = read.csv("LBB DEAN/heart.csv")
df <- df %>%
  mutate_at(vars(sex,cp,fbs,restecg,exang,slope,ca,thal), as.factor)
glimpse(df)
#> Rows: 1,025
#> Columns: 14
#> $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
#> $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
#> $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
#> $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
#> $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
#> $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
#> $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
#> $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
#> $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
#> $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
#> $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
#> $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
#> $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
#> $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
RNGkind(sample.kind = "Rounding") 
set.seed(417)

# index sampling
index <- sample(x = nrow(df), size = nrow(df)*0.8)
# sample(x = nrow(df), size = nrow(df)*0.2)

# splitting
df_train <- df[index,]
df_test <- df[-index,]
# variabel prediktor pada `train`
train_x <- df_train %>% select_if(is.numeric)

# variabel prediktor pada `test`
test_x <- df_test %>% select_if(is.numeric)

# variabel target pada `train`
train_y <- df_train[,"target"]

# variabel target pada `test`
test_y <- df_test[,"target"]
# your code here

# scale train_x data
train_xs <- scale(train_x)

# scale test_x data
test_xs <- scale(x = test_x,
                center = attr(train_xs,"scaled:center"),
                scale = attr(train_xs,"scaled:scale"))
# find optimum k
round(sqrt(nrow(test_xs)))
#> [1] 14
df_predKNN <- knn(train = train_xs, 
                 test = test_xs, 
                 cl = train_y, 
                 k = 15)
head(df_predKNN)
#> [1] 0 1 0 0 0 1
#> Levels: 0 1
knn_cm <- confusionMatrix(data = df_predKNN, 
                reference = as.factor(test_y), 
                positive = "1")

knn_cm
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 104   1
#>          1   0 100
#>                                              
#>                Accuracy : 0.9951             
#>                  95% CI : (0.9731, 0.9999)   
#>     No Information Rate : 0.5073             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9902             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 0.9901             
#>             Specificity : 1.0000             
#>          Pos Pred Value : 1.0000             
#>          Neg Pred Value : 0.9905             
#>              Prevalence : 0.4927             
#>          Detection Rate : 0.4878             
#>    Detection Prevalence : 0.4878             
#>       Balanced Accuracy : 0.9950             
#>                                              
#>        'Positive' Class : 1                  
#> 

Insight :

  • We have successful improve our accuracy model from 76,59% to 99,51%.
  • We have focused to seperate data factorial and non factorial as KNN’s best practice.
  • Even accuracy 99,51% we should concern about accuracy with total 2 patients have heart disease, but machine predicted not heart disease. Because we focus on positive class = 1 (“there is indicate hear disease”)
  • KNN Method can not appilcated by ROC AUC due to tThe KNN method does not provide probabilities or scores as predictive outputs. Instead, KNN provides class labels for the majority of nearest neighbors as predictions for new data.

Decision Tree

Data Test

# prediksi kelas di data test
pred_diab_test <- predict(object = dtree_model , 
                          newdata = df_test , 
                          type = "response")

pred_diab_prob <- predict(object = dtree_model , 
                          newdata = df_test , 
                          type = "prob")


# Confusion Matrix: data test
confusionMatrix(data = as.factor(pred_diab_test), 
                reference = as.factor(df_test$target), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 73  9
#>          1 31 92
#>                                                
#>                Accuracy : 0.8049               
#>                  95% CI : (0.7439, 0.8568)     
#>     No Information Rate : 0.5073               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 0.6109               
#>                                                
#>  Mcnemar's Test P-Value : 0.0008989            
#>                                                
#>             Sensitivity : 0.9109               
#>             Specificity : 0.7019               
#>          Pos Pred Value : 0.7480               
#>          Neg Pred Value : 0.8902               
#>              Prevalence : 0.4927               
#>          Detection Rate : 0.4488               
#>    Detection Prevalence : 0.6000               
#>       Balanced Accuracy : 0.8064               
#>                                                
#>        'Positive' Class : 1                    
#> 

Data Train

# prediksi kelas di data test
pred_diab_train <- predict(object = dtree_model , 
                          newdata = df_train , 
                          type = "response")

# Confusion Matrix: data test
confusionMatrix(data = as.factor(pred_diab_train), 
                reference = as.factor(df_train$target), 
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 330  44
#>          1  65 381
#>                                               
#>                Accuracy : 0.8671              
#>                  95% CI : (0.8419, 0.8896)    
#>     No Information Rate : 0.5183              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.7333              
#>                                               
#>  Mcnemar's Test P-Value : 0.05541             
#>                                               
#>             Sensitivity : 0.8965              
#>             Specificity : 0.8354              
#>          Pos Pred Value : 0.8543              
#>          Neg Pred Value : 0.8824              
#>              Prevalence : 0.5183              
#>          Detection Rate : 0.4646              
#>    Detection Prevalence : 0.5439              
#>       Balanced Accuracy : 0.8660              
#>                                               
#>        'Positive' Class : 1                   
#> 

ROC

# ROC
dtree_roc <- data.frame(prediction=round(pred_diab_prob[,2],4),
                      trueclass= as.numeric(df_test$target == 1))

dtree_roc <- prediction(dtree_roc$prediction, dtree_roc$trueclass) 

# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

#Check AUC
dt_value <- performance(prediction.obj = dtree_roc, 
                         measure = "auc")
dt_value@y.values
#> [[1]]
#> [1] 0.9107959

Insight :

  • 91.07% it means, models that the model performance is good in classifying both positive and negative class.

Random Forest

Data Test

confusionMatrix(data = as.factor(pred_df_rf),
                reference = as.factor(df_test$target),
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 104   0
#>          1   0 101
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9822, 1)          
#>     No Information Rate : 0.5073               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.0000               
#>             Specificity : 1.0000               
#>          Pos Pred Value : 1.0000               
#>          Neg Pred Value : 1.0000               
#>              Prevalence : 0.4927               
#>          Detection Rate : 0.4927               
#>    Detection Prevalence : 0.4927               
#>       Balanced Accuracy : 1.0000               
#>                                                
#>        'Positive' Class : 1                    
#> 

Data Train

#RAW
pred_df_rf_train <- predict(object = df_rf_model,
                   newdata = df_train,
                   type ="raw")

confusionMatrix(data = as.factor(pred_df_rf_train),
                reference = as.factor(df_train$target),
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 395   0
#>          1   0 425
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9955, 1)          
#>     No Information Rate : 0.5183               
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.0000               
#>             Specificity : 1.0000               
#>          Pos Pred Value : 1.0000               
#>          Neg Pred Value : 1.0000               
#>              Prevalence : 0.5183               
#>          Detection Rate : 0.5183               
#>    Detection Prevalence : 0.5183               
#>       Balanced Accuracy : 1.0000               
#>                                                
#>        'Positive' Class : 1                    
#> 

Insight :

  • Accuracy classifcation Random Forest in data train and data test is 100% accurate with optimal model, not in over fitting.

ROC

# ROC
rf_roc <- data.frame(prediction=round(pred_df_rf_prob[,2],4),
                      trueclass= as.numeric(df_test$target == 1))

rf_roc <- prediction(rf_roc$prediction, rf_roc$trueclass) 

# ROC curve
plot(performance(rf_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

AUC

#Check AUC
rf_value <- performance(prediction.obj = rf_roc, 
                         measure = "auc")
rf_value@y.values
#> [[1]]
#> [1] 1

Insight :

  • 100% it means, models that the model performance is good in classifying both positive and negative class.
  • Our ROC curve is good at distinguishing positive and negative classes marked by an inverted L shape. and the model performance results from the AUC are already 100%.

7. Conclusion

cbind(Random_Forest_AUC = rf_value@y.values, Decision_Tree_AUC = dt_value@y.values, Logistic_Reg_AUC = log_value@y.values, KNN = knn_cm$overall["Accuracy"])
#>          Random_Forest_AUC Decision_Tree_AUC Logistic_Reg_AUC KNN     
#> Accuracy 1                 0.9107959         0.9119383        0.995122
  • As we can see the table above, Random Forest is the greatest model with the highest accuracy than the other models and then KNN with 99.02% accuracy.
  • Use Random Forest because it tends to have a higher performance than KNN and able to perform binary and multi-class classification.
  • Logistic Regression in Log of odds values cannot be interpreted. Log of Odds generated by Logistic Regression. Log of odds values can be converted to odds and/or probabilities so they can be interpreted.
  • The goal of logistic regression is to use a linear regression model to predict probability (which can be used for classification).
  • Our goals above have already achieved, with get the best model machine learning that is Random Forest and we could know the result of predictive survived in our data test (df_test).
  • Because cases like this may be appropriate only when the class targets we have are balanced. If the target class is not balanced, when the threshold is shifted, one of the Recall or Precision metric values will usually fall. To ensure that the threshold shift remains optimal, we generally check the model’s goodness first using the AUC value.
  • For data-set should be fix on integral data-set, not in factorial or categorical due to KNN is good for numerical predictors (because it classifies by distance).