1 Objective

In this notebook i will create a prediction model to determine wether or not a patient (or you) have an unhealthy or healthy heart condition using Logistic Regression model and KNN Model.

The Prediction should be used by both the medical examiner and patient wether or not they should proceed with the medical precedure or not, rather than as an absolute answer.

2 Setup and Data Import

options(scipen = 999)
library(dplyr)
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(rsample)
library(MASS)
library(caret)

2.1 Input Data and Inspection

heart = read.csv('heart.csv')
head(heart)

Data Dictionary : 1. age

  1. sex : 1 = male, 0 = female

  2. chest pain type (4 values)

  3. resting blood pressure

  4. serum cholestoral in mg/dl

  5. fasting blood sugar > 120 mg/dl : 1 = yes, 0 = no

  6. resting electrocardiographic results (values 0,1,2)

  7. maximum heart rate achieved

  8. exercise induced angina : 1 = yes, 0 = no

  9. oldpeak = ST depression induced by exercise relative to rest

  10. the slope of the peak exercise ST segment

  11. number of major vessels (0-3) colored by flourosopy

  12. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

  13. target : 1 Not Healthy, 0 = Healthy

summary(heart)
#>      ï..age           sex               cp           trestbps    
#>  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
#>  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
#>  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
#>  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
#>  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
#>  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
#>       chol            fbs            restecg          thalach     
#>  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
#>  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
#>  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
#>  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
#>  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
#>  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
#>      exang           oldpeak         slope             ca        
#>  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
#>  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
#>  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
#>  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
#>  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
#>       thal           target      
#>  Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:2.000   1st Qu.:0.0000  
#>  Median :2.000   Median :1.0000  
#>  Mean   :2.314   Mean   :0.5446  
#>  3rd Qu.:3.000   3rd Qu.:1.0000  
#>  Max.   :3.000   Max.   :1.0000

3 EDA & Data Wrangling

3.1 Target Proportion

Check target class proportion.

prop.table(table(heart$target))
#> 
#>         0         1 
#> 0.4554455 0.5445545

Our classes is balanced, our model will be able to learn on both classes.

3.2 Assigning right data type

Change the label to their proper label respectively rather than zero and one, so it will be easier to be interpreted.

heart <- heart %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
         fbs =factor(fbs, levels = c(0,1), labels = c("No", "Yes")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         target = factor(target, levels = c(0,1), labels = c("Healthy", "Not Healthy")),
         ï..age = as.integer(ï..age),
         trestbps = as.integer(trestbps),
         chol = as.integer(chol),
         thalach = as.integer(thalach))

3.3 Check Missing Value

Check missing value

anyNA(heart)
#> [1] FALSE

No missing value found, we can move to next step.

3.4 Train Test Data Split

Split the data for training and testing.

set.seed(420)

index <- initial_split(heart, 0.7, strata = 'target')
heart.train <- training(index)
heart.test <- testing(index)

3.5 Check Target Proportion

Check the proportion again on both set, hopefully they retain their proportion.

prop.table(table(heart.train$target))
#> 
#>     Healthy Not Healthy 
#>   0.4553991   0.5446009
prop.table(table(heart.test$target))
#> 
#>     Healthy Not Healthy 
#>   0.4555556   0.5444444

4 Creating Model

4.1 Logistic Regression

The easiest way to determine which variable to used as a predictor without the knowledge of the business is to create a model using all available variable, and then eliminate the variables one by one using stepwise function.

init_all <- glm(target~., 'binomial', heart.train)

4.1.1 Feature Selection

Backward Stepwise Method

backstep_model <- step(init_all, direction = 'backward', trace = 0)
summary(backstep_model)
#> 
#> Call:
#> glm(formula = target ~ ï..age + sex + cp + trestbps + chol + 
#>     fbs + thalach + exang + slope + ca + thal, family = "binomial", 
#>     data = heart.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8734  -0.3131   0.1023   0.4336   3.3986  
#> 
#> Coefficients:
#>                Estimate  Std. Error z value   Pr(>|z|)    
#> (Intercept)   13.624107 1455.398180   0.009    0.99253    
#> ï..age         0.055260    0.033267   1.661    0.09669 .  
#> sexMale       -1.894728    0.631525  -3.000    0.00270 ** 
#> cp1            0.626882    0.699937   0.896    0.37045    
#> cp2            1.979165    0.659954   2.999    0.00271 ** 
#> cp3            1.528225    0.816119   1.873    0.06113 .  
#> trestbps      -0.066326    0.025622  -2.589    0.00963 ** 
#> chol          -0.015633    0.007114  -2.198    0.02798 *  
#> fbsYes         1.276485    0.778762   1.639    0.10119    
#> thalach        0.032078    0.015280   2.099    0.03579 *  
#> exangYes      -0.885005    0.575772  -1.537    0.12427    
#> slope1        -1.252268    0.971136  -1.289    0.19723    
#> slope2         0.455515    0.986081   0.462    0.64412    
#> ca1           -2.175795    0.622176  -3.497    0.00047 ***
#> ca2           -4.739092    1.029772  -4.602 0.00000418 ***
#> ca3           -2.091254    0.998433  -2.095    0.03621 *  
#> ca4            1.459926    1.700220   0.859    0.39052    
#> thal1        -11.883109 1455.398489  -0.008    0.99349    
#> thal2        -10.806283 1455.398015  -0.007    0.99408    
#> thal3        -12.343589 1455.397972  -0.008    0.99323    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 293.58  on 212  degrees of freedom
#> Residual deviance: 127.65  on 193  degrees of freedom
#> AIC: 167.65
#> 
#> Number of Fisher Scoring iterations: 14

Apparently, thal variable indicating a perfect separation (can be identified with unusualy high value of log of odds), so we have to remove it.

model_heart <- glm(target ~ sex + cp + trestbps + exang + oldpeak + ca + slope, 'binomial', heart.train)
summary(model_heart)
#> 
#> Call:
#> glm(formula = target ~ sex + cp + trestbps + exang + oldpeak + 
#>     ca + slope, family = "binomial", data = heart.train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.8501  -0.3986   0.1696   0.5160   2.6324  
#> 
#> Coefficients:
#>             Estimate Std. Error z value  Pr(>|z|)    
#> (Intercept)  4.56710    1.39230   3.280  0.001037 ** 
#> sexMale     -1.72180    0.52842  -3.258  0.001120 ** 
#> cp1          1.03857    0.64351   1.614  0.106543    
#> cp2          2.03498    0.57740   3.524  0.000424 ***
#> cp3          2.07917    0.82697   2.514  0.011930 *  
#> trestbps    -0.04457    0.02199  -2.027  0.042628 *  
#> exangYes    -1.18674    0.52438  -2.263  0.023629 *  
#> oldpeak     -0.57757    0.26333  -2.193  0.028285 *  
#> ca1         -1.94683    0.53520  -3.638  0.000275 ***
#> ca2         -3.43424    0.83659  -4.105 0.0000404 ***
#> ca3         -1.70245    0.85405  -1.993  0.046219 *  
#> ca4          0.55490    1.62975   0.340  0.733494    
#> slope1      -2.09305    1.03141  -2.029  0.042427 *  
#> slope2      -0.47960    1.12540  -0.426  0.669992    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 293.58  on 212  degrees of freedom
#> Residual deviance: 144.37  on 199  degrees of freedom
#> AIC: 172.37
#> 
#> Number of Fisher Scoring iterations: 6

Now we have a model, lets evaluate it.

4.1.2 Logistic Model Evaluation

Run the prediction on test set.

#Model Prediction
heart_pred <- predict(model_heart, heart.test, 'response')

By default, the threshold are 0.5, but i adjust it to 0.4 to prioritize recall measurement.

heart_pred_class <- as.factor(ifelse(heart_pred > 0.4, 'Not Healthy', 'Healthy'))

4.1.3 Model Evaluation

confusionMatrix(heart_pred_class, heart.test$target, positive = "Not Healthy")
#> Confusion Matrix and Statistics
#> 
#>              Reference
#> Prediction    Healthy Not Healthy
#>   Healthy          30           5
#>   Not Healthy      11          44
#>                                           
#>                Accuracy : 0.8222          
#>                  95% CI : (0.7274, 0.8948)
#>     No Information Rate : 0.5444          
#>     P-Value [Acc > NIR] : 0.0000000284    
#>                                           
#>                   Kappa : 0.6373          
#>                                           
#>  Mcnemar's Test P-Value : 0.2113          
#>                                           
#>             Sensitivity : 0.8980          
#>             Specificity : 0.7317          
#>          Pos Pred Value : 0.8000          
#>          Neg Pred Value : 0.8571          
#>              Prevalence : 0.5444          
#>          Detection Rate : 0.4889          
#>    Detection Prevalence : 0.6111          
#>       Balanced Accuracy : 0.8148          
#>                                           
#>        'Positive' Class : Not Healthy     
#> 

Because we do not want to miss diagnose an unhealthy person as healthy, i reduced the treshold to 0.4. it could be lower but it will resulting in more false positive (diagnosing healthy person as unhealthy). This might sound ‘bad’ for now, but if you think about it, in a healthcare industry, you most definately dont want to misdiagnose your patient.

4.1.4 Logistic Model Interpretation

exp(model_heart$coefficients)
#> (Intercept)     sexMale         cp1         cp2         cp3    trestbps 
#> 96.26421814  0.17874380  2.82518196  7.65212794  7.99781582  0.95640458 
#>    exangYes     oldpeak         ca1         ca2         ca3         ca4 
#>  0.30521567  0.56126227  0.14272622  0.03224996  0.18223599  1.74176524 
#>      slope1      slope2 
#>  0.12330988  0.61903111
  1. A male individuals has a 12,4% less chance of having unhealthy heart compared to female individuals, same goes for other categoric variable compared to their own base/default value (ex; cp1/cp2/cp3 compared to cp0), with their own respective odds values
  2. an increase of 1 value of ‘trestbps’ (blood preasure), resulting 0.96 less odds of having unhealthy heart, same goes for other numeric variables with their own respective odds values

4.2 K Nearest Neighbour

KNN Model does not work well with categorical variable, and unfortunately most of our variable are categorical, to make our data work properly with KNN model, we have to transform our categorical into multiple individual variable.

4.2.1 Data Warangling

heart2 <- dummyVars(" ~target+sex+cp+fbs+exang+oldpeak+slope+ca+thal", data = heart)
heart2 <- data.frame(predict(heart2, newdata = heart))
str(heart2)
#> 'data.frame':    303 obs. of  25 variables:
#>  $ target.Healthy    : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ target.Not.Healthy: num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ sex.Female        : num  0 0 1 0 1 0 1 0 0 0 ...
#>  $ sex.Male          : num  1 1 0 1 0 1 0 1 1 1 ...
#>  $ cp.0              : num  0 0 0 0 1 1 0 0 0 0 ...
#>  $ cp.1              : num  0 0 1 1 0 0 1 1 0 0 ...
#>  $ cp.2              : num  0 1 0 0 0 0 0 0 1 1 ...
#>  $ cp.3              : num  1 0 0 0 0 0 0 0 0 0 ...
#>  $ fbs.No            : num  0 1 1 1 1 1 1 1 0 1 ...
#>  $ fbs.Yes           : num  1 0 0 0 0 0 0 0 1 0 ...
#>  $ exang.No          : num  1 1 1 1 0 1 1 1 1 1 ...
#>  $ exang.Yes         : num  0 0 0 0 1 0 0 0 0 0 ...
#>  $ oldpeak           : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
#>  $ slope.0           : num  1 1 0 0 0 0 0 0 0 0 ...
#>  $ slope.1           : num  0 0 0 0 0 1 1 0 0 0 ...
#>  $ slope.2           : num  0 0 1 1 1 0 0 1 1 1 ...
#>  $ ca.0              : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ ca.1              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ca.2              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ca.3              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ ca.4              : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ thal.0            : num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ thal.1            : num  1 0 0 0 0 1 0 0 0 0 ...
#>  $ thal.2            : num  0 1 1 1 1 0 1 0 0 1 ...
#>  $ thal.3            : num  0 0 0 0 0 0 0 1 1 0 ...

Remove every negative class variables, and transform target to factor

heart2$sex.Female <- NULL
heart2$fbs.False <- NULL
heart2$exang.No <- NULL
heart2$target.Healthy <- NULL
heart2$target.Not.Healthy <- as.factor(heart2$target.Not.Healthy)

Train Test Split

set.seed(420)

index <- initial_split(heart2, 0.7)
heart2.train <- training(index)
heart2.test <- testing(index)

Seperating Predictor and Target

# prediktor data train
train_x <- heart2.train %>% 
           select_if(is.numeric)

# target data train
train_y <- heart2.train %>% 
           dplyr::select(target.Not.Healthy)

# prediktor data test
test_x <- heart2.test %>% 
          select_if(is.numeric)

# target data test
test_y <- heart2.test %>% 
          dplyr::select(target.Not.Healthy)

4.2.2 Eliminating Variables

We can eliminate feature that have small variance, small to no information.

n0vartrainx <- nearZeroVar(train_x)
n0vartestx <- nearZeroVar(test_x)
train_x <- train_x[,-n0vartrainx]
test_x <- test_x[,-n0vartrainx]

4.2.3 Normalizing

Our data need to be normalized so every variables have same ranges.

train_x <- scale(train_x) 
  
test_x <- scale(test_x,
                center = attr(train_x, "scaled:center"),
                scale =  attr(train_x,"scaled:scale"))

4.2.4 KNN Modeling

heartknn <- knn(train = train_x, test = test_x, cl = train_y$target, k = 18)

k values are root of number of observation

anyNA(heart)
#> [1] FALSE

4.2.5 KNN Evaluation

confusionMatrix(data = heartknn ,reference = heart2.test$target.Not.Healthy, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 35  5
#>          1  8 42
#>                                           
#>                Accuracy : 0.8556          
#>                  95% CI : (0.7657, 0.9208)
#>     No Information Rate : 0.5222          
#>     P-Value [Acc > NIR] : 0.00000000002547
#>                                           
#>                   Kappa : 0.7097          
#>                                           
#>  Mcnemar's Test P-Value : 0.5791          
#>                                           
#>             Sensitivity : 0.8936          
#>             Specificity : 0.8140          
#>          Pos Pred Value : 0.8400          
#>          Neg Pred Value : 0.8750          
#>              Prevalence : 0.5222          
#>          Detection Rate : 0.4667          
#>    Detection Prevalence : 0.5556          
#>       Balanced Accuracy : 0.8538          
#>                                           
#>        'Positive' Class : 1               
#> 

5 Model Comparison

confusionMatrix(heart_pred_class, heart.test$target, positive = "Not Healthy")
#> Confusion Matrix and Statistics
#> 
#>              Reference
#> Prediction    Healthy Not Healthy
#>   Healthy          30           5
#>   Not Healthy      11          44
#>                                           
#>                Accuracy : 0.8222          
#>                  95% CI : (0.7274, 0.8948)
#>     No Information Rate : 0.5444          
#>     P-Value [Acc > NIR] : 0.0000000284    
#>                                           
#>                   Kappa : 0.6373          
#>                                           
#>  Mcnemar's Test P-Value : 0.2113          
#>                                           
#>             Sensitivity : 0.8980          
#>             Specificity : 0.7317          
#>          Pos Pred Value : 0.8000          
#>          Neg Pred Value : 0.8571          
#>              Prevalence : 0.5444          
#>          Detection Rate : 0.4889          
#>    Detection Prevalence : 0.6111          
#>       Balanced Accuracy : 0.8148          
#>                                           
#>        'Positive' Class : Not Healthy     
#> 
confusionMatrix(data = heartknn ,reference = heart2.test$target.Not.Healthy, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 35  5
#>          1  8 42
#>                                           
#>                Accuracy : 0.8556          
#>                  95% CI : (0.7657, 0.9208)
#>     No Information Rate : 0.5222          
#>     P-Value [Acc > NIR] : 0.00000000002547
#>                                           
#>                   Kappa : 0.7097          
#>                                           
#>  Mcnemar's Test P-Value : 0.5791          
#>                                           
#>             Sensitivity : 0.8936          
#>             Specificity : 0.8140          
#>          Pos Pred Value : 0.8400          
#>          Neg Pred Value : 0.8750          
#>              Prevalence : 0.5222          
#>          Detection Rate : 0.4667          
#>    Detection Prevalence : 0.5556          
#>       Balanced Accuracy : 0.8538          
#>                                           
#>        'Positive' Class : 1               
#> 

6 Conclusion

Idealy, we want our accuracy (overall prediction on both classes) as our indicator on how good our model is, Realisticaly, we want to prioritize one of the class, here i decided to prioritize Recall (Minimizing False Negative), if i have to put myself as a medical practioner, I dont want patient who are actually Not Healthy diagnosed as healthy, could end up in a lawsuit, thats $ loss.

Our KNN model are slightly better on all metrics compared to Logistic Regression model, if the goal is only to predict the result, use KNN model because it has better metrics and performance, but it cannot be interpreted, we cannot figure on which variable effecting our target the most and the least. If data understanding is the goal, then use Logistic Regression Model, every variable can be intepreted as Probability and Odds toward our target.