Introduction

Business Problem

The health of heart could be one of the most important thing in human body. in this project, we are trying to find the trend or factors that influence the health of heart.

Objectives

We will trying the Logistic regression and k-Nearest neighbors supervised learning algorithm to try predict the presence of heart disease in our observation.

Data

Source

Data is Sourced from: https://www.kaggle.com/ronitf/heart-disease-uci

Context

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

Content

age
sex
chest pain type (4 values)
resting blood pressure
serum cholestoral in mg/dl
fasting blood sugar > 120 mg/dl
resting electrocardiographic results (values 0,1,2)
maximum heart rate achieved
exercise induced angina
oldpeak = ST depression induced by exercise relative to rest
the slope of the peak exercise ST segment
number of major vessels (0-3) colored by flourosopy
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The names and social security numbers of the patients were recently removed from the database, replaced with dummy values. One file has been “processed”, that one containing the Cleveland database. All four unprocessed files also exist in this directory.

To see Test Costs (donated by Peter Turney), please see the folder “Costs”

Acknowledgements

Creators:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

Donor: David W. Aha (aha ‘@’ ics.uci.edu) (714) 856-8779

Inspiration

Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

Methodology

Data Preparation

Load the library that we would use on the project

library(dplyr) # data wrangling wrangling
library(magrittr) # useful piping operator
library(ggplot2) # plotting
library(tidyr) # tidying data
library(DataExplorer) # Exlporatory Data Analysis
library(caret) # short for Classification And Regression Training
library(class) # for K-Nearest Neighbors Classification
set.seed(11) # for reproducbility of result

Set plotting theme

theme_set(ggthemes::theme_hc())

Load the dataset

df <- read.csv("heart.csv")

Take a look at the data

knitr::knit_print(skimr::skim(df))

Data summary
Name	df
Number of rows	303
Number of columns	14
_______________________
Column type frequency:
numeric	14
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	54.37	9.08	29	47.5	55.0	61.0	77.0	▁▆▇▇▁
sex	1	0.68	0.47	0	0.0	1.0	1.0	1.0	▃▁▁▁▇
cp	1	0.97	1.03	0	0.0	1.0	2.0	3.0	▇▃▁▅▁
trestbps	1	131.62	17.54	94	120.0	130.0	140.0	200.0	▃▇▅▁▁
chol	1	246.26	51.83	126	211.0	240.0	274.5	564.0	▃▇▂▁▁
fbs	1	0.15	0.36	0	0.0	0.0	0.0	1.0	▇▁▁▁▂
restecg	1	0.53	0.53	0	0.0	1.0	1.0	2.0	▇▁▇▁▁
thalach	1	149.65	22.91	71	133.5	153.0	166.0	202.0	▁▂▅▇▂
exang	1	0.33	0.47	0	0.0	0.0	1.0	1.0	▇▁▁▁▃
oldpeak	1	1.04	1.16	0	0.0	0.8	1.6	6.2	▇▂▁▁▁
slope	1	1.40	0.62	0	1.0	1.0	2.0	2.0	▁▁▇▁▇
ca	1	0.73	1.02	0	0.0	0.0	1.0	4.0	▇▃▂▁▁
thal	1	2.31	0.61	0	2.0	2.0	3.0	3.0	▁▁▁▇▆
target	1	0.54	0.50	0	0.0	1.0	1.0	1.0	▇▁▁▁▇

Dimension: 303,14 We have a little number of observation and this could influence the predictability of our model, we could gather more data to improve the model result.
Completion rate: 100% The data is complete with 0 missing value for each variable, but we should correct the variable type since there’s factor that classified as numeric and assign descriptive value to the variable.

df %<>% mutate(sex = if_else(sex == 1, "MALE", "FEMALE"),
               cp = if_else(cp == 1, "ATYPICAL ANGINA",
                            if_else(cp == 2, "NON-ANGINAL PAIN", "ASYMPTOMATIC")),
               fbs = if_else(fbs == 1, ">120", "<=120"),
               restecg = if_else(restecg == 0, "NORMAL",
                                 if_else(restecg == 1, "ABNORMALITY", "PROBABLE OR DEFINITE")),
               exang = if_else(exang == 1, "YES" ,"NO"),
               slope = as.factor(slope),
               ca = as.factor(ca),
               thal = as.factor(thal),
               target = if_else(target == 1, "YES", "NO")) %>%
  mutate_if(is.character, as.factor)

We gain better understanding of our numeric data distribution by using boxplot and histogram

ggplot(gather(df[,sapply(df, is.numeric)], cols, value), 
       aes(x = value, fill = cols)) + 
  geom_boxplot() + 
  facet_wrap(cols~., scales = "free") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position =  "none")

age doen’t have outlier the median is centered around 55 year old
chol have outlier around 400 and one heavy outlier with value over 500
oldpeak have outlier with value over 4
thalach have outlier with value under 90
trestbps have the most of outlier around 180

ggplot(gather(df[,sapply(df, is.numeric)], cols, value), 
       aes(x = value)) + 
  geom_histogram(aes(fill = cols)) + 
  geom_density(aes(y = stat(count))) +
  facet_wrap(cols~., scales = "free") +
  theme(legend.position =  "none",
        axis.title.y = element_text(angle = 90))

age of our observation have 2 modes around 45 and 55 years old, and quite normaly distributed
chol variable have single modes around 250 and alittle bit right skewed
oldpeak variable have modes on 0 and heavily right skewed.
thalach have modes around 160 and left skewed
trestbps have several modes around 135 and a little bit right skewed

let’s check our target value’s proportion

round(prop.table(table(df$target)),2)

## 
##   NO  YES 
## 0.46 0.54

Our target’s proportion is quite well proportioned.

We split out dataframe based on several attributes: categorical vs numeric and discrete vs continuous to ease our Exploratory Data Analysis

split <- split_columns(df)

ggplot(gather(split$discrete %>%
                select(-c(cp,restecg,target)), cols, value), 
       aes(x = value)) + 
  geom_bar(aes(fill = value)) + 
  facet_wrap(cols~., scales = "free") +
  theme(legend.position =  "none",
        axis.title.y = element_text(angle = 90)) +
  ggtitle("Discrete Variables Proportions")

ggplot(gather(split$discrete %>% 
                select(cp), cols, value), 
       aes(x = value)) + 
  geom_bar(aes(fill = value)) + 
  theme(legend.position =  "none",
        axis.title.y = element_text(angle = 90)) +
  ggtitle("Chest Pain")

ggplot(gather(split$discrete %>% 
                select(restecg), cols, value), 
       aes(x = value)) + 
  geom_bar(aes(fill = value)) + 
  theme(legend.position =  "none",
        axis.title.y = element_text(angle = 90)) +
  ggtitle("Resting Electrocardiographic Results")

AppliedPredictiveModeling::transparentTheme(trans = .5)
featurePlot(x = split$continuous, 
            y = df$target,
            plot = "density", 
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation = "free"), 
                          y = list(relation = "free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(3, 2),
            auto.key = list(columns = 2))

The different between person with health disease and with healthy heart is quite clear in thalach, oldpeak and age, these factor’s could be a good predictor for our classification.

featurePlot(split$continuous,
            df$target,
            plot = "box",
            scales = list(y = list(relation = "free"),
                          x = list(rot = 90)),
            auto.key = list(columns = 2))

We could see from the boxplot that majority of person with heart disease have these characteristic: - higher thalach - lower oldpeak - lower age - relatively lower trestbps - relatively same level of chol

AppliedPredictiveModeling::transparentTheme(trans = 0.4)
featurePlot(x = split$continuous, 
            y = df$target, 
            plot = "pairs",
            ## Add a key at the top
            auto.key = list(columns = 2))

Our Scatter Plot Matrix shows that, the old peak variable could be the a good feature for our model in relation with the other continuous variable. YES and NO for heart disease are hardly parted from the continuous variable, thus we hypothesized that logistic regression wouldn’t be the right classification method for our case.

Modeling

In this project, we will use classification with 2 technique, logistic regression and K-nearest neighbors. - Logistic regression divide the observation into 2 different parts using the regression line to classify the data - K-nearest neighbors classify the data by looking at the k-number of the nearest point to figure out the difference and the similarity.

Train - Test Split

inTraining <- createDataPartition(df$target, p = .75, list = FALSE)

train <- df[ inTraining,]
test  <- df[-inTraining,]

train_x <- train %>% select(-target)
train_y <- train %>% select(target)

test_x <- test %>% select(-target)
test_y <- test %>% select(target)

Logistic Regression

Baseline model

logmod <- glm(target~.,train, family = binomial("logit"))
summary(logmod)

## 
## Call:
## glm(formula = target ~ ., family = binomial("logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7819  -0.3902   0.1533   0.3760   2.8944  
## 
## Coefficients:
##                                 Estimate   Std. Error z value  Pr(>|z|)    
## (Intercept)                  -16.7242707 2399.5473365  -0.007   0.99444    
## age                            0.0227048    0.0292500   0.776   0.43761    
## sexMALE                       -0.5575973    0.6372390  -0.875   0.38156    
## cpATYPICAL ANGINA              0.6715472    0.8195231   0.819   0.41254    
## cpNON-ANGINAL PAIN             0.7641526    0.5272676   1.449   0.14726    
## trestbps                      -0.0205815    0.0127333  -1.616   0.10602    
## chol                          -0.0006164    0.0045646  -0.135   0.89258    
## fbs>120                        0.8183040    0.6257587   1.308   0.19098    
## restecgNORMAL                 -0.1463589    0.4626668  -0.316   0.75175    
## restecgPROBABLE OR DEFINITE  -14.7197368 1245.3277449  -0.012   0.99057    
## thalach                        0.0283954    0.0144956   1.959   0.05013 .  
## exangYES                      -1.1390310    0.5087331  -2.239   0.02516 *  
## oldpeak                       -0.4161185    0.2582631  -1.611   0.10713    
## slope1                        -1.3975074    0.9840938  -1.420   0.15558    
## slope2                        -0.2177829    1.1153621  -0.195   0.84519    
## ca1                           -1.8776663    0.5980489  -3.140   0.00169 ** 
## ca2                           -3.5421866    0.8976941  -3.946 0.0000795 ***
## ca3                           -1.9083253    1.0347502  -1.844   0.06515 .  
## ca4                           -0.9261473    2.2097925  -0.419   0.67514    
## thal1                         17.5303348 2399.5449392   0.007   0.99417    
## thal2                         17.5731307 2399.5448813   0.007   0.99416    
## thal3                         15.9678269 2399.5448490   0.007   0.99469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 314.32  on 227  degrees of freedom
## Residual deviance: 139.32  on 206  degrees of freedom
## AIC: 183.32
## 
## Number of Fisher Scoring iterations: 15

We include all variable on the baseline model and the summary() function show us that the most significant value for our prediction from the Pr(>|z|) which are sex, cp, thalach, exang, and ca, in other word, feature selection. We will take a look at the result of our confussion matrix to see whether we should do some modification to improve our model.

Our thal variable could perfectly classify the dataset and it’s an indication of perfect separation. Thus we must exclude the thal variable from our logistic regression.

logmod_clean <- glm(target~.-thal,train, family = binomial("logit"))
summary(logmod_clean)

## 
## Call:
## glm(formula = target ~ . - thal, family = binomial("logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5986  -0.4487   0.1586   0.4444   2.8224  
## 
## Coefficients:
##                                Estimate  Std. Error z value Pr(>|z|)    
## (Intercept)                    1.690338    3.103619   0.545  0.58600    
## age                            0.016521    0.028024   0.590  0.55551    
## sexMALE                       -1.361275    0.524416  -2.596  0.00944 ** 
## cpATYPICAL ANGINA              1.124713    0.798171   1.409  0.15880    
## cpNON-ANGINAL PAIN             1.041716    0.497093   2.096  0.03612 *  
## trestbps                      -0.019432    0.012640  -1.537  0.12423    
## chol                          -0.003220    0.004422  -0.728  0.46654    
## fbs>120                        0.815081    0.590758   1.380  0.16767    
## restecgNORMAL                 -0.015178    0.434601  -0.035  0.97214    
## restecgPROBABLE OR DEFINITE  -13.938153 1371.840702  -0.010  0.99189    
## thalach                        0.025514    0.012867   1.983  0.04738 *  
## exangYES                      -1.337280    0.489743  -2.731  0.00632 ** 
## oldpeak                       -0.417201    0.232822  -1.792  0.07314 .  
## slope1                        -1.371795    0.874551  -1.569  0.11675    
## slope2                        -0.064491    0.994560  -0.065  0.94830    
## ca1                           -1.902762    0.585480  -3.250  0.00115 ** 
## ca2                           -3.210743    0.857169  -3.746  0.00018 ***
## ca3                           -2.019771    0.990750  -2.039  0.04149 *  
## ca4                           -0.838456    1.845854  -0.454  0.64966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 314.32  on 227  degrees of freedom
## Residual deviance: 151.41  on 209  degrees of freedom
## AIC: 189.41
## 
## Number of Fisher Scoring iterations: 15

Now, we test our baseline logistic regression model with the test data.

logmod_logit <- predict(logmod_clean, newdata = test, type = "response")

We will convert the probability into class by the default threshold value of 0.5, any value above would be positive and vice versa.

logmod_pred <- as.factor(ifelse(logmod_logit >= 0.5, "YES", "NO"))
confusionMatrix(logmod_pred, test$target, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NO YES
##        NO  25   5
##        YES  9  36
##                                          
##                Accuracy : 0.8133         
##                  95% CI : (0.7067, 0.894)
##     No Information Rate : 0.5467         
##     P-Value [Acc > NIR] : 0.000001183    
##                                          
##                   Kappa : 0.6196         
##                                          
##  Mcnemar's Test P-Value : 0.4227         
##                                          
##             Sensitivity : 0.8780         
##             Specificity : 0.7353         
##          Pos Pred Value : 0.8000         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.5467         
##          Detection Rate : 0.4800         
##    Detection Prevalence : 0.6000         
##       Balanced Accuracy : 0.8067         
##                                          
##        'Positive' Class : YES            
##

Our model Sensitivity should be as big as possible, and why is that? Because we made a model that predict the presence of heart disease, the result our false negative could be very dangerous for the patient. Sensitivity measures the proportion of positives that are correctly identified (e.g., the percentage of sick people who are correctly identified as having some illness).

test %>% 
  mutate(yhat = logmod_pred) %>%
  ggplot(aes(target, yhat, col = paste(target,yhat))) +
  geom_jitter() +
  geom_hline(aes(yintercept = 1.5)) +
  geom_vline(aes(xintercept = 1.5)) +
  labs(x = "Actual", 
       y = "Prediction", 
       title = "Confussion Matrix", 
       subtitle = "Logistic Regression - Baseline") +
  theme(legend.position = "none")

Stepwise model

We could do the backward stepwise regression which include all of the independent variables as prdictor just by simply using step() function with our logistic regression model, the direction parameter is “backward” by default. In backward stepwise regression, we will eliminating each predictor variable until we reach the lowest Akaike Information Criterion (AIC)

summary(step(logmod_clean))

## Start:  AIC=189.41
## target ~ (age + sex + cp + trestbps + chol + fbs + restecg + 
##     thalach + exang + oldpeak + slope + ca + thal) - thal
## 
##            Df Deviance    AIC
## - restecg   2   151.81 185.81
## - age       1   151.76 187.76
## - chol      1   151.93 187.93
## - fbs       1   153.35 189.35
## <none>          151.41 189.41
## - trestbps  1   153.84 189.84
## - cp        2   156.83 190.83
## - oldpeak   1   154.84 190.84
## - thalach   1   155.57 191.57
## - slope     2   160.59 194.59
## - sex       1   158.62 194.62
## - exang     1   159.03 195.03
## - ca        4   177.02 207.02
## 
## Step:  AIC=185.81
## target ~ age + sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca
## 
##            Df Deviance    AIC
## - age       1   152.18 184.18
## - chol      1   152.40 184.40
## <none>          151.81 185.81
## - fbs       1   153.85 185.85
## - trestbps  1   154.27 186.27
## - cp        2   157.37 187.37
## - oldpeak   1   155.46 187.46
## - thalach   1   156.01 188.01
## - sex       1   158.97 190.97
## - slope     2   161.00 191.00
## - exang     1   159.50 191.50
## - ca        4   177.86 203.86
## 
## Step:  AIC=184.18
## target ~ sex + cp + trestbps + chol + fbs + thalach + exang + 
##     oldpeak + slope + ca
## 
##            Df Deviance    AIC
## - chol      1   152.69 182.69
## <none>          152.18 184.18
## - fbs       1   154.25 184.25
## - trestbps  1   154.28 184.28
## - cp        2   157.89 185.89
## - thalach   1   156.01 186.01
## - oldpeak   1   156.13 186.13
## - slope     2   161.22 189.22
## - sex       1   159.61 189.61
## - exang     1   159.86 189.86
## - ca        4   179.86 203.86
## 
## Step:  AIC=182.69
## target ~ sex + cp + trestbps + fbs + thalach + exang + oldpeak + 
##     slope + ca
## 
##            Df Deviance    AIC
## - fbs       1   154.63 182.63
## <none>          152.69 182.69
## - trestbps  1   154.91 182.91
## - thalach   1   156.44 184.44
## - cp        2   158.53 184.53
## - oldpeak   1   156.86 184.86
## - sex       1   159.68 187.68
## - slope     2   161.92 187.92
## - exang     1   160.31 188.31
## - ca        4   180.53 202.53
## 
## Step:  AIC=182.63
## target ~ sex + cp + trestbps + thalach + exang + oldpeak + slope + 
##     ca
## 
##            Df Deviance    AIC
## - trestbps  1   156.43 182.43
## <none>          154.63 182.63
## - thalach   1   158.42 184.42
## - cp        2   161.27 185.27
## - oldpeak   1   159.56 185.56
## - sex       1   160.75 186.75
## - exang     1   161.30 187.30
## - slope     2   163.87 187.87
## - ca        4   181.04 201.04
## 
## Step:  AIC=182.43
## target ~ sex + cp + thalach + exang + oldpeak + slope + ca
## 
##           Df Deviance    AIC
## <none>         156.43 182.43
## - thalach  1   159.85 183.85
## - cp       2   163.53 185.53
## - sex      1   162.29 186.29
## - oldpeak  1   162.38 186.38
## - slope    2   165.63 187.63
## - exang    1   163.87 187.87
## - ca       4   181.71 199.71

## 
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + oldpeak + 
##     slope + ca, family = binomial("logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5680  -0.4690   0.1958   0.4605   2.6970  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.14396    2.04826  -0.070  0.94397    
## sexMALE            -1.11187    0.46958  -2.368  0.01789 *  
## cpATYPICAL ANGINA   1.16496    0.76905   1.515  0.12982    
## cpNON-ANGINAL PAIN  1.18832    0.48430   2.454  0.01414 *  
## thalach             0.02140    0.01175   1.822  0.06842 .  
## exangYES           -1.28378    0.47639  -2.695  0.00704 ** 
## oldpeak            -0.52784    0.22700  -2.325  0.02006 *  
## slope1             -1.45761    0.83855  -1.738  0.08217 .  
## slope2             -0.21372    0.94022  -0.227  0.82019    
## ca1                -1.75476    0.54515  -3.219  0.00129 ** 
## ca2                -2.78264    0.77505  -3.590  0.00033 ***
## ca3                -1.75968    0.87034  -2.022  0.04319 *  
## ca4                -0.83067    2.04143  -0.407  0.68408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 314.32  on 227  degrees of freedom
## Residual deviance: 156.43  on 215  degrees of freedom
## AIC: 182.43
## 
## Number of Fisher Scoring iterations: 6

logstep <- glm(formula = target ~ sex + cp + thalach + exang + oldpeak + 
    slope + ca, family = binomial("logit"), data = train)
summary(logstep)

## 
## Call:
## glm(formula = target ~ sex + cp + thalach + exang + oldpeak + 
##     slope + ca, family = binomial("logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5680  -0.4690   0.1958   0.4605   2.6970  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.14396    2.04826  -0.070  0.94397    
## sexMALE            -1.11187    0.46958  -2.368  0.01789 *  
## cpATYPICAL ANGINA   1.16496    0.76905   1.515  0.12982    
## cpNON-ANGINAL PAIN  1.18832    0.48430   2.454  0.01414 *  
## thalach             0.02140    0.01175   1.822  0.06842 .  
## exangYES           -1.28378    0.47639  -2.695  0.00704 ** 
## oldpeak            -0.52784    0.22700  -2.325  0.02006 *  
## slope1             -1.45761    0.83855  -1.738  0.08217 .  
## slope2             -0.21372    0.94022  -0.227  0.82019    
## ca1                -1.75476    0.54515  -3.219  0.00129 ** 
## ca2                -2.78264    0.77505  -3.590  0.00033 ***
## ca3                -1.75968    0.87034  -2.022  0.04319 *  
## ca4                -0.83067    2.04143  -0.407  0.68408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 314.32  on 227  degrees of freedom
## Residual deviance: 156.43  on 215  degrees of freedom
## AIC: 182.43
## 
## Number of Fisher Scoring iterations: 6

logstep_logit <- predict(logstep, newdata = test, type = "response")

Once again, we will convert the probability into class by the default threshold value of 0.5, any value above would be positive and vice versa.

logstep_pred <- as.factor(ifelse(logstep_logit >= 0.5, "YES", "NO"))
confusionMatrix(logstep_pred, test$target, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NO YES
##        NO  23   4
##        YES 11  37
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6917, 0.8835)
##     No Information Rate : 0.5467          
##     P-Value [Acc > NIR] : 0.000004117     
##                                           
##                   Kappa : 0.5893          
##                                           
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.6765          
##          Pos Pred Value : 0.7708          
##          Neg Pred Value : 0.8519          
##              Prevalence : 0.5467          
##          Detection Rate : 0.4933          
##    Detection Prevalence : 0.6400          
##       Balanced Accuracy : 0.7895          
##                                           
##        'Positive' Class : YES             
##

Can you see the difference from our baseline and stepwise logistic regression model? Just by trimming down our independent variable, we decrease accuracy of our prediction by 0.0133 but our model sensitivity is increased by 0.0244.

logplot <- test %>% 
  mutate(yhat = logstep_pred) %>%
  ggplot(aes(target, yhat)) 

logplot +
  geom_jitter(aes(col = paste(target,yhat))) +
  geom_hline(aes(yintercept = 1.5)) +
  geom_vline(aes(xintercept = 1.5)) +
  labs(x = "Actual", 
       y = "Prediction", 
       title = "Confussion Matrix", 
       subtitle = "Logistic Regression - Stepwise") +
  theme(legend.position = "none")

And what’s better? from our logistic regression model, we could exponentiate the coefficients and interpret them as odds-ratios.

round(exp(coef(logstep)),2)

##        (Intercept)            sexMALE  cpATYPICAL ANGINA cpNON-ANGINAL PAIN 
##               0.87               0.33               3.21               3.28 
##            thalach           exangYES            oldpeak             slope1 
##               1.02               0.28               0.59               0.23 
##             slope2                ca1                ca2                ca3 
##               0.81               0.17               0.06               0.17 
##                ca4 
##               0.44

For example, for 1 unit increase in non-anginal chest pain and atypical angina chest pain, the odds to have heart disease (versus asymptomatic chest pain) increase by factor of 3.28 and 3.21 consecutively, thus chest pain could be one of the best sign for presence of heart disease.

K-Nearest Neighbors

To do the K-NN algorithm, we should dummify our categorical data first to turn them into number and then normalize the dataset so the number on our dataset would be on same scale, thus preventing biased outcome. But don’t forget to exclude our target variable from the dummfied as we want to use the target variable as the label for our classification.

Dummify the categorical variable

dummified <- df %>%
  select(-target) %>%
  dummify() 
normalized <- as.data.frame(lapply(dummified,scale)) %>%
  mutate(target = df$target)

Then we do the train-test split again from our normalized data

norminTraining <- createDataPartition(normalized$target, p = .75, list = FALSE)

norm_train <- normalized[ norminTraining,-30]
norm_test  <- normalized[-norminTraining,-30]

norm_train_label <- normalized[ norminTraining,30]
norm_test_label <- normalized[-norminTraining,30]

crt_norm_train <- normalized[norminTraining,]
crt_norm_test <- normalized[-norminTraining,]

For the K-nearest neighbors classification, we will use the package ‘class’ which has the KNN function in it.

First of all, we will calcultate the ‘K’ in our K-NN, as the best practice, we would find the K from square root of our observation (nrow)

round(sqrt(nrow(norm_train)))

## [1] 15

So the rounded square root of our training data observation is 15, we will use the ‘k’ value of 15 for our K-NN model.

knnmod <- knn(norm_train, norm_test, norm_train_label, 15)
confusionMatrix(knnmod, norm_test_label, positive = "YES")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction NO YES
##        NO  27   4
##        YES  7  37
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.7527, 0.9244)
##     No Information Rate : 0.5467          
##     P-Value [Acc > NIR] : 0.00000001664   
##                                           
##                   Kappa : 0.7018          
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.9024          
##             Specificity : 0.7941          
##          Pos Pred Value : 0.8409          
##          Neg Pred Value : 0.8710          
##              Prevalence : 0.5467          
##          Detection Rate : 0.4933          
##    Detection Prevalence : 0.5867          
##       Balanced Accuracy : 0.8483          
##                                           
##        'Positive' Class : YES             
##

Our K-NN model have higher Accuracy than our logistic regression model with same level of Sensitivity, higher Specificity as it measures the proportion of negatives that are correctly identified (e.g., the percentage of healthy people who are correctly identified as not having some illness), but we could increase our model performance by several way such as balancing our data by target variable (remember that we originally have 46% No and 54% Yes),doing k-fold cross-validation or even grid search.

knnplot <- test %>% 
  mutate(yhat = knnmod) %>%
  ggplot(aes(target, yhat)) 

knnplot +
  geom_jitter(aes(col = paste(target,yhat))) +
  geom_hline(aes(yintercept = 1.5)) +
  geom_vline(aes(xintercept = 1.5)) +
  labs(x = "Actual", 
       y = "Prediction", 
       title = "Confussion Matrix", 
       subtitle = "K-Nearest Neighbors") +
  theme(legend.position = "none")

Model Tuning

For this computational intensive parameter tuning, R will use for-loops for each cross validation, and for loops in R means that we will copy each dataset for each loops and in order to reduce the computation time, we could use configure caret to use 4 of our avaiable cores using the doMC package.

doMC::registerDoMC(4)

For the start of our model optimization, we will do k-fold cross-validation to reduce the bias in our model, in this example, we do 10 fold cross validation with number of repetition of 10, that is without explicitly stating the number of ‘k’ in our K-NN and specifying our metrics for choosing model is Sensitivity.

Repeated 10 Fold Cross-Validation

fitControl <- trainControl(## 10-fold CV
                           method = "repeatedcv",
                           number = 10,
                           ## repeated ten times
                           repeats = 10,
                           classProbs = TRUE,
                           summaryFunction = multiClassSummary)

knncv <- train(target~., 
               data = crt_norm_train, 
               method = "knn", 
               trControl = fitControl, 
               metric = "Sensitivity")
print(knncv)

## k-Nearest Neighbors 
## 
## 228 samples
##  29 predictor
##   2 classes: 'NO', 'YES' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 205, 206, 205, 206, 206, 205, ... 
## Resampling results across tuning parameters:
## 
##   k  logLoss    AUC        prAUC      Accuracy   Kappa      F1       
##   5  2.0111532  0.8310084  0.5280528  0.7832428  0.5609995  0.7544578
##   7  1.2263070  0.8571069  0.6107227  0.8056489  0.6046119  0.7730981
##   9  0.9139489  0.8765979  0.6472838  0.8223024  0.6388998  0.7938093
##   Sensitivity  Specificity  Pos_Pred_Value  Neg_Pred_Value  Precision
##   0.7469091    0.8132051    0.7772298       0.8024504       0.7772298
##   0.7421818    0.8588462    0.8231618       0.8064805       0.8231618
##   0.7681818    0.8677564    0.8375162       0.8255204       0.8375162
##   Recall     Detection_Rate  Balanced_Accuracy
##   0.7469091  0.3408300       0.7800571        
##   0.7421818  0.3385573       0.8005140        
##   0.7681818  0.3504018       0.8179691        
## 
## Sensitivity was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Automatic Grid Search

Caret package could also help us do grid search automatically just by setting the tuneLength parameter in our train function which are numbers of different values to try for each algorithm parameter.

knncv_autogrid <- train(target~., 
                        data = crt_norm_train, 
                        method = "knn", 
                        trControl = fitControl,
                        metric = "Sensitivity",
                        tuneLength = 30)

Remember that our most important metrics is Sensitivity wheareas reduce the risk of misclassified hearth disease. we could use the dplyr syntax and our K-NN with autogrid to set the main parameter to be concerned as the main tuning variable. In this example we will be using the Specificity.

knncv_autogrid$results %>%
  arrange(desc(Sensitivity)) %>%
  round(3)

And we could easily turn the result into plot using simple dplyr syntax.

colors <- c("Accuracy" = "red", "Sensitivity" = "green", "Specificity" = "blue")

knncv_autogrid$results %>%
  ggplot(aes(k)) +
  geom_line(aes(y = Accuracy, col = "Accuracy")) +
  geom_point(aes(y = Accuracy)) +
  geom_line(aes(y = Sensitivity, col = "Sensitivity")) +
  geom_point(aes(y = Sensitivity)) +
  geom_line(aes(y = Specificity, col = "Specificity")) +
  geom_point(aes(y = Specificity)) +
  labs(title = "K-NN with 10 repeated 10 fold Cross Validation",
       subtitle = "Accuracy vs Sensitivity vs Specificity",
       color = "Legend") +
  scale_color_manual(values = colors) +
  theme(legend.position = "top",
        axis.title.y = element_blank())

We can easily see from the plot that after the maximum Sensitivity is achieved, our model start to decrease in Sensitivity, followed by Accuracy but keep rising in terms of Specificity. Our final K-NN model Sensitivity is at it highest on K of 33 with Accuracy of 0.835 Sensitivity of 0.802 and Specificity of 0.862

Model Evaluation

Our K-NN model have better result than our linear regression model in term of result, but please remind that our model is being trained on relatively small dataset and could be improvised by feeding it with more data.

As an aspect of precaution to mitigate the risk of biased training over small dataset, we conduct training on our knn model with 10 repeated 10 cross validation and tune it with grid search algorithm. As result, our final K-NN model have slightly lower Accuracy, Sensitivity and Specificity than our original model, but with lower bias.

Conclusion

From our linear regression model, we find that chest pain could be a symptom of heart disease, when someone feel some symptomatic chest pain, we reccomend them to consult doctor for further control and treatment.

Reference

Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of statistical software, 28(5), 1-26. Source: https://www.jstatsoft.org/index.php/jss/article/view/v028i05/v28i05.pdf

LBB: 5 - Classification in Machine Learning

Devin Jevian

25/09/2020