1. Objective

to classify the customer whether have heart disease or not based on provided variables. To do this classification, we will use Logistic Regression and K-Nearest Neighbor (KNN) method for the modeling.

2. Data Wrangling

Load the required package

library(dplyr)
library(rsample)
library(class)
library(caret)

Let’s load the data first as our first step of modeling.

heart <- read.csv("data_input/heart.csv")
head(heart)

Check the structure of the data

str(heart)

## 'data.frame':    303 obs. of  14 variables:
##  $ ï..age  : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

Here is the dictionary of data:
- age -> age in years
- sex -> (1 = male; 0 = female)
- cp -> chest pain type
- trestbps -> resting blood pressure (in mm Hg on admission to the hospital)
- chol -> serum cholestoral in mg/dl
- fbs -> (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg -> resting electrocardiographic results
- thalach -> maximum heart rate achieved
- exang -> exercise induced angina (1 = yes; 0 = no)
- oldpeak -> ST depression induced by exercise relative to rest
- slope -> the slope of the peak exercise ST segment
- ca -> number of major vessels (0-3) colored by flourosopy
- thal -> 1 = fixed defect; 2 = normal ; 3 = reversable defect
- target -> 1 or 0

Variables are needed to be changed type:
sex, cp, fbs, restecg, exang, thal, target -> as factor

heart1 <- heart %>% 
  mutate(sex = as.factor(sex),
         cp = as.factor(cp),
         fbs = as.factor(fbs),
         exang = as.factor(exang),
         thal = as.factor(thal),
         target = as.factor(target))

Check any missing value

anyNA(heart1)

## [1] FALSE

“FALSE” means there is no missing values.

3. Exploratory Data Analysis

3.1 Check the distribution of data

summary(heart1)

##      ï..age      sex     cp         trestbps          chol       fbs    
##  Min.   :29.00   0: 96   0:143   Min.   : 94.0   Min.   :126.0   0:258  
##  1st Qu.:47.50   1:207   1: 50   1st Qu.:120.0   1st Qu.:211.0   1: 45  
##  Median :55.00           2: 87   Median :130.0   Median :240.0          
##  Mean   :54.37           3: 23   Mean   :131.6   Mean   :246.3          
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:274.5          
##  Max.   :77.00                   Max.   :200.0   Max.   :564.0          
##     restecg          thalach      exang      oldpeak         slope      
##  Min.   :0.0000   Min.   : 71.0   0:204   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:133.5   1: 99   1st Qu.:0.00   1st Qu.:1.000  
##  Median :1.0000   Median :153.0           Median :0.80   Median :1.000  
##  Mean   :0.5281   Mean   :149.6           Mean   :1.04   Mean   :1.399  
##  3rd Qu.:1.0000   3rd Qu.:166.0           3rd Qu.:1.60   3rd Qu.:2.000  
##  Max.   :2.0000   Max.   :202.0           Max.   :6.20   Max.   :2.000  
##        ca         thal    target 
##  Min.   :0.0000   0:  2   0:138  
##  1st Qu.:0.0000   1: 18   1:165  
##  Median :0.0000   2:166          
##  Mean   :0.7294   3:117          
##  3rd Qu.:1.0000                  
##  Max.   :4.0000

There are 2 “0” values in thal variables we can check what data it is.

heart1 %>% 
  filter(thal==0)

As we compare this 2 observatioan with the summary and “0” represented null values to this variable, we can conclude that there is no anomaly for these 2 observation, thus we take out.

heart1 <- heart1 %>%
  filter(thal!=0)

3.2 Class-imbalance checking

table(heart1$target)

## 
##   0   1 
## 137 164

The proportion of the target of the data is imbalance enough, we don’t have to take action about this.

4. Modeling

4.1 Cross Validation

set.seed(124)
splitted <- initial_split(heart1, prop = 0.75, strata = target)
heart_train <- training(splitted)
heart_test <- testing(splitted)

Check again the proportion of train data and compare to the original one

prop.table(table(heart_train$target))

## 
##         0         1 
## 0.4557522 0.5442478

prop.table(table(heart1$target))

## 
##         0         1 
## 0.4551495 0.5448505

4.2 Logistic Regression

heart_model <- glm(target~., heart_train,family = "binomial")
summary (heart_model)

## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7031  -0.3483   0.1887   0.5644   2.5136  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.989770   3.313846  -0.299 0.765186    
## ï..age       0.018539   0.028546   0.649 0.516048    
## sex1        -1.008176   0.594303  -1.696 0.089810 .  
## cp1          1.397972   0.669245   2.089 0.036718 *  
## cp2          2.023641   0.568660   3.559 0.000373 ***
## cp3          2.172573   0.757348   2.869 0.004122 ** 
## trestbps    -0.016059   0.011989  -1.339 0.180428    
## chol        -0.001780   0.004441  -0.401 0.688580    
## fbs1        -0.209264   0.639025  -0.327 0.743310    
## restecg      0.652053   0.415747   1.568 0.116790    
## thalach      0.020439   0.011962   1.709 0.087523 .  
## exang1      -0.658821   0.475766  -1.385 0.166127    
## oldpeak     -0.374505   0.257358  -1.455 0.145617    
## slope        0.640606   0.426476   1.502 0.133073    
## ca          -0.955366   0.254445  -3.755 0.000174 ***
## thal2       -0.143108   0.815993  -0.175 0.860782    
## thal3       -1.456762   0.796775  -1.828 0.067501 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 311.53  on 225  degrees of freedom
## Residual deviance: 154.86  on 209  degrees of freedom
## AIC: 188.86
## 
## Number of Fisher Scoring iterations: 6

Though there are several variables which are not significantly impact to the model, we can take it out or insert to step process and it will be eliminated if it’s trully not significantly impact.

step(heart_model, direction = "backward", trace = 0)

## 
## Call:  glm(formula = target ~ sex + cp + restecg + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Coefficients:
## (Intercept)         sex1          cp1          cp2          cp3      restecg  
##    -1.86110     -0.90781      1.38345      1.92019      1.97448      0.64641  
##     thalach       exang1      oldpeak        slope           ca        thal2  
##     0.01584     -0.72678     -0.42578      0.61315     -0.89164     -0.06270  
##       thal3  
##    -1.39826  
## 
## Degrees of Freedom: 225 Total (i.e. Null);  213 Residual
## Null Deviance:       311.5 
## Residual Deviance: 157.3     AIC: 183.3

Since there is no anomaly coefficients of the variables which indicates perfect seperation, we can save the model first.

heart_model2 <- glm(formula = target ~ sex + cp + restecg + thalach + exang + 
    oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
summary(heart_model2)

## 
## Call:
## glm(formula = target ~ sex + cp + restecg + thalach + exang + 
##     oldpeak + slope + ca + thal, family = "binomial", data = heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6086  -0.3760   0.2009   0.5462   2.4617  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.86110    1.76889  -1.052 0.292740    
## sex1        -0.90781    0.52445  -1.731 0.083458 .  
## cp1          1.38345    0.65875   2.100 0.035717 *  
## cp2          1.92019    0.52882   3.631 0.000282 ***
## cp3          1.97448    0.72202   2.735 0.006244 ** 
## restecg      0.64641    0.39528   1.635 0.101983    
## thalach      0.01584    0.01061   1.493 0.135561    
## exang1      -0.72678    0.46232  -1.572 0.115946    
## oldpeak     -0.42578    0.25307  -1.682 0.092481 .  
## slope        0.61315    0.42000   1.460 0.144321    
## ca          -0.89164    0.24301  -3.669 0.000243 ***
## thal2       -0.06270    0.79523  -0.079 0.937156    
## thal3       -1.39826    0.76859  -1.819 0.068872 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 311.53  on 225  degrees of freedom
## Residual deviance: 157.35  on 213  degrees of freedom
## AIC: 183.35
## 
## Number of Fisher Scoring iterations: 6

4.2.1 Prediction

heart_test$pred.target <- predict(heart_model2,heart_test, type="response")

For default, let’s use 0.5 as threshold.

heart_test$pred.label <- ifelse(heart_test$pred.target>0.5,"1","0")

heart_test <- heart_test %>% 
  mutate(pred.label= as.factor(pred.label))

4.3 K-Nearest Neighbor (KNN)

4.3.1 Scaling

Due to scaling method the data in numberic type, we have to convert the predictors into numberic first.

heart_train1 <- heart_train %>% 
  mutate(sex = as.numeric(sex),
         cp = as.numeric(cp),
         fbs = as.numeric(fbs),
         exang = as.numeric(exang),
         thal = as.numeric(thal),
         target = as.numeric(target))

heart_test1 <- heart_test %>% 
  mutate(sex = as.numeric(sex),
         cp = as.numeric(cp),
         fbs = as.numeric(fbs),
         exang = as.numeric(exang),
         thal = as.numeric(thal),
         target = as.numeric(target)) %>% 
  select(-pred.label,-pred.target)

heart_train_p <- scale(x=heart_train1[,-14])
heart_test_p <- scale(x=heart_test1[,-14], 
                     center = attr(heart_train_p,"scaled:center"),
                     scale = attr(heart_train_p,"scaled:scale"))

#data label(target)
heart_train_label <- heart_train1[,14]
heart_test_label <- heart_test1[,14]

4.3.2 Predict

#find optimum k
round(sqrt(nrow(heart_train1)),0)

## [1] 15

Number of class:2 k: 15

heart_pred <- knn(train = heart_train_p,
                  test = heart_test_p,
                  cl = heart_train_label,
                  k=15)

#checking result of prediction
head(heart_pred)

## [1] 2 2 2 1 2 2
## Levels: 1 2

5. Model Evaluation

5.1 Logistic Regression

confusionMatrix(data = heart_test$pred.label,
                reference = heart_test$target,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 28  3
##          1  6 38
##                                           
##                Accuracy : 0.88            
##                  95% CI : (0.7844, 0.9436)
##     No Information Rate : 0.5467          
##     P-Value [Acc > NIR] : 5.906e-10       
##                                           
##                   Kappa : 0.7561          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.9268          
##             Specificity : 0.8235          
##          Pos Pred Value : 0.8636          
##          Neg Pred Value : 0.9032          
##              Prevalence : 0.5467          
##          Detection Rate : 0.5067          
##    Detection Prevalence : 0.5867          
##       Balanced Accuracy : 0.8752          
##                                           
##        'Positive' Class : 1               
##

Result of logistic regression is good. Accurary 88%, Sensitivity 92%, and Pos Pred Value 86% are above 80%.

5.2 KNN

#class checking
class(heart_test_label)

## [1] "numeric"

class(heart_pred)

## [1] "factor"

In order to use confusionMatrix syntax, we have to convert heart_test_label into factor

heart_test_label <- as.factor(heart_test_label)

confusionMatrix(data = heart_pred,
                reference = heart_test_label,
                positive = "2")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  1  2
##          1 26  3
##          2  8 38
##                                           
##                Accuracy : 0.8533          
##                  95% CI : (0.7527, 0.9244)
##     No Information Rate : 0.5467          
##     P-Value [Acc > NIR] : 1.664e-08       
##                                           
##                   Kappa : 0.7003          
##                                           
##  Mcnemar's Test P-Value : 0.2278          
##                                           
##             Sensitivity : 0.9268          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.8261          
##          Neg Pred Value : 0.8966          
##              Prevalence : 0.5467          
##          Detection Rate : 0.5067          
##    Detection Prevalence : 0.6133          
##       Balanced Accuracy : 0.8458          
##                                           
##        'Positive' Class : 2               
##

Result of logistic regression is good. Accurary 85%, Sensitivity 92%, and Pos Pred Value 82% are above 80%.

6. Conclusion

Both generated model are good to predict the data. Despite of being good, Logistic Regression model is slightly better than KNN based on Accurate Rate. For this data, accurate rate is supposed to be prioritized because we want to predict a patient whether have heart disease precisely.

Logistic Regression & KNN

RPD