1 Objective

The main objective of this project is to predict whether the given person is having heart disease or not, with the help of several factor which are causing eg- age,cholesterol level,type of chest pain etc.

The algorithm which we are using in this problem are:

  • Logistic Regression
  • K-Nearest Neigbour

2 Library Setup

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
library(ROCR)

3 Data Preparation

3.1 Import Data

First, we need to import the dataset to R. The data used in the project contains information about patients with heart disease based on several factors. You can download it from Kaggle.

heart <- read.csv("heart.csv")

str(heart)
## 'data.frame':    303 obs. of  14 variables:
##  $ age     : int  63 37 41 56 57 57 56 44 52 57 ...
##  $ sex     : int  1 1 0 1 0 1 0 1 1 1 ...
##  $ cp      : int  3 2 1 1 0 0 1 1 2 2 ...
##  $ trestbps: int  145 130 130 120 120 140 140 120 172 150 ...
##  $ chol    : int  233 250 204 236 354 192 294 263 199 168 ...
##  $ fbs     : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ restecg : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ thalach : int  150 187 172 178 163 148 153 173 162 174 ...
##  $ exang   : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ oldpeak : num  2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ...
##  $ slope   : int  0 0 2 2 2 1 1 2 2 2 ...
##  $ ca      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ thal    : int  1 2 2 2 2 1 2 3 3 2 ...
##  $ target  : int  1 1 1 1 1 1 1 1 1 1 ...

3.2 Variables description

Here are some brief explanations of the variables :

  • age : Age of individual in years
  • sex : Gender Of individual(1 = male; 0 = female)
  • cp : Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non:anginal pain; 4 = asymptomatic)
  • trestbps : Resting blood pressure (in mm Hg on admission to the hospital)
  • chol : Serum cholesterol in mg/dl
  • fbs : Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)
  • restecg : Resting electrocardiographic results (0 = normal; 1 = having ST:T; 2 = hypertrophy)
  • thalach : Maximum heart rate achieved
  • exang : Exercise induced angina (1 = yes; 0 = no)
  • oldpeak : ST depression induced by exercise relative to rest
  • slope : The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
  • ca : Number of major vessels (0:4) colored by flourosopy
  • thal : Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defect
  • target : the predicted attribute : diagnosis of heart disease (angiographic disease status) (Value 0 = < 50% diameter narrowing; Value 1 = > 50% diameter narrowing)

4 Explanatory Data Analysis

Exploratory Data Analysis is a phase where we explore the data variables to understand the condition of the data before modeling.

Check all variables data distribution

summary(heart)
##       age             sex               cp           trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.000   Min.   : 94.0  
##  1st Qu.:47.50   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:120.0  
##  Median :55.00   Median :1.0000   Median :1.000   Median :130.0  
##  Mean   :54.37   Mean   :0.6832   Mean   :0.967   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.000   Max.   :200.0  
##       chol            fbs            restecg          thalach     
##  Min.   :126.0   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:133.5  
##  Median :240.0   Median :0.0000   Median :1.0000   Median :153.0  
##  Mean   :246.3   Mean   :0.1485   Mean   :0.5281   Mean   :149.6  
##  3rd Qu.:274.5   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak         slope             ca        
##  Min.   :0.0000   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.80   Median :1.000   Median :0.0000  
##  Mean   :0.3267   Mean   :1.04   Mean   :1.399   Mean   :0.7294  
##  3rd Qu.:1.0000   3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.20   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.314   Mean   :0.5446  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000

See if there any missing values

colSums(is.na(heart))
##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

As observing the above summary and str we can say following points:

  • sex cannot be continuous variable as it can be either Male or Female as per our description. Hence we have to convert the variable name sex from integer to factor. And also labelling it to avoid any further confusion.
  • cp cannot be continuous variable as it is type of chest pain. As it is type of chest pain, we have to convert variable cp to factor and labelling it to our convenience.
  • fbs cannot be continuous variable or integer as it shows blood sugar level below 120mg/dl or not.Therefore, we convert it to factor and labelling it to our convenience.
  • restecg should be factor as it is type of ECG results.Hence, it can’t be integer. So, we are converting it to factor and labelling it.
  • exang should be factor as per the description of the dataset. Angina can happen or not i.e. it can be either yes or no. Therefore, converting the variable to factor and labelling it.
  • slope cannot be integer as it is type of slope which is observed in ECG. Therefore we are converting the variable to factor and labelling it.
  • ca as per the description of our dataset. It can’t be integer. Therefore, we are converting the variable to factor.
  • thal cannot be integer as it is type of thalassemia which cannot be numeric or integer. Therefore, we are converting the variable to factor and labelling it.
  • target is the predicated variable and tells us whether the individual has heart disease or not. Therefore, we are converting the variable to factor and labelling it for your convenience.

According to above observation we implementing the changes

heart <- heart %>% 
  mutate(sex = factor(sex, levels = c(0,1), labels = c("Female", "Male")),
         cp = factor(cp, levels = c(0,1,2,3), labels = c("typical","atypical","non-anginal","asymptomatic")),
         fbs =factor(fbs, levels = c(0,1), labels = c("False", "True")),
         restecg =factor(restecg, levels = c(0,1,2), labels = c("normal","stt","hypertrophy")),
         exang = factor(exang, levels = c(0,1), labels = c("No", "Yes")),
         slope = factor(slope, levels = c(0,1,2), labels = c("upsloping","flat","downsloping")),
         ca = factor(ca),
         thal = factor(thal),
         target = factor(target, levels = c(0,1), 
                        labels = c("Health", "Not Health")))

glimpse(heart)
## Rows: 303
## Columns: 14
## $ age      <int> 63, 37, 41, 56, 57, 57, 56, 44, 52, 57, 54, 48, 49, 64, 58, 5~
## $ sex      <fct> Male, Male, Female, Male, Female, Male, Female, Male, Male, M~
## $ cp       <fct> asymptomatic, non-anginal, atypical, atypical, typical, typic~
## $ trestbps <int> 145, 130, 130, 120, 120, 140, 140, 120, 172, 150, 140, 130, 1~
## $ chol     <int> 233, 250, 204, 236, 354, 192, 294, 263, 199, 168, 239, 275, 2~
## $ fbs      <fct> True, False, False, False, False, False, False, False, True, ~
## $ restecg  <fct> normal, stt, normal, stt, stt, stt, normal, stt, stt, stt, st~
## $ thalach  <int> 150, 187, 172, 178, 163, 148, 153, 173, 162, 174, 160, 139, 1~
## $ exang    <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No, Yes, No,~
## $ oldpeak  <dbl> 2.3, 3.5, 1.4, 0.8, 0.6, 0.4, 1.3, 0.0, 0.5, 1.6, 1.2, 0.2, 0~
## $ slope    <fct> upsloping, upsloping, downsloping, downsloping, downsloping, ~
## $ ca       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0~
## $ thal     <fct> 1, 2, 2, 2, 2, 1, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3~
## $ target   <fct> Not Health, Not Health, Not Health, Not Health, Not Health, N~

And these are the proportions of our variable target class

prop.table(table(heart$target))
## 
##     Health Not Health 
##  0.4554455  0.5445545

5 Modeling

5.1 Splitting Train-Test

Before we train our model, we must split our data into train and test data. We will use our train data to train our model and use our test data to validate our model when overcoming the unseen data.The data will be split with ratio of 75/25 (75% data will be used to train, 25% to test).

set.seed(100) 
index<-sample(nrow(heart),0.75*nrow(heart))
train<-heart[index,]
test<-heart[-index,]

5.2 Logistic Regression

This is the equation of a logistic regression model. The left-hand side is called the log-odds or logit. On the right side, the B0 is the model intercept and B1 is the coefficient of feature X.

\[log(\frac{p(X)}{1-p(X)})= B_0 + B_1.X\]

Generate the model on training data and then validating the model with testing data.

model_lgr<-glm(target~.,data = train,family = "binomial")
# family = " binomial" means it contains only two outcomes.

summary(model_lgr)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.03957  -0.26644   0.08235   0.42215   2.97504  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.864308   4.269846   0.202 0.839587    
## age                 0.036153   0.031524   1.147 0.251452    
## sexMale            -2.082095   0.707450  -2.943 0.003249 ** 
## cpatypical          1.357814   0.730943   1.858 0.063223 .  
## cpnon-anginal       1.871870   0.616335   3.037 0.002389 ** 
## cpasymptomatic      3.473455   1.021746   3.400 0.000675 ***
## trestbps           -0.027971   0.015272  -1.832 0.067024 .  
## chol               -0.006059   0.006112  -0.991 0.321497    
## fbsTrue             0.427703   0.695171   0.615 0.538390    
## restecgstt          0.958900   0.493394   1.943 0.051958 .  
## restecghypertrophy -1.049684   3.174112  -0.331 0.740870    
## thalach             0.009176   0.013970   0.657 0.511258    
## exangYes           -0.704661   0.542541  -1.299 0.194007    
## oldpeak            -0.282308   0.306674  -0.921 0.357286    
## slopeflat          -0.069356   1.187088  -0.058 0.953410    
## slopedownsloping    1.573649   1.290392   1.220 0.222650    
## ca1                -2.618266   0.658078  -3.979 6.93e-05 ***
## ca2                -4.114748   1.041030  -3.953 7.73e-05 ***
## ca3                -2.171650   1.016714  -2.136 0.032683 *  
## ca4                 1.056279   2.046429   0.516 0.605745    
## thal1               2.867969   2.767630   1.036 0.300083    
## thal2               2.822062   2.645828   1.067 0.286149    
## thal3               1.190640   2.643434   0.450 0.652412    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 313.10  on 226  degrees of freedom
## Residual deviance: 125.41  on 204  degrees of freedom
## AIC: 171.41
## 
## Number of Fisher Scoring iterations: 6

then we can predicting using model_lgr we created before. We will use the default number 0.5 as a classification threshold / cut off number.

test$pred<-predict(model_lgr,test,type = "response")
test$pred<-ifelse(test$pred < 0.5,"Health","Not Health")
rmarkdown::paged_table(data.frame(test$pred))

Interpretation

The logistic regression model is an interpretable model. We can see how each variable contribute to the probability of the outcome. It can be done by create a data frame consist of our model coefficients and transform the value with inv.logit(). These will transform the log of odds value into probability value.

# Return the probability value

model_logit <- model_lgr$coefficients %>%
  inv.logit() %>%
  data.frame()

rmarkdown::paged_table(model_logit)

so, from dataframe above we can interprete some information like:

  • The probability of male to be diagnosed with heart disease is 11,8%.
  • People with high level severe type of pain (cp = 3) have a 97% probability to be diagnosed with heart disease.

5.3 K-Nearest Neighbor

K-NN classify the outcome by looking at the nearest “neighbour”. In other words, K-NN is looking at what is the class of data-point(s) with the least/shortest distance to the data we want to classify. The distance is measured with Euclidean Distance, which can penalizes neighbour with greater distance.

\[d(x,y)=\sqrt{\Sigma(x_i - y_i)^2}\] Scaling

As the overview of our new data frame before, we notice that some variable have a different range with each other. Before we start to create our K-NN model we need scale our predictors in the training data, so that there is no dominant predictor because the scale is bigger than the others.

numerictrain <- sapply(X = train, FUN = is.numeric)
numerictest <- sapply(X = test, FUN = is.numeric)

train_x <- scale(train[,numerictrain])
train_y <- train$target

test_x <- scale(test[,numerictest], 
                             center = attr(train_x, "scaled:center"),
                             scale = attr(train_x, "scaled:scale"))
test_y <- test$target

And then let’s pick the optimum k number from train_x

nrow(train_x)
## [1] 227
sqrt(nrow(train_x))
## [1] 15.06652

Because the number of target classes is even, then k must be odd to avoid a tie when there is majority voting. Now, we can do predictions with K-NN model

knn <- knn(train = train_x,
    test = test_x,
    cl = train_y,
    k = 16)

rmarkdown::paged_table(data.frame(knn))

6 Evaluation

We will evaluate both logistic regression and K-NN model with confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

knitr::include_graphics("cf.PNG")

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision (Saito and Rehmsmeier, 2015). Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

\[Accuracy=\frac{TP+TN}{TP+TN+FP+FN}\]

\[Sensitifity=\frac{TP}{TP+FN}\]

\[Specificity=\frac{TP}{TN+FP}\]

\[Precision=\frac{TP}{TP+FP}\]

6.1 Logistic Regression

log_cm <- confusionMatrix(factor(test$pred),test$target,positive = "Not Health")
log_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         24          4
##   Not Health     10         38
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7103, 0.8955)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 1.294e-06       
##                                           
##                   Kappa : 0.6211          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.9048          
##             Specificity : 0.7059          
##          Pos Pred Value : 0.7917          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.5526          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.6316          
##       Balanced Accuracy : 0.8053          
##                                           
##        'Positive' Class : Not Health      
## 

6.2 K-NN

knn_cm <- confusionMatrix(knn, test$target, positive = "Not Health")
knn_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         21          8
##   Not Health     13         34
##                                           
##                Accuracy : 0.7237          
##                  95% CI : (0.6091, 0.8201)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 0.001639        
##                                           
##                   Kappa : 0.4332          
##                                           
##  Mcnemar's Test P-Value : 0.382733        
##                                           
##             Sensitivity : 0.8095          
##             Specificity : 0.6176          
##          Pos Pred Value : 0.7234          
##          Neg Pred Value : 0.7241          
##              Prevalence : 0.5526          
##          Detection Rate : 0.4474          
##    Detection Prevalence : 0.6184          
##       Balanced Accuracy : 0.7136          
##                                           
##        'Positive' Class : Not Health      
## 

6.3 Comparison

eval_log <- data.frame(Accuracy = log_cm$overall[1],
                       Recall = log_cm$byClass[1],
                       Specificity = log_cm$byClass[2],
                       Precision = log_cm$byClass[3])


eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
                       Recall = knn_cm$byClass[1],
                       Specificity = knn_cm$byClass[2],
                       Precision = knn_cm$byClass[3])

eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "K-NN")

rmarkdown::paged_table(data.frame(eval_total))

7 Model Improvement

7.1 Logistic regression

To get a better model result in logistic regression model, we can do it in several ways, one way is by changing the classification thresholds to determine the prediction. We require ROC curve(receiver operating characteristic curve) which is a graph showing the performance of a classification model at all classification thresholds. It will allow us to take proper cutoff.

train$pred<-fitted(model_lgr)
# fitted can be used only to get predicted score of the data on which model has been generated.
pred<-prediction(train$pred,train$target)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize = T,print.cutoffs.at = seq(0.1,by = 0.1))

With the use of ROC curve we can observe that 0.6 is having better sensitivity and specificity.There we can change 0.6 as our cutoff number.

test$pred<-predict(model_lgr,test,type = "response")
test$pred<-ifelse(test$pred < 0.6,"Health","Not Health")

and let’s calculate the confusion matrix again

log_cm <- confusionMatrix(factor(test$pred),test$target,positive = "Not Health")
log_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         25          5
##   Not Health      9         37
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7103, 0.8955)
##     No Information Rate : 0.5526          
##     P-Value [Acc > NIR] : 1.294e-06       
##                                           
##                   Kappa : 0.6232          
##                                           
##  Mcnemar's Test P-Value : 0.4227          
##                                           
##             Sensitivity : 0.8810          
##             Specificity : 0.7353          
##          Pos Pred Value : 0.8043          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.5526          
##          Detection Rate : 0.4868          
##    Detection Prevalence : 0.6053          
##       Balanced Accuracy : 0.8081          
##                                           
##        'Positive' Class : Not Health      
## 

as the result, the accuracy and precision numbers have increased than before.

7.2 K-NN

For improving K-NN model, we will try to change the K number from 16 to 18 and see if there is any differences

knn <- knn(train = train_x,
    test = test_x,
    cl = train_y,
    k = 18)

knn_cm <- confusionMatrix(knn, test$target, positive = "Not Health")
knn_cm
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   Health Not Health
##   Health         21          4
##   Not Health     13         38
##                                          
##                Accuracy : 0.7763         
##                  95% CI : (0.6662, 0.864)
##     No Information Rate : 0.5526         
##     P-Value [Acc > NIR] : 4.229e-05      
##                                          
##                   Kappa : 0.5359         
##                                          
##  Mcnemar's Test P-Value : 0.05235        
##                                          
##             Sensitivity : 0.9048         
##             Specificity : 0.6176         
##          Pos Pred Value : 0.7451         
##          Neg Pred Value : 0.8400         
##              Prevalence : 0.5526         
##          Detection Rate : 0.5000         
##    Detection Prevalence : 0.6711         
##       Balanced Accuracy : 0.7612         
##                                          
##        'Positive' Class : Not Health     
## 

And we can see there is improvement from the previous results with higher accuracy dan precicion than before.

eval_log <- data.frame(Accuracy = log_cm$overall[1],
                       Recall = log_cm$byClass[1],
                       Specificity = log_cm$byClass[2],
                       Precision = log_cm$byClass[3])


eval_knn <- data.frame(Accuracy = knn_cm$overall[1],
                       Recall = knn_cm$byClass[1],
                       Specificity = knn_cm$byClass[2],
                       Precision = knn_cm$byClass[3])

eval_total <- rbind(round(eval_log,2), round(eval_knn,2))
rownames(eval_total) <- c("Logistic Regression", "K-NN")

rmarkdown::paged_table(data.frame(eval_total))

8 Conclusion

Both model performances are increased after we used the tuning process. From the comparison table above, we can see that Logistic Regression get higher result for Recall, Accuracy, Specificity and Precision matrics then K-NN model. Depending on what we want to achieve, we better look at the precision metric, where we don’t want our model to be wrong in predicting which patients are really Not Health or Health. So, we can choose the Logistic Regression models. The ability of the Logistic Regression models to predict correctly from the actual data of people with Not Health is better because it has a value of precision = 85%, greater than using the K-NN models.