Heart Disease Prediction using Logistic Regression and K-Nearest Neighbors

Introduction

Objective

The main objective of this project is to predict whether a given person is at risk of having heart disease or not. This prediction is made by analyzing several contributing factors, including age, cholesterol level, and the type of chest pain experienced by the individual etc.

In order to achieve this objective, we employ the following algorithms:

Logistic Regression: This algorithm is a statistical method that is commonly used for binary classification tasks, such as determining the presence or absence of heart disease. It calculates the probability of an individual having heart disease based on the input factors and then classifies them accordingly.
K-Nearest Neighbors (K-NN): K-Nearest Neighbors is a machine learning algorithm used for classification tasks. In this context, it works by identifying the k-nearest individuals in the dataset who share similar characteristics with the person being evaluated. The algorithm then predicts the presence of heart disease based on the majority class of these nearest neighbors.

Data Source

The dataset utilized in this project comprises comprehensive information pertaining to patients diagnosed with heart disease. It encompasses a wide array of data derived from multiple factors. Should you wish to access and acquire this dataset, it is readily available for download on Kaggle.

Here are some brief explanations of the variables :

age : Age of individual in years
sex : Gender Of individual(1 = male; 0 = female)
cp : Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non:anginal pain; 4 = asymptomatic)
trestbps : Resting blood pressure (in mm Hg on admission to the hospital)
chol : Serum cholesterol in mg/dl
fbs : Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)
restecg : Resting electrocardiographic results (0 = normal; 1 = having ST:T; 2 = hypertrophy)
thalach : Maximum heart rate achieved
exang : Exercise induced angina (1 = yes; 0 = no)
oldpeak : ST depression induced by exercise relative to rest
slope : The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
ca : Number of major vessels (0:4) colored by flourosopy
thal : Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defect
target : the predicted attribute : diagnosis of heart disease (angiographic disease status). Value 0 = < 50% diameter narrowing; - Value 1 = > 50% diameter narrowing.

Library

In this project, we have incorporated several essential libraries to facilitate our data analysis and model building. These libraries include:

library(readr) 
library(dplyr) 
library(GGally)
library(lmtest)
library(car)
library(MLmetrics)
library(class) 
library(caret)

Data Preparation

The Data Preparation process consists of two primary activities: Loading Data and Data Wrangling. First, we Load Data by utilizing the readr library to import the raw dataset. Following this, we move to Data Wrangling, where we perform data cleaning, transformation, and structuring tasks, ensuring that the data is in an optimal and reliable state for further analysis and model development.

Load Data

First, we Load Data by utilizing the readr library to import the raw dataset.

heart <- read.csv("data_dd/heart.csv")
rmarkdown::paged_table(heart)

Data Wrangling

Data Wrangling is a pivotal phase in our project, where we focus on transforming and structuring the data to prepare it for analysis and modeling. To kick off this process, we begin by utilizing the glimpse function. This function is a handy feature provided by the dplyr library and allows us to swiftly gain insight into the data’s structure. It displays a concise summary of the dataset, revealing information about the number of observations and variables, as well as the data types of each variable.

glimpse(heart)

## Rows: 1,025
## Columns: 14
## $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
## $ sex      <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ cp       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
## $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
## $ fbs      <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ restecg  <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
## $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
## $ exang    <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
## $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
## $ slope    <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
## $ ca       <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
## $ thal     <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
## $ target   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

In the data examination process, we have identified several variables that need to be transformed into factor data types. These variables include sex, cp, fbs, restecg, exang, slope, ca, thal , and target. Factor data types allow us to more efficiently manage these categorical variables in our analysis and modeling, and also facilitate the creation of informative visualizations.

heart <- heart %>%
  mutate(sex = as.factor(sex),
         cp = as.factor(cp),
         fbs = as.factor(fbs),
         restecg = as.factor(restecg),
         exang = as.factor(exang),
         slope = as.factor(slope),
         ca = as.factor(ca),
         thal = as.factor(thal),
         target = as.factor(target))
glimpse(heart)

## Rows: 1,025
## Columns: 14
## $ age      <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
## $ sex      <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ cp       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
## $ chol     <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
## $ fbs      <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ restecg  <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
## $ thalach  <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
## $ exang    <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
## $ oldpeak  <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
## $ slope    <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
## $ ca       <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
## $ thal     <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
## $ target   <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…

Explanatory Data Analysis

The Explanatory Data Analysis (EDA) section of our project plays a pivotal role in unveiling crucial insights from the dataset.

Proportions Target Class

Assess the distribution of target variable to understand the prevalence of heart disease.

prop.table(table(heart$target))

## 
##         0         1 
## 0.4868293 0.5131707

The distribution of the target class in dataset demonstrates a state of balance. With approximately 48.68% of instances falling into category 0 and around 51.32% into category 1, there is no significant imbalance between the two classes. A balanced distribution of the target class is a favorable characteristic for our analysis and modeling processes. It ensures that both outcomes are adequately represented in dataset, thereby contributing to the robustness and reliability of predictive models.

Missing Value

Which involves an examination of missing data points, ensuring data completeness and integrity.

anyNA(heart)

## [1] FALSE

colSums(is.na(heart))

##      age      sex       cp trestbps     chol      fbs  restecg  thalach 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0

The dataset is no missing values, signifying a high level of data completeness and integrity. This absence of gaps in the data streamlines analysis and modeling processes, instilling confidence in the reliability and efficiency of endeavors.

Correlation

We delve into the relationships between variables to uncover potential patterns and associations.

ggcorr(heart, label = TRUE, label_size = 2.5, hjust = 1, layout.exp = 2)

From the correlation checking process, it’s evident that only some correlations between variables can be displayed because the ggcorr library ignores data types other than numeric. Among the displayed correlations, there is a negative correlation between oldpeak and thalach, with a value of -0.3. A negative correlation indicates that when one variable increases, the other tends to decrease, and vice versa. In this context, the value of -0.3 suggests a weak negative relationship between oldpeak (ST segment depression after exercise) and thalach (maximum heart rate achieved during exercise). This implies that as oldpeak increases, thalach tends to decrease, and vice versa.

Cross Validation

Before proceeding with model training, the dataset is divided into two distinct sets: the training data and the test data. The training data, constituting 75% of the dataset, is utilized for model training, allowing the model to learn from the data’s patterns and relationships. The remaining 25% of the data is reserved for the test data, which serves as a benchmark to assess the model’s ability to make accurate predictions on new, unseen data.

RNGkind(sample.kind = "Rounding") 
set.seed(231) 

index <- sample(x = nrow(heart), size = nrow(heart)*0.75) 

heart_train <- heart[index,]
heart_test <- heart[-index,]

prop.table(table(heart_train$target))

## 
##         0         1 
## 0.5091146 0.4908854

With these relatively balanced proportions, it can be inferred that the class distribution in the model remains fairly balanced. This implies that both class 0 and class 1 maintain a comparable representation within the dataset.

Model Building

This project is a critical phase where we delve into constructing predictive models to gain valuable insights into heart disease diagnosis. This section is subdivided into three core components: Scaling, Logistic Regression, and K-Nearest Neighbor.

Scaling

Scaling plays a pivotal role in ensuring the accuracy and effectiveness of the k-Nearest Neighbors (kNN) algorithm in this project. Specifically, z-score standardization is employed to prepare the dataset for kNN classification.

Using the summary() function to obtain summary statistics for each variable in dataset, which includes information such as minimum, 1st quartile, median (2nd quartile), mean, 3rd quartile, and maximum values. This will help understand the range of values in your dataset.

summary(heart)

##       age        sex     cp         trestbps          chol     fbs     restecg
##  Min.   :29.00   0:312   0:497   Min.   : 94.0   Min.   :126   0:872   0:497  
##  1st Qu.:48.00   1:713   1:167   1st Qu.:120.0   1st Qu.:211   1:153   1:513  
##  Median :56.00           2:284   Median :130.0   Median :240           2: 15  
##  Mean   :54.43           3: 77   Mean   :131.6   Mean   :246                  
##  3rd Qu.:61.00                   3rd Qu.:140.0   3rd Qu.:275                  
##  Max.   :77.00                   Max.   :200.0   Max.   :564                  
##     thalach      exang      oldpeak      slope   ca      thal    target 
##  Min.   : 71.0   0:680   Min.   :0.000   0: 74   0:578   0:  7   0:499  
##  1st Qu.:132.0   1:345   1st Qu.:0.000   1:482   1:226   1: 64   1:526  
##  Median :152.0           Median :0.800   2:469   2:134   2:544          
##  Mean   :149.1           Mean   :1.072           3: 69   3:410          
##  3rd Qu.:166.0           3rd Qu.:1.800           4: 18                  
##  Max.   :202.0           Max.   :6.200

The analysis shows that the dataset has a wide range of values for various variables, indicating that scaling is necessary to ensure that variables have a similar influence in the modeling process.

Before applying scaling, it’s crucial to separate the data into two distinct parts: the predictor variables, which will be used for modeling, and the target variable, which aim to predict. This separation is essential to ensure that the scaling process doesn’t impact the target variable, allowing us to maintain the integrity of the predictive relationship during kNN modeling.

heart_train_x <- heart_train %>% 
                  select_if(is.numeric)
heart_test_x <- heart_test %>% 
                  select_if(is.numeric)

heart_train_y <- heart_train[,"target"]
heart_test_y <- heart_test[,"target"]

heart_train_xs <- scale(heart_train_x)
heart_test_xs <- scale(heart_test_x, 
                       center = attr(heart_train_xs, "scaled:center"), 
                       scale = attr(heart_train_xs, "scaled:scale"))

Logistic Regression

Build Model

Logistic Regression using backward feature selection is a model-building process that starts with all available predictor variables and iteratively removes the least significant variables until a subset with the most influential variables remains. This technique helps simplify the model, potentially improving its performance and interpretability while retaining essential predictive factors.

heart_pred_lr <- glm(target ~ .,
                     data = heart_train,
                     family = "binomial")

heart_model_step <- step(object = heart_pred_lr,
                         direction = "backward",
                         trace = F)

Summary Model

summary(heart_model_step)

## 
## Call:
## glm(formula = target ~ age + sex + cp + trestbps + chol + thalach + 
##     exang + oldpeak + slope + ca + thal, family = "binomial", 
##     data = heart_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.133489   2.486789   0.054 0.957191    
## age          0.029109   0.015923   1.828 0.067534 .  
## sex1        -2.220298   0.363135  -6.114 9.70e-10 ***
## cp1          1.122160   0.370357   3.030 0.002446 ** 
## cp2          2.027204   0.324136   6.254 4.00e-10 ***
## cp3          2.443180   0.446890   5.467 4.58e-08 ***
## trestbps    -0.026347   0.007310  -3.604 0.000313 ***
## chol        -0.006279   0.002576  -2.437 0.014799 *  
## thalach      0.027332   0.007788   3.509 0.000449 ***
## exang1      -0.830875   0.282849  -2.938 0.003308 ** 
## oldpeak     -0.426937   0.144101  -2.963 0.003049 ** 
## slope1      -0.978950   0.540180  -1.812 0.069945 .  
## slope2       0.346692   0.584982   0.593 0.553412    
## ca1         -2.252339   0.318190  -7.079 1.46e-12 ***
## ca2         -3.630050   0.514034  -7.062 1.64e-12 ***
## ca3         -1.967510   0.573945  -3.428 0.000608 ***
## ca4          1.574593   0.978595   1.609 0.107609    
## thal1        3.026894   1.911558   1.583 0.113314    
## thal2        2.141184   1.862130   1.150 0.250203    
## thal3        1.131188   1.866392   0.606 0.544460    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1064.42  on 767  degrees of freedom
## Residual deviance:  465.72  on 748  degrees of freedom
## AIC: 505.72
## 
## Number of Fisher Scoring iterations: 6

Interpretation Model

The model results indicate the relationship between the target variable (presence or absence of heart disease) and several predictor variables. The coefficients associated with each predictor variable represent their impact on the log-odds of the target variable.

Notable findings from the model include:

age has a positive coefficient, suggesting that as a person’s age increases, the likelihood of having heart disease also increases.
sex with a value of 1 (indicating male) has a significant negative coefficient, implying that being male is associated with a lower likelihood of heart disease.
cp (chest pain type) are significant predictors, with positive coefficients, indicating a positive correlation with heart disease.
trestbps (resting blood pressure) and chol (serum cholesterol) have negative coefficients, suggesting that higher values of these variables are associated with a lower likelihood of heart disease.
exang1 (presence of exercise-induced angina) has a negative coefficient, indicating a reduced likelihood of heart disease when angina is present during exercise.
ca variable (number of major vessels colored by fluoroscopy) are significant predictors. Higher values of ca are associated with a lower likelihood of heart disease.

Prediction

heart_test$pred_lr <- predict(heart_model_step,
                              heart_test,
                              type = "response")
heart_test$pred_label_lr <- ifelse(heart_test$pred_lr >= 0.5, yes = 1, no = 0)

heart_test %>% 
  select(target, # label
         pred_lr, # peluang
         pred_label_lr) %>% # hasil prediksi (label) 
  rmarkdown::paged_table()

K-Nearest Neighbor

Find Optimum k

# find optimum k
sqrt(nrow(heart_train_xs))

## [1] 27.71281

The choice of k can significantly impact the model’s performance. To find the ideal k, a common approach is to calculate \(sqrt(nrow(data))\) which helps provide a starting point for k selection. In this case, this calculation yields approximately 27.71281. Since the dataset contains binary target classes (0 and 1), it is advisable to opt for an odd value of k to avoid potential ties in the voting process. Therefore, setting k to 27, as suggested by the calculation.

Build Model

heart_pred_knn <- knn(train =  heart_train_xs, 
                    test = heart_test_xs,
                    cl = heart_train_y, 
                    k = 27)

Evaluation

Logistic Regression

confusionMatrix(data = as.factor(heart_test$pred_label_lr), 
                reference = heart_test$target, 
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  91  11
##          1  17 138
##                                           
##                Accuracy : 0.8911          
##                  95% CI : (0.8464, 0.9264)
##     No Information Rate : 0.5798          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7747          
##                                           
##  Mcnemar's Test P-Value : 0.3447          
##                                           
##             Sensitivity : 0.9262          
##             Specificity : 0.8426          
##          Pos Pred Value : 0.8903          
##          Neg Pred Value : 0.8922          
##              Prevalence : 0.5798          
##          Detection Rate : 0.5370          
##    Detection Prevalence : 0.6031          
##       Balanced Accuracy : 0.8844          
##                                           
##        'Positive' Class : 1               
##

K-Nearest Neighbor

confusionMatrix(data = heart_pred_knn,
                reference = heart_test_y,
                positive = '1')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  69  22
##          1  39 127
##                                           
##                Accuracy : 0.7626          
##                  95% CI : (0.7058, 0.8133)
##     No Information Rate : 0.5798          
##     P-Value [Acc > NIR] : 6.304e-10       
##                                           
##                   Kappa : 0.5021          
##                                           
##  Mcnemar's Test P-Value : 0.0405          
##                                           
##             Sensitivity : 0.8523          
##             Specificity : 0.6389          
##          Pos Pred Value : 0.7651          
##          Neg Pred Value : 0.7582          
##              Prevalence : 0.5798          
##          Detection Rate : 0.4942          
##    Detection Prevalence : 0.6459          
##       Balanced Accuracy : 0.7456          
##                                           
##        'Positive' Class : 1               
##

Comparison

Metric	Logistic.Regression	K.Nearest.Neighbor
Accuracy	89.11%	75.49%
Sensitivity	92.62%	84.56%
Specificity	84.26%	62.96%

In this classification project, two models, Logistic Regression and K-Nearest Neighbor (kNN), were evaluated based on several performance metrics. The project’s primary objective is to minimize False Negatives (FN). Given this priority, the recall metric, also known as Sensitivity, becomes crucial. Based on this analysis, the Logistic Regression model exhibits higher sensitivity (recall) compared to the K-Nearest Neighbor model. This suggests that, for this specific task, Logistic Regression may be the more suitable choice due to its superior ability to correctly identify positive cases and minimize false negatives.

Conclusion

Following the process, both model performances exhibited notable improvements. A thorough examination of the comparison table reveals that the Logistic Regression model achieved higher scores across crucial metrics, including Recall, Accuracy, Specificity, and Precision, in contrast to the K-Nearest Neighbor (K-NN) model. The pivotal metric to consider depends on the specific objective at hand, and in this context, place significant emphasis on the Recall metric. This is particularly important when we aim to minimize instances where the model incorrectly predicts patients as Target 1 when they are, in fact, healthy, or vice versa. Consequently, the Logistic Regression models emerge as the preferred choice, as they exhibit superior predictive accuracy in identifying individuals with health issues. The Recall value of 92.62% achieved by the Logistic Regression models surpasses the performance of the K-NN models.