Heart Disease Prediction using Logistic Regression and K-Nearest Neighbors
Introduction
Objective
The main objective of this project is to predict whether a given person is at risk of having heart disease or not. This prediction is made by analyzing several contributing factors, including age, cholesterol level, and the type of chest pain experienced by the individual etc.
In order to achieve this objective, we employ the following algorithms:
- Logistic Regression: This algorithm is a statistical method that is commonly used for binary classification tasks, such as determining the presence or absence of heart disease. It calculates the probability of an individual having heart disease based on the input factors and then classifies them accordingly.
- K-Nearest Neighbors (K-NN): K-Nearest Neighbors is a machine learning algorithm used for classification tasks. In this context, it works by identifying the k-nearest individuals in the dataset who share similar characteristics with the person being evaluated. The algorithm then predicts the presence of heart disease based on the majority class of these nearest neighbors.
Data Source
The dataset utilized in this project comprises comprehensive information pertaining to patients diagnosed with heart disease. It encompasses a wide array of data derived from multiple factors. Should you wish to access and acquire this dataset, it is readily available for download on Kaggle.
Here are some brief explanations of the variables :
age: Age of individual in yearssex: Gender Of individual(1 = male; 0 = female)cp: Chest pain type (1 = typical angina; 2 = atypical angina; 3 = non:anginal pain; 4 = asymptomatic)trestbps: Resting blood pressure (in mm Hg on admission to the hospital)chol: Serum cholesterol in mg/dlfbs: Fasting blood sugar level > 120 mg/dl (1 = true; 0 = false)restecg: Resting electrocardiographic results (0 = normal; 1 = having ST:T; 2 = hypertrophy)thalach: Maximum heart rate achievedexang: Exercise induced angina (1 = yes; 0 = no)oldpeak: ST depression induced by exercise relative to restslope: The slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)ca: Number of major vessels (0:4) colored by flourosopythal: Thalassemia is an inherited blood disorder that affects the body’s ability to produce hemoglobin and red blood cells. 1 = normal; 2 = fixed defect; 3 = reversable defecttarget: the predicted attribute : diagnosis of heart disease (angiographic disease status). Value 0 = < 50% diameter narrowing; - Value 1 = > 50% diameter narrowing.
Data Preparation
The Data Preparation process consists of two primary activities: Loading Data and Data Wrangling. First, we Load Data by utilizing the readr library to import the raw dataset. Following this, we move to Data Wrangling, where we perform data cleaning, transformation, and structuring tasks, ensuring that the data is in an optimal and reliable state for further analysis and model development.
Load Data
First, we Load Data by utilizing the readr library to
import the raw dataset.
Data Wrangling
Data Wrangling is a pivotal phase in our project, where we focus on
transforming and structuring the data to prepare it for analysis and
modeling. To kick off this process, we begin by utilizing the
glimpse function. This function is a handy feature provided
by the dplyr library and allows us to swiftly gain insight
into the data’s structure. It displays a concise summary of the dataset,
revealing information about the number of observations and variables, as
well as the data types of each variable.
## Rows: 1,025
## Columns: 14
## $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
## $ sex <int> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ cp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
## $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
## $ fbs <int> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ restecg <int> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
## $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
## $ exang <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
## $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
## $ slope <int> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
## $ ca <int> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
## $ thal <int> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
## $ target <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
In the data examination process, we have identified several variables
that need to be transformed into factor data types. These variables
include sex, cp, fbs,
restecg, exang, slope,
ca, thal , and target. Factor
data types allow us to more efficiently manage these categorical
variables in our analysis and modeling, and also facilitate the creation
of informative visualizations.
heart <- heart %>%
mutate(sex = as.factor(sex),
cp = as.factor(cp),
fbs = as.factor(fbs),
restecg = as.factor(restecg),
exang = as.factor(exang),
slope = as.factor(slope),
ca = as.factor(ca),
thal = as.factor(thal),
target = as.factor(target))
glimpse(heart)## Rows: 1,025
## Columns: 14
## $ age <int> 52, 53, 70, 61, 62, 58, 58, 55, 46, 54, 71, 43, 34, 51, 52, 3…
## $ sex <fct> 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1…
## $ cp <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 1, 2, 2…
## $ trestbps <int> 125, 140, 145, 148, 138, 100, 114, 160, 120, 122, 112, 132, 1…
## $ chol <int> 212, 203, 174, 203, 294, 248, 318, 289, 249, 286, 149, 341, 2…
## $ fbs <fct> 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ restecg <fct> 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0…
## $ thalach <int> 168, 155, 125, 161, 106, 122, 140, 145, 144, 116, 125, 136, 1…
## $ exang <fct> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0…
## $ oldpeak <dbl> 1.0, 3.1, 2.6, 0.0, 1.9, 1.0, 4.4, 0.8, 0.8, 3.2, 1.6, 3.0, 0…
## $ slope <fct> 2, 0, 0, 2, 1, 1, 0, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1…
## $ ca <fct> 2, 0, 0, 1, 3, 0, 3, 1, 0, 2, 0, 0, 0, 3, 0, 0, 1, 1, 0, 0, 0…
## $ thal <fct> 3, 3, 3, 3, 2, 2, 1, 3, 3, 2, 2, 3, 2, 3, 0, 2, 2, 3, 2, 2, 2…
## $ target <fct> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0…
Explanatory Data Analysis
The Explanatory Data Analysis (EDA) section of our project plays a pivotal role in unveiling crucial insights from the dataset.
Proportions Target Class
Assess the distribution of target variable to understand the prevalence of heart disease.
##
## 0 1
## 0.4868293 0.5131707
The distribution of the target class in dataset demonstrates a state of balance. With approximately 48.68% of instances falling into category 0 and around 51.32% into category 1, there is no significant imbalance between the two classes. A balanced distribution of the target class is a favorable characteristic for our analysis and modeling processes. It ensures that both outcomes are adequately represented in dataset, thereby contributing to the robustness and reliability of predictive models.
Missing Value
Which involves an examination of missing data points, ensuring data completeness and integrity.
## [1] FALSE
## age sex cp trestbps chol fbs restecg thalach
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
The dataset is no missing values, signifying a high level of data completeness and integrity. This absence of gaps in the data streamlines analysis and modeling processes, instilling confidence in the reliability and efficiency of endeavors.
Correlation
We delve into the relationships between variables to uncover potential patterns and associations.
From the correlation checking process, it’s evident that only some
correlations between variables can be displayed because the
ggcorr library ignores data types other than numeric. Among
the displayed correlations, there is a negative correlation between
oldpeak and thalach, with a value of -0.3. A
negative correlation indicates that when one variable increases, the
other tends to decrease, and vice versa. In this context, the value of
-0.3 suggests a weak negative relationship between oldpeak
(ST segment depression after exercise) and thalach (maximum
heart rate achieved during exercise). This implies that as
oldpeak increases, thalach tends to decrease,
and vice versa.
Cross Validation
Before proceeding with model training, the dataset is divided into two distinct sets: the training data and the test data. The training data, constituting 75% of the dataset, is utilized for model training, allowing the model to learn from the data’s patterns and relationships. The remaining 25% of the data is reserved for the test data, which serves as a benchmark to assess the model’s ability to make accurate predictions on new, unseen data.
RNGkind(sample.kind = "Rounding")
set.seed(231)
index <- sample(x = nrow(heart), size = nrow(heart)*0.75)
heart_train <- heart[index,]
heart_test <- heart[-index,] ##
## 0 1
## 0.5091146 0.4908854
With these relatively balanced proportions, it can be inferred that the class distribution in the model remains fairly balanced. This implies that both class 0 and class 1 maintain a comparable representation within the dataset.
Model Building
This project is a critical phase where we delve into constructing predictive models to gain valuable insights into heart disease diagnosis. This section is subdivided into three core components: Scaling, Logistic Regression, and K-Nearest Neighbor.
Scaling
Scaling plays a pivotal role in ensuring the accuracy and effectiveness of the k-Nearest Neighbors (kNN) algorithm in this project. Specifically, z-score standardization is employed to prepare the dataset for kNN classification.
Using the summary() function to obtain summary statistics for each variable in dataset, which includes information such as minimum, 1st quartile, median (2nd quartile), mean, 3rd quartile, and maximum values. This will help understand the range of values in your dataset.
## age sex cp trestbps chol fbs restecg
## Min. :29.00 0:312 0:497 Min. : 94.0 Min. :126 0:872 0:497
## 1st Qu.:48.00 1:713 1:167 1st Qu.:120.0 1st Qu.:211 1:153 1:513
## Median :56.00 2:284 Median :130.0 Median :240 2: 15
## Mean :54.43 3: 77 Mean :131.6 Mean :246
## 3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:275
## Max. :77.00 Max. :200.0 Max. :564
## thalach exang oldpeak slope ca thal target
## Min. : 71.0 0:680 Min. :0.000 0: 74 0:578 0: 7 0:499
## 1st Qu.:132.0 1:345 1st Qu.:0.000 1:482 1:226 1: 64 1:526
## Median :152.0 Median :0.800 2:469 2:134 2:544
## Mean :149.1 Mean :1.072 3: 69 3:410
## 3rd Qu.:166.0 3rd Qu.:1.800 4: 18
## Max. :202.0 Max. :6.200
The analysis shows that the dataset has a wide range of values for various variables, indicating that scaling is necessary to ensure that variables have a similar influence in the modeling process.
Before applying scaling, it’s crucial to separate the data into two distinct parts: the predictor variables, which will be used for modeling, and the target variable, which aim to predict. This separation is essential to ensure that the scaling process doesn’t impact the target variable, allowing us to maintain the integrity of the predictive relationship during kNN modeling.
Logistic Regression
Build Model
Logistic Regression using backward feature selection is a model-building process that starts with all available predictor variables and iteratively removes the least significant variables until a subset with the most influential variables remains. This technique helps simplify the model, potentially improving its performance and interpretability while retaining essential predictive factors.
Summary Model
##
## Call:
## glm(formula = target ~ age + sex + cp + trestbps + chol + thalach +
## exang + oldpeak + slope + ca + thal, family = "binomial",
## data = heart_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.133489 2.486789 0.054 0.957191
## age 0.029109 0.015923 1.828 0.067534 .
## sex1 -2.220298 0.363135 -6.114 9.70e-10 ***
## cp1 1.122160 0.370357 3.030 0.002446 **
## cp2 2.027204 0.324136 6.254 4.00e-10 ***
## cp3 2.443180 0.446890 5.467 4.58e-08 ***
## trestbps -0.026347 0.007310 -3.604 0.000313 ***
## chol -0.006279 0.002576 -2.437 0.014799 *
## thalach 0.027332 0.007788 3.509 0.000449 ***
## exang1 -0.830875 0.282849 -2.938 0.003308 **
## oldpeak -0.426937 0.144101 -2.963 0.003049 **
## slope1 -0.978950 0.540180 -1.812 0.069945 .
## slope2 0.346692 0.584982 0.593 0.553412
## ca1 -2.252339 0.318190 -7.079 1.46e-12 ***
## ca2 -3.630050 0.514034 -7.062 1.64e-12 ***
## ca3 -1.967510 0.573945 -3.428 0.000608 ***
## ca4 1.574593 0.978595 1.609 0.107609
## thal1 3.026894 1.911558 1.583 0.113314
## thal2 2.141184 1.862130 1.150 0.250203
## thal3 1.131188 1.866392 0.606 0.544460
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1064.42 on 767 degrees of freedom
## Residual deviance: 465.72 on 748 degrees of freedom
## AIC: 505.72
##
## Number of Fisher Scoring iterations: 6
Interpretation Model
The model results indicate the relationship between the target variable (presence or absence of heart disease) and several predictor variables. The coefficients associated with each predictor variable represent their impact on the log-odds of the target variable.
Notable findings from the model include:
agehas a positive coefficient, suggesting that as a person’s age increases, the likelihood of having heart disease also increases.sexwith a value of 1 (indicating male) has a significant negative coefficient, implying that being male is associated with a lower likelihood of heart disease.cp(chest pain type) are significant predictors, with positive coefficients, indicating a positive correlation with heart disease.trestbps(resting blood pressure) andchol(serum cholesterol) have negative coefficients, suggesting that higher values of these variables are associated with a lower likelihood of heart disease.exang1(presence of exercise-induced angina) has a negative coefficient, indicating a reduced likelihood of heart disease when angina is present during exercise.cavariable (number of major vessels colored by fluoroscopy) are significant predictors. Higher values of ca are associated with a lower likelihood of heart disease.
K-Nearest Neighbor
Find Optimum k
## [1] 27.71281
The choice of k can significantly impact the model’s performance. To find the ideal k, a common approach is to calculate \(sqrt(nrow(data))\) which helps provide a starting point for k selection. In this case, this calculation yields approximately 27.71281. Since the dataset contains binary target classes (0 and 1), it is advisable to opt for an odd value of k to avoid potential ties in the voting process. Therefore, setting k to 27, as suggested by the calculation.
Evaluation
Logistic Regression
confusionMatrix(data = as.factor(heart_test$pred_label_lr),
reference = heart_test$target,
positive = "1") ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 11
## 1 17 138
##
## Accuracy : 0.8911
## 95% CI : (0.8464, 0.9264)
## No Information Rate : 0.5798
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7747
##
## Mcnemar's Test P-Value : 0.3447
##
## Sensitivity : 0.9262
## Specificity : 0.8426
## Pos Pred Value : 0.8903
## Neg Pred Value : 0.8922
## Prevalence : 0.5798
## Detection Rate : 0.5370
## Detection Prevalence : 0.6031
## Balanced Accuracy : 0.8844
##
## 'Positive' Class : 1
##
K-Nearest Neighbor
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 69 22
## 1 39 127
##
## Accuracy : 0.7626
## 95% CI : (0.7058, 0.8133)
## No Information Rate : 0.5798
## P-Value [Acc > NIR] : 6.304e-10
##
## Kappa : 0.5021
##
## Mcnemar's Test P-Value : 0.0405
##
## Sensitivity : 0.8523
## Specificity : 0.6389
## Pos Pred Value : 0.7651
## Neg Pred Value : 0.7582
## Prevalence : 0.5798
## Detection Rate : 0.4942
## Detection Prevalence : 0.6459
## Balanced Accuracy : 0.7456
##
## 'Positive' Class : 1
##
Comparison
| Metric | Logistic.Regression | K.Nearest.Neighbor |
|---|---|---|
| Accuracy | 89.11% | 75.49% |
| Sensitivity | 92.62% | 84.56% |
| Specificity | 84.26% | 62.96% |
In this classification project, two models, Logistic Regression and K-Nearest Neighbor (kNN), were evaluated based on several performance metrics. The project’s primary objective is to minimize False Negatives (FN). Given this priority, the recall metric, also known as Sensitivity, becomes crucial. Based on this analysis, the Logistic Regression model exhibits higher sensitivity (recall) compared to the K-Nearest Neighbor model. This suggests that, for this specific task, Logistic Regression may be the more suitable choice due to its superior ability to correctly identify positive cases and minimize false negatives.
Conclusion
Following the process, both model performances exhibited notable improvements. A thorough examination of the comparison table reveals that the Logistic Regression model achieved higher scores across crucial metrics, including Recall, Accuracy, Specificity, and Precision, in contrast to the K-Nearest Neighbor (K-NN) model. The pivotal metric to consider depends on the specific objective at hand, and in this context, place significant emphasis on the Recall metric. This is particularly important when we aim to minimize instances where the model incorrectly predicts patients as Target 1 when they are, in fact, healthy, or vice versa. Consequently, the Logistic Regression models emerge as the preferred choice, as they exhibit superior predictive accuracy in identifying individuals with health issues. The Recall value of 92.62% achieved by the Logistic Regression models surpasses the performance of the K-NN models.