Heart disease remains a leading cause of mortality worldwide. Predictive modeling can assist in early detection and intervention. In this analysis, we aim to build a predictive model to classify the presence of heart disease based on patient health metrics.
The dataset includes 303 observations with 14 variables:
Age: Age in years Sex: Sex (1 = male; 0 = female) Chest Pain Type: 1: Typical angina 2: Atypical angina 3: Non-anginal pain 4: Asymptomatic Resting Blood Pressure: Resting blood pressure (in mm Hg) Serum Cholesterol: Serum cholesterol in mg/dl Fasting Blood Sugar: (1 = true; 0 = false) Resting ECG: 0: Normal 1: ST-T wave abnormality 2: Left ventricular hypertrophy Max Heart Rate Achieved Exercise Induced Angina: (1 = yes; 0 = no) ST Depression: Induced by exercise relative to rest Slope of Peak Exercise ST Segment: 1: Upsloping 2: Flat 3: Downsloping Number of Major Vessels: Colored by fluoroscopy (0–3) Thalassemia: 3: Normal 6: Fixed defect 7: Reversible defect Target: Diagnosis of heart disease (0 = no disease; 1 = disease)
# Load necessary libraries
library(tidyverse)
library(caret)
library(e1071)
# Load the dataset directly from the UCI repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")
heart_data <- read.csv(url, header = FALSE, col.names = column_names, na.strings = "?")
# View the first few rows of the dataset
head(heart_data)
NA
NA
NA
# Convert relevant columns to factors
heart_data <- heart_data %>%
mutate(sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),
cp = factor(cp, levels = 1:4, labels = c("Typical Angina", "Atypical Angina", "Non-anginal Pain", "Asymptomatic")),
fbs = factor(fbs, levels = c(0, 1), labels = c("<= 120 mg/dl", "> 120 mg/dl")),
restecg = factor(restecg, levels = 0:2, labels = c("Normal", "ST-T Abnormality", "LV Hypertrophy")),
exang = factor(exang, levels = c(0, 1), labels = c("No", "Yes")),
slope = factor(slope, levels = 1:3, labels = c("Upsloping", "Flat", "Downsloping")),
thal = factor(thal, levels = c(3, 6, 7), labels = c("Normal", "Fixed Defect", "Reversible Defect")),
target = factor(ifelse(target == 0, "No Disease", "Disease")))
# Handle missing values by removing rows with NA
heart_data <- na.omit(heart_data)
# Split the data into training and testing sets
set.seed(123)
train_index <- createDataPartition(heart_data$target, p = 0.7, list = FALSE)
train_data <- heart_data[train_index, ]
test_data <- heart_data[-train_index, ]
# Summary statistics
summary(train_data)
age sex cp trestbps chol
Min. :29.00 Female: 70 Typical Angina :17 Min. : 94.0 Min. :126.0
1st Qu.:47.00 Male :138 Atypical Angina :33 1st Qu.:120.0 1st Qu.:211.8
Median :56.00 Non-anginal Pain:60 Median :130.0 Median :243.0
Mean :54.65 Asymptomatic :98 Mean :131.3 Mean :247.5
3rd Qu.:61.00 3rd Qu.:140.0 3rd Qu.:271.5
Max. :77.00 Max. :180.0 Max. :564.0
fbs restecg thalach exang oldpeak
<= 120 mg/dl:180 Normal : 98 Min. : 88.0 No :136 Min. :0.000
> 120 mg/dl : 28 ST-T Abnormality: 3 1st Qu.:137.8 Yes: 72 1st Qu.:0.000
LV Hypertrophy :107 Median :152.5 Median :0.800
Mean :149.4 Mean :1.044
3rd Qu.:164.0 3rd Qu.:1.600
Max. :202.0 Max. :6.200
slope ca thal target
Upsloping :98 Min. :0.0000 Normal :113 Disease : 96
Flat :97 1st Qu.:0.0000 Fixed Defect : 14 No Disease:112
Downsloping:13 Median :0.0000 Reversible Defect: 81
Mean :0.6442
3rd Qu.:1.0000
Max. :3.0000
# Visualize the distribution of age
ggplot(train_data, aes(x = age, fill = target)) +
geom_histogram(binwidth = 5, position = "dodge") +
labs(title = "Age Distribution by Heart Disease Status", x = "Age", y = "Count")
# Visualize the chest pain type distribution
ggplot(train_data, aes(x = cp, fill = target)) +
geom_bar(position = "dodge") +
labs(title = "Chest Pain Type Distribution by Heart Disease Status", x = "Chest Pain Type", y = "Count")
# Train a logistic regression model
model <- train(target ~ ., data = train_data, method = "glm", family = "binomial")
Warning: prediction from a rank-deficient fit may be misleadingWarning: prediction from a rank-deficient fit may be misleading
# View the model summary
summary(model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5843 -0.3051 0.1150 0.4574 2.9384
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.591027 3.621247 1.268 0.20487
age 0.034948 0.029348 1.191 0.23372
sexMale -2.068782 0.672680 -3.075 0.00210 **
`cpAtypical Angina` -1.021412 0.939300 -1.087 0.27685
`cpNon-anginal Pain` -0.449945 0.810843 -0.555 0.57896
cpAsymptomatic -2.319464 0.811207 -2.859 0.00425 **
trestbps -0.021835 0.013691 -1.595 0.11076
chol -0.003790 0.004928 -0.769 0.44186
`fbs> 120 mg/dl` 1.142714 0.838517 1.363 0.17295
`restecgST-T Abnormality` -1.096034 2.638086 -0.415 0.67780
`restecgLV Hypertrophy` -0.446918 0.482763 -0.926 0.35458
thalach 0.018864 0.014409 1.309 0.19045
exangYes -0.590307 0.542483 -1.088 0.27653
oldpeak -0.396839 0.293033 -1.354 0.17566
slopeFlat -1.249711 0.617806 -2.023 0.04309 *
slopeDownsloping -1.054895 1.236906 -0.853 0.39374
ca -1.523549 0.348264 -4.375 1.22e-05 ***
`thalFixed Defect` 0.830804 0.937151 0.887 0.37534
`thalReversible Defect` -1.048207 0.550608 -1.904 0.05695 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 287.12 on 207 degrees of freedom
Residual deviance: 130.64 on 189 degrees of freedom
AIC: 168.64
Number of Fisher Scoring iterations: 6
# Predict on the test set
predictions <- predict(model, newdata = test_data)
# Confusion matrix
conf_matrix <- confusionMatrix(predictions, test_data$target)
print(conf_matrix)
Confusion Matrix and Statistics
Reference
Prediction Disease No Disease
Disease 33 6
No Disease 8 42
Accuracy : 0.8427
95% CI : (0.7502, 0.9112)
No Information Rate : 0.5393
P-Value [Acc > NIR] : 1.452e-09
Kappa : 0.6823
Mcnemar's Test P-Value : 0.7893
Sensitivity : 0.8049
Specificity : 0.8750
Pos Pred Value : 0.8462
Neg Pred Value : 0.8400
Prevalence : 0.4607
Detection Rate : 0.3708
Detection Prevalence : 0.4382
Balanced Accuracy : 0.8399
'Positive' Class : Disease
# ROC curve and AUC
probabilities <- predict(model, newdata = test_data, type = "prob")[,2]
roc_curve <- roc(test_data$target, probabilities)
Setting levels: control = Disease, case = No Disease
Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve")
auc(roc_curve)
Area under the curve: 0.9101
Significant Predictors: * Sex (Male): Strongly associated with heart disease. * Chest Pain (Asymptomatic): Indicates a higher likelihood of heart disease. * Number of Major Vessels (ca): Higher values indicate a lower likelihood of disease. Model Metrics: Accuracy: 84.27% Sensitivity: 80.49% Specificity: 87.50% AUC: 0.9101, indicating strong model performance.
The logistic regression model demonstrated strong predictive performance with an accuracy of 84.27% and an AUC of 0.9101. Significant predictors like sex, chest pain type, and number of major vessels provide actionable insights for heart disease diagnosis.