Heart disease remains a leading cause of mortality worldwide. Early and accurate prediction of heart disease is crucial for timely intervention and improved patient outcomes. This report details a machine learning approach to predict the presence of heart disease using a comprehensive dataset. We will employ Logistic Regression, a powerful and interpretable statistical model, to build our predictive system.
The primary goal of this analysis is to develop a model that can predict heart disease with high accuracy, specifically aiming to meet or exceed certain performance benchmarks.
The dataset used in this analysis contains various medical and demographic attributes of patients, along with an indicator of heart disease presence. Each row represents a patient, and columns represent different features.
age: Age of the patient in years.
(Numerical - Continuous)sex: Biological sex of the patient.
(Categorical - Binary: Male, Female)cp (Chest Pain Type): The type of
chest pain experienced by the patient. (Categorical - Nominal: typical
angina, atypical angina, non-anginal pain, asymptomatic)trestbps (Resting Blood Pressure):
Resting blood pressure (in mm Hg) measured on admission to the hospital.
(Numerical - Continuous)chol (Cholesterol): Serum cholesterol
level in mg/dl. (Numerical - Continuous)fbs (Fasting Blood Sugar): Fasting
blood sugar level > 120 mg/dl. (Categorical - Binary: 1 if True, 0 if
False)restecg (Resting Electrocardiographic
Results): Results of the resting electrocardiogram.
(Categorical - Nominal: lv hypertrophy, normal, st-t abnormality)thalch (Maximum Heart Rate Achieved):
The maximum heart rate achieved during an exercise stress test.
(Numerical - Continuous)exang (Exercise Induced Angina):
Presence of exercise-induced angina. (Categorical - Binary: 1 if True, 0
if False)oldpeak (ST Depression Induced by Exercise
Relative to Rest): The depression of the ST segment induced by
exercise relative to resting state. (Numerical - Continuous)slope (Slope of the Peak Exercise ST
Segment): The slope of the peak exercise ST segment.
(Categorical - Nominal: downsloping, flat, upsloping)ca (Number of Major Vessels Colored by
Fluoroscopy): The number of major vessels (0-3) colored by
fluoroscopy. (Categorical - Ordinal: 0, 1, 2, 3)thal (Thallium Stress Test Result):
Result of the thallium stress test. (Categorical - Nominal: fixed
defect, normal, reversable defect)num: Original target variable
indicating heart disease presence (0 = no disease, >0 = disease).
This will be transformed into target.id: Patient identifier. (Removed
during preprocessing)dataset: Source of the dataset.
(Removed during preprocessing as it has no variance)This section outlines the steps taken to load the dataset and prepare it for modeling. This includes handling missing values, converting data types, and feature engineering to ensure the data is in a suitable format for Logistic Regression.
file_path <- "/Users/sahar/Downloads/heart.csv"
df <- read.csv(file_path, na.strings = "?")
df_pre_removal <- df
# Initial inspection (outputs to console)
cat("--- Initial Data Inspection ---\n")
## --- Initial Data Inspection ---
print(head(df, 10))
## id age sex dataset cp trestbps chol fbs restecg
## 1 1 63 Male Cleveland typical angina 145 233 TRUE lv hypertrophy
## 2 2 67 Male Cleveland asymptomatic 160 286 FALSE lv hypertrophy
## 3 3 67 Male Cleveland asymptomatic 120 229 FALSE lv hypertrophy
## 4 4 37 Male Cleveland non-anginal 130 250 FALSE normal
## 5 5 41 Female Cleveland atypical angina 130 204 FALSE lv hypertrophy
## 6 6 56 Male Cleveland atypical angina 120 236 FALSE normal
## 7 7 62 Female Cleveland asymptomatic 140 268 FALSE lv hypertrophy
## 8 8 57 Female Cleveland asymptomatic 120 354 FALSE normal
## 9 9 63 Male Cleveland asymptomatic 130 254 FALSE lv hypertrophy
## 10 10 53 Male Cleveland asymptomatic 140 203 TRUE lv hypertrophy
## thalch exang oldpeak slope ca thal num
## 1 150 FALSE 2.3 downsloping 0 fixed defect 0
## 2 108 TRUE 1.5 flat 3 normal 2
## 3 129 TRUE 2.6 flat 2 reversable defect 1
## 4 187 FALSE 3.5 downsloping 0 normal 0
## 5 172 FALSE 1.4 upsloping 0 normal 0
## 6 178 FALSE 0.8 upsloping 0 normal 0
## 7 160 FALSE 3.6 downsloping 2 normal 3
## 8 163 TRUE 0.6 upsloping 0 normal 0
## 9 147 FALSE 1.4 flat 1 reversable defect 2
## 10 155 TRUE 3.1 downsloping 0 reversable defect 1
print(str(df))
## 'data.frame': 920 obs. of 16 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ dataset : chr "Cleveland" "Cleveland" "Cleveland" "Cleveland" ...
## $ cp : chr "typical angina" "asymptomatic" "asymptomatic" "non-anginal" ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : int 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : logi TRUE FALSE FALSE FALSE FALSE FALSE ...
## $ restecg : chr "lv hypertrophy" "lv hypertrophy" "lv hypertrophy" "normal" ...
## $ thalch : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : chr "downsloping" "flat" "flat" "downsloping" ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : chr "fixed defect" "normal" "reversable defect" "normal" ...
## $ num : int 0 2 1 0 0 0 3 0 2 1 ...
## NULL
print(summary(df))
## id age sex dataset
## Min. : 1.0 Min. :28.00 Length:920 Length:920
## 1st Qu.:230.8 1st Qu.:47.00 Class :character Class :character
## Median :460.5 Median :54.00 Mode :character Mode :character
## Mean :460.5 Mean :53.51
## 3rd Qu.:690.2 3rd Qu.:60.00
## Max. :920.0 Max. :77.00
##
## cp trestbps chol fbs
## Length:920 Min. : 0.0 Min. : 0.0 Mode :logical
## Class :character 1st Qu.:120.0 1st Qu.:175.0 FALSE:692
## Mode :character Median :130.0 Median :223.0 TRUE :138
## Mean :132.1 Mean :199.1 NA's :90
## 3rd Qu.:140.0 3rd Qu.:268.0
## Max. :200.0 Max. :603.0
## NA's :59 NA's :30
## restecg thalch exang oldpeak
## Length:920 Min. : 60.0 Mode :logical Min. :-2.6000
## Class :character 1st Qu.:120.0 FALSE:528 1st Qu.: 0.0000
## Mode :character Median :140.0 TRUE :337 Median : 0.5000
## Mean :137.5 NA's :55 Mean : 0.8788
## 3rd Qu.:157.0 3rd Qu.: 1.5000
## Max. :202.0 Max. : 6.2000
## NA's :55 NA's :62
## slope ca thal num
## Length:920 Min. :0.0000 Length:920 Min. :0.0000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000 Mode :character Median :1.0000
## Mean :0.6764 Mean :0.9957
## 3rd Qu.:1.0000 3rd Qu.:2.0000
## Max. :3.0000 Max. :4.0000
## NA's :611
cat("\nInitial NA counts after loading with na.strings:\n")
##
## Initial NA counts after loading with na.strings:
print(colSums(is.na(df)))
## id age sex dataset cp trestbps chol fbs
## 0 0 0 0 0 59 30 90
## restecg thalch exang oldpeak slope ca thal num
## 0 55 55 62 0 611 0 0
# Drop 'id' and 'dataset'
df <- df %>%
select(-id, -dataset)
# Create 'target' variable from 'num' (0 for no disease, 1 for disease)
df <- df %>%
mutate(target = ifelse(num > 0, 1, 0)) %>%
select(-num) # Remove the original 'num' column
# Ensure target factor levels are valid R variable names for caret and other functions
df$target <- factor(df$target, levels = c(0, 1), labels = c("NoDisease", "Disease"))
# Convert logical 'fbs' and 'exang' to integer (0/1). 'NA's are preserved initially.
df <- df %>%
mutate(fbs = as.integer(fbs),
exang = as.integer(exang))
# Convert relevant character columns to factors, explicitly handling empty strings
categorical_cols_to_factor <- c('sex', 'cp', 'restecg', 'slope', 'thal')
for (col_name in categorical_cols_to_factor) {
df[[col_name]][df[[col_name]] == ""] <- NA # Convert empty strings to NA
df[[col_name]] <- as.factor(df[[col_name]]) # Convert to factor
}
# Impute numerical missing values with the median
numerical_cols_for_imputation <- c('age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'fbs', 'exang')
for (col_name in numerical_cols_for_imputation) {
if (any(is.na(df[[col_name]]))) {
df[[col_name]][is.na(df[[col_name]])] <- median(df[[col_name]], na.rm = TRUE)
}
}
# Impute categorical missing values with the mode
get_mode <- function(v) {
uniqv <- unique(v)
if (length(uniqv) == 0 || all(is.na(uniqv))) return(NA)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
for (col_name in categorical_cols_to_factor) {
if (any(is.na(df[[col_name]]))) {
mode_val <- get_mode(na.omit(df[[col_name]]))
if (!is.na(mode_val)) {
df[[col_name]][is.na(df[[col_name]])] <- mode_val
}
}
}
# Final data type adjustments
df <- df %>%
mutate(ca = as.integer(ca))
# Check structure again after preprocessing
cat("\n--- Data Structure After Preprocessing ---\n")
##
## --- Data Structure After Preprocessing ---
print(str(df))
## 'data.frame': 920 obs. of 14 variables:
## $ age : int 63 67 67 37 41 56 62 57 63 53 ...
## $ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1 1 2 2 ...
## $ cp : Factor w/ 4 levels "asymptomatic",..: 4 1 1 3 2 2 1 1 1 1 ...
## $ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
## $ chol : num 233 286 229 250 204 236 268 354 254 203 ...
## $ fbs : num 1 0 0 0 0 0 0 0 0 1 ...
## $ restecg : Factor w/ 3 levels "lv hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
## $ thalch : int 150 108 129 187 172 178 160 163 147 155 ...
## $ exang : int 0 1 1 0 0 0 0 1 0 1 ...
## $ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
## $ slope : Factor w/ 3 levels "downsloping",..: 1 2 2 1 3 3 1 3 2 1 ...
## $ ca : int 0 3 2 0 0 0 2 0 1 0 ...
## $ thal : Factor w/ 3 levels "fixed defect",..: 1 2 3 2 2 2 2 2 3 3 ...
## $ target : Factor w/ 2 levels "NoDisease","Disease": 1 2 2 1 1 1 2 1 2 2 ...
## NULL
cat("\nNA counts after preprocessing:\n")
##
## NA counts after preprocessing:
print(colSums(is.na(df)))
## age sex cp trestbps chol fbs restecg thalch
## 0 0 0 0 0 0 0 0
## exang oldpeak slope ca thal target
## 0 0 0 0 0 0
# One-hot encode categorical columns for modeling
categorical_cols_to_encode <- c('sex', 'cp', 'restecg', 'slope', 'thal')
df_encoded <- dummy_cols(df,
select_columns = categorical_cols_to_encode,
remove_first_dummy = TRUE,
remove_selected_columns = TRUE)
# Ensure target column name is consistent
df_encoded <- df_encoded %>%
rename(target = target)
# Separate features (X) and target (y)
X <- df_encoded %>% select(-target)
y <- df_encoded$target
# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.80, list = FALSE, times = 1)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]
# Scale numerical features (center and scale)
numerical_features_for_scaling <- X_train %>%
select(where(is.numeric)) %>%
names()
preprocessor <- preProcess(X_train[, numerical_features_for_scaling], method = c("center", "scale"))
X_train_scaled <- predict(preprocessor, X_train[, numerical_features_for_scaling])
X_test_scaled <- predict(preprocessor, X_test[, numerical_features_for_scaling])
# Combine scaled numerical features with one-hot encoded categorical features
X_train_final <- cbind(X_train_scaled, X_train %>% select(-all_of(numerical_features_for_scaling)))
X_test_final <- cbind(X_test_scaled, X_test %>% select(-all_of(numerical_features_for_scaling)))
# Ensure column names match after scaling and combining
colnames(X_train_final) <- make.names(colnames(X_train_final), unique = TRUE)
colnames(X_test_final) <- make.names(colnames(X_test_final), unique = TRUE)
cat("\n--- Training and Test Set Dimensions ---\n")
##
## --- Training and Test Set Dimensions ---
cat("Dimensions of X_train_final:", dim(X_train_final), "\n")
## Dimensions of X_train_final: 737 18
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 737
cat("Dimensions of X_test_final:", dim(X_test_final), "\n")
## Dimensions of X_test_final: 183 18
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 183
Exploratory Data Analysis (EDA) is a critical step to understand the underlying patterns, distributions, and relationships within the dataset. This section provides visual and statistical summaries of the data features.
Initial data inspection revealed several features with missing values (represented by ‘?’). Table @ref(tab:missing-values-table) summarizes these counts, and Figure @ref(fig:missing-values-hist) provides a visual representation. These missing values were subsequently imputed during the preprocessing phase.
| Feature | Missing_Count |
|---|---|
| ca | 611 |
| slope | 309 |
| fbs | 90 |
| oldpeak | 62 |
| trestbps | 59 |
| exang | 55 |
| thalch | 55 |
| chol | 30 |
# missing_data_summary defined above
ggplot(missing_data_summary, aes(x = reorder(Feature, -Missing_Count), y = Missing_Count, fill = Missing_Count)) +
geom_bar(stat = "identity") +
geom_text(aes(label = Missing_Count), vjust = -0.5, size = 3.5) +
labs(title = "Initial Missing Value Counts by Feature",
x = "Feature",
y = "Number of Missing Values") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_gradient(low = "lightblue", high = "darkred")
Figure 4.1: Initial Missing Value Counts by Feature
The ‘dataset’ column was removed during preprocessing because it contained only a single unique value (‘Cleveland’), providing no variability or predictive power for the model. Figure @ref(fig:dataset-distribution) illustrates this lack of variance.
# Plotting the 'dataset' column using the df_pre_removal (before dropping this column)
dataset_plot <- ggplot(df_pre_removal, aes(x = dataset)) +
geom_bar(fill = "steelblue", color = "black") +
labs(title = "Distribution of 'dataset' Column",
x = "Dataset Source",
y = "Count") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 10),
axis.text = element_text(size = 8))
print(dataset_plot)
Figure 4.2: Distribution of Removed Feature (‘dataset’)
This section presents the summary statistics and individual histograms for each numerical feature, providing insights into their distributions after preprocessing.
Table @ref(tab:numerical-summary) provides a comprehensive overview of the central tendency, dispersion, and range for the numerical features after preprocessing and imputation.
| age | trestbps | chol | thalch | oldpeak | ca | fbs | exang | |
|---|---|---|---|---|---|---|---|---|
| Min. :28.00 | Min. : 0 | Min. : 0.0 | Min. : 60.0 | Min. :-2.6000 | Min. :0.0000 | Min. :0.00 | Min. :0.0000 | |
| 1st Qu.:47.00 | 1st Qu.:120 | 1st Qu.:177.8 | 1st Qu.:120.0 | 1st Qu.: 0.0000 | 1st Qu.:0.0000 | 1st Qu.:0.00 | 1st Qu.:0.0000 | |
| Median :54.00 | Median :130 | Median :223.0 | Median :140.0 | Median : 0.5000 | Median :0.0000 | Median :0.00 | Median :0.0000 | |
| Mean :53.51 | Mean :132 | Mean :199.9 | Mean :137.7 | Mean : 0.8533 | Mean :0.2272 | Mean :0.15 | Mean :0.3663 | |
| 3rd Qu.:60.00 | 3rd Qu.:140 | 3rd Qu.:267.0 | 3rd Qu.:156.0 | 3rd Qu.: 1.5000 | 3rd Qu.:0.0000 | 3rd Qu.:0.00 | 3rd Qu.:1.0000 | |
| Max. :77.00 | Max. :200 | Max. :603.0 | Max. :202.0 | Max. : 6.2000 | Max. :3.0000 | Max. :1.00 | Max. :1.0000 |
Figure @ref(fig:numerical-boxplots) illustrates the relationship
between each numerical feature and the presence of heart disease
(target variable) using boxplots. This helps identify
features that show different distributions based on the outcome,
suggesting their predictive power.
# Use the numerical_cols_for_imputation defined in preprocessing
boxplot_plots_list <- lapply(numerical_cols_for_imputation, function(col_name) {
ggplot(df, aes_string(x = "target", y = col_name, fill = "target")) + # Use 'target' directly as it's a factor
geom_boxplot(color = "black", alpha = 0.8) +
labs(title = paste(col_name, " by Heart Disease Presence"),
x = "Heart Disease (NoDisease vs. Disease)",
y = col_name,
fill = "Target") +
theme_minimal() +
scale_fill_manual(values = c("NoDisease" = "lightblue", "Disease" = "salmon")) + # Use new factor labels
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5, size = 10),
axis.title = element_text(size = 8),
axis.text = element_text(size = 7),
panel.grid.major = element_line(linetype = "dotted", color = "lightgray"),
panel.grid.minor = element_blank())
})
combined_boxplot_figure <- wrap_plots(boxplot_plots_list, ncol = 4, nrow = 2)
print(combined_boxplot_figure)
Figure 4.4: Boxplots of Numerical Features by Heart Disease Presence
Figure @ref(fig:categorical-proportions) provides stacked bar charts
showing the distribution of each categorical feature, broken down by the
presence or absence of heart disease (target). This helps
in understanding the relationship between each categorical variable and
the outcome.
categorical_cols_for_plot <- c('sex', 'cp', 'restecg', 'slope', 'thal', 'fbs', 'exang')
categorical_plots_list <- lapply(categorical_cols_for_plot, function(col_name) {
ggplot(df, aes_string(x = col_name, fill = "target")) + # Use 'target' directly
geom_bar(position = "fill", color = "black", alpha = 0.8) + # 'fill' makes it proportional
labs(title = paste("Proportion of Target by", col_name),
x = col_name,
y = "Proportion",
fill = "Heart Disease") + # Label is now "Heart Disease"
theme_minimal() +
scale_fill_manual(values = c("NoDisease" = "darkgreen", "Disease" = "firebrick")) + # Use new factor labels
theme(plot.title = element_text(hjust = 0.5, size = 10),
axis.text.x = element_text(angle = 45, hjust = 1), # Rotate labels for readability
axis.title = element_text(size = 8),
axis.text = element_text(size = 7),
legend.position = "bottom")
})
combined_categorical_figure <- wrap_plots(categorical_plots_list, ncol = 3, nrow = 3)
print(combined_categorical_figure)
Figure 4.5: Proportional Distribution of Categorical Features by Heart Disease Presence
Figure @ref(fig:correlation-matrix) presents a heatmap of the Pearson correlation coefficients between the numerical features. This helps identify multicollinearity (highly correlated predictors) that might affect model stability and interpretability.
numerical_data_for_corr <- df %>% select(all_of(numerical_cols_for_imputation))
# Calculate the correlation matrix
correlation_matrix <- cor(numerical_data_for_corr)
# Create the correlation heatmap
corrplot(correlation_matrix,
method = "circle",
type = "upper",
order = "hclust",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
col = colorRampPalette(c("blue", "white", "red"))(200),
title = "Correlation Matrix of Numerical Features",
mar = c(0,0,1,0))
Figure 4.6: Correlation Matrix of Numerical Features
Logistic Regression is a statistical model used for binary classification. It models the probability of a binary outcome using a logistic function.
The Logistic Regression model was trained using the glm
function in R, with family = "binomial" for binary
classification. The model learns the relationship between the features
and the likelihood of heart disease.
# Train the Logistic Regression model
model_lr <- glm(y_train ~ ., data = X_train_final, family = "binomial")
# Display model summary
summary(model_lr)
##
## Call:
## glm(formula = y_train ~ ., family = "binomial", data = X_train_final)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.37366 0.10879 3.435 0.000593 ***
## age 0.24007 0.12126 1.980 0.047721 *
## trestbps 0.08414 0.11093 0.759 0.448143
## chol -0.50986 0.11654 -4.375 1.21e-05 ***
## fbs 0.15596 0.10693 1.459 0.144684
## thalch -0.25559 0.12659 -2.019 0.043492 *
## exang 0.49546 0.12068 4.106 4.03e-05 ***
## oldpeak 0.48643 0.12879 3.777 0.000159 ***
## ca 0.41631 0.13813 3.014 0.002580 **
## sex_Male 0.54521 0.11327 4.814 1.48e-06 ***
## cp_atypical.angina -0.76009 0.12157 -6.252 4.05e-10 ***
## cp_non.anginal -0.57514 0.10465 -5.496 3.89e-08 ***
## cp_typical.angina -0.29670 0.09252 -3.207 0.001341 **
## restecg_normal 0.06477 0.13993 0.463 0.643448
## restecg_st.t.abnormality 0.12800 0.13860 0.924 0.355744
## slope_flat 0.36242 0.20988 1.727 0.084205 .
## slope_upsloping 0.03182 0.20795 0.153 0.878383
## thal_normal -0.24088 0.20260 -1.189 0.234467
## thal_reversable.defect 0.12732 0.21079 0.604 0.545828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1013.21 on 736 degrees of freedom
## Residual deviance: 602.53 on 718 degrees of freedom
## AIC: 640.53
##
## Number of Fisher Scoring iterations: 5
The trained model’s performance was evaluated on the unseen test set using various metrics.
The overall accuracy and the detailed confusion matrix for the model’s predictions are presented below.
# predictions on the test set
predictions_prob <- predict(model_lr, newdata = X_test_final, type = "response")
predictions_class <- factor(ifelse(predictions_prob > 0.5, "Disease", "NoDisease"), levels = c("NoDisease", "Disease"))
# Generate confusion matrix
conf_matrix <- confusionMatrix(predictions_class, y_test, positive = "Disease")
# overall accuracy
cat("#### Overall Accuracy:\n")
cat(paste("The overall accuracy of the model is **", round(conf_matrix$overall['Accuracy'], 4) * 100, "%**.\n\n", sep = ""))
The overall accuracy of the model is 83.61%.
cat("This indicates that the model correctly predicted the presence or absence of heart disease for approximately ", round(conf_matrix$overall['Accuracy'], 2) * 100, "% of the patients in the test set.\n\n")
This indicates that the model correctly predicted the presence or absence of heart disease for approximately 84 % of the patients in the test set.
# confusion matrix
cat("#### Confusion Matrix:\n")
cat("The confusion matrix provides a detailed breakdown of correct and incorrect predictions:\n\n")
The confusion matrix provides a detailed breakdown of correct and incorrect predictions:
print(conf_matrix$table) # Print the raw table
Reference
Prediction NoDisease Disease NoDisease 68 16 Disease 14 85
# Convert confusion matrix to a markdown table for better display
conf_matrix_table <- data.frame(conf_matrix$table) %>%
pivot_wider(names_from = Prediction, values_from = Freq) %>%
rename(Actual = Reference)
# Rename columns and rows for clarity in the markdown table
names(conf_matrix_table) <- c("Actual", "Predicted No Disease (0)", "Predicted Disease (1)")
# Add descriptive labels for the cells
conf_matrix_annotated <- conf_matrix_table %>%
mutate(`Predicted No Disease (0)` = paste0(`Predicted No Disease (0)`, " (True Negative)"),
`Predicted Disease (1)` = paste0(`Predicted Disease (1)`, " (False Positive)"))
# Manually adjust for Actual 1 row
conf_matrix_annotated[2, 2] <- paste0(conf_matrix_table[2, 2], " (False Negative)")
conf_matrix_annotated[2, 3] <- paste0(conf_matrix_table[2, 3], " (True Positive)")
cat("\nTable 5.1: Confusion Matrix\n")
Table 5.1: Confusion Matrix
knitr::kable(conf_matrix_annotated)
| Actual | Predicted No Disease (0) | Predicted Disease (1) |
|---|---|---|
| NoDisease | 68 (True Negative) | 14 (False Positive) |
| Disease | 16 (False Negative) | 85 (True Positive) |
cat("\n* **True Negatives (TN):**", conf_matrix$table[1,1], "- Model correctly predicted no disease.\n")
cat("* **False Positives (FP):**", conf_matrix$table[1,2], "- Model incorrectly predicted disease (Type I error).\n")
cat("* **False Negatives (FN):**", conf_matrix$table[2,1], "- Model incorrectly predicted no disease (Type II error).\n")
cat("* **True Positives (TP):**", conf_matrix$table[2,2], "- Model correctly predicted disease.\n")
cat("\n#### Classification Metrics:\n")
cat("Derived from the confusion matrix, these metrics provide further insights into model performance:\n\n")
Derived from the confusion matrix, these metrics provide further insights into model performance:
metrics <- data.frame(
Metric = c("Sensitivity (Recall for Positive Class 'Disease')", "Specificity (Recall for Negative Class 'NoDisease')",
"Positive Predictive Value (Precision for Positive Class 'Disease')",
"Negative Predictive Value (Precision for Negative Class 'NoDisease')",
"F1-Score (for Positive Class 'Disease')", "Kappa"),
Value = c(round(conf_matrix$byClass['Sensitivity'], 4),
round(conf_matrix$byClass['Specificity'], 4),
round(conf_matrix$byClass['Pos Pred Value'], 4),
round(conf_matrix$byClass['Neg Pred Value'], 4),
round(conf_matrix$byClass['F1'], 4),
round(conf_matrix$overall['Kappa'], 4))
)
knitr::kable(metrics, caption = "Table 5.2: Key Classification Metrics")
| Metric | Value | |
|---|---|---|
| Sensitivity | Sensitivity (Recall for Positive Class ‘Disease’) | 0.8416 |
| Specificity | Specificity (Recall for Negative Class ‘NoDisease’) | 0.8293 |
| Pos Pred Value | Positive Predictive Value (Precision for Positive Class ‘Disease’) | 0.8586 |
| Neg Pred Value | Negative Predictive Value (Precision for Negative Class ‘NoDisease’) | 0.8095 |
| F1 | F1-Score (for Positive Class ‘Disease’) | 0.8500 |
| Kappa | Kappa | 0.6693 |
cat("\n* **Sensitivity (Recall for Positive Class 'Disease'):**", round(conf_matrix$byClass['Sensitivity'], 4) * 100, "% - Out of all actual heart disease cases, the model correctly identified ", round(conf_matrix$byClass['Sensitivity'], 4) * 100, "%. This indicates the model's ability to detect positive cases.\n")
cat("* **Specificity (Recall for Negative Class 'NoDisease'):**", round(conf_matrix$byClass['Specificity'], 4) * 100, "% - Out of all actual non-disease cases, the model correctly identified ", round(conf_matrix$byClass['Specificity'], 4) * 100, "%.\n")
cat("* **Positive Predictive Value (Precision for Positive Class 'Disease'):**", round(conf_matrix$byClass['Pos Pred Value'], 4) * 100, "% - Out of all cases where the model predicted heart disease, ", round(conf_matrix$byClass['Pos Pred Value'], 4) * 100, "% of those predictions were correct.\n")
cat("* **Negative Predictive Value (Precision for Negative Class 'NoDisease'):**", round(conf_matrix$byClass['Neg Pred Value'], 4) * 100, "% - Out of all cases where the model predicted no heart disease, ", round(conf_matrix$byClass['Neg Pred Value'], 4) * 100, "% were actually correct.\n")
cat("* **F1-Score (for Positive Class 'Disease'):**", round(conf_matrix$byClass['F1'], 4), " - The F1-score is the harmonic mean of precision and sensitivity, providing a balanced measure of the model's accuracy, especially useful in cases of imbalanced classes. A score of ", round(conf_matrix$byClass['F1'], 4), " suggests reasonable performance.\n")
The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are critical for evaluating the model’s ability to discriminate between positive and negative classes across various thresholds.
# Create ROC curve object
roc_obj_lr <- roc(y_test, predictions_prob)
# Plot ROC curve
plot(roc_obj_lr, main = "ROC Curve for Logistic Regression Model",
col = "darkorange", lwd = 2, print.auc = TRUE,
legacy.axes = TRUE) # Ensures standard axes for comparison
abline(a = 0, b = 1, col = "navy", lty = 2, lwd = 2) # Add a diagonal reference line
legend("bottomright", legend = paste("AUC =", round(auc(roc_obj_lr), 4)),
col = "darkorange", lwd = 2)
Figure 5.1: ROC Curve and AUC for Logistic Regression Model
# Print AUC value
cat(paste("The Area Under the ROC Curve (AUC) is", round(auc(roc_obj_lr), 4), ".\n", sep = ""))
## The Area Under the ROC Curve (AUC) is0.9197.
cat("An AUC of ", round(auc(roc_obj_lr), 4), " indicates that the model has strong discriminatory power, meaning it can effectively distinguish between patients with and without heart disease.\n")
## An AUC of 0.9197 indicates that the model has strong discriminatory power, meaning it can effectively distinguish between patients with and without heart disease.
The coefficients of the Logistic Regression model indicate the impact of each feature on the log-odds of having heart disease. A positive coefficient suggests an increased likelihood of heart disease, while a negative coefficient suggests a decreased likelihood, holding other factors constant.
coefficients_lr <- as.data.frame(coef(summary(model_lr)))
coefficients_lr$Feature <- rownames(coefficients_lr)
coefficients_lr <- coefficients_lr %>%
select(Feature, Estimate, `Std. Error`, `z value`, `Pr(>|z|)`) %>%
arrange(desc(Estimate))
knitr::kable(coefficients_lr, caption = "Table 5.3: Logistic Regression Model Coefficients")
| Feature | Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|---|
| sex_Male | sex_Male | 0.5452120 | 0.1132663 | 4.8135427 | 0.0000015 |
| exang | exang | 0.4954557 | 0.1206757 | 4.1056772 | 0.0000403 |
| oldpeak | oldpeak | 0.4864337 | 0.1287912 | 3.7769163 | 0.0001588 |
| ca | ca | 0.4163051 | 0.1381339 | 3.0137798 | 0.0025801 |
| (Intercept) | (Intercept) | 0.3736594 | 0.1087858 | 3.4348183 | 0.0005930 |
| slope_flat | slope_flat | 0.3624213 | 0.2098810 | 1.7267942 | 0.0842046 |
| age | age | 0.2400686 | 0.1212565 | 1.9798416 | 0.0477213 |
| fbs | fbs | 0.1559633 | 0.1069290 | 1.4585689 | 0.1446838 |
| restecg_st.t.abnormality | restecg_st.t.abnormality | 0.1279991 | 0.1386015 | 0.9235042 | 0.3557445 |
| thal_reversable.defect | thal_reversable.defect | 0.1273237 | 0.2107926 | 0.6040236 | 0.5458279 |
| trestbps | trestbps | 0.0841422 | 0.1109304 | 0.7585139 | 0.4481434 |
| restecg_normal | restecg_normal | 0.0647709 | 0.1399291 | 0.4628834 | 0.6434480 |
| slope_upsloping | slope_upsloping | 0.0318200 | 0.2079471 | 0.1530195 | 0.8783829 |
| thal_normal | thal_normal | -0.2408792 | 0.2026016 | -1.1889303 | 0.2344671 |
| thalch | thalch | -0.2555859 | 0.1265934 | -2.0189508 | 0.0434923 |
| cp_typical.angina | cp_typical.angina | -0.2967041 | 0.0925180 | -3.2069889 | 0.0013413 |
| chol | chol | -0.5098626 | 0.1165415 | -4.3749435 | 0.0000121 |
| cp_non.anginal | cp_non.anginal | -0.5751430 | 0.1046513 | -5.4958023 | 0.0000000 |
| cp_atypical.angina | cp_atypical.angina | -0.7600922 | 0.1215740 | -6.2520962 | 0.0000000 |
cat("\nNotes on Coefficient Interpretation:\n")
##
## Notes on Coefficient Interpretation:
cat("Positive Coefficients:Mean that an increase in this feature (holding other features constant) increases the log-odds of heart disease.\n")
## Positive Coefficients:Mean that an increase in this feature (holding other features constant) increases the log-odds of heart disease.
cat("Negative Coefficients: Mean that an increase in this feature (holding other features constant) decreases the log-odds of heart disease.\n")
## Negative Coefficients: Mean that an increase in this feature (holding other features constant) decreases the log-odds of heart disease.
cat("Magnitude of Coefficient:Indicates the strength of the relationship. Note that this applies to Scaled Features, meaning a 1-unit increase in a scaled feature corresponds to the change in log-odds.\n")
## Magnitude of Coefficient:Indicates the strength of the relationship. Note that this applies to Scaled Features, meaning a 1-unit increase in a scaled feature corresponds to the change in log-odds.
cat("P-value (Pr(>|z|)): A low p-value (typically < 0.05) suggests that the feature is statistically significant in predicting the outcome.\n")
## P-value (Pr(>|z|)): A low p-value (typically < 0.05) suggests that the feature is statistically significant in predicting the outcome.
The Logistic Regression model developed for predicting heart disease demonstrates strong performance, achieving an accuracy of 83.61% and an AUC of 0.9197. The high sensitivity and specificity values also confirm the model’s ability to correctly identify both positive and negative cases. The comprehensive preprocessing steps, including handling missing values and one-hot encoding, were crucial in preparing the data for effective modeling.