1 Introduction

Heart disease remains a leading cause of mortality worldwide. Early and accurate prediction of heart disease is crucial for timely intervention and improved patient outcomes. This report details a machine learning approach to predict the presence of heart disease using a comprehensive dataset. We will employ Logistic Regression, a powerful and interpretable statistical model, to build our predictive system.

The primary goal of this analysis is to develop a model that can predict heart disease with high accuracy, specifically aiming to meet or exceed certain performance benchmarks.

2 Dataset Description

The dataset used in this analysis contains various medical and demographic attributes of patients, along with an indicator of heart disease presence. Each row represents a patient, and columns represent different features.

2.1 Variable Definitions

  • age: Age of the patient in years. (Numerical - Continuous)
  • sex: Biological sex of the patient. (Categorical - Binary: Male, Female)
  • cp (Chest Pain Type): The type of chest pain experienced by the patient. (Categorical - Nominal: typical angina, atypical angina, non-anginal pain, asymptomatic)
  • trestbps (Resting Blood Pressure): Resting blood pressure (in mm Hg) measured on admission to the hospital. (Numerical - Continuous)
  • chol (Cholesterol): Serum cholesterol level in mg/dl. (Numerical - Continuous)
  • fbs (Fasting Blood Sugar): Fasting blood sugar level > 120 mg/dl. (Categorical - Binary: 1 if True, 0 if False)
  • restecg (Resting Electrocardiographic Results): Results of the resting electrocardiogram. (Categorical - Nominal: lv hypertrophy, normal, st-t abnormality)
  • thalch (Maximum Heart Rate Achieved): The maximum heart rate achieved during an exercise stress test. (Numerical - Continuous)
  • exang (Exercise Induced Angina): Presence of exercise-induced angina. (Categorical - Binary: 1 if True, 0 if False)
  • oldpeak (ST Depression Induced by Exercise Relative to Rest): The depression of the ST segment induced by exercise relative to resting state. (Numerical - Continuous)
  • slope (Slope of the Peak Exercise ST Segment): The slope of the peak exercise ST segment. (Categorical - Nominal: downsloping, flat, upsloping)
  • ca (Number of Major Vessels Colored by Fluoroscopy): The number of major vessels (0-3) colored by fluoroscopy. (Categorical - Ordinal: 0, 1, 2, 3)
  • thal (Thallium Stress Test Result): Result of the thallium stress test. (Categorical - Nominal: fixed defect, normal, reversable defect)
  • num: Original target variable indicating heart disease presence (0 = no disease, >0 = disease). This will be transformed into target.
  • id: Patient identifier. (Removed during preprocessing)
  • dataset: Source of the dataset. (Removed during preprocessing as it has no variance)

3 Data Loading and Preprocessing

This section outlines the steps taken to load the dataset and prepare it for modeling. This includes handling missing values, converting data types, and feature engineering to ensure the data is in a suitable format for Logistic Regression.

file_path <- "/Users/sahar/Downloads/heart.csv" 
df <- read.csv(file_path, na.strings = "?")

df_pre_removal <- df

# Initial inspection (outputs to console)
cat("--- Initial Data Inspection ---\n")
## --- Initial Data Inspection ---
print(head(df, 10))
##    id age    sex   dataset              cp trestbps chol   fbs        restecg
## 1   1  63   Male Cleveland  typical angina      145  233  TRUE lv hypertrophy
## 2   2  67   Male Cleveland    asymptomatic      160  286 FALSE lv hypertrophy
## 3   3  67   Male Cleveland    asymptomatic      120  229 FALSE lv hypertrophy
## 4   4  37   Male Cleveland     non-anginal      130  250 FALSE         normal
## 5   5  41 Female Cleveland atypical angina      130  204 FALSE lv hypertrophy
## 6   6  56   Male Cleveland atypical angina      120  236 FALSE         normal
## 7   7  62 Female Cleveland    asymptomatic      140  268 FALSE lv hypertrophy
## 8   8  57 Female Cleveland    asymptomatic      120  354 FALSE         normal
## 9   9  63   Male Cleveland    asymptomatic      130  254 FALSE lv hypertrophy
## 10 10  53   Male Cleveland    asymptomatic      140  203  TRUE lv hypertrophy
##    thalch exang oldpeak       slope ca              thal num
## 1     150 FALSE     2.3 downsloping  0      fixed defect   0
## 2     108  TRUE     1.5        flat  3            normal   2
## 3     129  TRUE     2.6        flat  2 reversable defect   1
## 4     187 FALSE     3.5 downsloping  0            normal   0
## 5     172 FALSE     1.4   upsloping  0            normal   0
## 6     178 FALSE     0.8   upsloping  0            normal   0
## 7     160 FALSE     3.6 downsloping  2            normal   3
## 8     163  TRUE     0.6   upsloping  0            normal   0
## 9     147 FALSE     1.4        flat  1 reversable defect   2
## 10    155  TRUE     3.1 downsloping  0 reversable defect   1
print(str(df))
## 'data.frame':    920 obs. of  16 variables:
##  $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : chr  "Male" "Male" "Male" "Male" ...
##  $ dataset : chr  "Cleveland" "Cleveland" "Cleveland" "Cleveland" ...
##  $ cp      : chr  "typical angina" "asymptomatic" "asymptomatic" "non-anginal" ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
##  $ restecg : chr  "lv hypertrophy" "lv hypertrophy" "lv hypertrophy" "normal" ...
##  $ thalch  : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : chr  "downsloping" "flat" "flat" "downsloping" ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : chr  "fixed defect" "normal" "reversable defect" "normal" ...
##  $ num     : int  0 2 1 0 0 0 3 0 2 1 ...
## NULL
print(summary(df))
##        id             age            sex              dataset         
##  Min.   :  1.0   Min.   :28.00   Length:920         Length:920        
##  1st Qu.:230.8   1st Qu.:47.00   Class :character   Class :character  
##  Median :460.5   Median :54.00   Mode  :character   Mode  :character  
##  Mean   :460.5   Mean   :53.51                                        
##  3rd Qu.:690.2   3rd Qu.:60.00                                        
##  Max.   :920.0   Max.   :77.00                                        
##                                                                       
##       cp               trestbps          chol          fbs         
##  Length:920         Min.   :  0.0   Min.   :  0.0   Mode :logical  
##  Class :character   1st Qu.:120.0   1st Qu.:175.0   FALSE:692      
##  Mode  :character   Median :130.0   Median :223.0   TRUE :138      
##                     Mean   :132.1   Mean   :199.1   NA's :90       
##                     3rd Qu.:140.0   3rd Qu.:268.0                  
##                     Max.   :200.0   Max.   :603.0                  
##                     NA's   :59      NA's   :30                     
##    restecg              thalch        exang            oldpeak       
##  Length:920         Min.   : 60.0   Mode :logical   Min.   :-2.6000  
##  Class :character   1st Qu.:120.0   FALSE:528       1st Qu.: 0.0000  
##  Mode  :character   Median :140.0   TRUE :337       Median : 0.5000  
##                     Mean   :137.5   NA's :55        Mean   : 0.8788  
##                     3rd Qu.:157.0                   3rd Qu.: 1.5000  
##                     Max.   :202.0                   Max.   : 6.2000  
##                     NA's   :55                      NA's   :62       
##     slope                 ca             thal                num        
##  Length:920         Min.   :0.0000   Length:920         Min.   :0.0000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Mode  :character   Median :1.0000  
##                     Mean   :0.6764                      Mean   :0.9957  
##                     3rd Qu.:1.0000                      3rd Qu.:2.0000  
##                     Max.   :3.0000                      Max.   :4.0000  
##                     NA's   :611
cat("\nInitial NA counts after loading with na.strings:\n")
## 
## Initial NA counts after loading with na.strings:
print(colSums(is.na(df)))
##       id      age      sex  dataset       cp trestbps     chol      fbs 
##        0        0        0        0        0       59       30       90 
##  restecg   thalch    exang  oldpeak    slope       ca     thal      num 
##        0       55       55       62        0      611        0        0
# Drop 'id' and 'dataset'
df <- df %>%
  select(-id, -dataset)

# Create 'target' variable from 'num' (0 for no disease, 1 for disease)
df <- df %>%
  mutate(target = ifelse(num > 0, 1, 0)) %>%
  select(-num) # Remove the original 'num' column

# Ensure target factor levels are valid R variable names for caret and other functions
df$target <- factor(df$target, levels = c(0, 1), labels = c("NoDisease", "Disease"))

# Convert logical 'fbs' and 'exang' to integer (0/1). 'NA's are preserved initially.
df <- df %>%
  mutate(fbs = as.integer(fbs),
         exang = as.integer(exang))

# Convert relevant character columns to factors, explicitly handling empty strings
categorical_cols_to_factor <- c('sex', 'cp', 'restecg', 'slope', 'thal')
for (col_name in categorical_cols_to_factor) {
  df[[col_name]][df[[col_name]] == ""] <- NA # Convert empty strings to NA
  df[[col_name]] <- as.factor(df[[col_name]]) # Convert to factor
}

# Impute numerical missing values with the median
numerical_cols_for_imputation <- c('age', 'trestbps', 'chol', 'thalch', 'oldpeak', 'ca', 'fbs', 'exang')
for (col_name in numerical_cols_for_imputation) {
  if (any(is.na(df[[col_name]]))) {
    df[[col_name]][is.na(df[[col_name]])] <- median(df[[col_name]], na.rm = TRUE)
  }
}

# Impute categorical missing values with the mode
get_mode <- function(v) {
  uniqv <- unique(v)
  if (length(uniqv) == 0 || all(is.na(uniqv))) return(NA)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
for (col_name in categorical_cols_to_factor) {
  if (any(is.na(df[[col_name]]))) {
    mode_val <- get_mode(na.omit(df[[col_name]]))
    if (!is.na(mode_val)) {
      df[[col_name]][is.na(df[[col_name]])] <- mode_val
    }
  }
}

# Final data type adjustments
df <- df %>%
  mutate(ca = as.integer(ca))

# Check structure again after preprocessing
cat("\n--- Data Structure After Preprocessing ---\n")
## 
## --- Data Structure After Preprocessing ---
print(str(df))
## 'data.frame':    920 obs. of  14 variables:
##  $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
##  $ sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1 1 2 2 ...
##  $ cp      : Factor w/ 4 levels "asymptomatic",..: 4 1 1 3 2 2 1 1 1 1 ...
##  $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
##  $ chol    : num  233 286 229 250 204 236 268 354 254 203 ...
##  $ fbs     : num  1 0 0 0 0 0 0 0 0 1 ...
##  $ restecg : Factor w/ 3 levels "lv hypertrophy",..: 1 1 1 2 1 2 1 2 1 1 ...
##  $ thalch  : int  150 108 129 187 172 178 160 163 147 155 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
##  $ slope   : Factor w/ 3 levels "downsloping",..: 1 2 2 1 3 3 1 3 2 1 ...
##  $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
##  $ thal    : Factor w/ 3 levels "fixed defect",..: 1 2 3 2 2 2 2 2 3 3 ...
##  $ target  : Factor w/ 2 levels "NoDisease","Disease": 1 2 2 1 1 1 2 1 2 2 ...
## NULL
cat("\nNA counts after preprocessing:\n")
## 
## NA counts after preprocessing:
print(colSums(is.na(df)))
##      age      sex       cp trestbps     chol      fbs  restecg   thalch 
##        0        0        0        0        0        0        0        0 
##    exang  oldpeak    slope       ca     thal   target 
##        0        0        0        0        0        0
# One-hot encode categorical columns for modeling
categorical_cols_to_encode <- c('sex', 'cp', 'restecg', 'slope', 'thal')
df_encoded <- dummy_cols(df,
                         select_columns = categorical_cols_to_encode,
                         remove_first_dummy = TRUE, 
                         remove_selected_columns = TRUE)

# Ensure target column name is consistent 
df_encoded <- df_encoded %>%
  rename(target = target)

# Separate features (X) and target (y)
X <- df_encoded %>% select(-target)
y <- df_encoded$target

# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.80, list = FALSE, times = 1)
X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

# Scale numerical features (center and scale)
numerical_features_for_scaling <- X_train %>%
  select(where(is.numeric)) %>%
  names()

preprocessor <- preProcess(X_train[, numerical_features_for_scaling], method = c("center", "scale"))
X_train_scaled <- predict(preprocessor, X_train[, numerical_features_for_scaling])
X_test_scaled <- predict(preprocessor, X_test[, numerical_features_for_scaling])

# Combine scaled numerical features with one-hot encoded categorical features
X_train_final <- cbind(X_train_scaled, X_train %>% select(-all_of(numerical_features_for_scaling)))
X_test_final <- cbind(X_test_scaled, X_test %>% select(-all_of(numerical_features_for_scaling)))

# Ensure column names match after scaling and combining
colnames(X_train_final) <- make.names(colnames(X_train_final), unique = TRUE)
colnames(X_test_final) <- make.names(colnames(X_test_final), unique = TRUE)

cat("\n--- Training and Test Set Dimensions ---\n")
## 
## --- Training and Test Set Dimensions ---
cat("Dimensions of X_train_final:", dim(X_train_final), "\n")
## Dimensions of X_train_final: 737 18
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 737
cat("Dimensions of X_test_final:", dim(X_test_final), "\n")
## Dimensions of X_test_final: 183 18
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 183

4 Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a critical step to understand the underlying patterns, distributions, and relationships within the dataset. This section provides visual and statistical summaries of the data features.

4.1 Initial Missing Value Assessment

Initial data inspection revealed several features with missing values (represented by ‘?’). Table @ref(tab:missing-values-table) summarizes these counts, and Figure @ref(fig:missing-values-hist) provides a visual representation. These missing values were subsequently imputed during the preprocessing phase.

Table 4.1: Initial Missing Value Counts by Feature (Prior to Preprocessing)
Feature Missing_Count
ca 611
slope 309
fbs 90
oldpeak 62
trestbps 59
exang 55
thalch 55
chol 30
# missing_data_summary defined above
ggplot(missing_data_summary, aes(x = reorder(Feature, -Missing_Count), y = Missing_Count, fill = Missing_Count)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Missing_Count), vjust = -0.5, size = 3.5) +
  labs(title = "Initial Missing Value Counts by Feature",
       x = "Feature",
       y = "Number of Missing Values") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_gradient(low = "lightblue", high = "darkred")
Figure 4.1: Initial Missing Value Counts by Feature

Figure 4.1: Initial Missing Value Counts by Feature

4.2 Justification for Column Removal: ‘dataset’

The ‘dataset’ column was removed during preprocessing because it contained only a single unique value (‘Cleveland’), providing no variability or predictive power for the model. Figure @ref(fig:dataset-distribution) illustrates this lack of variance.

# Plotting the 'dataset' column using the df_pre_removal (before dropping this column)
dataset_plot <- ggplot(df_pre_removal, aes(x = dataset)) +
  geom_bar(fill = "steelblue", color = "black") +
  labs(title = "Distribution of 'dataset' Column",
       x = "Dataset Source",
       y = "Count") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 12),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 8))
print(dataset_plot)
Figure 4.2: Distribution of Removed Feature ('dataset')

Figure 4.2: Distribution of Removed Feature (‘dataset’)

4.3 Distribution of Numerical Features

This section presents the summary statistics and individual histograms for each numerical feature, providing insights into their distributions after preprocessing.

4.3.1 Summary Statistics for Numerical Features

Table @ref(tab:numerical-summary) provides a comprehensive overview of the central tendency, dispersion, and range for the numerical features after preprocessing and imputation.

Table 4.2: Summary Statistics for Numerical Features (Post-Preprocessing)
age trestbps chol thalch oldpeak ca fbs exang
Min. :28.00 Min. : 0 Min. : 0.0 Min. : 60.0 Min. :-2.6000 Min. :0.0000 Min. :0.00 Min. :0.0000
1st Qu.:47.00 1st Qu.:120 1st Qu.:177.8 1st Qu.:120.0 1st Qu.: 0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:0.0000
Median :54.00 Median :130 Median :223.0 Median :140.0 Median : 0.5000 Median :0.0000 Median :0.00 Median :0.0000
Mean :53.51 Mean :132 Mean :199.9 Mean :137.7 Mean : 0.8533 Mean :0.2272 Mean :0.15 Mean :0.3663
3rd Qu.:60.00 3rd Qu.:140 3rd Qu.:267.0 3rd Qu.:156.0 3rd Qu.: 1.5000 3rd Qu.:0.0000 3rd Qu.:0.00 3rd Qu.:1.0000
Max. :77.00 Max. :200 Max. :603.0 Max. :202.0 Max. : 6.2000 Max. :3.0000 Max. :1.00 Max. :1.0000

4.4 Boxplots of Numerical Features by Target

Figure @ref(fig:numerical-boxplots) illustrates the relationship between each numerical feature and the presence of heart disease (target variable) using boxplots. This helps identify features that show different distributions based on the outcome, suggesting their predictive power.

# Use the numerical_cols_for_imputation defined in preprocessing
boxplot_plots_list <- lapply(numerical_cols_for_imputation, function(col_name) {
  ggplot(df, aes_string(x = "target", y = col_name, fill = "target")) + # Use 'target' directly as it's a factor
    geom_boxplot(color = "black", alpha = 0.8) +
    labs(title = paste(col_name, " by Heart Disease Presence"),
         x = "Heart Disease (NoDisease vs. Disease)",
         y = col_name,
         fill = "Target") +
    theme_minimal() +
    scale_fill_manual(values = c("NoDisease" = "lightblue", "Disease" = "salmon")) + # Use new factor labels
    theme(legend.position = "none",
          plot.title = element_text(hjust = 0.5, size = 10),
          axis.title = element_text(size = 8),
          axis.text = element_text(size = 7),
          panel.grid.major = element_line(linetype = "dotted", color = "lightgray"),
          panel.grid.minor = element_blank())
})
combined_boxplot_figure <- wrap_plots(boxplot_plots_list, ncol = 4, nrow = 2)
print(combined_boxplot_figure)
Figure 4.4: Boxplots of Numerical Features by Heart Disease Presence

Figure 4.4: Boxplots of Numerical Features by Heart Disease Presence

4.5 Proportional Distribution of Categorical Features by Heart Disease Presence

Figure @ref(fig:categorical-proportions) provides stacked bar charts showing the distribution of each categorical feature, broken down by the presence or absence of heart disease (target). This helps in understanding the relationship between each categorical variable and the outcome.

categorical_cols_for_plot <- c('sex', 'cp', 'restecg', 'slope', 'thal', 'fbs', 'exang')

categorical_plots_list <- lapply(categorical_cols_for_plot, function(col_name) {
  ggplot(df, aes_string(x = col_name, fill = "target")) + # Use 'target' directly
    geom_bar(position = "fill", color = "black", alpha = 0.8) + # 'fill' makes it proportional
    labs(title = paste("Proportion of Target by", col_name),
         x = col_name,
         y = "Proportion",
         fill = "Heart Disease") + # Label is now "Heart Disease"
    theme_minimal() +
    scale_fill_manual(values = c("NoDisease" = "darkgreen", "Disease" = "firebrick")) + # Use new factor labels
    theme(plot.title = element_text(hjust = 0.5, size = 10),
          axis.text.x = element_text(angle = 45, hjust = 1), # Rotate labels for readability
          axis.title = element_text(size = 8),
          axis.text = element_text(size = 7),
          legend.position = "bottom")
})
combined_categorical_figure <- wrap_plots(categorical_plots_list, ncol = 3, nrow = 3)
print(combined_categorical_figure)
Figure 4.5: Proportional Distribution of Categorical Features by Heart Disease Presence

Figure 4.5: Proportional Distribution of Categorical Features by Heart Disease Presence

4.6 Correlation Matrix of Numerical Features

Figure @ref(fig:correlation-matrix) presents a heatmap of the Pearson correlation coefficients between the numerical features. This helps identify multicollinearity (highly correlated predictors) that might affect model stability and interpretability.

numerical_data_for_corr <- df %>% select(all_of(numerical_cols_for_imputation))

# Calculate the correlation matrix
correlation_matrix <- cor(numerical_data_for_corr)

# Create the correlation heatmap
corrplot(correlation_matrix,
         method = "circle",
         type = "upper",
         order = "hclust",
         tl.col = "black",
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         col = colorRampPalette(c("blue", "white", "red"))(200),
         title = "Correlation Matrix of Numerical Features",
         mar = c(0,0,1,0))
Figure 4.6: Correlation Matrix of Numerical Features

Figure 4.6: Correlation Matrix of Numerical Features

5 Model Training: Logistic Regression

Logistic Regression is a statistical model used for binary classification. It models the probability of a binary outcome using a logistic function.

5.1 Model Training

The Logistic Regression model was trained using the glm function in R, with family = "binomial" for binary classification. The model learns the relationship between the features and the likelihood of heart disease.

# Train the Logistic Regression model
model_lr <- glm(y_train ~ ., data = X_train_final, family = "binomial")

# Display model summary 
summary(model_lr)
## 
## Call:
## glm(formula = y_train ~ ., family = "binomial", data = X_train_final)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.37366    0.10879   3.435 0.000593 ***
## age                       0.24007    0.12126   1.980 0.047721 *  
## trestbps                  0.08414    0.11093   0.759 0.448143    
## chol                     -0.50986    0.11654  -4.375 1.21e-05 ***
## fbs                       0.15596    0.10693   1.459 0.144684    
## thalch                   -0.25559    0.12659  -2.019 0.043492 *  
## exang                     0.49546    0.12068   4.106 4.03e-05 ***
## oldpeak                   0.48643    0.12879   3.777 0.000159 ***
## ca                        0.41631    0.13813   3.014 0.002580 ** 
## sex_Male                  0.54521    0.11327   4.814 1.48e-06 ***
## cp_atypical.angina       -0.76009    0.12157  -6.252 4.05e-10 ***
## cp_non.anginal           -0.57514    0.10465  -5.496 3.89e-08 ***
## cp_typical.angina        -0.29670    0.09252  -3.207 0.001341 ** 
## restecg_normal            0.06477    0.13993   0.463 0.643448    
## restecg_st.t.abnormality  0.12800    0.13860   0.924 0.355744    
## slope_flat                0.36242    0.20988   1.727 0.084205 .  
## slope_upsloping           0.03182    0.20795   0.153 0.878383    
## thal_normal              -0.24088    0.20260  -1.189 0.234467    
## thal_reversable.defect    0.12732    0.21079   0.604 0.545828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1013.21  on 736  degrees of freedom
## Residual deviance:  602.53  on 718  degrees of freedom
## AIC: 640.53
## 
## Number of Fisher Scoring iterations: 5

5.2 5.2 Model Performance Evaluation

The trained model’s performance was evaluated on the unseen test set using various metrics.

5.2.1 Accuracy and Confusion Matrix

The overall accuracy and the detailed confusion matrix for the model’s predictions are presented below.

# predictions on the test set
predictions_prob <- predict(model_lr, newdata = X_test_final, type = "response")
predictions_class <- factor(ifelse(predictions_prob > 0.5, "Disease", "NoDisease"), levels = c("NoDisease", "Disease"))

# Generate confusion matrix
conf_matrix <- confusionMatrix(predictions_class, y_test, positive = "Disease")

# overall accuracy
cat("#### Overall Accuracy:\n")

5.2.2 Overall Accuracy:

cat(paste("The overall accuracy of the model is **", round(conf_matrix$overall['Accuracy'], 4) * 100, "%**.\n\n", sep = ""))

The overall accuracy of the model is 83.61%.

cat("This indicates that the model correctly predicted the presence or absence of heart disease for approximately ", round(conf_matrix$overall['Accuracy'], 2) * 100, "% of the patients in the test set.\n\n")

This indicates that the model correctly predicted the presence or absence of heart disease for approximately 84 % of the patients in the test set.

# confusion matrix
cat("#### Confusion Matrix:\n")

5.2.3 Confusion Matrix:

cat("The confusion matrix provides a detailed breakdown of correct and incorrect predictions:\n\n")

The confusion matrix provides a detailed breakdown of correct and incorrect predictions:

print(conf_matrix$table) # Print the raw table
       Reference

Prediction NoDisease Disease NoDisease 68 16 Disease 14 85

# Convert confusion matrix to a markdown table for better display
conf_matrix_table <- data.frame(conf_matrix$table) %>%
  pivot_wider(names_from = Prediction, values_from = Freq) %>%
  rename(Actual = Reference)

# Rename columns and rows for clarity in the markdown table
names(conf_matrix_table) <- c("Actual", "Predicted No Disease (0)", "Predicted Disease (1)")

# Add descriptive labels for the cells
conf_matrix_annotated <- conf_matrix_table %>%
  mutate(`Predicted No Disease (0)` = paste0(`Predicted No Disease (0)`, " (True Negative)"),
         `Predicted Disease (1)` = paste0(`Predicted Disease (1)`, " (False Positive)"))
# Manually adjust for Actual 1 row
conf_matrix_annotated[2, 2] <- paste0(conf_matrix_table[2, 2], " (False Negative)")
conf_matrix_annotated[2, 3] <- paste0(conf_matrix_table[2, 3], " (True Positive)")

cat("\nTable 5.1: Confusion Matrix\n")

Table 5.1: Confusion Matrix

knitr::kable(conf_matrix_annotated)
Actual Predicted No Disease (0) Predicted Disease (1)
NoDisease 68 (True Negative) 14 (False Positive)
Disease 16 (False Negative) 85 (True Positive)
cat("\n* **True Negatives (TN):**", conf_matrix$table[1,1], "- Model correctly predicted no disease.\n")
  • True Negatives (TN): 68 - Model correctly predicted no disease.
cat("* **False Positives (FP):**", conf_matrix$table[1,2], "- Model incorrectly predicted disease (Type I error).\n")
  • False Positives (FP): 16 - Model incorrectly predicted disease (Type I error).
cat("* **False Negatives (FN):**", conf_matrix$table[2,1], "- Model incorrectly predicted no disease (Type II error).\n")
  • False Negatives (FN): 14 - Model incorrectly predicted no disease (Type II error).
cat("* **True Positives (TP):**", conf_matrix$table[2,2], "- Model correctly predicted disease.\n")
  • True Positives (TP): 85 - Model correctly predicted disease.
cat("\n#### Classification Metrics:\n")

5.2.4 Classification Metrics:

cat("Derived from the confusion matrix, these metrics provide further insights into model performance:\n\n")

Derived from the confusion matrix, these metrics provide further insights into model performance:

metrics <- data.frame(
  Metric = c("Sensitivity (Recall for Positive Class 'Disease')", "Specificity (Recall for Negative Class 'NoDisease')",
             "Positive Predictive Value (Precision for Positive Class 'Disease')",
             "Negative Predictive Value (Precision for Negative Class 'NoDisease')",
             "F1-Score (for Positive Class 'Disease')", "Kappa"),
  Value = c(round(conf_matrix$byClass['Sensitivity'], 4),
            round(conf_matrix$byClass['Specificity'], 4),
            round(conf_matrix$byClass['Pos Pred Value'], 4),
            round(conf_matrix$byClass['Neg Pred Value'], 4),
            round(conf_matrix$byClass['F1'], 4),
            round(conf_matrix$overall['Kappa'], 4))
)

knitr::kable(metrics, caption = "Table 5.2: Key Classification Metrics")
Table 5.2: Key Classification Metrics
Metric Value
Sensitivity Sensitivity (Recall for Positive Class ‘Disease’) 0.8416
Specificity Specificity (Recall for Negative Class ‘NoDisease’) 0.8293
Pos Pred Value Positive Predictive Value (Precision for Positive Class ‘Disease’) 0.8586
Neg Pred Value Negative Predictive Value (Precision for Negative Class ‘NoDisease’) 0.8095
F1 F1-Score (for Positive Class ‘Disease’) 0.8500
Kappa Kappa 0.6693
cat("\n* **Sensitivity (Recall for Positive Class 'Disease'):**", round(conf_matrix$byClass['Sensitivity'], 4) * 100, "% - Out of all actual heart disease cases, the model correctly identified ", round(conf_matrix$byClass['Sensitivity'], 4) * 100, "%. This indicates the model's ability to detect positive cases.\n")
  • Sensitivity (Recall for Positive Class ‘Disease’): 84.16 % - Out of all actual heart disease cases, the model correctly identified 84.16 %. This indicates the model’s ability to detect positive cases.
cat("* **Specificity (Recall for Negative Class 'NoDisease'):**", round(conf_matrix$byClass['Specificity'], 4) * 100, "% - Out of all actual non-disease cases, the model correctly identified ", round(conf_matrix$byClass['Specificity'], 4) * 100, "%.\n")
  • Specificity (Recall for Negative Class ‘NoDisease’): 82.93 % - Out of all actual non-disease cases, the model correctly identified 82.93 %.
cat("* **Positive Predictive Value (Precision for Positive Class 'Disease'):**", round(conf_matrix$byClass['Pos Pred Value'], 4) * 100, "% - Out of all cases where the model predicted heart disease, ", round(conf_matrix$byClass['Pos Pred Value'], 4) * 100, "% of those predictions were correct.\n")
  • Positive Predictive Value (Precision for Positive Class ‘Disease’): 85.86 % - Out of all cases where the model predicted heart disease, 85.86 % of those predictions were correct.
cat("* **Negative Predictive Value (Precision for Negative Class 'NoDisease'):**", round(conf_matrix$byClass['Neg Pred Value'], 4) * 100, "% - Out of all cases where the model predicted no heart disease, ", round(conf_matrix$byClass['Neg Pred Value'], 4) * 100, "% were actually correct.\n")
  • Negative Predictive Value (Precision for Negative Class ‘NoDisease’): 80.95 % - Out of all cases where the model predicted no heart disease, 80.95 % were actually correct.
cat("* **F1-Score (for Positive Class 'Disease'):**", round(conf_matrix$byClass['F1'], 4), " - The F1-score is the harmonic mean of precision and sensitivity, providing a balanced measure of the model's accuracy, especially useful in cases of imbalanced classes. A score of ", round(conf_matrix$byClass['F1'], 4), " suggests reasonable performance.\n")
  • F1-Score (for Positive Class ‘Disease’): 0.85 - The F1-score is the harmonic mean of precision and sensitivity, providing a balanced measure of the model’s accuracy, especially useful in cases of imbalanced classes. A score of 0.85 suggests reasonable performance. #### ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve and the Area Under the Curve (AUC) are critical for evaluating the model’s ability to discriminate between positive and negative classes across various thresholds.

# Create ROC curve object
roc_obj_lr <- roc(y_test, predictions_prob)

# Plot ROC curve
plot(roc_obj_lr, main = "ROC Curve for Logistic Regression Model",
     col = "darkorange", lwd = 2, print.auc = TRUE,
     legacy.axes = TRUE) # Ensures standard axes for comparison
abline(a = 0, b = 1, col = "navy", lty = 2, lwd = 2) # Add a diagonal reference line
legend("bottomright", legend = paste("AUC =", round(auc(roc_obj_lr), 4)),
       col = "darkorange", lwd = 2)
Figure 5.1: ROC Curve and AUC for Logistic Regression Model

Figure 5.1: ROC Curve and AUC for Logistic Regression Model

# Print AUC value

cat(paste("The Area Under the ROC Curve (AUC) is", round(auc(roc_obj_lr), 4), ".\n", sep = ""))
## The Area Under the ROC Curve (AUC) is0.9197.
cat("An AUC of ", round(auc(roc_obj_lr), 4), " indicates that the model has strong discriminatory power, meaning it can effectively distinguish between patients with and without heart disease.\n")
## An AUC of  0.9197  indicates that the model has strong discriminatory power, meaning it can effectively distinguish between patients with and without heart disease.

5.3 5.3 Interpretation of Logistic Regression Coefficients

The coefficients of the Logistic Regression model indicate the impact of each feature on the log-odds of having heart disease. A positive coefficient suggests an increased likelihood of heart disease, while a negative coefficient suggests a decreased likelihood, holding other factors constant.

coefficients_lr <- as.data.frame(coef(summary(model_lr)))
coefficients_lr$Feature <- rownames(coefficients_lr)
coefficients_lr <- coefficients_lr %>%
  select(Feature, Estimate, `Std. Error`, `z value`, `Pr(>|z|)`) %>%
  arrange(desc(Estimate))

knitr::kable(coefficients_lr, caption = "Table 5.3: Logistic Regression Model Coefficients")
Table 5.3: Logistic Regression Model Coefficients
Feature Estimate Std. Error z value Pr(>|z|)
sex_Male sex_Male 0.5452120 0.1132663 4.8135427 0.0000015
exang exang 0.4954557 0.1206757 4.1056772 0.0000403
oldpeak oldpeak 0.4864337 0.1287912 3.7769163 0.0001588
ca ca 0.4163051 0.1381339 3.0137798 0.0025801
(Intercept) (Intercept) 0.3736594 0.1087858 3.4348183 0.0005930
slope_flat slope_flat 0.3624213 0.2098810 1.7267942 0.0842046
age age 0.2400686 0.1212565 1.9798416 0.0477213
fbs fbs 0.1559633 0.1069290 1.4585689 0.1446838
restecg_st.t.abnormality restecg_st.t.abnormality 0.1279991 0.1386015 0.9235042 0.3557445
thal_reversable.defect thal_reversable.defect 0.1273237 0.2107926 0.6040236 0.5458279
trestbps trestbps 0.0841422 0.1109304 0.7585139 0.4481434
restecg_normal restecg_normal 0.0647709 0.1399291 0.4628834 0.6434480
slope_upsloping slope_upsloping 0.0318200 0.2079471 0.1530195 0.8783829
thal_normal thal_normal -0.2408792 0.2026016 -1.1889303 0.2344671
thalch thalch -0.2555859 0.1265934 -2.0189508 0.0434923
cp_typical.angina cp_typical.angina -0.2967041 0.0925180 -3.2069889 0.0013413
chol chol -0.5098626 0.1165415 -4.3749435 0.0000121
cp_non.anginal cp_non.anginal -0.5751430 0.1046513 -5.4958023 0.0000000
cp_atypical.angina cp_atypical.angina -0.7600922 0.1215740 -6.2520962 0.0000000
cat("\nNotes on Coefficient Interpretation:\n")
## 
## Notes on Coefficient Interpretation:
cat("Positive Coefficients:Mean that an increase in this feature (holding other features constant) increases the log-odds of heart disease.\n")
## Positive Coefficients:Mean that an increase in this feature (holding other features constant) increases the log-odds of heart disease.
cat("Negative Coefficients: Mean that an increase in this feature (holding other features constant) decreases the log-odds of heart disease.\n")
## Negative Coefficients: Mean that an increase in this feature (holding other features constant) decreases the log-odds of heart disease.
cat("Magnitude of Coefficient:Indicates the strength of the relationship. Note that this applies to Scaled Features, meaning a 1-unit increase in a scaled feature corresponds to the change in log-odds.\n")
## Magnitude of Coefficient:Indicates the strength of the relationship. Note that this applies to Scaled Features, meaning a 1-unit increase in a scaled feature corresponds to the change in log-odds.
cat("P-value (Pr(>|z|)): A low p-value (typically < 0.05) suggests that the feature is statistically significant in predicting the outcome.\n")
## P-value (Pr(>|z|)): A low p-value (typically < 0.05) suggests that the feature is statistically significant in predicting the outcome.

6 Conclusion and Future Work

6.1 Conclusion

The Logistic Regression model developed for predicting heart disease demonstrates strong performance, achieving an accuracy of 83.61% and an AUC of 0.9197. The high sensitivity and specificity values also confirm the model’s ability to correctly identify both positive and negative cases. The comprehensive preprocessing steps, including handling missing values and one-hot encoding, were crucial in preparing the data for effective modeling.