WQD7004 Group Project: Cardiovascular Risk Analysis: A Dual Approach to Heart Disease Prediction and Cholesterol Level Modeling

1 Introduction and Objectives

Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for an estimated 31% of total global deaths each year. Heart failure is a cardiovascular disease with high morbidity and mortality across the globe. Early and accurate prediction of heart failure risk can help clinicians formulate timely treatment plans, thereby reducing the risk of disease progression and healthcare costs. Traditional diagnosis of heart failure relies on clinicians’ clinical experience and the evaluation of single indicators, which is highly subjective and limited in accuracy. With the advancement of machine learning and data analytics technologies, predictive models based on multi-dimensional clinical features have become important tools for cardiovascular disease risk assessment.

The dataset used in this project is derived from the heart-failure-prediction dataset, which integrates samples from multiple clinical studies on cardiovascular diseases. It contains 12 clinical features and heart failure diagnosis outcomes of 918 observations, and is a classic public dataset for heart failure prediction modeling.

This study focuses on two core analytical objectives:

Regression Analysis: explore the linear relationship between patients’ age (Age) and maximum heart rate (MaxHR), and quantify the degree of age’s impact on maximum heart rate.
Classification Analysis: construct a logistic regression model based on patients’ multi-dimensional clinical features to predict the risk of heart failure (HeartDisease), and evaluate the predictive performance of the model.

2 Data Processing

To meet the requirements of subsequent analysis, data cleaning needs to be performed on the dataset. This mainly involves missing value imputation, outlier handling, and data type conversion.

2.1 Dataset Basic Information

First and foremost, we need to gain an understanding of the basic information about the dataset. The dataset contains a total of 12 fields, 11 of which are feature fields and 1 is a target label field. Below is the descriptive information for each field:

Attribute	Description
Age	age of the patient [years]
Sex	sex of the patient [M: Male, F: Female]
ChestPainType	chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP	resting blood pressure [mm Hg]
Cholesterol	serum cholesterol [mm/dl]
FastingBS	fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG	resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
MaxHR	maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina	exercise-induced angina [Y: Yes, N: No]
Oldpeak	Soldpeak = ST [Numeric value measured in depression]
ST_Slope	the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
HeartDisease	output class [1: heart disease, 0: Normal]

# Read CSV data
df <- read.csv('Data.csv')

# Data structure
str(df)

## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : chr  "M" "F" "M" "F" ...
##  $ ChestPainType : chr  "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : chr  "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: chr  "N" "N" "N" "Y" ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : chr  "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...

# Data Summary
summary(df)

##       Age            Sex            ChestPainType        RestingBP    
##  Min.   :28.00   Length:918         Length:918         Min.   :  0.0  
##  1st Qu.:47.00   Class :character   Class :character   1st Qu.:120.0  
##  Median :54.00   Mode  :character   Mode  :character   Median :130.0  
##  Mean   :53.51                                         Mean   :132.4  
##  3rd Qu.:60.00                                         3rd Qu.:140.0  
##  Max.   :77.00                                         Max.   :200.0  
##   Cholesterol      FastingBS       RestingECG            MaxHR      
##  Min.   :  0.0   Min.   :0.0000   Length:918         Min.   : 60.0  
##  1st Qu.:173.2   1st Qu.:0.0000   Class :character   1st Qu.:120.0  
##  Median :223.0   Median :0.0000   Mode  :character   Median :138.0  
##  Mean   :198.8   Mean   :0.2331                      Mean   :136.8  
##  3rd Qu.:267.0   3rd Qu.:0.0000                      3rd Qu.:156.0  
##  Max.   :603.0   Max.   :1.0000                      Max.   :202.0  
##  ExerciseAngina        Oldpeak          ST_Slope          HeartDisease   
##  Length:918         Min.   :-2.6000   Length:918         Min.   :0.0000  
##  Class :character   1st Qu.: 0.0000   Class :character   1st Qu.:0.0000  
##  Mode  :character   Median : 0.6000   Mode  :character   Median :1.0000  
##                     Mean   : 0.8874                      Mean   :0.5534  
##                     3rd Qu.: 1.5000                      3rd Qu.:1.0000  
##                     Max.   : 6.2000                      Max.   :1.0000

Analysis of the dataset summary reveals the following:

Age: Ranging from 28 to 77 years, with a mean of 53.51 years and a median of 54 years, indicating that the majority of patients are middle-aged and elderly.
RestingBP: With a mean of 132.4 mm Hg, a minimum value of 0 (indicating the presence of outliers), and a median of 130 mm Hg.
Cholesterol: With a mean of 198.8 mg/dl, a minimum value of 0 (suspected missing values), and a maximum value of 603 mg/dl, showing a wide distribution range. – MaxHR: Ranging from 60 to 202 beats per minute, with a mean of 136.8 beats per minute, which falls within the normal resting heart rate range for adults.
HeartDisease: With a mean of 0.5534, suggesting that 55.34% of the samples are heart failure patients (coded as 1) and 44.66% are non-patients (coded as 0), reflecting a relatively balanced class distribution of the samples.

2.2 Data Preprocessing

Data cleaning is a core step to ensure the validity of the model. Based on the characteristics of the dataset, this study implements the following cleaning operations:

Outlier Handling: The physiologically reasonable range of RestingBP should be greater than 0. Rows with RestingBP = 0 need to be removed to ensure that the data is consistent with clinical common sense.

Missing Value Imputation: There are a large number of 0 values in the Cholesterol field. According to clinical knowledge, these 0 values are not true measured values but rather a notation for missing values. The median value is thus adopted for missing value imputation. Details are as follows:

Calculate the median of Cholesterol after excluding 0 values: chol_median <- median(df_clean$Cholesterol[df_clean$Cholesterol > 0], na.rm = TRUE);
Replace 0 values with the median: mutate(Cholesterol = ifelse(Cholesterol == 0, chol_median, Cholesterol))；

Data Type Conversion: Convert all categorical character variables into factor type. In particular, the target variable HeartDisease is converted into factor type for the training of classification models.

# 1. Filter out rows where RestingBP equals 0 to ensure data validity
df_clean <- df %>% filter(RestingBP > 0)

#2. Impute missing values in Cholesterol (0 values indicate missing data, not actual measurements)
# Step 1: Calculate median of Cholesterol excluding 0 values (to avoid bias from missing values)
chol_median <- median(df_clean$Cholesterol[df_clean$Cholesterol > 0], na.rm = TRUE)
# Step 2: Replace 0 values with the calculated median
df_clean <- df_clean %>% mutate(Cholesterol = ifelse(Cholesterol == 0, chol_median, Cholesterol))

#3. Convert categorical variables from character to factor type
df_clean <- df_clean %>%
  mutate(
    Sex = as.factor(Sex),
    ChestPainType = as.factor(ChestPainType),
    RestingECG = as.factor(RestingECG),
    ExerciseAngina = as.factor(ExerciseAngina),
    ST_Slope = as.factor(ST_Slope),
    HeartDisease = as.factor(HeartDisease)
  )

# Preview the processed data
head(df_clean)

# Validate cleaned dataset
cat("Data Dimensions (Rows × Columns):", dim(df_clean)[1], "×", dim(df_clean)[2])

## Data Dimensions (Rows × Columns): 917 × 12

# Descriptive Statistics of Cleaned Data
summary(df_clean)

##       Age        Sex     ChestPainType   RestingBP      Cholesterol   
##  Min.   :28.00   F:193   ASY:496       Min.   : 80.0   Min.   : 85.0  
##  1st Qu.:47.00   M:724   ATA:173       1st Qu.:120.0   1st Qu.:214.0  
##  Median :54.00           NAP:202       Median :130.0   Median :237.0  
##  Mean   :53.51           TA : 46       Mean   :132.5   Mean   :243.2  
##  3rd Qu.:60.00                         3rd Qu.:140.0   3rd Qu.:267.0  
##  Max.   :77.00                         Max.   :200.0   Max.   :603.0  
##    FastingBS       RestingECG      MaxHR       ExerciseAngina    Oldpeak       
##  Min.   :0.0000   LVH   :188   Min.   : 60.0   N:546          Min.   :-2.6000  
##  1st Qu.:0.0000   Normal:551   1st Qu.:120.0   Y:371          1st Qu.: 0.0000  
##  Median :0.0000   ST    :178   Median :138.0                  Median : 0.6000  
##  Mean   :0.2334                Mean   :136.8                  Mean   : 0.8867  
##  3rd Qu.:0.0000                3rd Qu.:156.0                  3rd Qu.: 1.5000  
##  Max.   :1.0000                Max.   :202.0                  Max.   : 6.2000  
##  ST_Slope   HeartDisease
##  Down: 63   0:410       
##  Flat:459   1:507       
##  Up  :395               
##                         
##                         
##

After cleaning, the dataset was reduced from 918 entries to 917 entries. The zero outliers in the RestingBP field were eliminated. Following median imputation for zero values in the Cholesterol field, the mean value increased moderately while the median value saw a slight rise. All other character fields were successfully converted to factor type.

3 Exploratory Data Analysis（EDA）

3.1 Distribution of Heart Failure Cases

ggplot(df_clean, aes(x = HeartDisease, fill = HeartDisease)) +
  geom_bar(alpha = 0.8, width = 0.6) +
  scale_fill_manual(values = c("#4299e1", "#e53e3e")) +
  labs(
    title = "Distribution of Heart Disease Cases",
    x = "Heart Disease (0 = No, 1 = Yes)",
    y = "Number of Patients",
    fill = "Heart Disease"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

Patients with heart failure account for a larger proportion of the samples (approximately 55%), while those without heart failure make up around 45%. The overall class distribution is relatively balanced, with no significant class imbalance issue.

3.2 Age Distribution of Patients

ggplot(df_clean, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "#38b2ac", color = "white", alpha = 0.8) +
  labs(
    title = "Age Distribution of Patients",
    x = "Age (Years)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

As can be observed from the figure:

Age is concentrated in the 50–60 age group: this group has the largest number of patients, exceeding 200 individuals.
The overall distribution shows a right-skewed trend: patients under 30 years old are extremely rare; the number of patients gradually increases after 40 years old, peaks in the 50–60 age group, and then decreases significantly after 70 years old.
The risk of heart failure increases with age, which explains why middle-aged and elderly people account for a larger proportion of the sample.

3.3 Correlation Heatmap of Numerical Variables

# Extract numeric variables and calculate correlation matrix
num_vars <- df_clean %>% select_if(is.numeric)
cor_matrix <- cor(num_vars, use = "complete.obs")

# Plot correlation heatmap
corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.col = "black",
  tl.srt = 45,
  col = colorRampPalette(c("#2d3748", "#4299e1", "#ffffff", "#e53e3e", "#9b2c2c"))(100),
  title = "Correlation Matrix of Numeric Variables",
  mar = c(0, 0, 1, 0)
)

As can be observed from the figure, Age shows a strong negative correlation with MaxHR (Maximum Heart Rate) (indicated by the blue areas). The older the patients are, the lower their maximum heart rate is, which aligns with the physiological principle that heart rate decreases with age. Among other variables, the correlations are weak (represented by light-colored/white areas) — for instance, the correlation between RestingBP (Resting Blood Pressure) and Age, as well as the correlations between Cholesterol and other variables, are all insignificant without obvious strong associations.

The strong negative correlation between Age and MaxHR provides a data-driven basis for the subsequent regression analysis on predicting maximum heart rate based on age.

4 Regression Analysis: Relationship between Age and MaxHR

The focus of this section is to study how age affects maximum heart rate (MaxHR). Physiologically, age and maximum heart rate are generally negatively correlated, meaning that the older a person is, the lower their maximum heart rate. In the preliminary exploratory data analysis (EDA), the correlation matrix showed a distinct blue area between Age and MaxHR, which indicates a strong negative correlation. In this section, we will conduct further exploration.

# Build a Linear Regression Model
lm_model <- lm(MaxHR ~ Age, data = df_clean)

# Output Detailed Results of the Model
summary(lm_model)

## 
## Call:
## lm(formula = MaxHR ~ Age, data = df_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.378 -15.904   0.748  18.117  58.717 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 191.98798    4.47893   42.87   <2e-16 ***
## Age          -1.03157    0.08243  -12.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.55 on 915 degrees of freedom
## Multiple R-squared:  0.1461, Adjusted R-squared:  0.1452 
## F-statistic: 156.6 on 1 and 915 DF,  p-value: < 2.2e-16

# Visualizing Regression Fitting Relationships
ggplot(df_clean, aes(x = Age, y = MaxHR)) +
  geom_point(color = "#4299e1", alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", color = "#e53e3e", linewidth = 1.2) +
  labs(
    title = "Age vs Maximum Heart Rate (MaxHR)",
    subtitle = paste("Correlation Coefficient: ", round(cor(df_clean$Age, df_clean$MaxHR), 2)),
    x = "Age (Years)",
    y = "Max Heart Rate (bpm)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))

Intercept = 191.99: The maximum heart rate at a theoretical age of 0 (with only mathematical significance).
Age Coefficient = -1.03: For each additional year of age, the maximum heart rate is expected to decrease by an average of 1.03 beats per minute.
P-value < 2e-16: Much less than 0.05, indicating that the effect of age on heart rate is statistically extremely significant.
Correlation Coefficient = -0.38: It indicates a moderate negative correlation between the two.

In the regression analysis, we established a simple linear regression model with Age as the independent variable and MaxHR as the dependent variable. The model results indicate that Age has a significant negative impact on MaxHR (p < 0.001). Based on the regression coefficient, for every one-year increase in age, a patient’s maximum heart rate decreases by approximately 1.03 bpm. This finding provides a data-driven basis for clinical assessments of cardiovascular load capacity across different age groups.

5 Classification Analysis: Heart Failure Prediction

5.1 Dataset Splitting (Training Set/Test Set)

set.seed(123)  # Fix the random seed to ensure the reproducibility of results
train_index <- createDataPartition(
  df_clean$HeartDisease, 
  p = 0.8, 
  list = FALSE, 
  times = 1
)

train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]

cat("Training Set Dimension: ", dim(train_data)[1], "×", dim(train_data)[2], "\n")

## Training Set Dimension:  734 × 12

cat("Test Set Dimensions: ", dim(test_data)[1], "×", dim(test_data)[2], "\n")

## Test Set Dimensions:  183 × 12

Prior to constructing the predictive model, this study first performed a random split on the cleaned dataset. To ensure the reproducibility of the experiment, a fixed random seed (Seed: 123) was set, guaranteeing consistent data partitioning in subsequent runs. We utilized the createDataPartition function to split the original data into a Training Set and a Test Set at a ratio of 80:20, based on the distribution of the target variable “Heart Disease”.

According to the code output, the specific details of the dataset split are as follows:

Training Set Size: Contains samples from 734 patients, used for fitting the logistic regression model and learning its parameters.
Test Set Size: Contains samples from 183 patients, serving as independent data to validate the predictive accuracy of the final model.

Each subset retains the original 12 clinical feature variables, ensuring the integrity of the analytical dimensions.

5.2 Build Logistic Regression Model

# Train the Logistic Regression Model
logit_model <- train(
  HeartDisease ~ ., 
  data = train_data, 
  method = "glm", 
  family = "binomial",
  trControl = trainControl(method = "cv", number = 5)  # 5-fold Cross-Validation
)

# Output basic information of the Model
print(logit_model)

## Generalized Linear Model 
## 
## 734 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 586, 588, 587, 587, 588 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.848827  0.6932526

# Details of Model Coefficients
summary(logit_model$finalModel)

## 
## Call:
## NULL
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.805e+00  1.651e+00  -1.093 0.274275    
## Age               1.760e-02  1.472e-02   1.195 0.231925    
## SexM              1.694e+00  3.087e-01   5.488 4.07e-08 ***
## ChestPainTypeATA -1.887e+00  3.613e-01  -5.222 1.77e-07 ***
## ChestPainTypeNAP -1.578e+00  2.849e-01  -5.538 3.06e-08 ***
## ChestPainTypeTA  -1.588e+00  4.775e-01  -3.325 0.000884 ***
## RestingBP        -2.045e-05  6.848e-03  -0.003 0.997617    
## Cholesterol       1.836e-03  2.179e-03   0.843 0.399500    
## FastingBS         1.174e+00  2.974e-01   3.947 7.91e-05 ***
## RestingECGNormal -3.702e-02  2.983e-01  -0.124 0.901239    
## RestingECGST      1.814e-01  3.773e-01   0.481 0.630707    
## MaxHR            -7.093e-03  5.441e-03  -1.304 0.192364    
## ExerciseAnginaY   7.469e-01  2.622e-01   2.849 0.004392 ** 
## Oldpeak           3.913e-01  1.284e-01   3.047 0.002311 ** 
## ST_SlopeFlat      1.323e+00  4.612e-01   2.868 0.004131 ** 
## ST_SlopeUp       -1.081e+00  4.818e-01  -2.244 0.024857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1009.24  on 733  degrees of freedom
## Residual deviance:  498.01  on 718  degrees of freedom
## AIC: 530.01
## 
## Number of Fisher Scoring iterations: 5

Following the data split, this study employed the Logistic Regression algorithm to construct a heart failure prediction model. To ensure model robustness and prevent overfitting, we utilized the train function from the caret package and configured a 5-fold Cross-Validation technique. This method iteratively trains and validates the model by dividing the training data into five subsets, effectively enhancing the model’s generalization capability on unseen data.

Based on the coefficients output of the logistic regression model, several clinical features demonstrated significant statistical importance:

Sex (SexM): Being male is significantly positively correlated with an increased risk of heart failure (p < 0.001).
Chest Pain Type: Compared to other types, asymptomatic chest pain (ASY) carries a higher risk weight, while typical angina (TA) and non-anginal pain (NAP) show significant negative correlations.
Exercise-Induced Angina (ExerciseAnginaY): This indicator significantly increases the predicted probability of the disease.
ST Slope: A flat slope (ST_SlopeFlat) acts as a strong positive predictor for heart failure.

The model’s Akaike Information Criterion (AIC) value is 530.01, reflecting a good balance between model complexity and goodness of fit. Additionally, the average training accuracy during cross-validation reached approximately 84.88%, indicating that the model has successfully captured the core clinical patterns within the data.

5.3 Model Evaluation

# Test Set Prediction
predictions <- predict(logit_model, newdata = test_data)

# Confusion Matrix (including metrics such as Precision, Recall, F1, etc.)
conf_matrix <- confusionMatrix(
  predictions, 
  test_data$HeartDisease,
  positive = "1"  # Define the positive class as "Having Heart Failure"
)

# Output the Confusion Matrix
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 72 10
##          1 10 91
##                                          
##                Accuracy : 0.8907         
##                  95% CI : (0.8363, 0.932)
##     No Information Rate : 0.5519         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.779          
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9010         
##             Specificity : 0.8780         
##          Pos Pred Value : 0.9010         
##          Neg Pred Value : 0.8780         
##              Prevalence : 0.5519         
##          Detection Rate : 0.4973         
##    Detection Prevalence : 0.5519         
##       Balanced Accuracy : 0.8895         
##                                          
##        'Positive' Class : 1              
##

# Extract and Highlight Accuracy Rate
accuracy <- conf_matrix$overall["Accuracy"]

cat("Test Set Accuracy: ", round(accuracy * 100, 2), "%\n")

## Test Set Accuracy:  89.07 %

In the model evaluation phase, we validated the trained logistic regression model using an independent test set. The results show that the model achieved a predictive accuracy of 89.07% on the test set (95% CI: 0.836, 0.932). This indicates that the model possesses high predictive precision and can effectively identify potential heart failure risks.

Detailed analysis of the confusion matrix reveals that the model maintains an excellent balance in classification:

Sensitivity = 90.10%: Meaning the model can correctly capture the vast majority of true positive cases, significantly reducing the possibility of missed clinical diagnoses.
Specificity = 87.80%: Reflecting the model’s ability to identify healthy individuals and effectively controlling the rate of false positives.
Kappa Coefficient = 0.779: Further confirms a high level of agreement between the model’s predictions and actual clinical diagnoses.

The logistic regression model demonstrates robust performance in handling multi-dimensional clinical features. Its high Positive Predictive Value (90.10%) and Negative Predictive Value (87.80%) suggest that the model can serve as an auxiliary tool for clinicians during preliminary cardiovascular screenings, providing objective data support for the early and accurate prediction of heart failure risks.

6 Analysis Conclusion

6.1 Key Findings from Data and Regression

Through a rigorous data preprocessing workflow, including the handling of physiological outliers and median imputation for missing cholesterol values, this study established a high-quality dataset of 917 clinical observations. The exploratory analysis reveals a patient demographic primarily composed of middle-aged and elderly individuals, with a mean age of 53.51 years. The subsequent regression analysis identifies a statistically significant inverse relationship between age and maximum heart rate (MaxHR). With a correlation coefficient of -0.38 and a high degree of statistical significance (p < 0.001), the model quantifies a physiological decline of approximately 1.03 bpm for every year of aging. These findings provide a data-driven foundation for assessing cardiovascular load capacity across different age cohorts in clinical settings.

6.2 Predictive Modeling Performance

In the classification phase, a logistic regression model was developed using 5-fold cross-validation to ensure robustness and generalizability. The model demonstrated superior diagnostic performance, achieving a predictive accuracy of 89.07% on an independent test set. Detailed performance metrics, including a sensitivity of 90.10% and a specificity of 87.80%, indicate that the model maintains an excellent balance between capturing true positive cases and controlling false alarms. Furthermore, the analysis identified key clinical predictors, specifically being male, asymptomatic chest pain, exercise-induced angina, and a flat ST slope, as having the most substantial impact on heart failure risk. The high Kappa coefficient (0.779) further validates the strong agreement between model predictions and actual clinical diagnoses.

6.3 Clinical Implications and Future Work

The results of this dual-approach analysis underline the potential of machine learning as a pivotal auxiliary tool in cardiovascular risk assessment. By providing high positive and negative predictive values, the model offers objective data support that can assist clinicians in early-stage screenings and more accurate heart failure risk stratification. While the current logistic regression framework is highly effective, future research should explore more complex non-linear algorithms, such as Random Forest or Support Vector Machines, to further refine predictive precision. Additionally, integrating a broader spectrum of clinical features could enhance the model’s comprehensive understanding of disease progression.