Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for an estimated 31% of total global deaths each year. Heart failure is a cardiovascular disease with high morbidity and mortality across the globe. Early and accurate prediction of heart failure risk can help clinicians formulate timely treatment plans, thereby reducing the risk of disease progression and healthcare costs. Traditional diagnosis of heart failure relies on clinicians’ clinical experience and the evaluation of single indicators, which is highly subjective and limited in accuracy. With the advancement of machine learning and data analytics technologies, predictive models based on multi-dimensional clinical features have become important tools for cardiovascular disease risk assessment.
The dataset used in this project is derived from the
heart-failure-prediction dataset, which integrates samples
from multiple clinical studies on cardiovascular diseases. It contains
12 clinical features and heart failure diagnosis outcomes of 918
observations, and is a classic public dataset for heart failure
prediction modeling.
This study focuses on two core analytical objectives:
To meet the requirements of subsequent analysis, data cleaning needs to be performed on the dataset. This mainly involves missing value imputation, outlier handling, and data type conversion.
First and foremost, we need to gain an understanding of the basic information about the dataset. The dataset contains a total of 12 fields, 11 of which are feature fields and 1 is a target label field. Below is the descriptive information for each field:
| Attribute | Description |
|---|---|
| Age | age of the patient [years] |
| Sex | sex of the patient [M: Male, F: Female] |
| ChestPainType | chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic] |
| RestingBP | resting blood pressure [mm Hg] |
| Cholesterol | serum cholesterol [mm/dl] |
| FastingBS | fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise] |
| RestingECG | resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria] |
| MaxHR | maximum heart rate achieved [Numeric value between 60 and 202] |
| ExerciseAngina | exercise-induced angina [Y: Yes, N: No] |
| Oldpeak | Soldpeak = ST [Numeric value measured in depression] |
| ST_Slope | the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping] |
| HeartDisease | output class [1: heart disease, 0: Normal] |
## 'data.frame': 918 obs. of 12 variables:
## $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : chr "M" "F" "M" "F" ...
## $ ChestPainType : chr "ATA" "NAP" "ATA" "ASY" ...
## $ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ RestingECG : chr "Normal" "Normal" "ST" "Normal" ...
## $ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: chr "N" "N" "N" "Y" ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : chr "Up" "Flat" "Up" "Flat" ...
## $ HeartDisease : int 0 1 0 1 0 0 0 0 1 0 ...
## Age Sex ChestPainType RestingBP
## Min. :28.00 Length:918 Length:918 Min. : 0.0
## 1st Qu.:47.00 Class :character Class :character 1st Qu.:120.0
## Median :54.00 Mode :character Mode :character Median :130.0
## Mean :53.51 Mean :132.4
## 3rd Qu.:60.00 3rd Qu.:140.0
## Max. :77.00 Max. :200.0
## Cholesterol FastingBS RestingECG MaxHR
## Min. : 0.0 Min. :0.0000 Length:918 Min. : 60.0
## 1st Qu.:173.2 1st Qu.:0.0000 Class :character 1st Qu.:120.0
## Median :223.0 Median :0.0000 Mode :character Median :138.0
## Mean :198.8 Mean :0.2331 Mean :136.8
## 3rd Qu.:267.0 3rd Qu.:0.0000 3rd Qu.:156.0
## Max. :603.0 Max. :1.0000 Max. :202.0
## ExerciseAngina Oldpeak ST_Slope HeartDisease
## Length:918 Min. :-2.6000 Length:918 Min. :0.0000
## Class :character 1st Qu.: 0.0000 Class :character 1st Qu.:0.0000
## Mode :character Median : 0.6000 Mode :character Median :1.0000
## Mean : 0.8874 Mean :0.5534
## 3rd Qu.: 1.5000 3rd Qu.:1.0000
## Max. : 6.2000 Max. :1.0000
Analysis of the dataset summary reveals the following:
Data cleaning is a core step to ensure the validity of the model. Based on the characteristics of the dataset, this study implements the following cleaning operations:
Outlier Handling: The physiologically reasonable range of RestingBP should be greater than 0. Rows with RestingBP = 0 need to be removed to ensure that the data is consistent with clinical common sense.
Missing Value Imputation: There are a large number of 0 values in the Cholesterol field. According to clinical knowledge, these 0 values are not true measured values but rather a notation for missing values. The median value is thus adopted for missing value imputation. Details are as follows:
chol_median <- median(df_clean$Cholesterol[df_clean$Cholesterol > 0], na.rm = TRUE);mutate(Cholesterol = ifelse(Cholesterol == 0, chol_median, Cholesterol));Data Type Conversion: Convert all categorical character variables into factor type. In particular, the target variable HeartDisease is converted into factor type for the training of classification models.
# 1. Filter out rows where RestingBP equals 0 to ensure data validity
df_clean <- df %>% filter(RestingBP > 0)
#2. Impute missing values in Cholesterol (0 values indicate missing data, not actual measurements)
# Step 1: Calculate median of Cholesterol excluding 0 values (to avoid bias from missing values)
chol_median <- median(df_clean$Cholesterol[df_clean$Cholesterol > 0], na.rm = TRUE)
# Step 2: Replace 0 values with the calculated median
df_clean <- df_clean %>% mutate(Cholesterol = ifelse(Cholesterol == 0, chol_median, Cholesterol))
#3. Convert categorical variables from character to factor type
df_clean <- df_clean %>%
mutate(
Sex = as.factor(Sex),
ChestPainType = as.factor(ChestPainType),
RestingECG = as.factor(RestingECG),
ExerciseAngina = as.factor(ExerciseAngina),
ST_Slope = as.factor(ST_Slope),
HeartDisease = as.factor(HeartDisease)
)
# Preview the processed data
head(df_clean)# Validate cleaned dataset
cat("Data Dimensions (Rows × Columns):", dim(df_clean)[1], "×", dim(df_clean)[2])## Data Dimensions (Rows × Columns): 917 × 12
## Age Sex ChestPainType RestingBP Cholesterol
## Min. :28.00 F:193 ASY:496 Min. : 80.0 Min. : 85.0
## 1st Qu.:47.00 M:724 ATA:173 1st Qu.:120.0 1st Qu.:214.0
## Median :54.00 NAP:202 Median :130.0 Median :237.0
## Mean :53.51 TA : 46 Mean :132.5 Mean :243.2
## 3rd Qu.:60.00 3rd Qu.:140.0 3rd Qu.:267.0
## Max. :77.00 Max. :200.0 Max. :603.0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## Min. :0.0000 LVH :188 Min. : 60.0 N:546 Min. :-2.6000
## 1st Qu.:0.0000 Normal:551 1st Qu.:120.0 Y:371 1st Qu.: 0.0000
## Median :0.0000 ST :178 Median :138.0 Median : 0.6000
## Mean :0.2334 Mean :136.8 Mean : 0.8867
## 3rd Qu.:0.0000 3rd Qu.:156.0 3rd Qu.: 1.5000
## Max. :1.0000 Max. :202.0 Max. : 6.2000
## ST_Slope HeartDisease
## Down: 63 0:410
## Flat:459 1:507
## Up :395
##
##
##
After cleaning, the dataset was reduced from 918 entries to 917 entries. The zero outliers in the RestingBP field were eliminated. Following median imputation for zero values in the Cholesterol field, the mean value increased moderately while the median value saw a slight rise. All other character fields were successfully converted to factor type.
ggplot(df_clean, aes(x = HeartDisease, fill = HeartDisease)) +
geom_bar(alpha = 0.8, width = 0.6) +
scale_fill_manual(values = c("#4299e1", "#e53e3e")) +
labs(
title = "Distribution of Heart Disease Cases",
x = "Heart Disease (0 = No, 1 = Yes)",
y = "Number of Patients",
fill = "Heart Disease"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))Patients with heart failure account for a larger proportion of the samples (approximately 55%), while those without heart failure make up around 45%. The overall class distribution is relatively balanced, with no significant class imbalance issue.
ggplot(df_clean, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "#38b2ac", color = "white", alpha = 0.8) +
labs(
title = "Age Distribution of Patients",
x = "Age (Years)",
y = "Frequency"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))As can be observed from the figure:
# Extract numeric variables and calculate correlation matrix
num_vars <- df_clean %>% select_if(is.numeric)
cor_matrix <- cor(num_vars, use = "complete.obs")
# Plot correlation heatmap
corrplot(
cor_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
col = colorRampPalette(c("#2d3748", "#4299e1", "#ffffff", "#e53e3e", "#9b2c2c"))(100),
title = "Correlation Matrix of Numeric Variables",
mar = c(0, 0, 1, 0)
)As can be observed from the figure, Age shows a strong negative correlation with MaxHR (Maximum Heart Rate) (indicated by the blue areas). The older the patients are, the lower their maximum heart rate is, which aligns with the physiological principle that heart rate decreases with age. Among other variables, the correlations are weak (represented by light-colored/white areas) — for instance, the correlation between RestingBP (Resting Blood Pressure) and Age, as well as the correlations between Cholesterol and other variables, are all insignificant without obvious strong associations.
The strong negative correlation between Age and MaxHR provides a data-driven basis for the subsequent regression analysis on predicting maximum heart rate based on age.
The focus of this section is to study how age affects maximum heart rate (MaxHR). Physiologically, age and maximum heart rate are generally negatively correlated, meaning that the older a person is, the lower their maximum heart rate. In the preliminary exploratory data analysis (EDA), the correlation matrix showed a distinct blue area between Age and MaxHR, which indicates a strong negative correlation. In this section, we will conduct further exploration.
# Build a Linear Regression Model
lm_model <- lm(MaxHR ~ Age, data = df_clean)
# Output Detailed Results of the Model
summary(lm_model)##
## Call:
## lm(formula = MaxHR ~ Age, data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.378 -15.904 0.748 18.117 58.717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 191.98798 4.47893 42.87 <2e-16 ***
## Age -1.03157 0.08243 -12.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.55 on 915 degrees of freedom
## Multiple R-squared: 0.1461, Adjusted R-squared: 0.1452
## F-statistic: 156.6 on 1 and 915 DF, p-value: < 2.2e-16
# Visualizing Regression Fitting Relationships
ggplot(df_clean, aes(x = Age, y = MaxHR)) +
geom_point(color = "#4299e1", alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "#e53e3e", linewidth = 1.2) +
labs(
title = "Age vs Maximum Heart Rate (MaxHR)",
subtitle = paste("Correlation Coefficient: ", round(cor(df_clean$Age, df_clean$MaxHR), 2)),
x = "Age (Years)",
y = "Max Heart Rate (bpm)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"))Intercept = 191.99: The maximum heart rate at a theoretical age of 0 (with only mathematical significance).
Age Coefficient = -1.03: For each additional year of age, the maximum heart rate is expected to decrease by an average of 1.03 beats per minute.
P-value < 2e-16: Much less than 0.05, indicating that the effect of age on heart rate is statistically extremely significant.
Correlation Coefficient = -0.38: It indicates a moderate negative correlation between the two.
In the regression analysis, we established a simple linear regression model with Age as the independent variable and MaxHR as the dependent variable. The model results indicate that Age has a significant negative impact on MaxHR (p < 0.001). Based on the regression coefficient, for every one-year increase in age, a patient’s maximum heart rate decreases by approximately 1.03 bpm. This finding provides a data-driven basis for clinical assessments of cardiovascular load capacity across different age groups.
set.seed(123) # Fix the random seed to ensure the reproducibility of results
train_index <- createDataPartition(
df_clean$HeartDisease,
p = 0.8,
list = FALSE,
times = 1
)
train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]
cat("Training Set Dimension: ", dim(train_data)[1], "×", dim(train_data)[2], "\n")## Training Set Dimension: 734 × 12
## Test Set Dimensions: 183 × 12
Prior to constructing the predictive model, this study first performed a random split on the cleaned dataset. To ensure the reproducibility of the experiment, a fixed random seed (Seed: 123) was set, guaranteeing consistent data partitioning in subsequent runs. We utilized the createDataPartition function to split the original data into a Training Set and a Test Set at a ratio of 80:20, based on the distribution of the target variable “Heart Disease”.
According to the code output, the specific details of the dataset split are as follows:
Training Set Size: Contains samples from 734 patients, used for fitting the logistic regression model and learning its parameters.
Test Set Size: Contains samples from 183 patients, serving as independent data to validate the predictive accuracy of the final model.
Each subset retains the original 12 clinical feature variables, ensuring the integrity of the analytical dimensions.
# Train the Logistic Regression Model
logit_model <- train(
HeartDisease ~ .,
data = train_data,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 5) # 5-fold Cross-Validation
)
# Output basic information of the Model
print(logit_model)## Generalized Linear Model
##
## 734 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 586, 588, 587, 587, 588
## Resampling results:
##
## Accuracy Kappa
## 0.848827 0.6932526
##
## Call:
## NULL
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.805e+00 1.651e+00 -1.093 0.274275
## Age 1.760e-02 1.472e-02 1.195 0.231925
## SexM 1.694e+00 3.087e-01 5.488 4.07e-08 ***
## ChestPainTypeATA -1.887e+00 3.613e-01 -5.222 1.77e-07 ***
## ChestPainTypeNAP -1.578e+00 2.849e-01 -5.538 3.06e-08 ***
## ChestPainTypeTA -1.588e+00 4.775e-01 -3.325 0.000884 ***
## RestingBP -2.045e-05 6.848e-03 -0.003 0.997617
## Cholesterol 1.836e-03 2.179e-03 0.843 0.399500
## FastingBS 1.174e+00 2.974e-01 3.947 7.91e-05 ***
## RestingECGNormal -3.702e-02 2.983e-01 -0.124 0.901239
## RestingECGST 1.814e-01 3.773e-01 0.481 0.630707
## MaxHR -7.093e-03 5.441e-03 -1.304 0.192364
## ExerciseAnginaY 7.469e-01 2.622e-01 2.849 0.004392 **
## Oldpeak 3.913e-01 1.284e-01 3.047 0.002311 **
## ST_SlopeFlat 1.323e+00 4.612e-01 2.868 0.004131 **
## ST_SlopeUp -1.081e+00 4.818e-01 -2.244 0.024857 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1009.24 on 733 degrees of freedom
## Residual deviance: 498.01 on 718 degrees of freedom
## AIC: 530.01
##
## Number of Fisher Scoring iterations: 5
Following the data split, this study employed the Logistic Regression algorithm to construct a heart failure prediction model. To ensure model robustness and prevent overfitting, we utilized the train function from the caret package and configured a 5-fold Cross-Validation technique. This method iteratively trains and validates the model by dividing the training data into five subsets, effectively enhancing the model’s generalization capability on unseen data.
Based on the coefficients output of the logistic regression model, several clinical features demonstrated significant statistical importance:
Sex (SexM): Being male is significantly positively correlated with an increased risk of heart failure (p < 0.001).
Chest Pain Type: Compared to other types, asymptomatic chest pain (ASY) carries a higher risk weight, while typical angina (TA) and non-anginal pain (NAP) show significant negative correlations.
Exercise-Induced Angina (ExerciseAnginaY): This indicator significantly increases the predicted probability of the disease.
ST Slope: A flat slope (ST_SlopeFlat) acts as a strong positive predictor for heart failure.
The model’s Akaike Information Criterion (AIC) value is 530.01, reflecting a good balance between model complexity and goodness of fit. Additionally, the average training accuracy during cross-validation reached approximately 84.88%, indicating that the model has successfully captured the core clinical patterns within the data.
# Test Set Prediction
predictions <- predict(logit_model, newdata = test_data)
# Confusion Matrix (including metrics such as Precision, Recall, F1, etc.)
conf_matrix <- confusionMatrix(
predictions,
test_data$HeartDisease,
positive = "1" # Define the positive class as "Having Heart Failure"
)
# Output the Confusion Matrix
print(conf_matrix)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 72 10
## 1 10 91
##
## Accuracy : 0.8907
## 95% CI : (0.8363, 0.932)
## No Information Rate : 0.5519
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.779
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9010
## Specificity : 0.8780
## Pos Pred Value : 0.9010
## Neg Pred Value : 0.8780
## Prevalence : 0.5519
## Detection Rate : 0.4973
## Detection Prevalence : 0.5519
## Balanced Accuracy : 0.8895
##
## 'Positive' Class : 1
##
# Extract and Highlight Accuracy Rate
accuracy <- conf_matrix$overall["Accuracy"]
cat("Test Set Accuracy: ", round(accuracy * 100, 2), "%\n")## Test Set Accuracy: 89.07 %
In the model evaluation phase, we validated the trained logistic regression model using an independent test set. The results show that the model achieved a predictive accuracy of 89.07% on the test set (95% CI: 0.836, 0.932). This indicates that the model possesses high predictive precision and can effectively identify potential heart failure risks.
Detailed analysis of the confusion matrix reveals that the model maintains an excellent balance in classification:
Sensitivity = 90.10%: Meaning the model can correctly capture the vast majority of true positive cases, significantly reducing the possibility of missed clinical diagnoses.
Specificity = 87.80%: Reflecting the model’s ability to identify healthy individuals and effectively controlling the rate of false positives.
Kappa Coefficient = 0.779: Further confirms a high level of agreement between the model’s predictions and actual clinical diagnoses.
The logistic regression model demonstrates robust performance in handling multi-dimensional clinical features. Its high Positive Predictive Value (90.10%) and Negative Predictive Value (87.80%) suggest that the model can serve as an auxiliary tool for clinicians during preliminary cardiovascular screenings, providing objective data support for the early and accurate prediction of heart failure risks.
Through a rigorous data preprocessing workflow, including the handling of physiological outliers and median imputation for missing cholesterol values, this study established a high-quality dataset of 917 clinical observations. The exploratory analysis reveals a patient demographic primarily composed of middle-aged and elderly individuals, with a mean age of 53.51 years. The subsequent regression analysis identifies a statistically significant inverse relationship between age and maximum heart rate (MaxHR). With a correlation coefficient of -0.38 and a high degree of statistical significance (p < 0.001), the model quantifies a physiological decline of approximately 1.03 bpm for every year of aging. These findings provide a data-driven foundation for assessing cardiovascular load capacity across different age cohorts in clinical settings.
In the classification phase, a logistic regression model was developed using 5-fold cross-validation to ensure robustness and generalizability. The model demonstrated superior diagnostic performance, achieving a predictive accuracy of 89.07% on an independent test set. Detailed performance metrics, including a sensitivity of 90.10% and a specificity of 87.80%, indicate that the model maintains an excellent balance between capturing true positive cases and controlling false alarms. Furthermore, the analysis identified key clinical predictors, specifically being male, asymptomatic chest pain, exercise-induced angina, and a flat ST slope, as having the most substantial impact on heart failure risk. The high Kappa coefficient (0.779) further validates the strong agreement between model predictions and actual clinical diagnoses.
The results of this dual-approach analysis underline the potential of machine learning as a pivotal auxiliary tool in cardiovascular risk assessment. By providing high positive and negative predictive values, the model offers objective data support that can assist clinicians in early-stage screenings and more accurate heart failure risk stratification. While the current logistic regression framework is highly effective, future research should explore more complex non-linear algorithms, such as Random Forest or Support Vector Machines, to further refine predictive precision. Additionally, integrating a broader spectrum of clinical features could enhance the model’s comprehensive understanding of disease progression.