The goal of this analysis is to develop a machine learning model that accurately predicts the survival of patients after heart failure based on clinical data. This will be useful in helping healthcare providers make informed decisions about treatment plans, allocate resources more effectively, and provide personalized care for individuals based on their likelihood of survival. By identifying key clinical factors that influence outcomes, the model can assist in early intervention, improve patient management, and enhance overall patient care and outcomes.
The problem to be solved is predicting whether a patient will survive or not after being diagnosed with heart failure, based on various clinical factors.
The dataset used in this analysis consists of medical records from 299 heart failure patients, including 105 women and 194 men, aged between 40 and 95 years. The data was collected from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad, Punjab, Pakistan, during April–December 2015. It includes both continuous and binary numeric variables, covering various clinical factors relevant to heart failure outcomes. The dataset has no missing values, and is publicly available from the UCI Machine Learning Repository.
Heart Failure Clinical Records [Dataset]. (2020). UCI Machine Learning Repository. https://doi.org/10.24432/C5Z89R.
| x | |
|---|---|
| age | 1.106 |
| anaemia | 1.087 |
| creatinine_phosphokinase | 1.066 |
| diabetes | 1.064 |
| ejection_fraction | 1.068 |
| high_blood_pressure | 1.068 |
| platelets | 1.046 |
| serum_creatinine | 1.081 |
| serum_sodium | 1.102 |
| sex | 1.338 |
| smoking | 1.285 |
| time | 1.138 |
Outliers Detection
I used boxplots to detect outliers in the continuous variables. These graphs revealed a few potential outliers in variables like creatinine_phosphokinase, serum_creatine, and platelets, where certain values significantly deviated from the rest of the data.
Distributions
To assess the distributions of both continuous and binary variables, I created histograms (within a Pairwise Plot) and bar plots. The histograms suggested that several variables, such as creatinine_phosphokinase and serum_creatine had skewed distributions, which may require transformation for better model performance. The binary variables were plotted as bar charts, revealing an imbalance in the target variable, DEATH_EVENT, as well as sex and smoking.
Correlations and Collinearity
I also evaluated the correlation between all of the variables using a Spearman correlation matrix. The results showed moderate correlations between certain variables, such as DEATH_EVENT and time, and sex and smoking. To assess these relationships further, I computed the Variance Inflation Factor (VIF) for each feature, which indicated that multicollinearity is not a significant concern, as all VIF values were below 1.5 and well within an acceptable range.
Missing Values
The dataset is known to have no missing values, which is ideal for model training and ensures that we do not need to impute any missing data. However, if there were to be missing values, I would consider using the mean for imputing continuous variables and the mode for categorical data.
Imputation
While the dataset has no missing data, a comparison of imputed versus full data could be useful in understanding the impact of missing data imputation on model performance. If data were randomly removed and then imputed, the performance of the model could differ due to the introduction of bias or errors from the imputation process. For example, imputing with the mean or mode may reduce variance in the data, leading to underfitting, or removing outliers could eliminate valuable information that helps the model differentiate between normal and extreme cases.
Outliers
I have opted not to remove outliers from the dataset I will use for model development. This decision stems from the fact that the data comes from a clinical context, and outliers could potentially represent extreme but medically relevant cases, such as rare conditions or severe health episodes that significantly affect the outcome. Removing these could skew the results and reduce the generalizability of the model. No cases will be removed from the analysis.
| x |
|---|
| age |
| creatinine_phosphokinase |
| ejection_fraction |
| platelets |
| serum_creatinine |
| serum_sodium |
| time |
Distribution
To assess normality, I performed the Shapiro-Wilk test on each continuous feature and identified those that did not follow a normal distribution. Based on these findings, I applied appropriate transformations to make the distributions more symmetric and Gaussian-like where needed. Although my selected machine learning algorithms do not require data to follow a Gaussian distribution, I still wanted to explore whether normalizing the features would help with model convergence, particularly for algorithms that are sensitive to feature scaling, such as Support Vector Machines (SVM).
Scaling
For normalization, I used Min-Max scaling, which rescales the data to a range between 0 and 1, ensuring that all features are on the same scale. This is useful when features have different units or scales, which could otherwise dominate the model’s learning process. Additionally, the transformed features will help reduce issues arising from extreme values or skewed distributions, allowing the algorithms to perform more effectively. Overall, the goal is to strike a balance between preserving the integrity of the data while optimizing it for model training.
I applied principal component analysis (PCA) to the dataset to explore the underlying structure of the data and potentially reduce dimensionality. However, the PCA results reveal that the first few principal components do not explain much of the variance in the data, as indicated by the low eigenvalues and variance explained by each component. This suggests that there may be multiple directions of variation in the dataset, or potentially noise, which makes it difficult to capture meaningful patterns in the reduced dimensions. Since there are only 12 features in the dataset, I have decided not to remove any features at this stage, as each feature could potentially contribute valuable information to the model, especially in a healthcare context where all variables may have clinical significance. It does not seem that new derived features will benefit the model.
| .id | label | variable | value |
|---|---|---|---|
| DEATH_EVENT | DEATH_EVENT | No | 203 (67.89%) |
| DEATH_EVENT | DEATH_EVENT | Yes | 96 (32.11%) |
| .id | label | variable | value |
|---|---|---|---|
| DEATH_EVENT | DEATH_EVENT | No | 144 (68.57%) |
| DEATH_EVENT | DEATH_EVENT | Yes | 66 (31.43%) |
| .id | label | variable | value |
|---|---|---|---|
| DEATH_EVENT | DEATH_EVENT | No | 59 (66.29%) |
| DEATH_EVENT | DEATH_EVENT | Yes | 30 (33.71%) |
In this analysis, the dataset is split into training and test sets using a 70/30 ratio. The createDataPartition() function from the caret package ensures that the distribution of the binary target variable, DEATH_EVENT, is preserved in both the training and test datasets. By setting the seed for reproducibility, we ensure that the data split can be consistently reproduced across different runs. The splits are not identical in proportion due to the relatively small size of the dataset.
Logistic Regression is a fundamental statistical model used for binary classification tasks. It works by modeling the probability of a binary outcome using a logistic function, where the input features are combined linearly, and the output is a value between 0 and 1. This model is appropriate for my dataset because it can easily handle numerical predictors and can provide probabilities for each class.
Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to make a final prediction. Each tree is trained on a random subset of the data with random feature selection, which reduces overfitting and increases robustness. This method is appropriate for my dataset because it can handle complex, non-linear relationships between features and the target variable, without the need for explicit feature engineering. It is also less sensitive to outliers compared to other models. Additionally, Random Forests can rank feature importance, helping to identify which clinical factors are most predictive of patient survival.
SVM is a powerful classification algorithm that works by finding the hyperplane that best separates different classes in the feature space. SVM tries to maximize the margin between the closest points (support vectors) of each class, which helps improve generalization. This algorithm is appropriate for my dataset because it works well with high-dimensional data and can handle both linear and non-linear relationships.
The basic logistic regression model was trained using the original dataset without any advanced tuning. Predictions were made on the holdout test set, and performance metrics were computed. This model may underperform because it doesn’t leverage techniques like hyperparameter tuning, bagging, or boosting, which can help reduce overfitting and improve generalization.
For this model, I used hyperparameter tuning to optimize the logistic regression model using the glmnet method. This model was trained with a grid search for the hyperparameters alpha and lambda. Tuning allows the model to find the best balance between bias and variance, improving its predictive capability.
| Hyperparameter | Value |
|---|---|
| Alpha | 0.25 |
| Lambda | 0.10 |
In the bagging model, I used bootstrapping to create multiple versions of the logistic regression model, each trained on different random samples of the data. Bagging reduces variance by averaging the predictions from multiple models, which helps improve the robustness of the model and avoid overfitting.
The boosted logistic regression model was created using XGBoost, an implementation of gradient boosting. Boosting aims to reduce bias by focusing on the errors made by the previous models and iteratively correcting them.
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | Holdout_AUC | CV_AUC |
|---|---|---|---|---|---|---|---|---|
| Logistic Regression (Original) | 0.831 | 0.867 | 0.881 | 0.874 | 0.881 | 0.733 | 0.859 | 0.863 |
| Logistic Regression (Tuned) | 0.753 | 0.761 | 0.915 | 0.831 | 0.915 | 0.433 | 0.869 | 0.897 |
| Logistic Regression (Bagging) | 0.753 | 0.761 | 0.915 | 0.831 | 0.915 | 0.433 | 0.869 | 0.880 |
| Logistic Regression (Boosting) | 0.809 | 0.828 | 0.898 | 0.862 | 0.898 | 0.633 | 0.766 | 0.898 |
A comparison of the four logistic regression models reveals that the Tuned Logistic Regression model is the best overall, achieving the highest CV AUC of 0.897, indicating strong generalization across the training set. It is important to note that for all models in this analysis, the “Positive” case is when the DEATH_EVENT is “No”. While it sacrifices some accuracy (0.753) and True Negative Rate (TNR) (0.433) in favor of a significantly improved recall (0.915), it delivers the highest F1 score (0.831), making it the most balanced and effective at identifying positive cases. The Original Logistic Regression model offers solid performance with a high accuracy (r round(accuracy, 3)) but falls short in TNR and AUC. The Bagging model mirrors the tuned model in recall (0.915) and F1 score (0.831) but has slightly lower CV AUC (0.88), while the Boosted Logistic Regression model, though performing well in terms of accuracy (0.809) and precision (0.828), has the lowest Holdout AUC (0.766), indicating it may not generalize as effectively as the tuned model.
The initial Random Forest model was trained using the original dataset, applying 500 trees and allowing the model to learn from multiple decision trees, each trained on a different subset of the data. This form of ensemble learning reduces overfitting by averaging predictions from multiple models, making the overall predictions more robust and generalized.
For the tuned Random Forest model, hyperparameter tuning was performed to optimize the mtry parameter, which determines the number of features considered for each split in the decision trees. The best mtry value was identified through cross-validation.
| Hyperparameter | Value |
|---|---|
| Best mtry | 2 |
While bagging (Bootstrap Aggregating) is a core component of Random Forest, it is not necessary to apply bagging explicitly in this context. Random Forest inherently uses bagging, where each tree in the forest is trained on a random subset of the data. As a result, applying additional bagging methods would be redundant.
Boosting involves sequentially training models where each new model attempts to correct the errors made by the previous model. However, boosting is not typically applied to Random Forest because it is a parallel model. Unlike boosting algorithms like XGBoost, Random Forest models work in parallel (each tree is built independently), and boosting methods (which rely on the sequential training of models) are not directly compatible with Random Forests. Additionally, the combination of bagging and boosting methods in one model (like trying to “boost” a Random Forest) is unnecessary and could lead to inefficiencies.
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | Holdout_AUC | CV_AUC |
|---|---|---|---|---|---|---|---|---|
| Original Random Forest | 0.820 | 0.864 | 0.864 | 0.864 | 0.864 | 0.733 | 0.905 | 0.898 |
| Tuned Random Forest | 0.854 | 0.871 | 0.915 | 0.893 | 0.915 | 0.733 | 0.901 | 0.905 |
| Feature | Importance | |
|---|---|---|
| time | time | 29.644 |
| serum_creatinine | serum_creatinine | 11.183 |
| age | age | 9.809 |
| ejection_fraction | ejection_fraction | 8.828 |
| platelets | platelets | 8.123 |
| creatinine_phosphokinase | creatinine_phosphokinase | 7.340 |
| serum_sodium | serum_sodium | 6.276 |
| high_blood_pressure | high_blood_pressure | 1.727 |
| diabetes | diabetes | 1.416 |
| anaemia | anaemia | 1.385 |
| sex | sex | 1.340 |
| smoking | smoking | 1.136 |
The comparison between the Original and Tuned Random Forest models reveals a significant improvement in performance after hyperparameter tuning. The Tuned Random Forest model, with an optimized mtry value of 2, achieves a higher accuracy (0.854) and F1 score (0.893) compared to the Original model, indicating better overall balance between precision and recall. Notably, the Tuned model excels in recall (0.915) with a significant increase in sensitivity (TPR) to 0.915, making it more effective at identifying positive cases. Despite these gains, the Original Random Forest model still performs well, with an accuracy of 0.82 and a respectable TNR (specificity) of 0.733. In terms of generalization, both models perform well, with the Tuned Random Forest model achieving the highest CV AUC of 0.905, indicating better cross-validated performance. The Original Random Forest model has an AUC of 0.905, reflecting solid performance but not quite matching the Tuned model’s enhanced generalization capability.
This is the baseline SVM model. It uses the linear kernel by default, which assumes that the data can be separated by a straight line (or hyperplane in higher dimensions). The model is trained on the training dataset to predict whether the DEATH_EVENT is “Yes” or “No” based on the features. The cost parameter (C) is set to 1, which strikes a balance between achieving a low error on training data and maintaining the model’s ability to generalize to new data.
The tuned SVM model improves upon the original by optimizing its hyperparameters, specifically the cost parameter (C). In this case, we use the Radial basis function (RBF) kernel (svmRadial), which is a non-linear kernel that allows the model to capture more complex decision boundaries between the classes.
| Hyperparameter | Value |
|---|---|
| Best C | 2.000 |
| Best Sigma | 0.031 |
Bagging (Bootstrap Aggregating) is an ensemble learning technique where multiple models are trained on different subsets of the training data and then combined (usually by averaging or voting) to improve the model’s performance. The idea is that by training multiple models on random subsets of the data (with replacement), the model reduces variance and becomes more robust.
In the case of bagging with SVM, we apply the tuned SVM model from the previous step, but we train multiple versions of the model on bootstrapped subsets of the training data. Bagging helps to reduce overfitting by decreasing the model’s sensitivity to small fluctuations in the training data. The final prediction is made by aggregating the predictions from all the individual SVM models.
Boosting is another ensemble learning technique that combines the predictions of several base models (in this case, SVM models) in a sequential manner. The key idea behind boosting is that each new model focuses on the mistakes (misclassifications) made by the previous models. Boosting corrects errors iteratively, resulting in a more accurate and robust model.
In this case, the boosting algorithm used is XGBoost, a highly efficient and powerful gradient boosting method that combines multiple weak learners (SVMs in this case) into a strong learner. The XGBoost model applies a series of SVMs sequentially, where each new SVM model aims to correct the errors of the previous one.
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | Holdout_AUC | CV_AUC |
|---|---|---|---|---|---|---|---|---|
| Original SVM | 0.798 | 0.815 | 0.898 | 0.855 | 0.898 | 0.600 | 0.749 | 0.854 |
| Tuned SVM | 0.798 | 0.836 | 0.864 | 0.850 | 0.864 | 0.667 | 0.845 | 0.644 |
| Bagged SVM | 0.809 | 0.850 | 0.864 | 0.857 | 0.864 | 0.700 | 0.845 | 0.878 |
| Boosted SVM | 0.809 | 0.862 | 0.847 | 0.855 | 0.847 | 0.733 | 0.790 | 0.910 |
The comparison of the four SVM models reveals notable differences in performance. The Original SVM model, with an accuracy of 0.798 and a high recall of 0.898 (indicating a strong True Positive Rate, TPR), was effective at identifying positive cases, but it had lower specificity of 0.6, meaning it struggled with correctly identifying negative cases. Its CV AUC of 0.854 suggests solid generalization. The Tuned SVM showed slight improvements in precision (0.836), but had a slightly lower recall of 0.864 and a CV AUC of 0.644, indicating potential overfitting during tuning. The Bagged SVM model outperformed the others with an accuracy of 0.809, high precision of 0.85, recall of 0.864, and specificity of 0.7, pointing to a more balanced performance across all metrics. It also had a CV AUC of 0.878, reflecting better cross-validation performance. The Boosted SVM model achieved similar accuracy and precision (0.809 and 0.862) but excelled in specificity (0.733), showing it was more effective at correctly identifying negative cases. With the highest CV AUC of 0.91, it demonstrated the best generalization, making it the best-performing model overall despite a slightly lower holdout AUC of 0.79.
In this ensemble model, three of the best-performing models — Boosted Logistic Regression (XGBoost), Tuned Random Forest, and Boosted SVM — were combined to improve predictive performance by leveraging the strengths of different algorithms. The ensemble approach averages the predicted probabilities of the positive class (“Yes”) from each model and then makes the final prediction by applying a threshold of 0.5 to this average.
This ensemble combines the predictions of the three best-performing models, and averages them to create a final prediction. This method leverages the diverse strengths of the three models, helping to mitigate the weaknesses of any single model and improving overall prediction accuracy.
The Bagged Trees Model uses bagging with decision trees to combine multiple trees trained on different subsets of the data. Bagging helps reduce variance and prevent overfitting, as it averages the predictions of several decision trees, each trained on a different bootstrap sample (randomly sampled data with replacement). In this case, the model uses bagging as the meta-model to combine the predictions from the individual models (XGBoost, Random Forest, and Boosted SVM). This model is effective because bagging reduces the overfitting tendency of decision trees and improves model generalization by averaging out errors from individual models.
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | Holdout_AUC | CV_AUC |
|---|---|---|---|---|---|---|---|---|
| Manual Averaging Model | 0.831 | 0.867 | 0.881 | 0.874 | 0.881 | 0.733 | 0.894 | 0.900 |
| Bagged Trees Meta-Model | 0.731 | 0.857 | 0.706 | 0.774 | 0.706 | 0.778 | 0.859 | 0.822 |
# Define new data point
new_data_point <- data.frame(
age = 65,
anaemia = 0,
creatinine_phosphokinase = 400,
diabetes = 1,
ejection_fraction = 30,
high_blood_pressure = 1,
platelets = 250000,
serum_creatinine = 1.2,
serum_sodium = 135,
sex = 1,
smoking = 0,
time = 10
)
# Display the predictions
cat("Final Predicted Class:", final_prediction_meta, "\n")
## Final Predicted Class: 2
print(final_prob_meta)
## No Yes
## 1 0.32 0.68
The Averaging Ensemble Model, which combines the predictions from XGBoost, Tuned Random Forest, and Boosted SVM, achieved an accuracy of 0.831 and demonstrated a high recall of 0.881, indicating its strong ability to correctly identify positive cases. The model also had an impressive F1 score of 0.874, reflecting a balanced performance between precision and recall. The Holdout AUC for the averaging model was 0.894, while the CV AUC was slightly higher at 0.9, highlighting the model’s strong generalization. In comparison, the Bagged Trees Meta-Model, which uses bagging with decision trees to combine model predictions, showed a lower accuracy of 0.731 and recall of 0.706. However, it had a higher precision of 0.857 and an F1 score of 0.774. The Holdout AUC for the Bagged Trees Meta-Model was 0.859 and the CV AUC was 0.822, which demonstrated solid performance, though slightly lower than the Averaging Model. The results suggest that the Averaging Ensemble Model outperforms the Bagged Trees Meta-Model in overall accuracy and recall, while the Meta-Model shows more stability in terms of precision.
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | AUC_Holdout | AUC_CV |
|---|---|---|---|---|---|---|---|---|
| Random Forest (Original Data) | 0.820 | 0.864 | 0.864 | 0.864 | 0.864 | 0.733 | 0.905 | 0.898 |
| Random Forest (Imputed Data) | 0.216 | 0.400 | 0.091 | 0.148 | 0.091 | 0.591 | 0.840 | 0.811 |
| Model | Accuracy | Precision | Recall | F1_Score | TPR | TNR | AUC_Holdout | AUC_CV |
|---|---|---|---|---|---|---|---|---|
| SVM (Original Data) | 0.798 | 0.815 | 0.898 | 0.855 | 0.898 | 0.600 | 0.749 | 0.854 |
| SVM (Normalized Data) | 0.807 | 0.831 | 0.900 | 0.864 | 0.900 | 0.607 | 0.754 | 0.856 |
In the comparison of SVM models on original versus normalized data, the model trained on normalized data (which involved techniques like Gaussian normalization and min-max scaling) outperformed the original data model in several key performance metrics. The accuracy of the normalized SVM model was 0.807, a notable improvement over the original model’s 0.798. Additionally, the precision increased to 0.831 from 0.815, and the recall saw a significant boost to 0.9 compared to 0.898. This improvement in recall indicates that the normalized data model was better at identifying positive cases. The F1 score also improved, rising to 0.864 from 0.855, reflecting better balance between precision and recall. Furthermore, the Holdout AUC for the normalized data model was 0.754, slightly better than the original model’s 0.749. However, the TNR (True Negative Rate) decreased from 0.6 to 0.607, suggesting a trade-off where the model becomes more sensitive but less specific. These results suggest that normalization, which standardizes the features, can significantly enhance the model’s ability to capture the underlying patterns in the data, particularly for SVM.
When comparing Random Forest models on the original versus imputed data, the performance diverged sharply. The accuracy of the Random Forest model on the original data was 0.82, whereas on the imputed data, it dropped drastically to 0.216. Similarly, the precision and recall were also lower for the imputed model, indicating that imputing missing data (via median for continuous variables and mode for categorical variables) introduced significant noise, leading to poorer performance. The F1 score of 0.148 for the imputed data was substantially lower than the original model’s score of 0.864, and the AUC dropped from 0.905 to 0.84.
After considering all of the models I developed, the Averaging Ensemble Model stands out as the best overall choice for predicting patient survival. This ensemble method effectively combines the strengths of XGBoost, Tuned Random Forest, and Boosted SVM, achieving a high balance between precision and recall, strong generalization (as evidenced by its CV AUC of 0.9), and a robust F1 score of 0.874. Although the individual models showed strong performance, particularly the Tuned Random Forest and Boosted SVM, the ensemble approach harnesses the complementary strengths of these models to enhance prediction stability and accuracy, making it the optimal choice for this task.