This phase involves thoroughly understanding and preparing the dataset for analysis. We begin by inspecting the structure, data types, and potential issues such as missing values or duplicates. Each variable is evaluated for relevance, and categorical variables are appropriately converted to factors. Outliers and skewed distributions are visualized to inform any required transformations. The dataset is then cleaned and normalized using a preprocessing recipe, followed by a train-test split to enable unbiased model training and evaluation.
## Rows: 299
## Columns: 13
## $ age <dbl> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 6…
## $ anaemia <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, …
## $ creatinine_phosphokinase <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, …
## $ diabetes <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ ejection_fraction <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 2…
## $ high_blood_pressure <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, …
## $ platelets <dbl> 265000, 263358, 162000, 210000, 327000, 20400…
## $ serum_creatinine <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.1…
## $ serum_sodium <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, …
## $ sex <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, …
## $ smoking <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, …
## $ time <dbl> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11,…
## $ death_event <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …
| Name | df |
| Number of rows | 299 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 60.83 | 11.89 | 40.0 | 51.0 | 60.0 | 70.0 | 95.0 | ▆▇▇▂▁ |
| anaemia | 0 | 1 | 0.43 | 0.50 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▆ |
| creatinine_phosphokinase | 0 | 1 | 581.84 | 970.29 | 23.0 | 116.5 | 250.0 | 582.0 | 7861.0 | ▇▁▁▁▁ |
| diabetes | 0 | 1 | 0.42 | 0.49 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▆ |
| ejection_fraction | 0 | 1 | 38.08 | 11.83 | 14.0 | 30.0 | 38.0 | 45.0 | 80.0 | ▃▇▂▂▁ |
| high_blood_pressure | 0 | 1 | 0.35 | 0.48 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▅ |
| platelets | 0 | 1 | 263358.03 | 97804.24 | 25100.0 | 212500.0 | 262000.0 | 303500.0 | 850000.0 | ▂▇▂▁▁ |
| serum_creatinine | 0 | 1 | 1.39 | 1.03 | 0.5 | 0.9 | 1.1 | 1.4 | 9.4 | ▇▁▁▁▁ |
| serum_sodium | 0 | 1 | 136.63 | 4.41 | 113.0 | 134.0 | 137.0 | 140.0 | 148.0 | ▁▁▃▇▁ |
| sex | 0 | 1 | 0.65 | 0.48 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | ▅▁▁▁▇ |
| smoking | 0 | 1 | 0.32 | 0.47 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |
| time | 0 | 1 | 130.26 | 77.61 | 4.0 | 73.0 | 115.0 | 203.0 | 285.0 | ▆▇▃▆▃ |
| death_event | 0 | 1 | 0.32 | 0.47 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | ▇▁▁▁▃ |
## age anaemia creatinine_phosphokinase
## 0 0 0
## diabetes ejection_fraction high_blood_pressure
## 0 0 0
## platelets serum_creatinine serum_sodium
## 0 0 0
## sex smoking time
## 0 0 0
## death_event
## 0
## [1] 0
## age anaemia creatinine_phosphokinase diabetes
## Min. :40.00 Min. :0.0000 Min. : 23.0 Min. :0.0000
## 1st Qu.:51.00 1st Qu.:0.0000 1st Qu.: 116.5 1st Qu.:0.0000
## Median :60.00 Median :0.0000 Median : 250.0 Median :0.0000
## Mean :60.83 Mean :0.4314 Mean : 581.8 Mean :0.4181
## 3rd Qu.:70.00 3rd Qu.:1.0000 3rd Qu.: 582.0 3rd Qu.:1.0000
## Max. :95.00 Max. :1.0000 Max. :7861.0 Max. :1.0000
## ejection_fraction high_blood_pressure platelets serum_creatinine
## Min. :14.00 Min. :0.0000 Min. : 25100 Min. :0.500
## 1st Qu.:30.00 1st Qu.:0.0000 1st Qu.:212500 1st Qu.:0.900
## Median :38.00 Median :0.0000 Median :262000 Median :1.100
## Mean :38.08 Mean :0.3512 Mean :263358 Mean :1.394
## 3rd Qu.:45.00 3rd Qu.:1.0000 3rd Qu.:303500 3rd Qu.:1.400
## Max. :80.00 Max. :1.0000 Max. :850000 Max. :9.400
## serum_sodium sex smoking time
## Min. :113.0 Min. :0.0000 Min. :0.0000 Min. : 4.0
## 1st Qu.:134.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 73.0
## Median :137.0 Median :1.0000 Median :0.0000 Median :115.0
## Mean :136.6 Mean :0.6488 Mean :0.3211 Mean :130.3
## 3rd Qu.:140.0 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:203.0
## Max. :148.0 Max. :1.0000 Max. :1.0000 Max. :285.0
## death_event
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3211
## 3rd Qu.:1.0000
## Max. :1.0000
The dataset consists of 299 observations and 13 variables, all of which are numeric, with several acting as binary categorical indicators (e.g., anaemia, diabetes, sex). No missing values or duplicated records were found, which simplifies the preprocessing phase.
The target variable, death_event, is binary and
indicates whether a patient died during the follow-up period. The mean
of 0.321 suggests that approximately 32.1% of patients experienced a
death event.
Key observations from the summary statistics include:
anaemia,
diabetes, smoking,
high_blood_pressure, and sex are coded as 0
and 1, and will later be treated as factors.This overview provides a foundational understanding of the dataset and reveals which variables may require transformation, encoding, or scaling prior to modeling.
The correlation matrix above provides insights into the linear relationships between numeric variables in the dataset. Notably:
death_event at -0.53, indicating that
patients who lived longer during follow-up were less likely to die.death_event have
a positive correlation of 0.29, suggesting that higher
creatinine levels (an indicator of kidney dysfunction) are associated
with higher mortality.death_event, which aligns with medical expectations: lower
ejection fractions are linked to increased death risk.age (0.25),
serum_sodium (-0.20), and high_blood_pressure
(0.06) have weaker correlations with the outcome but may still
contribute in multivariate models.These relationships help guide feature selection and provide early intuition on variable importance in predicting patient outcomes.
## Rows: 238
## Columns: 13
## $ age <dbl> -0.99487877, 0.35024581, 0.77059724, -1.3311599…
## $ cpk <dbl> -0.517206774, -0.546064189, -0.522359884, 0.000…
## $ ef <dbl> -0.683035135, -1.105516528, -0.260553742, 3.541…
## $ platelets <dbl> 1.673158e+00, 1.292579e-01, -4.126409e-01, 7.52…
## $ scr <dbl> -0.38074023, -0.09074788, 1.26254973, -0.206744…
## $ ss <dbl> 0.31152159, 0.08489153, 0.31152159, 0.08489153,…
## $ time <dbl> -1.5237013, -1.4721643, -0.9825633, -0.8666051,…
## $ death <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ anaemia_X1 <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ diabetes_X1 <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,…
## $ high_blood_pressure_X1 <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,…
## $ sex_X1 <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ smoking_X1 <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,…
## age cpk ef platelets
## Min. :-1.75151 Min. :-0.575952 Min. :-2.034976 Min. :-2.43607
## 1st Qu.:-0.74267 1st Qu.:-0.477785 1st Qu.:-0.683035 1st Qu.:-0.53278
## Median :-0.07011 Median :-0.316235 Median :-0.007065 Median :-0.00183
## Mean : 0.02892 Mean : 0.032812 Mean : 0.012107 Mean : 0.01624
## 3rd Qu.: 0.77060 3rd Qu.: 0.000165 3rd Qu.: 0.584409 3rd Qu.: 0.41299
## Max. : 2.87235 Max. : 7.502063 Max. : 3.541779 Max. : 5.99812
## scr ss time death
## Min. :-0.864061 Min. :-5.354230 Min. :-1.62678 0:162
## 1st Qu.:-0.477404 1st Qu.:-0.594999 1st Qu.:-0.69589 1: 76
## Median :-0.284076 Median : 0.084892 Median :-0.13220
## Mean : 0.019888 Mean :-0.004618 Mean : 0.05202
## 3rd Qu.: 0.005916 3rd Qu.: 0.764782 3rd Qu.: 0.98550
## Max. : 7.739045 Max. : 2.577822 Max. : 1.92927
## anaemia_X1 diabetes_X1 high_blood_pressure_X1 sex_X1
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.4118 Mean :0.4538 Mean :0.3655 Mean :0.6555
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## smoking_X1
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.3067
## 3rd Qu.:1.0000
## Max. :1.0000
To prepare the dataset for modeling, several key preprocessing steps were applied:
Factor Conversion & Renaming: Binary variables
such as anaemia, diabetes,
high_blood_pressure, sex,
smoking, and death_event were converted to
factors to ensure proper treatment in modeling. Additionally, variable
names were simplified for readability (e.g.,
ejection_fraction was renamed to ef).
Recipe-Based Transformation: Using the
recipes package, a preprocessing pipeline was implemented.
This included:
Train/Test Split: The dataset was split into
training (80%) and testing (20%) sets, stratified by the target variable
(death). The training set contains 238 observations and the
test set 61.
Resulting Dataset Summary: The processed
train_data contains 13 columns, including normalized
numeric features (age, cpk, ef,
platelets, scr, ss,
time) and binary dummy variables for categorical features
(anaemia_X1, diabetes_X1,
high_blood_pressure_X1, sex_X1,
smoking_X1). Notably:
sex_X1 and
diabetes_X1 have binary values (0 or 1), suitable for use
in distance-based models like KNN.death remains balanced
(approximately 68% survival, 32% death) after the split, preserving
representativeness.These preprocessing steps ensure the dataset is clean, standardized, and ready for predictive modeling.
Exploratory Data Analysis (EDA) was conducted to better understand
variable distributions, detect outliers, and explore relationships
between predictors and the target variable (death). This
step supports hypothesis generation and guides preprocessing and
modeling choices.
The boxplots (with log-scale Y axis) reveal extreme right skewness in cpk, scr, and platelets, with several high-value outliers. These features may benefit from transformation or scaling prior to modeling.
The age distribution is approximately normal but slightly skewed right. Most patients fall between ages 50 and 70, aligning with common demographics for heart failure risk.
The time variable shows a fairly even spread, though several patients were followed for over 200 days. This supports a good range of follow-up durations for modeling survival.
Patients who died had a noticeably lower median ejection fraction compared to survivors, highlighting reduced heart function as a significant factor in mortality.
Patients who died tended to have higher serum creatinine levels, indicating poorer kidney function, which is known to correlate with worse heart failure outcomes.
While the median platelet counts between groups appear similar, the range is broader among deceased patients, showing more variability that may be associated with clinical instability.
Across binary health indicators like anaemia,
diabetes, and high_blood_pressure, the
proportion of deaths remains relatively consistent. No strong
categorical driver stands out visually.
There’s no clear linear trend, but some older patients with high creatinine levels fall into the death category, suggesting compounding risk factors may exist.
Deceased patients often had lower ejection fractions and were followed for shorter durations, consistent with worse prognosis and earlier adverse events.
The pair plot reveals weak to moderate correlations among features.
ef and scr have some class-separating power.
The visual clustering shows that combined features might improve
predictiveness.
The exploratory analysis uncovered several important insights that will guide our modeling strategy. Variables such as ejection fraction, serum creatinine, and follow-up time show noticeable differences between death outcomes, indicating their potential predictive power. While categorical variables like anaemia and diabetes showed weaker associations individually, they may still contribute within multivariate models. Additionally, visual inspection confirmed the presence of skewed distributions and outliers in variables like creatinine and platelets, justifying our earlier preprocessing steps. Overall, the EDA supports the development of robust predictive models and highlights key features likely to influence patient survival.
Following the exploratory and preprocessing phases, this section focuses on building and evaluating predictive models to estimate the likelihood of death among heart failure patients. The objective is to assess the predictive power of clinical variables using various machine learning algorithms. We will apply and compare four classification models: Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, and Linear Regression (as a baseline). Each model will be trained on the preprocessed training set and evaluated using accuracy on the test set. This comparative analysis will help identify the most effective model for predicting patient outcomes in a clinical decision-making context.
##
## Call:
## lm(formula = death ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76595 -0.25978 -0.01087 0.22603 0.92758
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.404236 0.058252 6.939 4.15e-11 ***
## age 0.060538 0.023756 2.548 0.0115 *
## cpk 0.032132 0.022852 1.406 0.1611
## ef -0.119492 0.024122 -4.954 1.43e-06 ***
## platelets -0.015722 0.023208 -0.677 0.4988
## scr 0.092215 0.022791 4.046 7.16e-05 ***
## ss -0.032070 0.023908 -1.341 0.1812
## time -0.227650 0.024449 -9.311 < 2e-16 ***
## anaemia_X1 -0.056082 0.048407 -1.159 0.2479
## diabetes_X1 0.011898 0.047870 0.249 0.8039
## high_blood_pressure_X1 -0.040922 0.049391 -0.829 0.4082
## sex_X1 -0.067353 0.055773 -1.208 0.2285
## smoking_X1 0.002115 0.055237 0.038 0.9695
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3544 on 225 degrees of freedom
## Multiple R-squared: 0.4536, Adjusted R-squared: 0.4245
## F-statistic: 15.57 on 12 and 225 DF, p-value: < 2.2e-16
## Actual
## Predicted 0 1
## 0 34 7
## 1 7 13
## Accuracy: 0.7705
## Area under the curve: 0.8183
The linear regression model, used here as a baseline classifier, achieved an accuracy of 77.1% on the test dataset. The ROC curve shows a good level of separation with an AUC (Area Under the Curve) that suggests decent discriminative ability despite linear regression not being inherently designed for classification tasks.
From the model summary, we observe that several predictors
significantly contribute to the model: age,
ejection fraction (ef),
serum creatinine (scr), and time. Notably,
higher ef and longer time are associated with
lower probability of death, while higher scr and
age increase the likelihood of death — aligning with
medical intuition.
While the model performed reasonably well, it has limitations due to its assumptions and lack of support for class probabilities, making it a useful benchmark but not the most suitable for binary classification.
##
## Call:
## glm(formula = death ~ ., family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.719910 0.518634 -1.388 0.165110
## age 0.502041 0.208042 2.413 0.015814 *
## cpk 0.195708 0.208284 0.940 0.347413
## ef -0.977994 0.229636 -4.259 2.05e-05 ***
## platelets -0.129127 0.215785 -0.598 0.549570
## scr 0.813332 0.215366 3.777 0.000159 ***
## ss -0.330033 0.204540 -1.614 0.106628
## time -1.835981 0.285229 -6.437 1.22e-10 ***
## anaemia_X1 -0.523338 0.435012 -1.203 0.228960
## diabetes_X1 0.093390 0.413350 0.226 0.821252
## high_blood_pressure_X1 -0.209523 0.416781 -0.503 0.615163
## sex_X1 -0.589627 0.495427 -1.190 0.233991
## smoking_X1 0.001479 0.477978 0.003 0.997531
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 298.15 on 237 degrees of freedom
## Residual deviance: 161.86 on 225 degrees of freedom
## AIC: 187.86
##
## Number of Fisher Scoring iterations: 6
## Actual
## Predicted 0 1
## 0 35 7
## 1 6 13
## Logistic Regression Accuracy: 0.7869
## Area under the curve: 0.8207
The logistic regression model outperformed the linear regression baseline with an accuracy of 78.7%. Its ROC curve indicates improved classification capability and greater sensitivity compared to the linear model.
Key predictors in this model included age,
ejection fraction (ef),
serum creatinine (scr), and time, all of which
were statistically significant. Lower ejection fraction and shorter
follow-up times were linked to higher death probabilities, while
increased age and serum creatinine were also associated with higher
risk.
These results reinforce known medical insights and support logistic regression as a solid candidate for modeling binary outcomes in clinical datasets.
## k-Nearest Neighbors
##
## 238 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 214, 214, 214, 214, 214, 215, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.7770870 0.4300627
## 7 0.7647536 0.3907864
## 9 0.7520580 0.3455697
## 11 0.7737971 0.3946275
## 13 0.7611014 0.3562933
## 15 0.7654493 0.3617844
## 17 0.7609203 0.3525280
## 19 0.7569348 0.3356419
## 21 0.7569348 0.3356419
## 23 0.7569348 0.3356419
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39 13
## 1 2 7
##
## Accuracy : 0.7541
## 95% CI : (0.6271, 0.8554)
## No Information Rate : 0.6721
## P-Value [Acc > NIR] : 0.108034
##
## Kappa : 0.3506
##
## Mcnemar's Test P-Value : 0.009823
##
## Sensitivity : 0.9512
## Specificity : 0.3500
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.7778
## Prevalence : 0.6721
## Detection Rate : 0.6393
## Detection Prevalence : 0.8525
## Balanced Accuracy : 0.6506
##
## 'Positive' Class : 0
##
## KNN Accuracy: 0.7541
The KNN model achieved an accuracy of 75.4%, slightly below logistic regression. The cross-validation plot suggests the best performance was at k = 5, with diminishing accuracy beyond that.
The confusion matrix highlights KNN’s strength in identifying true negatives (class 0), but a lower true positive rate indicates some difficulty distinguishing death events. Despite this, its sensitivity of 95.1% for class 0 suggests it’s conservative in predicting death, favoring the majority class.
While interpretable and non-parametric, KNN’s performance depends heavily on data scaling and parameter tuning. In this case, it performed reasonably well, though not the best.
## Naive Bayes
##
## 238 samples
## 12 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 214, 214, 214, 214, 214, 215, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.7607536 0.3857472
## TRUE 0.8319783 0.5761426
##
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
## and adjust = 1.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39 10
## 1 2 10
##
## Accuracy : 0.8033
## 95% CI : (0.6816, 0.894)
## No Information Rate : 0.6721
## P-Value [Acc > NIR] : 0.01733
##
## Kappa : 0.5027
##
## Mcnemar's Test P-Value : 0.04331
##
## Sensitivity : 0.9512
## Specificity : 0.5000
## Pos Pred Value : 0.7959
## Neg Pred Value : 0.8333
## Prevalence : 0.6721
## Detection Rate : 0.6393
## Detection Prevalence : 0.8033
## Balanced Accuracy : 0.7256
##
## 'Positive' Class : 0
##
## Naive Bayes Accuracy: 0.8033
The Naive Bayes model achieved the highest accuracy of
80.3% among all models tested. This performance was
enhanced when using kernel density estimation, as shown by the tuning
plot, which revealed a significant boost in accuracy when
usekernel = TRUE.
The confusion matrix indicates strong performance across both classes, with an improvement in specificity compared to KNN. The balanced accuracy of 72.6% and sensitivity of 95.1% suggest that this model strikes a better balance between detecting death events and avoiding false positives.
Overall, Naive Bayes proved to be the most effective classifier in this context, offering both simplicity and robust predictive capability.
## Model Accuracy
## 1 Linear Regression 0.7704918
## 2 Logistic Regression 0.7868852
## 3 KNN 0.7540984
## 4 Naive Bayes 0.8032787
Among the four models evaluated, Naive Bayes emerged as the top performer with an accuracy of 80.3%, benefiting from kernel density estimation which improved its flexibility. Logistic Regression followed closely with 78.7%, offering strong interpretability and alignment with clinical knowledge.
Linear Regression, used as a baseline, performed decently at 77.0% accuracy despite not being a classification model. Meanwhile, K-Nearest Neighbors achieved 75.4%, showing that while intuitive and easy to implement, it struggled with generalization in this dataset.
Overall, Naive Bayes provided the best trade-off between simplicity and predictive power, making it a promising choice for risk prediction in medical datasets with structured features.
This phase of the analysis demonstrated the practical application of multiple machine learning models for predicting mortality in heart failure patients. By comparing Logistic Regression, KNN, Naive Bayes, and Linear Regression, we identified Naive Bayes as the most effective method in terms of predictive accuracy and class balance. These findings support the integration of data-driven tools into clinical decision-making, enabling early identification of high-risk patients and potentially improving treatment outcomes.
This project presented a comprehensive analysis pipeline for predicting mortality among patients with heart failure using real clinical data. Beginning with data critique and preprocessing, we ensured our dataset was clean, standardized, and ready for model development. Through exploratory data analysis (EDA), we uncovered significant patterns, such as the impact of variables like ejection fraction, serum creatinine, and age on patient outcomes. Correlation analysis and visualizations further guided our understanding of which features held predictive value.
During the predictive modeling phase, four approaches—Linear Regression, Logistic Regression, K-Nearest Neighbors (KNN), and Naive Bayes—were evaluated. Each method brought unique advantages: Linear Regression served as a baseline, Logistic Regression aligned closely with clinical reasoning, KNN provided a non-parametric perspective, and Naive Bayes demonstrated the strongest predictive power, especially when kernel density was applied.
Naive Bayes ultimately achieved the highest accuracy, suggesting its strong potential for real-world application in early intervention strategies. However, logistic regression remains a strong alternative due to its transparency and ease of interpretation.
This analysis not only highlights the utility of machine learning in healthcare but also emphasizes the importance of preprocessing, model tuning, and interpretability in building reliable predictive systems. By leveraging data analytics, medical professionals can be better equipped to identify at-risk individuals and improve patient care outcomes through informed decision-making.