Heart Failure Analysis and Prediction

DATA CRITIQUE

This phase involves thoroughly understanding and preparing the dataset for analysis. We begin by inspecting the structure, data types, and potential issues such as missing values or duplicates. Each variable is evaluated for relevance, and categorical variables are appropriately converted to factors. Outliers and skewed distributions are visualized to inform any required transformations. The dataset is then cleaned and normalized using a preprocessing recipe, followed by a train-test split to enable unbiased model training and evaluation.

Data Overview

## Rows: 299
## Columns: 13
## $ age                      <dbl> 75, 55, 65, 50, 65, 90, 75, 60, 65, 80, 75, 6…
## $ anaemia                  <dbl> 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, …
## $ creatinine_phosphokinase <dbl> 582, 7861, 146, 111, 160, 47, 246, 315, 157, …
## $ diabetes                 <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ ejection_fraction        <dbl> 20, 38, 20, 20, 20, 40, 15, 60, 65, 35, 38, 2…
## $ high_blood_pressure      <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, …
## $ platelets                <dbl> 265000, 263358, 162000, 210000, 327000, 20400…
## $ serum_creatinine         <dbl> 1.90, 1.10, 1.30, 1.90, 2.70, 2.10, 1.20, 1.1…
## $ serum_sodium             <dbl> 130, 136, 129, 137, 116, 132, 137, 131, 138, …
## $ sex                      <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, …
## $ smoking                  <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, …
## $ time                     <dbl> 4, 6, 7, 7, 8, 8, 10, 10, 10, 10, 10, 10, 11,…
## $ death_event              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, …

Data summary
Name	df
Number of rows	299
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	60.83	11.89	40.0	51.0	60.0	70.0	95.0	▆▇▇▂▁
anaemia	1	0.43	0.50	0.0	0.0	0.0	1.0	1.0	▇▁▁▁▆
creatinine_phosphokinase	1	581.84	970.29	23.0	116.5	250.0	582.0	7861.0	▇▁▁▁▁
diabetes	1	0.42	0.49	0.0	0.0	0.0	1.0	1.0	▇▁▁▁▆
ejection_fraction	1	38.08	11.83	14.0	30.0	38.0	45.0	80.0	▃▇▂▂▁
high_blood_pressure	1	0.35	0.48	0.0	0.0	0.0	1.0	1.0	▇▁▁▁▅
platelets	1	263358.03	97804.24	25100.0	212500.0	262000.0	303500.0	850000.0	▂▇▂▁▁
serum_creatinine	1	1.39	1.03	0.5	0.9	1.1	1.4	9.4	▇▁▁▁▁
serum_sodium	1	136.63	4.41	113.0	134.0	137.0	140.0	148.0	▁▁▃▇▁
sex	1	0.65	0.48	0.0	0.0	1.0	1.0	1.0	▅▁▁▁▇
smoking	1	0.32	0.47	0.0	0.0	0.0	1.0	1.0	▇▁▁▁▃
time	1	130.26	77.61	4.0	73.0	115.0	203.0	285.0	▆▇▃▆▃
death_event	1	0.32	0.47	0.0	0.0	0.0	1.0	1.0	▇▁▁▁▃

##                      age                  anaemia creatinine_phosphokinase 
##                        0                        0                        0 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                        0                        0                        0 
##                platelets         serum_creatinine             serum_sodium 
##                        0                        0                        0 
##                      sex                  smoking                     time 
##                        0                        0                        0 
##              death_event 
##                        0

## [1] 0

##       age           anaemia       creatinine_phosphokinase    diabetes     
##  Min.   :40.00   Min.   :0.0000   Min.   :  23.0           Min.   :0.0000  
##  1st Qu.:51.00   1st Qu.:0.0000   1st Qu.: 116.5           1st Qu.:0.0000  
##  Median :60.00   Median :0.0000   Median : 250.0           Median :0.0000  
##  Mean   :60.83   Mean   :0.4314   Mean   : 581.8           Mean   :0.4181  
##  3rd Qu.:70.00   3rd Qu.:1.0000   3rd Qu.: 582.0           3rd Qu.:1.0000  
##  Max.   :95.00   Max.   :1.0000   Max.   :7861.0           Max.   :1.0000  
##  ejection_fraction high_blood_pressure   platelets      serum_creatinine
##  Min.   :14.00     Min.   :0.0000      Min.   : 25100   Min.   :0.500   
##  1st Qu.:30.00     1st Qu.:0.0000      1st Qu.:212500   1st Qu.:0.900   
##  Median :38.00     Median :0.0000      Median :262000   Median :1.100   
##  Mean   :38.08     Mean   :0.3512      Mean   :263358   Mean   :1.394   
##  3rd Qu.:45.00     3rd Qu.:1.0000      3rd Qu.:303500   3rd Qu.:1.400   
##  Max.   :80.00     Max.   :1.0000      Max.   :850000   Max.   :9.400   
##   serum_sodium        sex            smoking            time      
##  Min.   :113.0   Min.   :0.0000   Min.   :0.0000   Min.   :  4.0  
##  1st Qu.:134.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 73.0  
##  Median :137.0   Median :1.0000   Median :0.0000   Median :115.0  
##  Mean   :136.6   Mean   :0.6488   Mean   :0.3211   Mean   :130.3  
##  3rd Qu.:140.0   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:203.0  
##  Max.   :148.0   Max.   :1.0000   Max.   :1.0000   Max.   :285.0  
##   death_event    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3211  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

The dataset consists of 299 observations and 13 variables, all of which are numeric, with several acting as binary categorical indicators (e.g., anaemia, diabetes, sex). No missing values or duplicated records were found, which simplifies the preprocessing phase.

The target variable, death_event, is binary and indicates whether a patient died during the follow-up period. The mean of 0.321 suggests that approximately 32.1% of patients experienced a death event.

Key observations from the summary statistics include:

Age ranges from 40 to 95 years, with a mean of 60.8, showing the dataset primarily contains older adults.
Ejection fraction values (a measure of heart efficiency) range from 14% to 80%, with a median of 38%, which indicates impaired cardiac function in many patients.
Creatinine phosphokinase (CPK) has a highly skewed distribution, with a maximum value of 7861 — much higher than its median of 250 — pointing to possible outliers.
Serum creatinine and platelet levels also show large variances, further supporting the need for normalization and outlier handling.
Binary indicators such as anaemia, diabetes, smoking, high_blood_pressure, and sex are coded as 0 and 1, and will later be treated as factors.

This overview provides a foundational understanding of the dataset and reveals which variables may require transformation, encoding, or scaling prior to modeling.

Correlation

The correlation matrix above provides insights into the linear relationships between numeric variables in the dataset. Notably:

Time shows the strongest negative correlation with death_event at -0.53, indicating that patients who lived longer during follow-up were less likely to die.
Serum creatinine and death_event have a positive correlation of 0.29, suggesting that higher creatinine levels (an indicator of kidney dysfunction) are associated with higher mortality.
Ejection fraction, a measure of heart performance, has a negative correlation of -0.27 with death_event, which aligns with medical expectations: lower ejection fractions are linked to increased death risk.
Other variables such as age (0.25), serum_sodium (-0.20), and high_blood_pressure (0.06) have weaker correlations with the outcome but may still contribute in multivariate models.
Overall, there is no evidence of strong multicollinearity among predictors, which is favorable for modeling.

These relationships help guide feature selection and provide early intuition on variable importance in predicting patient outcomes.

Data Preprocessing

## Rows: 238
## Columns: 13
## $ age                    <dbl> -0.99487877, 0.35024581, 0.77059724, -1.3311599…
## $ cpk                    <dbl> -0.517206774, -0.546064189, -0.522359884, 0.000…
## $ ef                     <dbl> -0.683035135, -1.105516528, -0.260553742, 3.541…
## $ platelets              <dbl> 1.673158e+00, 1.292579e-01, -4.126409e-01, 7.52…
## $ scr                    <dbl> -0.38074023, -0.09074788, 1.26254973, -0.206744…
## $ ss                     <dbl> 0.31152159, 0.08489153, 0.31152159, 0.08489153,…
## $ time                   <dbl> -1.5237013, -1.4721643, -0.9825633, -0.8666051,…
## $ death                  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ anaemia_X1             <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ diabetes_X1            <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,…
## $ high_blood_pressure_X1 <dbl> 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1,…
## $ sex_X1                 <dbl> 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ smoking_X1             <dbl> 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,…

##       age                cpk                  ef              platelets       
##  Min.   :-1.75151   Min.   :-0.575952   Min.   :-2.034976   Min.   :-2.43607  
##  1st Qu.:-0.74267   1st Qu.:-0.477785   1st Qu.:-0.683035   1st Qu.:-0.53278  
##  Median :-0.07011   Median :-0.316235   Median :-0.007065   Median :-0.00183  
##  Mean   : 0.02892   Mean   : 0.032812   Mean   : 0.012107   Mean   : 0.01624  
##  3rd Qu.: 0.77060   3rd Qu.: 0.000165   3rd Qu.: 0.584409   3rd Qu.: 0.41299  
##  Max.   : 2.87235   Max.   : 7.502063   Max.   : 3.541779   Max.   : 5.99812  
##       scr                  ss                 time          death  
##  Min.   :-0.864061   Min.   :-5.354230   Min.   :-1.62678   0:162  
##  1st Qu.:-0.477404   1st Qu.:-0.594999   1st Qu.:-0.69589   1: 76  
##  Median :-0.284076   Median : 0.084892   Median :-0.13220          
##  Mean   : 0.019888   Mean   :-0.004618   Mean   : 0.05202          
##  3rd Qu.: 0.005916   3rd Qu.: 0.764782   3rd Qu.: 0.98550          
##  Max.   : 7.739045   Max.   : 2.577822   Max.   : 1.92927          
##    anaemia_X1      diabetes_X1     high_blood_pressure_X1     sex_X1      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000         Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000         1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000         Median :1.0000  
##  Mean   :0.4118   Mean   :0.4538   Mean   :0.3655         Mean   :0.6555  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000         3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000         Max.   :1.0000  
##    smoking_X1    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.3067  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

To prepare the dataset for modeling, several key preprocessing steps were applied:

Factor Conversion & Renaming: Binary variables such as anaemia, diabetes, high_blood_pressure, sex, smoking, and death_event were converted to factors to ensure proper treatment in modeling. Additionally, variable names were simplified for readability (e.g., ejection_fraction was renamed to ef).

Recipe-Based Transformation: Using the recipes package, a preprocessing pipeline was implemented. This included:

Removing zero-variance predictors
Normalizing all numeric predictors to have mean 0 and standard deviation 1
One-hot encoding of factor variables

Train/Test Split: The dataset was split into training (80%) and testing (20%) sets, stratified by the target variable (death). The training set contains 238 observations and the test set 61.

Resulting Dataset Summary: The processed train_data contains 13 columns, including normalized numeric features (age, cpk, ef, platelets, scr, ss, time) and binary dummy variables for categorical features (anaemia_X1, diabetes_X1, high_blood_pressure_X1, sex_X1, smoking_X1). Notably:

The mean of all normalized numeric variables is close to 0, confirming successful standardization.
Categorical variables such as sex_X1 and diabetes_X1 have binary values (0 or 1), suitable for use in distance-based models like KNN.
The class distribution of death remains balanced (approximately 68% survival, 32% death) after the split, preserving representativeness.

These preprocessing steps ensure the dataset is clean, standardized, and ready for predictive modeling.

EXPLORATORY DATA ANALYSIS (EDA)

Exploratory Data Analysis (EDA) was conducted to better understand variable distributions, detect outliers, and explore relationships between predictors and the target variable (death). This step supports hypothesis generation and guides preprocessing and modeling choices.

The boxplots (with log-scale Y axis) reveal extreme right skewness in cpk, scr, and platelets, with several high-value outliers. These features may benefit from transformation or scaling prior to modeling.

The age distribution is approximately normal but slightly skewed right. Most patients fall between ages 50 and 70, aligning with common demographics for heart failure risk.

The time variable shows a fairly even spread, though several patients were followed for over 200 days. This supports a good range of follow-up durations for modeling survival.

Patients who died had a noticeably lower median ejection fraction compared to survivors, highlighting reduced heart function as a significant factor in mortality.

Patients who died tended to have higher serum creatinine levels, indicating poorer kidney function, which is known to correlate with worse heart failure outcomes.

While the median platelet counts between groups appear similar, the range is broader among deceased patients, showing more variability that may be associated with clinical instability.

Across binary health indicators like anaemia, diabetes, and high_blood_pressure, the proportion of deaths remains relatively consistent. No strong categorical driver stands out visually.

There’s no clear linear trend, but some older patients with high creatinine levels fall into the death category, suggesting compounding risk factors may exist.

Deceased patients often had lower ejection fractions and were followed for shorter durations, consistent with worse prognosis and earlier adverse events.

Pair Plot of Key Variables

The pair plot reveals weak to moderate correlations among features. ef and scr have some class-separating power. The visual clustering shows that combined features might improve predictiveness.

The exploratory analysis uncovered several important insights that will guide our modeling strategy. Variables such as ejection fraction, serum creatinine, and follow-up time show noticeable differences between death outcomes, indicating their potential predictive power. While categorical variables like anaemia and diabetes showed weaker associations individually, they may still contribute within multivariate models. Additionally, visual inspection confirmed the presence of skewed distributions and outliers in variables like creatinine and platelets, justifying our earlier preprocessing steps. Overall, the EDA supports the development of robust predictive models and highlights key features likely to influence patient survival.

PREDICTIVE DATA ANALYSIS AND MODELLING

Following the exploratory and preprocessing phases, this section focuses on building and evaluating predictive models to estimate the likelihood of death among heart failure patients. The objective is to assess the predictive power of clinical variables using various machine learning algorithms. We will apply and compare four classification models: Logistic Regression, K-Nearest Neighbors (KNN), Naive Bayes, and Linear Regression (as a baseline). Each model will be trained on the preprocessed training set and evaluated using accuracy on the test set. This comparative analysis will help identify the most effective model for predicting patient outcomes in a clinical decision-making context.

Linear Regression (Baseline)

## 
## Call:
## lm(formula = death ~ ., data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.76595 -0.25978 -0.01087  0.22603  0.92758 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             0.404236   0.058252   6.939 4.15e-11 ***
## age                     0.060538   0.023756   2.548   0.0115 *  
## cpk                     0.032132   0.022852   1.406   0.1611    
## ef                     -0.119492   0.024122  -4.954 1.43e-06 ***
## platelets              -0.015722   0.023208  -0.677   0.4988    
## scr                     0.092215   0.022791   4.046 7.16e-05 ***
## ss                     -0.032070   0.023908  -1.341   0.1812    
## time                   -0.227650   0.024449  -9.311  < 2e-16 ***
## anaemia_X1             -0.056082   0.048407  -1.159   0.2479    
## diabetes_X1             0.011898   0.047870   0.249   0.8039    
## high_blood_pressure_X1 -0.040922   0.049391  -0.829   0.4082    
## sex_X1                 -0.067353   0.055773  -1.208   0.2285    
## smoking_X1              0.002115   0.055237   0.038   0.9695    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3544 on 225 degrees of freedom
## Multiple R-squared:  0.4536, Adjusted R-squared:  0.4245 
## F-statistic: 15.57 on 12 and 225 DF,  p-value: < 2.2e-16

##          Actual
## Predicted  0  1
##         0 34  7
##         1  7 13

## Accuracy: 0.7705

## Area under the curve: 0.8183

The linear regression model, used here as a baseline classifier, achieved an accuracy of 77.1% on the test dataset. The ROC curve shows a good level of separation with an AUC (Area Under the Curve) that suggests decent discriminative ability despite linear regression not being inherently designed for classification tasks.

From the model summary, we observe that several predictors significantly contribute to the model: age, ejection fraction (ef), serum creatinine (scr), and time. Notably, higher ef and longer time are associated with lower probability of death, while higher scr and age increase the likelihood of death — aligning with medical intuition.

While the model performed reasonably well, it has limitations due to its assumptions and lack of support for class probabilities, making it a useful benchmark but not the most suitable for binary classification.

Logistic Regression

## 
## Call:
## glm(formula = death ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.719910   0.518634  -1.388 0.165110    
## age                     0.502041   0.208042   2.413 0.015814 *  
## cpk                     0.195708   0.208284   0.940 0.347413    
## ef                     -0.977994   0.229636  -4.259 2.05e-05 ***
## platelets              -0.129127   0.215785  -0.598 0.549570    
## scr                     0.813332   0.215366   3.777 0.000159 ***
## ss                     -0.330033   0.204540  -1.614 0.106628    
## time                   -1.835981   0.285229  -6.437 1.22e-10 ***
## anaemia_X1             -0.523338   0.435012  -1.203 0.228960    
## diabetes_X1             0.093390   0.413350   0.226 0.821252    
## high_blood_pressure_X1 -0.209523   0.416781  -0.503 0.615163    
## sex_X1                 -0.589627   0.495427  -1.190 0.233991    
## smoking_X1              0.001479   0.477978   0.003 0.997531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 298.15  on 237  degrees of freedom
## Residual deviance: 161.86  on 225  degrees of freedom
## AIC: 187.86
## 
## Number of Fisher Scoring iterations: 6

##          Actual
## Predicted  0  1
##         0 35  7
##         1  6 13

## Logistic Regression Accuracy: 0.7869

## Area under the curve: 0.8207

The logistic regression model outperformed the linear regression baseline with an accuracy of 78.7%. Its ROC curve indicates improved classification capability and greater sensitivity compared to the linear model.

Key predictors in this model included age, ejection fraction (ef), serum creatinine (scr), and time, all of which were statistically significant. Lower ejection fraction and shorter follow-up times were linked to higher death probabilities, while increased age and serum creatinine were also associated with higher risk.

These results reinforce known medical insights and support logistic regression as a solid candidate for modeling binary outcomes in clinical datasets.

K-Nearest Neighbors (KNN)

## k-Nearest Neighbors 
## 
## 238 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 215, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.7770870  0.4300627
##    7  0.7647536  0.3907864
##    9  0.7520580  0.3455697
##   11  0.7737971  0.3946275
##   13  0.7611014  0.3562933
##   15  0.7654493  0.3617844
##   17  0.7609203  0.3525280
##   19  0.7569348  0.3356419
##   21  0.7569348  0.3356419
##   23  0.7569348  0.3356419
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39 13
##          1  2  7
##                                           
##                Accuracy : 0.7541          
##                  95% CI : (0.6271, 0.8554)
##     No Information Rate : 0.6721          
##     P-Value [Acc > NIR] : 0.108034        
##                                           
##                   Kappa : 0.3506          
##                                           
##  Mcnemar's Test P-Value : 0.009823        
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.3500          
##          Pos Pred Value : 0.7500          
##          Neg Pred Value : 0.7778          
##              Prevalence : 0.6721          
##          Detection Rate : 0.6393          
##    Detection Prevalence : 0.8525          
##       Balanced Accuracy : 0.6506          
##                                           
##        'Positive' Class : 0               
##

## KNN Accuracy: 0.7541

The KNN model achieved an accuracy of 75.4%, slightly below logistic regression. The cross-validation plot suggests the best performance was at k = 5, with diminishing accuracy beyond that.

The confusion matrix highlights KNN’s strength in identifying true negatives (class 0), but a lower true positive rate indicates some difficulty distinguishing death events. Despite this, its sensitivity of 95.1% for class 0 suggests it’s conservative in predicting death, favoring the majority class.

While interpretable and non-parametric, KNN’s performance depends heavily on data scaling and parameter tuning. In this case, it performed reasonably well, though not the best.

Naive Bayes

## Naive Bayes 
## 
## 238 samples
##  12 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 214, 214, 214, 214, 214, 215, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.7607536  0.3857472
##    TRUE      0.8319783  0.5761426
## 
## Tuning parameter 'laplace' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were laplace = 0, usekernel = TRUE
##  and adjust = 1.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 39 10
##          1  2 10
##                                          
##                Accuracy : 0.8033         
##                  95% CI : (0.6816, 0.894)
##     No Information Rate : 0.6721         
##     P-Value [Acc > NIR] : 0.01733        
##                                          
##                   Kappa : 0.5027         
##                                          
##  Mcnemar's Test P-Value : 0.04331        
##                                          
##             Sensitivity : 0.9512         
##             Specificity : 0.5000         
##          Pos Pred Value : 0.7959         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.6721         
##          Detection Rate : 0.6393         
##    Detection Prevalence : 0.8033         
##       Balanced Accuracy : 0.7256         
##                                          
##        'Positive' Class : 0              
##

## Naive Bayes Accuracy: 0.8033

The Naive Bayes model achieved the highest accuracy of 80.3% among all models tested. This performance was enhanced when using kernel density estimation, as shown by the tuning plot, which revealed a significant boost in accuracy when usekernel = TRUE.

The confusion matrix indicates strong performance across both classes, with an improvement in specificity compared to KNN. The balanced accuracy of 72.6% and sensitivity of 95.1% suggest that this model strikes a better balance between detecting death events and avoiding false positives.

Overall, Naive Bayes proved to be the most effective classifier in this context, offering both simplicity and robust predictive capability.

Model Comparison

##                 Model  Accuracy
## 1   Linear Regression 0.7704918
## 2 Logistic Regression 0.7868852
## 3                 KNN 0.7540984
## 4         Naive Bayes 0.8032787

Among the four models evaluated, Naive Bayes emerged as the top performer with an accuracy of 80.3%, benefiting from kernel density estimation which improved its flexibility. Logistic Regression followed closely with 78.7%, offering strong interpretability and alignment with clinical knowledge.

Linear Regression, used as a baseline, performed decently at 77.0% accuracy despite not being a classification model. Meanwhile, K-Nearest Neighbors achieved 75.4%, showing that while intuitive and easy to implement, it struggled with generalization in this dataset.

Overall, Naive Bayes provided the best trade-off between simplicity and predictive power, making it a promising choice for risk prediction in medical datasets with structured features.

This phase of the analysis demonstrated the practical application of multiple machine learning models for predicting mortality in heart failure patients. By comparing Logistic Regression, KNN, Naive Bayes, and Linear Regression, we identified Naive Bayes as the most effective method in terms of predictive accuracy and class balance. These findings support the integration of data-driven tools into clinical decision-making, enabling early identification of high-risk patients and potentially improving treatment outcomes.

CONCLUSION

This project presented a comprehensive analysis pipeline for predicting mortality among patients with heart failure using real clinical data. Beginning with data critique and preprocessing, we ensured our dataset was clean, standardized, and ready for model development. Through exploratory data analysis (EDA), we uncovered significant patterns, such as the impact of variables like ejection fraction, serum creatinine, and age on patient outcomes. Correlation analysis and visualizations further guided our understanding of which features held predictive value.

During the predictive modeling phase, four approaches—Linear Regression, Logistic Regression, K-Nearest Neighbors (KNN), and Naive Bayes—were evaluated. Each method brought unique advantages: Linear Regression served as a baseline, Logistic Regression aligned closely with clinical reasoning, KNN provided a non-parametric perspective, and Naive Bayes demonstrated the strongest predictive power, especially when kernel density was applied.

Naive Bayes ultimately achieved the highest accuracy, suggesting its strong potential for real-world application in early intervention strategies. However, logistic regression remains a strong alternative due to its transparency and ease of interpretation.

This analysis not only highlights the utility of machine learning in healthcare but also emphasizes the importance of preprocessing, model tuning, and interpretability in building reliable predictive systems. By leveraging data analytics, medical professionals can be better equipped to identify at-risk individuals and improve patient care outcomes through informed decision-making.