1 🏥 Introduction

What is Heart Failure? Heart failure happens when the heart cannot pump enough blood to the body. It is one of the top causes of death worldwide, especially in people over 60 years old.

Why is this dataset important? This dataset has clinical records of 299 heart failure patients from Faisalabad, Pakistan (2015). It helps us find which clinical factors predict whether a patient will survive.

Goal: Explore the data step by step — filter, summarize, visualize, run statistical tests, and use machine learning to predict patient death (DEATH_EVENT).


2 📦 Load Libraries and Dataset

2.1 Which libraries are required?

# Install any packages that are not already installed
packages_needed <- c("dplyr", "ggplot2", "corrplot",
                     "caret", "class", "cluster", "factoextra", "knitr")

for (pkg in packages_needed) {
  if (!pkg %in% installed.packages()[, "Package"]) {
    install.packages(pkg)
  }
}

# Load all libraries
library(dplyr)       # for data filtering and grouping
library(ggplot2)     # for all charts and plots
library(corrplot)    # for correlation heatmap
library(caret)       # for machine learning tools
library(class)       # for KNN algorithm
library(cluster)     # for K-Means clustering
library(factoextra)  # for cluster visualization
library(knitr)       # for clean tables

# Set default figure size for all plots
knitr::opts_chunk$set(fig.width = 9, fig.height = 5,
                      fig.align = "center",
                      warning = FALSE, message = FALSE)

2.2 How do you load the dataset?

# Read the CSV file (make sure it is in your working directory)
df_raw <- read.csv("heart_failure_clinical_records_dataset.csv")

# Basic info
cat("Rows:", nrow(df_raw), "| Columns:", ncol(df_raw), "\n\n")
## Rows: 299 | Columns: 13
# Show first 6 rows
head(df_raw)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1  75       0                      582        0                20
## 2  55       0                     7861        0                38
## 3  65       0                      146        0                20
## 4  50       1                      111        0                20
## 5  65       1                      160        1                20
## 6  90       1                       47        0                40
##   high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1                   1    265000              1.9          130   1       0    4
## 2                   0    263358              1.1          136   1       0    6
## 3                   0    162000              1.3          129   1       1    7
## 4                   0    210000              1.9          137   1       0    7
## 5                   0    327000              2.7          116   0       0    8
## 6                   1    204000              2.1          132   1       1    8
##   DEATH_EVENT
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

What do the columns mean?

Column Meaning
age Patient age in years
anaemia Has anaemia? (0=No, 1=Yes)
creatinine_phosphokinase CPK enzyme level in blood (mcg/L)
diabetes Has diabetes? (0=No, 1=Yes)
ejection_fraction % of blood pumped out per heartbeat
high_blood_pressure Has high BP? (0=No, 1=Yes)
platelets Platelet count in blood
serum_creatinine Kidney function marker (mg/dL)
serum_sodium Sodium level in blood (mEq/L)
sex 0=Female, 1=Male
smoking Smokes? (0=No, 1=Yes)
time Days of follow-up after hospital
DEATH_EVENT TARGET: 0=Survived, 1=Died

3 📈 Expand Dataset to 2,500 Rows

We add more rows using synthetic augmentation — small random changes to existing rows. This gives machine learning models more data to learn from.

set.seed(42)  # makes results repeatable

expand_data <- function(data, target_rows = 2500) {

  # Columns where we add small noise
  num_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
                "platelets", "serum_creatinine", "serum_sodium", "time")

  # Columns where we rarely flip the value (5% chance)
  bin_cols <- c("anaemia", "diabetes", "high_blood_pressure",
                "sex", "smoking", "DEATH_EVENT")

  original_n  <- nrow(data)
  rows_to_add <- target_rows - original_n
  new_rows    <- list()

  for (i in 1:rows_to_add) {
    base_row <- data[sample(original_n, 1), ]  # pick a random row
    new_row  <- base_row

    # Add tiny random noise to numeric columns
    for (col in num_cols) {
      col_range      <- max(data[[col]]) - min(data[[col]])
      noise          <- rnorm(1, mean = 0, sd = 0.05 * col_range)
      new_row[[col]] <- base_row[[col]] + noise
      new_row[[col]] <- max(min(new_row[[col]], max(data[[col]])), min(data[[col]]))
    }

    # 5% chance to flip a binary column
    for (col in bin_cols) {
      if (runif(1) < 0.05) new_row[[col]] <- 1 - base_row[[col]]
    }

    new_row$age  <- round(new_row$age, 0)
    new_rows[[i]] <- new_row
  }

  expanded <- rbind(data, do.call(rbind, new_rows))
  expanded <- expanded[sample(nrow(expanded)), ]  # shuffle rows
  rownames(expanded) <- NULL
  return(expanded)
}

df <- expand_data(df_raw, 2500)

cat("Original rows:", nrow(df_raw), "\n")
## Original rows: 299
cat("Expanded rows:", nrow(df), "\n")
## Expanded rows: 2500

✅ Dataset expanded from 299 → 2,500 rows successfully.


4 🔵 Level 1 – Understanding the Data

4.1 Q1.1 – What is the structure of the dataset?

cat("Total Rows:", nrow(df), "\n")
## Total Rows: 2500
cat("Total Columns:", ncol(df), "\n\n")
## Total Columns: 13
# str() shows data type of each column
str(df)
## 'data.frame':    2500 obs. of  13 variables:
##  $ age                     : num  95 49 74 71 95 60 52 52 56 40 ...
##  $ anaemia                 : num  0 1 0 1 1 0 0 0 1 0 ...
##  $ creatinine_phosphokinase: num  513.2 70.5 153.6 23 23 ...
##  $ diabetes                : num  1 0 1 1 0 0 0 0 0 1 ...
##  $ ejection_fraction       : num  37.5 50.6 14.9 29.1 38.2 ...
##  $ high_blood_pressure     : num  1 0 1 0 1 1 0 0 0 0 ...
##  $ platelets               : num  273506 180778 351536 195750 206778 ...
##  $ serum_creatinine        : num  1.429 1.067 1.586 2.806 0.922 ...
##  $ serum_sodium            : num  135 142 136 138 139 ...
##  $ sex                     : num  1 0 1 1 0 1 0 1 1 1 ...
##  $ smoking                 : num  0 0 1 1 0 0 1 0 1 0 ...
##  $ time                    : num  33.4 196.1 192.3 55.8 29.7 ...
##  $ DEATH_EVENT             : num  1 0 0 0 1 0 0 0 0 0 ...

4.2 Q1.2 – Are there any missing values?

# colSums + is.na counts missing per column
missing_count <- colSums(is.na(df))

cat("=== Missing Values per Column ===\n")
## === Missing Values per Column ===
print(missing_count)
##                      age                  anaemia creatinine_phosphokinase 
##                        0                        0                        0 
##                 diabetes        ejection_fraction      high_blood_pressure 
##                        0                        0                        0 
##                platelets         serum_creatinine             serum_sodium 
##                        0                        0                        0 
##                      sex                  smoking                     time 
##                        0                        0                        0 
##              DEATH_EVENT 
##                        0
cat("\nTotal missing values:", sum(missing_count), "\n")
## 
## Total missing values: 0

4.3 Q1.3 – Summary statistics of key variables

# summary() gives min, max, mean, median for each column
summary(df[, c("age", "ejection_fraction", "serum_creatinine",
               "serum_sodium", "creatinine_phosphokinase")])
##       age       ejection_fraction serum_creatinine  serum_sodium  
##  Min.   :40.0   Min.   :14.00     Min.   :0.5000   Min.   :113.0  
##  1st Qu.:51.0   1st Qu.:29.95     1st Qu.:0.8445   1st Qu.:134.0  
##  Median :60.0   Median :37.03     Median :1.1900   Median :136.9  
##  Mean   :60.9   Mean   :37.91     Mean   :1.4683   Mean   :136.6  
##  3rd Qu.:69.0   3rd Qu.:44.10     3rd Qu.:1.7000   3rd Qu.:139.6  
##  Max.   :95.0   Max.   :80.00     Max.   :9.4000   Max.   :148.0  
##  creatinine_phosphokinase
##  Min.   :  23.00         
##  1st Qu.:  77.05         
##  Median : 390.15         
##  Mean   : 638.07         
##  3rd Qu.: 803.86         
##  Max.   :7861.00

4.4 Q1.4 – Distribution of Survived vs Died

death_table <- table(df$DEATH_EVENT)

death_df <- data.frame(
  Outcome    = c("Survived (0)", "Died (1)"),
  Count      = as.vector(death_table),
  Percentage = round(as.vector(death_table) / nrow(df) * 100, 1)
)

cat("=== Death Event Distribution ===\n")
## === Death Event Distribution ===
print(death_df)
##        Outcome Count Percentage
## 1 Survived (0)  1621       64.8
## 2     Died (1)   879       35.2

4.5 Q1.5 – Unique values in categorical variables

cat("=== Categorical Variables – Unique Values ===\n\n")
## === Categorical Variables – Unique Values ===
cat("Sex         (0=Female, 1=Male) :", unique(df$sex), "\n")
## Sex         (0=Female, 1=Male) : 1 0
cat("Smoking     (0=No, 1=Yes)      :", unique(df$smoking), "\n")
## Smoking     (0=No, 1=Yes)      : 0 1
cat("Diabetes    (0=No, 1=Yes)      :", unique(df$diabetes), "\n")
## Diabetes    (0=No, 1=Yes)      : 1 0
cat("Anaemia     (0=No, 1=Yes)      :", unique(df$anaemia), "\n")
## Anaemia     (0=No, 1=Yes)      : 0 1
cat("High BP     (0=No, 1=Yes)      :", unique(df$high_blood_pressure), "\n")
## High BP     (0=No, 1=Yes)      : 1 0
cat("DEATH_EVENT (0=Survived,1=Died):", unique(df$DEATH_EVENT), "\n")
## DEATH_EVENT (0=Survived,1=Died): 1 0
cat("\n--- Counts for each ---\n")
## 
## --- Counts for each ---
cat("Sex:\n");         print(table(df$sex))
## Sex:
## 
##    0    1 
##  936 1564
cat("Smoking:\n");     print(table(df$smoking))
## Smoking:
## 
##    0    1 
## 1674  826
cat("DEATH_EVENT:\n"); print(table(df$DEATH_EVENT))
## DEATH_EVENT:
## 
##    0    1 
## 1621  879

Key Insights – Level 1:

  • Dataset has 2,500 rows and 13 columns — zero missing values.
  • About 68% survived and 32% died (moderately imbalanced target).
  • All yes/no features are encoded as 0 or 1.
  • Data is clean and ready for analysis.

5 🟡 Level 2 – Data Extraction & Filtering

5.1 Q2.1 – Patients older than 60 years

older_60 <- df[df$age > 60, ]

cat("Patients older than 60:", nrow(older_60), "\n\n")
## Patients older than 60: 1191
head(older_60[, c("age", "ejection_fraction", "serum_creatinine", "DEATH_EVENT")], 8)
##    age ejection_fraction serum_creatinine DEATH_EVENT
## 1   95          37.50308        1.4291232           1
## 3   74          14.87444        1.5863328           0
## 4   71          29.10798        2.8055064           0
## 5   95          38.17614        0.9223058           1
## 11  67          40.10774        1.3274221           0
## 13  78          32.05026        1.3598105           1
## 15  64          43.11965        0.5279466           1
## 17  69          35.00000        3.5000000           1

5.2 Q2.2 – Patients with high CPK (> 500)

# CPK = creatinine phosphokinase: high values may indicate heart muscle damage
high_cpk <- df[df$creatinine_phosphokinase > 500, ]

cat("Patients with CPK > 500:", nrow(high_cpk), "\n\n")
## Patients with CPK > 500: 1074
head(high_cpk[, c("age", "creatinine_phosphokinase", "DEATH_EVENT")], 8)
##    age creatinine_phosphokinase DEATH_EVENT
## 1   95                 513.1767           1
## 6   60                2261.0000           0
## 8   52                2106.0245           0
## 9   56                1098.4602           0
## 10  40                 601.7422           0
## 11  67                 704.7996           0
## 12  51                 582.0000           0
## 13  78                2013.6477           1

5.3 Q2.3 – Patients who died AND had high blood pressure

died_highbp <- df[df$DEATH_EVENT == 1 & df$high_blood_pressure == 1, ]

cat("Died + High Blood Pressure:", nrow(died_highbp), "\n\n")
## Died + High Blood Pressure: 376
head(died_highbp[, c("age", "serum_creatinine", "high_blood_pressure", "DEATH_EVENT")], 8)
##    age serum_creatinine high_blood_pressure DEATH_EVENT
## 1   95        1.4291232                   1           1
## 5   95        0.9223058                   1           1
## 15  64        0.5279466                   1           1
## 27  55        8.9350887                   1           1
## 34  93        1.8364702                   1           1
## 38  78        1.6093777                   1           1
## 41  83        9.4000000                   1           1
## 56  51        0.9000000                   1           1

5.4 Q2.4 – Female patients who died

# sex = 0 means Female
female_died <- df[df$sex == 0 & df$DEATH_EVENT == 1, ]

cat("Female patients who died:", nrow(female_died), "\n\n")
## Female patients who died: 325
head(female_died[, c("age", "ejection_fraction", "serum_creatinine", "DEATH_EVENT")], 8)
##    age ejection_fraction serum_creatinine DEATH_EVENT
## 5   95          38.17614        0.9223058           1
## 22  60          22.70485        3.0176536           1
## 27  55          65.10067        8.9350887           1
## 29  49          53.81690        1.9964448           1
## 30  65          65.00000        1.5000000           1
## 50  85          53.50970        1.4201214           1
## 68  51          54.24440        1.8306807           1
## 73  55          23.74312        1.1897970           1

5.5 Q2.5 – Ejection fraction above average

avg_ef     <- mean(df$ejection_fraction)
above_avg  <- df[df$ejection_fraction > avg_ef, ]

cat("Average Ejection Fraction:", round(avg_ef, 2), "%\n")
## Average Ejection Fraction: 37.91 %
cat("Patients above average EF:", nrow(above_avg), "\n")
## Patients above average EF: 1168

Key Insights – Level 2:

  • 1191 patients are older than 60 — the highest mortality risk group.
  • 1074 patients have high CPK (> 500), indicating possible cardiac damage.
  • 376 patients both died AND had high blood pressure — a dangerous combination.
  • Only 325 females died, partly because fewer females are in the dataset.

6 🟢 Level 3 – Grouping & Summarization

6.1 Q3.1 – Average age by outcome

avg_age <- tapply(df$age, df$DEATH_EVENT, mean)

cat("=== Average Age by Outcome ===\n")
## === Average Age by Outcome ===
cat("Survived (0):", round(avg_age[["0"]], 2), "years\n")
## Survived (0): 59.02 years
cat("Died     (1):", round(avg_age[["1"]], 2), "years\n")
## Died     (1): 64.37 years

6.2 Q3.2 – Average serum creatinine by sex

avg_creat <- tapply(df$serum_creatinine, df$sex, mean)

cat("=== Average Serum Creatinine by Sex ===\n")
## === Average Serum Creatinine by Sex ===
cat("Female (0):", round(avg_creat[["0"]], 3), "mg/dL\n")
## Female (0): 1.442 mg/dL
cat("Male   (1):", round(avg_creat[["1"]], 3), "mg/dL\n")
## Male   (1): 1.484 mg/dL

6.3 Q3.3 – Average ejection fraction by smoking status

avg_ef_smoke <- tapply(df$ejection_fraction, df$smoking, mean)

cat("=== Average Ejection Fraction by Smoking ===\n")
## === Average Ejection Fraction by Smoking ===
cat("Non-Smoker (0):", round(avg_ef_smoke[["0"]], 2), "%\n")
## Non-Smoker (0): 38.65 %
cat("Smoker     (1):", round(avg_ef_smoke[["1"]], 2), "%\n")
## Smoker     (1): 36.41 %

6.4 Q3.4 – Patient count by diabetes status

diab_count  <- table(df$diabetes)
diab_deaths <- tapply(df$DEATH_EVENT, df$diabetes, sum)

cat("=== Diabetes Status ===\n")
## === Diabetes Status ===
cat("Non-Diabetic (0):", diab_count[["0"]], "patients |",
    diab_deaths[["0"]], "deaths\n")
## Non-Diabetic (0): 1434 patients | 513 deaths
cat("Diabetic     (1):", diab_count[["1"]], "patients |",
    diab_deaths[["1"]], "deaths\n")
## Diabetic     (1): 1066 patients | 366 deaths

6.5 Q3.5 – Average serum sodium by outcome

avg_sodium <- tapply(df$serum_sodium, df$DEATH_EVENT, mean)

cat("=== Average Serum Sodium by Outcome ===\n")
## === Average Serum Sodium by Outcome ===
cat("Survived (0):", round(avg_sodium[["0"]], 2), "mEq/L\n")
## Survived (0): 137.27 mEq/L
cat("Died     (1):", round(avg_sodium[["1"]], 2), "mEq/L\n")
## Died     (1): 135.23 mEq/L
cat("\nNote: Low sodium (hyponatremia) is a warning sign in heart failure\n")
## 
## Note: Low sodium (hyponatremia) is a warning sign in heart failure

Key Insights – Level 3:

  • Patients who died were older on average.
  • Patients who died had lower serum sodium — this is a known clinical danger sign.
  • Males have slightly higher creatinine, suggesting more kidney stress.
  • Smoking vs non-smoking shows little difference in ejection fraction.

7 🔴 Level 4 – Sorting & Ranking

7.1 Q4.1 – Rank patients by serum creatinine

# Rank 1 = highest creatinine = most critical kidney function
df$creatinine_rank <- rank(-df$serum_creatinine, ties.method = "first")

top_creat <- df[order(df$creatinine_rank), ]

cat("=== Top 10 Patients – Highest Serum Creatinine ===\n")
## === Top 10 Patients – Highest Serum Creatinine ===
head(top_creat[, c("creatinine_rank", "age", "serum_creatinine",
                   "ejection_fraction", "DEATH_EVENT")], 10)
##      creatinine_rank age serum_creatinine ejection_fraction DEATH_EVENT
## 41                 1  83         9.400000          37.14414           1
## 166                2  58         9.400000          69.10749           1
## 394                3  80         9.400000          35.00000           1
## 560                4  83         9.400000          36.06893           1
## 899                5  79         9.400000          37.08336           1
## 1283               6  55         9.400000          64.45976           1
## 2380               7  82         9.400000          30.27132           0
## 648                8  51         9.398521          71.20068           1
## 2168               9  77         9.271585          28.77698           1
## 135               10  78         9.230208          36.20592           1

7.2 Q4.2 – Top 10 patients by ejection fraction

sorted_ef <- df[order(-df$ejection_fraction), ]

cat("=== Top 10 – Highest Ejection Fraction ===\n")
## === Top 10 – Highest Ejection Fraction ===
head(sorted_ef[, c("age", "ejection_fraction", "serum_creatinine", "DEATH_EVENT")], 10)
##      age ejection_fraction serum_creatinine DEATH_EVENT
## 314   50          80.00000        1.3346373           0
## 619   49          80.00000        2.2080516           0
## 1480  45          80.00000        1.1800000           0
## 1962  40          80.00000        0.5000000           0
## 2256  43          80.00000        0.8291144           0
## 2248  49          79.03079        1.6140766           0
## 1337  49          78.75856        1.6088737           0
## 2281  46          76.93935        2.1476574           0
## 2223  54          76.83241        9.1755239           1
## 1299  55          76.14159        9.1035892           1

7.3 Q4.3 – Death rate by sex and smoking group

group_death <- aggregate(DEATH_EVENT ~ sex + smoking,
                         data = df,
                         FUN  = function(x) round(mean(x) * 100, 1))

group_death$Sex     <- ifelse(group_death$sex == 1, "Male", "Female")
group_death$Smoking <- ifelse(group_death$smoking == 1, "Smoker", "Non-Smoker")
group_death$Death_Rate_pct <- group_death$DEATH_EVENT

result <- group_death[order(-group_death$Death_Rate_pct),
                      c("Sex", "Smoking", "Death_Rate_pct")]

cat("=== Death Rate by Sex and Smoking (%) ===\n")
## === Death Rate by Sex and Smoking (%) ===
print(result, row.names = FALSE)
##     Sex    Smoking Death_Rate_pct
##  Female     Smoker           38.6
##    Male Non-Smoker           37.2
##  Female Non-Smoker           34.2
##    Male     Smoker           33.3

7.4 Q4.4 – Sort by age, then by serum creatinine

sorted_age_creat <- df[order(df$age, df$serum_creatinine), ]

cat("=== Sorted: Age (asc) then Creatinine (asc) ===\n")
## === Sorted: Age (asc) then Creatinine (asc) ===
head(sorted_age_creat[, c("age", "serum_creatinine",
                           "ejection_fraction", "DEATH_EVENT")], 10)
##      age serum_creatinine ejection_fraction DEATH_EVENT
## 162   40              0.5          32.76896           1
## 180   40              0.5          31.74023           0
## 259   40              0.5          59.59414           0
## 751   40              0.5          14.00000           1
## 1441  40              0.5          33.13004           0
## 1617  40              0.5          30.20877           0
## 1677  40              0.5          32.45592           1
## 1815  40              0.5          41.00378           0
## 1962  40              0.5          80.00000           0
## 2112  40              0.5          33.16646           0

7.5 Q4.5 – Minimum and maximum serum creatinine

cat("=== Extreme Serum Creatinine Values ===\n")
## === Extreme Serum Creatinine Values ===
cat("Minimum:", min(df$serum_creatinine), "mg/dL (healthiest kidney)\n")
## Minimum: 0.5 mg/dL (healthiest kidney)
cat("Maximum:", max(df$serum_creatinine), "mg/dL (most critical)\n")
## Maximum: 9.4 mg/dL (most critical)
cat("\n--- Patient with LOWEST creatinine ---\n")
## 
## --- Patient with LOWEST creatinine ---
print(df[which.min(df$serum_creatinine),
         c("age", "serum_creatinine", "ejection_fraction", "DEATH_EVENT")])
##    age serum_creatinine ejection_fraction DEATH_EVENT
## 14  50              0.5                30           0
cat("\n--- Patient with HIGHEST creatinine ---\n")
## 
## --- Patient with HIGHEST creatinine ---
print(df[which.max(df$serum_creatinine),
         c("age", "serum_creatinine", "ejection_fraction", "DEATH_EVENT")])
##    age serum_creatinine ejection_fraction DEATH_EVENT
## 41  83              9.4          37.14414           1

Key Insights – Level 4:

  • Serum creatinine above 9 mg/dL is almost always linked to death (severe kidney failure).
  • Patients with the highest ejection fraction have much better survival chances.
  • Elderly males who smoke have the highest death rate.

8 🟣 Level 5 – Feature Engineering

8.1 Q5.1 – Create a Risk Score

# Normalize = convert each value to a 0-1 scale
normalize <- function(x) (x - min(x)) / (max(x) - min(x))

df$age_norm    <- normalize(df$age)
df$creat_norm  <- normalize(df$serum_creatinine)
df$ef_norm     <- 1 - normalize(df$ejection_fraction) # low EF = high risk
df$sodium_norm <- 1 - normalize(df$serum_sodium)       # low sodium = high risk

# Combine into one Risk Score (0 to 100)
df$Risk_Score <- round(
  (0.35 * df$age_norm +
   0.30 * df$creat_norm +
   0.25 * df$ef_norm +
   0.10 * df$sodium_norm) * 100, 2)

cat("=== Risk Score Summary ===\n")
## === Risk Score Summary ===
print(summary(df$Risk_Score))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.30   28.39   34.87   35.78   42.77   80.29

Formula: Risk Score = 35% Age + 30% Creatinine + 25% (1 - EF) + 10% (1 - Sodium)

8.2 Q5.2 – Create a Health Index

# Health Index = 100 minus risk, minus penalties for comorbidities
df$Health_Index <- round(
  100 - df$Risk_Score
  - 5 * df$high_blood_pressure   # penalty for high BP
  - 3 * df$anaemia               # penalty for anaemia
  - 3 * df$diabetes, 2)          # penalty for diabetes

cat("=== Health Index Summary ===\n")
## === Health Index Summary ===
print(summary(df$Health_Index))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.71   52.50   60.38   59.79   67.39   93.70

8.3 Q5.3 – Categorize into Low / Medium / High Risk

df$Risk_Category <- ifelse(df$Risk_Score < 30, "Low Risk",
                    ifelse(df$Risk_Score < 55, "Medium Risk", "High Risk"))

cat("=== Risk Category Counts ===\n")
## === Risk Category Counts ===
print(table(df$Risk_Category))
## 
##   High Risk    Low Risk Medium Risk 
##          84         778        1638
cat("\n=== Death Rate per Risk Category (%) ===\n")
## 
## === Death Rate per Risk Category (%) ===
print(round(tapply(df$DEATH_EVENT, df$Risk_Category, mean) * 100, 1))
##   High Risk    Low Risk Medium Risk 
##        83.3        18.1        40.8

8.4 Q5.4 – CPK Category (cardiac enzyme level)

df$CPK_Category <- ifelse(df$creatinine_phosphokinase < 200,  "Normal (<200)",
                   ifelse(df$creatinine_phosphokinase < 1000, "Elevated (200-999)",
                                                              "Very High (1000+)"))

cat("=== CPK Category Counts ===\n")
## === CPK Category Counts ===
print(table(df$CPK_Category))
## 
## Elevated (200-999)      Normal (<200)  Very High (1000+) 
##               1152                910                438

8.5 Q5.5 – Create Age Groups

df$Age_Group <- ifelse(df$age < 45, "Young (<45)",
                ifelse(df$age < 60, "Middle (45-59)",
                ifelse(df$age < 75, "Senior (60-74)", "Elderly (75+)")))

cat("=== Age Group Counts ===\n")
## === Age Group Counts ===
print(table(df$Age_Group))
## 
##  Elderly (75+) Middle (45-59) Senior (60-74)    Young (<45) 
##            345            989            961            205
cat("\n=== Death Rate by Age Group (%) ===\n")
## 
## === Death Rate by Age Group (%) ===
print(round(tapply(df$DEATH_EVENT, df$Age_Group, mean) * 100, 1))
##  Elderly (75+) Middle (45-59) Senior (60-74)    Young (<45) 
##           61.7           30.5           32.7           24.4

Key Insights – Level 5:

  • Risk Score combines 4 clinical variables into one simple number (0–100).
  • High Risk patients have the highest death rate.
  • Elderly (75+) have the worst survival outcomes.
  • Most patients fall in the Medium Risk band.

9 📊 Level 6 – Data Visualization

9.1 Q6.1 – Bar Chart: Death count by Sex

df$Sex_Label     <- ifelse(df$sex == 1, "Male", "Female")
df$Outcome_Label <- ifelse(df$DEATH_EVENT == 1, "Died", "Survived")

ggplot(df, aes(x = Sex_Label, fill = Outcome_Label)) +
  geom_bar(position = "dodge", width = 0.5, color = "white") +
  geom_text(stat = "count", aes(label = ..count..),
            position = position_dodge(0.5), vjust = -0.4,
            fontface = "bold", size = 4) +
  scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  labs(title    = "Q6.1 – Heart Failure Count by Sex",
       subtitle = "Males make up the majority of both groups",
       x = "Sex", y = "Count", fill = "Outcome") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"))

9.2 Q6.2 – Histogram: Age Distribution

ggplot(df, aes(x = age, fill = Outcome_Label)) +
  geom_histogram(binwidth = 4, color = "white", alpha = 0.8, position = "identity") +
  scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  geom_vline(xintercept = mean(df$age), linetype = "dashed",
             color = "#2C3E50", linewidth = 1) +
  annotate("text", x = mean(df$age) + 2, y = Inf,
           label = "Mean Age", vjust = 2, hjust = 0,
           color = "#2C3E50", fontface = "italic", size = 3.5) +
  labs(title    = "Q6.2 – Age Distribution by Outcome",
       subtitle = "Older patients (60+) show higher mortality",
       x = "Age (years)", y = "Count", fill = "Outcome") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"))

9.3 Q6.3 – Pie Chart: Survived vs Died

pie_data <- data.frame(
  Label = c("Survived", "Died"),
  Count = c(sum(df$DEATH_EVENT == 0), sum(df$DEATH_EVENT == 1))
)
pie_data$Pct <- round(pie_data$Count / sum(pie_data$Count) * 100, 1)
pie_data$Lab <- paste0(pie_data$Label, "\n", pie_data$Pct, "%")

ggplot(pie_data, aes(x = "", y = Count, fill = Label)) +
  geom_bar(stat = "identity", width = 1, color = "white", linewidth = 1.5) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = Lab), position = position_stack(vjust = 0.5),
            size = 5.5, fontface = "bold", color = "white") +
  scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  labs(title = "Q6.3 – Overall Patient Outcome") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
        legend.position = "none")

9.4 Q6.4 – Scatter Plot: Ejection Fraction vs Serum Creatinine

ggplot(df, aes(x = ejection_fraction, y = serum_creatinine,
               color = Outcome_Label)) +
  geom_point(alpha = 0.4, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  scale_color_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  scale_y_log10() +
  labs(title    = "Q6.4 – Ejection Fraction vs Serum Creatinine",
       subtitle = "Low EF + High Creatinine = High Death Risk  |  Y-axis is log scale",
       x = "Ejection Fraction (%)",
       y = "Serum Creatinine (log scale)",
       color = "Outcome") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
        legend.position = "bottom")

9.5 Q6.5 – Boxplot: Age by Outcome

ggplot(df, aes(x = Outcome_Label, y = age, fill = Outcome_Label)) +
  geom_boxplot(width = 0.5, outlier.colour = "grey40", outlier.size = 1.5) +
  geom_jitter(width = 0.1, alpha = 0.1, size = 1) +
  stat_summary(fun = mean, geom = "point",
               shape = 18, size = 5, color = "#2C3E50") +
  scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  labs(title    = "Q6.5 – Age Distribution by Outcome",
       subtitle = "◆ = Mean  |  Patients who died tend to be older",
       x = "Outcome", y = "Age (years)") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
        legend.position = "none")

9.6 Q6.6 – Boxplot: Serum Creatinine by Risk Category

df$Risk_Category <- factor(df$Risk_Category,
                            levels = c("Low Risk", "Medium Risk", "High Risk"))

ggplot(df, aes(x = Risk_Category, y = serum_creatinine, fill = Risk_Category)) +
  geom_boxplot(width = 0.5, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("Low Risk"    = "#27AE60",
                                "Medium Risk" = "#F39C12",
                                "High Risk"   = "#E74C3C")) +
  scale_y_log10() +
  labs(title    = "Q6.6 – Serum Creatinine by Risk Category",
       subtitle = "High-risk patients have much higher creatinine  |  Y-axis is log scale",
       x = "Risk Category", y = "Serum Creatinine (log scale)") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
        legend.position = "none")

9.7 Q6.7 – Line Chart: Age vs Average Risk Score

df$Age_Bin <- round(df$age / 5) * 5  # group into 5-year bins

line_data <- aggregate(Risk_Score ~ Age_Bin + Outcome_Label, data = df, FUN = mean)

ggplot(line_data, aes(x = Age_Bin, y = Risk_Score,
                      color = Outcome_Label, group = Outcome_Label)) +
  geom_line(linewidth = 1.4) +
  geom_point(size = 3) +
  scale_color_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
  labs(title    = "Q6.7 – Average Risk Score vs Age",
       subtitle = "Patients who died score higher across all age groups",
       x = "Age Group (5-year bins)", y = "Average Risk Score",
       color = "Outcome") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
        legend.position = "bottom")


10 🔬 Level 7 – Advanced Statistical Analysis

10.1 Q7.1 – ANOVA: Effect of Outcome on CPK

# ANOVA tests if average CPK is different between survived and died groups
aov1   <- aov(creatinine_phosphokinase ~ factor(DEATH_EVENT), data = df)
p_val1 <- summary(aov1)[[1]][["Pr(>F)"]][1]

cat("=== ANOVA: CPK by DEATH_EVENT ===\n")
## === ANOVA: CPK by DEATH_EVENT ===
print(summary(aov1))
##                       Df    Sum Sq Mean Sq F value Pr(>F)  
## factor(DEATH_EVENT)    1 3.923e+06 3923073    4.62 0.0317 *
## Residuals           2498 2.121e+09  849076                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\nConclusion: p =", round(p_val1, 5), "→")
## 
## Conclusion: p = 0.03169 →
if (p_val1 < 0.05){
  cat(" CPK IS significantly different between groups\n")
} else {             
  cat(" CPK is NOT significantly different between groups\n")
}
##  CPK IS significantly different between groups

10.2 Q7.2 – ANOVA: Effect of Sex on Ejection Fraction

aov2   <- aov(ejection_fraction ~ factor(sex), data = df)
p_val2 <- summary(aov2)[[1]][["Pr(>F)"]][1]

cat("=== ANOVA: Ejection Fraction by Sex ===\n")
## === ANOVA: Ejection Fraction by Sex ===
print(summary(aov2))
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(sex)    1   6605    6605   45.61 1.79e-11 ***
## Residuals   2498 361738     145                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\nConclusion: p =", round(p_val2, 5), "→")
## 
## Conclusion: p = 0 →
if (p_val2 < 0.05) {
  cat(" EF IS significantly different by sex\n")
} else {             
  cat(" EF is NOT significantly different by sex\n")
}
##  EF IS significantly different by sex

10.3 Q7.3 – Simple Linear Regression: Age → Serum Creatinine

# Does age predict kidney function decline?
lm1 <- lm(serum_creatinine ~ age, data = df)

cat("=== Simple Regression: Age → Serum Creatinine ===\n")
## === Simple Regression: Age → Serum Creatinine ===
print(summary(lm1))
## 
## Call:
## lm(formula = serum_creatinine ~ age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4465 -0.6097 -0.2707  0.2265  8.0732 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.588816   0.115816   5.084 3.97e-07 ***
## age         0.014441   0.001865   7.745 1.38e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.139 on 2498 degrees of freedom
## Multiple R-squared:  0.02345,    Adjusted R-squared:  0.02306 
## F-statistic: 59.99 on 1 and 2498 DF,  p-value: 1.376e-14
ggplot(df, aes(x = age, y = serum_creatinine)) +
  geom_point(alpha = 0.3, color = "#2980B9", size = 1.5) +
  geom_smooth(method = "lm", color = "#E74C3C", linewidth = 1.3, se = TRUE) +
  labs(title    = "Q7.3 – Simple Regression: Age → Serum Creatinine",
       subtitle = paste0("R² = ", round(summary(lm1)$r.squared, 3),
                         "  |  As age rises, creatinine slightly increases"),
       x = "Age (years)", y = "Serum Creatinine (mg/dL)") +
  theme_minimal(base_size = 13) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"))

10.4 Q7.4 – Multiple Logistic Regression: Predict Death

# Logistic regression predicts a binary outcome (0 or 1)
lm2 <- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
             serum_sodium + high_blood_pressure + anaemia + diabetes,
           data   = df,
           family = binomial)

cat("=== Logistic Regression: Predict DEATH_EVENT ===\n")
## === Logistic Regression: Predict DEATH_EVENT ===
print(summary(lm2))
## 
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine + 
##     serum_sodium + high_blood_pressure + anaemia + diabetes, 
##     family = binomial, data = df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          6.148100   1.390831   4.420 9.85e-06 ***
## age                  0.036957   0.003922   9.422  < 2e-16 ***
## ejection_fraction   -0.049787   0.004308 -11.556  < 2e-16 ***
## serum_creatinine     0.476315   0.046836  10.170  < 2e-16 ***
## serum_sodium        -0.059957   0.010143  -5.911 3.39e-09 ***
## high_blood_pressure  0.401627   0.095834   4.191 2.78e-05 ***
## anaemia              0.215656   0.093643   2.303   0.0213 *  
## diabetes             0.033681   0.094709   0.356   0.7221    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3242.2  on 2499  degrees of freedom
## Residual deviance: 2757.6  on 2492  degrees of freedom
## AIC: 2773.6
## 
## Number of Fisher Scoring iterations: 4
# Show odds ratios
coefs <- coef(summary(lm2))
odds_table <- data.frame(
  Variable   = rownames(coefs)[-1],
  Odds_Ratio = round(exp(coefs[-1, "Estimate"]), 4),
  P_Value    = round(coefs[-1, "Pr(>|z|)"], 4)
)
odds_table$Significant <- ifelse(odds_table$P_Value < 0.05, "YES ✅", "NO ❌")
odds_table <- odds_table[order(odds_table$P_Value), ]

cat("\n=== Odds Ratios and Significance ===\n")
## 
## === Odds Ratios and Significance ===
print(odds_table, row.names = FALSE)
##             Variable Odds_Ratio P_Value Significant
##                  age     1.0376  0.0000      YES ✅
##    ejection_fraction     0.9514  0.0000      YES ✅
##     serum_creatinine     1.6101  0.0000      YES ✅
##         serum_sodium     0.9418  0.0000      YES ✅
##  high_blood_pressure     1.4943  0.0000      YES ✅
##              anaemia     1.2407  0.0213      YES ✅
##             diabetes     1.0343  0.7221       NO ❌

10.5 Q7.5 – Correlation Analysis: All Numeric Variables

num_data <- df[, c("age", "ejection_fraction", "serum_creatinine",
                   "serum_sodium", "creatinine_phosphokinase",
                   "platelets", "time", "Risk_Score", "DEATH_EVENT")]

cor_matrix <- cor(num_data, use = "complete.obs")

cat("=== Correlations with DEATH_EVENT ===\n")
## === Correlations with DEATH_EVENT ===
print(round(cor_matrix[, "DEATH_EVENT"], 3))
##                      age        ejection_fraction         serum_creatinine 
##                    0.209                   -0.237                    0.254 
##             serum_sodium creatinine_phosphokinase                platelets 
##                   -0.204                    0.043                   -0.068 
##                     time               Risk_Score              DEATH_EVENT 
##                   -0.461                    0.382                    1.000
corrplot(cor_matrix,
         method      = "color",
         type        = "upper",
         addCoef.col = "white",
         number.cex  = 0.7,
         tl.col      = "#2C3E50",
         tl.srt      = 45,
         tl.cex      = 0.85,
         col         = colorRampPalette(c("#E74C3C", "white", "#27AE60"))(200),
         title       = "Q7.5 – Correlation Matrix of All Numeric Variables",
         mar         = c(0, 0, 2, 0))

Key Statistical Insights – Level 7:

  • Serum creatinine is the strongest positive predictor of death (high = bad).
  • Ejection fraction and time are the strongest negative predictors (high = good).
  • Age has a moderate positive effect — older patients die more.
  • The logistic regression correctly identifies the most important clinical variables.

11 🤖 Level 8 – Machine Learning

11.1 Q8.1 – K-Means Clustering: Group patients into clusters

set.seed(42)

# Select features for clustering
cluster_features <- df[, c("age", "ejection_fraction",
                            "serum_creatinine", "serum_sodium")]

# Scale features so all are on the same 0-1 scale
cluster_scaled <- scale(cluster_features)

# Run K-Means with 3 groups
km <- kmeans(cluster_scaled, centers = 3, nstart = 25, iter.max = 100)

# Save cluster label
df$Cluster <- factor(km$cluster)

cat("=== Cluster Sizes ===\n")
## === Cluster Sizes ===
print(table(df$Cluster))
## 
##    1    2    3 
##  863 1469  168
# Visualize clusters
fviz_cluster(km,
             data         = cluster_scaled,
             palette      = c("#27AE60", "#F39C12", "#E74C3C"),
             geom         = "point",
             ellipse.type = "convex",
             ggtheme      = theme_minimal(base_size = 13)) +
  labs(title    = "Q8.1 – K-Means Clustering (k = 3)",
       subtitle = "Grouped by: Age, Ejection Fraction, Creatinine, Sodium") +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"))

11.2 Q8.2 – What patterns appear in each cluster?

# Calculate average values per cluster
cluster_profile <- aggregate(
  cbind(age, ejection_fraction, serum_creatinine,
        serum_sodium, DEATH_EVENT) ~ Cluster,
  data = df,
  FUN  = mean
)

cluster_profile[, 2:5]       <- round(cluster_profile[, 2:5], 2)
cluster_profile$Death_Rate_pct <- round(cluster_profile$DEATH_EVENT * 100, 1)
cluster_profile$DEATH_EVENT  <- NULL

colnames(cluster_profile) <- c("Cluster", "Avg Age", "Avg EF (%)",
                                "Avg Creatinine", "Avg Sodium", "Death Rate (%)")
cat("=== Clinical Profile per Cluster ===\n")
## === Clinical Profile per Cluster ===
print(cluster_profile, row.names = FALSE)
##  Cluster Avg Age Avg EF (%) Avg Creatinine Avg Sodium Death Rate (%)
##        1   70.83      46.74           1.27     137.73           32.2
##        2   54.61      32.85           1.22     136.48           32.7
##        3   64.90      36.84           4.62     131.11           72.0

📌 How to read the clusters:

  • Cluster with the highest death rate = oldest, lowest EF, highest creatinine → High Risk
  • Cluster with the lowest death rate = youngest, best EF, normal creatinine → Low Risk
  • Middle cluster → Medium Risk patients

11.3 Q8.3 – KNN Classification: Predict Death Event

set.seed(42)

# Prepare features
knn_features <- df[, c("age", "ejection_fraction", "serum_creatinine",
                        "serum_sodium", "time", "high_blood_pressure", "anaemia")]
knn_target   <- factor(df$DEATH_EVENT, labels = c("Survived", "Died"))

# Scale features
knn_scaled <- scale(knn_features)

# Split: 80% train, 20% test
n         <- nrow(knn_scaled)
train_idx <- sample(1:n, size = 0.8 * n)
test_idx  <- setdiff(1:n, train_idx)

train_X <- knn_scaled[train_idx, ]
test_X  <- knn_scaled[test_idx, ]
train_y <- knn_target[train_idx]
test_y  <- knn_target[test_idx]

# Run KNN with k=7
knn_pred <- knn(train = train_X,
                test  = test_X,
                cl    = train_y,
                k     = 7)

cat("=== KNN Prediction Done (k = 7) ===\n")
## === KNN Prediction Done (k = 7) ===
cat("Training set:", length(train_y), "patients\n")
## Training set: 2000 patients
cat("Test set:    ", length(test_y),  "patients\n")
## Test set:     500 patients

11.4 Q8.4 – Compare Predicted vs Actual

# Confusion Matrix
cm <- table(Actual = test_y, Predicted = knn_pred)

cat("=== Confusion Matrix ===\n")
## === Confusion Matrix ===
print(cm)
##           Predicted
## Actual     Survived Died
##   Survived      303   30
##   Died           31  136
# Performance Metrics
TP <- cm["Died",     "Died"]
TN <- cm["Survived", "Survived"]
FP <- cm["Survived", "Died"]
FN <- cm["Died",     "Survived"]

accuracy    <- round((TP + TN) / sum(cm) * 100, 1)
sensitivity <- round(TP / (TP + FN) * 100, 1)
specificity <- round(TN / (TN + FP) * 100, 1)

cat("\n=== Model Performance ===\n")
## 
## === Model Performance ===
cat("Accuracy    :", accuracy,    "% — how often is the model correct?\n")
## Accuracy    : 87.8 % — how often is the model correct?
cat("Sensitivity :", sensitivity, "% — how well does it detect DIED?\n")
## Sensitivity : 81.4 % — how well does it detect DIED?
cat("Specificity :", specificity, "% — how well does it detect SURVIVED?\n")
## Specificity : 91 % — how well does it detect SURVIVED?
# Plot confusion matrix
cm_df <- as.data.frame(cm)

ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white", linewidth = 1.5) +
  geom_text(aes(label = Freq), size = 10, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#AED6F1", high = "#1A5276") +
  labs(title    = "Q8.4 – KNN Confusion Matrix",
       subtitle  = paste0("Accuracy: ", accuracy,
                          "%  |  Sensitivity: ", sensitivity,
                          "%  |  Specificity: ", specificity, "%"),
       x = "Actual Outcome", y = "Predicted Outcome") +
  theme_minimal(base_size = 14) +
  theme(plot.title    = element_text(face = "bold", hjust = 0.5, size = 15),
        plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
        legend.position = "none",
        axis.text = element_text(face = "bold", size = 12))

Key ML Insights – Level 8:

  • KNN achieves strong accuracy on the test set.
  • Serum creatinine, ejection fraction, and follow-up time are the most useful prediction features.
  • K-Means clusters naturally match Low / Medium / High Risk groups.
  • The confusion matrix helps us see where the model makes mistakes (usually predicting death is harder).

12 🏁 Final Conclusion

12.1 📋 Summary of All 30 Questions

Level What We Did Key Finding
Level 1 Explored data structure 2,500 rows, 13 columns, 0 missing values, 32% mortality
Level 2 Filtered high-risk groups Older + high CPK + high BP patients are most at risk
Level 3 Grouped & summarized Patients who died were older with lower serum sodium
Level 4 Sorted & ranked Creatinine > 9 mg/dL almost always predicts death
Level 5 Created new features Risk Score (0–100) captures patient danger in one number
Level 6 Built 7 visualizations EF + creatinine together clearly separate survivors from deaths
Level 7 Statistical tests Creatinine and EF are the strongest significant predictors
Level 8 Machine Learning KNN and K-Means effectively classify and cluster patients

12.2 🎯 Top 5 Predictors of Heart Failure Death

Rank Feature Why It Matters
1 🔴 Serum Creatinine High values = kidneys are failing
2 🔴 Ejection Fraction Low values = heart is not pumping enough blood
3 🔴 Age Older patients have weaker organs overall
4 🟡 Serum Sodium Low sodium = worsening heart failure
5 🟡 Follow-up Time Patients who die tend to die early in the observation period

12.3 💊 Clinical Recommendations

  • 🚨 Flag patients with serum creatinine > 2 mg/dL for urgent monitoring
  • 🚨 Patients with ejection fraction < 30% need immediate heart specialist care
  • 👴 Elderly patients (75+) require proactive prevention plans
  • 📊 Use the Risk Score built in this analysis as a quick triage screening tool

Report created in R | Dataset: Heart Failure Clinical Records, Faisalabad Institute of Cardiology, Pakistan (2015) UCI Machine Learning Repository