14-04-2026What is Heart Failure? Heart failure happens when the heart cannot pump enough blood to the body. It is one of the top causes of death worldwide, especially in people over 60 years old.
Why is this dataset important? This dataset has clinical records of 299 heart failure patients from Faisalabad, Pakistan (2015). It helps us find which clinical factors predict whether a patient will survive.
Goal: Explore the data step by step — filter, summarize, visualize, run statistical tests, and use machine learning to predict patient death (DEATH_EVENT).
# Install any packages that are not already installed
packages_needed <- c("dplyr", "ggplot2", "corrplot",
"caret", "class", "cluster", "factoextra", "knitr")
for (pkg in packages_needed) {
if (!pkg %in% installed.packages()[, "Package"]) {
install.packages(pkg)
}
}
# Load all libraries
library(dplyr) # for data filtering and grouping
library(ggplot2) # for all charts and plots
library(corrplot) # for correlation heatmap
library(caret) # for machine learning tools
library(class) # for KNN algorithm
library(cluster) # for K-Means clustering
library(factoextra) # for cluster visualization
library(knitr) # for clean tables
# Set default figure size for all plots
knitr::opts_chunk$set(fig.width = 9, fig.height = 5,
fig.align = "center",
warning = FALSE, message = FALSE)# Read the CSV file (make sure it is in your working directory)
df_raw <- read.csv("heart_failure_clinical_records_dataset.csv")
# Basic info
cat("Rows:", nrow(df_raw), "| Columns:", ncol(df_raw), "\n\n")## Rows: 299 | Columns: 13
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## 1 75 0 582 0 20
## 2 55 0 7861 0 38
## 3 65 0 146 0 20
## 4 50 1 111 0 20
## 5 65 1 160 1 20
## 6 90 1 47 0 40
## high_blood_pressure platelets serum_creatinine serum_sodium sex smoking time
## 1 1 265000 1.9 130 1 0 4
## 2 0 263358 1.1 136 1 0 6
## 3 0 162000 1.3 129 1 1 7
## 4 0 210000 1.9 137 1 0 7
## 5 0 327000 2.7 116 0 0 8
## 6 1 204000 2.1 132 1 1 8
## DEATH_EVENT
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
What do the columns mean?
| Column | Meaning |
|---|---|
| age | Patient age in years |
| anaemia | Has anaemia? (0=No, 1=Yes) |
| creatinine_phosphokinase | CPK enzyme level in blood (mcg/L) |
| diabetes | Has diabetes? (0=No, 1=Yes) |
| ejection_fraction | % of blood pumped out per heartbeat |
| high_blood_pressure | Has high BP? (0=No, 1=Yes) |
| platelets | Platelet count in blood |
| serum_creatinine | Kidney function marker (mg/dL) |
| serum_sodium | Sodium level in blood (mEq/L) |
| sex | 0=Female, 1=Male |
| smoking | Smokes? (0=No, 1=Yes) |
| time | Days of follow-up after hospital |
| DEATH_EVENT | TARGET: 0=Survived, 1=Died |
We add more rows using synthetic augmentation — small random changes to existing rows. This gives machine learning models more data to learn from.
set.seed(42) # makes results repeatable
expand_data <- function(data, target_rows = 2500) {
# Columns where we add small noise
num_cols <- c("age", "creatinine_phosphokinase", "ejection_fraction",
"platelets", "serum_creatinine", "serum_sodium", "time")
# Columns where we rarely flip the value (5% chance)
bin_cols <- c("anaemia", "diabetes", "high_blood_pressure",
"sex", "smoking", "DEATH_EVENT")
original_n <- nrow(data)
rows_to_add <- target_rows - original_n
new_rows <- list()
for (i in 1:rows_to_add) {
base_row <- data[sample(original_n, 1), ] # pick a random row
new_row <- base_row
# Add tiny random noise to numeric columns
for (col in num_cols) {
col_range <- max(data[[col]]) - min(data[[col]])
noise <- rnorm(1, mean = 0, sd = 0.05 * col_range)
new_row[[col]] <- base_row[[col]] + noise
new_row[[col]] <- max(min(new_row[[col]], max(data[[col]])), min(data[[col]]))
}
# 5% chance to flip a binary column
for (col in bin_cols) {
if (runif(1) < 0.05) new_row[[col]] <- 1 - base_row[[col]]
}
new_row$age <- round(new_row$age, 0)
new_rows[[i]] <- new_row
}
expanded <- rbind(data, do.call(rbind, new_rows))
expanded <- expanded[sample(nrow(expanded)), ] # shuffle rows
rownames(expanded) <- NULL
return(expanded)
}
df <- expand_data(df_raw, 2500)
cat("Original rows:", nrow(df_raw), "\n")## Original rows: 299
## Expanded rows: 2500
✅ Dataset expanded from 299 → 2,500 rows successfully.
## Total Rows: 2500
## Total Columns: 13
## 'data.frame': 2500 obs. of 13 variables:
## $ age : num 95 49 74 71 95 60 52 52 56 40 ...
## $ anaemia : num 0 1 0 1 1 0 0 0 1 0 ...
## $ creatinine_phosphokinase: num 513.2 70.5 153.6 23 23 ...
## $ diabetes : num 1 0 1 1 0 0 0 0 0 1 ...
## $ ejection_fraction : num 37.5 50.6 14.9 29.1 38.2 ...
## $ high_blood_pressure : num 1 0 1 0 1 1 0 0 0 0 ...
## $ platelets : num 273506 180778 351536 195750 206778 ...
## $ serum_creatinine : num 1.429 1.067 1.586 2.806 0.922 ...
## $ serum_sodium : num 135 142 136 138 139 ...
## $ sex : num 1 0 1 1 0 1 0 1 1 1 ...
## $ smoking : num 0 0 1 1 0 0 1 0 1 0 ...
## $ time : num 33.4 196.1 192.3 55.8 29.7 ...
## $ DEATH_EVENT : num 1 0 0 0 1 0 0 0 0 0 ...
# colSums + is.na counts missing per column
missing_count <- colSums(is.na(df))
cat("=== Missing Values per Column ===\n")## === Missing Values per Column ===
## age anaemia creatinine_phosphokinase
## 0 0 0
## diabetes ejection_fraction high_blood_pressure
## 0 0 0
## platelets serum_creatinine serum_sodium
## 0 0 0
## sex smoking time
## 0 0 0
## DEATH_EVENT
## 0
##
## Total missing values: 0
# summary() gives min, max, mean, median for each column
summary(df[, c("age", "ejection_fraction", "serum_creatinine",
"serum_sodium", "creatinine_phosphokinase")])## age ejection_fraction serum_creatinine serum_sodium
## Min. :40.0 Min. :14.00 Min. :0.5000 Min. :113.0
## 1st Qu.:51.0 1st Qu.:29.95 1st Qu.:0.8445 1st Qu.:134.0
## Median :60.0 Median :37.03 Median :1.1900 Median :136.9
## Mean :60.9 Mean :37.91 Mean :1.4683 Mean :136.6
## 3rd Qu.:69.0 3rd Qu.:44.10 3rd Qu.:1.7000 3rd Qu.:139.6
## Max. :95.0 Max. :80.00 Max. :9.4000 Max. :148.0
## creatinine_phosphokinase
## Min. : 23.00
## 1st Qu.: 77.05
## Median : 390.15
## Mean : 638.07
## 3rd Qu.: 803.86
## Max. :7861.00
death_table <- table(df$DEATH_EVENT)
death_df <- data.frame(
Outcome = c("Survived (0)", "Died (1)"),
Count = as.vector(death_table),
Percentage = round(as.vector(death_table) / nrow(df) * 100, 1)
)
cat("=== Death Event Distribution ===\n")## === Death Event Distribution ===
## Outcome Count Percentage
## 1 Survived (0) 1621 64.8
## 2 Died (1) 879 35.2
## === Categorical Variables – Unique Values ===
## Sex (0=Female, 1=Male) : 1 0
## Smoking (0=No, 1=Yes) : 0 1
## Diabetes (0=No, 1=Yes) : 1 0
## Anaemia (0=No, 1=Yes) : 0 1
## High BP (0=No, 1=Yes) : 1 0
## DEATH_EVENT (0=Survived,1=Died): 1 0
##
## --- Counts for each ---
## Sex:
##
## 0 1
## 936 1564
## Smoking:
##
## 0 1
## 1674 826
## DEATH_EVENT:
##
## 0 1
## 1621 879
✅ Key Insights – Level 1:
## Patients older than 60: 1191
## age ejection_fraction serum_creatinine DEATH_EVENT
## 1 95 37.50308 1.4291232 1
## 3 74 14.87444 1.5863328 0
## 4 71 29.10798 2.8055064 0
## 5 95 38.17614 0.9223058 1
## 11 67 40.10774 1.3274221 0
## 13 78 32.05026 1.3598105 1
## 15 64 43.11965 0.5279466 1
## 17 69 35.00000 3.5000000 1
# CPK = creatinine phosphokinase: high values may indicate heart muscle damage
high_cpk <- df[df$creatinine_phosphokinase > 500, ]
cat("Patients with CPK > 500:", nrow(high_cpk), "\n\n")## Patients with CPK > 500: 1074
## age creatinine_phosphokinase DEATH_EVENT
## 1 95 513.1767 1
## 6 60 2261.0000 0
## 8 52 2106.0245 0
## 9 56 1098.4602 0
## 10 40 601.7422 0
## 11 67 704.7996 0
## 12 51 582.0000 0
## 13 78 2013.6477 1
died_highbp <- df[df$DEATH_EVENT == 1 & df$high_blood_pressure == 1, ]
cat("Died + High Blood Pressure:", nrow(died_highbp), "\n\n")## Died + High Blood Pressure: 376
## age serum_creatinine high_blood_pressure DEATH_EVENT
## 1 95 1.4291232 1 1
## 5 95 0.9223058 1 1
## 15 64 0.5279466 1 1
## 27 55 8.9350887 1 1
## 34 93 1.8364702 1 1
## 38 78 1.6093777 1 1
## 41 83 9.4000000 1 1
## 56 51 0.9000000 1 1
# sex = 0 means Female
female_died <- df[df$sex == 0 & df$DEATH_EVENT == 1, ]
cat("Female patients who died:", nrow(female_died), "\n\n")## Female patients who died: 325
## age ejection_fraction serum_creatinine DEATH_EVENT
## 5 95 38.17614 0.9223058 1
## 22 60 22.70485 3.0176536 1
## 27 55 65.10067 8.9350887 1
## 29 49 53.81690 1.9964448 1
## 30 65 65.00000 1.5000000 1
## 50 85 53.50970 1.4201214 1
## 68 51 54.24440 1.8306807 1
## 73 55 23.74312 1.1897970 1
avg_ef <- mean(df$ejection_fraction)
above_avg <- df[df$ejection_fraction > avg_ef, ]
cat("Average Ejection Fraction:", round(avg_ef, 2), "%\n")## Average Ejection Fraction: 37.91 %
## Patients above average EF: 1168
✅ Key Insights – Level 2:
## === Average Age by Outcome ===
## Survived (0): 59.02 years
## Died (1): 64.37 years
avg_creat <- tapply(df$serum_creatinine, df$sex, mean)
cat("=== Average Serum Creatinine by Sex ===\n")## === Average Serum Creatinine by Sex ===
## Female (0): 1.442 mg/dL
## Male (1): 1.484 mg/dL
avg_ef_smoke <- tapply(df$ejection_fraction, df$smoking, mean)
cat("=== Average Ejection Fraction by Smoking ===\n")## === Average Ejection Fraction by Smoking ===
## Non-Smoker (0): 38.65 %
## Smoker (1): 36.41 %
diab_count <- table(df$diabetes)
diab_deaths <- tapply(df$DEATH_EVENT, df$diabetes, sum)
cat("=== Diabetes Status ===\n")## === Diabetes Status ===
## Non-Diabetic (0): 1434 patients | 513 deaths
## Diabetic (1): 1066 patients | 366 deaths
avg_sodium <- tapply(df$serum_sodium, df$DEATH_EVENT, mean)
cat("=== Average Serum Sodium by Outcome ===\n")## === Average Serum Sodium by Outcome ===
## Survived (0): 137.27 mEq/L
## Died (1): 135.23 mEq/L
##
## Note: Low sodium (hyponatremia) is a warning sign in heart failure
✅ Key Insights – Level 3:
# Rank 1 = highest creatinine = most critical kidney function
df$creatinine_rank <- rank(-df$serum_creatinine, ties.method = "first")
top_creat <- df[order(df$creatinine_rank), ]
cat("=== Top 10 Patients – Highest Serum Creatinine ===\n")## === Top 10 Patients – Highest Serum Creatinine ===
head(top_creat[, c("creatinine_rank", "age", "serum_creatinine",
"ejection_fraction", "DEATH_EVENT")], 10)## creatinine_rank age serum_creatinine ejection_fraction DEATH_EVENT
## 41 1 83 9.400000 37.14414 1
## 166 2 58 9.400000 69.10749 1
## 394 3 80 9.400000 35.00000 1
## 560 4 83 9.400000 36.06893 1
## 899 5 79 9.400000 37.08336 1
## 1283 6 55 9.400000 64.45976 1
## 2380 7 82 9.400000 30.27132 0
## 648 8 51 9.398521 71.20068 1
## 2168 9 77 9.271585 28.77698 1
## 135 10 78 9.230208 36.20592 1
## === Top 10 – Highest Ejection Fraction ===
## age ejection_fraction serum_creatinine DEATH_EVENT
## 314 50 80.00000 1.3346373 0
## 619 49 80.00000 2.2080516 0
## 1480 45 80.00000 1.1800000 0
## 1962 40 80.00000 0.5000000 0
## 2256 43 80.00000 0.8291144 0
## 2248 49 79.03079 1.6140766 0
## 1337 49 78.75856 1.6088737 0
## 2281 46 76.93935 2.1476574 0
## 2223 54 76.83241 9.1755239 1
## 1299 55 76.14159 9.1035892 1
group_death <- aggregate(DEATH_EVENT ~ sex + smoking,
data = df,
FUN = function(x) round(mean(x) * 100, 1))
group_death$Sex <- ifelse(group_death$sex == 1, "Male", "Female")
group_death$Smoking <- ifelse(group_death$smoking == 1, "Smoker", "Non-Smoker")
group_death$Death_Rate_pct <- group_death$DEATH_EVENT
result <- group_death[order(-group_death$Death_Rate_pct),
c("Sex", "Smoking", "Death_Rate_pct")]
cat("=== Death Rate by Sex and Smoking (%) ===\n")## === Death Rate by Sex and Smoking (%) ===
## Sex Smoking Death_Rate_pct
## Female Smoker 38.6
## Male Non-Smoker 37.2
## Female Non-Smoker 34.2
## Male Smoker 33.3
sorted_age_creat <- df[order(df$age, df$serum_creatinine), ]
cat("=== Sorted: Age (asc) then Creatinine (asc) ===\n")## === Sorted: Age (asc) then Creatinine (asc) ===
## age serum_creatinine ejection_fraction DEATH_EVENT
## 162 40 0.5 32.76896 1
## 180 40 0.5 31.74023 0
## 259 40 0.5 59.59414 0
## 751 40 0.5 14.00000 1
## 1441 40 0.5 33.13004 0
## 1617 40 0.5 30.20877 0
## 1677 40 0.5 32.45592 1
## 1815 40 0.5 41.00378 0
## 1962 40 0.5 80.00000 0
## 2112 40 0.5 33.16646 0
## === Extreme Serum Creatinine Values ===
## Minimum: 0.5 mg/dL (healthiest kidney)
## Maximum: 9.4 mg/dL (most critical)
##
## --- Patient with LOWEST creatinine ---
print(df[which.min(df$serum_creatinine),
c("age", "serum_creatinine", "ejection_fraction", "DEATH_EVENT")])## age serum_creatinine ejection_fraction DEATH_EVENT
## 14 50 0.5 30 0
##
## --- Patient with HIGHEST creatinine ---
print(df[which.max(df$serum_creatinine),
c("age", "serum_creatinine", "ejection_fraction", "DEATH_EVENT")])## age serum_creatinine ejection_fraction DEATH_EVENT
## 41 83 9.4 37.14414 1
✅ Key Insights – Level 4:
# Normalize = convert each value to a 0-1 scale
normalize <- function(x) (x - min(x)) / (max(x) - min(x))
df$age_norm <- normalize(df$age)
df$creat_norm <- normalize(df$serum_creatinine)
df$ef_norm <- 1 - normalize(df$ejection_fraction) # low EF = high risk
df$sodium_norm <- 1 - normalize(df$serum_sodium) # low sodium = high risk
# Combine into one Risk Score (0 to 100)
df$Risk_Score <- round(
(0.35 * df$age_norm +
0.30 * df$creat_norm +
0.25 * df$ef_norm +
0.10 * df$sodium_norm) * 100, 2)
cat("=== Risk Score Summary ===\n")## === Risk Score Summary ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.30 28.39 34.87 35.78 42.77 80.29
Formula:
Risk Score = 35% Age + 30% Creatinine + 25% (1 - EF) + 10% (1 - Sodium)
# Health Index = 100 minus risk, minus penalties for comorbidities
df$Health_Index <- round(
100 - df$Risk_Score
- 5 * df$high_blood_pressure # penalty for high BP
- 3 * df$anaemia # penalty for anaemia
- 3 * df$diabetes, 2) # penalty for diabetes
cat("=== Health Index Summary ===\n")## === Health Index Summary ===
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.71 52.50 60.38 59.79 67.39 93.70
df$Risk_Category <- ifelse(df$Risk_Score < 30, "Low Risk",
ifelse(df$Risk_Score < 55, "Medium Risk", "High Risk"))
cat("=== Risk Category Counts ===\n")## === Risk Category Counts ===
##
## High Risk Low Risk Medium Risk
## 84 778 1638
##
## === Death Rate per Risk Category (%) ===
## High Risk Low Risk Medium Risk
## 83.3 18.1 40.8
df$CPK_Category <- ifelse(df$creatinine_phosphokinase < 200, "Normal (<200)",
ifelse(df$creatinine_phosphokinase < 1000, "Elevated (200-999)",
"Very High (1000+)"))
cat("=== CPK Category Counts ===\n")## === CPK Category Counts ===
##
## Elevated (200-999) Normal (<200) Very High (1000+)
## 1152 910 438
df$Age_Group <- ifelse(df$age < 45, "Young (<45)",
ifelse(df$age < 60, "Middle (45-59)",
ifelse(df$age < 75, "Senior (60-74)", "Elderly (75+)")))
cat("=== Age Group Counts ===\n")## === Age Group Counts ===
##
## Elderly (75+) Middle (45-59) Senior (60-74) Young (<45)
## 345 989 961 205
##
## === Death Rate by Age Group (%) ===
## Elderly (75+) Middle (45-59) Senior (60-74) Young (<45)
## 61.7 30.5 32.7 24.4
✅ Key Insights – Level 5:
df$Sex_Label <- ifelse(df$sex == 1, "Male", "Female")
df$Outcome_Label <- ifelse(df$DEATH_EVENT == 1, "Died", "Survived")
ggplot(df, aes(x = Sex_Label, fill = Outcome_Label)) +
geom_bar(position = "dodge", width = 0.5, color = "white") +
geom_text(stat = "count", aes(label = ..count..),
position = position_dodge(0.5), vjust = -0.4,
fontface = "bold", size = 4) +
scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
labs(title = "Q6.1 – Heart Failure Count by Sex",
subtitle = "Males make up the majority of both groups",
x = "Sex", y = "Count", fill = "Outcome") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))ggplot(df, aes(x = age, fill = Outcome_Label)) +
geom_histogram(binwidth = 4, color = "white", alpha = 0.8, position = "identity") +
scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
geom_vline(xintercept = mean(df$age), linetype = "dashed",
color = "#2C3E50", linewidth = 1) +
annotate("text", x = mean(df$age) + 2, y = Inf,
label = "Mean Age", vjust = 2, hjust = 0,
color = "#2C3E50", fontface = "italic", size = 3.5) +
labs(title = "Q6.2 – Age Distribution by Outcome",
subtitle = "Older patients (60+) show higher mortality",
x = "Age (years)", y = "Count", fill = "Outcome") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))pie_data <- data.frame(
Label = c("Survived", "Died"),
Count = c(sum(df$DEATH_EVENT == 0), sum(df$DEATH_EVENT == 1))
)
pie_data$Pct <- round(pie_data$Count / sum(pie_data$Count) * 100, 1)
pie_data$Lab <- paste0(pie_data$Label, "\n", pie_data$Pct, "%")
ggplot(pie_data, aes(x = "", y = Count, fill = Label)) +
geom_bar(stat = "identity", width = 1, color = "white", linewidth = 1.5) +
coord_polar("y", start = 0) +
geom_text(aes(label = Lab), position = position_stack(vjust = 0.5),
size = 5.5, fontface = "bold", color = "white") +
scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
labs(title = "Q6.3 – Overall Patient Outcome") +
theme_void() +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
legend.position = "none")ggplot(df, aes(x = ejection_fraction, y = serum_creatinine,
color = Outcome_Label)) +
geom_point(alpha = 0.4, size = 1.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
scale_color_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
scale_y_log10() +
labs(title = "Q6.4 – Ejection Fraction vs Serum Creatinine",
subtitle = "Low EF + High Creatinine = High Death Risk | Y-axis is log scale",
x = "Ejection Fraction (%)",
y = "Serum Creatinine (log scale)",
color = "Outcome") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "bottom")ggplot(df, aes(x = Outcome_Label, y = age, fill = Outcome_Label)) +
geom_boxplot(width = 0.5, outlier.colour = "grey40", outlier.size = 1.5) +
geom_jitter(width = 0.1, alpha = 0.1, size = 1) +
stat_summary(fun = mean, geom = "point",
shape = 18, size = 5, color = "#2C3E50") +
scale_fill_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
labs(title = "Q6.5 – Age Distribution by Outcome",
subtitle = "◆ = Mean | Patients who died tend to be older",
x = "Outcome", y = "Age (years)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "none")df$Risk_Category <- factor(df$Risk_Category,
levels = c("Low Risk", "Medium Risk", "High Risk"))
ggplot(df, aes(x = Risk_Category, y = serum_creatinine, fill = Risk_Category)) +
geom_boxplot(width = 0.5, outlier.alpha = 0.3) +
scale_fill_manual(values = c("Low Risk" = "#27AE60",
"Medium Risk" = "#F39C12",
"High Risk" = "#E74C3C")) +
scale_y_log10() +
labs(title = "Q6.6 – Serum Creatinine by Risk Category",
subtitle = "High-risk patients have much higher creatinine | Y-axis is log scale",
x = "Risk Category", y = "Serum Creatinine (log scale)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "none")df$Age_Bin <- round(df$age / 5) * 5 # group into 5-year bins
line_data <- aggregate(Risk_Score ~ Age_Bin + Outcome_Label, data = df, FUN = mean)
ggplot(line_data, aes(x = Age_Bin, y = Risk_Score,
color = Outcome_Label, group = Outcome_Label)) +
geom_line(linewidth = 1.4) +
geom_point(size = 3) +
scale_color_manual(values = c("Died" = "#E74C3C", "Survived" = "#27AE60")) +
labs(title = "Q6.7 – Average Risk Score vs Age",
subtitle = "Patients who died score higher across all age groups",
x = "Age Group (5-year bins)", y = "Average Risk Score",
color = "Outcome") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "bottom")# ANOVA tests if average CPK is different between survived and died groups
aov1 <- aov(creatinine_phosphokinase ~ factor(DEATH_EVENT), data = df)
p_val1 <- summary(aov1)[[1]][["Pr(>F)"]][1]
cat("=== ANOVA: CPK by DEATH_EVENT ===\n")## === ANOVA: CPK by DEATH_EVENT ===
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(DEATH_EVENT) 1 3.923e+06 3923073 4.62 0.0317 *
## Residuals 2498 2.121e+09 849076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Conclusion: p = 0.03169 →
if (p_val1 < 0.05){
cat(" CPK IS significantly different between groups\n")
} else {
cat(" CPK is NOT significantly different between groups\n")
}## CPK IS significantly different between groups
aov2 <- aov(ejection_fraction ~ factor(sex), data = df)
p_val2 <- summary(aov2)[[1]][["Pr(>F)"]][1]
cat("=== ANOVA: Ejection Fraction by Sex ===\n")## === ANOVA: Ejection Fraction by Sex ===
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(sex) 1 6605 6605 45.61 1.79e-11 ***
## Residuals 2498 361738 145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Conclusion: p = 0 →
if (p_val2 < 0.05) {
cat(" EF IS significantly different by sex\n")
} else {
cat(" EF is NOT significantly different by sex\n")
}## EF IS significantly different by sex
# Does age predict kidney function decline?
lm1 <- lm(serum_creatinine ~ age, data = df)
cat("=== Simple Regression: Age → Serum Creatinine ===\n")## === Simple Regression: Age → Serum Creatinine ===
##
## Call:
## lm(formula = serum_creatinine ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4465 -0.6097 -0.2707 0.2265 8.0732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.588816 0.115816 5.084 3.97e-07 ***
## age 0.014441 0.001865 7.745 1.38e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.139 on 2498 degrees of freedom
## Multiple R-squared: 0.02345, Adjusted R-squared: 0.02306
## F-statistic: 59.99 on 1 and 2498 DF, p-value: 1.376e-14
ggplot(df, aes(x = age, y = serum_creatinine)) +
geom_point(alpha = 0.3, color = "#2980B9", size = 1.5) +
geom_smooth(method = "lm", color = "#E74C3C", linewidth = 1.3, se = TRUE) +
labs(title = "Q7.3 – Simple Regression: Age → Serum Creatinine",
subtitle = paste0("R² = ", round(summary(lm1)$r.squared, 3),
" | As age rises, creatinine slightly increases"),
x = "Age (years)", y = "Serum Creatinine (mg/dL)") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))# Logistic regression predicts a binary outcome (0 or 1)
lm2 <- glm(DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
serum_sodium + high_blood_pressure + anaemia + diabetes,
data = df,
family = binomial)
cat("=== Logistic Regression: Predict DEATH_EVENT ===\n")## === Logistic Regression: Predict DEATH_EVENT ===
##
## Call:
## glm(formula = DEATH_EVENT ~ age + ejection_fraction + serum_creatinine +
## serum_sodium + high_blood_pressure + anaemia + diabetes,
## family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.148100 1.390831 4.420 9.85e-06 ***
## age 0.036957 0.003922 9.422 < 2e-16 ***
## ejection_fraction -0.049787 0.004308 -11.556 < 2e-16 ***
## serum_creatinine 0.476315 0.046836 10.170 < 2e-16 ***
## serum_sodium -0.059957 0.010143 -5.911 3.39e-09 ***
## high_blood_pressure 0.401627 0.095834 4.191 2.78e-05 ***
## anaemia 0.215656 0.093643 2.303 0.0213 *
## diabetes 0.033681 0.094709 0.356 0.7221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3242.2 on 2499 degrees of freedom
## Residual deviance: 2757.6 on 2492 degrees of freedom
## AIC: 2773.6
##
## Number of Fisher Scoring iterations: 4
# Show odds ratios
coefs <- coef(summary(lm2))
odds_table <- data.frame(
Variable = rownames(coefs)[-1],
Odds_Ratio = round(exp(coefs[-1, "Estimate"]), 4),
P_Value = round(coefs[-1, "Pr(>|z|)"], 4)
)
odds_table$Significant <- ifelse(odds_table$P_Value < 0.05, "YES ✅", "NO ❌")
odds_table <- odds_table[order(odds_table$P_Value), ]
cat("\n=== Odds Ratios and Significance ===\n")##
## === Odds Ratios and Significance ===
## Variable Odds_Ratio P_Value Significant
## age 1.0376 0.0000 YES ✅
## ejection_fraction 0.9514 0.0000 YES ✅
## serum_creatinine 1.6101 0.0000 YES ✅
## serum_sodium 0.9418 0.0000 YES ✅
## high_blood_pressure 1.4943 0.0000 YES ✅
## anaemia 1.2407 0.0213 YES ✅
## diabetes 1.0343 0.7221 NO ❌
num_data <- df[, c("age", "ejection_fraction", "serum_creatinine",
"serum_sodium", "creatinine_phosphokinase",
"platelets", "time", "Risk_Score", "DEATH_EVENT")]
cor_matrix <- cor(num_data, use = "complete.obs")
cat("=== Correlations with DEATH_EVENT ===\n")## === Correlations with DEATH_EVENT ===
## age ejection_fraction serum_creatinine
## 0.209 -0.237 0.254
## serum_sodium creatinine_phosphokinase platelets
## -0.204 0.043 -0.068
## time Risk_Score DEATH_EVENT
## -0.461 0.382 1.000
corrplot(cor_matrix,
method = "color",
type = "upper",
addCoef.col = "white",
number.cex = 0.7,
tl.col = "#2C3E50",
tl.srt = 45,
tl.cex = 0.85,
col = colorRampPalette(c("#E74C3C", "white", "#27AE60"))(200),
title = "Q7.5 – Correlation Matrix of All Numeric Variables",
mar = c(0, 0, 2, 0))✅ Key Statistical Insights – Level 7:
set.seed(42)
# Select features for clustering
cluster_features <- df[, c("age", "ejection_fraction",
"serum_creatinine", "serum_sodium")]
# Scale features so all are on the same 0-1 scale
cluster_scaled <- scale(cluster_features)
# Run K-Means with 3 groups
km <- kmeans(cluster_scaled, centers = 3, nstart = 25, iter.max = 100)
# Save cluster label
df$Cluster <- factor(km$cluster)
cat("=== Cluster Sizes ===\n")## === Cluster Sizes ===
##
## 1 2 3
## 863 1469 168
# Visualize clusters
fviz_cluster(km,
data = cluster_scaled,
palette = c("#27AE60", "#F39C12", "#E74C3C"),
geom = "point",
ellipse.type = "convex",
ggtheme = theme_minimal(base_size = 13)) +
labs(title = "Q8.1 – K-Means Clustering (k = 3)",
subtitle = "Grouped by: Age, Ejection Fraction, Creatinine, Sodium") +
theme(plot.title = element_text(face = "bold", hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"))# Calculate average values per cluster
cluster_profile <- aggregate(
cbind(age, ejection_fraction, serum_creatinine,
serum_sodium, DEATH_EVENT) ~ Cluster,
data = df,
FUN = mean
)
cluster_profile[, 2:5] <- round(cluster_profile[, 2:5], 2)
cluster_profile$Death_Rate_pct <- round(cluster_profile$DEATH_EVENT * 100, 1)
cluster_profile$DEATH_EVENT <- NULL
colnames(cluster_profile) <- c("Cluster", "Avg Age", "Avg EF (%)",
"Avg Creatinine", "Avg Sodium", "Death Rate (%)")
cat("=== Clinical Profile per Cluster ===\n")## === Clinical Profile per Cluster ===
## Cluster Avg Age Avg EF (%) Avg Creatinine Avg Sodium Death Rate (%)
## 1 70.83 46.74 1.27 137.73 32.2
## 2 54.61 32.85 1.22 136.48 32.7
## 3 64.90 36.84 4.62 131.11 72.0
📌 How to read the clusters:
set.seed(42)
# Prepare features
knn_features <- df[, c("age", "ejection_fraction", "serum_creatinine",
"serum_sodium", "time", "high_blood_pressure", "anaemia")]
knn_target <- factor(df$DEATH_EVENT, labels = c("Survived", "Died"))
# Scale features
knn_scaled <- scale(knn_features)
# Split: 80% train, 20% test
n <- nrow(knn_scaled)
train_idx <- sample(1:n, size = 0.8 * n)
test_idx <- setdiff(1:n, train_idx)
train_X <- knn_scaled[train_idx, ]
test_X <- knn_scaled[test_idx, ]
train_y <- knn_target[train_idx]
test_y <- knn_target[test_idx]
# Run KNN with k=7
knn_pred <- knn(train = train_X,
test = test_X,
cl = train_y,
k = 7)
cat("=== KNN Prediction Done (k = 7) ===\n")## === KNN Prediction Done (k = 7) ===
## Training set: 2000 patients
## Test set: 500 patients
# Confusion Matrix
cm <- table(Actual = test_y, Predicted = knn_pred)
cat("=== Confusion Matrix ===\n")## === Confusion Matrix ===
## Predicted
## Actual Survived Died
## Survived 303 30
## Died 31 136
# Performance Metrics
TP <- cm["Died", "Died"]
TN <- cm["Survived", "Survived"]
FP <- cm["Survived", "Died"]
FN <- cm["Died", "Survived"]
accuracy <- round((TP + TN) / sum(cm) * 100, 1)
sensitivity <- round(TP / (TP + FN) * 100, 1)
specificity <- round(TN / (TN + FP) * 100, 1)
cat("\n=== Model Performance ===\n")##
## === Model Performance ===
## Accuracy : 87.8 % — how often is the model correct?
## Sensitivity : 81.4 % — how well does it detect DIED?
## Specificity : 91 % — how well does it detect SURVIVED?
# Plot confusion matrix
cm_df <- as.data.frame(cm)
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", linewidth = 1.5) +
geom_text(aes(label = Freq), size = 10, fontface = "bold", color = "white") +
scale_fill_gradient(low = "#AED6F1", high = "#1A5276") +
labs(title = "Q8.4 – KNN Confusion Matrix",
subtitle = paste0("Accuracy: ", accuracy,
"% | Sensitivity: ", sensitivity,
"% | Specificity: ", specificity, "%"),
x = "Actual Outcome", y = "Predicted Outcome") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
plot.subtitle = element_text(hjust = 0.5, color = "grey50"),
legend.position = "none",
axis.text = element_text(face = "bold", size = 12))✅ Key ML Insights – Level 8:
| Level | What We Did | Key Finding |
|---|---|---|
| Level 1 | Explored data structure | 2,500 rows, 13 columns, 0 missing values, 32% mortality |
| Level 2 | Filtered high-risk groups | Older + high CPK + high BP patients are most at risk |
| Level 3 | Grouped & summarized | Patients who died were older with lower serum sodium |
| Level 4 | Sorted & ranked | Creatinine > 9 mg/dL almost always predicts death |
| Level 5 | Created new features | Risk Score (0–100) captures patient danger in one number |
| Level 6 | Built 7 visualizations | EF + creatinine together clearly separate survivors from deaths |
| Level 7 | Statistical tests | Creatinine and EF are the strongest significant predictors |
| Level 8 | Machine Learning | KNN and K-Means effectively classify and cluster patients |
| Rank | Feature | Why It Matters |
|---|---|---|
| 1 🔴 | Serum Creatinine | High values = kidneys are failing |
| 2 🔴 | Ejection Fraction | Low values = heart is not pumping enough blood |
| 3 🔴 | Age | Older patients have weaker organs overall |
| 4 🟡 | Serum Sodium | Low sodium = worsening heart failure |
| 5 🟡 | Follow-up Time | Patients who die tend to die early in the observation period |
Report created in R | Dataset: Heart Failure Clinical Records, Faisalabad Institute of Cardiology, Pakistan (2015) UCI Machine Learning Repository