1. Introduction

Cardiovascular diseases are a leading cause of mortality worldwide, making early risk assessment critical. For our project, we have chosen the Cardiovascular Disease Dataset from Kaggle to analyze the complex interplay between lifestyle choices and health outcomes. Our goal is to leverage R programming to not only predict the presence of heart disease but also to model key health indicators like blood pressure. By analyzing 70,000 patient records, we aim to demonstrate how data science can transform raw medical stats into actionable health insights.

2. Problem Statement

Cardiovascular diseases (CVDs) remain the leading cause of mortality globally. While clinical examinations provide vital data, the complex interplay between lifestyle choices (smoking, alcohol), physical attributes (BMI, age), and physiological markers (cholesterol, glucose) makes it difficult for healthcare providers to predict risk accurately without computational aid.

Furthermore, raw medical data often contains significant noise, such as physiological outliers in blood pressure readings and inconsistent units, which hinders effective decision-making. There is a critical need to develop robust data pipelines to clean this information and build models that can both predict the presence of disease and forecast vital health metrics like systolic blood pressure

3. Objectives

Objective 1 (Data Processing): To perform extensive data cleaning and feature engineering, including converting age from days to years, handling extreme outliers in blood pressure readings (ap_hi and ap_lo), and calculating Body Mass Index (BMI) from height and weight.
Objective 2 (Classification): To develop and evaluate a classification model (e.g., Logistic Regression or Random Forest) to accurately predict the presence or absence of cardiovascular disease (cardio) based on patient profiles.
Objective 3 (Regression): To investigate the relationship between lifestyle factors and physiological traits by building a regression model to predict Systolic Blood Pressure (ap_hi) as a continuous variable.
Objective 4 (Insight): To identify the most significant risk factors (e.g., cholesterol vs. physical activity) that contribute to heart disease through exploratory data visualization.

4. Process and Methodology

The project follows a standard data science workflow:

Environment Setup: Loading necessary libraries for analysis.

Data Preparation: Cleaning data, handling outliers in blood pressure, and removing duplicates.

Feature Engineering: Creating new variables like BMI and converting age into years.

Modelling: Training and comparing Logistic Regression, Decision Trees, and Random Forest models.

5. Computational Environment Setup

Before commencing the analysis, the necessary libraries were loaded to support data handling, machine learning workflows, and visualization.

# 1. List of packages
packages <- c("ggplot2", "randomForest", "pROC", "rpart", "rpart.plot", 
              "tidyr", "dplyr", "corrplot", "Metrics", "caret", "lava", "parallelly")

# 2. Install missing packages silently
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) {
  install.packages(new_packages, 
                   dependencies = TRUE, 
                   repos = "[https://cran.rstudio.com/](https://cran.rstudio.com/)", 
                   type = "binary")
}

# 3. Load all libraries
# results='hide' in the header will now block all those "TRUE" messages
lapply(packages, require, character.only = TRUE)

6. Data Loading & Initial Inspection

To ensure model accuracy, the raw dataset was filtered for realistic physiological ranges and duplicates were removed.

6. Data Loading and Initial Inspection

# Define column names
col_names <- c("id", "age", "gender", "height", "weight", "ap_hi", "ap_lo",
               "cholesterol", "gluc", "smoke", "alco", "active", "cardio")

# Use the absolute path with forward slashes
file_path <- "D:/Masters of Data Science (Artificial Intelligence) -- (UM)/Programming for Data Science/Group Project/cardio_train.csv"

# Read the file
data <- read.csv(file_path, 
                 sep = ";", 
                 header = FALSE, 
                 skip = 1, 
                 col.names = col_names)

# Verify the data loaded correctly
head(data)

##   id   age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1  0 18393      2    168     62   110    80           1    1     0    0      1
## 2  1 20228      1    156     85   140    90           3    1     0    0      1
## 3  2 18857      1    165     64   130    70           3    1     0    0      0
## 4  3 17623      2    169     82   150   100           1    1     0    0      1
## 5  4 17474      1    156     56   100    60           1    1     0    0      0
## 6  8 21914      1    151     67   120    80           2    2     0    0      0
##   cardio
## 1      0
## 2      1
## 3      1
## 4      1
## 5      0
## 6      0

7. DATA CLEANING

# Check for total missing values per column

colSums(is.na(data))

##          id         age      gender      height      weight       ap_hi 
##           0           0           0           0           0           0 
##       ap_lo cholesterol        gluc       smoke        alco      active 
##           0           0           0           0           0           0 
##      cardio 
##           0

sum(duplicated(data))

## [1] 0

# If duplicates exist, remove them (though it doesnt exist for this dataset)

data <- data[!duplicated(data), ]

data$id <- NULL # Remove ID as it is non-predictive

8 . Handling the Impossible Value (Data Noise)

# Filter realistic Blood Pressure ranges

data_clean <- data[data$ap_hi >= 60 & data$ap_hi <= 250 &
                     data$ap_lo >= 40 & data$ap_lo <= 150, ]

#  Ensure Systolic is higher than Diastolic [Ensure ap_hi is > ap_lo]

data_clean <- data_clean[data_clean$ap_hi > data_clean$ap_lo, ]


#  Cleaning Summary Metrics

total_original  <- nrow(data)

total_cleaned   <- nrow(data_clean)

rows_removed    <- total_original - total_cleaned

percent_retained <- (total_cleaned / total_original) * 100

cat("--- DATA CLEANING SUMMARY ---",
    "\nOriginal Rows: ", total_original,
    "\nCleaned Rows:  ", total_cleaned,
    "\nData Retained: ", round(percent_retained, 2), "%\n")

## --- DATA CLEANING SUMMARY --- 
## Original Rows:  70000 
## Cleaned Rows:   68667 
## Data Retained:  98.1 %

#  Visualization: Before vs After Cleaning

counts <- c(total_original, total_cleaned)

barplot(counts, names.arg = c("Original", "Cleaned"), 
        col = c("gray", "steelblue"), 
        main = "Dataset Volume Before vs. After Cleaning", 
        ylim = c(0, 80000))
        
text(x = 1.2, y = total_cleaned + 5000, labels = total_cleaned, cex = 1.2, font = 2)

9. FEATURE ENGINEERING

# Calculate BMI & Convert Age (from days to years)

data_clean$bmi <- data_clean$weight / (data_clean$height / 100)^2

data_clean$age_years <- round(data_clean$age / 365)

prop.table(table(data_clean$cardio))

## 
##         0         1 
## 0.5053082 0.4946918

# Created a simple bar plot for the balance

counts <- table(data_clean$cardio)

names(counts) <- c("Healthy (0)", "Cardio (1)")

barplot(counts, 
        col = c("#45B39D", "#EC7063"), # Teal for Healthy, Coral for Cardio
        main = "Distribution of Cardiovascular Disease",
        ylab = "Number of Patients",
        ylim = c(0, 40000))

#  Adding the percentages on top

text(x = 0.7, y = 20000, labels = "50.5%", col = "white", font = 2, cex = 1.5)

text(x = 1.9, y = 20000, labels = "49.5%", col = "white", font = 2, cex = 1.5)

10. Exploratory Data Analysis

#EDA 1 -  The "Risk Over Time" Density Plot

ggplot(data_clean, aes(x = age_years, fill = as.factor(cardio))) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("#45B39D", "#EC7063"), 
                    labels = c("Healthy", "Cardio"), 
                    name = "Diagnosis") +
  theme_minimal() +
  labs(title = "Risk Probability Density by Age",
       subtitle = "Visualizing the 'Crossover' point where risk increases significantly",
       x = "Age (Years)",
       y = "Density")

# EDA 2 - BMI Violin Plots (Distribution vs. Outliers)

ggplot(data_clean, aes(x = as.factor(cardio), y = bmi, fill = as.factor(cardio))) +
  geom_violin(trim = FALSE, alpha = 0.7) +
  geom_boxplot(width = 0.1, fill = "white") +
  scale_x_discrete(labels = c("Healthy", "Cardio")) +
  scale_fill_manual(values = c("#45B39D", "#EC7063"), guide = "none") +
  theme_minimal() +
  ylim(15, 50) + # Focus on realistic BMI range
  labs(title = "Body Mass Index (BMI) Distribution",
       subtitle = "Comparing healthy vs. diagnosed populations",
       x = "Patient Status",
       y = "BMI")

# EDA 3 - The Feature Correlation Matrix

## Use only numeric columns forthe corr matric

numeric_data <- data_clean[, sapply(data_clean, is.numeric)]

cor_matrix <- cor(numeric_data)

corrplot(cor_matrix, 
         method = "color", 
         type = "upper", 
         tl.col = "black", 
         tl.srt = 45, 
         addCoef.col = "black", # Show correlation numbers
         number.cex = 0.7,      # Adjust number size
         col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200))

#  EDA 4 Part 1 -  Systolic BP by Cardio Status

ggplot(data_clean, aes(x = as.factor(cardio), y = ap_hi, fill = as.factor(cardio))) +
  geom_boxplot() +
  scale_fill_manual(values = c("#45B39D", "#EC7063"), labels = c("Healthy", "Cardio"), name = "Status") +
  labs(title = "Systolic Blood Pressure vs. Heart Disease",
       x = "Diagnosis", y = "Systolic BP (ap_hi)") +
  theme_minimal()

#  EDA 4 Part 2 - Age by Cardio Status
ggplot(data_clean, aes(x = as.factor(cardio), y = age_years, fill = as.factor(cardio))) +
  geom_boxplot() +
  scale_fill_manual(values = c("#45B39D", "#EC7063"), labels = c("Healthy", "Cardio"), name = "Status") +
  labs(title = "Age Distribution vs. Heart Disease",
       x = "Diagnosis", y = "Age (Years)") +
  theme_minimal()

#  EDA 5

eda_summary <- data_clean %>%
  group_by(cardio) %>%
  summarise(
    Avg_Age = mean(age_years),
    Avg_BMI = mean(bmi),
    Avg_Systolic = mean(ap_hi),
    Avg_Diastolic = mean(ap_lo)
  )

print(eda_summary)

## # A tibble: 2 × 5
##   cardio Avg_Age Avg_BMI Avg_Systolic Avg_Diastolic
##    <int>   <dbl>   <dbl>        <dbl>         <dbl>
## 1      0    51.7    26.5         120.          78.1
## 2      1    55.0    28.5         134.          84.5

11. The Classification models

# 1. Convert categorical integers into Factors
categorical_cols <- c("cholesterol", "gluc", "smoke", "alco", "active", "gender")
data_clean[categorical_cols] <- lapply(data_clean[categorical_cols], as.factor)

# 2. Drop redundant raw age and ID
final_data <- data_clean[, !(names(data_clean) %in% c("age"))]

# 3. Split data 70/30
set.seed(123)
sample_size <- floor(0.70 * nrow(final_data))
train_indices <- sample(seq_len(nrow(final_data)), size = sample_size)
train_set <- final_data[train_indices, ]
test_set  <- final_data[-train_indices, ]

# 4. Ensure cardio is a factor
train_set$cardio <- as.factor(train_set$cardio)
test_set$cardio <- as.factor(test_set$cardio)

# 5. Model Training
dt_model  <- rpart(cardio ~ ., data = train_set, method = "class")
rf_model  <- randomForest(cardio ~ ., data = train_set, ntree = 100)
log_model <- glm(cardio ~ ., data = train_set, family = "binomial")

# 6. Predictions
dt_pred  <- predict(dt_model, test_set, type = "class")
rf_pred  <- predict(rf_model, test_set)
log_prob <- predict(log_model, test_set, type = "response")
log_pred <- as.factor(ifelse(log_prob > 0.5, 1, 0))

# 7. Metrics Function
get_metrics <- function(predictions, actual) {
  cm <- table(Predicted = predictions, Actual = actual)
  accuracy <- sum(diag(cm)) / sum(cm)
  sensitivity <- cm[2,2] / (cm[2,2] + cm[1,2])
  specificity <- cm[1,1] / (cm[1,1] + cm[2,1])
  return(c(Accuracy = accuracy, Sensitivity = sensitivity, Specificity = specificity))
}

# 8. Create Comparison Table
comparison_table <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
  rbind(get_metrics(log_pred, test_set$cardio), 
        get_metrics(dt_pred, test_set$cardio), 
        get_metrics(rf_pred, test_set$cardio))
)

# --- FIX: Create comp_long for the bar chart ---
comp_long <- tidyr::pivot_longer(comparison_table, 
                                 cols = c(Accuracy, Sensitivity, Specificity), 
                                 names_to = "Metric", 
                                 values_to = "Value")

# 9. Plot Model Comparison
ggplot(comp_long, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  ylim(0, 1) +
  labs(title = "Model Performance Comparison", y = "Score")

# 10. Feature Importance (Random Forest)
imp <- as.data.frame(rf_model$importance)
imp$Variable <- rownames(imp)
colnames(imp)[1] <- "GiniImportance"

ggplot(imp, aes(x = reorder(Variable, GiniImportance), y = GiniImportance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Feature Importance Ranking", x = "Features", y = "Importance")

# 11. ROC Curves
# FIX: Using 'log_prob' (singular) to match the definition above
roc_log <- roc(test_set$cardio, log_prob)
roc_dt  <- roc(test_set$cardio, predict(dt_model, test_set, type = "prob")[,2])
roc_rf  <- roc(test_set$cardio, predict(rf_model, test_set, type = "prob")[,2])

plot(roc_rf, col="blue", lwd=3, main="ROC Curve Comparison")
plot(roc_log, col="green", lwd=3, add=TRUE)
plot(roc_dt, col="red", lwd=3, add=TRUE)
legend("bottomright", legend=c("RF", "LogReg", "DT"), col=c("blue", "green", "red"), lwd=3)

12. REGRESSION MODEL

# =========================================================
# REGRESSION MODEL
# =========================================================

# =========================================================
# STEP 1: SETUP (Do this once)
# =========================================================

data_clean$cholesterol <- as.factor(data_clean$cholesterol)
data_clean$smoke <- as.factor(data_clean$smoke)
data_clean$alco <- as.factor(data_clean$alco)
data_clean$gluc <- as.factor(data_clean$gluc)

# =========================================================
# STEP 2: THE TOURNAMENT (Comparing 3 Models)
# =========================================================
cat("\n--- STARTING MODEL TOURNAMENT ---\n")

## 
## --- STARTING MODEL TOURNAMENT ---

# 1. Split Data into Train (80%) and Test (20%)
# This ensures we test on "fresh" data for a fair fight
set.seed(123)
sample_index <- sample(1:nrow(data_clean), 0.8 * nrow(data_clean))
train_data <- data_clean[sample_index, ]
test_data  <- data_clean[-sample_index, ]

# 2. Train the 3 Contenders
# A. Linear Regression
model_lm <- lm(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco, data = train_data)

# B. Decision Tree
model_tree <- rpart(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco, 
                    data = train_data, method = "anova")

# C. Random Forest (limit trees to 50 for speed)
model_rf <- randomForest(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco, 
                         data = train_data, ntree = 50, importance = TRUE)

# 3. Calculate Errors (RMSE) on Test Data
pred_lm <- predict(model_lm, test_data)
rmse_lm <- rmse(test_data$ap_hi, pred_lm)

pred_tree <- predict(model_tree, test_data)
rmse_tree <- rmse(test_data$ap_hi, pred_tree)

pred_rf <- predict(model_rf, test_data)
rmse_rf <- rmse(test_data$ap_hi, pred_rf)

# 4. Declare the Winner
results <- data.frame(
  Model = c("Linear Regression", "Decision Tree", "Random Forest"),
  RMSE = c(rmse_lm, rmse_tree, rmse_rf)
)

print(results)

##               Model     RMSE
## 1 Linear Regression 15.64675
## 2     Decision Tree 15.95346
## 3     Random Forest 15.64399

winner_name <- results$Model[which.min(results$RMSE)]
cat("\n🏆 THE WINNER IS:", winner_name, "\n")

## 
## 🏆 THE WINNER IS: Random Forest

# =========================================================
# STEP 3: DEEP DIVE ANALYSIS (On the Winner Only)
# =========================================================
cat("\n--- PERFORMING DETAILED ANALYSIS ON THE WINNER ---\n")

## 
## --- PERFORMING DETAILED ANALYSIS ON THE WINNER ---

if (winner_name == "Linear Regression") {
  
  # --- LINEAR REGRESSION ANALYSIS ---
  print(summary(model_lm))
  
  # Plot Coefficients
  coef_df <- as.data.frame(coef(model_lm))
  colnames(coef_df) <- "Estimate"
  coef_df$Variable <- rownames(coef_df)
  
  print(
    ggplot(coef_df[-1,], aes(x = reorder(Variable, Estimate), y = Estimate)) +
      geom_bar(stat = "identity", fill = "steelblue") +
      coord_flip() +
      labs(title = "Linear Model: Coefficients", subtitle = "How much each variable adds to BP")
  )
  
} else if (winner_name == "Decision Tree") {
  
  # --- DECISION TREE ANALYSIS ---
  printcp(model_tree) # Print complexity
  
  # Plot the Tree Logic
  rpart.plot(model_tree, main = "Decision Tree Logic", 
             type = 3, clip.right.labs = FALSE, branch = .3, under = TRUE)
  
} else {
  
  # --- RANDOM FOREST ANALYSIS ---
  print(model_rf)
  
  # 1. Variable Importance Plot (Which variable matters most?)
  # This replaces "Coefficients" for Random Forest
  varImpPlot(model_rf, main = "Random Forest: What drives Blood Pressure?")
  
  # 2. Extract Importance for a nicer Bar Chart
  imp_df <- as.data.frame(importance(model_rf))
  imp_df$Variable <- rownames(imp_df)
  
  print(
    ggplot(imp_df, aes(x = reorder(Variable, `%IncMSE`), y = `%IncMSE`)) +
      geom_bar(stat = "identity", fill = "darkgreen") +
      coord_flip() +
      labs(title = "Random Forest: Variable Importance", 
           subtitle = "Higher % = More important for predicting BP",
           y = "% Increase in Error (if variable is removed)",
           x = "Variable") +
      theme_minimal()
  )
}

## 
## Call:
##  randomForest(formula = ap_hi ~ age_years + bmi + cholesterol +      gluc + smoke + alco, data = train_data, ntree = 50, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 50
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 245.3706
##                     % Var explained: 11.94

# =========================================================
# STEP 4: ACTUAL VS PREDICTED (For the Winner)
# =========================================================
# Regardless of who won, we want to see the prediction accuracy plot

if (winner_name == "Linear Regression") {
  final_pred <- pred_lm
} else if (winner_name == "Decision Tree") {
  final_pred <- pred_tree
} else {
  final_pred <- pred_rf
}

# Create the plot data
plot_data <- data.frame(Actual = test_data$ap_hi, Predicted = final_pred)

ggplot(plot_data, aes(x = Predicted, y = Actual)) +
  geom_point(alpha = 0.1, color = "darkblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # The "Perfect" line
  labs(title = paste("Performance of Best Model:", winner_name),
       subtitle = "Red Line = Perfect Prediction. Dots = Patients.",
       x = "Predicted BP", y = "Actual BP") +
  theme_minimal()

# The Age Effect Plot
ggplot(data_clean, aes(x = age_years, y = ap_hi)) +
  # We use 'stat_summary' to show the average BP for each age (cleaner than raw dots)
  stat_summary(fun = mean, geom = "point", color = "darkblue", size = 2) +
  geom_smooth(method = "lm", color = "red", size = 1.5) +
  labs(title = "The Effect of Age",
       subtitle = "The red line shows the 'Slope' for Age specifically",
       x = "Age (Years)",
       y = "Average Systolic BP") +
  theme_minimal()

# The BMI Effect Plot
ggplot(data_clean, aes(x = bmi, y = ap_hi)) +
  # Limiting BMI to 50 so outliers don't squish the graph
  geom_point(alpha = 0.05, color = "darkgreen") + 
  geom_smooth(method = "lm", color = "red", size = 1.5) +
  labs(title = "The Effect of BMI",
       subtitle = "The red line shows the 'Slope' for BMI specifically",
       x = "BMI",
       y = "Systolic BP") +
  xlim(15, 50) + # Focus on the main group of people
  theme_minimal()

Cardiovascular Disease Risk Prediction

Diviyah

2026-01-18