Cardiovascular diseases are a leading cause of mortality worldwide, making early risk assessment critical. For our project, we have chosen the Cardiovascular Disease Dataset from Kaggle to analyze the complex interplay between lifestyle choices and health outcomes. Our goal is to leverage R programming to not only predict the presence of heart disease but also to model key health indicators like blood pressure. By analyzing 70,000 patient records, we aim to demonstrate how data science can transform raw medical stats into actionable health insights.
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally. While clinical examinations provide vital data, the complex interplay between lifestyle choices (smoking, alcohol), physical attributes (BMI, age), and physiological markers (cholesterol, glucose) makes it difficult for healthcare providers to predict risk accurately without computational aid.
Furthermore, raw medical data often contains significant noise, such as physiological outliers in blood pressure readings and inconsistent units, which hinders effective decision-making. There is a critical need to develop robust data pipelines to clean this information and build models that can both predict the presence of disease and forecast vital health metrics like systolic blood pressure
Objective 1 (Data Processing): To perform extensive data cleaning and feature engineering, including converting age from days to years, handling extreme outliers in blood pressure readings (ap_hi and ap_lo), and calculating Body Mass Index (BMI) from height and weight.
Objective 2 (Classification): To develop and evaluate a classification model (e.g., Logistic Regression or Random Forest) to accurately predict the presence or absence of cardiovascular disease (cardio) based on patient profiles.
Objective 3 (Regression): To investigate the relationship between lifestyle factors and physiological traits by building a regression model to predict Systolic Blood Pressure (ap_hi) as a continuous variable.
Objective 4 (Insight): To identify the most significant risk factors (e.g., cholesterol vs. physical activity) that contribute to heart disease through exploratory data visualization.
The project follows a standard data science workflow:
Environment Setup: Loading necessary libraries for analysis.
Data Preparation: Cleaning data, handling outliers in blood pressure, and removing duplicates.
Feature Engineering: Creating new variables like BMI and converting age into years.
Modelling: Training and comparing Logistic Regression, Decision Trees, and Random Forest models.
# 1. List of packages
packages <- c("ggplot2", "randomForest", "pROC", "rpart", "rpart.plot",
"tidyr", "dplyr", "corrplot", "Metrics", "caret", "lava", "parallelly")
# 2. Install missing packages silently
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) {
install.packages(new_packages,
dependencies = TRUE,
repos = "[https://cran.rstudio.com/](https://cran.rstudio.com/)",
type = "binary")
}
# 3. Load all libraries
# results='hide' in the header will now block all those "TRUE" messages
lapply(packages, require, character.only = TRUE)# Define column names
col_names <- c("id", "age", "gender", "height", "weight", "ap_hi", "ap_lo",
"cholesterol", "gluc", "smoke", "alco", "active", "cardio")
# Use the absolute path with forward slashes
file_path <- "D:/Masters of Data Science (Artificial Intelligence) -- (UM)/Programming for Data Science/Group Project/cardio_train.csv"
# Read the file
data <- read.csv(file_path,
sep = ";",
header = FALSE,
skip = 1,
col.names = col_names)
# Verify the data loaded correctly
head(data)## id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco active
## 1 0 18393 2 168 62 110 80 1 1 0 0 1
## 2 1 20228 1 156 85 140 90 3 1 0 0 1
## 3 2 18857 1 165 64 130 70 3 1 0 0 0
## 4 3 17623 2 169 82 150 100 1 1 0 0 1
## 5 4 17474 1 156 56 100 60 1 1 0 0 0
## 6 8 21914 1 151 67 120 80 2 2 0 0 0
## cardio
## 1 0
## 2 1
## 3 1
## 4 1
## 5 0
## 6 0
## id age gender height weight ap_hi
## 0 0 0 0 0 0
## ap_lo cholesterol gluc smoke alco active
## 0 0 0 0 0 0
## cardio
## 0
## [1] 0
# Filter realistic Blood Pressure ranges
data_clean <- data[data$ap_hi >= 60 & data$ap_hi <= 250 &
data$ap_lo >= 40 & data$ap_lo <= 150, ]
# Ensure Systolic is higher than Diastolic [Ensure ap_hi is > ap_lo]
data_clean <- data_clean[data_clean$ap_hi > data_clean$ap_lo, ]
# Cleaning Summary Metrics
total_original <- nrow(data)
total_cleaned <- nrow(data_clean)
rows_removed <- total_original - total_cleaned
percent_retained <- (total_cleaned / total_original) * 100
cat("--- DATA CLEANING SUMMARY ---",
"\nOriginal Rows: ", total_original,
"\nCleaned Rows: ", total_cleaned,
"\nData Retained: ", round(percent_retained, 2), "%\n")## --- DATA CLEANING SUMMARY ---
## Original Rows: 70000
## Cleaned Rows: 68667
## Data Retained: 98.1 %
# Visualization: Before vs After Cleaning
counts <- c(total_original, total_cleaned)
barplot(counts, names.arg = c("Original", "Cleaned"),
col = c("gray", "steelblue"),
main = "Dataset Volume Before vs. After Cleaning",
ylim = c(0, 80000))
text(x = 1.2, y = total_cleaned + 5000, labels = total_cleaned, cex = 1.2, font = 2)# Calculate BMI & Convert Age (from days to years)
data_clean$bmi <- data_clean$weight / (data_clean$height / 100)^2
data_clean$age_years <- round(data_clean$age / 365)
prop.table(table(data_clean$cardio))##
## 0 1
## 0.5053082 0.4946918
# Created a simple bar plot for the balance
counts <- table(data_clean$cardio)
names(counts) <- c("Healthy (0)", "Cardio (1)")
barplot(counts,
col = c("#45B39D", "#EC7063"), # Teal for Healthy, Coral for Cardio
main = "Distribution of Cardiovascular Disease",
ylab = "Number of Patients",
ylim = c(0, 40000))
# Adding the percentages on top
text(x = 0.7, y = 20000, labels = "50.5%", col = "white", font = 2, cex = 1.5)
text(x = 1.9, y = 20000, labels = "49.5%", col = "white", font = 2, cex = 1.5)#EDA 1 - The "Risk Over Time" Density Plot
ggplot(data_clean, aes(x = age_years, fill = as.factor(cardio))) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("#45B39D", "#EC7063"),
labels = c("Healthy", "Cardio"),
name = "Diagnosis") +
theme_minimal() +
labs(title = "Risk Probability Density by Age",
subtitle = "Visualizing the 'Crossover' point where risk increases significantly",
x = "Age (Years)",
y = "Density")# EDA 2 - BMI Violin Plots (Distribution vs. Outliers)
ggplot(data_clean, aes(x = as.factor(cardio), y = bmi, fill = as.factor(cardio))) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_boxplot(width = 0.1, fill = "white") +
scale_x_discrete(labels = c("Healthy", "Cardio")) +
scale_fill_manual(values = c("#45B39D", "#EC7063"), guide = "none") +
theme_minimal() +
ylim(15, 50) + # Focus on realistic BMI range
labs(title = "Body Mass Index (BMI) Distribution",
subtitle = "Comparing healthy vs. diagnosed populations",
x = "Patient Status",
y = "BMI")# EDA 3 - The Feature Correlation Matrix
## Use only numeric columns forthe corr matric
numeric_data <- data_clean[, sapply(data_clean, is.numeric)]
cor_matrix <- cor(numeric_data)
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black", # Show correlation numbers
number.cex = 0.7, # Adjust number size
col = colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))(200))# EDA 4 Part 1 - Systolic BP by Cardio Status
ggplot(data_clean, aes(x = as.factor(cardio), y = ap_hi, fill = as.factor(cardio))) +
geom_boxplot() +
scale_fill_manual(values = c("#45B39D", "#EC7063"), labels = c("Healthy", "Cardio"), name = "Status") +
labs(title = "Systolic Blood Pressure vs. Heart Disease",
x = "Diagnosis", y = "Systolic BP (ap_hi)") +
theme_minimal()# EDA 4 Part 2 - Age by Cardio Status
ggplot(data_clean, aes(x = as.factor(cardio), y = age_years, fill = as.factor(cardio))) +
geom_boxplot() +
scale_fill_manual(values = c("#45B39D", "#EC7063"), labels = c("Healthy", "Cardio"), name = "Status") +
labs(title = "Age Distribution vs. Heart Disease",
x = "Diagnosis", y = "Age (Years)") +
theme_minimal()# EDA 5
eda_summary <- data_clean %>%
group_by(cardio) %>%
summarise(
Avg_Age = mean(age_years),
Avg_BMI = mean(bmi),
Avg_Systolic = mean(ap_hi),
Avg_Diastolic = mean(ap_lo)
)
print(eda_summary)## # A tibble: 2 × 5
## cardio Avg_Age Avg_BMI Avg_Systolic Avg_Diastolic
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 51.7 26.5 120. 78.1
## 2 1 55.0 28.5 134. 84.5
# 1. Convert categorical integers into Factors
categorical_cols <- c("cholesterol", "gluc", "smoke", "alco", "active", "gender")
data_clean[categorical_cols] <- lapply(data_clean[categorical_cols], as.factor)
# 2. Drop redundant raw age and ID
final_data <- data_clean[, !(names(data_clean) %in% c("age"))]
# 3. Split data 70/30
set.seed(123)
sample_size <- floor(0.70 * nrow(final_data))
train_indices <- sample(seq_len(nrow(final_data)), size = sample_size)
train_set <- final_data[train_indices, ]
test_set <- final_data[-train_indices, ]
# 4. Ensure cardio is a factor
train_set$cardio <- as.factor(train_set$cardio)
test_set$cardio <- as.factor(test_set$cardio)
# 5. Model Training
dt_model <- rpart(cardio ~ ., data = train_set, method = "class")
rf_model <- randomForest(cardio ~ ., data = train_set, ntree = 100)
log_model <- glm(cardio ~ ., data = train_set, family = "binomial")
# 6. Predictions
dt_pred <- predict(dt_model, test_set, type = "class")
rf_pred <- predict(rf_model, test_set)
log_prob <- predict(log_model, test_set, type = "response")
log_pred <- as.factor(ifelse(log_prob > 0.5, 1, 0))
# 7. Metrics Function
get_metrics <- function(predictions, actual) {
cm <- table(Predicted = predictions, Actual = actual)
accuracy <- sum(diag(cm)) / sum(cm)
sensitivity <- cm[2,2] / (cm[2,2] + cm[1,2])
specificity <- cm[1,1] / (cm[1,1] + cm[2,1])
return(c(Accuracy = accuracy, Sensitivity = sensitivity, Specificity = specificity))
}
# 8. Create Comparison Table
comparison_table <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
rbind(get_metrics(log_pred, test_set$cardio),
get_metrics(dt_pred, test_set$cardio),
get_metrics(rf_pred, test_set$cardio))
)
# --- FIX: Create comp_long for the bar chart ---
comp_long <- tidyr::pivot_longer(comparison_table,
cols = c(Accuracy, Sensitivity, Specificity),
names_to = "Metric",
values_to = "Value")
# 9. Plot Model Comparison
ggplot(comp_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
ylim(0, 1) +
labs(title = "Model Performance Comparison", y = "Score")# 10. Feature Importance (Random Forest)
imp <- as.data.frame(rf_model$importance)
imp$Variable <- rownames(imp)
colnames(imp)[1] <- "GiniImportance"
ggplot(imp, aes(x = reorder(Variable, GiniImportance), y = GiniImportance)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(title = "Feature Importance Ranking", x = "Features", y = "Importance")# 11. ROC Curves
# FIX: Using 'log_prob' (singular) to match the definition above
roc_log <- roc(test_set$cardio, log_prob)
roc_dt <- roc(test_set$cardio, predict(dt_model, test_set, type = "prob")[,2])
roc_rf <- roc(test_set$cardio, predict(rf_model, test_set, type = "prob")[,2])
plot(roc_rf, col="blue", lwd=3, main="ROC Curve Comparison")
plot(roc_log, col="green", lwd=3, add=TRUE)
plot(roc_dt, col="red", lwd=3, add=TRUE)
legend("bottomright", legend=c("RF", "LogReg", "DT"), col=c("blue", "green", "red"), lwd=3)# =========================================================
# REGRESSION MODEL
# =========================================================
# =========================================================
# STEP 1: SETUP (Do this once)
# =========================================================
data_clean$cholesterol <- as.factor(data_clean$cholesterol)
data_clean$smoke <- as.factor(data_clean$smoke)
data_clean$alco <- as.factor(data_clean$alco)
data_clean$gluc <- as.factor(data_clean$gluc)
# =========================================================
# STEP 2: THE TOURNAMENT (Comparing 3 Models)
# =========================================================
cat("\n--- STARTING MODEL TOURNAMENT ---\n")##
## --- STARTING MODEL TOURNAMENT ---
# 1. Split Data into Train (80%) and Test (20%)
# This ensures we test on "fresh" data for a fair fight
set.seed(123)
sample_index <- sample(1:nrow(data_clean), 0.8 * nrow(data_clean))
train_data <- data_clean[sample_index, ]
test_data <- data_clean[-sample_index, ]
# 2. Train the 3 Contenders
# A. Linear Regression
model_lm <- lm(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco, data = train_data)
# B. Decision Tree
model_tree <- rpart(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco,
data = train_data, method = "anova")
# C. Random Forest (limit trees to 50 for speed)
model_rf <- randomForest(ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco,
data = train_data, ntree = 50, importance = TRUE)
# 3. Calculate Errors (RMSE) on Test Data
pred_lm <- predict(model_lm, test_data)
rmse_lm <- rmse(test_data$ap_hi, pred_lm)
pred_tree <- predict(model_tree, test_data)
rmse_tree <- rmse(test_data$ap_hi, pred_tree)
pred_rf <- predict(model_rf, test_data)
rmse_rf <- rmse(test_data$ap_hi, pred_rf)
# 4. Declare the Winner
results <- data.frame(
Model = c("Linear Regression", "Decision Tree", "Random Forest"),
RMSE = c(rmse_lm, rmse_tree, rmse_rf)
)
print(results)## Model RMSE
## 1 Linear Regression 15.64675
## 2 Decision Tree 15.95346
## 3 Random Forest 15.64399
##
## 🏆 THE WINNER IS: Random Forest
# =========================================================
# STEP 3: DEEP DIVE ANALYSIS (On the Winner Only)
# =========================================================
cat("\n--- PERFORMING DETAILED ANALYSIS ON THE WINNER ---\n")##
## --- PERFORMING DETAILED ANALYSIS ON THE WINNER ---
if (winner_name == "Linear Regression") {
# --- LINEAR REGRESSION ANALYSIS ---
print(summary(model_lm))
# Plot Coefficients
coef_df <- as.data.frame(coef(model_lm))
colnames(coef_df) <- "Estimate"
coef_df$Variable <- rownames(coef_df)
print(
ggplot(coef_df[-1,], aes(x = reorder(Variable, Estimate), y = Estimate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Linear Model: Coefficients", subtitle = "How much each variable adds to BP")
)
} else if (winner_name == "Decision Tree") {
# --- DECISION TREE ANALYSIS ---
printcp(model_tree) # Print complexity
# Plot the Tree Logic
rpart.plot(model_tree, main = "Decision Tree Logic",
type = 3, clip.right.labs = FALSE, branch = .3, under = TRUE)
} else {
# --- RANDOM FOREST ANALYSIS ---
print(model_rf)
# 1. Variable Importance Plot (Which variable matters most?)
# This replaces "Coefficients" for Random Forest
varImpPlot(model_rf, main = "Random Forest: What drives Blood Pressure?")
# 2. Extract Importance for a nicer Bar Chart
imp_df <- as.data.frame(importance(model_rf))
imp_df$Variable <- rownames(imp_df)
print(
ggplot(imp_df, aes(x = reorder(Variable, `%IncMSE`), y = `%IncMSE`)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(title = "Random Forest: Variable Importance",
subtitle = "Higher % = More important for predicting BP",
y = "% Increase in Error (if variable is removed)",
x = "Variable") +
theme_minimal()
)
}##
## Call:
## randomForest(formula = ap_hi ~ age_years + bmi + cholesterol + gluc + smoke + alco, data = train_data, ntree = 50, importance = TRUE)
## Type of random forest: regression
## Number of trees: 50
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 245.3706
## % Var explained: 11.94
# =========================================================
# STEP 4: ACTUAL VS PREDICTED (For the Winner)
# =========================================================
# Regardless of who won, we want to see the prediction accuracy plot
if (winner_name == "Linear Regression") {
final_pred <- pred_lm
} else if (winner_name == "Decision Tree") {
final_pred <- pred_tree
} else {
final_pred <- pred_rf
}
# Create the plot data
plot_data <- data.frame(Actual = test_data$ap_hi, Predicted = final_pred)
ggplot(plot_data, aes(x = Predicted, y = Actual)) +
geom_point(alpha = 0.1, color = "darkblue") +
geom_abline(intercept = 0, slope = 1, color = "red", size = 1) + # The "Perfect" line
labs(title = paste("Performance of Best Model:", winner_name),
subtitle = "Red Line = Perfect Prediction. Dots = Patients.",
x = "Predicted BP", y = "Actual BP") +
theme_minimal()# The Age Effect Plot
ggplot(data_clean, aes(x = age_years, y = ap_hi)) +
# We use 'stat_summary' to show the average BP for each age (cleaner than raw dots)
stat_summary(fun = mean, geom = "point", color = "darkblue", size = 2) +
geom_smooth(method = "lm", color = "red", size = 1.5) +
labs(title = "The Effect of Age",
subtitle = "The red line shows the 'Slope' for Age specifically",
x = "Age (Years)",
y = "Average Systolic BP") +
theme_minimal()# The BMI Effect Plot
ggplot(data_clean, aes(x = bmi, y = ap_hi)) +
# Limiting BMI to 50 so outliers don't squish the graph
geom_point(alpha = 0.05, color = "darkgreen") +
geom_smooth(method = "lm", color = "red", size = 1.5) +
labs(title = "The Effect of BMI",
subtitle = "The red line shows the 'Slope' for BMI specifically",
x = "BMI",
y = "Systolic BP") +
xlim(15, 50) + # Focus on the main group of people
theme_minimal()