| Name | Matric No. |
|---|---|
| LAM JUN YAN | 24200414 |
| OWEN CHAN WEI WEN | 25072668 |
| 00000000 | |
| 00000000 | |
| 00000000 |
Data Science is the systematic study of data that involves developing methods for recording, storing, and analysing data to effectively extract useful information. As specified in the project guidelines, this report turns raw data into understanding, insight, and knowledge.
We utilise the RStudio, which provides the best tools for data scientists to ingest and organise data. We also load the dplyr package for manipulation and ggplot2 for visualization.
We load the dataset using the read.csv() function, which creates a data frame—a tabular object where each column can contain different modes of dataW
df <- read.csv("cardio_data_processed.csv", stringsAsFactors = FALSE)
Before analysis, we must determine the nature of the data. EDA helps us discover patterns, spot anomalies, and check assumptions
# Determine dimension (Rows and Columns)
dim(df) # [15, 16]
## [1] 68205 17
# Information dense summary of structure
glimpse(df) # [15, 17]
## Rows: 68,205
## Columns: 17
## $ id <int> 0, 1, 2, 3, 4, 8, 9, 12, 13, 14, 15, 16, 18, 21, 2…
## $ age <int> 18393, 20228, 18857, 17623, 17474, 21914, 22113, 2…
## $ gender <int> 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1,…
## $ height <int> 168, 156, 165, 169, 156, 151, 157, 178, 158, 164, …
## $ weight <dbl> 62, 85, 64, 82, 56, 67, 93, 95, 71, 68, 80, 60, 60…
## $ ap_hi <int> 110, 140, 130, 150, 100, 120, 130, 130, 110, 110, …
## $ ap_lo <int> 80, 90, 70, 100, 60, 80, 80, 90, 70, 60, 80, 80, 8…
## $ cholesterol <int> 1, 3, 3, 1, 1, 2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ gluc <int> 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ smoke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ alco <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ active <int> 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,…
## $ cardio <int> 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ age_years <int> 50, 55, 51, 48, 47, 60, 60, 61, 48, 54, 61, 51, 40…
## $ bmi <dbl> 21.96712, 34.92768, 23.50781, 28.71048, 23.01118, …
## $ bp_category <chr> "Hypertension Stage 1", "Hypertension Stage 2", "H…
## $ bp_category_encoded <chr> "Hypertension Stage 1", "Hypertension Stage 2", "H…
# Statistical summary of each field
summary(df) # [13, 18]
## id age gender height
## Min. : 0 Min. :10798 Min. :1.000 Min. : 55.0
## 1st Qu.:24991 1st Qu.:17656 1st Qu.:1.000 1st Qu.:159.0
## Median :50008 Median :19700 Median :1.000 Median :165.0
## Mean :49972 Mean :19463 Mean :1.349 Mean :164.4
## 3rd Qu.:74878 3rd Qu.:21323 3rd Qu.:2.000 3rd Qu.:170.0
## Max. :99999 Max. :23713 Max. :2.000 Max. :250.0
## weight ap_hi ap_lo cholesterol
## Min. : 11.0 Min. : 90.0 Min. : 60.00 Min. :1.000
## 1st Qu.: 65.0 1st Qu.:120.0 1st Qu.: 80.00 1st Qu.:1.000
## Median : 72.0 Median :120.0 Median : 80.00 Median :1.000
## Mean : 74.1 Mean :126.4 Mean : 81.26 Mean :1.363
## 3rd Qu.: 82.0 3rd Qu.:140.0 3rd Qu.: 90.00 3rd Qu.:1.000
## Max. :200.0 Max. :180.0 Max. :120.00 Max. :3.000
## gluc smoke alco active
## Min. :1.000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :1.000 Median :0.00000 Median :0.00000 Median :1.0000
## Mean :1.225 Mean :0.08766 Mean :0.05313 Mean :0.8035
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :3.000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## cardio age_years bmi bp_category
## Min. :0.0000 Min. :29.00 Min. : 3.472 Length:68205
## 1st Qu.:0.0000 1st Qu.:48.00 1st Qu.: 23.875 Class :character
## Median :0.0000 Median :53.00 Median : 26.346 Mode :character
## Mean :0.4937 Mean :52.82 Mean : 27.511
## 3rd Qu.:1.0000 3rd Qu.:58.00 3rd Qu.: 30.116
## Max. :1.0000 Max. :64.00 Max. :298.667
## bp_category_encoded
## Length:68205
## Class :character
## Mode :character
##
##
##
# Check if there is any NA value
colSums(is.na(df))
## id age gender height
## 0 0 0 0
## weight ap_hi ap_lo cholesterol
## 0 0 0 0
## gluc smoke alco active
## 0 0 0 0
## cardio age_years bmi bp_category
## 0 0 0 0
## bp_category_encoded
## 0
# Check if there is any empty string value
colSums(df=="")
## id age gender height
## 0 0 0 0
## weight ap_hi ap_lo cholesterol
## 0 0 0 0
## gluc smoke alco active
## 0 0 0 0
## cardio age_years bmi bp_category
## 0 0 0 0
## bp_category_encoded
## 0
# Check if there is any zero value
colSums(df==0)
## id age gender height
## 1 0 0 0
## weight ap_hi ap_lo cholesterol
## 0 0 0 0
## gluc smoke alco active
## 0 62226 64581 13399
## cardio age_years bmi bp_category
## 34533 0 0 0
## bp_category_encoded
## 0
Found that the duplicates is meaningless here as:
1. There’s no real patient name provided.. It’s possible will have the
exact same data in real-world.
We check for data that is technically “present” but logically impossible.
Rule: Systolic (ap_hi) must always be higher than Diastolic (ap_lo).
# Find impossible rows
impossible_bp <- df %>% filter(ap_lo > ap_hi)
cat("Rows with Impossible BP (Diastolic > Systolic):", nrow(impossible_bp), "\n")
## Rows with Impossible BP (Diastolic > Systolic): 3
# Visualize the error
ggplot(df, aes(x = ap_hi, y = ap_lo)) +
geom_point(alpha = 0.3, color = "red") +
geom_abline(slope = 1, intercept = 0, color = "black", linetype = "dashed") +
labs(title = "BP Logic Check",
subtitle = "Points above the dashed line are impossible (Diastolic > Systolic)",
x = "Systolic", y = "Diastolic") +
theme_minimal()
Rule: Check if the bmi column matches the formula: \(Weight / (Height(m)^2)\).
# Verify BMI calculation
df_check <- df %>%
mutate(
height_m = height / 100,
theoretical_bmi = weight / (height_m^2),
diff = abs(bmi - theoretical_bmi)
)
# Count how many are significantly different (>1 unit off)
bad_bmi_rows <- df_check %>% filter(diff > 1)
cat("Rows with mismatched BMI calculations:", nrow(bad_bmi_rows), "\n")
## Rows with mismatched BMI calculations: 0
# Visualize
ggplot(df_check, aes(x = theoretical_bmi, y = bmi)) +
geom_point(alpha = 0.1, color = "blue") +
geom_abline(color = "red") +
labs(title = "BMI Data Integrity",
subtitle = "If points drift from red line, the provided 'bmi' column is wrong",
x = "Calculated BMI (Weight/Height^2)", y = "Provided BMI Column") +
theme_minimal()
Rule: Does bp_category map consistently to bp_category_encoded?
# Cross-tabulation
# A perfect result should only have numbers on the diagonal
table(df$bp_category, df$bp_category_encoded)
##
## Elevated Hypertension Stage 1 Hypertension Stage 2
## Elevated 3101 0 0
## Hypertension Stage 1 0 39750 0
## Hypertension Stage 2 0 0 15937
## Normal 0 0 0
##
## Normal
## Elevated 0
## Hypertension Stage 1 0
## Hypertension Stage 2 0
## Normal 9417
We visualize extreme values to understand what needs cleaning.
ggplot(df, aes(y = height)) +
geom_boxplot(fill = "#F1C40F") +
labs(title = "Raw Data: Height Distribution",
subtitle = "Visualizing extreme outliers before cleaning",
y = "Height (cm)") +
theme_minimal()
# Filter rows where Height is < 50 OR Height > 200
height_outliers <- df %>%
filter(height < 100 | height > 200)
# Sort them by height to see the extremes clearly
height_outliers <- height_outliers %>%
arrange(height)
# Print the count and the actual data
cat("Number of extreme height outliers:", nrow(height_outliers), "\n")
## Number of extreme height outliers: 28
print(height_outliers)
## id age gender height weight ap_hi ap_lo cholesterol gluc smoke alco
## 1 32456 23386 1 55 81.0 130 90 1 1 0 0
## 2 95141 18830 1 57 61.0 130 90 1 1 0 0
## 3 91523 18426 1 59 57.6 125 67 1 1 0 0
## 4 41661 19088 1 60 69.0 110 70 1 1 0 0
## 5 39462 20978 1 64 61.0 130 70 1 1 0 0
## 6 48009 19709 2 65 72.0 130 80 1 1 0 0
## 7 63545 19120 1 65 60.0 120 80 1 1 0 0
## 8 20459 22005 1 67 57.0 120 90 1 1 0 0
## 9 72476 14499 2 67 60.0 110 80 1 1 1 1
## 10 76116 20541 2 67 80.0 120 80 1 1 0 0
## 11 32207 14538 1 68 65.0 100 60 1 1 0 0
## 12 66161 21006 2 68 71.0 120 80 1 1 0 0
## 13 21686 15812 1 70 68.0 120 80 1 1 0 0
## 14 73386 15432 2 70 69.0 120 80 1 1 0 0
## 15 18928 22456 2 71 68.0 120 80 3 1 0 0
## 16 45832 15374 1 72 74.0 150 90 1 1 0 0
## 17 23859 19680 2 74 98.0 140 90 1 1 0 0
## 18 18218 19594 1 75 168.0 120 80 1 1 1 0
## 19 67631 23297 1 75 75.0 120 80 1 1 0 0
## 20 309 21800 2 76 55.0 120 80 1 1 0 0
## 21 39156 15292 1 80 178.0 140 90 3 3 0 0
## 22 34186 19074 1 81 156.0 140 90 1 1 0 0
## 23 41075 19747 1 91 55.0 140 90 1 1 0 0
## 24 79917 21171 1 96 59.0 90 60 1 1 0 0
## 25 11662 17646 2 97 170.0 160 100 1 1 1 0
## 26 93223 18467 1 99 60.0 90 60 1 1 0 0
## 27 30894 19054 2 207 78.0 100 70 1 1 0 1
## 28 9223 21220 1 250 86.0 140 100 3 1 0 0
## active cardio age_years bmi bp_category bp_category_encoded
## 1 1 1 64 267.76860 Hypertension Stage 1 Hypertension Stage 1
## 2 1 1 51 187.75008 Hypertension Stage 1 Hypertension Stage 1
## 3 0 0 50 165.46969 Elevated Elevated
## 4 0 0 52 191.66667 Normal Normal
## 5 1 0 57 148.92578 Hypertension Stage 1 Hypertension Stage 1
## 6 0 0 53 170.41420 Hypertension Stage 1 Hypertension Stage 1
## 7 1 0 52 142.01183 Hypertension Stage 1 Hypertension Stage 1
## 8 1 1 60 126.97705 Hypertension Stage 2 Hypertension Stage 2
## 9 1 0 39 133.66006 Hypertension Stage 1 Hypertension Stage 1
## 10 0 1 56 178.21341 Hypertension Stage 1 Hypertension Stage 1
## 11 0 0 39 140.57093 Normal Normal
## 12 1 0 57 153.54671 Hypertension Stage 1 Hypertension Stage 1
## 13 0 0 43 138.77551 Hypertension Stage 1 Hypertension Stage 1
## 14 0 0 42 140.81633 Hypertension Stage 1 Hypertension Stage 1
## 15 1 0 61 134.89387 Hypertension Stage 1 Hypertension Stage 1
## 16 1 1 42 142.74691 Hypertension Stage 2 Hypertension Stage 2
## 17 1 1 53 178.96275 Hypertension Stage 2 Hypertension Stage 2
## 18 1 1 53 298.66667 Hypertension Stage 1 Hypertension Stage 1
## 19 1 0 63 133.33333 Hypertension Stage 1 Hypertension Stage 1
## 20 1 0 59 95.22161 Hypertension Stage 1 Hypertension Stage 1
## 21 1 1 41 278.12500 Hypertension Stage 2 Hypertension Stage 2
## 22 1 0 52 237.76863 Hypertension Stage 2 Hypertension Stage 2
## 23 1 1 54 66.41710 Hypertension Stage 2 Hypertension Stage 2
## 24 1 1 58 64.01910 Normal Normal
## 25 1 1 48 180.67807 Hypertension Stage 2 Hypertension Stage 2
## 26 1 0 50 61.21824 Normal Normal
## 27 1 0 52 18.20346 Normal Normal
## 28 1 1 58 13.76000 Hypertension Stage 2 Hypertension Stage 2
ggplot(df, aes(y = weight)) +
geom_boxplot(fill = "#E67E22") +
labs(title = "Raw Data: Weight Distribution", y = "Weight (kg)") +
theme_minimal()
ggplot(df, aes(y = ap_hi)) +
geom_boxplot(fill = "#E74C3C") +
labs(title = "Raw Data: Systolic BP Distribution", y = "Systolic BP (mmHg)") +
theme_minimal()
Checking if we have an equal number of sick vs healthy patients.
# Calculate counts and percentages
balance_df <- df %>%
group_by(cardio) %>%
summarise(Count = n()) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1))
print(balance_df)
## # A tibble: 2 × 3
## cardio Count Percentage
## <int> <int> <dbl>
## 1 0 34533 50.6
## 2 1 33672 49.4
# Visual check
ggplot(balance_df, aes(x = as.factor(cardio), y = Count, fill = as.factor(cardio))) +
geom_col() +
geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5) +
scale_fill_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy", "Disease")) +
labs(title = "Class Balance (Target Variable)", x = "Cardio", y = "Count") +
theme_minimal()
# 1. Define Valid Ranges (Medical Logic)
# Height: 100cm (3'3") to 250cm (8'2")
# Weight: 30kg (66lbs) to 200kg (440lbs)
# BMI: 10 to 60 (Standard medical range for survival)
# BP: Systolic must be > Diastolic
# 2. Apply Filters
df_clean <- df %>%
# Filter Height (Removes the 55cm errors and the 250cm outliers)
filter(height >= 100 & height <= 250) %>%
# Filter Weight (Removes extremely low/high weights)
filter(weight >= 30 & weight <= 200) %>%
# Filter BMI (Removes the resulting 200+ BMI errors)
filter(bmi > 10 & bmi < 60) %>%
# Filter Blood Pressure (Removes negative or impossible readings)
filter(ap_hi > 60 & ap_hi < 240) %>%
filter(ap_lo > 30 & ap_lo < 150) %>%
filter(ap_hi > ap_lo) # Systolic must be higher than Diastolic
# 3. Check how many rows were removed
rows_removed <- nrow(df) - nrow(df_clean)
cat("Rows Removed due to errors:", rows_removed, "\n")
## Rows Removed due to errors: 72
cat("Remaining Clean Data:", nrow(df_clean), "\n")
## Remaining Clean Data: 68133
# Calculate counts and percentages
balance_df2 <- df_clean %>%
group_by(cardio) %>%
summarise(Count = n()) %>%
mutate(Percentage = round(Count / sum(Count) * 100, 1))
print(balance_df)
## # A tibble: 2 × 3
## cardio Count Percentage
## <int> <int> <dbl>
## 1 0 34533 50.6
## 2 1 33672 49.4
# Visual check
ggplot(balance_df2, aes(x = as.factor(cardio), y = Count, fill = as.factor(cardio))) +
geom_col() +
geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5) +
scale_fill_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy", "Disease")) +
labs(title = "Class Balance (Target Variable)", x = "Cardio", y = "Count") +
theme_minimal()
This section investigates the variables and the relationships between them
We examine each field individually to understand its distribution and location
# Select numeric columns
num_vars <- df_clean %>%
select(age_years, height, weight, bmi, ap_hi, ap_lo)
# Calculate summary statistics
num_summary <- num_vars %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
group_by(Variable) %>%
summarise(
Mean = round(mean(Value), 2),
Median = median(Value),
Std_Dev = round(sd(Value), 2),
Min = min(Value),
Max = max(Value)
)
# Display nice table using standard knitr
knitr::kable(num_summary, caption = "Summary Statistics for Numerical Features")
| Variable | Mean | Median | Std_Dev | Min | Max |
|---|---|---|---|---|---|
| age_years | 52.82 | 53.00000 | 6.77 | 29.00000 | 64.00000 |
| ap_hi | 126.44 | 120.00000 | 15.95 | 90.00000 | 180.00000 |
| ap_lo | 81.26 | 80.00000 | 9.14 | 60.00000 | 120.00000 |
| bmi | 27.44 | 26.31464 | 5.19 | 10.72664 | 59.52381 |
| height | 164.42 | 165.00000 | 7.93 | 109.00000 | 250.00000 |
| weight | 74.07 | 72.00000 | 14.17 | 30.00000 | 200.00000 |
Visualizing the shape of the data. We look for Bell Curves (Normal Distribution) or skewness.
# Reshape data for faceting
num_long <- df_clean %>%
select(age_years, height, weight, bmi, ap_hi, ap_lo) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Plot all histograms at once
ggplot(num_long, aes(x = Value)) +
geom_histogram(fill = "#3498DB", color = "white", bins = 30) +
facet_wrap(~Variable, scales = "free", ncol = 2) +
labs(title = "Distributions of Numerical Variables", y = "Count", x = "Value") +
theme_minimal() +
theme(strip.text = element_text(size = 12, face = "bold"))
We examine the frequency of categorical features (Risk Factors & Demographics).
# Select categorical columns
cat_vars <- df_clean %>%
select(gender, cholesterol, gluc, smoke, alco, active, cardio) %>%
mutate(across(everything(), as.factor)) # Ensure they are factors
# Create a long format summary
cat_summary <- cat_vars %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Category") %>%
group_by(Variable, Category) %>%
summarise(Count = n(), .groups = "drop") %>%
mutate(Percentage = 0) # Placeholder
# Calculate percentages within each variable
for(var in unique(cat_summary$Variable)) {
total <- sum(cat_summary$Count[cat_summary$Variable == var])
cat_summary$Percentage[cat_summary$Variable == var] <-
round(cat_summary$Count[cat_summary$Variable == var] / total * 100, 1)
}
# Display table
knitr::kable(head(cat_summary, 20), caption = "Categorical Frequencies (Breakdown)")
| Variable | Category | Count | Percentage |
|---|---|---|---|
| active | 1 | 54746 | 80.4 |
| active | 0 | 13387 | 19.6 |
| alco | 1 | 3617 | 5.3 |
| alco | 0 | 64516 | 94.7 |
| cardio | 1 | 33638 | 49.4 |
| cardio | 0 | 34495 | 50.6 |
| cholesterol | 1 | 51160 | 75.1 |
| cholesterol | 2 | 9184 | 13.5 |
| cholesterol | 3 | 7789 | 11.4 |
| gender | 1 | 44379 | 65.1 |
| gender | 2 | 23754 | 34.9 |
| gluc | 1 | 57960 | 85.1 |
| gluc | 2 | 4997 | 7.3 |
| gluc | 3 | 5176 | 7.6 |
| smoke | 1 | 5972 | 8.8 |
| smoke | 0 | 62161 | 91.2 |
# Plot all categorical bars at once
ggplot(cat_summary, aes(x = Category, y = Percentage, fill = Variable)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5, size = 4) +
facet_wrap(~Variable, scales = "free_x", ncol = 3) +
scale_y_continuous(limits = c(0, 100)) + # Add space for labels
labs(title = "Breakdown of Categorical Features", y = "Percentage (%)") +
theme_minimal() +
theme(strip.text = element_text(size = 12, face = "bold"))
Hypothesis: Do men engage in risky behaviors (Smoking/Alcohol) significantly more than women?
library(tidyr)
# Summary Table logic using df_clean
habit_summary <- df_clean %>%
group_by(gender) %>%
summarise(
# Convert factor to numeric (1=0%, 2=100%) for binary variables
Smoker_Rate = mean(as.numeric(as.character(smoke))) * 100,
Alcohol_Rate = mean(as.numeric(as.character(alco))) * 100,
Active_Rate = mean(as.numeric(as.character(active))) * 100,
Cardio_Risk = mean(as.numeric(as.character(cardio))) * 100,
# FOR CHOLESTEROL & GLUCOSE: Calculate % of people with Level 3 (High)
Choles_High_Pct = mean(cholesterol == "3") * 100,
Glucose_High_Pct = mean(gluc == "3") * 100
)
# 1. Visualizing Smoking by Gender
# FIX: Use df_clean and wrap smoke in as.factor()
ggplot(df_clean, aes(x = gender, fill = as.factor(smoke))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Smoking Habits by Gender",
subtitle = "Men (Gender 2) show significantly higher smoking rates",
y = "Percentage", x = "Gender (1=Women, 2=Men)", fill = "Smoking Status") +
scale_fill_manual(values = c("gray", "#E74C3C"), labels = c("Non-Smoker", "Smoker")) +
theme_minimal()
# 2. Visualizing Alcohol by Gender
# FIX: Use df_clean and wrap alco in as.factor()
ggplot(df_clean, aes(x = gender, fill = as.factor(alco))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Alcohol Habits by Gender",
subtitle = "Men (Gender 2) show significantly higher alcohol drinking rates",
y = "Percentage", x = "Gender (1=Women, 2=Men)", fill = "Alcohol Status") +
scale_fill_manual(values = c("gray", "#E74C3C"), labels = c("Non-Drinker", "Drinker")) +
theme_minimal()
# 3. Reshape and Plot Summary
habit_long <- habit_summary %>%
pivot_longer(cols = -gender,
names_to = "Metric",
values_to = "Percentage")
# FIX: Wrap gender in as.factor()
ggplot(habit_long, aes(x = Metric, y = Percentage, fill = as.factor(gender))) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("#FF9999", "#66B2FF"), labels = c("Women", "Men")) +
labs(title = "Health Risks & Habits by Gender",
subtitle = "Comparison of High Risk Factors",
y = "Percentage (%)", x = "", fill = "Gender") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Hypothesis: Can being active help lower Blood Pressure, even for Obese patients (BMI > 30)?
# Filter for Obese patients only
df_obese <- df_clean %>% filter(bmi > 30)
ggplot(df_obese, aes(x = active, y = ap_hi, fill = active)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Impact of Activity on Blood Pressure (Obese Patients)",
subtitle = "Active patients show slightly lower median BP and fewer extreme outliers",
x = "Active Status (0=No, 1=Yes)", y = "Systolic BP") +
theme_minimal()
A deep dive into the class balance of our target variable to ensure the dataset is not biased.
# Prepare data for Pie Chart
cardio_data <- df_clean %>%
group_by(cardio) %>%
summarise(Count = n()) %>%
mutate(
Percentage = Count / sum(Count),
Label = paste0(round(Percentage * 100, 1), "%")
)
# Plot Pie Chart
ggplot(cardio_data, aes(x = "", y = Percentage, fill = as.factor(cardio))) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5), color = "white", size = 6) +
scale_fill_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy (0)", "Disease (1)")) +
labs(title = "Target Variable Balance (Heart Disease)",
fill = "Diagnosis") +
theme_void()
# Histogram to show the distribution of Age
# ggplot(df, aes(x = age/365.25)) +
# geom_histogram(binwidth = 1, fill = "steelblue", color = "white") +
# labs(title = "Univariate Analysis: Distribution of Age", x = "Age (Years)")
Here we analyze how a single risk factor correlates with Heart Disease (cardio).
We use Boxplots to see if patients with heart disease have higher median Age, BMI, or Blood Pressure.
# 1. Age vs Cardio
ggplot(df_clean, aes(x = cardio, y = age_years, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy", "Disease")) +
labs(title = "Age vs. Heart Disease",
subtitle = "Patients with heart disease are significantly older (median ~55 vs ~50)",
y = "Age (Years)", x = "Status") +
theme_minimal()
# 2. BMI vs Cardio
ggplot(df_clean, aes(x = cardio, y = bmi, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values = c("#2ECC71", "#E74C3C")) +
geom_hline(yintercept = 30, linetype = "dashed", color = "black") +
labs(title = "BMI vs. Heart Disease",
subtitle = "Disease group has higher median BMI. Dashed line = Obesity Threshold (30)",
y = "BMI", x = "Status") +
theme_minimal()
# 3. Systolic BP vs Cardio
ggplot(df_clean, aes(x = cardio, y = ap_hi, fill = cardio)) +
geom_boxplot() +
scale_fill_manual(values = c("#2ECC71", "#E74C3C")) +
labs(title = "Systolic BP vs. Heart Disease",
subtitle = "High Blood Pressure is the most distinct separator between Healthy and Sick",
y = "Systolic BP (mmHg)", x = "Status") +
theme_minimal()
We look at infection rates across different groups (Gender, Cholesterol, Glucose).
# Function to create percentage plots
plot_categorical_risk <- function(data, column, title_text) {
data %>%
count(!!sym(column), cardio) %>%
group_by(!!sym(column)) %>%
mutate(Percent = n / sum(n)) %>%
# FIX IS HERE: Use as.factor(cardio) inside aes()
ggplot(aes(x = !!sym(column), y = Percent, fill = as.factor(cardio))) +
geom_col(position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy", "Disease")) +
labs(title = title_text, y = "Percentage", x = column, fill = "Status") +
theme_minimal()
}
# 1. Cholesterol Risk
plot_categorical_risk(df_clean, "cholesterol", "Heart Disease Risk by Cholesterol Level (1=Normal, 3=High)")
# 2. Glucose Risk
plot_categorical_risk(df_clean, "gluc", "Heart Disease Risk by Glucose Level (1=Normal, 3=High)")
# 3. Smoking Risk
plot_categorical_risk(df_clean, "smoke", "Heart Disease Risk by Smoking Status (0=Non, 1=Smoker)")
# Box plots visually show numerical distribution and averages across categories
ggplot(df, aes(x = factor(cardio), y = bmi, fill = factor(cardio))) +
geom_boxplot() +
labs(title = "Bivariate Analysis: BMI by Cardio Status", x = "0 = No, 1 = Yes") #
We map interactions between three or more fields, often using colour-coding or faceting
# We divide Age and BP into "bins" (squares) and calculate the % of sick people in each square.
# This removes overplotting and shows the "Risk Gradient" clearly.
ggplot(df_clean, aes(x = age_years, y = ap_hi, z = as.numeric(as.character(cardio)))) +
stat_summary_2d(fun = mean, bins = 30) +
scale_fill_gradient(low = "#2ECC71", high = "#C0392B", labels = scales::percent) +
facet_wrap(~cholesterol) +
labs(title = "Risk Heatmap: Age vs BP by Cholesterol",
subtitle = "Red areas indicate high probability of heart disease",
x = "Age (Years)", y = "Systolic BP", fill = "Risk %") +
theme_minimal()
# Correlation Matrix (Numeric only)
num_vars <- df %>% select(age_years, height, weight, bmi, ap_hi, ap_lo)
corr_matrix <- cor(num_vars)
corrplot(corr_matrix, method = "color", type = "upper",
addCoef.col = "black", tl.col = "black", diag = FALSE)
Found that the features that is more than (>0.5) indicates that the
features could be redundant to each other, one of them will be
dropped.
Observations:
Does having BOTH high cholesterol and high glucose compound the risk?
# Calculate Probability of Disease for each subgroup
metabolic_risk <- df_clean %>%
group_by(cholesterol, gluc) %>%
summarise(Probability = mean(as.numeric(as.character(cardio))), .groups = 'drop')
# Heatmap Visualization
ggplot(metabolic_risk, aes(x = cholesterol, y = gluc, fill = Probability)) +
geom_tile(color = "white", size = 1) +
geom_text(aes(label = round(Probability, 2)), size = 5, color = "black") +
scale_fill_gradient(low = "#2ECC71", high = "#C0392B") +
labs(title = "Interaction Effect: Cholesterol & Glucose",
subtitle = "Risk doubles from 0.43 (Normal/Normal) to 0.82 (High/High)",
x = "Cholesterol Level", y = "Glucose Level", fill = "Risk Prob") +
theme_minimal()
Visualizing the decision boundary. We plot Age vs BP and color the points by Disease status.
# We use a sample of 2000 points to avoid overplotting and make the chart readable
set.seed(123)
df_sample <- df_clean %>% sample_n(2000)
# FIX: Added as.factor(cardio) to the color aesthetic
ggplot(df_sample, aes(x = age_years, y = ap_hi, color = as.factor(cardio))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, size = 1.5) +
scale_color_manual(values = c("#2ECC71", "#E74C3C"), labels = c("Healthy", "Disease")) +
labs(title = "Age vs. BP Separation",
subtitle = "Disease (Red) concentrates at higher Age AND higher BP",
x = "Age (Years)", y = "Systolic BP", color = "Status") +
theme_minimal()
This is the statistical “Gold Standard” for multivariate analysis. It tells us which factor is the strongest predictor when all other factors are held constant.
# Fit Logistic Regression
log_model <- glm(cardio ~ age_years + bmi + cholesterol + smoke + active + ap_hi,
data = df_clean, family = "binomial")
# Extract Odds Ratios
odds_ratios <- exp(coef(log_model))
conf_intervals <- exp(confint(log_model))
# Create Dataframe for Plotting
or_plot_data <- data.frame(
Variable = names(odds_ratios)[-1], # Remove Intercept
OR = odds_ratios[-1],
Lower = conf_intervals[-1, 1],
Upper = conf_intervals[-1, 2]
)
# Forest Plot of Odds Ratios
ggplot(or_plot_data, aes(x = reorder(Variable, OR), y = OR)) +
geom_point(size = 4, color = "darkblue") +
geom_errorbar(aes(ymin = Lower, ymax = Upper), width = 0.2, color = "darkblue") +
geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
coord_flip() +
labs(title = "Multivariate Risk Factors (Odds Ratios)",
subtitle = "Factors > 1 increase risk. Cholesterol & BP are the strongest drivers.",
y = "Odds Ratio (Risk Multiplier)", x = "Variable") +
theme_minimal()
We use the Decision Tree model to calculate a single “Importance Score” for each variable. This aggregates categorical levels (e.g., merging “High Cholesterol” and “Very High Cholesterol”) into one score to see which feature matters most overall.
# 1. Train a quick Decision Tree (if not already trained)
library(rpart)
tree_model <- rpart(cardio ~ ., data = df_clean, method = "class")
# 2. Extract Importance
# rpart calculates importance by summing the improvement in accuracy for every split using that variable
importances <- data.frame(
Feature = names(tree_model$variable.importance),
Importance = tree_model$variable.importance
)
# 3. Visualize
ggplot(importances, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "#E67E22") +
coord_flip() +
labs(title = "Overall Feature Importance (Decision Tree)",
subtitle = "This ranks variables by their total contribution to the model's accuracy.",
x = "Feature", y = "Importance Score") +
theme_minimal()
80% of a data science project is spent sourcing, cleaning, and preparing data. We identify inaccurate or missing parts (NA or ““) and modify them)
# Check for missing values
any(is.na(df))
## [1] FALSE
df_clean <- df %>%
# Handle outliers in Blood Pressure
filter(ap_hi > 50 & ap_hi < 250) %>% # [15, 28]
# Transform age and calculate BMI
mutate(age_years = floor(age / 365.25),
bmi = weight / ((height/100)^2)) %>% # [15, 29]
# Drop rows with NAs
na.omit()
df <- df %>% distinct()
## Convert categories to factors
cols_factor <- c("gender", "cholesterol", "gluc", "smoke", "alco", "active", "cardio")
df[cols_factor] <- lapply(df[cols_factor], as.factor)
## Filter outliers
df_clean <- df %>%
filter(height > 100 & height < 250) %>%
filter(weight > 30 & weight < 200) %>%
filter(ap_hi > 60 & ap_hi < 240) %>%
filter(ap_lo > 30 & ap_lo < 150) %>%
filter(bmi > 10 & bmi < 60)
cat("Original Rows:", nrow(df), "\n")
## Original Rows: 68205
cat("Cleaned Rows:", nrow(df_clean), "\n")
## Cleaned Rows: 68133
Based on the correlation analysis, we identified high redundancy between:
# Dropping redundant columns to reduce multicollinearity
df_final <- df_clean %>% select(-age, -weight, -height)
# Verify remaining columns
colnames(df_final)
## [1] "id" "gender" "ap_hi"
## [4] "ap_lo" "cholesterol" "gluc"
## [7] "smoke" "alco" "active"
## [10] "cardio" "age_years" "bmi"
## [13] "bp_category" "bp_category_encoded"
To answer our objective of finding the best algorithm to predict
cardiovascular disease (cardio), we will train and compare
three distinct types of models: 1. Decision Tree: For
visual interpretability. 2. Logistic Regression: For
statistical relationship analysis. 3. Random Forest:
For high predictive accuracy.
# Load necessary libraries for ML
if(!require(randomForest)) install.packages("randomForest")
if(!require(pROC)) install.packages("pROC")
if(!require(caret)) install.packages("caret")
if(!require(caTools)) install.packages("caTools")
library(randomForest)
library(pROC)
library(caret)
library(caTools)
# Split Data (70% Train, 30% Test)
set.seed(123)
split <- sample.split(df_final$cardio, SplitRatio = 0.7)
train_set <- subset(df_final, split == TRUE)
test_set <- subset(df_final, split == FALSE)
cat("Training Samples:", nrow(train_set), "\n")
## Training Samples: 47693
cat("Testing Samples:", nrow(test_set), "\n")
## Testing Samples: 20440
# Train Decision Tree
tree_model <- rpart(cardio ~ ., data = train_set, method = "class", cp = 0.001)
# Visualize Rules
rpart.plot(tree_model, main = "Decision Tree Rules", extra = 106)
# Train Logistic Regression
log_model <- glm(cardio ~ ., data = train_set, family = "binomial")
# Summary of significant factors
summary(log_model)
##
## Call:
## glm(formula = cardio ~ ., family = "binomial", data = train_set)
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.084e+01 2.274e-01 -47.647 < 2e-16
## id 2.338e-08 3.630e-07 0.064 0.948645
## gender2 3.645e-02 2.365e-02 1.541 0.123338
## ap_hi 5.468e-02 1.278e-03 42.789 < 2e-16
## ap_lo 3.036e-03 2.482e-03 1.223 0.221151
## cholesterol2 3.549e-01 3.294e-02 10.774 < 2e-16
## cholesterol3 1.110e+00 4.300e-02 25.810 < 2e-16
## gluc2 6.171e-02 4.346e-02 1.420 0.155633
## gluc3 -3.513e-01 4.804e-02 -7.313 2.61e-13
## smoke1 -1.250e-01 4.221e-02 -2.962 0.003056
## alco1 -1.972e-01 5.099e-02 -3.867 0.000110
## active1 -2.368e-01 2.624e-02 -9.023 < 2e-16
## age_years 5.109e-02 1.626e-03 31.412 < 2e-16
## bmi 2.700e-02 2.189e-03 12.336 < 2e-16
## bp_categoryHypertension Stage 1 2.180e-01 5.617e-02 3.880 0.000104
## bp_categoryHypertension Stage 2 6.185e-01 7.687e-02 8.045 8.60e-16
## bp_categoryNormal 3.551e-01 5.868e-02 6.051 1.44e-09
## bp_category_encodedHypertension Stage 1 NA NA NA NA
## bp_category_encodedHypertension Stage 2 NA NA NA NA
## bp_category_encodedNormal NA NA NA NA
##
## (Intercept) ***
## id
## gender2
## ap_hi ***
## ap_lo
## cholesterol2 ***
## cholesterol3 ***
## gluc2
## gluc3 ***
## smoke1 **
## alco1 ***
## active1 ***
## age_years ***
## bmi ***
## bp_categoryHypertension Stage 1 ***
## bp_categoryHypertension Stage 2 ***
## bp_categoryNormal ***
## bp_category_encodedHypertension Stage 1
## bp_category_encodedHypertension Stage 2
## bp_category_encodedNormal
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66109 on 47692 degrees of freedom
## Residual deviance: 53378 on 47676 degrees of freedom
## AIC: 53412
##
## Number of Fisher Scoring iterations: 4
# Train Random Forest (Restricted to 100 trees for speed)
rf_model <- randomForest(cardio ~ ., data = train_set, ntree = 100, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = cardio ~ ., data = train_set, ntree = 100, importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 27.25%
## Confusion matrix:
## 0 1 class.error
## 0 18695 5451 0.2257517
## 1 7546 16001 0.3204655
We evaluate the models based on Accuracy (overall correctness), Precision (avoiding false alarms), and AUC (reliability).
# --- PREDICTIONS ---
# 1. Decision Tree Predictions
dt_prob <- predict(tree_model, test_set, type = "prob")[,2]
dt_class <- ifelse(dt_prob > 0.5, 1, 0)
# 2. Logistic Regression Predictions
log_prob <- predict(log_model, test_set, type = "response")
log_class <- ifelse(log_prob > 0.5, 1, 0)
# 3. Random Forest Predictions
rf_prob <- predict(rf_model, test_set, type = "prob")[,2]
rf_class <- ifelse(rf_prob > 0.5, 1, 0)
# --- METRICS CALCULATION FUNCTION ---
calc_metrics <- function(actual, predicted, probs, model_name) {
cm <- table(predicted, actual)
accuracy <- sum(diag(cm)) / sum(cm)
precision <- cm[2,2] / (cm[2,2] + cm[2,1]) # TP / (TP + FP)
recall <- cm[2,2] / (cm[2,2] + cm[1,2]) # TP / (TP + FN)
roc_obj <- roc(actual, probs)
auc_score <- auc(roc_obj)
return(c(Model = model_name,
Accuracy = round(accuracy, 3),
Precision = round(precision, 3),
Recall = round(recall, 3),
AUC = round(auc_score, 3)))
}
# Generate Comparison Table
metrics_df <- rbind(
calc_metrics(test_set$cardio, dt_class, dt_prob, "Decision Tree"),
calc_metrics(test_set$cardio, log_class, log_prob, "Logistic Regression"),
calc_metrics(test_set$cardio, rf_class, rf_prob, "Random Forest")
)
# Convert to dataframe for display
metrics_df <- as.data.frame(metrics_df)
knitr::kable(metrics_df, caption = "Machine Learning Model Comparison")
| Model | Accuracy | Precision | Recall | AUC |
|---|---|---|---|---|
| Decision Tree | 0.728 | 0.734 | 0.703 | 0.759 |
| Logistic Regression | 0.73 | 0.765 | 0.652 | 0.796 |
| Random Forest | 0.733 | 0.753 | 0.683 | 0.788 |
# Calculate ROC objects
roc_dt <- roc(test_set$cardio, dt_prob)
roc_log <- roc(test_set$cardio, log_prob)
roc_rf <- roc(test_set$cardio, rf_prob)
# Plot
plot(roc_rf, col = "green", main = "ROC Curve Comparison")
plot(roc_log, col = "blue", add = TRUE)
plot(roc_dt, col = "red", add = TRUE)
legend("bottomright", legend = c("Random Forest", "Logistic Reg", "Decision Tree"),
col = c("green", "blue", "red"), lwd = 2)
To answer “Which specific variables are the most significant predictors?”, we extract importance scores from the Random Forest model.
# Extract Importance
importance_data <- data.frame(Feature = rownames(importance(rf_model)),
Importance = importance(rf_model)[, "MeanDecreaseGini"])
# Get Top 5
top_5_features <- importance_data %>%
arrange(desc(Importance)) %>%
head(5)
# Visualize
ggplot(top_5_features, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 5 Significant Predictors of Heart Disease",
subtitle = "Based on Random Forest Gini Importance",
x = "Feature", y = "Importance Score") +
theme_minimal()
Based on the analysis above, here are the direct answers to the project objectives: