Veri setine erişmek için Heart Disease sitesini ziyaret edebilirsiniz.

1.Veri Seti Hikayesi/Tanımı ve Amaçlar

Yaş: Hastanın yaşı (yıl)

Cinsiyet: Hastanın cinsiyeti (E: Erkek, K: Kadın)

ChestPainType: göğüs ağrısı tipi (TA: Tipik Angina, ATA: Atipik Angina, NAP: Anginal Olmayan Ağrı, ASY: Asemptomatik)

DinlenmeBP: dinlenme kan basıncı (mm Hg)

Cholesterol: serum kolesterolü (mm/dl)

FastingBS: açlık kan şekeri (1: FastingBS > 120 mg/dl ise, 0: aksi takdirde)

RestingECG: istirahat elektrokardiyogram sonuçları (Normal: Normal, ST: ST-T dalgası anormalliği (T dalgası inversiyonları ve/veya ST elevasyonu veya > 0,05 mV depresyonu), LVH: Estes kriterlerine göre olası veya kesin sol ventriküler hipertrofiyi gösteren)

MaxHR: ulaşılan maksimum kalp atış hızı (60 ile 202 arasındaki sayısal değer)

ExerciseAngina: egzersize bağlı anjina (E: Evet, N: Hayır)

Oldpeak: oldpeak = ST (Alçalmada ölçülen sayısal değer)

ST_Slope: zirve egzersiz ST segmentinin eğimi (Up: yukarı eğimli, Down: düz, Aşağı: aşağı eğimli)

HeartDisease: çıktı sınıfı (1: kalp hastalığı, 0: Normal)

AMAÇ: Veri incelemesi yapıldıktan sonra çokdeğişkenli analiz yöntemlerinin uygulanması


2. Kütüphanelerin Yüklenmesi

Kodu Göster / Gizle
library(readr)
library(tidyverse)
library(tidyr)
library(dendextend)
library(knitr)
library(gridExtra)
library(GGally)
library(ggplot2)
library(VIM)
library(corrplot)
library(car)
library(ResourceSelection)
library(glmulti)
library(tree)
library(randomForest)
library(ISLR)
library(class)
library(pROC)
library(gtools)
library(tidyverse)
library(GGally)
library(superml)
library(caret)
library(Boruta)
library("stringr")
library("tidyr")
library("readr")
library("here")
library("skimr")
library("janitor")
library("lubridate")
library(gridExtra)
library(ggplot2)
library(VIM)
library(corrplot)
library(car)
library(ResourceSelection)
library(glmulti)
library(tree)
library(randomForest)
library(ISLR)
library(class)
library(pROC)
library(gtools)
library(tidyverse)
library("scales")
library("ggcorrplot")
library("ggrepel")
library("forcats")
library("corrgram")
library(tidymodels)
library(baguette)
library(discrim)
library(bonsai)
library(ResourceSelection)
library(kableExtra)
library(broom)
library(dplyr)
library(caret)
library(tidyr)
library(corrplot)
library("Hmisc")
library(psych)
library(factoextra)
library("DescTools")
library(ResourceSelection)
library(haven)
library(effectsize)
library(rstatix)
library(ggpubr)
library(biotools)
library(PerformanceAnalytics)
library(heplots)
library(gplots)

3. Veri Setini Yükleme

Kodu Göster / Gizle
df <- read.csv('/home/ilke/Downloads/heart (2).csv')

4. Veri Önişleme ve Keşifsel Veri Analizi

4.1 Değişken TÜrlerinin Ayarlanması

kable(head(df), format = "html") %>%
  kable_styling()
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
37 M ATA 130 283 0 ST 98 N 0.0 Up 0
48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
39 M NAP 120 339 0 Normal 170 N 0.0 Up 0
colnames(df)
##  [1] "Age"            "Sex"            "ChestPainType"  "RestingBP"     
##  [5] "Cholesterol"    "FastingBS"      "RestingECG"     "MaxHR"         
##  [9] "ExerciseAngina" "Oldpeak"        "ST_Slope"       "HeartDisease"
str(df)
## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : chr  "M" "F" "M" "F" ...
##  $ ChestPainType : chr  "ATA" "NAP" "ATA" "ASY" ...
##  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ RestingECG    : chr  "Normal" "Normal" "ST" "Normal" ...
##  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: chr  "N" "N" "N" "Y" ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : chr  "Up" "Flat" "Up" "Flat" ...
##  $ HeartDisease  : int  0 1 0 1 0 0 0 0 1 0 ...
df <- df %>% mutate(
  Sex = factor(Sex),
  ChestPainType = factor(ChestPainType),
  FastingBS = factor(FastingBS),
  RestingECG = factor(RestingECG),
  ExerciseAngina = factor(ExerciseAngina),
  ST_Slope  = factor(ST_Slope),
  HeartDisease = factor(HeartDisease)
)
str(df)
## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
##  $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...

4.2 Eksik Veri ve Aykırı Değer Sorununu Çözme

Veride eksik veri yoktur. Aykırı değerlere bakabilmek için sayısal değişkenlerin kutu grafiklerine bakıldı. 
sum(is.na(df))
## [1] 0
colSums(is.na(df))
##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0
library(Hmisc)
Hmisc::describe(df) 
## df 
## 
##  12  Variables      918  Observations
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      918        0       50    0.999    53.51    10.71       37       40 
##      .25      .50      .75      .90      .95 
##       47       54       60       65       68 
## 
## lowest : 28 29 30 31 32, highest: 73 74 75 76 77
## --------------------------------------------------------------------------------
## Sex 
##        n  missing distinct 
##      918        0        2 
##                     
## Value         F    M
## Frequency   193  725
## Proportion 0.21 0.79
## --------------------------------------------------------------------------------
## ChestPainType 
##        n  missing distinct 
##      918        0        4 
##                                   
## Value        ASY   ATA   NAP    TA
## Frequency    496   173   203    46
## Proportion 0.540 0.188 0.221 0.050
## --------------------------------------------------------------------------------
## RestingBP 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      918        0       67    0.993    132.4    20.09      106      110 
##      .25      .50      .75      .90      .95 
##      120      130      140      160      160 
## 
## lowest :   0  80  92  94  95, highest: 180 185 190 192 200
## --------------------------------------------------------------------------------
## Cholesterol 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      918        0      222    0.993    198.8      116      0.0      0.0 
##      .25      .50      .75      .90      .95 
##    173.2    223.0    267.0    305.0    331.3 
## 
## lowest :   0  85 100 110 113, highest: 491 518 529 564 603
## --------------------------------------------------------------------------------
## FastingBS 
##        n  missing distinct 
##      918        0        2 
##                       
## Value          0     1
## Frequency    704   214
## Proportion 0.767 0.233
## --------------------------------------------------------------------------------
## RestingECG 
##        n  missing distinct 
##      918        0        3 
##                                
## Value         LVH Normal     ST
## Frequency     188    552    178
## Proportion  0.205  0.601  0.194
## --------------------------------------------------------------------------------
## MaxHR 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      918        0      119        1    136.8    29.03       96      103 
##      .25      .50      .75      .90      .95 
##      120      138      156      170      178 
## 
## lowest :  60  63  67  69  70, highest: 190 192 194 195 202
## --------------------------------------------------------------------------------
## ExerciseAngina 
##        n  missing distinct 
##      918        0        2 
##                       
## Value          N     Y
## Frequency    547   371
## Proportion 0.596 0.404
## --------------------------------------------------------------------------------
## Oldpeak 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      918        0       53    0.934   0.8874    1.126      0.0      0.0 
##      .25      .50      .75      .90      .95 
##      0.0      0.6      1.5      2.3      3.0 
## 
## lowest : -2.6 -2   -1.5 -1.1 -1  , highest: 4.2  4.4  5    5.6  6.2 
## --------------------------------------------------------------------------------
## ST_Slope 
##        n  missing distinct 
##      918        0        3 
##                             
## Value       Down  Flat    Up
## Frequency     63   460   395
## Proportion 0.069 0.501 0.430
## --------------------------------------------------------------------------------
## HeartDisease 
##        n  missing distinct 
##      918        0        2 
##                       
## Value          0     1
## Frequency    410   508
## Proportion 0.447 0.553
## --------------------------------------------------------------------------------
par(mfrow=c(2, 3)) 

# Age Boxplot
boxplot(df$Age, main="Age", ylab="Age", col="skyblue")

# RestingBP Boxplot
boxplot(df$RestingBP, main="RestingBP", ylab="RestingBP", col="lightgreen")

# Cholesterol Boxplot
boxplot(df$Cholesterol, main="Cholesterol", ylab="Cholesterol", col="lightcoral")

# MaxHR Boxplot
boxplot(df$MaxHR, main="MaxHR", ylab="MaxHR", col="lightgoldenrodyellow")

# Oldpeak Boxplot
boxplot(df$Oldpeak, main="Oldpeak", ylab="Oldpeak", col="lightsteelblue")

df$Age %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   28.00   47.00   54.00   53.51   60.00   77.00
ggplot(data = df, aes(x = Age)) +
  geom_histogram(color = "darkblue", fill = "lightblue") +
  labs(title = "Age Histogram Plot", x = "Age", y = "Count") +
  theme_minimal()

df$Cholesterol %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   173.2   223.0   198.8   267.0   603.0
ggplot(data = df, aes(x = Cholesterol)) +
  geom_histogram(color = "darkblue", fill = "lightblue") +
  labs(title = "Serum Cholesterol Histogram Plot", x = "Serum Cholesterol", y = "Count") +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Alt sınır için Q1 - 1.5 * IQR, üst sınır içinse Q3 + 1.5 * IQR değerini kullanarak değişken bazında aykırı değer oranlarını hesaplandı. %20’den fazla aykırı değere sahip değişken olmadığı için aykırı değerleri direkt silmek tercih edildi.
outlier_ratio <- function(data, variable) {
  Q1 <- quantile(data[[variable]], 0.25)
  Q3 <- quantile(data[[variable]], 0.75)
  IQR <- Q3 - Q1
  
  alt_sinir <- Q1 - 1.5 * IQR
  ust_sinir <- Q3 + 1.5 * IQR
  
  aykiri <- sum(data[[variable]] < alt_sinir | data[[variable]] > ust_sinir)
  oran <- aykiri / length(data[[variable]])
  
  return(oran)
}


degiskenler <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
aykiri_oranlar <- sapply(degiskenler, function(x) outlier_ratio(df, x))


aykiri_oranlar
##         Age   RestingBP Cholesterol       MaxHR     Oldpeak 
## 0.000000000 0.030501089 0.199346405 0.002178649 0.017429194
veri <- data.frame(degiskenler, aykiri_oranlar)


grafik <- ggplot(data = veri, aes(x = degiskenler, y = aykiri_oranlar)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +
  labs(title = "Aykırı Değer Oranları", x = "Değişkenler", y = "Aykırı Değer Oranı") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


print(grafik)

outlier_detection <- function(data, variable) {
  Q1 <- quantile(data[[variable]], 0.25)
  Q3 <- quantile(data[[variable]], 0.75)
  IQR <- Q3 - Q1
  
  alt_sinir <- Q1 - 1.5 * IQR
  ust_sinir <- Q3 + 1.5 * IQR
  
  aykiri <- data[[variable]] < alt_sinir | data[[variable]] > ust_sinir
  return(aykiri)
}


age_outliers <- outlier_detection(df, "Age")
RestingBP_outliers <- outlier_detection(df, "RestingBP")
Cholesterol_outliers <- outlier_detection(df, "Cholesterol")
MaxHR_outliers <- outlier_detection(df, "MaxHR")
Oldpeak_outliers <- outlier_detection(df, "Oldpeak")


clean_df <- df[!age_outliers & !RestingBP_outliers & !Cholesterol_outliers & !MaxHR_outliers & !Oldpeak_outliers, ]


cat("Aykırı değerleri içermeyen veri setinin boyutu:", dim(clean_df))
## Aykırı değerleri içermeyen veri setinin boyutu: 702 12
write.csv(clean_df, file = "/home/ilke/Downloads/clean_heart.csv", row.names = FALSE)   #**düzenlenmiş veriyi kaydettim**
Aykırı değerleri çıkarttıktan sonra sayısal değişkenlerin kutu grafikleri
par(mfrow=c(2, 3)) 

# Age Boxplot
boxplot(clean_df$Age, main="Age", ylab="Age", col="skyblue")

# RestingBP Boxplot
boxplot(clean_df$RestingBP, main="RestingBP", ylab="RestingBP", col="lightgreen")

# Cholesterol Boxplot
boxplot(clean_df$Cholesterol, main="Cholesterol", ylab="Cholesterol", col="lightcoral")

# MaxHR Boxplot
boxplot(clean_df$MaxHR, main="MaxHR", ylab="MaxHR", col="lightgoldenrodyellow")

# Oldpeak Boxplot
boxplot(clean_df$Oldpeak, main="Oldpeak", ylab="Oldpeak", col="lightsteelblue")

4.3 veriye genel bakış

head(df)
##   Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR
## 1  40   M           ATA       140         289         0     Normal   172
## 2  49   F           NAP       160         180         0     Normal   156
## 3  37   M           ATA       130         283         0         ST    98
## 4  48   F           ASY       138         214         0     Normal   108
## 5  54   M           NAP       150         195         0     Normal   122
## 6  39   M           NAP       120         339         0     Normal   170
##   ExerciseAngina Oldpeak ST_Slope HeartDisease
## 1              N     0.0       Up            0
## 2              N     1.0     Flat            1
## 3              N     0.0       Up            0
## 4              Y     1.5     Flat            1
## 5              N     0.0       Up            0
## 6              N     0.0       Up            0
kable(df %>%
group_by(HeartDisease) %>%
  summarise(N = n())) %>%
  kable_styling(full_width = F)
HeartDisease N
0 410
1 508
factor_columns <- sapply(df, is.factor) 
factor_levels <- lapply(df[factor_columns], table)

factor_summary_combined <- do.call(rbind, Map(as.data.frame, factor_levels))

kable(factor_summary_combined, caption = "kategorik değişkenlerin seviye sayıları") %>%
  kable_styling(full_width = F)
kategorik değişkenlerin seviye sayıları
Var1 Freq
Sex.1 F 193
Sex.2 M 725
ChestPainType.1 ASY 496
ChestPainType.2 ATA 173
ChestPainType.3 NAP 203
ChestPainType.4 TA 46
FastingBS.1 0 704
FastingBS.2 1 214
RestingECG.1 LVH 188
RestingECG.2 Normal 552
RestingECG.3 ST 178
ExerciseAngina.1 N 547
ExerciseAngina.2 Y 371
ST_Slope.1 Down 63
ST_Slope.2 Flat 460
ST_Slope.3 Up 395
HeartDisease.1 0 410
HeartDisease.2 1 508
barplot_var1 <- ggplot(clean_df, aes(x = as.factor(Sex), fill = as.factor(Sex))) +
  geom_bar(position = "dodge") +
  labs(title = "Sex Distribution", x = "Sex", y = "Frequency", fill = "Sex") +
  theme_minimal()

barplot_var2 <- ggplot(clean_df, aes(x = as.factor(ChestPainType), fill = as.factor(ChestPainType))) +
  geom_bar(position = "dodge") +
  labs(title = "Chest Pain Type Distribution", x = "Chest Pain Type", y = "Frequency", fill = "Chest Pain Type") +
  theme_minimal()

barplot_var3 <- ggplot(clean_df, aes(x = as.factor(FastingBS), fill = as.factor(FastingBS))) +
  geom_bar(position = "dodge") +
  labs(title = "Fasting Blood Sugar Distribution", x = "Fasting BS", y = "Frequency", fill = "Fasting BS") +
  theme_minimal()

barplot_var4 <- ggplot(clean_df, aes(x = as.factor(RestingECG), fill = as.factor(RestingECG))) +
  geom_bar(position = "dodge") +
  labs(title = "Resting ECG Distribution", x = "Resting ECG", y = "Frequency", fill = "Resting ECG") +
  theme_minimal()

library(gridExtra)
multi_barplot <- grid.arrange(barplot_var1, barplot_var2, barplot_var3, barplot_var4, ncol = 2)

print(multi_barplot)
## TableGrob (2 x 2) "arrange": 4 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
barplot_st_slope <- ggplot(clean_df, aes(x = as.factor(ST_Slope), fill = as.factor(ST_Slope))) +
  geom_bar(position = "dodge") +
  labs(title = "ST Slope Distribution", x = "ST Slope", y = "Frequency", fill = "ST Slope") +
  theme_minimal()

barplot_heart_disease <- ggplot(clean_df, aes(x = as.factor(HeartDisease), fill = as.factor(HeartDisease))) +
  geom_bar(position = "dodge") +
  labs(title = "Heart Disease Distribution", x = "Heart Disease", y = "Frequency", fill = "Heart Disease") +
  theme_minimal()

barplot_exercise_angina <- ggplot(clean_df, aes(x = as.factor(ExerciseAngina), fill = as.factor(ExerciseAngina))) +
  geom_bar(position = "dodge") +
  labs(title = "Exercise Angina Distribution", x = "Exercise Angina", y = "Frequency", fill = "Exercise Angina") +
  theme_minimal()

multi_barplot <- grid.arrange(barplot_st_slope, barplot_heart_disease, barplot_exercise_angina, ncol = 3)

print(multi_barplot)
## TableGrob (1 x 3) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (1-1,3-3) arrange gtable[layout]
colnames(df)
##  [1] "Age"            "Sex"            "ChestPainType"  "RestingBP"     
##  [5] "Cholesterol"    "FastingBS"      "RestingECG"     "MaxHR"         
##  [9] "ExerciseAngina" "Oldpeak"        "ST_Slope"       "HeartDisease"
a <- df %>%
  group_by(HeartDisease) %>%
  summarise(across(-c(Sex, ChestPainType, FastingBS, RestingECG, ExerciseAngina, ST_Slope), list(mean = mean, sd = sd)))
kable(a,caption="Target değişkenine göre ort ve standart sapmalar", format = "html") %>%
  kable_styling()
Target değişkenine göre ort ve standart sapmalar
HeartDisease Age_mean Age_sd RestingBP_mean RestingBP_sd Cholesterol_mean Cholesterol_sd MaxHR_mean MaxHR_sd Oldpeak_mean Oldpeak_sd
0 50.55122 9.444915 130.1805 16.49958 227.1220 74.63466 148.1512 23.28807 0.4080488 0.6997091
1 55.89961 8.727056 134.1850 19.82868 175.9409 126.39140 127.6555 23.38692 1.2742126 1.1518720
kable(summary(df), caption = "Veri Özeti") %>%
  kable_styling(full_width = F)
Veri Özeti
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
Min. :28.00 F:193 ASY:496 Min. : 0.0 Min. : 0.0 0:704 LVH :188 Min. : 60.0 N:547 Min. :-2.6000 Down: 63 0:410
1st Qu.:47.00 M:725 ATA:173 1st Qu.:120.0 1st Qu.:173.2 1:214 Normal:552 1st Qu.:120.0 Y:371 1st Qu.: 0.0000 Flat:460 1:508
Median :54.00 NA NAP:203 Median :130.0 Median :223.0 NA ST :178 Median :138.0 NA Median : 0.6000 Up :395 NA
Mean :53.51 NA TA : 46 Mean :132.4 Mean :198.8 NA NA Mean :136.8 NA Mean : 0.8874 NA NA
3rd Qu.:60.00 NA NA 3rd Qu.:140.0 3rd Qu.:267.0 NA NA 3rd Qu.:156.0 NA 3rd Qu.: 1.5000 NA NA
Max. :77.00 NA NA Max. :200.0 Max. :603.0 NA NA Max. :202.0 NA Max. : 6.2000 NA NA
clean_df$Cholesterol %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    85.0   206.0   235.0   239.7   272.0   404.0
ggplot(data = clean_df, aes(x = Cholesterol)) +
  geom_histogram(color = "darkblue", fill = "lightblue") +
  labs(title = "Serum Cholesterol Histogram Plot", x = "Serum Cholesterol", y = "Count") +
  theme_minimal()

ggplot(clean_df, aes(x = RestingBP, y = Cholesterol, color = as.factor(HeartDisease))) +
  geom_point(alpha = 0.7) +
  labs(title = "Scatterplot of Resting BP and Cholesterol by Heart Disease Status", x = "Resting BP", y = "Cholesterol", color = "Heart Disease") +
  theme_minimal()

ggplot(clean_df, aes(x = Age, y = RestingBP, color = as.factor(HeartDisease), size = MaxHR)) +
  geom_point(alpha = 0.7) +
  labs(title = "Three-variable Scatterplot", x = "Age", y = "Resting BP", color = "Heart Disease", size = "Max HR") +
  theme_minimal()

ggpairs(clean_df[, c("Age", "RestingBP", "Cholesterol", "MaxHR", "HeartDisease")], 
        aes(color = as.factor(HeartDisease)),
        lower = list(continuous = "points"),  
        upper = list(continuous = "blank"),  
        diag = list(continuous = "barDiag"))  

ggpairs(df[, c("Age", "RestingBP", "Cholesterol", "MaxHR", "HeartDisease", "Sex")], 
        aes(color = Sex),
        lower = list(continuous = "points"),  
        upper = list(continuous = "blank"),  
        diag = list(continuous = "barDiag"))