##Intro :

This project focuses on exploring factors that increase heart-attack risk and building models to predict it using real-world health data. The analysis involves comprehensive data cleaning, visualization, and model building to identify key predictors of heart disease. Various machine learning techniques such as Logistic Regression, K-Nearest Neighbors (KNN), and Clustering were applied and compared based on performance metrics and interpretability. The report provides insights into how attributes like age, cholesterol, blood pressure, chest pain type, and exercise patterns contribute to cardiovascular risk. The goal is to support early detection and clinical decision-making through data-driven analysis.

##Objective:

1.To understand which demographic and clinical features affect heart-attack risk.

2.To identify patterns in the data.

3.To build predictive models that can help identify high-risk patients early.

##Dataset:

Our dataset contains 303 patients with information like:

Age, Sex Chest Pain Type, Exercise Angina, ST_Slope Resting Blood Pressure, Cholesterol, Max Heart Rate, Oldpeak The target variable is HeartDisease (1 = heart disease, 0 = no heart disease).

Analysis Phases

Phase 1: Descriptive / Exploratory Analysis
Q_No Question Type Visualization_or_Test
1 What is the distribution of heart-attack risk (target variable) in the dataset? Descriptive / Exploratory Bar chart / Pie chart
2 How do demographic features (age, sex) vary between individuals with and without heart attack? Descriptive / Exploratory Boxplot for age, Bar chart for sex
3 Which clinical/laboratory features (cholesterol, resting BP, max HR, etc.) differ between the two groups? Descriptive / Exploratory Boxplot / Violin plot
4 What are the correlations among continuous risk factors? Are there clusters of related variables? Descriptive / Exploratory Correlation heatmap, Pairplot
5 Are there meaningful categorical variables (chest pain type, thalassemia, fasting BS) whose proportions differ by heart-attack status? Descriptive / Exploratory Stacked bar chart / Mosaic plot
Phase 2: Inferential / Statistical Testing
Q_No Question Type Visualization_or_Test
6 Using an ANOVA or t-test (as appropriate), do mean values of key continuous variables differ significantly between heart-attack vs non-heart-attack groups? Inferential / Statistical t-test / ANOVA
7 For categorical variables, use chi-square tests (or similar) to determine if distributions differ between groups. Inferential / Statistical Chi-square test / Fisher’s exact test
Phase 3: Predictive Modelling
Q_No Question Type Visualization_or_Test
8 Build logistic regression to predict heart-attack likelihood. Interpret coefficients (odds ratios) and evaluate performance (AUC, calibration). Predictive / Modeling Logistic regression, ROC curve, AUC, Calibration plot
9 Build k-nearest neighbours (KNN) classifier. Compare performance to logistic regression. Predictive / Modeling KNN, Accuracy, AUC, Confusion matrix
10 Use k-means clustering on predictor variables (without target) to identify patient clusters; examine cluster membership relative to heart-attack risk. Predictive / Modeling K-means clustering, cluster proportions by heart attack
Phase 4: Model Comparison
Q_No Question Type Visualization_or_Test
11 Compare logistic regression, KNN, and clusters on predictive performance and interpretability. Comparative Accuracy, AUC, Sensitivity, Specificity table; interpretability notes
Phase 5: Interpretation & Insights
Q_No Question Type Visualization_or_Test
12 What are the main risk factors, demographic patterns, and clinical indicators associated with heart-attack risk? Interpretive Key findings
13 Which predictive model is most suitable for clinical screening? Interpretive Recommendation with rationale

Load packages & data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(knitr)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ROCR)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(broom)

# Read File
path <- '/Users/naincysingh/Library/Mobile Documents/com~apple~Numbers/Documents/heart.csv'
heart <- read.csv(path, stringsAsFactors = FALSE)

# View data
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age            <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex            <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType  <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP      <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol    <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG     <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR          <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak        <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope       <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease   <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
colnames(heart)
##  [1] "Age"            "Sex"            "ChestPainType"  "RestingBP"     
##  [5] "Cholesterol"    "FastingBS"      "RestingECG"     "MaxHR"         
##  [9] "ExerciseAngina" "Oldpeak"        "ST_Slope"       "HeartDisease"

Data cleaning & preprocessing

# Convert plausible categorical columns
heart <- heart %>%
mutate(
Sex = as.factor(Sex),
ChestPainType = as.factor(ChestPainType),
FastingBS = as.factor(FastingBS),
RestingECG = as.factor(RestingECG),
ExerciseAngina = as.factor(ExerciseAngina),
ST_Slope = as.factor(ST_Slope),
HeartDisease = as.factor(HeartDisease)
)

# View structure
str(heart)
## 'data.frame':    918 obs. of  12 variables:
##  $ Age           : int  40 49 37 48 54 39 45 54 37 48 ...
##  $ Sex           : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
##  $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
##  $ RestingBP     : int  140 160 130 138 150 120 130 110 140 120 ...
##  $ Cholesterol   : int  289 180 283 214 195 339 237 208 207 284 ...
##  $ FastingBS     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ RestingECG    : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
##  $ MaxHR         : int  172 156 98 108 122 170 170 142 130 120 ...
##  $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
##  $ Oldpeak       : num  0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ ST_Slope      : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
##  $ HeartDisease  : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
#Check for missing values
colSums(is.na(heart))
##            Age            Sex  ChestPainType      RestingBP    Cholesterol 
##              0              0              0              0              0 
##      FastingBS     RestingECG          MaxHR ExerciseAngina        Oldpeak 
##              0              0              0              0              0 
##       ST_Slope   HeartDisease 
##              0              0

Descriptive / Exploratory Analysis

1.What is the distribution of heart-attack risk (target variable) in the dataset?
#1.Distribution of Heart Attack Risk
ggplot(heart, aes(x = HeartDisease, fill = HeartDisease)) +
geom_bar() +
labs(title = "Distribution of Heart Attack Risk", x = "Heart Disease (0 = No, 1 = Yes)", y = "Count")
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Conclusion:The count of individuals without heart disease is considerably higher than the count of individuals with heart disease.
2.How do demographic features (age, sex) vary between individuals with and without a history of heart attack?
#2.Demographic Comparison (Age & Sex)
p1 <- ggplot(heart, aes(x = HeartDisease, y = Age, fill = HeartDisease)) +
geom_boxplot() + labs(title = "Age Distribution by Heart Disease")

p2 <- ggplot(heart, aes(x = Sex, fill = HeartDisease)) +
geom_bar(position = "fill") + labs(title = "Sex vs Heart Disease", y = "Proportion")

grid.arrange(p1, p2, ncol = 2)
Conclusion:Individuals with history of heart attack are typically older and more likely to be male compared to those without a history of heart attack.
3.Which clinical/laboratory features (cholesterol, resting blood pressure, max heart rate, etc.) differ significantly between the two groups?
#3.Clinical Features Comparison
num_vars <- c("RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
heart %>%
select(all_of(num_vars), HeartDisease) %>%
pivot_longer(cols = all_of(num_vars), names_to = "Feature", values_to = "Value") %>%
ggplot(aes(x = HeartDisease, y = Value, fill = HeartDisease)) +
geom_boxplot() +
facet_wrap(~Feature, scales = "free") +
labs(title = "Clinical Feature Distributions")
Conclusion:The boxplots indicate that Cholesterol,Maximum heart rate and Oldepeak show noticeable differences in distribution between the group with heart disease(1) and the group without heart disease(0),while RestingBP distributions are similar.
4.What are the correlations among continuous risk factors, and are there clusters of related variables?
#4.Correlation Among Continuous Variables
continuous_vars <- heart %>% select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak)
corr_matrix <- cor(continuous_vars, use = "complete.obs")
ggcorr(continuous_vars, label = TRUE) +
labs(title = "Correlation Matrix of Continuous Variables")
Conclusion:There are several weak to moderate correlations among the continuous risk factors,with the most notable being the negative correlation between Age and Maximum Heart rate, and positive correlations between Age and RestingBP/Old peak.
5.Are there meaningful categories (e.g., chest pain type, thalassemia category) whose proportions differ meaningfully by heart-attack status?
#5.Categorical Variables vs Heart Attack
cat_vars <- c("ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope")
for (var in cat_vars) {
print(
ggplot(heart, aes_string(x = var, fill = "HeartDisease")) +
geom_bar(position = "fill") +
labs(title = paste(var, "vs Heart Disease"), y = "Proportion")
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Inferential / Statistical Testing

6.Using an ANOVA or t-test (as appropriate), do mean values of key continuous variables differ significantly between heart-attack vs non-heart-attack groups?
#1.T-test or ANOVA for Continuous Variables
vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
for (v in vars) {
cat("\n", v, ":\n")
print(t.test(heart[[v]] ~ heart$HeartDisease))
}
## 
##  Age :
## 
##  Welch Two Sample t-test
## 
## data:  heart[[v]] by heart$HeartDisease
## t = -8.8225, df = 843.69, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.538260 -4.158513
## sample estimates:
## mean in group 0 mean in group 1 
##        50.55122        55.89961 
## 
## 
##  RestingBP :
## 
##  Welch Two Sample t-test
## 
## data:  heart[[v]] by heart$HeartDisease
## t = -3.3395, df = 915.14, p-value = 0.0008732
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -6.357955 -1.651148
## sample estimates:
## mean in group 0 mean in group 1 
##        130.1805        134.1850 
## 
## 
##  Cholesterol :
## 
##  Welch Two Sample t-test
## 
## data:  heart[[v]] by heart$HeartDisease
## t = 7.6269, df = 844.36, p-value = 6.481e-14
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  38.00953 64.35249
## sample estimates:
## mean in group 0 mean in group 1 
##        227.1220        175.9409 
## 
## 
##  MaxHR :
## 
##  Welch Two Sample t-test
## 
## data:  heart[[v]] by heart$HeartDisease
## t = 13.231, df = 877.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  17.45551 23.53591
## sample estimates:
## mean in group 0 mean in group 1 
##        148.1512        127.6555 
## 
## 
##  Oldpeak :
## 
##  Welch Two Sample t-test
## 
## data:  heart[[v]] by heart$HeartDisease
## t = -14.04, df = 855.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.9872502 -0.7450774
## sample estimates:
## mean in group 0 mean in group 1 
##       0.4080488       1.2742126
7.For categorical variables, use chi-square tests (or similar) to determine if distributions differ between groups.
#2.Chi-square Tests for Categorical Variables
for (v in cat_vars) {
tbl <- table(heart[[v]], heart$HeartDisease)
cat("\nChi-square test for", v, ":\n")
print(chisq.test(tbl))
}
## 
## Chi-square test for ChestPainType :
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 268.07, df = 3, p-value < 2.2e-16
## 
## 
## Chi-square test for RestingECG :
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 10.931, df = 2, p-value = 0.004229
## 
## 
## Chi-square test for ExerciseAngina :
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 222.26, df = 1, p-value < 2.2e-16
## 
## 
## Chi-square test for ST_Slope :
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 355.92, df = 2, p-value < 2.2e-16

Predictive Modelling

8.Build a logistic regression model to predict the likelihood of a heart-attack event. Interpret coefficients (odds ratios) and evaluate performance (AUC, calibration).
#1.Logistic Regression
set.seed(123)
trainIndex <- createDataPartition(heart$HeartDisease, p = 0.7, list = FALSE)
train <- heart[trainIndex,]
test <- heart[-trainIndex,]

log_model <- glm(HeartDisease ~ ., data = train, family = binomial)
summary(log_model)
## 
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = train)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.842221   1.675548  -0.503 0.615208    
## Age               0.015273   0.015495   0.986 0.324290    
## SexM              1.507029   0.323270   4.662 3.13e-06 ***
## ChestPainTypeATA -1.806232   0.385128  -4.690 2.73e-06 ***
## ChestPainTypeNAP -1.922537   0.312419  -6.154 7.57e-10 ***
## ChestPainTypeTA  -1.904039   0.546820  -3.482 0.000498 ***
## RestingBP         0.004104   0.006789   0.605 0.545510    
## Cholesterol      -0.004185   0.001235  -3.388 0.000703 ***
## FastingBS1        1.183915   0.321640   3.681 0.000232 ***
## RestingECGNormal -0.389396   0.315018  -1.236 0.216419    
## RestingECGST     -0.191001   0.394204  -0.485 0.628014    
## MaxHR            -0.004912   0.006014  -0.817 0.414069    
## ExerciseAnginaY   0.648619   0.281296   2.306 0.021121 *  
## Oldpeak           0.286305   0.140361   2.040 0.041373 *  
## ST_SlopeFlat      1.596272   0.500953   3.186 0.001440 ** 
## ST_SlopeUp       -0.867916   0.525467  -1.652 0.098594 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 883.97  on 642  degrees of freedom
## Residual deviance: 438.99  on 627  degrees of freedom
## AIC: 470.99
## 
## Number of Fisher Scoring iterations: 5
pred_prob <- predict(log_model, test, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)

confusionMatrix(as.factor(pred_class), test$HeartDisease)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  16
##          1  16 136
##                                          
##                Accuracy : 0.8836         
##                  95% CI : (0.8397, 0.919)
##     No Information Rate : 0.5527         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7647         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.8699         
##             Specificity : 0.8947         
##          Pos Pred Value : 0.8699         
##          Neg Pred Value : 0.8947         
##              Prevalence : 0.4473         
##          Detection Rate : 0.3891         
##    Detection Prevalence : 0.4473         
##       Balanced Accuracy : 0.8823         
##                                          
##        'Positive' Class : 0              
## 
9.Build a k-nearest neighbours (KNN) classifier for the same outcome. Compare its performance to logistic regression.
#2.K-Nearest Neighbours (KNN)
set.seed(123)
ctrl <- trainControl(method = "cv", number = 5)
knn_model <- train(HeartDisease ~ ., data = train, method = "knn", trControl = ctrl, tuneLength = 10)
knn_model
## k-Nearest Neighbors 
## 
## 643 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 515, 514, 514, 515, 514 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.6842781  0.3602564
##    7  0.7029918  0.3981753
##    9  0.7107679  0.4119314
##   11  0.7168968  0.4255808
##   13  0.7029191  0.3961536
##   15  0.7059835  0.4034233
##   17  0.7013324  0.3945442
##   19  0.7029070  0.3980291
##   21  0.7059714  0.4042277
##   23  0.7121972  0.4160053
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
pred_knn <- predict(knn_model, test)
confusionMatrix(pred_knn, test$HeartDisease)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  81  45
##          1  42 107
##                                           
##                Accuracy : 0.6836          
##                  95% CI : (0.6251, 0.7382)
##     No Information Rate : 0.5527          
##     P-Value [Acc > NIR] : 6.249e-06       
##                                           
##                   Kappa : 0.3616          
##                                           
##  Mcnemar's Test P-Value : 0.8302          
##                                           
##             Sensitivity : 0.6585          
##             Specificity : 0.7039          
##          Pos Pred Value : 0.6429          
##          Neg Pred Value : 0.7181          
##              Prevalence : 0.4473          
##          Detection Rate : 0.2945          
##    Detection Prevalence : 0.4582          
##       Balanced Accuracy : 0.6812          
##                                           
##        'Positive' Class : 0               
## 
10.Use k-means clustering on the predictor variables (without using the target) to identify clusters of patients; then examine how cluster membership relates to heart-attack risk.
#3.Clustering (K-Means)
heart_num <- heart %>% select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak) %>% scale()
fviz_nbclust(heart_num, kmeans, method = "wss")

set.seed(123)
km <- kmeans(heart_num, centers = 3, nstart = 25)
fviz_cluster(km, data = heart_num)

heart$Cluster <- as.factor(km$cluster)
ggplot(heart, aes(x = Cluster, fill = HeartDisease)) +
geom_bar(position = "fill") +
labs(title = "Cluster Membership vs Heart Disease")
Conclusion:The most prominent “elbow plot” appears at k=3.While there are slight bends at k=2 and k=4,the sharpest drop in the With-in cluster sum of squares occurs between k=1 and k=3, after which the rate of decrease lessens considerably. Therefore,based on the elbow method ,3 is the optimal number of clusters for this dataset.

Model Comparison & Interpretation

11.Compare the models (logistic regression, KNN, clusters) on performance metrics and interpretability.
# Fit logistic regression

model_log <- glm(HeartDisease ~ ., data=train, family=binomial)

# Predict probabilities on test set

pred_prob <- predict(model_log, newdata=test, type="response")

# ROC curve
pred <- prediction(pred_prob, test$HeartDisease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, main="ROC Curve: Logistic Regression", col="blue")
abline(a=0, b=1, lty=2, col="red")

# AUC
auc <- performance(pred, "auc")@y.values[[1]]
cat("AUC (Logistic Regression):", round(auc, 3), "\n")
## AUC (Logistic Regression): 0.946
# Accuracy
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
acc <- mean(pred_class == test$HeartDisease)
cat("Accuracy (Logistic Regression):", round(acc, 3))
## Accuracy (Logistic Regression): 0.884

Interpretation

Key Findings:

Age, Cholesterol, and Oldpeak show significant differences.

Chest pain type and Exercise Angina have strong categorical associations.

Logistic Regression gives interpretable results and competitive accuracy.

Overall Conclusion:

The Heart Attack Risk Analysis project successfully explored, analyzed, and modeled clinical and demographic data to identify key factors influencing heart disease risk. Through data cleaning, exploratory analysis, statistical testing, and predictive modeling, the following insights were derived:

Demographic Impact: Individuals who suffered from heart disease are generally older and predominantly male.

Clinical Indicators: Key continuous variables like Cholesterol, Oldpeak, and Max Heart Rate showed significant differences between healthy and affected groups.

Categorical Associations: Variables such as Chest Pain Type, Exercise Angina, and ST_Slope demonstrated strong categorical relationships with heart disease occurrence.

Model Insights:

Logistic Regression provided interpretable results with reliable accuracy and strong AUC, making it suitable for clinical screening applications.

K-Nearest Neighbors (KNN) achieved comparable performance but lacked interpretability.

Clustering analysis (K-Means) revealed three distinct patient groups, indicating natural patterns among risk profiles.

In summary, the study highlights that age, chest pain type, exercise-induced angina, and Oldpeak are critical predictors of heart disease. Logistic Regression emerges as the most practical and explainable model for early risk detection and clinical deployment.

This project demonstrates how data-driven modeling and visualization can support preventive healthcare, enabling medical professionals to identify at-risk individuals and promote early intervention strategies.