##Intro :
This project focuses on exploring factors that increase heart-attack risk and building models to predict it using real-world health data. The analysis involves comprehensive data cleaning, visualization, and model building to identify key predictors of heart disease. Various machine learning techniques such as Logistic Regression, K-Nearest Neighbors (KNN), and Clustering were applied and compared based on performance metrics and interpretability. The report provides insights into how attributes like age, cholesterol, blood pressure, chest pain type, and exercise patterns contribute to cardiovascular risk. The goal is to support early detection and clinical decision-making through data-driven analysis.
##Objective:
1.To understand which demographic and clinical features affect heart-attack risk.
2.To identify patterns in the data.
3.To build predictive models that can help identify high-risk patients early.
##Dataset:
Our dataset contains 303 patients with information like:
Age, Sex Chest Pain Type, Exercise Angina, ST_Slope Resting Blood Pressure, Cholesterol, Max Heart Rate, Oldpeak The target variable is HeartDisease (1 = heart disease, 0 = no heart disease).
| Q_No | Question | Type | Visualization_or_Test |
|---|---|---|---|
| 1 | What is the distribution of heart-attack risk (target variable) in the dataset? | Descriptive / Exploratory | Bar chart / Pie chart |
| 2 | How do demographic features (age, sex) vary between individuals with and without heart attack? | Descriptive / Exploratory | Boxplot for age, Bar chart for sex |
| 3 | Which clinical/laboratory features (cholesterol, resting BP, max HR, etc.) differ between the two groups? | Descriptive / Exploratory | Boxplot / Violin plot |
| 4 | What are the correlations among continuous risk factors? Are there clusters of related variables? | Descriptive / Exploratory | Correlation heatmap, Pairplot |
| 5 | Are there meaningful categorical variables (chest pain type, thalassemia, fasting BS) whose proportions differ by heart-attack status? | Descriptive / Exploratory | Stacked bar chart / Mosaic plot |
| Q_No | Question | Type | Visualization_or_Test |
|---|---|---|---|
| 6 | Using an ANOVA or t-test (as appropriate), do mean values of key continuous variables differ significantly between heart-attack vs non-heart-attack groups? | Inferential / Statistical | t-test / ANOVA |
| 7 | For categorical variables, use chi-square tests (or similar) to determine if distributions differ between groups. | Inferential / Statistical | Chi-square test / Fisher’s exact test |
| Q_No | Question | Type | Visualization_or_Test |
|---|---|---|---|
| 8 | Build logistic regression to predict heart-attack likelihood. Interpret coefficients (odds ratios) and evaluate performance (AUC, calibration). | Predictive / Modeling | Logistic regression, ROC curve, AUC, Calibration plot |
| 9 | Build k-nearest neighbours (KNN) classifier. Compare performance to logistic regression. | Predictive / Modeling | KNN, Accuracy, AUC, Confusion matrix |
| 10 | Use k-means clustering on predictor variables (without target) to identify patient clusters; examine cluster membership relative to heart-attack risk. | Predictive / Modeling | K-means clustering, cluster proportions by heart attack |
| Q_No | Question | Type | Visualization_or_Test |
|---|---|---|---|
| 11 | Compare logistic regression, KNN, and clusters on predictive performance and interpretability. | Comparative | Accuracy, AUC, Sensitivity, Specificity table; interpretability notes |
| Q_No | Question | Type | Visualization_or_Test |
|---|---|---|---|
| 12 | What are the main risk factors, demographic patterns, and clinical indicators associated with heart-attack risk? | Interpretive | Key findings |
| 13 | Which predictive model is most suitable for clinical screening? | Interpretive | Recommendation with rationale |
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(knitr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(ROCR)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(broom)
# Read File
path <- '/Users/naincysingh/Library/Mobile Documents/com~apple~Numbers/Documents/heart.csv'
heart <- read.csv(path, stringsAsFactors = FALSE)
# View data
glimpse(heart)
## Rows: 918
## Columns: 12
## $ Age <int> 40, 49, 37, 48, 54, 39, 45, 54, 37, 48, 37, 58, 39, 49,…
## $ Sex <chr> "M", "F", "M", "F", "M", "M", "F", "M", "M", "F", "F", …
## $ ChestPainType <chr> "ATA", "NAP", "ATA", "ASY", "NAP", "NAP", "ATA", "ATA",…
## $ RestingBP <int> 140, 160, 130, 138, 150, 120, 130, 110, 140, 120, 130, …
## $ Cholesterol <int> 289, 180, 283, 214, 195, 339, 237, 208, 207, 284, 211, …
## $ FastingBS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ RestingECG <chr> "Normal", "Normal", "ST", "Normal", "Normal", "Normal",…
## $ MaxHR <int> 172, 156, 98, 108, 122, 170, 170, 142, 130, 120, 142, 9…
## $ ExerciseAngina <chr> "N", "N", "N", "Y", "N", "N", "N", "N", "Y", "N", "N", …
## $ Oldpeak <dbl> 0.0, 1.0, 0.0, 1.5, 0.0, 0.0, 0.0, 0.0, 1.5, 0.0, 0.0, …
## $ ST_Slope <chr> "Up", "Flat", "Up", "Flat", "Up", "Up", "Up", "Up", "Fl…
## $ HeartDisease <int> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1…
colnames(heart)
## [1] "Age" "Sex" "ChestPainType" "RestingBP"
## [5] "Cholesterol" "FastingBS" "RestingECG" "MaxHR"
## [9] "ExerciseAngina" "Oldpeak" "ST_Slope" "HeartDisease"
# Convert plausible categorical columns
heart <- heart %>%
mutate(
Sex = as.factor(Sex),
ChestPainType = as.factor(ChestPainType),
FastingBS = as.factor(FastingBS),
RestingECG = as.factor(RestingECG),
ExerciseAngina = as.factor(ExerciseAngina),
ST_Slope = as.factor(ST_Slope),
HeartDisease = as.factor(HeartDisease)
)
# View structure
str(heart)
## 'data.frame': 918 obs. of 12 variables:
## $ Age : int 40 49 37 48 54 39 45 54 37 48 ...
## $ Sex : Factor w/ 2 levels "F","M": 2 1 2 1 2 2 1 2 2 1 ...
## $ ChestPainType : Factor w/ 4 levels "ASY","ATA","NAP",..: 2 3 2 1 3 3 2 2 1 2 ...
## $ RestingBP : int 140 160 130 138 150 120 130 110 140 120 ...
## $ Cholesterol : int 289 180 283 214 195 339 237 208 207 284 ...
## $ FastingBS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ RestingECG : Factor w/ 3 levels "LVH","Normal",..: 2 2 3 2 2 2 2 2 2 2 ...
## $ MaxHR : int 172 156 98 108 122 170 170 142 130 120 ...
## $ ExerciseAngina: Factor w/ 2 levels "N","Y": 1 1 1 2 1 1 1 1 2 1 ...
## $ Oldpeak : num 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ ST_Slope : Factor w/ 3 levels "Down","Flat",..: 3 2 3 2 3 3 3 3 2 3 ...
## $ HeartDisease : Factor w/ 2 levels "0","1": 1 2 1 2 1 1 1 1 2 1 ...
#Check for missing values
colSums(is.na(heart))
## Age Sex ChestPainType RestingBP Cholesterol
## 0 0 0 0 0
## FastingBS RestingECG MaxHR ExerciseAngina Oldpeak
## 0 0 0 0 0
## ST_Slope HeartDisease
## 0 0
#1.Distribution of Heart Attack Risk
ggplot(heart, aes(x = HeartDisease, fill = HeartDisease)) +
geom_bar() +
labs(title = "Distribution of Heart Attack Risk", x = "Heart Disease (0 = No, 1 = Yes)", y = "Count")
## Warning: <ggplot> %+% x was deprecated in ggplot2 4.0.0.
## ℹ Please use <ggplot> + x instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#2.Demographic Comparison (Age & Sex)
p1 <- ggplot(heart, aes(x = HeartDisease, y = Age, fill = HeartDisease)) +
geom_boxplot() + labs(title = "Age Distribution by Heart Disease")
p2 <- ggplot(heart, aes(x = Sex, fill = HeartDisease)) +
geom_bar(position = "fill") + labs(title = "Sex vs Heart Disease", y = "Proportion")
grid.arrange(p1, p2, ncol = 2)
#3.Clinical Features Comparison
num_vars <- c("RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
heart %>%
select(all_of(num_vars), HeartDisease) %>%
pivot_longer(cols = all_of(num_vars), names_to = "Feature", values_to = "Value") %>%
ggplot(aes(x = HeartDisease, y = Value, fill = HeartDisease)) +
geom_boxplot() +
facet_wrap(~Feature, scales = "free") +
labs(title = "Clinical Feature Distributions")
#4.Correlation Among Continuous Variables
continuous_vars <- heart %>% select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak)
corr_matrix <- cor(continuous_vars, use = "complete.obs")
ggcorr(continuous_vars, label = TRUE) +
labs(title = "Correlation Matrix of Continuous Variables")
#5.Categorical Variables vs Heart Attack
cat_vars <- c("ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope")
for (var in cat_vars) {
print(
ggplot(heart, aes_string(x = var, fill = "HeartDisease")) +
geom_bar(position = "fill") +
labs(title = paste(var, "vs Heart Disease"), y = "Proportion")
)
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#1.T-test or ANOVA for Continuous Variables
vars <- c("Age", "RestingBP", "Cholesterol", "MaxHR", "Oldpeak")
for (v in vars) {
cat("\n", v, ":\n")
print(t.test(heart[[v]] ~ heart$HeartDisease))
}
##
## Age :
##
## Welch Two Sample t-test
##
## data: heart[[v]] by heart$HeartDisease
## t = -8.8225, df = 843.69, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.538260 -4.158513
## sample estimates:
## mean in group 0 mean in group 1
## 50.55122 55.89961
##
##
## RestingBP :
##
## Welch Two Sample t-test
##
## data: heart[[v]] by heart$HeartDisease
## t = -3.3395, df = 915.14, p-value = 0.0008732
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -6.357955 -1.651148
## sample estimates:
## mean in group 0 mean in group 1
## 130.1805 134.1850
##
##
## Cholesterol :
##
## Welch Two Sample t-test
##
## data: heart[[v]] by heart$HeartDisease
## t = 7.6269, df = 844.36, p-value = 6.481e-14
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 38.00953 64.35249
## sample estimates:
## mean in group 0 mean in group 1
## 227.1220 175.9409
##
##
## MaxHR :
##
## Welch Two Sample t-test
##
## data: heart[[v]] by heart$HeartDisease
## t = 13.231, df = 877.04, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 17.45551 23.53591
## sample estimates:
## mean in group 0 mean in group 1
## 148.1512 127.6555
##
##
## Oldpeak :
##
## Welch Two Sample t-test
##
## data: heart[[v]] by heart$HeartDisease
## t = -14.04, df = 855.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.9872502 -0.7450774
## sample estimates:
## mean in group 0 mean in group 1
## 0.4080488 1.2742126
#2.Chi-square Tests for Categorical Variables
for (v in cat_vars) {
tbl <- table(heart[[v]], heart$HeartDisease)
cat("\nChi-square test for", v, ":\n")
print(chisq.test(tbl))
}
##
## Chi-square test for ChestPainType :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 268.07, df = 3, p-value < 2.2e-16
##
##
## Chi-square test for RestingECG :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 10.931, df = 2, p-value = 0.004229
##
##
## Chi-square test for ExerciseAngina :
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 222.26, df = 1, p-value < 2.2e-16
##
##
## Chi-square test for ST_Slope :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 355.92, df = 2, p-value < 2.2e-16
#1.Logistic Regression
set.seed(123)
trainIndex <- createDataPartition(heart$HeartDisease, p = 0.7, list = FALSE)
train <- heart[trainIndex,]
test <- heart[-trainIndex,]
log_model <- glm(HeartDisease ~ ., data = train, family = binomial)
summary(log_model)
##
## Call:
## glm(formula = HeartDisease ~ ., family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.842221 1.675548 -0.503 0.615208
## Age 0.015273 0.015495 0.986 0.324290
## SexM 1.507029 0.323270 4.662 3.13e-06 ***
## ChestPainTypeATA -1.806232 0.385128 -4.690 2.73e-06 ***
## ChestPainTypeNAP -1.922537 0.312419 -6.154 7.57e-10 ***
## ChestPainTypeTA -1.904039 0.546820 -3.482 0.000498 ***
## RestingBP 0.004104 0.006789 0.605 0.545510
## Cholesterol -0.004185 0.001235 -3.388 0.000703 ***
## FastingBS1 1.183915 0.321640 3.681 0.000232 ***
## RestingECGNormal -0.389396 0.315018 -1.236 0.216419
## RestingECGST -0.191001 0.394204 -0.485 0.628014
## MaxHR -0.004912 0.006014 -0.817 0.414069
## ExerciseAnginaY 0.648619 0.281296 2.306 0.021121 *
## Oldpeak 0.286305 0.140361 2.040 0.041373 *
## ST_SlopeFlat 1.596272 0.500953 3.186 0.001440 **
## ST_SlopeUp -0.867916 0.525467 -1.652 0.098594 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 883.97 on 642 degrees of freedom
## Residual deviance: 438.99 on 627 degrees of freedom
## AIC: 470.99
##
## Number of Fisher Scoring iterations: 5
pred_prob <- predict(log_model, test, type = "response")
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
confusionMatrix(as.factor(pred_class), test$HeartDisease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 16
## 1 16 136
##
## Accuracy : 0.8836
## 95% CI : (0.8397, 0.919)
## No Information Rate : 0.5527
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7647
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8699
## Specificity : 0.8947
## Pos Pred Value : 0.8699
## Neg Pred Value : 0.8947
## Prevalence : 0.4473
## Detection Rate : 0.3891
## Detection Prevalence : 0.4473
## Balanced Accuracy : 0.8823
##
## 'Positive' Class : 0
##
#2.K-Nearest Neighbours (KNN)
set.seed(123)
ctrl <- trainControl(method = "cv", number = 5)
knn_model <- train(HeartDisease ~ ., data = train, method = "knn", trControl = ctrl, tuneLength = 10)
knn_model
## k-Nearest Neighbors
##
## 643 samples
## 11 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 515, 514, 514, 515, 514
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6842781 0.3602564
## 7 0.7029918 0.3981753
## 9 0.7107679 0.4119314
## 11 0.7168968 0.4255808
## 13 0.7029191 0.3961536
## 15 0.7059835 0.4034233
## 17 0.7013324 0.3945442
## 19 0.7029070 0.3980291
## 21 0.7059714 0.4042277
## 23 0.7121972 0.4160053
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
pred_knn <- predict(knn_model, test)
confusionMatrix(pred_knn, test$HeartDisease)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 81 45
## 1 42 107
##
## Accuracy : 0.6836
## 95% CI : (0.6251, 0.7382)
## No Information Rate : 0.5527
## P-Value [Acc > NIR] : 6.249e-06
##
## Kappa : 0.3616
##
## Mcnemar's Test P-Value : 0.8302
##
## Sensitivity : 0.6585
## Specificity : 0.7039
## Pos Pred Value : 0.6429
## Neg Pred Value : 0.7181
## Prevalence : 0.4473
## Detection Rate : 0.2945
## Detection Prevalence : 0.4582
## Balanced Accuracy : 0.6812
##
## 'Positive' Class : 0
##
#3.Clustering (K-Means)
heart_num <- heart %>% select(Age, RestingBP, Cholesterol, MaxHR, Oldpeak) %>% scale()
fviz_nbclust(heart_num, kmeans, method = "wss")
set.seed(123)
km <- kmeans(heart_num, centers = 3, nstart = 25)
fviz_cluster(km, data = heart_num)
heart$Cluster <- as.factor(km$cluster)
ggplot(heart, aes(x = Cluster, fill = HeartDisease)) +
geom_bar(position = "fill") +
labs(title = "Cluster Membership vs Heart Disease")
# Fit logistic regression
model_log <- glm(HeartDisease ~ ., data=train, family=binomial)
# Predict probabilities on test set
pred_prob <- predict(model_log, newdata=test, type="response")
# ROC curve
pred <- prediction(pred_prob, test$HeartDisease)
perf <- performance(pred, "tpr", "fpr")
plot(perf, main="ROC Curve: Logistic Regression", col="blue")
abline(a=0, b=1, lty=2, col="red")
# AUC
auc <- performance(pred, "auc")@y.values[[1]]
cat("AUC (Logistic Regression):", round(auc, 3), "\n")
## AUC (Logistic Regression): 0.946
# Accuracy
pred_class <- ifelse(pred_prob > 0.5, 1, 0)
acc <- mean(pred_class == test$HeartDisease)
cat("Accuracy (Logistic Regression):", round(acc, 3))
## Accuracy (Logistic Regression): 0.884
Age, Cholesterol, and Oldpeak show significant differences.
Chest pain type and Exercise Angina have strong categorical associations.
Logistic Regression gives interpretable results and competitive accuracy.
The Heart Attack Risk Analysis project successfully explored, analyzed, and modeled clinical and demographic data to identify key factors influencing heart disease risk. Through data cleaning, exploratory analysis, statistical testing, and predictive modeling, the following insights were derived:
Demographic Impact: Individuals who suffered from heart disease are generally older and predominantly male.
Clinical Indicators: Key continuous variables like Cholesterol, Oldpeak, and Max Heart Rate showed significant differences between healthy and affected groups.
Categorical Associations: Variables such as Chest Pain Type, Exercise Angina, and ST_Slope demonstrated strong categorical relationships with heart disease occurrence.
Model Insights:
Logistic Regression provided interpretable results with reliable accuracy and strong AUC, making it suitable for clinical screening applications.
K-Nearest Neighbors (KNN) achieved comparable performance but lacked interpretability.
Clustering analysis (K-Means) revealed three distinct patient groups, indicating natural patterns among risk profiles.
In summary, the study highlights that age, chest pain type, exercise-induced angina, and Oldpeak are critical predictors of heart disease. Logistic Regression emerges as the most practical and explainable model for early risk detection and clinical deployment.
This project demonstrates how data-driven modeling and visualization can support preventive healthcare, enabling medical professionals to identify at-risk individuals and promote early intervention strategies.