Analysis of Cardiovascular Disease Risk Factors Using Machine Learning
Author
Andreina Arias
Abstract:
Cardiovascular disease (CVD) remains a leading cause of mortality worldwide, highlighting the urgent need for effective early detection and prevention strategies. This project applies data science and machine learning techniques to analyze and predict the presence of cardiovascular disease using the UCI heart disease dataset. The dataset includes key patient attributes such as age, sex, chest pain types, blood pressure, cholesterol levels, and other clinical measurements providing a comprehensive basis for predictive modeling. The study involves data preprocessing, exploratory data analysis, and feature engineering to better understand the relationships between variables and CVD outcomes. Machine learning models like logistic regression, decision trees, and ensemble methods such as random forests, are developed and evaluated using performance metrics like accuracy, precision, recall, and F1-score. Correlation analysis was also conducted to identify the most influential risk factors contributing to CVD.
The results aim to deliver an accurate and interpretable predictive model capable of identifying high risk individuals. The analysis will also provide valuable insights into key determinants of cardiovascular disease, supporting data driven decision making in clinical settings. This project demonstrates the potential of machine learning to enhance early diagnosis and contribute to more effective prevention and management of cardiovascular disease.
Introduction:
Background and Problem
Cardiovascular disease continues to be one of the leading causes of mortality worldwide. World Health Organization (WHO) states cardiovascular diseases are responsible for millions of deaths annually, with many cases being preventable through early detection and some lifestyle changes. Data science and machine learning can be powerful tools to help identify patterns in medical data that may not be easily detected through traditional statistical methods. This project aims to analyze cardiovascular disease features and develop predictive models to identify individuals that are at high risk. Clinical screening tools may be available, but many patients remain undiagnosed or diagnosed too late. There is a need for accurate predictive models and better understanding of key risk factors.
Objective
The primary objective of this project is to develop and evaluate machine learning models to predict the risk of cardiovascular disease based on demographic, lifestyle, and clinical risk factors.
Research Questions
· Which demographic, behavioral, and clinical variables contribute significantly the most towards cardiovascular disease risk?
· How do different algorithms compare statistically (ex. Accuracy, precision, recall)?
· Can an interpretable model suitable for clinical insight be produced?
Methodology
This project will use a publicly available UCI cardiovascular disease dataset from Kaggle. The variables in the dataset are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise induced angina, ST depression, ST slope, and presence of heart disease. I will handle missing data if there are any, encode categorical features, and train/test split. Correlation analysis was performed, I used logistic regression, decision tree, and random forest as my models. The models were evaluated by comparing their accuracy, precision, recall, F1-score, and ROC-AUC.
Expected Contribution
· Identify most significant clinical and demographic risk factors.
· Compare performance of machine learning models for cardiovascular disease risk.
· Offer interpretable predictions that can support preventive care.
Machine learning (ML) has significantly improved the prediction and diagnosis of CVD. Numerous studies have explored different algorithms, datasets, and feature selection techniques to enhance predictive accuracy and clinical applicability.
Pal et al. (2022) investigated multiple machine learning classifiers for cardiovascular disease prediction and demonstrated that ensemble methods and hybrid approaches often outperform traditional statistical models in terms of accuracy and reliability. Their study highlights the important of selecting appropriate algorithms and preprocessing techniques when dealing with clinical datasets. Similarly, Mim et al. (2025) reported that ensembles techniques outperform individual models due to their ability to capture complex, nonlinear relationships among features. These findings suggest that model selection plays a critical role in optimizing predictive performance.
Many systemic reviews provide a broader understanding of the field. A study by Ahsan and Siddique (2022) analyze over 400 research papers and found that machine learning models can effectively detect heart disease using clinical and ECG data. However the authors emphasized the challenges such as imbalanced datasets and lack of interpretability, which limit real world clinical adoption. Likewise, Liu et al. (2025) emphasized that while electric health record (EHR) based models enable large scale risk prediction, issues related to data quality, heterogeneity, and generalizability remain as a barrier. Together these studies indicate that machine learning models are highly capable, their clinical applicability is still constrained by data related challenges.
More recent studies highlight evolution of advanced technique, like Banerjee and Pacal (2025) reported that although ML models achieve high predictive accuracy, issues such as data quality and model transparency hinder their integration into clinical practice. Haq et al. (2026) further highlighted how effective deep learning is in imaging-based diagnosis, while also pointing out increased computational demands and the “black-box” nature of these models. This creates a trade-off between performance and explainability, which remains a central issue in the field.
Meta analytic evidence provided additional support for the effectiveness of ML approaches. Krittanawong et al. (2020) showed that algorithms such as random forests and support vector machines (SVM) achieve strong predictive performance across large and diverse patient populations. However, the study also suggested variability in outcomes depending on dataset composition and feature engineering, reinforcing the importance of methodological consistency.
Experimental studies offered more granular insights into algorithm performance. Ingole et al. (2024) found that SVMs achieved high accuracy in heart disease prediction, demonstrating that their robustness in high dimensional clinical data. But in contrast, Osei-Nkwantabisa and Ntumy (2024) reported that k-nearest neighbors (KNN) outperformed other models on the UCL Heart Disease dataset. These contrasted findings highlight a key concern which is model performance is highly context dependent, varying with dataset characteristics, preprocessing methods, and feature selection techniques. The lack of consistency makes it difficult to identify a universal optimal model.
Beyond traditional ML approaches, alternative methods have also been explored. EL Massari et al. (2024) demonstrated that ontology-based models can enhance both prediction accuracy and interpretability by incorporating domain knowledge. But such approaches are less commonly used and require further validation in real world clinical settings.
Overall, these literatures demonstrated that ML techniques are highly effective for cardiovascular disease prediction, particularly when applied to structures datasets such as the UCI heart Disease dataset. However, several critical challenges persist, including data imbalance, lack of interpretability, variability in model performance, and limited real world clinical deployment. Many studies focus primarily on maximizing accuracy with adequately addressing explainability or practical implementation.
Identifying the most influential risk factors, systematically comparing algorithm performance and developing interpretable models suitable for clinical use. To address limitations, this study aims to investigate the relative importance of demographic, behavioral, and clinical variables in CVD prediction, evaluate the performance of multiple ML algorithms using standard metrics, and develop an interpretable model that balance predictive accuracy with clinical relevance.
Methodology:
This study utilized the publicly available UCI Heart Disease dataset, accessed from Kaggle. The dataset used in this study is derived from the UCI Heart Disease dataset (Detrano et al., 1988), accessed via publicly available version on Kaggle SONY, R. (n.d.). The dataset contains clinical and demographic information collected from multiple institutions, including the Hungarian Institute of Cardiology, University Hospital Zurich, University Hospital Basel, and the Cleveland Clinic Foundation. It includes variables such as age, sec, chest pain type, cholesterol levels, resting blood pressure, and other clinical indicators relevant to cardiovascular disease diagnosis.
The dataset consisted of 920 observations (patient records) and 16 variables (demographic and clinical variables).
Data Preprocessing
Initial data exploration was conducted to understand the structure and distribution of variables.
Figure 1: Dataset
The column information:
id (Unique id for each patient)
age (Age of the patient in years)
origin (place of study)
sex (Male/Female)
cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by fluoroscopy
thal: [normal; fixed defect; reversible defect]
num: the predicted attribute 0: No heart disease (absence of disease). Values 1-4 represent increasing severity levels of heart disease, but they generally indicate the extent or severity of the disease 1: Mild heart disease, 2: Moderate heart disease, 3: Severe heart disease, and 4: Very severe heart disease.
Missing values were identified across several columns. To address this issue a simple imputation strategy was applied:
· Numeric Variables were imputed using the median values
· Categorical variables were imputed using the mode (most frequent value)
This approach ensures that missing data does not bias model training while preserving the overall distribution of the dataset.
The target variable (num), originally represented multiple stages of heart disease severity (0-4), was transformed into a binary classification variable where:
· 0 = No heart disease
· 1 = Presence of heart disease (any severity level)
Additionally, categorical variables such as sex, chest pain type, fasting blood sugar, electrocardiographic results, exercise induced angina, and slope were converted into factor variable to ensure proper handling by machine learning algorithms.
Data Test and Train Split
The dataset was divided into training (80%) and testing (20%) subsets stratified sampling to preserve class distribution. A fixed random seed was set to ensure reproducibility of results.
The model validation was performed using the hold out validation approach, where the dataset was divided into training and testing subsets. To preserve the class distribution across both subset stratified sampling was applied. Model performance was evaluated on the unseen test dataset to assess generalization capability.
Exploratory Data Analysis
A correlation analysis was performed on numerical variables to identify relationships among features. A correlation matrix was visualized to assess potential multicollinearity and to better understand which variables may influence CVD risk.
Model Development
To address the research questions, three machine learning models were implemented:
1. Logistic Regression- A baseline statistical model was developed using a generalized linear model with a binomial distribution.
2. Decision Tree- A decision tree classifier was constructed to capture nonlinear relationships and interactions between variables. The model structure was visualized to enhance interpretability and provide insight into decision rules.
3. Random Forest- An ensemble learning method was implemented using multiple decision trees to improve predictive performance and reduce overfitting. The model aggregates predictions from 100 trees to produce more robust results.
Model Evaluation
Model performance was evaluated on the test dataset using several standard classification metrics:
· Accuracy: Overall correctness of predictions.
· Precision: Proportion of true positive prediction among all positive predictions.
· Recall (sensitivity): Ability to correctly identify positive cases.
· F1 Score: Harmonic mean of precision and recall.
· ROC-AUC (Receiver Operating Characteristic- Area Under Curve) Measures the model’s ability to distinguish between classes.
A confusion matrix was generated for each model to assess classification performance. Additionally, ROC curves were plotted to visually compare the performance of the models, particularly for decision tree and random forest classifiers.
Research Alignment
This methodology directly addresses the study’s research question by:
· Identifying important predictors through model training and feature relationships.
· Incorporating interpretable models (logistic regression and decision tree) alongside a high performance ensemble model (random forest).
Results:
In addition to model evaluation a correlation analysis was conducted to examine relationships among numerical variables and the correlation matrix revealed several notable patterns. Age shows a moderate positive correlation with resting blood pressure and a weaker relationship with cholesterol levels, suggesting that cardiovascular risk factors tend to increase with age. Maximum heart rate (thatlach) exhibited a negative correlation with age, indicating that older individuals generally achieve lower heart rates during exercise.
Figure 2:Correlation Matrix Numerical Variables
Furthermore, old peak (ST depression) showed a negative relationship with maximum heart rate positive association with other risk related variables, suggested its relevance as an indicator of cardiac stress. The number of major vessels (ca) also demonstrated mild correlations with several variables supporting its role as an important clinical feature.
Overall, the correlations were generally weak to moderate, indicating limited multicollinearity among predictors. This suggest that the variables contribute unique information to the models, supporting their inclusion in ML analysis. Also, no strong correlations (|r|> 0.7) were observed indicating low multicollinearity.
The performance of the three ML models, logistic regression, decision tree, and random forest was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC. The results are presented in the Table 1.
Table 1 Results
Figure 3 ROC Plot
The results suggest that the decision tree model provided the most balanced classification performance across the evaluation metrics, while the random forest model demonstrated the stronger overall discriminative ability. Logistic regression showed lower predictive performance compared to the tree-based models but remained useful because of its interpretability. The findings indicate that nonlinear machine learning methods may better capture the complex relationships among cardiovascular risk factors than traditional linear models. Overall, the models were able to distinguish between patients with and without cardiovascular disease with relatively strong predictive capability.
The feature analysis indicated that variables such as chest pain type, maximum heart rate, ST depression (oldpeak), number of major vessels (ca), and exercise-induced angina were among the strongest predictors of cardiovascular disease. Age and resting blood pressure also showed meaningful relationship with disease presence, although their effects were less pronounced compared to clinical stress test related variables. These findings are consistent with previous cardiovascular disease research identifying both physiological and exercise-related measurements as key indicators of cardiac risk.
Conclusion:
This study investigated cardiovascular disease risk prediction using multiple machine learning algorithms and examined the contribution of clinical and demographic variables. Features such as chest pain type, exercise-induced angina, ST depression, maximum heart rated achieved, and the number of major vessels appeared to play an important role in predicting disease presence. These variables are clinically relevant because they reflect cardiac function and physiological stress responses commonly associated with cardiovascular complications.
The results in this study demonstrated that the decision tree model achieved the highest overall classification performance, outperforming both logistic regression and random forest in accuracy, precision, recall, and F1-score. This suggest that simpler, interpretable models can be effective when the dataset contains well defined patterns. The random forest had the highest ROC-AUC indicating superior discriminative ability across classification thresholds. This highlighted it strength in capturing complex relationships, even when it doesn’t achieve the highest accuracy at a fixed threshold.
The logistic regression model had the least accuracy but still demonstrated reasonable performance ad remains valuable due to its interpretability and simplicity.
In relation to the research questions:
· Key variables contributing to the cardiovascular disease risk were effectively captured by all models, supported by correlation analysis and model performance.
· The comparison of algorithms showed that decisions trees performed best overall while random forest provided stronger probabilistic discrimination.
· An interpretable model was successfully developed with the decision tree offering both high accuracy and clear decision rules suitable for clinical insight.
In conclusion, this study highlighted that model selection should consider both predictive performance and interpretability, particularly in healthcare applications. Future work should explore hybrid approaches that combine the strengths of both interpretable and high performance models.
Appendix:
Code for this study: https://rpubs.com/Andreina-A/1423457
Bibliography
Ahsan, M. M., & Siddique, Z. (2022). Machine learning-based heart disease diagnosis: A systematic literature review. Retrieved from Artificial Intelligence in Medicine: https://doi.org/10.1016/j.artmed.2022.102289
Banerjee, T., & Paçal, İ. (2025). A systematic review of machine learning in heart disease prediction. Retrieved from Turkish Journal of Biology: https://doi.org/10.55730/1300-0152.2766
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1988). Heart disease dataset [Data set]. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
El Massari, Y., Bensaid, A., & Ouhbi, S. (2024). The impact of ontology on the prediction of cardiovascular disease compared to machine learning algorithms. Retrieved from arXiv: https://arxiv.org/abs/2405.20414
Haq, I., Liang, H., Zeng, K., Wang, T., Uddin, I., Lin, J., Kang, Y., & Huang, B. (2026). Deep learning advancements for cardiovascular diseases (CVDs) diagnosis: Imaging modalities, challenges, and future perspectives. Retrieved from Biomedical Signal Processing and Control: https://doi.org/10.1016/j.bspc.2026.109899
Ingole, V., Patil, S., Deshmukh, A., & Kulkarni, P. (2024). Advancements in heart disease prediction: A machine learning approach for early detection and risk assessment. Retrieved from arXiv: https://arxiv.org/abs/2410.14738
Krittanawong, C., Johnson, K. W., Rosenson, R. S., Wang, Z., Aydar, M., & Halperin, J. L. (2020). Machine learning prediction in cardiovascular diseases: A meta-analysis. Retrieved from Scientific Reports: https://doi.org/10.1038/s41598-020-72685-1
Liu, T., Krentz, A. J., Huo, Z., & Ćurčin, V. (2025). Opportunities and challenges of cardiovascular disease risk prediction for primary prevention using machine learning and electronic health records: A systematic review. Retrieved from Reviews in Cardiovascular Medicine: https://doi.org/10.31083/RCM37443
Mim, F. N., Rahman, M. S., Islam, M. R., & Hossain, M. A. (2025). Machine learning approaches for cardiovascular disease prediction: A comparative study. Retrieved from International Journal of Data Science and Analytics: https://doi.org/10.1007/s44174-025-00564-2
Osei-Nkwantabisa, G., & Ntumy, E. (2024). Classification and prediction of heart diseases using machine learning algorithms. Retrieved from arXiv: https://arxiv.org/abs/2409.03697
Pal, M., Parija, S., Panda, G., Dhama, K., & Mohapatra, R. K. (2022). Risk prediction of cardiovascular disease using machine learning classifiers. Retrieved from PMC PubMed Central: https://pmc.ncbi.nlm.nih.gov/articles/PMC9206502/
SONY, R. (n.d.). Heart disease data [Data set]. Retrieved from Kaggle: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data
#Load Librarieslibrary(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
Columns: id (Unique id for each patient) age (Age of the patient in years) origin (place of study) sex (Male/Female) cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic]) trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital)) chol (serum cholesterol in mg/dl) fbs (if fasting blood sugar > 120 mg/dl) restecg (resting electrocardiographic results) – Values: [normal, stt abnormality, lv hypertrophy] thalach: maximum heart rate achieved exang: exercise-induced angina (True/ False) oldpeak: ST depression induced by exercise relative to rest slope: the slope of the peak exercise ST segment ca: number of major vessels (0-3) colored by fluoroscopy thal: [normal; fixed defect; reversible defect] num: the predicted attribute 0: No heart disease (absence of disease).The exact meaning of values 1 through 4 can depend on the specific dataset, but they generally indicate the extent or severity of the disease 1: Mild heart disease, 2: Moderate heart disease, 3: Severe heart disease, and 4: Very severe heart disease.
Citation Request: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:
Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.
#check for missing valuescolSums(is.na(df))
id age sex dataset cp trestbps chol fbs
0 0 0 0 0 59 30 90
restecg thalch exang oldpeak slope ca thal num
0 55 55 62 0 611 0 0
Seven columns have missing data, I used a simple imputation where the the median will be used of the numeric values and mode will be used for the categorical values.
id age sex dataset cp trestbps chol fbs
0 0 0 0 0 0 0 0
restecg thalch exang oldpeak slope ca thal num
0 0 0 0 0 0 0 0
I converted the target variable “num” into a binary variable, instead of using all heart stages I will set no heart disease at 0 and all other stages into just 1 where it indicatess heart disease.