Analysis of Cardiovascular Disease Risk Factors Using Machine Learning

Author

Andreina Arias

Abstract:

Cardiovascular disease (CVD) remains a leading cause of mortality worldwide, highlighting the urgent need for effective early detection and prevention strategies. This project applies data science and machine learning techniques to analyze and predict the presence of cardiovascular disease using the UCI heart disease dataset. The dataset includes key patient attributes such as age, sex, chest pain types, blood pressure, cholesterol levels, and other clinical measurements providing a comprehensive basis for predictive modeling. The study involves data preprocessing, exploratory data analysis, and feature engineering to better understand the relationships between variables and CVD outcomes. Machine learning models like logistic regression, decision trees, and ensemble methods such as random forests, are developed and evaluated using performance metrics like accuracy, precision, recall, and F1-score. Correlation analysis was also conducted to identify the most influential risk factors contributing to CVD.

The results aim to deliver an accurate and interpretable predictive model capable of identifying high risk individuals. The analysis will also provide valuable insights into key determinants of cardiovascular disease, supporting data driven decision making in clinical settings. This project demonstrates the potential of machine learning to enhance early diagnosis and contribute to more effective prevention and management of cardiovascular disease.

Introduction:

Background and Problem

Cardiovascular disease continues to be one of the leading causes of mortality worldwide. World Health Organization (WHO) states cardiovascular diseases are responsible for millions of deaths annually, with many cases being preventable through early detection and some lifestyle changes. Data science and machine learning can be powerful tools to help identify patterns in medical data that may not be easily detected through traditional statistical methods. This project aims to analyze cardiovascular disease features and develop predictive models to identify individuals that are at high risk. Clinical screening tools may be available, but many patients remain undiagnosed or diagnosed too late. There is a need for accurate predictive models and better understanding of key risk factors.

Objective

The primary objective of this project is to develop and evaluate machine learning models to predict the risk of cardiovascular disease based on demographic, lifestyle, and clinical risk factors.

Research Questions

· Which demographic, behavioral, and clinical variables contribute significantly the most towards cardiovascular disease risk?

· How do different algorithms compare statistically (ex. Accuracy, precision, recall)?

· Can an interpretable model suitable for clinical insight be produced?

Methodology

This project will use a publicly available UCI cardiovascular disease dataset from Kaggle. The variables in the dataset are age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate achieved, exercise induced angina, ST depression, ST slope, and presence of heart disease. I will handle missing data if there are any, encode categorical features, and train/test split. Correlation analysis was performed, I used logistic regression, decision tree, and random forest as my models. The models were evaluated by comparing their accuracy, precision, recall, F1-score, and ROC-AUC.

Expected Contribution

· Identify most significant clinical and demographic risk factors.

· Compare performance of machine learning models for cardiovascular disease risk.

· Offer interpretable predictions that can support preventive care.

Keywords:

Cardiovascular, Machine Learning, Healthcare, Disease prevention.

Literature Review:

Machine learning (ML) has significantly improved the prediction and diagnosis of CVD. Numerous studies have explored different algorithms, datasets, and feature selection techniques to enhance predictive accuracy and clinical applicability.

Pal et al. (2022) investigated multiple machine learning classifiers for cardiovascular disease prediction and demonstrated that ensemble methods and hybrid approaches often outperform traditional statistical models in terms of accuracy and reliability. Their study highlights the important of selecting appropriate algorithms and preprocessing techniques when dealing with clinical datasets. Similarly, Mim et al. (2025) reported that ensembles techniques outperform individual models due to their ability to capture complex, nonlinear relationships among features. These findings suggest that model selection plays a critical role in optimizing predictive performance.

Many systemic reviews provide a broader understanding of the field. A study by Ahsan and Siddique (2022) analyze over 400 research papers and found that machine learning models can effectively detect heart disease using clinical and ECG data. However the authors emphasized the challenges such as imbalanced datasets and lack of interpretability, which limit real world clinical adoption. Likewise, Liu et al. (2025) emphasized that while electric health record (EHR) based models enable large scale risk prediction, issues related to data quality, heterogeneity, and generalizability remain as a barrier. Together these studies indicate that machine learning models are highly capable, their clinical applicability is still constrained by data related challenges.

More recent studies highlight evolution of advanced technique, like Banerjee and Pacal (2025) reported that although ML models achieve high predictive accuracy, issues such as data quality and model transparency hinder their integration into clinical practice. Haq et al. (2026) further highlighted how effective deep learning is in imaging-based diagnosis, while also pointing out increased computational demands and the “black-box” nature of these models. This creates a trade-off between performance and explainability, which remains a central issue in the field.

Meta analytic evidence provided additional support for the effectiveness of ML approaches. Krittanawong et al. (2020) showed that algorithms such as random forests and support vector machines (SVM) achieve strong predictive performance across large and diverse patient populations. However, the study also suggested variability in outcomes depending on dataset composition and feature engineering, reinforcing the importance of methodological consistency.

Experimental studies offered more granular insights into algorithm performance. Ingole et al. (2024) found that SVMs achieved high accuracy in heart disease prediction, demonstrating that their robustness in high dimensional clinical data. But in contrast, Osei-Nkwantabisa and Ntumy (2024) reported that k-nearest neighbors (KNN) outperformed other models on the UCL Heart Disease dataset. These contrasted findings highlight a key concern which is model performance is highly context dependent, varying with dataset characteristics, preprocessing methods, and feature selection techniques. The lack of consistency makes it difficult to identify a universal optimal model.

Beyond traditional ML approaches, alternative methods have also been explored. EL Massari et al. (2024) demonstrated that ontology-based models can enhance both prediction accuracy and interpretability by incorporating domain knowledge. But such approaches are less commonly used and require further validation in real world clinical settings.

Overall, these literatures demonstrated that ML techniques are highly effective for cardiovascular disease prediction, particularly when applied to structures datasets such as the UCI heart Disease dataset. However, several critical challenges persist, including data imbalance, lack of interpretability, variability in model performance, and limited real world clinical deployment. Many studies focus primarily on maximizing accuracy with adequately addressing explainability or practical implementation.

Identifying the most influential risk factors, systematically comparing algorithm performance and developing interpretable models suitable for clinical use. To address limitations, this study aims to investigate the relative importance of demographic, behavioral, and clinical variables in CVD prediction, evaluate the performance of multiple ML algorithms using standard metrics, and develop an interpretable model that balance predictive accuracy with clinical relevance.

Methodology:

This study utilized the publicly available UCI Heart Disease dataset, accessed from Kaggle. The dataset used in this study is derived from the UCI Heart Disease dataset (Detrano et al., 1988), accessed via publicly available version on Kaggle SONY, R. (n.d.). The dataset contains clinical and demographic information collected from multiple institutions, including the Hungarian Institute of Cardiology, University Hospital Zurich, University Hospital Basel, and the Cleveland Clinic Foundation. It includes variables such as age, sec, chest pain type, cholesterol levels, resting blood pressure, and other clinical indicators relevant to cardiovascular disease diagnosis.

The dataset consisted of 920 observations (patient records) and 16 variables (demographic and clinical variables).

Data Preprocessing

Initial data exploration was conducted to understand the structure and distribution of variables.

Figure 1: Dataset

The column information:

id (Unique id for each patient)

age (Age of the patient in years)

origin (place of study)

sex (Male/Female)

cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])

trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))

chol (serum cholesterol in mg/dl)

fbs (if fasting blood sugar > 120 mg/dl)

restecg (resting electrocardiographic results) – Values: [normal, stt abnormality, lv hypertrophy]

thalach: maximum heart rate achieved

exang: exercise-induced angina (True/ False)

oldpeak: ST depression induced by exercise relative to rest

slope: the slope of the peak exercise ST segment

ca: number of major vessels (0-3) colored by fluoroscopy

thal: [normal; fixed defect; reversible defect]

num: the predicted attribute 0: No heart disease (absence of disease). Values 1-4 represent increasing severity levels of heart disease, but they generally indicate the extent or severity of the disease 1: Mild heart disease, 2: Moderate heart disease, 3: Severe heart disease, and 4: Very severe heart disease.

Missing values were identified across several columns. To address this issue a simple imputation strategy was applied:

· Numeric Variables were imputed using the median values

· Categorical variables were imputed using the mode (most frequent value)

This approach ensures that missing data does not bias model training while preserving the overall distribution of the dataset.

The target variable (num), originally represented multiple stages of heart disease severity (0-4), was transformed into a binary classification variable where:

· 0 = No heart disease

· 1 = Presence of heart disease (any severity level)

Additionally, categorical variables such as sex, chest pain type, fasting blood sugar, electrocardiographic results, exercise induced angina, and slope were converted into factor variable to ensure proper handling by machine learning algorithms.

Data Test and Train Split

The dataset was divided into training (80%) and testing (20%) subsets stratified sampling to preserve class distribution. A fixed random seed was set to ensure reproducibility of results.

The model validation was performed using the hold out validation approach, where the dataset was divided into training and testing subsets. To preserve the class distribution across both subset stratified sampling was applied. Model performance was evaluated on the unseen test dataset to assess generalization capability.

Exploratory Data Analysis

A correlation analysis was performed on numerical variables to identify relationships among features. A correlation matrix was visualized to assess potential multicollinearity and to better understand which variables may influence CVD risk.

Model Development

To address the research questions, three machine learning models were implemented:

1. Logistic Regression- A baseline statistical model was developed using a generalized linear model with a binomial distribution.

2. Decision Tree- A decision tree classifier was constructed to capture nonlinear relationships and interactions between variables. The model structure was visualized to enhance interpretability and provide insight into decision rules.

3. Random Forest- An ensemble learning method was implemented using multiple decision trees to improve predictive performance and reduce overfitting. The model aggregates predictions from 100 trees to produce more robust results.

Model Evaluation

Model performance was evaluated on the test dataset using several standard classification metrics:

· Accuracy: Overall correctness of predictions.

· Precision: Proportion of true positive prediction among all positive predictions.

· Recall (sensitivity): Ability to correctly identify positive cases.

· F1 Score: Harmonic mean of precision and recall.

· ROC-AUC (Receiver Operating Characteristic- Area Under Curve) Measures the model’s ability to distinguish between classes.

A confusion matrix was generated for each model to assess classification performance. Additionally, ROC curves were plotted to visually compare the performance of the models, particularly for decision tree and random forest classifiers.

Research Alignment

This methodology directly addresses the study’s research question by:

· Identifying important predictors through model training and feature relationships.

· Comparing multiple algorithms standardized evaluation metrics.

· Incorporating interpretable models (logistic regression and decision tree) alongside a high performance ensemble model (random forest).

Results:

In addition to model evaluation a correlation analysis was conducted to examine relationships among numerical variables and the correlation matrix revealed several notable patterns. Age shows a moderate positive correlation with resting blood pressure and a weaker relationship with cholesterol levels, suggesting that cardiovascular risk factors tend to increase with age. Maximum heart rate (thatlach) exhibited a negative correlation with age, indicating that older individuals generally achieve lower heart rates during exercise.

Figure 2:Correlation Matrix Numerical Variables

Furthermore, old peak (ST depression) showed a negative relationship with maximum heart rate positive association with other risk related variables, suggested its relevance as an indicator of cardiac stress. The number of major vessels (ca) also demonstrated mild correlations with several variables supporting its role as an important clinical feature.

Overall, the correlations were generally weak to moderate, indicating limited multicollinearity among predictors. This suggest that the variables contribute unique information to the models, supporting their inclusion in ML analysis. Also, no strong correlations (|r|> 0.7) were observed indicating low multicollinearity.

The performance of the three ML models, logistic regression, decision tree, and random forest was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC. The results are presented in the Table 1.

Table 1 Results

Figure 3 ROC Plot

The results suggest that the decision tree model provided the most balanced classification performance across the evaluation metrics, while the random forest model demonstrated the stronger overall discriminative ability. Logistic regression showed lower predictive performance compared to the tree-based models but remained useful because of its interpretability. The findings indicate that nonlinear machine learning methods may better capture the complex relationships among cardiovascular risk factors than traditional linear models. Overall, the models were able to distinguish between patients with and without cardiovascular disease with relatively strong predictive capability.

The feature analysis indicated that variables such as chest pain type, maximum heart rate, ST depression (oldpeak), number of major vessels (ca), and exercise-induced angina were among the strongest predictors of cardiovascular disease. Age and resting blood pressure also showed meaningful relationship with disease presence, although their effects were less pronounced compared to clinical stress test related variables. These findings are consistent with previous cardiovascular disease research identifying both physiological and exercise-related measurements as key indicators of cardiac risk.

Conclusion:

This study investigated cardiovascular disease risk prediction using multiple machine learning algorithms and examined the contribution of clinical and demographic variables. Features such as chest pain type, exercise-induced angina, ST depression, maximum heart rated achieved, and the number of major vessels appeared to play an important role in predicting disease presence. These variables are clinically relevant because they reflect cardiac function and physiological stress responses commonly associated with cardiovascular complications.

The results in this study demonstrated that the decision tree model achieved the highest overall classification performance, outperforming both logistic regression and random forest in accuracy, precision, recall, and F1-score. This suggest that simpler, interpretable models can be effective when the dataset contains well defined patterns. The random forest had the highest ROC-AUC indicating superior discriminative ability across classification thresholds. This highlighted it strength in capturing complex relationships, even when it doesn’t achieve the highest accuracy at a fixed threshold.

The logistic regression model had the least accuracy but still demonstrated reasonable performance ad remains valuable due to its interpretability and simplicity.

In relation to the research questions:

· Key variables contributing to the cardiovascular disease risk were effectively captured by all models, supported by correlation analysis and model performance.

· The comparison of algorithms showed that decisions trees performed best overall while random forest provided stronger probabilistic discrimination.

· An interpretable model was successfully developed with the decision tree offering both high accuracy and clear decision rules suitable for clinical insight.

In conclusion, this study highlighted that model selection should consider both predictive performance and interpretability, particularly in healthcare applications. Future work should explore hybrid approaches that combine the strengths of both interpretable and high performance models.

Appendix:

Code for this study: https://rpubs.com/Andreina-A/1423457

Bibliography

Ahsan, M. M., & Siddique, Z. (2022). Machine learning-based heart disease diagnosis: A systematic literature review. Retrieved from Artificial Intelligence in Medicine: https://doi.org/10.1016/j.artmed.2022.102289

Banerjee, T., & Paçal, İ. (2025). A systematic review of machine learning in heart disease prediction. Retrieved from Turkish Journal of Biology: https://doi.org/10.55730/1300-0152.2766

Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1988). Heart disease dataset [Data set]. Retrieved from UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

El Massari, Y., Bensaid, A., & Ouhbi, S. (2024). The impact of ontology on the prediction of cardiovascular disease compared to machine learning algorithms. Retrieved from arXiv: https://arxiv.org/abs/2405.20414

Haq, I., Liang, H., Zeng, K., Wang, T., Uddin, I., Lin, J., Kang, Y., & Huang, B. (2026). Deep learning advancements for cardiovascular diseases (CVDs) diagnosis: Imaging modalities, challenges, and future perspectives. Retrieved from Biomedical Signal Processing and Control: https://doi.org/10.1016/j.bspc.2026.109899

Ingole, V., Patil, S., Deshmukh, A., & Kulkarni, P. (2024). Advancements in heart disease prediction: A machine learning approach for early detection and risk assessment. Retrieved from arXiv: https://arxiv.org/abs/2410.14738

Krittanawong, C., Johnson, K. W., Rosenson, R. S., Wang, Z., Aydar, M., & Halperin, J. L. (2020). Machine learning prediction in cardiovascular diseases: A meta-analysis. Retrieved from Scientific Reports: https://doi.org/10.1038/s41598-020-72685-1

Liu, T., Krentz, A. J., Huo, Z., & Ćurčin, V. (2025). Opportunities and challenges of cardiovascular disease risk prediction for primary prevention using machine learning and electronic health records: A systematic review. Retrieved from Reviews in Cardiovascular Medicine: https://doi.org/10.31083/RCM37443

Mim, F. N., Rahman, M. S., Islam, M. R., & Hossain, M. A. (2025). Machine learning approaches for cardiovascular disease prediction: A comparative study. Retrieved from International Journal of Data Science and Analytics: https://doi.org/10.1007/s44174-025-00564-2

Osei-Nkwantabisa, G., & Ntumy, E. (2024). Classification and prediction of heart diseases using machine learning algorithms. Retrieved from arXiv: https://arxiv.org/abs/2409.03697

Pal, M., Parija, S., Panda, G., Dhama, K., & Mohapatra, R. K. (2022). Risk prediction of cardiovascular disease using machine learning classifiers. Retrieved from PMC PubMed Central: https://pmc.ncbi.nlm.nih.gov/articles/PMC9206502/

SONY, R. (n.d.). Heart disease data [Data set]. Retrieved from Kaggle: https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

#Load Libraries
library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

library(corrplot)

corrplot 0.95 loaded

library(rpart)
library(rpart.plot)
library(randomForest)

randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

library(pROC)

Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var

Loaded data from https://www.kaggle.com/datasets/redwankarimsony/heart-disease-data

df<-read.csv("https://raw.githubusercontent.com/Andreina-A/Data698/refs/heads/main/heart_disease_uci.csv")

#view structure
str(df)

'data.frame':   920 obs. of  16 variables:
 $ id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : chr  "Male" "Male" "Male" "Male" ...
 $ dataset : chr  "Cleveland" "Cleveland" "Cleveland" "Cleveland" ...
 $ cp      : chr  "typical angina" "asymptomatic" "asymptomatic" "non-anginal" ...
 $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : logi  TRUE FALSE FALSE FALSE FALSE FALSE ...
 $ restecg : chr  "lv hypertrophy" "lv hypertrophy" "lv hypertrophy" "normal" ...
 $ thalch  : int  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : logi  FALSE TRUE TRUE FALSE FALSE FALSE ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : chr  "downsloping" "flat" "flat" "downsloping" ...
 $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
 $ thal    : chr  "fixed defect" "normal" "reversable defect" "normal" ...
 $ num     : int  0 2 1 0 0 0 3 0 2 1 ...

Columns: id (Unique id for each patient) age (Age of the patient in years) origin (place of study) sex (Male/Female) cp chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic]) trestbps resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital)) chol (serum cholesterol in mg/dl) fbs (if fasting blood sugar > 120 mg/dl) restecg (resting electrocardiographic results) – Values: [normal, stt abnormality, lv hypertrophy] thalach: maximum heart rate achieved exang: exercise-induced angina (True/ False) oldpeak: ST depression induced by exercise relative to rest slope: the slope of the peak exercise ST segment ca: number of major vessels (0-3) colored by fluoroscopy thal: [normal; fixed defect; reversible defect] num: the predicted attribute 0: No heart disease (absence of disease).The exact meaning of values 1 through 4 can depend on the specific dataset, but they generally indicate the extent or severity of the disease 1: Mild heart disease, 2: Moderate heart disease, 3: Severe heart disease, and 4: Very severe heart disease.

Citation Request: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:Robert Detrano, M.D., Ph.D.

#check for missing values
colSums(is.na(df))

      id      age      sex  dataset       cp trestbps     chol      fbs 
       0        0        0        0        0       59       30       90 
 restecg   thalch    exang  oldpeak    slope       ca     thal      num 
       0       55       55       62        0      611        0        0

Seven columns have missing data, I used a simple imputation where the the median will be used of the numeric values and mode will be used for the categorical values.

for(col in names(df)){
  if(is.numeric(df[[col]])){
    df[[col]][is.na(df[[col]])]<-median(df[[col]], na.rm=TRUE)
  }else{
    mode_val<-names(sort(table(df[[col]]),decreasing=TRUE))[1]
    df[[col]][is.na(df[[col]])]<-mode_val
  }
}

#check for missing values again
colSums(is.na(df))

      id      age      sex  dataset       cp trestbps     chol      fbs 
       0        0        0        0        0        0        0        0 
 restecg   thalch    exang  oldpeak    slope       ca     thal      num 
       0        0        0        0        0        0        0        0

I converted the target variable “num” into a binary variable, instead of using all heart stages I will set no heart disease at 0 and all other stages into just 1 where it indicatess heart disease.

#coverted target column into binary 
df$num <- ifelse(df$num == 0, 0, 1)
df$num <- as.factor(df$num)

#converted categorical columns to factors
df$sex <- as.factor(df$sex)
df$cp <- as.factor(df$cp)
df$fbs <- as.factor(df$fbs)
df$restecg <- as.factor(df$restecg)
df$exang <- as.factor(df$exang)
df$slope <- as.factor(df$slope)

#Train/Test split
set.seed(21)
trainIndex<- createDataPartition(df$num, p = 0.8, list = FALSE)

train_df <- df[trainIndex, ]
test_df  <- df[-trainIndex, ]

Correlation Analysis

numeric_data<-df|>
  select(where(is.numeric))
cor_matrix<-cor(numeric_data)

corrplot(cor_matrix, method="color", type="upper", tl.cex=0.8)

Logistic Regression Model

log_model<-glm(num~., data=train_df,family = "binomial")

log_probs<-predict(log_model, test_df,type="response")
log_pred<-ifelse(log_probs>0.5,1,0)|>
  as.factor()

Decision Tree Model

DT_model<-rpart(num~., data=train_df,method = "class")
rpart.plot(DT_model)

DT_pred<- predict(DT_model, test_df, type="class")
DT_probs<-predict(DT_model, test_df, type = "prob")[,2]

Random Forest model

rf_model<-randomForest(num~., data=train_df,ntree =100)

rf_pred<- predict(rf_model, test_df)
rf_probs<-predict(rf_model, test_df, type="prob")[,2]

Evaluation

#Evaulation Function
evaluate_model <- function(true, pred, probs) {
  cm <- confusionMatrix(pred, true, positive = "1")
  
  accuracy  <- unname(cm$overall["Accuracy"])
  precision <- unname(cm$byClass["Precision"])
  recall    <- unname(cm$byClass["Recall"])
  f1        <- unname(cm$byClass["F1"])
  
  true_numeric <- as.numeric(true) - 1
  
  roc_obj <- roc(true_numeric, probs)
  roc_auc <- as.numeric(auc(roc_obj))
  
  return(c(
    Accuracy = accuracy,
    Precision = precision,
    Recall = recall,
    F1_Score = f1,
    ROC_AUC = roc_auc
  ))
}

#Evaluation models
log_results<-evaluate_model(test_df$num, log_pred,log_probs)

Setting levels: control = 0, case = 1

Setting direction: controls < cases

DT_results<-evaluate_model(test_df$num, DT_pred,DT_probs)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

rf_results<-evaluate_model(test_df$num, rf_pred,rf_probs)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

#Result
results<-rbind(
  Logistic_Regression = log_results,
  Decision_Tree       = DT_results,
  Random_Forest       = rf_results
)
print(results)

                     Accuracy Precision    Recall  F1_Score   ROC_AUC
Logistic_Regression 0.7978142 0.8200000 0.8118812 0.8159204 0.9054576
Decision_Tree       0.8743169 0.8900000 0.8811881 0.8855721 0.9018957
Random_Forest       0.8579235 0.8865979 0.8514851 0.8686869 0.9437938

plot(roc(test_df$num, DT_probs), col="blue")

Setting levels: control = 0, case = 1

Setting direction: controls < cases

plot(roc(test_df$num, rf_probs), col="red", add=TRUE)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

plot(roc(test_df$num, log_probs), col= "green", add= TRUE)

Setting levels: control = 0, case = 1
Setting direction: controls < cases

legend("bottomright", legend=c("Decision Tree", "Random Forest", "Logistic Regression"),
       col=c("blue","red", "green"), lwd=2)