Assignment 3: Comparing Decision Tree, Rondom Forrest, AdaBoosting to SVM Models

Ahm Hamza 04.13.2025

Overview

Perform an analysis of the dataset used in previous homework using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?

Article Review

Summary of the provided articles

[1] This article presents a comparative analysis of various machine learning algorithms—including Support Vector Machines (SVM), Random Forests, Neural Networks, and XGBoost applied to COVID-19 datasets. Notably, SVM outperformed the other models in accurately predicting positive cases across two different datasets. The study also emphasizes the significance of addressing class imbalance, recommending the use of undersampling and oversampling techniques—an approach that directly aligns with the preprocessing steps used in our project. These insights reinforce the effectiveness of SVM, particularly in scenarios where accurate classification is critical and datasets are imbalanced, much like the binary classification task in the bank marketing dataset.

[2] This study demonstrates the effectiveness of SVM in classifying COVID-19 severity with approximately 87% accuracy, highlighting SVM’s reliability in healthcare diagnostics for complex data, a finding consistent with the strong performance observed in our tuned radial SVM model for marketing predictions. Another study on a 10% positive COVID-19 patient dataset (n=5644) emphasizes the challenges of class imbalance, where methods like Balanced Random Forest (RUS) and RUSBagging showed superior AUPRC and AUROC, respectively, for imbalanced data. Both studies used standard evaluation metrics (accuracy, precision, recall, F1-measure, AUC-ROC, AUPRC), and the latter underscores the importance of addressing class imbalance and incorporating features like age to improve predictive accuracy, a factor some studies overlooked.

Additional 3 Required Articles

Title “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning” link

Summary The article presents a comprehensive comparative analysis of five machine learning algorithms: K-Nearest Neighbor (KNN), Genetic Algorithm (GA), Support Vector Machine (SVM), Decision Tree (DT), and Long Short-Term Memory (LSTM). It highlights their functionalities, advantages, and disadvantages, as well as their performance in real-time applications. SVM consistently demonstrated superior performance compared to KNN and DT in various applications, particularly in breast cancer detection and student performance prediction.In the credit card fraud detection comparison, SVM outperformed both KNN and Naive Bayes in terms of accuracy. SVM achieved high accuracy levels (up to 96%) in cases where there was a clear margin of separation between classes, making it effective for high-dimensional data. The article recommends using SVM for applications requiring high accuracy and reliability, especially in domains like healthcare and finance. This aligns well with projects focused on predictive analytics where performance and accuracy are critical.

Title: “Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests” link

Summary

This study evaluates several classifiers for predicting the progression of Mild Cognitive Impairment to dementia, comparing traditional methods like LDA and logistic regression with modern ones such as SVM, Random Forest, and Neural Networks. SVM achieved the highest overall accuracy and perfect specificity but showed low sensitivity, limiting its ability to detect positive dementia cases. In contrast, Random Forest provided a better balance between accuracy and sensitivity.

The article recommends using models like Random Forest or LDA for tasks where sensitivity is important. This finding relates to our project, as accurate classification alone is not enough. While SVM performed well on our bank dataset, it is important to evaluate whether its trade-off between accuracy and sensitivity is appropriate for our specific use case.

Title: “A comparative study of forecasting corporate credit ratings using neural networks, support vector machines, and decision trees” link

Summary

This article evaluates the effectiveness of various machine learning models, including SVM, Neural Networks, and Decision Trees, in predicting corporate credit ratings. While SVM showed strong classification accuracy, it was less effective in predicting rating changes. In contrast, Bagged Decision Trees and Random Forests outperformed both SVM and Neural Networks, achieving up to 84.21% accuracy.

The study highlights that Decision Trees require less parameter tuning and offer greater interpretability, making them more accessible for practical use.This is relevant to the assignment as they emphasize the strengths of ensemble tree-based models in high-stakes classification tasks. However, bank dataset, tuned SVM models demonstrated higher AUC and F1-Score, suggesting that the best model choice depends on the specific data and objective.

Cross Comparison and Insights

Across all five articles, SVM is consistently recognized for its high classification accuracy, particularly in high-dimensional datasets like healthcare diagnostics, credit scoring, and fraud detection. The two provided studies on COVID-19 prediction both emphasize SVM’s effectiveness when paired with techniques to address class imbalance, which aligns with our project’s success using oversampling and tuned radial kernels. However, the three academic articles I sourced present a more nuanced view.

For instance, while SVM outperformed Decision Trees (DT) in applications like breast cancer detection and student performance prediction, other domains such as dementia prediction and corporate credit rating favored ensemble tree models like Random Forest or Bagged DTs. These models offered better sensitivity, easier interpretability, and robustness to noisy or imbalanced data, echoing results in my previous assignments where Random Forests provided solid, balanced metrics.

SVMs perform best when the margin between classes is clearly separable, and when computational resources allow for careful tuning. In contrast, tree-based methods excel in real-world settings where interpretability, sensitivity, or minimal tuning are priorities. In our bank dataset, SVM’s high precision and AUC were valuable, but interpretability and balance offered by tree models are also significant in institutional applications.

Data and library loading

# load necessary packages
rq_packages <- c("GGally", "naniar", "gridExtra", "scales", "ggplot2",
                 "dplyr", "tidyr", "corrplot", "ggcorrplot", "caret",
                 "naivebayes", "pROC", "car", "knitr", "rpart", "randomForest", 
                 "rpart.plot", "ROSE", "adabag", "reshape2", "ada", 
                 "smotefamily", "e1071", "kernlab", "doParallel")
# install and load packages
for (pkg in rq_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}
#load the dataset
df <- read.csv("https://raw.githubusercontent.com/hamza9713/assignment_data_repo/refs/heads/main/bank-additional-full.csv", sep=";")

Preprocessing Same As Assignment 2

#rename columns for clarity
df <- df %>%
  rename(
    age = age,
    job = job,
    marital_status = marital,
    education = education,
    credit_default = default,
    mortgage = housing,
    personal_loan = loan,
    contact_method = contact,
    contact_month = month,
    contact_day = day_of_week,
    contact_duration = duration,
    campaign_contacts = campaign,
    days_since_last_contact = pdays,
    previous_contacts = previous,
    previous_outcome = poutcome,
    employment_rate = emp.var.rate,
    consumer_price_index = cons.price.idx,
    consumer_confidence_index = cons.conf.idx,
    euribor_rate = euribor3m,
    employees_count = nr.employed,
    subscription_status = y
  )

#data inspection
# str(df)
# summary(df)

##########################################################################
### data cleaning & preprocessing
##########################################################################
# convert "unknown" to NA and omit NAs
df[df == "unknown"] <- NA 
df <- na.omit(df)

# standardize numeric variables (z-score normalization)
numeric_vars <- df %>%
  select_if(is.numeric) %>%
  colnames()

df[numeric_vars] <- scale(df[numeric_vars])

#list of categorical variables
categorical_vars <- c(
  "job", "marital_status", "education", "credit_default",
  "mortgage", "personal_loan", "contact_method", "contact_month",
  "contact_day", "previous_outcome", "subscription_status"
)

#convert categorical variables to factors
df[categorical_vars] <- lapply(df[categorical_vars], factor)

# set seed for reproducibility
set.seed(123)

# balance and split the dataset (assume df is your original dataset)
# here, we are generating a balanced dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over", 
                           N = max(table(df$subscription_status)) * 2)$data

# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]

# inspect the training data
str(train_data)

## 'data.frame':    37282 obs. of  21 variables:
##  $ age                      : num  1.6422 0.0939 1.6422 -1.4545 -1.3577 ...
##  $ job                      : Factor w/ 11 levels "admin.","blue-collar",..: 4 1 8 10 8 8 2 2 2 2 ...
##  $ marital_status           : Factor w/ 3 levels "divorced","married",..: 2 2 2 3 3 3 3 2 2 2 ...
##  $ education                : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 2 4 6 4 4 4 2 2 3 ...
##  $ credit_default           : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mortgage                 : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 2 2 ...
##  $ personal_loan            : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 2 ...
##  $ contact_method           : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ contact_month            : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ contact_day              : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ contact_duration         : num  0.00579 -0.41451 0.18156 0.46049 -0.80043 ...
##  $ campaign_contacts        : num  -0.559 -0.559 -0.559 -0.559 -0.559 ...
##  $ days_since_last_contact  : num  0.212 0.212 0.212 0.212 0.212 ...
##  $ previous_contacts        : num  -0.372 -0.372 -0.372 -0.372 -0.372 ...
##  $ previous_outcome         : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ employment_rate          : num  0.727 0.727 0.727 0.727 0.727 ...
##  $ consumer_price_index     : num  0.804 0.804 0.804 0.804 0.804 ...
##  $ consumer_confidence_index: num  0.877 0.877 0.877 0.877 0.877 ...
##  $ euribor_rate             : num  0.786 0.786 0.786 0.786 0.786 ...
##  $ employees_count          : num  0.402 0.402 0.402 0.402 0.402 ...
##  $ subscription_status      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(train_data)

##       age                   job         marital_status 
##  Min.   :-2.1319   admin.     :11102   divorced: 4158  
##  1st Qu.:-0.7771   technician : 6497   married :20675  
##  Median :-0.2932   blue-collar: 5862   single  :12449  
##  Mean   : 0.0566   services   : 3095                   
##  3rd Qu.: 0.6745   management : 2757                   
##  Max.   : 5.4163   retired    : 2350                   
##                    (Other)    : 5619                   
##                education     credit_default mortgage    personal_loan
##  basic.4y           : 3077   no :37281      no :16838   no :31527    
##  basic.6y           : 1439   yes:    1      yes:20444   yes: 5755    
##  basic.9y           : 4547                                           
##  high.school        : 9477                                           
##  illiterate         :   29                                           
##  professional.course: 5198                                           
##  university.degree  :13515                                           
##    contact_method  contact_month  contact_day contact_duration   
##  cellular :27748   may    :9633   fri:6778    Min.   :-0.991479  
##  telephone: 9534   jul    :5603   mon:7313    1st Qu.:-0.441260  
##                    aug    :5433   thu:8151    Median :-0.005671  
##                    jun    :4536   tue:7518    Mean   : 0.453812  
##                    nov    :3993   wed:7522    3rd Qu.: 0.934286  
##                    apr    :3345               Max.   :17.800008  
##                    (Other):4739                                  
##  campaign_contacts  days_since_last_contact previous_contacts
##  Min.   :-0.55933   Min.   :-4.7491         Min.   :-0.3716  
##  1st Qu.:-0.55933   1st Qu.: 0.2119         1st Qu.:-0.3716  
##  Median :-0.19170   Median : 0.2119         Median :-0.3716  
##  Mean   :-0.07452   Mean   :-0.3713         Mean   : 0.2539  
##  3rd Qu.: 0.17593   3rd Qu.: 0.2119         3rd Qu.:-0.3716  
##  Max.   :14.88100   Max.   : 0.2119         Max.   :13.0181  
##                                                              
##     previous_outcome employment_rate   consumer_price_index
##  failure    : 4522   Min.   :-2.0669   Min.   :-2.2589     
##  nonexistent:28707   1st Qu.:-1.0733   1st Qu.:-1.0768     
##  success    : 4053   Median :-0.6387   Median :-0.1355     
##                      Mean   :-0.3389   Mean   :-0.1395     
##                      3rd Qu.: 0.9138   3rd Qu.: 0.8041     
##                      Max.   : 0.9138   Max.   : 2.1246     
##                                                            
##  consumer_confidence_index  euribor_rate     employees_count  
##  Min.   :-2.12930          Min.   :-1.5901   Min.   :-2.6240  
##  1st Qu.:-1.16881          1st Qu.:-1.3566   1st Qu.:-1.1258  
##  Median :-0.25009          Median :-1.1338   Median :-0.8211  
##  Mean   : 0.06331          Mean   :-0.3503   Mean   :-0.4052  
##  3rd Qu.: 0.87744          3rd Qu.: 0.8424   3rd Qu.: 0.8953  
##  Max.   : 2.86105          Max.   : 0.8919   Max.   : 0.8953  
##                                                               
##  subscription_status
##  no :18641          
##  yes:18641          
##                     
##                     
##                     
##                     
##

# Function for F1 score calculation
get_f1_score <- function(cm) {
  precision <- cm$byClass["Precision"]
  recall <- cm$byClass["Recall"]
  f1 <- 2 * (precision * recall) / (precision + recall)
  return(f1)
}

Load Pre-trained Models For Faster Knit

# Load the saved models - adjust paths as needed
model_path <- "/Users/ahmhamza/"

# Load models from RDS files
svm_linear <- readRDS(paste0(model_path, "svm_linear_model.rds"))
svm_linear_caret <- readRDS(paste0(model_path, "svm_linear_caret_model.rds"))
svm_radial <- readRDS(paste0(model_path, "svm_radial_model.rds"))
svm_radial_caret <- readRDS(paste0(model_path, "svm_radial_caret_model.rds"))
svm_poly <- readRDS(paste0(model_path, "svm_poly_model.rds"))

Base Linear SVM Model

The linear kernel assumes that the data is separable by linearity, meaning a hyperplane that cleanly divides the “yes” and “no” subscription classes. To adress class-imbalance I used oversampling, ensuring the model isn’t biased toward the majority class. Linear SVMs are efficient and less prone to overfitting, especially with high-dimensional datasets like this. The use of probabilities and ROC/AUC also gives insights into how well the classifier ranks predictions, not just how often it’s correct.

Hypothesis: A linear decision boundary will be enough if the classescan be separated with a straight hyperplane, especially with standardized features.

Model Evaluation - Linear SVM

# Predictions and evaluation for linear SVM
svm_linear_pred <- predict(svm_linear, test_data)
svm_linear_prob <- predict(svm_linear, test_data, probability = TRUE)
svm_linear_cm <- confusionMatrix(svm_linear_pred, test_data$subscription_status, positive = "yes")
svm_linear_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_linear_prob, "probabilities")[, "yes"]))

cat("Linear SVM Accuracy:", svm_linear_cm$overall["Accuracy"], "\n")

## Linear SVM Accuracy: 0.8691788

cat("Linear SVM AUC:", auc(svm_linear_auc), "\n")

## Linear SVM AUC: 0.9330476

cat("Linear SVM F1-Score:", get_f1_score(svm_linear_cm), "\n\n")

## Linear SVM F1-Score: 0.8735785

print(svm_linear_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6665  767
##        yes 1323 7221
##                                           
##                Accuracy : 0.8692          
##                  95% CI : (0.8639, 0.8744)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7384          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9040          
##             Specificity : 0.8344          
##          Pos Pred Value : 0.8452          
##          Neg Pred Value : 0.8968          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4520          
##    Detection Prevalence : 0.5348          
##       Balanced Accuracy : 0.8692          
##                                           
##        'Positive' Class : yes             
##

Prediction & Probability

The linear SVM demonstrates strong effectiveness for binary classification on the dataset, achieving approximately 86.92% accuracy, a 0.933 AUC, and a 0.8736 F1-score, indicating a robust ability to distinguish subscribers from non-subscribers with a well-suited linear decision boundary that captures key data relationships without overfitting. This high AUC and balanced accuracy, with 90.4% sensitivity and 83.4% specificity, suggest a favorable bias-variance tradeoff, exhibiting low bias by learning meaningful patterns and moderate to low variance by avoiding overfitting, outperforming Decision Tree models that showed higher bias and limited feature utility.

SVM Linear Model with Tuning

SVM linear model with hyperparameter tuning to optimize the model’s performance. Also by removing features that show almost no variation and eliminating potentially noisy that could harm model performance or slow down training without contributing predictive power.

Hypothesis: Fine-tuning the cost parameter C via cross-validation will improve better performance tan base linear model.

Model Evaluation - Linear SVM with Caret

# Predict and evaluate
linear_pred <- predict(svm_linear_caret, test_data)
linear_prob <- predict(svm_linear_caret, test_data, type = "prob")
linear_cm <- confusionMatrix(linear_pred, test_data$subscription_status, positive = "yes")
linear_roc <- roc(test_data$subscription_status, linear_prob$yes)

cat("Caret Linear SVM Accuracy:", linear_cm$overall["Accuracy"], "\n")

## Caret Linear SVM Accuracy: 0.8637331

cat("Caret Linear SVM AUC:", auc(linear_roc), "\n")

## Caret Linear SVM AUC: 0.9327739

cat("Caret Linear SVM F1-Score:", get_f1_score(linear_cm), "\n\n")

## Caret Linear SVM F1-Score: 0.8656753

print(linear_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6784  973
##        yes 1204 7015
##                                          
##                Accuracy : 0.8637         
##                  95% CI : (0.8583, 0.869)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7275         
##                                          
##  Mcnemar's Test P-Value : 8.246e-07      
##                                          
##             Sensitivity : 0.8782         
##             Specificity : 0.8493         
##          Pos Pred Value : 0.8535         
##          Neg Pred Value : 0.8746         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4391         
##    Detection Prevalence : 0.5145         
##       Balanced Accuracy : 0.8637         
##                                          
##        'Positive' Class : yes            
##

Prediction & Probability

Hyperparameter tuning on the Linear SVM yielded a slightly lower accuracy (86.37%), AUC (0.9328), and F1-score (0.8657) compared to the base model, thus not strongly supporting the hypothesis of significant improvement through tuning for this classification task. However, the tuned model demonstrated an improved balance between sensitivity (87.82%) and specificity (84.93%), which is critical for equitable detection in potentially imbalanced datasets by mitigating false positives at the cost of a slight increase in bias, evidenced by a modest reduction in sensitivity from the untuned model’s 90.4%. The substantial kappa statistic of 0.7275 reinforces the tuned model’s reliability. The minimal performance gain from tuning suggests that the default linear kernel already operates near its optimal capacity for this dataset, indicating that prioritizing model simplicity and computational efficiency might be a more pragmatic approach for future model selection.

SVM Model with Radical Basic Function

The radial kernel projects the data into a higher-dimensional space where a linear separator might exist. This is especially useful when the decision boundary is curved or wavy as might be the case in this dataset with complex customer behaviors. The model should capture more subtle patterns, especially after oversampling. Probability estimation and AUC here help understand how confidently the model distinguishes the two classes.

Hypothesis: RBF can better handle complex boundaries in the data, leading to improved classification of harder-to-separate instances

Model Evaluation - Radial SVM

# Predictions and evaluation for radial SVM
svm_radial_pred <- predict(svm_radial, test_data)
svm_radial_prob <- predict(svm_radial, test_data, probability = TRUE)
svm_radial_cm <- confusionMatrix(svm_radial_pred, test_data$subscription_status, positive = "yes")
svm_radial_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_radial_prob, "probabilities")[, "yes"]))

cat("Radial SVM Accuracy:", svm_radial_cm$overall["Accuracy"], "\n")

## Radial SVM Accuracy: 0.8788182

cat("Radial SVM AUC:", auc(svm_radial_auc), "\n")

## Radial SVM AUC: 0.9389735

cat("Radial SVM F-Score:", get_f1_score(svm_radial_cm), "\n")

## Radial SVM F-Score: 0.8845282

print(svm_radial_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6625  573
##        yes 1363 7415
##                                           
##                Accuracy : 0.8788          
##                  95% CI : (0.8737, 0.8838)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7576          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9283          
##             Specificity : 0.8294          
##          Pos Pred Value : 0.8447          
##          Neg Pred Value : 0.9204          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4641          
##    Detection Prevalence : 0.5494          
##       Balanced Accuracy : 0.8788          
##                                           
##        'Positive' Class : yes             
##

Prediction & Probability

The SVM with an RBF kernel strongly supports the hypothesis that nonlinear kernels can better model complex decision boundaries, achieving superior performance with 87.88% accuracy, a high AUC of 0.9390, and an F1-score of 0.8845, outperforming linear SVM variants. Its high sensitivity (92.83%) highlights a strong ability to detect positive cases, crucial where false negatives are costly. While specificity (82.94%) is slightly lower than linear models, the balanced accuracy remains robust at 87.88%, and the higher Kappa score (0.7576) indicates stronger agreement. A significant McNemar’s Test p-value (<2.2e-16) further validates the model’s statistical significance. This improved performance confirms the necessity of nonlinear boundaries for accurately classifying complex instances within the dataset’s intricate feature interactions.

SVM Radical Basic Function Model with Tuning

The sigma parameter controls how far the influence of a single training example reaches. A small sigma leads to a complex, wiggly decision boundary (low bias, high variance), while a large sigma results in a smoother boundary (higher bias, lower variance). The tested involves a small grid (sigma = 0.01, C = 0.1 or 1) under cross-validation to strike a balance. Using caret also streamlines preprocessing and ensures fair model evaluation.

Hypothesis: If the tuning leads improve performance, it would indicate that the dataset benefits from a more flexible, non-linear decision boundary and regularization strength classification

Model Evaluation - Radial SVM with Caret

# Predict on the test data
radial_pred <- predict(svm_radial_caret, test_data)
radial_prob <- predict(svm_radial_caret, test_data, type = "prob")
radial_cm <- confusionMatrix(radial_pred, test_data$subscription_status, positive = "yes")
radial_roc <- roc(test_data$subscription_status, radial_prob$yes)

cat("Caret Radial SVM Accuracy:", radial_cm$overall["Accuracy"], "\n")

## Caret Radial SVM Accuracy: 0.8847646

cat("Caret Radial SVM AUC:", auc(radial_roc), "\n")

## Caret Radial SVM AUC: 0.9421786

cat("Caret Radial SVM F1-Score:", get_f1_score(radial_cm), "\n\n")

## Caret Radial SVM F1-Score: 0.888431

print(radial_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6805  658
##        yes 1183 7330
##                                           
##                Accuracy : 0.8848          
##                  95% CI : (0.8797, 0.8897)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7695          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9176          
##             Specificity : 0.8519          
##          Pos Pred Value : 0.8610          
##          Neg Pred Value : 0.9118          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4588          
##    Detection Prevalence : 0.5329          
##       Balanced Accuracy : 0.8848          
##                                           
##        'Positive' Class : yes             
##

Prediction & Probability

Hyperparameter tuning significantly improve the Radial Basis Function (RBF) SVM’s classification performance, supporting the hypothesis that tuning optimizes the model for the dataset’s complexity. Achieving the highest accuracy (88.48%) and AUC (0.9422) among all tested models, the tuned radial SVM demonstrates superior predictive power and class discrimination. Its strong F1-score (0.8884) indicates a balanced precision and recall. Notably, it improved both sensitivity (91.76%) and specificity (85.19%) compared to previous models, resulting in a balanced accuracy of 88.48% and a high Kappa score (0.7695), further validated by a significant McNemar’s test p-value (< 2.2e-16). These improvements confirm that the nonlinear RBF kernel, combined with optimized cost (C) and gamma, effectively captures the dataset’s intricate structure by balancing margin maximization and error minimization, leading to better generalization through regularization and adaptation to complex decision surfaces.

SVM wiht Polynomial Kernel

Polynomial kernels are particularly useful when the relationship between features follows a polynomial pattern. After waiting for 4 hours of slow convergence smaller sample of 2000 observations to reduce computation, suggesting polynomial SVMs are heavier. The degree-2 polynomial kernel models not just the original features, but also their pairwise interactions. This is helpful when feature interactions drive outcomes. Using a small grid (C = 1, 10; scale = 0.1) with cross-validation ensures the model doesn’t overfit.

Hypothesis: A degree-2 polynomial kernel is hypothesized to effectively model feature interactions, proving beneficial when relationships are nonlinear but less intricate than those requiring an RBF kernel.

Model Evaluation - Polynomial SVM

# Make predictions on test data
svm_poly_pred <- predict(svm_poly, newdata = test_data)

# Ensure factors match for evaluation
svm_poly_pred <- factor(svm_poly_pred, levels = c("no", "yes"))
true_labels <- factor(test_data$subscription_status, levels = c("no", "yes"))

# Confusion matrix
svm_poly_cm <- confusionMatrix(svm_poly_pred, true_labels, positive = "yes")

# Print evaluation
print(svm_poly_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6720 1540
##        yes 1268 6448
##                                           
##                Accuracy : 0.8242          
##                  95% CI : (0.8182, 0.8301)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6485          
##                                           
##  Mcnemar's Test P-Value : 3.152e-07       
##                                           
##             Sensitivity : 0.8072          
##             Specificity : 0.8413          
##          Pos Pred Value : 0.8357          
##          Neg Pred Value : 0.8136          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4036          
##    Detection Prevalence : 0.4830          
##       Balanced Accuracy : 0.8242          
##                                           
##        'Positive' Class : yes             
##

Prediction & Probability

The polynomial kernel (degree 2) achieved a respectable accuracy of 82.4%. Its balanced accuracy of 82.4% and a Kappa score of 0.6485 indicate good performance beyond chance, without strong bias towards either class, further supported by a sensitivity of 80.7% and specificity of 84.1%. However, a significant McNemar’s Test (p < 0.001) suggests a potential difference in error distribution between classes, warranting consideration for fairness. Overall, the degree-2 polynomial kernel empirically validates the hypothesis, offering a strong compromise between simpler linear models and potentially more complex radial kernels by capturing moderate nonlinear patterns with reasonable computational efficiency and interpretability, making it a solid choice when fairness and transparency are important.

SVM and Decision Tree Ensemble Comparison

# Create a dataframe to compare models
model_names <- c("Linear SVM", "Linear SVM (Caret)", "Radial SVM", 
                 "Radial SVM (Caret)", "Polynomial SVM")

accuracies <- c(
  svm_linear_cm$overall["Accuracy"],
  linear_cm$overall["Accuracy"],
  svm_radial_cm$overall["Accuracy"],
  radial_cm$overall["Accuracy"],
  svm_poly_cm$overall["Accuracy"]
)

f1_scores <- c(
  get_f1_score(svm_linear_cm),
  get_f1_score(linear_cm),
  get_f1_score(svm_radial_cm),
  get_f1_score(radial_cm),
  get_f1_score(svm_poly_cm)
)

aucs <- c(
  auc(svm_linear_auc),
  auc(linear_roc),
  auc(svm_radial_auc),
  auc(radial_roc),
  NA  # We don't have AUC for polynomial model
)

comparison_df <- data.frame(
  Model = model_names,
  Accuracy = round(accuracies, 4),
  F1_Score = round(f1_scores, 4),
  AUC = round(aucs, 4)
)

# Print comparison table
knitr::kable(comparison_df, caption = "Model Performance Comparison")

Model Performance Comparison
Model	Accuracy	F1_Score	AUC
Linear SVM	0.8692	0.8736	0.9330
Linear SVM (Caret)	0.8637	0.8657	0.9328
Radial SVM	0.8788	0.8845	0.9390
Radial SVM (Caret)	0.8848	0.8884	0.9422
Polynomial SVM	0.8242	0.8212	NA

Visualization of Model Performance

# Create a bar plot to compare model performance

# Prepare data for plotting
plot_data <- melt(comparison_df, id.vars = "Model", 
                 variable.name = "Metric", value.name = "Value")

# Create the plot
ggplot(plot_data, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  theme_minimal() +
  labs(title = "SVM Model Performance Comparison",
       x = "Model", y = "Score") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_brewer(palette = "Set1")

SVM and Decision Tree Essemble Comarison

################################################################
# results compilation and comparison
# previous models (decision trees, rf, adaboost)
dt_oversample_metrics <- c(Model = "Decision Tree (Oversampling)", Accuracy = 0.5740, AUC = 0.588, F1 = NA)
dt_undersample_metrics <- c(Model = "Decision Tree (Undersampling)", Accuracy = 0.5696, AUC = 0.584, F1 = NA)
rf_all_metrics <- c(Model = "Random Forest (All Features)", Accuracy = 0.9024, AUC = 0.9408, F1 = NA)
rf_top5_metrics <- c(Model = "Random Forest (Top 5 Features)", Accuracy = 0.8986, AUC = 0.9318, F1 = NA)
ada_oversample_metrics <- c(Model = "AdaBoost (Oversampling)", Accuracy = 0.8646, AUC = 0.93027, F1 = NA)
ada_demo_metrics <- c(Model = "AdaBoost (Demographics Only)", Accuracy = 0.5920, AUC = 0.62968, F1 = NA)

# svm models
svm_linear_metrics <- c(Model = "SVM (Linear)", Accuracy = 0.8692, AUC = 0.9330, F1 = 0.8735785)
svm_rbf_metrics <- c(Model = "SVM (Radial)", Accuracy = 0.8796, AUC = 0.9385, F1 = NA)
svm_poly_metrics <- c(Model = "SVM (Polynomial)", Accuracy = 0.8242, AUC = NA, F1 = NA)
svm_caret_linear_metrics <- c(Model = "SVM (Caret Linear)", Accuracy = 0.8637, AUC = 0.9328, F1 = 0.8656753)
svm_caret_radial_metrics <- c(Model = "SVM (Caret Radial)", Accuracy = 0.8849, AUC = 0.9422, F1 = 0.8885927)

# bind all model metric vectors
all_models <- rbind(
  dt_oversample_metrics,
  dt_undersample_metrics,
  rf_all_metrics,
  rf_top5_metrics,
  ada_oversample_metrics,
  ada_demo_metrics,
  svm_linear_metrics,
  svm_rbf_metrics,
  svm_poly_metrics,
  svm_caret_linear_metrics,
  svm_caret_radial_metrics
)

# convert to a data frame
comparison_df <- as.data.frame(all_models)

# convert numeric columns correctly
comparison_df$Accuracy <- as.numeric(comparison_df$Accuracy)
comparison_df$AUC <- as.numeric(comparison_df$AUC)
comparison_df$F1 <- as.numeric(comparison_df$F1)

# convert to long format for the second plot
comparison_long <- pivot_longer(
  comparison_df,
  cols = c(Accuracy, AUC, F1),
  names_to = "variable",
  values_to = "value"
)



# print the final comparison table
print(comparison_df)

##                                                   Model Accuracy     AUC
## dt_oversample_metrics      Decision Tree (Oversampling)   0.5740 0.58800
## dt_undersample_metrics    Decision Tree (Undersampling)   0.5696 0.58400
## rf_all_metrics             Random Forest (All Features)   0.9024 0.94080
## rf_top5_metrics          Random Forest (Top 5 Features)   0.8986 0.93180
## ada_oversample_metrics          AdaBoost (Oversampling)   0.8646 0.93027
## ada_demo_metrics           AdaBoost (Demographics Only)   0.5920 0.62968
## svm_linear_metrics                         SVM (Linear)   0.8692 0.93300
## svm_rbf_metrics                            SVM (Radial)   0.8796 0.93850
## svm_poly_metrics                       SVM (Polynomial)   0.8242      NA
## svm_caret_linear_metrics             SVM (Caret Linear)   0.8637 0.93280
## svm_caret_radial_metrics             SVM (Caret Radial)   0.8849 0.94220
##                                 F1
## dt_oversample_metrics           NA
## dt_undersample_metrics          NA
## rf_all_metrics                  NA
## rf_top5_metrics                 NA
## ada_oversample_metrics          NA
## ada_demo_metrics                NA
## svm_linear_metrics       0.8735785
## svm_rbf_metrics                 NA
## svm_poly_metrics                NA
## svm_caret_linear_metrics 0.8656753
## svm_caret_radial_metrics 0.8885927

# reorder model levels for better plot order
comparison_df$Model <- factor(comparison_df$Model, levels = comparison_df$Model)

# accuracy plot
ggplot(comparison_df, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_col() +
  coord_flip() +
  labs(title = "Model Accuracy Comparison", x = "Model", y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Accuracy, 3)), hjust = -0.1, size = 3.5) +
  ylim(0, 1)

#f1 Score where available
#f1 Score where available
ggplot(comparison_long, aes(x = variable, y = value, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = ifelse(!is.na(value), round(value, 2), "")), 
            position = position_dodge(0.9), vjust = -0.25) +
  labs(title = "Performance Metrics with Labels", x = "Metric", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 40, hjust = 0.4))

Model Comparison: SVMs vs. Decision Trees and Ensemble Methods

SVMs outperformed Decision Trees by a significant margin across all evaluation metrics. Both oversampled and undersampled Decision Trees yielded low accuracies (~57%) and AUCs (~0.58), indicating that simple, axis-parallel decision boundaries are insufficient to capture the complexity of the dataset. In contrast, all SVM variants showed strong results, with accuracies ranging from 82.4% (polynomial) to 88.5% (caret radial), and AUC values between 0.93 and 0.94. The highest-performing SVM model, SVM with a radial kernel using the caret package, achieved an accuracy of 88.5% and an AUC of 0.94—demonstrating the benefit of using non-linear kernel functions to model intricate decision boundaries.

When comparing ensemble methods, Random Forest slightly outperformed all SVMs in terms of accuracy, reaching 90.2% with all features and 89.9% with just the top five features, while maintaining high AUC scores (0.94 and 0.93 respectively). This suggests that Random Forest’s ensemble structure is highly effective for this task, providing both predictive power and robustness even with fewer input variables. AdaBoost with oversampling also showed strong results (accuracy = 86.5%, AUC = 0.93), comparable to linear SVM models. However, AdaBoost’s performance dropped significantly when trained on demographic variables alone (accuracy = 59.2%, AUC = 0.63), highlighting the critical importance of contextual and behavioral features in this dataset. Overall, while Random Forest offers the best performance, SVMs—particularly with linear kernels present a compelling tradeoff between interpretability, performance, and computational efficiency.

Assignment Codes

# load necessary packages
rq_packages <- c("GGally", "naniar", "gridExtra", "scales", "ggplot2",
                 "dplyr", "tidyr", "corrplot", "ggcorrplot", "caret",
                 "naivebayes", "pROC", "car", "knitr", "rpart", "randomForest", 
                 "rpart.plot", "ROSE", "adabag", "reshape2", "ada", 
                 "smotefamily", "e1071", "kernlab", "doParallel")
# install and load packages
for (pkg in rq_packages) {
  if (!require(pkg, character.only = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

#load the dataset
df <- read.csv("https://raw.githubusercontent.com/hamza9713/assignment_data_repo/refs/heads/main/bank-additional-full.csv", sep=";")

#rename columns for clarity
df <- df %>%
  rename(
    age = age,
    job = job,
    marital_status = marital,
    education = education,
    credit_default = default,
    mortgage = housing,
    personal_loan = loan,
    contact_method = contact,
    contact_month = month,
    contact_day = day_of_week,
    contact_duration = duration,
    campaign_contacts = campaign,
    days_since_last_contact = pdays,
    previous_contacts = previous,
    previous_outcome = poutcome,
    employment_rate = emp.var.rate,
    consumer_price_index = cons.price.idx,
    consumer_confidence_index = cons.conf.idx,
    euribor_rate = euribor3m,
    employees_count = nr.employed,
    subscription_status = y
  )

#data inspection
str(df)
summary(df)

##########################################################################
### data cleaning & preprocessing
##########################################################################
# convert "unknown" to NA and omit NAs
df[df == "unknown"] <- NA 
df <- na.omit(df)

# standardize numeric variables (z-score normalization)
numeric_vars <- df %>%
  select_if(is.numeric) %>%
  colnames()

df[numeric_vars] <- scale(df[numeric_vars])

#list of categorical variables
categorical_vars <- c(
  "job", "marital_status", "education", "credit_default",
  "mortgage", "personal_loan", "contact_method", "contact_month",
  "contact_day", "previous_outcome", "subscription_status"
)

#convert categorical variables to factors
df[categorical_vars] <- lapply(df[categorical_vars], factor)

# set seed for reproducibility
set.seed(123)

# balance and split the dataset (assume df is your original dataset)
# here, we are generating a balanced dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over", 
                           N = max(table(df$subscription_status)) * 2)$data

# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]

# inspect the training data
str(train_data)
summary(train_data)



############# All the models  ####################

# svm with linear kernel
svm_linear <- svm(subscription_status ~ ., data = train_data, kernel = "linear", probability = TRUE)

# predictions and evaluation for linear svm
svm_linear_pred <- predict(svm_linear, test_data)
svm_linear_prob <- predict(svm_linear, test_data, probability = TRUE)
svm_linear_cm <- confusionMatrix(svm_linear_pred, test_data$subscription_status, positive = "yes")
svm_linear_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_linear_prob, "probabilities")[, "yes"]))

# function for F1 score
get_f1_score <- function(cm) {
  precision <- cm$byClass["Precision"]
  recall <- cm$byClass["Recall"]
  f1 <- 2 * (precision * recall) / (precision + recall)
  return(f1)
}

cat("Linear SVM Accuracy:", svm_linear_cm$overall["Accuracy"], "\n")
cat("Linear SVM AUC:", auc(svm_linear_auc), "\n")
cat("Linear SVM F1-Score:", get_f1_score(svm_linear_cm), "\n\n")
print(svm_linear_cm)




###########svm linear with tuning 
set.seed(123)
# balance and split the dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over", 
                           N = max(table(df$subscription_status)) * 2)$data

# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]

# remove near-zero variance predictors
nzv <- nearZeroVar(train_data)
if (length(nzv) > 0) {
  train_data <- train_data[, -nzv]
  test_data <- test_data[, -nzv]
}

# setup parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)

# train control with light cross-validation
train_control <- trainControl(method = "cv", number = 3, classProbs = TRUE, 
                              summaryFunction = twoClassSummary, savePredictions = TRUE)

# svm linear with caret + tuning
svm_linear_caret <- train(subscription_status ~ ., 
                          data = train_data, 
                          method = "svmLinear",
                          trControl = train_control,
                          preProcess = c("center", "scale"),
                          metric = "ROC",
                          tuneGrid = expand.grid(C = c(0.01, 0.1, 1)))  # small grid

# predict and evaluate
linear_pred <- predict(svm_linear_caret, test_data)
linear_prob <- predict(svm_linear_caret, test_data, type = "prob")
linear_cm <- confusionMatrix(linear_pred, test_data$subscription_status, positive = "yes")
linear_roc <- roc(test_data$subscription_status, linear_prob$yes)

#output linear tuning 
cat("Caret Linear SVM Accuracy:", linear_cm$overall["Accuracy"], "\n")
cat("Caret Linear SVM AUC:", auc(linear_roc), "\n")
cat("Caret Linear SVM F1-Score:", get_f1_score(linear_cm), "\n\n")
print(linear_cm)




# svm with radial kernel
svm_radial <- svm(subscription_status ~ ., data = train_data, kernel = "radial", probability = TRUE)

# predictions and evaluation for radial svm
svm_radial_pred <- predict(svm_radial, test_data)
svm_radial_prob <- predict(svm_radial, test_data, probability = TRUE)
svm_radial_cm <- confusionMatrix(svm_radial_pred, test_data$subscription_status, positive = "yes")
svm_radial_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_radial_prob, "probabilities")[, "yes"]))

# output results for radial svm
cat("Radial SVM Accuracy:", svm_radial_cm$overall["Accuracy"], "\n")
cat("Radial SVM AUC:", auc(svm_radial_auc), "\n")
cat("Radial SVM F-Score:", get_f1_score(svm_radial_cm), "\n")
print(svm_radial_cm)


#svm radial with caret + tuning
svm_radial_caret <- train(subscription_status ~ ., 
                          data = train_data, 
                          method = "svmRadial",
                          trControl = train_control,
                          preProcess = c("center", "scale"),
                          metric = "ROC",
                          tuneGrid = expand.grid(sigma = 0.01, C = c(0.1, 1)))  # simple grid

# predict on the test 
radial_pred <- predict(svm_radial_caret, test_data)
radial_prob <- predict(svm_radial_caret, test_data, type = "prob")

radial_cm <- confusionMatrix(radial_pred, test_data$subscription_status, positive = "yes")
radial_roc <- roc(test_data$subscription_status, radial_prob$yes)

#output radical w tuning 
# Output results for radial SVM with tuning
cat("Caret Radial SVM Accuracy:", radial_cm$overall["Accuracy"], "\n")
cat("Caret Radial SVM AUC:", auc(radial_roc), "\n")  # Ensure auc is from pROC
cat("Caret Radial SVM F1-Score:", get_f1_score(radial_cm), "\n\n")
print(radial_cm)

# polynomial kernel 
set.seed(123)
# parallel processing
cl <- makePSOCKcluster(parallel::detectCores() - 1)
registerDoParallel(cl)

# smaller dataset sample
sample_index <- sample(nrow(train_data), 2000)
sample_train <- train_data[sample_index, ]

# fast trainControl
train_control <- trainControl(method = "cv", number = 5)

# smaller tuning grid
tune_grid <- expand.grid(degree = 2, scale = 0.1, C = c(1, 10))

# train SVM with polynomial kernel and preprocessing 
svm_poly <- train(subscription_status ~ ., 
                  data = sample_train, 
                  method = "svmPoly",
                  trControl = train_control,
                  preProcess = c("center", "scale", "zv"),
                  metric = "Accuracy",
                  tuneGrid = tune_grid)


# make predictions on test data
svm_poly_pred <- predict(svm_poly, newdata = test_data)

# ensure factors match for evaluation
svm_poly_pred <- factor(svm_poly_pred, levels = c("no", "yes"))
true_labels <- factor(test_data$subscription_status, levels = c("no", "yes"))
# confusion matrix
svm_poly_cm <- confusionMatrix(svm_poly_pred, true_labels, positive = "yes")
# print evaluation
print(svm_poly_cm)

#saving all the model for reuse 
saveRDS(svm_linear, file = "svm_linear_model.rds")
saveRDS(svm_linear_caret, file = "svm_linear_caret_model.rds")
saveRDS(svm_radial, file = "svm_radial_model.rds")
saveRDS(svm_radial_caret, file = "svm_radial_caret_model.rds")
saveRDS(svm_poly, file = "svm_poly_model.rds")





# results compilation and comparison
# previous models (decision trees, rf, adaboost)
dt_oversample_metrics <- c(Model = "Decision Tree (Oversampling)", Accuracy = 0.5740, AUC = 0.588, F1 = NA)
dt_undersample_metrics <- c(Model = "Decision Tree (Undersampling)", Accuracy = 0.5696, AUC = 0.584, F1 = NA)
rf_all_metrics <- c(Model = "Random Forest (All Features)", Accuracy = 0.9024, AUC = 0.9408, F1 = NA)
rf_top5_metrics <- c(Model = "Random Forest (Top 5 Features)", Accuracy = 0.8986, AUC = 0.9318, F1 = NA)
ada_oversample_metrics <- c(Model = "AdaBoost (Oversampling)", Accuracy = 0.8646, AUC = 0.93027, F1 = NA)
ada_demo_metrics <- c(Model = "AdaBoost (Demographics Only)", Accuracy = 0.5920, AUC = 0.62968, F1 = NA)

# svm models
svm_linear_metrics <- c(Model = "SVM (Linear)", Accuracy = 0.8692, AUC = 0.9330, F1 = 0.8735785)
svm_rbf_metrics <- c(Model = "SVM (Radial)", Accuracy = 0.8796, AUC = 0.9385, F1 = NA)
svm_poly_metrics <- c(Model = "SVM (Polynomial)", Accuracy = 0.8242, AUC = NA, F1 = NA)
svm_caret_linear_metrics <- c(Model = "SVM (Caret Linear)", Accuracy = 0.8637, AUC = 0.9328, F1 = 0.8656753)
svm_caret_radial_metrics <- c(Model = "SVM (Caret Radial)", Accuracy = 0.8849, AUC = 0.9422, F1 = 0.8885927)

# bind all model metric vectors
all_models <- rbind(
  dt_oversample_metrics,
  dt_undersample_metrics,
  rf_all_metrics,
  rf_top5_metrics,
  ada_oversample_metrics,
  ada_demo_metrics,
  svm_linear_metrics,
  svm_rbf_metrics,
  svm_poly_metrics,
  svm_caret_linear_metrics,
  svm_caret_radial_metrics
)

# convert to a data frame
comparison_df <- as.data.frame(all_models)

# convert numeric columns correctly
comparison_df$Accuracy <- as.numeric(comparison_df$Accuracy)
comparison_df$AUC <- as.numeric(comparison_df$AUC)
comparison_df$F1 <- as.numeric(comparison_df$F1)

# print the final comparison table
print(comparison_df)

# reorder model levels for better plot order
comparison_df$Model <- factor(comparison_df$Model, levels = comparison_df$Model)

# accuracy plot
ggplot(comparison_df, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_col() +
  coord_flip() +
  labs(title = "Model Accuracy Comparison", x = "Model", y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Accuracy, 3)), hjust = -0.1, size = 3.5) +
  ylim(0, 1)

#f1 Score where available
ggplot(comparison_long, aes(x = variable, y = value, fill = Model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_text(aes(label = ifelse(!is.na(value), round(value, 2), "")), 
            position = position_dodge(0.9), vjust = -0.25) +
  labs(title = "Performance Metrics with Labels", x = "Metric", y = "Value") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))