Ahm Hamza 04.13.2025
Ahm Hamza 04.13.2025
Perform an analysis of the dataset used in previous homework using the SVM algorithm. Compare the results with the results from previous homework. Answer questions, such as: Which algorithm is recommended to get more accurate results? Is it better for classification or regression scenarios? Do you agree with the recommendations? Why?
Summary of the provided articles
[1] This article presents a comparative analysis of various machine learning algorithms—including Support Vector Machines (SVM), Random Forests, Neural Networks, and XGBoost applied to COVID-19 datasets. Notably, SVM outperformed the other models in accurately predicting positive cases across two different datasets. The study also emphasizes the significance of addressing class imbalance, recommending the use of undersampling and oversampling techniques—an approach that directly aligns with the preprocessing steps used in our project. These insights reinforce the effectiveness of SVM, particularly in scenarios where accurate classification is critical and datasets are imbalanced, much like the binary classification task in the bank marketing dataset.
[2] This study demonstrates the effectiveness of SVM in classifying COVID-19 severity with approximately 87% accuracy, highlighting SVM’s reliability in healthcare diagnostics for complex data, a finding consistent with the strong performance observed in our tuned radial SVM model for marketing predictions. Another study on a 10% positive COVID-19 patient dataset (n=5644) emphasizes the challenges of class imbalance, where methods like Balanced Random Forest (RUS) and RUSBagging showed superior AUPRC and AUROC, respectively, for imbalanced data. Both studies used standard evaluation metrics (accuracy, precision, recall, F1-measure, AUC-ROC, AUPRC), and the latter underscores the importance of addressing class imbalance and incorporating features like age to improve predictive accuracy, a factor some studies overlooked.
Title “A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning” link
Summary The article presents a comprehensive comparative analysis of five machine learning algorithms: K-Nearest Neighbor (KNN), Genetic Algorithm (GA), Support Vector Machine (SVM), Decision Tree (DT), and Long Short-Term Memory (LSTM). It highlights their functionalities, advantages, and disadvantages, as well as their performance in real-time applications. SVM consistently demonstrated superior performance compared to KNN and DT in various applications, particularly in breast cancer detection and student performance prediction.In the credit card fraud detection comparison, SVM outperformed both KNN and Naive Bayes in terms of accuracy. SVM achieved high accuracy levels (up to 96%) in cases where there was a clear margin of separation between classes, making it effective for high-dimensional data. The article recommends using SVM for applications requiring high accuracy and reliability, especially in domains like healthcare and finance. This aligns well with projects focused on predictive analytics where performance and accuracy are critical.
Title: “Data mining methods in the prediction of Dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests” link
Summary
This study evaluates several classifiers for predicting the progression of Mild Cognitive Impairment to dementia, comparing traditional methods like LDA and logistic regression with modern ones such as SVM, Random Forest, and Neural Networks. SVM achieved the highest overall accuracy and perfect specificity but showed low sensitivity, limiting its ability to detect positive dementia cases. In contrast, Random Forest provided a better balance between accuracy and sensitivity.
The article recommends using models like Random Forest or LDA for tasks where sensitivity is important. This finding relates to our project, as accurate classification alone is not enough. While SVM performed well on our bank dataset, it is important to evaluate whether its trade-off between accuracy and sensitivity is appropriate for our specific use case.
Title: “A comparative study of forecasting corporate credit ratings using neural networks, support vector machines, and decision trees” link
Summary
This article evaluates the effectiveness of various machine learning models, including SVM, Neural Networks, and Decision Trees, in predicting corporate credit ratings. While SVM showed strong classification accuracy, it was less effective in predicting rating changes. In contrast, Bagged Decision Trees and Random Forests outperformed both SVM and Neural Networks, achieving up to 84.21% accuracy.
The study highlights that Decision Trees require less parameter tuning and offer greater interpretability, making them more accessible for practical use.This is relevant to the assignment as they emphasize the strengths of ensemble tree-based models in high-stakes classification tasks. However, bank dataset, tuned SVM models demonstrated higher AUC and F1-Score, suggesting that the best model choice depends on the specific data and objective.
Across all five articles, SVM is consistently recognized for its high classification accuracy, particularly in high-dimensional datasets like healthcare diagnostics, credit scoring, and fraud detection. The two provided studies on COVID-19 prediction both emphasize SVM’s effectiveness when paired with techniques to address class imbalance, which aligns with our project’s success using oversampling and tuned radial kernels. However, the three academic articles I sourced present a more nuanced view.
For instance, while SVM outperformed Decision Trees (DT) in applications like breast cancer detection and student performance prediction, other domains such as dementia prediction and corporate credit rating favored ensemble tree models like Random Forest or Bagged DTs. These models offered better sensitivity, easier interpretability, and robustness to noisy or imbalanced data, echoing results in my previous assignments where Random Forests provided solid, balanced metrics.
SVMs perform best when the margin between classes is clearly separable, and when computational resources allow for careful tuning. In contrast, tree-based methods excel in real-world settings where interpretability, sensitivity, or minimal tuning are priorities. In our bank dataset, SVM’s high precision and AUC were valuable, but interpretability and balance offered by tree models are also significant in institutional applications.
# load necessary packages
rq_packages <- c("GGally", "naniar", "gridExtra", "scales", "ggplot2",
"dplyr", "tidyr", "corrplot", "ggcorrplot", "caret",
"naivebayes", "pROC", "car", "knitr", "rpart", "randomForest",
"rpart.plot", "ROSE", "adabag", "reshape2", "ada",
"smotefamily", "e1071", "kernlab", "doParallel")
# install and load packages
for (pkg in rq_packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
#load the dataset
df <- read.csv("https://raw.githubusercontent.com/hamza9713/assignment_data_repo/refs/heads/main/bank-additional-full.csv", sep=";")
#rename columns for clarity
df <- df %>%
rename(
age = age,
job = job,
marital_status = marital,
education = education,
credit_default = default,
mortgage = housing,
personal_loan = loan,
contact_method = contact,
contact_month = month,
contact_day = day_of_week,
contact_duration = duration,
campaign_contacts = campaign,
days_since_last_contact = pdays,
previous_contacts = previous,
previous_outcome = poutcome,
employment_rate = emp.var.rate,
consumer_price_index = cons.price.idx,
consumer_confidence_index = cons.conf.idx,
euribor_rate = euribor3m,
employees_count = nr.employed,
subscription_status = y
)
#data inspection
# str(df)
# summary(df)
##########################################################################
### data cleaning & preprocessing
##########################################################################
# convert "unknown" to NA and omit NAs
df[df == "unknown"] <- NA
df <- na.omit(df)
# standardize numeric variables (z-score normalization)
numeric_vars <- df %>%
select_if(is.numeric) %>%
colnames()
df[numeric_vars] <- scale(df[numeric_vars])
#list of categorical variables
categorical_vars <- c(
"job", "marital_status", "education", "credit_default",
"mortgage", "personal_loan", "contact_method", "contact_month",
"contact_day", "previous_outcome", "subscription_status"
)
#convert categorical variables to factors
df[categorical_vars] <- lapply(df[categorical_vars], factor)
# set seed for reproducibility
set.seed(123)
# balance and split the dataset (assume df is your original dataset)
# here, we are generating a balanced dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over",
N = max(table(df$subscription_status)) * 2)$data
# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]
# inspect the training data
str(train_data)
## 'data.frame': 37282 obs. of 21 variables:
## $ age : num 1.6422 0.0939 1.6422 -1.4545 -1.3577 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 4 1 8 10 8 8 2 2 2 2 ...
## $ marital_status : Factor w/ 3 levels "divorced","married",..: 2 2 2 3 3 3 3 2 2 2 ...
## $ education : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 2 4 6 4 4 4 2 2 3 ...
## $ credit_default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ mortgage : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 2 2 ...
## $ personal_loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 2 ...
## $ contact_method : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ contact_month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ contact_day : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ contact_duration : num 0.00579 -0.41451 0.18156 0.46049 -0.80043 ...
## $ campaign_contacts : num -0.559 -0.559 -0.559 -0.559 -0.559 ...
## $ days_since_last_contact : num 0.212 0.212 0.212 0.212 0.212 ...
## $ previous_contacts : num -0.372 -0.372 -0.372 -0.372 -0.372 ...
## $ previous_outcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ employment_rate : num 0.727 0.727 0.727 0.727 0.727 ...
## $ consumer_price_index : num 0.804 0.804 0.804 0.804 0.804 ...
## $ consumer_confidence_index: num 0.877 0.877 0.877 0.877 0.877 ...
## $ euribor_rate : num 0.786 0.786 0.786 0.786 0.786 ...
## $ employees_count : num 0.402 0.402 0.402 0.402 0.402 ...
## $ subscription_status : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(train_data)
## age job marital_status
## Min. :-2.1319 admin. :11102 divorced: 4158
## 1st Qu.:-0.7771 technician : 6497 married :20675
## Median :-0.2932 blue-collar: 5862 single :12449
## Mean : 0.0566 services : 3095
## 3rd Qu.: 0.6745 management : 2757
## Max. : 5.4163 retired : 2350
## (Other) : 5619
## education credit_default mortgage personal_loan
## basic.4y : 3077 no :37281 no :16838 no :31527
## basic.6y : 1439 yes: 1 yes:20444 yes: 5755
## basic.9y : 4547
## high.school : 9477
## illiterate : 29
## professional.course: 5198
## university.degree :13515
## contact_method contact_month contact_day contact_duration
## cellular :27748 may :9633 fri:6778 Min. :-0.991479
## telephone: 9534 jul :5603 mon:7313 1st Qu.:-0.441260
## aug :5433 thu:8151 Median :-0.005671
## jun :4536 tue:7518 Mean : 0.453812
## nov :3993 wed:7522 3rd Qu.: 0.934286
## apr :3345 Max. :17.800008
## (Other):4739
## campaign_contacts days_since_last_contact previous_contacts
## Min. :-0.55933 Min. :-4.7491 Min. :-0.3716
## 1st Qu.:-0.55933 1st Qu.: 0.2119 1st Qu.:-0.3716
## Median :-0.19170 Median : 0.2119 Median :-0.3716
## Mean :-0.07452 Mean :-0.3713 Mean : 0.2539
## 3rd Qu.: 0.17593 3rd Qu.: 0.2119 3rd Qu.:-0.3716
## Max. :14.88100 Max. : 0.2119 Max. :13.0181
##
## previous_outcome employment_rate consumer_price_index
## failure : 4522 Min. :-2.0669 Min. :-2.2589
## nonexistent:28707 1st Qu.:-1.0733 1st Qu.:-1.0768
## success : 4053 Median :-0.6387 Median :-0.1355
## Mean :-0.3389 Mean :-0.1395
## 3rd Qu.: 0.9138 3rd Qu.: 0.8041
## Max. : 0.9138 Max. : 2.1246
##
## consumer_confidence_index euribor_rate employees_count
## Min. :-2.12930 Min. :-1.5901 Min. :-2.6240
## 1st Qu.:-1.16881 1st Qu.:-1.3566 1st Qu.:-1.1258
## Median :-0.25009 Median :-1.1338 Median :-0.8211
## Mean : 0.06331 Mean :-0.3503 Mean :-0.4052
## 3rd Qu.: 0.87744 3rd Qu.: 0.8424 3rd Qu.: 0.8953
## Max. : 2.86105 Max. : 0.8919 Max. : 0.8953
##
## subscription_status
## no :18641
## yes:18641
##
##
##
##
##
# Function for F1 score calculation
get_f1_score <- function(cm) {
precision <- cm$byClass["Precision"]
recall <- cm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
return(f1)
}
# Load the saved models - adjust paths as needed
model_path <- "/Users/ahmhamza/"
# Load models from RDS files
svm_linear <- readRDS(paste0(model_path, "svm_linear_model.rds"))
svm_linear_caret <- readRDS(paste0(model_path, "svm_linear_caret_model.rds"))
svm_radial <- readRDS(paste0(model_path, "svm_radial_model.rds"))
svm_radial_caret <- readRDS(paste0(model_path, "svm_radial_caret_model.rds"))
svm_poly <- readRDS(paste0(model_path, "svm_poly_model.rds"))
The linear kernel assumes that the data is separable by linearity, meaning a hyperplane that cleanly divides the “yes” and “no” subscription classes. To adress class-imbalance I used oversampling, ensuring the model isn’t biased toward the majority class. Linear SVMs are efficient and less prone to overfitting, especially with high-dimensional datasets like this. The use of probabilities and ROC/AUC also gives insights into how well the classifier ranks predictions, not just how often it’s correct.
Hypothesis: A linear decision boundary will be enough if the classescan be separated with a straight hyperplane, especially with standardized features.
# Predictions and evaluation for linear SVM
svm_linear_pred <- predict(svm_linear, test_data)
svm_linear_prob <- predict(svm_linear, test_data, probability = TRUE)
svm_linear_cm <- confusionMatrix(svm_linear_pred, test_data$subscription_status, positive = "yes")
svm_linear_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_linear_prob, "probabilities")[, "yes"]))
cat("Linear SVM Accuracy:", svm_linear_cm$overall["Accuracy"], "\n")
## Linear SVM Accuracy: 0.8691788
cat("Linear SVM AUC:", auc(svm_linear_auc), "\n")
## Linear SVM AUC: 0.9330476
cat("Linear SVM F1-Score:", get_f1_score(svm_linear_cm), "\n\n")
## Linear SVM F1-Score: 0.8735785
print(svm_linear_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6665 767
## yes 1323 7221
##
## Accuracy : 0.8692
## 95% CI : (0.8639, 0.8744)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7384
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9040
## Specificity : 0.8344
## Pos Pred Value : 0.8452
## Neg Pred Value : 0.8968
## Prevalence : 0.5000
## Detection Rate : 0.4520
## Detection Prevalence : 0.5348
## Balanced Accuracy : 0.8692
##
## 'Positive' Class : yes
##
Prediction & Probability
The linear SVM demonstrates strong effectiveness for binary classification on the dataset, achieving approximately 86.92% accuracy, a 0.933 AUC, and a 0.8736 F1-score, indicating a robust ability to distinguish subscribers from non-subscribers with a well-suited linear decision boundary that captures key data relationships without overfitting. This high AUC and balanced accuracy, with 90.4% sensitivity and 83.4% specificity, suggest a favorable bias-variance tradeoff, exhibiting low bias by learning meaningful patterns and moderate to low variance by avoiding overfitting, outperforming Decision Tree models that showed higher bias and limited feature utility.
SVM linear model with hyperparameter tuning to optimize the model’s performance. Also by removing features that show almost no variation and eliminating potentially noisy that could harm model performance or slow down training without contributing predictive power.
Hypothesis: Fine-tuning the cost parameter C via cross-validation will improve better performance tan base linear model.
# Predict and evaluate
linear_pred <- predict(svm_linear_caret, test_data)
linear_prob <- predict(svm_linear_caret, test_data, type = "prob")
linear_cm <- confusionMatrix(linear_pred, test_data$subscription_status, positive = "yes")
linear_roc <- roc(test_data$subscription_status, linear_prob$yes)
cat("Caret Linear SVM Accuracy:", linear_cm$overall["Accuracy"], "\n")
## Caret Linear SVM Accuracy: 0.8637331
cat("Caret Linear SVM AUC:", auc(linear_roc), "\n")
## Caret Linear SVM AUC: 0.9327739
cat("Caret Linear SVM F1-Score:", get_f1_score(linear_cm), "\n\n")
## Caret Linear SVM F1-Score: 0.8656753
print(linear_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6784 973
## yes 1204 7015
##
## Accuracy : 0.8637
## 95% CI : (0.8583, 0.869)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7275
##
## Mcnemar's Test P-Value : 8.246e-07
##
## Sensitivity : 0.8782
## Specificity : 0.8493
## Pos Pred Value : 0.8535
## Neg Pred Value : 0.8746
## Prevalence : 0.5000
## Detection Rate : 0.4391
## Detection Prevalence : 0.5145
## Balanced Accuracy : 0.8637
##
## 'Positive' Class : yes
##
Prediction & Probability
Hyperparameter tuning on the Linear SVM yielded a slightly lower accuracy (86.37%), AUC (0.9328), and F1-score (0.8657) compared to the base model, thus not strongly supporting the hypothesis of significant improvement through tuning for this classification task. However, the tuned model demonstrated an improved balance between sensitivity (87.82%) and specificity (84.93%), which is critical for equitable detection in potentially imbalanced datasets by mitigating false positives at the cost of a slight increase in bias, evidenced by a modest reduction in sensitivity from the untuned model’s 90.4%. The substantial kappa statistic of 0.7275 reinforces the tuned model’s reliability. The minimal performance gain from tuning suggests that the default linear kernel already operates near its optimal capacity for this dataset, indicating that prioritizing model simplicity and computational efficiency might be a more pragmatic approach for future model selection.
The radial kernel projects the data into a higher-dimensional space where a linear separator might exist. This is especially useful when the decision boundary is curved or wavy as might be the case in this dataset with complex customer behaviors. The model should capture more subtle patterns, especially after oversampling. Probability estimation and AUC here help understand how confidently the model distinguishes the two classes.
Hypothesis: RBF can better handle complex boundaries in the data, leading to improved classification of harder-to-separate instances
# Predictions and evaluation for radial SVM
svm_radial_pred <- predict(svm_radial, test_data)
svm_radial_prob <- predict(svm_radial, test_data, probability = TRUE)
svm_radial_cm <- confusionMatrix(svm_radial_pred, test_data$subscription_status, positive = "yes")
svm_radial_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_radial_prob, "probabilities")[, "yes"]))
cat("Radial SVM Accuracy:", svm_radial_cm$overall["Accuracy"], "\n")
## Radial SVM Accuracy: 0.8788182
cat("Radial SVM AUC:", auc(svm_radial_auc), "\n")
## Radial SVM AUC: 0.9389735
cat("Radial SVM F-Score:", get_f1_score(svm_radial_cm), "\n")
## Radial SVM F-Score: 0.8845282
print(svm_radial_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6625 573
## yes 1363 7415
##
## Accuracy : 0.8788
## 95% CI : (0.8737, 0.8838)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7576
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9283
## Specificity : 0.8294
## Pos Pred Value : 0.8447
## Neg Pred Value : 0.9204
## Prevalence : 0.5000
## Detection Rate : 0.4641
## Detection Prevalence : 0.5494
## Balanced Accuracy : 0.8788
##
## 'Positive' Class : yes
##
Prediction & Probability
The SVM with an RBF kernel strongly supports the hypothesis that nonlinear kernels can better model complex decision boundaries, achieving superior performance with 87.88% accuracy, a high AUC of 0.9390, and an F1-score of 0.8845, outperforming linear SVM variants. Its high sensitivity (92.83%) highlights a strong ability to detect positive cases, crucial where false negatives are costly. While specificity (82.94%) is slightly lower than linear models, the balanced accuracy remains robust at 87.88%, and the higher Kappa score (0.7576) indicates stronger agreement. A significant McNemar’s Test p-value (<2.2e-16) further validates the model’s statistical significance. This improved performance confirms the necessity of nonlinear boundaries for accurately classifying complex instances within the dataset’s intricate feature interactions.
The sigma parameter controls how far the influence of a single training example reaches. A small sigma leads to a complex, wiggly decision boundary (low bias, high variance), while a large sigma results in a smoother boundary (higher bias, lower variance). The tested involves a small grid (sigma = 0.01, C = 0.1 or 1) under cross-validation to strike a balance. Using caret also streamlines preprocessing and ensures fair model evaluation.
Hypothesis: If the tuning leads improve performance, it would indicate that the dataset benefits from a more flexible, non-linear decision boundary and regularization strength classification
# Predict on the test data
radial_pred <- predict(svm_radial_caret, test_data)
radial_prob <- predict(svm_radial_caret, test_data, type = "prob")
radial_cm <- confusionMatrix(radial_pred, test_data$subscription_status, positive = "yes")
radial_roc <- roc(test_data$subscription_status, radial_prob$yes)
cat("Caret Radial SVM Accuracy:", radial_cm$overall["Accuracy"], "\n")
## Caret Radial SVM Accuracy: 0.8847646
cat("Caret Radial SVM AUC:", auc(radial_roc), "\n")
## Caret Radial SVM AUC: 0.9421786
cat("Caret Radial SVM F1-Score:", get_f1_score(radial_cm), "\n\n")
## Caret Radial SVM F1-Score: 0.888431
print(radial_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6805 658
## yes 1183 7330
##
## Accuracy : 0.8848
## 95% CI : (0.8797, 0.8897)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7695
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9176
## Specificity : 0.8519
## Pos Pred Value : 0.8610
## Neg Pred Value : 0.9118
## Prevalence : 0.5000
## Detection Rate : 0.4588
## Detection Prevalence : 0.5329
## Balanced Accuracy : 0.8848
##
## 'Positive' Class : yes
##
Prediction & Probability
Hyperparameter tuning significantly improve the Radial Basis Function (RBF) SVM’s classification performance, supporting the hypothesis that tuning optimizes the model for the dataset’s complexity. Achieving the highest accuracy (88.48%) and AUC (0.9422) among all tested models, the tuned radial SVM demonstrates superior predictive power and class discrimination. Its strong F1-score (0.8884) indicates a balanced precision and recall. Notably, it improved both sensitivity (91.76%) and specificity (85.19%) compared to previous models, resulting in a balanced accuracy of 88.48% and a high Kappa score (0.7695), further validated by a significant McNemar’s test p-value (< 2.2e-16). These improvements confirm that the nonlinear RBF kernel, combined with optimized cost (C) and gamma, effectively captures the dataset’s intricate structure by balancing margin maximization and error minimization, leading to better generalization through regularization and adaptation to complex decision surfaces.
Polynomial kernels are particularly useful when the relationship between features follows a polynomial pattern. After waiting for 4 hours of slow convergence smaller sample of 2000 observations to reduce computation, suggesting polynomial SVMs are heavier. The degree-2 polynomial kernel models not just the original features, but also their pairwise interactions. This is helpful when feature interactions drive outcomes. Using a small grid (C = 1, 10; scale = 0.1) with cross-validation ensures the model doesn’t overfit.
Hypothesis: A degree-2 polynomial kernel is hypothesized to effectively model feature interactions, proving beneficial when relationships are nonlinear but less intricate than those requiring an RBF kernel.
# Make predictions on test data
svm_poly_pred <- predict(svm_poly, newdata = test_data)
# Ensure factors match for evaluation
svm_poly_pred <- factor(svm_poly_pred, levels = c("no", "yes"))
true_labels <- factor(test_data$subscription_status, levels = c("no", "yes"))
# Confusion matrix
svm_poly_cm <- confusionMatrix(svm_poly_pred, true_labels, positive = "yes")
# Print evaluation
print(svm_poly_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6720 1540
## yes 1268 6448
##
## Accuracy : 0.8242
## 95% CI : (0.8182, 0.8301)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6485
##
## Mcnemar's Test P-Value : 3.152e-07
##
## Sensitivity : 0.8072
## Specificity : 0.8413
## Pos Pred Value : 0.8357
## Neg Pred Value : 0.8136
## Prevalence : 0.5000
## Detection Rate : 0.4036
## Detection Prevalence : 0.4830
## Balanced Accuracy : 0.8242
##
## 'Positive' Class : yes
##
Prediction & Probability
The polynomial kernel (degree 2) achieved a respectable accuracy of 82.4%. Its balanced accuracy of 82.4% and a Kappa score of 0.6485 indicate good performance beyond chance, without strong bias towards either class, further supported by a sensitivity of 80.7% and specificity of 84.1%. However, a significant McNemar’s Test (p < 0.001) suggests a potential difference in error distribution between classes, warranting consideration for fairness. Overall, the degree-2 polynomial kernel empirically validates the hypothesis, offering a strong compromise between simpler linear models and potentially more complex radial kernels by capturing moderate nonlinear patterns with reasonable computational efficiency and interpretability, making it a solid choice when fairness and transparency are important.
# Create a dataframe to compare models
model_names <- c("Linear SVM", "Linear SVM (Caret)", "Radial SVM",
"Radial SVM (Caret)", "Polynomial SVM")
accuracies <- c(
svm_linear_cm$overall["Accuracy"],
linear_cm$overall["Accuracy"],
svm_radial_cm$overall["Accuracy"],
radial_cm$overall["Accuracy"],
svm_poly_cm$overall["Accuracy"]
)
f1_scores <- c(
get_f1_score(svm_linear_cm),
get_f1_score(linear_cm),
get_f1_score(svm_radial_cm),
get_f1_score(radial_cm),
get_f1_score(svm_poly_cm)
)
aucs <- c(
auc(svm_linear_auc),
auc(linear_roc),
auc(svm_radial_auc),
auc(radial_roc),
NA # We don't have AUC for polynomial model
)
comparison_df <- data.frame(
Model = model_names,
Accuracy = round(accuracies, 4),
F1_Score = round(f1_scores, 4),
AUC = round(aucs, 4)
)
# Print comparison table
knitr::kable(comparison_df, caption = "Model Performance Comparison")
Model | Accuracy | F1_Score | AUC |
---|---|---|---|
Linear SVM | 0.8692 | 0.8736 | 0.9330 |
Linear SVM (Caret) | 0.8637 | 0.8657 | 0.9328 |
Radial SVM | 0.8788 | 0.8845 | 0.9390 |
Radial SVM (Caret) | 0.8848 | 0.8884 | 0.9422 |
Polynomial SVM | 0.8242 | 0.8212 | NA |
# Create a bar plot to compare model performance
# Prepare data for plotting
plot_data <- melt(comparison_df, id.vars = "Model",
variable.name = "Metric", value.name = "Value")
# Create the plot
ggplot(plot_data, aes(x = Model, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = position_dodge()) +
theme_minimal() +
labs(title = "SVM Model Performance Comparison",
x = "Model", y = "Score") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_brewer(palette = "Set1")
################################################################
# results compilation and comparison
# previous models (decision trees, rf, adaboost)
dt_oversample_metrics <- c(Model = "Decision Tree (Oversampling)", Accuracy = 0.5740, AUC = 0.588, F1 = NA)
dt_undersample_metrics <- c(Model = "Decision Tree (Undersampling)", Accuracy = 0.5696, AUC = 0.584, F1 = NA)
rf_all_metrics <- c(Model = "Random Forest (All Features)", Accuracy = 0.9024, AUC = 0.9408, F1 = NA)
rf_top5_metrics <- c(Model = "Random Forest (Top 5 Features)", Accuracy = 0.8986, AUC = 0.9318, F1 = NA)
ada_oversample_metrics <- c(Model = "AdaBoost (Oversampling)", Accuracy = 0.8646, AUC = 0.93027, F1 = NA)
ada_demo_metrics <- c(Model = "AdaBoost (Demographics Only)", Accuracy = 0.5920, AUC = 0.62968, F1 = NA)
# svm models
svm_linear_metrics <- c(Model = "SVM (Linear)", Accuracy = 0.8692, AUC = 0.9330, F1 = 0.8735785)
svm_rbf_metrics <- c(Model = "SVM (Radial)", Accuracy = 0.8796, AUC = 0.9385, F1 = NA)
svm_poly_metrics <- c(Model = "SVM (Polynomial)", Accuracy = 0.8242, AUC = NA, F1 = NA)
svm_caret_linear_metrics <- c(Model = "SVM (Caret Linear)", Accuracy = 0.8637, AUC = 0.9328, F1 = 0.8656753)
svm_caret_radial_metrics <- c(Model = "SVM (Caret Radial)", Accuracy = 0.8849, AUC = 0.9422, F1 = 0.8885927)
# bind all model metric vectors
all_models <- rbind(
dt_oversample_metrics,
dt_undersample_metrics,
rf_all_metrics,
rf_top5_metrics,
ada_oversample_metrics,
ada_demo_metrics,
svm_linear_metrics,
svm_rbf_metrics,
svm_poly_metrics,
svm_caret_linear_metrics,
svm_caret_radial_metrics
)
# convert to a data frame
comparison_df <- as.data.frame(all_models)
# convert numeric columns correctly
comparison_df$Accuracy <- as.numeric(comparison_df$Accuracy)
comparison_df$AUC <- as.numeric(comparison_df$AUC)
comparison_df$F1 <- as.numeric(comparison_df$F1)
# convert to long format for the second plot
comparison_long <- pivot_longer(
comparison_df,
cols = c(Accuracy, AUC, F1),
names_to = "variable",
values_to = "value"
)
# print the final comparison table
print(comparison_df)
## Model Accuracy AUC
## dt_oversample_metrics Decision Tree (Oversampling) 0.5740 0.58800
## dt_undersample_metrics Decision Tree (Undersampling) 0.5696 0.58400
## rf_all_metrics Random Forest (All Features) 0.9024 0.94080
## rf_top5_metrics Random Forest (Top 5 Features) 0.8986 0.93180
## ada_oversample_metrics AdaBoost (Oversampling) 0.8646 0.93027
## ada_demo_metrics AdaBoost (Demographics Only) 0.5920 0.62968
## svm_linear_metrics SVM (Linear) 0.8692 0.93300
## svm_rbf_metrics SVM (Radial) 0.8796 0.93850
## svm_poly_metrics SVM (Polynomial) 0.8242 NA
## svm_caret_linear_metrics SVM (Caret Linear) 0.8637 0.93280
## svm_caret_radial_metrics SVM (Caret Radial) 0.8849 0.94220
## F1
## dt_oversample_metrics NA
## dt_undersample_metrics NA
## rf_all_metrics NA
## rf_top5_metrics NA
## ada_oversample_metrics NA
## ada_demo_metrics NA
## svm_linear_metrics 0.8735785
## svm_rbf_metrics NA
## svm_poly_metrics NA
## svm_caret_linear_metrics 0.8656753
## svm_caret_radial_metrics 0.8885927
# reorder model levels for better plot order
comparison_df$Model <- factor(comparison_df$Model, levels = comparison_df$Model)
# accuracy plot
ggplot(comparison_df, aes(x = Model, y = Accuracy, fill = Model)) +
geom_col() +
coord_flip() +
labs(title = "Model Accuracy Comparison", x = "Model", y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none") +
geom_text(aes(label = round(Accuracy, 3)), hjust = -0.1, size = 3.5) +
ylim(0, 1)
#f1 Score where available
#f1 Score where available
ggplot(comparison_long, aes(x = variable, y = value, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_text(aes(label = ifelse(!is.na(value), round(value, 2), "")),
position = position_dodge(0.9), vjust = -0.25) +
labs(title = "Performance Metrics with Labels", x = "Metric", y = "Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 40, hjust = 0.4))
SVMs outperformed Decision Trees by a significant margin across all evaluation metrics. Both oversampled and undersampled Decision Trees yielded low accuracies (~57%) and AUCs (~0.58), indicating that simple, axis-parallel decision boundaries are insufficient to capture the complexity of the dataset. In contrast, all SVM variants showed strong results, with accuracies ranging from 82.4% (polynomial) to 88.5% (caret radial), and AUC values between 0.93 and 0.94. The highest-performing SVM model, SVM with a radial kernel using the caret package, achieved an accuracy of 88.5% and an AUC of 0.94—demonstrating the benefit of using non-linear kernel functions to model intricate decision boundaries.
When comparing ensemble methods, Random Forest slightly outperformed all SVMs in terms of accuracy, reaching 90.2% with all features and 89.9% with just the top five features, while maintaining high AUC scores (0.94 and 0.93 respectively). This suggests that Random Forest’s ensemble structure is highly effective for this task, providing both predictive power and robustness even with fewer input variables. AdaBoost with oversampling also showed strong results (accuracy = 86.5%, AUC = 0.93), comparable to linear SVM models. However, AdaBoost’s performance dropped significantly when trained on demographic variables alone (accuracy = 59.2%, AUC = 0.63), highlighting the critical importance of contextual and behavioral features in this dataset. Overall, while Random Forest offers the best performance, SVMs—particularly with linear kernels present a compelling tradeoff between interpretability, performance, and computational efficiency.
# load necessary packages
rq_packages <- c("GGally", "naniar", "gridExtra", "scales", "ggplot2",
"dplyr", "tidyr", "corrplot", "ggcorrplot", "caret",
"naivebayes", "pROC", "car", "knitr", "rpart", "randomForest",
"rpart.plot", "ROSE", "adabag", "reshape2", "ada",
"smotefamily", "e1071", "kernlab", "doParallel")
# install and load packages
for (pkg in rq_packages) {
if (!require(pkg, character.only = TRUE)) {
install.packages(pkg)
library(pkg, character.only = TRUE)
}
}
#load the dataset
df <- read.csv("https://raw.githubusercontent.com/hamza9713/assignment_data_repo/refs/heads/main/bank-additional-full.csv", sep=";")
#rename columns for clarity
df <- df %>%
rename(
age = age,
job = job,
marital_status = marital,
education = education,
credit_default = default,
mortgage = housing,
personal_loan = loan,
contact_method = contact,
contact_month = month,
contact_day = day_of_week,
contact_duration = duration,
campaign_contacts = campaign,
days_since_last_contact = pdays,
previous_contacts = previous,
previous_outcome = poutcome,
employment_rate = emp.var.rate,
consumer_price_index = cons.price.idx,
consumer_confidence_index = cons.conf.idx,
euribor_rate = euribor3m,
employees_count = nr.employed,
subscription_status = y
)
#data inspection
str(df)
summary(df)
##########################################################################
### data cleaning & preprocessing
##########################################################################
# convert "unknown" to NA and omit NAs
df[df == "unknown"] <- NA
df <- na.omit(df)
# standardize numeric variables (z-score normalization)
numeric_vars <- df %>%
select_if(is.numeric) %>%
colnames()
df[numeric_vars] <- scale(df[numeric_vars])
#list of categorical variables
categorical_vars <- c(
"job", "marital_status", "education", "credit_default",
"mortgage", "personal_loan", "contact_method", "contact_month",
"contact_day", "previous_outcome", "subscription_status"
)
#convert categorical variables to factors
df[categorical_vars] <- lapply(df[categorical_vars], factor)
# set seed for reproducibility
set.seed(123)
# balance and split the dataset (assume df is your original dataset)
# here, we are generating a balanced dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over",
N = max(table(df$subscription_status)) * 2)$data
# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]
# inspect the training data
str(train_data)
summary(train_data)
############# All the models ####################
# svm with linear kernel
svm_linear <- svm(subscription_status ~ ., data = train_data, kernel = "linear", probability = TRUE)
# predictions and evaluation for linear svm
svm_linear_pred <- predict(svm_linear, test_data)
svm_linear_prob <- predict(svm_linear, test_data, probability = TRUE)
svm_linear_cm <- confusionMatrix(svm_linear_pred, test_data$subscription_status, positive = "yes")
svm_linear_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_linear_prob, "probabilities")[, "yes"]))
# function for F1 score
get_f1_score <- function(cm) {
precision <- cm$byClass["Precision"]
recall <- cm$byClass["Recall"]
f1 <- 2 * (precision * recall) / (precision + recall)
return(f1)
}
cat("Linear SVM Accuracy:", svm_linear_cm$overall["Accuracy"], "\n")
cat("Linear SVM AUC:", auc(svm_linear_auc), "\n")
cat("Linear SVM F1-Score:", get_f1_score(svm_linear_cm), "\n\n")
print(svm_linear_cm)
###########svm linear with tuning
set.seed(123)
# balance and split the dataset using oversampling
df_balanced <- ovun.sample(subscription_status ~ ., data = df, method = "over",
N = max(table(df$subscription_status)) * 2)$data
# train-test split
train_index <- createDataPartition(df_balanced$subscription_status, p = 0.7, list = FALSE)
train_data <- df_balanced[train_index, ]
test_data <- df_balanced[-train_index, ]
# remove near-zero variance predictors
nzv <- nearZeroVar(train_data)
if (length(nzv) > 0) {
train_data <- train_data[, -nzv]
test_data <- test_data[, -nzv]
}
# setup parallel processing
cl <- makePSOCKcluster(detectCores() - 1)
registerDoParallel(cl)
# train control with light cross-validation
train_control <- trainControl(method = "cv", number = 3, classProbs = TRUE,
summaryFunction = twoClassSummary, savePredictions = TRUE)
# svm linear with caret + tuning
svm_linear_caret <- train(subscription_status ~ .,
data = train_data,
method = "svmLinear",
trControl = train_control,
preProcess = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(C = c(0.01, 0.1, 1))) # small grid
# predict and evaluate
linear_pred <- predict(svm_linear_caret, test_data)
linear_prob <- predict(svm_linear_caret, test_data, type = "prob")
linear_cm <- confusionMatrix(linear_pred, test_data$subscription_status, positive = "yes")
linear_roc <- roc(test_data$subscription_status, linear_prob$yes)
#output linear tuning
cat("Caret Linear SVM Accuracy:", linear_cm$overall["Accuracy"], "\n")
cat("Caret Linear SVM AUC:", auc(linear_roc), "\n")
cat("Caret Linear SVM F1-Score:", get_f1_score(linear_cm), "\n\n")
print(linear_cm)
# svm with radial kernel
svm_radial <- svm(subscription_status ~ ., data = train_data, kernel = "radial", probability = TRUE)
# predictions and evaluation for radial svm
svm_radial_pred <- predict(svm_radial, test_data)
svm_radial_prob <- predict(svm_radial, test_data, probability = TRUE)
svm_radial_cm <- confusionMatrix(svm_radial_pred, test_data$subscription_status, positive = "yes")
svm_radial_auc <- roc(test_data$subscription_status, as.numeric(attr(svm_radial_prob, "probabilities")[, "yes"]))
# output results for radial svm
cat("Radial SVM Accuracy:", svm_radial_cm$overall["Accuracy"], "\n")
cat("Radial SVM AUC:", auc(svm_radial_auc), "\n")
cat("Radial SVM F-Score:", get_f1_score(svm_radial_cm), "\n")
print(svm_radial_cm)
#svm radial with caret + tuning
svm_radial_caret <- train(subscription_status ~ .,
data = train_data,
method = "svmRadial",
trControl = train_control,
preProcess = c("center", "scale"),
metric = "ROC",
tuneGrid = expand.grid(sigma = 0.01, C = c(0.1, 1))) # simple grid
# predict on the test
radial_pred <- predict(svm_radial_caret, test_data)
radial_prob <- predict(svm_radial_caret, test_data, type = "prob")
radial_cm <- confusionMatrix(radial_pred, test_data$subscription_status, positive = "yes")
radial_roc <- roc(test_data$subscription_status, radial_prob$yes)
#output radical w tuning
# Output results for radial SVM with tuning
cat("Caret Radial SVM Accuracy:", radial_cm$overall["Accuracy"], "\n")
cat("Caret Radial SVM AUC:", auc(radial_roc), "\n") # Ensure auc is from pROC
cat("Caret Radial SVM F1-Score:", get_f1_score(radial_cm), "\n\n")
print(radial_cm)
# polynomial kernel
set.seed(123)
# parallel processing
cl <- makePSOCKcluster(parallel::detectCores() - 1)
registerDoParallel(cl)
# smaller dataset sample
sample_index <- sample(nrow(train_data), 2000)
sample_train <- train_data[sample_index, ]
# fast trainControl
train_control <- trainControl(method = "cv", number = 5)
# smaller tuning grid
tune_grid <- expand.grid(degree = 2, scale = 0.1, C = c(1, 10))
# train SVM with polynomial kernel and preprocessing
svm_poly <- train(subscription_status ~ .,
data = sample_train,
method = "svmPoly",
trControl = train_control,
preProcess = c("center", "scale", "zv"),
metric = "Accuracy",
tuneGrid = tune_grid)
# make predictions on test data
svm_poly_pred <- predict(svm_poly, newdata = test_data)
# ensure factors match for evaluation
svm_poly_pred <- factor(svm_poly_pred, levels = c("no", "yes"))
true_labels <- factor(test_data$subscription_status, levels = c("no", "yes"))
# confusion matrix
svm_poly_cm <- confusionMatrix(svm_poly_pred, true_labels, positive = "yes")
# print evaluation
print(svm_poly_cm)
#saving all the model for reuse
saveRDS(svm_linear, file = "svm_linear_model.rds")
saveRDS(svm_linear_caret, file = "svm_linear_caret_model.rds")
saveRDS(svm_radial, file = "svm_radial_model.rds")
saveRDS(svm_radial_caret, file = "svm_radial_caret_model.rds")
saveRDS(svm_poly, file = "svm_poly_model.rds")
# results compilation and comparison
# previous models (decision trees, rf, adaboost)
dt_oversample_metrics <- c(Model = "Decision Tree (Oversampling)", Accuracy = 0.5740, AUC = 0.588, F1 = NA)
dt_undersample_metrics <- c(Model = "Decision Tree (Undersampling)", Accuracy = 0.5696, AUC = 0.584, F1 = NA)
rf_all_metrics <- c(Model = "Random Forest (All Features)", Accuracy = 0.9024, AUC = 0.9408, F1 = NA)
rf_top5_metrics <- c(Model = "Random Forest (Top 5 Features)", Accuracy = 0.8986, AUC = 0.9318, F1 = NA)
ada_oversample_metrics <- c(Model = "AdaBoost (Oversampling)", Accuracy = 0.8646, AUC = 0.93027, F1 = NA)
ada_demo_metrics <- c(Model = "AdaBoost (Demographics Only)", Accuracy = 0.5920, AUC = 0.62968, F1 = NA)
# svm models
svm_linear_metrics <- c(Model = "SVM (Linear)", Accuracy = 0.8692, AUC = 0.9330, F1 = 0.8735785)
svm_rbf_metrics <- c(Model = "SVM (Radial)", Accuracy = 0.8796, AUC = 0.9385, F1 = NA)
svm_poly_metrics <- c(Model = "SVM (Polynomial)", Accuracy = 0.8242, AUC = NA, F1 = NA)
svm_caret_linear_metrics <- c(Model = "SVM (Caret Linear)", Accuracy = 0.8637, AUC = 0.9328, F1 = 0.8656753)
svm_caret_radial_metrics <- c(Model = "SVM (Caret Radial)", Accuracy = 0.8849, AUC = 0.9422, F1 = 0.8885927)
# bind all model metric vectors
all_models <- rbind(
dt_oversample_metrics,
dt_undersample_metrics,
rf_all_metrics,
rf_top5_metrics,
ada_oversample_metrics,
ada_demo_metrics,
svm_linear_metrics,
svm_rbf_metrics,
svm_poly_metrics,
svm_caret_linear_metrics,
svm_caret_radial_metrics
)
# convert to a data frame
comparison_df <- as.data.frame(all_models)
# convert numeric columns correctly
comparison_df$Accuracy <- as.numeric(comparison_df$Accuracy)
comparison_df$AUC <- as.numeric(comparison_df$AUC)
comparison_df$F1 <- as.numeric(comparison_df$F1)
# print the final comparison table
print(comparison_df)
# reorder model levels for better plot order
comparison_df$Model <- factor(comparison_df$Model, levels = comparison_df$Model)
# accuracy plot
ggplot(comparison_df, aes(x = Model, y = Accuracy, fill = Model)) +
geom_col() +
coord_flip() +
labs(title = "Model Accuracy Comparison", x = "Model", y = "Accuracy") +
theme_minimal() +
theme(legend.position = "none") +
geom_text(aes(label = round(Accuracy, 3)), hjust = -0.1, size = 3.5) +
ylim(0, 1)
#f1 Score where available
ggplot(comparison_long, aes(x = variable, y = value, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge()) +
geom_text(aes(label = ifelse(!is.na(value), round(value, 2), "")),
position = position_dodge(0.9), vjust = -0.25) +
labs(title = "Performance Metrics with Labels", x = "Metric", y = "Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))