library(tidyverse)
library(e1071)
library(knitr)
library(corrplot)
df <- read.csv("bank-full.csv", sep = ";")
df$y <- as.factor(df$y)
# Split the data into training (70%) and test (30%) sets
set.seed(123)
trainIndex <- sample(seq_len(nrow(df)), size = 0.7 * nrow(df))
trainData <- df[trainIndex, ]
testData <- df[-trainIndex, ]
Based on the previous assignments, it is evident that Portuguese bank dataset was imbalanced and non linear. The radial kernel is ideal for capturing non-linear relationships, which are likely present in the dataset.
# Train an SVM model using a radial basis function (RBF) kernel.
svm_model <- svm(y ~ ., data = trainData, kernel = "radial", cost = 1, gamma = 0.1)
# Make predictions on the test set
svm_predictions <- predict(svm_model, newdata = testData)
# Evaluate the model performance with a confusion matrix
svm_confusion <- table(Predicted = svm_predictions, Actual = testData$y)
print(svm_confusion)
## Actual
## Predicted no yes
## no 11718 1018
## yes 280 548
# Calculate and print the test accuracy
svm_accuracy <- sum(diag(svm_confusion)) / sum(svm_confusion)
print(paste("Test Accuracy:", round(svm_accuracy * 100, 2), "%"))
## [1] "Test Accuracy: 90.43 %"
The test accuracy for the radial kernel looks solid, indicating that the model is effective in capturing the underlying patterns in the data. Let’s see if tuning the cost parameter will increase the test accuracy. Adjusting cost could help reduce misclassification by either softening or tightening the decision boundary, allowing us to see if we can further boost the test accuracy
# Define a range of cost values to test
c_values <- c(0.01, 0.1, 1, 10, 100)
# Initialize a list to store results
c_results <- list()
# Loop through each cost value and train an SVM model with the radial kernel
for (c_value in c_values) {
# Train the model
model <- svm(y ~ ., data = trainData, kernel = "radial", cost = c_value, gamma = 0.1)
# Make predictions on the test set
predictions <- predict(model, newdata = testData)
# Evaluate the performance using a confusion matrix and calculate accuracy
confusion <- table(Predicted = predictions, Actual = testData$y)
accuracy <- sum(diag(confusion)) / sum(confusion)
# Store the cost value and accuracy in the results list
c_results[[as.character(c_value)]] <- accuracy
}
# Print the results
for (c_value in names(c_results)) {
print(paste("Cost =", c_value, "Test Accuracy:", round(c_results[[c_value]] * 100, 2), "%"))
}
## [1] "Cost = 0.01 Test Accuracy: 88.45 %"
## [1] "Cost = 0.1 Test Accuracy: 89.86 %"
## [1] "Cost = 1 Test Accuracy: 90.43 %"
## [1] "Cost = 10 Test Accuracy: 90.39 %"
## [1] "Cost = 100 Test Accuracy: 89.13 %"
The tuning results show that the test accuracy peaked at 90.43% when the cost parameter was set to 1. Accuracy decreased slightly when cost moved away from 1, possibly due to underfitting and overfitting. This suggests cost = 1 is the optimal choice for this dataset.
Evaluating the other kernel types to see if they can produce a higher test accuracy.
# Define a list of kernels to evaluate
kernels <- c("linear", "polynomial", "sigmoid")
# Initialize a list to store results
results <- list()
# Loop through each kernel and train an SVM model
for (kernel_type in kernels) {
model <- svm(y ~ ., data = trainData, kernel = kernel_type, cost = 1, gamma = 0.1)
# Make predictions on the test set
predictions <- predict(model, newdata = testData)
# Evaluate performance using a confusion matrix and calculate accuracy
confusion <- table(Predicted = predictions, Actual = testData$y)
accuracy <- sum(diag(confusion)) / sum(confusion)
# Store the kernel type and accuracy in the results list
results[[kernel_type]] <- accuracy
}
# Print the results for each kernel
for (kernel_type in names(results)) {
print(paste(kernel_type, "Kernel Test Accuracy:", round(results[[kernel_type]] * 100, 2), "%"))
}
## [1] "linear Kernel Test Accuracy: 89.4 %"
## [1] "polynomial Kernel Test Accuracy: 90.18 %"
## [1] "sigmoid Kernel Test Accuracy: 83.58 %"
The test accuracy results show that the polynomial kernel performed well, achieving a test accuracy of 90.18%. However, this was slightly lower than the radial kernel’s test accuracy of 90.43%, indicating that the radial kernel is more effective at capturing the patterns in the data. Intuitively, this makes sense, as both the radial and polynomial kernels are well-suited to modeling complex, non-linear relationships in the dataset. In contrast, the linear and sigmoid kernels are less effective at handling such non-linear patterns, which likely accounts for their lower performance.
Correlation plot:
# Correlation matrix for numerical variables
df <- read.csv("bank-full.csv", sep = ";")
df$y <- as.factor(df$y)
num_cols <- sapply(df, is.numeric)
cor_matrix <- cor(df[, num_cols])
corrplot(cor_matrix, method = "color",tl.col = "black", addCoef.col = "black", number.cex = 0.8)
Results from the previous assignment and this assignment:
# Create a data frame all the test accuracy results from both assigments
results_table <- data.frame(
Algorithm = c("Decision Tree", "Decision Tree", "Random Forest", "Random Forest", "XGBoost", "XGBoost", "SVM", "SVM", "SVM", "SVM"),
Experiment = c("Exp 1", "Exp 2 (minsplit = 10)", "Exp 3 (ntree = 500)", "Exp 4 (ntree = 200)", "Exp 5 (default)", "Exp 6 (CV)", "Radial Kernel (C = 1)", "Linear Kernel", "Polynomial Kernel", "Sigmoid Kernel"),
Training_Accuracy = c(89.42, 89.42, 99.64, 99.59, 95.53, 93.33, NA, NA, NA, NA),
Testing_Accuracy = c(89.45, 89.45, 90.08, 90.23, 90.40, 90.39, 90.43, 89.4, 90.18, 83.58)
)
kable(results_table, caption = "Comparison of Model Performance Across Assignments")
Algorithm | Experiment | Training_Accuracy | Testing_Accuracy |
---|---|---|---|
Decision Tree | Exp 1 | 89.42 | 89.45 |
Decision Tree | Exp 2 (minsplit = 10) | 89.42 | 89.45 |
Random Forest | Exp 3 (ntree = 500) | 99.64 | 90.08 |
Random Forest | Exp 4 (ntree = 200) | 99.59 | 90.23 |
XGBoost | Exp 5 (default) | 95.53 | 90.40 |
XGBoost | Exp 6 (CV) | 93.33 | 90.39 |
SVM | Radial Kernel (C = 1) | NA | 90.43 |
SVM | Linear Kernel | NA | 89.40 |
SVM | Polynomial Kernel | NA | 90.18 |
SVM | Sigmoid Kernel | NA | 83.58 |
Out of all the algorithms, the SVM model with the radial kernel achieved the highest test accuracy at 90.43%, slightly surpassing XGBoost at 90.40% and Random Forest at 90.23%. The correlation plot revealed that many of the numeric predictors exhibit complex, non-linear relationships with the outcome variable. The radial kernel is particularly effective at capturing these non-linear patterns, which explains its testing accuracy was highest among the other algorithms. However, the slight improvement in the testing accuracy of the SVM model is not enough for me to select it over the XGBoost algorithm. XGBoost was significantly faster in execution time making it a better choice for handling this large dataset. In fact, I chose not to run SVM on the training data as it took too long to compute on this large dataset. I can definitely see the benefits of using SVM as the parameters like cost and gamma can be easily tuned to account for overfitting and underfitting. But the trade-off in computational time and cost is not justified by the slight gain in test accuracy when compared to XGBoost.
The other algorithms, Decision Tree and Random Forest, also demonstrated strong testing accuracies of 89.45% and 90.23%, respectively. However, the Decision Tree model was heavily influenced by the duration variable, which made it too simplistic and limited its ability to capture the more complex, non-linear relationships among the other predictors in the dataset. While Random Forest improved on this by aggregating multiple trees and reducing the dominance of individual predictors, the training accuracy was a bit too high at 99.59%, suggesting overfitting with the Random Forest model. Overall, when considering both performance and computational efficiency, XGBoost is the best overall option for this dataset.