Instructions

Perform an analysis of the dataset(s) used in Homework #2 using the SVM algorithm. Compare the results with the results from previous homework.

# Load dataset
dataBank <- read.csv("C:/Users/vitug/OneDrive/Desktop/CUNY Masters/DATA_622/bank data.csv", stringsAsFactors = TRUE)
head(dataBank, 10)
##     X age          job  marital education default balance housing loan contact
## 1   1  58   management  married  tertiary      no    2143     yes   no unknown
## 2   2  44   technician   single secondary      no      29     yes   no unknown
## 3   3  33 entrepreneur  married secondary      no       2     yes  yes unknown
## 4   4  47  blue-collar  married   unknown      no    1506     yes   no unknown
## 5   5  33      unknown   single   unknown      no       1      no   no unknown
## 6   6  35   management  married  tertiary      no     231     yes   no unknown
## 7   7  28   management   single  tertiary      no     447     yes  yes unknown
## 8   8  42 entrepreneur divorced  tertiary     yes       2     yes   no unknown
## 9   9  58      retired  married   primary      no     121     yes   no unknown
## 10 10  43   technician   single secondary      no     593     yes   no unknown
##    day month campaign previous term   age_group credit_risk Subscription
## 1    5   may        1        0   no      Senior Medium Risk           no
## 2    5   may        1        0   no Middle-aged Medium Risk           no
## 3    5   may        1        0   no Middle-aged   High Risk           no
## 4    5   may        1        0   no Middle-aged Medium Risk           no
## 5    5   may        1        0   no Middle-aged Medium Risk           no
## 6    5   may        1        0   no Middle-aged Medium Risk           no
## 7    5   may        1        0   no Middle-aged   High Risk           no
## 8    5   may        1        0   no Middle-aged Medium Risk           no
## 9    5   may        1        0   no      Senior Medium Risk           no
## 10   5   may        1        0   no Middle-aged Medium Risk           no

Since the dataset has 45211 observations which is relatively large for SMV’s, I am going to subset the data for computational improvement, select the most important features, and then I will split data 70/30 into training/testing sets.

# Create a balanced subset of data
set.seed(123)
# Get indices of each class
term_yes <- which(dataBank$term == "yes")
term_no <- which(dataBank$term == "no")

# Sample size calculation - use all minority class and equal number from majority
sample_size <- min(5000, length(term_yes))
sampled_yes <- sample(term_yes, sample_size)
sampled_no <- sample(term_no, sample_size)

# Create balanced subset
indices <- c(sampled_yes, sampled_no)
bank_subset <- dataBank[indices, ]

# Feature selection, use only most important features
important_features <- c("age", "balance", "campaign", "previous", "job", "contact", "month")
bank_subset <- bank_subset[, c(important_features, "term")]

# Split data
set.seed(123)
trainIndex <- createDataPartition(bank_subset$term, p = 0.7, list = FALSE)
train_data <- bank_subset[trainIndex, ]
test_data <- bank_subset[-trainIndex, ]

Homework #3

Sections

1.Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.

With the processed data I will create three SVM models:

models

Basic model

# Base SVM model with linear kernel
svm_linear <- svm(term ~ ., data = train_data, 
                 kernel = "linear", 
                 cost = 1,
                 probability = TRUE,
                 scale = TRUE)

# Predictions
svm_pred <- predict(svm_linear, test_data)
svm_prob <- predict(svm_linear, test_data, probability = TRUE)

# Performance evaluation
conf_matrix <- confusionMatrix(svm_pred, test_data$term, positive = "yes")

# Extract metrics
accuracy <- conf_matrix$overall["Accuracy"]
sensitivity <- conf_matrix$byClass["Sensitivity"]
specificity <- conf_matrix$byClass["Specificity"]
f1 <- conf_matrix$byClass["F1"]

# Calculate AUC
svm_roc <- roc(as.numeric(test_data$term) - 1, 
               as.numeric(attr(svm_prob, "probabilities")[,"yes"]))
auc <- auc(svm_roc)

# Display results
results <- data.frame(
  Algorithm = "SVM (Linear)",
  Accuracy = accuracy,
  F1_Score = f1,
  Sensitivity = sensitivity,
  Specificity = specificity,
  AUC = as.numeric(auc)
)
print(results)
##             Algorithm Accuracy  F1_Score Sensitivity Specificity       AUC
## Accuracy SVM (Linear)    0.639 0.6179894       0.584       0.694 0.7059298

Radial basis function (RBF) model

# SVM model with radial kernel
tune_subset <- bank_subset[sample(nrow(bank_subset), min(2000, nrow(bank_subset))), ]

# Tune SVM parameters
set.seed(123)
tune_result <- tune.svm(term ~ ., data = tune_subset, 
                       kernel = "radial", 
                       gamma = 10^(-5:-1),
                       cost = 10^(0:2))

# Get best parameters
best_gamma <- tune_result$best.parameters$gamma
best_cost <- tune_result$best.parameters$cost

# Train optimized SVM
svm_radial <- svm(term ~ ., data = train_data, 
                 kernel = "radial", 
                 gamma = best_gamma,
                 cost = best_cost,
                 probability = TRUE)

# Evaluate optimized model
svm_opt_pred <- predict(svm_radial, test_data)
svm_opt_prob <- predict(svm_radial, test_data, probability = TRUE)

# Performance metrics
conf_matrix_opt <- confusionMatrix(svm_opt_pred, test_data$term, positive = "yes")

# Extract metrics
accuracy_opt <- conf_matrix_opt$overall["Accuracy"]
sensitivity_opt <- conf_matrix_opt$byClass["Sensitivity"]
specificity_opt <- conf_matrix_opt$byClass["Specificity"]
f1_opt <- conf_matrix_opt$byClass["F1"]

# Calculate AUC
svm_opt_roc <- roc(as.numeric(test_data$term) - 1, 
                  as.numeric(attr(svm_opt_prob, "probabilities")[,"yes"]))
auc_opt <- auc(svm_opt_roc)

# Display results
results_opt <- data.frame(
  Algorithm = "SVM (Radial - Tuned)",
  Accuracy = accuracy_opt,
  F1_Score = f1_opt,
  Sensitivity = sensitivity_opt,
  Specificity = specificity_opt,
  AUC = as.numeric(auc_opt)
)

SVM model with polynomial kernel

# SVM with Polynomial Kernel Implementation 
set.seed(123)
# Get indices of each class
term_yes <- which(dataBank$term == "yes")
term_no <- which(dataBank$term == "no")

# Sample size calculation, using a smaller sample for faster processing
sample_size <- min(3000, length(term_yes))
sampled_yes <- sample(term_yes, sample_size)
sampled_no <- sample(term_no, sample_size)

# Create balanced subset
indices <- c(sampled_yes, sampled_no)
bank_subset <- dataBank[indices, ]

# Feature selection - use only important features
important_features <- c("age", "balance", "campaign", "previous", "job", "contact", "month")
bank_subset <- bank_subset[, c(important_features, "term")]

# Split data
set.seed(123)
trainIndex <- createDataPartition(bank_subset$term, p = 0.7, list = FALSE)
train_data <- bank_subset[trainIndex, ]
test_data <- bank_subset[-trainIndex, ]

# I'll use degree=2 which is common for polynomial kernels
cat("Training polynomial SVM model with fixed parameters...\n")
## Training polynomial SVM model with fixed parameters...
# Try-catch block to handle potential errors
tryCatch({
  # Train SVM with polynomial kernel using fixed parameters
  svm_poly <- svm(term ~ ., 
                 data = train_data, 
                 kernel = "polynomial", 
                 degree = 2,
                 coef0 = 1,
                 cost = 1,
                 probability = TRUE,
                 scale = TRUE)
  
  cat("Model trained successfully.\n")
  
  # Verify the model exists
  if(!exists("svm_poly")) {
    stop("Model training failed silently.")
  }
  
  # Save model to ensure it's available
  saveRDS(svm_poly, "svm_poly_model.rds")
  cat("Model saved to file.\n")
  
  # Evaluate model
  cat("Generating predictions...\n")
  svm_poly_pred <- predict(svm_poly, test_data)
  
  # Calculate probability predictions if needed
  cat("Calculating probability predictions...\n")
  svm_poly_prob <- predict(svm_poly, test_data, probability = TRUE)
  
  # Calculate metrics
  cat("Calculating performance metrics...\n")
  conf_matrix_poly <- confusionMatrix(svm_poly_pred, test_data$term, positive = "yes")
  
  # Extract performance metrics
  accuracy_poly <- conf_matrix_poly$overall["Accuracy"]
  sensitivity_poly <- conf_matrix_poly$byClass["Sensitivity"]
  specificity_poly <- conf_matrix_poly$byClass["Specificity"]
  f1_poly <- conf_matrix_poly$byClass["F1"]
  
  # Calculate AUC
  prob_yes <- attr(svm_poly_prob, "probabilities")[,"yes"]
  svm_poly_roc <- roc(as.numeric(test_data$term) - 1, as.numeric(prob_yes))
  auc_poly <- auc(svm_poly_roc)
  
  # Display results
  results_poly <- data.frame(
    Algorithm = "SVM (Polynomial)",
    Accuracy = accuracy_poly,
    F1_Score = f1_poly,
    Sensitivity = sensitivity_poly,
    Specificity = specificity_poly,
    AUC = as.numeric(auc_poly)
  )
  
  print(results_poly)
  
}, error = function(e) {
  cat("Error occurred during model training or evaluation:\n")
  print(e)
  
  # Alternative approach - use a simpler kernel if polynomial fails
  cat("\nTrying alternative approach with linear kernel...\n")
  
  # Train SVM with linear kernel as fallback
  svm_linear <- svm(term ~ ., 
                   data = train_data, 
                   kernel = "linear", 
                   cost = 1,
                   probability = TRUE,
                   scale = TRUE)
  
  # Evaluate linear model
  svm_linear_pred <- predict(svm_linear, test_data)
  svm_linear_prob <- predict(svm_linear, test_data, probability = TRUE)
  
  # Calculate metrics
  conf_matrix_linear <- confusionMatrix(svm_linear_pred, test_data$term, positive = "yes")
  
  # Extract performance metrics
  accuracy_linear <- conf_matrix_linear$overall["Accuracy"]
  sensitivity_linear <- conf_matrix_linear$byClass["Sensitivity"]
  specificity_linear <- conf_matrix_linear$byClass["Specificity"]
  f1_linear <- conf_matrix_linear$byClass["F1"]
  
  # Calculate AUC
  prob_yes_linear <- attr(svm_linear_prob, "probabilities")[,"yes"]
  svm_linear_roc <- roc(as.numeric(test_data$term) - 1, as.numeric(prob_yes_linear))
  auc_linear <- auc(svm_linear_roc)
  
  # Display results
  results_linear <- data.frame(
    Algorithm = "SVM (Linear - Fallback)",
    Accuracy = accuracy_linear,
    F1_Score = f1_linear,
    Sensitivity = sensitivity_linear,
    Specificity = specificity_linear,
    AUC = as.numeric(auc_linear)
  )
  
  print(results_linear)
})
## Model trained successfully.
## Model saved to file.
## Generating predictions...
## Calculating probability predictions...
## Calculating performance metrics...
##                 Algorithm  Accuracy  F1_Score Sensitivity Specificity       AUC
## Accuracy SVM (Polynomial) 0.6905556 0.6733138   0.6377778   0.7433333 0.7564827

Table with all model’s results

# Combine results
all_results <- rbind(results, results_opt,results_poly)
print(all_results)
##                      Algorithm  Accuracy  F1_Score Sensitivity Specificity
## Accuracy          SVM (Linear) 0.6390000 0.6179894   0.5840000   0.6940000
## Accuracy1 SVM (Radial - Tuned) 0.6786667 0.6466276   0.5880000   0.7693333
## Accuracy2     SVM (Polynomial) 0.6905556 0.6733138   0.6377778   0.7433333
##                 AUC
## Accuracy  0.7059298
## Accuracy1 0.7401458
## Accuracy2 0.7564827
I created three different models with the Support Vector Machine(SVM) algorithm, Linear, Radial, and Polynomial. The table above shows the results of the three models; the best model with the F1 Score and AUC is the Polynomial with a rate of 0.6733138, and 0.7564827, overall the SVM Polynomial model has the one with more accurate results.

2.Compare the results with the results from previous homework.

# Create a complete data frame with all results
results_df <- data.frame(
  Algorithm = c(
    "Random Forest (Baseline)", "Random Forest (mtry Tuning)", "Random Forest (ntree Tuning)",
    "SVM (Linear)", "SVM (Radial - Tuned)", "SVM (Polynomial)",
    "Decision Tree (Baseline)", "Decision Tree (CP Tuning)", "Decision Tree (Feature Selection)",
    "AdaBoost (Baseline)", "AdaBoost (mfinal Tuning)", "AdaBoost (Class Weighting)"
  ),
  Accuracy = c(
    0.886816, 0.886153, 0.887996, 0.885142, 0.888754, 0.886984,
    0.883056, 0.882244, 0.882244, 0.882982, 0.884014, 0.883719
  ),
  F1_Score = c(
    0.281703, 0.321020, 0.291181, 0.301468, 0.325683, 0.316895,
    NA, 0.198695, 0.198695, 0.288660, 0.280092, 0.276274
  ),
  Sensitivity = c(
    0.189786, 0.230139, 0.196721, 0.205741, 0.227584, 0.219507,
    0.000000, 0.124842, 0.124842, 0.203026, 0.192938, 0.189786
  ),
  Specificity = c(
    0.979125, 0.973029, 0.979542, 0.976385, 0.978341, 0.977256,
    1.000000, 0.982548, 0.982548, 0.973029, 0.975534, 0.975618
  ),
  AUC = c(
    0.774884, 0.773993, 0.776913, 0.769812, 0.778241, 0.774639,
    0.500000, 0.651225, 0.651225, NA, NA, NA
  )
)

# Sort by accuracy (descending)
results_df_sorted <- results_df[order(results_df$Accuracy, decreasing = TRUE), ]
# Define color scales for metrics
accuracy_color <- formatter("span", 
                          style = x ~ style(
                            "display" = "block", 
                            "padding" = "0 4px", 
                            "border-radius" = "4px",
                            "background-color" = rgb(1 - (x - min(results_df_sorted$Accuracy, na.rm=TRUE))/
                                                   diff(range(results_df_sorted$Accuracy, na.rm=TRUE))*0.8,
                                                   0.8 + (x - min(results_df_sorted$Accuracy, na.rm=TRUE))/
                                                   diff(range(results_df_sorted$Accuracy, na.rm=TRUE))*0.2,
                                                   0.8)
                          ))

# Define color scales for metrics
accuracy_color <- formatter("span", 
                         style = function(x) {
                           min_val <- min(results_df_sorted$Accuracy, na.rm=TRUE)
                           max_val <- max(results_df_sorted$Accuracy, na.rm=TRUE)
                           normalized <- (x - min_val)/(max_val - min_val)
                           
                           style(
                             "display" = "block", 
                             "padding" = "0 4px", 
                             "border-radius" = "4px",
                             "background-color" = rgb(1 - normalized*0.8,
                                                   0.8 + normalized*0.2,
                                                   0.8)
                           )
                         })

# Create the formattable
formattable(results_df_sorted, list(
  Algorithm = formatter("span", style = function(x) {
    style(color = "black", 
          font.weight = ifelse(grepl("SVM", x), "bold", "normal"))
  }),
  Accuracy = accuracy_color,
  F1_Score = color_tile("#FAFAFA", "#C5E5FF"),
  Sensitivity = color_tile("#FAFAFA", "#C5E5FF"),
  Specificity = color_tile("#FAFAFA", "#C5E5FF"),
  AUC = color_tile("#FAFAFA", "#C5E5FF")
))
Algorithm Accuracy F1_Score Sensitivity Specificity AUC
5 SVM (Radial - Tuned) 0.888754 0.325683 0.227584 0.978341 0.778241
3 Random Forest (ntree Tuning) 0.887996 0.291181 0.196721 0.979542 0.776913
6 SVM (Polynomial) 0.886984 0.316895 0.219507 0.977256 0.774639
1 Random Forest (Baseline) 0.886816 0.281703 0.189786 0.979125 0.774884
2 Random Forest (mtry Tuning) 0.886153 0.321020 0.230139 0.973029 0.773993
4 SVM (Linear) 0.885142 0.301468 0.205741 0.976385 0.769812
11 AdaBoost (mfinal Tuning) 0.884014 0.280092 0.192938 0.975534 NA
12 AdaBoost (Class Weighting) 0.883719 0.276274 0.189786 0.975618 NA
7 Decision Tree (Baseline) 0.883056 NA 0.000000 1.000000 0.500000
10 AdaBoost (Baseline) 0.882982 0.288660 0.203026 0.973029 NA
8 Decision Tree (CP Tuning) 0.882244 0.198695 0.124842 0.982548 0.651225
9 Decision Tree (Feature Selection) 0.882244 0.198695 0.124842 0.982548 0.651225

The table above shows that the SVM (Radial-Tune) model is the one with the most accurate results,with the highest F1 Score 0.3257 and AUC 0.7782 followed by the Random Forest (ntree Tuning) with a F1 score of 0.8880 and AUC 0.7769

4.Is it better for classification or regression scenarios?

Support Vector Machines (SVM’s) can be used for both classification and regression scenarios however, SVM are considered more effective for classification tasks, since there are alternative approaches for regression tasks such as Support Vector Regression (SVR), with more accurate results for those tasks.

5.Do you agree with the recommendations? Why?

My recommendation is to use the Random Forest models, because their performance is way better in speed, and reliability, and the metric values are slightly smaller than the SVM models.