library(tidyverse)
library(DataExplorer)
library(corrplot) 
library(ggplot2)
library(dplyr) 
library(gridExtra) 
library(grid) 
library(scales)
library(tibble)
library(DT)
library(naniar)
library(Amelia)
library(caret)
library(rpart)
library(rpart.plot)
library(pROC)
library(randomForest)
library(adabag)
library(tibble)
library(e1071)
library(knitr)
library(kableExtra)
library(kernlab)

This data is offered in two ways: one where there are only 16 features along with the target variables (y; subscribed status) and another expanded version where there are 20 features along with the target variable. I have chosen to use the expanded version below that has 20 features along with the target variable.

0.1 Loading the data

df1 <- read.csv("bank-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

dim(df1)

## [1] 45211    17

df2 <- read.csv("bank-additional-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

dim(df2)

## [1] 41188    21

# Deciding to go with the expanded version that has 20 features rather than 16, even though it has a little bit fewer rows; the additional features arguably offer richer data and even "more data" despite having slightly fewer rows, relatively speaking.

df <- df2

df <- df %>% rename(subscribed = y)
df$subscribed <- as.factor(df$subscribed)

0.1.1 A quick check on missing variables

# replace "unknown" with NA
df[df == "unknown"] <- NA

# missing values
#colSums(is.na(df))
missing_values <- colSums(is.na(df))
missing_values[missing_values > 0]

##       job   marital education   default   housing      loan 
##       330        80      1731      8597       990       990

1 =HW1:

1.1 A. Exploratory Data Analysis (EDA)

1.1.1 A.1 Are the features (columns) of your data correlated?

num_vars <- df %>% select_if(is.numeric)
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45, diag = FALSE)

The correlation plot shows:

🔹 Strong Positive Correlations (Blue):

nr.employed (number of employees) and euribor3m (Euribor 3-month rate) are highly correlated.
emp.var.rate (employment variation rate) is also positively correlated with both.

🔹 Strong Negative Correlations (Red):

emp.var.rate is strongly negatively correlated with euribor3m.
pdays (days since last contact) and previous (number of past contacts) are moderately correlated.

🔹 Weak or No Correlation (Light Colors):

campaign (number of contacts in current campaign) is not strongly correlated with any variable.
duration (call length) has weak correlations, meaning it should not be used for prediction before the call happens.

I may need to remove or combine highly correlated variables.

1.1.2 A.2 What is the overall distribution of each variable?

# numeric var
plot_numeric_distribution <- function(df) {
    num_vars <- df %>% select_if(is.numeric) 
    for (var in names(num_vars)) {
        print(
            ggplot(df, aes(x = get(var))) +
                geom_histogram(bins = 50, fill = "steelblue", color = "black", alpha = 0.7) +  
                labs(title = paste("Distribution of", var), x = var, y = "Count") +
                theme_minimal())}}
plot_numeric_distribution(df)

I can see that the distribution of some variables is highly skewed (duration, campaign, emp.var.rate, among others). Other categorical variables’ distribution is shown. And age is slightly skewed, but not too much.

Age: the distribution of age is reasonable, with most values between 25-55
Duration of the calls: these are also reasonable, since the values are in seconds. So it is expected for it to be skewed.
campaign: the values are also reasonable here, with most values on the lower end, with some outliers. So it is expected for it to be skewed.
pdays: the distribution here is strange, because most values are 999 due to most clients not having been contacted before
previous: also seems reasonable, So it is expected for it to be skewed.
emp.var.rate: slightly skewed, but reasonable
cons.price.idx: the range and distribution seem reasonable
cons.conf.idx: the range and distribution seem reasonable
euribor3m: this 3 month rate seems reasonably distributed too, but is skewed
nr:employed: the number of employees is reasonable too and its distribution is not irregular

Outlier detection is done below.

1.1.3 A.3 How are categorical variables distributed?

# categorical var
plot_categorical_distribution <- function(df) {
    cat_vars <- df %>% select_if(is.character)
    for (var in names(cat_vars)) {
        print(
            ggplot(df, aes(x = get(var))) +
                geom_bar(fill = "steelblue") +
                labs(title = paste("Distribution of", var), x = var, y = "Count") +
                theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
                theme_minimal())}}
plot_categorical_distribution(df)

The categorical variables are reasonably distributed too:

jobs: most are admin or blue collar jobs
marital status: most are married
education: most are university educated and high-school degree holders
dafault: there are zero that have been deemed as having “defaulted”, so most are either a “no” or unknown.
housing, loan, contact, poutcome: are reasonable for the expected cohort in this dataset
most contact took place in May, with a nearly equal spread across the work-days of the week

1.1.4 A.4 Are there any outliers present?

Using boxplots

plot_outliers_horizontal <- function(df) {
    num_vars <- df %>% select_if(is.numeric)  
    num_plots <- length(num_vars) 
    cols <- 2  
    plots <- lapply(names(num_vars), function(var) {
        ggplot(df, aes(y = get(var), x = "")) +  
            geom_boxplot(fill = "#69b3a2", outlier.color = "red", outlier.size = 2) +  
            labs(title = paste("Boxplot of", var), y = var, x = " ") +  
            theme_minimal() +
            theme(
                plot.title = element_text(size = 16, face = "bold"), 
                axis.text.y = element_text(size = 14),  
                axis.text.x = element_text(size = 12), 
                axis.ticks.x = element_line(color = "black"), 
                panel.grid.major = element_line(color = "grey85"),
                panel.grid.minor = element_blank()) +
            coord_flip()})

    grid.arrange(
        grobs = plots, 
        ncol = cols, 
        nrow = ceiling(num_plots / cols),
        top = textGrob(" ", gp = gpar(fontsize = 18, fontface = "bold")))

    grid.lines(x = unit(0.5, "npc"), y = unit(c(0, 1), "npc"), gp = gpar(col = "black", lwd = 2))}
plot_outliers_horizontal(df)

Box plots show:

duration → Many high outliers; long call durations may indicate customer engagement. But this may be used in benchmarking, as the notes from this dataset suggest, since if duration =0, obviously the target will be a “no”. So may need to drop this when building the predictive model for a more realistic model.
campaign → Some clients contacted 10+ times; potential excessive follow-ups.
previous → Outliers in repeat contacts; could indicate persistent marketing attempts.
pdays → Mostly 999 (never contacted before), with some low-value numbers for actual numbers of days that passed after last client contact.
age → A few older clients (90+ years), but generally well-distributed.
emp.var.rate, cons.price.idx, euribor3m, nr.employed → No major outliers; economic indicators are stable.
cons.conf.idx → A few extreme values, but mostly normal distribution.

Certain outliers that seem anomalous: I can remove them and consider them missing values (which later may be imputed). I can also use capping to make sure no extreme values impact the model.

Using IQR to Identify Outliers

detect_outliers <- function(df) {
    num_vars <- df %>% select_if(is.numeric)
    outliers <- list()
    for (var in names(num_vars)) {
        Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
        Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
        IQR_val <- Q3 - Q1
        lower_bound <- Q1 - 3 * IQR_val
        upper_bound <- Q3 + 3 * IQR_val
        num_outliers <- sum(df[[var]] < lower_bound | df[[var]] > upper_bound, na.rm = TRUE)
        if (num_outliers > 0) {
            outliers[[var]] <- num_outliers}}
    return(outliers)}
outlier_counts <- detect_outliers(df)
print(outlier_counts)

## $age
## [1] 4
## 
## $duration
## [1] 1043
## 
## $campaign
## [1] 1094
## 
## $pdays
## [1] 1515
## 
## $previous
## [1] 5625

So there does seem to be a number of outlier values in these 6 variables. Now, to see what they actually are:

detect_outliers_df <- function(df) {
    num_vars <- df %>% select_if(is.numeric)
    outlier_data <- list()
    for (var in names(num_vars)) {
        Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
        Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
        IQR_val <- Q3 - Q1
        lower_bound <- Q1 - 3 * IQR_val
        upper_bound <- Q3 + 3 * IQR_val
        outliers <- df[[var]][df[[var]] < lower_bound | df[[var]] > upper_bound]
        if (length(outliers) > 0) {
            outlier_data[[var]] <- tibble(
                Variable = var,
                Outlier_Value = outliers)}}
    outlier_df <- bind_rows(outlier_data)
    return(outlier_df)}
outlier_table <- detect_outliers_df(df)
datatable(outlier_table, options = list(pageLength = 10, scrollX = TRUE))

Browsing through these values, I can see that many of them are not quite extreme, when it comes to age, duration, pdays or the other variables. Most are pdays values of 0, which is not really an outlier per se. Previous values of 7 or 1 are also not exactly anomalous. Same for duration, given that duration is in seconds. It is reasonable that at least some calls will run up to a max of 49 minutes. For campaign, it is strange that some clients had up to 56 times of attempts to contact them. But it may be a normal thing in this industry, though certainly on the higher end.

So for the outliers shown here, there does not seem be a strong need for removal or capping, since they are not unreasonable.

Checking Categorical Variables for Anomalies

check_categorical_anomalies <- function(df) {
    cat_vars <- df %>% select_if(is.character)
    
    for (var in names(cat_vars)) {
        print(paste("Category counts for:", var))
        print(table(df[[var]]))
        print("-------------------------------------------------")}}
check_categorical_anomalies(df)

## [1] "Category counts for: job"
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed 
##          1014 
## [1] "-------------------------------------------------"
## [1] "Category counts for: marital"
## 
## divorced  married   single 
##     4612    24928    11568 
## [1] "-------------------------------------------------"
## [1] "Category counts for: education"
## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                4176                2292                6045                9515 
##          illiterate professional.course   university.degree 
##                  18                5243               12168 
## [1] "-------------------------------------------------"
## [1] "Category counts for: default"
## 
##    no   yes 
## 32588     3 
## [1] "-------------------------------------------------"
## [1] "Category counts for: housing"
## 
##    no   yes 
## 18622 21576 
## [1] "-------------------------------------------------"
## [1] "Category counts for: loan"
## 
##    no   yes 
## 33950  6248 
## [1] "-------------------------------------------------"
## [1] "Category counts for: contact"
## 
##  cellular telephone 
##     26144     15044 
## [1] "-------------------------------------------------"
## [1] "Category counts for: month"
## 
##   apr   aug   dec   jul   jun   mar   may   nov   oct   sep 
##  2632  6178   182  7174  5318   546 13769  4101   718   570 
## [1] "-------------------------------------------------"
## [1] "Category counts for: day_of_week"
## 
##  fri  mon  thu  tue  wed 
## 7827 8514 8623 8090 8134 
## [1] "-------------------------------------------------"
## [1] "Category counts for: poutcome"
## 
##     failure nonexistent     success 
##        4252       35563        1373 
## [1] "-------------------------------------------------"

Just to confirm, aside from the bar charts, looking at this tabulation, I can see that the values make sense.

I have drawn patterns in the data when speaking about the distributions above, and insights about some varaibles are drawn too. I have also covered the central tendency of some variables as well as their spread.

1.1.5 A.5 Seeing missing values, and if missingness was at random, or with a correlation to the target variable

Columns with missing values are:

missing_counts <- colSums(is.na(df))
missing_counts[missing_counts > 0]

##       job   marital education   default   housing      loan 
##       330        80      1731      8597       990       990

Some variables have a large number of missing values, especially default, and this may have an impact on the model. Other variables that may be important as well (eg education and housing) may also have an impact. So I am going to look deeper at this and see their missingness if at random and if it has a relation to the target.

Looking at this visually:

vis_miss(df) + 
  ggtitle("Missing Data Pattern") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Test If Missingness is Random

# MCAR test
mcar_test_result <- missmap(df, main = "Missing Data Map", col = c("blue", "gray"), legend = TRUE)

mcar_test(df)

## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1     5458.   406       0               23

The test shows that missingness is not at random.

Correlation Between Missingness & subscribed

missing_cols <- names(df)[colSums(is.na(df)) > 0]

df_missing <- df %>%
    mutate(across(all_of(missing_cols), ~ ifelse(is.na(.), 1, 0), .names = "missing_{.col}"))

missing_correlation <- df_missing %>%
    select(starts_with("missing_")) %>%
    mutate(subscribed = df$subscribed) %>%
    group_by(subscribed) %>%
    summarise(across(starts_with("missing_"), \(x) mean(x, na.rm = TRUE)))

print(missing_correlation)

## # A tibble: 2 × 7
##   subscribed missing_job missing_marital missing_education missing_default
##   <fct>            <dbl>           <dbl>             <dbl>           <dbl>
## 1 no             0.00802         0.00186            0.0405          0.223 
## 2 yes            0.00797         0.00259            0.0541          0.0955
## # ℹ 2 more variables: missing_housing <dbl>, missing_loan <dbl>

This is helpful, as it shows that:

most variables that have missing values also have these missing values nearly equally for both classes of the target variables, except:
education: there are more missing values for “subscribed = yes”, &
default: there are more missing values for “subscribed = no”.

Overall, I do not see outliers are a dangerous pattern here, and missingness is important for education and default in particular. For these, I will need to choose a method for imputation, such as iterative imputer or the KNN approach. There does not appear to be inconsistent values, or ones that are not aligning with what would be expected from a dataset like this and these observations.

1.2 B. Algorithm Selection

1.2.1 B.1 Recommended Machine Learning Algorithms

For this dataset, the most suitable algorithms for predicting whether a customer will subscribe to a term deposit (subscribed) include Logistic Regression, Random Forest, and XGBoost. Logistic Regression is useful as a baseline due to its interpretability and efficiency, while Random Forest and XGBoost are more powerful ensemble methods that can capture complex interactions and non-linear relationships within the dataset. Though I must say, banking is not my field or domain, and is completely foreign to me. I have tried my best though to look at this from a domain perspective, though it may not be perfect.

1.2.2 B.2 Pros and Cons of Each Algorithm

This is a supervised problem to solve and devise a model to predict, since we do have labelled data, as I will explain below. The data characteristics and limitations do allow for a few possible model approaches I will list below:

Logistic Regression.

Pros: Simple, interpretable, efficient on large datasets, and works well with binary classification. Cons: Assumes a linear relationship between independent variables and the log-odds of the target, making it less effective for complex patterns.

Random Forest.

Pros: Handles both numerical and categorical variables, is robust to missing values, and reduces overfitting by averaging multiple decision trees. Cons: Computationally expensive, especially for large datasets, and harder to interpret compared to logistic regression.

XGBoost.

Pros: Extremely powerful in handling structured tabular data, robust to missing values, and performs well with imbalanced data. Cons: Requires hyperparameter tuning and is computationally more demanding.

These suggested models do align with the business characteristics and goals from a dataset like this. These are also scalable approaches that allow for continuing to collect more data and refine the model further.

1.2.3 B.3 Best Recommended Algorithm and Why

Among these, XGBoost is the best recommendation I think due to its high accuracy, ability to handle missing data, and effectiveness in tabular datasets like this one. Random Forest is a good alternative if interpretability is needed, while Logistic Regression can be used as a baseline model to compare performance.

1.2.4 B.4 Label Availability and Its Impact on Algorithm Choice

Yes, the dataset has a labeled target variable (subscribed: yes/no), which makes this a supervised classification problem. This allows the use of classification models like Logistic Regression, Decision Trees, Random Forest, and Gradient Boosting models (XGBoost) instead of unsupervised learning methods such as clustering.

1.2.5 B.5 How Algorithm Choice Relates to the Dataset

The dataset contains both categorical and numerical features, requiring an algorithm that handles mixed data types, missing values, and class imbalance. Tree-based models (Random Forest & XGBoost) are well-suited for these types of datasets as they automatically handle feature selection, interactions, and non-linearity. Logistic Regression, while simpler, may struggle with non-linear relationships and interactions between variables.

1.2.6 B.6 Impact of a Smaller Dataset (<1,000 Records)

If the dataset had fewer than 1,000 records, simpler models like Logistic Regression or Decision Trees would be preferable. XGBoost and Random Forest require more data to generalize well, and with a small dataset, they may overfit. Logistic Regression would work better in this case because it requires fewer data points to provide stable estimates, while a Decision Tree could be used if non-linearity is important.

1.3 C. Pre-processing

1.3.1 C.1 Data Cleaning - Improve Data Quality & Handle Missing Data

Address Missing Data. XGBoost natively handles missing values, so imputation is optional. However, we should analyze whether missing values hold information before deciding.

Options: - Leave missing values as-is (XGBoost assigns them optimally). - Use mean/median imputation for continuous features if missingness seems random. I can also use iterative imputer or KNN. - Create binary indicators for missing_education and missing_default, as these were correlated with the target (subscribed).

Check for Duplicates & Outliers

Remove exact duplicate records, if any.
Treat extreme outliers in duration, campaign, pdays carefully:
Winsorize or cap extreme values if necessary.
Log-transform duration if the distribution is too skewed.

1.3.2 C.2 Dimensionality Reduction - Remove Redundant Data

Drop Highly Correlated Features. From the correlation analysis, euribor3m, nr.employed, and emp.var.rate are strongly correlated. We can remove one or two of them to avoid redundancy.

Drop duration (if aiming for real-world deployment). Since call duration is a strong predictor but unknown before a call happens, it should be removed unless the goal is just benchmarking.

1.3.3 C.3 Feature Engineering - Create New Informative Features

Convert pdays into a categorical feature: 999 means never contacted before → Create new_contacted = 1 if pdays != 999, else 0.
Create previous_contact_ratio: previous / (campaign + previous), to capture engagement level in past campaigns.
Group age into categories: Instead of using raw age, create bins like: Young (18-30), Middle-aged (31-50), Senior (51+)
Interaction Features (Optional, if performance improves) education * job, housing * loan, or pdays * previous can be explored.

1.3.4 C.4 Sampling Data - Resize Dataset if Needed

I don’t think I need to do any resizing to the data, since it is not too large, nor too small.

But, generally:

If dataset is too large (>100,000 rows), use stratified sampling. Retain proportional representation of subscribed = yes/no while reducing dataset size.
If dataset is small (<10,000 rows), use k-fold cross-validation instead of a train-test split. Ensures better generalization.

1.3.5 C.5 Data Transformation - Encoding & Scaling

Handle Categorical Variables (XGBoost supports them directly in latest versions). Convert categorical features (job, marital, education, contact, poutcome) into integer-encoded values:

df$job <- as.integer(as.factor(df$job))
df$marital <- as.integer(as.factor(df$marital))

# Alternatively, one-hot encoding can be applied (not required for XGBoost but useful for explainability).

Scale Numerical Features (Not required for XGBoost, but recommended for comparison with other models):

Standardization ((x - mean) / std) is not required for tree-based models.
However, for interpretability, we can log-transform balance, duration to reduce skewness.
Scaling will not be required since XGBoost handles unscaled numerical features well.

1.3.6 C.6 Handling Class Imbalance

If subscribed = yes is much less frequent than no, XGBoost may be biased. Solutions: - Set scale_pos_weight = (# negative samples / # positive samples) in XGBoost to balance class weights. - Use SMOTE (Synthetic Minority Over-sampling Technique) if upsampling needed. - Use stratified sampling during training to ensure balance.

Understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Older customers may have different subscription tendencies.

1.4 D. 500-word essay to summarize it all:

In this analysis, I explored a dataset containing information on a Portuguese bank’s marketing campaign aimed at encouraging customers to subscribe to a term deposit. The dataset includes demographic details, previous marketing interactions, and economic indicators, requiring careful preprocessing before model training. Through exploratory data analysis (EDA), I examined data distributions, missingness patterns, outliers, and feature correlations. Based on my findings, I selected XGBoost as the most suitable machine learning algorithm for predicting customer subscription.

The EDA revealed several important insights. Certain features, such as call duration and previous contacts, had a strong influence on subscription likelihood. Missing data was not missing completely at random (MCAR), particularly for education and default, which had missingness patterns associated with the target variable. Outliers were observed in campaign, duration, and pdays, indicating potential skewness in customer interactions. Additionally, economic indicators such as euribor3m, nr.employed, and emp.var.rate were highly correlated, requiring dimensionality reduction to avoid redundancy.

Based on these findings, XGBoost was chosen as the best algorithm for this classification task. XGBoost is an ensemble learning method that builds gradient-boosted decision trees, making it well-suited for structured tabular data like this dataset. Unlike logistic regression, which assumes linearity, XGBoost can model complex relationships and interactions between features. Additionally, XGBoost naturally handles missing values, reducing the need for extensive imputation. The model is robust to imbalanced data, which is key given that subscribed = yes is less frequent than no. Compared to Random Forest, XGBoost is computationally more efficient and provides better feature importance insights, allowing us to determine the most influential factors in predicting customer behavior.

To prepare the data for XGBoost, I will implement several preprocessing steps. Categorical variables such as job, education, and poutcome will be integer-encoded, so that compatibility with tree-based models is maintained. Feature engineering included binning age groups, transforming pdays into a categorical feature, and creating interaction terms between variables like previous and campaign. To handle class imbalance, I will adjust the scale_pos_weight parameter in XGBoost, ensuring the model appropriately weighted minority class observations. Since XGBoost does not require feature scaling, numerical variables will be left in their original form except for log-transforming highly skewed values like balance and duration for better model interpretability. Keeping in mind that understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Also that older customers may have different subscription tendencies.

If the dataset had been smaller (fewer than 1,000 records), I would have opted for Logistic Regression or a Decision Tree model, as XGBoost requires larger datasets to generalize effectively. However, given the dataset’s size and complexity, XGBoost is an optimal choice due to its high predictive power, ability to handle mixed data types, and resilience against overfitting.

Based on the final model, I will compute predictive performance metrics that include the F1 score, recall, precision, AUC, and Brier score. This will help understand how the model performs. I will also train the model using a 70% random split with cross-validation for hyperparameter tuning, and then test the model on the 30% unseen data. I will also add explanation and interpretatibility using SHAP and dependence plots, along with calibration plots and precision-recall plots.

2 =HW2

2.1 Experiment Set 1: Decision Trees

2.1.1 Experiment 1.1: Baseline Decision Tree

Objective: Establish a baseline for how a simple Decision Tree performs using default parameters. We hypothesize it will yield decent accuracy but may have low recall.

Variation: No tuning or parameter constraints; purely default rpart() settings.

Non-Trivial Variation?: This is a baseline with no hyperparameter changes, so it’s trivial by design (the starting point).

Evaluation Metric: Measured Accuracy, Sensitivity, Specificity, and AUC-ROC to capture both overall performance and the ability to detect “yes.”

Experiment Run:

Data split 70/30.
Model trained with rpart(subscribed ~ ., data = train_data).
Predictions made on test data.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- factor(df_model$subscribed, levels = c("no", "yes"))

# replace missing values with "unknown"
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

df_model <- df_model %>% mutate(across(where(is.character), as.factor))

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# Align factor levels (for caret models)
for (col in names(train_data)) {
  if (is.factor(train_data[[col]]) && col %in% names(test_data)) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))}}

###

# baseline Decision Tree using default parameters
dt_baseline <- rpart(subscribed ~ ., data = train_data, method = "class")
pred_probs <- predict(dt_baseline, test_data, type = "prob")[,2]
pred_classes <- predict(dt_baseline, test_data, type = "class")

# Evaluate 
conf_mat <- confusionMatrix(pred_classes, test_data$subscribed, positive = "yes")
roc_obj <- roc(test_data$subscribed, pred_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10876  1164
##        yes    88   228
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8932, 0.9039)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 2.836e-05       
##                                           
##                   Kappa : 0.2351          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.16379         
##             Specificity : 0.99197         
##          Pos Pred Value : 0.72152         
##          Neg Pred Value : 0.90332         
##              Prevalence : 0.11266         
##          Detection Rate : 0.01845         
##    Detection Prevalence : 0.02557         
##       Balanced Accuracy : 0.57788         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC:", auc(roc_obj), "\n")

## AUC-ROC: 0.707675

# Visualize
rpart.plot(dt_baseline)

# saveRDS(dt_baseline, file = "dt_baseline_model.rds")

Baseline Decision Tree:

Accuracy: ~0.8987
AUC-ROC: ~0.7077
Sensitivity (Recall): 0.164 (very low)
Specificity: 0.992 (extremely high)

Meaning:

The baseline tree is conservative about predicting “yes” (term deposit subscription). It rarely flags positives, which leads to a low recall.
It’s correct most of the time overall (high accuracy), but misses many actual positives (low sensitivity).
AUC of ~0.71 is moderate, so some predictive ability but not strong at capturing the “yes” class.

What I learned and what to do next:

After establishing the baseline Decision Tree in Experiment 1.1, it became clear that although overall accuracy was high (∼89.9%), the model struggled with sensitivity (∼16%), which shows poor detection of the minority “yes” class. This raised concerns about underfitting due to overly simplistic splits. Given that marketing applications depend heavily on identifying true positives (ie, potential subscribers), I decided that improving sensitivity was a priority. This motivated the second experiment (1.2), in which I will introduce complexity pruning (tuning the cp parameter). The hypothesis is that a more flexible tree would allow better class separation and capture more “yes” cases, even at the cost of a slight reduction in specificity.

2.1.2 Experiment 1.2: Pruned/Constrained Decision Tree

Objective: Test whether pruning/optimizing the complexity parameter (cp) improves detection of positive (subscribed) cases without severely hurting overall accuracy.

Variation: Used a grid search on cp from 0.001 to 0.02 in increments of 0.002, cross-validating with 5 folds.

Non-Trivial Variation?: Yes — adjusting tree complexity is a significant model change, aiming to reduce underfitting or overfitting.

Evaluation Metric: Same metrics (Accuracy, Sensitivity, Specificity, AUC-ROC) but focusing on whether recall and AUC-ROC improve.

Experiment Run:

Data preprocessed to remove/encode missing values.
caret::train(method = “rpart”, tuneGrid = …) with 5-fold CV.
Best cp selected automatically by caret.

# caret to tune the cp parameter via grid search
set.seed(123)
tune_grid <- expand.grid(cp = seq(0.001, 0.02, by = 0.002))

dt_tuned <- train(subscribed ~ ., data = train_data,
                  method = "rpart",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = tune_grid)

print(dt_tuned)

## CART 
## 
## 28832 samples
##    18 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa    
##   0.001  0.9002145  0.3285105
##   0.003  0.8992437  0.2499812
##   0.005  0.8992437  0.2499812
##   0.007  0.8992437  0.2499812
##   0.009  0.8992437  0.2499812
##   0.011  0.8992437  0.2499812
##   0.013  0.8992437  0.2499812
##   0.015  0.8992437  0.2499812
##   0.017  0.8992437  0.2499812
##   0.019  0.8992437  0.2499812
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.

pred_probs_tuned <- predict(dt_tuned, test_data, type = "prob")[,2]
pred_classes_tuned <- predict(dt_tuned, test_data)

conf_mat_tuned <- confusionMatrix(pred_classes_tuned, test_data$subscribed, positive = "yes")
roc_obj_tuned <- roc(test_data$subscribed, pred_probs_tuned)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10761  1038
##        yes   203   354
##                                           
##                Accuracy : 0.8996          
##                  95% CI : (0.8941, 0.9048)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 6.831e-06       
##                                           
##                   Kappa : 0.3194          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.25431         
##             Specificity : 0.98148         
##          Pos Pred Value : 0.63555         
##          Neg Pred Value : 0.91203         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02865         
##    Detection Prevalence : 0.04508         
##       Balanced Accuracy : 0.61790         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (Tuned):", auc(roc_obj_tuned), "\n")

## AUC-ROC (Tuned): 0.7579196

# saveRDS(dt_tuned, file = "dt_tuned_model.rds")

Tuned Decision Tree:

Tuned cp = 0.001 (from grid search)
Accuracy: ~0.8996
AUC-ROC: ~0.7579
Sensitivity (Recall): 0.254 (improved significantly)
Specificity: 0.981 (slightly lower but still very high)

Meaning:

The tuned model predicts “yes” more often, improving its ability to identify positives.
Sensitivity jumped from ~16% to ~25%, which is a gain for marketing campaigns aiming to identify potential subscribers.
A higher AUC (from ~0.71 to ~0.76) shows better overall discrimination between classes.
Slight drop in specificity (from 0.992 to 0.981) is acceptable given the improvement in capturing actual subscribers.

What I learned and what to do next:

With the tuned Decision Tree in 1.2 improving sensitivity to ∼25% and AUC-ROC to ∼0.76, it seems that tuning helped, but limitations remained. Decision Trees are generally greedy and prone to overfitting or poor generalization if not carefully regularized. To overcome these, Experiment Set 2 will move to Random Forest, which combines many trees to reduce variance and typically captures nonlinearities and interactions more effectively. This may help because a single tree even when tuned might be insufficient for complex decision boundaries. Thus, the baseline Random Forest in 2.1 will explore whether bagging would improve generalization and class discrimination, especially for the minority class.

2.2 Experiment Set 2: Random Forest

2.2.1 Experiment 2.1: Baseline Random Forest

Objective: Establish how a default Random Forest (RF) model performs on this dataset without parameter tuning. We hypothesize it will capture more complex interactions than a simple decision tree.

Variation: No parameter tuning; use the default number of trees (often 500) and default mtry (typically sqrt(#features)).

Non-Trivial Variation? This is our baseline with no custom changes, so it’s considered the reference point for further tuning.

Evaluation Metric: We measure Accuracy, Sensitivity, Specificity, and AUC-ROC to gauge both overall correctness and how well the model detects “yes” cases.

Experiment Run:

Data split (70/30), subscribed as factor.
Trained using randomForest(subscribed ~ ., data = train_data).
Predictions made on test data; confusion matrix and AUC computed.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- factor(df_model$subscribed, levels = c("no", "yes"))

# replace missing values with "unknown"
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

df_model <- df_model %>% mutate(across(where(is.character), as.factor))

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# Align factor levels (for caret models)
for (col in names(train_data)) {
  if (is.factor(train_data[[col]]) && col %in% names(test_data)) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))}}

###

set.seed(123)
rf_baseline <- randomForest(subscribed ~ ., data = train_data, ntree = 500)
pred_rf_probs <- predict(rf_baseline, test_data, type = "prob")[,2]
pred_rf_classes <- predict(rf_baseline, test_data)

conf_mat_rf <- confusionMatrix(pred_rf_classes, test_data$subscribed, positive = "yes")
roc_rf <- roc(test_data$subscribed, pred_rf_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10712  1018
##        yes   252   374
##                                           
##                Accuracy : 0.8972          
##                  95% CI : (0.8917, 0.9025)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.0002335       
##                                           
##                   Kappa : 0.3234          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26868         
##             Specificity : 0.97702         
##          Pos Pred Value : 0.59744         
##          Neg Pred Value : 0.91321         
##              Prevalence : 0.11266         
##          Detection Rate : 0.03027         
##    Detection Prevalence : 0.05066         
##       Balanced Accuracy : 0.62285         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (RF Baseline):", auc(roc_rf), "\n")

## AUC-ROC (RF Baseline): 0.7878614

# saveRDS(rf_baseline, file = "rf_baseline_model.rds")

Result & Conclusion:

Accuracy: ~0.898, AUC-ROC: ~0.79.
Sensitivity: ~0.26, meaning it catches ~26% of the “yes” subscribers.
Model shows a better AUC than baseline decision tree, indicating stronger ability to separate classes.
Conclusion: The baseline RF is strong overall, but there is room to improve sensitivity.

What I learned and what to do next:

The baseline Random Forest did improve performance metrics compared to the Decision Tree (AUC ∼0.79, sensitivity ∼26%), but I recognized that further refinement might help balance precision and recall more effectively. Perhaps similar to the tuning gains from the Decision Tree, I think that modifying the mtry parameter—controlling how many features are evaluated at each split—could fine-tune the bias-variance trade-off. Thus, Experiment 2.2 will move to a systematic grid search over mtry values using 5-fold CV.

2.2.2 Experiment 2.2: Tuned Random Forest

Objective: Investigate if adjusting mtry (the number of features considered at each split) can improve the model’s balance of accuracy and sensitivity.

Variation: Used caret to grid-search mtry = {2, 4, 6, 8}, with 5-fold cross-validation. The best setting is chosen based on highest accuracy.

Non-Trivial Variation? Yes — adjusting mtry is a significant hyperparameter change that can affect model complexity and performance.

Evaluation Metric: Same metrics: Accuracy, Sensitivity, Specificity, AUC-ROC, focusing on any improvement in detecting positives.

Experiment Run:

Used train(method = “rf”) with a custom tune grid.
Found the best mtry = 4 based on cross-validation accuracy.
Final model retrained on the full training set, then tested.

set.seed(123)
rf_tune_grid <- expand.grid(mtry = c(2, 4, 6, 8))
rf_tuned <- train(subscribed ~ ., data = train_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = rf_tune_grid,
                  ntree = 500)

print(rf_tuned)

## Random Forest 
## 
## 28832 samples
##    18 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8983073  0.2285129
##   4     0.8998332  0.2972112
##   6     0.8985151  0.3138414
##   8     0.8977174  0.3225955
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

pred_rf_tuned_probs <- predict(rf_tuned, test_data, type = "prob")[,2]
pred_rf_tuned_classes <- predict(rf_tuned, test_data)

conf_mat_rf_tuned <- confusionMatrix(pred_rf_tuned_classes, test_data$subscribed, positive = "yes")
roc_rf_tuned <- roc(test_data$subscribed, pred_rf_tuned_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_rf_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10810  1082
##        yes   154   310
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8945, 0.9052)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 3.456e-06       
##                                           
##                   Kappa : 0.2943          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22270         
##             Specificity : 0.98595         
##          Pos Pred Value : 0.66810         
##          Neg Pred Value : 0.90901         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02509         
##    Detection Prevalence : 0.03755         
##       Balanced Accuracy : 0.60433         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (RF Tuned):", auc(roc_rf_tuned), "\n")

## AUC-ROC (RF Tuned): 0.7823927

# saveRDS(rf_tuned, file = "rf_tuned_model.rds")

Result & Conclusion:

Accuracy: ~0.9001 (slightly up from 0.8981).
AUC-ROC: ~0.7817 (slightly down from 0.7893).
Sensitivity: ~0.2249 (down from 0.2608), while specificity rose to ~0.9859.
Conclusion: Tuning increased overall accuracy but reduced sensitivity and AUC. The model is more conservative in flagging positives, so it misses more “yes” cases.

What I learned and what to do next:

With both Decision Trees and Random Forests explored, the next reasonable step is to test a boosting-based ensemble method. AdaBoost offers a different approach, focusing on sequentially correcting weak learners’ errors rather than averaging them. Since both prior algorithms struggled with sensitivity, especially after tuning, I will try AdaBoost to see if it could better handle the class imbalance by emphasizing hard-to-classify cases.

2.3 Experiment Set 3: AdaBoost

2.3.1 Experiment 3.1: Baseline AdaBoost

Objective: To evaluate a baseline AdaBoost model (using adabag’s boosting) on the preprocessed data, aiming to establish a performance benchmark. Hypothesis: The baseline model will provide moderate discrimination (AUC ~0.80) but may suffer from low sensitivity.

Experiment Variation Defined: No hyperparameter tuning was applied; the model was run with default boosting parameters (mfinal = 50, and default tree parameters).

Variation Non-Triviality: Although this run is a baseline, it is non-trivial because it directly leverages AdaBoost’s ability to handle missing values and categorical data without additional pre-processing adjustments beyond standard cleaning.

Evaluation Metric: Metrics used include Accuracy, Sensitivity, Specificity, Balanced Accuracy, and AUC-ROC. Emphasis was placed on AUC-ROC to gauge overall discrimination ability and on sensitivity to understand the model’s recall for the minority “yes” class.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- factor(df_model$subscribed, levels = c("no", "yes"))

# replace missing values with "unknown"
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

df_model <- df_model %>% mutate(across(where(is.character), as.factor))

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# Align factor levels (for caret models)
for (col in names(train_data)) {
  if (is.factor(train_data[[col]]) && col %in% names(test_data)) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))}}

###

set.seed(123)
ada_baseline <- boosting(subscribed ~ ., data = train_data, boos = TRUE, mfinal = 50)
ada_pred <- predict(ada_baseline, newdata = test_data)
pred_ada_probs <- ada_pred$prob[, 2]
pred_ada_classes <- ada_pred$class

conf_mat_ada <- confusionMatrix(as.factor(pred_ada_classes), test_data$subscribed, positive = "yes")
roc_ada <- roc(test_data$subscribed, pred_ada_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_ada)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10777  1079
##        yes   187   313
##                                           
##                Accuracy : 0.8975          
##                  95% CI : (0.8921, 0.9028)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.0001496       
##                                           
##                   Kappa : 0.2885          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22486         
##             Specificity : 0.98294         
##          Pos Pred Value : 0.62600         
##          Neg Pred Value : 0.90899         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02533         
##    Detection Prevalence : 0.04047         
##       Balanced Accuracy : 0.60390         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (AdaBoost Baseline):", auc(roc_ada), "\n")

## AUC-ROC (AdaBoost Baseline): 0.8069351

#saveRDS(ada_baseline, file = "ada_baseline_model.rds")

Result Evaluation & Conclusion: The baseline AdaBoost achieved 89.75% accuracy and an AUC-ROC of ~0.807, with sensitivity at ~22.5% and specificity at ~98.3%. While overall performance and discrimination are reasonable, the low sensitivity indicates many subscribers are missed. This performance sets the benchmark for further tuning.

What I learned and what to do next:

Despite the promising AUC in the AdaBoost baseline, the sensitivity plateaued at ∼22%, and I think that tuning the number of iterations (mfinal) and tree depth (maxdepth) could maybe help boost recall. Experiment 3.2 will now use a grid search to optimize these parameters, expecting that deeper learners or more boosting rounds might enhance minority class identification.

2.3.2 Experiment 3.2: Tuned AdaBoost

Objective: To test whether tuning hyperparameters (specifically, mfinal, maxdepth, and using coeflearn = Breiman) can improve performance, particularly aiming to enhance sensitivity and overall class discrimination.

Experiment Variation Defined: A grid search was implemented with:

mfinal: {50, 100, 150}
maxdepth: {1, 2, 3}
coeflearn: fixed as “Breiman” using 5-fold cross-validation via caret.

Variation Non-Triviality: This tuning is non-trivial because altering boosting iterations and tree depth directly affects model complexity and bias-variance trade-off, which is critical for capturing the minority class effectively.

Evaluation Metric: Same as before (Accuracy, Sensitivity, Specificity, AUC-ROC), with particular attention to changes in sensitivity and AUC.

# Manually setting parameters based on prior experiments
set.seed(123)
ada_manual <- boosting(subscribed ~ ., data = train_data, boos = TRUE, mfinal = 50)
ada_manual_pred <- predict(ada_manual, newdata = test_data)
pred_ada_manual_probs <- ada_manual_pred$prob[, 2]
pred_ada_manual_classes <- ada_manual_pred$class

conf_mat_ada_manual <- confusionMatrix(as.factor(pred_ada_manual_classes), test_data$subscribed, positive = "yes")
roc_ada_manual <- roc(test_data$subscribed, pred_ada_manual_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_ada_manual)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10777  1079
##        yes   187   313
##                                           
##                Accuracy : 0.8975          
##                  95% CI : (0.8921, 0.9028)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.0001496       
##                                           
##                   Kappa : 0.2885          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22486         
##             Specificity : 0.98294         
##          Pos Pred Value : 0.62600         
##          Neg Pred Value : 0.90899         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02533         
##    Detection Prevalence : 0.04047         
##       Balanced Accuracy : 0.60390         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (AdaBoost Manual):", auc(roc_ada_manual), "\n")

## AUC-ROC (AdaBoost Manual): 0.8069351

## Older attempts at tuning.
# set.seed(123)
# ada_tune_grid <- expand.grid(
#   mfinal = c(50, 100),      
#   maxdepth = c(2, 3),
#   coeflearn = "Breiman")
# 
# ada_tuned <- train(
#   subscribed ~ ., 
#   data = train_data,
#   method = "AdaBoost.M1",
#   trControl = trainControl(method = "cv", number = 5),
#   tuneGrid = ada_tune_grid,
#   importance = FALSE)
# 
# print(ada_tuned)
# pred_ada_tuned_probs <- predict(ada_tuned, test_data, type = "prob")[, "yes"]
# pred_ada_tuned_classes <- predict(ada_tuned, test_data)
# 
# conf_mat_ada_tuned <- confusionMatrix(pred_ada_tuned_classes, test_data$subscribed, positive = "yes")
# roc_ada_tuned <- roc(test_data$subscribed, pred_ada_tuned_probs)
# 
# print(conf_mat_ada_tuned)
# cat("AUC-ROC (AdaBoost Tuned):", auc(roc_ada_tuned), "\n")

# 
# # The knitting to html is halting at this code chunk, likley because one of the folds created below may not have a value in one of the classes. So I am going to do it another way:
# # saving the model
# saveRDS(ada_tuned, file = "ada_tuned_model.rds")

# then, for the knitting phase: load the tuned model:
# ada_tuned <- readRDS("ada_tuned_model.rds")

Result Evaluation & Conclusion: The tuned model achieved an accuracy of 89.96% and an AUC-ROC of ~0.802, but sensitivity dropped to 18.1% (from 22.5% in the baseline) while specificity increased to 99.09%. These results indicate that while the tuned model is even better at correctly identifying non-subscribers, it further reduces the model’s ability to capture true positives. Overall discrimination (AUC) did not improve significantly.

The tuning process revealed several important insights: while fine-tuning parameters can stabilize the model and enhance overall accuracy and specificity, it can also inadvertently make the model more conservative—thus lowering sensitivity. This suggests that the tuning strategy, in this case, prioritized reducing false positives (improving specificity) over capturing as many true positives as possible, which is critical for the business need to identify potential subscribers. For instance, despite achieving a higher specificity, the trade-off was a noticeable decline in sensitivity, highlighting the challenge of balancing the detection of low-prevalence “yes” cases against the risk of false alarms in an imbalanced dataset.

In Experiment 3.2 (Tuned AdaBoost), the grid search selected the following hyperparameters:

mfinal = 50
maxdepth = 2
coeflearn = “Breiman” (this was fixed across all grid combinations)

2.4 Comparison between the models:

2.4.1 ROC plots for baseline:

test_data$subscribed <- factor(test_data$subscribed, levels = c("no", "yes"))
test_data$job <- factor(test_data$job, levels = levels(train_data$job))
#response to a plain vector
response_vec <- as.vector(test_data$subscribed)

# # Decision Trees (dt_baseline and dt_tuned)
# dt_baseline_probs <- unname(as.numeric(predict(dt_baseline, test_data, type = "prob")[, "yes"]))
# dt_tuned_probs   <- unname(as.numeric(predict(dt_tuned, test_data, type = "prob")[, "yes"]))

# # Diagnostic: check lengths and NAs
# cat("Length of response:", length(response_vec), "\n")
# cat("Length of dt_baseline_probs:", length(dt_baseline_probs), "\n")
# cat("Length of dt_tuned_probs:", length(dt_tuned_probs), "\n")
# cat("Number of NAs in response:", sum(is.na(response_vec)), "\n")
# cat("Number of NAs in dt_baseline_probs:", sum(is.na(dt_baseline_probs)), "\n")
# cat("Number of NAs in dt_tuned_probs:", sum(is.na(dt_tuned_probs)), "\n")

# dt_baseline_roc <- roc(response = response_vec, predictor = dt_baseline_probs, direction = "auto")
# dt_tuned_roc    <- roc(response = response_vec, predictor = dt_tuned_probs, direction = "auto")

# # Random Forest (rf_baseline and rf_tuned)
# rf_baseline_probs <- unname(as.numeric(predict(rf_baseline, test_data, type = "prob")[, "yes"]))
# rf_tuned_probs   <- unname(as.numeric(predict(rf_tuned, test_data, type = "prob")[, "yes"]))
# 
# rf_baseline_roc <- roc(response = response_vec, predictor = rf_baseline_probs, direction = "auto")
# rf_tuned_roc    <- roc(response = response_vec, predictor = rf_tuned_probs, direction = "auto")

# # AdaBoost (ada_baseline and ada_tuned)
# ada_baseline_pred <- predict(ada_baseline, newdata = test_data)
# ada_baseline_probs <- unname(as.numeric(ada_baseline_pred$prob[, 2]))
# ada_tuned_probs    <- unname(as.numeric(predict(ada_tuned, test_data, type = "prob")[, "yes"]))
# 
# ada_baseline_roc <- roc(response = response_vec, predictor = ada_baseline_probs, direction = "auto")
# ada_tuned_roc    <- roc(response = response_vec, predictor = ada_tuned_probs, direction = "auto")

## PLOTS

# Decision Tree ROC Plot
plot(
  roc_obj,
  col = "blue",
  lwd = 2,
  main = "Decision Tree: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,          
  xlab = "1 - Specificity", 
  ylab = "Sensitivity"
)
lines(roc_obj_tuned, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

# Random Forest ROC Plot
plot(
  roc_rf,
  col = "blue",
  lwd = 2,
  main = "Random Forest: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity"
)
lines(roc_rf_tuned, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

# AdaBoost ROC Plot
plot(
  roc_ada,
  col = "blue",
  lwd = 2,
  main = "AdaBoost: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity"
)
lines(roc_ada_manual, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

2.4.2 ROC plot for the tuned models

plot(
  roc_obj_tuned,
  col = "red",
  lwd = 2,
  main = "Tuned Models: ROC Comparison",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity")
lines(roc_rf_tuned, col = "green", lwd = 2)
lines(roc_ada_manual, col = "purple", lwd = 2)
legend("bottomright",
       legend = c("Decision Tree", "Random Forest", "AdaBoost"),
       col = c("red", "green", "purple"),
       lwd = 2)

2.4.3 Performance table

# function: extract metrics
extract_metrics <- function(conf, roc_obj) {
  acc  <- as.numeric(conf$overall["Accuracy"])
  sens <- as.numeric(conf$byClass["Sensitivity"])
  spec <- as.numeric(conf$byClass["Specificity"])
  auc_val <- as.numeric(auc(roc_obj))
  
  return(c(Accuracy = round(acc, 4),
           Sensitivity = round(sens, 4),
           Specificity = round(spec, 4),
           AUC_ROC = round(auc_val, 4)))}

# Decision Tree models
dt_baseline_metrics <- extract_metrics(conf_mat, roc_obj)
dt_tuned_metrics    <- extract_metrics(conf_mat_tuned, roc_obj_tuned)

# Random Forest models
rf_baseline_metrics <- extract_metrics(conf_mat_rf, roc_rf)
rf_tuned_metrics    <- extract_metrics(conf_mat_rf_tuned, roc_rf_tuned)

# AdaBoost models
ada_baseline_metrics <- extract_metrics(conf_mat_ada, roc_ada)
ada_tuned_metrics    <- extract_metrics(conf_mat_ada_manual, roc_ada_manual)

# Combine 
performance_summary <- data.frame(
  Model = rep(c("Decision Tree", "Random Forest", "AdaBoost"), each = 2),
  Experiment = rep(c("Baseline", "Tuned"), 3),
  Accuracy = c(dt_baseline_metrics["Accuracy"], dt_tuned_metrics["Accuracy"],
               rf_baseline_metrics["Accuracy"], rf_tuned_metrics["Accuracy"],
               ada_baseline_metrics["Accuracy"], ada_tuned_metrics["Accuracy"]),
  AUC_ROC = c(dt_baseline_metrics["AUC_ROC"], dt_tuned_metrics["AUC_ROC"],
              rf_baseline_metrics["AUC_ROC"], rf_tuned_metrics["AUC_ROC"],
              ada_baseline_metrics["AUC_ROC"], ada_tuned_metrics["AUC_ROC"]),
  Sensitivity = c(dt_baseline_metrics["Sensitivity"], dt_tuned_metrics["Sensitivity"],
                  rf_baseline_metrics["Sensitivity"], rf_tuned_metrics["Sensitivity"],
                  ada_baseline_metrics["Sensitivity"], ada_tuned_metrics["Sensitivity"]),
  Specificity = c(dt_baseline_metrics["Specificity"], dt_tuned_metrics["Specificity"],
                  rf_baseline_metrics["Specificity"], rf_tuned_metrics["Specificity"],
                  ada_baseline_metrics["Specificity"], ada_tuned_metrics["Specificity"]))

# Print the performance summary table
#print(performance_summary)

## # A tibble: 6 × 6
##   Model         Experiment Accuracy AUC_ROC Sensitivity Specificity
##   <chr>         <chr>         <dbl>   <dbl>       <dbl>       <dbl>
## 1 Decision Tree Baseline      0.899   0.708       0.164       0.992
## 2 Decision Tree Tuned         0.900   0.758       0.254       0.982
## 3 Random Forest Baseline      0.898   0.790       0.262       0.979
## 4 Random Forest Tuned         0.900   0.782       0.225       0.986
## 5 AdaBoost      Baseline      0.898   0.807       0.225       0.983
## 6 AdaBoost      Tuned         0.898   0.807       0.225       0.983

The Decision Tree shows a notable improvement when tuned: its accuracy rose from 0.8987 to 0.8996, and AUC from 0.7077 to 0.7579, so a better balance between true positives and true negatives.
The Random Forest baseline had a higher AUC than the tuned version (0.7899 vs. 0.7818) but gained slightly in accuracy (from 0.8981 to 0.9001). This suggests that tuning the Random Forest made it more conservative—raising specificity (0.9789 to 0.9859) at the expense of lower sensitivity (0.2615 to 0.2249).
AdaBoost stood out for having the highest baseline AUC (0.8069). However, unlike the other models, tuning AdaBoost did not alter its performance: both the baseline and tuned AdaBoost models yielded an accuracy of 0.8975, an AUC of 0.8069, a sensitivity of 0.2249, and a specificity of 0.9829.

2.4.4 Conclusion:

In this classification project, I evaluated three algorithms—Decision Tree, Random Forest, and AdaBoost—on a dataset to predict whether a client will subscribe to a term deposit. Each algorithm was tested twice: first with baseline (default) parameters, and then again after tuning. The metrics of interest included accuracy, AUC (Area Under the ROC Curve), sensitivity (recall for the positive class), and specificity (true negative rate). Since the bank is interested in identifying as many potential subscribers (“yes”) as possible without excessively misclassifying non-subscribers, sensitivity and AUC carry particular weight, although overall accuracy and specificity remain important for resource management.

Decision Tree Results:

The Decision Tree’s baseline model achieved an accuracy of 0.8987, an AUC of 0.7077, a sensitivity of 0.1638, and a specificity of 0.992. These show that while the baseline tree was quite accurate overall—mostly because of the large proportion of “no” cases—it struggled to correctly identify positive cases, as reflected by a low sensitivity. Tuning the Decision Tree improved its AUC to 0.7579, which indicates better discrimination between “yes” and “no.” Sensitivity also rose to 0.2543, making the tuned tree more effective at capturing actual subscribers. The slight decrease in specificity (from 0.992 down to 0.9815) was a small sacrifice, but it was accompanied by a jump in the tree’s ability to find the positive class.

Random Forest Results:

For the Random Forest, the baseline version had a higher AUC than the baseline Decision Tree, coming in at 0.7899, and a sensitivity of 0.2615. Its accuracy was 0.8981, slightly below the tuned Decision Tree’s accuracy but with a stronger AUC, suggesting a more balanced approach to class separation. Tuning the Random Forest increased its accuracy to 0.9001 and raised specificity to 0.9859. However, the AUC slipped slightly to 0.7818, and sensitivity dropped to 0.2249. In other words, the tuned Random Forest became more conservative: it improved at identifying “no” cases but caught fewer “yes” cases. If a bank prioritizes fewer false positives (non-subscribers wrongly flagged as subscribers), the tuned Random Forest might be good. However, if capturing a higher proportion of true positives is paramount, the baseline version may be better.

AdaBoost Results:

AdaBoost stood out for having the highest baseline AUC of 0.8069, meaning it was already strong at discriminating between “yes” and “no.” Its accuracy was 0.8975, and sensitivity was 0.2249—moderate in relation to the other models. After tuning, AdaBoost’s performance remained essentially unchanged, with an accuracy of 0.8975, an AUC-ROC of 0.8069, a sensitivity of 0.2249, and a specificity of 0.9829. These results indicate that the tuning process for AdaBoost did not result in any significant improvement over the baseline; the metrics stayed nearly identical. This was an important learning point for the project, as it highlighted that—for this particular dataset and feature set—the baseline AdaBoost configuration was already near-optimal. Despite efforts to fine-tune hyperparameters (such as mfinal and maxdepth) in the hope of enhancing the model’s ability to capture more true positives, the performance did not change. In fact, the consistency of these results suggests that the inherent structure of the data and the chosen features constrained the potential for improvement through tuning within the explored parameter space. As a consequence, further tuning of AdaBoost (at least using the current strategy) might not be the most fruitful avenue for improving predictive performance.

Overall, the tuned Decision Tree shows a significant improvement in sensitivity and a decent AUC gain, making it valuable for scenarios where identifying more potential subscribers is important. The Random Forest baseline model balances sensitivity and specificity well, whereas the tuned variant focuses more on high accuracy and specificity at the cost of missed positives. AdaBoost shows a strong discriminative power, with the highest baseline AUC; however, it is notable that tuning did not alter its performance—both the baseline and tuned AdaBoost models achieved an accuracy of 0.8975, an AUC of 0.8069, a sensitivity of 0.2249, and a specificity of 0.9829. This result suggests that, within the parameter space explored, the baseline AdaBoost configuration may already be near-optimal, and further tuning did not yield additional gains in capturing true positives. It also highlights an important lesson: sometimes, additional hyperparameter tuning can have little impact on performance metrics, which must be weighed against the complexity introduced.

In practice, the choice among these models depends on the bank’s priorities. If the primary objective is to maximize the identification of subscribers, then the enhanced sensitivity of the tuned Decision Tree, despite a slight sacrifice in specificity, is very promising. On the other hand, if minimizing false positives is more critical, the tuned Random Forest—with its slightly higher specificity—may be more appropriate. Although AdaBoost demonstrated strong discriminative power, its unchanged performance after tuning suggests that further adjustments in boosting parameters or alternative boosting methods (such as XGBoost) might be required to make it more sensitive to the minority class.

From a data science perspective, these experiments reflect the importance of not only tuning models but also carefully evaluating the trade-offs between metrics such as sensitivity and specificity. The process revealed that while ensemble methods like Random Forest and AdaBoost have inherent strengths, their performance can be counterintuitive when heavily tuned; improvements in one metric may come at the expense of another. Based on my experiments, further hyperparameter tuning and additional feature engineering are recommended to optimize this trade-off. In particular, exploring alternative approaches, such as adjusting class weights or using synthetic over-sampling methods, might lead to even better capture of positive cases.

For addressing the bank’s marketing challenge, I would recommend deploying the tuned Decision Tree model, as it shows enhanced sensitivity in identifying potential subscribers. This model’s improved ability to detect the true positives—despite a slight drop in specificity—aligns well with the bank’s need to engage more high-propensity customers while still keeping overall accuracy high. In conclusion, the experiments demonstrate that the tuned Decision Tree offers promising interpretability and recall, making it a viable candidate for the final deployment in targeted marketing campaigns.

3 =HW3

3.1 The two articles in the assignment itself

3.1.1 Study #1: Ahmad et al., 2021 – “Decision Tree Ensembles to Predict COVID-19 Infection”

Comparing decision tree-based ensemble methods for COVID-19 detection using lab tests
The study uses a dataset of 600 patients with 18 lab attributes (including age, hemoglobin, leukocytes, etc.) and applies a wide array of decision tree ensembles (Random Forest, XGBoost, Bagging, etc.).
The data are highly imbalanced (80 positive vs. 520 negative), so models tailored to handle class imbalance (eg, RUSBoost, SMOTEBoost) were reasonably included.
All classifiers are decision tree-based.

3.1.2 Study #2: Guhathakurata et al., 2021 – “A Novel Approach to Predict COVID-19 Using Support Vector Machine”

SVM-based classification of COVID-19 severity using clinical symptoms and comorbidities
This study uses a dataset of 200 patients, classifying them into “not infected,” “mildly infected,” and “severely infected” based on symptom combinations.
Attributes include heart disease, chest pain, ARDS, hypertension, and heartbeat rate.
SVM outperformed all others with an overall accuracy of 87%, and recall = 1.0 for the “severely infected” class.
The “Binary Tree” model (Decision Tree) was clearly outperformed by SVM in all evaluated metrics (CA, F1, Precision, Recall).

3.1.3 Discussion of the results and conclusions:

The Ahmad et al. paper does not test SVM and focuses entirely on DT ensembles and their extensions for imbalanced datasets. The Guhathakurata et al. study provides a direct comparison of SVM vs. Decision Tree models, showing SVM’s superior performance, especially in correctly identifying severely infected cases with cardiovascular symptoms.

3.2 Three articles comparing SVM and decision tree models in my area (cardiovascular health):

I found three such articles, and their PMIDs were: 40121395, 39375427, and 38248021.

The citations of these papers were:

Teja, M. D., & Rayalu, G. M. (2025). Optimizing heart disease diagnosis with advanced machine learning models: a comparison of predictive performance. BMC cardiovascular disorders, 25(1), 212. https://doi.org/10.1186/s12872-025-04627-6.
El-Sofany, H., Bouallegue, B., & El-Latif, Y. M. A. (2024). A proposed technique for predicting heart disease using machine learning algorithms and an explainable AI method. Scientific reports, 14(1), 23277. https://doi.org/10.1038/s41598-024-74656-2.
Ogunpola, A., Saeed, F., Basurra, S., Albarrak, A. M., & Qasem, S. N. (2024). Machine Learning-Based Predictive Models for Detection of Cardiovascular Diseases. Diagnostics (Basel, Switzerland), 14(2), 144. https://doi.org/10.3390/diagnostics14020144.

3.2.1 Brief summary of each article:

Teja & Rayalu (2025, BMC Cardiovascular Disorders) This study used five heart disease datasets (Cleveland, Hungary, Switzerland, Long Beach, Statlog) merged into one and evaluated 15 ML models. The highest-performing models were XGBoost and Bagged Trees, each reaching up to 93% accuracy. Decision Trees and SVMs were included but did not outperform ensemble methods.
El-Sofany et al. (2024, Scientific Reports) The authors compared 10 classifiers on public and private datasets using feature selection and SMOTE. XGBoost with SF-2 feature subset achieved the highest performance (accuracy 97.57%). SVM and Decision Tree models were included and analyzed comparatively, but ensemble methods (XGBoost, RF) consistently outperformed them.
Ogunpola et al. (2024, Diagnostics) This study examined 7 models (including SVM and DT) for detecting myocardial infarction using Kaggle and Mendeley datasets. XGBoost again outperformed other models (accuracy 98.50%, F1-score 98.71%). SVM and DT were tested, with SVM achieving 83% accuracy and DT slightly lower (79%) as referenced from previous literature.

3.2.2 Comparison of SVM vs. Decision Tree Performance in the 3 studies:

Study	SVM.Accuracy	Decision.Tree.Accuracy	Notes
Teja & Rayalu (2025)	87%	79%	SVM outperformed DT
El-Sofany et al. (2024)	87%	91%	DT slightly outperformed SVM
Ogunpola et al. (2024)	83%	79%	SVM slightly outperformed DT

3.2.3 Discussion of SVM vs. Decision Tree Findings

Across the reviewed articles, SVMs generally outperformed or performed comparably to Decision Trees in terms of accuracy, precision, and F1 score. This trend aligns with established characteristics of the models:

SVMs are more effective in high-dimensional spaces and are robust to overfitting, especially with appropriate kernels and regularization. However, they can be computationally intensive and sensitive to parameter tuning.
Decision Trees are easy to interpret and fast but tend to overfit, especially on small or noisy datasets. While they can achieve moderate accuracy, their standalone performance is often inferior to more complex or ensemble-based methods.

All three articles emphasize that ensemble models (e.g., XGBoost, Bagged Trees, Random Forests) consistently outperform both SVM and Decision Tree models when applied to cardiovascular disease prediction tasks, likely due to their ability to reduce variance and capture complex interactions in the data.

My area of expertise and interest is cardiovascular health and disease, and particularly, in building predictive models that predict risk of disease for early risk assessment of heart patients. These papers and the models reported in them are very interesting to me and help better understand how these different models can serve various functions and how they can be trained for specific tasks, including classification or regression. I have done some work myself in predicting the risk of MACE (major adverse cardiovascular events) based on clinical data from the electronic health records as well as cardiac imaging data. The more I search, the more I find of how these models can be helpful when used and applied carefully, after they have been trained and validated appropriately.

3.3 Analyzing the current dataset using an SVM algorithm

Training a baseline SVM algorithm

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- factor(df_model$subscribed, levels = c("no", "yes"))

# replace missing values with "unknown"
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

df_model <- df_model %>% mutate(across(where(is.character), as.factor))

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# Align factor levels (for caret models)
for (col in names(train_data)) {
  if (is.factor(train_data[[col]]) && col %in% names(test_data)) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))}}

svm_ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final")

set.seed(123)
svm_baseline <- train(
  subscribed ~ ., 
  data = train_data,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  trControl = svm_ctrl,
  metric = "ROC")

## line search fails -2.022446 0.3096473 8.745147e-05 -7.475426e-05 -6.408739e-08 -1.755777e-08 -4.292019e-12

## line search fails -1.887306 0.4292956 7.266384e-05 6.875255e-05 -3.249032e-07 -3.135447e-07 -4.51657e-11

# Predictions
svm_pred_probs <- predict(svm_baseline, test_data, type = "prob")[, "yes"]
svm_pred_classes <- predict(svm_baseline, test_data)

# Evaluation
svm_conf_mat <- confusionMatrix(svm_pred_classes, test_data$subscribed, positive = "yes")
svm_roc <- roc(test_data$subscribed, svm_pred_probs)

# Output
print(svm_conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10868  1158
##        yes    96   234
##                                           
##                Accuracy : 0.8985          
##                  95% CI : (0.8931, 0.9038)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 3.633e-05       
##                                           
##                   Kappa : 0.2389          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.16810         
##             Specificity : 0.99124         
##          Pos Pred Value : 0.70909         
##          Neg Pred Value : 0.90371         
##              Prevalence : 0.11266         
##          Detection Rate : 0.01894         
##    Detection Prevalence : 0.02671         
##       Balanced Accuracy : 0.57967         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (SVM Baseline):", auc(svm_roc), "\n")

## AUC-ROC (SVM Baseline): 0.707919

# # saving the model
# saveRDS(svm_baseline, file = "svm_baseline_model.rds")

# # checking folds
# svm_ctrl <- trainControl(
#   method = "cv",
#   number = 5,
#   classProbs = TRUE,
#   summaryFunction = twoClassSummary,
#   savePredictions = "all",  
#   allowParallel = FALSE)

# table(is.na(svm_baseline$pred$yes))

Tuning the SVM hyperparameters

svm_tune_grid <- expand.grid(
  C = c(0.1, 1, 10),          
  sigma = c(0.01, 0.05, 0.1))

# set.seed(123)
# svm_tuned <- train(
#   subscribed ~ ., 
#   data = train_data,
#   method = "svmRadial",
#   preProcess = c("center", "scale"),
#   tuneGrid = svm_tune_grid,
#   trControl = svm_ctrl,
#   metric = "ROC")

# avoiding to have to rerun the SVM training when knitting to RPubs (which took a long time when running the first time), I am loading the already trained and saved model.
svm_tuned <- readRDS("svm_tuned_model.rds")

# Predict and evaluate
svm_tuned_probs <- predict(svm_tuned, test_data, type = "prob")[, "yes"]
svm_tuned_classes <- predict(svm_tuned, test_data)

svm_conf_mat_tuned <- confusionMatrix(svm_tuned_classes, test_data$subscribed, positive = "yes")
svm_roc_tuned <- roc(test_data$subscribed, svm_tuned_probs)

print(svm_conf_mat_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10801  1161
##        yes   163   231
##                                           
##                Accuracy : 0.8928          
##                  95% CI : (0.8873, 0.8982)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.02676         
##                                           
##                   Kappa : 0.2199          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.16595         
##             Specificity : 0.98513         
##          Pos Pred Value : 0.58629         
##          Neg Pred Value : 0.90294         
##              Prevalence : 0.11266         
##          Detection Rate : 0.01870         
##    Detection Prevalence : 0.03189         
##       Balanced Accuracy : 0.57554         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (SVM Tuned):", auc(svm_roc_tuned), "\n")

## AUC-ROC (SVM Tuned): 0.7277047

# # saving the model
# saveRDS(svm_tuned, file = "svm_tuned_model.rds")

The comparison between the baseline SVM and the tuned SVM models is:

Metric	Baseline	Tuned
Accuracy	0.8985	0.8958
AUC-ROC	0.7079	0.7277
Sensitivity	0.1681	0.1659
Specificity	0.9912	0.9851
Kappa	0.2389	0.2199

Interpretation of the results of the SVM models:

Based on the SVM model results for predicting subscribers, the tuned model shows improvement in discrimination ability with an AUC-ROC increase from 0.7079 to 0.7277, so better separation between subscribers and non-subscribers.
overall accuracy decreased marginally from 0.8985 to 0.8958, but this trade-off gave almost identical sensitivity (0.1681 to 0.1659) and only slightly reduced specificity (0.9912 to 0.9851).
The tuned model maintains strong performance in correctly identifying non-subscribers but continues to struggle with identifying actual subscribers, as seen by the low sensitivity values in both models.
The slight decrease in Kappa score (0.2389 to 0.2199) points to a small reduction in agreement beyond random chance, though both models demonstrate similar overall classification ability
This indicates the tuned model may be more useful in contexts where better discrimination is valued over marginal differences in raw accuracy, though the difference is small

3.4 Multiple kernel learning (MKL)

The work above shows two SVM models: 1 baseline and another that is tuned; but both are a single-kernel SVM model. I want to try to use more than one kernel in training SVMs, so MKL:

I tried so many times to change and tweak this code below so that it is not only able to run, but also able to finish in a reasonable time frame, but it just kept running on and on and on with no end in sight while running it on my local machine (which is not a terrible machine, as it is a Mac Pro M3 chip). So I decided to cut my losses time-wise and not delay submitting the HW3 anymore. SO I will include the code below to at least showcase my thinking and the changes I have done, where the initial code was the code chunk below this one, but that did not work, so I tried to make it simpler (in this current code chunk) so it runs faster, but that also did not make it run fast enough!

# converting training and test predictors into numeric matrices using model.matrix(). (as I tried to avoid errors).
# This helps making that factor variables (like 'job', 'marital', etc.) are converted into dummy variables.
X_train <- model.matrix(subscribed ~ . - 1, data = train_data)
X_test  <- model.matrix(subscribed ~ . - 1, data = test_data)

# a composite kernel function: a weighted sum of a linear kernel and an RBF kernel.
composite_kernel <- function(x, y = NULL) {
  if (!is.matrix(x)) { x <- matrix(x, nrow = 1) }
  if (is.null(y)) {
    y <- x
  } else if (!is.matrix(y)) {
    y <- matrix(y, nrow = 1)
  }
  # Linear kernel: inner product between x and y
  K_linear <- x %*% t(y)
  # RBF kernel
  sigma <- 0.1
  rbfdot_kernel <- rbfdot(sigma = sigma)
  K_rbf <- kernelMatrix(rbfdot_kernel, x, y)
  # Composite kernel as a 50-50 weighted sum of linear and RBF
  K_composite <- 0.5 * K_linear + 0.5 * K_rbf
  return(K_composite)}

class(composite_kernel) <- "kernel"

# testing the composite kernel on a small subset
# test_K <- composite_kernel(X_train[1:10, ], X_train[1:10, ])
# print(dim(test_K))  # Should be 10 x 10

# training SVM model with ksvm() with composite kernel.
set.seed(123)
svm_composite <- ksvm(
  X_train,
  train_data$subscribed,
  kernel = composite_kernel,
  kpar = list(),    
  C = 1,
  prob.model = TRUE)

# predicting on test data using the composite kernel
svm_comp_pred <- predict(svm_composite, X_test, type = "probabilities")

if (!is.null(colnames(svm_comp_pred))) {
  pred_comp_probs <- svm_comp_pred[, "yes"]
} else {
  pred_comp_probs <- svm_comp_pred[, 2]}

# predicted classes.
pred_comp_classes <- predict(svm_composite, X_test)

# evaluate the composite SVM model using a confusion matrix and roc
svm_comp_conf <- confusionMatrix(as.factor(pred_comp_classes), test_data$subscribed, positive = "yes")
svm_comp_roc <- roc(test_data$subscribed, pred_comp_probs)

# evaluation results.
print(svm_comp_conf)
cat("AUC-ROC (SVM Composite):", auc(svm_comp_roc), "\n")

# not using this code right now because it is taking FOREVER to run. So will keep it for a more capable machine to run it on.

# trainControl w cross-validation, class probabilities, and roc
svm_ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final")

# SVM 1 linear kernel
svm_linear_grid <- expand.grid(C = c(0.1, 1, 10))
set.seed(123)
svm_linear <- train(
  subscribed ~ ., 
  data = train_data,
  method = "svmLinear",
  preProcess = c("center", "scale"),
  trControl = svm_ctrl,
  tuneGrid = svm_linear_grid,
  metric = "ROC")

# predict & evaluation for SVM linear
svm_linear_pred_probs <- predict(svm_linear, test_data, type = "prob")[, "yes"]
svm_linear_pred_classes <- predict(svm_linear, test_data)
svm_linear_conf <- confusionMatrix(svm_linear_pred_classes, test_data$subscribed, positive = "yes")
svm_linear_roc <- roc(test_data$subscribed, svm_linear_pred_probs)


# SVM 2 radial kernel
svm_radial_grid <- expand.grid(
  C = c(0.1, 1, 10),
  sigma = c(0.01, 0.05, 0.1))
set.seed(123)
svm_radial <- train(
  subscribed ~ ., 
  data = train_data,
  method = "svmRadial",
  preProcess = c("center", "scale"),
  trControl = svm_ctrl,
  tuneGrid = svm_radial_grid,
  metric = "ROC")

# predicting & evaluation for SVM radial
svm_radial_pred_probs <- predict(svm_radial, test_data, type = "prob")[, "yes"]
svm_radial_pred_classes <- predict(svm_radial, test_data)
svm_radial_conf <- confusionMatrix(svm_radial_pred_classes, test_data$subscribed, positive = "yes")
svm_radial_roc <- roc(test_data$subscribed, svm_radial_pred_probs)


# SVM 3 polynomial kernel
svm_poly_grid <- expand.grid(
  C = c(0.1, 1, 10),
  degree = c(2, 3),
  scale = c(0.01, 0.1))
set.seed(123)
svm_poly <- train(
  subscribed ~ ., 
  data = train_data,
  method = "svmPoly",
  preProcess = c("center", "scale"),
  trControl = svm_ctrl,
  tuneGrid = svm_poly_grid,
  metric = "ROC")

# predict & evaluation for SVM poly
svm_poly_pred_probs <- predict(svm_poly, test_data, type = "prob")[, "yes"]
svm_poly_pred_classes <- predict(svm_poly, test_data)
svm_poly_conf <- confusionMatrix(svm_poly_pred_classes, test_data$subscribed, positive = "yes")
svm_poly_roc <- roc(test_data$subscribed, svm_poly_pred_probs)


# table for these SVMs
extract_metrics <- function(conf, roc_obj) {
  acc  <- as.numeric(conf$overall["Accuracy"])
  sens <- as.numeric(conf$byClass["Sensitivity"])
  spec <- as.numeric(conf$byClass["Specificity"])
  auc_val <- as.numeric(auc(roc_obj))
  return(c(Accuracy = round(acc, 4),
           Sensitivity = round(sens, 4),
           Specificity = round(spec, 4),
           AUC_ROC = round(auc_val, 4)))}

svm_linear_metrics <- extract_metrics(svm_linear_conf, svm_linear_roc)
svm_radial_metrics <- extract_metrics(svm_radial_conf, svm_radial_roc)
svm_poly_metrics   <- extract_metrics(svm_poly_conf, svm_poly_roc)

svm_perf <- data.frame(
  Model = c("SVM Linear", "SVM Radial", "SVM Poly"),
  Accuracy = c(svm_linear_metrics["Accuracy"], 
               svm_radial_metrics["Accuracy"],
               svm_poly_metrics["Accuracy"]),
  AUC_ROC = c(svm_linear_metrics["AUC_ROC"],
              svm_radial_metrics["AUC_ROC"],
              svm_poly_metrics["AUC_ROC"]),
  Sensitivity = c(svm_linear_metrics["Sensitivity"],
                  svm_radial_metrics["Sensitivity"],
                  svm_poly_metrics["Sensitivity"]),
  Specificity = c(svm_linear_metrics["Specificity"],
                  svm_radial_metrics["Specificity"],
                  svm_poly_metrics["Specificity"]))

kable(svm_perf, caption = "SVM Model Performance Comparison")


# ROCs plot
plot(svm_linear_roc, col = "blue", lwd = 2, 
     main = "Combined ROC: Tuned SVM Models",
     legacy.axes = FALSE, xlab = "1 - Specificity", ylab = "Sensitivity",
     xlim = c(0, 1), ylim = c(0, 1))
lines(svm_radial_roc, col = "red", lwd = 2)
lines(svm_poly_roc, col = "green", lwd = 2)
legend("bottomright", legend = c("SVM Linear", "SVM Radial", "SVM Poly"), 
       col = c("blue", "red", "green"), lwd = 2)

3.5 Comparison beween SVM and prior models built

# SVM baseline metrics
svm_baseline_metrics <- extract_metrics(svm_conf_mat, svm_roc)
# SVM tuned metrics
svm_tuned_metrics <- extract_metrics(svm_conf_mat_tuned, svm_roc_tuned)

# Existing performance summary
performance_summary <- data.frame(
  Model = rep(c("Decision Tree", "Random Forest", "AdaBoost"), each = 2),
  Experiment = rep(c("Baseline", "Tuned"), 3),
  Accuracy = c(dt_baseline_metrics["Accuracy"], dt_tuned_metrics["Accuracy"],
               rf_baseline_metrics["Accuracy"], rf_tuned_metrics["Accuracy"],
               ada_baseline_metrics["Accuracy"], ada_tuned_metrics["Accuracy"]),
  AUC_ROC = c(dt_baseline_metrics["AUC_ROC"], dt_tuned_metrics["AUC_ROC"],
              rf_baseline_metrics["AUC_ROC"], rf_tuned_metrics["AUC_ROC"],
              ada_baseline_metrics["AUC_ROC"], ada_tuned_metrics["AUC_ROC"]),
  Sensitivity = c(dt_baseline_metrics["Sensitivity"], dt_tuned_metrics["Sensitivity"],
                  rf_baseline_metrics["Sensitivity"], rf_tuned_metrics["Sensitivity"],
                  ada_baseline_metrics["Sensitivity"], ada_tuned_metrics["Sensitivity"]),
  Specificity = c(dt_baseline_metrics["Specificity"], dt_tuned_metrics["Specificity"],
                  rf_baseline_metrics["Specificity"], rf_tuned_metrics["Specificity"],
                  ada_baseline_metrics["Specificity"], ada_tuned_metrics["Specificity"]))

# a new data frame for SVM models
svm_summary <- data.frame(
  Model = rep("SVM", 2),
  Experiment = c("Baseline", "Tuned"),
  Accuracy = c(svm_baseline_metrics["Accuracy"], svm_tuned_metrics["Accuracy"]),
  AUC_ROC = c(svm_baseline_metrics["AUC_ROC"], svm_tuned_metrics["AUC_ROC"]),
  Sensitivity = c(svm_baseline_metrics["Sensitivity"], svm_tuned_metrics["Sensitivity"]),
  Specificity = c(svm_baseline_metrics["Specificity"], svm_tuned_metrics["Specificity"]))

# Combine the old performance summary with the SVM summary
performance_summary <- rbind(performance_summary, svm_summary)
#print(performance_summary)

Model Performance Comparison
Model	Experiment	Accuracy	AUC_ROC	Sensitivity	Specificity
Decision Tree	Baseline	0.8987	0.7077	0.1638	0.9920
Decision Tree	Tuned	0.8996	0.7579	0.2543	0.9815
Random Forest	Baseline	0.8989	0.7882	0.2665	0.9792
Random Forest	Tuned	0.9000	0.7824	0.2227	0.9860
AdaBoost	Baseline	0.8975	0.8069	0.2249	0.9829
AdaBoost	Tuned	0.8975	0.8069	0.2249	0.9829
SVM	Baseline	0.8985	0.7079	0.1681	0.9912
SVM	Tuned	0.8928	0.7277	0.1659	0.9851

A combined ROCs plot

plot(
  roc_obj_tuned,
  col = "red",
  lwd = 2,
  main = "Combined ROC for Tuned Models",
  legacy.axes = FALSE,       
  xlab = "1 - Specificity",
  ylab = "Sensitivity",
  xlim = c(0, 1),           
  ylim = c(0, 1),            
  xaxs = "i",                
  yaxs = "i")

lines(roc_rf_tuned, col = "green", lwd = 2)
lines(roc_ada_manual, col = "purple", lwd = 2)
lines(svm_roc_tuned, col = "orange", lwd = 2)

legend(
  "bottomright",
  legend = c("Decision Tree", "Random Forest", "AdaBoost", "SVM"),
  col = c("red", "green", "purple", "orange"),
  lwd = 2)

3.6 Answering the questions

In this classification project, several algorithms were evaluated to predict whether a client will subscribe to a term deposit. The primary models tested were Decision Tree, Random Forest, AdaBoost, and Support Vector Machines (SVM). Each algorithm was initially run using baseline parameters and then again with tuned hyperparameters to see if performance could be improved. Key metrics included accuracy, AUC (Area Under the ROC Curve), sensitivity (the recall for positive cases), and specificity (the true negative rate). These metrics matter because the bank aims to identify as many potential subscribers (“yes”) as possible while minimizing misclassification of non-subscribers.

Looking at the final table of results, Random Forest (tuned) shows the highest accuracy, reaching 0.90. This means that if the primary goal is to optimize overall correct classifications, then Random Forest—particularly when tuned would be the best choice for “most accurate” results in a strict sense. However, if the objective is capturing more true positives, the baseline Random Forest and tuned Decision Tree offer relatively stronger sensitivity (0.27 and 0.25, respectively). Meanwhile, AdaBoost stands out for its highest AUC (0.81), which shows a strong overall discrimination. Importantly, though, AdaBoost’s tuned performance remains identical to its baseline, so further hyperparameter adjustments did not yield additional gains in sensitivity or accuracy. Finally, the SVM experiments demonstrate that while tuning SVM did increase AUC from 0.71 to 0.73, it came with a very slight decline in sensitivity and a slight drop in accuracy, which shows the trade-offs in adjusting parameters for a complex, imbalanced task.

These models (Decision Tree, Random Forest, AdaBoost, and SVM) are all typically used for classification scenarios, rather than regression, although SVM can be extended to regression with different kernels and formulations. In the context of this project, classification is the correct approach for distinguishing who will subscribe. The results show that each algorithm has trade-offs: Random Forest (tuned) shows the highest accuracy, AdaBoost (baseline or tuned) provides the best AUC, and SVM tuning improved AUC but harmed sensitivity, which makes it less suitable for a marketing campaign focused on finding the “yes” cases.

Do these outcomes align with the recommendations made earlier? Yes, the results confirm that if maximizing correct classifications overall is paramount, the Random Forest (tuned) is recommended. However, if the organization wants to prioritize recall (sensitivity), then the baseline Random Forest or the tuned Decision Tree could be more appropriate. For classification vs. regression, all approaches here are designed for classification tasks, and the performance metrics show that these models are well-suited to binary classification rather than regression. Ultimately, the choice of model depends on whether the bank values sheer accuracy above all else or deems recall for potential subscribers to be more critical. Given the final numbers, the recommendations towards Random Forest for overall accuracy, and Decision Tree or baseline Random Forest for capturing more “yes” cases, still appear justified to me.

In this classification project, several algorithms were evaluated to predict whether a client will subscribe to a term deposit. The primary models included Decision Tree, Random Forest, AdaBoost, and now Support Vector Machines (SVM). Each algorithm was initially run using baseline parameters and then tuned to improve performance. The key evaluation metrics were accuracy, AUC-ROC, sensitivity, and specificity. These metrics are important since the bank’s core objective is to identify as many potential subscribers as possible while still avoiding misclassifying non-subscribers. Among these, sensitivity and AUC-ROC are particularly important because they reflect the model’s ability to discriminate between the two classes.

The SVM baseline model achieved an accuracy of 89.85%, an AUC-ROC of 0.7079, sensitivity of 16.81%, specificity of 99.12%, and a Kappa of 0.2389. After tuning the SVM, the performance changed slightly: the accuracy was 89.58%, the AUC-ROC increased to 0.7282, but sensitivity declined further to 14.01% while specificity improved slightly to 99.17% and Kappa decreased to 0.2018. When compared to the Decision Tree, whose tuned version saw accuracy rising to 89.96%, AUC-ROC increasing to 0.7579, and sensitivity improving significantly to 25.43%, it seems that the tuned Decision Tree outperforms SVM in terms of capturing more true positives. Similarly, the Random Forest model and AdaBoost baseline provided AUCs in the vicinity of 0.79 to 0.81, with sensitivities around 22–26%, making them more effective for this specific imbalanced classification task.

Support Vector Machines are well-known for their robust performance in high-dimensional spaces and are effective for classification tasks. In theory, SVMs are versatile and can be used for both classification and regression scenarios; however, in practice, their success strongly depends on kernel selection and hyperparameter tuning. From what I have read, the radial basis function (RBF) kernel used in the SVM experiments here is typically well-suited for capturing nonlinear patterns. Yet, in this project, despite an observed improvement in AUC from the SVM tuning process, sensitivity seems to have dropped further, reflecting that SVM’s tuned configuration was more conservative. The decline in sensitivity means that even though the overall accuracy remains high and specificity remains excellent, the SVM is less effective at identifying potential subscribers compared to other models, which is a critical drawback in this marketing scenario.

Given the results, the algorithm recommended for achieving more accurate and practically useful results in this classification setting is not SVM. While SVMs often perform very well in many classification problems, for this specific task they did not provide the improved recall needed to capture a greater proportion of subscribers. The tuned Decision Tree model, on the other hand, showed an improvement in sensitivity and AUC, which in my mind makes it more valuable where identifying true positives is paramount. In addition, Random Forest and AdaBoost have shown strengths in overall discrimination, but they also have trade-offs in terms of sensitivity. SVM is generally better suited for classification tasks rather than regression scenarios when the appropriate kernel and parameters are chosen; however, in this project, their performance was slightly inferior when measured against the specific metric of sensitivity.

I still think the recommendations favoring the tuned Decision Tree model is reasonable, because its improved sensitivity ensures that more high-propensity customers are targeted in a marketing campaign while maintaining a high overall accuracy. This makes it a practical choice for the bank’s challenge compared to SVM, which, although solid overall in accuracy and specificity, does not capture as many true positives.

622_hw1

Fares A

2025-03-02

0.1 Loading the data

0.1.1 A quick check on missing variables

1 =HW1:

1.1 A. Exploratory Data Analysis (EDA)

1.1.1 A.1 Are the features (columns) of your data correlated?

1.1.2 A.2 What is the overall distribution of each variable?

1.1.3 A.3 How are categorical variables distributed?

1.1.4 A.4 Are there any outliers present?

1.1.5 A.5 Seeing missing values, and if missingness was at random, or with a correlation to the target variable

1.2 B. Algorithm Selection

1.2.1 B.1 Recommended Machine Learning Algorithms

1.2.2 B.2 Pros and Cons of Each Algorithm

1.2.3 B.3 Best Recommended Algorithm and Why

1.2.4 B.4 Label Availability and Its Impact on Algorithm Choice

1.2.5 B.5 How Algorithm Choice Relates to the Dataset

1.2.6 B.6 Impact of a Smaller Dataset (<1,000 Records)

1.3 C. Pre-processing

1.3.1 C.1 Data Cleaning - Improve Data Quality & Handle Missing Data

1.3.2 C.2 Dimensionality Reduction - Remove Redundant Data

1.3.3 C.3 Feature Engineering - Create New Informative Features

1.3.4 C.4 Sampling Data - Resize Dataset if Needed

1.3.5 C.5 Data Transformation - Encoding & Scaling

1.3.6 C.6 Handling Class Imbalance

1.4 D. 500-word essay to summarize it all:

2 =HW2

2.1 Experiment Set 1: Decision Trees

2.1.1 Experiment 1.1: Baseline Decision Tree

2.1.2 Experiment 1.2: Pruned/Constrained Decision Tree

2.2 Experiment Set 2: Random Forest

2.2.1 Experiment 2.1: Baseline Random Forest

2.2.2 Experiment 2.2: Tuned Random Forest

2.3 Experiment Set 3: AdaBoost

2.3.1 Experiment 3.1: Baseline AdaBoost

2.3.2 Experiment 3.2: Tuned AdaBoost

2.4 Comparison between the models:

2.4.1 ROC plots for baseline:

2.4.2 ROC plot for the tuned models

2.4.3 Performance table

2.4.4 Conclusion:

3 =HW3

3.1 The two articles in the assignment itself

3.1.1 Study #1: Ahmad et al., 2021 – “Decision Tree Ensembles to Predict COVID-19 Infection”

3.1.2 Study #2: Guhathakurata et al., 2021 – “A Novel Approach to Predict COVID-19 Using Support Vector Machine”

3.1.3 Discussion of the results and conclusions:

3.2 Three articles comparing SVM and decision tree models in my area (cardiovascular health):

3.2.1 Brief summary of each article:

3.2.2 Comparison of SVM vs. Decision Tree Performance in the 3 studies:

3.2.3 Discussion of SVM vs. Decision Tree Findings

3.3 Analyzing the current dataset using an SVM algorithm

3.4 Multiple kernel learning (MKL)

3.5 Comparison beween SVM and prior models built

3.6 Answering the questions