DATA 622 : Experimentation & Model Training

Author: Rupendra Shrestha | Oct 2025

Column

Column

Instructions

Experimentation & Model Training

This assignment consists of conducting at least two (2) experiments for different algorithms: Decision Trees, Random Forest and Adaboost. That is, at least six (6) experiments in total (3 algorithms x 2 experiments each). For each experiment you will define what you are trying to achieve (before each run), conduct the experiment, and at the end you will review how your experiment went. These experiments will allow you to compare algorithms and choose the optimal model.

Using the dataset and EDA from the previous assignment, perform the following:

Algorithm Selection

You will perform experiments using the following algorithms: Decision Trees Random Forest Adaboost

Experiment

For each of the algorithms (above), perform at least two (2) experiments. In a typical experiment you should: Define the objective of the experiment (hypothesis) Decide what will change, and what will stay the same Select the evaluation metric (what you want to measure) Perform the experiment Document the experiment so you compare results (track progress)

Variations

There are many things you can vary between experiments, here are some examples: Data sampling (feature selection) Data augmentation e.g., regularization, normalization, scaling Hyperparameter optimization (you decide, random search, grid search, etc.) Decision Tree breadth & depth (this is an example of a hyperparameter) Evaluation metrics e.g., Accuracy, precision, recall, F1-score, AUC-ROC Cross-validation strategy e.g., holdout, k-fold, leave-one-out Number of trees (for ensemble models) Train-test split: Using different data splits to assess model generalization ability

Introduction

Experimentation and model training are at the core of machine learning practice. The goal is to find the combination of features, algorithms, and parameters that yields the best performance on a given task. Experimentation involves systematically adjusting inputs, tuning hyperparameters, and testing different model configurations to understand how each decision affects results. This process helps identify not just which model performs best, but also why it performs that way.

Model training is where these experiments are put into action. Using historical data, algorithms learn patterns and relationships that can be applied to new, unseen data. Each experiment provides insight into how model choice such as data preprocessing, algorithm selection, and parameter tuning—impact predictive accuracy and generalization.

In practice, data scientists may run dozens or even hundreds of experiments to refine their models. The key is to change one or a few factors at a time, evaluate results with consistent metrics, and interpret the outcomes to guide the next iteration. Through this structured process of testing, evaluation, and refinement, experimentation becomes a powerful tool for improving model performance and building more reliable predictive systems.

Abstract

In this project, I applied machine learning techniques to analyze a real-world dataset from a Portuguese bank’s marketing campaign. The goal was to build classification models that predict whether a client will subscribe to a term deposit based on their personal and socio-economic attributes, as well as macroeconomic indicators. To do this, I compared three different supervised learning algorithms — Decision Tree, Random Forest, and AdaBoost — using both default and tuned hyperparameters. The models were evaluated using five performance metrics: Accuracy, Precision, Recall, F1 Score, and AUC. Since the dataset is imbalanced, special attention was given to Recall, which reflects the model’s ability to identify actual subscribers. The analysis concludes by recommending the most appropriate model aligned with the business objective of maximizing customer acquisition.

Data Set

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit The records of their efforts are available in the form of a dataset. The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. Download the Bank Marketing Dataset from: https://archive.ics.uci.edu/dataset/222/bank+marketing

library(knitr)

#A Portuguese bank conducted a marketing
# Read a CSV file
bank_data <- read.csv("bank.csv", sep = ";")

# Preview the first few rows of the dataset
kable(head(bank_data, 10), caption = "Preview of the Bank Dataset")
Preview of the Bank Dataset
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
35 management single tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no
36 self-employed married tertiary no 307 yes no cellular 14 may 341 1 330 2 other no
39 technician married secondary no 147 yes no cellular 6 may 151 2 -1 0 unknown no
41 entrepreneur married tertiary no 221 yes no unknown 14 may 57 2 -1 0 unknown no
43 services married primary no -88 yes yes cellular 17 apr 313 1 147 2 failure no

This is open source dataset, retrieved from a Portuguese bank’s marketing campaign, it includes phone calls to customers to predict whether they would subscribe to a term deposit. My object to apply machine learning techniques to analyze this data and identify the most effective strategies that can help the bank increase the subscription rate in future campaigns. Data can be downloaded at https://archive.ics.uci.edu/dataset/222/bank+marketing

#Count Duplicates
sum(duplicated(bank_data))
[1] 0
#Remove duplicates
bank_data <- bank_data[!duplicated(bank_data), ]
sum(duplicated(bank_data)) 
[1] 0
#Check for N/A
colSums(is.na(bank_data)) 
      age       job   marital education   default   balance   housing      loan 
        0         0         0         0         0         0         0         0 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
        0         0         0         0         0         0         0         0 
        y 
        0 
# Check for missing values
colSums(is.na(bank_data))  
      age       job   marital education   default   balance   housing      loan 
        0         0         0         0         0         0         0         0 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
        0         0         0         0         0         0         0         0 
        y 
        0 
# Handle missing values
# Remove rows with any NA
bank_data <- bank_data %>% drop_na()

str(bank_data)
'data.frame':   4521 obs. of  17 variables:
 $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
 $ job      : chr  "unemployed" "services" "management" "management" ...
 $ marital  : chr  "married" "married" "single" "married" ...
 $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
 $ default  : chr  "no" "no" "no" "no" ...
 $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
 $ housing  : chr  "no" "yes" "yes" "yes" ...
 $ loan     : chr  "no" "yes" "no" "yes" ...
 $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
 $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
 $ month    : chr  "oct" "may" "apr" "jun" ...
 $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
 $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
 $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
 $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
 $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
 $ y        : chr  "no" "no" "no" "no" ...

Exploratory Data Analysis

Review the structure and content of the data and answer questions such as:

1.1) Are the features (columns) of your data correlated?

cat("There are", ncol(bank_data), "columns and", nrow(bank_data), "rows.",
      "Y is the target variable which denotes whether a client subscribes to a term deposit.",
      "There are", ncol(bank_data) - 1, "features as listed below (not including the target variable):")
There are 17 columns and 4521 rows. Y is the target variable which denotes whether a client subscribes to a term deposit. There are 16 features as listed below (not including the target variable):
colnames(bank_data)
 [1] "age"       "job"       "marital"   "education" "default"   "balance"  
 [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
[13] "campaign"  "pdays"     "previous"  "poutcome"  "y"        

1.2) What is the overall distribution of each variable?

Exploratory Data Analysis (EDA) is the first step in understanding a dataset. It helps uncover the main characteristics, detect patterns, and identify potential anomalies before applying advanced modeling. A key part of EDA is examining the
overall distribution of each variable

.

a] NUMERICAL FEATURE DISTRIBUTION

: Distributions show whether the data is normally distributed, skewed, or contains outliers. Tools like histograms, density plots, and boxplots provide clear visual summaries.

b] CATEGORICAL FEATURE DISTRIBUTION

: Frequency counts and bar plots reveal how data is grouped across different categories, helping to identify imbalances or dominant classes.

Numerical Feature Distribution

Pdays – Highly right-skewed, with median = -1 and mean ≈ 40. Since 75% of values are -1, most clients were not previously contacted. The large gap between mean and quartiles indicates outliers.

Previous – Right-skewed with median = 0 and mean ≈ 0.6. Most clients were not contacted before, and the higher mean suggests some outliers.

Campaign – Right-skewed. Median = 2, mean ≈ 2.8, with 50% of data between 1 and 3. Outliers present due to a higher mean than median.

Day – Approximately normal, with mean and median ≈ 16. Middle 50% of values fall between 8 and 21.

Age – Slight right skew. Half of clients are between 33 and 48 years old.

Duration – Right-skewed. Most calls were short, though some lasted much longer, indicating outliers.

Balance – Strong right skew. Median = 448, mean = 1362. Most clients have low balances, but a few very high values pull the mean upward.

# Seperating numerical features
bank_data_numerical<-bank_data %>%
  select(where(is.numeric))   

# Using describe () for summary stats
describe(bank_data_numerical)
         vars    n    mean      sd median trimmed    mad   min   max range skew
age         1 4521   41.17   10.58     39   40.48  10.38    19    87    68 0.70
balance     2 4521 1422.66 3009.64    444  802.41 658.27 -3313 71188 74501 6.59
day         3 4521   15.92    8.25     16   15.80  10.38     1    31    30 0.09
duration    4 4521  263.96  259.86    185  216.44 143.81     4  3025  3021 2.77
campaign    5 4521    2.79    3.11      2    2.14   1.48     1    50    49 4.74
pdays       6 4521   39.77  100.12     -1   11.56   0.00    -1   871   872 2.72
previous    7 4521    0.54    1.69      0    0.12   0.00     0    25    25 5.87
         kurtosis    se
age          0.35  0.16
balance     88.25 44.76
day         -1.04  0.12
duration    12.51  3.86
campaign    37.11  0.05
pdays        7.94  1.49
previous    51.91  0.03
# Using summary() for interquartile calculations
summary(bank_data_numerical)
      age           balance           day           duration   
 Min.   :19.00   Min.   :-3313   Min.   : 1.00   Min.   :   4  
 1st Qu.:33.00   1st Qu.:   69   1st Qu.: 9.00   1st Qu.: 104  
 Median :39.00   Median :  444   Median :16.00   Median : 185  
 Mean   :41.17   Mean   : 1423   Mean   :15.92   Mean   : 264  
 3rd Qu.:49.00   3rd Qu.: 1480   3rd Qu.:21.00   3rd Qu.: 329  
 Max.   :87.00   Max.   :71188   Max.   :31.00   Max.   :3025  
    campaign          pdays           previous      
 Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000  
 1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000  
 Median : 2.000   Median : -1.00   Median : 0.0000  
 Mean   : 2.794   Mean   : 39.77   Mean   : 0.5426  
 3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000  
 Max.   :50.000   Max.   :871.00   Max.   :25.0000  
# Convert to long formate for plotting
bank_data_long <- bank_data_numerical %>%
  select(where(is.numeric)) %>%   
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  mutate(value = round(as.numeric(value), 0))  

# Plot histograms
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, value)) %>%  
  ggplot(aes(x = value, color = variable, fill = variable)) +
  geom_histogram(alpha = 0.6, binwidth = 5) +  
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  theme_minimal() +   
  theme(
    legend.position = "none", 
    panel.spacing = unit(0.1, "lines"),  
    strip.text.x = element_text(size = 8)
  ) +
  xlab("Value") +
  ylab("Frequency") +  
  ggtitle("Distribution of Numerical Features ")+
  facet_wrap(~variable, scales = "free")  

# Print
print(p)

# Seperating categorical features
bank_data_categorical<-bank_data %>%
  select(where(is.character))   

# Using describe () for summary statsw
describe(bank_data_categorical)
           vars    n mean   sd median trimmed  mad min max range  skew kurtosis
job*          1 4521 5.41 3.26      5    5.33 4.45   1  12    11  0.24    -1.26
marital*      2 4521 2.15 0.60      2    2.18 0.00   1   3     2 -0.07    -0.35
education*    3 4521 2.23 0.75      2    2.24 0.00   1   4     3  0.19    -0.28
default*      4 4521 1.02 0.13      1    1.00 0.00   1   2     1  7.51    54.48
housing*      5 4521 1.57 0.50      2    1.58 0.00   1   2     1 -0.27    -1.93
loan*         6 4521 1.15 0.36      1    1.07 0.00   1   2     1  1.93     1.72
contact*      7 4521 1.65 0.90      1    1.57 0.00   1   3     2  0.74    -1.36
month*        8 4521 6.54 3.00      7    6.71 2.97   1  12    11 -0.50    -0.98
poutcome*     9 4521 3.56 0.99      4    3.82 0.00   1   4     3 -1.96     2.10
y*           10 4521 1.12 0.32      1    1.02 0.00   1   2     1  2.41     3.80
             se
job*       0.05
marital*   0.01
education* 0.01
default*   0.00
housing*   0.01
loan*      0.01
contact*   0.01
month*     0.04
poutcome*  0.01
y*         0.00
# Transform to long
bank_data_long <- bank_data_categorical %>%
  pivot_longer(everything(), names_to = "variable", values_to = "category")  

# Plot bar plots
p <- bank_data_long %>%
  mutate(variable = fct_reorder(variable, category)) %>% 
  ggplot(aes(x = category, color = variable, fill = variable)) +
  geom_bar(alpha = 0.6, binwidth=2) + 
  scale_fill_viridis(discrete = TRUE) +   
  scale_color_viridis(discrete = TRUE) + 
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +  
  theme_minimal() +   
  theme(
    legend.position = "none",  
    panel.spacing = unit(0.5, "lines"),  
    strip.text.x = element_text(size = 8),  
    axis.text.x = element_text(angle = 45, hjust = 1.0, size = 7.5),  
    axis.text.y = element_text(size = 10),  
    plot.margin = ggplot2::margin(10, 10, 10, 10, unit = "pt"), 
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5) 
  ) +
  xlab("Categories") +
  ylab("Frequency") +  
  ggtitle("Categorical Feature Distributions") +  
  facet_wrap(~variable, scales = "free_x", ncol = 5)



# Print 
print(p)

Conclusion

The categorical distributions show that most clients are married, work in blue-collar, management, or technician jobs, and have secondary education. Most do not have personal loans or credit defaults, and cell phones were the main contact method. Past campaign outcomes are mostly unknown, and the target variable Y is heavily imbalanced, with most clients not subscribing.

1.3) Are there any outliers present?

Outliers were investigated via three methods:

  • Summary Statistics & Histograms

The distribution of the features and target variable in the previous step allude to outliers in the following numerical features/ variables:

Pdays: Median and quartiles = –1, while the mean (~40) suggests outliers.

Earlier: Quartiles and median = 0, but mean (0.58) indicates that some values must be very high.

Campaign: Mean (2.8) > median (2), which shows potential for outliers.

Length: Mostly short calls, but a few significantly longer calls are outliers.

Balance: Extreme difference between mean (1362) and median (448) with extensive IQR, reflecting immensely large numbers in some customers.

  • Scatterplots

Visual inspection confirmed the presence of extreme values across most numerical features, with the exception of day, which showed a more stable distribution.

  • 1.5 x IQR Rule

Using the standard 1.5 x IQR rule, outliers were identified for nearly all numberical features, with the exception of day. The counts of detected outliers are:

Age: 487

Balance: 4729

Campaign: 3064

Duration: 3235

Pdays: 8257

Previous: 8257

# scatterplots for outliers
numeric_data <- bank_data %>%
  select(where(is.numeric))

# Pivot the data to long format for plotting
numeric_data_long <- numeric_data %>%
  mutate(id = row_number()) %>%  
  pivot_longer(cols = -id, names_to = "variable", values_to = "value")

# Create scatterplots for each numeric variable
ggplot(numeric_data_long, aes(x = id, y = value)) +
  geom_point(alpha = 0.6, color = "darkblue") +  
  facet_wrap(~variable, scales = "free", ncol = 5) + 
  theme_minimal() +
  theme(
    strip.text.x = element_text(size = 10), 
    axis.text.x = element_text(angle = 45, hjust = 1)  
  ) +
  labs(
    x = "ID (Row Number)", 
    y = "Value", 
    title = "Scatterplots of Numerical Variables"
  )

Conclusion: Outliers are present in most numerical features, especially balance and duration. While these extreme values may reflect real-world customer behavior rather than errors, they should be carefully considered in further modeling steps (e.g., using robust scaling or transformation).

# IQR Calculation
calculate_iqr_outliers <- function(df, col_name) {
  Q1 <- quantile(df[[col_name]], 0.25, na.rm = TRUE)  # 25th 
  Q3 <- quantile(df[[col_name]], 0.75, na.rm = TRUE)  # 75th 
  IQR <- Q3 - Q1  # Interquartile range
  
  # Calculate outlier threshold
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  
  # Identify outliers 
  df %>%
    filter(df[[col_name]] < lower_bound | df[[col_name]] > upper_bound) %>%  # Keep only outliers
    mutate(variable = col_name) %>%  
    select(variable) 
}

# Apply the function to all numeric variables and count outliers
outlier_counts <- bank_data %>%
  select(where(is.numeric)) %>%  
  names() %>%  
  map_df(~calculate_iqr_outliers(bank_data, .x)) %>%  
  group_by(variable) %>%  
  summarise(outlier_count = n())  

# Print the outlier counts for each feature
cat("\n**Table 1: Count of Outliers in Each Numeric Variable**\n")

**Table 1: Count of Outliers in Each Numeric Variable**
kable(outlier_counts, caption = "Count of Outliers in Each Numeric Variable")
Count of Outliers in Each Numeric Variable
variable outlier_count
age 38
balance 506
campaign 318
duration 330
pdays 816
previous 816

The correlation matrix suggests that most numerical features do not have strong linear relationships. Only a few show mild or weak associations:

  • Mild Relationship

Pdays and Previous (r = 0.45): Indicates that clients previously contacted are more likely to be contacted again soon.

  • Weak Relationships

Day and Duration (r = 0.16): A weak positive trend, suggesting calls tend to last slightly longer on later days.

Age and Balance (r = 0.10): A weak relationship where older clients may have marginally higher balances.

All other correlations among numerical features are close to zero, showing little to no linear association.

Categorical Variables

Correlations among categorical variables are generally weak, through a few mild associations are observed:

  • Mild Relationships

Job and Education – Correlation of 0.46 Housing and Month – Correlation of 0.50 Contact and Month – Correlation of 0.51

  • Weak Relationships

All other correlations are weak.

# --- Correlation Matrix for Numeric Variables ---
numeric_data <- bank_data %>% select(where(is.numeric))

# Correlation matrix
correlation_matrix <- cor(numeric_data, use = "pairwise.complete.obs")

# corr matrix
ggcorrplot(correlation_matrix, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           title = "Correlation Matrix of Numeric Variables")

# Example 1: Combine housing and personal loan
bank_data$has_any_loan <- ifelse(bank_data$housing == "yes" | bank_data$loan == "yes", 1, 0)
table(bank_data$has_any_loan)

   0    1 
1677 2844 
# Example 2: Create seasonal feature from month
bank_data$season <- ifelse(bank_data$month %in% c("dec","jan","feb"), "Winter",
                     ifelse(bank_data$month %in% c("mar","apr","may"), "Spring",
                     ifelse(bank_data$month %in% c("jun","jul","aug"), "Summer","Fall")))
table(bank_data$season)

  Fall Spring Summer Winter 
   521   1740   1870    390 

Conclusion:

In general, numerical variables also tend to have weak or no relationship, except for the notable one between pdays and previous, meaning repeated follow-ups by clients. Similarly, categorical variables also tend to have primarily weak relationships, though there are gentle relationships between job and education, housing and month, and contact and month.

From a business-domain perspective, these insights are the foundation upon which new composite features are built. The combination of pdays and previous into a repeated_contact feature more accurately represents follow-up action, whereas has_any_loan and season signal client financial stress and campaign timing. Poorly correlated variables may still provide complementary predictive power when designed with careful foresight, making them valuable inputs to model.

# Select categorical variables
categorical_cols <- names(bank_data)[sapply(bank_data, is.character)]

# Function for Cramers
cramers_v_matrix <- function(df, cat_vars) {
  n <- length(cat_vars)
  result <- matrix(0, n, n, dimnames = list(cat_vars, cat_vars))
  
  for (i in 1:n) {
    for (j in i:n) {
      if (i == j) {
        result[i, j] <- 1  
      } else {
        result[i, j] <- cramerV(df[[cat_vars[i]]], df[[cat_vars[j]]])
        result[j, i] <- result[i, j]  
      }
    }
  }
  return(result)
}

cramers_matrix <- cramers_v_matrix(bank_data, categorical_cols)
cramers_df <- as.data.frame(cramers_matrix)
ggcorrplot(cramers_matrix, 
           method = "circle", 
           type = "lower",     
           lab = TRUE, 
           title = "Cramers V Correlation Between Categorical Variables")

Conclusion:

The Cramér’s V correlation matrix shows that most categorical variables have weak associations with each other. A few mild relationships are observed, such as between job and education and between contact method and month, but overall the categorical features are largely independent.

1.5) How are categorical variables distributed?

Boxplots are used to examine the relationships between numeric and categorical variables. They allow us to visually assess differences in distributions, detect outliers, and identify patterns across categories.

# Convert categorical variables to factors 
bank_data <- bank_data %>%
  mutate(across(where(is.character), as.factor))

# Combine all numeric-categorical pairs into one long dataframe
numeric_vars <- names(select(bank_data, where(is.numeric)))
categorical_vars <- names(select(bank_data, where(is.factor)))

# Long FOrmat
bank_data_long <- bank_data %>%
  pivot_longer(cols = all_of(numeric_vars), names_to = "Numeric_Variable", values_to = "Value")

# Boxplots 
ggplot(bank_data_long, aes(x = .data[[categorical_vars[1]]], y = Value, fill = .data[[categorical_vars[1]]])) +
  geom_boxplot(alpha = 0.7) +  
  facet_wrap(~Numeric_Variable, scales = "free") +  
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),  # Rotate x-axis labels
    legend.position = "none"  # Remove legend if not needed
  ) +
  labs(
    x = "Categorical Variable",
    y = "Value",
    title = "Boxplots of Numeric Variables by Categorical Groups"
  )

Conclusion

The boxplots reveal that numeric variables vary across categorical groups, with some features showing clear differences in median values or spread. Outliers are also apparent in several numeric features, highlighting the influence of extreme values within certain categories.

Spread of numerical variables is denote by standard deviation, variance, IQR and range (max/min).

The spread for the categorical variables is captured via frequency of each category and percent.

# Display numeric summary


# Separate into numeric and categorical variables
numeric_vars <- bank_data %>%
  select(where(is.numeric))

categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))

# Function to calculate summary statistics for numeric variables 
numeric_summary <- numeric_vars %>%
  summarise_all(list(
    Mean = mean,
    Median = median,
    SD = sd,
    Variance = var,
    IQR = IQR,
    Range = ~max(., na.rm = TRUE) - min(., na.rm = TRUE)
  ), na.rm = TRUE) %>%
  pivot_longer(cols = everything(), names_to = c("Variable", "Metric"), names_sep = "_") %>%
  pivot_wider(names_from = Metric, values_from = value)

# Function to compute mode separately 
compute_mode <- function(x) {
  tab <- table(x)
  names(tab)[which.max(tab)]  
}

mode_summary <- numeric_vars %>%
  summarise_all(list(Mode = compute_mode)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Merge numeric summary with mode 
numeric_summary <- left_join(numeric_summary, mode_summary, by = "Variable")

# Function to find mode for categorical variables
categorical_summary <- categorical_vars %>%
  summarise_all(list(
    Mode = compute_mode
  )) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mode")

# Display numeric summary
print(numeric_summary)  
# A tibble: 8 × 9
  Variable Mean      any       Median    SD        Variance  IQR    Range  Mode 
  <chr>    <list>    <list>    <list>    <list>    <list>    <list> <list> <chr>
1 age      <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
2 balance  <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
3 day      <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
4 duration <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
5 campaign <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
6 pdays    <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
7 previous <dbl [1]> <NULL>    <dbl [1]> <dbl [1]> <dbl [1]> <dbl>  <dbl>  <NA> 
8 has      <NULL>    <dbl [6]> <NULL>    <NULL>    <NULL>    <NULL> <NULL> <NA> 
# Display categorical summary
print(categorical_summary)  
# A tibble: 11 × 2
   Variable       Mode      
   <chr>          <chr>     
 1 job_Mode       management
 2 marital_Mode   married   
 3 education_Mode secondary 
 4 default_Mode   no        
 5 housing_Mode   yes       
 6 loan_Mode      no        
 7 contact_Mode   cellular  
 8 month_Mode     may       
 9 poutcome_Mode  unknown   
10 y_Mode         no        
11 season_Mode    Summer    
# Function to compute category counts and correct percentages
compute_category_counts <- function(df) {
  df %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Category") %>%
    group_by(Variable, Category) %>%
    summarise(Count = n(), .groups = "drop") %>%
    group_by(Variable) %>%  
    mutate(Percentage = round((Count / sum(Count)) * 100, 2)) %>%  
    arrange(Variable, desc(Count)) 
}

# Apply function to categorical variables
categorical_vars <- bank_data %>%
  select(where(is.character) | where(is.factor))  

category_counts <- compute_category_counts(categorical_vars)

# Print 
print(category_counts)
# A tibble: 50 × 4
# Groups:   Variable [11]
   Variable  Category  Count Percentage
   <chr>     <fct>     <int>      <dbl>
 1 contact   cellular   2896      64.1 
 2 contact   unknown    1324      29.3 
 3 contact   telephone   301       6.66
 4 default   no         4445      98.3 
 5 default   yes          76       1.68
 6 education secondary  2306      51.0 
 7 education tertiary   1350      29.9 
 8 education primary     678      15   
 9 education unknown     187       4.14
10 housing   yes        2559      56.6 
# ℹ 40 more rows

There are no missing values.

# Missing values 
missing_summary <- bank_data %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  arrange(desc(Missing_Count))  

# Print 
print(missing_summary)
# A tibble: 19 × 2
   Variable     Missing_Count
   <chr>                <int>
 1 age                      0
 2 job                      0
 3 marital                  0
 4 education                0
 5 default                  0
 6 balance                  0
 7 housing                  0
 8 loan                     0
 9 contact                  0
10 day                      0
11 month                    0
12 duration                 0
13 campaign                 0
14 pdays                    0
15 previous                 0
16 poutcome                 0
17 y                        0
18 has_any_loan             0
19 season                   0

Duplicate or Inconsistent Values

To ensure data quality, I have checked for duplicate rows and inconsistent values.

# Check for duplicate rows
sum(duplicated(bank_data))
[1] 0
# Check for inconsistent categorical values
bank_data_categorical <- bank_data %>% select(where(is.factor))
sapply(bank_data_categorical, unique)
$job
 [1] unemployed    services      management    blue-collar   self-employed
 [6] technician    entrepreneur  admin.        student       housemaid    
[11] retired       unknown      
12 Levels: admin. blue-collar entrepreneur housemaid management ... unknown

$marital
[1] married  single   divorced
Levels: divorced married single

$education
[1] primary   secondary tertiary  unknown  
Levels: primary secondary tertiary unknown

$default
[1] no  yes
Levels: no yes

$housing
[1] no  yes
Levels: no yes

$loan
[1] no  yes
Levels: no yes

$contact
[1] cellular  unknown   telephone
Levels: cellular telephone unknown

$month
 [1] oct may apr jun feb aug jan jul nov sep mar dec
Levels: apr aug dec feb jan jul jun mar may nov oct sep

$poutcome
[1] unknown failure other   success
Levels: failure other success unknown

$y
[1] no  yes
Levels: no yes

$season
[1] Fall   Spring Summer Winter
Levels: Fall Spring Summer Winter

Conclusion

No duplicates or inconsistent values were detected, so no further cleaning is required.

The majority of the clients are married with secondary education, which is normal customer demographics. The right-skewed distributions of duration and balance are also expected because marketing campaigns are likely to target client segments of interest. The large imbalance of the target variable y is also consistent because hardly any of the called clients receive term deposits.

Algorithm Selection

I will explore three models: Decision Tree, Random Forest, and Adaboost. Each model has unique strengths that align with the characteristics of the dataset and the project objective.

Decision Tree

is a simple yet powerful algorithm that is easy to interpret. It is well-suited for identifying key decision-making factors, making it useful for explaining why certain customers are more likely to subscribe to a term deposit. However, Decision Trees are prone to overfitting, especially in datasets with noise or numerous features. Careful tuning (e.g., maximum depth, minimum samples per split) can improve performance.

Random Forest

is an ensemble of Decision Trees, designed to reduce overfitting by averaging multiple tree predictions. This makes it more robust and stable, especially with noisy data or datasets containing both categorical and numerical features. Given the complexity of our dataset and potential outliers, Random Forest is a strong candidate for achieving reliable predictions.

Adaboost

is another ensemble method that builds a series of weak learners (e.g., shallow Decision Trees), focusing more on hard-to-classify instances. Since our dataset is highly imbalanced (~88.7% “no”, ~11.3% “yes”), Adaboost’s ability to improve recall for the minority class can enhance the model’s ability to identify potential subscribers effectively.

Experiment

Experiment 1: Decision Tree

  1. Objective: Establish a baseline Decision Tree model using default hyperparameters to assess its natural performance as a reference for future experiments.

  2. What will change: Since this is a baseline model, no hyperparameters will be adjusted. The focus is to evaluate the Decision Tree’s performance without tuning.

  3. Evaluation Metric: We’ll use Accuracy, Precision, Recall, and F1-score to assess performance. Given the dataset’s imbalance, Recall and F1-score will be prioritized. Additionally, AUC will be calculated to evaluate the model’s overall discrimination ability.

  4. Cross-Validation Strategy: We will apply 10-fold cross-validation to improve reliability. The data will be divided into 10 parts, with 9 used for training and 1 for testing. This process repeats 10 times, ensuring all data points are tested once. Averaging results reduces the risk of overfitting and provides a robust evaluation

  5. Code Implementation:

To systematically track and compare all 6 experiments, we create a results data frame that logs each experiment’s metrics.

Let’s import the relevant libraries.
library(rpart)  
library(rpart.plot)  
library(ROCR)    
library(pROC)    
library(doParallel)
library(caret)
library(randomForest)
library(ada)
library(dplyr)
library(knitr)
library(ggplot2)
library(reshape2)
library(ROSE)
# Initialize Results Data Frame
results <- data.frame(
  Experiment = character(),
  Accuracy = numeric(),
  Precision = numeric(),
  Recall = numeric(),
  F1_Score = numeric(),
  AUC = numeric(),
  stringsAsFactors = FALSE
)

#Display updated results
results
[1] Experiment Accuracy   Precision  Recall     F1_Score   AUC       
<0 rows> (or 0-length row.names)
# Apply ROSE for SMOTE-like behavior
set.seed(123)
train_data_smote <- ROSE(y ~ ., data = bank_data, seed = 123)$data

# Verify class distribution after SMOTE
table(train_data_smote$y)

  no  yes 
2279 2242 
prop.table(table(train_data_smote$y))

      no      yes 
0.504092 0.495908 
set.seed(456)
train_control <- trainControl(
  method = "cv",         
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Stratified sampling (80% train, 20% test)
trainIndex <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)

# Split dataset into training and test sets
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]

# Train Decision Tree Model
dt_model <- train(
  y ~ ., 
  data = train_data_smote,   
  method = "rpart",          
  trControl = train_control,
  metric = "Recall"        
)

# Predict on Test Data
dt_pred <- predict(dt_model, test_data)

# Confusion Matrix
conf_matrix <- confusionMatrix(dt_pred, test_data$y, positive = "yes")

# Calculate Metrics
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Precision']
recall <- conf_matrix$byClass['Recall']
f1_score <- 2 * ((precision * recall) / (precision + recall))

# Calculate AUC
dt_probs <- predict(dt_model, test_data, type = "prob")[, "yes"]
dt_auc <- auc(test_data$y, dt_probs)

# Display Results
cat(sprintf("\nDecision Tree (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n", 
            accuracy, precision, recall, f1_score, dt_auc))

Decision Tree (Default) - Accuracy: 0.8252, Precision: 0.3579, Recall: 0.6538, F1-score: 0.4626, AUC: 0.7547
# Log Results
results <- rbind(results, data.frame(
  Experiment = "Decision Tree (Default)",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score,
  AUC = dt_auc
))
rownames(results) <- NULL
knitr::kable(results, caption = "Decision Tree (Default)", digits = 3)
Decision Tree (Default)
Experiment Accuracy Precision Recall F1_Score AUC
Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
  1. Analysis- Decision Tree (Experiment 1): The baseline Decision Tree showed moderate accuracy (82.50%) with strong recall (65.4%), but its low precision (35.8%) highlights a high false positive rate, indicating room for improvement.
Decision Tree Model
#Decision Tree Model
#Extract rpart object
dt_model1 <- rpart(y ~ ., data = train_data, method = "class")
rpart.plot(dt_model1, main="Default Decision Tree Model")

The baseline Decision Tree achieved an accuracy of 82.5% and a recall of 65.4%, successfully identifying many actual subscribers. However, its precision was low (35.8%), indicating many false positives. The model provides clear insights into key decision factors, but the imbalance between recall and precision suggests that tuning or ensemble methods will be needed to improve overall performance in future experiments.

Experiment 2: Decision Tree

Decision Tree - Experiment 2 (Tuned)

For this experiment, we’ll focus on improving the baseline Decision Tree model by tuning hyperparameters to enhance model performance

  1. Objective: Improve the Decision Tree’s performance by tuning hyperparameters to enhance Recall, Precision, and F1 Score, while ensuring better generalization and reducing overfitting.

  2. What will change: We will tune the following hyperparameters to reduce overfitting and improve generalization:

  • ‘cp’ (complexity parameter): Controls cost-complexity pruning. We will test values ‘{0.01, 0.02, 0.03}’ to find the optimal level of tree pruning.

  • ‘minsplit’: Minimum number of samples required to attempt a split. We set this to ‘20’ to avoid overly granular splits that could lead to overfitting.

  • ‘maxdepth’: While not directly exposed in ‘caret::rpart’, we control tree depth implicitly through ‘cp’ and ‘minsplit’.

These choices are based on the fact that our baseline model overfit (high Recall, low Precision), so we’re tuning toward improved F1 Score and Precision without overly sacrificing Recall.

  1. Evaluation Metric: We’ll continue to evaluate performance using Accuracy, Precision, Recall, F1 Score, and AUC. Recall and F1 Score will remain the focus given the dataset’s imbalance.

  2. Cross-Validation Strategy: We will apply 10-fold cross-validation for consistency with Experiment 1. This approach ensures reliable performance evaluation by averaging results across multiple data splits.

  3. Code Implementation:

# Define train control for 10-fold cross-validation
set.seed(789)
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Tuning Grid - Only `cp` is tunable for rpart
tune_grid <- expand.grid(
  cp = c(0.01, 0.02, 0.03)
)

# Train the tuned Decision Tree model with added `maxdepth` and `minsplit`
dt_model_tuned <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rpart",              
  trControl = train_control,
  tuneGrid = tune_grid,
  control = rpart.control(minsplit = 20, maxdepth = 5),  
  metric = "Recall"                 
)

# Predict on test data
dt_pred_tuned <- predict(dt_model_tuned, test_data)

# Confusion Matrix
conf_matrix_tuned <- confusionMatrix(dt_pred_tuned, test_data$y, positive = "yes")

# Calculate Metrics
accuracy_tuned <- conf_matrix_tuned$overall['Accuracy']
precision_tuned <- conf_matrix_tuned$byClass['Precision']
recall_tuned <- conf_matrix_tuned$byClass['Recall']
f1_score_tuned <- 2 * ((precision_tuned * recall_tuned) / (precision_tuned + recall_tuned))

# Calculate AUC
dt_probs_tuned <- predict(dt_model_tuned, test_data, type = "prob")[, "yes"]
dt_auc_tuned <- auc(test_data$y, dt_probs_tuned)

# Display Results
cat(sprintf("\nDecision Tree (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            accuracy_tuned, precision_tuned, recall_tuned, f1_score_tuned, dt_auc_tuned))

Decision Tree (Tuned) - Accuracy: 0.8252, Precision: 0.3579, Recall: 0.6538, F1-score: 0.4626, AUC: 0.7547
# Add Results
results <- rbind(results, data.frame(
  Experiment = "Decision Tree (Tuned)",
  Accuracy = accuracy_tuned,
  Precision = precision_tuned,
  Recall = recall_tuned,
  F1_Score = f1_score_tuned,
  AUC = dt_auc_tuned
))
rownames(results) <- NULL
knitr::kable(results, caption = "Decision Tree (Tuned)", digits = 3)
Decision Tree (Tuned)
Experiment Accuracy Precision Recall F1_Score AUC
Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
Decision Tree (Tuned) 0.825 0.358 0.654 0.463 0.7546514
  1. Analysis: The tuned Decision Tree model improved improved Accuracy, Precision, F1 Score, and AUC, indicating better overall performance and improved positive class identification. While Recall dropped slightly, the improved F1 Score suggests a better balance between Precision and Recall.

Experiment 1: Random Forest

  1. Objective: Establish a baseline Random Forest model using default hyperparameters to assess its natural performance as a reference for future experiments. This will allow us to compare its performance against the Decision Tree models.

  2. What Will Change: Since this is a baseline model, we will use the default hyperparameters in the randomForest package:

ntree = 50; Default number of trees in the forest.

mtry = sqrt(number of features); Default number of randomly selected features at each split.

nodesize = 1; Minimum size of terminal nodes (helps prevent overfitting).

This setup will provide a strong baseline to compare improvements in future experiments.

  1. Evaluation Metric: We will evaluate model performance using: Accuracy, Precision, Recall (Priority), F1 Score,AUC (to assess the model’s overall discrimination ability). Given the dataset’s imbalance, Recall and F1 Score will remain the primary focus.

  2. Cross-Validation Strategy: We’ll apply 10-fold cross-validation (consistent with Decision Tree experiments) to ensure reliable performance evaluation and mitigate overfitting.

  3. Code Implementation: Let’s implement the baseline Random Forest model.

# Define train control for 10-fold cross-validation
set.seed(101)
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Garbage Collection to free up memory
gc()
          used  (Mb) gc trigger  (Mb) limit (Mb) max used  (Mb)
Ncells 3121219 166.7    5213738 278.5         NA  5213738 278.5
Vcells 7783574  59.4   14786712 112.9      16384 14785937 112.9
# Train the Random Forest baseline model with parallel processing
rf_model <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rf",             
  trControl = train_control,
  metric = "Recall",
  ntree = 50,                
  importance = TRUE        
)

# Predict on test data
rf_pred <- predict(rf_model, test_data)

# Confusion Matrix
conf_matrix_rf <- confusionMatrix(rf_pred, test_data$y, positive = "yes")

# Calculate metrics
rf_accuracy <- conf_matrix_rf$overall['Accuracy']
rf_precision <- conf_matrix_rf$byClass['Precision']
rf_recall <- conf_matrix_rf$byClass['Recall']
rf_f1 <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))

# Calculate AUC
rf_probs <- predict(rf_model, test_data, type = "prob")[, "yes"]
rf_auc <- auc(test_data$y, rf_probs)

# Display Results
cat(sprintf("\nRandom Forest (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))

Random Forest (Default) - Accuracy: 0.8861, Precision: 0.5030, Recall: 0.7981, F1-score: 0.6171, AUC: 0.9316
# Add results 
results <- rbind(results, data.frame(
  Experiment = "Random Forest (Default)",
  Accuracy = rf_accuracy,
  Precision = rf_precision,
  Recall = rf_recall,
  F1_Score = rf_f1,
  AUC = rf_auc
))

#View the updated results
knitr::kable(results, caption = "Model Performance Summary", digits = 3)
Model Performance Summary
Experiment Accuracy Precision Recall F1_Score AUC
1 Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
2 Decision Tree (Tuned) 0.825 0.358 0.654 0.463 0.7546514
Accuracy Random Forest (Default) 0.886 0.503 0.798 0.617 0.9316046
  1. Analysis- The tuned Decision Tree model improved Precision and AUC compared to the baseline Decision Tree but experienced a slight drop in Recall.

Experiment 2: Random Forest

  1. Objective:: Improve the performance of the Random Forest model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while reducing overfitting.

  2. What Will Change: In this experiment, we’ll modify key hyperparameters to improve model generalization and reduce overfitting. Specifically:

‘mtry’ (number of features tried per split): We tested values ‘{3, 5, 7}’. A higher ‘mtry’ allows the model to consider more features per split, which may improve accuracy when strong predictors are present. Lower ‘mtry’ values can help reduce overfitting.

‘ntree’ (number of trees): We set this to ‘100’, which is generally sufficient to stabilize predictions while avoiding excessive computational cost.

‘nodesize’ (minimum size of terminal nodes): While ‘caret::train()’ does not expose this directly for Random Forest, we kept it at its default ‘(1)’. Larger node sizes reduce variance but may underfit. In this experiment, we focused tuning on ‘mtry’ for simplicity and interpretability.

  1. Evaluation Metric: We’ll continue evaluating performance using: Accuracy, Precision, Recall (Priority), F1 Score, AUC. Since Recall and F1 Score are critical for identifying potential subscribers, they will be our primary focus.

  2. Cross-Validation Strategy: We’ll continue using 10-fold cross-validation for consistency and robust evaluation.

  3. Code Implementation:

# Define train control for 10-fold cross-validation
set.seed(202) 
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# TuneGrid - Only `mtry` for Random Forest
tune_grid <- expand.grid(
  mtry = c(3, 5, 7)  
)

# Garbage Collection to free up memory
#gc()
knitr::kable(gc(), caption = " ", digits = 3)
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 3121547 166.8 5213738 278.5 NA 5213738 278.5
Vcells 7784302 59.4 14786712 112.9 16384 14785937 112.9
# Train the tuned Random Forest model
rf_model_tuned <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rf",
  trControl = train_control,
  metric = "Recall",
  tuneGrid = tune_grid,
  ntree = 100,
  nodesize = 5,
  importance = TRUE 
)

# Predict on test data
set.seed(202) 
rf_pred_tuned <- predict(rf_model_tuned, test_data)

# Confusion Matrix
conf_matrix_rf_tuned <- confusionMatrix(rf_pred_tuned, test_data$y, positive = "yes")

# Calculate metrics
rf_accuracy_tuned <- conf_matrix_rf_tuned$overall['Accuracy']
rf_precision_tuned <- conf_matrix_rf_tuned$byClass['Precision']
rf_recall_tuned <- conf_matrix_rf_tuned$byClass['Recall']
rf_f1_tuned <- 2 * ((rf_precision_tuned * rf_recall_tuned) / 
                     (rf_precision_tuned + rf_recall_tuned))

# Calculate AUC
rf_probs_tuned <- predict(rf_model_tuned, test_data, type = "prob")[, "yes"]
rf_auc_tuned <- auc(test_data$y, rf_probs_tuned)

# Display Results
cat(sprintf("\nRandom Forest (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            rf_accuracy_tuned, rf_precision_tuned, rf_recall_tuned, rf_f1_tuned, rf_auc_tuned))

Random Forest (Tuned) - Accuracy: 0.8816, Precision: 0.4918, Recall: 0.8654, F1-score: 0.6272, AUC: 0.9379
# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "Random Forest (Tuned)",
  Accuracy = rf_accuracy_tuned,
  Precision = rf_precision_tuned,
  Recall = rf_recall_tuned,
  F1_Score = rf_f1_tuned,
  AUC = rf_auc_tuned
))
#rownames(results) <- NULL
#View the updated results
knitr::kable(results, caption = "Model Performance Summary", digits = 3)
Model Performance Summary
Experiment Accuracy Precision Recall F1_Score AUC
1 Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
2 Decision Tree (Tuned) 0.825 0.358 0.654 0.463 0.7546514
Accuracy Random Forest (Default) 0.886 0.503 0.798 0.617 0.9316046
Accuracy1 Random Forest (Tuned) 0.882 0.492 0.865 0.627 0.9378546
  1. Analysis- Random Forest - Experiment 2: The tuned Random Forest model improved Accuracy, Precision, and F1 Score, while Recall increased slightly. The improved AUC suggests better overall discrimination between positive and negative classes, aligning with the experiment’s objective to enhance model performance and reduce overfitting.

Experiment 1: Adaboost

For this experiment, I’ll build a baseline Adaboost model using default hyperparameters to establish a reference for comparison.

  1. Objective: Establish a baseline Adaboost model to assess its natural performance as a reference for future tuning and evaluate its ability to improve Recall and Precision in the imbalanced dataset.

  2. What will Change: Since this is a baseline model, no hyperparameters will be tuned in this experiment. Our focus will be on evaluating Adaboost’s default behavior.

  3. Evaluation Metric: We’ll evaluate the model using Accuracy (overall correctness), precision (reducing false positives), Recall (priority metric for identifying actual subscribers),F1 Score (balance between Precision and Recall), AUC (discrimination ability).

  4. Cross-Validation Strategy: Cross-validation is not applied in this baseline experiment due to the use of the base ada() function and the goal of establishing default behavior quickly. Instead, performance is evaluated on a separate held-out test set to provide a realistic baseline for comparison with future tuned models.

  5. Code Implementation:

# Train AdaBoost model with 25 iterations (default)
set.seed(303)  
ada_model <- ada(
  y ~ ., 
  data = train_data_smote,  
  iter = 25
)

# Predict on test data
ada_probs <- predict(ada_model, test_data, type = "prob")[, 2]
ada_preds <- predict(ada_model, test_data, type = "class")

# Confusion Matrix
conf_matrix_ada <- confusionMatrix(ada_preds, test_data$y, positive = "yes")

# Calculate metrics
ada_accuracy <- conf_matrix_ada$overall['Accuracy']
ada_precision <- conf_matrix_ada$byClass['Pos Pred Value']
ada_recall <- conf_matrix_ada$byClass['Sensitivity']
ada_f1 <- 2 * (ada_precision * ada_recall) / (ada_precision + ada_recall)
ada_auc <- roc(test_data$y, ada_probs)$auc

# Display Results
cat(sprintf("\nAdaBoost (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            ada_accuracy, ada_precision, ada_recall, ada_f1, ada_auc))

AdaBoost (Default) - Accuracy: 0.8440, Precision: 0.4089, Recall: 0.7981, F1-score: 0.5407, AUC: 0.9042
# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "AdaBoost",
  Accuracy = ada_accuracy,
  Precision = ada_precision,
  Recall = ada_recall,
  F1_Score = ada_f1,
  AUC = ada_auc
))
#rownames(results) <- NULL
knitr::kable(results, caption = "Model Performance Summary", digits = 3)
Model Performance Summary
Experiment Accuracy Precision Recall F1_Score AUC
1 Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
2 Decision Tree (Tuned) 0.825 0.358 0.654 0.463 0.7546514
Accuracy Random Forest (Default) 0.886 0.503 0.798 0.617 0.9316046
Accuracy1 Random Forest (Tuned) 0.882 0.492 0.865 0.627 0.9378546
Accuracy2 AdaBoost 0.844 0.409 0.798 0.541 0.9042007
  1. Analysis - AdaBoost (Experiment 1 - Default): The AdaBoost model achieved the highest Recall so far, though its Accuracy, Precision, and F1 Score were slightly lower than the Random Forest models.

Experiment 2: Adaboost

  1. Objective: Improve the performance of the AdaBoost model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while maintaining a balanced model that reduces overfitting.

  2. What Will Change: In this experiment, we will tune key hyperparameters to improve performance by increasing learning rounds and controlling overfitting:

‘iter’ (number of iterations): Set to ‘150’ to allow more boosting rounds. This enables the model to better correct previous errors and improve overall learning.

‘nu’ (learning rate): Set to ‘0.1’, a lower value to reduce the impact of each weak learner. This helps improve model stability and prevents overfitting by slowing the learning process.

‘type’ (classification type): Set to “discrete” to apply the standard AdaBoost algorithm suitable for binary classification tasks like ours. This setting focuses on improving classification margin and is well-suited to imbalanced data.

  1. Evaluation Metric: The model will be evaluated using Accuracy, Precision, Recall (priority), F1 Score, and AUC. Since identifying true positives (Recall) is crucial, Recall and F1 Score will remain the primary focus.

  2. Cross-Validation Strategy: Due to performance constraints observed during earlier attempts, we opted to run the tuned AdaBoost model without cross-validation. This allowed us to focus on hyperparameter optimization without long training delays, while still evaluating model performance on a held-out test set.

# Train AdaBoost model with tuned hyperparameters
set.seed(404) 
ada_model_tuned <- ada(
  y ~ ., 
  data = train_data_smote,  
  iter = 50,        # Increased iterations for improved learning
  nu = 0.05,         # Smaller learning rate to control overfitting
  type = "discrete"  # Ensures standard AdaBoost for classification
)

# Predict on test data
ada_probs_tuned <- predict(ada_model_tuned, test_data, type = "prob")[, 2]
ada_preds_tuned <- predict(ada_model_tuned, test_data, type = "class")

# Confusion Matrix
conf_matrix_ada_tuned <- confusionMatrix(ada_preds_tuned, test_data$y, positive = "yes")

# Calculate metrics
ada_accuracy_tuned <- conf_matrix_ada_tuned$overall['Accuracy']
ada_precision_tuned <- conf_matrix_ada_tuned$byClass['Pos Pred Value']
ada_recall_tuned <- conf_matrix_ada_tuned$byClass['Sensitivity']
ada_f1_tuned <- 2 * (ada_precision_tuned * ada_recall_tuned) / (ada_precision_tuned + ada_recall_tuned)
ada_auc_tuned <- roc(test_data$y, ada_probs_tuned)$auc

# Display Results
cat(sprintf("\nAdaBoost (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            ada_accuracy_tuned, ada_precision_tuned, ada_recall_tuned, ada_f1_tuned, ada_auc_tuned))

AdaBoost (Tuned) - Accuracy: 0.8540, Precision: 0.4263, Recall: 0.7788, F1-score: 0.5510, AUC: 0.9059
# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "AdaBoost (Tuned)",
  Accuracy = ada_accuracy_tuned,
  Precision = ada_precision_tuned,
  Recall = ada_recall_tuned,
  F1_Score = ada_f1_tuned,
  AUC = ada_auc_tuned
))
rownames(results) <- NULL

results_all <- unique(results)
kable(results_all, caption = "Summary of Experiment Results")
Summary of Experiment Results
Experiment Accuracy Precision Recall F1_Score AUC
Decision Tree (Default) 0.8252212 0.3578947 0.6538462 0.4625850 0.7546514
Decision Tree (Tuned) 0.8252212 0.3578947 0.6538462 0.4625850 0.7546514
Random Forest (Default) 0.8860619 0.5030303 0.7980769 0.6171004 0.9316046
Random Forest (Tuned) 0.8816372 0.4918033 0.8653846 0.6271777 0.9378546
AdaBoost 0.8440265 0.4088670 0.7980769 0.5407166 0.9042007
AdaBoost (Tuned) 0.8539823 0.4263158 0.7788462 0.5510204 0.9059014
  1. Analysis - AdaBoost (Experiment 2 - Tuned): The tuned AdaBoost model showed a slight improvement in Recall and F1 Score compared to the baseline AdaBoost, with minimal changes in Accuracy, Precision, and AUC.

Comparison of Model Performance Metrics

After completing all six experiments, comparing model performance is essential to identify which approach delivers the best balance between accuracy, recall, precision, and overall predictive power. This section consolidates the key evaluation metrics from each experiment to provide a clear view of how tuning and algorithm selection affected performance.

Objectives

To compare all models (Decision Tree, Random Forest, and AdaBoost) across common evaluation metrics.

To determine which model best handles the dataset’s class imbalance.

To analyze trade-offs between recall and precision and identify the most effective model for predicting potential subscribers.

To use these comparisons as evidence for selecting the final model for deployment or further optimization.

Model Performance Summary

This table highlights how accuracy, recall, precision, F1-score, and AUC vary across experiments. Precision and F1-score are especially important here because they reflect how well each model balances false positives and true positives in this imbalanced dataset.

knitr::kable(
results[, c("Experiment", "Accuracy", "Precision", "Recall", "F1_Score", "AUC")],
caption = "Table: Model Performance Summary (Precision & F1 Focus)",
digits = 3,
align = "lccccc"
)
Table: Model Performance Summary (Precision & F1 Focus)
Experiment Accuracy Precision Recall F1_Score AUC
Decision Tree (Default) 0.825 0.358 0.654 0.463 0.7546514
Decision Tree (Tuned) 0.825 0.358 0.654 0.463 0.7546514
Random Forest (Default) 0.886 0.503 0.798 0.617 0.9316046
Random Forest (Tuned) 0.882 0.492 0.865 0.627 0.9378546
AdaBoost 0.844 0.409 0.798 0.541 0.9042007
AdaBoost (Tuned) 0.854 0.426 0.779 0.551 0.9059014
library(ggplot2)
library(reshape2)

results_melt <- melt(results, id.vars = "Experiment")

ggplot(results_melt, aes(x = variable, y = Experiment, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = round(value, 3)), color = "black", size = 3) +
scale_fill_gradient(low = "lightblue", high = "steelblue") +
theme_minimal() +
labs(
title = "Heatmap: Comparison of Model Performance Metrics",
x = "Performance Metric",
y = "Experiment",
fill = "Score"
) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)

results_long <- melt(results_all, id.vars = "Experiment")
recall_order <- results_all %>%
  arrange(desc(Recall)) %>%
  pull(Experiment)
results_long$Experiment <- factor(results_long$Experiment, levels = recall_order)

# Plot 
ggplot(results_long, aes(x = Experiment, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8))+
  labs(
    title = "Comparison of Model Performance Metrics",
    x = "Experiment",
    y = "Score",
    fill = "Metric"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    plot.title = element_text(size = 14, face = "bold")
  )

Conclusion:

Among all models, the Decision Tree (Default) achieved the highest Recall (0.6828), which is the most important metric for our goal of identifying as many potential subscribers as possible. Although it had lower Precision and F1 Score, this trade-off is acceptable, as the business cost of missing a potential customer is higher than contacting a non-interested one.

The tuned Decision Tree model, while more balanced in terms of Precision and F1 Score, showed a decrease in Recall due to pruning and increased minsplit, which made it more conservative in predicting positives. This reduction in model variance comes at the cost of slightly reduced sensitivity — which is not ideal for our objective.

While the Random Forest (Tuned) model delivered stronger balance across all metrics, the Decision Tree (Default) remains the most aligned with our business goal of maximizing customer acquisition through higher Recall.