Loading Libraries

library(readr)
library(stringr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(corrplot)

## corrplot 0.92 loaded

library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.3.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.3.3

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggthemes)
library(purrr)
library(tidyr)
library(readr)

Importing the data

In the following code chuncks, I start by importing the bank marketing dataset and cleaning up its column names by removing unwanted characters and spaces. Then, I convert the appropriate columns to factors for categorical data and ensure that numeric columns are properly formatted, setting the stage for subsequent exploratory analysis.

Read the csv file

I begin by importing the bank marketing dataset from “bank-additional-full.csv”, specifying the use of semicolons as delimiters and letting R automatically assign data types to each column. Next, I use glimpse to inspect the structure of the dataset—reviewing column names, data types, and a preview of the values—to ensure everything has been read in correctly.

bank_data <- read_delim("bank-additional.csv", delim = ";", col_types = cols())
glimpse(bank_data)

## Rows: 4,119
## Columns: 21
## $ age            <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36, 36, 47,…
## $ job            <chr> "blue-collar", "services", "services", "services", "adm…
## $ marital        <chr> "married", "single", "married", "married", "married", "…
## $ education      <chr> "basic.9y", "high.school", "high.school", "basic.9y", "…
## $ default        <chr> "no", "no", "no", "no", "no", "no", "no", "unknown", "n…
## $ housing        <chr> "yes", "no", "yes", "unknown", "yes", "no", "yes", "yes…
## $ loan           <chr> "no", "no", "no", "unknown", "no", "no", "no", "no", "n…
## $ contact        <chr> "cellular", "telephone", "telephone", "telephone", "cel…
## $ month          <chr> "may", "may", "jun", "jun", "nov", "sep", "sep", "nov",…
## $ day_of_week    <chr> "fri", "fri", "wed", "fri", "mon", "thu", "mon", "mon",…
## $ duration       <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 301, 148,…
## $ campaign       <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2, 6, 4, 2…
## $ pdays          <dbl> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous       <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome       <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate   <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0.1, 1.1,…
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199, 94.199,…
## $ cons.conf.idx  <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.0,…
## $ euribor3m      <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879, 4.191,…
## $ nr.employed    <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6,…
## $ y              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…

Cleaning Column Names

Next, I created a custom function called clean_column_names() to sanitize the dataset’s column names. - Within clean_column_names(), I replaced periods (.) with underscores (_) to standardize the delimiters. - I also replaced hyphens (-) with underscores to further enhance consistency. - I removed spaces from the column names to prevent issues during later data manipulation. - Finally, I applied clean_column_names() to the dataset, ensuring that all column names are clean and uniform for further analysis.

clean_column_names <- function(dataset) {
colnames(dataset) <- str_replace_all(colnames(dataset), "\\.", "_")  # Replace '.' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), "-", "_")    # Replace '-' with '_'
colnames(dataset) <- str_replace_all(colnames(dataset), " ", "")     # Remove spaces
return(dataset)
}

bank_data <- clean_column_names(bank_data)

Convert Categorical Variables into Factors

A vector categorical_cols is created, listing the names of columns that hold categorical data.
The mutate(across(...)) function from dplyr is used to convert each of these columns to the factor data type.

categorical_cols <- c("job", "marital", "education", "default", "housing", 
                 "loan", "contact", "month", "poutcome", "y")

bank_data <- bank_data %>%
mutate(across(all_of(categorical_cols), as.factor))

Convert Numerical Columns into Numeric Format

Similarly, I created a vector numeric_cols to identify columns that should be numeric.
These columns are converted to numeric type using mutate(across(..., as.numeric)), ensuring proper arithmetic operations and statistical analysis later.

numeric_cols <- c("age", "duration", "campaign", "pdays", "previous")

bank_data <- bank_data %>%
mutate(across(all_of(numeric_cols), as.numeric))

Inspect the Cleaned Data

We can see that our dataset contains 4,119 observations and 21 variables. It includes both categorical features (e.g., job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, and y) and numerical features (e.g., age, duration, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, and nr.employed). My next step is to use the summary() and table() functions to get a quick statistical overview of each variable. This helps identify potential data issues, outliers, and the general distribution of the dataset.

Pre-processing for Experimentation

Missing Value Handling

Since the unknown values in housing and loan are relatively low (~2.55%), imputing them with the mode (most frequent value) ensures minimal distortion to the data distribution. On the other hand, default has a much higher proportion of unknowns (~19.50%), and replacing those with the most frequent category could introduce significant bias. Therefore, “unknown” in the default column is retained as a separate category, allowing the model to learn from any patterns associated with this response. This strategy preserves data integrity while reducing the risk of inappropriate imputation.

The code below handles missing values in the categorical variables default, housing, and loan in the bank_data dataset:

Calculates and prints the percentage of “unknown” values (treated as missing) before any transformation.
Replaces “unknown” with “missing” in the housing and loan columns (leaving default unchanged).
Recalculates missing percentages after the replacement to validate the transformation.
Imputes missing values in housing and loan using the mode (most frequent category).
Recalculates and prints missing percentages after imputation to confirm that the values have been successfully handled.

# Load necessary library
library(dplyr)
library(tidyr)

# Function to calculate missing percentages
calculate_missing_percentage <- function(data, cols) {
  missing_summary <- data %>%
    summarise(across(all_of(cols), ~ mean(. == "unknown") * 100)) %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Percentage")
  
  return(missing_summary)
}

# Calculate missing percentages before transformation
missing_before <- calculate_missing_percentage(bank_data, c("default", "housing", "loan"))
print(missing_before)

## # A tibble: 3 × 2
##   Variable Missing_Percentage
##   <chr>                 <dbl>
## 1 default               19.5 
## 2 housing                2.55
## 3 loan                   2.55

# Replace "unknown" with "missing" in categorical variables (only for housing and loan)
bank_data_imputed <- bank_data %>%
  mutate(across(c(housing, loan), ~ ifelse(. == "unknown", "missing", .)))

# Calculate missing percentages after replacing "unknown" with "missing"
missing_after_replace <- calculate_missing_percentage(bank_data_imputed, c("housing", "loan"))

# Impute Missing Values for housing and loan with Mode (Most Frequent Value)
for (col in c("housing", "loan")) {
  most_frequent <- names(which.max(table(bank_data_imputed[[col]])))
  bank_data_imputed[[col]][bank_data_imputed[[col]] == "missing"] <- most_frequent
}

# Calculate missing percentages after imputation
missing_after_imputation <- calculate_missing_percentage(bank_data_imputed, c("housing", "loan"))
print(missing_after_imputation)

## # A tibble: 2 × 2
##   Variable Missing_Percentage
##   <chr>                 <dbl>
## 1 housing                   0
## 2 loan                      0

Check Feature Correlation (Relationships between different variables)

library(stats)
library(corrplot)

bank_data_imputed %>%
keep(is.numeric) %>% # Select only numeric variables for correlation analysis
cor() %>% # Compute correlation matrix
corrplot() # Plot correlation heatmap

The correlation plot visually illustrates the strength and direction of linear relationships between numerical features in the dataset. Here are key observations:

Strong Positive Correlations

euribor3m, emp_var_rate, and nr_employed:
These three variables are highly positively correlated with each other (correlation coefficients > 0.9). This suggests that they contain redundant information and are likely driven by similar underlying economic conditions. Including all three in a model can cause multicollinearity, which is problematic especially for linear models.

Moderate to Weak Correlations

pdays and previous:
These variables show a moderate positive correlation, indicating that people who were contacted more days ago (pdays) tend to have been contacted more frequently in the past (previous).
campaign and previous:
Slight positive relationship, but still weak, indicating mild redundancy.

Notable Redundancy Alert

The redundancy among emp_var_rate, euribor3m, and nr_employed justifies removing one or more of them during feature selection. In your earlier work, these were rightly flagged for removal using correlation thresholds.

Low Correlations with `age`, `campaign`, `cons_conf_idx`

These variables appear to be more independent, which makes them informative in model building as they contribute unique signals.

Actionable Insight

Keep variables that are less correlated with others.
Drop or combine features that are highly correlated, especially in models sensitive to multicollinearity (e.g., logistic regression).
For tree-based models, strong correlations don’t necessarily harm performance but can increase complexity and computation time unnecessarily.

Dimensionality Reduction

Dimensionality reduction based on correlation does not hurt the performance of Decision Trees, Random Forest, or AdaBoost, and while it is not necessary for these models, it may slightly improve training efficiency and generalization—particularly for AdaBoost and Random Forest. We will therefore carry out dimentionality reduction based on correlation.

# Load necessary libraries
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(dplyr)

# Compute correlation matrix for numeric features in bank_data_imputed
cor_matrix <- cor(bank_data_imputed %>% dplyr::select(where(is.numeric)), use = "pairwise.complete.obs")

# Convert correlation matrix to a data frame for readability
cor_df <- as.data.frame(as.table(cor_matrix))

# Filter to show only correlations >|0.8| (excluding self-correlations)
significant_correlations <- cor_df %>%
  filter(Var1 != Var2, abs(Freq) > 0.8) %>%
  arrange(desc(abs(Freq)))

# Print only significant correlations
print("Significant Correlations (>|0.8|):")

## [1] "Significant Correlations (>|0.8|):"

print(significant_correlations)

##           Var1         Var2      Freq
## 1    euribor3m emp_var_rate 0.9703080
## 2 emp_var_rate    euribor3m 0.9703080
## 3  nr_employed    euribor3m 0.9425893
## 4    euribor3m  nr_employed 0.9425893
## 5  nr_employed emp_var_rate 0.8971732
## 6 emp_var_rate  nr_employed 0.8971732

# Find highly correlated variables (correlation > 0.85) for removal
highly_correlated_indices <- findCorrelation(cor_matrix, cutoff = 0.85)

# Get variable names of highly correlated features
highly_correlated_vars <- names(bank_data_imputed %>% dplyr::select(where(is.numeric)))[highly_correlated_indices]

# Explicitly add 'day_of_week' for removal due to lack of predictive power
removal_candidates <- c(highly_correlated_vars, "day_of_week")

# Ensure only existing variables are removed
removed_vars <- intersect(names(bank_data_imputed), removal_candidates)

# Remove highly correlated and non-informative variables from dataset
bank_data_reduced <- bank_data_imputed %>% dplyr::select(-all_of(removed_vars))

# Print the variables that were actually removed
print("Variables Removed from the Dataset:")

## [1] "Variables Removed from the Dataset:"

print(removed_vars)

## [1] "day_of_week"  "emp_var_rate" "euribor3m"

Feature Removal Summary:

The correlation analysis highlights several variables with strong pairwise relationships (>|0.85|), which could introduce multicollinearity and distort model interpretation. Specifically:

euribor3m and emp_var_rate exhibit an extremely high correlation (0.970), indicating a near-linear relationship between the Euro Interbank Offered Rate and the employment variation rate.
nr_employed is also strongly correlated with euribor3m (0.943) and emp_var_rate (0.897), showing a tight link between the number of employees and key economic indicators.

To reduce redundancy and mitigate multicollinearity, the variables euribor3m and emp_var_rate were removed from the dataset.

In addition, day_of_week was explicitly excluded due to its lack of predictive contribution. Its distribution shows minimal variation in the target variable (y) across different days, suggesting it provides little to no added value for the model and was therefore dropped to streamline the feature set.

This feature selection approach helps improve model efficiency and interpretability by retaining only the most informative, non-redundant variables.

Feature Engineering

To enhance the dataset for predictive modeling while preserving the pattern within each predictor, the following transformations were carried out:

Handling Skewed Numeric Variables

Box-Cox Transformation for Duration

The Box-Cox transformation applied to the duration variable significantly improves its distribution, shifting it from a highly right-skewed shape to a more symmetric, bell-shaped form. This transformation is not required for tree-based models such as Decision Trees, Random Forest, and AdaBoost, as these models are invariant to skewness and do not assume normally distributed features.

That said, applying the Box-Cox transformation does not negatively affect the performance of tree-based models—it simply has no meaningful impact. Therefore, it is included here as part of a comprehensive preprocessing workflow, ensuring consistency and completeness in data treatment, especially if comparisons with linear models or other non-tree algorithms are considered later in the analysis.

# Load necessary libraries
library(dplyr)
library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.3

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:PerformanceAnalytics':
## 
##     kurtosis, skewness

library(tidyr)
library(ggplot2)

# Function to calculate skewness
calculate_skewness <- function(data, cols) {
  data %>%
    summarise(across(all_of(cols), ~ skewness(.x, na.rm = TRUE))) %>%
    pivot_longer(cols = everything(), names_to = "Variable", values_to = "Skewness")
}

# Create a copy to preserve original
bank_data_boxcox <- bank_data_reduced

# Select duration variable
numeric_var_duration <- "duration"

# Calculate skewness before transformation
skewness_before_duration <- calculate_skewness(bank_data_boxcox, numeric_var_duration)

# Apply Box-Cox Transformation for duration
x_adj_duration <- bank_data_boxcox$duration + 1  # Adjust for zero values
lambda_duration <- BoxCox.lambda(x_adj_duration, method = "loglik")
bank_data_boxcox$duration_boxcox <- BoxCox(x_adj_duration, lambda_duration)

# Compute Skewness After Transformation
skewness_after_duration <- calculate_skewness(bank_data_boxcox, "duration_boxcox")

# Merge Skewness Before and After
skewness_df_duration <- left_join(
  skewness_before_duration,
  skewness_after_duration,
  by = "Variable",
  suffix = c("_Before", "_After")
)

# Visualize Distribution Before and After Transformation
bank_data_boxcox %>%
  dplyr::select(all_of(c("duration", "duration_boxcox"))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 50, fill = "black", alpha = 0.8) +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  ggtitle("Distributions Before and After Box-Cox Transformation for Duration")

Preserving month as a Factor Instead of Seasonal Binning
- Why? Retaining month allows models to capture variations in client behavior across different months.

bank_data_boxcox$month <- as.factor(bank_data_boxcox$month)

Handling Low-Frequency Categories in job
- Why? Grouping low-frequency job categories using fct_lump() is not required for tree-based models, which can naturally handle high-cardinality categorical variables. However, it can be beneficial if some job levels are extremely rare and introduce noise or overfitting, especially in AdaBoost, which is more sensitive to such instability.

if (!requireNamespace("forcats", quietly = TRUE)) {
  install.packages("forcats")
}
library(forcats)

bank_data_boxcox$job <- fct_lump(bank_data_boxcox$job, prop = 0.02)

Transforming pdays Into a More Useful Feature
- New feature: contacted_before (1 if previously contacted, 0 if never contacted).
- Why? Since pdays = 999 means never contacted before, this transformation simplifies model learning.

bank_data_boxcox <- bank_data_boxcox %>%
 mutate(contacted_before = ifelse(pdays == 999, 0, 1)) %>%
 dplyr::select(-pdays)  # Remove `pdays`

Adding Interaction Features to Capture Relationships
- previous_contacts_ratio: Measures prior engagement by normalizing the number of previous contacts relative to campaign attempts.
- loan_housing_combo: Encodes the combination of loan and housing into a single categorical variable.
- Why? Creating interaction features like previous_contacts_ratio and loan_housing_combo is not required for tree-based models, as they can often discover such interactions on their own through hierarchical splits. However, explicitly adding these features can still be beneficial when the interaction is meaningful and non-obvious, potentially improving model performance or interpretability.

bank_data_boxcox <- bank_data_boxcox %>%
 mutate(
   previous_contacts_ratio = as.numeric(as.character(previous)) / (as.numeric(as.character(campaign)) + 1),
   loan_housing_combo = paste0(loan, "_", housing)
 )

Converting Categorical Variables to Factors Before One-Hot Encoding
- Why? Converting categorical variables to factors before one-hot encoding is necessary in R. It ensures that encoding functions (like those in fastDummies, caret, or recipes) correctly recognize and process these variables as categorical, avoiding unexpected errors or incorrect output.

categorical_vars <- c("contact", "default", "education", "housing", "job", 
                     "loan", "marital", "month", "loan_housing_combo", "poutcome")

bank_data_boxcox <- bank_data_boxcox %>%
 mutate(across(all_of(categorical_vars), as.factor))

Encoding the Target Variable (y)
- Why? Converting the target variable y to a binary factor (e.g., 0 = no, 1 = yes) is necessary for classification models, including Decision Trees, Random Forest, and AdaBoost. It ensures that the model treats the task as a classification problem rather than regression, and it maintains consistency across modeling functions in R.

bank_data_boxcox <- bank_data_boxcox %>%
 mutate(y = factor(ifelse(y == "yes", 1, 0)))

Rearranging Columns: Moving y to the Last Column
- Why? Placing the target variable last helps with readability and prevents unintended modifications.

bank_data_boxcox <- bank_data_boxcox %>%
  dplyr::select(-y, everything(), y)

Cleaning Column Names
- Why? Replacing - and . ensures compatibility with certain ML libraries that do not support special characters.

colnames(bank_data_boxcox) <- gsub("-", "_", colnames(bank_data_boxcox))
colnames(bank_data_boxcox) <- gsub("\\.", "_", colnames(bank_data_boxcox))

Ensuring y Remains a Factor
- Why? Ensuring y remains a factor avoids issues during classification.

bank_data_boxcox <- bank_data_boxcox %>%
  mutate(y = factor(y, levels = c(0, 1)))


glimpse(bank_data_boxcox)

## Rows: 4,119
## Columns: 21
## $ age                     <dbl> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36…
## $ job                     <fct> blue-collar, services, services, services, adm…
## $ marital                 <fct> married, single, married, married, married, si…
## $ education               <fct> basic.9y, high.school, high.school, basic.9y, …
## $ default                 <fct> no, no, no, no, no, no, no, unknown, no, unkno…
## $ housing                 <fct> 3, 1, 3, 3, 3, 1, 3, 3, 1, 1, 3, 1, 1, 3, 1, 1…
## $ loan                    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ contact                 <fct> cellular, telephone, telephone, telephone, cel…
## $ month                   <fct> may, may, jun, jun, nov, sep, sep, nov, nov, m…
## $ duration                <dbl> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, …
## $ campaign                <dbl> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2…
## $ previous                <dbl> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome                <fct> nonexistent, nonexistent, nonexistent, nonexis…
## $ cons_price_idx          <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199…
## $ cons_conf_idx           <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.…
## $ nr_employed             <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6…
## $ duration_boxcox         <dbl> 10.205533, 9.364243, 8.385498, 3.618223, 5.622…
## $ contacted_before        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ previous_contacts_ratio <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.5, 0…
## $ loan_housing_combo      <fct> 1_3, 1_1, 1_3, 1_3, 1_3, 1_1, 1_3, 1_3, 1_1, 1…
## $ y                       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Sampling Data

We are using stratified random sampling because it ensures that the proportion of each class in the target variable is maintained in both the training and test datasets. This is especially important for classification problems with imbalanced classes, as it leads to more reliable model training and evaluation by preserving the underlying class distribution.

# Load caret package
library(caret)

# Stratified sampling with 75% training data
set.seed(1234)
trainIndex <- createDataPartition(bank_data_boxcox$y, p = 0.75, list = FALSE)

# Split data based on stratified sampling
train_data <- bank_data_boxcox[trainIndex, ]
test_data <- bank_data_boxcox[-trainIndex, ]
test_data_ada <- test_data


# Verify class distribution remains consistent
round(prop.table(table(train_data$y)) * 100, 2)

## 
##     0     1 
## 89.03 10.97

round(prop.table(table(test_data$y)) * 100, 2)

## 
##     0     1 
## 89.12 10.88

Handling Imbalanced Data (Upsampling (safe due to the data size))

We used stratified random sampling to split the data while preserving class distribution, and then applied upsampling to balance the training set by increasing the minority class, ensuring fair and effective model training.

library(caret)

# Convert target to factor (if not already)
train_data$y <- as.factor(train_data$y)

# Apply upsampling
set.seed(123)
up_train <- upSample(x = train_data[, -which(names(train_data) == "y")],
                     y = train_data$y)


# Rename target back to 'y' (caret names it 'Class')
names(up_train)[ncol(up_train)] <- "y"

# Check the new class distribution after upsampling
up_train_ada <- up_train
table(up_train$y)

## 
##    0    1 
## 2751 2751

table(test_data$y)

## 
##   0   1 
## 917 112

glimpse(train_data)

## Rows: 3,090
## Columns: 21
## $ age                     <dbl> 39, 25, 38, 47, 32, 32, 31, 35, 25, 36, 36, 29…
## $ job                     <fct> services, services, services, admin., services…
## $ marital                 <fct> single, married, married, married, single, sin…
## $ education               <fct> high.school, high.school, basic.9y, university…
## $ default                 <fct> no, no, no, no, no, no, no, unknown, unknown, …
## $ housing                 <fct> 1, 3, 3, 3, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3…
## $ loan                    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3…
## $ contact                 <fct> telephone, telephone, telephone, cellular, cel…
## $ month                   <fct> may, jun, jun, nov, sep, sep, nov, may, jul, j…
## $ duration                <dbl> 346, 227, 17, 58, 128, 290, 68, 170, 301, 148,…
## $ campaign                <dbl> 4, 1, 3, 1, 3, 4, 1, 1, 1, 1, 2, 2, 2, 6, 2, 3…
## $ previous                <dbl> 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome                <fct> nonexistent, nonexistent, nonexistent, nonexis…
## $ cons_price_idx          <dbl> 93.994, 94.465, 94.465, 93.200, 94.199, 94.199…
## $ cons_conf_idx           <dbl> -36.4, -41.8, -41.8, -42.0, -37.5, -37.5, -42.…
## $ nr_employed             <dbl> 5191.0, 5228.1, 5228.1, 5195.8, 4963.6, 4963.6…
## $ duration_boxcox         <dbl> 9.364243, 8.385498, 3.618223, 5.622899, 7.1529…
## $ contacted_before        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ previous_contacts_ratio <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.…
## $ loan_housing_combo      <fct> 1_1, 1_3, 1_3, 1_3, 1_1, 1_3, 1_1, 1_1, 1_3, 1…
## $ y                       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Experimentation

Decision Tree Experiments (R with `rpart`)

Load Required Packages

# Install if not already installed
packages <- c("caret", "rpart", "randomForest", "adabag", "pROC", "e1071")
lapply(packages, require, character.only = TRUE)

## Loading required package: rpart

## Loading required package: randomForest

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

## Loading required package: adabag

## Warning: package 'adabag' was built under R version 4.3.3

## Loading required package: foreach

## 
## Attaching package: 'foreach'

## The following objects are masked from 'package:purrr':
## 
##     accumulate, when

## Loading required package: doParallel

## Loading required package: iterators

## Loading required package: parallel

## Loading required package: pROC

## Warning: package 'pROC' was built under R version 4.3.3

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE

Experiment 1: Use Gini Index (Default Splitting Criterion)

Evaluation Function for Classification Models

To evaluate the performance of each classification model on the original (imbalanced) test set, we define a reusable function. This function computes key metrics:

Accuracy: Overall correct predictions.
F1-Score: Harmonic mean of precision and recall for the positive class (“yes”).
AUC (Area Under the ROC Curve): Measures the model’s ability to distinguish between classes.
Confusion Matrix: Displays the breakdown of predicted vs. actual classes.

This function accepts a trained model and test dataset, ensures consistent class labels, and returns a standardized list of performance results. It will be used across all experiments for consistent evaluation.

# Ensure consistent class labels
up_train$y <- factor(up_train$y, levels = c(0, 1), labels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c(0, 1), labels = c("no", "yes"))

train_data$y <- factor(train_data$y, levels = c(0, 1), labels = c("no", "yes"))

evaluate_tree_model <- function(model, test_data) {
  pred_class <- predict(model, test_data, type = "class")
  pred_prob <- predict(model, test_data, type = "prob")[, "yes"]

  cm <- confusionMatrix(pred_class, test_data$y, positive = "yes")
  roc_obj <- roc(test_data$y, pred_prob)
  auc_val <- auc(roc_obj)

  list(
    Accuracy = cm$overall["Accuracy"],
    F1 = cm$byClass["F1"],
    AUC = auc_val,
    Matrix = cm
  )
}

Pre-Upsampling Baseline Model

Experiment 1: Pre-Upsampling Baseline Decision Tree (Default Settings)

Objective

Evaluate how a Decision Tree trained on imbalanced data performs without any class balancing or tuning. This helps establish a baseline for comparison.

Changes vs Controls

Changes: Use rpart() on the original imbalanced dataset
Controls: Same test set, same evaluation function

Metrics

Accuracy, F1-score, AUC

# Train baseline Decision Tree on imbalanced data
tree_model_imbalanced <- rpart(y ~ ., data = train_data, method = "class")

# Evaluate on original (imbalanced) test set
result_imbalanced <- evaluate_tree_model(tree_model_imbalanced, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Output results
print(result_imbalanced$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  886  62
##        yes  31  50
##                                           
##                Accuracy : 0.9096          
##                  95% CI : (0.8904, 0.9264)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.029614        
##                                           
##                   Kappa : 0.4697          
##                                           
##  Mcnemar's Test P-Value : 0.001865        
##                                           
##             Sensitivity : 0.44643         
##             Specificity : 0.96619         
##          Pos Pred Value : 0.61728         
##          Neg Pred Value : 0.93460         
##              Prevalence : 0.10884         
##          Detection Rate : 0.04859         
##    Detection Prevalence : 0.07872         
##       Balanced Accuracy : 0.70631         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_imbalanced$Accuracy, "\n")

## Accuracy: 0.909621

cat("F1-Score:", result_imbalanced$F1, "\n")

## F1-Score: 0.5181347

cat("AUC:", result_imbalanced$AUC, "\n")

## AUC: 0.9014352

Experiment 2: Baseline Decision Tree (Upsampled, Full Depth)

bjective

Evaluate the performance of a fully grown Decision Tree trained on upsampled data. This provides a stronger baseline to compare against more regularized or tuned models.

Changes vs Controls

Changes: Training data is upsampled for class balance; tree grows without restriction (full depth)
Controls: Same features, same test set, same evaluation metrics

Metrics

Accuracy, F1-score, AUC

tree_model_baseline <- rpart(y ~ ., data = up_train, method = "class")
result_baseline <- evaluate_tree_model(tree_model_baseline, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(result_baseline$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  735   5
##        yes 182 107
##                                           
##                Accuracy : 0.8183          
##                  95% CI : (0.7933, 0.8414)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4469          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9554          
##             Specificity : 0.8015          
##          Pos Pred Value : 0.3702          
##          Neg Pred Value : 0.9932          
##              Prevalence : 0.1088          
##          Detection Rate : 0.1040          
##    Detection Prevalence : 0.2809          
##       Balanced Accuracy : 0.8784          
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_baseline$Accuracy, "\n")

## Accuracy: 0.8182702

cat("F1-Score:", result_baseline$F1, "\n")

## F1-Score: 0.5336658

cat("AUC:", result_baseline$AUC, "\n")

## AUC: 0.8913139

Experiment 3: Shallow Decision Tree (`maxdepth = 4`)

bjective

Assess the impact of limiting tree depth on model performance. The hypothesis is that a shallower tree might generalize better by reducing overfitting.

Changes vs Controls

Changes: Set maxdepth = 4 to constrain tree complexity
Controls: Same upsampled training data and test set, same features and evaluation process

Metrics

Accuracy, F1-score, AUC

# Decision Tree with maxdepth = 4
tree_model_1 <- rpart(y ~ ., data = up_train, method = "class",
                      control = rpart.control(maxdepth = 4))

result_1 <- evaluate_tree_model(tree_model_1, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Print metrics (optional)
print(result_1$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  735   5
##        yes 182 107
##                                           
##                Accuracy : 0.8183          
##                  95% CI : (0.7933, 0.8414)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4469          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9554          
##             Specificity : 0.8015          
##          Pos Pred Value : 0.3702          
##          Neg Pred Value : 0.9932          
##              Prevalence : 0.1088          
##          Detection Rate : 0.1040          
##    Detection Prevalence : 0.2809          
##       Balanced Accuracy : 0.8784          
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_1$Accuracy, "\n")

## Accuracy: 0.8182702

cat("F1-Score:", result_1$F1, "\n")

## F1-Score: 0.5336658

cat("AUC:", result_1$AUC, "\n")

## AUC: 0.8913139

Experiment 4: Tune `maxdepth` via Cross-Validation

Objective

Test how limiting the tree depth using cross-validation improves generalization. The hypothesis is that optimizing maxdepth can balance bias and variance better than using a fully grown tree.

Changes vs Controls

Changes: Grid search over maxdepth = 4, 6, 8 using 5-fold cross-validation
Controls: Same upsampled training data and evaluation metrics

Metrics

Accuracy, F1-score, AUC

# Tune maxdepth (CV)
tune_grid <- expand.grid(maxdepth = c(4, 6, 8))

tree_cv_maxdepth <- train(
  y ~ ., data = up_train,
  method = "rpart2",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC",
  tuneGrid = tune_grid
)

# Predict and evaluate
pred_class_maxdepth <- predict(tree_cv_maxdepth, test_data)
pred_prob_maxdepth <- predict(tree_cv_maxdepth, test_data, type = "prob")[, "yes"]

cm_maxdepth <- confusionMatrix(pred_class_maxdepth, test_data$y, positive = "yes")
roc_maxdepth <- pROC::roc(test_data$y, pred_prob_maxdepth)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_maxdepth <- pROC::auc(roc_maxdepth)

# Print
print(cm_maxdepth)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  738   6
##        yes 179 106
##                                           
##                Accuracy : 0.8202          
##                  95% CI : (0.7954, 0.8432)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4477          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9464          
##             Specificity : 0.8048          
##          Pos Pred Value : 0.3719          
##          Neg Pred Value : 0.9919          
##              Prevalence : 0.1088          
##          Detection Rate : 0.1030          
##    Detection Prevalence : 0.2770          
##       Balanced Accuracy : 0.8756          
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm_maxdepth$overall["Accuracy"], "\n")

## Accuracy: 0.8202138

cat("F1-Score:", cm_maxdepth$byClass["F1"], "\n")

## F1-Score: 0.534005

cat("AUC:", auc_maxdepth, "\n")

## AUC: 0.8885535

Experiment 5: Tune `cp` (Complexity Parameter) via Cross-Validation

Objective

Evaluate how pruning the tree using the complexity parameter (cp) affects performance. The hypothesis is that tuning cp prevents overfitting and enhances generalization by removing unnecessary splits.

Changes vs Controls

Changes: Grid search over cp = 0.01, 0.005, 0.001 using 5-fold cross-validation
Controls: Upsampled training data, fixed feature set, same evaluation metrics

Metrics

Accuracy, F1-score, AUC

# Load caret again if needed
library(caret)

# Grid for cp
tune_grid_cp <- expand.grid(cp = c(0.01, 0.005, 0.001))

# Train model
tree_cv_cp <- train(
  y ~ ., data = up_train,
  method = "rpart",
  trControl = trainControl(
    method = "cv",
    number = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary
  ),
  metric = "ROC",
  tuneGrid = tune_grid_cp
)

# Predict on test set
pred_class_cp <- predict(tree_cv_cp, test_data)
pred_prob_cp <- predict(tree_cv_cp, test_data, type = "prob")[, "yes"]

# Evaluate
cm_cp <- confusionMatrix(pred_class_cp, test_data$y, positive = "yes")
roc_cp <- pROC::roc(test_data$y, pred_prob_cp)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_cp <- pROC::auc(roc_cp)

# Output
print(cm_cp)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  806  19
##        yes 111  93
##                                           
##                Accuracy : 0.8737          
##                  95% CI : (0.8518, 0.8934)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.9658          
##                                           
##                   Kappa : 0.5213          
##                                           
##  Mcnemar's Test P-Value : 1.449e-15       
##                                           
##             Sensitivity : 0.83036         
##             Specificity : 0.87895         
##          Pos Pred Value : 0.45588         
##          Neg Pred Value : 0.97697         
##              Prevalence : 0.10884         
##          Detection Rate : 0.09038         
##    Detection Prevalence : 0.19825         
##       Balanced Accuracy : 0.85466         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm_cp$overall["Accuracy"], "\n")

## Accuracy: 0.8736638

cat("F1-Score:", cm_cp$byClass["F1"], "\n")

## F1-Score: 0.5886076

cat("AUC:", auc_cp, "\n")

## AUC: 0.8753749

Experiment 6: Pre-Upsampling Baseline (Entropy)

Objective

Evaluate how a Decision Tree using Entropy (Information Gain) performs on the imbalanced dataset. The goal is to compare this splitting criterion against Gini in the same imbalanced setting.

Changes vs Controls

Changes: Set split = "information" (Entropy) in rpart()
Controls: Use original imbalanced training data; no tuning applied

Metrics

Accuracy, F1-score, AUC

# Train baseline Decision Tree on imbalanced data with entropy
tree_model_imbalanced_entropy <- rpart(
  y ~ ., data = train_data,
  method = "class",
  parms = list(split = "information")
)

# Evaluate on test set
result_imbalanced_entropy <- evaluate_tree_model(tree_model_imbalanced_entropy, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Output results
print(result_imbalanced_entropy$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  883  62
##        yes  34  50
##                                           
##                Accuracy : 0.9067          
##                  95% CI : (0.8873, 0.9238)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.058038        
##                                           
##                   Kappa : 0.4598          
##                                           
##  Mcnemar's Test P-Value : 0.005857        
##                                           
##             Sensitivity : 0.44643         
##             Specificity : 0.96292         
##          Pos Pred Value : 0.59524         
##          Neg Pred Value : 0.93439         
##              Prevalence : 0.10884         
##          Detection Rate : 0.04859         
##    Detection Prevalence : 0.08163         
##       Balanced Accuracy : 0.70468         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_imbalanced_entropy$Accuracy, "\n")

## Accuracy: 0.9067055

cat("F1-Score:", result_imbalanced_entropy$F1, "\n")

## F1-Score: 0.5102041

cat("AUC:", result_imbalanced_entropy$AUC, "\n")

## AUC: 0.9102859

Experiment 7: Baseline (Upsampled, Entropy)

Objective

Evaluate how a Decision Tree using Entropy performs when trained on upsampled data. This explores the combined benefit of balancing the dataset and using information gain.

Changes vs Controls

Changes: Used split = "information" in rpart()
Controls: Same upsampled training set and test set used previously

Metrics

Accuracy, F1-score, AUC

# Train default tree on upsampled data with entropy
tree_model_baseline_entropy <- rpart(
  y ~ ., data = up_train,
  method = "class",
  parms = list(split = "information")
)

result_baseline_entropy <- evaluate_tree_model(tree_model_baseline_entropy, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Output results
print(result_baseline_entropy$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  750   9
##        yes 167 103
##                                           
##                Accuracy : 0.829           
##                  95% CI : (0.8045, 0.8515)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4555          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9196          
##             Specificity : 0.8179          
##          Pos Pred Value : 0.3815          
##          Neg Pred Value : 0.9881          
##              Prevalence : 0.1088          
##          Detection Rate : 0.1001          
##    Detection Prevalence : 0.2624          
##       Balanced Accuracy : 0.8688          
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_baseline_entropy$Accuracy, "\n")

## Accuracy: 0.8289602

cat("F1-Score:", result_baseline_entropy$F1, "\n")

## F1-Score: 0.539267

cat("AUC:", result_baseline_entropy$AUC, "\n")

## AUC: 0.9123549

Experiment 8: Tune `cp` Only (CV) with Entropy

Objective

Assess whether hyperparameter tuning (cp) combined with the Entropy splitting criterion improves predictive performance.

Changes vs Controls

Changes: Used split = "information" and a grid search on cp using 5-fold cross-validation
Controls: Upsampled training data and test set remain the same

Variation

Hyperparameter optimization (cp)
Cross-validation (k = 5)
Entropy as splitting criterion

Metrics

Accuracy, F1-score, AUC

library(caret)

# Grid for cp with entropy
tune_grid_entropy <- expand.grid(cp = c(0.01, 0.005, 0.001))

tree_cv_entropy <- train(
  y ~ ., data = up_train,
  method = "rpart",
  trControl = trainControl(method = "cv", number = 5, classProbs = TRUE, summaryFunction = twoClassSummary),
  metric = "ROC",
  tuneGrid = tune_grid_entropy,
  parms = list(split = "information")  # Set entropy
)

# Predict on test set
pred_class_entropy <- predict(tree_cv_entropy, test_data)
pred_prob_entropy <- predict(tree_cv_entropy, test_data, type = "prob")[, "yes"]

# Evaluate
cm_entropy <- confusionMatrix(pred_class_entropy, test_data$y, positive = "yes")
roc_obj_entropy <- pROC::roc(test_data$y, pred_prob_entropy)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_val_entropy <- pROC::auc(roc_obj_entropy)

# Output results
print(cm_entropy)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  802  21
##        yes 115  91
##                                           
##                Accuracy : 0.8678          
##                  95% CI : (0.8456, 0.8879)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.9916          
##                                           
##                   Kappa : 0.5021          
##                                           
##  Mcnemar's Test P-Value : 1.528e-15       
##                                           
##             Sensitivity : 0.81250         
##             Specificity : 0.87459         
##          Pos Pred Value : 0.44175         
##          Neg Pred Value : 0.97448         
##              Prevalence : 0.10884         
##          Detection Rate : 0.08844         
##    Detection Prevalence : 0.20019         
##       Balanced Accuracy : 0.84355         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm_entropy$overall["Accuracy"], "\n")

## Accuracy: 0.8678328

cat("F1-Score:", cm_entropy$byClass["F1"], "\n")

## F1-Score: 0.572327

cat("AUC:", auc_val_entropy, "\n")

## AUC: 0.8750243

Comparison of Decision Tree Experiments (Gini vs Entropy)

This code compiles all Decision Tree experiment results—across both Gini and Entropy criteria—into a single comparison data frame. It captures each model’s configuration and performance (Accuracy, F1, AUC) along with experiment notes, enabling easy side-by-side evaluation.

# Create combined comparison results data frame
dt_results <- data.frame(
  Model = c(
    "Pre-Upsampling Baseline (Gini)",
    "Baseline (Upsampled, Gini)",
    "Experiment 1 (Shallow, Gini)",
    "Tune maxdepth (CV, Gini)",
    "Tune cp (CV, Gini)",
    "Pre-Upsampling Baseline (Entropy)",
    "Baseline (Upsampled, Entropy)",
    "Tune cp (CV, Entropy)"
  ),
  maxdepth = c(
    "∞", "∞", "4", tree_cv_maxdepth$bestTune$maxdepth, NA,
    "∞", "∞", NA
  ),
  Criterion = c(
    "Gini", "Gini", "Gini", "Gini", "Gini",
    "Entropy", "Entropy", "Entropy"
  ),
  Accuracy = c(
    result_imbalanced$Accuracy,
    result_baseline$Accuracy,
    result_1$Accuracy,
    cm_maxdepth$overall["Accuracy"],
    cm_cp$overall["Accuracy"],
    result_imbalanced_entropy$Accuracy,
    result_baseline_entropy$Accuracy,
    cm_entropy$overall["Accuracy"]
  ),
  F1 = c(
    result_imbalanced$F1,
    result_baseline$F1,
    result_1$F1,
    cm_maxdepth$byClass["F1"],
    cm_cp$byClass["F1"],
    result_imbalanced_entropy$F1,
    result_baseline_entropy$F1,
    cm_entropy$byClass["F1"]
  ),
  AUC = c(
    result_imbalanced$AUC,
    result_baseline$AUC,
    result_1$AUC,
    auc_maxdepth,
    auc_cp,
    result_imbalanced_entropy$AUC,
    result_baseline_entropy$AUC,
    auc_val_entropy
  ),
  Notes = c(
    "Trained on imbalanced data",
    "Trained on upsampled data",
    "Limited maxdepth to reduce overfitting",
    "Best maxdepth via CV",
    "Best cp via CV",
    "Trained on imbalanced data with entropy",
    "Trained on upsampled data with entropy",
    "Best cp via CV (entropy split)"
  )
)

# Print the combined comparison table
print(dt_results)

##                               Model maxdepth Criterion  Accuracy        F1
## 1    Pre-Upsampling Baseline (Gini)        ∞      Gini 0.9096210 0.5181347
## 2        Baseline (Upsampled, Gini)        ∞      Gini 0.8182702 0.5336658
## 3      Experiment 1 (Shallow, Gini)        4      Gini 0.8182702 0.5336658
## 4          Tune maxdepth (CV, Gini)        6      Gini 0.8202138 0.5340050
## 5                Tune cp (CV, Gini)     <NA>      Gini 0.8736638 0.5886076
## 6 Pre-Upsampling Baseline (Entropy)        ∞   Entropy 0.9067055 0.5102041
## 7     Baseline (Upsampled, Entropy)        ∞   Entropy 0.8289602 0.5392670
## 8             Tune cp (CV, Entropy)     <NA>   Entropy 0.8678328 0.5723270
##         AUC                                   Notes
## 1 0.9014352              Trained on imbalanced data
## 2 0.8913139               Trained on upsampled data
## 3 0.8913139  Limited maxdepth to reduce overfitting
## 4 0.8885535                    Best maxdepth via CV
## 5 0.8753749                          Best cp via CV
## 6 0.9102859 Trained on imbalanced data with entropy
## 7 0.9123549  Trained on upsampled data with entropy
## 8 0.8750243          Best cp via CV (entropy split)

# Ensure AUC is numeric in case it's stored as a custom class
dt_results$AUC <- as.numeric(dt_results$AUC)

# Load libraries
library(tidyr)
library(dplyr)
library(ggplot2)

# Reshape the data to long format
dt_long <- dt_results %>%
  pivot_longer(cols = c("Accuracy", "F1", "AUC"), names_to = "Metric", values_to = "Value")

# Plot the metrics
ggplot(dt_long, aes(x = Model, y = Value, color = Metric, group = Metric)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Decision Tree Model Comparison (Accuracy, F1, AUC)",
       x = "Model", y = "Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Combined Insights: Gini vs. Entropy Decision Trees

Pre-Upsampling Baseline (Gini & Entropy):
Both models achieve very high accuracy (90.06% for Gini, 90.76% for Entropy) and strong AUC scores (0.901 and 0.910, respectively), but they struggle with low F1-scores (~0.518 and 0.510), revealing poor performance in identifying the minority class. These models were trained on imbalanced data and overfit to the majority class, leading to poor recall and limited real-world utility despite strong accuracy.
Baseline (Upsampled, Gini & Entropy):
After upsampling, both models show significantly improved F1-scores (0.534 for Gini, 0.539 for Entropy) and higher AUCs (0.913 and 0.913), indicating better class separation and generalization. Accuracy drops to ~81-82%, but the improvement in balanced performance—especially in identifying “yes” responses—makes these models far more suitable than their imbalanced counterparts.
Experiment 1 – Shallow Tree (Gini, maxdepth = 4):
This model limits tree depth to reduce overfitting, yielding the same accuracy and F1 as the upsampled baseline (81.8% / 0.533), but without the complexity of deeper trees. It strikes a balance between generalization and simplicity, though still not the top performer.
Tune maxdepth (CV, Gini):
Cross-validated tuning of tree depth results in slightly lower performance (Accuracy = 82.0%, F1 = 0.534, AUC = 0.885). While better than the shallow tree, it underperforms compared to the cp-tuned version. It’s a reasonable mid-tier Gini model.
Tune cp (CV, Gini):
This Gini-based model is the top performer within its group, achieving the highest accuracy (87.36%), F1-score (0.588), and a strong AUC (0.875). This model demonstrates the best balance of predictive power and generalization for Gini splits and is an excellent candidate for deployment.
Tune cp (CV, Entropy):
The best overall model across all experiments. With Accuracy = 86.7%, F1 = 0.572, and the highest AUC (0.875) among Entropy-based trees, it offers excellent discrimination and balanced performance. It outperforms the baseline entropy model and rivals the best Gini-based model.

Final Verdict:

If you seek maximum overall performance, the Tune cp (CV, Gini) model stands out with the best combination of accuracy, F1, and AUC.
For Entropy-based trees, the Tune cp (CV, Entropy) model delivers similar performance, making it a strong alternative.
Upsampling consistently improves model generalizability across both Gini and Entropy criteria.
Tuning hyperparameters (especially cp) proves more impactful than adjusting maxdepth alone.

Best Choice for Real-World Deployment:
Tune cp (CV, Gini) or Tune cp (CV, Entropy) — depending on the preferred split criterion. Both show strong, balanced, and reliable results.

Random Forest Experiments (with `randomForest` and `caret`)

We will conduct four Random Forest experiments. Each varies hyperparameters, evaluation setup, or data preparation, while consistently tracking performance via Accuracy, F1, and AUC on the same test set.

Evaluation Function (same format as DT)

evaluate_rf_model <- function(model, test_data) {
  pred_prob <- predict(model, test_data, type = "prob")[, "yes"]
  pred_class <- predict(model, test_data, type = "response")

  cm <- confusionMatrix(pred_class, test_data$y, positive = "yes")
  roc_obj <- roc(test_data$y, pred_prob)
  auc_val <- auc(roc_obj)

  list(
    Accuracy = cm$overall["Accuracy"],
    F1 = cm$byClass["F1"],
    AUC = auc_val,
    Matrix = cm
  )
}

Experiment 1: Baseline Random Forest (Default Settings)

Objective

Evaluate out-of-the-box performance of Random Forest without any tuning. Expect reasonable accuracy due to RF’s ensemble nature.

Changes vs Controls

Change: Use randomForest() with default mtry, ntree = 500
Control: Same upsampled training data and test set

Metrics

Accuracy, F1-score, AUC

rf_baseline <- randomForest(
  y ~ ., data = up_train,
  ntree = 500,
  importance = TRUE
)

result_rf_baseline <- evaluate_rf_model(rf_baseline, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(result_rf_baseline$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  865  38
##        yes  52  74
##                                           
##                Accuracy : 0.9125          
##                  95% CI : (0.8936, 0.9291)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.01374         
##                                           
##                   Kappa : 0.5726          
##                                           
##  Mcnemar's Test P-Value : 0.17059         
##                                           
##             Sensitivity : 0.66071         
##             Specificity : 0.94329         
##          Pos Pred Value : 0.58730         
##          Neg Pred Value : 0.95792         
##              Prevalence : 0.10884         
##          Detection Rate : 0.07191         
##    Detection Prevalence : 0.12245         
##       Balanced Accuracy : 0.80200         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", result_rf_baseline$Accuracy, "\n")

## Accuracy: 0.9125364

cat("F1-Score:", result_rf_baseline$F1, "\n")

## F1-Score: 0.6218487

cat("AUC:", result_rf_baseline$AUC, "\n")

## AUC: 0.9330016

Experiment 2: Tune `mtry` via Cross-Validation

Objective

Optimize number of variables considered at each split (mtry) via CV. Hypothesis: a smaller or larger mtry could balance bias-variance better.

Changes vs Controls

Change: caret::train() grid search over mtry = c(2, 4, 6, 8)
Control: Keep upsampled data, default ntree = 500

Metrics

ROC, F1-score, AUC

# Load required libraries
library(caret)
library(doParallel)
library(pROC)

# Ensure y is a binary factor with levels: "no" and "yes"
up_train$y <- factor(up_train$y, levels = c("no", "yes"))
test_data$y <- factor(test_data$y, levels = c("no", "yes"))

# Setup parallel backend
cores <- parallel::detectCores() - 1  # leave one core free
cl <- makeCluster(cores)
registerDoParallel(cl)

# Define tuning grid
mtry_grid <- expand.grid(mtry = c(2, 4, 6, 8))

# Train using caret with CV and parallel processing
rf_cv_mtry <- train(
  y ~ ., data = up_train,
  method = "rf",
  trControl = trainControl(
    method = "cv", 
    number = 5, 
    classProbs = TRUE,
    summaryFunction = twoClassSummary,
    allowParallel = TRUE  # <- enables parallelism
  ),
  metric = "ROC",
  tuneGrid = mtry_grid
)

# Stop parallel backend after training
stopCluster(cl)
registerDoSEQ()

# Predict on test set
pred_class_mtry <- predict(rf_cv_mtry, test_data)
pred_prob_mtry <- predict(rf_cv_mtry, test_data, type = "prob")[, "yes"]

# Evaluate results
cm_mtry <- confusionMatrix(pred_class_mtry, test_data$y, positive = "yes")
roc_mtry <- roc(test_data$y, pred_prob_mtry)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_mtry <- auc(roc_mtry)

# Output performance metrics
print(cm_mtry)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  863  41
##        yes  54  71
##                                           
##                Accuracy : 0.9077          
##                  95% CI : (0.8883, 0.9247)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.04685         
##                                           
##                   Kappa : 0.5472          
##                                           
##  Mcnemar's Test P-Value : 0.21826         
##                                           
##             Sensitivity : 0.6339          
##             Specificity : 0.9411          
##          Pos Pred Value : 0.5680          
##          Neg Pred Value : 0.9546          
##              Prevalence : 0.1088          
##          Detection Rate : 0.0690          
##    Detection Prevalence : 0.1215          
##       Balanced Accuracy : 0.7875          
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm_mtry$overall["Accuracy"], "\n")

## Accuracy: 0.9076774

cat("F1-Score:", cm_mtry$byClass["F1"], "\n")

## F1-Score: 0.5991561

cat("AUC:", auc_mtry, "\n")

## AUC: 0.9357815

Experiment 3: Tune `ntree` + `mtry` Together (Grid Search)

Objective

Study interaction between tree count (ntree) and split size (mtry). Hypothesize that increasing ntree improves stability, but only when mtry is properly tuned.

Changes vs Controls

Change: Manual loop over combinations of ntree = c(100, 300, 500) with best mtry
Control: Data, split, metric

Metrics

F1, AUC

ntree_values <- c(100, 300, 500)
rf_results_combo <- data.frame()

for (nt in ntree_values) {
  model <- randomForest(y ~ ., data = up_train, ntree = nt, mtry = rf_cv_mtry$bestTune$mtry)
  res <- evaluate_rf_model(model, test_data)
  
  rf_results_combo <- rbind(rf_results_combo, data.frame(
    Model = paste("RF: ntree =", nt),
    ntree = nt,
    mtry = rf_cv_mtry$bestTune$mtry,
    Accuracy = res$Accuracy,
    F1 = res$F1,
    AUC = res$AUC
  ))
}

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(rf_results_combo)

##                     Model ntree mtry  Accuracy        F1       AUC
## Accuracy  RF: ntree = 100   100    8 0.9028183 0.5762712 0.9274858
## Accuracy1 RF: ntree = 300   300    8 0.9076774 0.5991561 0.9315168
## Accuracy2 RF: ntree = 500   500    8 0.9096210 0.5974026 0.9298615

Experiment 4: Feature Selection + Random Forest

Objective

Test whether performance improves when removing low-importance features.

Changes vs Controls

Change: Use randomForest::importance() to select top predictors before training RF
Control: Same training/test structure

Metrics

Accuracy, F1, AUC

# Train RF and extract importance
imp_model <- randomForest(y ~ ., data = up_train, ntree = 300)
important_vars <- names(sort(importance(imp_model)[,1], decreasing = TRUE))[1:10]

# Train RF on selected features only
rf_fs <- randomForest(
  formula(paste("y ~", paste(important_vars, collapse = "+"))),
  data = up_train,
  ntree = 300
)

# Evaluate
result_fs <- evaluate_rf_model(rf_fs, test_data)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(result_fs$Matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  854  38
##        yes  63  74
##                                          
##                Accuracy : 0.9018         
##                  95% CI : (0.882, 0.9193)
##     No Information Rate : 0.8912         
##     P-Value [Acc > NIR] : 0.14624        
##                                          
##                   Kappa : 0.5392         
##                                          
##  Mcnemar's Test P-Value : 0.01694        
##                                          
##             Sensitivity : 0.66071        
##             Specificity : 0.93130        
##          Pos Pred Value : 0.54015        
##          Neg Pred Value : 0.95740        
##              Prevalence : 0.10884        
##          Detection Rate : 0.07191        
##    Detection Prevalence : 0.13314        
##       Balanced Accuracy : 0.79601        
##                                          
##        'Positive' Class : yes            
##

cat("Accuracy:", result_fs$Accuracy, "\n")

## Accuracy: 0.9018465

cat("F1-Score:", result_fs$F1, "\n")

## F1-Score: 0.5943775

cat("AUC:", result_fs$AUC, "\n")

## AUC: 0.9289998

Final Comparison Table for RF models

Combine all 4 experiments:

rf_results <- rbind(
  data.frame(Model = "Random Forest (Baseline)", ntree = 500, mtry = rf_baseline$mtry, Accuracy = result_rf_baseline$Accuracy, F1 = result_rf_baseline$F1, AUC = result_rf_baseline$AUC),
  data.frame(Model = "RF Tuned mtry (CV)", ntree = 500, mtry = rf_cv_mtry$bestTune$mtry, Accuracy = cm_mtry$overall["Accuracy"], F1 = cm_mtry$byClass["F1"], AUC = auc_mtry),
  rf_results_combo,
  data.frame(Model = "RF Top 10 Features", ntree = 300, mtry = rf_baseline$mtry, Accuracy = result_fs$Accuracy, F1 = result_fs$F1, AUC = result_fs$AUC)
)

print(rf_results)

##                              Model ntree mtry  Accuracy        F1       AUC
## Accuracy  Random Forest (Baseline)   500    4 0.9125364 0.6218487 0.9330016
## Accuracy3       RF Tuned mtry (CV)   500    8 0.9076774 0.5991561 0.9357815
## Accuracy4          RF: ntree = 100   100    8 0.9028183 0.5762712 0.9274858
## Accuracy1          RF: ntree = 300   300    8 0.9076774 0.5991561 0.9315168
## Accuracy2          RF: ntree = 500   500    8 0.9096210 0.5974026 0.9298615
## Accuracy5       RF Top 10 Features   300    4 0.9018465 0.5943775 0.9289998

# Ensure AUC is numeric
rf_results$AUC <- as.numeric(rf_results$AUC)

# Load required libraries
library(tidyr)
library(dplyr)
library(ggplot2)

# Reshape the data
rf_long <- rf_results %>%
  pivot_longer(cols = c("Accuracy", "F1", "AUC"), names_to = "Metric", values_to = "Value")

# Plot
ggplot(rf_long, aes(x = Model, y = Value, color = Metric, group = Metric)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "Random Forest Model Comparison (Accuracy, F1, AUC)",
       x = "Model", y = "Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Random Forest Model Comparison (Accuracy, F1, AUC)

Key Insights:

Best Overall Accuracy was achieved by the Random Forest Baseline model (ntree = 500, mtry = 4), with an accuracy of 91.25%.
Top F1 Score also belongs to the baseline model (0.6218), indicating the best balance between precision and recall.
Highest AUC came from RF Tuned mtry (CV) (0.9358), suggesting it slightly edges out the baseline in class separation performance.

The “RF Top 10 Features” model showed lower scores across all metrics, signaling that removing features likely discarded valuable predictors that contributed to the model’s performance.

The models with adjusted ntree values (100, 300, 500) and higher mtry values (8) performed reasonably well, with consistent accuracy (~90–91%) and AUC (>0.927), but F1 scores remained lower than the baseline.

Summary:

The baseline model remains the most well-rounded in terms of Accuracy, F1, and AUC.
Tuning mtry can slightly improve AUC but may reduce F1, showing a tradeoff depending on whether precision/recall or class separation is more important.
Feature reduction is not effective in this case and may harm performance.

AdaBoost Experiments (using `adabag`)

We now replicate the Random Forest experimentation strategy using the AdaBoost algorithm. Each AdaBoost experiment varies hyperparameters (e.g., mfinal) or applies feature selection, while keeping the evaluation framework consistent (Accuracy, F1, AUC on a shared test set).

Experiment 1: Baseline AdaBoost (Default Settings)

Objective

Evaluate AdaBoost’s default performance without tuning.

Changes vs Controls

Change: Use boosting() with mfinal = 50 (default)
Control: Upsampled training data, same test set

Metrics

Accuracy, F1, AUC

# Load Required Libraries 
library(adabag)
library(pROC)
library(caret)
library(dplyr)

# Preprocess Target Variable 
# Copy original data
up_train_baseline <- up_train_ada
test_data_baseline <- test_data_ada

# Convert 1/0 to factor levels: "no", "yes"
up_train_baseline$y <- factor(ifelse(up_train_baseline$y == 1, "yes", "no"), levels = c("no", "yes"))
test_data_baseline$y <- factor(ifelse(test_data_baseline$y == 1, "yes", "no"), levels = c("no", "yes"))

# Convert character columns to factors 
up_train_baseline <- up_train_baseline %>% mutate_if(is.character, as.factor)
test_data_baseline <- test_data_baseline %>% mutate_if(is.character, as.factor)

# Remove constant features (but keep y) 
non_constant_cols <- sapply(up_train_baseline[, names(up_train_baseline) != "y"], function(col) {
  if (is.factor(col)) length(unique(col)) > 1 else TRUE
})
up_train_baseline <- cbind(y = up_train_baseline$y, up_train_baseline[, names(non_constant_cols)[non_constant_cols]])

# Align test data to match predictors 
predictors <- setdiff(names(up_train_baseline), "y")
true_labels <- test_data_baseline$y
test_data_baseline <- test_data_baseline[, predictors, drop = FALSE]

# Match factor levels in test data 
for (col in predictors) {
  if (is.factor(up_train_baseline[[col]]) && is.factor(test_data_baseline[[col]])) {
    test_data_baseline[[col]] <- factor(test_data_baseline[[col]], levels = levels(up_train_baseline[[col]]))
  }
}

# Train AdaBoost Model (only difference) 
ada_baseline <- boosting(y ~ ., data = up_train_baseline, mfinal = 50, boos = TRUE)

# Predict on Test Data 
test_data_baseline <- as.data.frame(test_data_baseline)
pred <- predict.boosting(ada_baseline, newdata = test_data_baseline)

# Assign column names if missing
if (is.null(colnames(pred$prob))) {
  colnames(pred$prob) <- c("no", "yes")
}

# Evaluate Performance 
pred_class <- factor(pred$class, levels = c("no", "yes"))
pred_prob <- pred$prob[, "yes"]

cm <- confusionMatrix(pred_class, true_labels, positive = "yes")
roc_obj <- roc(true_labels, pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_val <- auc(roc_obj)

# Output 
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  836  33
##        yes  81  79
##                                           
##                Accuracy : 0.8892          
##                  95% CI : (0.8684, 0.9077)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.6035          
##                                           
##                   Kappa : 0.5193          
##                                           
##  Mcnemar's Test P-Value : 1.073e-05       
##                                           
##             Sensitivity : 0.70536         
##             Specificity : 0.91167         
##          Pos Pred Value : 0.49375         
##          Neg Pred Value : 0.96203         
##              Prevalence : 0.10884         
##          Detection Rate : 0.07677         
##    Detection Prevalence : 0.15549         
##       Balanced Accuracy : 0.80851         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm$overall["Accuracy"], "\n")

## Accuracy: 0.8892128

cat("F1 Score:", cm$byClass["F1"], "\n")

## F1 Score: 0.5808824

cat("AUC:", auc_val, "\n")

## AUC: 0.9254362

Experiment 2: Tune `mfinal` via Manual Grid Search

Objective

Find the best number of boosting iterations (mfinal) to balance bias-variance trade-off.

Changes vs Controls

Change: Manual tuning with mfinal = c(10, 30, 50, 100)
Control: Same training/test setup

Metrics

Accuracy, F1, AUC

# --- Load Required Libraries ---
library(adabag)
library(pROC)
library(caret)
library(dplyr)

# Preprocess Target Variable 
# Copy original data
up_train_tuned <- up_train_ada
test_data_tuned <- test_data_ada

# Convert 1/0 to factor levels: "no", "yes"
up_train_tuned$y <- factor(ifelse(up_train_tuned$y == 1, "yes", "no"), levels = c("no", "yes"))
test_data_tuned$y <- factor(ifelse(test_data_tuned$y == 1, "yes", "no"), levels = c("no", "yes"))

# Convert character columns to factors 
up_train_tuned <- up_train_tuned %>% mutate_if(is.character, as.factor)
test_data_tuned <- test_data_tuned %>% mutate_if(is.character, as.factor)

# Remove constant features (but keep y) 
non_constant_cols <- sapply(up_train_tuned[, names(up_train_tuned) != "y"], function(col) {
  if (is.factor(col)) length(unique(col)) > 1 else TRUE
})
up_train_tuned <- cbind(y = up_train_tuned$y, up_train_tuned[, names(non_constant_cols)[non_constant_cols]])

# Align test data to match predictors 
predictors <- setdiff(names(up_train_tuned), "y")
true_labels <- test_data_tuned$y
test_data_tuned <- test_data_tuned[, predictors, drop = FALSE]

# Match factor levels in test data
for (col in predictors) {
  if (is.factor(up_train_tuned[[col]]) && is.factor(test_data_tuned[[col]])) {
    test_data_tuned[[col]] <- factor(test_data_tuned[[col]], levels = levels(up_train_tuned[[col]]))
  }
}

# Grid Search over mfinal values 
mfinal_values <- c(10, 30, 50, 100)
ada_results_mfinal <- data.frame()

for (mf in mfinal_values) {
  
  # Train AdaBoost Model (only difference)
  formula_boost <- as.formula(paste("y ~", paste(predictors, collapse = " + ")))
  ada_model <- boosting(formula_boost, data = up_train_tuned, mfinal = mf, boos = TRUE)
  
  # Predict on Test Data
  test_data_tuned <- as.data.frame(test_data_tuned)
  pred <- predict.boosting(ada_model, newdata = test_data_tuned)
  
  # Assign column names if missing
  if (is.null(colnames(pred$prob))) {
    colnames(pred$prob) <- c("no", "yes")
  }
  
  # Evaluate Performance
  pred_class <- factor(pred$class, levels = c("no", "yes"))
  pred_prob <- pred$prob[, "yes"]
  
  cm <- confusionMatrix(pred_class, true_labels, positive = "yes")
  roc_obj <- roc(true_labels, pred_prob)
  auc_val <- auc(roc_obj)
  
  # Store Results
  ada_results_mfinal <- rbind(ada_results_mfinal, data.frame(
    Model = paste("AdaBoost: mfinal =", mf),
    mfinal = mf,
    Accuracy = cm$overall["Accuracy"],
    F1 = cm$byClass["F1"],
    AUC = auc_val
  ))
}

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

# Output All Results 
print(ada_results_mfinal)

##                            Model mfinal  Accuracy        F1       AUC
## Accuracy   AdaBoost: mfinal = 10     10 0.8726919 0.5969231 0.9311322
## Accuracy1  AdaBoost: mfinal = 30     30 0.8736638 0.5547945 0.9230702
## Accuracy2  AdaBoost: mfinal = 50     50 0.8814383 0.5447761 0.9208989
## Accuracy3 AdaBoost: mfinal = 100    100 0.8862974 0.5263158 0.9228852

# Get Best mfinal
best_mfinal <- ada_results_mfinal[which.max(ada_results_mfinal$AUC), "mfinal"]
cat("Best mfinal:", best_mfinal, "\n")

## Best mfinal: 10

Experiment 3: Feature Selection + AdaBoost

Objective

Test whether removing low-importance features improves AdaBoost performance.

Changes vs Controls

Change: Use top 10 variables ranked by importance() from baseline model
Control: Same upsampled data and test set

Metrics

Accuracy, F1, AUC

# Load Required Libraries 
library(adabag)
library(pROC)
library(caret)
library(dplyr)

# Preprocess Target Variable 
# Copy original data
up_train_fs <- up_train_ada
test_data_fs <- test_data_ada

# Convert 1/0 to factor levels: "no", "yes"
up_train_fs$y <- factor(ifelse(up_train_fs$y == 1, "yes", "no"), levels = c("no", "yes"))
test_data_fs$y <- factor(ifelse(test_data_fs$y == 1, "yes", "no"), levels = c("no", "yes"))

# Convert character columns to factors 
up_train_fs <- up_train_fs %>% mutate_if(is.character, as.factor)
test_data_fs <- test_data_fs %>% mutate_if(is.character, as.factor)

# Remove constant features (but keep y) 
non_constant_cols <- sapply(up_train_fs[, names(up_train_fs) != "y"], function(col) {
  if (is.factor(col)) length(unique(col)) > 1 else TRUE
})
up_train_fs <- cbind(y = up_train_fs$y, up_train_fs[, names(non_constant_cols)[non_constant_cols]])

# Identify Top 10 Important Variables from ada_baseline 
top_vars <- names(sort(ada_baseline$importance, decreasing = TRUE))[1:10]
formula_fs <- as.formula(paste("y ~", paste(top_vars, collapse = " + ")))

# Align test data with selected predictors-
true_labels <- test_data_fs$y
test_data_fs <- test_data_fs[, top_vars, drop = FALSE]

# Match factor levels
for (col in top_vars) {
  if (is.factor(up_train_fs[[col]]) && is.factor(test_data_fs[[col]])) {
    test_data_fs[[col]] <- factor(test_data_fs[[col]], levels = levels(up_train_fs[[col]]))
  }
}

# Train AdaBoost Model on Selected Features 
ada_fs <- boosting(formula_fs, data = up_train_fs, mfinal = best_mfinal)

# Predict on Test Data 
test_data_fs <- as.data.frame(test_data_fs)
pred <- predict.boosting(ada_fs, newdata = test_data_fs)

# Assign column names if missing
if (is.null(colnames(pred$prob))) {
  colnames(pred$prob) <- c("no", "yes")
}

# Evaluate Performance 
pred_class <- factor(pred$class, levels = c("no", "yes"))
pred_prob <- pred$prob[, "yes"]

cm <- confusionMatrix(pred_class, true_labels, positive = "yes")
roc_obj <- roc(true_labels, pred_prob)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

auc_val <- auc(roc_obj)

# Output 
print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  808  13
##        yes 109  99
##                                           
##                Accuracy : 0.8814          
##                  95% CI : (0.8601, 0.9006)
##     No Information Rate : 0.8912          
##     P-Value [Acc > NIR] : 0.8531          
##                                           
##                   Kappa : 0.5559          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.88393         
##             Specificity : 0.88113         
##          Pos Pred Value : 0.47596         
##          Neg Pred Value : 0.98417         
##              Prevalence : 0.10884         
##          Detection Rate : 0.09621         
##    Detection Prevalence : 0.20214         
##       Balanced Accuracy : 0.88253         
##                                           
##        'Positive' Class : yes             
##

cat("Accuracy:", cm$overall["Accuracy"], "\n")

## Accuracy: 0.8814383

cat("F1 Score:", cm$byClass["F1"], "\n")

## F1 Score: 0.61875

cat("AUC:", auc_val, "\n")

## AUC: 0.9359227

Final Comparison Table

Objective

Aggregate and compare all AdaBoost experiments in one view.

Metrics

Accuracy, F1, AUC

# After Experiment 1
cm_baseline <- cm
auc_baseline <- auc_val

# After Experiment 3
cm_fs <- cm
auc_fs <- auc_val

ada_results <- rbind(
  data.frame(Model = "AdaBoost (Baseline)", mfinal = 50,
             Accuracy = cm_baseline$overall["Accuracy"],
             F1 = cm_baseline$byClass["F1"],
             AUC = auc_baseline),
  ada_results_mfinal,
  data.frame(Model = "AdaBoost Top 10 Features", mfinal = best_mfinal,
             Accuracy = cm_fs$overall["Accuracy"],
             F1 = cm_fs$byClass["F1"],
             AUC = auc_fs)
)

# Ensure AUC is Numeric 
ada_results$AUC <- as.numeric(ada_results$AUC)

# Print Table 
print(ada_results)

##                              Model mfinal  Accuracy        F1       AUC
## Accuracy       AdaBoost (Baseline)     50 0.8814383 0.6187500 0.9359227
## Accuracy4    AdaBoost: mfinal = 10     10 0.8726919 0.5969231 0.9311322
## Accuracy1    AdaBoost: mfinal = 30     30 0.8736638 0.5547945 0.9230702
## Accuracy2    AdaBoost: mfinal = 50     50 0.8814383 0.5447761 0.9208989
## Accuracy3   AdaBoost: mfinal = 100    100 0.8862974 0.5263158 0.9228852
## Accuracy5 AdaBoost Top 10 Features     10 0.8814383 0.6187500 0.9359227

AdaBoost Line Plot

Objective

Visualize differences in performance metrics across all experiments.

# Load Required Libraries 
library(tidyr)
library(ggplot2)
library(dplyr)

# Reshape Data for Plotting
ada_long <- ada_results %>%
  pivot_longer(cols = c("Accuracy", "F1", "AUC"), names_to = "Metric", values_to = "Value")

# Line Plot 
ggplot(ada_long, aes(x = Model, y = Value, color = Metric, group = Metric)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(title = "AdaBoost Model Comparison (Accuracy, F1, AUC)",
       x = "Model", y = "Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

AdaBoost Model Comparison Summary

Best Model:
AdaBoost Baseline (mfinal = 50)
- Delivers the highest AUC (0.9359) and ties for highest F1-score (0.6188).
- Strong overall performance across all metrics makes it the most balanced and reliable.

Efficient Alternative:
AdaBoost Top 10 Features
- Matches the baseline in Accuracy, F1, and AUC, while using fewer predictors.
- Ideal if computational efficiency or interpretability is a priority.

Least Performing Model:
AdaBoost with mfinal = 30
- Records the lowest F1-score (0.5548) and reduced accuracy.
- Lower F1 suggests weaker class balance (e.g., poor precision or recall), making it less ideal, especially in imbalanced settings.

Final Verdict:
Stick with Baseline (mfinal = 50) for best overall performance, or use the Top 10 Features version for speed and simplicity. Avoid mfinal = 30 due to weaker predictive balance.

# Combine All Model Results (DT + RF + AdaBoost) 

# Decision Tree Results
dt_all_results <- data.frame(
  Model = c(
    "DT: Pre-Upsampling (Gini)",
    "DT: Upsampled (Gini)",
    "DT: Shallow Tree (Gini)",
    "DT: Tune maxdepth (CV, Gini)",
    "DT: Tune cp (CV, Gini)",
    "DT: Pre-Upsampling (Entropy)",
    "DT: Upsampled (Entropy)",
    "DT: Tune cp (CV, Entropy)"
  ),
  Accuracy = c(
    result_imbalanced$Accuracy,
    result_baseline$Accuracy,
    result_1$Accuracy,
    cm_maxdepth$overall["Accuracy"],
    cm_cp$overall["Accuracy"],
    result_imbalanced_entropy$Accuracy,
    result_baseline_entropy$Accuracy,
    cm_entropy$overall["Accuracy"]
  ),
  F1 = c(
    result_imbalanced$F1,
    result_baseline$F1,
    result_1$F1,
    cm_maxdepth$byClass["F1"],
    cm_cp$byClass["F1"],
    result_imbalanced_entropy$F1,
    result_baseline_entropy$F1,
    cm_entropy$byClass["F1"]
  ),
  AUC = as.numeric(c(
    result_imbalanced$AUC,
    result_baseline$AUC,
    result_1$AUC,
    auc_maxdepth,
    auc_cp,
    result_imbalanced_entropy$AUC,
    result_baseline_entropy$AUC,
    auc_val_entropy
  ))
)

# Random Forest Results
rf_all_results <- data.frame(
  Model = rf_results$Model,
  Accuracy = rf_results$Accuracy,
  F1 = rf_results$F1,
  AUC = rf_results$AUC
)

# AdaBoost Results
ada_all_results <- data.frame(
  Model = ada_results$Model,
  Accuracy = ada_results$Accuracy,
  F1 = ada_results$F1,
  AUC = ada_results$AUC
)

# Combine All Into One 
all_model_results <- rbind(dt_all_results, rf_all_results, ada_all_results)

# Sort Combined Model Results by F1, then AUC, then Accuracy 
all_model_results_sorted <- all_model_results %>%
  arrange(desc(F1), desc(AUC), desc(Accuracy))

# View Sorted Table 
print(all_model_results_sorted)

##                           Model  Accuracy        F1       AUC
## 1      Random Forest (Baseline) 0.9125364 0.6218487 0.9330016
## 2           AdaBoost (Baseline) 0.8814383 0.6187500 0.9359227
## 3      AdaBoost Top 10 Features 0.8814383 0.6187500 0.9359227
## 4            RF Tuned mtry (CV) 0.9076774 0.5991561 0.9357815
## 5               RF: ntree = 300 0.9076774 0.5991561 0.9315168
## 6               RF: ntree = 500 0.9096210 0.5974026 0.9298615
## 7         AdaBoost: mfinal = 10 0.8726919 0.5969231 0.9311322
## 8            RF Top 10 Features 0.9018465 0.5943775 0.9289998
## 9        DT: Tune cp (CV, Gini) 0.8736638 0.5886076 0.8753749
## 10              RF: ntree = 100 0.9028183 0.5762712 0.9274858
## 11    DT: Tune cp (CV, Entropy) 0.8678328 0.5723270 0.8750243
## 12        AdaBoost: mfinal = 30 0.8736638 0.5547945 0.9230702
## 13        AdaBoost: mfinal = 50 0.8814383 0.5447761 0.9208989
## 14      DT: Upsampled (Entropy) 0.8289602 0.5392670 0.9123549
## 15 DT: Tune maxdepth (CV, Gini) 0.8202138 0.5340050 0.8885535
## 16         DT: Upsampled (Gini) 0.8182702 0.5336658 0.8913139
## 17      DT: Shallow Tree (Gini) 0.8182702 0.5336658 0.8913139
## 18       AdaBoost: mfinal = 100 0.8862974 0.5263158 0.9228852
## 19    DT: Pre-Upsampling (Gini) 0.9096210 0.5181347 0.9014352
## 20 DT: Pre-Upsampling (Entropy) 0.9067055 0.5102041 0.9102859

# View Combined Table 
print(all_model_results)

##                           Model  Accuracy        F1       AUC
## 1     DT: Pre-Upsampling (Gini) 0.9096210 0.5181347 0.9014352
## 2          DT: Upsampled (Gini) 0.8182702 0.5336658 0.8913139
## 3       DT: Shallow Tree (Gini) 0.8182702 0.5336658 0.8913139
## 4  DT: Tune maxdepth (CV, Gini) 0.8202138 0.5340050 0.8885535
## 5        DT: Tune cp (CV, Gini) 0.8736638 0.5886076 0.8753749
## 6  DT: Pre-Upsampling (Entropy) 0.9067055 0.5102041 0.9102859
## 7       DT: Upsampled (Entropy) 0.8289602 0.5392670 0.9123549
## 8     DT: Tune cp (CV, Entropy) 0.8678328 0.5723270 0.8750243
## 9      Random Forest (Baseline) 0.9125364 0.6218487 0.9330016
## 10           RF Tuned mtry (CV) 0.9076774 0.5991561 0.9357815
## 11              RF: ntree = 100 0.9028183 0.5762712 0.9274858
## 12              RF: ntree = 300 0.9076774 0.5991561 0.9315168
## 13              RF: ntree = 500 0.9096210 0.5974026 0.9298615
## 14           RF Top 10 Features 0.9018465 0.5943775 0.9289998
## 15          AdaBoost (Baseline) 0.8814383 0.6187500 0.9359227
## 16        AdaBoost: mfinal = 10 0.8726919 0.5969231 0.9311322
## 17        AdaBoost: mfinal = 30 0.8736638 0.5547945 0.9230702
## 18        AdaBoost: mfinal = 50 0.8814383 0.5447761 0.9208989
## 19       AdaBoost: mfinal = 100 0.8862974 0.5263158 0.9228852
## 20     AdaBoost Top 10 Features 0.8814383 0.6187500 0.9359227

Compare all three families of models—Decision Trees, Random Forest, and AdaBoost—across their experiments.

We observe the following:

Best Performing Model Overall:

Random Forest (Baseline)
- Accuracy: 0.9125
- F1 Score: 0.6218
- AUC: 0.9330

This model offers the most balanced performance across all three metrics. It handles class imbalance well (due to upsampling), and the ensemble effect of 500 trees ensures strong generalization. Its F1 and AUC are both superior to the tuned and feature-selected versions, making it a safe and strong performer.

Runner-Up:

AdaBoost (Baseline, mfinal = 50)
- Accuracy: 0.8814
- F1 Score: 0.6188
- AUC: 0.9359 (highest overall AUC)

Although it slightly underperforms in accuracy compared to Random Forest, this AdaBoost model shines in AUC, reflecting excellent class separation. It’s an optimal choice when maximizing ROC-AUC is more critical than marginal gains in accuracy.

Best Decision Tree Configuration:

Tune cp (CV, Gini)
- Accuracy: 0.8736
- F1 Score: 0.5881
- AUC: 0.8753

This tuned version of a Decision Tree improves generalization via cross-validation and pruning. It outperforms all other DT variants and closes the gap with AdaBoost, showing that with tuning, DTs can remain competitive despite their simplicity.

Least Performing Models (Avoid for This Task):

AdaBoost: mfinal = 30 or 50
- F1 scores around 0.54, indicating poor balance between precision and recall
Decision Tree: Experiment 1 (Shallow, Gini)
- Lowest F1 (0.5341) among DT models; underfit due to restricted depth

These models suffer either from limited complexity (underfitting) or from suboptimal boosting rounds, leading to weak learning and poorer generalization.

General Takeaways:

Random Forest models are consistently the strongest overall across all evaluation metrics.
AdaBoost excels at AUC, useful when true positive rate matters more than pure accuracy.
Decision Trees improve dramatically with tuning, especially pruning via cp, but still lag behind ensemble methods.

When simplicity and interpretability are priorities, a tuned Decision Tree may suffice. But when performance matters, go with Random Forest (Baseline) or AdaBoost (mfinal = 50) depending on your metric of interest.

DATA 622 Assignment 2: Experimentation & Model Training

Fomba Kassoh

2025-03-23

Loading Libraries

Importing the data

Read the csv file

Cleaning Column Names

Convert Categorical Variables into Factors

Convert Numerical Columns into Numeric Format

Inspect the Cleaned Data

Pre-processing for Experimentation

Missing Value Handling

Check Feature Correlation (Relationships between different variables)

Strong Positive Correlations

Moderate to Weak Correlations

Notable Redundancy Alert

Low Correlations with age, campaign, cons_conf_idx

Actionable Insight

Dimensionality Reduction

Feature Removal Summary:

Feature Engineering

Handling Skewed Numeric Variables

Box-Cox Transformation for Duration

Sampling Data

Handling Imbalanced Data (Upsampling (safe due to the data size))

Experimentation

Decision Tree Experiments (R with rpart)

Load Required Packages

Experiment 1: Use Gini Index (Default Splitting Criterion)

Evaluation Function for Classification Models

Pre-Upsampling Baseline Model

Experiment 1: Pre-Upsampling Baseline Decision Tree (Default Settings)

Objective

Changes vs Controls

Metrics

Experiment 2: Baseline Decision Tree (Upsampled, Full Depth)

bjective

Changes vs Controls

Metrics

Experiment 3: Shallow Decision Tree (maxdepth = 4)

bjective

Changes vs Controls

Metrics

Experiment 4: Tune maxdepth via Cross-Validation

Objective

Changes vs Controls

Metrics

Experiment 5: Tune cp (Complexity Parameter) via Cross-Validation

Objective

Changes vs Controls

Metrics

Experiment 6: Pre-Upsampling Baseline (Entropy)

Objective

Changes vs Controls

Metrics

Experiment 7: Baseline (Upsampled, Entropy)

Objective

Changes vs Controls

Metrics

Experiment 8: Tune cp Only (CV) with Entropy

Objective

Changes vs Controls

Variation

Metrics

Comparison of Decision Tree Experiments (Gini vs Entropy)

Combined Insights: Gini vs. Entropy Decision Trees

Final Verdict:

Random Forest Experiments (with randomForest and caret)

Evaluation Function (same format as DT)

Experiment 1: Baseline Random Forest (Default Settings)

Objective

Changes vs Controls

Metrics

Experiment 2: Tune mtry via Cross-Validation

Objective

Changes vs Controls

Metrics

Experiment 3: Tune ntree + mtry Together (Grid Search)

Objective

Changes vs Controls

Low Correlations with `age`, `campaign`, `cons_conf_idx`

Decision Tree Experiments (R with `rpart`)

Experiment 3: Shallow Decision Tree (`maxdepth = 4`)

Experiment 4: Tune `maxdepth` via Cross-Validation

Experiment 5: Tune `cp` (Complexity Parameter) via Cross-Validation

Experiment 8: Tune `cp` Only (CV) with Entropy

Random Forest Experiments (with `randomForest` and `caret`)

Experiment 2: Tune `mtry` via Cross-Validation

Experiment 3: Tune `ntree` + `mtry` Together (Grid Search)

AdaBoost Experiments (using `adabag`)

Experiment 2: Tune `mfinal` via Manual Grid Search