DATA622 - HW 2

Dataset: A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset.

Objective and Approach: In this project, I applied machine learning techniques to analyze a real-world dataset from a Portuguese bank’s marketing campaign. The goal was to build classification models that predict whether a client will subscribe to a term deposit based on their personal and socio-economic attributes, as well as macroeconomic indicators. To do this, I compared three different supervised learning algorithms — Decision Tree, Random Forest, and AdaBoost — using both default and tuned hyperparameters. The models were evaluated using five performance metrics: Accuracy, Precision, Recall, F1 Score, and AUC. Since the dataset is imbalanced, special attention was given to Recall, which reflects the model’s ability to identify actual subscribers. The analysis concludes by recommending the most appropriate model aligned with the business objective of maximizing customer acquisition..

bank_data <- read.csv("bank-additional-full.csv", sep = ";")

library(tidyverse)
library(ggplot2)

1. Exploratory Data Analysis (EDA)

Data structure

str(bank_data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

The dataset contains 41,188 rows (instances) and 21 columns (features + target variable y).

Count Duplicates

sum(duplicated(bank_data))

## [1] 12

Since 12 duplicates out of ~40,000 records is a very small percentage (~0.03%), removing them won’t significantly impact the dataset. Let’s remove them.

Remove duplicates

bank_data <- bank_data[!duplicated(bank_data), ]
sum(duplicated(bank_data))

## [1] 0

Check for N/A

colSums(is.na(bank_data))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0

There are no missing values (NA).

Check Imbalance: Check the distribution of the target variable (y)

table(bank_data$y)

## 
##    no   yes 
## 36537  4639

The dataset is highly imbalanced, with 36,548 “no” responses (88.7%) and 4,640 “yes” responses (11.3%). Since the “yes” class is underrepresented, this could affect model performance, and we may need to apply techniques like class weighting or oversampling (SMOTE) in pre-processing to balance the data.

Count ‘unknown’ in each categorical column

unknown_counts <- bank_data %>%
  summarise(across(where(is.character), ~ sum(. == "unknown"))) 
print(unknown_counts)

##   job marital education default housing loan contact month day_of_week poutcome
## 1 330      80      1730    8596     990  990       0     0           0        0
##   y
## 1 0

From our analysis, we observe that some categorical variables contain “unknown” values. Our strategy for handling them is as follows:

Keep “unknown” as a category for: job, marital, and education (as they may hold predictive value).
Replace “unknown” with the most frequent value (mode) for: housing, loan, and possibly default (since it has a high number of unknowns).

These modifications will be addressed in the Pre-processing step.

Check Correlation Between Features: To analyze how different numerical variables relate to each other, let’s create a correlation matrix.

library(ggcorrplot)

numeric_data <- bank_data %>% select_if(is.numeric)# Select only numeric columns

cor_matrix <- cor(numeric_data) # Compute correlation matrix

ggcorrplot(cor_matrix, method = "square", type = "lower", lab = TRUE) # Correlation heatmap

Corrplot Analysis: The correlation matrix reveals strong relationships between several numerical features. Notably, employment variation rate (emp.var.rate) and the number of employees (nr.employed) have an extremely high positive correlation of 0.97, suggesting redundancy. Similarly, euribor3m is highly correlated with both nr.employed (0.95) and emp.var.rate (0.91), indicating that these economic indicators move together and may not all be necessary for modeling. There is also a moderate negative correlation of -0.59 between pdays and previous, which might suggest an inverse relationship between the number of days since the last contact and the frequency of previous contacts. Since highly correlated variables can cause multicollinearity issues in modeling, we may consider removing or combining some of them during preprocessing.

Feature Distributions:

# Reshape data
bank_long <- bank_data %>%
  pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")

# Plot
ggplot(bank_long, aes(x = Value)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  facet_wrap(~Feature, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Numeric Features", x = "Value", y = "Count")

Analysis: Based on the histogram analysis, the numerical features show varying distributions. Age is right-skewed, with most clients between 25 and 60 years old. Duration and campaign are also highly skewed, with a large concentration of lower values and a few extreme cases. Pdays has a bimodal distribution, where most values are either very low or at 999, indicating a special category. Previous contacts are mostly zero, showing that many clients had no prior interactions. Economic indicators like employment variation rate (emp.var.rate) and euribor3m show multiple peaks, reflecting fluctuations in economic conditions. The distribution of consumer confidence index (cons.conf.idx) and consumer price index (cons.price.idx) appears more uniform. Overall, many variables are skewed, and some contain potential outliers that need further investigation.

Identify Outliers Using Boxplots:

ggplot(bank_long, aes(x = Value, y = Feature)) +  # Flip x and y
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  theme_minimal() +
  labs(title = "Boxplots of Numeric Features", x = "Value", y = "Feature") +  # Adjust labels
  theme(axis.text.x = element_text(size = 10),  # Show x-axis labels
        axis.text.y = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"))

Analysis: The Duration, Campaign, and Pdays contain extreme outliers, with several observations far above the upper whiskers.The Previous and Emp.Var.Rate also show potential outliers but with fewer extreme points. Nr.Employed and Euribor3m appear to have fewer extreme values compared to other features. The presence of these outliers suggests that some clients have had very long call duration, many contacts in the campaign, or a long gap (pdays) since their last contact. The boxplot confirms that the age distribution is right-skewed, with a few elderly customers as outliers and these customers may still be valid, but we need to consider whether they could affect model performance later.

Analyzing Categorical Variable Distribution:

categorical_data <- bank_data %>%
  select(where(is.character)) %>%  # Select categorical variables
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value") 
glimpse(categorical_data)  # Ensure Feature and Value exist

## Rows: 452,936
## Columns: 2
## $ Feature <chr> "job", "marital", "education", "default", "housing", "loan", "…
## $ Value   <chr> "housemaid", "married", "basic.4y", "no", "no", "no", "telepho…

I tried to plot all the categorical variables in one plot but very hard to read each categories. So, will filter and plot the categorical variables with fewer than 6 categories and variables with more than 5 categories separately.

small_categorical_data <- categorical_data %>%
  filter(Feature %in% c("marital", "default", "housing", "loan", "contact", "poutcome", "y"))

# Plot small categorical variables
ggplot(small_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) + 
  geom_bar() +
  facet_wrap(~Feature, scales = "free", ncol = 2) + 
  theme_minimal() +
  labs(title = "Distribution of Categorical Variables (≤5 Categories)", y = "Category", x = "Count") +
  theme(axis.text.y = element_text(size = 9),  
        axis.text.x = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"),
        legend.position = "none",
         panel.spacing.x = unit(2, "lines"))

large_categorical_data <- categorical_data %>%
  filter(Feature %in% c("job", "education", "month", "day_of_week"))

# Plot large categorical variables with improved x-axis width
ggplot(large_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) + 
  geom_bar() +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  theme_minimal() +
  labs(title = "Distribution of Categorical Variables (>5 Categories)", y = "Category", x = "Count") +
  theme(axis.text.y = element_text(size = 10), 
        axis.text.x = element_text(size = 10), 
        strip.text = element_text(size = 12, face = "bold"),  
        legend.position = "none",
        panel.spacing.x = unit(2, "lines"))  # Add space between facet columns

2. Algorithm Selection:

we will explore three models: Decision Tree, Random Forest, and Adaboost. Each model has unique strengths that align with the characteristics of the dataset and the project objective.

Decision Tree is a simple yet powerful algorithm that is easy to interpret. It is well-suited for identifying key decision-making factors, making it useful for explaining why certain customers are more likely to subscribe to a term deposit. However, Decision Trees are prone to overfitting, especially in datasets with noise or numerous features. Careful tuning (e.g., maximum depth, minimum samples per split) can improve performance.

Random Forest is an ensemble of Decision Trees, designed to reduce overfitting by averaging multiple tree predictions. This makes it more robust and stable, especially with noisy data or datasets containing both categorical and numerical features. Given the complexity of our dataset and potential outliers, Random Forest is a strong candidate for achieving reliable predictions.

Adaboost is another ensemble method that builds a series of weak learners (e.g., shallow Decision Trees), focusing more on hard-to-classify instances. Since our dataset is highly imbalanced (~88.7% “no”, ~11.3% “yes”), Adaboost’s ability to improve recall for the minority class can enhance the model’s ability to identify potential subscribers effectively.

3. Pre-processing

Replace “unknown” with Mode (Most Frequent Value): For housing, loan, and default, we will replace “unknown” with the most frequent category (mode) by identifying the most frequent value (mode) for each variable and Replace “unknown” values with the most common category.

# Step 1: Data Cleaning
most_common_housing <- names(sort(table(bank_data$housing), decreasing = TRUE))[1]
most_common_loan <- names(sort(table(bank_data$loan), decreasing = TRUE))[1]
most_common_default <- names(sort(table(bank_data$default), decreasing = TRUE))[1]

bank_data$housing[bank_data$housing == "unknown"] <- most_common_housing
bank_data$loan[bank_data$loan == "unknown"] <- most_common_loan
bank_data$default[bank_data$default == "unknown"] <- most_common_default

Keep “unknown” as a Category for job, marital, and education Since “unknown” in job, marital, and education may have meaning, we will keep it as a valid category by converting categorical variables to factors (to preserve “unknown” as a category).

bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$education <- factor(bank_data$education)

Ensure All Variables Are Correctly Formatted: We will verify that categorical variables are factors and numeric variables are properly formatted.

# Drop duration since it's data leakage
bank_data <- bank_data %>%
  select(-duration)

# Define categorical variables
categorical_cols <- c("job", "marital", "education", "default", 
                      "housing", "loan", "contact", "month", 
                      "day_of_week", "poutcome", "y")

# Convert categorical variables to factors 
bank_data[categorical_cols] <- lapply(bank_data[categorical_cols], factor)

# Define numeric variables
numeric_cols <- c("age", "campaign", "pdays", "previous", 
                  "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed")

# Convert numeric variables to numeric
bank_data[numeric_cols] <- lapply(bank_data[numeric_cols], as.numeric)

# Summary to confirm
summary(bank_data)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10419   divorced: 4611  
##  1st Qu.:32.00   blue-collar: 9253   married :24921  
##  Median :38.00   technician : 6739   single  :11564  
##  Mean   :40.02   services   : 3967   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1718                   
##                  (Other)    : 6156                   
##                education     default     housing      loan      
##  university.degree  :12164   no :41173   no :18615   no :34928  
##  high.school        : 9512   yes:    3   yes:22561   yes: 6248  
##  basic.9y           : 6045                                      
##  professional.course: 5240                                      
##  basic.4y           : 4176                                      
##  basic.6y           : 2291                                      
##  (Other)            : 1748                                      
##       contact          month       day_of_week    campaign          pdays      
##  cellular :26135   may    :13767   fri:7826    Min.   : 1.000   Min.   :  0.0  
##  telephone:15041   jul    : 7169   mon:8512    1st Qu.: 1.000   1st Qu.:999.0  
##                    aug    : 6176   thu:8618    Median : 2.000   Median :999.0  
##                    jun    : 5318   tue:8086    Mean   : 2.568   Mean   :962.5  
##                    nov    : 4100   wed:8134    3rd Qu.: 3.000   3rd Qu.:999.0  
##                    apr    : 2631               Max.   :56.000   Max.   :999.0  
##                    (Other): 2015                                               
##     previous            poutcome      emp.var.rate      cons.price.idx 
##  Min.   :0.000   failure    : 4252   Min.   :-3.40000   Min.   :92.20  
##  1st Qu.:0.000   nonexistent:35551   1st Qu.:-1.80000   1st Qu.:93.08  
##  Median :0.000   success    : 1373   Median : 1.10000   Median :93.75  
##  Mean   :0.173                       Mean   : 0.08192   Mean   :93.58  
##  3rd Qu.:0.000                       3rd Qu.: 1.40000   3rd Qu.:93.99  
##  Max.   :7.000                       Max.   : 1.40000   Max.   :94.77  
##                                                                        
##  cons.conf.idx     euribor3m      nr.employed     y        
##  Min.   :-50.8   Min.   :0.634   Min.   :4964   no :36537  
##  1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099   yes: 4639  
##  Median :-41.8   Median :4.857   Median :5191              
##  Mean   :-40.5   Mean   :3.621   Mean   :5167              
##  3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228              
##  Max.   :-26.9   Max.   :5.045   Max.   :5228              
##

Feature Engineering: Feature engineering involves creating new variables that could improve model performance.If we look at our boxplot for the numeric features, we can see campaign, and previous have severe outliers, causing high skewness. We will use Winsorization as it caps extreme values at the 99th percentile, reducing their impact without removing valuable data, which should prevent model instability while preserving most of the data distribution.

# Define Winsorization function
winsorize <- function(x, lower_quantile = 0.01, upper_quantile = 0.99) {
  lower_bound <- quantile(x, lower_quantile, na.rm = TRUE)
  upper_bound <- quantile(x, upper_quantile, na.rm = TRUE)
  x[x < lower_bound] <- lower_bound
  x[x > upper_bound] <- upper_bound
  return(x)
}

# Apply Winsorization to selected numeric columns with extreme outliers
bank_data <- bank_data %>%
  mutate(
    campaign = winsorize(campaign),
    previous = winsorize(previous)  # Removed `pdays`
  )

# Boxplot to confirm results
library(ggplot2)
bank_long <- bank_data %>% pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")

ggplot(bank_long, aes(y = Value, x = Feature)) +
  geom_boxplot(fill = "lightblue", outlier.color = "red") +
  facet_wrap(~Feature, scales = "free", ncol = 2) +  
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplots of Numeric Features (After Winsorization, Duration Removed)", 
       x = "Feature", y = "Value") +
  theme(axis.text.x = element_blank(),  
        axis.ticks.x = element_blank(),
        axis.text.y = element_text(size = 10),
        strip.text = element_text(size = 12, face = "bold"))

The Winsorization step has effectively reduced the extreme outliers in campaign and previous, improving the data’s suitability for Decision Trees, Random Forest, and Adaboost.

Let’s check pDays:

table(bank_data$pdays)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##    15    26    61   439   118    46   412    60    18    64    52    28    58 
##    13    14    15    16    17    18    19    20    21    22    25    26    27 
##    36    20    24    11     8     7     3     1     2     3     1     1     1 
##   999 
## 39661

So, 999 is not a regular numerical value. It is a special category indicating that the customer was never contacted before. Treating it as a number doesn’t make sense so convert it into a categorical variable instead.

# Convert pdays into a categorical feature
bank_data <- bank_data %>%
  mutate(
    pdays_cat = case_when(
      pdays == 999 ~ "Never Contacted",
      pdays <= 7   ~ "Contacted Recently (0-7 days)",
      pdays <= 30  ~ "Contacted Last Month (8-30 days)",
      TRUE         ~ "Contacted Earlier (30+ days)"
    )
  ) 

# Convert to factor for modeling
bank_data$pdays_cat <- as.factor(bank_data$pdays_cat)

# Drop original `pdays` column
bank_data <- bank_data %>%
  select(-pdays)

table(bank_data$pdays_cat)

## 
## Contacted Last Month (8-30 days)    Contacted Recently (0-7 days) 
##                              338                             1177 
##                  Never Contacted 
##                            39661

Stratified Sampling Data

We first create training and test sets then we use stratified sampling to ensure both classes are properly represented in the train/test sets.

library(caret)

set.seed(123)  # Ensure reproducibility

# Stratified sampling (80% train, 20% test)
trainIndex <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)

# Split dataset into training and test sets
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]

# Check class distribution
table(train_data$y) / nrow(train_data)

## 
##        no       yes 
## 0.8873171 0.1126829

table(test_data$y) / nrow(test_data)

## 
##       no      yes 
## 0.887418 0.112582

The class distribution in the training and test sets is consistent with the original dataset which was ~88.7% “no”, ~11.3% “yes”.

Apply SMOTE for Balancing Classes :

library(ROSE)

# Apply ROSE for SMOTE-like behavior
set.seed(123)
train_data_smote <- ROSE(y ~ ., data = train_data, seed = 123)$data

# Verify class distribution after SMOTE
table(train_data_smote$y)

## 
##    no   yes 
## 16705 16237

prop.table(table(train_data_smote$y))

## 
##        no       yes 
## 0.5071034 0.4928966

3. Experimentation & Model Training:

Experiment 1: Decision Tree (Default)

1. Objective: Establish a baseline Decision Tree model using default hyperparameters to assess its natural performance as a reference for future experiments.

2. What will change: Since this is a baseline model, no hyperparameters will be adjusted. The focus is to evaluate the Decision Tree’s performance without tuning.

3. Evaluation Metric: We’ll use Accuracy, Precision, Recall, and F1-score to assess performance. Given the dataset’s imbalance, Recall and F1-score will be prioritized. Additionally, AUC will be calculated to evaluate the model’s overall discrimination ability.

4. Cross-Validation Strategy: We will apply 10-fold cross-validation to improve reliability. The data will be divided into 10 parts, with 9 used for training and 1 for testing. This process repeats 10 times, ensuring all data points are tested once. Averaging results reduces the risk of overfitting and provides a robust evaluation

5. Code Implementation:

To systematically track and compare all 6 experiments, we create a results data frame that logs each experiment’s metrics.

# Initialize Results Data Frame
results <- data.frame(
  Experiment = character(),
  Accuracy = numeric(),
  Precision = numeric(),
  Recall = numeric(),
  F1_Score = numeric(),
  AUC = numeric(),
  stringsAsFactors = FALSE
)

Let’s import the relevant libraries.

library(rpart)  
library(rpart.plot)  
library(ROCR)    
library(pROC)    
library(doParallel)
library(caret)
library(randomForest)
library(ada)
library(dplyr)
library(knitr)
library(ggplot2)
library(reshape2)

Decision Tree (Default)

# Enable Parallel Processing
 # cl <- makeCluster(detectCores() - 1)
 # registerDoParallel(cl)
 # on.exit(stopCluster(cl))

# Train Control for 10-Fold CV
set.seed(456)
train_control <- trainControl(
  method = "cv",         
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Train Decision Tree Model
dt_model <- train(
  y ~ ., 
  data = train_data_smote,   
  method = "rpart",          
  trControl = train_control,
  metric = "Recall"        
)

# Predict on Test Data
dt_pred <- predict(dt_model, test_data)

# Confusion Matrix
conf_matrix <- confusionMatrix(dt_pred, test_data$y, positive = "yes")

# Calculate Metrics
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Precision']
recall <- conf_matrix$byClass['Recall']
f1_score <- 2 * ((precision * recall) / (precision + recall))

# Calculate AUC
dt_probs <- predict(dt_model, test_data, type = "prob")[, "yes"]
dt_auc <- auc(test_data$y, dt_probs)

# Display Results
cat(sprintf("\nDecision Tree (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n", 
            accuracy, precision, recall, f1_score, dt_auc))

## 
## Decision Tree (Default) - Accuracy: 0.7167, Precision: 0.2369, Recall: 0.6828, F1-score: 0.3518, AUC: 0.7019

# Log Results
results <- rbind(results, data.frame(
  Experiment = "Decision Tree (Default)",
  Accuracy = accuracy,
  Precision = precision,
  Recall = recall,
  F1_Score = f1_score,
  AUC = dt_auc
))
rownames(results) <- NULL

6. Analysis- Decision Tree (Experiment 1): The baseline Decision Tree showed moderate accuracy (71.7%) with strong recall (68.3%), but its low precision (23.7%) highlights a high false positive rate, indicating room for improvement.

Decision Tree - Experiment 2 (Tuned)

For this experiment, we’ll focus on improving the baseline Decision Tree model by tuning hyperparameters to enhance model performance

1. Objective: Improve the Decision Tree’s performance by tuning hyperparameters to enhance Recall, Precision, and F1 Score, while ensuring better generalization and reducing overfitting.

2. What will change: We will tune the following hyperparameters to reduce overfitting and improve generalization:

‘cp’ (complexity parameter): Controls cost-complexity pruning. We will test values ‘{0.01, 0.02, 0.03}’ to find the optimal level of tree pruning.
‘minsplit’: Minimum number of samples required to attempt a split. We set this to ‘20’ to avoid overly granular splits that could lead to overfitting.
‘maxdepth’: While not directly exposed in ‘caret::rpart’, we control tree depth implicitly through ‘cp’ and ‘minsplit’.

These choices are based on the fact that our baseline model overfit (high Recall, low Precision), so we’re tuning toward improved F1 Score and Precision without overly sacrificing Recall.

3. Evaluation Metric: We’ll continue to evaluate performance using Accuracy, Precision, Recall, F1 Score, and AUC. Recall and F1 Score will remain the focus given the dataset’s imbalance.

4. Cross-Validation Strategy: We will apply 10-fold cross-validation for consistency with Experiment 1. This approach ensures reliable performance evaluation by averaging results across multiple data splits.

5. Code Implementation:

# Enable Parallel Processing
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))

# Define train control for 10-fold cross-validation
set.seed(789)
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Tuning Grid - Only `cp` is tunable for rpart
tune_grid <- expand.grid(
  cp = c(0.01, 0.02, 0.03)
)

# Train the tuned Decision Tree model with added `maxdepth` and `minsplit`
dt_model_tuned <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rpart",              
  trControl = train_control,
  tuneGrid = tune_grid,
  control = rpart.control(minsplit = 20, maxdepth = 5),  
  metric = "Recall"                 
)

# Predict on test data
dt_pred_tuned <- predict(dt_model_tuned, test_data)

# Confusion Matrix
conf_matrix_tuned <- confusionMatrix(dt_pred_tuned, test_data$y, positive = "yes")

# Calculate Metrics
accuracy_tuned <- conf_matrix_tuned$overall['Accuracy']
precision_tuned <- conf_matrix_tuned$byClass['Precision']
recall_tuned <- conf_matrix_tuned$byClass['Recall']
f1_score_tuned <- 2 * ((precision_tuned * recall_tuned) / (precision_tuned + recall_tuned))

# Calculate AUC
dt_probs_tuned <- predict(dt_model_tuned, test_data, type = "prob")[, "yes"]
dt_auc_tuned <- auc(test_data$y, dt_probs_tuned)

# Display Results
cat(sprintf("\nDecision Tree (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            accuracy_tuned, precision_tuned, recall_tuned, f1_score_tuned, dt_auc_tuned))

## 
## Decision Tree (Tuned) - Accuracy: 0.8354, Precision: 0.3585, Recall: 0.5847, F1-score: 0.4444, AUC: 0.7374

# Add Results
results <- rbind(results, data.frame(
  Experiment = "Decision Tree (Tuned)",
  Accuracy = accuracy_tuned,
  Precision = precision_tuned,
  Recall = recall_tuned,
  F1_Score = f1_score_tuned,
  AUC = dt_auc_tuned
))
rownames(results) <- NULL

6. Analysis - Decision Tree (Experiment 2): The Random Forest baseline model improved Accuracy, Precision, F1 Score, and AUC, indicating better overall performance and improved positive class identification. While Recall dropped slightly, the improved F1 Score suggests a better balance between Precision and Recall.

Random Forest - Experiment 1 (Default)

1. Objective: Establish a baseline Random Forest model using default hyperparameters to assess its natural performance as a reference for future experiments. This will allow us to compare its performance against the Decision Tree models.

2. What Will Change: Since this is a baseline model, we will use the default hyperparameters in the randomForest package:

ntree = 50; Default number of trees in the forest.
mtry = sqrt(number of features); Default number of randomly selected features at each split.
nodesize = 1; Minimum size of terminal nodes (helps prevent overfitting).

This setup will provide a strong baseline to compare improvements in future experiments.

3. Evaluation Metric: We will evaluate model performance using: Accuracy, Precision, Recall (Priority), F1 Score,AUC (to assess the model’s overall discrimination ability). Given the dataset’s imbalance, Recall and F1 Score will remain the primary focus.

4. Cross-Validation Strategy: We’ll apply 10-fold cross-validation (consistent with Decision Tree experiments) to ensure reliable performance evaluation and mitigate overfitting.

5. Code Implementation: Let’s implement the baseline Random Forest model.

# Enable parallel processing
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))  

# Define train control for 10-fold cross-validation
set.seed(101)
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# Garbage Collection to free up memory
gc()

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  2832542 151.3    4250262 227.0  4250262 227.0
## Vcells 25330441 193.3   45795568 349.4 44915116 342.7

# Train the Random Forest baseline model with parallel processing
rf_model <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rf",             
  trControl = train_control,
  metric = "Recall",
  ntree = 50,                
  importance = TRUE        
)

# Predict on test data
rf_pred <- predict(rf_model, test_data)

# Confusion Matrix
conf_matrix_rf <- confusionMatrix(rf_pred, test_data$y, positive = "yes")

# Calculate metrics
rf_accuracy <- conf_matrix_rf$overall['Accuracy']
rf_precision <- conf_matrix_rf$byClass['Precision']
rf_recall <- conf_matrix_rf$byClass['Recall']
rf_f1 <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))

# Calculate AUC
rf_probs <- predict(rf_model, test_data, type = "prob")[, "yes"]
rf_auc <- auc(test_data$y, rf_probs)

# Display Results
cat(sprintf("\nRandom Forest (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))

## 
## Random Forest (Default) - Accuracy: 0.8871, Precision: 0.4982, Recall: 0.4401, F1-score: 0.4674, AUC: 0.7648

# Add results 
results <- rbind(results, data.frame(
  Experiment = "Random Forest (Default)",
  Accuracy = rf_accuracy,
  Precision = rf_precision,
  Recall = rf_recall,
  F1_Score = rf_f1,
  AUC = rf_auc
))

rownames(results) <- NULL

6. Analysis- Random Forest - Experiment 1: The Random Forest baseline model improved Precision and AUC compared to the baseline Decision Tree but experienced a slight drop in Recall.

Random Forest - Experiment 2 (Tuned)

1. Objective:: Improve the performance of the Random Forest model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while reducing overfitting.

2. What Will Change: In this experiment, we’ll modify key hyperparameters to improve model generalization and reduce overfitting. Specifically:

‘mtry’ (number of features tried per split): We tested values ‘{3, 5, 7}’. A higher ‘mtry’ allows the model to consider more features per split, which may improve accuracy when strong predictors are present. Lower ‘mtry’ values can help reduce overfitting.
‘ntree’ (number of trees): We set this to ‘100’, which is generally sufficient to stabilize predictions while avoiding excessive computational cost.
‘nodesize’ (minimum size of terminal nodes): While ‘caret::train()’ does not expose this directly for Random Forest, we kept it at its default ‘(1)’. Larger node sizes reduce variance but may underfit. In this experiment, we focused tuning on ‘mtry’ for simplicity and interpretability.

3. Evaluation Metric: We’ll continue evaluating performance using: Accuracy, Precision, Recall (Priority), F1 Score, AUC. Since Recall and F1 Score are critical for identifying potential subscribers, they will be our primary focus.

4. Cross-Validation Strategy: We’ll continue using 10-fold cross-validation for consistency and robust evaluation.

5. Code Implementation:

# Enable parallel processing with improved core usage
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))  

# Define train control for 10-fold cross-validation
set.seed(202) 
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary
)

# TuneGrid - Only `mtry` for Random Forest
tune_grid <- expand.grid(
  mtry = c(3, 5, 7)  
)

# Garbage Collection to free up memory
gc()

##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  2841660 151.8    4250262 227.0  4250262 227.0
## Vcells 26946748 205.6   88014604 671.5 88014604 671.5

# Train the tuned Random Forest model
rf_model_tuned <- train(
  y ~ ., 
  data = train_data_smote,
  method = "rf",
  trControl = train_control,
  metric = "Recall",
  tuneGrid = tune_grid,
  ntree = 100,
  nodesize = 5,
  importance = TRUE 
)

# Predict on test data
set.seed(202) 
rf_pred_tuned <- predict(rf_model_tuned, test_data)

# Confusion Matrix
conf_matrix_rf_tuned <- confusionMatrix(rf_pred_tuned, test_data$y, positive = "yes")

# Calculate metrics
rf_accuracy_tuned <- conf_matrix_rf_tuned$overall['Accuracy']
rf_precision_tuned <- conf_matrix_rf_tuned$byClass['Precision']
rf_recall_tuned <- conf_matrix_rf_tuned$byClass['Recall']
rf_f1_tuned <- 2 * ((rf_precision_tuned * rf_recall_tuned) / 
                     (rf_precision_tuned + rf_recall_tuned))

# Calculate AUC
rf_probs_tuned <- predict(rf_model_tuned, test_data, type = "prob")[, "yes"]
rf_auc_tuned <- auc(test_data$y, rf_probs_tuned)

# Display Results
cat(sprintf("\nRandom Forest (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            rf_accuracy_tuned, rf_precision_tuned, rf_recall_tuned, rf_f1_tuned, rf_auc_tuned))

## 
## Random Forest (Tuned) - Accuracy: 0.8839, Precision: 0.4844, Recall: 0.4844, F1-score: 0.4844, AUC: 0.7735

# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "Random Forest (Tuned)",
  Accuracy = rf_accuracy_tuned,
  Precision = rf_precision_tuned,
  Recall = rf_recall_tuned,
  F1_Score = rf_f1_tuned,
  AUC = rf_auc_tuned
))
rownames(results) <- NULL

6. Analysis- Random Forest - Experiment 2: The tuned Random Forest model improved Accuracy, Precision, and F1 Score, while Recall increased slightly. The improved AUC suggests better overall discrimination between positive and negative classes, aligning with the experiment’s objective to enhance model performance and reduce overfitting.

Adaboost - Experiment 1 (Default):

For this experiment, we’ll build a baseline Adaboost model using default hyperparameters to establish a reference for comparison.

1. Objective: Establish a baseline Adaboost model to assess its natural performance as a reference for future tuning and evaluate its ability to improve Recall and Precision in the imbalanced dataset.

2. What will Change: Since this is a baseline model, no hyperparameters will be tuned in this experiment. Our focus will be on evaluating Adaboost’s default behavior.

3. Evaluation Metric: We’ll evaluate the model using Accuracy (overall correctness), precision (reducing false positives), Recall (priority metric for identifying actual subscribers),F1 Score (balance between Precision and Recall), AUC (discrimination ability).

4. Cross-Validation Strategy: Cross-validation is not applied in this baseline experiment due to the use of the base ada() function and the goal of establishing default behavior quickly. Instead, performance is evaluated on a separate held-out test set to provide a realistic baseline for comparison with future tuned models.

5. Code Implementation::

# Train AdaBoost model with 25 iterations (default)
set.seed(303)  
ada_model <- ada(
  y ~ ., 
  data = train_data_smote,  
  iter = 25
)

# Predict on test data
ada_probs <- predict(ada_model, test_data, type = "prob")[, 2]
ada_preds <- predict(ada_model, test_data, type = "class")

# Confusion Matrix
conf_matrix_ada <- confusionMatrix(ada_preds, test_data$y, positive = "yes")

# Calculate metrics
ada_accuracy <- conf_matrix_ada$overall['Accuracy']
ada_precision <- conf_matrix_ada$byClass['Pos Pred Value']
ada_recall <- conf_matrix_ada$byClass['Sensitivity']
ada_f1 <- 2 * (ada_precision * ada_recall) / (ada_precision + ada_recall)
ada_auc <- roc(test_data$y, ada_probs)$auc

# Display Results
cat(sprintf("\nAdaBoost (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            ada_accuracy, ada_precision, ada_recall, ada_f1, ada_auc))

## 
## AdaBoost (Default) - Accuracy: 0.8738, Precision: 0.4483, Recall: 0.5243, F1-score: 0.4833, AUC: 0.7670

# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "AdaBoost (Default)",
  Accuracy = ada_accuracy,
  Precision = ada_precision,
  Recall = ada_recall,
  F1_Score = ada_f1,
  AUC = ada_auc
))
rownames(results) <- NULL

6. Analysis - AdaBoost (Experiment 1 - Default): The AdaBoost model achieved the highest Recall so far, though its Accuracy, Precision, and F1 Score were slightly lower than the Random Forest models.

Adaboost - Experiment 2 (Tuned):

1. Objective: Improve the performance of the AdaBoost model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while maintaining a balanced model that reduces overfitting.

2. What Will Change: In this experiment, we will tune key hyperparameters to improve performance by increasing learning rounds and controlling overfitting:

‘iter’ (number of iterations): Set to ‘150’ to allow more boosting rounds. This enables the model to better correct previous errors and improve overall learning.
‘nu’ (learning rate): Set to ‘0.1’, a lower value to reduce the impact of each weak learner. This helps improve model stability and prevents overfitting by slowing the learning process.
‘type’ (classification type): Set to “discrete” to apply the standard AdaBoost algorithm suitable for binary classification tasks like ours. This setting focuses on improving classification margin and is well-suited to imbalanced data.

3. Evaluation Metric: The model will be evaluated using Accuracy, Precision, Recall (priority), F1 Score, and AUC. Since identifying true positives (Recall) is crucial, Recall and F1 Score will remain the primary focus.

4. Cross-Validation Strategy: Due to performance constraints observed during earlier attempts, we opted to run the tuned AdaBoost model without cross-validation. This allowed us to focus on hyperparameter optimization without long training delays, while still evaluating model performance on a held-out test set..

# Train AdaBoost model with tuned hyperparameters
set.seed(404) 
ada_model_tuned <- ada(
  y ~ ., 
  data = train_data_smote,  
  iter = 50,        # Increased iterations for improved learning
  nu = 0.05,         # Smaller learning rate to control overfitting
  type = "discrete"  # Ensures standard AdaBoost for classification
)

# Predict on test data
ada_probs_tuned <- predict(ada_model_tuned, test_data, type = "prob")[, 2]
ada_preds_tuned <- predict(ada_model_tuned, test_data, type = "class")

# Confusion Matrix
conf_matrix_ada_tuned <- confusionMatrix(ada_preds_tuned, test_data$y, positive = "yes")

# Calculate metrics
ada_accuracy_tuned <- conf_matrix_ada_tuned$overall['Accuracy']
ada_precision_tuned <- conf_matrix_ada_tuned$byClass['Pos Pred Value']
ada_recall_tuned <- conf_matrix_ada_tuned$byClass['Sensitivity']
ada_f1_tuned <- 2 * (ada_precision_tuned * ada_recall_tuned) / (ada_precision_tuned + ada_recall_tuned)
ada_auc_tuned <- roc(test_data$y, ada_probs_tuned)$auc

# Display Results
cat(sprintf("\nAdaBoost (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
            ada_accuracy_tuned, ada_precision_tuned, ada_recall_tuned, ada_f1_tuned, ada_auc_tuned))

## 
## AdaBoost (Tuned) - Accuracy: 0.8722, Precision: 0.4437, Recall: 0.5318, F1-score: 0.4838, AUC: 0.7657

# Add Results to Results Table
results <- rbind(results, data.frame(
  Experiment = "AdaBoost (Tuned)",
  Accuracy = ada_accuracy_tuned,
  Precision = ada_precision_tuned,
  Recall = ada_recall_tuned,
  F1_Score = ada_f1_tuned,
  AUC = ada_auc_tuned
))
rownames(results) <- NULL

6. Analysis - AdaBoost (Experiment 2 - Tuned): The tuned AdaBoost model showed a slight improvement in Recall and F1 Score compared to the baseline AdaBoost, with minimal changes in Accuracy, Precision, and AUC.

Results and Visualization:

results_all <- unique(results)
kable(results_all, caption = "Summary of Experiment Results")

Summary of Experiment Results
Experiment	Accuracy	Precision	Recall	F1_Score	AUC
Decision Tree (Default)	0.7166626	0.2369012	0.6828479	0.3517644	0.7019002
Decision Tree (Tuned)	0.8354384	0.3584656	0.5846818	0.4444444	0.7373882
Random Forest (Default)	0.8870537	0.4981685	0.4401294	0.4673540	0.7648443
Random Forest (Tuned)	0.8838960	0.4843581	0.4843581	0.4843581	0.7735345
AdaBoost (Default)	0.8738159	0.4483395	0.5242718	0.4833416	0.7670148
AdaBoost (Tuned)	0.8722371	0.4437444	0.5318231	0.4838077	0.7657449

results_long <- melt(results_all, id.vars = "Experiment")
recall_order <- results_all %>%
  arrange(desc(Recall)) %>%
  pull(Experiment)
results_long$Experiment <- factor(results_long$Experiment, levels = recall_order)

# Plot 
ggplot(results_long, aes(x = Experiment, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8))+
  labs(
    title = "Comparison of Model Performance Metrics",
    x = "Experiment",
    y = "Score",
    fill = "Metric"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
    plot.title = element_text(size = 14, face = "bold")
  )

Conclusion:

Among all models, the Decision Tree (Default) achieved the highest Recall (0.6828), which is the most important metric for our goal of identifying as many potential subscribers as possible. Although it had lower Precision and F1 Score, this trade-off is acceptable, as the business cost of missing a potential customer is higher than contacting a non-interested one.

The tuned Decision Tree model, while more balanced in terms of Precision and F1 Score, showed a decrease in Recall due to pruning and increased minsplit, which made it more conservative in predicting positives. This reduction in model variance comes at the cost of slightly reduced sensitivity — which is not ideal for our objective.

While the Random Forest (Tuned) model delivered stronger balance across all metrics, the Decision Tree (Default) remains the most aligned with our business goal of maximizing customer acquisition through higher Recall.

DATA622 - HW 2

Farhana Akther

2025-03-26

1. Exploratory Data Analysis (EDA)

2. Algorithm Selection:

3. Pre-processing

3. Experimentation & Model Training:

Experiment 1: Decision Tree (Default)

Decision Tree - Experiment 2 (Tuned)

Random Forest - Experiment 1 (Default)

Random Forest - Experiment 2 (Tuned)

Adaboost - Experiment 1 (Default):

Adaboost - Experiment 2 (Tuned):