Dataset: A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset.
Objective and Approach: In this project, I applied machine learning techniques to analyze a real-world dataset from a Portuguese bank’s marketing campaign. The goal was to build classification models that predict whether a client will subscribe to a term deposit based on their personal and socio-economic attributes, as well as macroeconomic indicators. To do this, I compared three different supervised learning algorithms — Decision Tree, Random Forest, and AdaBoost — using both default and tuned hyperparameters. The models were evaluated using five performance metrics: Accuracy, Precision, Recall, F1 Score, and AUC. Since the dataset is imbalanced, special attention was given to Recall, which reflects the model’s ability to identify actual subscribers. The analysis concludes by recommending the most appropriate model aligned with the business objective of maximizing customer acquisition..
bank_data <- read.csv("bank-additional-full.csv", sep = ";")
library(tidyverse)
library(ggplot2)
Data structure
str(bank_data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
The dataset contains 41,188 rows (instances) and 21 columns (features + target variable y).
Count Duplicates
sum(duplicated(bank_data))
## [1] 12
Since 12 duplicates out of ~40,000 records is a very small percentage (~0.03%), removing them won’t significantly impact the dataset. Let’s remove them.
Remove duplicates
bank_data <- bank_data[!duplicated(bank_data), ]
sum(duplicated(bank_data))
## [1] 0
Check for N/A
colSums(is.na(bank_data))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
There are no missing values (NA).
Check Imbalance: Check the distribution of the target variable (y)
table(bank_data$y)
##
## no yes
## 36537 4639
The dataset is highly imbalanced, with 36,548 “no” responses (88.7%) and 4,640 “yes” responses (11.3%). Since the “yes” class is underrepresented, this could affect model performance, and we may need to apply techniques like class weighting or oversampling (SMOTE) in pre-processing to balance the data.
Count ‘unknown’ in each categorical column
unknown_counts <- bank_data %>%
summarise(across(where(is.character), ~ sum(. == "unknown")))
print(unknown_counts)
## job marital education default housing loan contact month day_of_week poutcome
## 1 330 80 1730 8596 990 990 0 0 0 0
## y
## 1 0
From our analysis, we observe that some categorical variables contain “unknown” values. Our strategy for handling them is as follows:
These modifications will be addressed in the Pre-processing step.
Check Correlation Between Features: To analyze how different numerical variables relate to each other, let’s create a correlation matrix.
library(ggcorrplot)
numeric_data <- bank_data %>% select_if(is.numeric)# Select only numeric columns
cor_matrix <- cor(numeric_data) # Compute correlation matrix
ggcorrplot(cor_matrix, method = "square", type = "lower", lab = TRUE) # Correlation heatmap
Corrplot Analysis: The correlation matrix reveals strong relationships between several numerical features. Notably, employment variation rate (emp.var.rate) and the number of employees (nr.employed) have an extremely high positive correlation of 0.97, suggesting redundancy. Similarly, euribor3m is highly correlated with both nr.employed (0.95) and emp.var.rate (0.91), indicating that these economic indicators move together and may not all be necessary for modeling. There is also a moderate negative correlation of -0.59 between pdays and previous, which might suggest an inverse relationship between the number of days since the last contact and the frequency of previous contacts. Since highly correlated variables can cause multicollinearity issues in modeling, we may consider removing or combining some of them during preprocessing.
Feature Distributions:
# Reshape data
bank_long <- bank_data %>%
pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")
# Plot
ggplot(bank_long, aes(x = Value)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
facet_wrap(~Feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Features", x = "Value", y = "Count")
Analysis: Based on the histogram analysis, the numerical features show varying distributions. Age is right-skewed, with most clients between 25 and 60 years old. Duration and campaign are also highly skewed, with a large concentration of lower values and a few extreme cases. Pdays has a bimodal distribution, where most values are either very low or at 999, indicating a special category. Previous contacts are mostly zero, showing that many clients had no prior interactions. Economic indicators like employment variation rate (emp.var.rate) and euribor3m show multiple peaks, reflecting fluctuations in economic conditions. The distribution of consumer confidence index (cons.conf.idx) and consumer price index (cons.price.idx) appears more uniform. Overall, many variables are skewed, and some contain potential outliers that need further investigation.
Identify Outliers Using Boxplots:
ggplot(bank_long, aes(x = Value, y = Feature)) + # Flip x and y
geom_boxplot(fill = "lightblue", outlier.color = "red") +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Boxplots of Numeric Features", x = "Value", y = "Feature") + # Adjust labels
theme(axis.text.x = element_text(size = 10), # Show x-axis labels
axis.text.y = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"))
Analysis: The Duration, Campaign, and Pdays contain extreme outliers, with several observations far above the upper whiskers.The Previous and Emp.Var.Rate also show potential outliers but with fewer extreme points. Nr.Employed and Euribor3m appear to have fewer extreme values compared to other features. The presence of these outliers suggests that some clients have had very long call duration, many contacts in the campaign, or a long gap (pdays) since their last contact. The boxplot confirms that the age distribution is right-skewed, with a few elderly customers as outliers and these customers may still be valid, but we need to consider whether they could affect model performance later.
Analyzing Categorical Variable Distribution:
categorical_data <- bank_data %>%
select(where(is.character)) %>% # Select categorical variables
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
glimpse(categorical_data) # Ensure Feature and Value exist
## Rows: 452,936
## Columns: 2
## $ Feature <chr> "job", "marital", "education", "default", "housing", "loan", "…
## $ Value <chr> "housemaid", "married", "basic.4y", "no", "no", "no", "telepho…
I tried to plot all the categorical variables in one plot but very hard to read each categories. So, will filter and plot the categorical variables with fewer than 6 categories and variables with more than 5 categories separately.
small_categorical_data <- categorical_data %>%
filter(Feature %in% c("marital", "default", "housing", "loan", "contact", "poutcome", "y"))
# Plot small categorical variables
ggplot(small_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) +
geom_bar() +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Distribution of Categorical Variables (≤5 Categories)", y = "Category", x = "Count") +
theme(axis.text.y = element_text(size = 9),
axis.text.x = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"),
legend.position = "none",
panel.spacing.x = unit(2, "lines"))
large_categorical_data <- categorical_data %>%
filter(Feature %in% c("job", "education", "month", "day_of_week"))
# Plot large categorical variables with improved x-axis width
ggplot(large_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) +
geom_bar() +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Distribution of Categorical Variables (>5 Categories)", y = "Category", x = "Count") +
theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"),
legend.position = "none",
panel.spacing.x = unit(2, "lines")) # Add space between facet columns
we will explore three models: Decision Tree, Random Forest, and Adaboost. Each model has unique strengths that align with the characteristics of the dataset and the project objective.
Decision Tree is a simple yet powerful algorithm that is easy to interpret. It is well-suited for identifying key decision-making factors, making it useful for explaining why certain customers are more likely to subscribe to a term deposit. However, Decision Trees are prone to overfitting, especially in datasets with noise or numerous features. Careful tuning (e.g., maximum depth, minimum samples per split) can improve performance.
Random Forest is an ensemble of Decision Trees, designed to reduce overfitting by averaging multiple tree predictions. This makes it more robust and stable, especially with noisy data or datasets containing both categorical and numerical features. Given the complexity of our dataset and potential outliers, Random Forest is a strong candidate for achieving reliable predictions.
Adaboost is another ensemble method that builds a series of weak learners (e.g., shallow Decision Trees), focusing more on hard-to-classify instances. Since our dataset is highly imbalanced (~88.7% “no”, ~11.3% “yes”), Adaboost’s ability to improve recall for the minority class can enhance the model’s ability to identify potential subscribers effectively.
Replace “unknown” with Mode (Most Frequent Value): For housing, loan, and default, we will replace “unknown” with the most frequent category (mode) by identifying the most frequent value (mode) for each variable and Replace “unknown” values with the most common category.
# Step 1: Data Cleaning
most_common_housing <- names(sort(table(bank_data$housing), decreasing = TRUE))[1]
most_common_loan <- names(sort(table(bank_data$loan), decreasing = TRUE))[1]
most_common_default <- names(sort(table(bank_data$default), decreasing = TRUE))[1]
bank_data$housing[bank_data$housing == "unknown"] <- most_common_housing
bank_data$loan[bank_data$loan == "unknown"] <- most_common_loan
bank_data$default[bank_data$default == "unknown"] <- most_common_default
Keep “unknown” as a Category for job, marital, and education Since “unknown” in job, marital, and education may have meaning, we will keep it as a valid category by converting categorical variables to factors (to preserve “unknown” as a category).
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$education <- factor(bank_data$education)
Ensure All Variables Are Correctly Formatted: We will verify that categorical variables are factors and numeric variables are properly formatted.
# Drop duration since it's data leakage
bank_data <- bank_data %>%
select(-duration)
# Define categorical variables
categorical_cols <- c("job", "marital", "education", "default",
"housing", "loan", "contact", "month",
"day_of_week", "poutcome", "y")
# Convert categorical variables to factors
bank_data[categorical_cols] <- lapply(bank_data[categorical_cols], factor)
# Define numeric variables
numeric_cols <- c("age", "campaign", "pdays", "previous",
"emp.var.rate", "cons.price.idx",
"cons.conf.idx", "euribor3m", "nr.employed")
# Convert numeric variables to numeric
bank_data[numeric_cols] <- lapply(bank_data[numeric_cols], as.numeric)
# Summary to confirm
summary(bank_data)
## age job marital
## Min. :17.00 admin. :10419 divorced: 4611
## 1st Qu.:32.00 blue-collar: 9253 married :24921
## Median :38.00 technician : 6739 single :11564
## Mean :40.02 services : 3967 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1718
## (Other) : 6156
## education default housing loan
## university.degree :12164 no :41173 no :18615 no :34928
## high.school : 9512 yes: 3 yes:22561 yes: 6248
## basic.9y : 6045
## professional.course: 5240
## basic.4y : 4176
## basic.6y : 2291
## (Other) : 1748
## contact month day_of_week campaign pdays
## cellular :26135 may :13767 fri:7826 Min. : 1.000 Min. : 0.0
## telephone:15041 jul : 7169 mon:8512 1st Qu.: 1.000 1st Qu.:999.0
## aug : 6176 thu:8618 Median : 2.000 Median :999.0
## jun : 5318 tue:8086 Mean : 2.568 Mean :962.5
## nov : 4100 wed:8134 3rd Qu.: 3.000 3rd Qu.:999.0
## apr : 2631 Max. :56.000 Max. :999.0
## (Other): 2015
## previous poutcome emp.var.rate cons.price.idx
## Min. :0.000 failure : 4252 Min. :-3.40000 Min. :92.20
## 1st Qu.:0.000 nonexistent:35551 1st Qu.:-1.80000 1st Qu.:93.08
## Median :0.000 success : 1373 Median : 1.10000 Median :93.75
## Mean :0.173 Mean : 0.08192 Mean :93.58
## 3rd Qu.:0.000 3rd Qu.: 1.40000 3rd Qu.:93.99
## Max. :7.000 Max. : 1.40000 Max. :94.77
##
## cons.conf.idx euribor3m nr.employed y
## Min. :-50.8 Min. :0.634 Min. :4964 no :36537
## 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099 yes: 4639
## Median :-41.8 Median :4.857 Median :5191
## Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :-26.9 Max. :5.045 Max. :5228
##
Feature Engineering: Feature engineering involves creating new variables that could improve model performance.If we look at our boxplot for the numeric features, we can see campaign, and previous have severe outliers, causing high skewness. We will use Winsorization as it caps extreme values at the 99th percentile, reducing their impact without removing valuable data, which should prevent model instability while preserving most of the data distribution.
# Define Winsorization function
winsorize <- function(x, lower_quantile = 0.01, upper_quantile = 0.99) {
lower_bound <- quantile(x, lower_quantile, na.rm = TRUE)
upper_bound <- quantile(x, upper_quantile, na.rm = TRUE)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Apply Winsorization to selected numeric columns with extreme outliers
bank_data <- bank_data %>%
mutate(
campaign = winsorize(campaign),
previous = winsorize(previous) # Removed `pdays`
)
# Boxplot to confirm results
library(ggplot2)
bank_long <- bank_data %>% pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")
ggplot(bank_long, aes(y = Value, x = Feature)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
facet_wrap(~Feature, scales = "free", ncol = 2) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplots of Numeric Features (After Winsorization, Duration Removed)",
x = "Feature", y = "Value") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"))
The Winsorization step has effectively reduced the extreme outliers in campaign and previous, improving the data’s suitability for Decision Trees, Random Forest, and Adaboost.
Let’s check pDays:
table(bank_data$pdays)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 15 26 61 439 118 46 412 60 18 64 52 28 58
## 13 14 15 16 17 18 19 20 21 22 25 26 27
## 36 20 24 11 8 7 3 1 2 3 1 1 1
## 999
## 39661
So, 999 is not a regular numerical value. It is a special category indicating that the customer was never contacted before. Treating it as a number doesn’t make sense so convert it into a categorical variable instead.
# Convert pdays into a categorical feature
bank_data <- bank_data %>%
mutate(
pdays_cat = case_when(
pdays == 999 ~ "Never Contacted",
pdays <= 7 ~ "Contacted Recently (0-7 days)",
pdays <= 30 ~ "Contacted Last Month (8-30 days)",
TRUE ~ "Contacted Earlier (30+ days)"
)
)
# Convert to factor for modeling
bank_data$pdays_cat <- as.factor(bank_data$pdays_cat)
# Drop original `pdays` column
bank_data <- bank_data %>%
select(-pdays)
table(bank_data$pdays_cat)
##
## Contacted Last Month (8-30 days) Contacted Recently (0-7 days)
## 338 1177
## Never Contacted
## 39661
Stratified Sampling Data
We first create training and test sets then we use stratified sampling to ensure both classes are properly represented in the train/test sets.
library(caret)
set.seed(123) # Ensure reproducibility
# Stratified sampling (80% train, 20% test)
trainIndex <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)
# Split dataset into training and test sets
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]
# Check class distribution
table(train_data$y) / nrow(train_data)
##
## no yes
## 0.8873171 0.1126829
table(test_data$y) / nrow(test_data)
##
## no yes
## 0.887418 0.112582
The class distribution in the training and test sets is consistent with the original dataset which was ~88.7% “no”, ~11.3% “yes”.
Apply SMOTE for Balancing Classes :
library(ROSE)
# Apply ROSE for SMOTE-like behavior
set.seed(123)
train_data_smote <- ROSE(y ~ ., data = train_data, seed = 123)$data
# Verify class distribution after SMOTE
table(train_data_smote$y)
##
## no yes
## 16705 16237
prop.table(table(train_data_smote$y))
##
## no yes
## 0.5071034 0.4928966
1. Objective: Establish a baseline Decision Tree model using default hyperparameters to assess its natural performance as a reference for future experiments.
2. What will change: Since this is a baseline model, no hyperparameters will be adjusted. The focus is to evaluate the Decision Tree’s performance without tuning.
3. Evaluation Metric: We’ll use Accuracy, Precision, Recall, and F1-score to assess performance. Given the dataset’s imbalance, Recall and F1-score will be prioritized. Additionally, AUC will be calculated to evaluate the model’s overall discrimination ability.
4. Cross-Validation Strategy: We will apply 10-fold cross-validation to improve reliability. The data will be divided into 10 parts, with 9 used for training and 1 for testing. This process repeats 10 times, ensuring all data points are tested once. Averaging results reduces the risk of overfitting and provides a robust evaluation
5. Code Implementation:
To systematically track and compare all 6 experiments, we create a results data frame that logs each experiment’s metrics.
# Initialize Results Data Frame
results <- data.frame(
Experiment = character(),
Accuracy = numeric(),
Precision = numeric(),
Recall = numeric(),
F1_Score = numeric(),
AUC = numeric(),
stringsAsFactors = FALSE
)
Let’s import the relevant libraries.
library(rpart)
library(rpart.plot)
library(ROCR)
library(pROC)
library(doParallel)
library(caret)
library(randomForest)
library(ada)
library(dplyr)
library(knitr)
library(ggplot2)
library(reshape2)
Decision Tree (Default)
# Enable Parallel Processing
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))
# Train Control for 10-Fold CV
set.seed(456)
train_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Train Decision Tree Model
dt_model <- train(
y ~ .,
data = train_data_smote,
method = "rpart",
trControl = train_control,
metric = "Recall"
)
# Predict on Test Data
dt_pred <- predict(dt_model, test_data)
# Confusion Matrix
conf_matrix <- confusionMatrix(dt_pred, test_data$y, positive = "yes")
# Calculate Metrics
accuracy <- conf_matrix$overall['Accuracy']
precision <- conf_matrix$byClass['Precision']
recall <- conf_matrix$byClass['Recall']
f1_score <- 2 * ((precision * recall) / (precision + recall))
# Calculate AUC
dt_probs <- predict(dt_model, test_data, type = "prob")[, "yes"]
dt_auc <- auc(test_data$y, dt_probs)
# Display Results
cat(sprintf("\nDecision Tree (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
accuracy, precision, recall, f1_score, dt_auc))
##
## Decision Tree (Default) - Accuracy: 0.7167, Precision: 0.2369, Recall: 0.6828, F1-score: 0.3518, AUC: 0.7019
# Log Results
results <- rbind(results, data.frame(
Experiment = "Decision Tree (Default)",
Accuracy = accuracy,
Precision = precision,
Recall = recall,
F1_Score = f1_score,
AUC = dt_auc
))
rownames(results) <- NULL
6. Analysis- Decision Tree (Experiment 1): The baseline Decision Tree showed moderate accuracy (71.7%) with strong recall (68.3%), but its low precision (23.7%) highlights a high false positive rate, indicating room for improvement.
For this experiment, we’ll focus on improving the baseline Decision Tree model by tuning hyperparameters to enhance model performance
1. Objective: Improve the Decision Tree’s performance by tuning hyperparameters to enhance Recall, Precision, and F1 Score, while ensuring better generalization and reducing overfitting.
2. What will change: We will tune the following hyperparameters to reduce overfitting and improve generalization:
‘cp’ (complexity parameter): Controls cost-complexity pruning. We will test values ‘{0.01, 0.02, 0.03}’ to find the optimal level of tree pruning.
‘minsplit’: Minimum number of samples required to attempt a split. We set this to ‘20’ to avoid overly granular splits that could lead to overfitting.
‘maxdepth’: While not directly exposed in ‘caret::rpart’, we control tree depth implicitly through ‘cp’ and ‘minsplit’.
These choices are based on the fact that our baseline model overfit (high Recall, low Precision), so we’re tuning toward improved F1 Score and Precision without overly sacrificing Recall.
3. Evaluation Metric: We’ll continue to evaluate performance using Accuracy, Precision, Recall, F1 Score, and AUC. Recall and F1 Score will remain the focus given the dataset’s imbalance.
4. Cross-Validation Strategy: We will apply 10-fold cross-validation for consistency with Experiment 1. This approach ensures reliable performance evaluation by averaging results across multiple data splits.
5. Code Implementation:
# Enable Parallel Processing
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))
# Define train control for 10-fold cross-validation
set.seed(789)
train_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Tuning Grid - Only `cp` is tunable for rpart
tune_grid <- expand.grid(
cp = c(0.01, 0.02, 0.03)
)
# Train the tuned Decision Tree model with added `maxdepth` and `minsplit`
dt_model_tuned <- train(
y ~ .,
data = train_data_smote,
method = "rpart",
trControl = train_control,
tuneGrid = tune_grid,
control = rpart.control(minsplit = 20, maxdepth = 5),
metric = "Recall"
)
# Predict on test data
dt_pred_tuned <- predict(dt_model_tuned, test_data)
# Confusion Matrix
conf_matrix_tuned <- confusionMatrix(dt_pred_tuned, test_data$y, positive = "yes")
# Calculate Metrics
accuracy_tuned <- conf_matrix_tuned$overall['Accuracy']
precision_tuned <- conf_matrix_tuned$byClass['Precision']
recall_tuned <- conf_matrix_tuned$byClass['Recall']
f1_score_tuned <- 2 * ((precision_tuned * recall_tuned) / (precision_tuned + recall_tuned))
# Calculate AUC
dt_probs_tuned <- predict(dt_model_tuned, test_data, type = "prob")[, "yes"]
dt_auc_tuned <- auc(test_data$y, dt_probs_tuned)
# Display Results
cat(sprintf("\nDecision Tree (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
accuracy_tuned, precision_tuned, recall_tuned, f1_score_tuned, dt_auc_tuned))
##
## Decision Tree (Tuned) - Accuracy: 0.8354, Precision: 0.3585, Recall: 0.5847, F1-score: 0.4444, AUC: 0.7374
# Add Results
results <- rbind(results, data.frame(
Experiment = "Decision Tree (Tuned)",
Accuracy = accuracy_tuned,
Precision = precision_tuned,
Recall = recall_tuned,
F1_Score = f1_score_tuned,
AUC = dt_auc_tuned
))
rownames(results) <- NULL
6. Analysis - Decision Tree (Experiment 2): The Random Forest baseline model improved Accuracy, Precision, F1 Score, and AUC, indicating better overall performance and improved positive class identification. While Recall dropped slightly, the improved F1 Score suggests a better balance between Precision and Recall.
1. Objective: Establish a baseline Random Forest model using default hyperparameters to assess its natural performance as a reference for future experiments. This will allow us to compare its performance against the Decision Tree models.
2. What Will Change: Since this is a baseline model, we will use the default hyperparameters in the randomForest package:
ntree = 50; Default number of trees in the forest.
mtry = sqrt(number of features); Default number of randomly selected features at each split.
nodesize = 1; Minimum size of terminal nodes (helps prevent overfitting).
This setup will provide a strong baseline to compare improvements in future experiments.
3. Evaluation Metric: We will evaluate model performance using: Accuracy, Precision, Recall (Priority), F1 Score,AUC (to assess the model’s overall discrimination ability). Given the dataset’s imbalance, Recall and F1 Score will remain the primary focus.
4. Cross-Validation Strategy: We’ll apply 10-fold cross-validation (consistent with Decision Tree experiments) to ensure reliable performance evaluation and mitigate overfitting.
5. Code Implementation: Let’s implement the baseline Random Forest model.
# Enable parallel processing
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))
# Define train control for 10-fold cross-validation
set.seed(101)
train_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Garbage Collection to free up memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2832542 151.3 4250262 227.0 4250262 227.0
## Vcells 25330441 193.3 45795568 349.4 44915116 342.7
# Train the Random Forest baseline model with parallel processing
rf_model <- train(
y ~ .,
data = train_data_smote,
method = "rf",
trControl = train_control,
metric = "Recall",
ntree = 50,
importance = TRUE
)
# Predict on test data
rf_pred <- predict(rf_model, test_data)
# Confusion Matrix
conf_matrix_rf <- confusionMatrix(rf_pred, test_data$y, positive = "yes")
# Calculate metrics
rf_accuracy <- conf_matrix_rf$overall['Accuracy']
rf_precision <- conf_matrix_rf$byClass['Precision']
rf_recall <- conf_matrix_rf$byClass['Recall']
rf_f1 <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))
# Calculate AUC
rf_probs <- predict(rf_model, test_data, type = "prob")[, "yes"]
rf_auc <- auc(test_data$y, rf_probs)
# Display Results
cat(sprintf("\nRandom Forest (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
rf_accuracy, rf_precision, rf_recall, rf_f1, rf_auc))
##
## Random Forest (Default) - Accuracy: 0.8871, Precision: 0.4982, Recall: 0.4401, F1-score: 0.4674, AUC: 0.7648
# Add results
results <- rbind(results, data.frame(
Experiment = "Random Forest (Default)",
Accuracy = rf_accuracy,
Precision = rf_precision,
Recall = rf_recall,
F1_Score = rf_f1,
AUC = rf_auc
))
rownames(results) <- NULL
6. Analysis- Random Forest - Experiment 1: The Random Forest baseline model improved Precision and AUC compared to the baseline Decision Tree but experienced a slight drop in Recall.
1. Objective:: Improve the performance of the Random Forest model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while reducing overfitting.
2. What Will Change: In this experiment, we’ll modify key hyperparameters to improve model generalization and reduce overfitting. Specifically:
‘mtry’ (number of features tried per split): We tested values ‘{3, 5, 7}’. A higher ‘mtry’ allows the model to consider more features per split, which may improve accuracy when strong predictors are present. Lower ‘mtry’ values can help reduce overfitting.
‘ntree’ (number of trees): We set this to ‘100’, which is generally sufficient to stabilize predictions while avoiding excessive computational cost.
‘nodesize’ (minimum size of terminal nodes): While ‘caret::train()’ does not expose this directly for Random Forest, we kept it at its default ‘(1)’. Larger node sizes reduce variance but may underfit. In this experiment, we focused tuning on ‘mtry’ for simplicity and interpretability.
3. Evaluation Metric: We’ll continue evaluating performance using: Accuracy, Precision, Recall (Priority), F1 Score, AUC. Since Recall and F1 Score are critical for identifying potential subscribers, they will be our primary focus.
4. Cross-Validation Strategy: We’ll continue using 10-fold cross-validation for consistency and robust evaluation.
5. Code Implementation:
# Enable parallel processing with improved core usage
# cl <- makeCluster(detectCores() - 1)
# registerDoParallel(cl)
# on.exit(stopCluster(cl))
# Define train control for 10-fold cross-validation
set.seed(202)
train_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# TuneGrid - Only `mtry` for Random Forest
tune_grid <- expand.grid(
mtry = c(3, 5, 7)
)
# Garbage Collection to free up memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2841660 151.8 4250262 227.0 4250262 227.0
## Vcells 26946748 205.6 88014604 671.5 88014604 671.5
# Train the tuned Random Forest model
rf_model_tuned <- train(
y ~ .,
data = train_data_smote,
method = "rf",
trControl = train_control,
metric = "Recall",
tuneGrid = tune_grid,
ntree = 100,
nodesize = 5,
importance = TRUE
)
# Predict on test data
set.seed(202)
rf_pred_tuned <- predict(rf_model_tuned, test_data)
# Confusion Matrix
conf_matrix_rf_tuned <- confusionMatrix(rf_pred_tuned, test_data$y, positive = "yes")
# Calculate metrics
rf_accuracy_tuned <- conf_matrix_rf_tuned$overall['Accuracy']
rf_precision_tuned <- conf_matrix_rf_tuned$byClass['Precision']
rf_recall_tuned <- conf_matrix_rf_tuned$byClass['Recall']
rf_f1_tuned <- 2 * ((rf_precision_tuned * rf_recall_tuned) /
(rf_precision_tuned + rf_recall_tuned))
# Calculate AUC
rf_probs_tuned <- predict(rf_model_tuned, test_data, type = "prob")[, "yes"]
rf_auc_tuned <- auc(test_data$y, rf_probs_tuned)
# Display Results
cat(sprintf("\nRandom Forest (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
rf_accuracy_tuned, rf_precision_tuned, rf_recall_tuned, rf_f1_tuned, rf_auc_tuned))
##
## Random Forest (Tuned) - Accuracy: 0.8839, Precision: 0.4844, Recall: 0.4844, F1-score: 0.4844, AUC: 0.7735
# Add Results to Results Table
results <- rbind(results, data.frame(
Experiment = "Random Forest (Tuned)",
Accuracy = rf_accuracy_tuned,
Precision = rf_precision_tuned,
Recall = rf_recall_tuned,
F1_Score = rf_f1_tuned,
AUC = rf_auc_tuned
))
rownames(results) <- NULL
6. Analysis- Random Forest - Experiment 2: The tuned Random Forest model improved Accuracy, Precision, and F1 Score, while Recall increased slightly. The improved AUC suggests better overall discrimination between positive and negative classes, aligning with the experiment’s objective to enhance model performance and reduce overfitting.
For this experiment, we’ll build a baseline Adaboost model using default hyperparameters to establish a reference for comparison.
1. Objective: Establish a baseline Adaboost model to assess its natural performance as a reference for future tuning and evaluate its ability to improve Recall and Precision in the imbalanced dataset.
2. What will Change: Since this is a baseline model, no hyperparameters will be tuned in this experiment. Our focus will be on evaluating Adaboost’s default behavior.
3. Evaluation Metric: We’ll evaluate the model using Accuracy (overall correctness), precision (reducing false positives), Recall (priority metric for identifying actual subscribers),F1 Score (balance between Precision and Recall), AUC (discrimination ability).
4. Cross-Validation Strategy: Cross-validation is not applied in this baseline experiment due to the use of the base ada() function and the goal of establishing default behavior quickly. Instead, performance is evaluated on a separate held-out test set to provide a realistic baseline for comparison with future tuned models.
5. Code Implementation::
# Train AdaBoost model with 25 iterations (default)
set.seed(303)
ada_model <- ada(
y ~ .,
data = train_data_smote,
iter = 25
)
# Predict on test data
ada_probs <- predict(ada_model, test_data, type = "prob")[, 2]
ada_preds <- predict(ada_model, test_data, type = "class")
# Confusion Matrix
conf_matrix_ada <- confusionMatrix(ada_preds, test_data$y, positive = "yes")
# Calculate metrics
ada_accuracy <- conf_matrix_ada$overall['Accuracy']
ada_precision <- conf_matrix_ada$byClass['Pos Pred Value']
ada_recall <- conf_matrix_ada$byClass['Sensitivity']
ada_f1 <- 2 * (ada_precision * ada_recall) / (ada_precision + ada_recall)
ada_auc <- roc(test_data$y, ada_probs)$auc
# Display Results
cat(sprintf("\nAdaBoost (Default) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
ada_accuracy, ada_precision, ada_recall, ada_f1, ada_auc))
##
## AdaBoost (Default) - Accuracy: 0.8738, Precision: 0.4483, Recall: 0.5243, F1-score: 0.4833, AUC: 0.7670
# Add Results to Results Table
results <- rbind(results, data.frame(
Experiment = "AdaBoost (Default)",
Accuracy = ada_accuracy,
Precision = ada_precision,
Recall = ada_recall,
F1_Score = ada_f1,
AUC = ada_auc
))
rownames(results) <- NULL
6. Analysis - AdaBoost (Experiment 1 - Default): The AdaBoost model achieved the highest Recall so far, though its Accuracy, Precision, and F1 Score were slightly lower than the Random Forest models.
1. Objective: Improve the performance of the AdaBoost model by tuning key hyperparameters to enhance Recall, Precision, and F1 Score while maintaining a balanced model that reduces overfitting.
2. What Will Change: In this experiment, we will tune key hyperparameters to improve performance by increasing learning rounds and controlling overfitting:
‘iter’ (number of iterations): Set to ‘150’ to allow more boosting rounds. This enables the model to better correct previous errors and improve overall learning.
‘nu’ (learning rate): Set to ‘0.1’, a lower value to reduce the impact of each weak learner. This helps improve model stability and prevents overfitting by slowing the learning process.
‘type’ (classification type): Set to “discrete” to apply the standard AdaBoost algorithm suitable for binary classification tasks like ours. This setting focuses on improving classification margin and is well-suited to imbalanced data.
3. Evaluation Metric: The model will be evaluated using Accuracy, Precision, Recall (priority), F1 Score, and AUC. Since identifying true positives (Recall) is crucial, Recall and F1 Score will remain the primary focus.
4. Cross-Validation Strategy: Due to performance constraints observed during earlier attempts, we opted to run the tuned AdaBoost model without cross-validation. This allowed us to focus on hyperparameter optimization without long training delays, while still evaluating model performance on a held-out test set..
# Train AdaBoost model with tuned hyperparameters
set.seed(404)
ada_model_tuned <- ada(
y ~ .,
data = train_data_smote,
iter = 50, # Increased iterations for improved learning
nu = 0.05, # Smaller learning rate to control overfitting
type = "discrete" # Ensures standard AdaBoost for classification
)
# Predict on test data
ada_probs_tuned <- predict(ada_model_tuned, test_data, type = "prob")[, 2]
ada_preds_tuned <- predict(ada_model_tuned, test_data, type = "class")
# Confusion Matrix
conf_matrix_ada_tuned <- confusionMatrix(ada_preds_tuned, test_data$y, positive = "yes")
# Calculate metrics
ada_accuracy_tuned <- conf_matrix_ada_tuned$overall['Accuracy']
ada_precision_tuned <- conf_matrix_ada_tuned$byClass['Pos Pred Value']
ada_recall_tuned <- conf_matrix_ada_tuned$byClass['Sensitivity']
ada_f1_tuned <- 2 * (ada_precision_tuned * ada_recall_tuned) / (ada_precision_tuned + ada_recall_tuned)
ada_auc_tuned <- roc(test_data$y, ada_probs_tuned)$auc
# Display Results
cat(sprintf("\nAdaBoost (Tuned) - Accuracy: %.4f, Precision: %.4f, Recall: %.4f, F1-score: %.4f, AUC: %.4f\n",
ada_accuracy_tuned, ada_precision_tuned, ada_recall_tuned, ada_f1_tuned, ada_auc_tuned))
##
## AdaBoost (Tuned) - Accuracy: 0.8722, Precision: 0.4437, Recall: 0.5318, F1-score: 0.4838, AUC: 0.7657
# Add Results to Results Table
results <- rbind(results, data.frame(
Experiment = "AdaBoost (Tuned)",
Accuracy = ada_accuracy_tuned,
Precision = ada_precision_tuned,
Recall = ada_recall_tuned,
F1_Score = ada_f1_tuned,
AUC = ada_auc_tuned
))
rownames(results) <- NULL
6. Analysis - AdaBoost (Experiment 2 - Tuned): The tuned AdaBoost model showed a slight improvement in Recall and F1 Score compared to the baseline AdaBoost, with minimal changes in Accuracy, Precision, and AUC.
Results and Visualization:
results_all <- unique(results)
kable(results_all, caption = "Summary of Experiment Results")
| Experiment | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|
| Decision Tree (Default) | 0.7166626 | 0.2369012 | 0.6828479 | 0.3517644 | 0.7019002 |
| Decision Tree (Tuned) | 0.8354384 | 0.3584656 | 0.5846818 | 0.4444444 | 0.7373882 |
| Random Forest (Default) | 0.8870537 | 0.4981685 | 0.4401294 | 0.4673540 | 0.7648443 |
| Random Forest (Tuned) | 0.8838960 | 0.4843581 | 0.4843581 | 0.4843581 | 0.7735345 |
| AdaBoost (Default) | 0.8738159 | 0.4483395 | 0.5242718 | 0.4833416 | 0.7670148 |
| AdaBoost (Tuned) | 0.8722371 | 0.4437444 | 0.5318231 | 0.4838077 | 0.7657449 |
results_long <- melt(results_all, id.vars = "Experiment")
recall_order <- results_all %>%
arrange(desc(Recall)) %>%
pull(Experiment)
results_long$Experiment <- factor(results_long$Experiment, levels = recall_order)
# Plot
ggplot(results_long, aes(x = Experiment, y = value, fill = variable)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8))+
labs(
title = "Comparison of Model Performance Metrics",
x = "Experiment",
y = "Score",
fill = "Metric"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(size = 14, face = "bold")
)
Conclusion:
Among all models, the Decision Tree (Default) achieved the highest Recall (0.6828), which is the most important metric for our goal of identifying as many potential subscribers as possible. Although it had lower Precision and F1 Score, this trade-off is acceptable, as the business cost of missing a potential customer is higher than contacting a non-interested one.
The tuned Decision Tree model, while more balanced in terms of Precision and F1 Score, showed a decrease in Recall due to pruning and increased minsplit, which made it more conservative in predicting positives. This reduction in model variance comes at the cost of slightly reduced sensitivity — which is not ideal for our objective.
While the Random Forest (Tuned) model delivered stronger balance across all metrics, the Decision Tree (Default) remains the most aligned with our business goal of maximizing customer acquisition through higher Recall.