library(tidyverse)
library(DataExplorer)
library(corrplot)
library(ggplot2)
library(dplyr)
library(gridExtra)
library(grid)
library(scales)
library(tibble)
library(DT)
library(naniar)
library(Amelia)
library(caret)
library(rpart)
library(rpart.plot)
library(pROC)
library(randomForest)
library(adabag)
library(tibble)
This data is offered in two ways: one where there are only 16 features along with the target variables (y; subscribed status) and another expanded version where there are 20 features along with the target variable. I have chosen to use the expanded version below that has 20 features along with the target variable.
df1 <- read.csv("bank-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)
dim(df1)
## [1] 45211 17
df2 <- read.csv("bank-additional-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)
dim(df2)
## [1] 41188 21
# Deciding to go with the expanded version that has 20 features rather than 16, even though it has a little bit fewer rows; the additional features arguably offer richer data and even "more data" despite having slightly fewer rows, relatively speaking.
df <- df2
df <- df %>% rename(subscribed = y)
df$subscribed <- as.factor(df$subscribed)
# replace "unknown" with NA
df[df == "unknown"] <- NA
# missing values
#colSums(is.na(df))
missing_values <- colSums(is.na(df))
missing_values[missing_values > 0]
## job marital education default housing loan
## 330 80 1731 8597 990 990
# numeric var
plot_numeric_distribution <- function(df) {
num_vars <- df %>% select_if(is.numeric)
for (var in names(num_vars)) {
print(
ggplot(df, aes(x = get(var))) +
geom_histogram(bins = 50, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = paste("Distribution of", var), x = var, y = "Count") +
theme_minimal())}}
plot_numeric_distribution(df)
I can see that the distribution of some variables is highly skewed (duration, campaign, emp.var.rate, among others). Other categorical variables’ distribution is shown. And age is slightly skewed, but not too much.
Outlier detection is done below.
# categorical var
plot_categorical_distribution <- function(df) {
cat_vars <- df %>% select_if(is.character)
for (var in names(cat_vars)) {
print(
ggplot(df, aes(x = get(var))) +
geom_bar(fill = "steelblue") +
labs(title = paste("Distribution of", var), x = var, y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme_minimal())}}
plot_categorical_distribution(df)
The categorical variables are reasonably distributed too:
Using boxplots
plot_outliers_horizontal <- function(df) {
num_vars <- df %>% select_if(is.numeric)
num_plots <- length(num_vars)
cols <- 2
plots <- lapply(names(num_vars), function(var) {
ggplot(df, aes(y = get(var), x = "")) +
geom_boxplot(fill = "#69b3a2", outlier.color = "red", outlier.size = 2) +
labs(title = paste("Boxplot of", var), y = var, x = " ") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.text.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.ticks.x = element_line(color = "black"),
panel.grid.major = element_line(color = "grey85"),
panel.grid.minor = element_blank()) +
coord_flip()})
grid.arrange(
grobs = plots,
ncol = cols,
nrow = ceiling(num_plots / cols),
top = textGrob(" ", gp = gpar(fontsize = 18, fontface = "bold")))
grid.lines(x = unit(0.5, "npc"), y = unit(c(0, 1), "npc"), gp = gpar(col = "black", lwd = 2))}
plot_outliers_horizontal(df)
Box plots show:
Certain outliers that seem anomalous: I can remove them and consider them missing values (which later may be imputed). I can also use capping to make sure no extreme values impact the model.
Using IQR to Identify Outliers
detect_outliers <- function(df) {
num_vars <- df %>% select_if(is.numeric)
outliers <- list()
for (var in names(num_vars)) {
Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 3 * IQR_val
upper_bound <- Q3 + 3 * IQR_val
num_outliers <- sum(df[[var]] < lower_bound | df[[var]] > upper_bound, na.rm = TRUE)
if (num_outliers > 0) {
outliers[[var]] <- num_outliers}}
return(outliers)}
outlier_counts <- detect_outliers(df)
print(outlier_counts)
## $age
## [1] 4
##
## $duration
## [1] 1043
##
## $campaign
## [1] 1094
##
## $pdays
## [1] 1515
##
## $previous
## [1] 5625
So there does seem to be a number of outlier values in these 6 variables. Now, to see what they actually are:
detect_outliers_df <- function(df) {
num_vars <- df %>% select_if(is.numeric)
outlier_data <- list()
for (var in names(num_vars)) {
Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower_bound <- Q1 - 3 * IQR_val
upper_bound <- Q3 + 3 * IQR_val
outliers <- df[[var]][df[[var]] < lower_bound | df[[var]] > upper_bound]
if (length(outliers) > 0) {
outlier_data[[var]] <- tibble(
Variable = var,
Outlier_Value = outliers)}}
outlier_df <- bind_rows(outlier_data)
return(outlier_df)}
outlier_table <- detect_outliers_df(df)
datatable(outlier_table, options = list(pageLength = 10, scrollX = TRUE))
Browsing through these values, I can see that many of them are not quite extreme, when it comes to age, duration, pdays or the other variables. Most are pdays values of 0, which is not really an outlier per se. Previous values of 7 or 1 are also not exactly anomalous. Same for duration, given that duration is in seconds. It is reasonable that at least some calls will run up to a max of 49 minutes. For campaign, it is strange that some clients had up to 56 times of attempts to contact them. But it may be a normal thing in this industry, though certainly on the higher end.
So for the outliers shown here, there does not seem be a strong need for removal or capping, since they are not unreasonable.
Checking Categorical Variables for Anomalies
check_categorical_anomalies <- function(df) {
cat_vars <- df %>% select_if(is.character)
for (var in names(cat_vars)) {
print(paste("Category counts for:", var))
print(table(df[[var]]))
print("-------------------------------------------------")}}
check_categorical_anomalies(df)
## [1] "Category counts for: job"
##
## admin. blue-collar entrepreneur housemaid management
## 10422 9254 1456 1060 2924
## retired self-employed services student technician
## 1720 1421 3969 875 6743
## unemployed
## 1014
## [1] "-------------------------------------------------"
## [1] "Category counts for: marital"
##
## divorced married single
## 4612 24928 11568
## [1] "-------------------------------------------------"
## [1] "Category counts for: education"
##
## basic.4y basic.6y basic.9y high.school
## 4176 2292 6045 9515
## illiterate professional.course university.degree
## 18 5243 12168
## [1] "-------------------------------------------------"
## [1] "Category counts for: default"
##
## no yes
## 32588 3
## [1] "-------------------------------------------------"
## [1] "Category counts for: housing"
##
## no yes
## 18622 21576
## [1] "-------------------------------------------------"
## [1] "Category counts for: loan"
##
## no yes
## 33950 6248
## [1] "-------------------------------------------------"
## [1] "Category counts for: contact"
##
## cellular telephone
## 26144 15044
## [1] "-------------------------------------------------"
## [1] "Category counts for: month"
##
## apr aug dec jul jun mar may nov oct sep
## 2632 6178 182 7174 5318 546 13769 4101 718 570
## [1] "-------------------------------------------------"
## [1] "Category counts for: day_of_week"
##
## fri mon thu tue wed
## 7827 8514 8623 8090 8134
## [1] "-------------------------------------------------"
## [1] "Category counts for: poutcome"
##
## failure nonexistent success
## 4252 35563 1373
## [1] "-------------------------------------------------"
Just to confirm, aside from the bar charts, looking at this tabulation, I can see that the values make sense.
I have drawn patterns in the data when speaking about the distributions above, and insights about some varaibles are drawn too. I have also covered the central tendency of some variables as well as their spread.
Columns with missing values are:
missing_counts <- colSums(is.na(df))
missing_counts[missing_counts > 0]
## job marital education default housing loan
## 330 80 1731 8597 990 990
Some variables have a large number of missing values, especially default, and this may have an impact on the model. Other variables that may be important as well (eg education and housing) may also have an impact. So I am going to look deeper at this and see their missingness if at random and if it has a relation to the target.
Looking at this visually:
vis_miss(df) +
ggtitle("Missing Data Pattern") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Test If Missingness is Random
# MCAR test
mcar_test_result <- missmap(df, main = "Missing Data Map", col = c("blue", "gray"), legend = TRUE)
mcar_test(df)
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 5458. 406 0 23
The test shows that missingness is not at random.
Correlation Between Missingness & subscribed
missing_cols <- names(df)[colSums(is.na(df)) > 0]
df_missing <- df %>%
mutate(across(all_of(missing_cols), ~ ifelse(is.na(.), 1, 0), .names = "missing_{.col}"))
missing_correlation <- df_missing %>%
select(starts_with("missing_")) %>%
mutate(subscribed = df$subscribed) %>%
group_by(subscribed) %>%
summarise(across(starts_with("missing_"), \(x) mean(x, na.rm = TRUE)))
print(missing_correlation)
## # A tibble: 2 × 7
## subscribed missing_job missing_marital missing_education missing_default
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 no 0.00802 0.00186 0.0405 0.223
## 2 yes 0.00797 0.00259 0.0541 0.0955
## # ℹ 2 more variables: missing_housing <dbl>, missing_loan <dbl>
This is helpful, as it shows that:
Overall, I do not see outliers are a dangerous pattern here, and missingness is important for education and default in particular. For these, I will need to choose a method for imputation, such as iterative imputer or the KNN approach. There does not appear to be inconsistent values, or ones that are not aligning with what would be expected from a dataset like this and these observations.
For this dataset, the most suitable algorithms for predicting whether a customer will subscribe to a term deposit (subscribed) include Logistic Regression, Random Forest, and XGBoost. Logistic Regression is useful as a baseline due to its interpretability and efficiency, while Random Forest and XGBoost are more powerful ensemble methods that can capture complex interactions and non-linear relationships within the dataset. Though I must say, banking is not my field or domain, and is completely foreign to me. I have tried my best though to look at this from a domain perspective, though it may not be perfect.
This is a supervised problem to solve and devise a model to predict, since we do have labelled data, as I will explain below. The data characteristics and limitations do allow for a few possible model approaches I will list below:
Pros: Simple, interpretable, efficient on large datasets, and works well with binary classification. Cons: Assumes a linear relationship between independent variables and the log-odds of the target, making it less effective for complex patterns.
Pros: Handles both numerical and categorical variables, is robust to missing values, and reduces overfitting by averaging multiple decision trees. Cons: Computationally expensive, especially for large datasets, and harder to interpret compared to logistic regression.
Pros: Extremely powerful in handling structured tabular data, robust to missing values, and performs well with imbalanced data. Cons: Requires hyperparameter tuning and is computationally more demanding.
These suggested models do align with the business characteristics and goals from a dataset like this. These are also scalable approaches that allow for continuing to collect more data and refine the model further.
Among these, XGBoost is the best recommendation I think due to its high accuracy, ability to handle missing data, and effectiveness in tabular datasets like this one. Random Forest is a good alternative if interpretability is needed, while Logistic Regression can be used as a baseline model to compare performance.
Yes, the dataset has a labeled target variable (subscribed: yes/no), which makes this a supervised classification problem. This allows the use of classification models like Logistic Regression, Decision Trees, Random Forest, and Gradient Boosting models (XGBoost) instead of unsupervised learning methods such as clustering.
The dataset contains both categorical and numerical features, requiring an algorithm that handles mixed data types, missing values, and class imbalance. Tree-based models (Random Forest & XGBoost) are well-suited for these types of datasets as they automatically handle feature selection, interactions, and non-linearity. Logistic Regression, while simpler, may struggle with non-linear relationships and interactions between variables.
If the dataset had fewer than 1,000 records, simpler models like Logistic Regression or Decision Trees would be preferable. XGBoost and Random Forest require more data to generalize well, and with a small dataset, they may overfit. Logistic Regression would work better in this case because it requires fewer data points to provide stable estimates, while a Decision Tree could be used if non-linearity is important.
Address Missing Data. XGBoost natively handles missing values, so imputation is optional. However, we should analyze whether missing values hold information before deciding.
Options: - Leave missing values as-is (XGBoost assigns them optimally). - Use mean/median imputation for continuous features if missingness seems random. I can also use iterative imputer or KNN. - Create binary indicators for missing_education and missing_default, as these were correlated with the target (subscribed).
Check for Duplicates & Outliers
Drop Highly Correlated Features. From the correlation analysis, euribor3m, nr.employed, and emp.var.rate are strongly correlated. We can remove one or two of them to avoid redundancy.
Drop duration (if aiming for real-world deployment). Since call duration is a strong predictor but unknown before a call happens, it should be removed unless the goal is just benchmarking.
I don’t think I need to do any resizing to the data, since it is not too large, nor too small.
But, generally:
df$job <- as.integer(as.factor(df$job))
df$marital <- as.integer(as.factor(df$marital))
# Alternatively, one-hot encoding can be applied (not required for XGBoost but useful for explainability).
Scale Numerical Features (Not required for XGBoost, but recommended for comparison with other models):
If subscribed = yes is much less frequent than no, XGBoost may be biased. Solutions: - Set scale_pos_weight = (# negative samples / # positive samples) in XGBoost to balance class weights. - Use SMOTE (Synthetic Minority Over-sampling Technique) if upsampling needed. - Use stratified sampling during training to ensure balance.
Understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Older customers may have different subscription tendencies.
In this analysis, I explored a dataset containing information on a Portuguese bank’s marketing campaign aimed at encouraging customers to subscribe to a term deposit. The dataset includes demographic details, previous marketing interactions, and economic indicators, requiring careful preprocessing before model training. Through exploratory data analysis (EDA), I examined data distributions, missingness patterns, outliers, and feature correlations. Based on my findings, I selected XGBoost as the most suitable machine learning algorithm for predicting customer subscription.
The EDA revealed several important insights. Certain features, such as call duration and previous contacts, had a strong influence on subscription likelihood. Missing data was not missing completely at random (MCAR), particularly for education and default, which had missingness patterns associated with the target variable. Outliers were observed in campaign, duration, and pdays, indicating potential skewness in customer interactions. Additionally, economic indicators such as euribor3m, nr.employed, and emp.var.rate were highly correlated, requiring dimensionality reduction to avoid redundancy.
Based on these findings, XGBoost was chosen as the best algorithm for this classification task. XGBoost is an ensemble learning method that builds gradient-boosted decision trees, making it well-suited for structured tabular data like this dataset. Unlike logistic regression, which assumes linearity, XGBoost can model complex relationships and interactions between features. Additionally, XGBoost naturally handles missing values, reducing the need for extensive imputation. The model is robust to imbalanced data, which is key given that subscribed = yes is less frequent than no. Compared to Random Forest, XGBoost is computationally more efficient and provides better feature importance insights, allowing us to determine the most influential factors in predicting customer behavior.
To prepare the data for XGBoost, I will implement several preprocessing steps. Categorical variables such as job, education, and poutcome will be integer-encoded, so that compatibility with tree-based models is maintained. Feature engineering included binning age groups, transforming pdays into a categorical feature, and creating interaction terms between variables like previous and campaign. To handle class imbalance, I will adjust the scale_pos_weight parameter in XGBoost, ensuring the model appropriately weighted minority class observations. Since XGBoost does not require feature scaling, numerical variables will be left in their original form except for log-transforming highly skewed values like balance and duration for better model interpretability. Keeping in mind that understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Also that older customers may have different subscription tendencies.
If the dataset had been smaller (fewer than 1,000 records), I would have opted for Logistic Regression or a Decision Tree model, as XGBoost requires larger datasets to generalize effectively. However, given the dataset’s size and complexity, XGBoost is an optimal choice due to its high predictive power, ability to handle mixed data types, and resilience against overfitting.
Based on the final model, I will compute predictive performance metrics that include the F1 score, recall, precision, AUC, and Brier score. This will help understand how the model performs. I will also train the model using a 70% random split with cross-validation for hyperparameter tuning, and then test the model on the 30% unseen data. I will also add explanation and interpretatibility using SHAP and dependence plots, along with calibration plots and precision-recall plots.
Objective: Establish a baseline for how a simple Decision Tree performs using default parameters. We hypothesize it will yield decent accuracy but may have low recall.
Variation: No tuning or parameter constraints; purely default rpart() settings.
Non-Trivial Variation?: This is a baseline with no hyperparameter changes, so it’s trivial by design (the starting point).
Evaluation Metric: Measured Accuracy, Sensitivity, Specificity, and AUC-ROC to capture both overall performance and the ability to detect “yes.”
Experiment Run:
# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)
# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"
# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
###
# baseline Decision Tree using default parameters
dt_baseline <- rpart(subscribed ~ ., data = train_data, method = "class")
pred_probs <- predict(dt_baseline, test_data, type = "prob")[,2]
pred_classes <- predict(dt_baseline, test_data, type = "class")
# Evaluate
conf_mat <- confusionMatrix(pred_classes, test_data$subscribed, positive = "yes")
roc_obj <- roc(test_data$subscribed, pred_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(conf_mat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10876 1164
## yes 88 228
##
## Accuracy : 0.8987
## 95% CI : (0.8932, 0.9039)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 2.836e-05
##
## Kappa : 0.2351
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.16379
## Specificity : 0.99197
## Pos Pred Value : 0.72152
## Neg Pred Value : 0.90332
## Prevalence : 0.11266
## Detection Rate : 0.01845
## Detection Prevalence : 0.02557
## Balanced Accuracy : 0.57788
##
## 'Positive' Class : yes
##
cat("AUC-ROC:", auc(roc_obj), "\n")
## AUC-ROC: 0.707675
# Visualize
rpart.plot(dt_baseline)
# saveRDS(dt_baseline, file = "dt_baseline_model.rds")
Baseline Decision Tree:
Meaning:
Objective: Test whether pruning/optimizing the complexity parameter (cp) improves detection of positive (subscribed) cases without severely hurting overall accuracy.
Variation: Used a grid search on cp from 0.001 to 0.02 in increments of 0.002, cross-validating with 5 folds.
Non-Trivial Variation?: Yes — adjusting tree complexity is a significant model change, aiming to reduce underfitting or overfitting.
Evaluation Metric: Same metrics (Accuracy, Sensitivity, Specificity, AUC-ROC) but focusing on whether recall and AUC-ROC improve.
Experiment Run:
# caret to tune the cp parameter via grid search
set.seed(123)
tune_grid <- expand.grid(cp = seq(0.001, 0.02, by = 0.002))
dt_tuned <- train(subscribed ~ ., data = train_data,
method = "rpart",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = tune_grid)
print(dt_tuned)
## CART
##
## 28832 samples
## 18 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.001 0.9002145 0.3285105
## 0.003 0.8992437 0.2499812
## 0.005 0.8992437 0.2499812
## 0.007 0.8992437 0.2499812
## 0.009 0.8992437 0.2499812
## 0.011 0.8992437 0.2499812
## 0.013 0.8992437 0.2499812
## 0.015 0.8992437 0.2499812
## 0.017 0.8992437 0.2499812
## 0.019 0.8992437 0.2499812
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.
pred_probs_tuned <- predict(dt_tuned, test_data, type = "prob")[,2]
pred_classes_tuned <- predict(dt_tuned, test_data)
conf_mat_tuned <- confusionMatrix(pred_classes_tuned, test_data$subscribed, positive = "yes")
roc_obj_tuned <- roc(test_data$subscribed, pred_probs_tuned)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(conf_mat_tuned)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10761 1038
## yes 203 354
##
## Accuracy : 0.8996
## 95% CI : (0.8941, 0.9048)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 6.831e-06
##
## Kappa : 0.3194
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.25431
## Specificity : 0.98148
## Pos Pred Value : 0.63555
## Neg Pred Value : 0.91203
## Prevalence : 0.11266
## Detection Rate : 0.02865
## Detection Prevalence : 0.04508
## Balanced Accuracy : 0.61790
##
## 'Positive' Class : yes
##
cat("AUC-ROC (Tuned):", auc(roc_obj_tuned), "\n")
## AUC-ROC (Tuned): 0.7579196
# saveRDS(dt_tuned, file = "dt_tuned_model.rds")
Tuned Decision Tree:
Meaning:
Objective: Establish how a default Random Forest (RF) model performs on this dataset without parameter tuning. We hypothesize it will capture more complex interactions than a simple decision tree.
Variation: No parameter tuning; use the default number of trees (often 500) and default mtry (typically sqrt(#features)).
Non-Trivial Variation? This is our baseline with no custom changes, so it’s considered the reference point for further tuning.
Evaluation Metric: We measure Accuracy, Sensitivity, Specificity, and AUC-ROC to gauge both overall correctness and how well the model detects “yes” cases.
Experiment Run:
# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)
# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"
# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
###
set.seed(123)
rf_baseline <- randomForest(subscribed ~ ., data = train_data, ntree = 500)
pred_rf_probs <- predict(rf_baseline, test_data, type = "prob")[,2]
pred_rf_classes <- predict(rf_baseline, test_data)
conf_mat_rf <- confusionMatrix(pred_rf_classes, test_data$subscribed, positive = "yes")
roc_rf <- roc(test_data$subscribed, pred_rf_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(conf_mat_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10736 1021
## yes 228 371
##
## Accuracy : 0.8989
## 95% CI : (0.8935, 0.9042)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1.943e-05
##
## Kappa : 0.3271
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.26652
## Specificity : 0.97920
## Pos Pred Value : 0.61937
## Neg Pred Value : 0.91316
## Prevalence : 0.11266
## Detection Rate : 0.03003
## Detection Prevalence : 0.04848
## Balanced Accuracy : 0.62286
##
## 'Positive' Class : yes
##
cat("AUC-ROC (RF Baseline):", auc(roc_rf), "\n")
## AUC-ROC (RF Baseline): 0.7882052
# saveRDS(rf_baseline, file = "rf_baseline_model.rds")
Result & Conclusion:
Objective: Investigate if adjusting mtry (the number of features considered at each split) can improve the model’s balance of accuracy and sensitivity.
Variation: Used caret to grid-search mtry = {2, 4, 6, 8}, with 5-fold cross-validation. The best setting is chosen based on highest accuracy.
Non-Trivial Variation? Yes — adjusting mtry is a significant hyperparameter change that can affect model complexity and performance.
Evaluation Metric: Same metrics: Accuracy, Sensitivity, Specificity, AUC-ROC, focusing on any improvement in detecting positives.
Experiment Run:
set.seed(123)
rf_tune_grid <- expand.grid(mtry = c(2, 4, 6, 8))
rf_tuned <- train(subscribed ~ ., data = train_data,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = rf_tune_grid,
ntree = 500)
print(rf_tuned)
## Random Forest
##
## 28832 samples
## 18 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8983073 0.2285129
## 4 0.8998332 0.2972112
## 6 0.8985151 0.3138414
## 8 0.8977174 0.3225955
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
pred_rf_tuned_probs <- predict(rf_tuned, test_data, type = "prob")[,2]
pred_rf_tuned_classes <- predict(rf_tuned, test_data)
conf_mat_rf_tuned <- confusionMatrix(pred_rf_tuned_classes, test_data$subscribed, positive = "yes")
roc_rf_tuned <- roc(test_data$subscribed, pred_rf_tuned_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(conf_mat_rf_tuned)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10810 1082
## yes 154 310
##
## Accuracy : 0.9
## 95% CI : (0.8945, 0.9052)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 3.456e-06
##
## Kappa : 0.2943
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.22270
## Specificity : 0.98595
## Pos Pred Value : 0.66810
## Neg Pred Value : 0.90901
## Prevalence : 0.11266
## Detection Rate : 0.02509
## Detection Prevalence : 0.03755
## Balanced Accuracy : 0.60433
##
## 'Positive' Class : yes
##
cat("AUC-ROC (RF Tuned):", auc(roc_rf_tuned), "\n")
## AUC-ROC (RF Tuned): 0.7823927
# saveRDS(rf_tuned, file = "rf_tuned_model.rds")
Result & Conclusion:
Objective: To evaluate a baseline AdaBoost model (using adabag’s boosting) on the preprocessed data, aiming to establish a performance benchmark. Hypothesis: The baseline model will provide moderate discrimination (AUC ~0.80) but may suffer from low sensitivity.
Experiment Variation Defined: No hyperparameter tuning was applied; the model was run with default boosting parameters (mfinal = 50, and default tree parameters).
Variation Non-Triviality: Although this run is a baseline, it is non-trivial because it directly leverages AdaBoost’s ability to handle missing values and categorical data without additional pre-processing adjustments beyond standard cleaning.
Evaluation Metric: Metrics used include Accuracy, Sensitivity, Specificity, Balanced Accuracy, and AUC-ROC. Emphasis was placed on AUC-ROC to gauge overall discrimination ability and on sensitivity to understand the model’s recall for the minority “yes” class.
# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)
# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"
# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
###
set.seed(123)
ada_baseline <- boosting(subscribed ~ ., data = train_data, boos = TRUE, mfinal = 50)
ada_pred <- predict(ada_baseline, newdata = test_data)
pred_ada_probs <- ada_pred$prob[, 2]
pred_ada_classes <- ada_pred$class
conf_mat_ada <- confusionMatrix(as.factor(pred_ada_classes), test_data$subscribed, positive = "yes")
roc_ada <- roc(test_data$subscribed, pred_ada_probs)
## Setting levels: control = no, case = yes
## Setting direction: controls < cases
print(conf_mat_ada)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10777 1079
## yes 187 313
##
## Accuracy : 0.8975
## 95% CI : (0.8921, 0.9028)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.0001496
##
## Kappa : 0.2885
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.22486
## Specificity : 0.98294
## Pos Pred Value : 0.62600
## Neg Pred Value : 0.90899
## Prevalence : 0.11266
## Detection Rate : 0.02533
## Detection Prevalence : 0.04047
## Balanced Accuracy : 0.60390
##
## 'Positive' Class : yes
##
cat("AUC-ROC (AdaBoost Baseline):", auc(roc_ada), "\n")
## AUC-ROC (AdaBoost Baseline): 0.8069351
#saveRDS(ada_baseline, file = "ada_baseline_model.rds")
Result Evaluation & Conclusion: The baseline AdaBoost achieved 89.75% accuracy and an AUC-ROC of ~0.807, with sensitivity at ~22.5% and specificity at ~98.3%. While overall performance and discrimination are reasonable, the low sensitivity indicates many subscribers are missed. This performance sets the benchmark for further tuning.
Objective: To test whether tuning hyperparameters (specifically, mfinal, maxdepth, and using coeflearn = Breiman) can improve performance, particularly aiming to enhance sensitivity and overall class discrimination.
Experiment Variation Defined: A grid search was implemented with:
Variation Non-Triviality: This tuning is non-trivial because altering boosting iterations and tree depth directly affects model complexity and bias-variance trade-off, which is critical for capturing the minority class effectively.
Evaluation Metric: Same as before (Accuracy, Sensitivity, Specificity, AUC-ROC), with particular attention to changes in sensitivity and AUC.
# set.seed(123)
# ada_tune_grid <- expand.grid(mfinal = c(50, 100, 150), maxdepth = c(1, 2, 3), coeflearn = c("Breiman"))
# ada_tuned <- train(subscribed ~ ., data = train_data,
# method = "AdaBoost.M1",
# trControl = trainControl(method = "cv", number = 5),
# tuneGrid = ada_tune_grid)
#
# print(ada_tuned)
# pred_ada_tuned_probs <- predict(ada_tuned, test_data, type = "prob")[, "yes"]
# pred_ada_tuned_classes <- predict(ada_tuned, test_data)
#
# conf_mat_ada_tuned <- confusionMatrix(pred_ada_tuned_classes, test_data$subscribed, positive = "yes")
# roc_ada_tuned <- roc(test_data$subscribed, pred_ada_tuned_probs)
#
# print(conf_mat_ada_tuned)
# cat("AUC-ROC (AdaBoost Tuned):", auc(roc_ada_tuned), "\n")
#
# # The knitting to html is halting at this code chunk, likley because one of the folds created below may not have a value in one of the classes. So I am going to do it another way:
# # saving the model
# saveRDS(ada_tuned, file = "ada_tuned_model.rds")
# then, for the knitting phase: load the tuned model:
ada_tuned <- readRDS("ada_tuned_model.rds")
Result Evaluation & Conclusion: The tuned model achieved an accuracy of 89.96% and an AUC-ROC of ~0.802, but sensitivity dropped to 18.1% (from 22.5% in the baseline) while specificity increased to 99.09%. These results indicate that while the tuned model is even better at correctly identifying non-subscribers, it further reduces the model’s ability to capture true positives. Overall discrimination (AUC) did not improve significantly.
test_data$subscribed <- factor(test_data$subscribed, levels = c("no", "yes"))
test_data$job <- factor(test_data$job, levels = levels(train_data$job))
#response to a plain vector
response_vec <- as.vector(test_data$subscribed)
# Decision Trees (dt_baseline and dt_tuned)
dt_baseline_probs <- unname(as.numeric(predict(dt_baseline, test_data, type = "prob")[, "yes"]))
dt_tuned_probs <- unname(as.numeric(predict(dt_tuned, test_data, type = "prob")[, "yes"]))
# # Diagnostic: check lengths and NAs
# cat("Length of response:", length(response_vec), "\n")
# cat("Length of dt_baseline_probs:", length(dt_baseline_probs), "\n")
# cat("Length of dt_tuned_probs:", length(dt_tuned_probs), "\n")
# cat("Number of NAs in response:", sum(is.na(response_vec)), "\n")
# cat("Number of NAs in dt_baseline_probs:", sum(is.na(dt_baseline_probs)), "\n")
# cat("Number of NAs in dt_tuned_probs:", sum(is.na(dt_tuned_probs)), "\n")
dt_baseline_roc <- roc(response = response_vec, predictor = dt_baseline_probs, direction = "auto")
dt_tuned_roc <- roc(response = response_vec, predictor = dt_tuned_probs, direction = "auto")
# Random Forest (rf_baseline and rf_tuned)
rf_baseline_probs <- unname(as.numeric(predict(rf_baseline, test_data, type = "prob")[, "yes"]))
rf_tuned_probs <- unname(as.numeric(predict(rf_tuned, test_data, type = "prob")[, "yes"]))
rf_baseline_roc <- roc(response = response_vec, predictor = rf_baseline_probs, direction = "auto")
rf_tuned_roc <- roc(response = response_vec, predictor = rf_tuned_probs, direction = "auto")
# AdaBoost (ada_baseline and ada_tuned)
ada_baseline_pred <- predict(ada_baseline, newdata = test_data)
ada_baseline_probs <- unname(as.numeric(ada_baseline_pred$prob[, 2]))
ada_tuned_probs <- unname(as.numeric(predict(ada_tuned, test_data, type = "prob")[, "yes"]))
ada_baseline_roc <- roc(response = response_vec, predictor = ada_baseline_probs, direction = "auto")
ada_tuned_roc <- roc(response = response_vec, predictor = ada_tuned_probs, direction = "auto")
## PLOTS
# Decision Tree ROC Plot
plot(
dt_baseline_roc,
col = "blue",
lwd = 2,
main = "Decision Tree: Baseline vs. Tuned ROC",
legacy.axes = FALSE,
xlab = "1 - Specificity",
ylab = "Sensitivity"
)
lines(dt_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)
# Random Forest ROC Plot
plot(
rf_baseline_roc,
col = "blue",
lwd = 2,
main = "Random Forest: Baseline vs. Tuned ROC",
legacy.axes = FALSE,
xlab = "1 - Specificity",
ylab = "Sensitivity"
)
lines(rf_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)
# AdaBoost ROC Plot
plot(
ada_baseline_roc,
col = "blue",
lwd = 2,
main = "AdaBoost: Baseline vs. Tuned ROC",
legacy.axes = FALSE,
xlab = "1 - Specificity",
ylab = "Sensitivity"
)
lines(ada_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)
plot(
dt_tuned_roc,
col = "red",
lwd = 2,
main = "Tuned Models: ROC Comparison",
legacy.axes = FALSE,
xlab = "1 - Specificity",
ylab = "Sensitivity"
)
lines(rf_tuned_roc, col = "green", lwd = 2)
lines(ada_tuned_roc, col = "purple", lwd = 2)
legend("bottomright",
legend = c("Decision Tree", "Random Forest", "AdaBoost"),
col = c("red", "green", "purple"),
lwd = 2
)
# function: extract metrics
extract_metrics <- function(conf, roc_obj) {
acc <- as.numeric(conf$overall["Accuracy"])
sens <- as.numeric(conf$byClass["Sensitivity"])
spec <- as.numeric(conf$byClass["Specificity"])
auc_val <- as.numeric(auc(roc_obj))
return(c(Accuracy = round(acc, 4),
Sensitivity = round(sens, 4),
Specificity = round(spec, 4),
AUC_ROC = round(auc_val, 4)))
}
# Decision Tree models
dt_baseline_metrics <- extract_metrics(conf_mat, roc_obj)
dt_tuned_metrics <- extract_metrics(conf_mat_tuned, roc_obj_tuned)
# Random Forest models
rf_baseline_metrics <- extract_metrics(conf_mat_rf, roc_rf)
rf_tuned_metrics <- extract_metrics(conf_mat_rf_tuned, roc_rf_tuned)
# AdaBoost models
ada_baseline_metrics <- extract_metrics(conf_mat_ada, roc_ada)
ada_tuned_metrics <- extract_metrics(conf_mat_ada_tuned, roc_ada_tuned)
# Combine
performance_summary <- data.frame(
Model = rep(c("Decision Tree", "Random Forest", "AdaBoost"), each = 2),
Experiment = rep(c("Baseline", "Tuned"), 3),
Accuracy = c(dt_baseline_metrics["Accuracy"], dt_tuned_metrics["Accuracy"],
rf_baseline_metrics["Accuracy"], rf_tuned_metrics["Accuracy"],
ada_baseline_metrics["Accuracy"], ada_tuned_metrics["Accuracy"]),
AUC_ROC = c(dt_baseline_metrics["AUC_ROC"], dt_tuned_metrics["AUC_ROC"],
rf_baseline_metrics["AUC_ROC"], rf_tuned_metrics["AUC_ROC"],
ada_baseline_metrics["AUC_ROC"], ada_tuned_metrics["AUC_ROC"]),
Sensitivity = c(dt_baseline_metrics["Sensitivity"], dt_tuned_metrics["Sensitivity"],
rf_baseline_metrics["Sensitivity"], rf_tuned_metrics["Sensitivity"],
ada_baseline_metrics["Sensitivity"], ada_tuned_metrics["Sensitivity"]),
Specificity = c(dt_baseline_metrics["Specificity"], dt_tuned_metrics["Specificity"],
rf_baseline_metrics["Specificity"], rf_tuned_metrics["Specificity"],
ada_baseline_metrics["Specificity"], ada_tuned_metrics["Specificity"])
)
# Print the performance summary table
print(performance_summary)
## # A tibble: 6 × 6
## Model Experiment Accuracy AUC_ROC Sensitivity Specificity
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Decision Tree Baseline 0.899 0.708 0.164 0.992
## 2 Decision Tree Tuned 0.900 0.758 0.254 0.982
## 3 Random Forest Baseline 0.898 0.790 0.262 0.979
## 4 Random Forest Tuned 0.900 0.782 0.225 0.986
## 5 AdaBoost Baseline 0.898 0.807 0.225 0.983
## 6 AdaBoost Tuned 0.900 0.802 0.181 0.991
In this classification project, I evaluated three algorithms—Decision Tree, Random Forest, and AdaBoost—on a dataset to predict whether a client will subscribe to a term deposit. Each algorithm was tested twice: first with baseline (default) parameters, and then again after tuning. The metrics of interest included accuracy, AUC (Area Under the ROC Curve), sensitivity (recall for the positive class), and specificity (true negative rate). Since the bank is interested in identifying as many potential subscribers (“yes”) as possible without excessively misclassifying non-subscribers, sensitivity and AUC carry particular weight, although overall accuracy and specificity remain important for resource management.
Decision Tree Results:
The Decision Tree’s baseline model achieved an accuracy of 0.8987, an AUC of 0.7077, a sensitivity of 0.1638, and a specificity of 0.992. These show that while the baseline tree was quite accurate overall—mostly because of the large proportion of “no” cases—it struggled to correctly identify positive cases, as reflected by a low sensitivity. Tuning the Decision Tree improved its AUC to 0.7579, which indicates better discrimination between “yes” and “no.” Sensitivity also rose to 0.2543, making the tuned tree more effective at capturing actual subscribers. The slight decrease in specificity (from 0.992 down to 0.9815) was a small sacrifice, but it was accompanied by a jump in the tree’s ability to find the positive class.
Random Forest Results:
For the Random Forest, the baseline version had a higher AUC than the baseline Decision Tree, coming in at 0.7899, and a sensitivity of 0.2615. Its accuracy was 0.8981, slightly below the tuned Decision Tree’s accuracy but with a stronger AUC, suggesting a more balanced approach to class separation. Tuning the Random Forest increased its accuracy to 0.9001 and raised specificity to 0.9859. However, the AUC slipped slightly to 0.7818, and sensitivity dropped to 0.2249. In other words, the tuned Random Forest became more conservative: it improved at identifying “no” cases but caught fewer “yes” cases. If a bank prioritizes fewer false positives (non-subscribers wrongly flagged as subscribers), the tuned Random Forest might be good. However, if capturing a higher proportion of true positives is paramount, the baseline version may be better.
AdaBoost Results:
AdaBoost stood out for having the highest baseline AUC of 0.8069, meaning it was already strong at discriminating between “yes” and “no.” Its accuracy was 0.8975, and sensitivity was 0.2249—moderate in relation to the other models. After tuning, AdaBoost’s accuracy increased to 0.8996, and specificity rose from 0.9829 to 0.9909, but AUC dipped slightly to 0.8019, and sensitivity fell to 0.181. This indicates that while the tuned AdaBoost became better at correctly classifying “no,” it missed more actual subscribers.
Overall, the tuned Decision Tree shows a significant improvement in sensitivity and a decent AUC gain, making it valuable for scenarios where identifying more potential subscribers is important. The Random Forest’s baseline model balances sensitivity and specificity well, whereas the tuned variant focuses more on high accuracy and specificity at the cost of missed positives. AdaBoost shows a strong discriminative power, with the highest baseline AUC, though tuning tended to favor accuracy and specificity over recall. In practice, the choice depends on whether the bank emphasizes catching more true positives or avoiding too many false alarms.
From a data science perspective, further hyperparameter tuning and additional feature engineering are recommended to optimize the trade-off between sensitivity and specificity. My experiments show that while ensemble methods like Random Forest and AdaBoost perform well, the tuned Decision Tree offers promising interpretability and improved recall.
For addressing the bank’s marketing challenge, I would recommend deploying the tuned Decision Tree model as it shows enhanced sensitivity in identifying potential subscribers. This approach is likely to improve the targeting efficiency of marketing campaigns, so that more high-propensity customers are engaged while maintaining a high overall accuracy.