library(tidyverse)
library(DataExplorer)
library(corrplot) 
library(ggplot2)
library(dplyr) 
library(gridExtra) 
library(grid) 
library(scales)
library(tibble)
library(DT)
library(naniar)
library(Amelia)
library(caret)
library(rpart)
library(rpart.plot)
library(pROC)
library(randomForest)
library(adabag)
library(tibble)

This data is offered in two ways: one where there are only 16 features along with the target variables (y; subscribed status) and another expanded version where there are 20 features along with the target variable. I have chosen to use the expanded version below that has 20 features along with the target variable.

Loading the data

df1 <- read.csv("bank-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

dim(df1)

## [1] 45211    17

df2 <- read.csv("bank-additional-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

dim(df2)

## [1] 41188    21

# Deciding to go with the expanded version that has 20 features rather than 16, even though it has a little bit fewer rows; the additional features arguably offer richer data and even "more data" despite having slightly fewer rows, relatively speaking.

df <- df2

df <- df %>% rename(subscribed = y)
df$subscribed <- as.factor(df$subscribed)

A quick check on missing variables

# replace "unknown" with NA
df[df == "unknown"] <- NA

# missing values
#colSums(is.na(df))
missing_values <- colSums(is.na(df))
missing_values[missing_values > 0]

##       job   marital education   default   housing      loan 
##       330        80      1731      8597       990       990

HW1:

A. Exploratory Data Analysis (EDA)

A.1 Are the features (columns) of your data correlated?

num_vars <- df %>% select_if(is.numeric)
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.srt = 45, diag = FALSE)

The correlation plot shows:

🔹 Strong Positive Correlations (Blue):

nr.employed (number of employees) and euribor3m (Euribor 3-month rate) are highly correlated.
emp.var.rate (employment variation rate) is also positively correlated with both.

🔹 Strong Negative Correlations (Red):

emp.var.rate is strongly negatively correlated with euribor3m.
pdays (days since last contact) and previous (number of past contacts) are moderately correlated.

🔹 Weak or No Correlation (Light Colors):

campaign (number of contacts in current campaign) is not strongly correlated with any variable.
duration (call length) has weak correlations, meaning it should not be used for prediction before the call happens.

I may need to remove or combine highly correlated variables.

A.2 What is the overall distribution of each variable?

# numeric var
plot_numeric_distribution <- function(df) {
    num_vars <- df %>% select_if(is.numeric) 
    for (var in names(num_vars)) {
        print(
            ggplot(df, aes(x = get(var))) +
                geom_histogram(bins = 50, fill = "steelblue", color = "black", alpha = 0.7) +  
                labs(title = paste("Distribution of", var), x = var, y = "Count") +
                theme_minimal())}}
plot_numeric_distribution(df)

I can see that the distribution of some variables is highly skewed (duration, campaign, emp.var.rate, among others). Other categorical variables’ distribution is shown. And age is slightly skewed, but not too much.

Age: the distribution of age is reasonable, with most values between 25-55
Duration of the calls: these are also reasonable, since the values are in seconds. So it is expected for it to be skewed.
campaign: the values are also reasonable here, with most values on the lower end, with some outliers. So it is expected for it to be skewed.
pdays: the distribution here is strange, because most values are 999 due to most clients not having been contacted before
previous: also seems reasonable, So it is expected for it to be skewed.
emp.var.rate: slightly skewed, but reasonable
cons.price.idx: the range and distribution seem reasonable
cons.conf.idx: the range and distribution seem reasonable
euribor3m: this 3 month rate seems reasonably distributed too, but is skewed
nr:employed: the number of employees is reasonable too and its distribution is not irregular

Outlier detection is done below.

A.3 How are categorical variables distributed?

# categorical var
plot_categorical_distribution <- function(df) {
    cat_vars <- df %>% select_if(is.character)
    for (var in names(cat_vars)) {
        print(
            ggplot(df, aes(x = get(var))) +
                geom_bar(fill = "steelblue") +
                labs(title = paste("Distribution of", var), x = var, y = "Count") +
                theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
                theme_minimal())}}
plot_categorical_distribution(df)

The categorical variables are reasonably distributed too:

jobs: most are admin or blue collar jobs
marital status: most are married
education: most are university educated and high-school degree holders
dafault: there are zero that have been deemed as having “defaulted”, so most are either a “no” or unknown.
housing, loan, contact, poutcome: are reasonable for the expected cohort in this dataset
most contact took place in May, with a nearly equal spread across the work-days of the week

A.4 Are there any outliers present?

Using boxplots

plot_outliers_horizontal <- function(df) {
    num_vars <- df %>% select_if(is.numeric)  
    num_plots <- length(num_vars) 
    cols <- 2  
    plots <- lapply(names(num_vars), function(var) {
        ggplot(df, aes(y = get(var), x = "")) +  
            geom_boxplot(fill = "#69b3a2", outlier.color = "red", outlier.size = 2) +  
            labs(title = paste("Boxplot of", var), y = var, x = " ") +  
            theme_minimal() +
            theme(
                plot.title = element_text(size = 16, face = "bold"), 
                axis.text.y = element_text(size = 14),  
                axis.text.x = element_text(size = 12), 
                axis.ticks.x = element_line(color = "black"), 
                panel.grid.major = element_line(color = "grey85"),
                panel.grid.minor = element_blank()) +
            coord_flip()})

    grid.arrange(
        grobs = plots, 
        ncol = cols, 
        nrow = ceiling(num_plots / cols),
        top = textGrob(" ", gp = gpar(fontsize = 18, fontface = "bold")))

    grid.lines(x = unit(0.5, "npc"), y = unit(c(0, 1), "npc"), gp = gpar(col = "black", lwd = 2))}
plot_outliers_horizontal(df)

Box plots show:

duration → Many high outliers; long call durations may indicate customer engagement. But this may be used in benchmarking, as the notes from this dataset suggest, since if duration =0, obviously the target will be a “no”. So may need to drop this when building the predictive model for a more realistic model.
campaign → Some clients contacted 10+ times; potential excessive follow-ups.
previous → Outliers in repeat contacts; could indicate persistent marketing attempts.
pdays → Mostly 999 (never contacted before), with some low-value numbers for actual numbers of days that passed after last client contact.
age → A few older clients (90+ years), but generally well-distributed.
emp.var.rate, cons.price.idx, euribor3m, nr.employed → No major outliers; economic indicators are stable.
cons.conf.idx → A few extreme values, but mostly normal distribution.

Certain outliers that seem anomalous: I can remove them and consider them missing values (which later may be imputed). I can also use capping to make sure no extreme values impact the model.

Using IQR to Identify Outliers

detect_outliers <- function(df) {
    num_vars <- df %>% select_if(is.numeric)
    outliers <- list()
    for (var in names(num_vars)) {
        Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
        Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
        IQR_val <- Q3 - Q1
        lower_bound <- Q1 - 3 * IQR_val
        upper_bound <- Q3 + 3 * IQR_val
        num_outliers <- sum(df[[var]] < lower_bound | df[[var]] > upper_bound, na.rm = TRUE)
        if (num_outliers > 0) {
            outliers[[var]] <- num_outliers}}
    return(outliers)}
outlier_counts <- detect_outliers(df)
print(outlier_counts)

## $age
## [1] 4
## 
## $duration
## [1] 1043
## 
## $campaign
## [1] 1094
## 
## $pdays
## [1] 1515
## 
## $previous
## [1] 5625

So there does seem to be a number of outlier values in these 6 variables. Now, to see what they actually are:

detect_outliers_df <- function(df) {
    num_vars <- df %>% select_if(is.numeric)
    outlier_data <- list()
    for (var in names(num_vars)) {
        Q1 <- quantile(df[[var]], 0.25, na.rm = TRUE)
        Q3 <- quantile(df[[var]], 0.75, na.rm = TRUE)
        IQR_val <- Q3 - Q1
        lower_bound <- Q1 - 3 * IQR_val
        upper_bound <- Q3 + 3 * IQR_val
        outliers <- df[[var]][df[[var]] < lower_bound | df[[var]] > upper_bound]
        if (length(outliers) > 0) {
            outlier_data[[var]] <- tibble(
                Variable = var,
                Outlier_Value = outliers)}}
    outlier_df <- bind_rows(outlier_data)
    return(outlier_df)}
outlier_table <- detect_outliers_df(df)
datatable(outlier_table, options = list(pageLength = 10, scrollX = TRUE))

Browsing through these values, I can see that many of them are not quite extreme, when it comes to age, duration, pdays or the other variables. Most are pdays values of 0, which is not really an outlier per se. Previous values of 7 or 1 are also not exactly anomalous. Same for duration, given that duration is in seconds. It is reasonable that at least some calls will run up to a max of 49 minutes. For campaign, it is strange that some clients had up to 56 times of attempts to contact them. But it may be a normal thing in this industry, though certainly on the higher end.

So for the outliers shown here, there does not seem be a strong need for removal or capping, since they are not unreasonable.

Checking Categorical Variables for Anomalies

check_categorical_anomalies <- function(df) {
    cat_vars <- df %>% select_if(is.character)
    
    for (var in names(cat_vars)) {
        print(paste("Category counts for:", var))
        print(table(df[[var]]))
        print("-------------------------------------------------")}}
check_categorical_anomalies(df)

## [1] "Category counts for: job"
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed 
##          1014 
## [1] "-------------------------------------------------"
## [1] "Category counts for: marital"
## 
## divorced  married   single 
##     4612    24928    11568 
## [1] "-------------------------------------------------"
## [1] "Category counts for: education"
## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                4176                2292                6045                9515 
##          illiterate professional.course   university.degree 
##                  18                5243               12168 
## [1] "-------------------------------------------------"
## [1] "Category counts for: default"
## 
##    no   yes 
## 32588     3 
## [1] "-------------------------------------------------"
## [1] "Category counts for: housing"
## 
##    no   yes 
## 18622 21576 
## [1] "-------------------------------------------------"
## [1] "Category counts for: loan"
## 
##    no   yes 
## 33950  6248 
## [1] "-------------------------------------------------"
## [1] "Category counts for: contact"
## 
##  cellular telephone 
##     26144     15044 
## [1] "-------------------------------------------------"
## [1] "Category counts for: month"
## 
##   apr   aug   dec   jul   jun   mar   may   nov   oct   sep 
##  2632  6178   182  7174  5318   546 13769  4101   718   570 
## [1] "-------------------------------------------------"
## [1] "Category counts for: day_of_week"
## 
##  fri  mon  thu  tue  wed 
## 7827 8514 8623 8090 8134 
## [1] "-------------------------------------------------"
## [1] "Category counts for: poutcome"
## 
##     failure nonexistent     success 
##        4252       35563        1373 
## [1] "-------------------------------------------------"

Just to confirm, aside from the bar charts, looking at this tabulation, I can see that the values make sense.

I have drawn patterns in the data when speaking about the distributions above, and insights about some varaibles are drawn too. I have also covered the central tendency of some variables as well as their spread.

A.5 Seeing missing values, and if missingness was at random, or with a correlation to the target variable

Columns with missing values are:

missing_counts <- colSums(is.na(df))
missing_counts[missing_counts > 0]

##       job   marital education   default   housing      loan 
##       330        80      1731      8597       990       990

Some variables have a large number of missing values, especially default, and this may have an impact on the model. Other variables that may be important as well (eg education and housing) may also have an impact. So I am going to look deeper at this and see their missingness if at random and if it has a relation to the target.

Looking at this visually:

vis_miss(df) + 
  ggtitle("Missing Data Pattern") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Test If Missingness is Random

# MCAR test
mcar_test_result <- missmap(df, main = "Missing Data Map", col = c("blue", "gray"), legend = TRUE)

mcar_test(df)

## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1     5458.   406       0               23

The test shows that missingness is not at random.

Correlation Between Missingness & subscribed

missing_cols <- names(df)[colSums(is.na(df)) > 0]

df_missing <- df %>%
    mutate(across(all_of(missing_cols), ~ ifelse(is.na(.), 1, 0), .names = "missing_{.col}"))

missing_correlation <- df_missing %>%
    select(starts_with("missing_")) %>%
    mutate(subscribed = df$subscribed) %>%
    group_by(subscribed) %>%
    summarise(across(starts_with("missing_"), \(x) mean(x, na.rm = TRUE)))

print(missing_correlation)

## # A tibble: 2 × 7
##   subscribed missing_job missing_marital missing_education missing_default
##   <fct>            <dbl>           <dbl>             <dbl>           <dbl>
## 1 no             0.00802         0.00186            0.0405          0.223 
## 2 yes            0.00797         0.00259            0.0541          0.0955
## # ℹ 2 more variables: missing_housing <dbl>, missing_loan <dbl>

This is helpful, as it shows that:

most variables that have missing values also have these missing values nearly equally for both classes of the target variables, except:
education: there are more missing values for “subscribed = yes”, &
default: there are more missing values for “subscribed = no”.

Overall, I do not see outliers are a dangerous pattern here, and missingness is important for education and default in particular. For these, I will need to choose a method for imputation, such as iterative imputer or the KNN approach. There does not appear to be inconsistent values, or ones that are not aligning with what would be expected from a dataset like this and these observations.

B. Algorithm Selection

B.1 Recommended Machine Learning Algorithms

For this dataset, the most suitable algorithms for predicting whether a customer will subscribe to a term deposit (subscribed) include Logistic Regression, Random Forest, and XGBoost. Logistic Regression is useful as a baseline due to its interpretability and efficiency, while Random Forest and XGBoost are more powerful ensemble methods that can capture complex interactions and non-linear relationships within the dataset. Though I must say, banking is not my field or domain, and is completely foreign to me. I have tried my best though to look at this from a domain perspective, though it may not be perfect.

B.2 Pros and Cons of Each Algorithm

This is a supervised problem to solve and devise a model to predict, since we do have labelled data, as I will explain below. The data characteristics and limitations do allow for a few possible model approaches I will list below:

Logistic Regression.

Pros: Simple, interpretable, efficient on large datasets, and works well with binary classification. Cons: Assumes a linear relationship between independent variables and the log-odds of the target, making it less effective for complex patterns.

Random Forest.

Pros: Handles both numerical and categorical variables, is robust to missing values, and reduces overfitting by averaging multiple decision trees. Cons: Computationally expensive, especially for large datasets, and harder to interpret compared to logistic regression.

XGBoost.

Pros: Extremely powerful in handling structured tabular data, robust to missing values, and performs well with imbalanced data. Cons: Requires hyperparameter tuning and is computationally more demanding.

These suggested models do align with the business characteristics and goals from a dataset like this. These are also scalable approaches that allow for continuing to collect more data and refine the model further.

B.3 Best Recommended Algorithm and Why

Among these, XGBoost is the best recommendation I think due to its high accuracy, ability to handle missing data, and effectiveness in tabular datasets like this one. Random Forest is a good alternative if interpretability is needed, while Logistic Regression can be used as a baseline model to compare performance.

B.4 Label Availability and Its Impact on Algorithm Choice

Yes, the dataset has a labeled target variable (subscribed: yes/no), which makes this a supervised classification problem. This allows the use of classification models like Logistic Regression, Decision Trees, Random Forest, and Gradient Boosting models (XGBoost) instead of unsupervised learning methods such as clustering.

B.5 How Algorithm Choice Relates to the Dataset

The dataset contains both categorical and numerical features, requiring an algorithm that handles mixed data types, missing values, and class imbalance. Tree-based models (Random Forest & XGBoost) are well-suited for these types of datasets as they automatically handle feature selection, interactions, and non-linearity. Logistic Regression, while simpler, may struggle with non-linear relationships and interactions between variables.

B.6 Impact of a Smaller Dataset (<1,000 Records)

If the dataset had fewer than 1,000 records, simpler models like Logistic Regression or Decision Trees would be preferable. XGBoost and Random Forest require more data to generalize well, and with a small dataset, they may overfit. Logistic Regression would work better in this case because it requires fewer data points to provide stable estimates, while a Decision Tree could be used if non-linearity is important.

C. Pre-processing

C.1 Data Cleaning - Improve Data Quality & Handle Missing Data

Address Missing Data. XGBoost natively handles missing values, so imputation is optional. However, we should analyze whether missing values hold information before deciding.

Options: - Leave missing values as-is (XGBoost assigns them optimally). - Use mean/median imputation for continuous features if missingness seems random. I can also use iterative imputer or KNN. - Create binary indicators for missing_education and missing_default, as these were correlated with the target (subscribed).

Check for Duplicates & Outliers

Remove exact duplicate records, if any.
Treat extreme outliers in duration, campaign, pdays carefully:
Winsorize or cap extreme values if necessary.
Log-transform duration if the distribution is too skewed.

C.2 Dimensionality Reduction - Remove Redundant Data

Drop Highly Correlated Features. From the correlation analysis, euribor3m, nr.employed, and emp.var.rate are strongly correlated. We can remove one or two of them to avoid redundancy.

Drop duration (if aiming for real-world deployment). Since call duration is a strong predictor but unknown before a call happens, it should be removed unless the goal is just benchmarking.

C.3 Feature Engineering - Create New Informative Features

Convert pdays into a categorical feature: 999 means never contacted before → Create new_contacted = 1 if pdays != 999, else 0.
Create previous_contact_ratio: previous / (campaign + previous), to capture engagement level in past campaigns.
Group age into categories: Instead of using raw age, create bins like: Young (18-30), Middle-aged (31-50), Senior (51+)
Interaction Features (Optional, if performance improves) education * job, housing * loan, or pdays * previous can be explored.

C.4 Sampling Data - Resize Dataset if Needed

I don’t think I need to do any resizing to the data, since it is not too large, nor too small.

But, generally:

If dataset is too large (>100,000 rows), use stratified sampling. Retain proportional representation of subscribed = yes/no while reducing dataset size.
If dataset is small (<10,000 rows), use k-fold cross-validation instead of a train-test split. Ensures better generalization.

C.5 Data Transformation - Encoding & Scaling

Handle Categorical Variables (XGBoost supports them directly in latest versions). Convert categorical features (job, marital, education, contact, poutcome) into integer-encoded values:

df$job <- as.integer(as.factor(df$job))
df$marital <- as.integer(as.factor(df$marital))

# Alternatively, one-hot encoding can be applied (not required for XGBoost but useful for explainability).

Scale Numerical Features (Not required for XGBoost, but recommended for comparison with other models):

Standardization ((x - mean) / std) is not required for tree-based models.
However, for interpretability, we can log-transform balance, duration to reduce skewness.
Scaling will not be required since XGBoost handles unscaled numerical features well.

C.6 Handling Class Imbalance

If subscribed = yes is much less frequent than no, XGBoost may be biased. Solutions: - Set scale_pos_weight = (# negative samples / # positive samples) in XGBoost to balance class weights. - Use SMOTE (Synthetic Minority Over-sampling Technique) if upsampling needed. - Use stratified sampling during training to ensure balance.

Understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Older customers may have different subscription tendencies.

D. 500-word essay to summarize it all:

In this analysis, I explored a dataset containing information on a Portuguese bank’s marketing campaign aimed at encouraging customers to subscribe to a term deposit. The dataset includes demographic details, previous marketing interactions, and economic indicators, requiring careful preprocessing before model training. Through exploratory data analysis (EDA), I examined data distributions, missingness patterns, outliers, and feature correlations. Based on my findings, I selected XGBoost as the most suitable machine learning algorithm for predicting customer subscription.

The EDA revealed several important insights. Certain features, such as call duration and previous contacts, had a strong influence on subscription likelihood. Missing data was not missing completely at random (MCAR), particularly for education and default, which had missingness patterns associated with the target variable. Outliers were observed in campaign, duration, and pdays, indicating potential skewness in customer interactions. Additionally, economic indicators such as euribor3m, nr.employed, and emp.var.rate were highly correlated, requiring dimensionality reduction to avoid redundancy.

Based on these findings, XGBoost was chosen as the best algorithm for this classification task. XGBoost is an ensemble learning method that builds gradient-boosted decision trees, making it well-suited for structured tabular data like this dataset. Unlike logistic regression, which assumes linearity, XGBoost can model complex relationships and interactions between features. Additionally, XGBoost naturally handles missing values, reducing the need for extensive imputation. The model is robust to imbalanced data, which is key given that subscribed = yes is less frequent than no. Compared to Random Forest, XGBoost is computationally more efficient and provides better feature importance insights, allowing us to determine the most influential factors in predicting customer behavior.

To prepare the data for XGBoost, I will implement several preprocessing steps. Categorical variables such as job, education, and poutcome will be integer-encoded, so that compatibility with tree-based models is maintained. Feature engineering included binning age groups, transforming pdays into a categorical feature, and creating interaction terms between variables like previous and campaign. To handle class imbalance, I will adjust the scale_pos_weight parameter in XGBoost, ensuring the model appropriately weighted minority class observations. Since XGBoost does not require feature scaling, numerical variables will be left in their original form except for log-transforming highly skewed values like balance and duration for better model interpretability. Keeping in mind that understanding that customers contacted multiple times (campaign > X) may be less likely to subscribe. Also that older customers may have different subscription tendencies.

If the dataset had been smaller (fewer than 1,000 records), I would have opted for Logistic Regression or a Decision Tree model, as XGBoost requires larger datasets to generalize effectively. However, given the dataset’s size and complexity, XGBoost is an optimal choice due to its high predictive power, ability to handle mixed data types, and resilience against overfitting.

Based on the final model, I will compute predictive performance metrics that include the F1 score, recall, precision, AUC, and Brier score. This will help understand how the model performs. I will also train the model using a 70% random split with cross-validation for hyperparameter tuning, and then test the model on the 30% unseen data. I will also add explanation and interpretatibility using SHAP and dependence plots, along with calibration plots and precision-recall plots.

HW2

Experiment Set 1: Decision Trees

Experiment 1.1: Baseline Decision Tree

Objective: Establish a baseline for how a simple Decision Tree performs using default parameters. We hypothesize it will yield decent accuracy but may have low recall.

Variation: No tuning or parameter constraints; purely default rpart() settings.

Non-Trivial Variation?: This is a baseline with no hyperparameter changes, so it’s trivial by design (the starting point).

Evaluation Metric: Measured Accuracy, Sensitivity, Specificity, and AUC-ROC to capture both overall performance and the ability to detect “yes.”

Experiment Run:

Data split 70/30.
Model trained with rpart(subscribed ~ ., data = train_data).
Predictions made on test data.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)

# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

###

# baseline Decision Tree using default parameters
dt_baseline <- rpart(subscribed ~ ., data = train_data, method = "class")
pred_probs <- predict(dt_baseline, test_data, type = "prob")[,2]
pred_classes <- predict(dt_baseline, test_data, type = "class")

# Evaluate 
conf_mat <- confusionMatrix(pred_classes, test_data$subscribed, positive = "yes")
roc_obj <- roc(test_data$subscribed, pred_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10876  1164
##        yes    88   228
##                                           
##                Accuracy : 0.8987          
##                  95% CI : (0.8932, 0.9039)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 2.836e-05       
##                                           
##                   Kappa : 0.2351          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.16379         
##             Specificity : 0.99197         
##          Pos Pred Value : 0.72152         
##          Neg Pred Value : 0.90332         
##              Prevalence : 0.11266         
##          Detection Rate : 0.01845         
##    Detection Prevalence : 0.02557         
##       Balanced Accuracy : 0.57788         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC:", auc(roc_obj), "\n")

## AUC-ROC: 0.707675

# Visualize
rpart.plot(dt_baseline)

# saveRDS(dt_baseline, file = "dt_baseline_model.rds")

Baseline Decision Tree:

Accuracy: ~0.8987
AUC-ROC: ~0.7077
Sensitivity (Recall): 0.164 (very low)
Specificity: 0.992 (extremely high)

Meaning:

The baseline tree is conservative about predicting “yes” (term deposit subscription). It rarely flags positives, which leads to a low recall.
It’s correct most of the time overall (high accuracy), but misses many actual positives (low sensitivity).
AUC of ~0.71 is moderate, so some predictive ability but not strong at capturing the “yes” class.

Experiment 1.2: Pruned/Constrained Decision Tree

Objective: Test whether pruning/optimizing the complexity parameter (cp) improves detection of positive (subscribed) cases without severely hurting overall accuracy.

Variation: Used a grid search on cp from 0.001 to 0.02 in increments of 0.002, cross-validating with 5 folds.

Non-Trivial Variation?: Yes — adjusting tree complexity is a significant model change, aiming to reduce underfitting or overfitting.

Evaluation Metric: Same metrics (Accuracy, Sensitivity, Specificity, AUC-ROC) but focusing on whether recall and AUC-ROC improve.

Experiment Run:

Data preprocessed to remove/encode missing values.
caret::train(method = “rpart”, tuneGrid = …) with 5-fold CV.
Best cp selected automatically by caret.

# caret to tune the cp parameter via grid search
set.seed(123)
tune_grid <- expand.grid(cp = seq(0.001, 0.02, by = 0.002))

dt_tuned <- train(subscribed ~ ., data = train_data,
                  method = "rpart",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = tune_grid)

print(dt_tuned)

## CART 
## 
## 28832 samples
##    18 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065 
## Resampling results across tuning parameters:
## 
##   cp     Accuracy   Kappa    
##   0.001  0.9002145  0.3285105
##   0.003  0.8992437  0.2499812
##   0.005  0.8992437  0.2499812
##   0.007  0.8992437  0.2499812
##   0.009  0.8992437  0.2499812
##   0.011  0.8992437  0.2499812
##   0.013  0.8992437  0.2499812
##   0.015  0.8992437  0.2499812
##   0.017  0.8992437  0.2499812
##   0.019  0.8992437  0.2499812
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.001.

pred_probs_tuned <- predict(dt_tuned, test_data, type = "prob")[,2]
pred_classes_tuned <- predict(dt_tuned, test_data)

conf_mat_tuned <- confusionMatrix(pred_classes_tuned, test_data$subscribed, positive = "yes")
roc_obj_tuned <- roc(test_data$subscribed, pred_probs_tuned)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10761  1038
##        yes   203   354
##                                           
##                Accuracy : 0.8996          
##                  95% CI : (0.8941, 0.9048)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 6.831e-06       
##                                           
##                   Kappa : 0.3194          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.25431         
##             Specificity : 0.98148         
##          Pos Pred Value : 0.63555         
##          Neg Pred Value : 0.91203         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02865         
##    Detection Prevalence : 0.04508         
##       Balanced Accuracy : 0.61790         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (Tuned):", auc(roc_obj_tuned), "\n")

## AUC-ROC (Tuned): 0.7579196

# saveRDS(dt_tuned, file = "dt_tuned_model.rds")

Tuned Decision Tree:

Tuned cp = 0.001 (from grid search)
Accuracy: ~0.8996
AUC-ROC: ~0.7579
Sensitivity (Recall): 0.254 (improved significantly)
Specificity: 0.981 (slightly lower but still very high)

Meaning:

The tuned model predicts “yes” more often, improving its ability to identify positives.
Sensitivity jumped from ~16% to ~25%, which is a gain for marketing campaigns aiming to identify potential subscribers.
A higher AUC (from ~0.71 to ~0.76) shows better overall discrimination between classes.
Slight drop in specificity (from 0.992 to 0.981) is acceptable given the improvement in capturing actual subscribers.

Experiment Set 2: Random Forest

Experiment 2.1: Baseline Random Forest

Objective: Establish how a default Random Forest (RF) model performs on this dataset without parameter tuning. We hypothesize it will capture more complex interactions than a simple decision tree.

Variation: No parameter tuning; use the default number of trees (often 500) and default mtry (typically sqrt(#features)).

Non-Trivial Variation? This is our baseline with no custom changes, so it’s considered the reference point for further tuning.

Evaluation Metric: We measure Accuracy, Sensitivity, Specificity, and AUC-ROC to gauge both overall correctness and how well the model detects “yes” cases.

Experiment Run:

Data split (70/30), subscribed as factor.
Trained using randomForest(subscribed ~ ., data = train_data).
Predictions made on test data; confusion matrix and AUC computed.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)

# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

###

set.seed(123)
rf_baseline <- randomForest(subscribed ~ ., data = train_data, ntree = 500)
pred_rf_probs <- predict(rf_baseline, test_data, type = "prob")[,2]
pred_rf_classes <- predict(rf_baseline, test_data)

conf_mat_rf <- confusionMatrix(pred_rf_classes, test_data$subscribed, positive = "yes")
roc_rf <- roc(test_data$subscribed, pred_rf_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10736  1021
##        yes   228   371
##                                           
##                Accuracy : 0.8989          
##                  95% CI : (0.8935, 0.9042)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 1.943e-05       
##                                           
##                   Kappa : 0.3271          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.26652         
##             Specificity : 0.97920         
##          Pos Pred Value : 0.61937         
##          Neg Pred Value : 0.91316         
##              Prevalence : 0.11266         
##          Detection Rate : 0.03003         
##    Detection Prevalence : 0.04848         
##       Balanced Accuracy : 0.62286         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (RF Baseline):", auc(roc_rf), "\n")

## AUC-ROC (RF Baseline): 0.7882052

# saveRDS(rf_baseline, file = "rf_baseline_model.rds")

Result & Conclusion:

Accuracy: ~0.898, AUC-ROC: ~0.79.
Sensitivity: ~0.26, meaning it catches ~26% of the “yes” subscribers.
Model shows a better AUC than baseline decision tree, indicating stronger ability to separate classes.
Conclusion: The baseline RF is strong overall, but there is room to improve sensitivity.

Experiment 2.2: Tuned Random Forest

Objective: Investigate if adjusting mtry (the number of features considered at each split) can improve the model’s balance of accuracy and sensitivity.

Variation: Used caret to grid-search mtry = {2, 4, 6, 8}, with 5-fold cross-validation. The best setting is chosen based on highest accuracy.

Non-Trivial Variation? Yes — adjusting mtry is a significant hyperparameter change that can affect model complexity and performance.

Evaluation Metric: Same metrics: Accuracy, Sensitivity, Specificity, AUC-ROC, focusing on any improvement in detecting positives.

Experiment Run:

Used train(method = “rf”) with a custom tune grid.
Found the best mtry = 4 based on cross-validation accuracy.
Final model retrained on the full training set, then tested.

set.seed(123)
rf_tune_grid <- expand.grid(mtry = c(2, 4, 6, 8))
rf_tuned <- train(subscribed ~ ., data = train_data,
                  method = "rf",
                  trControl = trainControl(method = "cv", number = 5),
                  tuneGrid = rf_tune_grid,
                  ntree = 500)

print(rf_tuned)

## Random Forest 
## 
## 28832 samples
##    18 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 23066, 23065, 23065, 23067, 23065 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8983073  0.2285129
##   4     0.8998332  0.2972112
##   6     0.8985151  0.3138414
##   8     0.8977174  0.3225955
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.

pred_rf_tuned_probs <- predict(rf_tuned, test_data, type = "prob")[,2]
pred_rf_tuned_classes <- predict(rf_tuned, test_data)

conf_mat_rf_tuned <- confusionMatrix(pred_rf_tuned_classes, test_data$subscribed, positive = "yes")
roc_rf_tuned <- roc(test_data$subscribed, pred_rf_tuned_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_rf_tuned)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10810  1082
##        yes   154   310
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8945, 0.9052)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 3.456e-06       
##                                           
##                   Kappa : 0.2943          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22270         
##             Specificity : 0.98595         
##          Pos Pred Value : 0.66810         
##          Neg Pred Value : 0.90901         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02509         
##    Detection Prevalence : 0.03755         
##       Balanced Accuracy : 0.60433         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (RF Tuned):", auc(roc_rf_tuned), "\n")

## AUC-ROC (RF Tuned): 0.7823927

# saveRDS(rf_tuned, file = "rf_tuned_model.rds")

Result & Conclusion:

Accuracy: ~0.9001 (slightly up from 0.8981).
AUC-ROC: ~0.7817 (slightly down from 0.7893).
Sensitivity: ~0.2249 (down from 0.2608), while specificity rose to ~0.9859.
Conclusion: Tuning increased overall accuracy but reduced sensitivity and AUC. The model is more conservative in flagging positives, so it misses more “yes” cases.

Experiment Set 3: AdaBoost

Experiment 3.1: Baseline AdaBoost

Objective: To evaluate a baseline AdaBoost model (using adabag’s boosting) on the preprocessed data, aiming to establish a performance benchmark. Hypothesis: The baseline model will provide moderate discrimination (AUC ~0.80) but may suffer from low sensitivity.

Experiment Variation Defined: No hyperparameter tuning was applied; the model was run with default boosting parameters (mfinal = 50, and default tree parameters).

Variation Non-Triviality: Although this run is a baseline, it is non-trivial because it directly leverages AdaBoost’s ability to handle missing values and categorical data without additional pre-processing adjustments beyond standard cleaning.

Evaluation Metric: Metrics used include Accuracy, Sensitivity, Specificity, Balanced Accuracy, and AUC-ROC. Emphasis was placed on AUC-ROC to gauge overall discrimination ability and on sensitivity to understand the model’s recall for the minority “yes” class.

# remove non-predictive features, encode target
df_model <- df %>% select(-duration, -default)
df_model$subscribed <- as.factor(df_model$subscribed)

# replace missing values with "unknown"
# Example for your training data:
df_model$job[is.na(df_model$job)] <- "unknown"
df_model$marital[is.na(df_model$marital)] <- "unknown"
df_model$education[is.na(df_model$education)] <- "unknown"
df_model$housing[is.na(df_model$housing)] <- "unknown"
df_model$loan[is.na(df_model$loan)] <- "unknown"

# a 70/30 train-test split
set.seed(123)
train_index <- createDataPartition(df_model$subscribed, p = 0.7, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

###

set.seed(123)
ada_baseline <- boosting(subscribed ~ ., data = train_data, boos = TRUE, mfinal = 50)
ada_pred <- predict(ada_baseline, newdata = test_data)
pred_ada_probs <- ada_pred$prob[, 2]
pred_ada_classes <- ada_pred$class

conf_mat_ada <- confusionMatrix(as.factor(pred_ada_classes), test_data$subscribed, positive = "yes")
roc_ada <- roc(test_data$subscribed, pred_ada_probs)

## Setting levels: control = no, case = yes

## Setting direction: controls < cases

print(conf_mat_ada)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10777  1079
##        yes   187   313
##                                           
##                Accuracy : 0.8975          
##                  95% CI : (0.8921, 0.9028)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : 0.0001496       
##                                           
##                   Kappa : 0.2885          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.22486         
##             Specificity : 0.98294         
##          Pos Pred Value : 0.62600         
##          Neg Pred Value : 0.90899         
##              Prevalence : 0.11266         
##          Detection Rate : 0.02533         
##    Detection Prevalence : 0.04047         
##       Balanced Accuracy : 0.60390         
##                                           
##        'Positive' Class : yes             
##

cat("AUC-ROC (AdaBoost Baseline):", auc(roc_ada), "\n")

## AUC-ROC (AdaBoost Baseline): 0.8069351

#saveRDS(ada_baseline, file = "ada_baseline_model.rds")

Result Evaluation & Conclusion: The baseline AdaBoost achieved 89.75% accuracy and an AUC-ROC of ~0.807, with sensitivity at ~22.5% and specificity at ~98.3%. While overall performance and discrimination are reasonable, the low sensitivity indicates many subscribers are missed. This performance sets the benchmark for further tuning.

Experiment 3.2: Tuned AdaBoost

Objective: To test whether tuning hyperparameters (specifically, mfinal, maxdepth, and using coeflearn = Breiman) can improve performance, particularly aiming to enhance sensitivity and overall class discrimination.

Experiment Variation Defined: A grid search was implemented with:

mfinal: {50, 100, 150}
maxdepth: {1, 2, 3}
coeflearn: fixed as “Breiman” using 5-fold cross-validation via caret.

Variation Non-Triviality: This tuning is non-trivial because altering boosting iterations and tree depth directly affects model complexity and bias-variance trade-off, which is critical for capturing the minority class effectively.

Evaluation Metric: Same as before (Accuracy, Sensitivity, Specificity, AUC-ROC), with particular attention to changes in sensitivity and AUC.

# set.seed(123)
# ada_tune_grid <- expand.grid(mfinal = c(50, 100, 150), maxdepth = c(1, 2, 3), coeflearn = c("Breiman"))
# ada_tuned <- train(subscribed ~ ., data = train_data,
#                    method = "AdaBoost.M1",
#                    trControl = trainControl(method = "cv", number = 5),
#                    tuneGrid = ada_tune_grid)
# 
# print(ada_tuned)
# pred_ada_tuned_probs <- predict(ada_tuned, test_data, type = "prob")[, "yes"]
# pred_ada_tuned_classes <- predict(ada_tuned, test_data)
# 
# conf_mat_ada_tuned <- confusionMatrix(pred_ada_tuned_classes, test_data$subscribed, positive = "yes")
# roc_ada_tuned <- roc(test_data$subscribed, pred_ada_tuned_probs)
# 
# print(conf_mat_ada_tuned)
# cat("AUC-ROC (AdaBoost Tuned):", auc(roc_ada_tuned), "\n")
# 
# # The knitting to html is halting at this code chunk, likley because one of the folds created below may not have a value in one of the classes. So I am going to do it another way:
# # saving the model
# saveRDS(ada_tuned, file = "ada_tuned_model.rds")

# then, for the knitting phase: load the tuned model:
ada_tuned <- readRDS("ada_tuned_model.rds")

Result Evaluation & Conclusion: The tuned model achieved an accuracy of 89.96% and an AUC-ROC of ~0.802, but sensitivity dropped to 18.1% (from 22.5% in the baseline) while specificity increased to 99.09%. These results indicate that while the tuned model is even better at correctly identifying non-subscribers, it further reduces the model’s ability to capture true positives. Overall discrimination (AUC) did not improve significantly.

Comparison between the models:

ROC plots for baseline and tuned models:

test_data$subscribed <- factor(test_data$subscribed, levels = c("no", "yes"))
test_data$job <- factor(test_data$job, levels = levels(train_data$job))
#response to a plain vector
response_vec <- as.vector(test_data$subscribed)

# Decision Trees (dt_baseline and dt_tuned)
dt_baseline_probs <- unname(as.numeric(predict(dt_baseline, test_data, type = "prob")[, "yes"]))
dt_tuned_probs   <- unname(as.numeric(predict(dt_tuned, test_data, type = "prob")[, "yes"]))

# # Diagnostic: check lengths and NAs
# cat("Length of response:", length(response_vec), "\n")
# cat("Length of dt_baseline_probs:", length(dt_baseline_probs), "\n")
# cat("Length of dt_tuned_probs:", length(dt_tuned_probs), "\n")
# cat("Number of NAs in response:", sum(is.na(response_vec)), "\n")
# cat("Number of NAs in dt_baseline_probs:", sum(is.na(dt_baseline_probs)), "\n")
# cat("Number of NAs in dt_tuned_probs:", sum(is.na(dt_tuned_probs)), "\n")

dt_baseline_roc <- roc(response = response_vec, predictor = dt_baseline_probs, direction = "auto")
dt_tuned_roc    <- roc(response = response_vec, predictor = dt_tuned_probs, direction = "auto")

# Random Forest (rf_baseline and rf_tuned)
rf_baseline_probs <- unname(as.numeric(predict(rf_baseline, test_data, type = "prob")[, "yes"]))
rf_tuned_probs   <- unname(as.numeric(predict(rf_tuned, test_data, type = "prob")[, "yes"]))

rf_baseline_roc <- roc(response = response_vec, predictor = rf_baseline_probs, direction = "auto")
rf_tuned_roc    <- roc(response = response_vec, predictor = rf_tuned_probs, direction = "auto")

# AdaBoost (ada_baseline and ada_tuned)
ada_baseline_pred <- predict(ada_baseline, newdata = test_data)
ada_baseline_probs <- unname(as.numeric(ada_baseline_pred$prob[, 2]))
ada_tuned_probs    <- unname(as.numeric(predict(ada_tuned, test_data, type = "prob")[, "yes"]))

ada_baseline_roc <- roc(response = response_vec, predictor = ada_baseline_probs, direction = "auto")
ada_tuned_roc    <- roc(response = response_vec, predictor = ada_tuned_probs, direction = "auto")

## PLOTS

# Decision Tree ROC Plot
plot(
  dt_baseline_roc,
  col = "blue",
  lwd = 2,
  main = "Decision Tree: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,          
  xlab = "1 - Specificity", 
  ylab = "Sensitivity"
)
lines(dt_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

# Random Forest ROC Plot
plot(
  rf_baseline_roc,
  col = "blue",
  lwd = 2,
  main = "Random Forest: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity"
)
lines(rf_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

# AdaBoost ROC Plot
plot(
  ada_baseline_roc,
  col = "blue",
  lwd = 2,
  main = "AdaBoost: Baseline vs. Tuned ROC",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity"
)
lines(ada_tuned_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Baseline", "Tuned"), col = c("blue", "red"), lwd = 2)

ROC plot for the tuned models

plot(
  dt_tuned_roc,
  col = "red",
  lwd = 2,
  main = "Tuned Models: ROC Comparison",
  legacy.axes = FALSE,
  xlab = "1 - Specificity",
  ylab = "Sensitivity"
)
lines(rf_tuned_roc, col = "green", lwd = 2)
lines(ada_tuned_roc, col = "purple", lwd = 2)
legend("bottomright",
       legend = c("Decision Tree", "Random Forest", "AdaBoost"),
       col = c("red", "green", "purple"),
       lwd = 2
)

Performance table

# function: extract metrics
extract_metrics <- function(conf, roc_obj) {
  acc  <- as.numeric(conf$overall["Accuracy"])
  sens <- as.numeric(conf$byClass["Sensitivity"])
  spec <- as.numeric(conf$byClass["Specificity"])
  auc_val <- as.numeric(auc(roc_obj))
  
  return(c(Accuracy = round(acc, 4),
           Sensitivity = round(sens, 4),
           Specificity = round(spec, 4),
           AUC_ROC = round(auc_val, 4)))
}

# Decision Tree models
dt_baseline_metrics <- extract_metrics(conf_mat, roc_obj)
dt_tuned_metrics    <- extract_metrics(conf_mat_tuned, roc_obj_tuned)

# Random Forest models
rf_baseline_metrics <- extract_metrics(conf_mat_rf, roc_rf)
rf_tuned_metrics    <- extract_metrics(conf_mat_rf_tuned, roc_rf_tuned)

# AdaBoost models
ada_baseline_metrics <- extract_metrics(conf_mat_ada, roc_ada)
ada_tuned_metrics    <- extract_metrics(conf_mat_ada_tuned, roc_ada_tuned)

# Combine 
performance_summary <- data.frame(
  Model = rep(c("Decision Tree", "Random Forest", "AdaBoost"), each = 2),
  Experiment = rep(c("Baseline", "Tuned"), 3),
  Accuracy = c(dt_baseline_metrics["Accuracy"], dt_tuned_metrics["Accuracy"],
               rf_baseline_metrics["Accuracy"], rf_tuned_metrics["Accuracy"],
               ada_baseline_metrics["Accuracy"], ada_tuned_metrics["Accuracy"]),
  AUC_ROC = c(dt_baseline_metrics["AUC_ROC"], dt_tuned_metrics["AUC_ROC"],
              rf_baseline_metrics["AUC_ROC"], rf_tuned_metrics["AUC_ROC"],
              ada_baseline_metrics["AUC_ROC"], ada_tuned_metrics["AUC_ROC"]),
  Sensitivity = c(dt_baseline_metrics["Sensitivity"], dt_tuned_metrics["Sensitivity"],
                  rf_baseline_metrics["Sensitivity"], rf_tuned_metrics["Sensitivity"],
                  ada_baseline_metrics["Sensitivity"], ada_tuned_metrics["Sensitivity"]),
  Specificity = c(dt_baseline_metrics["Specificity"], dt_tuned_metrics["Specificity"],
                  rf_baseline_metrics["Specificity"], rf_tuned_metrics["Specificity"],
                  ada_baseline_metrics["Specificity"], ada_tuned_metrics["Specificity"])
)

# Print the performance summary table
print(performance_summary)

## # A tibble: 6 × 6
##   Model         Experiment Accuracy AUC_ROC Sensitivity Specificity
##   <chr>         <chr>         <dbl>   <dbl>       <dbl>       <dbl>
## 1 Decision Tree Baseline      0.899   0.708       0.164       0.992
## 2 Decision Tree Tuned         0.900   0.758       0.254       0.982
## 3 Random Forest Baseline      0.898   0.790       0.262       0.979
## 4 Random Forest Tuned         0.900   0.782       0.225       0.986
## 5 AdaBoost      Baseline      0.898   0.807       0.225       0.983
## 6 AdaBoost      Tuned         0.900   0.802       0.181       0.991

the Decision Tree shows a notable improvement when tuned: its accuracy rose from 0.8987 to 0.8996, and AUC from 0.7077 to 0.7579, so a better balance between true positives and true negatives.
The Random Forest baseline had a higher AUC than the tuned version (0.7899 vs. 0.7818) but gained slightly in accuracy (from 0.8981 to 0.9001). This suggests that tuning the Random Forest made it more conservative—raising specificity (0.9789 to 0.9859) at the expense of lower sensitivity (0.2615 to 0.2249).
AdaBoost stood out for having the highest baseline AUC (0.8069), though tuning it increased accuracy (to 0.8996) while decreasing both AUC (to 0.8019) and sensitivity (to 0.181).

Conclusion:

In this classification project, I evaluated three algorithms—Decision Tree, Random Forest, and AdaBoost—on a dataset to predict whether a client will subscribe to a term deposit. Each algorithm was tested twice: first with baseline (default) parameters, and then again after tuning. The metrics of interest included accuracy, AUC (Area Under the ROC Curve), sensitivity (recall for the positive class), and specificity (true negative rate). Since the bank is interested in identifying as many potential subscribers (“yes”) as possible without excessively misclassifying non-subscribers, sensitivity and AUC carry particular weight, although overall accuracy and specificity remain important for resource management.

Decision Tree Results:

The Decision Tree’s baseline model achieved an accuracy of 0.8987, an AUC of 0.7077, a sensitivity of 0.1638, and a specificity of 0.992. These show that while the baseline tree was quite accurate overall—mostly because of the large proportion of “no” cases—it struggled to correctly identify positive cases, as reflected by a low sensitivity. Tuning the Decision Tree improved its AUC to 0.7579, which indicates better discrimination between “yes” and “no.” Sensitivity also rose to 0.2543, making the tuned tree more effective at capturing actual subscribers. The slight decrease in specificity (from 0.992 down to 0.9815) was a small sacrifice, but it was accompanied by a jump in the tree’s ability to find the positive class.

Random Forest Results:

For the Random Forest, the baseline version had a higher AUC than the baseline Decision Tree, coming in at 0.7899, and a sensitivity of 0.2615. Its accuracy was 0.8981, slightly below the tuned Decision Tree’s accuracy but with a stronger AUC, suggesting a more balanced approach to class separation. Tuning the Random Forest increased its accuracy to 0.9001 and raised specificity to 0.9859. However, the AUC slipped slightly to 0.7818, and sensitivity dropped to 0.2249. In other words, the tuned Random Forest became more conservative: it improved at identifying “no” cases but caught fewer “yes” cases. If a bank prioritizes fewer false positives (non-subscribers wrongly flagged as subscribers), the tuned Random Forest might be good. However, if capturing a higher proportion of true positives is paramount, the baseline version may be better.

AdaBoost Results:

AdaBoost stood out for having the highest baseline AUC of 0.8069, meaning it was already strong at discriminating between “yes” and “no.” Its accuracy was 0.8975, and sensitivity was 0.2249—moderate in relation to the other models. After tuning, AdaBoost’s accuracy increased to 0.8996, and specificity rose from 0.9829 to 0.9909, but AUC dipped slightly to 0.8019, and sensitivity fell to 0.181. This indicates that while the tuned AdaBoost became better at correctly classifying “no,” it missed more actual subscribers.

Overall, the tuned Decision Tree shows a significant improvement in sensitivity and a decent AUC gain, making it valuable for scenarios where identifying more potential subscribers is important. The Random Forest’s baseline model balances sensitivity and specificity well, whereas the tuned variant focuses more on high accuracy and specificity at the cost of missed positives. AdaBoost shows a strong discriminative power, with the highest baseline AUC, though tuning tended to favor accuracy and specificity over recall. In practice, the choice depends on whether the bank emphasizes catching more true positives or avoiding too many false alarms.

From a data science perspective, further hyperparameter tuning and additional feature engineering are recommended to optimize the trade-off between sensitivity and specificity. My experiments show that while ensemble methods like Random Forest and AdaBoost perform well, the tuned Decision Tree offers promising interpretability and improved recall.

For addressing the bank’s marketing challenge, I would recommend deploying the tuned Decision Tree model as it shows enhanced sensitivity in identifying potential subscribers. This approach is likely to improve the targeting efficiency of marketing campaigns, so that more high-propensity customers are engaged while maintaining a high overall accuracy.

622_hw1

Fares A

2025-03-02

Loading the data

A quick check on missing variables

HW1:

A. Exploratory Data Analysis (EDA)

A.1 Are the features (columns) of your data correlated?

A.2 What is the overall distribution of each variable?

A.3 How are categorical variables distributed?

A.4 Are there any outliers present?

A.5 Seeing missing values, and if missingness was at random, or with a correlation to the target variable

B. Algorithm Selection

B.1 Recommended Machine Learning Algorithms

B.2 Pros and Cons of Each Algorithm

B.3 Best Recommended Algorithm and Why

B.4 Label Availability and Its Impact on Algorithm Choice

B.5 How Algorithm Choice Relates to the Dataset

B.6 Impact of a Smaller Dataset (<1,000 Records)

C. Pre-processing

C.1 Data Cleaning - Improve Data Quality & Handle Missing Data

C.2 Dimensionality Reduction - Remove Redundant Data

C.3 Feature Engineering - Create New Informative Features

C.4 Sampling Data - Resize Dataset if Needed

C.5 Data Transformation - Encoding & Scaling

C.6 Handling Class Imbalance

D. 500-word essay to summarize it all:

HW2

Experiment Set 1: Decision Trees

Experiment 1.1: Baseline Decision Tree

Experiment 1.2: Pruned/Constrained Decision Tree

Experiment Set 2: Random Forest

Experiment 2.1: Baseline Random Forest

Experiment 2.2: Tuned Random Forest

Experiment Set 3: AdaBoost

Experiment 3.1: Baseline AdaBoost

Experiment 3.2: Tuned AdaBoost

Comparison between the models:

ROC plots for baseline and tuned models:

ROC plot for the tuned models

Performance table

Conclusion: