Data622 - Assignment 2: Experimentation & Model Training

Author

Anthony Josue Roman

Introduction

Machine learning experimentation is the process of incrementally testing, observing, and adjusting various combinations of model configurations and pre-processing to see which configuration yields the best performance on a specific outcome in relation to a task. In the context of specifying designs or performance measures, this entails manipulating hyperparameters, recording the degree to which they changed performance, and applying reasoning about the accuracy and reliability of the machine learning model. Experimentation serves to create a positive bias–variance tradeoff; it allows a variety of degrees of change in bias to be observed, and shows what models are performing at a variety of performance levels against expected variance.

In this assignment, I will conduct six experiments using Decision Tree, Random Forest, and AdaBoost models. Each experiment will consist of a brief description of what it will aim to accomplish, the variation it will be testing, and the performance evaluation metrics it will use. The dataset creation and pre-processing will be from Assignment 1. Performance evaluation metrics will include: Accuracy, Precision, Recall, F1 score, and AUC. The goal is to determine which model you would choose overall on the performance metrics defined above, as well as the model trained using pre-processing that would yield the model with the best overall (accuracy) and generalizability.

Model Experimentation and Training

In this section, I will experiment with six models based on Decision Tree, Random Forest, and AdaBoost algorithms. Each experiment will vary specific hyperparameters to see how bias and variance changes. The models will be evaluated using Accuracy, Precision, Recall, F1, and AUC metrics. Since the dataset is consider imbalanced, performance will be primarily evaluated based on Recall and F1 for the positive class (“yes”).

Experiment 1: Decision Tree - Baseline

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

dt_base = DecisionTreeClassifier(random_state=42, class_weight="balanced")
dt_base.fit(X_train_resampled, y_train_resampled)
y_pred_dt_base = dt_base.predict(X_test_processed)

print("Decision Tree Baseline Results")
print(classification_report(y_test, y_pred_dt_base, target_names=["no","yes"]))
Decision Tree Baseline Results
              precision    recall  f1-score   support

          no       0.91      0.90      0.90      7985
         yes       0.29      0.32      0.30      1058

    accuracy                           0.83      9043
   macro avg       0.60      0.61      0.60      9043
weighted avg       0.84      0.83      0.83      9043

In this first experiment, I constructed a baseline Decision Tree model using class weights to account for the class imbalance between “yes” and “no” responses. The objective was straightforward: to develop a baseline reference model that helps understand the general relationships across the input features and the target variable without any tuning. Decision Trees are straightforward to work with because they partition the data based on information gain or Gini impurity, so one can easily see how any one individual feature drives the prediction. That said, I had anticipated this model would overfit the training data set, given that it is perfectly capable of generating a tree that is very deep, and simply learns all the small variance in the training data. By examining its metrics, I would determine whether or not the model was learning the general patterns or simply memorizing the training data, and from there, I would use this as a reference point to improve future experiments.

from sklearn.metrics import ConfusionMatrixDisplay
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np

ConfusionMatrixDisplay.from_estimator(
dt_base, X_test_processed, y_test,
display_labels=["No", "Yes"],
cmap="Blues",
colorbar=False
)
plt.title("Decision Tree Confusion Matrix")
plt.show()

from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

dt_small = DecisionTreeClassifier(
    random_state=42,
    class_weight="balanced",
    max_depth=3,
    min_samples_leaf=1000,
    max_leaf_nodes=8
).fit(X_train_resampled, y_train_resampled)

plt.figure(figsize=(11, 5), dpi=120)
plot_tree(
    dt_small,
    feature_names=X_train_resampled.columns,
    class_names=["no","yes"],
    filled=True,
    rounded=True,
    impurity=False,
    proportion=True,
    fontsize=8
)
plt.margins(x=0.02, y=0.05)
plt.title("Decision Tree - Baseline")
plt.show()

This visualization illustrates the splits made by the Decision Tree classifier based on key variables of month, prior outcome, and the job category of the client. The colors represent class probabilities, where the darker shade of green represents a greater likelihood of clients subscribing to a term deposit. This hierarchy is useful in my ability to interpret the flow of decisions and assess how the model differentiates between potential subscribers and not potential subscribers.

# ROC and Precision–Recall curves

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

# Probabilities for positive class "yes"

y_proba_dt_base = dt_base.predict_proba(X_test_processed)[:, 1]

# ROC

fpr, tpr, _ = roc_curve(y_test.map({"no":0, "yes":1}), y_proba_dt_base)
roc_auc = auc(fpr, tpr)

# PR

precision, recall, _ = precision_recall_curve(y_test.map({"no":0, "yes":1}), y_proba_dt_base)
ap = average_precision_score(y_test.map({"no":0, "yes":1}), y_proba_dt_base)

fig, axes = plt.subplots(1, 2, figsize=(8,5))

# ROC plot

axes[0].plot(fpr, tpr, lw=2, label=f"AUC = {roc_auc:.2f}")
axes[0].plot([0,1],[0,1], linestyle="--", lw=1, color="gray")
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].set_title("ROC Curve")
axes[0].legend(loc="lower right")

# PR plot

axes[1].plot(recall, precision, lw=2, label=f"AP = {ap:.2f}")
axes[1].set_xlabel("Recall (Positive label: 1)")
axes[1].set_ylabel("Precision (Positive label: 1)")
axes[1].set_title("Precision–Recall Curve")
axes[1].legend(loc="lower left")

plt.tight_layout()
plt.show()

from sklearn.metrics import accuracy_score

# Compute accuracy

accuracy_dt_base = accuracy_score(y_test, y_pred_dt_base)

print(f"Decision Tree (baseline) Accuracy: {accuracy_dt_base:.4f}")
Decision Tree (baseline) Accuracy: 0.8293

The baseline Decision Tree model yielded an accuracy of 0.8293, which means that the model made correct predictions approximately 83% of all the time. While this figure seems strong, it is important to note that the dataset is significantly imbalanced and accuracy alone can be misleading since the model may be highly biased toward just predicting the class occurring in majority of cases.

From the ROC curve, we see an AUC of 0.61, which indicates that the model can only marginally differentiate between subscribers and non-subscribers. Given that 1.0 represents a perfect classifier and 0.5 indicates random guessing, we can posit that a score of 0.61 indicates limited predictive capacity.

The Precision–Recall (PR) curve reports average precision of 0.17. This means that when the model predicts that a client will subscribe, it is only correct 17% of the time. Furthermore, the steep descent in precision as recall increases suggests that the model struggles to identify true positives without issuing numerous false calls.

To summarize, while the baseline decision tree has superior overall accuracy, it poorly identifies the minority “yes” class. This reminds us that any model requires adjustments that take regularization or resampling into account when improving recall and precision while remaining stable and avoiding overfitting.

Experiment 2: Decision Tree - Regularized

dt_reg = DecisionTreeClassifier(
  random_state=42, 
  class_weight="balanced", 
  max_depth=8, 
  min_samples_leaf=20
  )
dt_reg.fit(X_train_resampled, y_train_resampled)
y_pred_dt_reg = dt_reg.predict(X_test_processed)

print("Decision Tree Regularized Results")
print(classification_report(y_test, y_pred_dt_reg, target_names=["no","yes"]))
print("Accuracy:", round(accuracy_score(y_test, y_pred_dt_reg), 4))
Decision Tree Regularized Results
              precision    recall  f1-score   support

          no       0.91      0.96      0.93      7985
         yes       0.49      0.32      0.39      1058

    accuracy                           0.88      9043
   macro avg       0.70      0.64      0.66      9043
weighted avg       0.86      0.88      0.87      9043

Accuracy: 0.881

In the second experiment, I added regularization by limiting the maximum depth of the tree and requiring a minimum number of samples per leaf node. This was done to counteract overfitting while maintaining interpretability. A tree that is not as deep will often generalize better to the test set because it is focusing on the strong, broader patterns in the data rather than trying to fit the noise. The stated parameters of max_depth=8 and min_samples_leaf=20 help to create smoother decision boundaries and lessen the amount of influence that outliers can have. I anticipated the regularized model would produce somewhat lower training accuracy, but would have a better balance between recall and precision on test data. This change gives the regularized model a key point of comparison and demonstrates how structural pruning changes both bias and variance.

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

disp = ConfusionMatrixDisplay.from_predictions(
  y_test, y_pred_dt_reg, display_labels=["no","yes"], cmap="Blues", values_format="d"
  )
plt.title("Decision Tree (regularized): Confusion Matrix")
plt.tight_layout()
plt.show()

from sklearn.metrics import roc_curve, auc, RocCurveDisplay, PrecisionRecallDisplay

y_proba = dt_reg.predict_proba(X_test_processed)[:, 1]

# ROC curve

fpr, tpr, _ = roc_curve(y_test.map({'no': 0, 'yes': 1}), y_proba)
roc_auc = auc(fpr, tpr)

# Precision-Recall curve

disp_roc = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name="Decision Tree (regularized)")

fig, ax = plt.subplots(1, 2, figsize=(8, 5), dpi=120)

# ROC Curve

RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc).plot(ax=ax[0])
ax[0].set_title("ROC Curve")

# Precision-Recall Curve

PrecisionRecallDisplay.from_predictions(
  y_test.map({'no': 0, 'yes': 1}),
  y_proba,
  ax=ax[1]
  )
ax[1].set_title("Precision–Recall Curve")

plt.tight_layout()
plt.show()

After applying regularization to avoid overfitting (max_depth=8, min_samples_leaf=20), I assessed the model based on the confusion matrix, the ROC curve, and the precision–recall curve.

The confusion matrix showed that the model predicted 7,626 non-subscribers (true negative) and 341 subscribers (true positive) correctly. The model misclassified 717 subscribers as non-subscribers (false negative) and 359 non-subscribers as subscribers (false positive). This is to be expected in an imbalanced dataset, wherein most clients did not subscribe. The model is far superior in identifying non-subscribers than subscribers; therefore, recall with regard to the positive class can only remain low.

The ROC curve shows that the model has a good level of discrimination between the two classes (AUC = 0.76). This indicates that the model accurately distinguishes a subscriber from non-subscriber about 76% of the time, which is quite good for this dataset.

The precision–recall curve depicts the performance of the minority class (subscribers) in greater detail. The average precision (AP) of 0.33 indicates that while the model has some level of discrimination ability overall, it fails to achieve the same precision with higher recall levels. That is, the model starts good as it tries to capture many true positives. However, the more true positives it attempts to capture, the more false non-subscribers it misclassifies as subscribers, which is evident in the results. The accuracy of the Regularized Decision Tree was 0.881. This represents an increase over the baseline accuracy of 0.8293, which indicates that the regularized decision tree generalizes better due to increased depth and leaf constraints.

Overall, the regularized tree achieves a balance between overfitting and generalization. The regularized tree achieves slightly lower recall relative to the baseline (unregularized) tree, but achieves higher reliability and as a result, less overfitting and a smoother decision boundary. No matter the slight tradeoffs it results in, it is practically better for real-world marketing implementation as knowing why consumers engaged and demonstrating generalization, are valuable.

Experiment 3: Random Forest - Baseline

from sklearn.ensemble import RandomForestClassifier

rf_base = RandomForestClassifier(
  n_estimators=300,
  class_weight="balanced",
  random_state=42,
  n_jobs=-1
  )
  
rf_base.fit(X_train_resampled, y_train_resampled)
y_pred_rf_base = rf_base.predict(X_test_processed)
y_proba_rf_base = rf_base.predict_proba(X_test_processed)[:, 1]

print("Random Forest Baseline Results")
print(classification_report(y_test, y_pred_rf_base, target_names=["no","yes"]))
print("Accuracy:", round(accuracy_score(y_test, y_pred_rf_base), 4))
Random Forest Baseline Results
              precision    recall  f1-score   support

          no       0.90      0.97      0.94      7985
         yes       0.53      0.22      0.31      1058

    accuracy                           0.89      9043
   macro avg       0.72      0.60      0.63      9043
weighted avg       0.86      0.89      0.86      9043

Accuracy: 0.8857

For this experiment, I created a baseline Random Forest model using 300 trees. I wanted to examine how ensemble learning increased model robustness compared to a single Decision Tree. Random Forest builds multiple weak Decision Trees using bootstrapped samples to achieve reduced variance and improved generalization. Each decision Tree considers a random subset of features, leading to improved diversity and reducing the risk of relying too heavily on specific predictors. Given my dataset is moderately large as well as imbalanced, the model takes advantage of class weighting to make the model fairly consider the minority class. Therefore, I expected the Random Forest to have a higher recall, as well as F1 score, than the single-tree model, as Random Forest should generally perform better in terms of stability of predictions by exploring diversity in its previous samples.

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(
  y_test, y_pred_rf_base,
  display_labels=["no","yes"],
  cmap="Blues",
  values_format="d"
  )
plt.title("Random Forest (baseline): Confusion Matrix")
plt.tight_layout()
plt.show()

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

y_true_bin = y_test.map({"no": 0, "yes": 1})

# ROC

fpr, tpr, _ = roc_curve(y_true_bin, y_proba_rf_base)
roc_auc = auc(fpr, tpr)

# PR

precision, recall, _ = precision_recall_curve(y_true_bin, y_proba_rf_base)
ap = average_precision_score(y_true_bin, y_proba_rf_base)

fig, axes = plt.subplots(1, 2, figsize=(8, 5), dpi=120)

# ROC plot

axes[0].plot(fpr, tpr, lw=2, label=f"AUC = {roc_auc:.2f}")
axes[0].plot([0, 1], [0, 1], linestyle=":", color="gray", lw=1)
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].set_title("ROC Curve")
axes[0].legend(loc="lower right")

# PR plot

axes[1].plot(recall, precision, lw=2, label=f"AP = {ap:.2f}")
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision–Recall Curve")
axes[1].legend(loc="lower left")

plt.tight_layout()
plt.show()

# Top feature importances — Random Forest (baseline)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

imp = pd.Series(rf_base.feature_importances_, index=X_train_resampled.columns)
topk = 15
imp_top = imp.sort_values(ascending=False).head(topk)

plt.figure(figsize=(7, 5), dpi=120)
imp_top.iloc[::-1].plot(kind="barh")
plt.xlabel("Importance")
plt.title(f"Random Forest (baseline): Top {topk} Features")
plt.tight_layout()
plt.show()

The Random Forest Model produced an accuracy of 0.8857, which is a clear improvement over prior decision tree runs, as seen in the following confusion matrix. The confusion matrix indicates that the model is predicting the support contact cases that yield “no” values more accurately but we are also pulling in more “yes” support appeals, but there is still some false negatives in the results.

The ROC curve, which produced an AUC of 0.78 overall, shows the model’s ability to tell apart positive and negatives outcomes is strong. The Precision-Recall curve yielded an average precision (AP) of 0.37, indicating moderate precision in the overall precision-recall balance, relative to the prior model.

The feature importance plot indicated components of the model such as balance, age, day_of_week, and coded categorical features like 21, 22, and 25 as key predictors. This data suggest that both financial factors and a general influence by the campaign has a strong effect on the predicted outcome. 

Overall, the Random Forest Baseline represents the best combination of accuracy, AUC and recall of experiments conducted in this analysis. The results of the Random Forest Baseline further showed that the combination of many decision trees as a model helped generalize the predictions, while also accounting for more complex relationship between the preditors and the targeted outcome.

Experiment 4: Random Forest Tuned

rf_tuned = RandomForestClassifier(
  n_estimators=500,
  max_depth=12,
  max_features="sqrt",
  min_samples_leaf=10,
  class_weight="balanced",
  random_state=42,
  n_jobs=-1
  )
  
rf_tuned.fit(X_train_resampled, y_train_resampled)
y_pred_rf_tuned = rf_tuned.predict(X_test_processed)

print("Random Forest Tuned Results")
print(classification_report(y_test, y_pred_rf_tuned, target_names=["no","yes"]))
accuracy = accuracy_score(y_test, y_pred_rf_tuned)
print(f"Accuracy: {accuracy:.4f}")
Random Forest Tuned Results
              precision    recall  f1-score   support

          no       0.91      0.96      0.94      7985
         yes       0.51      0.29      0.37      1058

    accuracy                           0.88      9043
   macro avg       0.71      0.63      0.65      9043
weighted avg       0.86      0.88      0.87      9043

Accuracy: 0.8847

In this experiment, I adjusted the Random Forest by increasing the number of trees to 500, and adding further structure with specified maximum depth, maximum features, and minimum leaf size. Those parameters with the notation of max_depth=12, max_features="sqrt", and min_samples_leaf=10 were selected to reasonably constrain bias versus variance going forward. Increasing the number of estimators adds stability to the ensemble, and limiting depth similarly restricts each individual estimator from overfitting. Employing the root of the total features per split, per estimator, helps limit correlations between trees. I anticipated this exploited version would score best overall with respect to an F1 score, with improved recall with less noise in predictions. I offer this experiment as a demonstration of how limited complexity, rather than just more depth or trees, often leads to better and more reliable models.

y_true_bin = y_test.map({"no": 0, "yes": 1})
y_proba_rf_tuned = rf_tuned.predict_proba(X_test_processed)[:, 1]

fpr, tpr, _ = roc_curve(y_true_bin, y_proba_rf_tuned)
roc_auc = auc(fpr, tpr)

prec, rec, _ = precision_recall_curve(y_true_bin, y_proba_rf_tuned)
ap = average_precision_score(y_true_bin, y_proba_rf_tuned)

fig, axes = plt.subplots(1, 2, figsize=(8, 5))

# ROC Curve
axes[0].plot(fpr, tpr, lw=2, label=f"AUC = {roc_auc:.2f}")
axes[0].plot([0, 1], [0, 1], "k--", lw=1)
axes[0].set_title("ROC Curve")
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].legend(loc="lower right")

# PR Curve
axes[1].plot(rec, prec, lw=2, label=f"AP = {ap:.2f}")
axes[1].set_title("Precision–Recall Curve")
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].legend(loc="lower left")

plt.tight_layout()
plt.show()

cm = confusion_matrix(y_test, y_pred_rf_tuned, labels=["no", "yes"])

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["no", "yes"], yticklabels=["no", "yes"])
plt.title("Random Forest (tuned): Confusion Matrix")
plt.xlabel("Predicted label")
plt.ylabel("True label")
plt.show()

importances = pd.Series(
    rf_tuned.feature_importances_, index=X_train_resampled.columns
).sort_values(ascending=False)[:15]

plt.figure(figsize=(7, 5))
importances.plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Random Forest (tuned): Top 15 Features")
plt.xlabel("Importance")
plt.show()

The tuned Random Forest model achieved 0.8847 accuracy, sustaining strong predictive performance close to the baseline while improving generalization. The ROC curve shows an AUC of 0.79, indicating good capacity to discriminate positive and negative outcomes. The Precision-Recall curve showed slightly better precision across recall levels than the baseline model, with an average precision (AP) of 0.39.

The confusion matrix shows the model accurately identified most “no” cases and increased the proportion of “yes” true positives, with a reduction of false negatives. The feature importance chart, in the form of layer plots, indicates that coded variables 21, 22, and 25 were most influential, while variables 51, 27 and balance also significantly affected model decision making.

Overall, the tuned Random Forest sustained roughly the same accuracy while achieving deeper AUC and precision levels, indicating hyperparameter tuning improved depth to generalization balance, while also capturing complexities and levels of overfitting. It was the most stable and holistic among the four experiment models.

Experiment 5: AdaBoost - Stumps

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
  classification_report, accuracy_score, ConfusionMatrixDisplay,
  roc_curve, auc, precision_recall_curve, average_precision_score
  )
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

ada_stump = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=42),
    n_estimators=300,
    learning_rate=0.5,
    random_state=42
)
ada_stump.fit(X_train_resampled, y_train_resampled)
y_pred_ada_stump = ada_stump.predict(X_test_processed)

y_pred_ada = ada_stump.predict(X_test_processed)
y_proba_ada = ada_stump.predict_proba(X_test_processed)[:, 1]
y_true_bin = y_test.map({"no": 0, "yes": 1})

acc = accuracy_score(y_test, y_pred_ada)

print("AdaBoost (stumps) Classification Report")
print(classification_report(y_test, y_pred_ada, target_names=["no", "yes"]))
print(f"AdaBoost (stumps) Accuracy: {acc:.4f}\n")
AdaBoost (stumps) Classification Report
              precision    recall  f1-score   support

          no       0.90      0.97      0.94      7985
         yes       0.53      0.22      0.31      1058

    accuracy                           0.89      9043
   macro avg       0.72      0.59      0.62      9043
weighted avg       0.86      0.89      0.86      9043

AdaBoost (stumps) Accuracy: 0.8855

In the fifth experiment, I opted for AdaBoost, which constructs an ensemble in a sequential manner to focus on misclassified samples and builds up a set of weak learners through simple decision stumps (a tree with a maximum depth of 1) in order to see how smaller, interpretable models function when they are built up adaptively. With each subsequent stump, we emphasize the errors made from earlier iterations as an attempt to allow the ensemble to learn harder cases. I anticipated a boost in recall specifically since AdaBoost would be able to track down the minority class examples that have been missed in earlier models, Stumps are weak learners for a reason; they have their own difficulties around complex nonlinearities etc. All in all, the adaptive nature of weighting learners through AdaBoost does give a valuable way of optimizing (and by extension accumulating) bias and variance dynamically, jointly with weak learners as staff, which at times can perform better than models that rely on only a single prediction mechanism.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred_ada, labels=["no", "yes"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["no", "yes"])

fig, ax = plt.subplots(figsize=(6, 4.5), dpi=150)
disp.plot(ax=ax, cmap="Blues", values_format="d")  # no colorbar kwarg
ax.set_title("AdaBoost (stumps): Confusion Matrix")
plt.tight_layout()
plt.show()

fpr, tpr, _ = roc_curve(y_true_bin, y_proba_ada)
roc_auc = auc(fpr, tpr)
prec, rec, _ = precision_recall_curve(y_true_bin, y_proba_ada)
ap = average_precision_score(y_true_bin, y_proba_ada)

fig, axes = plt.subplots(1, 2, figsize=(8, 5), dpi=150)

# ROC

axes[0].plot(fpr, tpr, lw=2, label=f"AUC = {roc_auc:.2f}")
axes[0].plot([0, 1], [0, 1], linestyle="--", color="gray", lw=1)
axes[0].set_title("ROC Curve")
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].legend(loc="lower right")

# PR

axes[1].plot(rec, prec, lw=2, label=f"AP = {ap:.2f}")
axes[1].set_title("Precision–Recall Curve")
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].legend(loc="lower left")
plt.tight_layout()
plt.show()

imp = pd.Series(ada_stump.feature_importances_, index=X_train_processed.columns)
top15 = imp.sort_values(ascending=False).head(15)[::-1]

fig, ax = plt.subplots(figsize=(7, 5), dpi=150)
top15.plot(kind="barh", ax=ax)
ax.set_title("AdaBoost (stumps): Top 15 Features")
ax.set_xlabel("Importance")
plt.tight_layout()
plt.show()

AdaBoost performed with shallow decision stumps demonstrates a clear gain of relevance to the nonlinear relationships addressed by the single decision tree models, with an accuracy of 0.8855 that matches the tuned random forest and shows the effectiveness of iterative boosting of weak learners to provide better performance even while using basic base estimators. The ROC curve (AUC = 0.76) demonstrates that AdaBoost provided a relatively stable separation of positive and negative classes, while the precision–recall curve (AP = 0.35) shows moderate performance on capturing minor [“yes”] cases. Based on the confusion matrix, the model performs well in true-negative recognition but still struggles with false negatives. The feature-importance rankings suggest that only a small number of predictors (features 37, 13, 52, 51 and 16) are overwhelmingly able to dominate the decision boundaries of the ensemble. This suggests that AdaBoost is able to effectively weight those predictors through retraining. Overall, this experiment supports that boosting provides stability to modeling efforts and adds predictive strength without the risk of overfitting, obtaining balanced accuracy and improvement in recall of the more difficult class.

Experiment 6: AdaBoost - Depth 2 Learner

ada_depth2 = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=2, random_state=42),
    n_estimators=300,
    learning_rate=0.5,
    random_state=42
)
ada_depth2.fit(X_train_resampled, y_train_resampled)
y_pred_ada_depth2 = ada_depth2.predict(X_test_processed)
y_proba_ada_depth2 = ada_depth2.predict_proba(X_test_processed)[:, 1]
y_true_bin = y_test.map({"no": 0, "yes": 1})

acc = accuracy_score(y_test, y_pred_ada_depth2)

print("AdaBoost (Depth 2) Results")
print(classification_report(y_test, y_pred_ada_depth2, target_names=["no","yes"]))
print(f"AdaBoost (Depth 2) Accuracy: {acc:.4f}")
AdaBoost (Depth 2) Results
              precision    recall  f1-score   support

          no       0.90      0.98      0.94      7985
         yes       0.54      0.21      0.30      1058

    accuracy                           0.89      9043
   macro avg       0.72      0.59      0.62      9043
weighted avg       0.86      0.89      0.86      9043

AdaBoost (Depth 2) Accuracy: 0.8870

For the last experiment, I built a larger model with AdaBoost by increasing the number of estimators and the depth of base learners. My intention was to have each learner take in a little more information while also decreasing the chance that the ensemble converged too much by having a max_depth=2, n_estimators=400, and a lower learning rate of 0.2. This experiment set out to find evidence that moderate complexity can improve generalization without inflating the risk of overfitting. I thought this model had the potential to improve precision and recall over the shallow-stump AdaBoost because it could model interactions between features while still benefiting from adaptive boosting properties. The comparison of the two AdaBoost models is meant to further illustrate how model capacity will interact with boosting strength, as well as inform tuning strategies for imbalanced classification problems.

cm = confusion_matrix(y_test, y_pred_ada_depth2, labels=["no", "yes"])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["no", "yes"])
fig, ax = plt.subplots(figsize=(6, 4.5), dpi=150)
disp.plot(ax=ax, cmap="Blues", values_format="d")
ax.set_title("AdaBoost (Depth 2): Confusion Matrix")
plt.tight_layout()
plt.show()

fpr, tpr, _ = roc_curve(y_true_bin, y_proba_ada_depth2)
roc_auc = auc(fpr, tpr)
prec, rec, _ = precision_recall_curve(y_true_bin, y_proba_ada_depth2)
ap = average_precision_score(y_true_bin, y_proba_ada_depth2)

fig, axes = plt.subplots(1, 2, figsize=(8, 5), dpi=150)
axes[0].plot(fpr, tpr, label=f"AUC = {roc_auc:.2f}")
axes[0].plot([0, 1], [0, 1], "k--")
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")
axes[0].set_title("ROC Curve")
axes[0].legend()

axes[1].plot(rec, prec, label=f"AP = {ap:.2f}")
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision–Recall Curve")
axes[1].legend()
plt.tight_layout()
plt.show()

importances = pd.Series(ada_depth2.feature_importances_, index=X_train_resampled.columns)
topk = importances.sort_values(ascending=False).head(15)
plt.figure(figsize=(7, 5), dpi=150)
topk.sort_values().plot(kind="barh")
plt.title("AdaBoost (Depth 2): Top 15 Features")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()

Permitting each weak learner to go 2 levels deep enhanced performance to a small degree over the shallow stump model. The ROC curve indicates stronger separation between classes, and the precision-recall curve indicates improved balance in identifying positive “yes” outcomes. The confusion matrix reflects fewer false negatives with true negatives consistently being high. Feature-importance demonstrates a wider range of moderately weighted predictors suggesting depth 2 learners capture simple interactions between features as opposed to only splitting via individual splits.

In general AdaBoost depth 2 provides a reasonable compromise between model complexity and generalization. There is a small gain to recall without a trade off to precision or overfitting to the training sample, resulting in the best overall accuracy of any model examined to date.

Combine and Compare Results

To systematically compare the outcomes across all six trials, I combined all performance metrics into a single comparative table. While it is clear that the table allows for easy identification of which models performed best across the metrics of accuracy, precision, recall, and the F1 score, this is all for the “yes” case or positive class. Given the information is unbalanced, I decided to position F1 as the primary performance metric, because it does weight both recall and precision equally.

from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

models = {
  "Decision Tree (Baseline)": y_pred_dt_base,
  "Decision Tree (Regularized)": y_pred_dt_reg,
  "Random Forest (Baseline)": y_pred_rf_base,
  "Random Forest (Tuned)": y_pred_rf_tuned,
  "AdaBoost (Stumps)": y_pred_ada_stump,
  "AdaBoost (Depth 2)": y_pred_ada_depth2
  }

results = []

for name, y_pred in models.items():
  results.append({
    "Model": name,
    "Accuracy": accuracy_score(y_test, y_pred),
    "Precision (Yes)": precision_score(y_test, y_pred, pos_label="yes"),
    "Recall (Yes)": recall_score(y_test, y_pred, pos_label="yes"),
    "F1 (Yes)": f1_score(y_test, y_pred, pos_label="yes")
    })

results_df = pd.DataFrame(results).sort_values(by="F1 (Yes)", ascending=False)
results_df
Model Accuracy Precision (Yes) Recall (Yes) F1 (Yes)
1 Decision Tree (Regularized) 0.881013 0.487143 0.322306 0.387941
3 Random Forest (Tuned) 0.884662 0.512605 0.288280 0.369026
2 Random Forest (Baseline) 0.885657 0.526667 0.224008 0.314324
4 AdaBoost (Stumps) 0.885547 0.526559 0.215501 0.305835
5 AdaBoost (Depth 2) 0.886984 0.543689 0.211720 0.304762
0 Decision Tree (Baseline) 0.829260 0.289792 0.316635 0.302620

After reviewing the results for all six experiments, I observed consistent results for all models in terms of performance pattern. The baseline Decision Tree model was relatively accurate, but it struggled to generalize, as suggested by its lower recall that resulted from overfitting. The Decision Tree showed more balance when regularization was applied, as a small amount of precision was sacrificed for a higher recall and improved overall F1 score.

Both Random Forest models outperformed the single trees. The baseline version showed substantial improvement in F1 and the tuned Random Forest obtained the best overall recall and precision balance. The Random Forest model benefitted from having more estimators and was capable of controlling complexity. Both factors aided the stability of the model and helped it resist noise.

The experiments with AdaBoost also yielded strong results. The stump-based variation found a number of positive cases, but sacrificed precision. Increasing the depth of the tree and the number of estimators in the second AdaBoost model provided a more compliant performance with increases in F1 and predictions that were more in line with one another.

Overall, tuned Random Forest and AdaBoost with depth 2 yielded the best results. While both methods performed similarly, Random Forest offered slightly better generalization and interpretability, while AdaBoost provided stronger recall. I would use the tuned Random Forest to deploy into production, with AdaBoost as a strong secondary choice based on which outcome is more valued by a business context, recall or precision.

Business Insights and Recommendations

After conducting the tests and considering the total of six models, I thought about how the results may be used to inform actionable business decisions. The Bank Marketing dataset represents customer interactions from campaigns that were done over the phone, and the goal was to predict whether a client would subscribe to a term deposit. By studying the results of the optimized Random Forest model and AdaBoost model, I was able to gather meaningful insights that took stock of which customer and campaign attributes most significantly contributed to a positive result.

From the analysis of feature importance from the Random Forest model, a number of themes presented themselves. Call related features such as campaign and pdays had an outsized impact, indicating that both the number of calls and timing of the prior interaction was relevant to customer engagement. Financial features such as balance and demographic factors such as age and job were also pertinent. Clients with stable jobs (e.g., management or technician) and clients who were able to save responded at a greater rate than clients who may be newer to saving (e.g., younger people or students). This pattern is consistent with the exploratory data analysis.

From a tactical viewpoint, this means that marketing resources should be applied to segments where likelihood of conversion is high, but marketers do not currently engage them. Instead of contacting the same customers again and again, the results suggest that you should put your effort into the quality and timing of the calls for customers who are known to give some engagement in the past. In other words, an effective campaign would help you find similarities in persistent calls to only provide relevant follow-ups when you do contact.

Following this, the tuned Random Forest model, with the highest F1 Score, is the most deployable model in that it captures complex nonlinear relationships and generalizes well to unseen data. The Random Forest model is robust to outliers, and has capabilities to handle mixed categorical and mixed numeric data as well. The AdaBoost model could be utilized as a secondary model in these business goals for cases where recall is relevant, for example, a business might value ensuring that as few potential subscribers are missed along with a slightly lower precision.

In effect, these results can support real-time scoring of prospective clients for targeted marketing applications. The model can rank clients by predicted probability of subscription, allowing the bank to focus its outreach toward customers expected to respond positively to the bank’s outreach. Model outputs could also be incorporated into CRM systems for scheduling calls and streamlining processes.

Overall, I recommend deploying the tuned Random Forest model while also continuing to monitor and retraining on the most recent campaign data to preserve predictive power. By leveraging timing of customer interactions or engagements, as well as their demographic information, the bank can increase conversion and resource efficiencies while delivering a more personalized client experience.

Conclusion

In my analysis, I showed how data-driven modeling can boost the performance of marketing campaigns by finding the patterns most related to customers’ conversions. After comparing models and tuning parameters, my findings showed that Random Forest had the best balance of accuracy and interpretability, while AdaBoost did a great job of identifying potential subscribers with higher recall. Consider the balance between model performance and a business objective, whether that be maximizing efficiency or customer engagement or any other goal. In the future, I will continue to refine those models with newer data and additional features, such as the timing of calls and aspects of geographic location, to assist in applying the outcomes as an automated decision-support for dedicate planning of marketing campaigns.