Note: This essay serves as the Executive Summary accompanying the Exploratory Data Analysis (EDA) of the Bank Marketing Campaign dataset.
This dataset captures a Portuguese bank’s phone campaigns: 4,521 rows, 17 columns, a blend of numeric fields (age, balance, duration, campaign, pdays, previous) and categorical fields (job, marital, education, contact, month, poutcome). The target y indicates whether a client subscribed to a term deposit.
What matters most. The headline is imbalance and targeting. The target is skewed about 89 percent “no” to 11 percent “yes.” If we do nothing, any model can “win” by predicting “no” every time. Accuracy will look great and be useless. The real lift comes from campaign design variables: previous outcome, contact method, and timing, not simple demographics like age.
Central tendencies and spread. Age clusters in the late 30s to early 40s; nothing wild there. Balance and duration are extremely right-skewed with long tails. A handful of wealthy or very long-call cases will dominate a loss function if left untreated. That means we should log-transform duration and winsorize or log balance. Campaign count is concentrated around 1–3 calls; behavior after that tends to diminish returns.
Categorical conversion rates (where the signal lives). When we compute conversion rates by category, we see sharp differences:
Previous outcome (poutcome) is the strongest single indicator. A prior “success” predicts a dramatically higher current conversion rate. This is textbook: past positive engagement signals readiness.
Contact method matters: cell calls generally outperform unknown contact channels.
Month matters: conversion rates tend to spike in a few months (often March and December), while some high-volume months underperform. Timing is a lever.
Correlations are not the whole story. Numeric features have weak linear correlation with the target; that is normal for marketing data. The heavy lifting comes from categoricals and interactions (for example, job × month, contact × previous outcome). That is exactly why linear correlation heatmaps underwhelm here.
Missingness and duplicates. There are no formal NA values, but several fields use “unknown”. Treat those as explicit levels or flagged missing. Duplicates are checked and removed if found (see Appendix code). This aligns with domain reality: not every client discloses job/education, and many have never been contacted before (pdays == -1).
Algorithm selection. This is a supervised binary classification problem. Use Logistic Regression as a baseline for interpretability and stakeholder communication (coefficients map to odds). Use Gradient Boosting (or Random Forest) for performance, because it captures nonlinearities and interactions that dominate marketing outcomes. If the dataset were under 1,000 rows, I would prefer simpler models (Logistic/Decision Tree) to control overfitting.
Preprocessing playbook. Address class imbalance with SMOTE or class weights. One-hot encode categoricals (job, education, contact, month, poutcome), optionally keeping “unknown” as its own level. Log-transform duration; log or winsorize balance; standardize numerics for linear models. Engineer high-value flags such as contacted_before (pdays > 0) and seasonal targets (high-conversion months). Evaluate with ROC-AUC and PR-AUC, not just accuracy.
Bottom line. To get more deposits, do not just dial more — dial smarter. Prioritize clients with prior positive outcomes, use the best contact channel, and time outreach in stronger months. Fix the imbalance, encode categoricals, transform skew, and use Logistic Regression for storytelling with Gradient Boosting for lift. That’s the play.
Structure and Shape
## [1] 4521 17
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
Bottom line. Mostly categorical features with a few skewed numerics. Target is y (yes/no).*
| name | Mean | Median | SD | Min | Max |
|---|---|---|---|---|---|
| age | 41.1700951 | 39 | 10.576211 | 19 | 87 |
| balance | 1422.6578191 | 444 | 3009.638142 | -3313 | 71188 |
| campaign | 2.7936297 | 2 | 3.109807 | 1 | 50 |
| duration | 263.9612917 | 185 | 259.856633 | 4 | 3025 |
| pdays | 39.7666445 | -1 | 100.121124 | -1 | 871 |
| previous | 0.5425791 | 0 | 1.693562 | 0 | 25 |
Why it matters. Balance and duration are heavily right-skewed. Outliers will hijack training unless we log/trim.
The Shape Of The Problem
Bottom line. Transform balance and duration; treat campaign’s heavy mass at 1–3 calls carefully.
Imbalance Check
| level | n | prop |
|---|---|---|
| no | 4000 | 0.885 |
| yes | 521 | 0.115 |
Bottom line. With about 89 percent “no,” accuracy alone is a trap. Optimize recall/precision on “yes.”
Bottom line. Weak linear correlations are expected; the real signal is in categoricals and interactions.
| job | yes | no | total | conversion_rate |
|---|---|---|---|---|
| retired | 54 | 176 | 230 | 0.235 |
| student | 19 | 65 | 84 | 0.226 |
| unknown | 7 | 31 | 38 | 0.184 |
| management | 131 | 838 | 969 | 0.135 |
| housemaid | 14 | 98 | 112 | 0.125 |
| admin. | 58 | 420 | 478 | 0.121 |
| self-employed | 20 | 163 | 183 | 0.109 |
| technician | 83 | 685 | 768 | 0.108 |
| contact | yes | no | total | conversion_rate |
|---|---|---|---|---|
| telephone | 44 | 257 | 301 | 0.146 |
| cellular | 416 | 2480 | 2896 | 0.144 |
| unknown | 61 | 1263 | 1324 | 0.046 |
| poutcome | yes | no | total | conversion_rate |
|---|---|---|---|---|
| success | 83 | 46 | 129 | 0.643 |
| other | 38 | 159 | 197 | 0.193 |
| failure | 63 | 427 | 490 | 0.129 |
| unknown | 337 | 3368 | 3705 | 0.091 |
| month | no | yes | total | conversion_rate |
|---|---|---|---|---|
| oct | 43 | 37 | 80 | 0.462 |
| dec | 11 | 9 | 20 | 0.450 |
| mar | 28 | 21 | 49 | 0.429 |
| sep | 35 | 17 | 52 | 0.327 |
| apr | 237 | 56 | 293 | 0.191 |
| feb | 184 | 38 | 222 | 0.171 |
| aug | 554 | 79 | 633 | 0.125 |
| jan | 132 | 16 | 148 | 0.108 |
| jun | 476 | 55 | 531 | 0.104 |
| nov | 350 | 39 | 389 | 0.100 |
| jul | 645 | 61 | 706 | 0.086 |
| may | 1305 | 93 | 1398 | 0.067 |
Bottom line. Prior positive outcome, contact method, and timing are the levers. That is your targeting strategy.
| variable | pct_unknown |
|---|---|
| poutcome | 82.0 |
| contact | 29.3 |
| education | 4.1 |
| job | 0.8 |
| age | 0.0 |
| marital | 0.0 |
| default | 0.0 |
| balance | 0.0 |
| housing | 0.0 |
| loan | 0.0 |
| day | 0.0 |
| month | 0.0 |
| duration | 0.0 |
| campaign | 0.0 |
| pdays | 0.0 |
| previous | 0.0 |
| y | 0.0 |
Bottom line. Keep “unknown” as an explicit level or flag it. Do not silently drop it.
Bottom line. Log duration; log or winsorize balance; be cautious beyond 95th–99th percentiles.
| Metric | Value |
|---|---|
| Duplicate rows | 0 |
*Bottom line.** Remove duplicates if any appear; keep one distinct record.
Logistic Regression — interpretable coefficients for stakeholders; great baseline; needs one-hot encoding and scaling.
Gradient Boosting / Random Forest — stronger accuracy, handles nonlinearities and interactions; less transparent; use SHAP for explainability.
Small data (< 1000 rows) — prefer simpler models (Logistic/Decision Tree) to avoid overfitting.
Recommendation: baseline with Logistic Regression for storytelling; deploy Gradient Boosting for lift. Evaluate with ROC-AUC and PR-AUC, not plain accuracy.
Imbalance: SMOTE or class weights; report ROC-AUC and PR-AUC.
Categoricals: one-hot encode job, education, contact, month, poutcome; keep “unknown” explicit or add missing flags.
Numerics: log-transform duration; log or winsorize balance; standardize for linear models.
Features: contacted_before (pdays > 0); seasonal flags for high-conversion months; optional interactions.
Sanity: drop duplicates; trim extreme values where appropriate.
If the goal is more deposits, do not just dial more — dial smarter. Prioritize people with prior positive outcomes, use the right contact method, and time the calls when conversion is historically stronger. Fix the imbalance, encode categoricals, transform skewed numerics, and pair Logistic Regression (explainability) with Gradient Boosting (performance). That is the plan I would ship.
## [1] 4521 17
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
## age balance duration campaign
## Min. :19.00 Min. :-3313 Min. : 4 Min. : 1.000
## 1st Qu.:33.00 1st Qu.: 69 1st Qu.: 104 1st Qu.: 1.000
## Median :39.00 Median : 444 Median : 185 Median : 2.000
## Mean :41.17 Mean : 1423 Mean : 264 Mean : 2.794
## 3rd Qu.:49.00 3rd Qu.: 1480 3rd Qu.: 329 3rd Qu.: 3.000
## Max. :87.00 Max. :71188 Max. :3025 Max. :50.000
## pdays previous
## Min. : -1.00 Min. : 0.0000
## 1st Qu.: -1.00 1st Qu.: 0.0000
## Median : -1.00 Median : 0.0000
## Mean : 39.77 Mean : 0.5426
## 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :871.00 Max. :25.0000
## [1] 0
}