Note: This essay serves as the Executive Summary accompanying the Exploratory Data Analysis (EDA) of the Bank Marketing Campaign dataset.

1 Executive Summary & Findings

This dataset captures a Portuguese bank’s phone campaigns: 4,521 rows, 17 columns, a blend of numeric fields (age, balance, duration, campaign, pdays, previous) and categorical fields (job, marital, education, contact, month, poutcome). The target y indicates whether a client subscribed to a term deposit.

What matters most. The headline is imbalance and targeting. The target is skewed about 89 percent “no” to 11 percent “yes.” If we do nothing, any model can “win” by predicting “no” every time. Accuracy will look great and be useless. The real lift comes from campaign design variables: previous outcome, contact method, and timing, not simple demographics like age.

Central tendencies and spread. Age clusters in the late 30s to early 40s; nothing wild there. Balance and duration are extremely right-skewed with long tails. A handful of wealthy or very long-call cases will dominate a loss function if left untreated. That means we should log-transform duration and winsorize or log balance. Campaign count is concentrated around 1–3 calls; behavior after that tends to diminish returns.

Categorical conversion rates (where the signal lives). When we compute conversion rates by category, we see sharp differences:

Previous outcome (poutcome) is the strongest single indicator. A prior “success” predicts a dramatically higher current conversion rate. This is textbook: past positive engagement signals readiness.

Contact method matters: cell calls generally outperform unknown contact channels.

Month matters: conversion rates tend to spike in a few months (often March and December), while some high-volume months underperform. Timing is a lever.

Correlations are not the whole story. Numeric features have weak linear correlation with the target; that is normal for marketing data. The heavy lifting comes from categoricals and interactions (for example, job × month, contact × previous outcome). That is exactly why linear correlation heatmaps underwhelm here.

Missingness and duplicates. There are no formal NA values, but several fields use “unknown”. Treat those as explicit levels or flagged missing. Duplicates are checked and removed if found (see Appendix code). This aligns with domain reality: not every client discloses job/education, and many have never been contacted before (pdays == -1).

Algorithm selection. This is a supervised binary classification problem. Use Logistic Regression as a baseline for interpretability and stakeholder communication (coefficients map to odds). Use Gradient Boosting (or Random Forest) for performance, because it captures nonlinearities and interactions that dominate marketing outcomes. If the dataset were under 1,000 rows, I would prefer simpler models (Logistic/Decision Tree) to control overfitting.

Preprocessing playbook. Address class imbalance with SMOTE or class weights. One-hot encode categoricals (job, education, contact, month, poutcome), optionally keeping “unknown” as its own level. Log-transform duration; log or winsorize balance; standardize numerics for linear models. Engineer high-value flags such as contacted_before (pdays > 0) and seasonal targets (high-conversion months). Evaluate with ROC-AUC and PR-AUC, not just accuracy.

Bottom line. To get more deposits, do not just dial more — dial smarter. Prioritize clients with prior positive outcomes, use the best contact channel, and time outreach in stronger months. Fix the imbalance, encode categoricals, transform skew, and use Logistic Regression for storytelling with Gradient Boosting for lift. That’s the play.

2 Exploratory Data Analysis

Structure and Shape

## [1] 4521   17
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Bottom line. Mostly categorical features with a few skewed numerics. Target is y (yes/no).*

2.1 Central Tendency & Spread

Summary Statistics of Key Numeric Features
name Mean Median SD Min Max
age 41.1700951 39 10.576211 19 87
balance 1422.6578191 444 3009.638142 -3313 71188
campaign 2.7936297 2 3.109807 1 50
duration 263.9612917 185 259.856633 4 3025
pdays 39.7666445 -1 100.121124 -1 871
previous 0.5425791 0 1.693562 0 25

Why it matters. Balance and duration are heavily right-skewed. Outliers will hijack training unless we log/trim.

2.2 Distributions

The Shape Of The Problem

Bottom line. Transform balance and duration; treat campaign’s heavy mass at 1–3 calls carefully.

2.3 Target Balance

Imbalance Check

Target Class Balance
level n prop
no 4000 0.885
yes 521 0.115

Bottom line. With about 89 percent “no,” accuracy alone is a trap. Optimize recall/precision on “yes.”

2.4 Correlations Numeric Only

Bottom line. Weak linear correlations are expected; the real signal is in categoricals and interactions.

2.5 Categorical Conversion

Where The Signal Lives
Top Job Categories by Conversion Rate
job yes no total conversion_rate
retired 54 176 230 0.235
student 19 65 84 0.226
unknown 7 31 38 0.184
management 131 838 969 0.135
housemaid 14 98 112 0.125
admin. 58 420 478 0.121
self-employed 20 163 183 0.109
technician 83 685 768 0.108
Contact Method (Conversion Rate)
contact yes no total conversion_rate
telephone 44 257 301 0.146
cellular 416 2480 2896 0.144
unknown 61 1263 1324 0.046
Previous Outcome (Conversion Rate)
poutcome yes no total conversion_rate
success 83 46 129 0.643
other 38 159 197 0.193
failure 63 427 490 0.129
unknown 337 3368 3705 0.091
Conversion by Month
month no yes total conversion_rate
oct 43 37 80 0.462
dec 11 9 20 0.450
mar 28 21 49 0.429
sep 35 17 52 0.327
apr 237 56 293 0.191
feb 184 38 222 0.171
aug 554 79 633 0.125
jan 132 16 148 0.108
jun 476 55 531 0.104
nov 350 39 389 0.100
jul 645 61 706 0.086
may 1305 93 1398 0.067

Bottom line. Prior positive outcome, contact method, and timing are the levers. That is your targeting strategy.

2.6 Missingness The “Unknown” Game

Share of ‘unknown’ by Variable (%)
variable pct_unknown
poutcome 82.0
contact 29.3
education 4.1
job 0.8
age 0.0
marital 0.0
default 0.0
balance 0.0
housing 0.0
loan 0.0
day 0.0
month 0.0
duration 0.0
campaign 0.0
pdays 0.0
previous 0.0
y 0.0

Bottom line. Keep “unknown” as an explicit level or flag it. Do not silently drop it.

3 Outliers Trim Or Transform?

Bottom line. Log duration; log or winsorize balance; be cautious beyond 95th–99th percentiles.

3.1 Duplicates & Consistency Check

Data Quality Check
Metric Value
Duplicate rows 0

*Bottom line.** Remove duplicates if any appear; keep one distinct record.

4 Modeling Strategy What Fits The Business

Logistic Regression — interpretable coefficients for stakeholders; great baseline; needs one-hot encoding and scaling.

Gradient Boosting / Random Forest — stronger accuracy, handles nonlinearities and interactions; less transparent; use SHAP for explainability.

Small data (< 1000 rows) — prefer simpler models (Logistic/Decision Tree) to avoid overfitting.

Recommendation: baseline with Logistic Regression for storytelling; deploy Gradient Boosting for lift. Evaluate with ROC-AUC and PR-AUC, not plain accuracy.

4.1 Preprocessing Plan Execution Playbook

Imbalance: SMOTE or class weights; report ROC-AUC and PR-AUC.

Categoricals: one-hot encode job, education, contact, month, poutcome; keep “unknown” explicit or add missing flags.

Numerics: log-transform duration; log or winsorize balance; standardize for linear models.

Features: contacted_before (pdays > 0); seasonal flags for high-conversion months; optional interactions.

Sanity: drop duplicates; trim extreme values where appropriate.

5 Conclusion

If the goal is more deposits, do not just dial more — dial smarter. Prioritize people with prior positive outcomes, use the right contact method, and time the calls when conversion is historically stronger. Fix the imbalance, encode categoricals, transform skewed numerics, and pair Logistic Regression (explainability) with Gradient Boosting (performance). That is the plan I would ship.

6 Appendix: R Full Code

## [1] 4521   17
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...
##       age           balance         duration       campaign     
##  Min.   :19.00   Min.   :-3313   Min.   :   4   Min.   : 1.000  
##  1st Qu.:33.00   1st Qu.:   69   1st Qu.: 104   1st Qu.: 1.000  
##  Median :39.00   Median :  444   Median : 185   Median : 2.000  
##  Mean   :41.17   Mean   : 1423   Mean   : 264   Mean   : 2.794  
##  3rd Qu.:49.00   3rd Qu.: 1480   3rd Qu.: 329   3rd Qu.: 3.000  
##  Max.   :87.00   Max.   :71188   Max.   :3025   Max.   :50.000  
##      pdays           previous      
##  Min.   : -1.00   Min.   : 0.0000  
##  1st Qu.: -1.00   1st Qu.: 0.0000  
##  Median : -1.00   Median : 0.0000  
##  Mean   : 39.77   Mean   : 0.5426  
##  3rd Qu.: -1.00   3rd Qu.: 0.0000  
##  Max.   :871.00   Max.   :25.0000

## [1] 0

}