1 Executive Summary & Findings

This dataset captures a Portuguese bank’s phone campaigns: 4,521 rows, 17 columns, a blend of numeric fields (age, balance, duration, campaign, pdays, previous) and categorical fields (job, marital, education, contact, month, poutcome). The target y indicates whether a client subscribed to a term deposit.

What matters most. The headline is imbalance and targeting. The target is skewed about 89 percent “no” to 11 percent “yes.” If we do nothing, any model can “win” by predicting “no” every time. Accuracy will look great and be useless. The real lift comes from campaign design variables: previous outcome, contact method, and timing, not simple demographics like age.

Central tendencies and spread. Age clusters in the late 30s to early 40s; nothing wild there. Balance and duration are extremely right-skewed with long tails. A handful of wealthy or very long-call cases will dominate a loss function if left untreated. That means we should log-transform duration and winsorize or log balance. Campaign count is concentrated around 1–3 calls; behavior after that tends to diminish returns.

Categorical conversion rates (where the signal lives). When we compute conversion rates by category, we see sharp differences:

Previous outcome (poutcome) is the strongest single indicator. A prior “success” predicts a dramatically higher current conversion rate. This is textbook: past positive engagement signals readiness.

Contact method matters: cell calls generally outperform unknown contact channels.

Month matters: conversion rates tend to spike in a few months (often March and December), while some high-volume months underperform. Timing is a lever.

Correlations are not the whole story. Numeric features have weak linear correlation with the target; that is normal for marketing data. The heavy lifting comes from categoricals and interactions (for example, job × month, contact × previous outcome). That is exactly why linear correlation heatmaps underwhelm here.

Missingness and duplicates. There are no formal NA values, but several fields use “unknown”. Treat those as explicit levels or flagged missing. Duplicates are checked and removed if found (see Appendix code). This aligns with domain reality: not every client discloses job/education, and many have never been contacted before (pdays == -1).

Algorithm selection. This is a supervised binary classification problem. Use Logistic Regression as a baseline for interpretability and stakeholder communication (coefficients map to odds). Use Gradient Boosting (or Random Forest) for performance, because it captures nonlinearities and interactions that dominate marketing outcomes. If the dataset were under 1,000 rows, I would prefer simpler models (Logistic/Decision Tree) to control overfitting.

Preprocessing playbook. Address class imbalance with SMOTE or class weights. One-hot encode categoricals (job, education, contact, month, poutcome), optionally keeping “unknown” as its own level. Log-transform duration; log or winsorize balance; standardize numerics for linear models. Engineer high-value flags such as contacted_before (pdays > 0) and seasonal targets (high-conversion months). Evaluate with ROC-AUC and PR-AUC, not just accuracy.

Bottom line. To get more deposits, do not just dial more — dial smarter. Prioritize clients with prior positive outcomes, use the best contact channel, and time outreach in stronger months. Fix the imbalance, encode categoricals, transform skew, and use Logistic Regression for storytelling with Gradient Boosting for lift. That’s the play.

2 Exploratory Data Analysis

Structure and Shape

## [1] 4521   17

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

Bottom line. Mostly categorical features with a few skewed numerics. Target is y (yes/no).*

2.1 Central Tendency & Spread

Summary Statistics of Key Numeric Features
name	Mean	Median	SD	Min	Max
age	41.1700951	39	10.576211	19	87
balance	1422.6578191	444	3009.638142	-3313	71188
campaign	2.7936297	2	3.109807	1	50
duration	263.9612917	185	259.856633	4	3025
pdays	39.7666445	-1	100.121124	-1	871
previous	0.5425791	0	1.693562	0	25

Why it matters. Balance and duration are heavily right-skewed. Outliers will hijack training unless we log/trim.

2.2 Distributions

The Shape Of The Problem

Bottom line. Transform balance and duration; treat campaign’s heavy mass at 1–3 calls carefully.

2.3 Target Balance

Imbalance Check

Target Class Balance
level	n	prop
no	4000	0.885
yes	521	0.115

Bottom line. With about 89 percent “no,” accuracy alone is a trap. Optimize recall/precision on “yes.”

2.4 Correlations Numeric Only

Bottom line. Weak linear correlations are expected; the real signal is in categoricals and interactions.

2.5 Categorical Conversion

Where The Signal Lives

Top Job Categories by Conversion Rate
job	yes	no	total	conversion_rate
retired	54	176	230	0.235
student	19	65	84	0.226
unknown	7	31	38	0.184
management	131	838	969	0.135
housemaid	14	98	112	0.125
admin.	58	420	478	0.121
self-employed	20	163	183	0.109
technician	83	685	768	0.108

Contact Method (Conversion Rate)
contact	yes	no	total	conversion_rate
telephone	44	257	301	0.146
cellular	416	2480	2896	0.144
unknown	61	1263	1324	0.046

Previous Outcome (Conversion Rate)
poutcome	yes	no	total	conversion_rate
success	83	46	129	0.643
other	38	159	197	0.193
failure	63	427	490	0.129
unknown	337	3368	3705	0.091

Conversion by Month
month	no	yes	total	conversion_rate
oct	43	37	80	0.462
dec	11	9	20	0.450
mar	28	21	49	0.429
sep	35	17	52	0.327
apr	237	56	293	0.191
feb	184	38	222	0.171
aug	554	79	633	0.125
jan	132	16	148	0.108
jun	476	55	531	0.104
nov	350	39	389	0.100
jul	645	61	706	0.086
may	1305	93	1398	0.067

Bottom line. Prior positive outcome, contact method, and timing are the levers. That is your targeting strategy.

2.6 Missingness The “Unknown” Game

Share of ‘unknown’ by Variable (%)
variable	pct_unknown
poutcome	82.0
contact	29.3
education	4.1
job	0.8
age	0.0
marital	0.0
default	0.0
balance	0.0
housing	0.0
loan	0.0
day	0.0
month	0.0
duration	0.0
campaign	0.0
pdays	0.0
previous	0.0
y	0.0

Bottom line. Keep “unknown” as an explicit level or flag it. Do not silently drop it.

3 Outliers Trim Or Transform?

Bottom line. Log duration; log or winsorize balance; be cautious beyond 95th–99th percentiles.

3.1 Duplicates & Consistency Check

Data Quality Check
Metric	Value
Duplicate rows	0

*Bottom line.** Remove duplicates if any appear; keep one distinct record.

4 Modeling Strategy What Fits The Business

Logistic Regression — interpretable coefficients for stakeholders; great baseline; needs one-hot encoding and scaling.

Gradient Boosting / Random Forest — stronger accuracy, handles nonlinearities and interactions; less transparent; use SHAP for explainability.

Small data (< 1000 rows) — prefer simpler models (Logistic/Decision Tree) to avoid overfitting.

Recommendation: baseline with Logistic Regression for storytelling; deploy Gradient Boosting for lift. Evaluate with ROC-AUC and PR-AUC, not plain accuracy.

4.1 Preprocessing Plan Execution Playbook

Imbalance: SMOTE or class weights; report ROC-AUC and PR-AUC.

Categoricals: one-hot encode job, education, contact, month, poutcome; keep “unknown” explicit or add missing flags.

Numerics: log-transform duration; log or winsorize balance; standardize for linear models.

Features: contacted_before (pdays > 0); seasonal flags for high-conversion months; optional interactions.

Sanity: drop duplicates; trim extreme values where appropriate.

5 Conclusion

If the goal is more deposits, do not just dial more — dial smarter. Prioritize people with prior positive outcomes, use the right contact method, and time the calls when conversion is historically stronger. Fix the imbalance, encode categoricals, transform skewed numerics, and pair Logistic Regression (explainability) with Gradient Boosting (performance). That is the plan I would ship.

6 Appendix: R Full Code

## [1] 4521   17

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

##       age           balance         duration       campaign     
##  Min.   :19.00   Min.   :-3313   Min.   :   4   Min.   : 1.000  
##  1st Qu.:33.00   1st Qu.:   69   1st Qu.: 104   1st Qu.: 1.000  
##  Median :39.00   Median :  444   Median : 185   Median : 2.000  
##  Mean   :41.17   Mean   : 1423   Mean   : 264   Mean   : 2.794  
##  3rd Qu.:49.00   3rd Qu.: 1480   3rd Qu.: 329   3rd Qu.: 3.000  
##  Max.   :87.00   Max.   :71188   Max.   :3025   Max.   :50.000  
##      pdays           previous      
##  Min.   : -1.00   Min.   : 0.0000  
##  1st Qu.: -1.00   1st Qu.: 0.0000  
##  Median : -1.00   Median : 0.0000  
##  Mean   : 39.77   Mean   : 0.5426  
##  3rd Qu.: -1.00   3rd Qu.: 0.0000  
##  Max.   :871.00   Max.   :25.0000

## [1] 0

}

Bank Marketing Campaign EDA

What really drives term deposit sign-ups?

Sheriann McLarty

October 06, 2025