The Bank Marketing dataset from the UCI Machine Learning Repository has data from the telemarketing campaigns of a Portuguese bank. It includes client demographics, business data, and marketing contact data. The variable y indicates whether or not a client subscribed to a term deposit (yes or no).
This summary starts with an exploratory description of the data set, structure, distributions, correlations, and possible problems like outliers or imbalance. Finally, we introduce prep steps and show how the data set should be prepared before machine learning.
For this assignment, the dataset will be acquired from the following link. Since I will be using python for this assignment, I will be using the ucimlrepo package.
Exploratory Data Analysis
To begin I fetch the Bank Marketing dataset directly from the UCI Machine Learning Repository using the ucimlrepo package.
This ensures reproducibility and keeps features (X) and the target (y) clearly separated.
from ucimlrepo import fetch_ucirepoimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsbank = fetch_ucirepo(id=222)X = bank.data.featuresy = bank.data.targetsdf = pd.concat([X, y], axis=1)print("Shape:", df.shape)df.head()
Shape: (45211, 17)
age
job
marital
education
default
balance
housing
loan
contact
day_of_week
month
duration
campaign
pdays
previous
poutcome
y
0
58
management
married
tertiary
no
2143
yes
no
NaN
5
may
261
1
-1
0
NaN
no
1
44
technician
single
secondary
no
29
yes
no
NaN
5
may
151
1
-1
0
NaN
no
2
33
entrepreneur
married
secondary
no
2
yes
yes
NaN
5
may
76
1
-1
0
NaN
no
3
47
blue-collar
married
NaN
no
1506
yes
no
NaN
5
may
92
1
-1
0
NaN
no
4
33
NaN
single
NaN
no
1
no
no
NaN
5
may
198
1
-1
0
NaN
no
The dataset contains about 41,000 rows and 17 columns.
It has a mix of numerical variables (e.g., age, balance, duration) and categorical variables (e.g., job, marital, education).
The target variable is binary (y = yes/no), confirming this is a supervised classification problem.
1. Correlation of Features
plt.figure(figsize=(12,8))sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")plt.title("Correlation Matrix of Numerical Features")plt.show()
Low correlations overall: Most numerical features show weak correlations (close to 0).
Age and Balance: Slight positive correlation (~0.10). Older clients tend to have slightly higher balances.
Campaign and Day of Week: Mild correlation (~0.16). Contact frequency patterns may depend somewhat on the day of the week.
Pdays and Previous: Strongest correlation (~0.45). This makes sense since both relate to prior marketing contacts.
Duration: Almost no correlation with other variables. In practice duration is highly predictive of the target outcome (y).
Overall, multicollinearity is low. Most features are fairly independent, which is good for model training. The correlation between pdays and previous suggests redundancy that may require dimensionality reduction or careful feature selection.
2. Distribution of Variables
df.hist(figsize=(12,10), bins=30)plt.suptitle("Distributions of Numerical Variables")plt.show()
Age: Roughly bell-shaped with most clients between ages 30–50, tapering off after 60. Few very young (under 20) or very old (over 90).
Balance: Highly right-skewed. Most clients have modest or negative balances, but a few have very large positive balances (>50,000).
Day of Week: Fairly uniform, with slightly higher contact frequencies on certain days.
Duration: Strong right skew. Most calls are under 500 seconds, but a long tail includes calls lasting several thousand seconds.
Campaign (number of contacts): Heavily skewed. Most clients received 1–3 calls, but some received 50+.
Pdays (days since last contact): Large spike at -1 (meaning “not previously contacted”). Those who were contacted before show widely varying intervals, sometimes >800 days.
Previous: Most clients had 0 previous contacts, though some had dozens, with outliers exceeding 200.
3. Outliers
num_cols = df.select_dtypes(include="number").columnsfor col in num_cols: plt.figure(figsize=(6,3)) sns.boxplot(x=df[col]) plt.title(f"Boxplot: {col}") plt.show()
Age: Most clients are between 30–60 years old, but a few outliers exist above age 80.
Balance: Extremely skewed with many outliers. Some clients have balances over 100,000, while many are clustered near zero or even negative.
Duration: Call durations show many long outliers (up to ~5000 seconds), suggesting some unusually lengthy calls.
Campaign: Most clients received 1–3 contacts, but outliers received more than 50 calls, which is not typical.
Pdays (days since last contact): Distribution is heavily skewed. -1 indicates “never contacted before.” Many large outliers (500+ days).
Previous: Most clients had zero previous contacts, but a few were contacted over 200 times, which is extreme.
Overall, the dataset contains many outliers, especially in financial (balance) and campaign-related variables. These will need to be handled carefully through winsorizing, transformation, or by keeping them if they represent real business cases.
4. Relationships Between Variables
plt.figure(figsize=(6,4))sns.boxplot(x=df["y"], y=df["duration"])plt.title("Call Duration by Subscription Outcome")plt.show()plt.figure(figsize=(6,4))sns.boxplot(x=df["y"], y=df["campaign"])plt.title("Campaign Contacts by Subscription Outcome")plt.show()
Call Duration vs. Subscription: Clients who subscribed (yes) generally had longer call durations compared to those who did not (no). This indicates that call length is strongly associated with campaign success. Outliers show some very long calls, but the overall trend suggests longer conversations may help conversions.
Campaign Contacts vs. Subscription: Both subscribed and non-subscribed groups show most clients received only a small number of contacts. However, increasing the number of contacts beyond a few attempts does not significantly improve subscription rates and instead introduces outliers (some clients received over 50 calls). This suggests diminishing returns for repeated calls.
Together, these results highlight that quality of interaction (longer calls) is more important than quantity of contacts in influencing subscription outcomes.
5. Distribution of Categorical Variables
cat_cols = df.select_dtypes(include="object").columnsfor col in cat_cols: plt.figure(figsize=(6,4)) sns.countplot(y=col, data=df, order=df[col].value_counts().index) plt.title(f"Distribution of {col}") plt.tight_layout() plt.show()
Job: The largest groups are blue-collar, management, and technician. Smaller groups like students, housemaids, and unemployed are underrepresented, but these may still hold important patterns.
Marital: The majority of clients are married, followed by single and divorced.
Education: Most clients have secondary education, with tertiary next and primary being least common.
Default: Very few clients have credit defaults. Most are marked no.
Housing Loan: A majority of clients have housing loans, though a significant portion does not.
Personal Loan: Far fewer clients have personal loans compared to those without.
Contact Method: Most contacts are made via cellular phones, with fewer via landline telephone.
Month: Campaign activity peaks in May, with significant activity in July and August. Other months show smaller proportions.
Previous Outcome (poutcome): The majority are failure outcomes, followed by other and fewer successes.
Target Variable (y): The outcome is heavily imbalanced. The majority did not subscribe, while only about 11–12% subscribed.
These distributions highlight where the dataset is imbalanced (for example y, default, contact) and where categories are dominant (for example married, blue-collar). This will influence preprocessing and algorithm selection.
The cross-tabulation shows the proportion of clients who subscribed (yes) by job category.
Students have the highest subscription rate at ~28.7%, followed by retired clients at ~22.8%.
Groups like unemployed (~15.5%) and management (~13.8%) show moderate subscription rates.
Administrative roles have lower subscription rates (~12.2%), despite being one of the largest job categories overall.
This suggests that demographic characteristics strongly influence marketing outcomes.
Younger (students) and older (retired) clients tend to be more responsive, while mid-career groups like admin and management are less likely to subscribe.
7. Central Tendency and Spread
df.describe()
age
balance
day_of_week
duration
campaign
pdays
previous
count
45211.000000
45211.000000
45211.000000
45211.000000
45211.000000
45211.000000
45211.000000
mean
40.936210
1362.272058
15.806419
258.163080
2.763841
40.197828
0.580323
std
10.618762
3044.765829
8.322476
257.527812
3.098021
100.128746
2.303441
min
18.000000
-8019.000000
1.000000
0.000000
1.000000
-1.000000
0.000000
25%
33.000000
72.000000
8.000000
103.000000
1.000000
-1.000000
0.000000
50%
39.000000
448.000000
16.000000
180.000000
2.000000
-1.000000
0.000000
75%
48.000000
1428.000000
21.000000
319.000000
3.000000
-1.000000
0.000000
max
95.000000
102127.000000
31.000000
4918.000000
63.000000
871.000000
275.000000
Age: The mean is about 41 years with a standard deviation of ~11, showing moderate variability. The age range spans from 18 to 95.
Balance: The mean is ~1362, but the standard deviation is over 3000, and the minimum is negative (-8019). The median (448) is much lower than the mean, confirming a strong right-skew caused by extreme outliers.
Duration: The average call length is ~258 seconds, with a wide spread (std ~257). The maximum of ~4918 seconds indicates a few very long calls that may bias the distribution.
Campaign: Most clients were contacted very few times (median = 2), though some were contacted as many as 63 times. This long tail shows diminishing returns from repeated calls.
pdays: Many entries have -1, which indicates “not previously contacted.” This variable requires careful handling, as it mixes valid values with a special code.
Previous: The median is 0, meaning most clients had no prior contacts. A few clients had up to 275 previous contacts, representing extreme cases.
Overall, the descriptive statistics confirm skewness and the presence of extreme outliers in financial and campaign-related variables. Median values often provide a more reliable summary than means in this dataset.
The dataset does contain missing values in several categorical features:
job: 288 missing
education: 1,857 missing
contact: 13,020 missing
poutcome: 36,959 missing
The variables most affected are contact and poutcome, which are campaign-related attributes.
Since the majority of values in poutcome are missing, this feature may have limited predictive value and could be excluded or grouped into a category such as "unknown".
For features like job and education, missing values can be treated as "unknown" categories so they remain usable during modeling.
Numerical variables (age, balance, duration, etc.) show no missing data, which simplifies preprocessing.
Overall, while missing data is present, it is concentrated in categorical features and can be handled by either encoding "unknown" as a valid category or by dropping features with excessive missingness (like poutcome).
Algorithm Selection
Based on the exploratory data analysis, the Bank Marketing dataset is a supervised classification problem. The target variable y indicates whether a client subscribed to a term deposit (yes or no). The outcome is imbalanced, with most clients labeled as no. This impacts algorithm choice since models must handle both classification and imbalance.
Candidate Algorithms
1. Logistic Regression
Pros:
- Simple and interpretable.
- Provides probability estimates for class membership.
- Works well when relationships between predictors and outcome are approximately linear.
- Computationally efficient for large datasets.
Cons:
- Sensitive to outliers and multicollinearity.
- Struggles with complex, non-linear relationships.
- Requires preprocessing such as scaling and dummy variables for categorical features.
2. Decision Trees / Random Forest
Pros:
- Can model complex, non-linear interactions.
- Handles both categorical and numerical variables naturally.
- Robust to outliers.
- Random Forest provides better generalization and reduces overfitting compared to a single decision tree.
Cons:
- Less interpretable than logistic regression.
- Random Forest can be computationally expensive with very large datasets.
- May still be biased toward the majority class in imbalanced datasets unless class weighting or resampling is applied.
Recommended Algorithm
For this dataset, I recommend Random Forest. The EDA revealed:
- Several categorical variables with many levels (job, education).
- Strong non-linear patterns (e.g., call duration).
- Presence of outliers and skewed distributions.
Random Forest can handle these complexities better than logistic regression, while also reducing overfitting through ensembling. Logistic regression could still serve as a baseline due to its interpretability.
Impact of Labels
Since the target variable y is clearly labeled, the dataset is suitable for supervised classification. The imbalance in labels (yes ~12%, no ~88%) means that algorithms need either class weighting, SMOTE, or resampling strategies to avoid predicting mostly no.
Dataset Size Consideration
With more than 40,000 records, Random Forest is a strong choice. If the dataset had fewer than 1,000 records, I would recommend logistic regression because:
- Smaller datasets benefit from simpler models with fewer parameters.
- Logistic regression is less prone to overfitting on small data.
- Interpretability remains valuable when sample size is limited.
Pre-processing
Before applying machine learning models to the Bank Marketing dataset, several pre-processing steps are needed to improve data quality and model performance. These steps align with the issues revealed during exploratory data analysis.
Data Cleaning
Handle missing values in job, education, and contact by encoding them as "unknown" so information is not lost.
Consider dropping or grouping poutcome, which has more than 80% missing values and limited predictive power.
Ensure numerical features such as balance, duration, and pdays are treated consistently, especially when extreme outliers exist.
Correlation analysis shows that pdays and previous are moderately correlated. One option is to combine them into a single indicator of prior marketing activity.
Principal Component Analysis (PCA) could be explored if computational efficiency becomes an issue, but with 17 variables, feature selection is likely more practical than full PCA.
Create binary indicators from categorical variables. For example, group rare jobs (housemaid, unemployed) into an "other" category to reduce noise.
Create interaction terms such as long_call = duration > 300 since longer calls are strongly associated with positive outcomes.
Temporal features such as month could be transformed into seasonal groups (e.g., spring vs. summer) to capture patterns in campaign timing.
df['long_call'] = (df['duration'] >300).astype(int)rare_jobs = ['housemaid', 'unemployed', 'student', 'unknown']df['job_simplified'] = df['job'].apply(lambda x: x if x notin rare_jobs else'other')spring = ['mar', 'apr', 'may']summer = ['jun', 'jul', 'aug']fall = ['sep', 'oct', 'nov']winter = ['dec']def season_from_month(m):if m in spring: return'spring'elif m in summer: return'summer'elif m in fall: return'fall'elif m in winter: return'winter'else: return'unknown'df['season'] = df['month'].apply(season_from_month)
Sampling Data
With more than 40,000 records, downsampling is not required for efficiency.
Sampling techniques may still be useful for balancing the dataset (see below).
If cross-validation is used, stratified sampling should ensure both yes and no classes are represented proportionally in training and test folds.
Processed training shape: (36168, 63)
Processed test shape: (9043, 63)
Imbalanced Data
The dataset is imbalanced with only about 12% of clients subscribing.
To address this imbalance, use one or more of the following:
Class weights to penalize misclassification of the minority class.
Oversampling methods such as SMOTE to generate synthetic positive cases.
Undersampling of the majority class to balance proportions.
Overall, pre-processing ensures the dataset is clean, balanced, and structured for both interpretable models such as logistic regression and more complex models such as Random Forest.