Data622 - Assignment 1: Exploratory Data Analysis

Author

Anthony Josue Roman

Introduction

The Bank Marketing dataset from the UCI Machine Learning Repository has data from the telemarketing campaigns of a Portuguese bank. It includes client demographics, business data, and marketing contact data. The variable y indicates whether or not a client subscribed to a term deposit (yes or no).

This summary starts with an exploratory description of the data set, structure, distributions, correlations, and possible problems like outliers or imbalance. Finally, we introduce prep steps and show how the data set should be prepared before machine learning.

For this assignment, the dataset will be acquired from the following link. Since I will be using python for this assignment, I will be using the ucimlrepo package.

Exploratory Data Analysis

To begin I fetch the Bank Marketing dataset directly from the UCI Machine Learning Repository using the ucimlrepo package.
This ensures reproducibility and keeps features (X) and the target (y) clearly separated.

from ucimlrepo import fetch_ucirepo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

bank = fetch_ucirepo(id=222)

X = bank.data.features
y = bank.data.targets

df = pd.concat([X, y], axis=1)

print("Shape:", df.shape)
df.head()

Shape: (45211, 17)

	age	job	marital	education	default	balance	housing	loan	contact	day_of_week	month	duration	campaign	pdays	poutcome	y
0	58	management	married	tertiary	no	2143	yes	no	NaN	5	may	261	1	-1	NaN	no
1	44	technician	single	secondary	no	29	yes	no	NaN	5	may	151	1	-1	NaN	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	NaN	5	may	76	1	-1	NaN	no
3	47	blue-collar	married	NaN	no	1506	yes	no	NaN	5	may	92	1	-1	NaN	no
4	33	NaN	single	NaN	no	1	no	no	NaN	5	may	198	1	-1	NaN	no

The dataset contains about 41,000 rows and 17 columns.
It has a mix of numerical variables (e.g., age, balance, duration) and categorical variables (e.g., job, marital, education).
The target variable is binary (y = yes/no), confirming this is a supervised classification problem.

1. Correlation of Features

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Numerical Features")
plt.show()

Low correlations overall: Most numerical features show weak correlations (close to 0).
Age and Balance: Slight positive correlation (~0.10). Older clients tend to have slightly higher balances.
Campaign and Day of Week: Mild correlation (~0.16). Contact frequency patterns may depend somewhat on the day of the week.
Pdays and Previous: Strongest correlation (~0.45). This makes sense since both relate to prior marketing contacts.
Duration: Almost no correlation with other variables. In practice duration is highly predictive of the target outcome (y).

Overall, multicollinearity is low. Most features are fairly independent, which is good for model training. The correlation between pdays and previous suggests redundancy that may require dimensionality reduction or careful feature selection.

2. Distribution of Variables

df.hist(figsize=(12,10), bins=30)
plt.suptitle("Distributions of Numerical Variables")
plt.show()

Age: Roughly bell-shaped with most clients between ages 30–50, tapering off after 60. Few very young (under 20) or very old (over 90).
Balance: Highly right-skewed. Most clients have modest or negative balances, but a few have very large positive balances (>50,000).
Day of Week: Fairly uniform, with slightly higher contact frequencies on certain days.
Duration: Strong right skew. Most calls are under 500 seconds, but a long tail includes calls lasting several thousand seconds.
Campaign (number of contacts): Heavily skewed. Most clients received 1–3 calls, but some received 50+.
Pdays (days since last contact): Large spike at -1 (meaning “not previously contacted”). Those who were contacted before show widely varying intervals, sometimes >800 days.
Previous: Most clients had 0 previous contacts, though some had dozens, with outliers exceeding 200.

3. Outliers

num_cols = df.select_dtypes(include="number").columns
for col in num_cols:
    plt.figure(figsize=(6,3))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot: {col}")
    plt.show()

Age: Most clients are between 30–60 years old, but a few outliers exist above age 80.
Balance: Extremely skewed with many outliers. Some clients have balances over 100,000, while many are clustered near zero or even negative.
Duration: Call durations show many long outliers (up to ~5000 seconds), suggesting some unusually lengthy calls.
Campaign: Most clients received 1–3 contacts, but outliers received more than 50 calls, which is not typical.
Pdays (days since last contact): Distribution is heavily skewed. -1 indicates “never contacted before.” Many large outliers (500+ days).
Previous: Most clients had zero previous contacts, but a few were contacted over 200 times, which is extreme.

Overall, the dataset contains many outliers, especially in financial (balance) and campaign-related variables. These will need to be handled carefully through winsorizing, transformation, or by keeping them if they represent real business cases.

4. Relationships Between Variables

plt.figure(figsize=(6,4))
sns.boxplot(x=df["y"], y=df["duration"])
plt.title("Call Duration by Subscription Outcome")
plt.show()

plt.figure(figsize=(6,4))
sns.boxplot(x=df["y"], y=df["campaign"])
plt.title("Campaign Contacts by Subscription Outcome")
plt.show()

Call Duration vs. Subscription: Clients who subscribed (yes) generally had longer call durations compared to those who did not (no). This indicates that call length is strongly associated with campaign success. Outliers show some very long calls, but the overall trend suggests longer conversations may help conversions.
Campaign Contacts vs. Subscription: Both subscribed and non-subscribed groups show most clients received only a small number of contacts. However, increasing the number of contacts beyond a few attempts does not significantly improve subscription rates and instead introduces outliers (some clients received over 50 calls). This suggests diminishing returns for repeated calls.

Together, these results highlight that quality of interaction (longer calls) is more important than quantity of contacts in influencing subscription outcomes.

5. Distribution of Categorical Variables

cat_cols = df.select_dtypes(include="object").columns
for col in cat_cols:
    plt.figure(figsize=(6,4))
    sns.countplot(y=col, data=df, order=df[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.tight_layout()
    plt.show()

Job: The largest groups are blue-collar, management, and technician. Smaller groups like students, housemaids, and unemployed are underrepresented, but these may still hold important patterns.
Marital: The majority of clients are married, followed by single and divorced.
Education: Most clients have secondary education, with tertiary next and primary being least common.
Default: Very few clients have credit defaults. Most are marked no.
Housing Loan: A majority of clients have housing loans, though a significant portion does not.
Personal Loan: Far fewer clients have personal loans compared to those without.
Contact Method: Most contacts are made via cellular phones, with fewer via landline telephone.
Month: Campaign activity peaks in May, with significant activity in July and August. Other months show smaller proportions.
Previous Outcome (poutcome): The majority are failure outcomes, followed by other and fewer successes.
Target Variable (y): The outcome is heavily imbalanced. The majority did not subscribe, while only about 11–12% subscribed.

These distributions highlight where the dataset is imbalanced (for example y, default, contact) and where categories are dominant (for example married, blue-collar). This will influence preprocessing and algorithm selection.

6. Patterns and Trends

pd.crosstab(df["job"], df["y"], normalize="index").sort_values(by="yes",
ascending=False).head()

y	no	yes
job
student	0.713220	0.286780
retired	0.772085	0.227915
unemployed	0.844973	0.155027
management	0.862444	0.137556
admin.	0.877973	0.122027

The cross-tabulation shows the proportion of clients who subscribed (yes) by job category.
Students have the highest subscription rate at ~28.7%, followed by retired clients at ~22.8%.
Groups like unemployed (~15.5%) and management (~13.8%) show moderate subscription rates.
Administrative roles have lower subscription rates (~12.2%), despite being one of the largest job categories overall.

This suggests that demographic characteristics strongly influence marketing outcomes.
Younger (students) and older (retired) clients tend to be more responsive, while mid-career groups like admin and management are less likely to subscribe.

7. Central Tendency and Spread

df.describe()

	age	balance	day_of_week	duration	campaign	pdays	previous
count	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000
mean	40.936210	1362.272058	15.806419	258.163080	2.763841	40.197828	0.580323
std	10.618762	3044.765829	8.322476	257.527812	3.098021	100.128746	2.303441
min	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	33.000000	72.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	39.000000	448.000000	16.000000	180.000000	2.000000	-1.000000	0.000000
75%	48.000000	1428.000000	21.000000	319.000000	3.000000	-1.000000	0.000000
max	95.000000	102127.000000	31.000000	4918.000000	63.000000	871.000000	275.000000

Age: The mean is about 41 years with a standard deviation of ~11, showing moderate variability. The age range spans from 18 to 95.
Balance: The mean is ~1362, but the standard deviation is over 3000, and the minimum is negative (-8019). The median (448) is much lower than the mean, confirming a strong right-skew caused by extreme outliers.
Duration: The average call length is ~258 seconds, with a wide spread (std ~257). The maximum of ~4918 seconds indicates a few very long calls that may bias the distribution.
Campaign: Most clients were contacted very few times (median = 2), though some were contacted as many as 63 times. This long tail shows diminishing returns from repeated calls.
pdays: Many entries have -1, which indicates “not previously contacted.” This variable requires careful handling, as it mixes valid values with a special code.
Previous: The median is 0, meaning most clients had no prior contacts. A few clients had up to 275 previous contacts, representing extreme cases.

Overall, the descriptive statistics confirm skewness and the presence of extreme outliers in financial and campaign-related variables. Median values often provide a more reliable summary than means in this dataset.

8. Missing Values

df.isnull().sum()

age                0
job              288
marital            0
education       1857
default            0
balance            0
housing            0
loan               0
contact        13020
day_of_week        0
month              0
duration           0
campaign           0
pdays              0
previous           0
poutcome       36959
y                  0
dtype: int64

The dataset does contain missing values in several categorical features:
- job: 288 missing
- education: 1,857 missing
- contact: 13,020 missing
- poutcome: 36,959 missing
The variables most affected are contact and poutcome, which are campaign-related attributes.
Since the majority of values in poutcome are missing, this feature may have limited predictive value and could be excluded or grouped into a category such as "unknown".
For features like job and education, missing values can be treated as "unknown" categories so they remain usable during modeling.
Numerical variables (age, balance, duration, etc.) show no missing data, which simplifies preprocessing.

Overall, while missing data is present, it is concentrated in categorical features and can be handled by either encoding "unknown" as a valid category or by dropping features with excessive missingness (like poutcome).

Algorithm Selection

Based on the exploratory data analysis, the Bank Marketing dataset is a supervised classification problem. The target variable y indicates whether a client subscribed to a term deposit (yes or no). The outcome is imbalanced, with most clients labeled as no. This impacts algorithm choice since models must handle both classification and imbalance.

Candidate Algorithms

1. Logistic Regression

Pros:
- Simple and interpretable.
- Provides probability estimates for class membership.
- Works well when relationships between predictors and outcome are approximately linear.
- Computationally efficient for large datasets.

Cons:
- Sensitive to outliers and multicollinearity.
- Struggles with complex, non-linear relationships.
- Requires preprocessing such as scaling and dummy variables for categorical features.

2. Decision Trees / Random Forest

Pros:
- Can model complex, non-linear interactions.
- Handles both categorical and numerical variables naturally.
- Robust to outliers.
- Random Forest provides better generalization and reduces overfitting compared to a single decision tree.

Cons:
- Less interpretable than logistic regression.
- Random Forest can be computationally expensive with very large datasets.
- May still be biased toward the majority class in imbalanced datasets unless class weighting or resampling is applied.

Recommended Algorithm

For this dataset, I recommend Random Forest. The EDA revealed:
- Several categorical variables with many levels (job, education).
- Strong non-linear patterns (e.g., call duration).
- Presence of outliers and skewed distributions.

Random Forest can handle these complexities better than logistic regression, while also reducing overfitting through ensembling. Logistic regression could still serve as a baseline due to its interpretability.

Impact of Labels

Since the target variable y is clearly labeled, the dataset is suitable for supervised classification. The imbalance in labels (yes ~12%, no ~88%) means that algorithms need either class weighting, SMOTE, or resampling strategies to avoid predicting mostly no.

Dataset Size Consideration

With more than 40,000 records, Random Forest is a strong choice. If the dataset had fewer than 1,000 records, I would recommend logistic regression because:
- Smaller datasets benefit from simpler models with fewer parameters.
- Logistic regression is less prone to overfitting on small data.
- Interpretability remains valuable when sample size is limited.

Pre-processing

Before applying machine learning models to the Bank Marketing dataset, several pre-processing steps are needed to improve data quality and model performance. These steps align with the issues revealed during exploratory data analysis.

Data Cleaning

Handle missing values in job, education, and contact by encoding them as "unknown" so information is not lost.
Consider dropping or grouping poutcome, which has more than 80% missing values and limited predictive power.
Ensure numerical features such as balance, duration, and pdays are treated consistently, especially when extreme outliers exist.

df['job'] = df['job'].fillna("unknown")
df['education'] = df['education'].fillna("unknown")
df['contact'] = df['contact'].fillna("unknown")

df = df.drop(columns=['poutcome'])

Dimensionality Reduction

Correlation analysis shows that pdays and previous are moderately correlated. One option is to combine them into a single indicator of prior marketing activity.
Principal Component Analysis (PCA) could be explored if computational efficiency becomes an issue, but with 17 variables, feature selection is likely more practical than full PCA.

df[['pdays', 'previous']].corr()

df['prior_contact'] = (df['previous'] > 0).astype(int)

Feature Engineering

Create binary indicators from categorical variables. For example, group rare jobs (housemaid, unemployed) into an "other" category to reduce noise.
Create interaction terms such as long_call = duration > 300 since longer calls are strongly associated with positive outcomes.
Temporal features such as month could be transformed into seasonal groups (e.g., spring vs. summer) to capture patterns in campaign timing.

df['long_call'] = (df['duration'] > 300).astype(int)

rare_jobs = ['housemaid', 'unemployed', 'student', 'unknown']
df['job_simplified'] = df['job'].apply(lambda x: x if x not in rare_jobs else 'other')

spring = ['mar', 'apr', 'may']
summer = ['jun', 'jul', 'aug']
fall = ['sep', 'oct', 'nov']
winter = ['dec']

def season_from_month(m):
    if m in spring: return 'spring'
    elif m in summer: return 'summer'
    elif m in fall: return 'fall'
    elif m in winter: return 'winter'
    else: return 'unknown'

df['season'] = df['month'].apply(season_from_month)

Sampling Data

With more than 40,000 records, downsampling is not required for efficiency.
Sampling techniques may still be useful for balancing the dataset (see below).
If cross-validation is used, stratified sampling should ensure both yes and no classes are represented proportionally in training and test folds.

from sklearn.model_selection import train_test_split

X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape)
print("Test set:", X_test.shape)

Training set: (36168, 19)
Test set: (9043, 19)

Data Transformation

Apply one-hot encoding to categorical variables (job, marital, education, etc.) for models like logistic regression.
Standardize or normalize highly skewed numerical features such as balance, duration, and campaign to reduce the impact of outliers and scale effects.
Tree-based models like Random Forest do not require scaling, but logistic regression and kNN benefit from it.

from sklearn.preprocessing import OneHotEncoder

categorical_cols = X_train.select_dtypes(include='object').columns
numerical_cols = X_train.select_dtypes(exclude='object').columns

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

X_train_encoded = pd.DataFrame(
    encoder.fit_transform(X_train[categorical_cols]),
    index=X_train.index
)

X_test_encoded = pd.DataFrame(
    encoder.transform(X_test[categorical_cols]),
    index=X_test.index
)

X_train_processed = pd.concat(
    [X_train[numerical_cols].reset_index(drop=True),
     X_train_encoded.reset_index(drop=True)],
    axis=1
)

X_test_processed = pd.concat(
    [X_test[numerical_cols].reset_index(drop=True),
     X_test_encoded.reset_index(drop=True)],
    axis=1
)

X_train_processed.columns = X_train_processed.columns.astype(str)
X_test_processed.columns = X_test_processed.columns.astype(str)

print("Processed training shape:", X_train_processed.shape)
print("Processed test shape:", X_test_processed.shape)

Processed training shape: (36168, 63)
Processed test shape: (9043, 63)

Imbalanced Data

The dataset is imbalanced with only about 12% of clients subscribing.
To address this imbalance, use one or more of the following:
- Class weights to penalize misclassification of the minority class.
- Oversampling methods such as SMOTE to generate synthetic positive cases.
- Undersampling of the majority class to balance proportions.

Overall, pre-processing ensures the dataset is clean, balanced, and structured for both interpretable models such as logistic regression and more complex models such as Random Forest.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

print("Before SMOTE:", y_train.value_counts(normalize=True))
print("After SMOTE:", y_train_resampled.value_counts(normalize=True))

Before SMOTE: y
no     0.883018
yes    0.116982
Name: proportion, dtype: float64
After SMOTE: y
no     0.5
yes    0.5
Name: proportion, dtype: float64