Data622 - Assignment 1: Exploratory Data Analysis

Author

Anthony Josue Roman

Introduction

The Bank Marketing dataset from the UCI Machine Learning Repository has data from the telemarketing campaigns of a Portuguese bank. It includes client demographics, business data, and marketing contact data. The variable y indicates whether or not a client subscribed to a term deposit (yes or no).

This summary starts with an exploratory description of the data set, structure, distributions, correlations, and possible problems like outliers or imbalance. Finally, we introduce prep steps and show how the data set should be prepared before machine learning.

For this assignment, the dataset will be acquired from the following link. Since I will be using python for this assignment, I will be using the ucimlrepo package.

Exploratory Data Analysis

To begin I fetch the Bank Marketing dataset directly from the UCI Machine Learning Repository using the ucimlrepo package.
This ensures reproducibility and keeps features (X) and the target (y) clearly separated.

from ucimlrepo import fetch_ucirepo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

bank = fetch_ucirepo(id=222)

X = bank.data.features
y = bank.data.targets

df = pd.concat([X, y], axis=1)

print("Shape:", df.shape)
df.head()
Shape: (45211, 17)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN no
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN no
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN no
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN no
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN no
  • The dataset contains about 41,000 rows and 17 columns.
  • It has a mix of numerical variables (e.g., age, balance, duration) and categorical variables (e.g., job, marital, education).
  • The target variable is binary (y = yes/no), confirming this is a supervised classification problem.

1. Correlation of Features

plt.figure(figsize=(12,8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix of Numerical Features")
plt.show()

  • Low correlations overall: Most numerical features show weak correlations (close to 0).
  • Age and Balance: Slight positive correlation (~0.10). Older clients tend to have slightly higher balances.
  • Campaign and Day of Week: Mild correlation (~0.16). Contact frequency patterns may depend somewhat on the day of the week.
  • Pdays and Previous: Strongest correlation (~0.45). This makes sense since both relate to prior marketing contacts.
  • Duration: Almost no correlation with other variables. In practice duration is highly predictive of the target outcome (y).

Overall, multicollinearity is low. Most features are fairly independent, which is good for model training. The correlation between pdays and previous suggests redundancy that may require dimensionality reduction or careful feature selection.

2. Distribution of Variables

df.hist(figsize=(12,10), bins=30)
plt.suptitle("Distributions of Numerical Variables")
plt.show()

  • Age: Roughly bell-shaped with most clients between ages 30–50, tapering off after 60. Few very young (under 20) or very old (over 90).
  • Balance: Highly right-skewed. Most clients have modest or negative balances, but a few have very large positive balances (>50,000).
  • Day of Week: Fairly uniform, with slightly higher contact frequencies on certain days.
  • Duration: Strong right skew. Most calls are under 500 seconds, but a long tail includes calls lasting several thousand seconds.
  • Campaign (number of contacts): Heavily skewed. Most clients received 1–3 calls, but some received 50+.
  • Pdays (days since last contact): Large spike at -1 (meaning “not previously contacted”). Those who were contacted before show widely varying intervals, sometimes >800 days.
  • Previous: Most clients had 0 previous contacts, though some had dozens, with outliers exceeding 200.

3. Outliers

num_cols = df.select_dtypes(include="number").columns
for col in num_cols:
    plt.figure(figsize=(6,3))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot: {col}")
    plt.show()

  • Age: Most clients are between 30–60 years old, but a few outliers exist above age 80.
  • Balance: Extremely skewed with many outliers. Some clients have balances over 100,000, while many are clustered near zero or even negative.
  • Duration: Call durations show many long outliers (up to ~5000 seconds), suggesting some unusually lengthy calls.
  • Campaign: Most clients received 1–3 contacts, but outliers received more than 50 calls, which is not typical.
  • Pdays (days since last contact): Distribution is heavily skewed. -1 indicates “never contacted before.” Many large outliers (500+ days).
  • Previous: Most clients had zero previous contacts, but a few were contacted over 200 times, which is extreme.

Overall, the dataset contains many outliers, especially in financial (balance) and campaign-related variables. These will need to be handled carefully through winsorizing, transformation, or by keeping them if they represent real business cases.

4. Relationships Between Variables

plt.figure(figsize=(6,4))
sns.boxplot(x=df["y"], y=df["duration"])
plt.title("Call Duration by Subscription Outcome")
plt.show()

plt.figure(figsize=(6,4))
sns.boxplot(x=df["y"], y=df["campaign"])
plt.title("Campaign Contacts by Subscription Outcome")
plt.show()

  • Call Duration vs. Subscription: Clients who subscribed (yes) generally had longer call durations compared to those who did not (no). This indicates that call length is strongly associated with campaign success. Outliers show some very long calls, but the overall trend suggests longer conversations may help conversions.

  • Campaign Contacts vs. Subscription: Both subscribed and non-subscribed groups show most clients received only a small number of contacts. However, increasing the number of contacts beyond a few attempts does not significantly improve subscription rates and instead introduces outliers (some clients received over 50 calls). This suggests diminishing returns for repeated calls.

Together, these results highlight that quality of interaction (longer calls) is more important than quantity of contacts in influencing subscription outcomes.

5. Distribution of Categorical Variables

cat_cols = df.select_dtypes(include="object").columns
for col in cat_cols:
    plt.figure(figsize=(6,4))
    sns.countplot(y=col, data=df, order=df[col].value_counts().index)
    plt.title(f"Distribution of {col}")
    plt.tight_layout()
    plt.show()

  • Job: The largest groups are blue-collar, management, and technician. Smaller groups like students, housemaids, and unemployed are underrepresented, but these may still hold important patterns.
  • Marital: The majority of clients are married, followed by single and divorced.
  • Education: Most clients have secondary education, with tertiary next and primary being least common.
  • Default: Very few clients have credit defaults. Most are marked no.
  • Housing Loan: A majority of clients have housing loans, though a significant portion does not.
  • Personal Loan: Far fewer clients have personal loans compared to those without.
  • Contact Method: Most contacts are made via cellular phones, with fewer via landline telephone.
  • Month: Campaign activity peaks in May, with significant activity in July and August. Other months show smaller proportions.
  • Previous Outcome (poutcome): The majority are failure outcomes, followed by other and fewer successes.
  • Target Variable (y): The outcome is heavily imbalanced. The majority did not subscribe, while only about 11–12% subscribed.

These distributions highlight where the dataset is imbalanced (for example y, default, contact) and where categories are dominant (for example married, blue-collar). This will influence preprocessing and algorithm selection.

7. Central Tendency and Spread

df.describe()
age balance day_of_week duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000
  • Age: The mean is about 41 years with a standard deviation of ~11, showing moderate variability. The age range spans from 18 to 95.
  • Balance: The mean is ~1362, but the standard deviation is over 3000, and the minimum is negative (-8019). The median (448) is much lower than the mean, confirming a strong right-skew caused by extreme outliers.
  • Duration: The average call length is ~258 seconds, with a wide spread (std ~257). The maximum of ~4918 seconds indicates a few very long calls that may bias the distribution.
  • Campaign: Most clients were contacted very few times (median = 2), though some were contacted as many as 63 times. This long tail shows diminishing returns from repeated calls.
  • pdays: Many entries have -1, which indicates “not previously contacted.” This variable requires careful handling, as it mixes valid values with a special code.
  • Previous: The median is 0, meaning most clients had no prior contacts. A few clients had up to 275 previous contacts, representing extreme cases.

Overall, the descriptive statistics confirm skewness and the presence of extreme outliers in financial and campaign-related variables. Median values often provide a more reliable summary than means in this dataset.

8. Missing Values

df.isnull().sum()
age                0
job              288
marital            0
education       1857
default            0
balance            0
housing            0
loan               0
contact        13020
day_of_week        0
month              0
duration           0
campaign           0
pdays              0
previous           0
poutcome       36959
y                  0
dtype: int64
  • The dataset does contain missing values in several categorical features:
    • job: 288 missing
    • education: 1,857 missing
    • contact: 13,020 missing
    • poutcome: 36,959 missing
  • The variables most affected are contact and poutcome, which are campaign-related attributes.
  • Since the majority of values in poutcome are missing, this feature may have limited predictive value and could be excluded or grouped into a category such as "unknown".
  • For features like job and education, missing values can be treated as "unknown" categories so they remain usable during modeling.
  • Numerical variables (age, balance, duration, etc.) show no missing data, which simplifies preprocessing.

Overall, while missing data is present, it is concentrated in categorical features and can be handled by either encoding "unknown" as a valid category or by dropping features with excessive missingness (like poutcome).

Algorithm Selection

Based on the exploratory data analysis, the Bank Marketing dataset is a supervised classification problem. The target variable y indicates whether a client subscribed to a term deposit (yes or no). The outcome is imbalanced, with most clients labeled as no. This impacts algorithm choice since models must handle both classification and imbalance.

Candidate Algorithms

1. Logistic Regression

Pros:
- Simple and interpretable.
- Provides probability estimates for class membership.
- Works well when relationships between predictors and outcome are approximately linear.
- Computationally efficient for large datasets.

Cons:
- Sensitive to outliers and multicollinearity.
- Struggles with complex, non-linear relationships.
- Requires preprocessing such as scaling and dummy variables for categorical features.

2. Decision Trees / Random Forest

Pros:
- Can model complex, non-linear interactions.
- Handles both categorical and numerical variables naturally.
- Robust to outliers.
- Random Forest provides better generalization and reduces overfitting compared to a single decision tree.

Cons:
- Less interpretable than logistic regression.
- Random Forest can be computationally expensive with very large datasets.
- May still be biased toward the majority class in imbalanced datasets unless class weighting or resampling is applied.

Impact of Labels

Since the target variable y is clearly labeled, the dataset is suitable for supervised classification. The imbalance in labels (yes ~12%, no ~88%) means that algorithms need either class weighting, SMOTE, or resampling strategies to avoid predicting mostly no.

Dataset Size Consideration

With more than 40,000 records, Random Forest is a strong choice. If the dataset had fewer than 1,000 records, I would recommend logistic regression because:
- Smaller datasets benefit from simpler models with fewer parameters.
- Logistic regression is less prone to overfitting on small data.
- Interpretability remains valuable when sample size is limited.

Pre-processing

Before applying machine learning models to the Bank Marketing dataset, several pre-processing steps are needed to improve data quality and model performance. These steps align with the issues revealed during exploratory data analysis.

Data Cleaning

  • Handle missing values in job, education, and contact by encoding them as "unknown" so information is not lost.
  • Consider dropping or grouping poutcome, which has more than 80% missing values and limited predictive power.
  • Ensure numerical features such as balance, duration, and pdays are treated consistently, especially when extreme outliers exist.
df['job'] = df['job'].fillna("unknown")
df['education'] = df['education'].fillna("unknown")
df['contact'] = df['contact'].fillna("unknown")

df = df.drop(columns=['poutcome'])

Dimensionality Reduction

  • Correlation analysis shows that pdays and previous are moderately correlated. One option is to combine them into a single indicator of prior marketing activity.
  • Principal Component Analysis (PCA) could be explored if computational efficiency becomes an issue, but with 17 variables, feature selection is likely more practical than full PCA.
df[['pdays', 'previous']].corr()

df['prior_contact'] = (df['previous'] > 0).astype(int)

Feature Engineering

  • Create binary indicators from categorical variables. For example, group rare jobs (housemaid, unemployed) into an "other" category to reduce noise.
  • Create interaction terms such as long_call = duration > 300 since longer calls are strongly associated with positive outcomes.
  • Temporal features such as month could be transformed into seasonal groups (e.g., spring vs. summer) to capture patterns in campaign timing.
df['long_call'] = (df['duration'] > 300).astype(int)

rare_jobs = ['housemaid', 'unemployed', 'student', 'unknown']
df['job_simplified'] = df['job'].apply(lambda x: x if x not in rare_jobs else 'other')

spring = ['mar', 'apr', 'may']
summer = ['jun', 'jul', 'aug']
fall = ['sep', 'oct', 'nov']
winter = ['dec']

def season_from_month(m):
    if m in spring: return 'spring'
    elif m in summer: return 'summer'
    elif m in fall: return 'fall'
    elif m in winter: return 'winter'
    else: return 'unknown'

df['season'] = df['month'].apply(season_from_month)

Sampling Data

  • With more than 40,000 records, downsampling is not required for efficiency.
  • Sampling techniques may still be useful for balancing the dataset (see below).
  • If cross-validation is used, stratified sampling should ensure both yes and no classes are represented proportionally in training and test folds.
from sklearn.model_selection import train_test_split

X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training set:", X_train.shape)
print("Test set:", X_test.shape)
Training set: (36168, 19)
Test set: (9043, 19)

Data Transformation

  • Apply one-hot encoding to categorical variables (job, marital, education, etc.) for models like logistic regression.
  • Standardize or normalize highly skewed numerical features such as balance, duration, and campaign to reduce the impact of outliers and scale effects.
  • Tree-based models like Random Forest do not require scaling, but logistic regression and kNN benefit from it.
from sklearn.preprocessing import OneHotEncoder

categorical_cols = X_train.select_dtypes(include='object').columns
numerical_cols = X_train.select_dtypes(exclude='object').columns

encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

X_train_encoded = pd.DataFrame(
    encoder.fit_transform(X_train[categorical_cols]),
    index=X_train.index
)

X_test_encoded = pd.DataFrame(
    encoder.transform(X_test[categorical_cols]),
    index=X_test.index
)

X_train_processed = pd.concat(
    [X_train[numerical_cols].reset_index(drop=True),
     X_train_encoded.reset_index(drop=True)],
    axis=1
)

X_test_processed = pd.concat(
    [X_test[numerical_cols].reset_index(drop=True),
     X_test_encoded.reset_index(drop=True)],
    axis=1
)

X_train_processed.columns = X_train_processed.columns.astype(str)
X_test_processed.columns = X_test_processed.columns.astype(str)

print("Processed training shape:", X_train_processed.shape)
print("Processed test shape:", X_test_processed.shape)
Processed training shape: (36168, 63)
Processed test shape: (9043, 63)

Imbalanced Data

  • The dataset is imbalanced with only about 12% of clients subscribing.
  • To address this imbalance, use one or more of the following:
    • Class weights to penalize misclassification of the minority class.
    • Oversampling methods such as SMOTE to generate synthetic positive cases.
    • Undersampling of the majority class to balance proportions.

Overall, pre-processing ensures the dataset is clean, balanced, and structured for both interpretable models such as logistic regression and more complex models such as Random Forest.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_processed, y_train)

print("Before SMOTE:", y_train.value_counts(normalize=True))
print("After SMOTE:", y_train_resampled.value_counts(normalize=True))
Before SMOTE: y
no     0.883018
yes    0.116982
Name: proportion, dtype: float64
After SMOTE: y
no     0.5
yes    0.5
Name: proportion, dtype: float64