The data is related to direct marketing campaigns conducted by a Portuguese banking institution. These campaigns were executed via phone calls, with multiple contacts often made to the same client to assess whether they would subscribe to a bank term deposit. The dataset used for this analysis is bank-full.csv, which contains 45,211 observations and 17 features, including both numerical and categorical variables. This dataset is an older version with fewer inputs compared to the bank-additional-full.csv dataset. The primary objective of this analysis is to explore the structure of the data, identify patterns, detect potential issues such as outliers or missing values, and examine correlations between different variables to understand the factors influencing customer subscription decisions.
import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from ucimlrepo import fetch_ucirepo
# fetch dataset
bank_marketing = fetch_ucirepo(id=222)
# data (as pandas dataframes)
X = bank_marketing.data.features
y = bank_marketing.data.targets
# metadata
print(bank_marketing.metadata)
## {'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the success of bank telemarketing', 'authors': 'Sérgio Moro, P. Cortez, P. Rita', 'venue': 'Decision Support Systems', 'year': 2014, 'journal': None, 'DOI': '10.1016/j.dss.2014.03.001', 'URL': 'https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. \n\nThere are four datasets: \n1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]\n2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.\n3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). \n4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). \nThe smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). \n\nThe classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Input variables:\n # bank client data:\n 1 - age (numeric)\n 2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",\n "blue-collar","self-employed","retired","technician","services") \n 3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)\n 4 - education (categorical: "unknown","secondary","primary","tertiary")\n 5 - default: has credit in default? (binary: "yes","no")\n 6 - balance: average yearly balance, in euros (numeric) \n 7 - housing: has housing loan? (binary: "yes","no")\n 8 - loan: has personal loan? (binary: "yes","no")\n # related with the last contact of the current campaign:\n 9 - contact: contact communication type (categorical: "unknown","telephone","cellular") \n 10 - day: last contact day of the month (numeric)\n 11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")\n 12 - duration: last contact duration, in seconds (numeric)\n # other attributes:\n 13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n 14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)\n 15 - previous: number of contacts performed before this campaign and for this client (numeric)\n 16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")\n\n Output variable (desired target):\n 17 - y - has the client subscribed a term deposit? (binary: "yes","no")\n', 'citation': None}}
# variable information
print(bank_marketing.variables)
## name role ... units missing_values
## 0 age Feature ... None no
## 1 job Feature ... None no
## 2 marital Feature ... None no
## 3 education Feature ... None no
## 4 default Feature ... None no
## 5 balance Feature ... euros no
## 6 housing Feature ... None no
## 7 loan Feature ... None no
## 8 contact Feature ... None yes
## 9 day_of_week Feature ... None no
## 10 month Feature ... None no
## 11 duration Feature ... None no
## 12 campaign Feature ... None no
## 13 pdays Feature ... None yes
## 14 previous Feature ... None no
## 15 poutcome Feature ... None yes
## 16 y Target ... None no
##
## [17 rows x 7 columns]
The Exploratory Data Analysis (EDA) revealed several key insights. The correlation heatmap showed that call duration has the strongest correlation with the target variable (y), making it a crucial factor in determining customer responses. Other numerical features, such as age and balance, displayed weak correlations with most variables, indicating they might not be strong predictors. Most numerical features exhibit a right-skewed distribution, with balance, duration, and pdays showing extreme outliers. These outliers suggest that some clients maintain significantly higher balances or had exceptionally long call durations. The categorical analysis highlighted that the most common occupations among clients were blue-collar, management, and technician roles. Additionally, the majority of clients were married, and a significant proportion had secondary education. A notable observation was the presence of “unknown” values in variables such as contact and poutcome, which may indicate missing data encoded as text.
import pandas as pd
import numpy as np
# Load dataset
# Ensure correct delimiter
# Compute summary statistics in Python
summary_stats = X.describe()
summary_stats
## age balance ... pdays previous
## count 45211.000000 45211.000000 ... 45211.000000 45211.000000
## mean 40.936210 1362.272058 ... 40.197828 0.580323
## std 10.618762 3044.765829 ... 100.128746 2.303441
## min 18.000000 -8019.000000 ... -1.000000 0.000000
## 25% 33.000000 72.000000 ... -1.000000 0.000000
## 50% 39.000000 448.000000 ... -1.000000 0.000000
## 75% 48.000000 1428.000000 ... -1.000000 0.000000
## max 95.000000 102127.000000 ... 871.000000 275.000000
##
## [8 rows x 7 columns]
print(X.shape)
## (45211, 16)
The correlation heatmap visualizes the relationships between numerical variables in the dataset, with values ranging from -1 to 1, where 1 indicates a perfect correlation. The duration variable shows the strongest correlation with other features, particularly with campaign and pdays, while most other variables exhibit weak correlations. This suggests that call duration plays a significant role in customer responses, whereas features like age and balance have minimal influence on each other.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.heatmap(X.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()
# Plot distribution of numerical variables
X.hist(figsize=(12, 10), bins=30, edgecolor='black', layout=(3, 3))
## array([[<Axes: title={'center': 'age'}>,
## <Axes: title={'center': 'balance'}>,
## <Axes: title={'center': 'day_of_week'}>],
## [<Axes: title={'center': 'duration'}>,
## <Axes: title={'center': 'campaign'}>,
## <Axes: title={'center': 'pdays'}>],
## [<Axes: title={'center': 'previous'}>, <Axes: >, <Axes: >]],
## dtype=object)
# Set title for the entire figure
plt.suptitle("Distribution of Numerical Variables", fontsize=16)
# Show the plot
plt.show()
The distribution plot shows the spread of numerical variables in the dataset, revealing that most features exhibit right-skewed distributions with a concentration of lower values. The balance variable has extreme outliers, indicating a few clients with significantly high balances, while duration and pdays also display long tails. This suggests that a small subset of customers had longer call durations or a much higher delay since their last contact.
# Box plots to detect outliers
plt.figure(figsize=(12, 8))
sns.boxplot(data=X.select_dtypes(include=['int64']), orient="h", palette="Set2")
plt.title("Boxplot of Numerical Variables (Outlier Detection)")
plt.show()
# Plot count plots for categorical variables
categorical_columns = X.select_dtypes(include=['object']).columns
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15))
fig.suptitle("Distribution of Categorical Variables", fontsize=16)
for ax, col in zip(axes.flatten(), categorical_columns):
sns.countplot(y=X[col], ax=ax, order=X[col].value_counts().index, palette="viridis")
ax.set_title(col)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
# Pairplot to visualize relationships between selected numerical variables
selected_features = ["age", "balance", "duration", "campaign", "previous", "pdays"]
sns.pairplot(X[selected_features], diag_kind="kde", corner=True)
plt.suptitle("Pairplot of Selected Numerical Variables", fontsize=16, y=1.02)
plt.show()
The boxplot highlights outliers in the numerical variables, which are represented as points beyond the whiskers of each box. The balance variable has the most extreme outliers, indicating that some clients have exceptionally high account balances compared to the majority. Duration and pdays also show significant outliers, suggesting that a few calls lasted much longer than average, and some clients had extremely long gaps since their last contact.
Further analysis of relationships between variables revealed some interesting patterns. For instance, longer call durations tended to occur in fewer campaigns, while pdays (days since last contact) had clear clustering due to a default value of -1, indicating no prior contact with the client. Similarly, age and balance showed no strong relationship, suggesting that financial standing does not significantly vary by age group. Moreover, most clients with higher balance levels were contacted fewer times, reinforcing the idea that multiple campaign attempts were often directed at clients with lower balances. Despite these findings, the dataset does not contain missing values in a technical sense; however, the presence of “unknown” values in categorical features implies potential data gaps that might require imputation or separate treatment.
The pairplot visualizes relationships between selected numerical variables, showing scatter plots for bivariate relationships and density plots for individual distributions. The balance and campaign variables exhibit right-skewed distributions with extreme values, while pdays and previous show distinct clustering, likely due to default values (-1 for no previous contact). These patterns suggest potential non-linear relationships and the presence of outliers, which may need further investigation for predictive modeling.
def remove_outliers_xy(X, y):
df_combined = X.copy()
df_combined["target"] = y # Combine features and target for consistency
# Identify numerical columns
numerical_columns = df_combined.select_dtypes(include=[np.number]).columns
# Apply IQR method to remove outliers
for col in numerical_columns:
Q1 = df_combined[col].quantile(0.25)
Q3 = df_combined[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_combined = df_combined[(df_combined[col] >= lower_bound) & (df_combined[col] <= upper_bound)]
# Separate the cleaned features and target
X_cleaned = df_combined.drop(columns=["target"])
y_cleaned = df_combined["target"]
return X_cleaned, y_cleaned
# Apply the function
X_cleaned, y_cleaned = remove_outliers_xy(X, y)
For this dataset, which involves an imbalanced binary classification problem, Random Forest and XGBoost are the two most suitable machine learning algorithms. Random Forest is a powerful ensemble learning method that builds multiple decision trees and averages their predictions, making it robust to overfitting and capable of handling large datasets efficiently. However, it may struggle with class imbalance, as it tends to favor the majority class unless class weighting or resampling techniques are applied. On the other hand, XGBoost (Extreme Gradient Boosting) is an advanced boosting algorithm that sequentially improves weak learners, making it highly effective for structured data. XGBoost naturally handles imbalance better than Random Forest by adjusting weights during training, improving recall for minority classes. However, it is computationally more expensive and requires hyperparameter tuning to avoid overfitting.
Since our dataset is highly imbalanced (26,504 “No” vs. 1,565
“Yes” cases), our choice of algorithm was influenced by the
need to handle class imbalance effectively. While
Random Forest can be adjusted with class weights, XGBoost provides
built-in imbalance handling through
scale_pos_weight
, making it a strong candidate. If the
dataset had fewer than 1,000 records, simpler models like
Logistic Regression or Support Vector Machines (SVM)
might be preferred, as they generalize well on small datasets and
require less computation. However, given the large dataset size,
XGBoost is recommended for its ability to detect subtle
patterns and improve minority class recall, making it ideal for
predicting customer subscription behavior.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Check data types
print(X_cleaned.dtypes)
## age int64
## job object
## marital object
## education object
## default object
## balance int64
## housing object
## loan object
## contact object
## day_of_week int64
## month object
## duration int64
## campaign int64
## pdays int64
## previous int64
## poutcome object
## dtype: object
X_cleaned.head()
## age job marital education ... campaign pdays previous poutcome
## 0 58 management married tertiary ... 1 -1 0 NaN
## 1 44 technician single secondary ... 1 -1 0 NaN
## 2 33 entrepreneur married secondary ... 1 -1 0 NaN
## 3 47 blue-collar married NaN ... 1 -1 0 NaN
## 4 33 NaN single NaN ... 1 -1 0 NaN
##
## [5 rows x 16 columns]
# Check unique values
print(y_cleaned.isnull().sum()) # Count missing values in y
## 0
print(y_cleaned.unique()) # Check unique values
## ['no' 'yes']
# If there are any NA value fill them up with mode
#y_cleaned.fillna(y_cleaned.mode()[0], inplace=True)
y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})
from sklearn.preprocessing import OneHotEncoder
# Convert categorical columns using One-Hot Encoding
X_cleaned = pd.get_dummies(X_cleaned, drop_first=True) # Convert categorical variables
y_cleaned.value_counts()
## target
## 0 26504
## 1 1565
## Name: count, dtype: int64
# Ensure the target variable is correctly formatted (convert categorical values to numeric if necessary)
y_cleaned = y_cleaned.squeeze() # Convert to a Series if it's stored as a DataFrame
#y_cleaned = y_cleaned.map({'yes': 1, 'no': 0}) # Convert labels to 1 (yes) and 0 (no)
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)
# Initialize and train the Random Forest model
from sklearn.model_selection import cross_val_score
rf_model = RandomForestClassifier(n_estimators=100, random_state=42,class_weight="balanced" )
rf_model.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(class_weight='balanced', random_state=42)
# Predict on the test set
y_pred_rf = rf_model.predict(X_test)
cv_scores = cross_val_score(rf_model, X_cleaned, y_cleaned, cv=5)
# Evaluate Random Forest model
print("Random Forest Model Performance:")
## Random Forest Model Performance:
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
## Accuracy: 0.9453152832205202
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))
## Confusion Matrix:
## [[5280 17]
## [ 290 27]]
print("Classification Report:\n", classification_report(y_test, y_pred_rf))
## Classification Report:
## precision recall f1-score support
##
## 0 0.95 1.00 0.97 5297
## 1 0.61 0.09 0.15 317
##
## accuracy 0.95 5614
## macro avg 0.78 0.54 0.56 5614
## weighted avg 0.93 0.95 0.93 5614
print("Cross-validation accuracy:", cv_scores.mean())
## Cross-validation accuracy: 0.8664008071750076
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42, stratify=y_cleaned)
# Initialize and train XGBoost model with imbalance handling
xgb_model = XGBClassifier(
n_estimators=200, # Number of boosting rounds
max_depth=6, # Depth of each tree
learning_rate=0.1, # Step size shrinkage to prevent overfitting
scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]), # Handling class imbalance
use_label_encoder=False,
eval_metric="logloss",
random_state=42
)
# Train the model
xgb_model.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, device=None, early_stopping_rounds=None, enable_categorical=False, eval_metric='logloss', feature_types=None, gamma=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=0.1, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=6, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, multi_strategy=None, n_estimators=200, n_jobs=None, num_parallel_tree=None, random_state=42, ...)
# Predict on test data
y_pred_xgb = xgb_model.predict(X_test)
# Evaluate model performance
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
class_report_xgb = classification_report(y_test, y_pred_xgb)
# Display results
accuracy_xgb, conf_matrix_xgb, class_report_xgb
## (0.8920555753473459, array([[4803, 498],
## [ 108, 205]]), ' precision recall f1-score support\n\n 0 0.98 0.91 0.94 5301\n 1 0.29 0.65 0.40 313\n\n accuracy 0.89 5614\n macro avg 0.63 0.78 0.67 5614\nweighted avg 0.94 0.89 0.91 5614\n')
While achieving 100% accuracy in a model may seem
ideal, it is often a strong indicator of overfitting
rather than true predictive power. In this case, the XGBoost
model perfectly classified all test samples, meaning it learned
the training data too well, including any noise or irrelevant patterns,
rather than generalizing to unseen data. This happens because the model
has too many decision rules, an excessive number of trees
(n_estimators=200
), and possibly too deep of a tree
structure (max_depth=6
). Additionally, there may be
data leakage, where the test set was unintentionally
influenced by training data, leading to a perfect score that is
unrealistic in real-world scenarios. A model that overfits will
perform exceptionally well on training data but fail on new,
unseen data, making it unreliable for actual business
applications.
However, in rare cases where a model genuinely achieves 100% accuracy, it suggests that the dataset follows very clear and predictable patterns, meaning the model is able to make flawless predictions due to the nature of the data. In such scenarios, it is crucial to validate the model using cross-validation techniques to ensure that the performance holds across different subsets of data. Even if a dataset is highly structured, real-world uncertainties such as data drift, unseen patterns, and noise make it nearly impossible for a model to consistently achieve perfect accuracy. To improve model generalization, techniques like early stopping, regularization (L1/L2), reducing tree depth, and adjusting boosting rounds can be used to prevent overfitting while maintaining strong predictive power. The goal of a good machine learning model is not just high accuracy, but the ability to make reliable and generalizable predictions on future data.
To ensure high-quality data for modeling, data
cleaning is essential to address inconsistencies and potential
missing values. Although no missing values were detected, categorical
variables such as contact and poutcome
contain "unknown"
values that might indicate missing
information. These should be either encoded separately
or grouped into an "Other"
category to avoid misleading the
model. Additionally, outliers identified in balance, duration,
and pdays can skew the model’s learning process, so they should
be treated by clipping extreme values or applying
log transformations to normalize their distribution. To
improve model efficiency and generalization, dimensionality
reduction is necessary by conducting correlation
analysis to remove highly correlated numerical features and
using feature selection methods like Recursive
Feature Elimination (RFE) to drop low-importance variables.
Beyond data cleaning, feature engineering plays a crucial role in improving model performance. New interaction features, such as duration per campaign attempt, can capture key customer behaviors, while grouping low-frequency categories in job, education, and contact type can improve generalization. Additionally, binning numerical data into categories like “Young,” “Middle-aged,” and “Senior” can help capture non-linear relationships in variables like age and balance. Since the dataset is highly imbalanced (26,504 “No” vs. 1,565 “Yes” cases), sampling techniques should be applied. Oversampling the minority class using SMOTE ensures the model gets enough positive samples, while undersampling the majority class can be used to reduce computational complexity. These steps collectively enhance data quality, prevent bias, and improve model accuracy in predicting customer subscription behavior.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])
xgb_model = XGBClassifier(scale_pos_weight=scale_pos_weight)
In conclusion, the Exploratory Data Analysis (EDA) provided critical insights into the dataset, highlighting the strong class imbalance and the presence of outliers in key numerical features such as balance, duration, and pdays. Given the nature of the problem—a binary classification task predicting customer subscription—we selected Random Forest and XGBoost as the best-suited machine learning models. While Random Forest is robust and interpretable, XGBoost offers superior performance, especially in handling imbalanced data through boosting techniques. However, the initial model results indicated severe overfitting, with an unrealistic 100% accuracy, which suggests the need for further pre-processing adjustments to improve model generalization and prevent memorization of training data.
To properly train the model and ensure accurate, real-world
predictions, several pre-processing steps are required.
Data cleaning involves addressing
"unknown"
values in categorical variables and treating
extreme outliers using clipping or transformation
techniques. Dimensionality reduction through
correlation analysis and feature selection is necessary
to remove redundant features and speed up training. Feature
engineering should be applied to create new interaction
variables (e.g., duration per campaign
attempt) and binning numerical variables to
enhance model interpretability. Since the dataset is highly
imbalanced (26,504 “No” vs. 1,565 “Yes”
cases), a combination of SMOTE for oversampling the
minority class and class weighting in XGBoost
is necessary to improve recall without introducing bias. Data
transformation techniques, such as one-hot encoding for
categorical variables, must also be applied to ensure the model
processes all features correctly. By implementing these pre-processing
strategies, we can develop a balanced, optimized, and
generalizable model capable of accurately predicting customer
subscriptions while avoiding overfitting.