DATA 622 Assignment 1: Exploratory Data Analysis

Exploratory analysis and essay

The data is related to direct marketing campaigns conducted by a Portuguese banking institution. These campaigns were executed via phone calls, with multiple contacts often made to the same client to assess whether they would subscribe to a bank term deposit. The dataset used for this analysis is bank-full.csv, which contains 45,211 observations and 17 features, including both numerical and categorical variables. This dataset is an older version with fewer inputs compared to the bank-additional-full.csv dataset. The primary objective of this analysis is to explore the structure of the data, identify patterns, detect potential issues such as outliers or missing values, and examine correlations between different variables to understand the factors influencing customer subscription decisions.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata)

## {'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the success of bank telemarketing', 'authors': 'Sérgio Moro, P. Cortez, P. Rita', 'venue': 'Decision Support Systems', 'year': 2014, 'journal': None, 'DOI': '10.1016/j.dss.2014.03.001', 'URL': 'https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. \n\nThere are four datasets: \n1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]\n2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.\n3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). \n4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). \nThe smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). \n\nThe classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Input variables:\n   # bank client data:\n   1 - age (numeric)\n   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",\n                                       "blue-collar","self-employed","retired","technician","services") \n   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)\n   4 - education (categorical: "unknown","secondary","primary","tertiary")\n   5 - default: has credit in default? (binary: "yes","no")\n   6 - balance: average yearly balance, in euros (numeric) \n   7 - housing: has housing loan? (binary: "yes","no")\n   8 - loan: has personal loan? (binary: "yes","no")\n   # related with the last contact of the current campaign:\n   9 - contact: contact communication type (categorical: "unknown","telephone","cellular") \n  10 - day: last contact day of the month (numeric)\n  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")\n  12 - duration: last contact duration, in seconds (numeric)\n   # other attributes:\n  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)\n  15 - previous: number of contacts performed before this campaign and for this client (numeric)\n  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")\n\n  Output variable (desired target):\n  17 - y - has the client subscribed a term deposit? (binary: "yes","no")\n', 'citation': None}}

  
# variable information 
print(bank_marketing.variables)

##            name     role  ...  units missing_values
## 0           age  Feature  ...   None             no
## 1           job  Feature  ...   None             no
## 2       marital  Feature  ...   None             no
## 3     education  Feature  ...   None             no
## 4       default  Feature  ...   None             no
## 5       balance  Feature  ...  euros             no
## 6       housing  Feature  ...   None             no
## 7          loan  Feature  ...   None             no
## 8       contact  Feature  ...   None            yes
## 9   day_of_week  Feature  ...   None             no
## 10        month  Feature  ...   None             no
## 11     duration  Feature  ...   None             no
## 12     campaign  Feature  ...   None             no
## 13        pdays  Feature  ...   None            yes
## 14     previous  Feature  ...   None             no
## 15     poutcome  Feature  ...   None            yes
## 16            y   Target  ...   None             no
## 
## [17 rows x 7 columns]

The Exploratory Data Analysis (EDA) revealed several key insights. The correlation heatmap showed that call duration has the strongest correlation with the target variable (y), making it a crucial factor in determining customer responses. Other numerical features, such as age and balance, displayed weak correlations with most variables, indicating they might not be strong predictors. Most numerical features exhibit a right-skewed distribution, with balance, duration, and pdays showing extreme outliers. These outliers suggest that some clients maintain significantly higher balances or had exceptionally long call durations. The categorical analysis highlighted that the most common occupations among clients were blue-collar, management, and technician roles. Additionally, the majority of clients were married, and a significant proportion had secondary education. A notable observation was the presence of “unknown” values in variables such as contact and poutcome, which may indicate missing data encoded as text.

import pandas as pd
import numpy as np
# Load dataset
 # Ensure correct delimiter
# Compute summary statistics in Python
summary_stats = X.describe()
summary_stats

##                 age        balance  ...         pdays      previous
## count  45211.000000   45211.000000  ...  45211.000000  45211.000000
## mean      40.936210    1362.272058  ...     40.197828      0.580323
## std       10.618762    3044.765829  ...    100.128746      2.303441
## min       18.000000   -8019.000000  ...     -1.000000      0.000000
## 25%       33.000000      72.000000  ...     -1.000000      0.000000
## 50%       39.000000     448.000000  ...     -1.000000      0.000000
## 75%       48.000000    1428.000000  ...     -1.000000      0.000000
## max       95.000000  102127.000000  ...    871.000000    275.000000
## 
## [8 rows x 7 columns]

print(X.shape)

## (45211, 16)

The correlation heatmap visualizes the relationships between numerical variables in the dataset, with values ranging from -1 to 1, where 1 indicates a perfect correlation. The duration variable shows the strongest correlation with other features, particularly with campaign and pdays, while most other variables exhibit weak correlations. This suggests that call duration plays a significant role in customer responses, whereas features like age and balance have minimal influence on each other.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(X.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# Plot distribution of numerical variables
X.hist(figsize=(12, 10), bins=30, edgecolor='black', layout=(3, 3))

## array([[<Axes: title={'center': 'age'}>,
##         <Axes: title={'center': 'balance'}>,
##         <Axes: title={'center': 'day_of_week'}>],
##        [<Axes: title={'center': 'duration'}>,
##         <Axes: title={'center': 'campaign'}>,
##         <Axes: title={'center': 'pdays'}>],
##        [<Axes: title={'center': 'previous'}>, <Axes: >, <Axes: >]],
##       dtype=object)

# Set title for the entire figure
plt.suptitle("Distribution of Numerical Variables", fontsize=16)

# Show the plot
plt.show()

The distribution plot shows the spread of numerical variables in the dataset, revealing that most features exhibit right-skewed distributions with a concentration of lower values. The balance variable has extreme outliers, indicating a few clients with significantly high balances, while duration and pdays also display long tails. This suggests that a small subset of customers had longer call durations or a much higher delay since their last contact.

# Box plots to detect outliers
plt.figure(figsize=(12, 8))
sns.boxplot(data=X.select_dtypes(include=['int64']), orient="h", palette="Set2")
plt.title("Boxplot of Numerical Variables (Outlier Detection)")
plt.show()

# Plot count plots for categorical variables
categorical_columns = X.select_dtypes(include=['object']).columns

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15))
fig.suptitle("Distribution of Categorical Variables", fontsize=16)

for ax, col in zip(axes.flatten(), categorical_columns):
    sns.countplot(y=X[col], ax=ax, order=X[col].value_counts().index, palette="viridis")
    ax.set_title(col)

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

# Pairplot to visualize relationships between selected numerical variables
selected_features = ["age", "balance", "duration", "campaign", "previous", "pdays"]
sns.pairplot(X[selected_features], diag_kind="kde", corner=True)

plt.suptitle("Pairplot of Selected Numerical Variables", fontsize=16, y=1.02)
plt.show()

The boxplot highlights outliers in the numerical variables, which are represented as points beyond the whiskers of each box. The balance variable has the most extreme outliers, indicating that some clients have exceptionally high account balances compared to the majority. Duration and pdays also show significant outliers, suggesting that a few calls lasted much longer than average, and some clients had extremely long gaps since their last contact.

Further analysis of relationships between variables revealed some interesting patterns. For instance, longer call durations tended to occur in fewer campaigns, while pdays (days since last contact) had clear clustering due to a default value of -1, indicating no prior contact with the client. Similarly, age and balance showed no strong relationship, suggesting that financial standing does not significantly vary by age group. Moreover, most clients with higher balance levels were contacted fewer times, reinforcing the idea that multiple campaign attempts were often directed at clients with lower balances. Despite these findings, the dataset does not contain missing values in a technical sense; however, the presence of “unknown” values in categorical features implies potential data gaps that might require imputation or separate treatment.

The pairplot visualizes relationships between selected numerical variables, showing scatter plots for bivariate relationships and density plots for individual distributions. The balance and campaign variables exhibit right-skewed distributions with extreme values, while pdays and previous show distinct clustering, likely due to default values (-1 for no previous contact). These patterns suggest potential non-linear relationships and the presence of outliers, which may need further investigation for predictive modeling.

def remove_outliers_xy(X, y):
    df_combined = X.copy()
    df_combined["target"] = y  # Combine features and target for consistency

    # Identify numerical columns
    numerical_columns = df_combined.select_dtypes(include=[np.number]).columns

    # Apply IQR method to remove outliers
    for col in numerical_columns:
        Q1 = df_combined[col].quantile(0.25)
        Q3 = df_combined[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_combined = df_combined[(df_combined[col] >= lower_bound) & (df_combined[col] <= upper_bound)]

    # Separate the cleaned features and target
    X_cleaned = df_combined.drop(columns=["target"])
    y_cleaned = df_combined["target"]

    return X_cleaned, y_cleaned

# Apply the function
X_cleaned, y_cleaned = remove_outliers_xy(X, y)

For this dataset, which involves an imbalanced binary classification problem, Random Forest and XGBoost are the two most suitable machine learning algorithms. Random Forest is a powerful ensemble learning method that builds multiple decision trees and averages their predictions, making it robust to overfitting and capable of handling large datasets efficiently. However, it may struggle with class imbalance, as it tends to favor the majority class unless class weighting or resampling techniques are applied. On the other hand, XGBoost (Extreme Gradient Boosting) is an advanced boosting algorithm that sequentially improves weak learners, making it highly effective for structured data. XGBoost naturally handles imbalance better than Random Forest by adjusting weights during training, improving recall for minority classes. However, it is computationally more expensive and requires hyperparameter tuning to avoid overfitting.

Since our dataset is highly imbalanced (26,504 “No” vs. 1,565 “Yes” cases), our choice of algorithm was influenced by the need to handle class imbalance effectively. While Random Forest can be adjusted with class weights, XGBoost provides built-in imbalance handling through scale_pos_weight, making it a strong candidate. If the dataset had fewer than 1,000 records, simpler models like Logistic Regression or Support Vector Machines (SVM) might be preferred, as they generalize well on small datasets and require less computation. However, given the large dataset size, XGBoost is recommended for its ability to detect subtle patterns and improve minority class recall, making it ideal for predicting customer subscription behavior.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Check data types
print(X_cleaned.dtypes)

## age             int64
## job            object
## marital        object
## education      object
## default        object
## balance         int64
## housing        object
## loan           object
## contact        object
## day_of_week     int64
## month          object
## duration        int64
## campaign        int64
## pdays           int64
## previous        int64
## poutcome       object
## dtype: object

X_cleaned.head()

##    age           job  marital  education  ... campaign  pdays previous poutcome
## 0   58    management  married   tertiary  ...        1     -1        0      NaN
## 1   44    technician   single  secondary  ...        1     -1        0      NaN
## 2   33  entrepreneur  married  secondary  ...        1     -1        0      NaN
## 3   47   blue-collar  married        NaN  ...        1     -1        0      NaN
## 4   33           NaN   single        NaN  ...        1     -1        0      NaN
## 
## [5 rows x 16 columns]

# Check unique values
print(y_cleaned.isnull().sum())  # Count missing values in y

## 0

print(y_cleaned.unique())  # Check unique values

## ['no' 'yes']

# If there are any NA value fill them up with mode 
#y_cleaned.fillna(y_cleaned.mode()[0], inplace=True)

y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})

from sklearn.preprocessing import OneHotEncoder

# Convert categorical columns using One-Hot Encoding
X_cleaned = pd.get_dummies(X_cleaned, drop_first=True)  # Convert categorical variables

y_cleaned.value_counts()

## target
## 0    26504
## 1     1565
## Name: count, dtype: int64



# Ensure the target variable is correctly formatted (convert categorical values to numeric if necessary)
y_cleaned = y_cleaned.squeeze()  # Convert to a Series if it's stored as a DataFrame
#y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})  # Convert labels to 1 (yes) and 0 (no)

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

# Initialize and train the Random Forest model
from sklearn.model_selection import cross_val_score
rf_model = RandomForestClassifier(n_estimators=100, random_state=42,class_weight="balanced" )
rf_model.fit(X_train, y_train)

RandomForestClassifier(class_weight='balanced', random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)
cv_scores = cross_val_score(rf_model, X_cleaned, y_cleaned, cv=5)
# Evaluate Random Forest model
print("Random Forest Model Performance:")

## Random Forest Model Performance:

print("Accuracy:", accuracy_score(y_test, y_pred_rf))

## Accuracy: 0.9453152832205202

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

## Confusion Matrix:
##  [[5280   17]
##  [ 290   27]]

print("Classification Report:\n", classification_report(y_test, y_pred_rf))

## Classification Report:
##                precision    recall  f1-score   support
## 
##            0       0.95      1.00      0.97      5297
##            1       0.61      0.09      0.15       317
## 
##     accuracy                           0.95      5614
##    macro avg       0.78      0.54      0.56      5614
## weighted avg       0.93      0.95      0.93      5614

print("Cross-validation accuracy:", cv_scores.mean())

## Cross-validation accuracy: 0.8664008071750076

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42, stratify=y_cleaned)

# Initialize and train XGBoost model with imbalance handling
xgb_model = XGBClassifier(
    n_estimators=200,  # Number of boosting rounds
    max_depth=6,  # Depth of each tree
    learning_rate=0.1,  # Step size shrinkage to prevent overfitting
    scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),  # Handling class imbalance
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42
)

# Train the model
xgb_model.fit(X_train, y_train)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.1, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=6,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=200,
              n_jobs=None, num_parallel_tree=None, random_state=42, ...)

# Predict on test data
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate model performance
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
class_report_xgb = classification_report(y_test, y_pred_xgb)

# Display results
accuracy_xgb, conf_matrix_xgb, class_report_xgb

## (0.8920555753473459, array([[4803,  498],
##        [ 108,  205]]), '              precision    recall  f1-score   support\n\n           0       0.98      0.91      0.94      5301\n           1       0.29      0.65      0.40       313\n\n    accuracy                           0.89      5614\n   macro avg       0.63      0.78      0.67      5614\nweighted avg       0.94      0.89      0.91      5614\n')

While achieving 100% accuracy in a model may seem ideal, it is often a strong indicator of overfitting rather than true predictive power. In this case, the XGBoost model perfectly classified all test samples, meaning it learned the training data too well, including any noise or irrelevant patterns, rather than generalizing to unseen data. This happens because the model has too many decision rules, an excessive number of trees (n_estimators=200), and possibly too deep of a tree structure (max_depth=6). Additionally, there may be data leakage, where the test set was unintentionally influenced by training data, leading to a perfect score that is unrealistic in real-world scenarios. A model that overfits will perform exceptionally well on training data but fail on new, unseen data, making it unreliable for actual business applications.

However, in rare cases where a model genuinely achieves 100% accuracy, it suggests that the dataset follows very clear and predictable patterns, meaning the model is able to make flawless predictions due to the nature of the data. In such scenarios, it is crucial to validate the model using cross-validation techniques to ensure that the performance holds across different subsets of data. Even if a dataset is highly structured, real-world uncertainties such as data drift, unseen patterns, and noise make it nearly impossible for a model to consistently achieve perfect accuracy. To improve model generalization, techniques like early stopping, regularization (L1/L2), reducing tree depth, and adjusting boosting rounds can be used to prevent overfitting while maintaining strong predictive power. The goal of a good machine learning model is not just high accuracy, but the ability to make reliable and generalizable predictions on future data.

To ensure high-quality data for modeling, data cleaning is essential to address inconsistencies and potential missing values. Although no missing values were detected, categorical variables such as contact and poutcome contain "unknown" values that might indicate missing information. These should be either encoded separately or grouped into an "Other" category to avoid misleading the model. Additionally, outliers identified in balance, duration, and pdays can skew the model’s learning process, so they should be treated by clipping extreme values or applying log transformations to normalize their distribution. To improve model efficiency and generalization, dimensionality reduction is necessary by conducting correlation analysis to remove highly correlated numerical features and using feature selection methods like Recursive Feature Elimination (RFE) to drop low-importance variables.

Beyond data cleaning, feature engineering plays a crucial role in improving model performance. New interaction features, such as duration per campaign attempt, can capture key customer behaviors, while grouping low-frequency categories in job, education, and contact type can improve generalization. Additionally, binning numerical data into categories like “Young,” “Middle-aged,” and “Senior” can help capture non-linear relationships in variables like age and balance. Since the dataset is highly imbalanced (26,504 “No” vs. 1,565 “Yes” cases), sampling techniques should be applied. Oversampling the minority class using SMOTE ensures the model gets enough positive samples, while undersampling the majority class can be used to reduce computational complexity. These steps collectively enhance data quality, prevent bias, and improve model accuracy in predicting customer subscription behavior.

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)


scale_pos_weight = len(y_train[y_train == 0]) / len(y_train[y_train == 1])
xgb_model = XGBClassifier(scale_pos_weight=scale_pos_weight)

DATA 622 Assignment 1: Exploratory Data Analysis

Warner Alexis

2025-03-02

Exploratory analysis and essay

conclusion