DATA 622 Assignment 1: Exploratory Data Analysis

Exploratory analysis and essay

The data is related to direct marketing campaigns conducted by a Portuguese banking institution. These campaigns were executed via phone calls, with multiple contacts often made to the same client to assess whether they would subscribe to a bank term deposit. The dataset used for this analysis is bank-full.csv, which contains 45,211 observations and 17 features, including both numerical and categorical variables. This dataset is an older version with fewer inputs compared to the bank-additional-full.csv dataset. The primary objective of this analysis is to explore the structure of the data, identify patterns, detect potential issues such as outliers or missing values, and examine correlations between different variables to understand the factors influencing customer subscription decisions.

import pandas as pd
import numpy as np
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
print(bank_marketing.metadata)

## {'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the success of bank telemarketing', 'authors': 'Sérgio Moro, P. Cortez, P. Rita', 'venue': 'Decision Support Systems', 'year': 2014, 'journal': None, 'DOI': '10.1016/j.dss.2014.03.001', 'URL': 'https://www.semanticscholar.org/paper/cab86052882d126d43f72108c6cb41b295cc8a9e', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': "The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. \n\nThere are four datasets: \n1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]\n2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.\n3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs). \n4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). \nThe smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM). \n\nThe classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': 'Input variables:\n   # bank client data:\n   1 - age (numeric)\n   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",\n                                       "blue-collar","self-employed","retired","technician","services") \n   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)\n   4 - education (categorical: "unknown","secondary","primary","tertiary")\n   5 - default: has credit in default? (binary: "yes","no")\n   6 - balance: average yearly balance, in euros (numeric) \n   7 - housing: has housing loan? (binary: "yes","no")\n   8 - loan: has personal loan? (binary: "yes","no")\n   # related with the last contact of the current campaign:\n   9 - contact: contact communication type (categorical: "unknown","telephone","cellular") \n  10 - day: last contact day of the month (numeric)\n  11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")\n  12 - duration: last contact duration, in seconds (numeric)\n   # other attributes:\n  13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)\n  14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)\n  15 - previous: number of contacts performed before this campaign and for this client (numeric)\n  16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")\n\n  Output variable (desired target):\n  17 - y - has the client subscribed a term deposit? (binary: "yes","no")\n', 'citation': None}}

  
# variable information 
print(bank_marketing.variables)

##            name     role  ...  units missing_values
## 0           age  Feature  ...   None             no
## 1           job  Feature  ...   None             no
## 2       marital  Feature  ...   None             no
## 3     education  Feature  ...   None             no
## 4       default  Feature  ...   None             no
## 5       balance  Feature  ...  euros             no
## 6       housing  Feature  ...   None             no
## 7          loan  Feature  ...   None             no
## 8       contact  Feature  ...   None            yes
## 9   day_of_week  Feature  ...   None             no
## 10        month  Feature  ...   None             no
## 11     duration  Feature  ...   None             no
## 12     campaign  Feature  ...   None             no
## 13        pdays  Feature  ...   None            yes
## 14     previous  Feature  ...   None             no
## 15     poutcome  Feature  ...   None            yes
## 16            y   Target  ...   None             no
## 
## [17 rows x 7 columns]

The Exploratory Data Analysis (EDA) revealed several key insights. The correlation heatmap showed that call duration has the strongest correlation with the target variable (y), making it a crucial factor in determining customer responses. Other numerical features, such as age and balance, displayed weak correlations with most variables, indicating they might not be strong predictors. Most numerical features exhibit a right-skewed distribution, with balance, duration, and pdays showing extreme outliers. These outliers suggest that some clients maintain significantly higher balances or had exceptionally long call durations. The categorical analysis highlighted that the most common occupations among clients were blue-collar, management, and technician roles. Additionally, the majority of clients were married, and a significant proportion had secondary education. A notable observation was the presence of “unknown” values in variables such as contact and poutcome, which may indicate missing data encoded as text.

import pandas as pd
import numpy as np
# Load dataset
 # Ensure correct delimiter
# Compute summary statistics in Python
summary_stats = X.describe()
summary_stats

##                 age        balance  ...         pdays      previous
## count  45211.000000   45211.000000  ...  45211.000000  45211.000000
## mean      40.936210    1362.272058  ...     40.197828      0.580323
## std       10.618762    3044.765829  ...    100.128746      2.303441
## min       18.000000   -8019.000000  ...     -1.000000      0.000000
## 25%       33.000000      72.000000  ...     -1.000000      0.000000
## 50%       39.000000     448.000000  ...     -1.000000      0.000000
## 75%       48.000000    1428.000000  ...     -1.000000      0.000000
## max       95.000000  102127.000000  ...    871.000000    275.000000
## 
## [8 rows x 7 columns]

print(X.shape)

## (45211, 16)

The correlation heatmap visualizes the relationships between numerical variables in the dataset, with values ranging from -1 to 1, where 1 indicates a perfect correlation. The duration variable shows the strongest correlation with other features, particularly with campaign and pdays, while most other variables exhibit weak correlations. This suggests that call duration plays a significant role in customer responses, whereas features like age and balance have minimal influence on each other.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.heatmap(X.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# Plot distribution of numerical variables
X.hist(figsize=(12, 10), bins=30, edgecolor='black', layout=(3, 3))

## array([[<Axes: title={'center': 'age'}>,
##         <Axes: title={'center': 'balance'}>,
##         <Axes: title={'center': 'day_of_week'}>],
##        [<Axes: title={'center': 'duration'}>,
##         <Axes: title={'center': 'campaign'}>,
##         <Axes: title={'center': 'pdays'}>],
##        [<Axes: title={'center': 'previous'}>, <Axes: >, <Axes: >]],
##       dtype=object)

# Set title for the entire figure
plt.suptitle("Distribution of Numerical Variables", fontsize=16)

# Show the plot
plt.show()

The distribution plot shows the spread of numerical variables in the dataset, revealing that most features exhibit right-skewed distributions with a concentration of lower values. The balance variable has extreme outliers, indicating a few clients with significantly high balances, while duration and pdays also display long tails. This suggests that a small subset of customers had longer call durations or a much higher delay since their last contact.

# Box plots to detect outliers
plt.figure(figsize=(12, 8))
sns.boxplot(data=X.select_dtypes(include=['int64']), orient="h", palette="Set2")
plt.title("Boxplot of Numerical Variables (Outlier Detection)")
plt.show()

# Plot count plots for categorical variables
categorical_columns = X.select_dtypes(include=['object']).columns

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(12, 15))
fig.suptitle("Distribution of Categorical Variables", fontsize=16)

for ax, col in zip(axes.flatten(), categorical_columns):
    sns.countplot(y=X[col], ax=ax, order=X[col].value_counts().index, palette="viridis")
    ax.set_title(col)

plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()

# Pairplot to visualize relationships between selected numerical variables
selected_features = ["age", "balance", "duration", "campaign", "previous", "pdays"]
sns.pairplot(X[selected_features], diag_kind="kde", corner=True)

plt.suptitle("Pairplot of Selected Numerical Variables", fontsize=16, y=1.02)
plt.show()

The boxplot highlights outliers in the numerical variables, which are represented as points beyond the whiskers of each box. The balance variable has the most extreme outliers, indicating that some clients have exceptionally high account balances compared to the majority. Duration and pdays also show significant outliers, suggesting that a few calls lasted much longer than average, and some clients had extremely long gaps since their last contact.

Further analysis of relationships between variables revealed some interesting patterns. For instance, longer call durations tended to occur in fewer campaigns, while pdays (days since last contact) had clear clustering due to a default value of -1, indicating no prior contact with the client. Similarly, age and balance showed no strong relationship, suggesting that financial standing does not significantly vary by age group. Moreover, most clients with higher balance levels were contacted fewer times, reinforcing the idea that multiple campaign attempts were often directed at clients with lower balances. Despite these findings, the dataset does not contain missing values in a technical sense; however, the presence of “unknown” values in categorical features implies potential data gaps that might require imputation or separate treatment.

The pairplot visualizes relationships between selected numerical variables, showing scatter plots for bivariate relationships and density plots for individual distributions. The balance and campaign variables exhibit right-skewed distributions with extreme values, while pdays and previous show distinct clustering, likely due to default values (-1 for no previous contact). These patterns suggest potential non-linear relationships and the presence of outliers, which may need further investigation for predictive modeling.

def remove_outliers_xy(X, y):
    df_combined = X.copy()
    df_combined["target"] = y  # Combine features and target for consistency

    # Identify numerical columns
    numerical_columns = df_combined.select_dtypes(include=[np.number]).columns

    # Apply IQR method to remove outliers
    for col in numerical_columns:
        Q1 = df_combined[col].quantile(0.25)
        Q3 = df_combined[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df_combined = df_combined[(df_combined[col] >= lower_bound) & (df_combined[col] <= upper_bound)]

    # Separate the cleaned features and target
    X_cleaned = df_combined.drop(columns=["target"])
    y_cleaned = df_combined["target"]

    return X_cleaned, y_cleaned

# Apply the function
X_cleaned, y_cleaned = remove_outliers_xy(X, y)

For this dataset, which involves an imbalanced binary classification problem, Random Forest and XGBoost are the two most suitable machine learning algorithms. Random Forest is a powerful ensemble learning method that builds multiple decision trees and averages their predictions, making it robust to overfitting and capable of handling large datasets efficiently. However, it may struggle with class imbalance, as it tends to favor the majority class unless class weighting or resampling techniques are applied. On the other hand, XGBoost (Extreme Gradient Boosting) is an advanced boosting algorithm that sequentially improves weak learners, making it highly effective for structured data. XGBoost naturally handles imbalance better than Random Forest by adjusting weights during training, improving recall for minority classes. However, it is computationally more expensive and requires hyperparameter tuning to avoid overfitting.

Since our dataset is highly imbalanced (26,504 “No” vs. 1,565 “Yes” cases), our choice of algorithm was influenced by the need to handle class imbalance effectively. While Random Forest can be adjusted with class weights, XGBoost provides built-in imbalance handling through scale_pos_weight, making it a strong candidate. If the dataset had fewer than 1,000 records, simpler models like Logistic Regression or Support Vector Machines (SVM) might be preferred, as they generalize well on small datasets and require less computation. However, given the large dataset size, XGBoost is recommended for its ability to detect subtle patterns and improve minority class recall, making it ideal for predicting customer subscription behavior.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Check data types
print(X_cleaned.dtypes)

## age             int64
## job            object
## marital        object
## education      object
## default        object
## balance         int64
## housing        object
## loan           object
## contact        object
## day_of_week     int64
## month          object
## duration        int64
## campaign        int64
## pdays           int64
## previous        int64
## poutcome       object
## dtype: object

X_cleaned.head()

##    age           job  marital  education  ... campaign  pdays previous poutcome
## 0   58    management  married   tertiary  ...        1     -1        0      NaN
## 1   44    technician   single  secondary  ...        1     -1        0      NaN
## 2   33  entrepreneur  married  secondary  ...        1     -1        0      NaN
## 3   47   blue-collar  married        NaN  ...        1     -1        0      NaN
## 4   33           NaN   single        NaN  ...        1     -1        0      NaN
## 
## [5 rows x 16 columns]

# Check unique values
print(y_cleaned.isnull().sum())  # Count missing values in y

## 0

print(y_cleaned.unique())  # Check unique values

## ['no' 'yes']

# If there are any NA value fill them up with mode 
#y_cleaned.fillna(y_cleaned.mode()[0], inplace=True)

y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})

from sklearn.preprocessing import OneHotEncoder

# Convert categorical columns using One-Hot Encoding
X_cleaned = pd.get_dummies(X_cleaned, drop_first=True)  # Convert categorical variables

y_cleaned.value_counts()

## target
## 0    26504
## 1     1565
## Name: count, dtype: int64



# Ensure the target variable is correctly formatted (convert categorical values to numeric if necessary)
y_cleaned = y_cleaned.squeeze()  # Convert to a Series if it's stored as a DataFrame
#y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})  # Convert labels to 1 (yes) and 0 (no)

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_cleaned, y_cleaned, test_size=0.2, random_state=42)

The Random Forest model was trained using 100 estimators with class balancing enabled (class_weight='balanced') to account for class imbalance in the dataset. The model achieved a high overall accuracy of 94.35% on the test set. However, the classification report reveals a disparity in performance between the two classes. Class 0 (the majority class) has excellent precision (0.95), recall (1.00), and F1-score (0.97), indicating the model predicts this class very well. In contrast, class 1 (the minority class) has much lower precision (0.61), recall (0.09), and F1-score (0.15), showing that the model struggles to correctly identify instances of this class. This performance gap is further confirmed by the confusion matrix, where the model misclassifies most of the true positives of class 1 as class 0. Despite this, the 5-fold cross-validation yielded a strong mean accuracy of 86.64%, indicating the model generalizes reasonably well. Overall, while the model performs excellently on the majority class, further tuning or balancing techniques (like SMOTE or cost-sensitive learning) may be required to improve its performance on the minority class.

# Initialize and train the Random Forest model
from sklearn.model_selection import cross_val_score
rf_model = RandomForestClassifier(n_estimators=100, random_state=42,class_weight="balanced" )
rf_model.fit(X_train, y_train)

RandomForestClassifier(class_weight='balanced', random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# Predict on the test set
y_pred_rf = rf_model.predict(X_test)
cv_scores = cross_val_score(rf_model, X_cleaned, y_cleaned, cv=5)
# Evaluate Random Forest model
print("Random Forest Model Performance:")

## Random Forest Model Performance:

print("Accuracy:", accuracy_score(y_test, y_pred_rf))

## Accuracy: 0.9453152832205202

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_rf))

## Confusion Matrix:
##  [[5280   17]
##  [ 290   27]]

print("Classification Report:\n", classification_report(y_test, y_pred_rf))

## Classification Report:
##                precision    recall  f1-score   support
## 
##            0       0.95      1.00      0.97      5297
##            1       0.61      0.09      0.15       317
## 
##     accuracy                           0.95      5614
##    macro avg       0.78      0.54      0.56      5614
## weighted avg       0.93      0.95      0.93      5614

print("Cross-validation accuracy:", cv_scores.mean())

## Cross-validation accuracy: 0.8664008071750076

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

We performs preprocessing steps for a machine learning pipeline. First, it applies a custom function remove_outliers_xy() to clean the feature set X and target variable y. Then, the target labels are mapped from strings ('yes', 'no') to binary values (1, 0). Categorical features in X are encoded using LabelEncoder, and the resulting data is standardized with StandardScaler to ensure uniform scaling across features. Finally, the cleaned and scaled dataset is split into training and test sets using a 70-30 ratio, preparing the data for model training and evaluation.





from sklearn.preprocessing import StandardScaler
# Apply the function 
X_cleaned, y_cleaned = remove_outliers_xy(X, y)
y_cleaned = y_cleaned.map({'yes': 1, 'no': 0})

# Encode categorical variables using LabelEncoder
label_encoders = {}
for column in X_cleaned.select_dtypes(include='object').columns:
    le = LabelEncoder()
    X_cleaned[column] = le.fit_transform(X_cleaned[column])
    label_encoders[column] = le

# Standardize features (for fairness across models, even if not critical for Decision Trees)
scaler = StandardScaler()


X_scaled = scaler.fit_transform(X_cleaned)

# Split the experiment 1
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_scaled, y_cleaned, test_size = 0.3, random_state=42)

We displays performance results of two Decision Tree models. The first model, DT1, uses default parameters and achieves an accuracy of 92.42%, an F1 score of 0.354, and an AUC of 0.663. The second model, DT2, is a pruned tree with a maximum depth of 5. It performs better in terms of accuracy (94.37%) and AUC (0.793), indicating improved generalization and better separation between classes. However, DT2 shows a drop in F1 score (0.112), suggesting that while the overall predictions are more accurate, the model may be less effective at identifying the minority class. This reflects a typical bias-variance tradeoff: DT2 has reduced variance but increased bias, favoring stability over sensitivity to rare cases.

dt1 = DecisionTreeClassifier(random_state=42)
dt1.fit(X_train1, y_train1)

DecisionTreeClassifier(random_state=42)

y_pred1 = dt1.predict(X_test1)
y_proba1 = dt1.predict_proba(X_test1)[:, 1]

# Metrics for Experiment 1
results_dt1 = {
    "Experiment": "DT1 - Default Parameters",
    "Accuracy": accuracy_score(y_test1, y_pred1),
    "F1 Score": f1_score(y_test1, y_pred1),
    "AUC": roc_auc_score(y_test1, y_proba1)
}

# Decision Tree - Experiment 2: Limit max depth to control variance
dt2 = DecisionTreeClassifier(max_depth=5, random_state=42)
dt2.fit(X_train1, y_train1)

DecisionTreeClassifier(max_depth=5, random_state=42)

y_pred2 = dt2.predict(X_test1)
y_proba2 = dt2.predict_proba(X_test1)[:, 1]

# Metrics for Experiment 2
results_dt2 = {
    "Experiment": "DT2 - Max Depth = 5",
    "Accuracy": accuracy_score(y_test1, y_pred2),
    "F1 Score": f1_score(y_test1, y_pred2),
    "AUC": roc_auc_score(y_test1, y_proba2)
}
# Combine results into a DataFrame
dt_results = pd.DataFrame([results_dt1, results_dt2])

print(dt_results)

##                  Experiment  Accuracy  F1 Score       AUC
## 0  DT1 - Default Parameters  0.924237  0.354251  0.662942
## 1       DT2 - Max Depth = 5  0.943712  0.112360  0.792742

In this experiment, two Decision Tree models—DT1 and DT2—were trained on the same cleaned dataset to explore the effects of model complexity. DT1, built with default parameters, resulted in a highly complex and deep tree, indicating that the model captured detailed patterns from the training data. While this led to high accuracy, the depth of the tree raises concerns about overfitting, especially when generalizing to unseen data. In contrast, DT2 was limited to a maximum depth of 5, resulting in a simpler and more interpretable model. Although DT2 sacrificed some complexity, it demonstrated better generalization performance and reduced the risk of overfitting by focusing on only the most relevant decision splits. This comparison highlights the importance of balancing complexity and interpretability in decision tree models to achieve more reliable and robust predictions.

from sklearn.tree import plot_tree
# Plot the tree
# Plot the full decision tree (this might be very large)
# Visualize the tree
# Ensure class names are derived correctly from the y_cleaned labels
import numpy as np

# Extract unique class labels from y_cleaned and convert them to strings
class_names_str = [str(cls) for cls in np.unique(y_cleaned)]

# Plot the tree again using the corrected class names
plt.figure(figsize=(20, 10))
plot_tree(dt1, filled=True, feature_names=X.columns, class_names=class_names_str)
plt.title("Decision Tree - Based on X_cleaned and y_cleaned")
plt.show()

# Plot the tree again using the corrected class names
plt.figure(figsize=(20, 10))
plot_tree(dt2, filled=True, feature_names=X.columns, class_names=class_names_str)
plt.title("Decision Tree - Based on X_cleaned and y_cleaned")
plt.show()

The experiment presents the results of two AdaBoost experiments comparing a default model (AdaBoost1) with a tuned version (AdaBoost2) using 200 estimators and a learning rate of 0.5. Both models achieved nearly identical accuracy—94.37% for AdaBoost1 and 94.35% for AdaBoost2—indicating consistent overall performance. However, the F1 score, which better reflects performance on the minority class, slightly improved in AdaBoost2 (0.1408) compared to AdaBoost1 (0.1350), suggesting a marginal gain in balancing precision and recall. Additionally, the AUC (Area Under the Curve) also increased from 0.8865 to 0.8925, implying that the tuned AdaBoost model has a slightly better ability to distinguish between classes. Overall, increasing the number of estimators and adjusting the learning rate offered a minor improvement in model robustness without sacrificing accuracy.

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

ab1 = AdaBoostClassifier(random_state = 42 )
ab1.fit(X_train1, y_train1)

AdaBoostClassifier(random_state=42)

y_pred_ab1 = ab1.predict(X_test1)
y_proba_ab1 = ab1.predict_proba(X_test1)[:, 1]

# Metrics for AdaBoost Experiment 1
results_ab1 = {
    "Experiment": "AdaBoost1 - Default Parameters",
    "Accuracy": accuracy_score(y_test1, y_pred_ab1),
    "F1 Score": f1_score(y_test1, y_pred_ab1),
    "AUC": roc_auc_score(y_test1, y_proba_ab1)
}


# AdaBoost - Experiment 2: Increase number of estimators
ab2 = AdaBoostClassifier(n_estimators=200, learning_rate=0.5, random_state=42)
ab2.fit(X_train1, y_train1)

AdaBoostClassifier(learning_rate=0.5, n_estimators=200, random_state=42)

y_pred_ab2 = ab2.predict(X_test1)
y_proba_ab2 = ab2.predict_proba(X_test1)[:, 1]

# Metrics for AdaBoost Experiment 2
results_ab2 = {
    "Experiment": "AdaBoost2 - 200 Estimators, LR=0.5",
    "Accuracy": accuracy_score(y_test1, y_pred_ab2),
    "F1 Score": f1_score(y_test1, y_pred_ab2),
    "AUC": roc_auc_score(y_test1, y_proba_ab2)
}

# Combine results into a DataFrame
ab_results = pd.DataFrame([results_ab1, results_ab2])
print(ab_results)

##                            Experiment  Accuracy  F1 Score       AUC
## 0      AdaBoost1 - Default Parameters  0.943712  0.135036  0.886505
## 1  AdaBoost2 - 200 Estimators, LR=0.5  0.943475  0.140794  0.892491

In this project, GridSearchCV was used to systematically explore and identify the best hyperparameters for both the Decision Tree and AdaBoost models. GridSearchCV performs an exhaustive search over a specified parameter grid by evaluating all possible combinations using cross-validation. For the Decision Tree model, a 3-fold cross-validation was applied across variations of max_depth (5 and 10) and min_samples_split (2 and 5). Similarly, for the AdaBoost model, different combinations of n_estimators (50 and 100) and learning_rate values (0.1 and 1.0) were evaluated. The scoring metric used for both searches was the F1 score, which balances precision and recall—particularly useful for imbalanced datasets. As a result, the optimal hyperparameters were determined to be max_depth=10 and min_samples_split=2 for the Decision Tree, and n_estimators=100 with learning_rate=1.0 for AdaBoost, offering improved performance tailored to the dataset’s characteristics.

from sklearn.model_selection import GridSearchCV

# Define base estimator (Decision Tree) with limited depth to prevent overfitting
from sklearn.tree import DecisionTreeClassifier

# Decision Tree - reduced grid
dt_grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid={
        'max_depth': [5, 10],
        'min_samples_split': [2, 5],
    },
    cv=3,
    scoring='f1',
    n_jobs=-1
)
dt_grid.fit(X_train1, y_train1)

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42), n_jobs=-1,
             param_grid={'max_depth': [5, 10], 'min_samples_split': [2, 5]},
             scoring='f1')

# AdaBoost - reduced grid
ab_grid = GridSearchCV(
    AdaBoostClassifier(random_state=42),
    param_grid={
        'n_estimators': [50, 100],
        'learning_rate': [0.1, 1.0]
    },
    cv=3,
    scoring='f1',
    n_jobs=-1
)
ab_grid.fit(X_train1, y_train1)

GridSearchCV(cv=3, estimator=AdaBoostClassifier(random_state=42), n_jobs=-1,
             param_grid={'learning_rate': [0.1, 1.0],
                         'n_estimators': [50, 100]},
             scoring='f1')

print("Best Decision Tree Params:", dt_grid.best_params_)

## Best Decision Tree Params: {'max_depth': 10, 'min_samples_split': 2}

print("Best AdaBoost Params:", ab_grid.best_params_)

## Best AdaBoost Params: {'learning_rate': 1.0, 'n_estimators': 100}

# Best Decision Tree model
best_dt = DecisionTreeClassifier(max_depth=10, min_samples_split=2, random_state=42)
best_dt.fit(X_train1, y_train1)

DecisionTreeClassifier(max_depth=10, random_state=42)

y_pred_dt = best_dt.predict(X_test1)
y_proba_dt = best_dt.predict_proba(X_test1)[:, 1]

# Best AdaBoost model
best_ab = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42)
best_ab.fit(X_train1, y_train1)

AdaBoostClassifier(n_estimators=100, random_state=42)

y_pred_ab = best_ab.predict(X_test1)
y_proba_ab = best_ab.predict_proba(X_test1)[:, 1]

# Evaluate both
dt_results = {
    "Model": "Decision Tree (Best Params)",
    "Accuracy": accuracy_score(y_test1, y_pred_dt),
    "F1 Score": f1_score(y_test1, y_pred_dt),
    "AUC": roc_auc_score(y_test1, y_proba_dt)
}

ab_results = {
    "Model": "AdaBoost (Best Params)",
    "Accuracy": accuracy_score(y_test1, y_pred_ab),
    "F1 Score": f1_score(y_test1, y_pred_ab),
    "AUC": roc_auc_score(y_test1, y_proba_ab)
}

# Combine results
final_results = pd.DataFrame([dt_results, ab_results])
print(final_results)

##                          Model  Accuracy  F1 Score       AUC
## 0  Decision Tree (Best Params)  0.942762  0.377261  0.776648
## 1       AdaBoost (Best Params)  0.942762  0.183051  0.897107

The final evaluation identified the best-performing models as a Decision Tree with max_depth=10 and an AdaBoost classifier with n_estimators=100, both using a random_state of 42. While both models achieved identical accuracy at 94.28%, they differed notably in other performance metrics. The Decision Tree demonstrated a stronger F1 score of 0.377, indicating better balance between precision and recall, especially for the minority class. In contrast, AdaBoost achieved a significantly higher AUC score of 0.897, reflecting superior ability to distinguish between the positive and negative classes across thresholds. These results suggest that while the Decision Tree may be more effective for precise classification, AdaBoost offers stronger overall class separation, making it especially valuable when ranking predictions or reducing false positives is critical.

##Conclusion

The results of the model evaluation provide important insights into how well the Decision Tree and AdaBoost classifiers predict whether a customer will subscribe to a term deposit based on the features in the dataset. Both models achieved a high accuracy of 94.28%, which suggests they correctly classified the majority of the cases. However, accuracy alone can be misleading in imbalanced datasets, such as this one, where the number of customers who subscribed is much smaller than those who did not.

To better assess model performance on the minority class (i.e., customers who subscribed), the F1 Score and AUC (Area Under the Curve) are more informative. The Decision Tree model achieved a higher F1 score (0.377), indicating a better balance between precision and recall, and suggesting it is more effective at correctly identifying subscribers. This is valuable in marketing campaigns where false negatives (missing a likely subscriber) can lead to missed opportunities.

On the other hand, the AdaBoost model, while having a lower F1 score (0.183), produced a significantly higher AUC score (0.897). A higher AUC means the model is better at ranking predictions and distinguishing between customers who will and won’t subscribe. This makes AdaBoost a strong candidate for scenarios where the bank wants to prioritize leads—for example, by targeting customers who are most likely to respond positively, even if the model is less precise in binary classification.

In summary, the Decision Tree is better for directly classifying subscribers with moderate balance between recall and precision, while AdaBoost excels in ranking potential subscribers, which is useful for prioritization in resource-constrained marketing efforts. Both models contribute valuable insights into predicting the target variable, with different strengths depending on the business goal.

DATA 622 Assignment 1: Exploratory Data Analysis

Warner Alexis

2025-03-02

Exploratory analysis and essay