Bank Marketing Classifier

Author

Darwhin Gomez

Bank Marketing (with social/economic context)

The goal of this exercise is to explore a telemarketing campaign dataset for a Portuguese bank and make a prediction using a machine learning classifier algorithm to determine whether a client will subscribe to a term deposit. In this iteration, we will conduct exploratory data analysis on the bank-additional-full.csv file, examining data types, distributions, correlations, and dependencies/independencies. These explorations will enable us to make an informed recommendation on the best algorithm to tackle the business problem, as well as guide the next steps in the machine learning process, such as preprocessing.

Data

The data for this exercise comes from the paper:
Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62, 22-31.
Original data can be found here:
https://archive.ics.uci.edu/dataset/222/bank+marketing
File being used is the updated bank-additional-full.csv

Data Types dictionary by Moro et al.

1 - age (numeric)

2 - job: type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)

3 - marital: marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)

4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)

5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)

6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)

7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

8 - contact: contact communication type (categorical: “cellular”,“telephone”)

9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

21 - y: has the client subscribed a term deposit? (binary: “yes”,“no”)

Loading the Data

Code
# reading the data, specifying the data type for each column
bank_full <- read_delim(
  "bank-additional-full.csv",
  delim = ";",
  col_types = cols(
    age = col_integer(),
    job = col_character(),
    marital = col_character(),
    education = col_character(),
    default = col_character(),
    housing = col_character(),
    loan = col_character(),
    contact = col_character(),
    month = col_character(),
    day_of_week = col_character(),
    duration = col_double(),
    campaign = col_double(),
    pdays = col_double(),
    previous = col_double(),
    poutcome = col_character(),
    emp.var.rate = col_double(),
    cons.price.idx = col_double(),
    cons.conf.idx = col_double(),
    euribor3m = col_double(),
    nr.employed = col_double(),
    y = col_character()
  ),
  locale = locale(grouping_mark = ",")
)
# converting to dataframe
bank_full<- as.data.frame(bank_full)
# making cracacter variables factors
bank_full <- bank_full %>%
  mutate(across(where(is.character), as.factor))
# checking data structure
str(bank_full)
'data.frame':   41188 obs. of  21 variables:
 $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
 $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
 $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
 $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
 $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
 $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
 $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
 $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
 $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ duration      : num  261 149 226 151 307 198 139 217 380 50 ...
 $ campaign      : num  1 1 1 1 1 1 1 1 1 1 ...
 $ pdays         : num  999 999 999 999 999 999 999 999 999 999 ...
 $ previous      : num  0 0 0 0 0 0 0 0 0 0 ...
 $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
 $ cons.price.idx: num  94 94 94 94 94 ...
 $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
 $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
 $ nr.employed   : num  5191 5191 5191 5191 5191 ...
 $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Exploratory Data Analysis

Counts, Missing

Code
# Check for missing values
colSums(is.na(bank_full))
           age            job        marital      education        default 
             0              0              0              0              0 
       housing           loan        contact          month    day_of_week 
             0              0              0              0              0 
      duration       campaign          pdays       previous       poutcome 
             0              0              0              0              0 
  emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
             0              0              0              0              0 
             y 
             0 

No missing values in dataset, but we know from data dictionary that data like number of days since last contact have been coded to specify no previous contact. And we also know that missing data for categorical data has been labled as “unknown”.

Code
# summary for each column
summary(bank_full)
      age                 job            marital     
 Min.   :17.00   admin.     :10422   divorced: 4612  
 1st Qu.:32.00   blue-collar: 9254   married :24928  
 Median :38.00   technician : 6743   single  :11568  
 Mean   :40.02   services   : 3969   unknown :   80  
 3rd Qu.:47.00   management : 2924                   
 Max.   :98.00   retired    : 1720                   
                 (Other)    : 6156                   
               education        default         housing           loan      
 university.degree  :12168   no     :32588   no     :18622   no     :33950  
 high.school        : 9515   unknown: 8597   unknown:  990   unknown:  990  
 basic.9y           : 6045   yes    :    3   yes    :21576   yes    : 6248  
 professional.course: 5243                                                  
 basic.4y           : 4176                                                  
 basic.6y           : 2292                                                  
 (Other)            : 1749                                                  
      contact          month       day_of_week    duration     
 cellular :26144   may    :13769   fri:7827    Min.   :   0.0  
 telephone:15044   jul    : 7174   mon:8514    1st Qu.: 102.0  
                   aug    : 6178   thu:8623    Median : 180.0  
                   jun    : 5318   tue:8090    Mean   : 258.3  
                   nov    : 4101   wed:8134    3rd Qu.: 319.0  
                   apr    : 2632               Max.   :4918.0  
                   (Other): 2016                               
    campaign          pdays          previous            poutcome    
 Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
 1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35563  
 Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
 Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
 3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
 Max.   :56.000   Max.   :999.0   Max.   :7.000                      
                                                                     
  emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
 Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
 1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
 Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
 Mean   : 0.08189   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
 3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
 Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
                                                                   
  nr.employed     y        
 Min.   :4964   no :36548  
 1st Qu.:5099   yes: 4640  
 Median :5191              
 Mean   :5167              
 3rd Qu.:5228              
 Max.   :5228              
                           
Code
# quick glance at counts for factors and central tendendices / distributions for numerical colums
skim_sum<- skim(bank_full)
skim_sum
Data summary
Name bank_full
Number of rows 41188
Number of columns 21
_______________________
Column type frequency:
factor 11
numeric 10
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
job 0 1 FALSE 12 adm: 10422, blu: 9254, tec: 6743, ser: 3969
marital 0 1 FALSE 4 mar: 24928, sin: 11568, div: 4612, unk: 80
education 0 1 FALSE 8 uni: 12168, hig: 9515, bas: 6045, pro: 5243
default 0 1 FALSE 3 no: 32588, unk: 8597, yes: 3
housing 0 1 FALSE 3 yes: 21576, no: 18622, unk: 990
loan 0 1 FALSE 3 no: 33950, yes: 6248, unk: 990
contact 0 1 FALSE 2 cel: 26144, tel: 15044
month 0 1 FALSE 10 may: 13769, jul: 7174, aug: 6178, jun: 5318
day_of_week 0 1 FALSE 5 thu: 8623, mon: 8514, wed: 8134, tue: 8090
poutcome 0 1 FALSE 3 non: 35563, fai: 4252, suc: 1373
y 0 1 FALSE 2 no: 36548, yes: 4640

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.02 10.42 17.00 32.00 38.00 47.00 98.00 ▅▇▃▁▁
duration 0 1 258.29 259.28 0.00 102.00 180.00 319.00 4918.00 ▇▁▁▁▁
campaign 0 1 2.57 2.77 1.00 1.00 2.00 3.00 56.00 ▇▁▁▁▁
pdays 0 1 962.48 186.91 0.00 999.00 999.00 999.00 999.00 ▁▁▁▁▇
previous 0 1 0.17 0.49 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
emp.var.rate 0 1 0.08 1.57 -3.40 -1.80 1.10 1.40 1.40 ▁▃▁▁▇
cons.price.idx 0 1 93.58 0.58 92.20 93.08 93.75 93.99 94.77 ▁▆▃▇▂
cons.conf.idx 0 1 -40.50 4.63 -50.80 -42.70 -41.80 -36.40 -26.90 ▅▇▁▇▁
euribor3m 0 1 3.62 1.73 0.63 1.34 4.86 4.96 5.04 ▅▁▁▁▇
nr.employed 0 1 5167.04 72.25 4963.60 5099.10 5191.00 5228.10 5228.10 ▁▁▃▁▇

The skimr package provides a quick overview of the dataset, including summary statistics for numerical columns and the number of unique levels for categorical columns.

Categorical exploration

Code
# storing categorical vars for exploration vector names
categorical_vars <- names(bank_full)[sapply(bank_full, class) == "factor"]
categorical_vars <- categorical_vars[categorical_vars != "y"] #Excludes y
# cdf is categorial data
cdf <- bank_full[, categorical_vars]

Missing values in Categorical Columns poutcome does not have “unknown” and instead has “nonexistent” it was reached outside the loop in the code.

Code
# loop goes through each column in cdf and prints a neat statement with the values of unkown for each feature
for (col in names(cdf)) {
  unknown_count <- sum(cdf[[col]] == "unknown")
  print(paste("Column:", col, "- Count of 'unknown':", unknown_count))
}
[1] "Column: job - Count of 'unknown': 330"
[1] "Column: marital - Count of 'unknown': 80"
[1] "Column: education - Count of 'unknown': 1731"
[1] "Column: default - Count of 'unknown': 8597"
[1] "Column: housing - Count of 'unknown': 990"
[1] "Column: loan - Count of 'unknown': 990"
[1] "Column: contact - Count of 'unknown': 0"
[1] "Column: month - Count of 'unknown': 0"
[1] "Column: day_of_week - Count of 'unknown': 0"
[1] "Column: poutcome - Count of 'unknown': 0"
Code
# same as above but for specific poutcome value of nonexitent
print(paste("Column: poutcome", "- Count of 'nonexistent':", sum(cdf[["poutcome"]] == "nonexistent")))
[1] "Column: poutcome - Count of 'nonexistent': 35563"
Code
# Frequency tables
for (col in names(cdf)) {
  print(paste("Frequency table for", col))
  print(table(cdf[[col]]))
  print("==============================================================")
  
}
[1] "Frequency table for job"

       admin.   blue-collar  entrepreneur     housemaid    management 
        10422          9254          1456          1060          2924 
      retired self-employed      services       student    technician 
         1720          1421          3969           875          6743 
   unemployed       unknown 
         1014           330 
[1] "=============================================================="
[1] "Frequency table for marital"

divorced  married   single  unknown 
    4612    24928    11568       80 
[1] "=============================================================="
[1] "Frequency table for education"

           basic.4y            basic.6y            basic.9y         high.school 
               4176                2292                6045                9515 
         illiterate professional.course   university.degree             unknown 
                 18                5243               12168                1731 
[1] "=============================================================="
[1] "Frequency table for default"

     no unknown     yes 
  32588    8597       3 
[1] "=============================================================="
[1] "Frequency table for housing"

     no unknown     yes 
  18622     990   21576 
[1] "=============================================================="
[1] "Frequency table for loan"

     no unknown     yes 
  33950     990    6248 
[1] "=============================================================="
[1] "Frequency table for contact"

 cellular telephone 
    26144     15044 
[1] "=============================================================="
[1] "Frequency table for month"

  apr   aug   dec   jul   jun   mar   may   nov   oct   sep 
 2632  6178   182  7174  5318   546 13769  4101   718   570 
[1] "=============================================================="
[1] "Frequency table for day_of_week"

 fri  mon  thu  tue  wed 
7827 8514 8623 8090 8134 
[1] "=============================================================="
[1] "Frequency table for poutcome"

    failure nonexistent     success 
       4252       35563        1373 
[1] "=============================================================="

Bar Plots

Code
for (col in names(cdf)) {
  p <- ggplot(cdf, aes(x = .data[[col]])) +
    geom_bar(fill = "skyblue", color = "black") + # Customize bar appearance
    labs(title = paste("Barplot of", col), x = col, y = "Count") +
    theme_minimal() + # Apply minimal theme
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels

  print(p) # Print the plot
}

Response Variable

Code
# Bar plot of categorical variable 'y'
ggplot(bank_full, aes(x = y)) +
  geom_bar(fill = "green", color = "black") +  # Bar color customization
  labs(title = "Distribution of Target Variable (y)", x = "y", y = "Count") +
  theme_minimal() +  # Minimalistic theme
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code
y_yes <- 4640
total <- 41188

# Compute success rate
success_rate <- y_yes / total

# Correct print statement
print(paste("Success Rate:", round(success_rate,2)))
[1] "Success Rate: 0.11"

Numerical Exploration

Code
# using a faceted historgram for each numerical feature
# Select only numeric columns from the dataset
ndf <- bank_full %>%
  select(where(is.numeric))

# Reshape data into long format for ggplot
ndf_long <- ndf %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot histograms for numeric variables
ggplot(ndf_long, aes(x = Value)) +
  geom_histogram(fill = "maroon", color = "black", bins = 30) +
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  ggtitle("Distribution of Numeric Variables") +
  labs(x = "Value", y = "Count")

Code
# Box Plots
ggplot(ndf_long, aes(x = Value)) +
 geom_boxplot(fill = "maroon", color = "black") +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  theme_minimal() +
  ggtitle("Boxplot of Numeric Variables") +
  labs(x = "Value", y = "Count")

The following table provides a quick overview, incorporating some notes from Moro et al.’s data dictionary. I can use this to guide my recommendations for preprocessing and model selection

Code
skew_data <- data.frame(
  Variable = c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", 
               "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"),
  Skewness = c(0.78, 3.26, 4.76, -4.92, 3.83, -0.72, -0.23, 0.30, -0.71, -1.04),
  Action = c("Discretize", "Remove", "Discretize", "Discretize", "Discretize", 
             "No Action", "No Action", "No Action", "No Action", "No Action")
)



skew_data %>%
  kable(format = "html", caption = "Skewness Analysis and Recommended Actions") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "responsive"), full_width = FALSE)
Skewness Analysis and Recommended Actions
Variable Skewness Action
age 0.78 Discretize
duration 3.26 Remove
campaign 4.76 Discretize
pdays -4.92 Discretize
previous 3.83 Discretize
emp.var.rate -0.72 No Action
cons.price.idx -0.23 No Action
cons.conf.idx 0.30 No Action
euribor3m -0.71 No Action
nr.employed -1.04 No Action
Code
corr_matrix <- cor(num_df, use = "complete.obs")

# Plot the correlation heatmap
corrplot(corr_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black",
         col = colorRampPalette(c("blue", "white", "maroon"))(100))

  1. Strong Positive Correlations (Multicollinearity Concern)

    • euribor3m and nr.employed (r = 0.95) → Highly correlated.

    • emp.var.rate and euribor3m (r = 0.97) → Highly correlated.

    • emp.var.rate and nr.employed (r = 0.91) → Highly correlated.

    • emp.var.rate and cons.price.idx (r = 0.78) → Strongly correlated.

    • Recommendation: Consider removing one of these highly correlated features to reduce multicollinearity.

The socioeconomic data introduced into this dataset clearly adds collinearity, which needs to be addressed during preprocessing, either through the recommendations above or other methods not yet assessed.

Model Recommendations

Multinomial Naïve Bayes

Multinomial Naïve Bayes classifier with one-hot encoded categorical features and discretized numerical features. This model is well-suited for several reasons:

  • No Assumptions of Linearity or Normality: As none of the numerical features are normally distributed, Naïve Bayes avoids the constraints of models that require these assumptions.

  • Effective Handling of Multi-Category Discrete Features: Discretizing the numerical features creates multi-category discrete features, which Multinomial Naïve Bayes is designed to handle effectively. One-hot encoding handles the remaining categorical features.

  • Robustness to Irrelevant Features: Naïve Bayes is less sensitive to the impact of unimportant features, which is advantageous in datasets with potential noise.  

  • Suitability for Datasets with Many Categorical Features: The dataset contains numerous categorical features, including those with ‘unknown’ data types. Multinomial Naïve Bayes can effectively manage these features without significant performance degradation.

  • Handling Class Imbalance: It is important to note that the target variable has 11 percent success. Because of this, we will need to use evaluation metrics that are not based on accuracy, and we may need to resample the data. Additionally, discretizing the numerical features will make the model less sensitive to outliers.  

Logistic Regression

Alternatively, a Logistic Regression model could be considered, as we are attempting a binary classification task. Logistic Regression could work effectively because:

  • Binary Classification Suitability: It is specifically designed for predicting binary outcomes, aligning perfectly with our goal of predicting term deposit subscriptions (yes/no).

  • Probability Estimates: It provides probability estimates, which can be valuable for prioritizing clients based on their likelihood of subscription.  

  • Interpretability: It offers interpretable coefficients, allowing us to understand the impact of each feature on the probability of subscription.

  • Handling of One-Hot Encoded Features: It can effectively handle the binary features created by one-hot encoding our categorical variables.

  • Regularization Capabilities: It offers regularization (L1 or L2) to prevent overfitting, which is crucial given the increased dimensionality from one-hot encoding.

However, Logistic Regression also presents challenges:

  • Linearity Assumption: It assumes a linear relationship between features and the log-odds of the outcome, which may not hold true for all features.  

  • Sensitivity to Outliers: It is sensitive to outliers, requiring careful outlier handling during preprocessing.

  • Multicollinearity: While we’ve addressed multicollinearity by removing emp.var.rate, any remaining collinearity could impact the model’s performance.

  • Class Imbalance: Just like Naïve Bayes, it requires careful handling of the 11% success rate through resampling or cost-sensitive learning.

  • Scaling: Numerical features must be scaled/standardized to ensure they contribute equally to the model.

K-Nearest Neighbors (KNN)

*For Smaller Datasets

If this dataset were significantly smaller, K-Nearest Neighbors (KNN) could be a viable option. KNN is well-suited for smaller datasets because:

  • Simplicity: It’s conceptually simple and easy to implement.

  • Non-Parametric: It makes no assumptions about the underlying data distribution.

  • Adaptability: It can adapt to complex decision boundaries.

However, KNN also has significant challenges for this dataset:

  • Computational Cost: For larger datasets, KNN’s computational cost increases substantially.

  • Sensitivity to Scaling: Numerical features must be carefully scaled to prevent features with larger ranges from dominating distance calculations.

  • Curse of Dimensionality: With many one-hot encoded categorical features, the curse of dimensionality could severely impact performance, as distance calculations become less meaningful in high-dimensional spaces.  

  • Memory Intensive: KNN stores all training data in memory.  

  • Outliers: KNN is sensitive to outliers.

Final Recommendation

Given the dataset’s characteristics, including non-normally distributed numerical features, numerous categorical variables, and an 11% success rate in the target variable, I recommend implementing a Multinomial Naïve Bayes (MNB) classifier. MNB is well-suited for this scenario due to its ability to handle multi-category discrete features, its robustness to irrelevant variables, and its suitability for datasets with many categorical features.

Key Preprocessing and Model Training Steps and Considerations:

  • Randomize the Dataset: Ensure a balanced representation of data during training and testing.

  • Remove emp.var.rate: Mitigate multicollinearity and redundancy.

  • Remove duration: Direct impact on Y, Recommended removal for realistic classification. (Moro et al.)

  • Feature Engineering: The pdays and previous variables appear to capture overlapping information about prior contact history. I recommend consolidating them into a single binary variable indicating whether the client has been previously contacted.

  • Discretize Continuous Numerical Features: Transform numerical features into discrete categories using appropriate binning methods to enhance model compatibility.

  • One-Hot Encode Categorical Features: Convert all categorical variables with multiple levels into a numerical format suitable for the model.

  • Address Class Imbalance: Apply resampling techniques (e.g., SMOTE, oversampling, undersampling) or cost-sensitive learning to adjust for the 11% success rate in the target variable.

  • Select Appropriate Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC, which are more reliable than accuracy for imbalanced datasets.

  • Retain “Unknown” Values: Preserve data integrity, as Multinomial Naïve Bayes (MNB) can effectively handle these categories.

Following the above suggestions, we should be able to implement a robust, simple, and effective classifier using Multinomial Naïve Bayes (MNB). While more powerful models, such as Random Forest or Neural Networks, could be considered, we have yet to discuss these in class and, therefore, they are not included in these recommendations. However, I would like to reserve the possibility of incorporating them in future studies of this dataset, allowing for a comparative analysis of different models, knowing the preprocesing decisions are different for each model type. Ultimately, this would enable a confident data-driven recommendation to the bank based on performance outcomes of the models.