The goal of this exercise is to explore a telemarketing campaign dataset for a Portuguese bank and make a prediction using a machine learning classifier algorithm to determine whether a client will subscribe to a term deposit. In this iteration, we will conduct exploratory data analysis on the bank-additional-full.csv file, examining data types, distributions, correlations, and dependencies/independencies. These explorations will enable us to make an informed recommendation on the best algorithm to tackle the business problem, as well as guide the next steps in the machine learning process, such as preprocessing.
Data
The data for this exercise comes from the paper:
Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62, 22-31.
Original data can be found here: https://archive.ics.uci.edu/dataset/222/bank+marketing
File being used is the updated bank-additional-full.csv
Data Types dictionary by Moro et al.
1 - age (numeric)
2 - job: type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3 - marital: marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
No missing values in dataset, but we know from data dictionary that data like number of days since last contact have been coded to specify no previous contact. And we also know that missing data for categorical data has been labled as “unknown”.
Code
# summary for each columnsummary(bank_full)
age job marital
Min. :17.00 admin. :10422 divorced: 4612
1st Qu.:32.00 blue-collar: 9254 married :24928
Median :38.00 technician : 6743 single :11568
Mean :40.02 services : 3969 unknown : 80
3rd Qu.:47.00 management : 2924
Max. :98.00 retired : 1720
(Other) : 6156
education default housing loan
university.degree :12168 no :32588 no :18622 no :33950
high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
basic.9y : 6045 yes : 3 yes :21576 yes : 6248
professional.course: 5243
basic.4y : 4176
basic.6y : 2292
(Other) : 1749
contact month day_of_week duration
cellular :26144 may :13769 fri:7827 Min. : 0.0
telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
aug : 6178 thu:8623 Median : 180.0
jun : 5318 tue:8090 Mean : 258.3
nov : 4101 wed:8134 3rd Qu.: 319.0
apr : 2632 Max. :4918.0
(Other): 2016
campaign pdays previous poutcome
Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
Median : 2.000 Median :999.0 Median :0.000 success : 1373
Mean : 2.568 Mean :962.5 Mean :0.173
3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
Max. :56.000 Max. :999.0 Max. :7.000
emp.var.rate cons.price.idx cons.conf.idx euribor3m
Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
Mean : 0.08189 Mean :93.58 Mean :-40.5 Mean :3.621
3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
nr.employed y
Min. :4964 no :36548
1st Qu.:5099 yes: 4640
Median :5191
Mean :5167
3rd Qu.:5228
Max. :5228
Code
# quick glance at counts for factors and central tendendices / distributions for numerical columsskim_sum<-skim(bank_full)skim_sum
Data summary
Name
bank_full
Number of rows
41188
Number of columns
21
_______________________
Column type frequency:
factor
11
numeric
10
________________________
Group variables
None
Variable type: factor
skim_variable
n_missing
complete_rate
ordered
n_unique
top_counts
job
0
1
FALSE
12
adm: 10422, blu: 9254, tec: 6743, ser: 3969
marital
0
1
FALSE
4
mar: 24928, sin: 11568, div: 4612, unk: 80
education
0
1
FALSE
8
uni: 12168, hig: 9515, bas: 6045, pro: 5243
default
0
1
FALSE
3
no: 32588, unk: 8597, yes: 3
housing
0
1
FALSE
3
yes: 21576, no: 18622, unk: 990
loan
0
1
FALSE
3
no: 33950, yes: 6248, unk: 990
contact
0
1
FALSE
2
cel: 26144, tel: 15044
month
0
1
FALSE
10
may: 13769, jul: 7174, aug: 6178, jun: 5318
day_of_week
0
1
FALSE
5
thu: 8623, mon: 8514, wed: 8134, tue: 8090
poutcome
0
1
FALSE
3
non: 35563, fai: 4252, suc: 1373
y
0
1
FALSE
2
no: 36548, yes: 4640
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
age
0
1
40.02
10.42
17.00
32.00
38.00
47.00
98.00
▅▇▃▁▁
duration
0
1
258.29
259.28
0.00
102.00
180.00
319.00
4918.00
▇▁▁▁▁
campaign
0
1
2.57
2.77
1.00
1.00
2.00
3.00
56.00
▇▁▁▁▁
pdays
0
1
962.48
186.91
0.00
999.00
999.00
999.00
999.00
▁▁▁▁▇
previous
0
1
0.17
0.49
0.00
0.00
0.00
0.00
7.00
▇▁▁▁▁
emp.var.rate
0
1
0.08
1.57
-3.40
-1.80
1.10
1.40
1.40
▁▃▁▁▇
cons.price.idx
0
1
93.58
0.58
92.20
93.08
93.75
93.99
94.77
▁▆▃▇▂
cons.conf.idx
0
1
-40.50
4.63
-50.80
-42.70
-41.80
-36.40
-26.90
▅▇▁▇▁
euribor3m
0
1
3.62
1.73
0.63
1.34
4.86
4.96
5.04
▅▁▁▁▇
nr.employed
0
1
5167.04
72.25
4963.60
5099.10
5191.00
5228.10
5228.10
▁▁▃▁▇
The skimr package provides a quick overview of the dataset, including summary statistics for numerical columns and the number of unique levels for categorical columns.
Categorical exploration
Code
# storing categorical vars for exploration vector namescategorical_vars <-names(bank_full)[sapply(bank_full, class) =="factor"]categorical_vars <- categorical_vars[categorical_vars !="y"] #Excludes y# cdf is categorial datacdf <- bank_full[, categorical_vars]
Missing values in Categorical Columns poutcome does not have “unknown” and instead has “nonexistent” it was reached outside the loop in the code.
Code
# loop goes through each column in cdf and prints a neat statement with the values of unkown for each featurefor (col innames(cdf)) { unknown_count <-sum(cdf[[col]] =="unknown")print(paste("Column:", col, "- Count of 'unknown':", unknown_count))}
[1] "Column: job - Count of 'unknown': 330"
[1] "Column: marital - Count of 'unknown': 80"
[1] "Column: education - Count of 'unknown': 1731"
[1] "Column: default - Count of 'unknown': 8597"
[1] "Column: housing - Count of 'unknown': 990"
[1] "Column: loan - Count of 'unknown': 990"
[1] "Column: contact - Count of 'unknown': 0"
[1] "Column: month - Count of 'unknown': 0"
[1] "Column: day_of_week - Count of 'unknown': 0"
[1] "Column: poutcome - Count of 'unknown': 0"
Code
# same as above but for specific poutcome value of nonexitentprint(paste("Column: poutcome", "- Count of 'nonexistent':", sum(cdf[["poutcome"]] =="nonexistent")))
[1] "Column: poutcome - Count of 'nonexistent': 35563"
Code
# Frequency tablesfor (col innames(cdf)) {print(paste("Frequency table for", col))print(table(cdf[[col]]))print("==============================================================")}
[1] "Frequency table for job"
admin. blue-collar entrepreneur housemaid management
10422 9254 1456 1060 2924
retired self-employed services student technician
1720 1421 3969 875 6743
unemployed unknown
1014 330
[1] "=============================================================="
[1] "Frequency table for marital"
divorced married single unknown
4612 24928 11568 80
[1] "=============================================================="
[1] "Frequency table for education"
basic.4y basic.6y basic.9y high.school
4176 2292 6045 9515
illiterate professional.course university.degree unknown
18 5243 12168 1731
[1] "=============================================================="
[1] "Frequency table for default"
no unknown yes
32588 8597 3
[1] "=============================================================="
[1] "Frequency table for housing"
no unknown yes
18622 990 21576
[1] "=============================================================="
[1] "Frequency table for loan"
no unknown yes
33950 990 6248
[1] "=============================================================="
[1] "Frequency table for contact"
cellular telephone
26144 15044
[1] "=============================================================="
[1] "Frequency table for month"
apr aug dec jul jun mar may nov oct sep
2632 6178 182 7174 5318 546 13769 4101 718 570
[1] "=============================================================="
[1] "Frequency table for day_of_week"
fri mon thu tue wed
7827 8514 8623 8090 8134
[1] "=============================================================="
[1] "Frequency table for poutcome"
failure nonexistent success
4252 35563 1373
[1] "=============================================================="
Bar Plots
Code
for (col innames(cdf)) { p <-ggplot(cdf, aes(x = .data[[col]])) +geom_bar(fill ="skyblue", color ="black") +# Customize bar appearancelabs(title =paste("Barplot of", col), x = col, y ="Count") +theme_minimal() +# Apply minimal themetheme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labelsprint(p) # Print the plot}
Response Variable
Code
# Bar plot of categorical variable 'y'ggplot(bank_full, aes(x = y)) +geom_bar(fill ="green", color ="black") +# Bar color customizationlabs(title ="Distribution of Target Variable (y)", x ="y", y ="Count") +theme_minimal() +# Minimalistic themetheme(axis.text.x =element_text(angle =45, hjust =1))
# using a faceted historgram for each numerical feature# Select only numeric columns from the datasetndf <- bank_full %>%select(where(is.numeric))# Reshape data into long format for ggplotndf_long <- ndf %>%pivot_longer(cols =everything(), names_to ="Variable", values_to ="Value")# Plot histograms for numeric variablesggplot(ndf_long, aes(x = Value)) +geom_histogram(fill ="maroon", color ="black", bins =30) +facet_wrap(~ Variable, scales ="free", ncol=3) +theme_minimal() +ggtitle("Distribution of Numeric Variables") +labs(x ="Value", y ="Count")
Code
# Box Plotsggplot(ndf_long, aes(x = Value)) +geom_boxplot(fill ="maroon", color ="black") +facet_wrap(~ Variable, scales ="free", ncol =3) +theme_minimal() +ggtitle("Boxplot of Numeric Variables") +labs(x ="Value", y ="Count")
The following table provides a quick overview, incorporating some notes from Moro et al.’s data dictionary. I can use this to guide my recommendations for preprocessing and model selection
Code
skew_data <-data.frame(Variable =c("age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"),Skewness =c(0.78, 3.26, 4.76, -4.92, 3.83, -0.72, -0.23, 0.30, -0.71, -1.04),Action =c("Discretize", "Remove", "Discretize", "Discretize", "Discretize", "No Action", "No Action", "No Action", "No Action", "No Action"))skew_data %>%kable(format ="html", caption ="Skewness Analysis and Recommended Actions") %>%kable_styling(bootstrap_options =c("striped", "hover", "responsive"), full_width =FALSE)
Skewness Analysis and Recommended Actions
Variable
Skewness
Action
age
0.78
Discretize
duration
3.26
Remove
campaign
4.76
Discretize
pdays
-4.92
Discretize
previous
3.83
Discretize
emp.var.rate
-0.72
No Action
cons.price.idx
-0.23
No Action
cons.conf.idx
0.30
No Action
euribor3m
-0.71
No Action
nr.employed
-1.04
No Action
Code
corr_matrix <-cor(num_df, use ="complete.obs")# Plot the correlation heatmapcorrplot(corr_matrix, method ="color", type ="upper", tl.col ="black", tl.srt =45, addCoef.col ="black",col =colorRampPalette(c("blue", "white", "maroon"))(100))
euribor3m and nr.employed (r = 0.95) → Highly correlated.
emp.var.rate and euribor3m (r = 0.97) → Highly correlated.
emp.var.rate and nr.employed (r = 0.91) → Highly correlated.
emp.var.rate and cons.price.idx (r = 0.78) → Strongly correlated.
Recommendation: Consider removing one of these highly correlated features to reduce multicollinearity.
The socioeconomic data introduced into this dataset clearly adds collinearity, which needs to be addressed during preprocessing, either through the recommendations above or other methods not yet assessed.
Model Recommendations
Multinomial Naïve Bayes
Multinomial Naïve Bayes classifier with one-hot encoded categorical features and discretized numerical features. This model is well-suited for several reasons:
No Assumptions of Linearity or Normality: As none of the numerical features are normally distributed, Naïve Bayes avoids the constraints of models that require these assumptions.
Effective Handling of Multi-Category Discrete Features: Discretizing the numerical features creates multi-category discrete features, which Multinomial Naïve Bayes is designed to handle effectively. One-hot encoding handles the remaining categorical features.
Robustness to Irrelevant Features: Naïve Bayes is less sensitive to the impact of unimportant features, which is advantageous in datasets with potential noise.
Suitability for Datasets with Many Categorical Features: The dataset contains numerous categorical features, including those with ‘unknown’ data types. Multinomial Naïve Bayes can effectively manage these features without significant performance degradation.
Handling Class Imbalance: It is important to note that the target variable has 11 percent success. Because of this, we will need to use evaluation metrics that are not based on accuracy, and we may need to resample the data. Additionally, discretizing the numerical features will make the model less sensitive to outliers.
Logistic Regression
Alternatively, a Logistic Regression model could be considered, as we are attempting a binary classification task. Logistic Regression could work effectively because:
Binary Classification Suitability: It is specifically designed for predicting binary outcomes, aligning perfectly with our goal of predicting term deposit subscriptions (yes/no).
Probability Estimates: It provides probability estimates, which can be valuable for prioritizing clients based on their likelihood of subscription.
Interpretability: It offers interpretable coefficients, allowing us to understand the impact of each feature on the probability of subscription.
Handling of One-Hot Encoded Features: It can effectively handle the binary features created by one-hot encoding our categorical variables.
Regularization Capabilities: It offers regularization (L1 or L2) to prevent overfitting, which is crucial given the increased dimensionality from one-hot encoding.
However, Logistic Regression also presents challenges:
Linearity Assumption: It assumes a linear relationship between features and the log-odds of the outcome, which may not hold true for all features.
Sensitivity to Outliers: It is sensitive to outliers, requiring careful outlier handling during preprocessing.
Multicollinearity: While we’ve addressed multicollinearity by removing emp.var.rate, any remaining collinearity could impact the model’s performance.
Class Imbalance: Just like Naïve Bayes, it requires careful handling of the 11% success rate through resampling or cost-sensitive learning.
Scaling: Numerical features must be scaled/standardized to ensure they contribute equally to the model.
K-Nearest Neighbors (KNN)
*For Smaller Datasets
If this dataset were significantly smaller, K-Nearest Neighbors (KNN) could be a viable option. KNN is well-suited for smaller datasets because:
Simplicity: It’s conceptually simple and easy to implement.
Non-Parametric: It makes no assumptions about the underlying data distribution.
Adaptability: It can adapt to complex decision boundaries.
However, KNN also has significant challenges for this dataset:
Computational Cost: For larger datasets, KNN’s computational cost increases substantially.
Sensitivity to Scaling: Numerical features must be carefully scaled to prevent features with larger ranges from dominating distance calculations.
Curse of Dimensionality: With many one-hot encoded categorical features, the curse of dimensionality could severely impact performance, as distance calculations become less meaningful in high-dimensional spaces.
Memory Intensive: KNN stores all training data in memory.
Outliers: KNN is sensitive to outliers.
Final Recommendation
Given the dataset’s characteristics, including non-normally distributed numerical features, numerous categorical variables, and an 11% success rate in the target variable, I recommend implementing a Multinomial Naïve Bayes (MNB) classifier. MNB is well-suited for this scenario due to its ability to handle multi-category discrete features, its robustness to irrelevant variables, and its suitability for datasets with many categorical features.
Key Preprocessing and Model Training Steps and Considerations:
Randomize the Dataset: Ensure a balanced representation of data during training and testing.
Remove emp.var.rate: Mitigate multicollinearity and redundancy.
Remove duration: Direct impact on Y, Recommended removal for realistic classification. (Moro et al.)
Feature Engineering: The pdays and previous variables appear to capture overlapping information about prior contact history. I recommend consolidating them into a single binary variable indicating whether the client has been previously contacted.
Discretize Continuous Numerical Features: Transform numerical features into discrete categories using appropriate binning methods to enhance model compatibility.
One-Hot Encode Categorical Features: Convert all categorical variables with multiple levels into a numerical format suitable for the model.
Address Class Imbalance: Apply resampling techniques (e.g., SMOTE, oversampling, undersampling) or cost-sensitive learning to adjust for the 11% success rate in the target variable.
Select Appropriate Evaluation Metrics: Use precision, recall, F1-score, and AUC-ROC, which are more reliable than accuracy for imbalanced datasets.
Retain “Unknown” Values: Preserve data integrity, as Multinomial Naïve Bayes (MNB) can effectively handle these categories.
Following the above suggestions, we should be able to implement a robust, simple, and effective classifier using Multinomial Naïve Bayes (MNB). While more powerful models, such as Random Forest or Neural Networks, could be considered, we have yet to discuss these in class and, therefore, they are not included in these recommendations. However, I would like to reserve the possibility of incorporating them in future studies of this dataset, allowing for a comparative analysis of different models, knowing the preprocesing decisions are different for each model type. Ultimately, this would enable a confident data-driven recommendation to the bank based on performance outcomes of the models.