bank_marketing <- read.csv("bank-additional-full.csv",sep = ";",stringsAsFactors = T)
The Bank Marketing dataset originates from a study conducted by Sérgio Moro, Paulo Cortez, and Paulo Rita (2014), aiming to predict the success of bank telemarketing campaigns for term deposit subscriptions. This dataset, derived from the UCI Bank Marketing dataset, includes additional social and economic attributes obtained from Banco de Portugal, which enhance its predictive power.
The dataset consists of 41,188 client interactions collected between
May 2008 and November 2010. Each record contains bank client data,
details of the last marketing contact, and macroeconomic indicators,
culminating in a binary classification problem where the objective is to
predict whether a client subscribes to a term deposit
(y = yes/no).
Given the imbalanced nature of the target variable, a structured pre-processing approach is necessary. This document outlines the steps taken to clean, transform, and optimize the dataset for weighted Logistic Regression. The preprocessing pipeline includes handling missing data, reducing redundancy, encoding categorical variables, transforming skewed features, and addressing class imbalance to ensure model stability and interpretability.
bank_marketing |>
select(where(is.numeric)) |>
skim() |>
select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd,
numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |>
rename(
Variable = skim_variable,
`Missing Values` = n_missing,
`Completeness (%)` = complete_rate,
`Mean` = numeric.mean,
`Standard Deviation` = numeric.sd,
`Min` = numeric.p0,
`25th Percentile` = numeric.p25,
`Median (50th Pct)` = numeric.p50,
`75th Percentile` = numeric.p75,
`Max` = numeric.p100
) |>
kable(caption = "Summary Statistics for Numerical Variables") |>
kable_styling() |>
kable_classic()
| Variable | Missing Values | Completeness (%) | Mean | Standard Deviation | Min | 25th Percentile | Median (50th Pct) | 75th Percentile | Max |
|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.0240604 | 10.4212500 | 17.000 | 32.000 | 38.000 | 47.000 | 98.000 |
| duration | 0 | 1 | 258.2850102 | 259.2792488 | 0.000 | 102.000 | 180.000 | 319.000 | 4918.000 |
| campaign | 0 | 1 | 2.5675925 | 2.7700135 | 1.000 | 1.000 | 2.000 | 3.000 | 56.000 |
| pdays | 0 | 1 | 962.4754540 | 186.9109073 | 0.000 | 999.000 | 999.000 | 999.000 | 999.000 |
| previous | 0 | 1 | 0.1729630 | 0.4949011 | 0.000 | 0.000 | 0.000 | 0.000 | 7.000 |
| emp.var.rate | 0 | 1 | 0.0818855 | 1.5709597 | -3.400 | -1.800 | 1.100 | 1.400 | 1.400 |
| cons.price.idx | 0 | 1 | 93.5756644 | 0.5788400 | 92.201 | 93.075 | 93.749 | 93.994 | 94.767 |
| cons.conf.idx | 0 | 1 | -40.5026003 | 4.6281979 | -50.800 | -42.700 | -41.800 | -36.400 | -26.900 |
| euribor3m | 0 | 1 | 3.6212908 | 1.7344474 | 0.634 | 1.344 | 4.857 | 4.961 | 5.045 |
| nr.employed | 0 | 1 | 5167.0359109 | 72.2515277 | 4963.600 | 5099.100 | 5191.000 | 5228.100 | 5228.100 |
The numerical summary of the Bank Marketing Dataset indicates that all variables are fully complete (0 missing values). Age has a mean of 40 years (range: 17–98), with most individuals between 32 and 47. Duration, which measures call length, has a median of 180 seconds but a high standard deviation (259.3 sec) and a maximum of 4918 seconds, suggesting a long tail. Campaign, representing the number of contacts per client, has a median of 2 but reaches up to 56 contacts, indicating some clients were targeted repeatedly.
The pdays variable is dominated by the value 999, likely a placeholder for no prior contact, while previous, tracking past interactions, has a median of 0, meaning most clients were contacted for the first time. Economic variables show varying trends: the employment variation rate ranges from -3.4 to 1.4, and the consumer confidence index has an average of -40.5, reflecting economic uncertainty.
The euribor3m interest rate, which influences borrowing costs, has a median of 4.86%, fluctuating between 0.63% and 5.05%. The number of employed individuals is relatively stable, with a mean of 5167 and low variation, suggesting consistency in the labor market during the data collection period.
missing_data <- bank_marketing |> mutate(across(where(is.factor), ~ fct_recode(.x, NULL = "unknown")))
missing_data |>
select(!where(is.numeric)) |>
skim() |>
select(skim_variable, n_missing, complete_rate, factor.n_unique, factor.top_counts) |>
rename(
Variable = skim_variable,
`Missing Values` = n_missing,
`Completeness (%)` = complete_rate,
`Unique Categories` = factor.n_unique,
`Top Categories (Counts)` = factor.top_counts
) |>
kable(caption = "Summary Statistics for Categorical Variables") |>
kable_styling() |>
kable_classic()
| Variable | Missing Values | Completeness (%) | Unique Categories | Top Categories (Counts) |
|---|---|---|---|---|
| job | 330 | 0.9919880 | 11 | adm: 10422, blu: 9254, tec: 6743, ser: 3969 |
| marital | 80 | 0.9980577 | 3 | mar: 24928, sin: 11568, div: 4612 |
| education | 1731 | 0.9579732 | 7 | uni: 12168, hig: 9515, bas: 6045, pro: 5243 |
| default | 8597 | 0.7912742 | 2 | no: 32588, yes: 3 |
| housing | 990 | 0.9759639 | 2 | yes: 21576, no: 18622 |
| loan | 990 | 0.9759639 | 2 | no: 33950, yes: 6248 |
| contact | 0 | 1.0000000 | 2 | cel: 26144, tel: 15044 |
| month | 0 | 1.0000000 | 10 | may: 13769, jul: 7174, aug: 6178, jun: 5318 |
| day_of_week | 0 | 1.0000000 | 5 | thu: 8623, mon: 8514, wed: 8134, tue: 8090 |
| poutcome | 0 | 1.0000000 | 3 | non: 35563, fai: 4252, suc: 1373 |
| y | 0 | 1.0000000 | 2 | no: 36548, yes: 4640 |
grid.arrange(missing1, missing2, ncol = 2)
The Bank Marketing Dataset presents a low overall missing data rate of 1.5%. However, 21% of observations contain at least one missing value. Notably, categorical variables such as education (4.2% missing), housing (2.4% missing), loan (2.4% missing), job (0.8% missing), and marital status (0.19% missing) contribute to this, potentially impacting subsequent modeling. While missing data is distributed between categorical (52.4%) and continuous (47.6%) variables, no column is fully missing, and most features are complete.
Examining categorical variables reveals specific missingness patterns. ‘Job’ has 330 missing values, resulting in 99.2% completeness, with admin, blue-collar, and technician being the most common categories. ‘Marital status’ is nearly complete, at 99.8%, while ‘education’ has 1,731 missing values, resulting in 95.8% completeness, primarily affecting university education.
Financial variables show larger gaps. ‘Default’ status has the highest missing rate, with 8,597 missing values and 79.1% completeness, dominated by ‘no’ responses. ‘Housing’ and ‘loan’ each have 990 missing values, resulting in 97.6% completeness, mainly showing homeownership and no loans. Other categorical variables—contact, month, and poutcome—are fully observed. The target variable (y) is imbalanced, with 36,548 non-subscriptions versus 4,640 subscriptions, indicating a low campaign success rate.
age)plot_grid(
bank_marketing |> ggplot(aes(age)) + geom_histogram() ,
bank_marketing |> ggplot(aes(age)) + geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) ,
ncol = 2, align = "v"
)
The histogram shows a right-skewed distribution, with most clients between 25 and 60 years old. The boxplot highlights some outliers beyond 75 years, but they appear naturally occurring rather than errors. The peaks at 30, 35, and 50 suggest a higher concentration of clients in these age groups, though this may simply reflect population demographics rather than targeted marketing.
campaign)plot_grid(
bank_marketing |> ggplot(aes(campaign)) + geom_histogram() + theme_minimal(),
bank_marketing |> ggplot(aes(campaign)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4),
ncol = 2, align = "v"
)
The histogram confirms heavy right-skewness, with most clients receiving 1 or 2 contacts and a long tail extending beyond 40 contacts. The boxplot shows a large number of outliers, indicating that some customers were contacted excessively. Since most clients receive fewer than 5 calls, the extreme cases could suggest persistent but possibly ineffective marketing strategies.
cons.conf.idx)plot_grid(
bank_marketing |> ggplot(aes(cons.conf.idx)) + geom_histogram() ,
bank_marketing |> ggplot(aes(cons.conf.idx)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)
,
ncol = 2, align = "v"
)
The histogram displays discrete peaks around -50, -45, -40, and -35, suggesting that consumer sentiment is measured at fixed economic intervals rather than varying continuously. The boxplot confirms a few extreme values around -30, likely reflecting specific economic downturns rather than random noise.
cons.price.idx)plot_grid(
bank_marketing |> ggplot(aes(cons.price.idx)) + geom_histogram(),
bank_marketing |> ggplot(aes(cons.price.idx)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4),
ncol = 2, align = "v"
)
The histogram for cons.price.idx reveals clusters of
values around 93 and 94, rather than a continuous distribution. The
boxplot shows no extreme outliers, suggesting that the variable follows
well-defined economic reporting patterns. While numerical, it may be
more effective when treated as an economic regime indicator rather than
a standard continuous feature.
duration)plot_grid(
bank_marketing |> ggplot(aes(duration)) + geom_histogram() ,
bank_marketing |> ggplot(aes(duration)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
duration is heavily right-skewed, with a large number of
calls lasting less than 200 seconds and a long tail extending beyond
2000+ seconds. Longer call durations often indicate higher engagement
and are strongly correlated with term deposit subscriptions. However,
since duration is only known after the call, it must be
handled carefully in predictive modeling to avoid data leakage.
emp.var.rate)plot_grid(
bank_marketing |> ggplot(aes(emp.var.rate)) + geom_histogram() ,
bank_marketing |> ggplot(aes(emp.var.rate)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
The histogram does not resemble a continuous distribution but instead has distinct peaks at -3, -2, 0, and 1, suggesting that employment variation follows structured economic periods rather than random fluctuations. The boxplot shows no significant outliers, reinforcing that it is best interpreted as an economic trend indicator rather than a freely varying numerical variable.
euribor3m)plot_grid(
bank_marketing |> ggplot(aes(euribor3m)) + geom_histogram() ,
bank_marketing |> ggplot(aes(euribor3m)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
The histogram shows distinct peaks around 1, 2, and 5, confirming that interest rates are adjusted in structured policy shifts rather than gradually. The boxplot confirms no major outliers, supporting the interpretation that this variable behaves as an economic benchmark rather than a continuously changing variable.
nr.employed)plot_grid(
bank_marketing |> ggplot(aes(nr.employed)) + geom_histogram() ,
bank_marketing |> ggplot(aes(nr.employed)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
The nr.employed histogram shows a discrete, clustered
pattern, with most values concentrated around 5100 and 5200. This
suggests that employment levels are reported in predefined benchmarks
rather than fluctuating continuously.
previous)plot_grid(
bank_marketing |> ggplot(aes(previous)) + geom_histogram() ,
bank_marketing |> ggplot(aes(previous)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
The previous variable is highly right-skewed, with most
clients having 0 or 1 previous contact. A small fraction of clients
received up to 6 prior contacts, indicating persistent marketing efforts
for certain customers. Since repeated contacts might signal both
persistence and customer disinterest, this variable may require
interaction effects or capping to prevent outliers from distorting the
model.
pdays)bank_marketing <- bank_marketing |> mutate(pdays = if_else(pdays == 999,-1,pdays))
pdaysDist <- bank_marketing |> filter(pdays != -1)
plot_grid(
pdaysDist |> ggplot(aes(pdays)) + geom_histogram(binwidth = bins_cal),
pdaysDist |> ggplot(aes(pdays)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4) ,
ncol = 2, align = "v"
)
The variable pdays represents the number of days since a
client was last contacted in a previous campaign. However, it contains a
special value of -1 or 999, which indicates
that the client was never contacted before. Since pdays
exhibits category-like behavior, we first separate
"Never Contacted" clients from the distribution to better
analyze the spread of past interactions. After removal, the remaining
values show a clustered pattern rather than a smooth distribution,
suggesting that past contacts happened in distinct time periods.
The following section provides an analysis of categorical variables in the dataset, focusing on class distributions and potential implications for modeling.
bank_marketing |> select(!where(is.numeric)) |>
gather() |>
ggplot() +
geom_bar(aes(x = value)) + coord_flip() +
facet_wrap(~key, scales = 'free', ncol = 3)+
ggtitle("Distribution of classes")
Contact
The majority of clients were contacted via cellular phones, while a
smaller portion was reached through a telephone line. This suggests that
mobile communication was the primary method used for outreach.
Day of Week
Contact attempts were distributed fairly evenly across weekdays, with no
extreme variations between days. This suggests that the marketing
efforts did not favor a particular day of the week.
Default
Most clients do not have credit in default, while a small proportion
does. There is also a significant number of “unknown” values, indicating
missing or unrecorded data.
Education
The most common education levels are university degree and high school,
while fewer clients have basic education or professional courses. The
“unknown” category is present but relatively small.
Housing
A large portion of clients have a housing loan, while a significant
number do not. A notable “unknown” category is present, meaning that
housing loan information was not available for some clients.
Job
The dataset includes a variety of job categories, with management,
blue-collar, and technician roles being the most common. Less frequent
job categories include housemaid, student, and self-employed. A portion
of records have “unknown” values.
Loan
Most clients do not have a personal loan, while a smaller proportion
does. “Unknown” values are also present, meaning some loan statuses were
not recorded.
Marital
The largest group consists of married clients, followed by single and
divorced individuals. A moderate number of records contain “unknown”
values.
Month
Most contacts occurred in May, while other months have significantly
fewer interactions. This suggests that the marketing campaign was most
active during this period.
Poutcome
The majority of clients fall into the “nonexistent” category, meaning
they were not previously contacted in earlier campaigns. Among those who
were contacted before, “failure” is more common than “success,”
indicating that past campaigns had a lower conversion rate.
Y (Target Variable)
The dataset is highly imbalanced, with most clients not subscribing to
the term deposit. The “yes” category is significantly smaller, meaning
the marketing efforts had a low overall success rate.
Understanding how numerical variables interact with each other is crucial for identifying potential predictors and refining our model. The scatterplot and correlation matrix help reveal trends, clusters, and outliers, guiding decisions on feature selection and transformation. Below are the key takeaways from this analysis.
ggpairs(bank_marketing |> keep(is.numeric))
corr_matrix <- bank_marketing |>
keep(is.numeric) |> cor()
corrplot(corr_matrix)
# corr_matrix
duration)The duration variable is highly right-skewed, meaning
most calls are short, but a small number extend for significantly longer
durations. It has little correlation with other numerical variables,
suggesting it behaves independently. Since longer calls are often linked
to a higher likelihood of subscription, duration is likely
an important predictor. However, because it is only recorded after the
call occurs, it must be handled carefully in modeling to avoid data
leakage.
pdays and
previous)The relationship between pdays (days since last contact)
and previous (number of previous contacts) shows a strong
negative correlation (-0.588). Clients who had more previous contacts
tend to have lower pdays values, meaning they were
contacted more recently. The scatterplots reveal clustering, indicating
that pdays does not behave like a standard continuous
variable but more like a categorical variable with distinct groupings.
Additionally, many values of previous are zero, confirming
that most clients were not contacted in prior campaigns.
euribor3m,
emp.var.rate, nr.employed)Economic indicators such as euribor3m,
emp.var.rate, and nr.employed are highly
correlated, meaning they likely provide overlapping information.
Specifically, euribor3m and emp.var.rate have
a strong correlation (0.972), while euribor3m and
nr.employed also exhibit a strong relationship (0.945).
Since interest rates influence deposit rates, euribor3m is
likely the most directly relevant predictor, and one of the redundant
variables could be removed.
cons.price.idx and cons.conf.idx)The consumer price index (cons.price.idx) and consumer
confidence index (cons.conf.idx) have a moderate
correlation (0.775) and show distinct clustering in the scatterplots.
This suggests that consumer sentiment and inflation data were recorded
in defined economic periods rather than changing continuously. These
variables may help capture economic trends that influence whether
clients are more likely to commit to term deposits.
campaign)The campaign variable has a weak correlation with
pdays (-0.048) and duration (-0.072),
indicating it does not strongly interact with other numerical features.
Its scatterplot reveals a high concentration of values at 1 and 2,
meaning most clients received very few calls, while a small subset
received an unusually high number of contacts. These extreme values
suggest that capping or binning the variable may improve model
performance by reducing noise.
The exploratory data analysis (EDA) of the Bank Marketing Dataset revealed two critical characteristics: a significant class imbalance and notable correlations among numerical features. These findings directly dictate the selection of appropriate classification algorithms. Consequently, the models considered must effectively address these challenges. Therefore, an evaluation of Naïve Bayes and Weighted Logistic Regression, two distinct approaches, will determine their suitability for predicting term deposit subscriptions. Specifically, the evaluation will examine how each algorithm handles the observed class imbalance and feature correlations, which are pivotal for accurate and reliable predictions.
When choosing a classification algorithm for the Bank Marketing Dataset, both Naïve Bayes and Logistic Regression are options, but they have key differences. Logistic Regression models the probability of a client subscribing to a term deposit using a sigmoid function, making it suitable for binary classification. It assumes a linear relationship between features and the log-odds of the outcome. Naïve Bayes, on the other hand, is a probabilistic classifier that uses Bayes’ Theorem, assuming that features are independent given the class. This assumption is often not true in real-world datasets, especially when variables are correlated.
Naïve Bayes is simple and efficient, performing well with small datasets and categorical features. It doesn’t require complex feature engineering. However, its independence assumption is a major weakness in this dataset. Variables like euribor3m, emp.var.rate, and nr.employed are strongly correlated, which violates this assumption. This can lead to inaccurate probability estimates and lower predictive performance. Also, Naïve Bayes doesn’t handle class imbalance well, as it assumes all classes are equally important.
Weighted Logistic Regression is more interpretable and better suited for this dataset because it can handle class imbalance. By assigning higher importance to the minority class (subscriptions), it avoids favoring the majority class. Unlike Naïve Bayes, it doesn’t assume feature independence, making it more robust with correlated features. The decision threshold can also be adjusted to improve classification. While Logistic Regression assumes a linear decision boundary, its ability to handle imbalanced data and its interpretability make it a better choice for business decisions.
Given the dataset’s structure, Weighted Logistic Regression is the recommended algorithm. It addresses class imbalance, doesn’t rely on feature independence, and provides clear insights into how factors influence subscriptions. Without class weighting, standard Logistic Regression would face the same imbalance issues as Naïve Bayes. Naïve Bayes might be preferable with datasets smaller than 500 records or with completely categorical and independent features. However, with correlated numerical and categorical variables, Weighted Logistic Regression offers a more reliable and interpretable approach for predicting customer behavior.
With exploratory data analysis (EDA) completed and weighted Logistic Regression selected as the model of choice, appropriate pre-processing steps are crucial to enhance data quality, optimize feature selection, and address class imbalance. The dataset comprises both numerical and categorical variables, necessitating cleaning, transformation, and encoding to improve predictive accuracy and interpretability.
Several categorical variables contain missing values. The following strategies are applied based on the proportion of missingness:
Certain variables contain extreme values that could distort the model and require adjustment:
highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)
# noVar <- nearZeroVar(pdaysDist)
noVar <- nearZeroVar(bank_marketing)
columns_to <- names(bank_marketing)[noVar]
pdays as a continuous variable, removing
999 and analyzing only past interactions makes sense. However, for
interpretability and consistency, pdays is binned into
contact time intervals such as 1–7 days, 8–30 days, and 31+ days while
keeping “Never Contacted” as a separate category.To reduce redundancy and improve model efficiency, some variables are removed:
Additional transformations are applied to improve model interpretability:
To standardize features and improve interpretability, the following transformations are applied:
pdays
for more meaningful classification of contact time, while 999 is
retained as a separate category labeled “Never Contacted.”The dataset exhibits class imbalance in the target variable
(y), requiring adjustments:
y = yes), optimizing performance based on Precision-Recall
AUC rather than accuracy.This exploratory analysis of the Bank Marketing Dataset highlights key factors influencing term deposit subscriptions, including call duration, past contact history, and economic conditions. Challenges such as class imbalance, high correlation among economic indicators, and missing values necessitate targeted preprocessing to enhance model performance. Addressing these issues through feature selection, categorical encoding, and variable transformations ensures a more stable and interpretable predictive model.
Given the dataset’s characteristics, Weighted Logistic Regression is the most suitable approach due to its ability to handle imbalanced data while maintaining interpretability. Refining feature selection and optimizing preprocessing steps will enhance predictive accuracy, supporting more effective marketing strategies and data-driven decision-making. Future improvements could explore alternative classification models and deeper behavioral insights to refine customer targeting.