Bank Marketing Data Exploratory Data Analysis (EDA)

bank_marketing <- read.csv("bank-additional-full.csv",sep = ";",stringsAsFactors = T)

1. Introduction

The Bank Marketing dataset originates from a study conducted by Sérgio Moro, Paulo Cortez, and Paulo Rita (2014), aiming to predict the success of bank telemarketing campaigns for term deposit subscriptions. This dataset, derived from the UCI Bank Marketing dataset, includes additional social and economic attributes obtained from Banco de Portugal, which enhance its predictive power.

The dataset consists of 41,188 client interactions collected between May 2008 and November 2010. Each record contains bank client data, details of the last marketing contact, and macroeconomic indicators, culminating in a binary classification problem where the objective is to predict whether a client subscribes to a term deposit (y = yes/no).

Given the imbalanced nature of the target variable, a structured pre-processing approach is necessary. This document outlines the steps taken to clean, transform, and optimize the dataset for weighted Logistic Regression. The preprocessing pipeline includes handling missing data, reducing redundancy, encoding categorical variables, transforming skewed features, and addressing class imbalance to ensure model stability and interpretability.

2. Exploratory Data Analysis

bank_marketing |> 
  select(where(is.numeric)) |> 
  skim() |> 
  select(skim_variable, n_missing, complete_rate, numeric.mean, numeric.sd, 
         numeric.p0, numeric.p25, numeric.p50, numeric.p75, numeric.p100) |> 
  rename(
    Variable = skim_variable,
    `Missing Values` = n_missing,
    `Completeness (%)` = complete_rate,
    `Mean` = numeric.mean,
    `Standard Deviation` = numeric.sd,
    `Min` = numeric.p0,
    `25th Percentile` = numeric.p25,
    `Median (50th Pct)` = numeric.p50,
    `75th Percentile` = numeric.p75,
    `Max` = numeric.p100
  ) |> 
  kable(caption = "Summary Statistics for Numerical Variables") |> 
  kable_styling() |> 
  kable_classic()

Summary Statistics for Numerical Variables
Variable	Completeness (%)	Mean	Standard Deviation	Min	25th Percentile	Median (50th Pct)	75th Percentile	Max
age	1	40.0240604	10.4212500	17.000	32.000	38.000	47.000	98.000
duration	1	258.2850102	259.2792488	0.000	102.000	180.000	319.000	4918.000
campaign	1	2.5675925	2.7700135	1.000	1.000	2.000	3.000	56.000
pdays	1	962.4754540	186.9109073	0.000	999.000	999.000	999.000	999.000
previous	1	0.1729630	0.4949011	0.000	0.000	0.000	0.000	7.000
emp.var.rate	1	0.0818855	1.5709597	-3.400	-1.800	1.100	1.400	1.400
cons.price.idx	1	93.5756644	0.5788400	92.201	93.075	93.749	93.994	94.767
cons.conf.idx	1	-40.5026003	4.6281979	-50.800	-42.700	-41.800	-36.400	-26.900
euribor3m	1	3.6212908	1.7344474	0.634	1.344	4.857	4.961	5.045
nr.employed	1	5167.0359109	72.2515277	4963.600	5099.100	5191.000	5228.100	5228.100

The numerical summary of the Bank Marketing Dataset indicates that all variables are fully complete (0 missing values). Age has a mean of 40 years (range: 17–98), with most individuals between 32 and 47. Duration, which measures call length, has a median of 180 seconds but a high standard deviation (259.3 sec) and a maximum of 4918 seconds, suggesting a long tail. Campaign, representing the number of contacts per client, has a median of 2 but reaches up to 56 contacts, indicating some clients were targeted repeatedly.

The pdays variable is dominated by the value 999, likely a placeholder for no prior contact, while previous, tracking past interactions, has a median of 0, meaning most clients were contacted for the first time. Economic variables show varying trends: the employment variation rate ranges from -3.4 to 1.4, and the consumer confidence index has an average of -40.5, reflecting economic uncertainty.

The euribor3m interest rate, which influences borrowing costs, has a median of 4.86%, fluctuating between 0.63% and 5.05%. The number of employed individuals is relatively stable, with a mean of 5167 and low variation, suggesting consistency in the labor market during the data collection period.

missing_data <-  bank_marketing |> mutate(across(where(is.factor), ~ fct_recode(.x, NULL = "unknown")))


missing_data |> 
  select(!where(is.numeric)) |> 
  skim() |> 
  select(skim_variable, n_missing, complete_rate, factor.n_unique, factor.top_counts) |> 
  rename(
    Variable = skim_variable,
    `Missing Values` = n_missing,
    `Completeness (%)` = complete_rate,
    `Unique Categories` = factor.n_unique,
    `Top Categories (Counts)` = factor.top_counts
  ) |> 
  kable(caption = "Summary Statistics for Categorical Variables") |> 
  kable_styling() |>  
  kable_classic()

Summary Statistics for Categorical Variables
Variable	Missing Values	Completeness (%)	Unique Categories	Top Categories (Counts)
job	330	0.9919880	11	adm: 10422, blu: 9254, tec: 6743, ser: 3969
marital	80	0.9980577	3	mar: 24928, sin: 11568, div: 4612
education	1731	0.9579732	7	uni: 12168, hig: 9515, bas: 6045, pro: 5243
default	8597	0.7912742	2	no: 32588, yes: 3
housing	990	0.9759639	2	yes: 21576, no: 18622
loan	990	0.9759639	2	no: 33950, yes: 6248
contact	0	1.0000000	2	cel: 26144, tel: 15044
month	0	1.0000000	10	may: 13769, jul: 7174, aug: 6178, jun: 5318
day_of_week	0	1.0000000	5	thu: 8623, mon: 8514, wed: 8134, tue: 8090
poutcome	0	1.0000000	3	non: 35563, fai: 4252, suc: 1373
y	0	1.0000000	2	no: 36548, yes: 4640

grid.arrange(missing1, missing2, ncol = 2)

The Bank Marketing Dataset presents a low overall missing data rate of 1.5%. However, 21% of observations contain at least one missing value. Notably, categorical variables such as education (4.2% missing), housing (2.4% missing), loan (2.4% missing), job (0.8% missing), and marital status (0.19% missing) contribute to this, potentially impacting subsequent modeling. While missing data is distributed between categorical (52.4%) and continuous (47.6%) variables, no column is fully missing, and most features are complete.

Examining categorical variables reveals specific missingness patterns. ‘Job’ has 330 missing values, resulting in 99.2% completeness, with admin, blue-collar, and technician being the most common categories. ‘Marital status’ is nearly complete, at 99.8%, while ‘education’ has 1,731 missing values, resulting in 95.8% completeness, primarily affecting university education.

Financial variables show larger gaps. ‘Default’ status has the highest missing rate, with 8,597 missing values and 79.1% completeness, dominated by ‘no’ responses. ‘Housing’ and ‘loan’ each have 990 missing values, resulting in 97.6% completeness, mainly showing homeownership and no loans. Other categorical variables—contact, month, and poutcome—are fully observed. The target variable (y) is imbalanced, with 36,548 non-subscriptions versus 4,640 subscriptions, indicating a low campaign success rate.

2.1 Numerical Variable Distributions

1. Age (`age`)

plot_grid(
bank_marketing |>  ggplot(aes(age)) + geom_histogram() ,
bank_marketing  |> ggplot(aes(age)) + geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=4) ,
  ncol = 2, align = "v"
)

The histogram shows a right-skewed distribution, with most clients between 25 and 60 years old. The boxplot highlights some outliers beyond 75 years, but they appear naturally occurring rather than errors. The peaks at 30, 35, and 50 suggest a higher concentration of clients in these age groups, though this may simply reflect population demographics rather than targeted marketing.

2. Number of Contacts in Current Campaign (`campaign`)

plot_grid(
bank_marketing |> ggplot(aes(campaign)) + geom_histogram() + theme_minimal(),
bank_marketing |> ggplot(aes(campaign)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4),
  ncol = 2, align = "v"
)

The histogram confirms heavy right-skewness, with most clients receiving 1 or 2 contacts and a long tail extending beyond 40 contacts. The boxplot shows a large number of outliers, indicating that some customers were contacted excessively. Since most clients receive fewer than 5 calls, the extreme cases could suggest persistent but possibly ineffective marketing strategies.

3. Consumer Confidence Index (`cons.conf.idx`)

plot_grid(
bank_marketing |> ggplot(aes(cons.conf.idx)) + geom_histogram()  ,

bank_marketing |> ggplot(aes(cons.conf.idx)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  
,
  ncol = 2, align = "v"
)

The histogram displays discrete peaks around -50, -45, -40, and -35, suggesting that consumer sentiment is measured at fixed economic intervals rather than varying continuously. The boxplot confirms a few extreme values around -30, likely reflecting specific economic downturns rather than random noise.

4. Consumer Price Index (`cons.price.idx`)

plot_grid(
bank_marketing |> ggplot(aes(cons.price.idx)) + geom_histogram(),

bank_marketing |> ggplot(aes(cons.price.idx)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4),
  ncol = 2, align = "v"
)

The histogram for cons.price.idx reveals clusters of values around 93 and 94, rather than a continuous distribution. The boxplot shows no extreme outliers, suggesting that the variable follows well-defined economic reporting patterns. While numerical, it may be more effective when treated as an economic regime indicator rather than a standard continuous feature.

5. Call Duration (`duration`)

plot_grid(
bank_marketing |> ggplot(aes(duration)) + geom_histogram()  ,

bank_marketing |> ggplot(aes(duration)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

duration is heavily right-skewed, with a large number of calls lasting less than 200 seconds and a long tail extending beyond 2000+ seconds. Longer call durations often indicate higher engagement and are strongly correlated with term deposit subscriptions. However, since duration is only known after the call, it must be handled carefully in predictive modeling to avoid data leakage.

6. Employment Variation Rate (`emp.var.rate`)

plot_grid(
bank_marketing |> ggplot(aes(emp.var.rate)) + geom_histogram()  ,

bank_marketing |> ggplot(aes(emp.var.rate)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

The histogram does not resemble a continuous distribution but instead has distinct peaks at -3, -2, 0, and 1, suggesting that employment variation follows structured economic periods rather than random fluctuations. The boxplot shows no significant outliers, reinforcing that it is best interpreted as an economic trend indicator rather than a freely varying numerical variable.

7. Euribor 3-Month Rate (`euribor3m`)

plot_grid(
bank_marketing |> ggplot(aes(euribor3m)) + geom_histogram() , 

bank_marketing |> ggplot(aes(euribor3m)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

The histogram shows distinct peaks around 1, 2, and 5, confirming that interest rates are adjusted in structured policy shifts rather than gradually. The boxplot confirms no major outliers, supporting the interpretation that this variable behaves as an economic benchmark rather than a continuously changing variable.

8. Number of Employees (`nr.employed`)

plot_grid(
bank_marketing |> ggplot(aes(nr.employed)) + geom_histogram()  ,

bank_marketing |> ggplot(aes(nr.employed)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

The nr.employed histogram shows a discrete, clustered pattern, with most values concentrated around 5100 and 5200. This suggests that employment levels are reported in predefined benchmarks rather than fluctuating continuously.

9. Number of Previous Contacts (`previous`)

plot_grid(
bank_marketing |> ggplot(aes(previous)) + geom_histogram()  ,

bank_marketing |> ggplot(aes(previous)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

The previous variable is highly right-skewed, with most clients having 0 or 1 previous contact. A small fraction of clients received up to 6 prior contacts, indicating persistent marketing efforts for certain customers. Since repeated contacts might signal both persistence and customer disinterest, this variable may require interaction effects or capping to prevent outliers from distorting the model.

10. Days Since Last Contact (`pdays`)

bank_marketing <- bank_marketing |> mutate(pdays = if_else(pdays == 999,-1,pdays))

pdaysDist <- bank_marketing |> filter(pdays != -1) 

plot_grid(

pdaysDist |> ggplot(aes(pdays)) + geom_histogram(binwidth = bins_cal),


pdaysDist |> ggplot(aes(pdays)) + geom_boxplot(outlier.colour="red", outlier.shape=8, outlier.size=4)  ,
  ncol = 2, align = "v"
)

The variable pdays represents the number of days since a client was last contacted in a previous campaign. However, it contains a special value of -1 or 999, which indicates that the client was never contacted before. Since pdays exhibits category-like behavior, we first separate "Never Contacted" clients from the distribution to better analyze the spread of past interactions. After removal, the remaining values show a clustered pattern rather than a smooth distribution, suggesting that past contacts happened in distinct time periods.

2.2 Categorical Variable Distributions

The following section provides an analysis of categorical variables in the dataset, focusing on class distributions and potential implications for modeling.

bank_marketing |> select(!where(is.numeric)) |>
  gather() |>
  ggplot() +
  geom_bar(aes(x = value)) + coord_flip() +
  facet_wrap(~key, scales = 'free', ncol = 3)+
  ggtitle("Distribution of classes")

Contact
The majority of clients were contacted via cellular phones, while a smaller portion was reached through a telephone line. This suggests that mobile communication was the primary method used for outreach.
Day of Week
Contact attempts were distributed fairly evenly across weekdays, with no extreme variations between days. This suggests that the marketing efforts did not favor a particular day of the week.
Default
Most clients do not have credit in default, while a small proportion does. There is also a significant number of “unknown” values, indicating missing or unrecorded data.
Education
The most common education levels are university degree and high school, while fewer clients have basic education or professional courses. The “unknown” category is present but relatively small.
Housing
A large portion of clients have a housing loan, while a significant number do not. A notable “unknown” category is present, meaning that housing loan information was not available for some clients.
Job
The dataset includes a variety of job categories, with management, blue-collar, and technician roles being the most common. Less frequent job categories include housemaid, student, and self-employed. A portion of records have “unknown” values.
Loan
Most clients do not have a personal loan, while a smaller proportion does. “Unknown” values are also present, meaning some loan statuses were not recorded.
Marital
The largest group consists of married clients, followed by single and divorced individuals. A moderate number of records contain “unknown” values.
Month
Most contacts occurred in May, while other months have significantly fewer interactions. This suggests that the marketing campaign was most active during this period.
Poutcome
The majority of clients fall into the “nonexistent” category, meaning they were not previously contacted in earlier campaigns. Among those who were contacted before, “failure” is more common than “success,” indicating that past campaigns had a lower conversion rate.
Y (Target Variable)
The dataset is highly imbalanced, with most clients not subscribing to the term deposit. The “yes” category is significantly smaller, meaning the marketing efforts had a low overall success rate.

2.3 Key Insights from the Correlation and Scatterplot Matrix

Understanding how numerical variables interact with each other is crucial for identifying potential predictors and refining our model. The scatterplot and correlation matrix help reveal trends, clusters, and outliers, guiding decisions on feature selection and transformation. Below are the key takeaways from this analysis.

ggpairs(bank_marketing |> keep(is.numeric))

corr_matrix <- bank_marketing |> 
  keep(is.numeric) |> cor()

corrplot(corr_matrix)

# corr_matrix

1. Impact of Call Duration on Subscriptions (`duration`)

The duration variable is highly right-skewed, meaning most calls are short, but a small number extend for significantly longer durations. It has little correlation with other numerical variables, suggesting it behaves independently. Since longer calls are often linked to a higher likelihood of subscription, duration is likely an important predictor. However, because it is only recorded after the call occurs, it must be handled carefully in modeling to avoid data leakage.

2. Influence of Previous Contact History (`pdays` and `previous`)

The relationship between pdays (days since last contact) and previous (number of previous contacts) shows a strong negative correlation (-0.588). Clients who had more previous contacts tend to have lower pdays values, meaning they were contacted more recently. The scatterplots reveal clustering, indicating that pdays does not behave like a standard continuous variable but more like a categorical variable with distinct groupings. Additionally, many values of previous are zero, confirming that most clients were not contacted in prior campaigns.

3. Correlation Among Economic Indicators (`euribor3m`, `emp.var.rate`, `nr.employed`)

Economic indicators such as euribor3m, emp.var.rate, and nr.employed are highly correlated, meaning they likely provide overlapping information. Specifically, euribor3m and emp.var.rate have a strong correlation (0.972), while euribor3m and nr.employed also exhibit a strong relationship (0.945). Since interest rates influence deposit rates, euribor3m is likely the most directly relevant predictor, and one of the redundant variables could be removed.

4. Consumer Sentiment and Economic Conditions (`cons.price.idx` and `cons.conf.idx`)

The consumer price index (cons.price.idx) and consumer confidence index (cons.conf.idx) have a moderate correlation (0.775) and show distinct clustering in the scatterplots. This suggests that consumer sentiment and inflation data were recorded in defined economic periods rather than changing continuously. These variables may help capture economic trends that influence whether clients are more likely to commit to term deposits.

5. Frequency of Marketing Contacts (`campaign`)

The campaign variable has a weak correlation with pdays (-0.048) and duration (-0.072), indicating it does not strongly interact with other numerical features. Its scatterplot reveals a high concentration of values at 1 and 2, meaning most clients received very few calls, while a small subset received an unusually high number of contacts. These extreme values suggest that capping or binning the variable may improve model performance by reducing noise.

3. Algorithm Selection

The exploratory data analysis (EDA) of the Bank Marketing Dataset revealed two critical characteristics: a significant class imbalance and notable correlations among numerical features. These findings directly dictate the selection of appropriate classification algorithms. Consequently, the models considered must effectively address these challenges. Therefore, an evaluation of Naïve Bayes and Weighted Logistic Regression, two distinct approaches, will determine their suitability for predicting term deposit subscriptions. Specifically, the evaluation will examine how each algorithm handles the observed class imbalance and feature correlations, which are pivotal for accurate and reliable predictions.

When choosing a classification algorithm for the Bank Marketing Dataset, both Naïve Bayes and Logistic Regression are options, but they have key differences. Logistic Regression models the probability of a client subscribing to a term deposit using a sigmoid function, making it suitable for binary classification. It assumes a linear relationship between features and the log-odds of the outcome. Naïve Bayes, on the other hand, is a probabilistic classifier that uses Bayes’ Theorem, assuming that features are independent given the class. This assumption is often not true in real-world datasets, especially when variables are correlated.

Naïve Bayes is simple and efficient, performing well with small datasets and categorical features. It doesn’t require complex feature engineering. However, its independence assumption is a major weakness in this dataset. Variables like euribor3m, emp.var.rate, and nr.employed are strongly correlated, which violates this assumption. This can lead to inaccurate probability estimates and lower predictive performance. Also, Naïve Bayes doesn’t handle class imbalance well, as it assumes all classes are equally important.

Weighted Logistic Regression is more interpretable and better suited for this dataset because it can handle class imbalance. By assigning higher importance to the minority class (subscriptions), it avoids favoring the majority class. Unlike Naïve Bayes, it doesn’t assume feature independence, making it more robust with correlated features. The decision threshold can also be adjusted to improve classification. While Logistic Regression assumes a linear decision boundary, its ability to handle imbalanced data and its interpretability make it a better choice for business decisions.

Given the dataset’s structure, Weighted Logistic Regression is the recommended algorithm. It addresses class imbalance, doesn’t rely on feature independence, and provides clear insights into how factors influence subscriptions. Without class weighting, standard Logistic Regression would face the same imbalance issues as Naïve Bayes. Naïve Bayes might be preferable with datasets smaller than 500 records or with completely categorical and independent features. However, with correlated numerical and categorical variables, Weighted Logistic Regression offers a more reliable and interpretable approach for predicting customer behavior.

4. Pre-processing Strategy for the Bank Marketing Dataset

With exploratory data analysis (EDA) completed and weighted Logistic Regression selected as the model of choice, appropriate pre-processing steps are crucial to enhance data quality, optimize feature selection, and address class imbalance. The dataset comprises both numerical and categorical variables, necessitating cleaning, transformation, and encoding to improve predictive accuracy and interpretability.

4.1. Handling Missing Data

Several categorical variables contain missing values. The following strategies are applied based on the proportion of missingness:

Job (0.99%): Impute with “admin,” the most frequent category, to minimize distortion.
Marital Status (0.19%): Impute with “married,” the most common response.
Education (4.2%): Assign to a new category, “Unknown Education,” instead of using mode imputation, as the missing values are more spread out.
Default (20.9%): Assign to a new category, “Unknown Default,” to prevent excessive bias from imputation.
Housing Status (2.4%): Impute with “yes,” as it is the most frequent response.
Loan Status (2.4%): Impute with “no,” the dominant response.

4.2. Outlier Treatment

Certain variables contain extreme values that could distort the model and require adjustment:

highCorrelation <- findCorrelation(corr_matrix,cutoff = 0.75)

# noVar <- nearZeroVar(pdaysDist)
noVar <- nearZeroVar(bank_marketing)
columns_to <- names(bank_marketing)[noVar]

Duration: Right-skewed with extreme values. Excluded from predictive modeling because it is known only after the outcome is determined. Including it would introduce data leakage, leading to overly optimistic model performance that would not generalize to new data. However, it is log-transformed for exploratory analysis to understand engagement levels.
Campaign and Previous Contacts: Right-skewed distributions requiring Box-Cox transformation to reduce variability and stabilize variance.
Pdays:
- The value 999, representing clients never contacted before, is converted into a separate category labeled “Never Contacted” for categorical encoding.
- When 999 is included, the variable has near-zero variance, meaning it provides little useful information. However, when 999 is removed, the remaining values show meaningful variability.
- If treating pdays as a continuous variable, removing 999 and analyzing only past interactions makes sense. However, for interpretability and consistency, pdays is binned into contact time intervals such as 1–7 days, 8–30 days, and 31+ days while keeping “Never Contacted” as a separate category.
Macroeconomic Variables: Box-Cox transformation is applied if necessary to correct skewness while preserving meaningful economic trends.

4.3. Feature Selection

To reduce redundancy and improve model efficiency, some variables are removed:

Euribor 3-month rate and employment variation rate: Highly correlated with number of employees and consumer price index, making them redundant.
Duration: Removed to prevent data leakage, as it depends on the outcome of the call.
Low-variance categorical features: Identified and removed if they provide minimal predictive value.

4.4. Feature Engineering

Additional transformations are applied to improve model interpretability:

Age: Converted into three categories:
- Adults (20–39 years old): Represents younger working professionals.
- Middle-aged adults (40–59 years old): Likely to have stable incomes.
- Senior adults (60+ years old): Includes retirees and older clients.
Duration: Converted into a binary “high engagement” feature, where calls lasting above the median duration are flagged as high engagement. This transformation is for analysis only and not used for modeling due to data leakage concerns.
Total Contacts: A new feature combining campaign contacts and previous contacts to summarize the total number of interactions per client.

4.5. Scaling and Encoding

To standardize features and improve interpretability, the following transformations are applied:

Box-Cox transformation: Applied to campaign and previous contacts to stabilize variance.
Categorical binning: Applied to pdays for more meaningful classification of contact time, while 999 is retained as a separate category labeled “Never Contacted.”
One-hot encoding: Used for categorical variables to prevent ordinal misinterpretation.
Binary encoding: Applied to high-cardinality categorical variables like job and month to reduce dimensionality.
Variance filtering: Low-variance features are removed if they contribute little information to the model.

4.6. Addressing Class Imbalance

The dataset exhibits class imbalance in the target variable (y), requiring adjustments:

Weighted Logistic Regression: Assigns higher penalties to misclassified minority-class instances, ensuring a better balance.
Threshold tuning: Adjusting the decision threshold from 0.5 to approximately 0.3 improves recall for positive cases (y = yes), optimizing performance based on Precision-Recall AUC rather than accuracy.

5. Conclusion

This exploratory analysis of the Bank Marketing Dataset highlights key factors influencing term deposit subscriptions, including call duration, past contact history, and economic conditions. Challenges such as class imbalance, high correlation among economic indicators, and missing values necessitate targeted preprocessing to enhance model performance. Addressing these issues through feature selection, categorical encoding, and variable transformations ensures a more stable and interpretable predictive model.

Given the dataset’s characteristics, Weighted Logistic Regression is the most suitable approach due to its ability to handle imbalanced data while maintaining interpretability. Refining feature selection and optimizing preprocessing steps will enhance predictive accuracy, supporting more effective marketing strategies and data-driven decision-making. Future improvements could explore alternative classification models and deeper behavioral insights to refine customer targeting.

Bank Marketing Data Exploratory Data Analysis (EDA)

Lewris Mota

2025-02-15

1. Introduction

2. Exploratory Data Analysis

2.1 Numerical Variable Distributions

1. Age (`age`)

2. Number of Contacts in Current Campaign (`campaign`)

3. Consumer Confidence Index (`cons.conf.idx`)

4. Consumer Price Index (`cons.price.idx`)

5. Call Duration (`duration`)

6. Employment Variation Rate (`emp.var.rate`)

7. Euribor 3-Month Rate (`euribor3m`)

8. Number of Employees (`nr.employed`)

9. Number of Previous Contacts (`previous`)

10. Days Since Last Contact (`pdays`)

2.2 Categorical Variable Distributions

2.3 Key Insights from the Correlation and Scatterplot Matrix

1. Impact of Call Duration on Subscriptions (`duration`)

2. Influence of Previous Contact History (`pdays` and `previous`)

3. Correlation Among Economic Indicators (`euribor3m`, `emp.var.rate`, `nr.employed`)

4. Consumer Sentiment and Economic Conditions (`cons.price.idx` and `cons.conf.idx`)

5. Frequency of Marketing Contacts (`campaign`)

3. Algorithm Selection

4. Pre-processing Strategy for the Bank Marketing Dataset

4.1. Handling Missing Data

4.2. Outlier Treatment

4.3. Feature Selection

4.4. Feature Engineering

4.5. Scaling and Encoding

4.6. Addressing Class Imbalance

5. Conclusion

Bank Marketing Data Exploratory Data Analysis (EDA)

Lewris Mota

2025-02-15

1. Introduction

2. Exploratory Data Analysis

2.1 Numerical Variable Distributions

1. Age (age)

2. Number of Contacts in Current Campaign (campaign)

3. Consumer Confidence Index (cons.conf.idx)

4. Consumer Price Index (cons.price.idx)

5. Call Duration (duration)

6. Employment Variation Rate (emp.var.rate)

7. Euribor 3-Month Rate (euribor3m)

8. Number of Employees (nr.employed)

9. Number of Previous Contacts (previous)

10. Days Since Last Contact (pdays)

2.2 Categorical Variable Distributions

2.3 Key Insights from the Correlation and Scatterplot Matrix

1. Impact of Call Duration on Subscriptions (duration)

2. Influence of Previous Contact History (pdays and previous)

3. Correlation Among Economic Indicators (euribor3m, emp.var.rate, nr.employed)

4. Consumer Sentiment and Economic Conditions (cons.price.idx and cons.conf.idx)

5. Frequency of Marketing Contacts (campaign)

3. Algorithm Selection

4. Pre-processing Strategy for the Bank Marketing Dataset

4.1. Handling Missing Data

4.2. Outlier Treatment

4.3. Feature Selection

4.4. Feature Engineering

4.5. Scaling and Encoding

4.6. Addressing Class Imbalance

5. Conclusion

1. Age (`age`)

2. Number of Contacts in Current Campaign (`campaign`)

3. Consumer Confidence Index (`cons.conf.idx`)

4. Consumer Price Index (`cons.price.idx`)

5. Call Duration (`duration`)

6. Employment Variation Rate (`emp.var.rate`)

7. Euribor 3-Month Rate (`euribor3m`)

8. Number of Employees (`nr.employed`)

9. Number of Previous Contacts (`previous`)

10. Days Since Last Contact (`pdays`)

1. Impact of Call Duration on Subscriptions (`duration`)

2. Influence of Previous Contact History (`pdays` and `previous`)

3. Correlation Among Economic Indicators (`euribor3m`, `emp.var.rate`, `nr.employed`)

4. Consumer Sentiment and Economic Conditions (`cons.price.idx` and `cons.conf.idx`)

5. Frequency of Marketing Contacts (`campaign`)