Dataset: A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The records of their efforts are available in the form of a dataset.
Objective: The objective here is to apply machine learning techniques to analyze the dataset and figure out most effective tactics that will help the bank in next campaign to persuade more customers to subscribe to the bank’s term deposit. In the Bank Marketing dataset, there are 4 different datasets. I checked the two larger datasets among four and decided to choose bank-additional-full.csv instead of bank-full.csv. This is because the dataset includes five new additional macroeconomic factors such as: employment variation rate, consumer confidence index, and euribor 3-month rate. These indicators might serve as better predictors than just an individual’s balance that is present in back-full but not in bank-additional dataset. Let’s read the data.
bank_data <- read.csv("bank-additional-full.csv", sep = ";")
Let’s check structure of the data.
# Check the first and last 6 rows
head(bank_data)
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
tail(bank_data)
## age job marital education default housing loan contact
## 41183 29 unemployed single basic.4y no yes no cellular
## 41184 73 retired married professional.course no yes no cellular
## 41185 46 blue-collar married professional.course no no no cellular
## 41186 56 retired married university.degree no yes no cellular
## 41187 44 technician married professional.course no no no cellular
## 41188 74 retired married professional.course no yes no cellular
## month day_of_week duration campaign pdays previous poutcome
## 41183 nov fri 112 1 9 1 success
## 41184 nov fri 334 1 999 0 nonexistent
## 41185 nov fri 383 1 999 0 nonexistent
## 41186 nov fri 189 2 999 0 nonexistent
## 41187 nov fri 442 1 999 0 nonexistent
## 41188 nov fri 239 3 999 1 failure
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
## 41183 -1.1 94.767 -50.8 1.028 4963.6 no
## 41184 -1.1 94.767 -50.8 1.028 4963.6 yes
## 41185 -1.1 94.767 -50.8 1.028 4963.6 no
## 41186 -1.1 94.767 -50.8 1.028 4963.6 no
## 41187 -1.1 94.767 -50.8 1.028 4963.6 yes
## 41188 -1.1 94.767 -50.8 1.028 4963.6 no
library(tidyverse)
glimpse(bank_data) # View dataset structure
## Rows: 41,188
## Columns: 21
## $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,…
## $ job <chr> "housemaid", "services", "services", "admin.", "service…
## $ marital <chr> "married", "married", "married", "married", "married", …
## $ education <chr> "basic.4y", "high.school", "high.school", "basic.6y", "…
## $ default <chr> "no", "unknown", "no", "no", "no", "unknown", "no", "un…
## $ housing <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "yes",…
## $ loan <chr> "no", "no", "no", "no", "yes", "no", "no", "no", "no", …
## $ contact <chr> "telephone", "telephone", "telephone", "telephone", "te…
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may",…
## $ day_of_week <chr> "mon", "mon", "mon", "mon", "mon", "mon", "mon", "mon",…
## $ duration <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50, 55, 22…
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, …
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ poutcome <chr> "nonexistent", "nonexistent", "nonexistent", "nonexiste…
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, …
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,…
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4,…
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857,…
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "…
The dataset contains 41,188 rows (instances) and 21 columns (features + target variable y).
sum(duplicated(bank_data)) # Count the number of duplicate rows
## [1] 12
Since 12 duplicates out of ~40,000 records is a very small percentage (~0.03%), removing them won’t significantly impact the dataset. Let’s remove them.
bank_data <- bank_data[!duplicated(bank_data), ]# Remove duplicate rows
sum(duplicated(bank_data))
## [1] 0
colSums(is.na(bank_data))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
There are no missing values (NA).
table(bank_data$y) # Check the distribution of the target variable (y)
##
## no yes
## 36537 4639
The dataset is highly imbalanced, with 36,548 “no” responses (88.7%) and 4,640 “yes” responses (11.3%). Since the “yes” class is underrepresented, this could affect model performance, and we may need to apply techniques like class weighting or oversampling (SMOTE) in pre-processing to balance the data.
unknown_counts <- bank_data %>%
summarise(across(where(is.character), ~ sum(. == "unknown"))) # Count "unknown" in each categorical column
print(unknown_counts)
## job marital education default housing loan contact month day_of_week poutcome
## 1 330 80 1730 8596 990 990 0 0 0 0
## y
## 1 0
From our analysis, we observe that some categorical variables contain “unknown” values. Our strategy for handling them is as follows:
These modifications will be addressed in the Pre-processing step.
Review the structure and content of the data and answer questions such as:
Are the features (columns) of your data correlated?
Check Correlation Between Features: To analyze how different numerical variables relate to each other, let’s create a correlation matrix.
library(ggcorrplot)
numeric_data <- bank_data %>% select_if(is.numeric)# Select only numeric columns
cor_matrix <- cor(numeric_data) # Compute correlation matrix
ggcorrplot(cor_matrix, method = "square", type = "lower", lab = TRUE) # Correlation heatmap
Corrplot Analysis: The correlation matrix reveals strong relationships between several numerical features. Notably, employment variation rate (emp.var.rate) and the number of employees (nr.employed) have an extremely high positive correlation of 0.97, suggesting redundancy. Similarly, euribor3m is highly correlated with both nr.employed (0.95) and emp.var.rate (0.91), indicating that these economic indicators move together and may not all be necessary for modeling. There is also a moderate negative correlation of -0.59 between pdays and previous, which might suggest an inverse relationship between the number of days since the last contact and the frequency of previous contacts. Since highly correlated variables can cause multicollinearity issues in modeling, we may consider removing or combining some of them during preprocessing.
What is the overall distribution of each variable?
Feature Distributions:
library(ggplot2)
# Reshape data for visualization
bank_long <- bank_data %>%
pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")
# Histograms for numeric features
ggplot(bank_long, aes(x = Value)) +
geom_histogram(fill = "steelblue", color = "black", bins = 30) +
facet_wrap(~Feature, scales = "free") +
theme_minimal() +
labs(title = "Distribution of Numeric Features", x = "Value", y = "Count")
Histogram Analysis: Based on the histogram analysis, the numerical features show varying distributions. Age is right-skewed, with most clients between 25 and 60 years old. Duration and campaign are also highly skewed, with a large concentration of lower values and a few extreme cases. Pdays has a bimodal distribution, where most values are either very low or at 999, indicating a special category. Previous contacts are mostly zero, showing that many clients had no prior interactions. Economic indicators like employment variation rate (emp.var.rate) and euribor3m show multiple peaks, reflecting fluctuations in economic conditions. The distribution of consumer confidence index (cons.conf.idx) and consumer price index (cons.price.idx) appears more uniform. Overall, many variables are skewed, and some contain potential outliers that need further investigation.
Are there any outliers present?
Identify Outliers Using Boxplots:
ggplot(bank_long, aes(x = Value, y = Feature)) + # Flip x and y
geom_boxplot(fill = "lightblue", outlier.color = "red") +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Boxplots of Numeric Features", x = "Value", y = "Feature") + # Adjust labels
theme(axis.text.x = element_text(size = 10), # Show x-axis labels
axis.text.y = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"))
Boxplot Analysis: The Duration, Campaign, and Pdays contain extreme outliers, with several observations far above the upper whiskers.The Previous and Emp.Var.Rate also show potential outliers but with fewer extreme points. Nr.Employed and Euribor3m appear to have fewer extreme values compared to other features. The presence of these outliers suggests that some clients have had very long call duration, many contacts in the campaign, or a long gap (pdays) since their last contact. The boxplot confirms that the age distribution is right-skewed, with a few elderly customers as outliers and these customers may still be valid, but we need to consider whether they could affect model performance later.
What are the relationships between different variables?
Relationships Between Different Variables: The correlation matrix helps reveal linear relationships between numeric variables. Based on our correlation heatmap, euribor3m, emp.var.rate, and nr.employed are highly correlated with each other, indicating a strong relationship between economic indicators. The previous campaign outcome (poutcome) and previous contacts (previous, pdays) show some relationship, suggesting that past interactions influence current marketing efforts. However, variables like age and campaign contacts do not show strong correlations with other numeric variables.
How are categorical variables distributed?
Analyzing Categorical Variable Distribution: Now that we’ve successfully identified numerical outliers, let’s analyze the distribution of categorical variables.
categorical_data <- bank_data %>%
select(where(is.character)) %>% # Select categorical variables
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
glimpse(categorical_data) # Ensure Feature and Value exist
## Rows: 452,936
## Columns: 2
## $ Feature <chr> "job", "marital", "education", "default", "housing", "loan", "…
## $ Value <chr> "housemaid", "married", "basic.4y", "no", "no", "no", "telepho…
I tried to plot all the categorical variables in one plot but very hard to read each categories. So, will filter and plot the categorical variables with fewer than 6 categories and variables with more than 5 categories separately.
small_categorical_data <- categorical_data %>%
filter(Feature %in% c("marital", "default", "housing", "loan", "contact", "poutcome", "y"))
# Plot small categorical variables
ggplot(small_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) +
geom_bar() +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Distribution of Categorical Variables (≤5 Categories)", y = "Category", x = "Count") +
theme(axis.text.y = element_text(size = 9),
axis.text.x = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"),
legend.position = "none",
panel.spacing.x = unit(2, "lines"))
large_categorical_data <- categorical_data %>%
filter(Feature %in% c("job", "education", "month", "day_of_week"))
# Plot large categorical variables with improved x-axis width
ggplot(large_categorical_data, aes(y = reorder(Value, table(Value)[Value]), fill = Feature)) +
geom_bar() +
facet_wrap(~Feature, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Distribution of Categorical Variables (>5 Categories)", y = "Category", x = "Count") +
theme(axis.text.y = element_text(size = 10),
axis.text.x = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"),
legend.position = "none",
panel.spacing.x = unit(2, "lines")) # Add space between facet columns
How are categorical variables distributed?
The categorical variables show clear distributions across different features. The majority of customers were contacted via cellular phones rather than landlines, and most had housing loans while fewer had personal loans. Marital status is predominantly “married,” and university degrees are the most common education level. The campaigns were largely conducted in May, and most previous campaign outcomes were either nonexistent or failures. The job distribution is diverse, with administrative, blue-collar, and technician roles being the most common. The target variable (y) is highly imbalanced, with most customers not subscribing to the term deposit
Do any patterns or trends emerge in the data?
Several patterns and trends emerge from the data. The peak contact period for marketing campaigns appears to be in May, which suggests a seasonality effect. Additionally, past campaign success seems to be a rare occurrence, indicating that customer retention may be a challenge. The presence of “unknown” values in multiple variables such as education, job, and loan status suggests incomplete information, which could influence predictive modeling.
What is the central tendency and spread of each variable? The central tendency and spread of variables vary across different attributes. Age shows a right-skewed distribution with most customers being middle-aged, while campaign-related variables such as the number of contacts tend to have a small range, with most people being contacted only a few times. The duration of the last contact exhibits a wide range with extreme values, indicating a need for careful handling.
Are there any missing values and how significant are they?
Missing values in the dataset are mostly labeled as “unknown” rather than actual NA values. These appear frequently in job, education, default, and loan-related fields. While their significance depends on the modeling approach, the presence of a substantial number of unknowns suggests that either data collection was incomplete or some customers did not disclose their information. These missing values should be examined further to determine if they need imputation, removal, or a separate category in modeling
Now you have completed the EDA, what Algorithms would suit the business purpose for the dataset. Answer questions such as:
Select two or more machine learning algorithms presented so far that could be used to train a model (no need to train models - I am only looking for your recommendations).
Algorithm Selection for the Bank Marketing Dataset: Since our goal is to predict whether a customer will subscribe to a term deposit (y: yes/no), this is a binary classification problem. Based on the data characteristics observed in Exploratory Data Analysis (EDA) and machine learning algorithms presented so far inour class, the two suitable machine learning algorithms for this dataset are Elastic Net Logistic Regression (L1 + L2) and Naive Bayes. These models were chosen because they balance interpretability, efficiency, and handling of categorical data, making them ideal for understanding customer behavior and predicting term deposit subscriptions.
What are the pros and cons of each algorithm you selected?
Elastic Net Logistic Regression (L1 + L2)
Balances feature selection (Lasso) and stability (Ridge), preventing overfitting while keeping important predictors.
Helps with multicollinearity, making it suitable for datasets with correlated features.
More flexible than using Lasso or Ridge alone.
Adjustable for large datasets, making it computationally efficient.
Cons of Logistic Regression (Lasso or Ridge)
Requires hyperparameter tuning to balance feature selection and shrinkage.
Sensitive to outliers, requiring pre-processing techniques such as winsorization or transformation.
Assumes a linear relationship between predictors and the target (y), which may not fully capture non-linear patterns.
Pros of Naive Bayes:
Works well with categorical data and can directly handle variables such as job, education, and month without requiring extensive one-hot encoding.
Fast and Adjustable which is efficient for large datasets.
Less sensitive to class imbalance compared to Logistic Regression.
Cons of Naive Bays:
Assumes independence between features, which may not hold true for this dataset (e.g., job and education are related).
Generally less accurate than more complex models.
May struggle with outliers in numerical variables.
Which algorithm would you recommend, and why?
Given our dataset and business goal (helping the bank identify effective strategies for customer subscriptions), Elastic Net Logistic Regression (L1 + L2) is the best primary choice because it provides the best of both worlds—feature selection from Lasso and multicollinearity handling from Ridge. This ensures that we can identify the most important customer characteristics while maintaining model stability.
However, Elastic Net assumes a linear relationship between predictors and the target. Since real-world customer behaviors are often non-linear, we need a second model that can handle categorical data efficiently and offer an alternative classification method. Naive Bayes is the best secondary choice because it naturally handles categorical variables and is less sensitive to class imbalance, providing an efficient alternative approach to predicting customer subscriptions.
By combining Elastic Net Logistic Regression for interpretability and Naive Bayes for handling categorical data efficiently, we can achieve both actionable business insights and strong predictive power.
Are there labels in your data? Did that impact your choice of algorithm?
Yes, the dataset has a clearly labeled target variable (y), which is binary (yes/no). Since this is a supervised learning classification problem, we needed classification models instead of unsupervised learning approaches like clustering.
Additionally, the imbalance in y (~88.7% “no”, ~11.3% “yes”) impacted our choice. We avoided models like k-NN, which struggles with imbalanced datasets, and instead chose Naive Bayes, which is less sensitive to imbalance, and Elastic Net Logistic Regression, which allows for class weighting or resampling techniques to improve fairness in classification.
How does your choice of algorithm relates to the dataset?
Our dataset contains a mix of numerical and categorical variables, some of which are highly correlated and also imbalanced.
Elastic Net Logistic Regression is a good fit because it helps handle multicollinearity, selects important features, and allows for class weighting to address imbalance.
Naive Bayes works well with categorical variables, does not require complex pre-processing, and can still perform well despite class imbalance.
These two models complement each other—Elastic Net provides interpretability, while Naive Bayes handles categorical data efficiently. Together, they help us understand which factors influence customer decisions and improve future marketing strategies
Would your choice of algorithm change if there were fewer than 1,000 data records, and why?
If the dataset had fewer than 1,000 records, I would likely replace Elastic Net Logistic Regression with standard Logistic Regression or k-NN, while keeping Naive Bayes because of its efficiency on small datasets.
Now you have done an EDA and selected an Algorithm, what pre-processing (if any) would you require for:
Answer: Since we already identified key data issues in EDA, let’s now systematically clean and prepare the dataset for Elastic Net Logistic Regression and Naive Bayes.
Data Cleaning – Improve Data Quality & Address Missing Data:
What we need to do:
Next Step: Apply these fixes in R
Replace “unknown” with Mode (Most Frequent Value): For housing, loan, and default, we will replace “unknown” with the most frequent category (mode) by identifying the most frequent value (mode) for each variable and Replace “unknown” values with the most common category.
most_common_housing <- names(sort(table(bank_data$housing), decreasing = TRUE))[1]
most_common_loan <- names(sort(table(bank_data$loan), decreasing = TRUE))[1]
most_common_default <- names(sort(table(bank_data$default), decreasing = TRUE))[1]
bank_data$housing[bank_data$housing == "unknown"] <- most_common_housing
bank_data$loan[bank_data$loan == "unknown"] <- most_common_loan
bank_data$default[bank_data$default == "unknown"] <- most_common_default
Keep “unknown” as a Category for job, marital, and education Since “unknown” in job, marital, and education may have meaning, we will keep it as a valid category by converting categorical variables to factors (to preserve “unknown” as a category).
bank_data$job <- factor(bank_data$job)
bank_data$marital <- factor(bank_data$marital)
bank_data$education <- factor(bank_data$education)
Ensure All Variables Are Correctly Formatted: We will verify that categorical variables are factors and numeric variables are properly formatted.
# Convert categorical variables to factors
categorical_cols <- c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y")
bank_data[categorical_cols] <- lapply(bank_data[categorical_cols], factor)
# Convert numeric variables to numeric
numeric_cols <- c("age", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed")
bank_data[numeric_cols] <- lapply(bank_data[numeric_cols], as.numeric)
summary(bank_data)
## age job marital
## Min. :17.00 admin. :10419 divorced: 4611
## 1st Qu.:32.00 blue-collar: 9253 married :24921
## Median :38.00 technician : 6739 single :11564
## Mean :40.02 services : 3967 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1718
## (Other) : 6156
## education default housing loan
## university.degree :12164 no :41173 no :18615 no :34928
## high.school : 9512 yes: 3 yes:22561 yes: 6248
## basic.9y : 6045
## professional.course: 5240
## basic.4y : 4176
## basic.6y : 2291
## (Other) : 1748
## contact month day_of_week duration
## cellular :26135 may :13767 fri:7826 Min. : 0.0
## telephone:15041 jul : 7169 mon:8512 1st Qu.: 102.0
## aug : 6176 thu:8618 Median : 180.0
## jun : 5318 tue:8086 Mean : 258.3
## nov : 4100 wed:8134 3rd Qu.: 319.0
## apr : 2631 Max. :4918.0
## (Other): 2015
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35551
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08192 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :36537
## 1st Qu.:5099 yes: 4639
## Median :5191
## Mean :5167
## 3rd Qu.:5228
## Max. :5228
##
The above summary shows that our data cleaning steps were applied correctly: “unknown” values were handled properly. Categorical variables are factors (job, marital, education, etc.). Numeric variables are correctly formatted (age, campaign, pdays, etc.).️ No missing values (NA).
Dimensionality Reduction - remove correlated/redundant data
What we need to do:
Next Step: Decide if we should remove correlated features manually or let Elastic Net handle it.
Looking at the boxplot of numeric features, we can see several extreme outliers, especially in duration, campaign, pdays, and previous. We can let Elastic Net handle age, Emp. Var. Rate, Cons. Price Idx, Euribor3m, and Nr. Employed directly. this is because aside age, these are economic indicators, so they are not necessarily “outliers” but actual recorded values. No manual intervention needed.
library(dplyr)
bank_data <- bank_data %>%
select(-duration) # Remove duration since it's data leakage
library(corrr)
numeric_data <- bank_data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_data)
# Find highly correlated features (above 0.9)
high_corr <- as.data.frame(as.table(cor_matrix)) %>%
filter(Var1 != Var2, abs(Freq) > 0.9) %>%
arrange(desc(abs(Freq)))
print(high_corr)
## Var1 Var2 Freq
## 1 euribor3m emp.var.rate 0.9722438
## 2 emp.var.rate euribor3m 0.9722438
## 3 nr.employed euribor3m 0.9451459
## 4 euribor3m nr.employed 0.9451459
## 5 nr.employed emp.var.rate 0.9069495
## 6 emp.var.rate nr.employed 0.9069495
From above we cansee that three features are highly correlated, Elastic Net Handle can handle these but Naive Bayes does not automatically handle multicollinearity. Since it assumes independence between features, having highly correlated variables could negatively affect its performance. we can remove nr.employed since it’s closely tied to emp.var.rate.
bank_data_nb <- bank_data %>% select(-nr.employed)
This version of the dataset is only for Naïve Bayes.
Feature Engineering - use of business knowledge to create new features
Feature engineering involves creating new variables that could improve model performance.
What we need to do:
If we look at our boxplot for the numeric features, we can see campaign, pdays, and previous have severe outliers, causing high skewness. We will use Winsorization as it caps extreme values at the 99th percentile, reducing their impact without removing valuable data, which should prevent model instability while preserving most of the data distribution.
# Define Winsorization function
winsorize <- function(x, lower_quantile = 0.01, upper_quantile = 0.99) {
lower_bound <- quantile(x, lower_quantile, na.rm = TRUE)
upper_bound <- quantile(x, upper_quantile, na.rm = TRUE)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Apply Winsorization
bank_data <- bank_data %>%
mutate(
campaign = winsorize(campaign),
pdays = winsorize(pdays),
previous = winsorize(previous)
)
# Check updated boxplots to confirm capping
library(ggplot2)
bank_long <- bank_data %>% pivot_longer(cols = where(is.numeric), names_to = "Feature", values_to = "Value")
ggplot(bank_long, aes(y = Value, x = Feature)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
facet_wrap(~Feature, scales = "free", ncol = 2) +
coord_flip() +
theme_minimal() +
labs(title = "Boxplots of Numeric Features (After Winsorization, Duration Removed)", x = "Feature", y = "Value") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_text(size = 10),
strip.text = element_text(size = 12, face = "bold"))
The boxplots clearly show the effects of Winsorization and removing the duration feature. The extreme outliers have been reduced, making the data more suitable for modeling.
Check Skewness of Numeric Features
library(moments) # skewness calculation
skewed_features <- bank_data %>%
select(campaign, pdays, previous) # Select the numeric features
skewness_values <- apply(skewed_features, 2, skewness) # Compute skewness for each feature
skewness_values
## campaign pdays previous
## 2.703432 -4.921200 2.725518
Campaign and Previous highly right-skewed where Pdays highly left-skewed. We will apply Log transformation. For right-skewed data (campaign & previous), we’ll apply log(x + 1) transformation to avoid log(0).
bank_data <- bank_data %>%
mutate(
campaign_log = log1p(campaign), # log(x + 1)
previous_log = log1p(previous))
# Remove original columns
bank_data <- bank_data %>%
select(-campaign, -previous)
# Check skewness after transformation
transformed_skewness <- bank_data %>%
select(campaign_log, previous_log) %>%
summarise_all(skewness)
print(transformed_skewness)
## campaign_log previous_log
## 1 1.162546 2.358403
let check pDays:
table(bank_data$pdays)
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15
## 541 118 46 412 60 18 64 52 28 58 36 20 24
## 16 17 18 19 20 21 22 25 26 27 999
## 11 8 7 3 1 2 3 1 1 1 39661
So, 999 is not a regular numerical value. It is a special category indicating that the customer was never contacted before. Treating it as a number doesn’t make sense so convert it into a categorical variable instead.
library(dplyr)
# Convert pdays into a categorical feature
bank_data <- bank_data %>%
mutate(
pdays_cat = case_when(
pdays == 999 ~ "Never Contacted",
pdays <= 7 ~ "Contacted Recently (0-7 days)",
pdays <= 30 ~ "Contacted Last Month (8-30 days)",
TRUE ~ "Contacted Earlier (30+ days)"
)
)
# Convert to factor for modeling
bank_data$pdays_cat <- as.factor(bank_data$pdays_cat)
# Drop original `pdays` column
bank_data <- bank_data %>%
select(-pdays)
table(bank_data$pdays_cat)
##
## Contacted Last Month (8-30 days) Contacted Recently (0-7 days)
## 338 1177
## Never Contacted
## 39661
https://www.datacamp.com/tutorial/winsorized-mean
Sampling Data – Resizing the Dataset for Training
What we need to do:
Next Step: We first create training and test sets and then apply encoding separately for Logistic Regression. Since our target variable y is imbalanced, we’ll use stratified sampling to ensure both classes are properly represented in the train/test sets.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(123) # Ensure reproducibility
# Stratified sampling (80% train, 20% test)
trainIndex <- createDataPartition(bank_data$y, p = 0.8, list = FALSE)
# Split dataset into training and test sets
train_data <- bank_data[trainIndex, ]
test_data <- bank_data[-trainIndex, ]
# Check class distribution
table(train_data$y) / nrow(train_data)
##
## no yes
## 0.8873171 0.1126829
table(test_data$y) / nrow(test_data)
##
## no yes
## 0.887418 0.112582
The class distribution in the training and test sets is consistent with the original dataset (~88.7% “no”, ~11.3% “yes”).
Data Transformation – Encoding & Scaling Features
What we need to do:
Next Step: Encoding Categorical Variables (For Logistic Regression): Since Logistic Regression requires numerical inputs, we need to one-hot encode categorical variables. However, we will keep categorical variables as factors for Naive Bayes.
library(caret)
# One-hot encode categorical variables
dummies <- dummyVars(~ ., data = train_data, fullRank = TRUE) # Create dummy variables
train_data_encoded <- predict(dummies, newdata = train_data) %>% as.data.frame()
test_data_encoded <- predict(dummies, newdata = test_data) %>% as.data.frame()
# Check the structure of the encoded dataset
str(train_data_encoded)
## 'data.frame': 32942 obs. of 51 variables:
## $ age : num 56 57 37 40 56 45 59 41 24 25 ...
## $ job.blue-collar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ job.entrepreneur : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.housemaid : num 1 0 0 0 0 0 0 0 0 0 ...
## $ job.management : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.retired : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.self-employed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.services : num 0 1 1 0 1 1 0 0 0 1 ...
## $ job.student : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.technician : num 0 0 0 0 0 0 0 0 1 0 ...
## $ job.unemployed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ job.unknown : num 0 0 0 0 0 0 0 0 0 0 ...
## $ marital.married : num 1 1 1 1 1 1 1 1 0 0 ...
## $ marital.single : num 0 0 0 0 0 0 0 0 1 1 ...
## $ marital.unknown : num 0 0 0 0 0 0 0 0 0 0 ...
## $ education.basic.6y : num 0 0 0 1 0 0 0 0 0 0 ...
## $ education.basic.9y : num 0 0 0 0 0 1 0 0 0 0 ...
## $ education.high.school : num 0 1 1 0 1 0 0 0 0 1 ...
## $ education.illiterate : num 0 0 0 0 0 0 0 0 0 0 ...
## $ education.professional.course : num 0 0 0 0 0 0 1 0 1 0 ...
## $ education.university.degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ education.unknown : num 0 0 0 0 0 0 0 1 0 0 ...
## $ default.yes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ housing.yes : num 0 0 1 0 0 0 0 0 1 1 ...
## $ loan.yes : num 0 0 0 0 1 0 0 0 0 0 ...
## $ contact.telephone : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month.aug : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.dec : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.jul : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.jun : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.mar : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.may : num 1 1 1 1 1 1 1 1 1 1 ...
## $ month.nov : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.oct : num 0 0 0 0 0 0 0 0 0 0 ...
## $ month.sep : num 0 0 0 0 0 0 0 0 0 0 ...
## $ day_of_week.mon : num 1 1 1 1 1 1 1 1 1 1 ...
## $ day_of_week.thu : num 0 0 0 0 0 0 0 0 0 0 ...
## $ day_of_week.tue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ day_of_week.wed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome.nonexistent : num 1 1 1 1 1 1 1 1 1 1 ...
## $ poutcome.success : num 0 0 0 0 0 0 0 0 0 0 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx : num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y.yes : num 0 0 0 0 0 0 0 0 0 0 ...
## $ campaign_log : num 0.693 0.693 0.693 0.693 0.693 ...
## $ previous_log : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pdays_cat.Contacted Recently (0-7 days): num 0 0 0 0 0 0 0 0 0 0 ...
## $ pdays_cat.Never Contacted : num 1 1 1 1 1 1 1 1 1 1 ...
Before encoding, the categorical variables like job, education, month, day_of_week, etc., were stored as single columns with multiple categories. After encoding, each category within a categorical variable was converted into a separate binary column (0 or 1).
Imbalanced Data – Addressing Class Imbalance (~88.7% No, ~11.3% Yes)
What we need to do:
For Elastic Net Logistic Regression: Apply SMOTE oversampling because: We need to improve recall for “yes” (subscribed) customers. SMOTE creates synthetic cases, ensuring our model learns from enough positive examples. We have only ~11% “yes” cases—without SMOTE, Logistic Regression might struggle to identify patterns in the minority class.
For Naive Bayes: No action needed, since it handles class imbalance well.
Next Step: Apply SMOTE for Logistic Regression
library(themis)
## Loading required package: recipes
##
## Attaching package: 'recipes'
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:stats':
##
## step
library(recipes)
set.seed(123)
train_data_encoded$y.yes <- as.factor(train_data_encoded$y.yes)# Convert target variable (y.yes) to a factor
smote_recipe <- recipe(y.yes ~ ., data = train_data_encoded) %>% # Create a recipe for SMOTE
step_smote(y.yes, over_ratio = 1) # Creates a balanced dataset
smote_prep <- prep(smote_recipe, training = train_data_encoded) # Prep and apply SMOTE
train_data_smote <- bake(smote_prep, new_data = NULL)
# Check class distribution after SMOTE
table(train_data_smote$y.yes) / nrow(train_data_smote)
##
## 0 1
## 0.5 0.5
The dataset is now balanced (50% “no”, 50% “yes”) after applying SMOTE.