For my analysis I chose to use the bank-additional-full.csv because it contains all available examples (41,188 rows), has 20 input variables and is ordered by date, which might help in identifying time-based trends if needed. Using a reduced dataset (like bank-full.csv, which was my 2nd option ) may miss key patterns.
Loading the libraries
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyverse)
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
library(ROSE)
## Warning: package 'ROSE' was built under R version 4.3.3
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(summarytools)
## Warning: package 'summarytools' was built under R version 4.3.3
library(scales)
## Warning: package 'scales' was built under R version 4.3.3
library(moments)
library(dplyr)
library(MASS)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Warning: package 'carData' was built under R version 4.3.3
library(readr)
Load the dataset
# I uploaded on github for reproducibility purposes
url <- "https://raw.githubusercontent.com/NikoletaEm/datasps/refs/heads/main/bank-additional-full.csv"
bank <- read.csv(url, sep = ";")
head(bank)
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
Initial Observations
str(bank)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
Age (numeric) – Age of the client.
Job (categorical) – Type of job (e.g., admin., blue-collar, management, student, etc.).
Marital (categorical) – Marital status (divorced, married, single, unknown).
Education (categorical) – Highest education level attained (basic.4y, high school, university degree, etc.).
Default (categorical) – Whether the client has credit in default (yes, no, unknown).
Housing (categorical) – Whether the client has a housing loan (yes, no, unknown).
Loan (categorical) – Whether the client has a personal loan (yes, no, unknown).
Contact (categorical) – Communication type (cellular or telephone).
Month (categorical) – Last contact month (e.g., Jan, Feb, Mar, …, Dec).
Day of Week (categorical) – Last contact day (Monday to Friday).
Duration (numeric) – Last contact duration in seconds (Important: This should be excluded for predictive modeling as it leaks information about the outcome).
Campaign (numeric) – Number of contacts performed during this campaign (including last contact).
Pdays (numeric) – Number of days since the client was last contacted (999 means never contacted before).
Previous (numeric) – Number of times the client was contacted before this campaign.
Poutcome (categorical) – Outcome of the previous marketing campaign (failure, nonexistent, success).
Employment Variation Rate (emp.var.rate) (numeric) – Quarterly employment variation rate.
Consumer Price Index (cons.price.idx) (numeric) – Monthly consumer price index.
Consumer Confidence Index (cons.conf.idx) (numeric) – Monthly consumer confidence index.
Euribor 3-month Rate (euribor3m) (numeric) – Daily Euribor 3-month interest rate.
Number of Employees (nr.employed) (numeric) – Quarterly number of employees.
y (binary) – Whether the client subscribed to a term deposit (yes/no).
The dataset consists of 41,188 observations and 21 variables, with a mix of numerical and categorical features. Our dataset contains both numerical(age, duration, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed) and categorical variables (job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, yjob, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, y). Our target variable is y which indicates whether a client subscribed to the term deposit and is binary with two levels: “no” (0) and “yes” (1). It’s important to note that some categorical variables (e.g., default, housing, loan) have “unknown” as a category, which might represent missing values.
Checking for missing values
colSums(is.na(bank))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
No missing values (NAs) were detected !
summary(bank)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:41188
## Class :character
## Mode :character
##
##
##
Looking at the summary of our data we can notice that we have quite a few “unknown” values.More specifically there are: 8,597 in Default ~20% of data, 990 in Housing ~2.4% , 990 in Loan ~2.4% and 80 in Marital <1%. For those missing values there are 3 approaches we can follow: If “unknown” is meaningful (e.g., customers refusing to disclose info),we keep it as a category, if “unknown” is missing completely at random, imputation or removal might be better or for highly missing columns (like Default with 20%),we should consider dropping it if it’s not critical. pdays has a value used as a flag for never contacted which is 999 we will count how many are there.
sum(bank$pdays == 999)
## [1] 39673
sum(bank$pdays != 999)
## [1] 1515
This suggests that the vast majority of clients in our dataset had no prior contact.We need some investigation here
table_pdays_y <- table(bank$pdays == 999, bank$y)
chisq.test(table_pdays_y)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_pdays_y
## X-squared = 4341.7, df = 1, p-value < 2.2e-16
Since the p-value is far below 0.05, we reject the null hypothesis that pdays and the target variable (y) are independent. This means that pdays does influence the response variable so we can’t just drop it. A good idea would be to transform it into a binary variable (previously_contacted) where: Yes : if pdays ≠ 999 (client was contacted before) No : if pdays = 999 (first-time contact)
Central tendency(Mean, Median) and spread (Spread, IQR, and SD) of each variable
numeric_data <- bank %>% dplyr::select(where(is.numeric))
summary_stats <- data.frame(
Mean = sapply(numeric_data, mean, na.rm = TRUE),
Median = sapply(numeric_data, median, na.rm = TRUE),
Range = sapply(numeric_data, function(x) diff(range(x, na.rm = TRUE))),
IQR = sapply(numeric_data, function(x) IQR(x, na.rm = TRUE)),
SD = sapply(numeric_data, sd, na.rm = TRUE)
)
print(summary_stats)
## Mean Median Range IQR SD
## age 40.0240604 38.000 81.000 15.000 10.4212500
## duration 258.2850102 180.000 4918.000 217.000 259.2792488
## campaign 2.5675925 2.000 55.000 2.000 2.7700135
## pdays 962.4754540 999.000 999.000 0.000 186.9109073
## previous 0.1729630 0.000 7.000 0.000 0.4949011
## emp.var.rate 0.0818855 1.100 4.800 3.200 1.5709597
## cons.price.idx 93.5756644 93.749 2.566 0.919 0.5788400
## cons.conf.idx -40.5026003 -41.800 23.900 6.300 4.6281979
## euribor3m 3.6212908 4.857 4.411 3.617 1.7344474
## nr.employed 5167.0359109 5191.000 264.500 129.000 72.2515277
Correlation analysis
cor_matrix <- cor(numeric_data, use = "pairwise.complete.obs")
ggcorrplot(cor_matrix, lab = TRUE, hc.order = TRUE, type = "lower")
We have some highly correlated features emp.var.rate and nr.employed have a very strong positive correlation (0.91), euribor3m is highly correlated with emp.var.rate (0.97) and nr.employed (0.95) and cons.price.idx has a strong correlation with emp.var.rate (0.78) and nr.employed (0.69).On the other hand we also have negatively correlated features. Previous has a strong negative correlation with euribor3m (-0.42) and emp.var.rate (-0.42) and also has a negative correlation with nr.employed (-0.5). The strong correlation between employment variables (emp.var.rate, nr.employed, euribor3m) suggests that they capture similar economic trends. The strong correlation between employment variables (emp.var.rate, nr.employed, euribor3m) suggests that they capture similar economic trends.Creating an interaction feature like age * campaign could reveal insights about whether younger or older clients are more likely to be influenced by repeated calls.
Feature Distribution
numeric_columns <- sapply(bank, is.numeric)
par(mfrow = c(2, 3))
for (col in names(bank)[numeric_columns]) {
hist(bank[[col]],
main = paste("Histogram of", col),
xlab = col,
col = "maroon",
border = "black")
}
par(mfrow = c(1, 1))
The age variable follows a distribution reasonably close to normality, while duration and campaign are highly skewed, indicating that most calls were short and few customers were contacted multiple times. Pdays and previous show large spikes, suggesting many clients were either never contacted before or contacted after a long gap. Economic indicators like emp.var.rate, euribor3m, cons.price.idx, and nr.employed exhibit multimodal distributions, likely reflecting different economic conditions over time
To better understanding our data’s distribution I will compute the actual values for skewness and kurtosis.
# To remember
# Skewness close to 0 → Data is approximately symmetrical
# Positive skewness (> 0) -> Right-skewed
# Negative skewness (< 0) -> Left-skewed
# Kurtosis close to 3 -> Data follows a normal distribution
# Kurtosis > 3 -> Heavy-tailed aka more outliers
# Kurtosis < 3 -> Light-tailed aka fewer extreme values
skewness_values <- apply(numeric_data, 2, skewness)
kurtosis_values <- apply(numeric_data, 2, kurtosis)
skew_kurt_summary <- data.frame(
Skewness = skewness_values,
Kurtosis = kurtosis_values
)
print(skew_kurt_summary)
## Skewness Kurtosis
## age 0.7846682 3.791070
## duration 3.2630224 23.245334
## campaign 4.7623333 39.975160
## pdays -4.9220107 25.226619
## previous 3.8319027 23.106230
## emp.var.rate -0.7240692 1.937352
## cons.price.idx -0.2308792 2.170146
## cons.conf.idx 0.3031688 2.641340
## euribor3m -0.7091621 1.593222
## nr.employed -1.0442244 2.996094
Skewness results We can see that age (0.78) and cons.conf.idx (0.30) are both slightly skewed and cons.price.idx (-0.23) is almost normal distribution. The insight we can take from that is that we have slightly more younger customers than older ones. We have 3 strong right-skewed variables duration (3.26), campaign (4.76), previous (3.83) and 4 strong left-skewed variables pdays (-4.92), nr.employed (-1.04), emp.var.rate (-0.72), euribor3m (-0.71). Kurtosis results High kurtosis was observed in the following 4 variables duration (23.24), campaign (39.97), pdays (25.23), previous (23.11).That means they have extreme outliers, indicating rare but extreme cases like for example very long cals. Low kurtosis was observed in emp.var.rate (1.93), cons.price.idx (2.17), cons.conf.idx (2.64), euribor3m (1.59).These variables have fewer extreme values. Our next steps might include that highly skewed variables (duration, campaign, pdays, previous) might need log transformation.
Outliers We will view the boxplots to verify the results stated above
par(mfrow = c(2, 3))
for (col in names(bank)[numeric_columns]) {
boxplot(bank[[col]],
main = paste("Boxplot of", col),
col = "purple",
border = "black",
horizontal = TRUE)
}
par(mfrow = c(1, 1))
The boxplots reveal the presence of outliers in multiple variables, especially in duration, campaign, pdays, and previous, which could potentially affect our models performance. To handle these outliers,I’d consider log transformation, to reduce skewness as mentioned before or removing extreme cases. The appropriate approach depends on whether these outliers carry meaningful information or result from data entry errors. Outliers in duration indicate lengthy call durations, possibly due to engaged customers, who are more likely to subscribe to a product or service, which is valuable information rather than noise. Also, outliers in campaign (number of contacts) could reflect persistent marketing efforts, while extreme values in pdays (days since last contact) might indicate a long gap between interactions.
Categorical Variables
# Identify categorical columns excluding numerical ones
categorical_columns <- names(bank)[!sapply(bank, is.numeric)]
par(mfrow = c(2, 2))
# Loop through categorical variables and create bar plots
for (col in categorical_columns) {
barplot(table(bank[[col]]),
main = paste("Distribution of", col),
col = "steelblue",
border = "black",
las = 2,
cex.names = 0.8)
}
par(mfrow = c(1, 1))
Patterns/Trends
ggplot(bank, aes(x = y, y = duration)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Call Duration vs Subscription")
The insights that we get from the plot above reveal that clients who subscribed generally had longer call durations compared to those who did not. The median call duration for subscribers is noticeably higher, suggesting that longer conversations tend to increase the chances of subscription.
ggplot(bank, aes(x = y, y = campaign)) +
geom_boxplot(fill = "blue") +
labs(title = "Number of Contacts vs Subscription")
Some of the insights drawn from the plot above are that a few individuals were contacted more than 40 times but still did not subscribe, indicating that excessive follow-ups may not be effective and most subscriptions happened within a low number of calls, suggesting that persistent calls beyond a certain threshold are not very useful.That tells us the bank should prioritize call quality over quantity, ensuring that calls are engaging and persuasive rather than simply frequent.
ggplot(bank, aes(x = campaign, y = duration, color = y)) +
geom_point(alpha = 0.5) +
labs(title = "Call Duration vs Number of Contacts by Subscription Outcome")
It is apparent by looking at the scatter plot above that most successful subscriptions (shown in lightblue) occur at relatively lower campaign contact numbers but longer call durations and large cluster of non-subscriptions (no in red) occurs at low call durations, even when multiple calls were made.In conclusion, the scatter plot suggests that long call duration is a stronger predictor of success than the number of contacts.
Seasonality
monthly_counts <- bank %>%
count(month, y) %>%
group_by(month) %>%
mutate(percentage = n / sum(n) * 100)
# Plot
ggplot(monthly_counts, aes(x = month, y = percentage, group = y, color = y)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(title = "Subscription Rate by Month", x = "Month", y = "Percentage (%)") +
theme_minimal(base_size = 14) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
By looking at the plot above it is very apparent that the likelihood of a client subscribing is highest in March, December and October. Whereas the probability of a non-subscription (‘no’) is highest in July, May, and August. Companies can leverage high-conversion months (March, December, October) by running intensive marketing campaigns, improve performance in low-conversion months (July, May, August) by adjusting messaging or customer engagement strategies and also investigate causes of seasonal variations by answering the following question: Are external factors (e.g., holidays, financial cycles) affecting client decisions?
create_plot <- function(data, x_var, title) {
ggplot(data, aes_string(x = x_var, fill = "y")) +
geom_bar(position = "fill", width = 0.7) +
geom_text(stat = "count", aes(label = scales::percent(..count.. / tapply(..count.., ..x.., sum)[..x..], accuracy = 0.1)),
position = position_fill(vjust = 0.5), size = 2.9, color = "black",fontface = "bold" ) +
labs(title = title, x = x_var, y = "Proportion") +
theme_minimal(base_size = 14) + # Increase text size
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
# Plot each variable individually
plot1 <- create_plot(bank, "job", "Subscription Rate by Job Type")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(plot1)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot2 <- create_plot(bank, "marital", "Subscription Rate by Marital Status")
print(plot2)
plot3 <- create_plot(bank, "education", "Subscription Rate by Education Level")
print(plot3)
plot4 <- create_plot(bank, "contact", "Subscription Rate by Contact Type")
print(plot4)
plot5 <- create_plot(bank, "day_of_week", "Subscription Rate by Day of the Week")
print(plot5)
Key insights: Retired individuals and students have the highest subscription rates (25.2% and 31.4% respectively), indicating they are more receptive to the offering. On the other hand blue-collar and entrepreneur job types show the lowest subscription rates (6.9% and 8.5% respectively), suggesting these groups may be less engaged or less financially inclined to subscribe. The highest subscription rates occur on Thursdays (12.1%) and Tuesdays (11.8%), while Mondays and Fridays which are the start and the end of the week see the lowest engagement.Marketing campaigns should be optimized for midweek outreach, as potential customers seem more receptive during these days. Cellular contacts have a higher conversion rate (14.7%) compared to telephone (5.2%), indicating that mobile communication is more effective. Traditional telephone calls may yield lower returns and could be de-prioritized in the marketing strategy and prioritize mobile-based outreach strategies such as SMS and email marketing. Single customers have the highest subscription rate (14%), followed by those in the “unknown” category (15%). Single customers are more open to subscribing, possibly because they have fewer financial commitments (like mortgages, dependents). Marketing strategies could target young professionals with customized financial products like savings plans or investment options. Higher education and professional training correlate with higher subscription rates, likely because these clients better understand financial products and also the high subscription rate among illiterate customers suggests they may be more responsive to direct marketing, such as in-person banking or phone calls, rather than online promotions.
Duplicate/Inconsistent Values If duplicates exist,we will remove them to avoid bias.
sum(duplicated(bank))
## [1] 12
We have 12 duplicates
# We will view them and then remove them in preprocessing
bank[duplicated(bank), ]
## age job marital education default housing loan
## 1267 39 blue-collar married basic.6y no no no
## 12262 36 retired married unknown no no no
## 14235 27 technician single professional.course no no no
## 16957 47 technician divorced high.school no yes no
## 18466 32 technician single professional.course no yes no
## 20217 55 services married high.school unknown no no
## 20535 41 technician married professional.course no yes no
## 25218 39 admin. married university.degree no no no
## 28478 24 services single high.school no yes no
## 32517 35 admin. married university.degree no yes no
## 36952 45 admin. married university.degree no no no
## 38282 71 retired single university.degree no no no
## contact month day_of_week duration campaign pdays previous poutcome
## 1267 telephone may thu 124 1 999 0 nonexistent
## 12262 telephone jul thu 88 1 999 0 nonexistent
## 14235 cellular jul mon 331 2 999 0 nonexistent
## 16957 cellular jul thu 43 3 999 0 nonexistent
## 18466 cellular jul thu 128 1 999 0 nonexistent
## 20217 cellular aug mon 33 1 999 0 nonexistent
## 20535 cellular aug tue 127 1 999 0 nonexistent
## 25218 cellular nov tue 123 2 999 0 nonexistent
## 28478 cellular apr tue 114 1 999 0 nonexistent
## 32517 cellular may fri 348 4 999 0 nonexistent
## 36952 cellular jul thu 252 1 999 0 nonexistent
## 38282 telephone oct tue 120 1 999 0 nonexistent
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1267 1.1 93.994 -36.4 4.855 5191.0 no
## 12262 1.4 93.918 -42.7 4.966 5228.1 no
## 14235 1.4 93.918 -42.7 4.962 5228.1 no
## 16957 1.4 93.918 -42.7 4.962 5228.1 no
## 18466 1.4 93.918 -42.7 4.968 5228.1 no
## 20217 1.4 93.444 -36.1 4.965 5228.1 no
## 20535 1.4 93.444 -36.1 4.966 5228.1 no
## 25218 -0.1 93.200 -42.0 4.153 5195.8 no
## 28478 -1.8 93.075 -47.1 1.423 5099.1 no
## 32517 -1.8 92.893 -46.2 1.313 5099.1 no
## 36952 -2.9 92.469 -33.6 1.072 5076.2 yes
## 38282 -3.4 92.431 -26.9 0.742 5017.5 no
Since the goal is to predict whether a client will subscribe to a term deposit, which is a binary classification problem, the best models from the ones presented so far are: Logistic Regression, Linear Discriminant Analysis (LDA), k-Nearest Neighbors (KNN) (for smaller datasets!). Logistic Regression is valued for its simplicity, interpretability and computational efficiency, making it ideal for large datasets like ours. It provides probability estimates and handles both categorical and numerical data effectively. However, it assumes linear relationships between predictors and log-odds outcomes, struggles with complex non-linear patterns, and can be compromised by multicollinearity. Linear Discriminant Analysis (LDA) performs well with normally distributed data and offers dimensionality reduction that can improve performance. Like logistic regression, it’s computationally efficient. Its main limitations include assuming normality of predictor variables which may not always be true and reduced effectiveness when relationships between variables and targets are non-linear. K-Nearest Neighbors (KNN) stands out as a non-parametric approach that doesn’t assume any specific distribution, allowing it to capture complex patterns in data. It’s particularly effective with smaller datasets. However, KNN becomes computationally expensive with large datasets (like our 41K records), requires careful selection of the K parameter to avoid overfitting or underfitting, and tends to perform poorly with imbalanced data.
I recommend Logistic Regression for this customer subscription prediction task. It’s the best choice because it handles large datasets efficiently (important for our 41,188 records), provides clear interpretability so business stakeholders can understand which factors influence customer decisions and generates probability estimates that help prioritize potential subscribers. While Linear Discriminant Analysis could be a good alternative if our data is normally distributed and KNN should be avoided due to its high computational cost with large datasets, Logistic Regression offers the best balance of performance, speed and business value for our specific needs.
Looking at the dataset structure, there is a clear target variable y that has binary values “no” and “yes” indicating whether clients subscribed or not. This confirms that we have labeled data, which directly impacted the algorithm selection. Since this is a supervised learning classification problem, Logistic Regression and LDA were appropriate choices as they learn from these labeled examples to predict outcomes for new data. If the dataset lacked this y column with subscription outcomes, we would have been forced to use unsupervised methods like clustering algorithms instead, which would group customers based on similarities but wouldn’t directly predict subscription likelihood without additional interpretation. Looking at our EDA findings, our dataset contains both numerical variables (age, duration, economic indicators) and categorical variables (job, marital status, education), which both Logistic Regression and LDA can handle effectively. However, several variables like duration and campaign exhibit skewness in their distributions, which Logistic Regression can tolerate better than LDA, as LDA assumes normally distributed predictors. Our correlation analysis revealed highly correlated economic variables, particularly between emp.var.rate and euribor3m. This multicollinearity could potentially affect the stability and interpretability of Logistic Regression coefficients. However, we can mitigate this issue through careful feature selection or regularization techniques. Additionally, the class imbalance we observed in the target variable y (with far fewer “yes” responses (4639) than “no” (36537)) is better handled by Logistic Regression, which can be weighted to account for imbalanced classes. The flexibility of Logistic Regression to handle these dataset characteristics reinforces our algorithm selection, though we should remain mindful of its sensitivity to multicollinearity when interpreting results. The choice of algorithm would change if there were fewer than 1,000 records. With a smaller dataset, K-Nearest Neighbors (KNN) would become a viable option since its computational cost would be significantly lower. Linear Discriminant Analysis (LDA) would still be useful, but a small sample size could affect the stability of covariance estimates. Logistic Regression’s interpretability helps the bank understand key factors influencing subscriptions, allowing for better-targeted marketing strategies. Its probability estimates enable prioritization of high-potential leads, ensuring efficient use of resources. Since it scales well with large datasets, the bank can quickly update predictions and refine outreach efforts. This aligns with business goals by maximizing conversions while keeping costs low and focusing on the most promising customer segments.
Data Cleaning
We will first handle the “unknown” values mentioned/found in EDA. For the variable default with ~20% of it’s data being uknown I think the best approach would be to keep “unknown” as its own category to let the model decide if it’s meaningful.Since 20% of the data is missing, dropping these rows would result in significant data loss. For the variables housing and loan since only 2.4% of data is missing, dropping these rows is an option, but it’s better to impute missing values and since they are binary, I’ll fill missing values with the most common category (aka mode).
bank$housing[bank$housing == "unknown"] <- names(which.max(table(bank$housing)))
bank$loan[bank$loan == "unknown"] <- names(which.max(table(bank$loan)))
For the variable marital since missing values are less than 1%, they won’t significantly impact the model, so I’ll keep “unknown” as a separate category
For handling outliers/noise we will use winsorization. Winsorization. is a technique that replaces outliers with less extreme values to reduce their impact on data analysis. We used a stronger winsorizarion (capping between 1st & 99th) because of extreme outliers. For the variables nr.employed, emp.var.rate, and euribor3m I chose to just standarize them since the skewness is relatively mild (not extreme).Standardizing these variables I think would likely be adequate and should work well for our logistic regression model. For the variables campaign and duration I used a log transformation and they turned out well. For previous I had to use a couple log transformations, the skewness is still relatively high but it could be acceptable depending on our model’s performance.
winsorize <- function(x) {
lower_bound <- quantile(x, 0.05, na.rm = TRUE)
upper_bound <- quantile(x, 0.95, na.rm = TRUE)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
winsorize_strong <- function(x) {
lower_bound <- quantile(x, 0.01, na.rm = TRUE)
upper_bound <- quantile(x, 0.99, na.rm = TRUE)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Apply Winsorization
duration_winsorized = winsorize(bank$duration)
campaign_winsorized = winsorize(bank$campaign)
previous_winsorized = winsorize_strong(bank$previous)
bank <- bank %>%
mutate(
# Apply transformations directly to the winsorized variables
duration_transformed = log1p(duration_winsorized),
campaign_transformed = log1p(campaign_winsorized),
previous_transformed = log1p(log1p(log1p(log1p(log1p(previous_winsorized))))),
# Standardize other variables
emp_var_rate_standardized = scale(emp.var.rate),
euribor3m_standardized = scale(euribor3m),
nr_employed_standardized = scale(nr.employed)
)
# Check skewness after transformations
skewness_after <- data.frame(
Variable = c("duration_transformed", "campaign_transformed", "previous_transformed",
"nr_employed_standardized", "emp_var_rate_standardized", "euribor3m_standardized"),
Skewness = c(
skewness(bank$duration_transformed, na.rm = TRUE),
skewness(bank$campaign_transformed, na.rm = TRUE),
skewness(bank$previous_transformed, na.rm = TRUE),
skewness(bank$nr_employed_standardized, na.rm = TRUE),
skewness(bank$emp_var_rate_standardized, na.rm = TRUE),
skewness(bank$euribor3m_standardized, na.rm = TRUE)
)
)
print(skewness_after)
## Variable Skewness
## 1 duration_transformed -0.09216888
## 2 campaign_transformed 0.78735043
## 3 previous_transformed 2.15047130
## 4 nr_employed_standardized -1.04422438
## 5 emp_var_rate_standardized -0.72406918
## 6 euribor3m_standardized -0.70916213
# Duration Before & After
boxplot_duration_before <- ggplot(bank, aes(y = duration)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
ggtitle("Duration (Before Winsorization)")
boxplot_duration_after <- ggplot(bank, aes(y = duration_winsorized)) +
geom_boxplot(fill = "lightgreen", outlier.color = "red") +
ggtitle("Duration (After Winsorization)")
# Campaign Before & After
boxplot_campaign_before <- ggplot(bank, aes(y = campaign)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
ggtitle("Campaign (Before Winsorization)")
boxplot_campaign_after <- ggplot(bank, aes(y = campaign_winsorized)) +
geom_boxplot(fill = "lightgreen", outlier.color = "red") +
ggtitle("Campaign (After Winsorization)")
# Previous Before & After
boxplot_previous_before <- ggplot(bank, aes(y = previous)) +
geom_boxplot(fill = "lightblue", outlier.color = "red") +
ggtitle("Previous (Before Winsorization)")
boxplot_previous_after <- ggplot(bank, aes(y = previous_winsorized)) +
geom_boxplot(fill = "lightgreen", outlier.color = "red") +
ggtitle("Previous (After Winsorization)")
# Combine the boxplots side by side
(boxplot_duration_before | boxplot_duration_after) /
(boxplot_campaign_before | boxplot_campaign_after) /
(boxplot_previous_before | boxplot_previous_after)
We identified some duplicates in EDA that we will remove now
bank <- bank[!duplicated(bank), ] # remove all duplicate rows except the first occurrence
We will do a test to verify that they are removed!
sum(duplicated(bank))
## [1] 0
Dimensionality Reduction
Since we’re using logistic regression, it’s important to avoid multicollinearity, which can lead to unstable model results. The variables emp.var.rate, nr.employed, and euribor3m are highly correlated, so we can combine them into a single feature, economic_indicators, which represents their average. This helps preserve the information while reducing redundancy and multicollinearity. Another thing I considered is dropping the previous variable since it’s still really skewed and looks like it’s very similar to campaign variable.
bank <- bank %>%
mutate(
economic_indicators = (emp_var_rate_standardized + nr_employed_standardized + euribor3m_standardized) / 3
) %>%
dplyr::select(-emp.var.rate, -nr.employed, -euribor3m,
-emp_var_rate_standardized, -nr_employed_standardized, -euribor3m_standardized)
head(bank)
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome cons.price.idx
## 1 mon 261 1 999 0 nonexistent 93.994
## 2 mon 149 1 999 0 nonexistent 93.994
## 3 mon 226 1 999 0 nonexistent 93.994
## 4 mon 151 1 999 0 nonexistent 93.994
## 5 mon 307 1 999 0 nonexistent 93.994
## 6 mon 198 1 999 0 nonexistent 93.994
## cons.conf.idx y duration_transformed campaign_transformed
## 1 -36.4 no 5.568345 0.6931472
## 2 -36.4 no 5.010635 0.6931472
## 3 -36.4 no 5.424950 0.6931472
## 4 -36.4 no 5.023881 0.6931472
## 5 -36.4 no 5.730100 0.6931472
## 6 -36.4 no 5.293305 0.6931472
## previous_transformed economic_indicators
## 1 0 0.5640705
## 2 0 0.5640705
## 3 0 0.5640705
## 4 0 0.5640705
## 5 0 0.5640705
## 6 0 0.5640705
Feature Engineering
I’ll transform pdays into a binary variable (previously_contacted)
bank$previously_contacted <- ifelse(bank$pdays == 999, 0, 1) # 999 means never contacted
bank <- bank %>%
dplyr::select(-pdays)
table(bank$previously_contacted, bank$y)
##
## no yes
## 0 35989 3672
## 1 548 967
prop.table(table(bank$previously_contacted, bank$y), margin = 1) * 100
##
## no yes
## 0 90.741535 9.258465
## 1 36.171617 63.828383
Key Insights from the code above -> 36,000 clients (~91%) who were not contacted before did not subscribe. BUT among those previously contacted, 967 (~64%) subscribed, which is significantly higher than the 9% for first-time contacts.This suggests that relationship-building and follow-ups significantly improve conversion rates.
We can introduce a new variable, call_frequency, to track how often each customer has been contacted. The idea is that repeated calls may influence their likelihood of subscribing. Our findings show that while previous contact increases subscription chances, too many calls can have the opposite effect. Call_frequency can help find the right number of outreach attempts, the “golden ratio”, for an effective outreach.
bank <- bank %>%
mutate(call_frequency = campaign / previous)
Another variable we could create instead of economic_indicators could be economic stability of each client which could involve using the Consumer Price Index (cons.price.idx), Employment Variability (emp.var.rate), and Euribor Rate (euribor3m) together, as they each capture different elements of economic stability. Consumer Price Index (CPI) captures inflation and cost of living., Employment Variation Rate (emp.var.rate) gives an idea of employment market stability and Euribor Rate (euribor3m) is the average interest rate at which European banks are prepared to lend, indicating overall economic stability. Why is it important? Because economic stability is a crucial aspect that influences the ability of customers to subscribe to products or services. High economic stability could lead to more people subscribing to products!
We have class imbalance that needs to be handled.I’ll use the ROSE package
table(bank$y)
##
## no yes
## 36537 4639
bank_yes <- bank[bank$y == "yes", ]
bank_no <- bank[bank$y == "no", ]
n_minority <- nrow(bank_yes)
bank_no_sampled <- bank_no[sample(nrow(bank_no), n_minority, replace = FALSE), ]
bank <- rbind(bank_yes, bank_no_sampled)
bank <- bank[sample(nrow(bank)), ]
table(bank$y)
##
## no yes
## 4639 4639
Data Transformation
Categorical Variables I’ll separate the categorical variables into two categories: nominal (no inherent order) and ordinal (ordered categories). For the nominal variables, which include job (12 levels, e.g., “admin.”, “blue-collar”), default (3 levels: “no”, “unknown”, “yes”), housing (3 levels: “no”, “unknown”, “yes”), loan (3 levels: “no”, “unknown”, “yes”), contact (2 levels: “cellular”, “telephone”), month (10 levels, e.g., “apr”, “aug”), day_of_week (5 levels, e.g., “fri”, “mon”), poutcome (3 levels: “failure”, “nonexistent”), and y (2 levels: “no”, “yes”, target variable), I’ll use One-Hot Encoding. For the ordinal variables, which are marital (4 levels: “divorced”, “married”, “single”, “widow”, with an order: married > single > divorced > widow) and education (8 levels, e.g., “basic.4y”, “basic.6y”, “high.school”, “university.degree”, with an order: basic < high school < university < etc.), I’ll use Label Encoding. For example, I’ll encode marital status as Married = 1, Single = 2, Divorced = 3, Widow = 4.
Standarization
Continuous variables such as age, duration(transformed), campaign(transformed), previous should be considered for standardization based on their scale and distribution. Age and duration should be standardized to ensure comparability.
Feature Selection
For feature selection I’d choose the Recursive Feature Elimination (RFE)technique that selects the most important features by systematically removing the least significant ones. It starts with all features, trains a model, ranks features by importance, then removes the weakest ones and retrains. This process repeats until finding the optimal feature set (testing subsets of 5, 10, and 15 features) through cross-validation. RFE helps build a simpler, more effective model by keeping only the truly important predictors of customer subscription behavior.