The assignment is to perform EDA on a dataset “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing).
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.
There are two datasets: 1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use the bank-full.csv
Input predictors:
# bank client data: 1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
3 - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
4 - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
5 - default: has credit in default? (binary: “yes”,“no”)
6 - balance: average yearly balance, in euros (numeric)
7 - housing: has housing loan? (binary: “yes”,“no”)
8 - loan: has personal loan? (binary: “yes”,“no”)
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10 - day: last contact day of the month (numeric)
11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12 - duration: last contact duration, in seconds (numeric)
# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15 - previous: number of contacts performed before this campaign and for this client (numeric)
16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
Output variable (desired target):
17 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
Missing values : none
#Import needed libraries
library(ggplot2)
library(readr) # to uses read_csv function
library(dplyr) # to use Filter, mutate, arrange function etc
library(tidyr) # to use pivot_longer function
library(e1071) # For skewness function
library(corrplot)
library(ROSE)
library(smotefamily)
library(caret)
The data analysis shows there are 45211 observations and 17 variables. I find some of variables not in the correct type. we need to convert it into correct data type.
bank_raw <- read.csv("https://raw.githubusercontent.com/datanerddhanya/DATA622/refs/heads/main/bank-full.csv")
head(bank_raw)
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
dim(bank_raw)
## [1] 45211 17
str(bank_raw)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
bank_transform <- bank_raw
bank_transform$job <- as.factor(bank_raw$job)
bank_transform$marital <- as.factor(bank_raw$marital)
bank_transform$education <- as.factor(bank_raw$education)
bank_transform$default <- as.factor(bank_raw$default)
bank_transform$balance <- as.integer(bank_raw$balance)
bank_transform$housing <- as.factor(bank_raw$housing)
bank_transform$loan <- as.factor(bank_raw$loan)
bank_transform$contact <- as.factor(bank_raw$contact)
bank_transform$month <- as.factor(bank_raw$month)
bank_transform$pdays <- as.integer(bank_raw$pdays)
bank_transform$poutcome <- as.factor(bank_raw$poutcome)
bank_transform$y <- as.factor(bank_raw$y)
str(bank_transform)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
When checking the day predictor, it is not mapped to any year. hence we will not be able to infer in which year 2008 to 2010 it belongs.Hence its better to remove the day predictor.We can still keep month to see which month has the most subscription.
bank_transform <- subset(bank_transform, select = -c(day))
There are no missing values. However Job,Education,Contact,poutcome has “unknown” values. It is difficult to impute the unknown values as it may affect the results of the outcome. Hence, I am exploring the data including the unknowns.
print("Missing values in the dataset")
## [1] "Missing values in the dataset"
colSums(is.na(bank_transform))
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact month duration campaign pdays previous poutcome y
## 0 0 0 0 0 0 0 0
print("Unknown values in the dataset")
## [1] "Unknown values in the dataset"
bank_transform[bank_transform == "unknown"] <- NA
colSums(is.na(bank_transform))
## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact month duration campaign pdays previous poutcome y
## 13020 0 0 0 0 0 36959 0
No duplicates found.
# Check for duplicates
duplicates <- sum(duplicated(bank_transform))
cat(paste("\nNumber of duplicate rows:", duplicates, "\n"))
##
## Number of duplicate rows: 0
To perform analysis, it is needed to have the target predictor in numeric format as well.
bank_transform$y_numeric <- ifelse(bank_transform$y == "yes", 1, 0)
bank_transform$y_numeric <- as.integer(bank_transform$y_numeric)
bank_transform |>
select(where(is.numeric))|>
gather()|>
ggplot(aes(value)) +
geom_histogram(bins = 5) +
facet_wrap(~key, scales = 'free') +
ggtitle("Histograms of Numerical Predictors")
bank_transform|>
select(where(is.numeric))|>
gather()|>
ggplot(aes(value)) +
geom_boxplot() +
facet_wrap(~key, scales = 'free') +
ggtitle("Boxplots of Numerical Predictors")
# Class distribution
target_counts <- table(bank_transform$y)
cat("\nTarget variable distribution:\n")
##
## Target variable distribution:
print(target_counts)
##
## no yes
## 39922 5289
cat(paste("Subscription rate:", round(target_counts["yes"] / sum(target_counts) * 100, 2), "%"))
## Subscription rate: 11.7 %
numeric_vars <- bank_transform %>% select_if(is.numeric)
for(col in names(numeric_vars)) {
q1 <- quantile(numeric_vars[[col]], 0.25)
q3 <- quantile(numeric_vars[[col]], 0.75)
iqr <- q3 - q1
lower <- q1 - 1.5 * iqr
upper <- q3 + 1.5 * iqr
outliers <- sum(numeric_vars[[col]] < lower | numeric_vars[[col]] > upper)
if(outliers > 0) {
cat(" -", col, ":", outliers, "potential outliers (",
round(outliers/nrow(bank_transform)*100, 2), "%)\n")
}
}
## - age : 487 potential outliers ( 1.08 %)
## - balance : 4729 potential outliers ( 10.46 %)
## - duration : 3235 potential outliers ( 7.16 %)
## - campaign : 3064 potential outliers ( 6.78 %)
## - pdays : 8257 potential outliers ( 18.26 %)
## - previous : 8257 potential outliers ( 18.26 %)
## - y_numeric : 5289 potential outliers ( 11.7 %)
Rule of thumb: 0.5 < |skewness| < 1 indicates moderate skewness, |skewness| > 1 indicates substantial skewness.
Based on the skewness values, all numerical predictors have substantial skewness, except for age which has moderate skewness.previous has the highest value of skewness, which is understanable as its the result of the previous campaign.
skew_values <- bank_transform |>
select(where(is.numeric)) |>
sapply(skewness, na.rm = TRUE)
# Print skewness values
print(skew_values)
## age balance duration campaign pdays previous y_numeric
## 0.6847725 8.3597536 3.1441095 4.8983251 2.6155419 41.8436778 2.3833223
bank_transform|>
select(is.numeric)|>
cor()
## Warning: Use of bare predicate functions was deprecated in tidyselect 1.1.0.
## ℹ Please use wrap predicates in `where()` instead.
## # Was:
## data %>% select(is.numeric)
##
## # Now:
## data %>% select(where(is.numeric))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## age balance duration campaign pdays
## age 1.000000000 0.097782739 -0.004648428 0.004760312 -0.023758014
## balance 0.097782739 1.000000000 0.021560380 -0.014578279 0.003435322
## duration -0.004648428 0.021560380 1.000000000 -0.084569503 -0.001564770
## campaign 0.004760312 -0.014578279 -0.084569503 1.000000000 -0.088627668
## pdays -0.023758014 0.003435322 -0.001564770 -0.088627668 1.000000000
## previous 0.001288319 0.016673637 0.001203057 -0.032855290 0.454819635
## y_numeric 0.025155017 0.052838410 0.394521016 -0.073172006 0.103621494
## previous y_numeric
## age 0.001288319 0.02515502
## balance 0.016673637 0.05283841
## duration 0.001203057 0.39452102
## campaign -0.032855290 -0.07317201
## pdays 0.454819635 0.10362149
## previous 1.000000000 0.09323577
## y_numeric 0.093235773 1.00000000
#Age Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x = age, fill = y)) +
geom_density(alpha = 0.5) +
labs(title = "Age Distribution by Term Deposit Subscription") +
scale_x_continuous(breaks = seq(0, 100, 5))
#Balance Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x = balance, fill = y)) +
geom_density(alpha = 0.5) +
labs(title = "Balance Distribution by Term Deposit Subscription") +
scale_x_continuous(breaks = seq(min(bank_transform$balance), max(bank_transform$balance), by = 10000)) +
scale_y_continuous(labels = scales::comma)
#Campaign Distribution On Term Deposit Subscription
ggplot(bank_transform, aes(x=campaign)) +
geom_histogram(bins=30, fill="purple", color="black", alpha=0.7) +
labs(title="Distribution of Campaign Contacts", x="Number of Contacts", y="Count") +
theme_minimal()
#Contact Distribution On Term Deposit Subscription
boxplot(duration ~ y,
data = bank_transform,
outline = F,
main = "Contact Duration Distribution On Term Deposit Subscription",
xlab = "Term Deposit Subscription",
ylab = "Duration",
col = c("indianred", "skyblue4"))
#Past days contacted Distribution On Term Deposit Subscription
boxplot(pdays ~ y,
data = bank_transform,
outline = F,
main = "Past days contacted On Term Deposit Subscription",
xlab = "Term Deposit Subscription",
ylab = "Passed days since the client was last contacted",
col = c("indianred", "skyblue4"))
boxplot(previous ~ y,
data = bank_transform,
outline = F,
main = "Previous contacts Distribution On Term Deposit Subscription",
xlab = "Term Deposit Subscription",
ylab = "Previous contacts performed before this campaign",
col = c("indianred", "skyblue4"))
#Previous contacts Distribution On Term Deposit Subscription
# outlier detection showing a value of 275
ggplot(bank_transform, aes(x = y, y = previous)) +
geom_boxplot(aes(fill = y)) +
labs(title = "Previous contacts Distribution by Term deposit Subscription")+
scale_y_continuous(breaks = seq(min(bank_transform$pdays), max(bank_transform$pdays), by = 10))
# plotting by removing the outlier
bank_transform |>
filter(previous < 200) |>
ggplot( aes(x = y, y = previous)) +
geom_boxplot(aes(fill = y)) +
labs(title = "Previous contacts Distribution by Term deposit Subscription")+
scale_y_continuous(breaks = seq(min(bank_transform$pdays), max(bank_transform$pdays), by = 10))
A moderate correlation with the target variable is observed in feature: default, Housing loan , loan ,contact, month and previous campaign outcome.
Customer Demographic Insights:
Job type impact: When plotting the job vs target variable by subscription rate i observe that student, retired, unemployed are shown to have the highest subscription rate.This might indicate these groups have more interest in passive income opportunities or are more available during contact hours.
Education impact: When plotting the education vs target variable by subscription rate i observe that tertiary education are shown to have the highest subscription rate.
Marital status influence: When plotting the marital status vs target variable by subscription rate i observe that clients who are single are shown to have the highest subscription rate.
Financial Status Patterns:
Clients who have no default are interested in in term deposit subscription. This greater interest indicates risk-averse financial behavior may correlate with term deposit interest
Clients who have no housing loan or no other loan are interested in in term deposit subscription than the clients who have housing loan.This aligns with financial capacity theory - they may have more disposable income to allocate.
Contact Strategy Insights:
Clients who have been contacted by Cellular are interested in in term deposit subscription than the clients who have been contacted by Telephone.This provides clear direction for contact strategy optimization
Subscription rate is highest in March followed by December.This seasonal pattern can inform capagin timing.
When plotting the previous campaign outcome(poutcome) vs target variable by subscription rate i observe that is the previous outcome was success, it has the highest subscription rate.Probably the bank has earned loyal customers.However it does not harm contacting the failed outcome clients as they also show a subscription rate of 12%.
# Job vs target
job_subscription <- bank_transform %>%
group_by(job) %>%
summarize(subscription_rate = mean(y == "yes") * 100) %>%
arrange(desc(subscription_rate))
ggplot(job_subscription, aes(x = reorder(job, subscription_rate), y = subscription_rate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Subscription Rate by Job Type",
x = "Job", y = "Subscription Rate (%)")
# marital status
mar_subscription <- bank_transform %>%
group_by(marital) %>%
summarize(subscription_rate = mean(y == "yes") * 100) %>%
arrange(desc(subscription_rate))
ggplot(mar_subscription, aes(x = reorder(marital, subscription_rate), y = subscription_rate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Subscription Rate by Marital Status",
x = "Marital Status ", y = "Subscription Rate (%)")
# Education vs target
edu_subscription <- bank_transform %>%
group_by(education) %>%
summarize(subscription_rate = mean(y == "yes") * 100) %>%
arrange(desc(subscription_rate))
ggplot(edu_subscription, aes(x = reorder(education, subscription_rate), y = subscription_rate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Subscription Rate by Job Type",
x = "Education", y = "Subscription Rate (%)")
# Default
ggplot(bank_transform, aes(x = default, fill = y)) +
geom_bar(position = "dodge") +
labs(title = "Default by Term Deposit Subscription",
x = "Default")
# Housing
ggplot(bank_transform, aes(x = housing, fill = y)) +
geom_bar(position = "dodge") +
labs(title = "Housing loan by Term Deposit Subscription",
x = "Housing")
# Loan
ggplot(bank_transform, aes(x = loan, fill = y)) +
geom_bar(position = "dodge") +
labs(title = "Loan by Term Deposit Subscription",
x = "Loan")
# Contact
ggplot(bank_transform, aes(x = contact, fill = y)) +
geom_bar(position = "dodge") +
labs(title = "Contact by Term Deposit Subscription",
x = "Contact")
# Month Vs Target
month_order <- c("jan", "feb", "mar", "apr", "may", "jun",
"jul", "aug", "sep", "oct", "nov", "dec")
month_subscription <- bank_transform %>%
group_by(month) %>%
summarize(subscription_rate = mean(y == "yes") * 100)
month_subscription$month <- factor(month_subscription$month, levels = month_order)
month_subscription <- month_subscription[order(month_subscription$month), ]
ggplot(month_subscription, aes(x = month, y = subscription_rate)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Subscription Rate by Month",
x = "Month", y = "Subscription Rate (%)")
#Previous outcome vs target
pout_subscription <- bank_transform %>%
group_by(poutcome) %>%
summarize(subscription_rate = mean(y == "yes") * 100) %>%
arrange(desc(subscription_rate))
ggplot(pout_subscription, aes(x = reorder(poutcome, subscription_rate), y = subscription_rate)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Subscription Rate by Job Type",
x = "Previous campaign outcome", y = "Subscription Rate (%)")
categorical_vars <- bank_transform %>% select_if(is.factor)
cat("\nRelationship between target variable (y) and categorical variables:\n")
##
## Relationship between target variable (y) and categorical variables:
for(col in names(categorical_vars)) {
if(col != "y") {
cat("\nContingency table for y vs", col, ":\n")
cont_table <- table(categorical_vars$y, categorical_vars[[col]])
print(cont_table)
cat("Chi-square test result:\n")
print(chisq.test(cont_table))
}
}
##
## Contingency table for y vs job :
##
## admin. blue-collar entrepreneur housemaid management retired
## no 4540 9024 1364 1131 8157 1748
## yes 631 708 123 109 1301 516
##
## self-employed services student technician unemployed unknown
## no 1392 3785 669 6757 1101 0
## yes 187 369 269 840 202 0
## Chi-square test result:
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = NaN, df = 11, p-value = NA
##
##
## Contingency table for y vs marital :
##
## divorced married single
## no 4585 24459 10878
## yes 622 2755 1912
## Chi-square test result:
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 196.5, df = 2, p-value < 2.2e-16
##
##
## Contingency table for y vs education :
##
## primary secondary tertiary unknown
## no 6260 20752 11305 0
## yes 591 2450 1996 0
## Chi-square test result:
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = NaN, df = 3, p-value = NA
##
##
## Contingency table for y vs default :
##
## no yes
## no 39159 763
## yes 5237 52
## Chi-square test result:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 22.202, df = 1, p-value = 2.454e-06
##
##
## Contingency table for y vs housing :
##
## no yes
## no 16727 23195
## yes 3354 1935
## Chi-square test result:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 874.82, df = 1, p-value < 2.2e-16
##
##
## Contingency table for y vs loan :
##
## no yes
## no 33162 6760
## yes 4805 484
## Chi-square test result:
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 209.62, df = 1, p-value < 2.2e-16
##
##
## Contingency table for y vs contact :
##
## cellular telephone unknown
## no 24916 2516 0
## yes 4369 390 0
## Chi-square test result:
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = NaN, df = 2, p-value = NA
##
##
## Contingency table for y vs month :
##
## apr aug dec feb jan jul jun mar may nov oct sep
## no 2355 5559 114 2208 1261 6268 4795 229 12841 3567 415 310
## yes 577 688 100 441 142 627 546 248 925 403 323 269
## Chi-square test result:
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 3061.8, df = 11, p-value < 2.2e-16
##
##
## Contingency table for y vs poutcome :
##
## failure other success unknown
## no 4283 1533 533 0
## yes 618 307 978 0
## Chi-square test result:
## Warning in chisq.test(cont_table): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = NaN, df = 3, p-value = NA
p_values <- sapply(categorical_vars, function(x) chisq.test(table(x, bank_transform$y))$p.value)
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
## Warning in chisq.test(table(x, bank_transform$y)): Chi-squared approximation
## may be incorrect
significant_vars <- names(p_values[p_values < 0.05]) # Keep only significant ones
print(significant_vars)
## [1] NA "marital" NA "default" "housing" "loan" NA
## [8] "month" NA "y"
Based on the EDA findings, as its a supervised learning usecasse with classification technique,here are my algorithm recommendations for predicting term deposit subscriptions:
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
Pros:
Cons:
Recommendation:
I would recommend starting with Logistic Regression for the following reasons:
Business Interpretability: Given the banking context, stakeholders likely need to understand why customers subscribe - logistic regression provides clear coefficients that explain the impact of each variable.
Dataset Characteristics: The dataset shows clear relationships between certain variables (duration, job type, education) and the target, which logistic regression can effectively capture.
Implementation Simplicity: It’s faster to implement, tune, and deploy, allowing quicker time-to-value for the business. Regulatory Considerations: Banking often requires transparent decision-making models that can be easily explained to regulators.
Additional Considerations:
Labels: Yes, the dataset has a clear binary label (“y” - whether the client subscribed to a term deposit). This makes this a supervised learning classification problem, which guided my algorithm choices.
Relationship to Dataset: Logistic regression aligns well with the observed patterns in the EDA, particularly the linear relationship between duration and subscription likelihood, and the categorical influence of job, education, and marital status.
Validation Strategy: Given the class imbalance observed (more “no” than “yes” responses), stratified cross-validation would be important regardless of algorithm choice.
As a secondary approach, I would suggest LDA to compare performance, especially if the assumptions of normally distributed features are reasonably met after transformation.
There are no missing values. There are no duplicates as shown in above code.However there are unknown values which are converted to NA.
# Count missing values per column
colSums(is.na(bank_transform ))
## age job marital education default balance housing loan
## 0 288 0 1857 0 0 0 0
## contact month duration campaign pdays previous poutcome y
## 13020 0 0 0 0 0 36959 0
## y_numeric
## 0
# Replace missing numerical values with median
bank_final <- bank_transform %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Replace missing categorical values with the most common value (mode)
bank_final <- bank_final %>%
mutate(across(where(is.factor), ~ ifelse(is.na(.), as.character(names(sort(table(.), decreasing = TRUE)[1])), .)))
i am removing the categorical variables which do show a correlation.
bank_final <- subset(bank_transform, select = -c(job, marital,education))
# Create Age Groups
bank_final <- bank_final %>%
mutate(age_group = case_when(
age <= 30 ~ "18-30",
age > 30 & age <= 40 ~ "30-40",
age > 40 & age <= 50 ~ "41-50",
age > 50 & age <= 60 ~ "51-60",
age > 60 ~ "60+"
))
# Categorize Balance Levels
bank_final$balance_group <- cut(bank_final$balance,
breaks = quantile(bank_final$balance, probs = seq(0, 1, 0.2)),
labels = c("Very Low", "Low", "Medium", "High", "Very High"))
# Categorize Contact Duration
bank_final <- bank_final %>%
mutate(duration_category = case_when(
duration < 100 ~ "Short",
duration >= 100 & duration <= 300 ~ "Medium",
duration > 300 ~ "Long"
))
# Convert categorical variables to factors
bank_final <- bank_final %>% mutate(across(where(is.character), as.factor))
# Check imbalance
table(bank_final$y)
##
## no yes
## 39922 5289
# Oversampling using ROSE -Random Over-Sampling Examples
bank_final_balanced <- ROSE(y ~ ., data = bank_final, seed = 123)$data
table(bank_final_balanced$y)
##
## no yes
## 4135 4047
# Alternative: SMOTE - Synthetic Minority Over-sampling Technique
# I tried SMOTE,but as it needs all values without any NAs, i was not able to use it.
bank_final_balanced <- subset(bank_final_balanced , select = -c(age, duration, balance))
# Normalize numerical variables
num_vars <- bank_final_balanced %>% select(where(is.numeric)) %>% names()
bank_final_balanced [num_vars] <- scale(bank_final_balanced [num_vars])
# One-hot encoding categorical variables (for Logistic Regression)
bank_final_balanced <- dummyVars(y ~ ., data = bank_final_balanced ) %>% predict(bank_final_balanced ) %>% as.data.frame()
# Convert target variable to factor for classification
bank_final_balanced $y <- as.factor(bank_final_balanced $y)
The EDA on “Bank Marketing” dataset shows there are 45211 observations and 17 variables. The data cleanup activity included changing the variables to correct data type, removing day variable, observing the unknown values, checking for duplicates.
Numerical Variables:
Based on the box plot,I see the spread of each variable. I observe that there are many points outside the whiskers for each and every element, indicating many outliers present especially for previous, pdays and balance .
Based on the histogram plot, I observe that all numerical predictors have substantial skewness, except for age which has moderate skewness.
The target variable distribution shows that the subscription rate is 11.7%.
None of the numerical variables appear to be strongly correlated to each other. This suggests that individual numerical variables alone don’t strongly predict subscription behavior.
The variable pdays and previous are moderately correlated with a coefficient of 0.45.
Closest ones to consider is correlated to the target variable is the duration.
Based on the age distribution, clients who are in the age category 60+ and 20-30 are accepting the term deposit subscription.
Based on the box plot of Average Yearly Balance, I observe that the clients having a balance of below 12000 euros get term deposit subscription which makes sense as the balance is low.
Based on the box plot of the contact duration ,I observe that Clients who are interested in term deposit subscription tend to have longer contact duration around 400 seconds.This makes sense as longer conversation times may indicate more interested customers.
Based on the box plot of Past days contacted Distribution on Term Deposit Subscription, the client is contacted 1-100 times, with an upper whisker up to 250.
Based on the box plot of Previous contacts Distribution on Term Deposit Subscription, the client is previous contacted 1 time, with an upper whisker up to 2.
Categorical Variables:
A moderate correlation with the target variable is observed in feature: default, Housing loan , loan ,contact, month and previous campaign outcome.
Customer Demographic Insights:
Job type impact: When plotting the job vs target variable by subscription rate I observe that student, retired, unemployed are shown to have the highest subscription rate.This might indicate these groups have more interest in passive income opportunities or are more available during contact hours.
Education impact: When plotting the education vs target variable by subscription rate I observe that tertiary education is shown to have the highest subscription rate.
Marital status influence: When plotting the marital status vs target variable by subscription rate I observe that clients who are single are shown to have the highest subscription rate.
Financial Status Patterns:
Clients who have no default are interested in in term deposit subscription. This greater interest indicates risk-averse financial behavior may correlate with term deposit interest
Clients who have no housing loan or no other loan are interested in in term deposit subscription than the clients who have housing loan.This aligns with financial capacity theory - they may have more disposable income to allocate.
Contact Strategy Insights:
Clients who have been contacted by Cellular are interested in in term deposit subscription than the clients who have been contacted by Telephone.This provides clear direction for contact strategy optimization
Subscription rate is highest in March followed by December.This seasonal pattern can inform capagin timing.
When plotting the previous campaign outcome(poutcome) vs target variable by subscription rate i observe that is the previous outcome was success, it has the highest subscription rate.Probably the bank has earned loyal customers.However it does not harm contacting the failed outcome clients as they also show a subscription rate of 12%.
The Chi-square test for each categorical variable shows a p value = 0 . Despite low correlation coefficients, the statistical significance (p-value = 0) for categorical variables indicates that these relationships aren’t occurring by chance. The significant variables which can be considered are:default, housing loan, loan, month.
ML model:
Based on the EDA, I would recommend starting with Logistic Regression . I would suggest LDA to compare performance.
Pre Processing:
Based on above two models, i need to Replace missing numerical values with median, Replace missing categorical values with the most common value (mode). I need to remove the categorical variables job, marital,education which do show a correlation. i need to feature engineer to add Age Groups, Balance groups and duration category. finally perform One-hot encoding categorical variables (for Logistic Regression) and Convert target variable to factor for classification.