This assignment focuses on Exploratory Data Analysis (EDA), a crucial step in data science that helps identify data gaps, imbalances, and provides insights into data quality. This analysis will focus on campaign effectiveness, customer demographics, and seasonality trends to understand how the bank can improve future marketing strategies.
A Portuguese bank conducted a marketing campaign (phone calls) to
predict if a client will subscribe to a term deposit. The objective here
is to analyze the dataset and figure out the most effective tactics that
will help the bank persuade more customers to engage with future
marketing efforts.
Download the Bank Marketing Dataset from: Bank
Marketing Dataset
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education : chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ personal_loan: chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ term_dep_sub : chr "no" "no" "no" "no" ...
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing personal_loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## term_dep_sub
## Length:45211
## Class :character
## Mode :character
##
##
##
## age job marital education default balance housing personal_loan
## 1 58 management married tertiary no 2143 yes no
## 2 44 technician single secondary no 29 yes no
## 3 33 entrepreneur married secondary no 2 yes yes
## 4 47 blue-collar married unknown no 1506 yes no
## 5 33 unknown single unknown no 1 no no
## 6 35 management married tertiary no 231 yes no
## contact day month duration campaign pdays previous poutcome term_dep_sub
## 1 unknown 5 may 261 1 -1 0 unknown no
## 2 unknown 5 may 151 1 -1 0 unknown no
## 3 unknown 5 may 76 1 -1 0 unknown no
## 4 unknown 5 may 92 1 -1 0 unknown no
## 5 unknown 5 may 198 1 -1 0 unknown no
## 6 unknown 5 may 139 1 -1 0 unknown no
This dataset has over 45,000 records and 17 features — a good-sized dataset for analysis. The features cover customer demographics, call timing, previous contact history, and the final outcome (whether they subscribed). It’s the kind of dataset that feels realistic — a bit messy, but full of useful signals.
The correlation matrix shows weak relationships across most numerical features. One standout is duration — the longer the call, the more likely the customer subscribed. That makes intuitive sense. A longer call probably means the customer was genuinely interested, not just being polite.
numerical_cols %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
facet_wrap(~name, scales = "free") +
theme_minimal()
balance
and duration
both show heavy skew. This could create
problems for certain models.categorical_cols <- bank_data %>% select_if(is.character)
categorical_cols %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(value)) +
geom_bar(fill = "lightgreen") +
facet_wrap(~name, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
numerical_cols %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = name, y = value)) +
geom_boxplot(fill = "lightblue") +
coord_flip()
Balance and duration both have some extreme outliers. This isn’t unusual in banking, where a handful of customers have very large balances, but these outliers could distort model performance.
ggplot(bank_data, aes(x = age, fill = term_dep_sub)) +
geom_histogram(binwidth = 5, position = "dodge") +
theme_minimal()
Older customers are slightly more likely to subscribe. That makes sense — older people may be more financially conservative and interested in guaranteed returns like term deposits.
ggplot(bank_data, aes(x = duration, fill = term_dep_sub)) +
geom_histogram(bins = 30, position = "dodge")
The longer the call, the higher the subscription rate. This supports the idea that meaningful conversations lead to conversions.
ggplot(bank_data, aes(x = education, fill = term_dep_sub)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(y = "Proportion", title = "Subscription Rate by Education")
his chart really stood out to me, education clearly plays a big role in whether someone signs up for a term deposit. It makes sense when you think about financial literacy. What surprised me, though, was that even customers with ‘unknown’ education had a decent subscription rate. That got me wondering if the bank’s data collection process might be leaving out some important context about these customers, maybe they have financial knowledge from work experience or informal learning.
ggplot(bank_data, aes(x = contact, fill = term_dep_sub)) +
geom_bar(position = "fill") +
theme_minimal() +
labs(y = "Proportion", title = "Subscription Rate by Contact Method")
Cellular calls work better than landlines or unknown contact methods. This makes sense — most people are quicker to pick up their cell, and they might feel more comfortable having a real conversation that way. If the bank wants better results, they need to lean into mobile outreach.
#Barplots forcategoricalvariables
categorical_cols <-bank_data %>%select_if(is.character)
categorical_cols %>%
gather(key= "Variable",value= "Category") %>%
ggplot(aes(x= Category))+
geom_bar(fill= "lightblue", color= "black") +
facet_wrap(~Variable, scales= "free") +
theme(axis.text.x=element_text(angle= 45, hjust= 1))
#Function to get top 3categories for each categorical column
top3_categories <-function(column) {
counts<-sort(table(column), decreasing= TRUE)
head(counts, 3)
}
#Apply to all categorical columns
categorical_cols <-bank_data %>%select_if(is.character)
top3_list <-lapply(categorical_cols,top3_categories)
#Format in to a nice data frame
top3_df <-purrr::map_df(names(top3_list), function(var){
data.frame(
Variable= var,
Category= names(top3_list[[var]]),
Count= as.numeric(top3_list[[var]])
)
})
#Display top 2
top3_df
## Variable Category Count
## 1 job blue-collar 9732
## 2 job management 9458
## 3 job technician 7597
## 4 marital married 27214
## 5 marital single 12790
## 6 marital divorced 5207
## 7 education secondary 23202
## 8 education tertiary 13301
## 9 education primary 6851
## 10 default no 44396
## 11 default yes 815
## 12 housing yes 25130
## 13 housing no 20081
## 14 personal_loan no 37967
## 15 personal_loan yes 7244
## 16 contact cellular 29285
## 17 contact unknown 13020
## 18 contact telephone 2906
## 19 month may 13766
## 20 month jul 6895
## 21 month aug 6247
## 22 poutcome unknown 36959
## 23 poutcome failure 4901
## 24 poutcome other 1840
## 25 term_dep_sub no 39922
## 26 term_dep_sub yes 5289
When I looked at the categories, the imbalance really jumped out especially how many calls happened in May. It was way more than any other month. It made me pause and wonder: Was the bank running some kind of seasonal promotion, or maybe they were trying to hit a quarterly target?
Customers were more likely to subscribe after longer calls, One trend really stood out to me, longer calls led to more subscriptions, which makes sense because those probably felt like real conversations, not just quick sales pitches. But what surprised me was that calling the same person too many times actually backfired, I guess people get annoyed if you keep bugging them. I also noticed that higher education levels made people more likely to sign up, and students and retirees converted better than blue-collar workers. It all kind of fits when you think about financial awareness and free time. And no shock here — cell phones worked better than landlines. I mean, who even answers their landline anymore?
numeric_summary <- bank_data %>%
select_if(is.numeric) %>%
summarise(across(everything(), list(
Mean = ~round(mean(., na.rm = TRUE)),
Median = ~round(median(., na.rm = TRUE)),
SD = ~round(sd(., na.rm = TRUE)),
Min = ~round(min(., na.rm = TRUE)),
Q1 = ~round(quantile(., 0.25, na.rm = TRUE)),
Q3 = ~round(quantile(., 0.75, na.rm = TRUE)),
Max = ~round(max(., na.rm = TRUE))
)))
numeric_summary_clean <- numeric_summary %>%
pivot_longer(cols = everything(), names_to = c("Variable", "Statistic"), names_sep = "_") %>%
pivot_wider(names_from = "Variable", values_from = "value")
kable(numeric_summary_clean, caption = "Central Tendency and Spread of Numerical Variables")
Statistic | age | balance | day | duration | campaign | pdays | previous |
---|---|---|---|---|---|---|---|
Mean | 41 | 1362 | 16 | 258 | 3 | 40 | 1 |
Median | 39 | 448 | 16 | 180 | 2 | -1 | 0 |
SD | 11 | 3045 | 8 | 258 | 3 | 100 | 2 |
Min | 18 | -8019 | 1 | 0 | 1 | -1 | 0 |
Q1 | 33 | 72 | 8 | 103 | 1 | -1 | 0 |
Q3 | 48 | 1428 | 21 | 319 | 3 | -1 | 0 |
Max | 95 | 102127 | 31 | 4918 | 63 | 871 | 275 |
## age job marital education default
## 0 0 0 0 0
## balance housing personal_loan contact day
## 0 0 0 0 0
## month duration campaign pdays previous
## 0 0 0 0 0
## poutcome term_dep_sub
## 0 0
The goal is to predict whether a customer subscribes to a term
deposit (term_dep_sub
). This is a binary
classification problem because the target variable has only two
classes: “yes” and “no”.
Two algorithms suitable for this task are:
bank_data$term_dep_sub <- factor(bank_data$term_dep_sub, levels = c("no", "yes"))
tree_model <- rpart(term_dep_sub ~ ., data = bank_data, method = "class")
rpart.plot(tree_model)
dummies <- dummyVars(term_dep_sub ~ ., data = bank_data)
bank_data_encoded <- predict(dummies, newdata = bank_data) %>% as.data.frame()
bank_data_encoded$term_dep_sub <- ifelse(bank_data$term_dep_sub == "yes", 1, 0)
train_matrix <- as.matrix(bank_data_encoded %>% select(-term_dep_sub))
train_labels <- bank_data_encoded$term_dep_sub
xgb_model <- xgboost(data = train_matrix, label = train_labels, objective = "binary:logistic", nrounds = 100, verbose = 0)
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix)
library(knitr)
algo_table <- data.frame(
Algorithm = c("Decision Tree", "XGBoost"),
Pros = c(
"Easy to interpret (visual tree); Captures non-linear relationships; Handles categorical and numeric data directly",
"High predictive accuracy; Handles non-linear relationships; Works well on large datasets; Robust to irrelevant features"
),
Cons = c(
"Prone to overfitting; Sensitive to small data changes",
"Requires careful tuning; Needs one-hot encoding; Less interpretable"
)
)
kable(algo_table, caption = "Comparison of Decision Tree and XGBoost", align = "l")
Algorithm | Pros | Cons |
---|---|---|
Decision Tree | Easy to interpret (visual tree); Captures non-linear relationships; Handles categorical and numeric data directly | Prone to overfitting; Sensitive to small data changes |
XGBoost | High predictive accuracy; Handles non-linear relationships; Works well on large datasets; Robust to irrelevant features | Requires careful tuning; Needs one-hot encoding; Less interpretable |
XGBoost is recommended because: - It captures complex non-linear patterns - Performs automatic regularization - Scales well for large datasets (45,000+ records) - Provides a useful feature importance plot
Yes, term_dep_sub
is the label column. This made both
Decision Trees and XGBoost suitable since it’s a supervised
classification problem.
algo_impact_table <- data.frame(
`Dataset Characteristic` = c(
"Binary Target (yes/no)",
"Mixed Features",
"Large Dataset",
"Class Imbalance",
"Need for Interpretability"
),
`Impact on Algorithm Choice` = c(
"Both algorithms are suitable",
"Decision Tree handles natively; XGBoost requires encoding",
"Both scale well",
"XGBoost supports weighting; Decision Tree allows stratified sampling",
"Decision Tree is more interpretable; XGBoost provides importance plots"
)
)
kable(algo_impact_table, caption = "How Dataset Characteristics Impact Algorithm Choice", align = "l")
Dataset.Characteristic | Impact.on.Algorithm.Choice |
---|---|
Binary Target (yes/no) | Both algorithms are suitable |
Mixed Features | Decision Tree handles natively; XGBoost requires encoding |
Large Dataset | Both scale well |
Class Imbalance | XGBoost supports weighting; Decision Tree allows stratified sampling |
Need for Interpretability | Decision Tree is more interpretable; XGBoost provides importance plots |
library(knitr)
data_size_table <- data.frame(
`Data Size` = c(
"Large (45k+)",
"Medium (1k - 10k)",
"Small (<1k)",
"Very Small (<500)"
),
`Recommended Algorithm` = c(
"XGBoost",
"XGBoost or Tree",
"Decision Tree",
"Decision Tree"
),
`Reason` = c(
"Captures complexity",
"Balance needed",
"Simple, interpretable",
"Easy to fit without overfitting"
)
)
kable(data_size_table, caption = "Recommended Algorithm Based on Data Size", align = "l")
Data.Size | Recommended.Algorithm | Reason |
---|---|---|
Large (45k+) | XGBoost | Captures complexity |
Medium (1k - 10k) | XGBoost or Tree | Balance needed |
Small (<1k) | Decision Tree | Simple, interpretable |
Very Small (<500) | Decision Tree | Easy to fit without overfitting |
poutcome
, if
necessary.season
based on month
engagement_per_contact
bank_data <- bank_data %>%
mutate(
season = case_when(
month %in% c("dec", "jan", "feb") ~ "Winter",
month %in% c("mar", "apr", "may") ~ "Spring",
month %in% c("jun", "jul", "aug") ~ "Summer",
month %in% c("sep", "oct", "nov") ~ "Fall"
),
engagement_per_contact = ifelse(campaign > 0, duration / campaign, duration)
)
Dataset is imbalanced (~12% subscribed).
Apply SMOTE or manual oversampling to correct it.
minority <- bank_data %>% filter(term_dep_sub == "yes")
majority <- bank_data %>% filter(term_dep_sub == "no")
set.seed(123)
oversampled_minority <- minority[sample(nrow(minority), size = nrow(majority), replace = TRUE),]
bank_data_balanced <- bind_rows(majority, oversampled_minority)
table(bank_data_balanced$term_dep_sub)
##
## no yes
## 39922 39922
model_table <- data.frame(
Model = c("Decision Tree", "XGBoost"),
Needs_Encoding = c("No", "Yes (one-hot)"),
Needs_Scaling = c("No", "No")
)
kable(model_table, caption = "Model Pre-processing Needs")
Model | Needs_Encoding | Needs_Scaling |
---|---|---|
Decision Tree | No | No |
XGBoost | Yes (one-hot) | No |
library(dplyr)
library(caret)
library(stringr)
# Step 1: Create Revised Age Group
bank_data_balanced <- bank_data_balanced %>%
mutate(age_group = cut(
age,
breaks = c(0, 25, 35, 40, 45, 50, 55, 60, 65, Inf),
labels = c("Under 25", "25-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65+"),
right = FALSE
))
# Step 2: One-hot Encode job, marital, and education
dummies <- dummyVars(~ job + marital + education, data = bank_data_balanced)
encoded <- predict(dummies, newdata = bank_data_balanced) %>% as.data.frame()
# Combine encoded columns back into main data
bank_data_balanced <- bind_cols(bank_data_balanced, encoded)
# Step 3: Rename Columns After Encoding
colnames(bank_data_balanced) <- colnames(bank_data_balanced) %>%
str_replace_all("jobadmin\\.", "Admin") %>%
str_replace_all("jobblue-collar", "Blue-collar") %>%
str_replace_all("jobentrepreneur", "Entrepreneur") %>%
str_replace_all("jobhousemaid", "Housemaid") %>%
str_replace_all("jobmanagement", "Management") %>%
str_replace_all("jobretired", "Retired") %>%
str_replace_all("jobself-employed", "Self-employed") %>%
str_replace_all("jobservices", "Services") %>%
str_replace_all("jobstudent", "Student") %>%
str_replace_all("jobtechnician", "Technician") %>%
str_replace_all("jobunemployed", "Unemployed") %>%
str_replace_all("jobunknown", "Unknown_job") %>%
str_replace_all("maritaldivorced", "Divorced") %>%
str_replace_all("maritalmarried", "Married") %>%
str_replace_all("maritalsingle", "Single") %>%
str_replace_all("educationprimary", "Primary_education") %>%
str_replace_all("educationsecondary", "Secondary_education") %>%
str_replace_all("educationtertiary", "Tertiary_education") %>%
str_replace_all("educationunknown", "Unknown_education")
# Step 4: Identify Columns for Summarization
job_columns <- c("Admin", "Blue-collar", "Entrepreneur", "Housemaid",
"Management", "Retired", "Self-employed", "Services",
"Student", "Technician", "Unemployed", "Unknown_job")
marital_columns <- c("Divorced", "Married", "Single")
education_columns <- c("Primary_education", "Secondary_education",
"Tertiary_education", "Unknown_education")
# Step 5: Summarize by Age Group
summary_by_age <- bank_data_balanced %>%
group_by(age_group) %>%
summarise(across(all_of(c(job_columns, marital_columns, education_columns)), sum, na.rm = TRUE))
# Step 6: Display result
print(summary_by_age)
## # A tibble: 9 × 20
## age_group Admin `Blue-collar` Entrepreneur Housemaid Management Retired
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Under 25 182 301 19 12 141 1
## 2 25-34 3397 4640 515 277 6313 16
## 3 35-39 1671 2745 460 277 3645 17
## 4 40-44 1102 2131 422 195 2206 64
## 5 45-49 1220 2079 330 297 2077 83
## 6 50-54 802 1361 296 292 1570 341
## 7 55-59 638 873 189 315 1408 1344
## 8 60-64 217 212 43 146 491 1168
## 9 65+ 31 6 24 98 125 2648
## # ℹ 13 more variables: `Self-employed` <dbl>, Services <dbl>, Student <dbl>,
## # Technician <dbl>, Unemployed <dbl>, Unknown_job <dbl>, Divorced <dbl>,
## # Married <dbl>, Single <dbl>, Primary_education <dbl>,
## # Secondary_education <dbl>, Tertiary_education <dbl>,
## # Unknown_education <dbl>
scale_pos_weight
parameter to handle class imbalance.library(dplyr)
library(caret)
# Ensure all categorical variables are factors (no encoding needed for rpart)
bank_data_balanced <- bank_data_balanced %>%
mutate(
term_dep_sub = factor(term_dep_sub, levels = c("no", "yes")),
job = factor(job),
marital = factor(marital),
education = factor(education),
default = factor(default),
housing = factor(housing),
personal_loan = factor(personal_loan),
contact = factor(contact),
month = factor(month),
poutcome = factor(poutcome)
)
# Check class imbalance
print(table(bank_data_balanced$term_dep_sub))
##
## no yes
## 39922 39922
# Define stratified cross-validation
train_control <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary,
sampling = "down"
)
# Train Decision Tree - Explicit formula listing (no ~ . shortcut)
tree_model <- train(
term_dep_sub ~ age + job + marital + education + default + balance + housing +
personal_loan + contact + day + month + duration + campaign +
previous + poutcome,
data = bank_data_balanced,
method = "rpart",
trControl = train_control,
metric = "ROC"
)
# Print results to show performance
print(tree_model)
## CART
##
## 79844 samples
## 15 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 71859, 71858, 71860, 71859, 71860, 71860, ...
## Addtional sampling using down-sampling
##
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.03779871 0.8073687 0.7100093 0.8324731
## 0.04432393 0.7583717 0.6626919 0.8193727
## 0.43975753 0.6299850 0.7655812 0.4943888
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03779871.