Introduction

This assignment focuses on Exploratory Data Analysis (EDA), a crucial step in data science that helps identify data gaps, imbalances, and provides insights into data quality. This analysis will focus on campaign effectiveness, customer demographics, and seasonality trends to understand how the bank can improve future marketing strategies.

Dataset Overview

A Portuguese bank conducted a marketing campaign (phone calls) to predict if a client will subscribe to a term deposit. The objective here is to analyze the dataset and figure out the most effective tactics that will help the bank persuade more customers to engage with future marketing efforts.
Download the Bank Marketing Dataset from: Bank Marketing Dataset

Loading the Data

bank_data <- read.csv("C:/Users/Admin/Desktop/Bank_full.csv")

str(bank_data)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age          : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job          : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital      : chr  "married" "single" "married" "married" ...
##  $ education    : chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default      : chr  "no" "no" "no" "no" ...
##  $ balance      : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing      : chr  "yes" "yes" "yes" "yes" ...
##  $ personal_loan: chr  "no" "no" "yes" "no" ...
##  $ contact      : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day          : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month        : chr  "may" "may" "may" "may" ...
##  $ duration     : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays        : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome     : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ term_dep_sub : chr  "no" "no" "no" "no" ...
summary(bank_data)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing          personal_loan     
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##  term_dep_sub      
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
head(bank_data)
##   age          job marital education default balance housing personal_loan
## 1  58   management married  tertiary      no    2143     yes            no
## 2  44   technician  single secondary      no      29     yes            no
## 3  33 entrepreneur married secondary      no       2     yes           yes
## 4  47  blue-collar married   unknown      no    1506     yes            no
## 5  33      unknown  single   unknown      no       1      no            no
## 6  35   management married  tertiary      no     231     yes            no
##   contact day month duration campaign pdays previous poutcome term_dep_sub
## 1 unknown   5   may      261        1    -1        0  unknown           no
## 2 unknown   5   may      151        1    -1        0  unknown           no
## 3 unknown   5   may       76        1    -1        0  unknown           no
## 4 unknown   5   may       92        1    -1        0  unknown           no
## 5 unknown   5   may      198        1    -1        0  unknown           no
## 6 unknown   5   may      139        1    -1        0  unknown           no

Observations

This dataset has over 45,000 records and 17 features — a good-sized dataset for analysis. The features cover customer demographics, call timing, previous contact history, and the final outcome (whether they subscribed). It’s the kind of dataset that feels realistic — a bit messy, but full of useful signals.

Exploratory Data Analysis (EDA)

Are Features Correlated?

numerical_cols <- bank_data %>% select_if(is.numeric)
cor_matrix <- cor(numerical_cols)

corrplot(cor_matrix, method = "color")

Observations

The correlation matrix shows weak relationships across most numerical features. One standout is duration — the longer the call, the more likely the customer subscribed. That makes intuitive sense. A longer call probably means the customer was genuinely interested, not just being polite.

What Do the Distributions Look Like?

Numerical Variables

numerical_cols %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "black") +
  facet_wrap(~name, scales = "free") +
  theme_minimal()

Observations

  • Age is fairly well spread, but balance and duration both show heavy skew. This could create problems for certain models.
  • Long tails in balance could mean the bank is targeting a few wealthy customers heavily — or it could indicate data entry errors.

Categorical Variables

categorical_cols <- bank_data %>% select_if(is.character)

categorical_cols %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(value)) +
  geom_bar(fill = "lightgreen") +
  facet_wrap(~name, scales = "free") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observations

  • Most customers are married, blue-collar workers with secondary education.
  • Most calls happened on cell phones, which reflects how society shifted away from landlines.

Are there any outliers present?

numerical_cols %>%
  pivot_longer(cols = everything()) %>%
  ggplot(aes(x = name, y = value)) +
  geom_boxplot(fill = "lightblue") +
  coord_flip()

Observations

Balance and duration both have some extreme outliers. This isn’t unusual in banking, where a handful of customers have very large balances, but these outliers could distort model performance.

Relationships Between Variables

Age vs Subscription

ggplot(bank_data, aes(x = age, fill = term_dep_sub)) +
  geom_histogram(binwidth = 5, position = "dodge") +
  theme_minimal()

Observations

Older customers are slightly more likely to subscribe. That makes sense — older people may be more financially conservative and interested in guaranteed returns like term deposits.

Duration vs Subscription

ggplot(bank_data, aes(x = duration, fill = term_dep_sub)) +
  geom_histogram(bins = 30, position = "dodge")

Observations

The longer the call, the higher the subscription rate. This supports the idea that meaningful conversations lead to conversions.

Education vs Subscription

ggplot(bank_data, aes(x = education, fill = term_dep_sub)) +
 geom_bar(position = "fill") +
 theme_minimal() +
 labs(y = "Proportion", title = "Subscription Rate by Education")

Observations

his chart really stood out to me, education clearly plays a big role in whether someone signs up for a term deposit. It makes sense when you think about financial literacy. What surprised me, though, was that even customers with ‘unknown’ education had a decent subscription rate. That got me wondering if the bank’s data collection process might be leaving out some important context about these customers, maybe they have financial knowledge from work experience or informal learning.

Contact Method vs Subscription

 ggplot(bank_data, aes(x = contact, fill = term_dep_sub)) +
 geom_bar(position = "fill") +
 theme_minimal() +
 labs(y = "Proportion", title = "Subscription Rate by Contact Method")

Observations

Cellular calls work better than landlines or unknown contact methods. This makes sense — most people are quicker to pick up their cell, and they might feel more comfortable having a real conversation that way. If the bank wants better results, they need to lean into mobile outreach.

How are categorical variables distributed?

 #Barplots forcategoricalvariables
 categorical_cols <-bank_data %>%select_if(is.character)
 categorical_cols %>%
 gather(key= "Variable",value= "Category") %>%
 ggplot(aes(x= Category))+
 geom_bar(fill= "lightblue", color= "black") +
 facet_wrap(~Variable, scales= "free") +
 theme(axis.text.x=element_text(angle= 45, hjust= 1))

#Function to get top 3categories for each categorical column
 top3_categories <-function(column) {
 counts<-sort(table(column), decreasing= TRUE)
 head(counts, 3)
 }

 #Apply to all categorical columns
 categorical_cols <-bank_data %>%select_if(is.character)
 top3_list <-lapply(categorical_cols,top3_categories)
 
 #Format in to a nice data frame
 top3_df <-purrr::map_df(names(top3_list), function(var){
 data.frame(
 Variable= var,
 Category= names(top3_list[[var]]),
 Count= as.numeric(top3_list[[var]])
 )
 })
 
#Display top 2
top3_df
##         Variable    Category Count
## 1            job blue-collar  9732
## 2            job  management  9458
## 3            job  technician  7597
## 4        marital     married 27214
## 5        marital      single 12790
## 6        marital    divorced  5207
## 7      education   secondary 23202
## 8      education    tertiary 13301
## 9      education     primary  6851
## 10       default          no 44396
## 11       default         yes   815
## 12       housing         yes 25130
## 13       housing          no 20081
## 14 personal_loan          no 37967
## 15 personal_loan         yes  7244
## 16       contact    cellular 29285
## 17       contact     unknown 13020
## 18       contact   telephone  2906
## 19         month         may 13766
## 20         month         jul  6895
## 21         month         aug  6247
## 22      poutcome     unknown 36959
## 23      poutcome     failure  4901
## 24      poutcome       other  1840
## 25  term_dep_sub          no 39922
## 26  term_dep_sub         yes  5289

Observations

When I looked at the categories, the imbalance really jumped out especially how many calls happened in May. It was way more than any other month. It made me pause and wonder: Was the bank running some kind of seasonal promotion, or maybe they were trying to hit a quarterly target?

What is the central tendency and spread of each variable?

numeric_summary <- bank_data %>%
  select_if(is.numeric) %>%
  summarise(across(everything(), list(
    Mean = ~round(mean(., na.rm = TRUE)),
    Median = ~round(median(., na.rm = TRUE)),
    SD = ~round(sd(., na.rm = TRUE)),
    Min = ~round(min(., na.rm = TRUE)),
    Q1 = ~round(quantile(., 0.25, na.rm = TRUE)),
    Q3 = ~round(quantile(., 0.75, na.rm = TRUE)),
    Max = ~round(max(., na.rm = TRUE))
  )))
  
numeric_summary_clean <- numeric_summary %>%
  pivot_longer(cols = everything(), names_to = c("Variable", "Statistic"), names_sep = "_") %>%
  pivot_wider(names_from = "Variable", values_from = "value")

kable(numeric_summary_clean, caption = "Central Tendency and Spread of Numerical Variables")
Central Tendency and Spread of Numerical Variables
Statistic age balance day duration campaign pdays previous
Mean 41 1362 16 258 3 40 1
Median 39 448 16 180 2 -1 0
SD 11 3045 8 258 3 100 2
Min 18 -8019 1 0 1 -1 0
Q1 33 72 8 103 1 -1 0
Q3 48 1428 21 319 3 -1 0
Max 95 102127 31 4918 63 871 275

Are there any missing values and how significant are they?

colSums(is.na(bank_data))
##           age           job       marital     education       default 
##             0             0             0             0             0 
##       balance       housing personal_loan       contact           day 
##             0             0             0             0             0 
##         month      duration      campaign         pdays      previous 
##             0             0             0             0             0 
##      poutcome  term_dep_sub 
##             0             0

Algorithm Selection

Select two or more machine learning algorithms

The goal is to predict whether a customer subscribes to a term deposit (term_dep_sub). This is a binary classification problem because the target variable has only two classes: “yes” and “no”.

Two algorithms suitable for this task are:

Algorithm 1: Decision Tree (Baseline)

bank_data$term_dep_sub <- factor(bank_data$term_dep_sub, levels = c("no", "yes"))

tree_model <- rpart(term_dep_sub ~ ., data = bank_data, method = "class")

rpart.plot(tree_model)

Pros and Cons of Each Algorithm

library(knitr)

algo_table <- data.frame(
  Algorithm = c("Decision Tree", "XGBoost"),
  Pros = c(
    "Easy to interpret (visual tree); Captures non-linear relationships; Handles categorical and numeric data directly",
    "High predictive accuracy; Handles non-linear relationships; Works well on large datasets; Robust to irrelevant features"
  ),
  Cons = c(
    "Prone to overfitting; Sensitive to small data changes",
    "Requires careful tuning; Needs one-hot encoding; Less interpretable"
  )
)

kable(algo_table, caption = "Comparison of Decision Tree and XGBoost", align = "l")
Comparison of Decision Tree and XGBoost
Algorithm Pros Cons
Decision Tree Easy to interpret (visual tree); Captures non-linear relationships; Handles categorical and numeric data directly Prone to overfitting; Sensitive to small data changes
XGBoost High predictive accuracy; Handles non-linear relationships; Works well on large datasets; Robust to irrelevant features Requires careful tuning; Needs one-hot encoding; Less interpretable

Are there labels in your data?

Yes, term_dep_sub is the label column. This made both Decision Trees and XGBoost suitable since it’s a supervised classification problem.

How does algorithm choice relate to the dataset?

algo_impact_table <- data.frame(
  `Dataset Characteristic` = c(
    "Binary Target (yes/no)",
    "Mixed Features",
    "Large Dataset",
    "Class Imbalance",
    "Need for Interpretability"
  ),
  `Impact on Algorithm Choice` = c(
    "Both algorithms are suitable",
    "Decision Tree handles natively; XGBoost requires encoding",
    "Both scale well",
    "XGBoost supports weighting; Decision Tree allows stratified sampling",
    "Decision Tree is more interpretable; XGBoost provides importance plots"
  )
)

kable(algo_impact_table, caption = "How Dataset Characteristics Impact Algorithm Choice", align = "l")
How Dataset Characteristics Impact Algorithm Choice
Dataset.Characteristic Impact.on.Algorithm.Choice
Binary Target (yes/no) Both algorithms are suitable
Mixed Features Decision Tree handles natively; XGBoost requires encoding
Large Dataset Both scale well
Class Imbalance XGBoost supports weighting; Decision Tree allows stratified sampling
Need for Interpretability Decision Tree is more interpretable; XGBoost provides importance plots

Would algorithm choice change with fewer records?

library(knitr)

data_size_table <- data.frame(
  `Data Size` = c(
    "Large (45k+)",
    "Medium (1k - 10k)",
    "Small (<1k)",
    "Very Small (<500)"
  ),
  `Recommended Algorithm` = c(
    "XGBoost",
    "XGBoost or Tree",
    "Decision Tree",
    "Decision Tree"
  ),
  `Reason` = c(
    "Captures complexity",
    "Balance needed",
    "Simple, interpretable",
    "Easy to fit without overfitting"
  )
)

kable(data_size_table, caption = "Recommended Algorithm Based on Data Size", align = "l")
Recommended Algorithm Based on Data Size
Data.Size Recommended.Algorithm Reason
Large (45k+) XGBoost Captures complexity
Medium (1k - 10k) XGBoost or Tree Balance needed
Small (<1k) Decision Tree Simple, interpretable
Very Small (<500) Decision Tree Easy to fit without overfitting

Pre-processing

1. Data Cleaning

bank_data <- bank_data %>%
  mutate(
    balance = pmin(balance, quantile(balance, 0.99)),
    duration = pmin(duration, quantile(duration, 0.99))
  )

2. Dimensionality Reduction

  • Review feature importance from XGBoost.
  • Drop low-importance features like poutcome, if necessary.
importance_matrix <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance_matrix)

3. Feature Engineering

  • Add season based on month
  • Add engagement_per_contact
bank_data <- bank_data %>%
  mutate(
    season = case_when(
      month %in% c("dec", "jan", "feb") ~ "Winter",
      month %in% c("mar", "apr", "may") ~ "Spring",
      month %in% c("jun", "jul", "aug") ~ "Summer",
      month %in% c("sep", "oct", "nov") ~ "Fall"
    ),
    engagement_per_contact = ifelse(campaign > 0, duration / campaign, duration)
  )

4. Sampling Data

Dataset is imbalanced (~12% subscribed).
Apply SMOTE or manual oversampling to correct it.

minority <- bank_data %>% filter(term_dep_sub == "yes")
majority <- bank_data %>% filter(term_dep_sub == "no")

set.seed(123)
oversampled_minority <- minority[sample(nrow(minority), size = nrow(majority), replace = TRUE),]

bank_data_balanced <- bind_rows(majority, oversampled_minority)
table(bank_data_balanced$term_dep_sub)
## 
##    no   yes 
## 39922 39922

5. Data Transformation - regularization, normalization, handling categorical variables

Model Needs

model_table <- data.frame(
  Model = c("Decision Tree", "XGBoost"),
  Needs_Encoding = c("No", "Yes (one-hot)"),
  Needs_Scaling = c("No", "No")
)

kable(model_table, caption = "Model Pre-processing Needs")
Model Pre-processing Needs
Model Needs_Encoding Needs_Scaling
Decision Tree No No
XGBoost Yes (one-hot) No

Actions

  • For Decision Tree, no transformation is necessary.
  • For XGBoost, categorical variables need one-hot encoding.
library(dplyr)
library(caret)
library(stringr)

# Step 1: Create Revised Age Group
bank_data_balanced <- bank_data_balanced %>%
  mutate(age_group = cut(
    age,
    breaks = c(0, 25, 35, 40, 45, 50, 55, 60, 65, Inf),
    labels = c("Under 25", "25-34", "35-39", "40-44", "45-49", "50-54", "55-59", "60-64", "65+"),
    right = FALSE
  ))

# Step 2: One-hot Encode job, marital, and education
dummies <- dummyVars(~ job + marital + education, data = bank_data_balanced)
encoded <- predict(dummies, newdata = bank_data_balanced) %>% as.data.frame()

# Combine encoded columns back into main data
bank_data_balanced <- bind_cols(bank_data_balanced, encoded)

# Step 3: Rename Columns After Encoding
colnames(bank_data_balanced) <- colnames(bank_data_balanced) %>%
  str_replace_all("jobadmin\\.", "Admin") %>%
  str_replace_all("jobblue-collar", "Blue-collar") %>%
  str_replace_all("jobentrepreneur", "Entrepreneur") %>%
  str_replace_all("jobhousemaid", "Housemaid") %>%
  str_replace_all("jobmanagement", "Management") %>%
  str_replace_all("jobretired", "Retired") %>%
  str_replace_all("jobself-employed", "Self-employed") %>%
  str_replace_all("jobservices", "Services") %>%
  str_replace_all("jobstudent", "Student") %>%
  str_replace_all("jobtechnician", "Technician") %>%
  str_replace_all("jobunemployed", "Unemployed") %>%
  str_replace_all("jobunknown", "Unknown_job") %>%
  str_replace_all("maritaldivorced", "Divorced") %>%
  str_replace_all("maritalmarried", "Married") %>%
  str_replace_all("maritalsingle", "Single") %>%
  str_replace_all("educationprimary", "Primary_education") %>%
  str_replace_all("educationsecondary", "Secondary_education") %>%
  str_replace_all("educationtertiary", "Tertiary_education") %>%
  str_replace_all("educationunknown", "Unknown_education")

# Step 4: Identify Columns for Summarization
job_columns <- c("Admin", "Blue-collar", "Entrepreneur", "Housemaid",
                 "Management", "Retired", "Self-employed", "Services",
                 "Student", "Technician", "Unemployed", "Unknown_job")

marital_columns <- c("Divorced", "Married", "Single")

education_columns <- c("Primary_education", "Secondary_education", 
                       "Tertiary_education", "Unknown_education")

# Step 5: Summarize by Age Group
summary_by_age <- bank_data_balanced %>%
  group_by(age_group) %>%
  summarise(across(all_of(c(job_columns, marital_columns, education_columns)), sum, na.rm = TRUE))

# Step 6: Display result
print(summary_by_age)
## # A tibble: 9 × 20
##   age_group Admin `Blue-collar` Entrepreneur Housemaid Management Retired
##   <fct>     <dbl>         <dbl>        <dbl>     <dbl>      <dbl>   <dbl>
## 1 Under 25    182           301           19        12        141       1
## 2 25-34      3397          4640          515       277       6313      16
## 3 35-39      1671          2745          460       277       3645      17
## 4 40-44      1102          2131          422       195       2206      64
## 5 45-49      1220          2079          330       297       2077      83
## 6 50-54       802          1361          296       292       1570     341
## 7 55-59       638           873          189       315       1408    1344
## 8 60-64       217           212           43       146        491    1168
## 9 65+          31             6           24        98        125    2648
## # ℹ 13 more variables: `Self-employed` <dbl>, Services <dbl>, Student <dbl>,
## #   Technician <dbl>, Unemployed <dbl>, Unknown_job <dbl>, Divorced <dbl>,
## #   Married <dbl>, Single <dbl>, Primary_education <dbl>,
## #   Secondary_education <dbl>, Tertiary_education <dbl>,
## #   Unknown_education <dbl>

6. Imbalanced Data - reducing the imbalance between classes

Findings from EDA

  • Significant imbalance (about 88% “no”, 12% “yes”).

Actions

  • In Decision Tree, ensure cross-validation is stratified so both classes appear in each fold.
  • In XGBoost, apply the scale_pos_weight parameter to handle class imbalance.
library(dplyr)
library(caret)

# Ensure all categorical variables are factors (no encoding needed for rpart)
bank_data_balanced <- bank_data_balanced %>%
  mutate(
    term_dep_sub = factor(term_dep_sub, levels = c("no", "yes")),
    job = factor(job),
    marital = factor(marital),
    education = factor(education),
    default = factor(default),
    housing = factor(housing),
    personal_loan = factor(personal_loan),
    contact = factor(contact),
    month = factor(month),
    poutcome = factor(poutcome)
  )

# Check class imbalance
print(table(bank_data_balanced$term_dep_sub))
## 
##    no   yes 
## 39922 39922
# Define stratified cross-validation
train_control <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  sampling = "down"
)

# Train Decision Tree - Explicit formula listing (no ~ . shortcut)
tree_model <- train(
  term_dep_sub ~ age + job + marital + education + default + balance + housing +
    personal_loan + contact + day + month + duration + campaign +
    previous + poutcome,
  data = bank_data_balanced,
  method = "rpart",
  trControl = train_control,
  metric = "ROC"
)

# Print results to show performance
print(tree_model)
## CART 
## 
## 79844 samples
##    15 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 71859, 71858, 71860, 71859, 71860, 71860, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec     
##   0.03779871  0.8073687  0.7100093  0.8324731
##   0.04432393  0.7583717  0.6626919  0.8193727
##   0.43975753  0.6299850  0.7655812  0.4943888
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03779871.