Introduction

This report provides a detailed exploratory data analysis (EDA) of the Portuguese Bank Marketing dataset. The goal is to uncover patterns, trends, and linkages to better understand the factors impacting term deposit subscriptions (y)and provide actionable insights for marketing strategies. The dataset includes 45,211 observations with 17 variables.

Exploratory Data Analysis (EDA)

Data Import

# Download the zip file (into a temporary file)
temp_file <- tempfile(fileext = ".zip")
download.file("https://archive.ics.uci.edu/static/public/222/bank+marketing.zip", 
              destfile = temp_file, mode = "wb")

# List contents of the ZIP (to see filenames)
zip_contents <- unzip(temp_file, list = TRUE)

#  Unzip the specific CSV file
csv_name <- zip_contents$Name[1]  # pick first or appropriate
unzip(temp_file, files = csv_name, exdir = tempdir())
csv_path <- file.path(tempdir(), csv_name)
bank_data <- read_delim(csv_path, delim = ";")

## Multiple files in zip: reading 'bank-full.csv'
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(bank_data)

## Rows: 45,211
## Columns: 17
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Data Overview

summary(bank_data)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

skim(bank_data)

Data summary
Name	bank_data
Number of rows	45211
Number of columns	17
_______________________
Column type frequency:
character	10
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
job	1	6	13	12
marital	1	6	8	3
education	1	7	9	4
default	1	2	3	2
housing	1	2	3	2
loan	1	2	3	2
contact	1	7	9	3
month	1	3	3	12
poutcome	1	5	7	4
y	1	2	3	2

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	1	40.94	10.62	18	33	39	48	95	▅▇▃▁▁
balance	1	1362.27	3044.77	-8019	72	448	1428	102127	▇▁▁▁▁
day	1	15.81	8.32	1	8	16	21	31	▇▆▇▆▆
duration	1	258.16	257.53	0	103	180	319	4918	▇▁▁▁▁
campaign	1	2.76	3.10	1	1	2	3	63	▇▁▁▁▁
pdays	1	40.20	100.13	-1	-1	-1	-1	871	▇▁▁▁▁
previous	1	0.58	2.30	0	0	0	0	275	▇▁▁▁▁

Key Observations

Imbalanced target variable: ~12% yes, 88% no
Numeric features:
- age: Median 39, mean 40.9
- balance: Mean 1362.3, with large outliers
- duration: Mean 258 sec, skewed

Categorical features:

Jobs: management, technician, retired → higher subscription rates
Housing and loan: minor differences in subscription

Visualization

Categorical Variables Distribution

bank_data %>% 
  select_if(is.character) %>% 
  map(table)

## $job
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5171          9732          1487          1240          9458 
##       retired self-employed      services       student    technician 
##          2264          1579          4154           938          7597 
##    unemployed       unknown 
##          1303           288 
## 
## $marital
## 
## divorced  married   single 
##     5207    27214    12790 
## 
## $education
## 
##   primary secondary  tertiary   unknown 
##      6851     23202     13301      1857 
## 
## $default
## 
##    no   yes 
## 44396   815 
## 
## $housing
## 
##    no   yes 
## 20081 25130 
## 
## $loan
## 
##    no   yes 
## 37967  7244 
## 
## $contact
## 
##  cellular telephone   unknown 
##     29285      2906     13020 
## 
## $month
## 
##   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
##  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
## 
## $poutcome
## 
## failure   other success unknown 
##    4901    1840    1511   36959 
## 
## $y
## 
##    no   yes 
## 39922  5289

Distribution of Numeric Variables

# Age distribution
hist(bank_data$age, main="Age Distribution", col="skyblue")

# Balance distribution (with outliers)
boxplot(bank_data$balance, main="Balance", col="lightgreen")

# Duration of calls
hist(bank_data$duration, main="Call Duration", col="orange")

Correlations Plot

num_vars <- bank_data %>% select_if(is.numeric)
corr_matrix <- cor(num_vars)
corrplot(corr_matrix, method="color", type="upper", tl.cex=0.8)

# Relationships Between Variables. Job VS Subscription

ggplot(bank_data, aes(x=job, fill=y)) +
  geom_bar(position="fill") +
  coord_flip() +
  labs(y="Proportion", title="Term Deposit Subscription by Job")

boxplot(bank_data$balance ~ bank_data$y, main="Balance by Subscription")

## Algorithm selection

Algorithms Selected:

Logistic Regression: Binary Target, Interpretable Coefficients
Decision Tree/Random Forest: Captures non-linear connections and handles categorical data.

Justification:

Logistic analysis reveals which characteristics improve the likelihood of becoming a subscriber.
Tree-based approaches are well-suited to handling categorical splits, outliers, and complicated patterns.

Pre-processing

Steps Taken:

Converted categorical variables to factors.
Optional log transformation of balance to reduce skew.
Addressed target variable imbalance via oversampling or weighting. *Checked missing or unknown values.

bank_data <- bank_data %>% 
  mutate(across(where(is.character), as.factor))
table(bank_data$y)

## 
##    no   yes 
## 39922  5289

prop.table(table(bank_data$y))

## 
##        no       yes 
## 0.8830152 0.1169848

bank_data$balance_log <- log1p(bank_data$balance)

## Warning in log1p(bank_data$balance): NaNs produced

Business Insights & Recommendations

Insights:

High potential employment classifications include management, technical, and retiree.
High balances and longer call durations are associated with greater subscriptions.
Prior results (poutcome) suggest superior targeting opportunities.

Recommendations:

Create predictive models (logistic regression, decision trees) for customer targeting.
Prioritize high-conversion parts in marketing.
Monitor campaigns and change strategy using model findings.

Conclusion.

EDA revealed characteristics impacting subscription, including job, balance, length, and prior outcomes.
ogistic regression and decision trees are suitable for modeling.
Pre-processing guarantees that numerical skew, category factors, and imbalance are addressed.
Marketing strategy should focus on high-conversion niches to optimize rewards.

References

https://archive.ics.uci.edu/dataset/222/bank+marketing

EDA of the Portuguese Bank Marketing Dataset

Woodelyne Durosier

2025-09-27