Introduction

This report provides a detailed exploratory data analysis (EDA) of the Portuguese Bank Marketing dataset. The goal is to uncover patterns, trends, and linkages to better understand the factors impacting term deposit subscriptions (y)and provide actionable insights for marketing strategies. The dataset includes 45,211 observations with 17 variables.

Exploratory Data Analysis (EDA)

Data Import

# Download the zip file (into a temporary file)
temp_file <- tempfile(fileext = ".zip")
download.file("https://archive.ics.uci.edu/static/public/222/bank+marketing.zip", 
              destfile = temp_file, mode = "wb")

# List contents of the ZIP (to see filenames)
zip_contents <- unzip(temp_file, list = TRUE)

#  Unzip the specific CSV file
csv_name <- zip_contents$Name[1]  # pick first or appropriate
unzip(temp_file, files = csv_name, exdir = tempdir())
csv_path <- file.path(tempdir(), csv_name)
bank_data <- read_delim(csv_path, delim = ";")
## Multiple files in zip: reading 'bank-full.csv'
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl  (7): age, balance, day, duration, campaign, pdays, previous
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bank_data)
## Rows: 45,211
## Columns: 17
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital   <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …

Data Overview

summary(bank_data)
##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##       y            
##  Length:45211      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
skim(bank_data)
Data summary
Name bank_data
Number of rows 45211
Number of columns 17
_______________________
Column type frequency:
character 10
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
job 0 1 6 13 0 12 0
marital 0 1 6 8 0 3 0
education 0 1 7 9 0 4 0
default 0 1 2 3 0 2 0
housing 0 1 2 3 0 2 0
loan 0 1 2 3 0 2 0
contact 0 1 7 9 0 3 0
month 0 1 3 3 0 12 0
poutcome 0 1 5 7 0 4 0
y 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 40.94 10.62 18 33 39 48 95 ▅▇▃▁▁
balance 0 1 1362.27 3044.77 -8019 72 448 1428 102127 ▇▁▁▁▁
day 0 1 15.81 8.32 1 8 16 21 31 ▇▆▇▆▆
duration 0 1 258.16 257.53 0 103 180 319 4918 ▇▁▁▁▁
campaign 0 1 2.76 3.10 1 1 2 3 63 ▇▁▁▁▁
pdays 0 1 40.20 100.13 -1 -1 -1 -1 871 ▇▁▁▁▁
previous 0 1 0.58 2.30 0 0 0 0 275 ▇▁▁▁▁

Key Observations

Categorical features:

Visualization

Categorical Variables Distribution

bank_data %>% 
  select_if(is.character) %>% 
  map(table)
## $job
## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          5171          9732          1487          1240          9458 
##       retired self-employed      services       student    technician 
##          2264          1579          4154           938          7597 
##    unemployed       unknown 
##          1303           288 
## 
## $marital
## 
## divorced  married   single 
##     5207    27214    12790 
## 
## $education
## 
##   primary secondary  tertiary   unknown 
##      6851     23202     13301      1857 
## 
## $default
## 
##    no   yes 
## 44396   815 
## 
## $housing
## 
##    no   yes 
## 20081 25130 
## 
## $loan
## 
##    no   yes 
## 37967  7244 
## 
## $contact
## 
##  cellular telephone   unknown 
##     29285      2906     13020 
## 
## $month
## 
##   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
##  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579 
## 
## $poutcome
## 
## failure   other success unknown 
##    4901    1840    1511   36959 
## 
## $y
## 
##    no   yes 
## 39922  5289

Distribution of Numeric Variables

# Age distribution
hist(bank_data$age, main="Age Distribution", col="skyblue")

# Balance distribution (with outliers)
boxplot(bank_data$balance, main="Balance", col="lightgreen")

# Duration of calls
hist(bank_data$duration, main="Call Duration", col="orange")

Correlations Plot

num_vars <- bank_data %>% select_if(is.numeric)
corr_matrix <- cor(num_vars)
corrplot(corr_matrix, method="color", type="upper", tl.cex=0.8)

# Relationships Between Variables. Job VS Subscription

ggplot(bank_data, aes(x=job, fill=y)) +
  geom_bar(position="fill") +
  coord_flip() +
  labs(y="Proportion", title="Term Deposit Subscription by Job")

boxplot(bank_data$balance ~ bank_data$y, main="Balance by Subscription")

## Algorithm selection

Algorithms Selected:

Justification:

Pre-processing

Steps Taken:

  • Converted categorical variables to factors.
  • Optional log transformation of balance to reduce skew.
  • Addressed target variable imbalance via oversampling or weighting. *Checked missing or unknown values.
bank_data <- bank_data %>% 
  mutate(across(where(is.character), as.factor))
table(bank_data$y)
## 
##    no   yes 
## 39922  5289
prop.table(table(bank_data$y))
## 
##        no       yes 
## 0.8830152 0.1169848
bank_data$balance_log <- log1p(bank_data$balance)
## Warning in log1p(bank_data$balance): NaNs produced

Business Insights & Recommendations

Insights:

Recommendations:

Conclusion.

  • EDA revealed characteristics impacting subscription, including job, balance, length, and prior outcomes.

  • ogistic regression and decision trees are suitable for modeling.

  • Pre-processing guarantees that numerical skew, category factors, and imbalance are addressed.

  • Marketing strategy should focus on high-conversion niches to optimize rewards.