This report provides a detailed exploratory data analysis (EDA) of the Portuguese Bank Marketing dataset. The goal is to uncover patterns, trends, and linkages to better understand the factors impacting term deposit subscriptions (y)and provide actionable insights for marketing strategies. The dataset includes 45,211 observations with 17 variables.
# Download the zip file (into a temporary file)
temp_file <- tempfile(fileext = ".zip")
download.file("https://archive.ics.uci.edu/static/public/222/bank+marketing.zip",
destfile = temp_file, mode = "wb")
# List contents of the ZIP (to see filenames)
zip_contents <- unzip(temp_file, list = TRUE)
# Unzip the specific CSV file
csv_name <- zip_contents$Name[1] # pick first or appropriate
unzip(temp_file, files = csv_name, exdir = tempdir())
csv_path <- file.path(tempdir(), csv_name)
bank_data <- read_delim(csv_path, delim = ";")
## Multiple files in zip: reading 'bank-full.csv'
## Rows: 45211 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (10): job, marital, education, default, housing, loan, contact, month, p...
## dbl (7): age, balance, day, duration, campaign, pdays, previous
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bank_data)
## Rows: 45,211
## Columns: 17
## $ age <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
## $ job <chr> "management", "technician", "entrepreneur", "blue-collar", "…
## $ marital <chr> "married", "single", "married", "married", "single", "marrie…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", …
## $ default <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",…
## $ balance <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
## $ housing <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"…
## $ loan <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"…
## $ contact <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ day <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
## $ month <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
## $ duration <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
## $ campaign <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pdays <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
## $ previous <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ poutcome <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
## $ y <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
summary(bank_data)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## y
## Length:45211
## Class :character
## Mode :character
##
##
##
skim(bank_data)
| Name | bank_data |
| Number of rows | 45211 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| job | 0 | 1 | 6 | 13 | 0 | 12 | 0 |
| marital | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| education | 0 | 1 | 7 | 9 | 0 | 4 | 0 |
| default | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| housing | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| loan | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| contact | 0 | 1 | 7 | 9 | 0 | 3 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| poutcome | 0 | 1 | 5 | 7 | 0 | 4 | 0 |
| y | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.94 | 10.62 | 18 | 33 | 39 | 48 | 95 | ▅▇▃▁▁ |
| balance | 0 | 1 | 1362.27 | 3044.77 | -8019 | 72 | 448 | 1428 | 102127 | ▇▁▁▁▁ |
| day | 0 | 1 | 15.81 | 8.32 | 1 | 8 | 16 | 21 | 31 | ▇▆▇▆▆ |
| duration | 0 | 1 | 258.16 | 257.53 | 0 | 103 | 180 | 319 | 4918 | ▇▁▁▁▁ |
| campaign | 0 | 1 | 2.76 | 3.10 | 1 | 1 | 2 | 3 | 63 | ▇▁▁▁▁ |
| pdays | 0 | 1 | 40.20 | 100.13 | -1 | -1 | -1 | -1 | 871 | ▇▁▁▁▁ |
| previous | 0 | 1 | 0.58 | 2.30 | 0 | 0 | 0 | 0 | 275 | ▇▁▁▁▁ |
Imbalanced target variable: ~12% yes, 88% no
Numeric features:
age: Median 39, mean 40.9
balance: Mean 1362.3, with large outliers
duration: Mean 258 sec, skewed
Jobs: management, technician, retired → higher subscription rates
Housing and loan: minor differences in subscription
bank_data %>%
select_if(is.character) %>%
map(table)
## $job
##
## admin. blue-collar entrepreneur housemaid management
## 5171 9732 1487 1240 9458
## retired self-employed services student technician
## 2264 1579 4154 938 7597
## unemployed unknown
## 1303 288
##
## $marital
##
## divorced married single
## 5207 27214 12790
##
## $education
##
## primary secondary tertiary unknown
## 6851 23202 13301 1857
##
## $default
##
## no yes
## 44396 815
##
## $housing
##
## no yes
## 20081 25130
##
## $loan
##
## no yes
## 37967 7244
##
## $contact
##
## cellular telephone unknown
## 29285 2906 13020
##
## $month
##
## apr aug dec feb jan jul jun mar may nov oct sep
## 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
##
## $poutcome
##
## failure other success unknown
## 4901 1840 1511 36959
##
## $y
##
## no yes
## 39922 5289
# Age distribution
hist(bank_data$age, main="Age Distribution", col="skyblue")
# Balance distribution (with outliers)
boxplot(bank_data$balance, main="Balance", col="lightgreen")
# Duration of calls
hist(bank_data$duration, main="Call Duration", col="orange")
num_vars <- bank_data %>% select_if(is.numeric)
corr_matrix <- cor(num_vars)
corrplot(corr_matrix, method="color", type="upper", tl.cex=0.8)
# Relationships Between Variables. Job VS Subscription
ggplot(bank_data, aes(x=job, fill=y)) +
geom_bar(position="fill") +
coord_flip() +
labs(y="Proportion", title="Term Deposit Subscription by Job")
boxplot(bank_data$balance ~ bank_data$y, main="Balance by Subscription")
## Algorithm selection
Steps Taken:
bank_data <- bank_data %>%
mutate(across(where(is.character), as.factor))
table(bank_data$y)
##
## no yes
## 39922 5289
prop.table(table(bank_data$y))
##
## no yes
## 0.8830152 0.1169848
bank_data$balance_log <- log1p(bank_data$balance)
## Warning in log1p(bank_data$balance): NaNs produced
EDA revealed characteristics impacting subscription, including job, balance, length, and prior outcomes.
ogistic regression and decision trees are suitable for modeling.
Pre-processing guarantees that numerical skew, category factors, and imbalance are addressed.
Marketing strategy should focus on high-conversion niches to optimize rewards.