ECON 465 — Stage 1: Bank Term Deposit Classification

Author

Ozan Tekin

library(tidyverse)
library(tidymodels)
set.seed(465)

1 Dataset Description and Source

This project investigates whether a client subscribes to a bank term deposit following a telemarketing call.

Source: Bank Marketing Dataset, originally collected by a Portuguese retail bank (May 2008 – November 2010). Mirror of the UCI Machine Learning Repository version, downloaded from Kaggle.

https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Original reference: Moro, S., Rita, P., & Cortez, P. (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, 62, 22–31.

1.1 Why this dataset is economically relevant

Term deposits are an important source of stable funding for retail banks, but acquiring them through outbound telemarketing is costly. Every phone call consumes employee time and the conversion rate is typically low. Predicting which clients are most likely to subscribe lets the bank target high-probability clients first, reducing the cost of acquiring deposit funding. The question relates to household savings behaviour and to banking microeconomics (the cost-of-funds problem).

2 Economic Question

Can client demographic, financial, and prior-campaign characteristics predict whether a client will subscribe to a bank term deposit?

This is a binary classification problem (subscribe = Yes or No), similar in structure to the loan default problem from the Week 8 lab.

3 Data Import

bank <- read_csv("data/bbank.csv")

glimpse(bank)
Rows: 11,162
Columns: 17
$ age       <dbl> 59, 56, 41, 55, 54, 42, 56, 60, 37, 28, 38, 30, 29, 46, 31, …
$ job       <chr> "admin.", "admin.", "technician", "services", "admin.", "man…
$ marital   <chr> "married", "married", "married", "married", "married", "sing…
$ education <chr> "secondary", "secondary", "secondary", "secondary", "tertiar…
$ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ balance   <dbl> 2343, 45, 1270, 2476, 184, 0, 830, 545, 1, 5090, 100, 309, 1…
$ housing   <chr> "yes", "no", "yes", "yes", "no", "yes", "yes", "yes", "yes",…
$ loan      <chr> "no", "no", "no", "no", "no", "yes", "yes", "no", "no", "no"…
$ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ day       <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, …
$ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
$ duration  <dbl> 1042, 1467, 1389, 579, 673, 562, 1201, 1030, 608, 1297, 786,…
$ campaign  <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 2, 4, 2, 2, 1, 3, 1, 2, 1, …
$ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
$ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ deposit   <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
cat("Number of observations:", nrow(bank), "\n")
Number of observations: 11162 
cat("Number of variables:", ncol(bank), "\n")
Number of variables: 17 

The dataset has 11162 observations and 17 variables, well above the project minimum of 500 observations and 5 variables.

The target variable is deposit (Yes / No). The remaining 16 variables describe client demographics (age, job, marital, education), financial status (default, balance, housing, loan), and contact history (contact, day, month, duration, campaign, pdays, previous, poutcome).

4 Data Cleaning

4.1 Convert variables to factors

R needs categorical variables to be encoded as factors for classification models. I convert the target and all other character variables to factors.

bank <- bank |>
  mutate(
    deposit = factor(deposit, levels = c("no", "yes")),
    job = factor(job),
    marital = factor(marital),
    education = factor(education),
    default = factor(default, levels = c("no", "yes")),
    housing = factor(housing, levels = c("no", "yes")),
    loan = factor(loan, levels = c("no", "yes")),
    contact = factor(contact),
    month = factor(month),
    poutcome = factor(poutcome)
  )

4.2 Check for missing values

sum(is.na(bank))
[1] 0

There are no missing values in the dataset.

4.3 How many people subscribed?

table(bank$deposit)

  no  yes 
5873 5289 

5,289 out of 11,162 clients subscribed to a term deposit, which is about 47%. Unlike the Default dataset from the lab (which was very imbalanced at 3.3%), this dataset is roughly balanced. This is an advantage for modelling: accuracy will be a more meaningful metric here, and we will not need to be as careful about class imbalance as the bank in the lab had to be.

4.4 Summary statistics by deposit status

Following the lab format, I compare numeric variables across the two outcome groups.

bank |>
  group_by(deposit) |>
  summarize(
    avg_age = mean(age),
    avg_balance = mean(balance),
    avg_duration = mean(duration),
    avg_campaign = mean(campaign)
  )
# A tibble: 2 × 5
  deposit avg_age avg_balance avg_duration avg_campaign
  <fct>     <dbl>       <dbl>        <dbl>        <dbl>
1 no         40.8       1280.         223.         2.84
2 yes        41.7       1804.         537.         2.14

Initial observation: Clients who subscribed have higher average balances, much longer last-contact durations, and were contacted fewer times during the campaign. The duration result is intuitive (longer calls correlate with successful sales), but as the original UCI documentation notes, duration is only known after the call ends and so cannot be used for true ex-ante prediction. I will return to this point in Stage 2.

5 Probability Distribution Analysis

The target variable deposit is binary, so it does not have a continuous distribution to plot. As the lab covered, a binary outcome is modelled as a Bernoulli random variable: each client either subscribes (1) or does not (0), with probability \(p\) of subscribing. We saw the same setup with default in the Week 8 lab.

For the distributional analysis required by Stage 1, I therefore focus on the most important continuous predictor, balance (yearly account balance in euros), which plays the same role here that balance played in the lab’s Default dataset.

5.1 Summary statistics for balance

bank |>
  summarize(
    mean = mean(balance),
    median = median(balance),
    sd = sd(balance),
    q1 = quantile(balance, 0.25),
    q3 = quantile(balance, 0.75),
    min = min(balance),
    max = max(balance)
  )
# A tibble: 1 × 7
   mean median    sd    q1    q3   min   max
  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1529.    550 3225.   122  1708 -6847 81204

The mean is far above the median, which suggests the distribution is right-skewed. The minimum is negative because some clients have overdraft accounts.

5.2 Histogram of balance

ggplot(bank, aes(x = balance)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Account Balance",
    x = "Balance (EUR)",
    y = "Count"
  )

The distribution of balance is strongly right-skewed. Most clients have small positive balances clustered near zero, while a small number of clients have very high balances (above 80,000 EUR). This is not consistent with a normal distribution.

5.3 Log transformation

Because balance contains zero and negative values, \(\log(x)\) is undefined for them. I apply the log transformation only to the strictly positive subset of clients, which is the standard approach when log-transforming an economic variable that includes a small number of zeros or negatives.

bank_pos <- bank |>
  filter(balance > 0) |>
  mutate(log_balance = log(balance))

ggplot(bank_pos, aes(x = log_balance)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Log Balance (positive balances only)",
    x = "Log Balance",
    y = "Count"
  )

After the log transformation, the distribution becomes much more symmetric and closer to a normal distribution.

5.4 Proposed theoretical distribution

Because the raw balance variable is strongly right-skewed but becomes approximately normal after a log transformation, the original variable may follow a log-normal distribution. This makes economic sense: account balances are bounded below near zero, and they tend to scale multiplicatively rather than additively (a 10% raise produces a larger absolute change for a wealthier client than a poorer one). Wealth and income variables are commonly modelled as log-normal in economics.

The target variable deposit follows a Bernoulli distribution with parameter \(p \approx 0.47\), which we will model as a function of the predictors using logistic regression in Stage 2 (just as the lab did with default).

6 Summary

  • Dataset: Bank Marketing (Portugal), 11,162 observations and 17 variables.
  • Target: deposit (binary, roughly balanced at 47% Yes / 53% No).
  • Cleaning: converted character variables to factors, no missing values to drop.
  • Distributions: balance is right-skewed and may follow a log-normal distribution. The target follows a Bernoulli distribution.
  • Next step (Stage 2): 80/20 train–test split, build a logistic regression model and at least one other classifier (decision tree or k-NN), evaluate using a confusion matrix, accuracy, precision, recall, and choose an appropriate threshold for the bank’s economic decision.