project_econ465

Authors

İsmet Erdal Tunç

Ozan Tekin

Stage 3: Complete Analysis Pipeline

Bank Term Deposit Subscription Prediction

Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.3     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.5.0 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.3      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.1.0 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.5.0      ✔ workflowsets 1.1.1 
✔ recipes      1.3.2      ✔ yardstick    1.4.0 
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()

Dataset Description and Source

This project investigates whether a client subscribes to a bank term deposit following a telemarketing call. The outcome variable is binary, indicating whether the client subscribed (“yes”) or not (“no”).

The dataset includes demographic, financial, and campaign-related characteristics that may help predict subscription decisions.

Source: Bank Marketing Dataset, originally collected by a Portuguese retail bank (May 2008 – November 2010). Mirror of the UCI Machine Learning Repository version, downloaded from Kaggle.

https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Economic Question

Can client demographic, financial, and prior-campaign characteristics predict whether a client will subscribe to a bank term deposit?

Economic Logic for Predictor Selection

This analysis focuses on variables that are expected to have stronger predictive power for term deposit subscription decisions. Financial variables such as balance, housing loan status, and personal loan status may reflect the client’s financial condition and saving behavior.

Campaign-related variables such as duration, previous, and poutcome are also expected to be important because they may indicate client interest and responsiveness. In particular, campaign duration is expected to be one of the strongest predictors.

Some variables, such as contact type, day, and month, were excluded to simplify the analysis and reduce model complexity.

Data Import

bank <- read_csv2("bank.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 11162 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (10): job, marital, education, default, housing, loan, contact, month, p...
dbl  (7): age, balance, day, duration, campaign, pdays, previous

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(bank)
Rows: 11,162
Columns: 17
$ age       <dbl> 59, 56, 41, 55, 54, 42, 56, 60, 37, 28, 38, 30, 29, 46, 31, …
$ job       <chr> "admin.", "admin.", "technician", "services", "admin.", "man…
$ marital   <chr> "married", "married", "married", "married", "married", "sing…
$ education <chr> "secondary", "secondary", "secondary", "secondary", "tertiar…
$ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ balance   <dbl> 2343, 45, 1270, 2476, 184, 0, 830, 545, 1, 5090, 100, 309, 1…
$ housing   <chr> "yes", "no", "yes", "yes", "no", "yes", "yes", "yes", "yes",…
$ loan      <chr> "no", "no", "no", "no", "no", "yes", "yes", "no", "no", "no"…
$ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ day       <dbl> 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, …
$ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may…
$ duration  <dbl> 1042, 1467, 1389, 579, 673, 562, 1201, 1030, 608, 1297, 786,…
$ campaign  <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 2, 4, 2, 2, 1, 3, 1, 2, 1, …
$ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
$ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn…
$ deposit   <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes…

Data Cleaning

bank <- bank |>
  drop_na() |>
  mutate(
    deposit = factor(deposit, levels = c("no", "yes")),
    job = factor(job),
    marital = factor(marital),
    education = factor(education),
    default = factor(default, levels = c("no", "yes")),
    housing = factor(housing, levels = c("no", "yes")),
    loan = factor(loan, levels = c("no", "yes")),
    contact = factor(contact),
    month = factor(month),
    poutcome = factor(poutcome)
  )

Summary Statistics by Deposit Status

bank |>
  group_by(deposit) |>
  summarize(
    avg_age = mean(age),
    avg_balance = mean(balance),
    avg_duration = mean(duration),
    avg_campaign = mean(campaign)
  )
# A tibble: 2 × 5
  deposit avg_age avg_balance avg_duration avg_campaign
  <fct>     <dbl>       <dbl>        <dbl>        <dbl>
1 no         40.8       1280.         223.         2.84
2 yes        41.7       1804.         537.         2.14

Clients who subscribed to the term deposit tend to have higher average balances and much longer call durations. This suggests that financial capacity and customer engagement may play an important role in subscription decisions.

Distribution of Deposit Subscription

ggplot(bank, aes(x = deposit)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Distribution of Deposit Subscription",
    x = "Deposit Subscription",
    y = "Count"
  )

The outcome variable is binary, with clients either subscribing (“yes”) or not subscribing (“no”) to the term deposit. This type of outcome can be modeled using a Bernoulli distribution.

The distribution appears relatively balanced, although there are slightly more non-subscribers than subscribers in the dataset.

Probability Analysis

prop.table(table(bank$deposit))

       no       yes 
0.5261602 0.4738398 

The probability analysis shows that approximately 47.4% of clients subscribed to a term deposit, while 52.6% did not subscribe.

This suggests that the outcome variable is relatively balanced between the two categories. A balanced outcome is beneficial for classification modeling because the model is not heavily biased toward one class.

From an economic perspective, the results indicate that term deposit subscription is not a rare event in this dataset. Therefore, client characteristics and campaign-related factors may provide useful information for predicting subscription decisions.

Data Splitting

set.seed(465)

bank_split <- initial_split(bank, prop = 0.8)

bank_train <- training(bank_split)
bank_test <- testing(bank_split)

nrow(bank_train)
[1] 8929
nrow(bank_test)
[1] 2233

The dataset was split into training (80%) and test (20%) sets using `initial_split()`. The training set contains 8,929 observations, while the test set contains 2,233 observations.

Model 1

Model 1 uses balance, housing loan status, and personal loan status as predictors of term deposit subscription. These variables may reflect the client’s financial condition and saving behavior.

model1_class <- logistic_reg() |>
  set_engine("glm")

model1_class_fit <- model1_class |>
  fit(
    deposit ~ balance + housing + loan,
    data = bank_train
  )

model1_class_predictions <- predict(
  model1_class_fit,
  bank_test,
  type = "class"
) |>
  bind_cols(bank_test)

metric_set(accuracy, precision, recall)(
  model1_class_predictions,
  truth = deposit,
  estimate = .pred_class
)
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.603
2 precision binary         0.627
3 recall    binary         0.626

Model 1 achieves moderate predictive performance. The accuracy, precision, and recall values suggest that financial variables alone may not fully explain subscription behavior.

Model 2

Model 2 includes campaign-related variables such as duration, previous campaign outcomes, and previous contacts in addition to financial variables. These variables are expected to improve predictive performance because they may better capture client interest and responsiveness.

model2_class <- logistic_reg() |>
  set_engine("glm")

model2_class_fit <- model2_class |>
  fit(
    deposit ~ balance + housing + loan + duration + previous + poutcome,
    data = bank_train
  )

model2_class_predictions <- predict(
  model2_class_fit,
  bank_test,
  type = "class"
) |>
  bind_cols(bank_test)

metric_set(accuracy, precision, recall)(
  model2_class_predictions,
  truth = deposit,
  estimate = .pred_class
)
# A tibble: 3 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.789
2 precision binary         0.770
3 recall    binary         0.861

Model 2 performs substantially better than Model 1. Higher accuracy, precision, and recall values suggest that campaign-related variables such as duration, previous contacts, and previous campaign outcomes provide important predictive information about deposit subscription decisions.

In particular, the high recall value indicates that the model is successful at identifying clients who are likely to subscribe to the term deposit.

Model Comparison

comparison_table_class <- tibble(
  Model = c("Model 1", "Model 2"),
  Accuracy = c(0.6032244, 0.7890730),
  Precision = c(0.6270042, 0.7695783),
  Recall = c(0.6259478, 0.8609941)
)

comparison_table_class
# A tibble: 2 × 4
  Model   Accuracy Precision Recall
  <chr>      <dbl>     <dbl>  <dbl>
1 Model 1    0.603     0.627  0.626
2 Model 2    0.789     0.770  0.861

Model 2 performs better overall because it achieves higher accuracy, precision, and recall values, indicating stronger predictive performance.

Although Model 2 is more complex, the additional campaign-related variables substantially improve the model’s ability to predict deposit subscription decisions.

Cross Validation

set.seed(465)

bank_folds <- vfold_cv(bank_train, v = 5)

cv_results_class <- fit_resamples(
  model2_class,
  deposit ~ balance + housing + loan + duration + previous + poutcome,
  resamples = bank_folds,
  metrics = metric_set(accuracy, precision, recall)
)

collect_metrics(cv_results_class)
# A tibble: 3 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  binary     0.786     5 0.00318 pre0_mod0_post0
2 precision binary     0.769     5 0.00487 pre0_mod0_post0
3 recall    binary     0.846     5 0.00463 pre0_mod0_post0

The 5-fold cross-validation results are close to the test set results. 
Model 2 achieved a test accuracy of 0.789 and a cross-validation accuracy of 0.786. 
This small difference suggests that the model generalizes well to unseen data and does not show strong evidence of overfitting.

In addition, precision and recall were included because accuracy alone may be misleading in classification problems. 
The recall value is relatively high, meaning that the model is successful at identifying customers who actually subscribed to a term deposit.

AI Interaction Log

During Stage 2, I used ChatGPT to better understand how to implement logistic regression and cross-validation using tidymodels in R. The AI suggested functions such as logistic_reg(), vfold_cv(), and fit_resamples() for building and evaluating classification models.

I adapted the suggested code to fit my own dataset, predictor selection, and modeling objectives rather than using it directly. The AI assistance helped me better understand the role of classification metrics such as accuracy, precision, and recall, as well as the importance of evaluating model performance on unseen data.

One limitation of the AI-generated suggestions was that some code examples and metric specifications required adjustments before they could be applied correctly to my project. Therefore, all code, outputs, and interpretations were manually reviewed and validated before being included in the report.

Overall, this interaction improved my understanding of classification modeling, model stability, and the importance of critically evaluating AI-generated recommendations.

Classification Conclusion

For the classification dataset, Model 2 achieved substantially higher accuracy, precision, and recall values than Model 1.

Campaign-related variables such as duration and previous campaign outcomes provided important predictive information about deposit subscription decisions.

The cross-validation results were very similar to the test set performance, suggesting that the model generalizes well and is relatively stable.

Results

Model 2 performs better than Model 1 across all evaluation metrics. Model 1 achieved an accuracy of 0.603, precision of 0.627, and recall of 0.626. In comparison, Model 2 achieved an accuracy of 0.789, precision of 0.770, and recall of 0.861.

This indicates that adding financial and campaign-related variables improved the model’s predictive performance. In particular, the high recall value of Model 2 suggests that the model is relatively successful at identifying clients who actually subscribed to a term deposit.

The 5-fold cross-validation results are also close to the test set results. The cross-validation accuracy is 0.786, precision is 0.769, and recall is 0.846. Since these values are very similar to the test set performance, Model 2 appears to generalize well and does not show strong evidence of overfitting.

Economic Interpretation

tidy(model2_class_fit)
# A tibble: 9 × 5
  term              estimate  std.error statistic  p.value
  <chr>                <dbl>      <dbl>     <dbl>    <dbl>
1 (Intercept)     -0.859     0.0985        -8.73  2.66e-18
2 balance          0.0000408 0.00000917     4.45  8.42e- 6
3 housingyes      -1.14      0.0567       -20.1   9.60e-90
4 loanyes         -0.710     0.0870        -8.16  3.39e-16
5 duration         0.00489   0.000125      39.1   0       
6 previous         0.00865   0.0156         0.556 5.78e- 1
7 poutcomeother    0.139     0.140          0.992 3.21e- 1
8 poutcomesuccess  2.24      0.147         15.3   1.04e-52
9 poutcomeunknown -0.760     0.0935        -8.13  4.36e-16

The results suggest that both financial characteristics and campaign-related factors influence the probability of subscribing to a term deposit.

The coefficient for balance is positive, indicating that clients with higher account balances are more likely to subscribe. This is economically reasonable because clients with greater financial resources may be more willing to allocate funds to savings products.

The coefficient for housing loan status is negative, suggesting that clients with housing loans are less likely to subscribe. Similarly, clients with personal loans are also less likely to subscribe, which may reflect tighter financial constraints.

Call duration has a strong positive coefficient and is one of the most important predictors in the model. Longer conversations may indicate greater customer interest and engagement, increasing the likelihood of subscription.

Previous campaign success is also strongly positive. Clients who responded successfully to previous campaigns are significantly more likely to subscribe again, suggesting that past customer behavior contains valuable information for future marketing efforts.

From a business perspective, these findings suggest that banks could improve marketing efficiency by targeting financially stronger customers and clients with a history of successful campaign responses. Future research could investigate additional demographic or behavioral variables to further improve predictive performance.

Limitations & Reproducibility

Limitations

This analysis has several limitations. First, the dataset does not include all factors that may influence a client’s decision to subscribe to a term deposit. Variables such as income, wealth, risk preferences, and broader economic conditions are not available.

Second, the analysis relies on a single dataset collected from one banking context. As a result, the findings may not generalize perfectly to other countries, time periods, or financial institutions.

Reproducibility

Several steps were taken to ensure reproducibility. All analyses were conducted using R and Quarto, and all code is included in this document. Relative file paths were used for data import, allowing the project to be reproduced on different computers without changing the code.

In addition, set.seed(465) was used before data splitting and cross-validation procedures to ensure consistent results across runs. The analysis also relied on documented R packages, including tidyverse and tidymodels.

AI Use Log

During this project, ChatGPT was used to assist with understanding R code, model evaluation, cross-validation procedures, and the interpretation of predictive modeling results.

Example Interaction

Prompt given to ChatGPT:

“How can I perform 5-fold cross-validation for my logistic regression model using tidymodels in R and evaluate the model using accuracy, precision, and recall?”

How the output was used:

The suggested workflow was used as a starting point for implementing cross-validation in the classification analysis. The recommendations helped identify the appropriate functions, including vfold_cv()fit_resamples(), and metric_set().

How the output was verified and modified:

The AI-generated code was not used directly. The code was adapted to fit the specific variables and model structure used in this project. Additional modifications were made to include precision and recall metrics after reviewing instructor feedback. All outputs were checked manually and compared with test-set results to ensure consistency and accuracy.

Reflection

AI tools were useful for improving understanding of predictive modeling techniques and debugging code. However, AI-generated suggestions occasionally required modification before they could be applied correctly. Therefore, all code, results, and interpretations were independently reviewed before inclusion in the final report.

Final Reflection

Improvement with More Time or Better Data

If more time and data were available, I would explore additional machine learning methods such as random forests or gradient boosting models. These approaches may capture more complex relationships between client characteristics and subscription decisions and potentially improve predictive performance.

Future Research Question

This analysis raises an interesting economic question for future research: Which customer characteristics are most important for long-term customer loyalty and repeated purchases of financial products? Investigating this question could help financial institutions improve customer retention and marketing strategies.

Conclusion

This project examined whether client demographic, financial, and campaign-related characteristics can predict term deposit subscription decisions.

The results showed that Model 2 substantially outperformed Model 1, achieving higher accuracy, precision, and recall. Variables such as account balance, loan status, call duration, and previous campaign outcomes were important predictors of subscription behavior.

The cross-validation results were very similar to the test-set results, suggesting that the model generalizes well to unseen data and does not exhibit severe overfitting.

Overall, the findings demonstrate that predictive modeling can provide valuable insights into customer behavior and support more effective marketing and decision-making in the banking sector.