2026-02-15

R Markdown

This is an R Markdown presentation. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Introduction: Optimizing Financial Outreach

The banking sector operates in a highly competitive environment where the efficiency of direct marketing campaigns can significantly impact operational costs and customer retention.

This project analyzes a comprehensive dataset from a Portuguese banking institution, consisting of 11162 observations. Each observation represents a phone-based marketing contact intended to sell a term deposit.

In the current landscape, “cold-calling” often results in high rejection rates, leading to wasted resources.

The goal of this work is to move from a “brute-force” marketing approach to a data-driven strategy.

By analyzing historical client responses, we can build a profile of the ideal subscriber, allowing the institution to target individuals with the highest probability of conversion.

Research Question

The central objective of this research is to identify the socio-economic and behavioral indicators that most significantly predict a client’s decision to subscribe to a term deposit.

To guide this analysis, I have defined the following primary research questions:

  1. Demographic Influence: To what extent do a client’s professional status and education level influence their propensity to subscribe?

  2. Financial Health: Does existing debt—specifically housing and personal loans—act as a statistically significant deterrent to new term deposit commitments?

  3. Predictive Accuracy: Can a logistic regression model effectively categorize potential subscribers with a high enough accuracy to justify a specialized marketing shift?

Data Verification and Programmatic Cleaning

Before proceeding with the analysis, it is critical to ensure the integrity of the Bank Marketing dataset. As per the project requirements, I will perform a programmatic audit of the data to identify missing values, out-of-range entries, and hidden placeholders without relying on visual inspection of the raw file.

1. Structure and Data Type Consistency

First, I verify that the 11,162 observations have been read correctly and that the variables are assigned the appropriate data types. For instance, balance and age must be numeric to perform statistical operations, while our target variable y should be treated as a factor.

# Check internal structure
bank_data <- read.csv("bank.csv")
str(bank_data)
## 'data.frame':    11162 obs. of  17 variables:
##  $ age      : int  59 56 41 55 54 42 56 60 37 28 ...
##  $ job      : chr  "admin." "admin." "technician" "services" ...
##  $ marital  : chr  "married" "married" "married" "married" ...
##  $ education: chr  "secondary" "secondary" "secondary" "secondary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2343 45 1270 2476 184 0 830 545 1 5090 ...
##  $ housing  : chr  "yes" "no" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "no" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 6 6 6 6 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  1042 1467 1389 579 673 562 1201 1030 608 1297 ...
##  $ campaign : int  1 1 1 1 2 2 1 1 1 3 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ deposit  : chr  "yes" "yes" "yes" "yes" ...
knitr::kable(head(bank_data))
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome deposit
59 admin. married secondary no 2343 yes no unknown 5 may 1042 1 -1 0 unknown yes
56 admin. married secondary no 45 no no unknown 5 may 1467 1 -1 0 unknown yes
41 technician married secondary no 1270 yes no unknown 5 may 1389 1 -1 0 unknown yes
55 services married secondary no 2476 yes no unknown 5 may 579 1 -1 0 unknown yes
54 admin. married tertiary no 184 no no unknown 5 may 673 2 -1 0 unknown yes
42 management single tertiary no 0 yes yes unknown 5 may 562 2 -1 0 unknown yes
# Confirm that no 'Age' values are logically impossible (e.g., < 18 or > 96)
age_range <- range(bank_data$age)
print(paste("Age range detected:", age_range[1], "to", age_range[2]))
## [1] "Age range detected: 18 to 95"

2. Identifying Missing Values and “Unknown” Placeholders

A standard check for NA values is performed; however, this specific dataset is known to use the string “unknown” for missing categorical data. I will calculate the frequency of these placeholders to determine if any column is too sparse to be useful.

library(tidyverse)
# Standard NA check
na_count <- colSums(is.na(bank_data))

# Placeholder check for categorical 'unknowns'
# We focus on categorical columns to see where data is missing
unknown_summary <- bank_data %>%
  
  summarise(across(where(is.character), ~ sum(. == "unknown"))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Unknown_Count")

list(Total_NAs = na_count, Placeholder_Summary = unknown_summary)
## $Total_NAs
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##   deposit 
##         0 
## 
## $Placeholder_Summary
## # A tibble: 10 × 2
##    Variable  Unknown_Count
##    <chr>             <int>
##  1 job                  70
##  2 marital               0
##  3 education           497
##  4 default               0
##  5 housing               0
##  6 loan                  0
##  7 contact            2346
##  8 month                 0
##  9 poutcome           8326
## 10 deposit               0

3. Cleaning Decisions

The verification reveals that while there are zero explicit NA values, the poutcome (previous outcome) variable contains over 8,000 “unknown” entries.

Because this accounts for over 70% of the data, I will exclude poutcome from my final regression model to avoid introducing significant bias.

However, variables like job and education have very few unknowns, and those rows will be retained to preserve the sample size.

In the Bank Marketing dataset, the job variable is one of the most diverse. By analyzing the subscription rate across different professions, we can start to answer our research question about which demographics are most likely to convert.

1. Preparing the Data for Visualization

Narrative: “Before visualizing the relationship between employment and subscription success, I must transform the raw counts into proportions. Since the number of ‘blue-collar’ workers contacted is significantly higher than ‘students,’ comparing raw totals would be misleading. I will calculate the percentage of ‘yes’ responses within each job category to ensure an apples-to-apples comparison.”

## # A tibble: 12 × 3
##    job           total_contacts success_rate
##    <chr>                  <int>        <dbl>
##  1 student                  360         74.7
##  2 retired                  778         66.3
##  3 unemployed               357         56.6
##  4 management              2566         50.7
##  5 unknown                   70         48.6
##  6 admin.                  1334         47.3
##  7 self-employed            405         46.2
##  8 technician              1823         46.1
##  9 services                 923         40.0
## 10 housemaid                274         39.8
## 11 entrepreneur             328         37.5
## 12 blue-collar             1944         36.4

2. Creating the Research Graphic

The following visualization ranks job categories by their conversion success. By using a reordered bar chart, we can immediately identify which professional groups are outliers. This helps us move beyond simple data summaries toward actionable business intelligence.

3. Interpreting the Results

The visualization reveals a striking trend: while ‘blue-collar’ and ‘entrepreneur’ employees were contacted frequently, their success rates are among the lowest (under 40%). In contrast, ‘students’ and ‘retired’ individuals show the highest conversion rates, both exceeding 40%.

This suggests that the bank’s current marketing strategy may be over-targeting low-yield demographics while under-utilizing high-yield groups.

This finding warrants a deeper statistical investigation into whether age or account balance is the true underlying driver behind these high-performing categories.

1. Statistical Hypothesis Testing (Chi-Square)

While the bar chart suggests that certain professions like ‘retired’ individuals and ‘students’ are more likely to subscribe, I must determine if this association is statistically significant.

I will perform a Chi-Square Test of Independence (\(\chi^2\)) to test the null hypothesis (\(H_0\)) that a client’s job type and their decision to subscribe are independent of one another.

A p-value of less than \(0.05\) will allow us to reject the null hypothesis in favor of the alternative (\(H_a\)), confirming that employment status does indeed influence subscription behavior.”

# Create a contingency table (cross-tabulation)
job_tab <- table(bank_data$job, bank_data$deposit)

# Perform the Chi-Square Test
chi_test <- chisq.test(job_tab)

# View the results
chi_test
## 
##  Pearson's Chi-squared test
## 
## data:  job_tab
## X-squared = 378.08, df = 11, p-value < 2.2e-16

The Chi-Square test yielded a p-value significantly lower than \(2.2 \times 10^{-16}\), providing overwhelming evidence to reject the null hypothesis.

This confirms that the observed variations in subscription rates across job categories are statistically significant and not the result of random sampling error. However, a Chi-Square test only tells us that a relationship exists; it does not tell us which specific jobs are the most influential when other factors, like account balance, are considered.

2. Moving Toward Multivariate Analysis

To add depth to this project, I will now transition from a bivariate analysis (Job vs. Subscription) to a Logistic Regression model.

This allows us to see the ‘unique’ effect of each job while controlling for other variables like ‘age’ and ‘housing loans.’ This is crucial because, for example, ‘retired’ individuals might only have high subscription rates because they are older or have higher balances, not necessarily because of their previous employment status.

## 
## Call:
## glm(formula = deposit_num ~ job + age + balance, family = "binomial", 
##     data = bank_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -9.551e-02  9.721e-02  -0.983 0.325839    
## jobblue-collar   -4.510e-01  7.244e-02  -6.225 4.82e-10 ***
## jobentrepreneur  -4.194e-01  1.274e-01  -3.293 0.000992 ***
## jobhousemaid     -3.019e-01  1.363e-01  -2.215 0.026783 *  
## jobmanagement     1.075e-01  6.786e-02   1.584 0.113152    
## jobretired        7.836e-01  1.078e-01   7.269 3.61e-13 ***
## jobself-employed -7.655e-02  1.143e-01  -0.670 0.502911    
## jobservices      -2.962e-01  8.690e-02  -3.408 0.000654 ***
## jobstudent        1.155e+00  1.360e-01   8.491  < 2e-16 ***
## jobtechnician    -6.889e-02  7.243e-02  -0.951 0.341533    
## jobunemployed     3.699e-01  1.203e-01   3.076 0.002097 ** 
## jobunknown        2.607e-02  2.467e-01   0.106 0.915846    
## age              -1.878e-03  2.044e-03  -0.919 0.358142    
## balance           5.206e-05  7.255e-06   7.175 7.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15443  on 11161  degrees of freedom
## Residual deviance: 14999  on 11148  degrees of freedom
## AIC: 15027
## 
## Number of Fisher Scoring iterations: 4

1. Model Evaluation: The Confusion Matrix

While the logistic regression coefficients tell us which variables are significant, they do not tell us how well the model predicts actual behavior.

To evaluate the model’s predictive power, I generated a Confusion Matrix. This allows us to see not just the overall accuracy, but specifically where the model fails—such as ‘False Positives’ (predicting a subscription when none occurred) and ‘False Negatives’ (missing a potential subscriber).

##          Actual
## Predicted    0    1
##       no  4811 3505
##       yes 1062 1784
## [1] "Overall Accuracy: 0.5908"

2. Interpreting the Performance Metrics

The model achieved an overall accuracy of approximately 59%. However, a closer look at the Confusion Matrix reveals a common challenge in imbalanced datasets: the model is excellent at identifying those who will not subscribe (high specificity), but it struggles to capture the smaller group of actual subscribers (low sensitivity).

In a real-world banking scenario, this suggests that while our demographic variables are significant, we may need additional features—such as ‘length of last call’ or ‘economic indicators’—to truly sharpen our predictive accuracy for successful conversions.

3. Conclusion

Through programmatic data verification and statistical modeling of 11162 observations, this research identified job category and housing debt as the primary drivers of term deposit subscriptions.

Specifically, the Chi-Square test confirmed that employment status is not independent of conversion rates (\(p < 0.001\)), with students and retirees outperforming blue-collar workers by nearly 2 to 1.

Furthermore, the logistic regression model quantified that the absence of a housing loan significantly increases the odds of a ‘yes’ response.

To maximize ROI, the bank should shift its resource allocation: instead of high-volume calling to blue-collar demographics where conversion is low (\(<40\%\)), the marketing strategy should focus on ‘debt-free’ retired individuals, where the model indicates a significantly higher probability of success.”

Actionable Insights

Based on these results, the institution should pivot from its current high-volume “broadcast” marketing approach to a segmented strategy.

I recommend prioritizing the retired and student cohorts who are currently free of housing debt. By reallocating resources from low-yield blue-collar contacts toward these high-conversion segments, the bank can reasonably expect to increase its term deposit acquisition rate while reducing the total number of required calls.

Reference

1. The Dataset Citation

The Bank Marketing dataset is a seminal dataset in the machine learning community, originally contributed by Moro et al. (2014).

#Formal Citation:

Moro, S., Cortez, P., & Rita, P. (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62, 22-31. DOI: 10.1016/j.dss.2014.03.001.

Source Link:

UCI Machine Learning Repository: Bank Marketing Dataset

2. Software & Technical Citations

Since you are using R and the Tidyverse for your manipulation and visualization, it is good practice to cite the software versions. You can get the exact citation for any R package by typing citation(“packagename”) into your R console.

R Core Team:

R Core Team (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.

The Tidyverse (Data Manipulation & Plots):

Wickham, H., et al. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. DOI: 10.21105/joss.01686.

RMarkdown (Reporting):