Problem Definition:

Financial institutions often face challenges in identifying which clients are most likely to accept a term deposit offer. Inefficient targeting leads to wasted marketing resources and lower campaign success rates. This project aims to use machine learning methods to predict customer subscription behavior, helping the bank focus on the most promising prospects.

Goals:

Explore demographic, economic, and campaign factors influencing term deposit subscriptions.

Compare two classification methods (Logistic Regression and Naive Bayes).

Evaluate model performance and identify key predictors for actionable business insights.

Introduction

This project uses the UCI Bank Marketing Dataset (Moro, Cortez, & Rita, 2014), which contains data on 40,000+ marketing interactions conducted by a Portuguese banking institution. The objective is to predict whether a customer will subscribe to a term deposit based on demographic, social, and economic information, along with details from previous marketing campaigns.

The dataset includes features such as:

Client attributes: age, job, marital status, education, housing, and personal loans.

Campaign-related variables: communication type, duration, number of contacts, and previous outcomes.

Economic indicators: employment variation rate, consumer price index, and Euribor 3-month rate.

The target variable is binary (y = yes or no), making this a classification problem. Two models were applied and compared: Logistic Regression and Naive Bayes.

Both models were trained on 70% of the data and tested on 30%, using metrics such as accuracy, sensitivity, specificity, and ROC curve for evaluation.

Citation: Moro, S., Rita, P., & Cortez, P. (2014). Bank Marketing [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.

Methods used to analyze data

Two supervised learning algorithms were applied and compared:

1.Logistic Regression:

Logistic Regression predicts the probability that a client will subscribe to a term deposit by using a sigmoid function, which converts linear outputs into values between 0 and 1.

It assumes a linear relationship between the predictor variables and the log-odds of the target outcome.

This method is easy to interpret, performs very well for binary classifications such as “yes” or “no,” and provides clear probability estimates that help assess the confidence of each prediction.

2.Naive Bayes Classifier

The Naive Bayes algorithm is based on Bayes’ theorem, which calculates the probability of a class given the input features.

It assumes that all predictors are independent of one another, an assumption that keeps the model simple and computationally efficient.

This method is fast, works well with high-dimensional categorical data, and provides probability-based classifications, making it easy to interpret and apply in real-world decision-making.

Data Preparation:

library(tidyverse)
library(caret)
library(e1071)  
library(class)     
library(caTools)   
library(ROCR)      

Load data

data <- read.csv("bank.csv", sep = ";")

# Check dimensions and structure
dim(data)
## [1] 41188    21
str(data)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

This section imports essential libraries for data processing, visualization, and modeling. The dataset contains 41,188 rows and 21 variables describing customers, their financial situation, and details of marketing calls. The target variable Y indicates whether the client subscribed to a term deposit.

## 
##    no   yes 
## 36548  4640
## 
##        no       yes 
## 0.8873458 0.1126542

About 88.7% of the clients said “no” and 11.3% said “yes.” This means the dataset is imbalanced, so models could easily achieve high accuracy by predicting “no.” That’s why we’ll also examine recall, precision, and AUC, not just accuracy.

Create dummy variables for categorical features

data$y <- factor(data$y, levels = c("no", "yes"))
dummies <- dummyVars(y ~ ., data = data)
data_transformed <- data.frame(predict(dummies, newdata = data))
data_transformed$y <- data$y

Split into training and testing sets

set.seed(123)
split <- sample.split(data_transformed$y, SplitRatio = 0.7)
train <- subset(data_transformed, split == TRUE)
test  <- subset(data_transformed, split == FALSE)

Categorical variables such as job, education, and month must be converted to numeric form so models can interpret them. We use dummy (one-hot) encoding to create binary variables for each category. The dataset is then split into training (70%) and testing (30%) subsets to ensure fair model evaluation.

Exploratory Data Analysis

# Class balance
prop.table(table(data$y))
## 
##        no       yes 
## 0.8873458 0.1126542

Result: About 88.7% of the clients said no, and 11.3% said yes. This imbalance suggests that accuracy alone might not fully reflect model performance, metrics like sensitivity and specificity are important too. The imbalance confirms that roughly 9 in 10 customers did not subscribe. This imbalance must be considered when interpreting metrics later.

Subscription Rate by Job

# Visualize by job
ggplot(data, aes(x = job, fill = y)) +
  geom_bar(position = "fill") +
  labs(title = "Subscription Rate by Job", y = "Proportion", 
       fill = "Subscription Outcome") +
  coord_flip()

Students, retired clients, and unemployed individuals appear slightly more likely to subscribe, possibly because they have more time to engage with marketing calls or are focused on long-term savings.

Blue-collar and services workers have the lowest proportions of “yes” responses, suggesting they may be less responsive to term deposit offers.

Understanding job categories helps banks tailor marketing strategies to customer occupations.

Subscription Rate by Education

# Visualize by education
ggplot(data, aes(x = education, fill = y)) +
  geom_bar(position = "fill") +
  labs(title = "Subscription Rate by Education Level", y = "Proportion", 
       fill = "Subscription Outcome") +
  coord_flip()

ggplot(data, aes(x = education)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Count of Clients by Education Level", x = "Education Level", y = "Number of Clients") +
  theme_minimal() +
  coord_flip()

Although the “illiterate” category appears to have the highest subscription rate, that proportion isn’t statistically reliable due to the very small group size.

When focusing on larger, more representative groups, clients with university degrees and professional courses show higher subscription rates compared to those with only basic education levels. This pattern suggests that higher education which often correlates with financial literacy and stable income increases the likelihood of subscribing to term deposits.

Subscription Rate by Age Group

# Age grouping
data$age_group <- cut(data$age,
                      breaks = c(0, 30, 40, 50, 60, 100),
                      labels = c("Under 30", "30-40", "40-50", "50-60", "60+"))

ggplot(data, aes(x = age_group, fill = y)) +
  geom_bar(position = "fill") +
  labs(title = "Subscription Rate by Age Group", y = "Proportion", 
       fill = "Subscription Outcome")

The analysis shows that clients aged 60 and above have the highest subscription rates, indicating that older individuals are more likely to invest in term deposits, possibly due to greater financial stability and a focus on long-term savings.

In contrast, the 40–50 age group shows the lowest subscription rate, which may reflect this group’s competing financial priorities, such as mortgages or family expenses. Younger clients (under 30) demonstrate moderate interest but remain less likely to subscribe compared to older groups.

These results suggest that age significantly influences financial behavior, with older clients being the most receptive to term deposit offers.

Model 1: Logistic Regression

model_logit <- glm(y ~ ., data = train, family = "binomial",
                   control = glm.control(maxit = 50))

# Predict
prob_logit <- predict(model_logit, newdata = test, type = "response")
pred_logit <- ifelse(prob_logit > 0.5, "yes", "no") %>% as.factor()

# Evaluate
confusionMatrix(pred_logit, test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10669   805
##        yes   295   587
##                                           
##                Accuracy : 0.911           
##                  95% CI : (0.9058, 0.9159)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4699          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9731          
##             Specificity : 0.4217          
##          Pos Pred Value : 0.9298          
##          Neg Pred Value : 0.6655          
##              Prevalence : 0.8873          
##          Detection Rate : 0.8635          
##    Detection Prevalence : 0.9286          
##       Balanced Accuracy : 0.6974          
##                                           
##        'Positive' Class : no              
## 

Logistic Regression estimates the probability that a client will subscribe to a term deposit (the “yes” outcome) based on all available features. It models the log-odds of the target variable using a linear combination of predictors and converts those log-odds into probabilities using the sigmoid function.

A threshold of 0.5 is then used , if the predicted probability is above 0.5, the model predicts “yes,” otherwise “no.”

Results Summary Interpretation

Accuracy (0.911): The model correctly classifies about 91% of all test cases.

Sensitivity (no = 0.97): It predicts “no” responses very accurately (97% of the actual “no”s were correctly identified).

Specificity (yes = 0.42): It identifies “yes” cases moderately well, only 42% of actual subscribers were correctly predicted.

AUC (approximately 0.70): Indicates the model has moderate ability to distinguish subscribers from non-subscribers.

Overall, Logistic Regression performs well for the dominant class (“no”) but misses some actual subscribers due to the dataset imbalance. Still, it’s valuable because it provides interpretable coefficients showing how each feature influences the probability of subscribing. This imbalance is common when the “yes” class is rare.

ROC Curve

# ROC Curve
pred_obj <- prediction(prob_logit, test$y)
perf <- performance(pred_obj, "tpr", "fpr")
plot(perf, main = "ROC Curve - Logistic Regression", col = "blue")
abline(a = 0, b = 1, lty = 2)

The ROC curve compares the true positive rate vs. false positive rate. The closer the curve is to the top-left corner, the better the model. The area under the curve (AUC aaproximately 0.70) indicates moderate discriminatory power.

Model 2: Naive Bayes

model_nb <- naiveBayes(y ~ ., data = train)
pred_nb <- predict(model_nb, newdata = test)
confusionMatrix(pred_nb, test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  9839  631
##        yes 1125  761
##                                          
##                Accuracy : 0.8579         
##                  95% CI : (0.8516, 0.864)
##     No Information Rate : 0.8873         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3845         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8974         
##             Specificity : 0.5467         
##          Pos Pred Value : 0.9397         
##          Neg Pred Value : 0.4035         
##              Prevalence : 0.8873         
##          Detection Rate : 0.7963         
##    Detection Prevalence : 0.8474         
##       Balanced Accuracy : 0.7220         
##                                          
##        'Positive' Class : no             
## 

The Naive Bayes model is based on Bayes’ Theorem, which calculates the probability of a class (in this case, “yes” or “no”) given the observed feature values. It assumes that all features are conditionally independent, which simplifies computation and works surprisingly well even when this assumption isn’t perfectly true.

Results Summary Interpretation

Accuracy (0.858): Correctly classifies about 86% of test cases.

Sensitivity (no = 0.90): Good at identifying “no” responses.

Specificity (yes = 0.55): Better than Logistic Regression at identifying “yes” responses (subscribers).

AUC (approximately 0.72): Slightly higher than Logistic Regression, showing stronger overall separation between classes.

Naive Bayes performs a bit worse in total accuracy but better recognizes actual subscribers, making it useful for applications where capturing the minority “yes” class is more important than overall accuracy.

Feature Importance (Logistic Regression)

coeff <- summary(model_logit)$coefficients
coeff <- as.data.frame(coeff)
coeff$Variable <- rownames(coeff)
coeff <- coeff %>% filter(!is.na(Estimate))

# Remove outlier coefficients and sort by importance
coeff <- coeff %>%
  filter(abs(Estimate) < 10) %>%
  mutate(Importance = abs(Estimate)) %>%
  arrange(desc(Importance)) %>%
  head(15)

# Clean variable names for readability
coeff$Variable <- gsub("\\.", " ", coeff$Variable)
coeff$Variable <- gsub("job", "Job: ", coeff$Variable)
coeff$Variable <- gsub("month", "Month: ", coeff$Variable)
coeff$Variable <- gsub("education", "Education: ", coeff$Variable)
coeff$Variable <- gsub("contact", "Contact: ", coeff$Variable)
coeff$Variable <- gsub("poutcome", "Previous Outcome: ", coeff$Variable)
coeff$Variable <- gsub("day_of_week", "Day: ", coeff$Variable)
coeff$Variable <- gsub("emp.var.rate", "Employment Variation Rate", coeff$Variable)
coeff$Variable <- gsub("cons.price.idx", "Consumer Price Index", coeff$Variable)
coeff$Variable <- gsub("euribor3m", "Euribor 3-Month Rate", coeff$Variable)

ggplot(coeff, aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Most Influential Predictors in Logistic Regression",
    x = "Variable (Feature Name)",
    y = "Absolute Coefficient (Importance)"
  ) +
  theme_minimal()

This plot ranks predictors by strength of influence. A higher absolute coefficient = stronger effect on the probability of “yes.” Top predictors include call duration, contact type (cellular), and month of campaign. These insights can guide real marketing strategies.

Confusion Matrix Visualization

cm <- confusionMatrix(pred_logit, test$y)
cm_table <- as.data.frame(cm$table)

ggplot(cm_table, aes(Prediction, Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", size = 5) +
  scale_fill_gradient(low = "steelblue", high = "darkred") +
  labs(title = "Confusion Matrix - Logistic Regression")

The model correctly classified most “no” responses, showing high sensitivity.

Some “yes” cases were incorrectly predicted as “no” (false negatives), likely due to class imbalance.

Overall accuracy: 91%, meaning 9 out of 10 predictions were correct.

The confusion matrix summarizes predictions as follows:

True Positives (TP): predicted “yes” and actually “yes”

True Negatives (TN): predicted “no” and actually “no”

False Positives (FP): predicted “yes” but actually “no”

False Negatives (FN): predicted “no” but actually “yes”

# Create performance comparison
acc_logit <- confusionMatrix(pred_logit, test$y)$overall["Accuracy"]
acc_nb <- confusionMatrix(pred_nb, test$y)$overall["Accuracy"]

data.frame(
  Model = c("Logistic Regression", "Naive Bayes"),
  Accuracy = c(round(acc_logit, 3), round(acc_nb, 3))
)

Logistic Regression achieves higher accuracy and interpretability.

Naive Bayes better identifies the minority class (yes), providing more balanced results.

library(corrplot)

numeric_data <- data %>% select_if(is.numeric)
corr_matrix <- cor(numeric_data, use = "complete.obs")

corrplot(corr_matrix, method = "color", type = "upper",
         tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix of Numeric Features", mar=c(0,0,1,0))

This plot highlights the top predictors that most strongly influence the likelihood of a client subscribing to a term deposit. A larger absolute coefficient means the feature has a stronger impact on the probability of a “yes” outcome.

The most influential variables are call duration, contact type (cellular), Euribor 3-month rate, and month of the campaign. Longer call durations are associated with a much higher chance of subscription, indicating that when clients stay engaged longer on the phone, they are more likely to agree to a term deposit. Using cellular contact instead of traditional phone lines also improves success rates, likely reflecting easier communication and broader reach.

The Euribor 3-month rate, which represents the short-term interest rate at which European banks lend to each other, captures overall economic conditions. When Euribor is low, savings products like term deposits become more attractive, as investors look for safer, higher-yield options. Conversely, higher Euribor levels typically signal stronger economies where customers may pursue alternative investments.

Finally, the month of contact affects outcomes, marketing efforts in certain months (often near the end of fiscal or promotional cycles) tend to produce better responses.

Together, these predictors suggest that both customer interaction quality (call duration and contact method) and economic context (Euribor rate) play key roles in determining whether a client subscribes to a term deposit.

library(caret)

logit_metrics <- confusionMatrix(pred_logit, test$y, positive = "yes")
nb_metrics <- confusionMatrix(pred_nb, test$y, positive = "yes")

metrics <- data.frame(
  Model = c("Logistic Regression", "Naive Bayes"),
  Accuracy = c(logit_metrics$overall["Accuracy"], nb_metrics$overall["Accuracy"]),
  Precision = c(logit_metrics$byClass["Pos Pred Value"], nb_metrics$byClass["Pos Pred Value"]),
  Recall = c(logit_metrics$byClass["Sensitivity"], nb_metrics$byClass["Sensitivity"]),
  F1_Score = c(2 * (logit_metrics$byClass["Pos Pred Value"] * logit_metrics$byClass["Sensitivity"]) /
                 (logit_metrics$byClass["Pos Pred Value"] + logit_metrics$byClass["Sensitivity"]),
               2 * (nb_metrics$byClass["Pos Pred Value"] * nb_metrics$byClass["Sensitivity"]) /
                 (nb_metrics$byClass["Pos Pred Value"] + nb_metrics$byClass["Sensitivity"]))
)
metrics
performance_df <- data.frame(
  Model = c("Logistic Regression", "Naive Bayes"),
  Accuracy = c(0.911, 0.858),
  F1_Score = c(0.516, 0.464)
)

ggplot(performance_df, aes(x = Model, y = Accuracy, fill = Model)) +
  geom_col(width = 0.6) +
  geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5) +
  ylim(0, 1) +
  labs(title = "Model Comparison: Accuracy",
       y = "Accuracy", x = "") +
  theme_minimal() +
  theme(legend.position = "none")

This summary plot compares both models visually. Logistic Regression achieves higher overall accuracy, confirming its better performance on this dataset. However, Naive Bayes’ simpler structure and faster computation make it a practical choice for real-time applications.

Summary

This project used the UCI Bank Marketing dataset to predict whether clients would subscribe to a term deposit, applying two supervised learning models: Logistic Regression and Naive Bayes.

Model Performance:

Logistic Regression achieved an overall accuracy of 91%, outperforming Naive Bayes, which reached 86%. While Logistic Regression offered stronger precision and interpretability, Naive Bayes showed slightly better recall, meaning it identified a higher proportion of actual subscribers.

Key Predictors:

The most influential variables included call duration, contact type (cellular), and economic factors such as the Euribor 3-month rate and employment variation rate. These predictors suggest that both personal engagement and macroeconomic conditions significantly influence customer subscription decisions.

Insights:

The dataset is highly imbalanced (only 11% of customers subscribed), which explains the models’ tendency to predict “no.” Techniques such as resampling, threshold tuning, or cost-sensitive learning could further improve the detection of true subscribers.

Business Impact:

The findings provide actionable guidance for marketing strategy. Banks can prioritize outreach to customer groups and campaign periods associated with higher subscription probabilities. Logistic Regression offers clear interpretability for strategic decisions, while Naive Bayes provides a faster, scalable alternative for large-scale marketing automation.