Introduction:

To what extent can client characteristics and marketing contact features predict subscription to a term deposit? In order to answer this question, I will use a dataset provided by UC Irvine’s Machine Learning Repository, which contains data from a direct, phone-call based, marketing campaign conducted by a Portuguese banking institution. The dataset itself has ~41,000 rows with 21 columns. This analysis will focus on finding key predictors among the data provided that can be included in a logistic regression model in order to predict whether a client will subcribe to a term deposit. The predictors I will be focusing on are duration (the duration of the call), pdays (number of days that passed after the client was last contacted), previous (number of previous contacts), campaign (number of contacts during the current campaign), contact (contact communication type), and month (last contact month of the year). The dataset can be found here.

Data Analysis

I will begin my data analysis by performing the necessary EDA and cleaning. I will start by analyzing the NAs. In this dataset, there are no NAs, however there are cells that have “unknown” present. This “unknown” is where the NAs should be, so I will replace all cells that include “unknown” to NAs. From there, I will remove the rows with NA present from the dataset because regression models cannot have NAs present. After that, I will change many of the categorical columns to factors. From there, I am ready to build my regression model.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
bank <- read.csv("bank-additional-full.csv", sep = ";")
str(bank)
## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...
summary(bank)
##       age            job              marital           education        
##  Min.   :17.00   Length:41188       Length:41188       Length:41188      
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.02                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :98.00                                                           
##    default            housing              loan             contact         
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:41188       Length:41188       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 102.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.3   Mean   : 2.568  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :56.000  
##      pdays          previous       poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.000   Length:41188       Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.000   Mode  :character   Median : 1.10000  
##  Mean   :962.5   Mean   :0.173                      Mean   : 0.08189  
##  3rd Qu.:999.0   3rd Qu.:0.000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :7.000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5167  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##       y            
##  Length:41188      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
colSums(is.na(bank))
##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y 
##              0
#code taken from dplyr website
bank_clean <- bank %>%
  mutate(across(where(is.character), ~na_if(., "unknown")))

#checking NAs for my predictors
colSums(is.na(select(bank_clean, duration, pdays, previous, campaign, contact, month)))
## duration    pdays previous campaign  contact    month 
##        0        0        0        0        0        0
#removing rows with NAs
bank_clean <- bank_clean %>%
  filter(rowSums(is.na(.)) == 0)

#changing data values to factors where applicable
bank_clean <- bank_clean %>%
  mutate(
    y = factor(y, levels = c("no", "yes")),
    job = factor(job),
    marital = factor(marital),
    education = factor(education),
    default = factor(default),
    housing = factor(housing),
    loan = factor(loan),
    contact = factor(contact),
    month = factor(month),
    day_of_week = factor(day_of_week),
    poutcome = factor(poutcome)
  )

str(bank_clean)
## 'data.frame':    30488 obs. of  21 variables:
##  $ age           : int  56 37 40 56 59 24 25 25 29 57 ...
##  $ job           : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 1 8 1 10 8 8 2 4 ...
##  $ marital       : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 3 3 1 ...
##  $ education     : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 2 4 6 6 4 4 4 1 ...
##  $ default       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ housing       : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 2 2 1 2 ...
##  $ loan          : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 2 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : int  261 226 151 307 139 380 50 222 137 293 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...


Logistic Regression

#creating the logistic regression model
model <- glm(
  y ~ duration + pdays + previous + campaign + contact + month,
  data = bank_clean,
  family = binomial
)

summary(model)
## 
## Call:
## glm(formula = y ~ duration + pdays + previous + campaign + contact + 
##     month, family = binomial, data = bank_clean)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.607e-01  1.167e-01  -2.234   0.0255 *  
## duration          4.068e-03  7.579e-05  53.678  < 2e-16 ***
## pdays            -2.349e-03  9.186e-05 -25.569  < 2e-16 ***
## previous          1.044e-01  4.080e-02   2.559   0.0105 *  
## campaign         -8.403e-02  1.268e-02  -6.629 3.37e-11 ***
## contacttelephone -1.141e+00  6.515e-02 -17.509  < 2e-16 ***
## monthaug         -6.801e-01  8.149e-02  -8.346  < 2e-16 ***
## monthdec          9.482e-01  2.046e-01   4.634 3.59e-06 ***
## monthjul         -9.881e-01  8.352e-02 -11.831  < 2e-16 ***
## monthjun          1.524e-01  9.390e-02   1.623   0.1045    
## monthmar          1.751e+00  1.197e-01  14.628  < 2e-16 ***
## monthmay         -9.551e-01  7.856e-02 -12.157  < 2e-16 ***
## monthnov         -9.695e-01  8.929e-02 -10.858  < 2e-16 ***
## monthoct          1.121e+00  1.155e-01   9.706  < 2e-16 ***
## monthsep          8.273e-01  1.291e-01   6.410 1.45e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 23160  on 30487  degrees of freedom
## Residual deviance: 15698  on 30473  degrees of freedom
## AIC: 15728
## 
## Number of Fisher Scoring iterations: 6

As we can see above, almost all of our predictors are significant with their p-values being below 0.05. Duration is the strongest predictor, with every additional second of call duration increasing the odds of subscribing by 0.41%. While this might not seem like a lot, considering that duration tends to span hundreds of seconds, this is an incredibly impactful predictor. Pdays (or number of days that have passed since last contact) has an odds ratio of ~0.977, telling us that for each additional day since the last marketing contact, the odds of subscribing decreases by 2.3%. The previous variable (number of previous contacts) has a smaller impact, but still a statistically significant one. It has an odds ratio of ~1.0105, which tells us that each additional number of previous contacts increases the odds of subscribing by 1.05%. The campaign variable, which reflects how many times a client has been contacted in the current campaign, shows us that each additional campaign reduces the odds of subscribing by ~8.1%, likely due to customer fatigue. The contact type also seems to play a large role. By contacting clients on the telephone rather than cellular, the odds of subscribing to a term deposit go down by 68%. This is a large effect, and shows us that customers reached by cell phones are much more likely to subscribe. Lastly, it seems that the time of year in which customers are contacted plays a large role too. For example, customers reached in August have about a 50% lower chance of subscribing than in April (the reference month), while customers reached in September have more than double the odds of subscribing than in April. This reflects seasonal behavior in financial decision-making.

#calculating r-square
r_square <- 1 - (model$deviance/model$null.deviance)

r_square
## [1] 0.3222137

Our R^2 score is ~0.3222, meaning that this model explains about 32.22% of the variation in the outcome (subscribing to the term deposit).

#calculating p-value
1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))
## [1] 0

Our p-value is showing 0, which means the model is very significant.

Model Assumption and Diagnostics

#code taken from logistic regression assignment
predicted.data <- data.frame(
  probability.of.y = model$fitted.values,
  y = bank_clean$y
)

predicted.data <- predicted.data[order(predicted.data$probability.of.y,
                                       decreasing = FALSE), ]
predicted.data$rank <- 1:nrow(predicted.data)

head(predicted.data, 6)
##       probability.of.y  y rank
## 8645      0.0002523638 no    1
## 2558      0.0004090162 no    2
## 2597      0.0004401368 no    3
## 3333      0.0004832882 no    4
## 12290     0.0005043308 no    5
## 3499      0.0006517756 no    6
#code taken from logistic regression assignment
ggplot(data=predicted.data, aes(x=rank, y=probability.of.y)) +
  geom_point(aes(color=y), alpha=1, shape=4, stroke=2) +
  xlab("Index") +
  ylab("Predicted probability of subscribing to a term deposit")


This graph shows the predicted probability of a client subscribing to a term deposit, sorted from lowest to highest probability. We see that most clients have low predicted probabilities, however we see a smaller set of clients showing highly increasing probabilities, indicating the individuals that the model deem highly likely to subscribe. We see that the clients that actually subscribed (colored blue in the graph) cluster near the high-probability region of the graph, indicating that the model appears to be effective at predicting who will subscribe to a term deposit.

#code taken from logistic regression assignment
bank_clean$y_num <- ifelse(bank_clean$y == "yes", 1, 0)

predicted.probs <- model$fitted.values

predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)

confusion <- table(
  Predicted = factor(predicted.classes, levels = c(0, 1)),
  Actual = factor(bank_clean$y_num, levels = c(0, 1))
)

head(confusion, 6)
##          Actual
## Predicted     0     1
##         0 25950  2518
##         1   679  1341
#code taken from logistic regression assignment
TN <- 25953
FP <- 676
FN <- 2502
TP <- 1357

accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)

cat("Accuracy:    ", round(accuracy, 4), "\n")
## Accuracy:     0.8958
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity:  0.3516
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity:  0.9746
cat("Precision:   ", round(precision, 4), "\n")
## Precision:    0.6675
cat("F1 Score:    ", round(f1_score, 4), "\n")
## F1 Score:     0.4606

What the confusion matrix and performance metrics show us is that this model is very good at predicting non-subscribers, with a specificity score of ~0.975. However the model is only moderately effective at correctly predicting subscribers, with a sensitivity score 0.35. This reflects a strong class imbalance, which is due to the fact that most people do not subscribe, so the model learns to predict non-subscribers very accurately but has a difficult time identifying the smaller, subscriber group.

#code taken from logistic regression assignment
roc_obj <- roc(
  response = bank_clean$y,
  predictor = model$fitted.values,
  levels = c("no", "yes"),
  direction = "<"
)

auc_val <- auc(roc_obj)
auc_val
## Area under the curve: 0.9006
plot.roc(roc_obj,
         print.auc = TRUE,
         legacy.axes = TRUE,
         xlab = "False Positive Rate (1 - Specificity)",
         ylab = "True Positive Rate (Sensitivity)")


The ROC curve shows us that the model does a good job at separating likely subscribers from non-subscribers. The AUC of 0.901 means that if you randomly pick one subscriber and one non-subscriber, the model will assign a higher predicted probability to the subscriber 91% of the time. This means that the model has strong predictive power and performs well.

Conclusion

This logistic regression analysis identified many important predictors of whether or not a customer will subscribe to a term deposit. Call duration was the strongest predictor, with longer calls being associated with higher odds of subscription. Recent contact, fewer campaign contacts, and using cellular instead of telephone communication also significant increased the odds of subscription. We also see a strong seasonal effect, with certain months being associated with much higher subscription odds. These results provide valuable knowledge regarding marketing strategy. Engaging conversations will increase the call duration and therefore increase the odds of subscription. The negative effect of repeated campaign contacts indicate strong diminishing returns, showing that over-contacting reduces the effectiveness. The timing of campaigns could also be revamped to run these campaigns during months that were shown to be more effective at securing subscribers.

The model itself shows its strength in discriminative performance, with an AUC of ~0.91. However, the pseudo R^2 shows that a substantial amount of variation remains unexplained. Additional factors that were not included in this model should be investigated in order to improve that. The data set itself is also highly imbalanced, with far more non-subscribers than subscribers. This could bias the predictions toward non-subscribers and reduce the sensitivity. Future work could explore additional predictors such as income, past banking behavior, and credit scores as well as adding interaction terms (for example, between contact method and call duration) to detect any synergistic effects.

References

Convert values to Na - na_if. - na_if • dplyr. (n.d.). https://dplyr.tidyverse.org/reference/na_if.html
Moro, S., Rita, P., & Cortez, P. (n.d.). Bank marketing. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/222/bank+marketing