To what extent can client characteristics and marketing
contact features predict subscription to a term deposit? In
order to answer this question, I will use a dataset provided by UC
Irvine’s Machine Learning Repository, which contains data from a direct,
phone-call based, marketing campaign conducted by a Portuguese banking
institution. The dataset itself has ~41,000 rows with 21 columns. This
analysis will focus on finding key predictors among the data provided
that can be included in a logistic regression model in order to predict
whether a client will subcribe to a term deposit. The predictors I will
be focusing on are duration (the duration of the call), pdays (number of
days that passed after the client was last contacted), previous (number
of previous contacts), campaign (number of contacts during the current
campaign), contact (contact communication type), and month (last contact
month of the year). The dataset can be found
here.
I will begin my data analysis by performing the necessary EDA and
cleaning. I will start by analyzing the NAs. In this dataset, there are
no NAs, however there are cells that have “unknown” present. This
“unknown” is where the NAs should be, so I will replace all cells that
include “unknown” to NAs. From there, I will remove the rows with NA
present from the dataset because regression models cannot have NAs
present. After that, I will change many of the categorical columns to
factors. From there, I am ready to build my regression model.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
bank <- read.csv("bank-additional-full.csv", sep = ";")
str(bank)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
summary(bank)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:41188
## Class :character
## Mode :character
##
##
##
colSums(is.na(bank))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y
## 0
#code taken from dplyr website
bank_clean <- bank %>%
mutate(across(where(is.character), ~na_if(., "unknown")))
#checking NAs for my predictors
colSums(is.na(select(bank_clean, duration, pdays, previous, campaign, contact, month)))
## duration pdays previous campaign contact month
## 0 0 0 0 0 0
#removing rows with NAs
bank_clean <- bank_clean %>%
filter(rowSums(is.na(.)) == 0)
#changing data values to factors where applicable
bank_clean <- bank_clean %>%
mutate(
y = factor(y, levels = c("no", "yes")),
job = factor(job),
marital = factor(marital),
education = factor(education),
default = factor(default),
housing = factor(housing),
loan = factor(loan),
contact = factor(contact),
month = factor(month),
day_of_week = factor(day_of_week),
poutcome = factor(poutcome)
)
str(bank_clean)
## 'data.frame': 30488 obs. of 21 variables:
## $ age : int 56 37 40 56 59 24 25 25 29 57 ...
## $ job : Factor w/ 11 levels "admin.","blue-collar",..: 4 8 1 8 1 10 8 8 2 4 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 2 2 2 3 3 3 3 1 ...
## $ education : Factor w/ 7 levels "basic.4y","basic.6y",..: 1 4 2 4 6 6 4 4 4 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 2 2 1 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 2 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 226 151 307 139 380 50 222 137 293 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#creating the logistic regression model
model <- glm(
y ~ duration + pdays + previous + campaign + contact + month,
data = bank_clean,
family = binomial
)
summary(model)
##
## Call:
## glm(formula = y ~ duration + pdays + previous + campaign + contact +
## month, family = binomial, data = bank_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.607e-01 1.167e-01 -2.234 0.0255 *
## duration 4.068e-03 7.579e-05 53.678 < 2e-16 ***
## pdays -2.349e-03 9.186e-05 -25.569 < 2e-16 ***
## previous 1.044e-01 4.080e-02 2.559 0.0105 *
## campaign -8.403e-02 1.268e-02 -6.629 3.37e-11 ***
## contacttelephone -1.141e+00 6.515e-02 -17.509 < 2e-16 ***
## monthaug -6.801e-01 8.149e-02 -8.346 < 2e-16 ***
## monthdec 9.482e-01 2.046e-01 4.634 3.59e-06 ***
## monthjul -9.881e-01 8.352e-02 -11.831 < 2e-16 ***
## monthjun 1.524e-01 9.390e-02 1.623 0.1045
## monthmar 1.751e+00 1.197e-01 14.628 < 2e-16 ***
## monthmay -9.551e-01 7.856e-02 -12.157 < 2e-16 ***
## monthnov -9.695e-01 8.929e-02 -10.858 < 2e-16 ***
## monthoct 1.121e+00 1.155e-01 9.706 < 2e-16 ***
## monthsep 8.273e-01 1.291e-01 6.410 1.45e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23160 on 30487 degrees of freedom
## Residual deviance: 15698 on 30473 degrees of freedom
## AIC: 15728
##
## Number of Fisher Scoring iterations: 6
As we can see above, almost all of our predictors are significant
with their p-values being below 0.05. Duration is the strongest
predictor, with every additional second of call duration increasing the
odds of subscribing by 0.41%. While this might not seem like a lot,
considering that duration tends to span hundreds of seconds, this is an
incredibly impactful predictor. Pdays (or number of days that have
passed since last contact) has an odds ratio of ~0.977, telling us that
for each additional day since the last marketing contact, the odds of
subscribing decreases by 2.3%. The previous variable (number of previous
contacts) has a smaller impact, but still a statistically significant
one. It has an odds ratio of ~1.0105, which tells us that each
additional number of previous contacts increases the odds of subscribing
by 1.05%. The campaign variable, which reflects how many times a client
has been contacted in the current campaign, shows us that each
additional campaign reduces the odds of subscribing by ~8.1%, likely due
to customer fatigue. The contact type also seems to play a large role.
By contacting clients on the telephone rather than cellular, the odds of
subscribing to a term deposit go down by 68%. This is a large effect,
and shows us that customers reached by cell phones are much more likely
to subscribe. Lastly, it seems that the time of year in which customers
are contacted plays a large role too. For example, customers reached in
August have about a 50% lower chance of subscribing than in April (the
reference month), while customers reached in September have more than
double the odds of subscribing than in April. This reflects seasonal
behavior in financial decision-making.
#calculating r-square
r_square <- 1 - (model$deviance/model$null.deviance)
r_square
## [1] 0.3222137
Our R^2 score is ~0.3222, meaning that this model explains about
32.22% of the variation in the outcome (subscribing to the term
deposit).
#calculating p-value
1 - pchisq((model$null.deviance - model$deviance), df=(length(model$coefficients)-1))
## [1] 0
Our p-value is showing 0, which means the model is very significant.
#code taken from logistic regression assignment
predicted.data <- data.frame(
probability.of.y = model$fitted.values,
y = bank_clean$y
)
predicted.data <- predicted.data[order(predicted.data$probability.of.y,
decreasing = FALSE), ]
predicted.data$rank <- 1:nrow(predicted.data)
head(predicted.data, 6)
## probability.of.y y rank
## 8645 0.0002523638 no 1
## 2558 0.0004090162 no 2
## 2597 0.0004401368 no 3
## 3333 0.0004832882 no 4
## 12290 0.0005043308 no 5
## 3499 0.0006517756 no 6
#code taken from logistic regression assignment
ggplot(data=predicted.data, aes(x=rank, y=probability.of.y)) +
geom_point(aes(color=y), alpha=1, shape=4, stroke=2) +
xlab("Index") +
ylab("Predicted probability of subscribing to a term deposit")
This graph shows the predicted probability of a client subscribing
to a term deposit, sorted from lowest to highest probability. We see
that most clients have low predicted probabilities, however we see a
smaller set of clients showing highly increasing probabilities,
indicating the individuals that the model deem highly likely to
subscribe. We see that the clients that actually subscribed (colored
blue in the graph) cluster near the high-probability region of the
graph, indicating that the model appears to be effective at predicting
who will subscribe to a term deposit.
#code taken from logistic regression assignment
bank_clean$y_num <- ifelse(bank_clean$y == "yes", 1, 0)
predicted.probs <- model$fitted.values
predicted.classes <- ifelse(predicted.probs > 0.5, 1, 0)
confusion <- table(
Predicted = factor(predicted.classes, levels = c(0, 1)),
Actual = factor(bank_clean$y_num, levels = c(0, 1))
)
head(confusion, 6)
## Actual
## Predicted 0 1
## 0 25950 2518
## 1 679 1341
#code taken from logistic regression assignment
TN <- 25953
FP <- 676
FN <- 2502
TP <- 1357
accuracy <- (TP + TN) / (TP + TN + FP + FN)
sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
precision <- TP / (TP + FP)
f1_score <- 2 * (precision * sensitivity) / (precision + sensitivity)
cat("Accuracy: ", round(accuracy, 4), "\n")
## Accuracy: 0.8958
cat("Sensitivity: ", round(sensitivity, 4), "\n")
## Sensitivity: 0.3516
cat("Specificity: ", round(specificity, 4), "\n")
## Specificity: 0.9746
cat("Precision: ", round(precision, 4), "\n")
## Precision: 0.6675
cat("F1 Score: ", round(f1_score, 4), "\n")
## F1 Score: 0.4606
What the confusion matrix and performance metrics show us is that
this model is very good at predicting non-subscribers, with a
specificity score of ~0.975. However the model is only moderately
effective at correctly predicting subscribers, with a sensitivity score
0.35. This reflects a strong class imbalance, which is due to the fact
that most people do not subscribe, so the model learns to predict
non-subscribers very accurately but has a difficult time identifying the
smaller, subscriber group.
#code taken from logistic regression assignment
roc_obj <- roc(
response = bank_clean$y,
predictor = model$fitted.values,
levels = c("no", "yes"),
direction = "<"
)
auc_val <- auc(roc_obj)
auc_val
## Area under the curve: 0.9006
plot.roc(roc_obj,
print.auc = TRUE,
legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")
The ROC curve shows us that the model does a good job at separating
likely subscribers from non-subscribers. The AUC of 0.901 means that if
you randomly pick one subscriber and one non-subscriber, the model will
assign a higher predicted probability to the subscriber 91% of the time.
This means that the model has strong predictive power and performs well.
This logistic regression analysis identified many important predictors of whether or not a customer will subscribe to a term deposit. Call duration was the strongest predictor, with longer calls being associated with higher odds of subscription. Recent contact, fewer campaign contacts, and using cellular instead of telephone communication also significant increased the odds of subscription. We also see a strong seasonal effect, with certain months being associated with much higher subscription odds. These results provide valuable knowledge regarding marketing strategy. Engaging conversations will increase the call duration and therefore increase the odds of subscription. The negative effect of repeated campaign contacts indicate strong diminishing returns, showing that over-contacting reduces the effectiveness. The timing of campaigns could also be revamped to run these campaigns during months that were shown to be more effective at securing subscribers.
The model itself shows its strength in discriminative performance,
with an AUC of ~0.91. However, the pseudo R^2 shows that a substantial
amount of variation remains unexplained. Additional factors that were
not included in this model should be investigated in order to improve
that. The data set itself is also highly imbalanced, with far more
non-subscribers than subscribers. This could bias the predictions toward
non-subscribers and reduce the sensitivity. Future work could explore
additional predictors such as income, past banking behavior, and credit
scores as well as adding interaction terms (for example, between contact
method and call duration) to detect any synergistic effects.
Convert values to Na - na_if. - na_if • dplyr. (n.d.). https://dplyr.tidyverse.org/reference/na_if.html
Moro, S., Rita, P., & Cortez, P. (n.d.). Bank marketing.
UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/222/bank+marketing