This study analyzes a Portuguese bank’s direct marketing campaigns to predict whether clients will subscribe to a term deposit using a logistic regression model. By leveraging customer demographics, past campaign outcomes, and economic factors, the findings highlight that prior successful contacts, employment status, and economic context significantly influence subscription likelihood. Targeted marketing informed by these insights can increase campaign effectiveness, and eliminating shorter call durations from the dataset yields a more predictive model.
This binary classification problem investigates data from direct
marketing campaigns of a Portuguese banking institution. The goal is to
develop a predictive model that determines whether a client will
subscribe to a term deposit (variable y
, values:
yes/no).
The model evaluation considered Logistic Regression, SVM, LDA, and Decision Trees. Logistic Regression demonstrated the highest predictive accuracy with interpretability, making it the chosen model for this analysis.
In similar research (Moro et al., 2014), marketing campaigns were analyzed using 22 features with models including Logistic Regression, Decision Trees, Neural Networks, and SVMs. Metrics such as AUC and ALIFT were used to evaluate performance. Neural Networks outperformed with an AUC of 0.80 and ALIFT of 0.67.
This supports the conclusion that machine learning can significantly boost marketing performance by accurately identifying high-potential clients.
While Linear Discriminant Analysis (LDA) was considered, the presence of numerous categorical variables made Logistic Regression a more suitable model due to its handling of binary outcomes and interpretability.
The dataset was segmented into four major variable categories: - Demographic: Age, job type, marital status, education. - Financial: Default status, housing loan, personal loan. - Marketing and Contact: Contact method, call duration, prior contacts, campaign outcomes. - Economic Indicators: Employment variation rate, consumer confidence, euribor rate.
Initial cleaning involved: - Removing the duration
variable due to its post-call nature and predictive leakage. - Recoding
pdays
into a binary categorical variable:
contacted
. - Encoding categorical variables as factors. -
Removing highly correlated variables identified via a correlation
matrix.
## # A tibble: 4,119 × 21
## age job marital education default housing loan contact month day_of_week
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 30 blue… married basic.9y no yes no cellul… may fri
## 2 39 serv… single high.sch… no no no teleph… may fri
## 3 25 serv… married high.sch… no yes no teleph… jun wed
## 4 38 serv… married basic.9y no unknown unkn… teleph… jun fri
## 5 47 admi… married universi… no yes no cellul… nov mon
## 6 32 serv… single universi… no no no cellul… sep thu
## 7 32 admi… single universi… no yes no cellul… sep mon
## 8 41 entr… married universi… unknown yes no cellul… nov mon
## 9 31 serv… divorc… professi… no no no cellul… nov tue
## 10 35 blue… married basic.9y unknown no no teleph… may thu
## # ℹ 4,109 more rows
## # ℹ 11 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## # previous <dbl>, poutcome <chr>, emp.var.rate <dbl>, cons.price.idx <dbl>,
## # cons.conf.idx <dbl>, euribor3m <dbl>, nr.employed <dbl>, y <chr>
bank <- bank[-c(11)]
factor_vars <- c("job", "education", "marital", "default", "housing", "loan",
"contact", "month", "day_of_week", "poutcome", "y")
bank[factor_vars] <- lapply(bank[factor_vars], as.factor)
bank$contacted <- ifelse(bank$pdays == 999, "No", "Yes")
bank$contacted <- as.factor(bank$contacted)
bank$y <- factor(bank$y, levels = c("no", "yes"))
summary(bank)
## age job marital education
## Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
## 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
## Median :38.00 technician : 691 single :1153 basic.9y : 574
## Mean :40.11 services : 393 unknown : 11 professional.course: 535
## 3rd Qu.:47.00 management : 324 basic.4y : 429
## Max. :88.00 retired : 166 basic.6y : 228
## (Other) : 649 (Other) : 168
## default housing loan contact month
## no :3315 no :1839 no :3349 cellular :2652 may :1378
## unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
## yes : 1 yes :2175 yes : 665 aug : 636
## jun : 530
## nov : 446
## apr : 215
## (Other): 203
## day_of_week campaign pdays previous
## fri:768 Min. : 1.000 Min. : 0.0 Min. :0.0000
## mon:855 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
## thu:860 Median : 2.000 Median :999.0 Median :0.0000
## tue:841 Mean : 2.537 Mean :960.4 Mean :0.1903
## wed:795 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :35.000 Max. :999.0 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 454 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:3523 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 142 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08497 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y contacted
## Min. :0.635 Min. :4964 no :3668 No :3959
## 1st Qu.:1.334 1st Qu.:5099 yes: 451 Yes: 160
## Median :4.857 Median :5191
## Mean :3.621 Mean :5166
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
##
factor_vars_2 <- c("job", "education", "marital", "default", "housing", "loan",
"contact", "month", "day_of_week", "poutcome", "contacted", "y")
# Enforce consistent levels across full dataset
bank[factor_vars_2] <- lapply(bank[factor_vars_2], function(x) factor(x))
# Train/test split
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[train_index, ]
bank_test <- bank[-train_index, ]
# Align factor levels in test to match train
for (col in factor_vars) {
bank_train[[col]] <- factor(bank_train[[col]])
bank_test[[col]] <- factor(bank_test[[col]], levels = levels(bank_train[[col]]))
}
# Drop unused levels in training data to avoid tune.svm issues
bank_train <- droplevels(bank_train)
The dataset contains 4,119 observations and 21 variables. It was sourced from a Portuguese bank’s direct marketing campaign records, hosted on the UCI Machine Learning Repository.
bank <- dplyr::select(bank, -c(loan, pdays, emp.var.rate, euribor3m))
bank_num <- dplyr::select_if(bank, is.numeric)
M <- cor(bank_num)
corrplot(M, method = "number")
This model estimates the probability of a binary outcome (subscription: yes/no) using a linear combination of predictors. It’s ideal for interpretability, especially with categorical variables, and was selected based on sensitivity and AIC optimization.
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -82.889102 133.694466 -0.620 0.535265
## age 0.015766 0.007680 2.053 0.040095 *
## jobblue-collar -0.371613 0.253074 -1.468 0.141996
## jobentrepreneur -0.542077 0.434705 -1.247 0.212398
## jobhousemaid -0.334421 0.473819 -0.706 0.480313
## jobmanagement -0.532371 0.286635 -1.857 0.063266 .
## jobretired -0.401286 0.340223 -1.179 0.238207
## jobself-employed -0.632975 0.383285 -1.651 0.098647 .
## jobservices -0.309526 0.276704 -1.119 0.263304
## jobstudent -0.112725 0.400147 -0.282 0.778166
## jobtechnician -0.044615 0.212385 -0.210 0.833616
## jobunemployed 0.109456 0.370847 0.295 0.767879
## jobunknown -2.042565 1.133241 -1.802 0.071481 .
## maritalmarried 0.284361 0.230844 1.232 0.218011
## maritalsingle 0.262964 0.262770 1.001 0.316953
## maritalunknown 1.125503 1.139332 0.988 0.323220
## educationbasic.6y 0.223645 0.393150 0.569 0.569455
## educationbasic.9y 0.351046 0.303417 1.157 0.247283
## educationhigh.school 0.098308 0.298537 0.329 0.741931
## educationilliterate -10.805328 324.744111 -0.033 0.973457
## educationprofessional.course 0.356855 0.317250 1.125 0.260657
## educationuniversity.degree 0.236896 0.296871 0.798 0.424883
## educationunknown 0.159747 0.396114 0.403 0.686738
## defaultunknown -0.096918 0.201885 -0.480 0.631181
## housingunknown -0.380851 0.492710 -0.773 0.439539
## housingyes -0.022309 0.132175 -0.169 0.865969
## loanunknown NA NA NA NA
## loanyes -0.095422 0.181553 -0.526 0.599175
## contacttelephone -0.971477 0.262057 -3.707 0.000210 ***
## monthaug 0.078314 0.396536 0.197 0.843441
## monthdec 1.029095 0.660271 1.559 0.119092
## monthjul -0.077268 0.336487 -0.230 0.818378
## monthjun -0.121103 0.411405 -0.294 0.768480
## monthmar 1.791621 0.497979 3.598 0.000321 ***
## monthmay -0.227987 0.282235 -0.808 0.419211
## monthnov -0.852609 0.402426 -2.119 0.034118 *
## monthoct -0.185880 0.494767 -0.376 0.707146
## monthsep -0.124183 0.573611 -0.216 0.828603
## day_of_weekmon -0.008324 0.204123 -0.041 0.967470
## day_of_weekthu 0.013751 0.206561 0.067 0.946922
## day_of_weektue -0.027235 0.212226 -0.128 0.897887
## day_of_weekwed 0.128113 0.213417 0.600 0.548311
## campaign -0.066050 0.037120 -1.779 0.075180 .
## pdays -0.096441 0.061072 -1.579 0.114307
## previous 0.239344 0.192041 1.246 0.212648
## poutcomenonexistent 0.857172 0.318043 2.695 0.007036 **
## poutcomesuccess 1.024203 0.704225 1.454 0.145844
## emp.var.rate -1.291724 0.438249 -2.947 0.003204 **
## cons.price.idx 1.762051 0.776592 2.269 0.023271 *
## cons.conf.idx 0.044087 0.025683 1.717 0.086056 .
## euribor3m 0.320252 0.401553 0.798 0.425142
## nr.employed 0.002230 0.009572 0.233 0.815758
## contactedYes -94.995130 60.408859 -1.573 0.115826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2277.7 on 3295 degrees of freedom
## Residual deviance: 1748.7 on 3244 degrees of freedom
## AIC: 1852.7
##
## Number of Fisher Scoring iterations: 11
predprob = predict.glm(log.model, newdata = bank_test, type = "response")
predclass_log = ifelse(predprob >= 0.08, "yes", "no")
predclass_log <- as.factor(predclass_log)
caret::confusionMatrix(as.factor(predclass_log), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 561 34
## yes 171 56
##
## Accuracy : 0.7506
## 95% CI : (0.7196, 0.7799)
## No Information Rate : 0.8905
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.233
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.62222
## Specificity : 0.76639
## Pos Pred Value : 0.24670
## Neg Pred Value : 0.94286
## Prevalence : 0.10949
## Detection Rate : 0.06813
## Detection Prevalence : 0.27616
## Balanced Accuracy : 0.69431
##
## 'Positive' Class : yes
##
set.seed(1)
svm_model <- y ~ .
tuned <- tune.svm(svm_model, data = bank_train, gamma = seq(0.1, 0.1, by=0.01), cost = seq(0.1, 1, by=0.1))
mysvm <- svm(svm_model, data = bank_train,
gamma = tuned$best.parameters$gamma,
cost = tuned$best.parameters$cost)
summary(mysvm)
##
## Call:
## svm(formula = svm_model, data = bank_train, gamma = tuned$best.parameters$gamma,
## cost = tuned$best.parameters$cost)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.5
##
## Number of Support Vectors: 1256
##
## ( 900 356 )
##
##
## Number of Classes: 2
##
## Levels:
## no yes
## [1] 822
## [1] 823
for (col in names(bank_test)) {
if (is.factor(bank_test[[col]])) {
bank_test[[col]] <- factor(bank_test[[col]], levels = levels(bank_train[[col]]))
}
}
bank_test <- na.omit(bank_test)
# Then predict
svmpredict <- predict(mysvm, newdata = bank_test)
# Now confusion matrix
caret::confusionMatrix(as.factor(svmpredict), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 726 75
## yes 6 15
##
## Accuracy : 0.9015
## 95% CI : (0.879, 0.921)
## No Information Rate : 0.8905
## P-Value [Acc > NIR] : 0.1715
##
## Kappa : 0.2387
##
## Mcnemar's Test P-Value : 4.171e-14
##
## Sensitivity : 0.16667
## Specificity : 0.99180
## Pos Pred Value : 0.71429
## Neg Pred Value : 0.90637
## Prevalence : 0.10949
## Detection Rate : 0.01825
## Detection Prevalence : 0.02555
## Balanced Accuracy : 0.57923
##
## 'Positive' Class : yes
##
bank_lda <- lda(y ~ ., data = bank_train)
predclass_lda <- predict(bank_lda, newdata = bank_test)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 703 57
## yes 29 33
##
## Accuracy : 0.8954
## 95% CI : (0.8724, 0.9155)
## No Information Rate : 0.8905
## P-Value [Acc > NIR] : 0.352519
##
## Kappa : 0.3787
##
## Mcnemar's Test P-Value : 0.003597
##
## Sensitivity : 0.36667
## Specificity : 0.96038
## Pos Pred Value : 0.53226
## Neg Pred Value : 0.92500
## Prevalence : 0.10949
## Detection Rate : 0.04015
## Detection Prevalence : 0.07543
## Balanced Accuracy : 0.66352
##
## 'Positive' Class : yes
##
#AIC
glm.aic <- step(log.model, scope = list(upper = log.model),
direction = "both", test = "Chisq", trace = F)
summary(glm.aic)
##
## Call:
## glm(formula = y ~ age + contact + month + campaign + pdays +
## poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## contacted, family = "binomial", data = bank_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -25.648087 59.634360 -0.430 0.667131
## age 0.008208 0.005365 1.530 0.126029
## contacttelephone -0.858787 0.235846 -3.641 0.000271 ***
## monthaug 0.030699 0.342187 0.090 0.928513
## monthdec 0.954409 0.594020 1.607 0.108121
## monthjul 0.045506 0.321934 0.141 0.887593
## monthjun 0.159195 0.294370 0.541 0.588647
## monthmar 1.712806 0.416567 4.112 3.93e-05 ***
## monthmay -0.280285 0.256705 -1.092 0.274896
## monthnov -0.614479 0.322361 -1.906 0.056626 .
## monthoct -0.061462 0.401818 -0.153 0.878431
## monthsep -0.167482 0.427201 -0.392 0.695024
## campaign -0.069795 0.037401 -1.866 0.062025 .
## pdays -0.096534 0.060180 -1.604 0.108692
## poutcomenonexistent 0.607594 0.205761 2.953 0.003148 **
## poutcomesuccess 0.786779 0.649155 1.212 0.225511
## emp.var.rate -0.761595 0.071645 -10.630 < 2e-16 ***
## cons.price.idx 1.296849 0.174574 7.429 1.10e-13 ***
## cons.conf.idx 0.051337 0.016630 3.087 0.002021 **
## contactedYes -94.800201 59.540007 -1.592 0.111338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2277.7 on 3295 degrees of freedom
## Residual deviance: 1773.9 on 3276 degrees of freedom
## AIC: 1813.9
##
## Number of Fisher Scoring iterations: 6
pred.aic = prediction(predict(glm.aic,bank_train,type='response'),bank_train$y)
#AIC sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)
# AIC second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2))
mtext('Specificity',side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred.aic, "sens")@y.values) - unlist(performance(pred.aic, "spec")@y.values)))
min.x<-unlist(performance(pred.aic, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred.aic, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)
#BIC
glm.bic <- step(log.model, scope = list(upper = log.model),
direction="both", test="Chisq", trace = F, k=log(nrow(bank_train)))
summary(glm.bic)
##
## Call:
## glm(formula = y ~ contact + pdays + emp.var.rate + cons.price.idx +
## cons.conf.idx, family = "binomial", data = bank_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.147e+02 1.347e+01 -8.512 < 2e-16 ***
## contacttelephone -8.028e-01 1.796e-01 -4.470 7.84e-06 ***
## pdays -1.367e-03 2.188e-04 -6.248 4.15e-10 ***
## emp.var.rate -7.399e-01 5.537e-02 -13.363 < 2e-16 ***
## cons.price.idx 1.242e+00 1.450e-01 8.569 < 2e-16 ***
## cons.conf.idx 5.871e-02 1.180e-02 4.976 6.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2277.7 on 3295 degrees of freedom
## Residual deviance: 1835.5 on 3290 degrees of freedom
## AIC: 1847.5
##
## Number of Fisher Scoring iterations: 5
Decision trees segment the dataset into branches based on decision rules. They are easy to interpret but can be unstable and prone to overfitting without pruning. They help reveal key features like job type or month in this case.
set.seed(1)
bank_dt <- rpart(formula = y ~.,
data = bank_train,
method = "class",
control = rpart.control(cp = 0, maxdepth =5,
minsplit = 5, minbucket = 10))
rpart.plot(bank_dt, type=1, sub = NULL)
#AUC and ROC model and predictions
model_bank <- naiveBayes(y~., bank_train, laplace = 1)
pred_naive <- predict(object = model_bank, newdata=bank_test, type="class")
# Create a probability prediction from `model_bank`
prob_naive<- predict(object = model_bank, newdata=bank_test, type="raw")
# Create a prob and label from prob_naive.
roc_bank <- data.frame(prob=prob_naive[,2],
label=as.numeric(bank_test$y=="yes"))
# Create an object prediction
prediction_roc_bank <- prediction(predictions = roc_bank$prob,
labels = roc_bank$label)
# Create an ROC plot
plot(ROCR::performance(prediction.obj = prediction_roc_bank,
measure = "tpr",
x.measure = "fpr"),main = "ROC Naive Bayes", col="#519259")
abline(a = 0, b = 1)
#AUC == between .5 and 1 so the model can distinguish between positive and negative classes
auc_naive <- ROCR::performance(prediction.obj=prediction_roc_bank, measure = "auc")
auc_naive@y.values[[1]]
## [1] 0.7640407
The Logistic Regression model showed the highest sensitivity (~62.2%) and decent accuracy (~75%). SVM and LDA had high accuracy but poor sensitivity. The Decision Tree showed logical structure but was less robust. AIC modeling showed optimal threshold tuning potential.
This case study aimed to identify which clients are most likely to subscribe to a term deposit, using four modeling approaches. Among these, Logistic Regression was chosen for its balance of performance and interpretability. It yielded moderate accuracy with the highest sensitivity (~62.2%), making it the most effective for identifying actual subscribers.
Overall, Logistic Regression was preferred due to its transparency, ability to handle categorical predictors effectively, and solid performance on imbalanced classification. Future work could incorporate ensemble techniques or address class imbalance with resampling to boost performance further.
Logistic Regression offered the best balance of accuracy, interpretability, and sensitivity. It is recommended for targeting clients likely to subscribe. Future enhancements may explore ensemble techniques like Random Forest or boosting methods.