1 Executive Summary

This study analyzes a Portuguese bank’s direct marketing campaigns to predict whether clients will subscribe to a term deposit using a logistic regression model. By leveraging customer demographics, past campaign outcomes, and economic factors, the findings highlight that prior successful contacts, employment status, and economic context significantly influence subscription likelihood. Targeted marketing informed by these insights can increase campaign effectiveness, and eliminating shorter call durations from the dataset yields a more predictive model.


2 Introduction

This binary classification problem investigates data from direct marketing campaigns of a Portuguese banking institution. The goal is to develop a predictive model that determines whether a client will subscribe to a term deposit (variable y, values: yes/no).

The model evaluation considered Logistic Regression, SVM, LDA, and Decision Trees. Logistic Regression demonstrated the highest predictive accuracy with interpretability, making it the chosen model for this analysis.


3 Literature Review

In similar research (Moro et al., 2014), marketing campaigns were analyzed using 22 features with models including Logistic Regression, Decision Trees, Neural Networks, and SVMs. Metrics such as AUC and ALIFT were used to evaluate performance. Neural Networks outperformed with an AUC of 0.80 and ALIFT of 0.67.

This supports the conclusion that machine learning can significantly boost marketing performance by accurately identifying high-potential clients.

While Linear Discriminant Analysis (LDA) was considered, the presence of numerous categorical variables made Logistic Regression a more suitable model due to its handling of binary outcomes and interpretability.


4 Methods

The dataset was segmented into four major variable categories: - Demographic: Age, job type, marital status, education. - Financial: Default status, housing loan, personal loan. - Marketing and Contact: Contact method, call duration, prior contacts, campaign outcomes. - Economic Indicators: Employment variation rate, consumer confidence, euribor rate.

4.0.1 Data Preprocessing

Initial cleaning involved: - Removing the duration variable due to its post-call nature and predictive leakage. - Recoding pdays into a binary categorical variable: contacted. - Encoding categorical variables as factors. - Removing highly correlated variables identified via a correlation matrix.

bank <- read_delim("bank_additional.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE)
bank
## # A tibble: 4,119 × 21
##      age job   marital education default housing loan  contact month day_of_week
##    <dbl> <chr> <chr>   <chr>     <chr>   <chr>   <chr> <chr>   <chr> <chr>      
##  1    30 blue… married basic.9y  no      yes     no    cellul… may   fri        
##  2    39 serv… single  high.sch… no      no      no    teleph… may   fri        
##  3    25 serv… married high.sch… no      yes     no    teleph… jun   wed        
##  4    38 serv… married basic.9y  no      unknown unkn… teleph… jun   fri        
##  5    47 admi… married universi… no      yes     no    cellul… nov   mon        
##  6    32 serv… single  universi… no      no      no    cellul… sep   thu        
##  7    32 admi… single  universi… no      yes     no    cellul… sep   mon        
##  8    41 entr… married universi… unknown yes     no    cellul… nov   mon        
##  9    31 serv… divorc… professi… no      no      no    cellul… nov   tue        
## 10    35 blue… married basic.9y  unknown no      no    teleph… may   thu        
## # ℹ 4,109 more rows
## # ℹ 11 more variables: duration <dbl>, campaign <dbl>, pdays <dbl>,
## #   previous <dbl>, poutcome <chr>, emp.var.rate <dbl>, cons.price.idx <dbl>,
## #   cons.conf.idx <dbl>, euribor3m <dbl>, nr.employed <dbl>, y <chr>
bank <- bank[-c(11)]
factor_vars <- c("job", "education", "marital", "default", "housing", "loan", 
                 "contact", "month", "day_of_week", "poutcome", "y")
bank[factor_vars] <- lapply(bank[factor_vars], as.factor)
bank$contacted <- ifelse(bank$pdays == 999, "No", "Yes")
bank$contacted <- as.factor(bank$contacted)
bank$y <- factor(bank$y, levels = c("no", "yes"))
summary(bank)
##       age                 job           marital                   education   
##  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
##  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
##  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
##  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
##  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
##  Max.   :88.00   retired    : 166                   basic.6y           : 228  
##                  (Other)    : 649                   (Other)            : 168  
##     default        housing          loan           contact         month     
##  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
##  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
##  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
##                                                                jun    : 530  
##                                                                nov    : 446  
##                                                                apr    : 215  
##                                                                (Other): 203  
##  day_of_week    campaign          pdays          previous     
##  fri:768     Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
##  mon:855     1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
##  thu:860     Median : 2.000   Median :999.0   Median :0.0000  
##  tue:841     Mean   : 2.537   Mean   :960.4   Mean   :0.1903  
##  wed:795     3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
##              Max.   :35.000   Max.   :999.0   Max.   :6.0000  
##                                                               
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 454   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:3523   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 142   Median : 1.10000   Median :93.75   Median :-41.8  
##                     Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
##                     3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                     Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                       
##    euribor3m      nr.employed     y        contacted 
##  Min.   :0.635   Min.   :4964   no :3668   No :3959  
##  1st Qu.:1.334   1st Qu.:5099   yes: 451   Yes: 160  
##  Median :4.857   Median :5191                        
##  Mean   :3.621   Mean   :5166                        
##  3rd Qu.:4.961   3rd Qu.:5228                        
##  Max.   :5.045   Max.   :5228                        
## 
factor_vars_2 <- c("job", "education", "marital", "default", "housing", "loan", 
                 "contact", "month", "day_of_week", "poutcome", "contacted", "y")

# Enforce consistent levels across full dataset
bank[factor_vars_2] <- lapply(bank[factor_vars_2], function(x) factor(x))

# Train/test split
set.seed(123)
train_index <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_train <- bank[train_index, ]
bank_test <- bank[-train_index, ]

# Align factor levels in test to match train
for (col in factor_vars) {
  bank_train[[col]] <- factor(bank_train[[col]])
  bank_test[[col]] <- factor(bank_test[[col]], levels = levels(bank_train[[col]]))
}

# Drop unused levels in training data to avoid tune.svm issues
bank_train <- droplevels(bank_train)

5 Data

The dataset contains 4,119 observations and 21 variables. It was sourced from a Portuguese bank’s direct marketing campaign records, hosted on the UCI Machine Learning Repository.

bank <- dplyr::select(bank, -c(loan, pdays, emp.var.rate, euribor3m))
bank_num <- dplyr::select_if(bank, is.numeric)
M <- cor(bank_num)
corrplot(M, method = "number")


6 Results

6.1 Logistic Regression

This model estimates the probability of a binary outcome (subscription: yes/no) using a linear combination of predictors. It’s ideal for interpretability, especially with categorical variables, and was selected based on sensitivity and AIC optimization.

set.seed(1)
log.model<- glm(y ~ ., data = bank_train, family = "binomial")
summary(log.model)
## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = bank_train)
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -82.889102 133.694466  -0.620 0.535265    
## age                            0.015766   0.007680   2.053 0.040095 *  
## jobblue-collar                -0.371613   0.253074  -1.468 0.141996    
## jobentrepreneur               -0.542077   0.434705  -1.247 0.212398    
## jobhousemaid                  -0.334421   0.473819  -0.706 0.480313    
## jobmanagement                 -0.532371   0.286635  -1.857 0.063266 .  
## jobretired                    -0.401286   0.340223  -1.179 0.238207    
## jobself-employed              -0.632975   0.383285  -1.651 0.098647 .  
## jobservices                   -0.309526   0.276704  -1.119 0.263304    
## jobstudent                    -0.112725   0.400147  -0.282 0.778166    
## jobtechnician                 -0.044615   0.212385  -0.210 0.833616    
## jobunemployed                  0.109456   0.370847   0.295 0.767879    
## jobunknown                    -2.042565   1.133241  -1.802 0.071481 .  
## maritalmarried                 0.284361   0.230844   1.232 0.218011    
## maritalsingle                  0.262964   0.262770   1.001 0.316953    
## maritalunknown                 1.125503   1.139332   0.988 0.323220    
## educationbasic.6y              0.223645   0.393150   0.569 0.569455    
## educationbasic.9y              0.351046   0.303417   1.157 0.247283    
## educationhigh.school           0.098308   0.298537   0.329 0.741931    
## educationilliterate          -10.805328 324.744111  -0.033 0.973457    
## educationprofessional.course   0.356855   0.317250   1.125 0.260657    
## educationuniversity.degree     0.236896   0.296871   0.798 0.424883    
## educationunknown               0.159747   0.396114   0.403 0.686738    
## defaultunknown                -0.096918   0.201885  -0.480 0.631181    
## housingunknown                -0.380851   0.492710  -0.773 0.439539    
## housingyes                    -0.022309   0.132175  -0.169 0.865969    
## loanunknown                          NA         NA      NA       NA    
## loanyes                       -0.095422   0.181553  -0.526 0.599175    
## contacttelephone              -0.971477   0.262057  -3.707 0.000210 ***
## monthaug                       0.078314   0.396536   0.197 0.843441    
## monthdec                       1.029095   0.660271   1.559 0.119092    
## monthjul                      -0.077268   0.336487  -0.230 0.818378    
## monthjun                      -0.121103   0.411405  -0.294 0.768480    
## monthmar                       1.791621   0.497979   3.598 0.000321 ***
## monthmay                      -0.227987   0.282235  -0.808 0.419211    
## monthnov                      -0.852609   0.402426  -2.119 0.034118 *  
## monthoct                      -0.185880   0.494767  -0.376 0.707146    
## monthsep                      -0.124183   0.573611  -0.216 0.828603    
## day_of_weekmon                -0.008324   0.204123  -0.041 0.967470    
## day_of_weekthu                 0.013751   0.206561   0.067 0.946922    
## day_of_weektue                -0.027235   0.212226  -0.128 0.897887    
## day_of_weekwed                 0.128113   0.213417   0.600 0.548311    
## campaign                      -0.066050   0.037120  -1.779 0.075180 .  
## pdays                         -0.096441   0.061072  -1.579 0.114307    
## previous                       0.239344   0.192041   1.246 0.212648    
## poutcomenonexistent            0.857172   0.318043   2.695 0.007036 ** 
## poutcomesuccess                1.024203   0.704225   1.454 0.145844    
## emp.var.rate                  -1.291724   0.438249  -2.947 0.003204 ** 
## cons.price.idx                 1.762051   0.776592   2.269 0.023271 *  
## cons.conf.idx                  0.044087   0.025683   1.717 0.086056 .  
## euribor3m                      0.320252   0.401553   0.798 0.425142    
## nr.employed                    0.002230   0.009572   0.233 0.815758    
## contactedYes                 -94.995130  60.408859  -1.573 0.115826    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2277.7  on 3295  degrees of freedom
## Residual deviance: 1748.7  on 3244  degrees of freedom
## AIC: 1852.7
## 
## Number of Fisher Scoring iterations: 11
predprob = predict.glm(log.model, newdata = bank_test, type = "response")
predclass_log = ifelse(predprob >= 0.08, "yes", "no")
predclass_log <- as.factor(predclass_log)
caret::confusionMatrix(as.factor(predclass_log), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  561  34
##        yes 171  56
##                                           
##                Accuracy : 0.7506          
##                  95% CI : (0.7196, 0.7799)
##     No Information Rate : 0.8905          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.233           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.62222         
##             Specificity : 0.76639         
##          Pos Pred Value : 0.24670         
##          Neg Pred Value : 0.94286         
##              Prevalence : 0.10949         
##          Detection Rate : 0.06813         
##    Detection Prevalence : 0.27616         
##       Balanced Accuracy : 0.69431         
##                                           
##        'Positive' Class : yes             
## 

6.2 SVM

set.seed(1)
svm_model <- y ~ .
tuned <- tune.svm(svm_model, data = bank_train, gamma = seq(0.1, 0.1, by=0.01), cost = seq(0.1, 1, by=0.1))

mysvm <- svm(svm_model, data = bank_train, 
             gamma = tuned$best.parameters$gamma, 
             cost = tuned$best.parameters$cost)
summary(mysvm)
## 
## Call:
## svm(formula = svm_model, data = bank_train, gamma = tuned$best.parameters$gamma, 
##     cost = tuned$best.parameters$cost)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.5 
## 
## Number of Support Vectors:  1256
## 
##  ( 900 356 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  no yes
svmpredict = predict(mysvm, bank_test, type = 'response')
length(svmpredict)
## [1] 822
length(bank_test$y)
## [1] 823
for (col in names(bank_test)) {
  if (is.factor(bank_test[[col]])) {
    bank_test[[col]] <- factor(bank_test[[col]], levels = levels(bank_train[[col]]))
  }
}
bank_test <- na.omit(bank_test)

# Then predict
svmpredict <- predict(mysvm, newdata = bank_test)

# Now confusion matrix
caret::confusionMatrix(as.factor(svmpredict), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  726  75
##        yes   6  15
##                                         
##                Accuracy : 0.9015        
##                  95% CI : (0.879, 0.921)
##     No Information Rate : 0.8905        
##     P-Value [Acc > NIR] : 0.1715        
##                                         
##                   Kappa : 0.2387        
##                                         
##  Mcnemar's Test P-Value : 4.171e-14     
##                                         
##             Sensitivity : 0.16667       
##             Specificity : 0.99180       
##          Pos Pred Value : 0.71429       
##          Neg Pred Value : 0.90637       
##              Prevalence : 0.10949       
##          Detection Rate : 0.01825       
##    Detection Prevalence : 0.02555       
##       Balanced Accuracy : 0.57923       
##                                         
##        'Positive' Class : yes           
## 

6.3 LDA

bank_lda <- lda(y ~ ., data = bank_train)
predclass_lda <- predict(bank_lda, newdata = bank_test)
caret::confusionMatrix(as.factor(predclass_lda$class), as.factor(bank_test$y), positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  703  57
##        yes  29  33
##                                           
##                Accuracy : 0.8954          
##                  95% CI : (0.8724, 0.9155)
##     No Information Rate : 0.8905          
##     P-Value [Acc > NIR] : 0.352519        
##                                           
##                   Kappa : 0.3787          
##                                           
##  Mcnemar's Test P-Value : 0.003597        
##                                           
##             Sensitivity : 0.36667         
##             Specificity : 0.96038         
##          Pos Pred Value : 0.53226         
##          Neg Pred Value : 0.92500         
##              Prevalence : 0.10949         
##          Detection Rate : 0.04015         
##    Detection Prevalence : 0.07543         
##       Balanced Accuracy : 0.66352         
##                                           
##        'Positive' Class : yes             
## 

6.4 AIC Model

#AIC
glm.aic <- step(log.model, scope = list(upper = log.model),
                direction = "both", test = "Chisq", trace = F)
summary(glm.aic)
## 
## Call:
## glm(formula = y ~ age + contact + month + campaign + pdays + 
##     poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     contacted, family = "binomial", data = bank_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -25.648087  59.634360  -0.430 0.667131    
## age                   0.008208   0.005365   1.530 0.126029    
## contacttelephone     -0.858787   0.235846  -3.641 0.000271 ***
## monthaug              0.030699   0.342187   0.090 0.928513    
## monthdec              0.954409   0.594020   1.607 0.108121    
## monthjul              0.045506   0.321934   0.141 0.887593    
## monthjun              0.159195   0.294370   0.541 0.588647    
## monthmar              1.712806   0.416567   4.112 3.93e-05 ***
## monthmay             -0.280285   0.256705  -1.092 0.274896    
## monthnov             -0.614479   0.322361  -1.906 0.056626 .  
## monthoct             -0.061462   0.401818  -0.153 0.878431    
## monthsep             -0.167482   0.427201  -0.392 0.695024    
## campaign             -0.069795   0.037401  -1.866 0.062025 .  
## pdays                -0.096534   0.060180  -1.604 0.108692    
## poutcomenonexistent   0.607594   0.205761   2.953 0.003148 ** 
## poutcomesuccess       0.786779   0.649155   1.212 0.225511    
## emp.var.rate         -0.761595   0.071645 -10.630  < 2e-16 ***
## cons.price.idx        1.296849   0.174574   7.429 1.10e-13 ***
## cons.conf.idx         0.051337   0.016630   3.087 0.002021 ** 
## contactedYes        -94.800201  59.540007  -1.592 0.111338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2277.7  on 3295  degrees of freedom
## Residual deviance: 1773.9  on 3276  degrees of freedom
## AIC: 1813.9
## 
## Number of Fisher Scoring iterations: 6
pred.aic = prediction(predict(glm.aic,bank_train,type='response'),bank_train$y)

#AIC sensitivity and specificity
plot(unlist(performance(pred.aic,'sens')@x.values),unlist(performance(pred.aic,'sens')@y.values), type='l', lwd=2, ylab = "", xlab = 'Cutoff')
mtext('Sensitivity',side=2)
mtext('Sensitivity vs. Specificity Plot for AIC Model', side=3)

# AIC second specificity in same plot
par(new=TRUE)
plot(unlist(performance(pred.aic,'spec')@x.values),unlist(performance(pred.aic,'spec')@y.values), type='l', lwd=2,col='red', ylab = "", xlab = 'Cutoff')
axis(4,at=seq(0,1,0.2)) 
mtext('Specificity',side=4, col='red')
min.diff <-which.min(abs(unlist(performance(pred.aic, "sens")@y.values) - unlist(performance(pred.aic, "spec")@y.values)))
min.x<-unlist(performance(pred.aic, "sens")@x.values)[min.diff]
min.y<-unlist(performance(pred.aic, "spec")@y.values)[min.diff]
optimal <-min.x
abline(h = min.y, lty = 3)
abline(v = min.x, lty = 3)
text(min.x,0,paste("optimal threshold=",round(optimal,2)), pos = 4)

#BIC
glm.bic <- step(log.model, scope = list(upper = log.model), 
                direction="both", test="Chisq", trace = F, k=log(nrow(bank_train))) 
summary(glm.bic)
## 
## Call:
## glm(formula = y ~ contact + pdays + emp.var.rate + cons.price.idx + 
##     cons.conf.idx, family = "binomial", data = bank_train)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.147e+02  1.347e+01  -8.512  < 2e-16 ***
## contacttelephone -8.028e-01  1.796e-01  -4.470 7.84e-06 ***
## pdays            -1.367e-03  2.188e-04  -6.248 4.15e-10 ***
## emp.var.rate     -7.399e-01  5.537e-02 -13.363  < 2e-16 ***
## cons.price.idx    1.242e+00  1.450e-01   8.569  < 2e-16 ***
## cons.conf.idx     5.871e-02  1.180e-02   4.976 6.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2277.7  on 3295  degrees of freedom
## Residual deviance: 1835.5  on 3290  degrees of freedom
## AIC: 1847.5
## 
## Number of Fisher Scoring iterations: 5

6.5 Decision Tree

Decision trees segment the dataset into branches based on decision rules. They are easy to interpret but can be unstable and prone to overfitting without pruning. They help reveal key features like job type or month in this case.

set.seed(1)
bank_dt <- rpart(formula = y ~., 
                  data = bank_train, 
                  method = "class",
                  control = rpart.control(cp = 0, maxdepth =5, 
                                          minsplit = 5, minbucket = 10))
rpart.plot(bank_dt, type=1, sub = NULL)

plotcp(bank_dt)

#AUC and ROC model and predictions
model_bank <- naiveBayes(y~., bank_train, laplace = 1)
pred_naive <- predict(object = model_bank, newdata=bank_test, type="class")

# Create a probability prediction from `model_bank`
prob_naive<- predict(object = model_bank, newdata=bank_test, type="raw")

# Create a prob and label from prob_naive. 
roc_bank <- data.frame(prob=prob_naive[,2],
                       label=as.numeric(bank_test$y=="yes"))

# Create an object prediction 
prediction_roc_bank <- prediction(predictions = roc_bank$prob, 
                                  labels = roc_bank$label) 
# Create an ROC plot 
plot(ROCR::performance(prediction.obj = prediction_roc_bank, 
                       measure = "tpr", 
                       x.measure = "fpr"),main = "ROC Naive Bayes", col="#519259")
abline(a = 0, b = 1)

#AUC == between .5 and 1 so the model can distinguish between positive and negative classes
auc_naive  <- ROCR::performance(prediction.obj=prediction_roc_bank, measure = "auc")
auc_naive@y.values[[1]]
## [1] 0.7640407

7 Discussion

The Logistic Regression model showed the highest sensitivity (~62.2%) and decent accuracy (~75%). SVM and LDA had high accuracy but poor sensitivity. The Decision Tree showed logical structure but was less robust. AIC modeling showed optimal threshold tuning potential.



8 Conclusions

This case study aimed to identify which clients are most likely to subscribe to a term deposit, using four modeling approaches. Among these, Logistic Regression was chosen for its balance of performance and interpretability. It yielded moderate accuracy with the highest sensitivity (~62.2%), making it the most effective for identifying actual subscribers.

Overall, Logistic Regression was preferred due to its transparency, ability to handle categorical predictors effectively, and solid performance on imbalanced classification. Future work could incorporate ensemble techniques or address class imbalance with resampling to boost performance further.

Logistic Regression offered the best balance of accuracy, interpretability, and sensitivity. It is recommended for targeting clients likely to subscribe. Future enhancements may explore ensemble techniques like Random Forest or boosting methods.


9 References

  1. Moro, S., Laureano, R., & Cortez, P. (2014). Expert Systems with Applications, 41(11), 4720–4731.
  2. UCI Machine Learning Repository: Bank Marketing Dataset