1 Background

Telemarketing is one of the ways banks can advertise their product but it is very intrusive and annoying to the customers who receive the unwanted calls. I have never been a telemarketer but I have received many telemarketer calls from unknown numbers. As soon as I know that the person calling is a telemarketer I will end the call after a brief conversation irrespective of whether the person on the other end has talked about their offer or not. I think that telemarketers (of credits without collateral) often target people who can pay for their services but have no need for them. I also think conversely, that people who needs the services usually cannot pay. A research done with the data from a bank in Portugal aimed to predict who, among the customers, will accept the term deposit product offered by that bank’s telemarketer.
After the initial EDA and additional insights from here, and here, finally we are ready to do the actual prediction of customers’ acceptance or rejection of the term deposit offers by the bank via telemarketing.

2 Data Description

The dataset and its description can be found in this link.

There are four datasets in the link above:

  • bank.csv
  • bank-full.csv
  • bank-additional.csv
  • bank-additional-full.csv

For the initial EDA and additional insights, we have used bank.csv, for the actual prediction, we will use bank-full.csv

The description for each columns are:

  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’ ,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
  8. contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
  16. y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
  17. balance: not in the original description but is interpreted as each customers’ bank balance at the time of the current campaign

3 Data Preprocessing

# read in the bank.csv
bank_full <- read.csv("data_input/bank-full.csv",sep = ";", stringsAsFactors = TRUE)

# inspect the first six rows of bank_full 
head(bank_full)
# inspect the data types of bank
glimpse(bank_full)
#> Rows: 45,211
#> Columns: 17
#> $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
#> $ job       <fct> management, technician, entrepreneur, blue-collar, unknown, …
#> $ marital   <fct> married, single, married, married, single, married, single, …
#> $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
#> $ default   <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
#> $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
#> $ housing   <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
#> $ loan      <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
#> $ contact   <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ month     <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
#> $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
#> $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
#> $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ y         <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …

The data has 45211 observations and 17 columns. Because there is now 10x more data than the previous bank.csv we are going to view the summary and several boxplots to check if our EDA on bank.csv holds true for bank-full.csv.

#check if there is any NA
anyNA(bank_full)
#> [1] FALSE
# summary of bank_full
summary(bank_full)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
#>  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
#>  Median :39.00   technician :7597   single  :12790   tertiary :13301  
#>  Mean   :40.94   admin.     :5171                    unknown  : 1857  
#>  3rd Qu.:48.00   services   :4154                                     
#>  Max.   :95.00   retired    :2264                                     
#>                  (Other)    :6835                                     
#>  default        balance       housing      loan            contact     
#>  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
#>  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
#>              Median :   448                           unknown  :13020  
#>              Mean   :  1362                                            
#>              3rd Qu.:  1428                                            
#>              Max.   :102127                                            
#>                                                                        
#>       day            month          duration         campaign     
#>  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
#>  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
#>  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
#>  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
#>  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
#>  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
#>                  (Other): 6060                                    
#>      pdays          previous           poutcome       y        
#>  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
#>  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
#>  Median : -1.0   Median :  0.0000   success: 1511              
#>  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
#>  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
#>  Max.   :871.0   Max.   :275.0000                              
#> 

From the summary, it can be seen that the target variable is very unbalanced. Let us check the imbalance using prop.table.

prop.table(table(bank_full$y))
#> 
#>        no       yes 
#> 0.8830152 0.1169848

Over 88.3% customers rejected the offers and only 11.7% accepted them.
Now, if we check the mean and medians of the numerical columns, we can see that balance, pdays, campaign, and duration is very likely to have many outliers while previous is likely to have lesser outliers.
Most categorical columns show the complete category with the number of observations in each category, but two, month and job. Let us check the number of unique values of these two columns.

table(bank_full$job)
#> 
#>        admin.   blue-collar  entrepreneur     housemaid    management 
#>          5171          9732          1487          1240          9458 
#>       retired self-employed      services       student    technician 
#>          2264          1579          4154           938          7597 
#>    unemployed       unknown 
#>          1303           288
table(bank_full$month)
#> 
#>   apr   aug   dec   feb   jan   jul   jun   mar   may   nov   oct   sep 
#>  2932  6247   214  2649  1403  6895  5341   477 13766  3970   738   579

Based on the summary and the above count table, there are no category_level that has just 3 or less observations so we do not need to filter levels with that much observations and we can go on to train-test split.

4 EDA For Prediction

But before we go to the cross-validation or train-test split, it is good to check for correlations between the numerical columns of bank_full.

ggcorr(bank_full, label=TRUE)

The only moderately strong correlation is between previous and pdays while all the other columns have weak correlations.

5 Cross-Validation Before Prediction.

For prediction purposes, the duration of the call is not known beforehand, so we’ll remove this predictor before spliting.

bank_full1 = bank_full %>% select(-duration)
library(rsample)
set.seed(123)
init <- initial_split(data = bank_full1, prop = 0.8, strata = y)
train_bank <- training(init)
test_bank <- testing(init)

6 Prediction Using Logistic Regression

The first model we are going to use in predicting if customers will accept or reject telemarketing offer is logistic regression. This model is simpler and less robust than the next model we are going to try, k-nearest neighbor or kNN but it is also more interpretable.

model_logreg <- glm(y ~., data= train_bank, family = "binomial")
summary(model_logreg)
#> 
#> Call:
#> glm(formula = y ~ ., family = "binomial", data = train_bank)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3235  -0.4819  -0.3793  -0.2435   3.5757  
#> 
#> Coefficients:
#>                        Estimate   Std. Error z value             Pr(>|z|)    
#> (Intercept)        -1.252715940  0.183500028  -6.827   0.0000000000086836 ***
#> age                 0.000226180  0.002199845   0.103             0.918109    
#> jobblue-collar     -0.154905842  0.072791657  -2.128             0.033331 *  
#> jobentrepreneur    -0.162872161  0.124125559  -1.312             0.189467    
#> jobhousemaid       -0.256082163  0.132224933  -1.937             0.052780 .  
#> jobmanagement      -0.038654212  0.073553625  -0.526             0.599219    
#> jobretired          0.407362719  0.097379874   4.183   0.0000287392377694 ***
#> jobself-employed   -0.162305976  0.113530334  -1.430             0.152824    
#> jobservices        -0.060723417  0.083068127  -0.731             0.464775    
#> jobstudent          0.271615929  0.111018694   2.447             0.014422 *  
#> jobtechnician      -0.090513505  0.069490288  -1.303             0.192734    
#> jobunemployed       0.121109647  0.108356454   1.118             0.263697    
#> jobunknown         -0.161202121  0.233077486  -0.692             0.489173    
#> maritalmarried     -0.159798235  0.058601192  -2.727             0.006394 ** 
#> maritalsingle       0.144166535  0.066806826   2.158             0.030931 *  
#> educationsecondary  0.132479032  0.063391922   2.090             0.036632 *  
#> educationtertiary   0.317410942  0.074049413   4.286   0.0000181530801310 ***
#> educationunknown    0.180781138  0.103823192   1.741             0.081641 .  
#> defaultyes         -0.112750773  0.163534416  -0.689             0.490533    
#> balance             0.000015866  0.000005007   3.169             0.001529 ** 
#> housingyes         -0.500745025  0.043115630 -11.614 < 0.0000000000000002 ***
#> loanyes            -0.452128779  0.060558669  -7.466   0.0000000000000827 ***
#> contacttelephone   -0.244521313  0.073049592  -3.347             0.000816 ***
#> contactunknown     -1.370121616  0.071176408 -19.250 < 0.0000000000000002 ***
#> day                 0.003288564  0.002479243   1.326             0.184694    
#> monthaug           -0.812319254  0.078188450 -10.389 < 0.0000000000000002 ***
#> monthdec            0.761792389  0.179955241   4.233   0.0000230356056245 ***
#> monthfeb           -0.397947745  0.089899260  -4.427   0.0000095731578611 ***
#> monthjan           -1.140465739  0.121884380  -9.357 < 0.0000000000000002 ***
#> monthjul           -0.660700780  0.076060998  -8.686 < 0.0000000000000002 ***
#> monthjun            0.205605533  0.093299108   2.204             0.027544 *  
#> monthmar            1.055639338  0.124771628   8.461 < 0.0000000000000002 ***
#> monthmay           -0.445097925  0.071903955  -6.190   0.0000000006009821 ***
#> monthnov           -0.862126512  0.084010373 -10.262 < 0.0000000000000002 ***
#> monthoct            0.686586812  0.109652456   6.261   0.0000000003813373 ***
#> monthsep            0.638271761  0.122486179   5.211   0.0000001878561988 ***
#> campaign           -0.081849377  0.009408080  -8.700 < 0.0000000000000002 ***
#> pdays              -0.000114825  0.000310349  -0.370             0.711392    
#> previous            0.025999418  0.009695453   2.682             0.007327 ** 
#> poutcomeother       0.192934216  0.090256129   2.138             0.032547 *  
#> poutcomesuccess     2.199595832  0.084013625  26.181 < 0.0000000000000002 ***
#> poutcomeunknown     0.051268131  0.096910267   0.529             0.596787    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 26108  on 36169  degrees of freedom
#> Residual deviance: 21711  on 36128  degrees of freedom
#> AIC: 21795
#> 
#> Number of Fisher Scoring iterations: 6

Logistic regression produces probabilities and need to be converted to class labels using if else statements. Here we will convert probabilities of 0.5 and more to “yes” and probabilities lower than 0.5 to “no”. The labels then needs to be converted to factors so that the predicted labels can be compared to the true labels to measure the performance of the models.

bank_pred_logreg1 <- predict(model_logreg, newdata = test_bank ,type = "response")
predictions1 <- as.factor(ifelse(bank_pred_logreg1<0.5, "no", "yes"))
head(predictions1)
#>  2  5  7 10 16 17 
#> no no no no no no 
#> Levels: no yes

The metric we are going to use to evaluate the performance of our model is the accuracy, recall/sensitivity and precision/pos_pred via confusionMatrix package in the caret library. Out of the three we will use recall as our primary metric because recall capture the amount of false negatives, that is customers that are predicted to reject the offer but actually will accept the offer. This makes sense in the context of the bank business because false negatives mean not contacting the customers that will accept the offer.

confusionMatrix(predictions1, reference = test_bank$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7873  874
#>        yes  111  183
#>                                                
#>                Accuracy : 0.8911               
#>                  95% CI : (0.8844, 0.8974)     
#>     No Information Rate : 0.8831               
#>     P-Value [Acc > NIR] : 0.009147             
#>                                                
#>                   Kappa : 0.2318               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.17313              
#>             Specificity : 0.98610              
#>          Pos Pred Value : 0.62245              
#>          Neg Pred Value : 0.90008              
#>              Prevalence : 0.11691              
#>          Detection Rate : 0.02024              
#>    Detection Prevalence : 0.03252              
#>       Balanced Accuracy : 0.57961              
#>                                                
#>        'Positive' Class : yes                  
#> 

Now what if the feature selection is applied to the logistic regression model using step ?

model_logreg2 <- step(model_logreg, direction = "backward")
#> Start:  AIC=21794.54
#> y ~ age + job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + campaign + pdays + previous + 
#>     poutcome
#> 
#>             Df Deviance   AIC
#> - age        1    21711 21793
#> - pdays      1    21711 21793
#> - default    1    21711 21793
#> - day        1    21712 21794
#> <none>            21710 21794
#> - previous   1    21717 21799
#> - balance    1    21720 21802
#> - education  3    21731 21809
#> - job       11    21772 21834
#> - marital    2    21757 21837
#> - loan       1    21771 21853
#> - campaign   1    21804 21886
#> - housing    1    21847 21929
#> - contact    2    22117 22197
#> - month     11    22430 22492
#> - poutcome   3    22666 22744
#> 
#> Step:  AIC=21792.56
#> y ~ job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + campaign + pdays + previous + 
#>     poutcome
#> 
#>             Df Deviance   AIC
#> - pdays      1    21711 21791
#> - default    1    21711 21791
#> - day        1    21712 21792
#> <none>            21711 21793
#> - previous   1    21717 21797
#> - balance    1    21720 21800
#> - education  3    21731 21807
#> - job       11    21779 21839
#> - marital    2    21763 21841
#> - loan       1    21771 21851
#> - campaign   1    21804 21884
#> - housing    1    21848 21928
#> - contact    2    22118 22196
#> - month     11    22430 22490
#> - poutcome   3    22667 22743
#> 
#> Step:  AIC=21790.69
#> y ~ job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + campaign + previous + poutcome
#> 
#>             Df Deviance   AIC
#> - default    1    21711 21789
#> - day        1    21712 21790
#> <none>            21711 21791
#> - previous   1    21717 21795
#> - balance    1    21720 21798
#> - education  3    21732 21806
#> - job       11    21780 21838
#> - marital    2    21763 21839
#> - loan       1    21771 21849
#> - campaign   1    21804 21882
#> - housing    1    21850 21928
#> - contact    2    22118 22194
#> - month     11    22430 22488
#> - poutcome   3    22687 22761
#> 
#> Step:  AIC=21789.19
#> y ~ job + marital + education + balance + housing + loan + contact + 
#>     day + month + campaign + previous + poutcome
#> 
#>             Df Deviance   AIC
#> - day        1    21713 21789
#> <none>            21711 21789
#> - previous   1    21718 21794
#> - balance    1    21721 21797
#> - education  3    21732 21804
#> - job       11    21780 21836
#> - marital    2    21763 21837
#> - loan       1    21773 21849
#> - campaign   1    21805 21881
#> - housing    1    21850 21926
#> - contact    2    22119 22193
#> - month     11    22432 22488
#> - poutcome   3    22688 22760
#> 
#> Step:  AIC=21788.97
#> y ~ job + marital + education + balance + housing + loan + contact + 
#>     month + campaign + previous + poutcome
#> 
#>             Df Deviance   AIC
#> <none>            21713 21789
#> - previous   1    21720 21794
#> - balance    1    21723 21797
#> - education  3    21734 21804
#> - job       11    21782 21836
#> - marital    2    21765 21837
#> - loan       1    21775 21849
#> - campaign   1    21805 21879
#> - housing    1    21855 21929
#> - contact    2    22119 22191
#> - month     11    22441 22495
#> - poutcome   3    22690 22760
summary(model_logreg2)
#> 
#> Call:
#> glm(formula = y ~ job + marital + education + balance + housing + 
#>     loan + contact + month + campaign + previous + poutcome, 
#>     family = "binomial", data = train_bank)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3321  -0.4822  -0.3793  -0.2435   3.5617  
#> 
#> Coefficients:
#>                        Estimate   Std. Error z value             Pr(>|z|)    
#> (Intercept)        -1.212605816  0.122915860  -9.865 < 0.0000000000000002 ***
#> jobblue-collar     -0.157422432  0.072733927  -2.164             0.030437 *  
#> jobentrepreneur    -0.166085332  0.124021276  -1.339             0.180516    
#> jobhousemaid       -0.253782372  0.131830341  -1.925             0.054221 .  
#> jobmanagement      -0.039196761  0.073457063  -0.534             0.593618    
#> jobretired          0.412312500  0.087766123   4.698   0.0000026290822583 ***
#> jobself-employed   -0.163407912  0.113508376  -1.440             0.149977    
#> jobservices        -0.062011356  0.083035566  -0.747             0.455181    
#> jobstudent          0.270093127  0.109144939   2.475             0.013338 *  
#> jobtechnician      -0.089211667  0.069464494  -1.284             0.199045    
#> jobunemployed       0.119899628  0.108339610   1.107             0.268423    
#> jobunknown         -0.166871219  0.232756571  -0.717             0.473415    
#> maritalmarried     -0.159760674  0.058342289  -2.738             0.006175 ** 
#> maritalsingle       0.142244506  0.062630595   2.271             0.023137 *  
#> educationsecondary  0.131570940  0.063061839   2.086             0.036944 *  
#> educationtertiary   0.318193780  0.073448605   4.332   0.0000147629167635 ***
#> educationunknown    0.182617195  0.103786372   1.760             0.078484 .  
#> balance             0.000016083  0.000004984   3.227             0.001251 ** 
#> housingyes         -0.506098885  0.042713650 -11.849 < 0.0000000000000002 ***
#> loanyes            -0.456887065  0.060437498  -7.560   0.0000000000000404 ***
#> contacttelephone   -0.243252762  0.072257881  -3.366             0.000761 ***
#> contactunknown     -1.359578525  0.070641886 -19.246 < 0.0000000000000002 ***
#> monthaug           -0.823983149  0.077549419 -10.625 < 0.0000000000000002 ***
#> monthdec            0.749444250  0.179646624   4.172   0.0000302243702798 ***
#> monthfeb           -0.435061451  0.084926066  -5.123   0.0000003009902168 ***
#> monthjan           -1.110311687  0.119901978  -9.260 < 0.0000000000000002 ***
#> monthjul           -0.662762703  0.075981047  -8.723 < 0.0000000000000002 ***
#> monthjun            0.175151688  0.090212867   1.942             0.052193 .  
#> monthmar            1.042384864  0.124204017   8.393 < 0.0000000000000002 ***
#> monthmay           -0.461806774  0.070903267  -6.513   0.0000000000735691 ***
#> monthnov           -0.860182283  0.083495924 -10.302 < 0.0000000000000002 ***
#> monthoct            0.686669555  0.109472415   6.273   0.0000000003552165 ***
#> monthsep            0.615557068  0.121177522   5.080   0.0000003778408024 ***
#> campaign           -0.080042312  0.009283064  -8.622 < 0.0000000000000002 ***
#> previous            0.025908585  0.009669390   2.679             0.007374 ** 
#> poutcomeother       0.195206198  0.090110470   2.166             0.030288 *  
#> poutcomesuccess     2.207565085  0.081658485  27.034 < 0.0000000000000002 ***
#> poutcomeunknown     0.078930092  0.064219744   1.229             0.219048    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 26108  on 36169  degrees of freedom
#> Residual deviance: 21713  on 36132  degrees of freedom
#> AIC: 21789
#> 
#> Number of Fisher Scoring iterations: 6
bank_pred_logreg2 <- predict(model_logreg2, newdata = test_bank ,type = "response")
predictions2 <- as.factor(ifelse(bank_pred_logreg2 < 0.5, "no", "yes"))
head(predictions2)
#>  2  5  7 10 16 17 
#> no no no no no no 
#> Levels: no yes
confusionMatrix(predictions2, reference = test_bank$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7875  873
#>        yes  109  184
#>                                                
#>                Accuracy : 0.8914               
#>                  95% CI : (0.8848, 0.8977)     
#>     No Information Rate : 0.8831               
#>     P-Value [Acc > NIR] : 0.006949             
#>                                                
#>                   Kappa : 0.2337               
#>                                                
#>  Mcnemar's Test P-Value : < 0.00000000000000022
#>                                                
#>             Sensitivity : 0.17408              
#>             Specificity : 0.98635              
#>          Pos Pred Value : 0.62799              
#>          Neg Pred Value : 0.90021              
#>              Prevalence : 0.11691              
#>          Detection Rate : 0.02035              
#>    Detection Prevalence : 0.03241              
#>       Balanced Accuracy : 0.58021              
#>                                                
#>        'Positive' Class : yes                  
#> 

Before feature selection:

  • Accuracy : 89.2%
  • Recall : 15.89%
  • Precision : 65.88%

After feature selection:

  • Accuracy : 89.16%
  • Recall : 15.71%
  • Precision : 65.1%

After the feature selection the performance of the model actually decreased across all three metrics that we use. Therefore we will use the model before the feature selection for prediction and interpretation. Before we can interpret the model we need to check if the assumptions underlying the logistic regression models are fulfilled to ensure the validity of the model.

The first assumption is multicolinearity, the predictors should not have strong correlation between one another. We can check this using vif from the car library.

vif(model_logreg)
#>               GVIF Df GVIF^(1/(2*Df))
#> age       2.133225  1        1.460556
#> job       4.072676 11        1.065913
#> marital   1.444888  2        1.096374
#> education 2.273831  3        1.146726
#> default   1.014232  1        1.007091
#> balance   1.044496  1        1.022006
#> housing   1.406153  1        1.185813
#> loan      1.055708  1        1.027477
#> contact   1.853741  2        1.166843
#> day       1.346200  1        1.160259
#> month     3.663943 11        1.060801
#> campaign  1.102741  1        1.050115
#> pdays     3.783283  1        1.945066
#> previous  1.634883  1        1.278625
#> poutcome  4.730988  3        1.295663

This assumption is fulfilled because none of the GVIF value is above 10.

The second assumption is independence of observations which means that there is no dependence between observations which can happen if the observations are a result of repeated measurement. This assumption is fulfilled because we used random sampling to split our data into train and test set.

The third assumption is the linearity of predictor and log of odds which means the model assumes linear relationship between the predictor coefficients and the log of odds or probability. We can take a look again at the model summary.

summary(model_logreg)
#> 
#> Call:
#> glm(formula = y ~ ., family = "binomial", data = train_bank)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.3235  -0.4819  -0.3793  -0.2435   3.5757  
#> 
#> Coefficients:
#>                        Estimate   Std. Error z value             Pr(>|z|)    
#> (Intercept)        -1.252715940  0.183500028  -6.827   0.0000000000086836 ***
#> age                 0.000226180  0.002199845   0.103             0.918109    
#> jobblue-collar     -0.154905842  0.072791657  -2.128             0.033331 *  
#> jobentrepreneur    -0.162872161  0.124125559  -1.312             0.189467    
#> jobhousemaid       -0.256082163  0.132224933  -1.937             0.052780 .  
#> jobmanagement      -0.038654212  0.073553625  -0.526             0.599219    
#> jobretired          0.407362719  0.097379874   4.183   0.0000287392377694 ***
#> jobself-employed   -0.162305976  0.113530334  -1.430             0.152824    
#> jobservices        -0.060723417  0.083068127  -0.731             0.464775    
#> jobstudent          0.271615929  0.111018694   2.447             0.014422 *  
#> jobtechnician      -0.090513505  0.069490288  -1.303             0.192734    
#> jobunemployed       0.121109647  0.108356454   1.118             0.263697    
#> jobunknown         -0.161202121  0.233077486  -0.692             0.489173    
#> maritalmarried     -0.159798235  0.058601192  -2.727             0.006394 ** 
#> maritalsingle       0.144166535  0.066806826   2.158             0.030931 *  
#> educationsecondary  0.132479032  0.063391922   2.090             0.036632 *  
#> educationtertiary   0.317410942  0.074049413   4.286   0.0000181530801310 ***
#> educationunknown    0.180781138  0.103823192   1.741             0.081641 .  
#> defaultyes         -0.112750773  0.163534416  -0.689             0.490533    
#> balance             0.000015866  0.000005007   3.169             0.001529 ** 
#> housingyes         -0.500745025  0.043115630 -11.614 < 0.0000000000000002 ***
#> loanyes            -0.452128779  0.060558669  -7.466   0.0000000000000827 ***
#> contacttelephone   -0.244521313  0.073049592  -3.347             0.000816 ***
#> contactunknown     -1.370121616  0.071176408 -19.250 < 0.0000000000000002 ***
#> day                 0.003288564  0.002479243   1.326             0.184694    
#> monthaug           -0.812319254  0.078188450 -10.389 < 0.0000000000000002 ***
#> monthdec            0.761792389  0.179955241   4.233   0.0000230356056245 ***
#> monthfeb           -0.397947745  0.089899260  -4.427   0.0000095731578611 ***
#> monthjan           -1.140465739  0.121884380  -9.357 < 0.0000000000000002 ***
#> monthjul           -0.660700780  0.076060998  -8.686 < 0.0000000000000002 ***
#> monthjun            0.205605533  0.093299108   2.204             0.027544 *  
#> monthmar            1.055639338  0.124771628   8.461 < 0.0000000000000002 ***
#> monthmay           -0.445097925  0.071903955  -6.190   0.0000000006009821 ***
#> monthnov           -0.862126512  0.084010373 -10.262 < 0.0000000000000002 ***
#> monthoct            0.686586812  0.109652456   6.261   0.0000000003813373 ***
#> monthsep            0.638271761  0.122486179   5.211   0.0000001878561988 ***
#> campaign           -0.081849377  0.009408080  -8.700 < 0.0000000000000002 ***
#> pdays              -0.000114825  0.000310349  -0.370             0.711392    
#> previous            0.025999418  0.009695453   2.682             0.007327 ** 
#> poutcomeother       0.192934216  0.090256129   2.138             0.032547 *  
#> poutcomesuccess     2.199595832  0.084013625  26.181 < 0.0000000000000002 ***
#> poutcomeunknown     0.051268131  0.096910267   0.529             0.596787    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 26108  on 36169  degrees of freedom
#> Residual deviance: 21711  on 36128  degrees of freedom
#> AIC: 21795
#> 
#> Number of Fisher Scoring iterations: 6

This assumption is fulfilled as well because the numeric variables can have linear relationship with the log of odds.

Therefore we can interpret the model as :

exp(model_logreg$coefficients) %>% data.frame() 

Looking at the odds and variable table above we can interpret the model as the customers who had the previous campaign status as success is almost 10 times more likely to say yes to the current telemarketing offer.

7 Prediction using K nearest neighbor (kNN)

Next we are going to predict the outcome of the telemarketing effort using the kNN model. This model is generally more robust and have higher performance than logistic regression but it is more difficult to interpret. This model also cannot work with categorical variables as predictors so we need to filter out the categorical variables. The reason is that kNN needs to work with a measure of distance to cluster or group a certain observations into classes and categorical variables cannot be converted into distances except 0 and 1.

# predictor variables in `train`
train_x <- train_bank %>% select_if(is.numeric) 
head(train_x)
# predictor variables in `test`
test_x <- test_bank %>% select_if(is.numeric) 
head(test_x)
# target variable in `train`
train_y <- train_bank$y
head(train_y)
#> [1] no no no no no no
#> Levels: no yes
# target variable in `test`
test_y <- test_bank$y
head(test_y)
#> [1] no no no no no no
#> Levels: no yes

Next the numerical variables needs to be scaled using z-score scaling so that the distances all have equal weight of importance.

# scale train_x data
train_x <- scale(train_x)

# scale test_x data
test_x <- scale(test_x,
                center = attr(train_x, "scaled:center"),
                scale = attr(train_x, "scaled:scale"))
head(train_x)
#>           age     balance       day   campaign      pdays   previous
#> 1  1.60278370  0.25322992 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 3 -0.75210265 -0.44568450 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 4  0.56663371  0.04528575 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 6 -0.56371175 -0.37092905 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 8  0.09565643 -0.44568450 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 9  1.60278370 -0.40683778 -1.302524 -0.5694578 -0.4104198 -0.2992611
head(test_x)
#>           age    balance       day   campaign      pdays   previous
#> 2   0.2840473 -0.4368705 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 5  -0.7521027 -0.4460109 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 7  -1.2230799 -0.3004174 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 10  0.1898519 -0.2527567 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 16  0.9434155 -0.3715819 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 17  0.3782428 -0.4420936 -1.302524 -0.5694578 -0.4104198 -0.2992611

Finally, we need to determine k, the number of nearest neighbor in a cluster. For our first iteration, Let us use the k of odd square root of train observations.

k <- round(sqrt(nrow(train_x)))+1 
k
#> [1] 191
bank_pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k= k)

We will use the same metrics as the logistic regression in order to enable comparison between the two models.

confusionMatrix(bank_pred_knn, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7979 1050
#>        yes    5    7
#>                                              
#>                Accuracy : 0.8833             
#>                  95% CI : (0.8765, 0.8899)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 0.4821             
#>                                              
#>                   Kappa : 0.0105             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.0066225          
#>             Specificity : 0.9993737          
#>          Pos Pred Value : 0.5833333          
#>          Neg Pred Value : 0.8837081          
#>              Prevalence : 0.1169118          
#>          Detection Rate : 0.0007743          
#>    Detection Prevalence : 0.0013273          
#>       Balanced Accuracy : 0.5029981          
#>                                              
#>        'Positive' Class : yes                
#> 

The metrics are:

  • Accuracy : 88.3%
  • Recall: 0.19%
  • Precision: 40%

Our kNN model are actually worse across all three metrics compared to logistic regression model. Before we make any conclusions about the performance of the models, let’s try two different k values, one higher at 300 and the other lower at 100.

bank_pred_knn2 <- knn(train = train_x, test = test_x, cl = train_y, k= 300)
confusionMatrix(bank_pred_knn2, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7984 1056
#>        yes    0    1
#>                                              
#>                Accuracy : 0.8832             
#>                  95% CI : (0.8764, 0.8898)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 0.4951             
#>                                              
#>                   Kappa : 0.0017             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.0009461          
#>             Specificity : 1.0000000          
#>          Pos Pred Value : 1.0000000          
#>          Neg Pred Value : 0.8831858          
#>              Prevalence : 0.1169118          
#>          Detection Rate : 0.0001106          
#>    Detection Prevalence : 0.0001106          
#>       Balanced Accuracy : 0.5004730          
#>                                              
#>        'Positive' Class : yes                
#> 
bank_pred_knn3 <- knn(train = train_x, test = test_x, cl = train_y, k= 100)
confusionMatrix(bank_pred_knn3, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7956 1037
#>        yes   28   20
#>                                              
#>                Accuracy : 0.8822             
#>                  95% CI : (0.8754, 0.8888)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 0.6111             
#>                                              
#>                   Kappa : 0.0263             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.018921           
#>             Specificity : 0.996493           
#>          Pos Pred Value : 0.416667           
#>          Neg Pred Value : 0.884688           
#>              Prevalence : 0.116912           
#>          Detection Rate : 0.002212           
#>    Detection Prevalence : 0.005309           
#>       Balanced Accuracy : 0.507707           
#>                                              
#>        'Positive' Class : yes                
#> 

For bank_pred_knn2 the metrics are :

  • Accuracy: 88.3%
  • Recall: 0%
  • Precision: 0%

For bank_pred_knn3 the metrics are :

  • Accuracy: 88.31%
  • Recall: 1.68%
  • Precision: 50%

So with smaller k there is an increase across all three metrics but they are still lower than the logistic regression model. Finally we can conclude that for our case of predicting the customer’s response to term deposit offers via telemarketing campaign logistic regression performs better than kNN and that logistic regression without feature selection performs the best.

8 Conclusion

We have used 2 machine learning models to help predict what kind of customers will accept the term deposit offer via the bank’s telemarketing campaign: logistic regression and k nearest neighbor. Out of the two logistic regression performs better and it performs the best when all available columns are used as predictors except the target column. The performance of the best model has a recall of 15.89%. This is still too low and the number of false negatives is still too high which means the logistic regression model needs to be tuned further or replaced with an even better model. The logistic regression was able to identify that a customer who accepts the offer in a previous campaign will ten times more likely to accept again compared to those who have declined, however.