Background

Introduction

This dataset is a telemarketing campaign for a banking institution in Portugal. In this exercise, my objective -besides determining whether a customer will subscribe or not- is to try and compare a few classification models performance. The models that I’m trying in this exercise are Logistic Regression, Decision Tree, Random Forest, Random Forest with Ranger, and XGBoost.

Data Import

We have 2 sets of data provided, one with the social and economic variables, and one without. We’ll try with the one with more predictor variables.

Data Variables Explanation

Our input variables has been divided into sections, based on its attributes.

Bank Client Data

1 - age
2 - job
3 - marital : marital status
4 - education
5 - default: has credit in default?
6 - housing: has housing loan?
7 - loan: has personal loan?

Related with the Last Contact of the Current Campaign

8 - contact: contact communication type
9 - month: last contact month of year
10 - day_of_week: last contact day of the week
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes

12 - campaign: number of contacts performed during this campaign and for this client
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
14 - previous: number of contacts performed before this campaign and for this client
15 - poutcome: outcome of the previous marketing campaign

Social and Economic Context Attributes

16 - emp.var.rate: employment variation rate - quarterly indicator
17 - cons.price.idx: consumer price index - monthly indicator
18 - cons.conf.idx: consumer confidence index - monthly indicator
19 - euribor3m: euribor 3 month rate - daily indicator
20 - nr.employed: number of employees - quarterly indicator

Target Variable

21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Target Variable Proportion

## 
##        no       yes 
## 0.8873458 0.1126542

Our target is quite imbalanced, so we will try to balance our training data with SMOTE.

Data Wrangling

We’re dropping duration variable. Because it’s mentioned in the description that the attribute highly affects the output target and should be discarded if the intention is to have a realistic predictive model.
This might be a perfect separation variable, and might hinder our learning process by including it inside our data.

Cross Validation

## 
##    no   yes 
## 29257  3694
## 
##        no       yes 
## 0.8878941 0.1121059

Balancing the Data

We’ll try to downsample the proportion using SMOTE from 89-11 to a more acceptable proportion. We’re trying 2 variation of the proportion. One is where it’s 50/50 ratio, and one with 60/40 ratio.

## 
##    no   yes 
## 22164 14776
## 
##  no yes 
## 0.6 0.4

Logistic Regression

## 
## Call:  glm(formula = y ~ job + marital + education + default + housing + 
##     loan + month + day_of_week + campaign + pdays + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed + age_log, family = "binomial", data = bank_add_train_bal)
## 
## Coefficients:
##                  (Intercept)                jobblue-collar  
##                   -27.724693                     -0.044227  
##              jobentrepreneur                  jobhousemaid  
##                    -0.011992                      0.092992  
##                jobmanagement                    jobretired  
##                     0.005428                      0.245235  
##             jobself-employed                   jobservices  
##                     0.178374                      0.019914  
##                   jobstudent                 jobtechnician  
##                     0.580556                      0.049140  
##                jobunemployed                    jobunknown  
##                     0.068199                      0.509203  
##               maritalmarried                 maritalsingle  
##                    -0.463148                     -0.167572  
##               maritalunknown             educationbasic.6y  
##                     0.129994                      0.409697  
##            educationbasic.9y          educationhigh.school  
##                     0.043945                      0.042286  
##          educationilliterate  educationprofessional.course  
##                    -0.121657                      0.110212  
##   educationuniversity.degree              educationunknown  
##                     0.109920                      0.234846  
##               defaultunknown                    defaultyes  
##                     0.706461                     -8.958572  
##               housingunknown                    housingyes  
##                     0.549389                     -0.168088  
##                  loanunknown                       loanyes  
##                     1.073310                      0.890196  
##                     monthaug                      monthdec  
##                     0.142034                      0.687731  
##                     monthjul                      monthjun  
##                     0.204754                      0.007234  
##                     monthmar                      monthmay  
##                     1.212656                     -0.615434  
##                     monthnov                      monthoct  
##                    -0.456716                     -0.157620  
##                     monthsep                day_of_weekmon  
##                    -0.465392                     -0.139545  
##               day_of_weekthu                day_of_weektue  
##                    -0.002313                      0.001691  
##               day_of_weekwed                      campaign  
##                     0.095933                     -0.056223  
##                        pdays           poutcomenonexistent  
##                    -0.001216                     -0.313641  
##              poutcomesuccess                  emp.var.rate  
##                     0.145713                     -0.796084  
##               cons.price.idx                 cons.conf.idx  
##                     0.695070                      0.019691  
##                    euribor3m                   nr.employed  
##                     0.389413                     -0.007055  
##                      age_log  
##                    -0.169505  
## 
## Degrees of Freedom: 36939 Total (i.e. Null);  36889 Residual
## Null Deviance:       49720 
## Residual Deviance: 37000     AIC: 37100

Here’s our linear regression formula. The variables are chosen after using stepwise function.

## 
## Call:
## glm(formula = y ~ job + marital + education + default + housing + 
##     loan + contact + month + day_of_week + campaign + pdays + 
##     previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed, family = "binomial", data = bank_add_train_bal)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9607  -0.7378  -0.5192   0.7630   2.5080  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.477e+01  1.674e+01  -1.479 0.139109    
## jobblue-collar               -3.915e-02  4.375e-02  -0.895 0.370856    
## jobentrepreneur              -1.760e-02  7.518e-02  -0.234 0.814844    
## jobhousemaid                  8.136e-02  8.436e-02   0.965 0.334784    
## jobmanagement                -4.496e-03  5.717e-02  -0.079 0.937322    
## jobretired                    1.901e-01  6.463e-02   2.941 0.003271 ** 
## jobself-employed              1.782e-01  7.199e-02   2.475 0.013327 *  
## jobservices                   2.386e-02  5.124e-02   0.466 0.641457    
## jobstudent                    6.165e-01  7.718e-02   7.988 1.37e-15 ***
## jobtechnician                 5.213e-02  4.485e-02   1.162 0.245084    
## jobunemployed                 7.203e-02  8.202e-02   0.878 0.379883    
## jobunknown                    5.007e-01  1.378e-01   3.633 0.000280 ***
## maritalmarried               -4.527e-01  4.034e-02 -11.224  < 2e-16 ***
## maritalsingle                -1.329e-01  4.338e-02  -3.064 0.002187 ** 
## maritalunknown                1.416e-01  2.904e-01   0.488 0.625798    
## educationbasic.6y             4.263e-01  6.585e-02   6.475 9.51e-11 ***
## educationbasic.9y             6.043e-02  5.465e-02   1.106 0.268842    
## educationhigh.school          6.286e-02  5.288e-02   1.189 0.234616    
## educationilliterate          -1.409e-01  7.199e-01  -0.196 0.844827    
## educationprofessional.course  1.274e-01  5.968e-02   2.135 0.032779 *  
## educationuniversity.degree    1.297e-01  5.229e-02   2.479 0.013165 *  
## educationunknown              2.404e-01  7.144e-02   3.365 0.000766 ***
## defaultunknown                6.974e-01  3.162e-02  22.059  < 2e-16 ***
## defaultyes                   -8.942e+00  6.856e+01  -0.130 0.896230    
## housingunknown                5.475e-01  8.082e-02   6.774 1.26e-11 ***
## housingyes                   -1.685e-01  2.654e-02  -6.351 2.14e-10 ***
## loanunknown                   1.075e+00  7.923e-02  13.568  < 2e-16 ***
## loanyes                       8.896e-01  3.123e-02  28.486  < 2e-16 ***
## contacttelephone              3.242e-02  3.677e-02   0.882 0.377906    
## monthaug                      1.511e-01  7.583e-02   1.993 0.046278 *  
## monthdec                      6.807e-01  1.719e-01   3.961 7.47e-05 ***
## monthjul                      2.222e-01  6.032e-02   3.683 0.000230 ***
## monthjun                      1.589e-02  6.875e-02   0.231 0.817178    
## monthmar                      1.210e+00  1.066e-01  11.352  < 2e-16 ***
## monthmay                     -6.143e-01  5.060e-02 -12.141  < 2e-16 ***
## monthnov                     -4.545e-01  7.104e-02  -6.398 1.57e-10 ***
## monthoct                     -1.534e-01  1.019e-01  -1.505 0.132361    
## monthsep                     -4.591e-01  1.044e-01  -4.396 1.10e-05 ***
## day_of_weekmon               -1.414e-01  4.097e-02  -3.451 0.000559 ***
## day_of_weekthu                1.634e-04  4.044e-02   0.004 0.996777    
## day_of_weektue                1.899e-03  4.130e-02   0.046 0.963323    
## day_of_weekwed                9.754e-02  4.096e-02   2.381 0.017251 *  
## campaign                     -5.665e-02  6.560e-03  -8.636  < 2e-16 ***
## pdays                        -1.197e-03  1.280e-04  -9.352  < 2e-16 ***
## previous                      1.548e-02  4.525e-02   0.342 0.732347    
## poutcomenonexistent          -3.034e-01  5.028e-02  -6.033 1.61e-09 ***
## poutcomesuccess               1.532e-01  1.181e-01   1.298 0.194322    
## emp.var.rate                 -7.851e-01  7.098e-02 -11.062  < 2e-16 ***
## cons.price.idx                6.652e-01  1.079e-01   6.164 7.11e-10 ***
## cons.conf.idx                 1.783e-02  5.646e-03   3.158 0.001590 ** 
## euribor3m                     3.914e-01  8.127e-02   4.816 1.47e-06 ***
## nr.employed                  -7.237e-03  1.515e-03  -4.778 1.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49722  on 36939  degrees of freedom
## Residual deviance: 37007  on 36888  degrees of freedom
## AIC: 37111
## 
## Number of Fisher Scoring iterations: 9

Logistic Regression Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6930  589
##        yes  361  357
##                                           
##                Accuracy : 0.8847          
##                  95% CI : (0.8776, 0.8915)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.5635          
##                                           
##                   Kappa : 0.3663          
##                                           
##  Mcnemar's Test P-Value : 1.773e-13       
##                                           
##             Sensitivity : 0.37738         
##             Specificity : 0.95049         
##          Pos Pred Value : 0.49721         
##          Neg Pred Value : 0.92167         
##              Prevalence : 0.11485         
##          Detection Rate : 0.04334         
##    Detection Prevalence : 0.08717         
##       Balanced Accuracy : 0.66393         
##                                           
##        'Positive' Class : yes             
## 

Decision Tree

After a few trials, this is the best decision tree model that we have.

Decision Tree Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6668  475
##        yes  623  471
##                                          
##                Accuracy : 0.8667         
##                  95% CI : (0.8592, 0.874)
##     No Information Rate : 0.8852         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3862         
##                                          
##  Mcnemar's Test P-Value : 9.154e-06      
##                                          
##             Sensitivity : 0.49789        
##             Specificity : 0.91455        
##          Pos Pred Value : 0.43053        
##          Neg Pred Value : 0.93350        
##              Prevalence : 0.11485        
##          Detection Rate : 0.05718        
##    Detection Prevalence : 0.13282        
##       Balanced Accuracy : 0.70622        
##                                          
##        'Positive' Class : yes            
## 

Random Forest Model

Since we already split the data as well as balancing it with SMOTE, all that’s left to do for now is to take out the near zero variance variable out, and train our random forest model.

Model Training

We’re using k fold cross validation in order to avoid bias and overfitting.

## Warning: `repeats` has no meaning for this resampling method.

Exporting our model, just in case we need it in the future.

## rf variable importance
## 
##   only 20 most important variables shown (out of 51)
## 
##                            Overall
## nr.employed                100.000
## previous                    58.946
## cons.conf.idx               57.939
## euribor3m                   55.567
## cons.price.idx              52.258
## age_log                     39.730
## emp.var.rate                19.824
## campaign                    19.624
## housingyes                   8.156
## monthmay                     5.938
## loanyes                      5.768
## maritalmarried               5.254
## contacttelephone             4.670
## educationuniversity.degree   4.665
## jobtechnician                4.480
## poutcomesuccess              4.391
## educationhigh.school         4.365
## maritalsingle                4.348
## defaultunknown               4.280
## day_of_weekmon               4.026

RF Model Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6742  456
##        yes  549  490
##                                          
##                Accuracy : 0.878          
##                  95% CI : (0.8707, 0.885)
##     No Information Rate : 0.8852         
##     P-Value [Acc > NIR] : 0.979439       
##                                          
##                   Kappa : 0.4245         
##                                          
##  Mcnemar's Test P-Value : 0.003707       
##                                          
##             Sensitivity : 0.51797        
##             Specificity : 0.92470        
##          Pos Pred Value : 0.47161        
##          Neg Pred Value : 0.93665        
##              Prevalence : 0.11485        
##          Detection Rate : 0.05949        
##    Detection Prevalence : 0.12614        
##       Balanced Accuracy : 0.72134        
##                                          
##        'Positive' Class : yes            
## 

Random Forest with Ranger

This is a few different configurations of random forest using ranger to optimize our model training phase by using all of our CPU threads.

Version 1

Ranger model version 1 - Data Train without SMOTE (imbalanced)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7235  783
##        yes   56  163
##                                           
##                Accuracy : 0.8981          
##                  95% CI : (0.8914, 0.9046)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 9.234e-05       
##                                           
##                   Kappa : 0.2473          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.17230         
##             Specificity : 0.99232         
##          Pos Pred Value : 0.74429         
##          Neg Pred Value : 0.90234         
##              Prevalence : 0.11485         
##          Detection Rate : 0.01979         
##    Detection Prevalence : 0.02659         
##       Balanced Accuracy : 0.58231         
##                                           
##        'Positive' Class : yes             
## 

Version 2

Ranger model v2 - Data Train with uneven SMOTE at .6 to .4

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6389  371
##        yes  902  575
##                                           
##                Accuracy : 0.8455          
##                  95% CI : (0.8375, 0.8532)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3891          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60782         
##             Specificity : 0.87629         
##          Pos Pred Value : 0.38930         
##          Neg Pred Value : 0.94512         
##              Prevalence : 0.11485         
##          Detection Rate : 0.06981         
##    Detection Prevalence : 0.17931         
##       Balanced Accuracy : 0.74205         
##                                           
##        'Positive' Class : yes             
## 

Version 3

Ranger model v3 - Data Train with even SMOTE at .5 to .5

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6804  494
##        yes  487  452
##                                           
##                Accuracy : 0.8809          
##                  95% CI : (0.8737, 0.8878)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.8896          
##                                           
##                   Kappa : 0.4123          
##                                           
##  Mcnemar's Test P-Value : 0.8481          
##                                           
##             Sensitivity : 0.47780         
##             Specificity : 0.93321         
##          Pos Pred Value : 0.48136         
##          Neg Pred Value : 0.93231         
##              Prevalence : 0.11485         
##          Detection Rate : 0.05487         
##    Detection Prevalence : 0.11400         
##       Balanced Accuracy : 0.70550         
##                                           
##        'Positive' Class : yes             
## 

Version 4

Ranger model v4 - Data Train with even SMOTE at .5 to .5 and tgrid control

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6802  479
##        yes  489  467
##                                           
##                Accuracy : 0.8825          
##                  95% CI : (0.8753, 0.8894)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.7821          
##                                           
##                   Kappa : 0.4246          
##                                           
##  Mcnemar's Test P-Value : 0.7724          
##                                           
##             Sensitivity : 0.4937          
##             Specificity : 0.9329          
##          Pos Pred Value : 0.4885          
##          Neg Pred Value : 0.9342          
##              Prevalence : 0.1148          
##          Detection Rate : 0.0567          
##    Detection Prevalence : 0.1161          
##       Balanced Accuracy : 0.7133          
##                                           
##        'Positive' Class : yes             
## 
## [1] 0.811119
## [1] 0.7863936
## [1] 0.7921616
## [1] 0.7920103

XG Boost

Preparing Matrix

## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
## 
##     slice
## The following object is masked from 'package:dplyr':
## 
##     slice
## [1]  train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
## 
## [11] train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## [21] train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## Stopping. Best iteration:
## [1]  train-error:0.000000+0.000000   test-error:0.000000+0.000000

XG Boost Model training

## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1]  val-error:0.000000  train-error:0.000000 
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
## 
## [11] val-error:0.000000  train-error:0.000000 
## Stopping. Best iteration:
## [1]  val-error:0.000000  train-error:0.000000

XG Boost Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7291    0
##          1    0  946
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.8852     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.1148     
##          Detection Rate : 0.1148     
##    Detection Prevalence : 0.1148     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
## 

## [1] 1

Conclusion

The original research team that’s working on this dataset is able have an AUC score of 0.8. Although I’m not really sure whether we should have AUC as the benchmark since the proportion of our target variable is quite imbalanced. The reason can be seen in our model comparison table below.

Here we can see that the model Ranger v1 has the highest AUC of 0.81, a very high Precision value at 0.74, and a poor recall of 0.17. The reason being this data is trained with an imbalanced training data, so the result reflects that.


In my opinion, if we have to deploy a model, I’d choose either the Random Forest or the Ranger v3 for their Recall and Precision result.

Model Prediction Plot Density Comparison

As I’m working on this exercise, I stumbled across an interesting discovery. If you look at our density plot comparison, you can see how different models prediction behaviors differ from each other.

Logistic Regression