Background

Introduction

This dataset is a telemarketing campaign for a banking institution in Portugal. In this exercise, my objective -besides determining whether a customer will subscribe or not- is to try and compare a few classification models performance. The models that I’m trying in this exercise are Logistic Regression, Decision Tree, Random Forest, Random Forest with Ranger, and XGBoost.

Library

library(tidyverse)
library(stringr)
library(car)
library(MLmetrics)
library(effects)
library(magrittr)
library(caret)
library(rsample)
library(partykit)
library(DMwR)
library(ROCR)


library(ggplot2)
library(inspectdf)
library(GGally)
library(plotly)

library(MLeval)
library(ranger)

Data Import

We have 2 sets of data provided, one with the social and economic variables, and one without. We’ll try with the one with more predictor variables.

bank_add <- read.csv("data_input/bank-additional/bank-additional/bank-additional-full.csv", sep = ";")

Data Variables Explanation

Our input variables has been divided into sections, based on its attributes.

Bank Client Data

1 - age
2 - job
3 - marital : marital status
4 - education
5 - default: has credit in default?
6 - housing: has housing loan?
7 - loan: has personal loan?

Related with the Last Contact of the Current Campaign

8 - contact: contact communication type
9 - month: last contact month of year
10 - day_of_week: last contact day of the week
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

Other Attributes

12 - campaign: number of contacts performed during this campaign and for this client
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
14 - previous: number of contacts performed before this campaign and for this client
15 - poutcome: outcome of the previous marketing campaign

Social and Economic Context Attributes

16 - emp.var.rate: employment variation rate - quarterly indicator
17 - cons.price.idx: consumer price index - monthly indicator
18 - cons.conf.idx: consumer confidence index - monthly indicator
19 - euribor3m: euribor 3 month rate - daily indicator
20 - nr.employed: number of employees - quarterly indicator

Target Variable

21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Target Variable Proportion

prop.table(table(bank_add$y))

## 
##        no       yes 
## 0.8873458 0.1126542

Our target is quite imbalanced, so we will try to balance our training data with SMOTE.

Data Wrangling

We’re dropping duration variable. Because it’s mentioned in the description that the attribute highly affects the output target and should be discarded if the intention is to have a realistic predictive model.
This might be a perfect separation variable, and might hinder our learning process by including it inside our data.

bank_add <- select(bank_add, -duration)

bank_add <- bank_add %>%
  mutate(age_log = log(age)) %>%
  select(-age)

Cross Validation

set.seed(123)

split <- initial_split(bank_add, prop = .8, strata = y)
bank_add_train <- training(split)
bank_add_test <- testing(split)

table(bank_add_train$y)

## 
##    no   yes 
## 29257  3694

prop.table(table(bank_add_train$y))

## 
##        no       yes 
## 0.8878941 0.1121059

Balancing the Data

We’ll try to downsample the proportion using SMOTE from 89-11 to a more acceptable proportion. We’re trying 2 variation of the proportion. One is where it’s 50/50 ratio, and one with 60/40 ratio.

bank_add_train_down <- SMOTE(form = y~.,data = bank_add_train, k = 5, perc.over = 100, perc.under = 200)

bank_add_train_bal <- SMOTE(form = y~.,data = bank_add_train, k = 8, perc.over = 390)


table(bank_add_train_bal$y)

## 
##    no   yes 
## 22164 14776

prop.table(table(bank_add_train_bal$y))

## 
##  no yes 
## 0.6 0.4

Logistic Regression

## 
## Call:  glm(formula = y ~ job + marital + education + default + housing + 
##     loan + month + day_of_week + campaign + pdays + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed + age_log, family = "binomial", data = bank_add_train_bal)
## 
## Coefficients:
##                  (Intercept)                jobblue-collar  
##                   -27.724693                     -0.044227  
##              jobentrepreneur                  jobhousemaid  
##                    -0.011992                      0.092992  
##                jobmanagement                    jobretired  
##                     0.005428                      0.245235  
##             jobself-employed                   jobservices  
##                     0.178374                      0.019914  
##                   jobstudent                 jobtechnician  
##                     0.580556                      0.049140  
##                jobunemployed                    jobunknown  
##                     0.068199                      0.509203  
##               maritalmarried                 maritalsingle  
##                    -0.463148                     -0.167572  
##               maritalunknown             educationbasic.6y  
##                     0.129994                      0.409697  
##            educationbasic.9y          educationhigh.school  
##                     0.043945                      0.042286  
##          educationilliterate  educationprofessional.course  
##                    -0.121657                      0.110212  
##   educationuniversity.degree              educationunknown  
##                     0.109920                      0.234846  
##               defaultunknown                    defaultyes  
##                     0.706461                     -8.958572  
##               housingunknown                    housingyes  
##                     0.549389                     -0.168088  
##                  loanunknown                       loanyes  
##                     1.073310                      0.890196  
##                     monthaug                      monthdec  
##                     0.142034                      0.687731  
##                     monthjul                      monthjun  
##                     0.204754                      0.007234  
##                     monthmar                      monthmay  
##                     1.212656                     -0.615434  
##                     monthnov                      monthoct  
##                    -0.456716                     -0.157620  
##                     monthsep                day_of_weekmon  
##                    -0.465392                     -0.139545  
##               day_of_weekthu                day_of_weektue  
##                    -0.002313                      0.001691  
##               day_of_weekwed                      campaign  
##                     0.095933                     -0.056223  
##                        pdays           poutcomenonexistent  
##                    -0.001216                     -0.313641  
##              poutcomesuccess                  emp.var.rate  
##                     0.145713                     -0.796084  
##               cons.price.idx                 cons.conf.idx  
##                     0.695070                      0.019691  
##                    euribor3m                   nr.employed  
##                     0.389413                     -0.007055  
##                      age_log  
##                    -0.169505  
## 
## Degrees of Freedom: 36939 Total (i.e. Null);  36889 Residual
## Null Deviance:       49720 
## Residual Deviance: 37000     AIC: 37100

Here’s our linear regression formula. The variables are chosen after using stepwise function.

bank_add_logreg_step <- glm(formula = y ~ job + marital + education + default + housing + 
    loan + contact + month + day_of_week + campaign + pdays + 
    previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
    euribor3m + nr.employed, family = "binomial", data = bank_add_train_bal)

summary(bank_add_logreg_step)

## 
## Call:
## glm(formula = y ~ job + marital + education + default + housing + 
##     loan + contact + month + day_of_week + campaign + pdays + 
##     previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx + 
##     euribor3m + nr.employed, family = "binomial", data = bank_add_train_bal)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9607  -0.7378  -0.5192   0.7630   2.5080  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.477e+01  1.674e+01  -1.479 0.139109    
## jobblue-collar               -3.915e-02  4.375e-02  -0.895 0.370856    
## jobentrepreneur              -1.760e-02  7.518e-02  -0.234 0.814844    
## jobhousemaid                  8.136e-02  8.436e-02   0.965 0.334784    
## jobmanagement                -4.496e-03  5.717e-02  -0.079 0.937322    
## jobretired                    1.901e-01  6.463e-02   2.941 0.003271 ** 
## jobself-employed              1.782e-01  7.199e-02   2.475 0.013327 *  
## jobservices                   2.386e-02  5.124e-02   0.466 0.641457    
## jobstudent                    6.165e-01  7.718e-02   7.988 1.37e-15 ***
## jobtechnician                 5.213e-02  4.485e-02   1.162 0.245084    
## jobunemployed                 7.203e-02  8.202e-02   0.878 0.379883    
## jobunknown                    5.007e-01  1.378e-01   3.633 0.000280 ***
## maritalmarried               -4.527e-01  4.034e-02 -11.224  < 2e-16 ***
## maritalsingle                -1.329e-01  4.338e-02  -3.064 0.002187 ** 
## maritalunknown                1.416e-01  2.904e-01   0.488 0.625798    
## educationbasic.6y             4.263e-01  6.585e-02   6.475 9.51e-11 ***
## educationbasic.9y             6.043e-02  5.465e-02   1.106 0.268842    
## educationhigh.school          6.286e-02  5.288e-02   1.189 0.234616    
## educationilliterate          -1.409e-01  7.199e-01  -0.196 0.844827    
## educationprofessional.course  1.274e-01  5.968e-02   2.135 0.032779 *  
## educationuniversity.degree    1.297e-01  5.229e-02   2.479 0.013165 *  
## educationunknown              2.404e-01  7.144e-02   3.365 0.000766 ***
## defaultunknown                6.974e-01  3.162e-02  22.059  < 2e-16 ***
## defaultyes                   -8.942e+00  6.856e+01  -0.130 0.896230    
## housingunknown                5.475e-01  8.082e-02   6.774 1.26e-11 ***
## housingyes                   -1.685e-01  2.654e-02  -6.351 2.14e-10 ***
## loanunknown                   1.075e+00  7.923e-02  13.568  < 2e-16 ***
## loanyes                       8.896e-01  3.123e-02  28.486  < 2e-16 ***
## contacttelephone              3.242e-02  3.677e-02   0.882 0.377906    
## monthaug                      1.511e-01  7.583e-02   1.993 0.046278 *  
## monthdec                      6.807e-01  1.719e-01   3.961 7.47e-05 ***
## monthjul                      2.222e-01  6.032e-02   3.683 0.000230 ***
## monthjun                      1.589e-02  6.875e-02   0.231 0.817178    
## monthmar                      1.210e+00  1.066e-01  11.352  < 2e-16 ***
## monthmay                     -6.143e-01  5.060e-02 -12.141  < 2e-16 ***
## monthnov                     -4.545e-01  7.104e-02  -6.398 1.57e-10 ***
## monthoct                     -1.534e-01  1.019e-01  -1.505 0.132361    
## monthsep                     -4.591e-01  1.044e-01  -4.396 1.10e-05 ***
## day_of_weekmon               -1.414e-01  4.097e-02  -3.451 0.000559 ***
## day_of_weekthu                1.634e-04  4.044e-02   0.004 0.996777    
## day_of_weektue                1.899e-03  4.130e-02   0.046 0.963323    
## day_of_weekwed                9.754e-02  4.096e-02   2.381 0.017251 *  
## campaign                     -5.665e-02  6.560e-03  -8.636  < 2e-16 ***
## pdays                        -1.197e-03  1.280e-04  -9.352  < 2e-16 ***
## previous                      1.548e-02  4.525e-02   0.342 0.732347    
## poutcomenonexistent          -3.034e-01  5.028e-02  -6.033 1.61e-09 ***
## poutcomesuccess               1.532e-01  1.181e-01   1.298 0.194322    
## emp.var.rate                 -7.851e-01  7.098e-02 -11.062  < 2e-16 ***
## cons.price.idx                6.652e-01  1.079e-01   6.164 7.11e-10 ***
## cons.conf.idx                 1.783e-02  5.646e-03   3.158 0.001590 ** 
## euribor3m                     3.914e-01  8.127e-02   4.816 1.47e-06 ***
## nr.employed                  -7.237e-03  1.515e-03  -4.778 1.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 49722  on 36939  degrees of freedom
## Residual deviance: 37007  on 36888  degrees of freedom
## AIC: 37111
## 
## Number of Fisher Scoring iterations: 9

Logistic Regression Confusion Matrix

bank_logreg_pred <- predict(bank_add_logreg, bank_add_test, type = "response")

bank_logreg_pred_label <- as.factor(ifelse(bank_logreg_pred>.7, "yes", "no"))

confusionMatrix(bank_logreg_pred_label, bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6930  589
##        yes  361  357
##                                           
##                Accuracy : 0.8847          
##                  95% CI : (0.8776, 0.8915)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.5635          
##                                           
##                   Kappa : 0.3663          
##                                           
##  Mcnemar's Test P-Value : 1.773e-13       
##                                           
##             Sensitivity : 0.37738         
##             Specificity : 0.95049         
##          Pos Pred Value : 0.49721         
##          Neg Pred Value : 0.92167         
##              Prevalence : 0.11485         
##          Detection Rate : 0.04334         
##    Detection Prevalence : 0.08717         
##       Balanced Accuracy : 0.66393         
##                                           
##        'Positive' Class : yes             
##

Log Reg ROC Curve and AUC

bank_add_logreg_roc <- prediction(bank_logreg_pred, bank_add_test$y)
plot(performance(bank_add_logreg_roc, "tpr", "fpr"))

bank_add_logreg_auc <- performance(bank_add_logreg_roc, "auc")
bank_add_logreg_auc@y.values[[1]]

## [1] 0.7665057

Decision Tree

After a few trials, this is the best decision tree model that we have.

bank_add_tree_v2 <- ctree(formula = y~., data = bank_add_train_bal, control = ctree_control(mincriterion = .95, minsplit = 200, minbucket = 500))
# plot(bank_add_tree_v2)

bank_add_predict_v2 <- predict(bank_add_tree_v2, newdata = bank_add_test)

Decision Tree Confusion Matrix

confusionMatrix(bank_add_predict_v2, bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6668  475
##        yes  623  471
##                                          
##                Accuracy : 0.8667         
##                  95% CI : (0.8592, 0.874)
##     No Information Rate : 0.8852         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3862         
##                                          
##  Mcnemar's Test P-Value : 9.154e-06      
##                                          
##             Sensitivity : 0.49789        
##             Specificity : 0.91455        
##          Pos Pred Value : 0.43053        
##          Neg Pred Value : 0.93350        
##              Prevalence : 0.11485        
##          Detection Rate : 0.05718        
##    Detection Prevalence : 0.13282        
##       Balanced Accuracy : 0.70622        
##                                          
##        'Positive' Class : yes            
##

ROC Curve and AUC

# library(ROCR)
bank_add_predict_prob <- predict(bank_add_tree_v2, newdata = bank_add_test, type = "prob")

bank_add_tree_roc <- ROCR::prediction(bank_add_predict_prob[,2], bank_add_test$y)
plot(performance(bank_add_tree_roc, "tpr", "fpr"))

bank_add_tree_auc <- performance(bank_add_tree_roc, "auc")
bank_add_tree_auc@y.values[[1]]

## [1] 0.764104

Random Forest Model

Since we already split the data as well as balancing it with SMOTE, all that’s left to do for now is to take out the near zero variance variable out, and train our random forest model.

Near Zero Var

n0_var <- nearZeroVar(bank_add_train_down)
bank_add_train_down_v2 <- bank_add_train_down[,-n0_var]

Model Training

We’re using k fold cross validation in order to avoid bias and overfitting.

ctrl <- trainControl(method = "cv", number = 5, repeats = 3)

## Warning: `repeats` has no meaning for this resampling method.

# bank_rf_model <- train(y ~., data = bank_add_train_bal, method = "rf", trControl= ctrl)
# bank_rf_model <- readRDS("bank_rf_model.RDS")

bank_rf_model <- readRDS("bank_rf_model_ageLog.RDS")

Exporting our model, just in case we need it in the future.

saveRDS(bank_rf_model, file = "bank_rf_model_ageLog.RDS")

varImp(bank_rf_model)

## rf variable importance
## 
##   only 20 most important variables shown (out of 51)
## 
##                            Overall
## nr.employed                100.000
## previous                    58.946
## cons.conf.idx               57.939
## euribor3m                   55.567
## cons.price.idx              52.258
## age_log                     39.730
## emp.var.rate                19.824
## campaign                    19.624
## housingyes                   8.156
## monthmay                     5.938
## loanyes                      5.768
## maritalmarried               5.254
## contacttelephone             4.670
## educationuniversity.degree   4.665
## jobtechnician                4.480
## poutcomesuccess              4.391
## educationhigh.school         4.365
## maritalsingle                4.348
## defaultunknown               4.280
## day_of_weekmon               4.026

RF Model Confusion Matrix

bank_rf_pred <- predict(bank_rf_model, bank_add_test)


confusionMatrix(bank_rf_pred, bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6742  456
##        yes  549  490
##                                          
##                Accuracy : 0.878          
##                  95% CI : (0.8707, 0.885)
##     No Information Rate : 0.8852         
##     P-Value [Acc > NIR] : 0.979439       
##                                          
##                   Kappa : 0.4245         
##                                          
##  Mcnemar's Test P-Value : 0.003707       
##                                          
##             Sensitivity : 0.51797        
##             Specificity : 0.92470        
##          Pos Pred Value : 0.47161        
##          Neg Pred Value : 0.93665        
##              Prevalence : 0.11485        
##          Detection Rate : 0.05949        
##    Detection Prevalence : 0.12614        
##       Balanced Accuracy : 0.72134        
##                                          
##        'Positive' Class : yes            
##

plot(density(predict(bank_rf_model, bank_add_test, type = "prob")[,2]))

RF Model ROC Curve and AUC

bank_rf_predict_prob <- predict(bank_rf_model, bank_add_test, type = "prob")

bank_rf_roc <- prediction(bank_rf_predict_prob[,2], bank_add_test$y)
plot(performance(bank_rf_roc, "tpr", "fpr"))

Our AUC Value

bank_rf_auc <- performance(bank_rf_roc, "auc")
bank_rf_auc@y.values[[1]]

## [1] 0.7810972

Random Forest with Ranger

This is a few different configurations of random forest using ranger to optimize our model training phase by using all of our CPU threads.

fitControl <- trainControl(method = "cv", 
                           summaryFunction = prSummary,
                           classProbs = T,
                           savePredictions = T,
                           verboseIter = F)

tgrid <- expand.grid(.mtry = 2:4,
                     .splitrule = "gini",
                     .min.node.size = c(10,20))


# bank_mdl_ranger <- train(y ~., data = bank_add_train,
#                 method = "ranger",
#                 metric = "AUC",
#                 trControl = fitControl)

# saveRDS(bank_mdl_ranger, file = "testFit.RDS")

bank_mdl_ranger_v1 <- readRDS("testFit.RDS")
bank_mdl_ranger_v2 <- readRDS("rangerModel_v2.RDS")
bank_mdl_ranger_v3 <- readRDS("rangerModel_v3.RDS")
bank_mdl_ranger_v4 <- readRDS("rangerModel_v4.RDS")



# x <- evalm(list(im_fit))

Version 1

Ranger model version 1 - Data Train without SMOTE (imbalanced)

confusionMatrix(predict(bank_mdl_ranger_v1, bank_add_test), bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7235  783
##        yes   56  163
##                                           
##                Accuracy : 0.8981          
##                  95% CI : (0.8914, 0.9046)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 9.234e-05       
##                                           
##                   Kappa : 0.2473          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.17230         
##             Specificity : 0.99232         
##          Pos Pred Value : 0.74429         
##          Neg Pred Value : 0.90234         
##              Prevalence : 0.11485         
##          Detection Rate : 0.01979         
##    Detection Prevalence : 0.02659         
##       Balanced Accuracy : 0.58231         
##                                           
##        'Positive' Class : yes             
##

Version 2

Ranger model v2 - Data Train with uneven SMOTE at .6 to .4

confusionMatrix(predict(bank_mdl_ranger_v2, bank_add_test), bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6389  371
##        yes  902  575
##                                           
##                Accuracy : 0.8455          
##                  95% CI : (0.8375, 0.8532)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3891          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.60782         
##             Specificity : 0.87629         
##          Pos Pred Value : 0.38930         
##          Neg Pred Value : 0.94512         
##              Prevalence : 0.11485         
##          Detection Rate : 0.06981         
##    Detection Prevalence : 0.17931         
##       Balanced Accuracy : 0.74205         
##                                           
##        'Positive' Class : yes             
##

Version 3

Ranger model v3 - Data Train with even SMOTE at .5 to .5

confusionMatrix(predict(bank_mdl_ranger_v3, bank_add_test), bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6804  494
##        yes  487  452
##                                           
##                Accuracy : 0.8809          
##                  95% CI : (0.8737, 0.8878)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.8896          
##                                           
##                   Kappa : 0.4123          
##                                           
##  Mcnemar's Test P-Value : 0.8481          
##                                           
##             Sensitivity : 0.47780         
##             Specificity : 0.93321         
##          Pos Pred Value : 0.48136         
##          Neg Pred Value : 0.93231         
##              Prevalence : 0.11485         
##          Detection Rate : 0.05487         
##    Detection Prevalence : 0.11400         
##       Balanced Accuracy : 0.70550         
##                                           
##        'Positive' Class : yes             
##

Version 4

Ranger model v4 - Data Train with even SMOTE at .5 to .5 and tgrid control

confusionMatrix(predict(bank_mdl_ranger_v4, bank_add_test), bank_add_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6802  479
##        yes  489  467
##                                           
##                Accuracy : 0.8825          
##                  95% CI : (0.8753, 0.8894)
##     No Information Rate : 0.8852          
##     P-Value [Acc > NIR] : 0.7821          
##                                           
##                   Kappa : 0.4246          
##                                           
##  Mcnemar's Test P-Value : 0.7724          
##                                           
##             Sensitivity : 0.4937          
##             Specificity : 0.9329          
##          Pos Pred Value : 0.4885          
##          Neg Pred Value : 0.9342          
##              Prevalence : 0.1148          
##          Detection Rate : 0.0567          
##    Detection Prevalence : 0.1161          
##       Balanced Accuracy : 0.7133          
##                                           
##        'Positive' Class : yes             
##

performance(prediction(predict(bank_mdl_ranger_v1,bank_add_test, type = "prob")[,2] , bank_add_test$y), "auc")@y.values[[1]]

## [1] 0.811119

performance(prediction(predict(bank_mdl_ranger_v2,bank_add_test, type = "prob")[,2] , bank_add_test$y), "auc")@y.values[[1]]

## [1] 0.7863936

performance(prediction(predict(bank_mdl_ranger_v3,bank_add_test, type = "prob")[,2] , bank_add_test$y), "auc")@y.values[[1]]

## [1] 0.7921616

performance(prediction(predict(bank_mdl_ranger_v4,bank_add_test, type = "prob")[,2] , bank_add_test$y), "auc")@y.values[[1]]

## [1] 0.7920103

XG Boost

labels <- bank_add_train_bal$y
ts_labels <- bank_add_test$y


new_tr <- model.matrix(~.+0, data = bank_add_train_bal[,-20], with = F)
new_ts <- model.matrix(~.+0, data = bank_add_test[,-20], with = F)

labels <- as.numeric(labels) -1
ts_labels <- as.numeric(ts_labels) -1

Preparing Matrix

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:plotly':
## 
##     slice

## The following object is masked from 'package:dplyr':
## 
##     slice

dtrain <- xgb.DMatrix(data = new_tr, label = labels)
dtest <- xgb.DMatrix(data = new_ts, label = ts_labels)

param <- list(booster = "gbtree", objective = "binary:logistic", eta = 0.3, gamma = 0, max_depth = 6, min_child_weight=1, subsample = 1, colsample_bytree=1 )
xgbcv <- xgb.cv(params = param, data = dtrain, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 10, early_stopping_rounds = 20, maximize = F)

## [1]  train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
## 
## [11] train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## [21] train-error:0.000000+0.000000   test-error:0.000000+0.000000 
## Stopping. Best iteration:
## [1]  train-error:0.000000+0.000000   test-error:0.000000+0.000000

XG Boost Model training

xgb1 <- xgb.train (params = param, data = dtrain, nrounds = 117, watchlist = list(val=dtest,train=dtrain), print.every.n = 10, early.stop.round = 10, maximize = F , eval_metric = "error")

## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").

## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").

## [1]  val-error:0.000000  train-error:0.000000 
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
## 
## [11] val-error:0.000000  train-error:0.000000 
## Stopping. Best iteration:
## [1]  val-error:0.000000  train-error:0.000000

XG Boost Confusion Matrix

xgbpred <- ifelse(predict(xgb1, dtest)>.5, 1, 0)

confusionMatrix(as.factor(xgbpred), as.factor(ts_labels), positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 7291    0
##          1    0  946
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9996, 1)
##     No Information Rate : 0.8852     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.1148     
##          Detection Rate : 0.1148     
##    Detection Prevalence : 0.1148     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : 1          
##

bank_xgb_roc <- prediction(xgbpred, bank_add_test$y)
plot(performance(bank_rf_roc, "tpr", "fpr"))

bank_xgb_auc <- performance(bank_xgb_roc, "auc")
bank_xgb_auc@y.values[[1]]

## [1] 1

Conclusion

The original research team that’s working on this dataset is able have an AUC score of 0.8. Although I’m not really sure whether we should have AUC as the benchmark since the proportion of our target variable is quite imbalanced. The reason can be seen in our model comparison table below.

Here we can see that the model Ranger v1 has the highest AUC of 0.81, a very high Precision value at 0.74, and a poor recall of 0.17. The reason being this data is trained with an imbalanced training data, so the result reflects that.

In my opinion, if we have to deploy a model, I’d choose either the Random Forest or the Ranger v3 for their Recall and Precision result.

Model Prediction Plot Density Comparison

As I’m working on this exercise, I stumbled across an interesting discovery. If you look at our density plot comparison, you can see how different models prediction behaviors differ from each other.

Logistic Regression

plot(density(bank_logreg_pred))

Decision Tree

plot(density(predict(bank_add_tree_v2, newdata = bank_add_test, type = "prob")[,2] ))

Random Forest

plot(density(bank_rf_predict_prob[,2]))

Random Forest with Ranger v1

plot(density(predict(bank_mdl_ranger_v1, bank_add_test, type = "prob")[,2]))

Random Forest with Ranger v3

plot(density(predict(bank_mdl_ranger_v3, bank_add_test, type = "prob")[,2]))

XG Boost

plot(density(xgbpred))

Bank Marketing Classification

Deo Ivan Mareza

2/28/2020

Background

Introduction

Library

Data Import

Data Variables Explanation

Target Variable Proportion

Data Wrangling

Cross Validation

Balancing the Data

Logistic Regression

Logistic Regression Confusion Matrix

Log Reg ROC Curve and AUC

Decision Tree

Decision Tree Confusion Matrix

ROC Curve and AUC

Random Forest Model

Near Zero Var

Model Training

RF Model Confusion Matrix

RF Model ROC Curve and AUC

Random Forest with Ranger

Version 1

Version 2

Version 3

Version 4

XG Boost

XG Boost Model training

XG Boost Confusion Matrix

Conclusion

Model Prediction Plot Density Comparison

Logistic Regression

Decision Tree

Random Forest

Random Forest with Ranger v1

Random Forest with Ranger v3

XG Boost