This dataset is a telemarketing campaign for a banking institution in Portugal. In this exercise, my objective -besides determining whether a customer will subscribe or not- is to try and compare a few classification models performance. The models that I’m trying in this exercise are Logistic Regression, Decision Tree, Random Forest, Random Forest with Ranger, and XGBoost.
We have 2 sets of data provided, one with the social and economic variables, and one without. We’ll try with the one with more predictor variables.
Our input variables has been divided into sections, based on its attributes.
Bank Client Data
1 - age
2 - job
3 - marital : marital status
4 - education
5 - default: has credit in default?
6 - housing: has housing loan?
7 - loan: has personal loan?
Related with the Last Contact of the Current Campaign
8 - contact: contact communication type
9 - month: last contact month of year
10 - day_of_week: last contact day of the week
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other Attributes
12 - campaign: number of contacts performed during this campaign and for this client
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
14 - previous: number of contacts performed before this campaign and for this client
15 - poutcome: outcome of the previous marketing campaign
Social and Economic Context Attributes
16 - emp.var.rate: employment variation rate - quarterly indicator
17 - cons.price.idx: consumer price index - monthly indicator
18 - cons.conf.idx: consumer confidence index - monthly indicator
19 - euribor3m: euribor 3 month rate - daily indicator
20 - nr.employed: number of employees - quarterly indicator
Target Variable
21 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
##
## no yes
## 0.8873458 0.1126542
Our target is quite imbalanced, so we will try to balance our training data with SMOTE.
We’re dropping duration variable. Because it’s mentioned in the description that the attribute highly affects the output target and should be discarded if the intention is to have a realistic predictive model.
This might be a perfect separation variable, and might hinder our learning process by including it inside our data.
set.seed(123)
split <- initial_split(bank_add, prop = .8, strata = y)
bank_add_train <- training(split)
bank_add_test <- testing(split)
table(bank_add_train$y)##
## no yes
## 29257 3694
##
## no yes
## 0.8878941 0.1121059
We’ll try to downsample the proportion using SMOTE from 89-11 to a more acceptable proportion. We’re trying 2 variation of the proportion. One is where it’s 50/50 ratio, and one with 60/40 ratio.
bank_add_train_down <- SMOTE(form = y~.,data = bank_add_train, k = 5, perc.over = 100, perc.under = 200)
bank_add_train_bal <- SMOTE(form = y~.,data = bank_add_train, k = 8, perc.over = 390)
table(bank_add_train_bal$y)##
## no yes
## 22164 14776
##
## no yes
## 0.6 0.4
##
## Call: glm(formula = y ~ job + marital + education + default + housing +
## loan + month + day_of_week + campaign + pdays + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed + age_log, family = "binomial", data = bank_add_train_bal)
##
## Coefficients:
## (Intercept) jobblue-collar
## -27.724693 -0.044227
## jobentrepreneur jobhousemaid
## -0.011992 0.092992
## jobmanagement jobretired
## 0.005428 0.245235
## jobself-employed jobservices
## 0.178374 0.019914
## jobstudent jobtechnician
## 0.580556 0.049140
## jobunemployed jobunknown
## 0.068199 0.509203
## maritalmarried maritalsingle
## -0.463148 -0.167572
## maritalunknown educationbasic.6y
## 0.129994 0.409697
## educationbasic.9y educationhigh.school
## 0.043945 0.042286
## educationilliterate educationprofessional.course
## -0.121657 0.110212
## educationuniversity.degree educationunknown
## 0.109920 0.234846
## defaultunknown defaultyes
## 0.706461 -8.958572
## housingunknown housingyes
## 0.549389 -0.168088
## loanunknown loanyes
## 1.073310 0.890196
## monthaug monthdec
## 0.142034 0.687731
## monthjul monthjun
## 0.204754 0.007234
## monthmar monthmay
## 1.212656 -0.615434
## monthnov monthoct
## -0.456716 -0.157620
## monthsep day_of_weekmon
## -0.465392 -0.139545
## day_of_weekthu day_of_weektue
## -0.002313 0.001691
## day_of_weekwed campaign
## 0.095933 -0.056223
## pdays poutcomenonexistent
## -0.001216 -0.313641
## poutcomesuccess emp.var.rate
## 0.145713 -0.796084
## cons.price.idx cons.conf.idx
## 0.695070 0.019691
## euribor3m nr.employed
## 0.389413 -0.007055
## age_log
## -0.169505
##
## Degrees of Freedom: 36939 Total (i.e. Null); 36889 Residual
## Null Deviance: 49720
## Residual Deviance: 37000 AIC: 37100
Here’s our linear regression formula. The variables are chosen after using stepwise function.
bank_add_logreg_step <- glm(formula = y ~ job + marital + education + default + housing +
loan + contact + month + day_of_week + campaign + pdays +
previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
euribor3m + nr.employed, family = "binomial", data = bank_add_train_bal)
summary(bank_add_logreg_step)##
## Call:
## glm(formula = y ~ job + marital + education + default + housing +
## loan + contact + month + day_of_week + campaign + pdays +
## previous + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx +
## euribor3m + nr.employed, family = "binomial", data = bank_add_train_bal)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9607 -0.7378 -0.5192 0.7630 2.5080
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.477e+01 1.674e+01 -1.479 0.139109
## jobblue-collar -3.915e-02 4.375e-02 -0.895 0.370856
## jobentrepreneur -1.760e-02 7.518e-02 -0.234 0.814844
## jobhousemaid 8.136e-02 8.436e-02 0.965 0.334784
## jobmanagement -4.496e-03 5.717e-02 -0.079 0.937322
## jobretired 1.901e-01 6.463e-02 2.941 0.003271 **
## jobself-employed 1.782e-01 7.199e-02 2.475 0.013327 *
## jobservices 2.386e-02 5.124e-02 0.466 0.641457
## jobstudent 6.165e-01 7.718e-02 7.988 1.37e-15 ***
## jobtechnician 5.213e-02 4.485e-02 1.162 0.245084
## jobunemployed 7.203e-02 8.202e-02 0.878 0.379883
## jobunknown 5.007e-01 1.378e-01 3.633 0.000280 ***
## maritalmarried -4.527e-01 4.034e-02 -11.224 < 2e-16 ***
## maritalsingle -1.329e-01 4.338e-02 -3.064 0.002187 **
## maritalunknown 1.416e-01 2.904e-01 0.488 0.625798
## educationbasic.6y 4.263e-01 6.585e-02 6.475 9.51e-11 ***
## educationbasic.9y 6.043e-02 5.465e-02 1.106 0.268842
## educationhigh.school 6.286e-02 5.288e-02 1.189 0.234616
## educationilliterate -1.409e-01 7.199e-01 -0.196 0.844827
## educationprofessional.course 1.274e-01 5.968e-02 2.135 0.032779 *
## educationuniversity.degree 1.297e-01 5.229e-02 2.479 0.013165 *
## educationunknown 2.404e-01 7.144e-02 3.365 0.000766 ***
## defaultunknown 6.974e-01 3.162e-02 22.059 < 2e-16 ***
## defaultyes -8.942e+00 6.856e+01 -0.130 0.896230
## housingunknown 5.475e-01 8.082e-02 6.774 1.26e-11 ***
## housingyes -1.685e-01 2.654e-02 -6.351 2.14e-10 ***
## loanunknown 1.075e+00 7.923e-02 13.568 < 2e-16 ***
## loanyes 8.896e-01 3.123e-02 28.486 < 2e-16 ***
## contacttelephone 3.242e-02 3.677e-02 0.882 0.377906
## monthaug 1.511e-01 7.583e-02 1.993 0.046278 *
## monthdec 6.807e-01 1.719e-01 3.961 7.47e-05 ***
## monthjul 2.222e-01 6.032e-02 3.683 0.000230 ***
## monthjun 1.589e-02 6.875e-02 0.231 0.817178
## monthmar 1.210e+00 1.066e-01 11.352 < 2e-16 ***
## monthmay -6.143e-01 5.060e-02 -12.141 < 2e-16 ***
## monthnov -4.545e-01 7.104e-02 -6.398 1.57e-10 ***
## monthoct -1.534e-01 1.019e-01 -1.505 0.132361
## monthsep -4.591e-01 1.044e-01 -4.396 1.10e-05 ***
## day_of_weekmon -1.414e-01 4.097e-02 -3.451 0.000559 ***
## day_of_weekthu 1.634e-04 4.044e-02 0.004 0.996777
## day_of_weektue 1.899e-03 4.130e-02 0.046 0.963323
## day_of_weekwed 9.754e-02 4.096e-02 2.381 0.017251 *
## campaign -5.665e-02 6.560e-03 -8.636 < 2e-16 ***
## pdays -1.197e-03 1.280e-04 -9.352 < 2e-16 ***
## previous 1.548e-02 4.525e-02 0.342 0.732347
## poutcomenonexistent -3.034e-01 5.028e-02 -6.033 1.61e-09 ***
## poutcomesuccess 1.532e-01 1.181e-01 1.298 0.194322
## emp.var.rate -7.851e-01 7.098e-02 -11.062 < 2e-16 ***
## cons.price.idx 6.652e-01 1.079e-01 6.164 7.11e-10 ***
## cons.conf.idx 1.783e-02 5.646e-03 3.158 0.001590 **
## euribor3m 3.914e-01 8.127e-02 4.816 1.47e-06 ***
## nr.employed -7.237e-03 1.515e-03 -4.778 1.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 49722 on 36939 degrees of freedom
## Residual deviance: 37007 on 36888 degrees of freedom
## AIC: 37111
##
## Number of Fisher Scoring iterations: 9
bank_logreg_pred <- predict(bank_add_logreg, bank_add_test, type = "response")
bank_logreg_pred_label <- as.factor(ifelse(bank_logreg_pred>.7, "yes", "no"))
confusionMatrix(bank_logreg_pred_label, bank_add_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6930 589
## yes 361 357
##
## Accuracy : 0.8847
## 95% CI : (0.8776, 0.8915)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 0.5635
##
## Kappa : 0.3663
##
## Mcnemar's Test P-Value : 1.773e-13
##
## Sensitivity : 0.37738
## Specificity : 0.95049
## Pos Pred Value : 0.49721
## Neg Pred Value : 0.92167
## Prevalence : 0.11485
## Detection Rate : 0.04334
## Detection Prevalence : 0.08717
## Balanced Accuracy : 0.66393
##
## 'Positive' Class : yes
##
After a few trials, this is the best decision tree model that we have.
bank_add_tree_v2 <- ctree(formula = y~., data = bank_add_train_bal, control = ctree_control(mincriterion = .95, minsplit = 200, minbucket = 500))
# plot(bank_add_tree_v2)
bank_add_predict_v2 <- predict(bank_add_tree_v2, newdata = bank_add_test)## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6668 475
## yes 623 471
##
## Accuracy : 0.8667
## 95% CI : (0.8592, 0.874)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3862
##
## Mcnemar's Test P-Value : 9.154e-06
##
## Sensitivity : 0.49789
## Specificity : 0.91455
## Pos Pred Value : 0.43053
## Neg Pred Value : 0.93350
## Prevalence : 0.11485
## Detection Rate : 0.05718
## Detection Prevalence : 0.13282
## Balanced Accuracy : 0.70622
##
## 'Positive' Class : yes
##
Since we already split the data as well as balancing it with SMOTE, all that’s left to do for now is to take out the near zero variance variable out, and train our random forest model.
We’re using k fold cross validation in order to avoid bias and overfitting.
## Warning: `repeats` has no meaning for this resampling method.
# bank_rf_model <- train(y ~., data = bank_add_train_bal, method = "rf", trControl= ctrl)
# bank_rf_model <- readRDS("bank_rf_model.RDS")
bank_rf_model <- readRDS("bank_rf_model_ageLog.RDS")Exporting our model, just in case we need it in the future.
## rf variable importance
##
## only 20 most important variables shown (out of 51)
##
## Overall
## nr.employed 100.000
## previous 58.946
## cons.conf.idx 57.939
## euribor3m 55.567
## cons.price.idx 52.258
## age_log 39.730
## emp.var.rate 19.824
## campaign 19.624
## housingyes 8.156
## monthmay 5.938
## loanyes 5.768
## maritalmarried 5.254
## contacttelephone 4.670
## educationuniversity.degree 4.665
## jobtechnician 4.480
## poutcomesuccess 4.391
## educationhigh.school 4.365
## maritalsingle 4.348
## defaultunknown 4.280
## day_of_weekmon 4.026
bank_rf_pred <- predict(bank_rf_model, bank_add_test)
confusionMatrix(bank_rf_pred, bank_add_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6742 456
## yes 549 490
##
## Accuracy : 0.878
## 95% CI : (0.8707, 0.885)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 0.979439
##
## Kappa : 0.4245
##
## Mcnemar's Test P-Value : 0.003707
##
## Sensitivity : 0.51797
## Specificity : 0.92470
## Pos Pred Value : 0.47161
## Neg Pred Value : 0.93665
## Prevalence : 0.11485
## Detection Rate : 0.05949
## Detection Prevalence : 0.12614
## Balanced Accuracy : 0.72134
##
## 'Positive' Class : yes
##
bank_rf_predict_prob <- predict(bank_rf_model, bank_add_test, type = "prob")
bank_rf_roc <- prediction(bank_rf_predict_prob[,2], bank_add_test$y)
plot(performance(bank_rf_roc, "tpr", "fpr"))Our AUC Value
## [1] 0.7810972
This is a few different configurations of random forest using ranger to optimize our model training phase by using all of our CPU threads.
fitControl <- trainControl(method = "cv",
summaryFunction = prSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
tgrid <- expand.grid(.mtry = 2:4,
.splitrule = "gini",
.min.node.size = c(10,20))
# bank_mdl_ranger <- train(y ~., data = bank_add_train,
# method = "ranger",
# metric = "AUC",
# trControl = fitControl)
# saveRDS(bank_mdl_ranger, file = "testFit.RDS")
bank_mdl_ranger_v1 <- readRDS("testFit.RDS")
bank_mdl_ranger_v2 <- readRDS("rangerModel_v2.RDS")
bank_mdl_ranger_v3 <- readRDS("rangerModel_v3.RDS")
bank_mdl_ranger_v4 <- readRDS("rangerModel_v4.RDS")
# x <- evalm(list(im_fit))Ranger model version 1 - Data Train without SMOTE (imbalanced)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7235 783
## yes 56 163
##
## Accuracy : 0.8981
## 95% CI : (0.8914, 0.9046)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 9.234e-05
##
## Kappa : 0.2473
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.17230
## Specificity : 0.99232
## Pos Pred Value : 0.74429
## Neg Pred Value : 0.90234
## Prevalence : 0.11485
## Detection Rate : 0.01979
## Detection Prevalence : 0.02659
## Balanced Accuracy : 0.58231
##
## 'Positive' Class : yes
##
Ranger model v2 - Data Train with uneven SMOTE at .6 to .4
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6389 371
## yes 902 575
##
## Accuracy : 0.8455
## 95% CI : (0.8375, 0.8532)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3891
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.60782
## Specificity : 0.87629
## Pos Pred Value : 0.38930
## Neg Pred Value : 0.94512
## Prevalence : 0.11485
## Detection Rate : 0.06981
## Detection Prevalence : 0.17931
## Balanced Accuracy : 0.74205
##
## 'Positive' Class : yes
##
Ranger model v3 - Data Train with even SMOTE at .5 to .5
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6804 494
## yes 487 452
##
## Accuracy : 0.8809
## 95% CI : (0.8737, 0.8878)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 0.8896
##
## Kappa : 0.4123
##
## Mcnemar's Test P-Value : 0.8481
##
## Sensitivity : 0.47780
## Specificity : 0.93321
## Pos Pred Value : 0.48136
## Neg Pred Value : 0.93231
## Prevalence : 0.11485
## Detection Rate : 0.05487
## Detection Prevalence : 0.11400
## Balanced Accuracy : 0.70550
##
## 'Positive' Class : yes
##
Ranger model v4 - Data Train with even SMOTE at .5 to .5 and tgrid control
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6802 479
## yes 489 467
##
## Accuracy : 0.8825
## 95% CI : (0.8753, 0.8894)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : 0.7821
##
## Kappa : 0.4246
##
## Mcnemar's Test P-Value : 0.7724
##
## Sensitivity : 0.4937
## Specificity : 0.9329
## Pos Pred Value : 0.4885
## Neg Pred Value : 0.9342
## Prevalence : 0.1148
## Detection Rate : 0.0567
## Detection Prevalence : 0.1161
## Balanced Accuracy : 0.7133
##
## 'Positive' Class : yes
##
## [1] 0.811119
## [1] 0.7863936
## [1] 0.7921616
## [1] 0.7920103
labels <- bank_add_train_bal$y
ts_labels <- bank_add_test$y
new_tr <- model.matrix(~.+0, data = bank_add_train_bal[,-20], with = F)
new_ts <- model.matrix(~.+0, data = bank_add_test[,-20], with = F)
labels <- as.numeric(labels) -1
ts_labels <- as.numeric(ts_labels) -1Preparing Matrix
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:plotly':
##
## slice
## The following object is masked from 'package:dplyr':
##
## slice
dtrain <- xgb.DMatrix(data = new_tr, label = labels)
dtest <- xgb.DMatrix(data = new_ts, label = ts_labels)param <- list(booster = "gbtree", objective = "binary:logistic", eta = 0.3, gamma = 0, max_depth = 6, min_child_weight=1, subsample = 1, colsample_bytree=1 )
xgbcv <- xgb.cv(params = param, data = dtrain, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 10, early_stopping_rounds = 20, maximize = F)## [1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## Multiple eval metrics are present. Will use test_error for early stopping.
## Will train until test_error hasn't improved in 20 rounds.
##
## [11] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## [21] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## Stopping. Best iteration:
## [1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
## Warning: 'print.every.n' is deprecated.
## Use 'print_every_n' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## Warning: 'early.stop.round' is deprecated.
## Use 'early_stopping_rounds' instead.
## See help("Deprecated") and help("xgboost-deprecated").
## [1] val-error:0.000000 train-error:0.000000
## Multiple eval metrics are present. Will use train_error for early stopping.
## Will train until train_error hasn't improved in 10 rounds.
##
## [11] val-error:0.000000 train-error:0.000000
## Stopping. Best iteration:
## [1] val-error:0.000000 train-error:0.000000
xgbpred <- ifelse(predict(xgb1, dtest)>.5, 1, 0)
confusionMatrix(as.factor(xgbpred), as.factor(ts_labels), positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7291 0
## 1 0 946
##
## Accuracy : 1
## 95% CI : (0.9996, 1)
## No Information Rate : 0.8852
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.1148
## Detection Rate : 0.1148
## Detection Prevalence : 0.1148
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : 1
##
## [1] 1
The original research team that’s working on this dataset is able have an AUC score of 0.8. Although I’m not really sure whether we should have AUC as the benchmark since the proportion of our target variable is quite imbalanced. The reason can be seen in our model comparison table below.
Here we can see that the model Ranger v1 has the highest AUC of 0.81, a very high Precision value at 0.74, and a poor recall of 0.17. The reason being this data is trained with an imbalanced training data, so the result reflects that.
In my opinion, if we have to deploy a model, I’d choose either the Random Forest or the Ranger v3 for their Recall and Precision result.
As I’m working on this exercise, I stumbled across an interesting discovery. If you look at our density plot comparison, you can see how different models prediction behaviors differ from each other.