1 Objective

The objective of this model prediction is to predict the likelihood of costumer to purchase the product again given the costumer profile and how and when the costumer contacted regarding the promotions.

2 Setup and Data Import

# libraries
library(caret)
library(dplyr)
library(partykit)
library(DMwR)
library(rsample)
library(class)
library(e1071)

Import and Bind two datasets into one.

bank <- read.csv('bank-additional-full.csv', sep = ';')
bank2 <- read.csv('bank-additional.csv', sep = ';')
bankc <- distinct(rbind(bank, bank2))
head(bankc)

3 Data Wrangling

Change all character type into category.

bankc <- bankc %>% mutate_if(is.character, as.factor)
summary(bankc)
#>       age                 job            marital     
#>  Min.   :17.00   admin.     :10419   divorced: 4611  
#>  1st Qu.:32.00   blue-collar: 9253   married :24921  
#>  Median :38.00   technician : 6739   single  :11564  
#>  Mean   :40.02   services   : 3967   unknown :   80  
#>  3rd Qu.:47.00   management : 2924                   
#>  Max.   :98.00   retired    : 1718                   
#>                  (Other)    : 6156                   
#>                education        default         housing           loan      
#>  university.degree  :12164   no     :32577   no     :18615   no     :33938  
#>  high.school        : 9512   unknown: 8596   unknown:  990   unknown:  990  
#>  basic.9y           : 6045   yes    :    3   yes    :21571   yes    : 6248  
#>  professional.course: 5240                                                  
#>  basic.4y           : 4176                                                  
#>  basic.6y           : 2291                                                  
#>  (Other)            : 1748                                                  
#>       contact          month       day_of_week    duration     
#>  cellular :26135   may    :13767   fri:7826    Min.   :   0.0  
#>  telephone:15041   jul    : 7169   mon:8512    1st Qu.: 102.0  
#>                    aug    : 6176   thu:8618    Median : 180.0  
#>                    jun    : 5318   tue:8086    Mean   : 258.3  
#>                    nov    : 4100   wed:8134    3rd Qu.: 319.0  
#>                    apr    : 2631               Max.   :4918.0  
#>                    (Other): 2015                               
#>     campaign          pdays          previous            poutcome    
#>  Min.   : 1.000   Min.   :  0.0   Min.   :0.000   failure    : 4252  
#>  1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000   nonexistent:35551  
#>  Median : 2.000   Median :999.0   Median :0.000   success    : 1373  
#>  Mean   : 2.568   Mean   :962.5   Mean   :0.173                      
#>  3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000                      
#>  Max.   :56.000   Max.   :999.0   Max.   :7.000                      
#>                                                                      
#>   emp.var.rate      cons.price.idx  cons.conf.idx     euribor3m    
#>  Min.   :-3.40000   Min.   :92.20   Min.   :-50.8   Min.   :0.634  
#>  1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344  
#>  Median : 1.10000   Median :93.75   Median :-41.8   Median :4.857  
#>  Mean   : 0.08192   Mean   :93.58   Mean   :-40.5   Mean   :3.621  
#>  3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961  
#>  Max.   : 1.40000   Max.   :94.77   Max.   :-26.9   Max.   :5.045  
#>                                                                    
#>   nr.employed     y        
#>  Min.   :4964   no :36537  
#>  1st Qu.:5099   yes: 4639  
#>  Median :5191              
#>  Mean   :5167              
#>  3rd Qu.:5228              
#>  Max.   :5228              
#> 

Check target proportion.

prop.table(table(bankc$y))
#> 
#>        no       yes 
#> 0.8873373 0.1126627

Our datasets have an extreme imbalanced in target variable, we could either remove the majority classes, or generate random data to make the target variable balanced later after we split the dataset.

Check Missing Value

anyNA(bankc)
#> [1] FALSE

3.1 Train Test Splitting

set.seed(69420)

index <- initial_split(bankc, 0.75, strata = 'y')
bank.train <- training(index)
bank.test <- testing(index)

3.2 Down Sampling Train Set

bank.downtrain <- downSample(bank.train, bank.train$y)
bank.downtrain <- bank.downtrain %>% mutate_if(is.character, as.factor)
prop.table(table(bank.downtrain$Class))
#> 
#>  no yes 
#> 0.5 0.5

3.3 Balancing Target Variable Using SMOTE() Function

Just like downsampling, the goal remain the same, to balance the target proportion, but rather than removing the majority class, SMOTE will generate new data based on existing data AND also remove the majority class.

bank.train <- SMOTE(y~. , data=bank.train)
prop.table(table(bank.train$y))
#> 
#>        no       yes 
#> 0.5714286 0.4285714

4 Creating Model

4.1 Naive Bayes Model

Naive Bayes Machine Learning Model are a Machine learning model that leverage Bayes theorem on classification, it is called ‘Naive’ because the model assumed that all features are independent and weigh the same on its prediction.

model_naive1 <- naiveBayes(bank.train %>% select(-y),
                           bank.train$y)

4.1.1 Prediction

pred_class1 <- predict(model_naive1, bank.test %>% select(-y))

4.1.2 Evaluation

confusionMatrix(pred_class1, bank.test$y, positive = 'yes')
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  7566  299
#>        yes 1568  860
#>                                              
#>                Accuracy : 0.8186             
#>                  95% CI : (0.811, 0.826)     
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3859             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.74202            
#>             Specificity : 0.82833            
#>          Pos Pred Value : 0.35420            
#>          Neg Pred Value : 0.96198            
#>              Prevalence : 0.11260            
#>          Detection Rate : 0.08355            
#>    Detection Prevalence : 0.23589            
#>       Balanced Accuracy : 0.78518            
#>                                              
#>        'Positive' Class : yes                
#> 

4.1.3 Naive Bayes ROCAUC Evaluation

Create an object of predictions probability

y_pred_prob <- predict(object = model_naive1, 
                           newdata = bank.test, 
                           type = "raw")

Create Data Frame of predicted probability vs actual

data_roc <- data.frame(pred_prob = y_pred_prob[, 'yes'],
           actual = ifelse(bank.test$y == 'yes', 1, 0))
head(data_roc)

Plotting ROC

library(ROCR)

bank_roc_naive <- prediction(predictions = data_roc$pred_prob,
                        labels = data_roc$actual)

# ROC curve
plot(performance(bank_roc_naive, "tpr", "fpr"))

bank_auc <- performance(bank_roc_naive, measure = "auc")
bank_auc@y.values
#> [[1]]
#> [1] 0.8707561

4.1.4 Interprating naive Bayes model

model_naive1
#> 
#> Naive Bayes Classifier for Discrete Predictors
#> 
#> Call:
#> naiveBayes.default(x = bank.train %>% select(-y), y = bank.train$y)
#> 
#> A-priori probabilities:
#> bank.train$y
#>        no       yes 
#> 0.5714286 0.4285714 
#> 
#> Conditional probabilities:
#>             age
#> bank.train$y     [,1]      [,2]
#>          no  39.93556  9.881943
#>          yes 41.17564 12.105453
#> 
#>             job
#> bank.train$y      admin. blue-collar entrepreneur   housemaid  management
#>          no  0.251077586 0.236997126  0.039295977 0.026077586 0.073132184
#>          yes 0.262260536 0.156992337  0.025862069 0.019540230 0.062643678
#>             job
#> bank.train$y     retired self-employed    services     student  technician
#>          no  0.034698276   0.034913793 0.096336207 0.016307471 0.159339080
#>          yes 0.090804598   0.032567050 0.083141762 0.071839080 0.145785441
#>             job
#> bank.train$y  unemployed     unknown
#>          no  0.023132184 0.008692529
#>          yes 0.036590038 0.011973180
#> 
#>             marital
#> bank.train$y    divorced     married      single     unknown
#>          no  0.115732759 0.606465517 0.275000000 0.002801724
#>          yes 0.137835249 0.492432950 0.367624521 0.002107280
#> 
#>             education
#> bank.train$y     basic.4y     basic.6y     basic.9y  high.school   illiterate
#>          no  0.1035201149 0.0599856322 0.1496408046 0.2303879310 0.0005028736
#>          yes 0.1035440613 0.0416666667 0.1255747126 0.2204980843 0.0008620690
#>             education
#> bank.train$y professional.course university.degree      unknown
#>          no         0.1252873563      0.2909482759 0.0397270115
#>          yes        0.1282567050      0.3176245211 0.0619731801
#> 
#>             default
#> bank.train$y        no   unknown       yes
#>          no  0.7802443 0.2197557 0.0000000
#>          yes 0.7887931 0.2112069 0.0000000
#> 
#>             housing
#> bank.train$y         no    unknown        yes
#>          no  0.45237069 0.02579023 0.52183908
#>          yes 0.43611111 0.07021073 0.49367816
#> 
#>             loan
#> bank.train$y         no    unknown        yes
#>          no  0.82356322 0.02579023 0.15064655
#>          yes 0.67519157 0.06599617 0.25881226
#> 
#>             contact
#> bank.train$y  cellular telephone
#>          no  0.6063937 0.3936063
#>          yes 0.7202107 0.2797893
#> 
#>             month
#> bank.train$y         apr         aug         dec         jul         jun
#>          no  0.052155172 0.150933908 0.001939655 0.173204023 0.131034483
#>          yes 0.112835249 0.117432950 0.025191571 0.140900383 0.141666667
#>             month
#> bank.train$y         mar         may         nov         oct         sep
#>          no  0.006393678 0.361709770 0.104669540 0.009267241 0.008692529
#>          yes 0.057950192 0.205651341 0.087260536 0.058524904 0.052586207
#> 
#>             day_of_week
#> bank.train$y       fri       mon       thu       tue       wed
#>          no  0.1898707 0.2099138 0.2038793 0.2038793 0.1924569
#>          yes 0.1772989 0.2022989 0.2188697 0.2040230 0.1975096
#> 
#>             duration
#> bank.train$y     [,1]     [,2]
#>          no  218.9117 202.2535
#>          yes 571.6600 367.0978
#> 
#>             campaign
#> bank.train$y     [,1]     [,2]
#>          no  2.664296 2.890663
#>          yes 2.076719 1.540320
#> 
#>             pdays
#> bank.train$y     [,1]     [,2]
#>          no  983.9510 121.3076
#>          yes 815.2704 370.2243
#> 
#>             previous
#> bank.train$y      [,1]      [,2]
#>          no  0.1289511 0.4046867
#>          yes 0.5496135 0.7895448
#> 
#>             poutcome
#> bank.train$y    failure nonexistent    success
#>          no  0.09683908  0.88994253 0.01321839
#>          yes 0.22404215  0.62538314 0.15057471
#> 
#>             emp.var.rate
#> bank.train$y      [,1]     [,2]
#>          no   0.256602 1.473560
#>          yes -1.186749 1.605262
#> 
#>             cons.price.idx
#> bank.train$y     [,1]      [,2]
#>          no  93.60411 0.5552965
#>          yes 93.39178 0.6374455
#> 
#>             cons.conf.idx
#> bank.train$y      [,1]     [,2]
#>          no  -40.57040 4.346403
#>          yes -39.86443 5.568549
#> 
#>             euribor3m
#> bank.train$y     [,1]     [,2]
#>          no  3.822509 1.629807
#>          yes 2.154097 1.744165
#> 
#>             nr.employed
#> bank.train$y     [,1]     [,2]
#>          no  5176.542 63.99564
#>          yes 5095.886 86.60270

We can interpret every feature by the result of their own dependent probability with the target, for example :

  • Probability of married customer who make a purchase during the campaign are 0.49

Multiply every probability on every predictor, then divide it with total(positive and negative class), and you will have the probability.

4.2 Decision Tree Model

Decision Tree are Tree Based model that create rules to decide the output. the model are easily interpreted but it is prone to overfitting, and most likely we will find ourself to use Random Forest Model.

Create Object Model

bank_tree <- ctree(formula = y ~ ., data = bank.train, control = ctree_control(minbucket = 50))

Because our train set did not have one of the class in default columns, i have to remove the class that train did not have in my test set.

summary(bank.train$default)
#>      no unknown     yes 
#>   19096    5264       0
summary(bank.test$default)
#>      no unknown     yes 
#>    8102    2190       1
bank.test2 <- bank.test %>% filter(default != 'yes') 

4.2.1 Prediction

bank_pred_test <- predict(bank_tree, bank.test2, type = 'response')

4.2.2 Evaluation

confusionMatrix(bank_pred_test, bank.test2$y, 'yes')
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  8048  194
#>        yes 1085  965
#>                                              
#>                Accuracy : 0.8757             
#>                  95% CI : (0.8692, 0.882)    
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 0.9999             
#>                                              
#>                   Kappa : 0.5345             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.83261            
#>             Specificity : 0.88120            
#>          Pos Pred Value : 0.47073            
#>          Neg Pred Value : 0.97646            
#>              Prevalence : 0.11261            
#>          Detection Rate : 0.09376            
#>    Detection Prevalence : 0.19918            
#>       Balanced Accuracy : 0.85691            
#>                                              
#>        'Positive' Class : yes                
#> 

4.2.3 Decision Tree ROCAUC Evaluation

Create an object of predictions probability

y_pred_prob_tree <- predict(object = bank_tree, 
                           newdata = bank.test2, 
                           type = "prob")

Create Data Frame of prediction vs actual

data_roc_tree <- data.frame(pred_prob = y_pred_prob_tree[, 'yes'],
           actual = ifelse(bank.test2$y == 'yes', 1, 0))
head(data_roc_tree)

Plotting ROC

bank_roc_tree <- prediction(predictions = data_roc_tree$pred_prob,
                        labels = data_roc_tree$actual)

plot(performance(bank_roc_tree, "tpr", "fpr"))

bank_auc_tree <- performance(bank_roc_tree, measure = "auc")
bank_auc_tree@y.values
#> [[1]]
#> [1] 0.9304156

4.3 Random Forest Model

Because Decision Tree alone are prone to overfitting, we can use Random Forest model instead, random forest model is a ensemble method that use many different tree model, then aggregate the result.

Removing Columns with variance close to zero

n0var <- nearZeroVar(bank)
bankt <- bank[,-n0var]
bankt <- bankt %>% mutate_if(is.character, as.factor)

Create model and save model for future use

# set.seed(420)
# 
# ctrl <- trainControl(method = "repeatedcv",
#                      number = 5,
#                      repeats = 3)
# 
# bank_forest2 <- train(y ~ .,
#                       data = bankt,
#                       method = "rf",
#                       trControl = ctrl,
#                       tuneLength = 5) #membuat 5 kombinasi mtry
# bank_forest2
# 
# saveRDS(bank_forest2, 'bank_forest2.RDS')

Import model

bank_forest2 <- readRDS('bank_forest2.RDS')
bank_forest2
#> Random Forest 
#> 
#> 41188 samples
#>    19 predictor
#>     2 classes: 'no', 'yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 32951, 32950, 32950, 32950, 32951, 32950, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.9011039  0.2479975
#>   14    0.9150076  0.5399231
#>   27    0.9141498  0.5423452
#>   39    0.9141174  0.5429429
#>   52    0.9130573  0.5382130
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 14.

4.3.1 Prediction

bank_forest_pred <- predict(bank_forest2, bank.test)

4.3.2 Evaluation

confusionMatrix(bank_forest_pred, bank.test$y)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  9134    1
#>        yes    0 1158
#>                                              
#>                Accuracy : 0.9999             
#>                  95% CI : (0.9995, 1)        
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.9995             
#>                                              
#>  Mcnemar's Test P-Value : 1                  
#>                                              
#>             Sensitivity : 1.0000             
#>             Specificity : 0.9991             
#>          Pos Pred Value : 0.9999             
#>          Neg Pred Value : 1.0000             
#>              Prevalence : 0.8874             
#>          Detection Rate : 0.8874             
#>    Detection Prevalence : 0.8875             
#>       Balanced Accuracy : 0.9996             
#>                                              
#>        'Positive' Class : no                 
#> 

Identifying the best mtry

bank_forest2$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 14
#> 
#>         OOB estimate of  error rate: 8.58%
#> Confusion matrix:
#>        no  yes class.error
#> no  35165 1383  0.03784065
#> yes  2149 2491  0.46314655

4.3.3 Random Forest ROCAUC Evaluation

Create an object of predictions probability

y_pred_prob_rf <- predict(object = bank_forest2, 
                           newdata = bank.train, 
                           type = "prob")

Create Data Frame of prediction vs actual

data_roc_rf <- data.frame(pred_prob = y_pred_prob_rf[, 'yes'],
           actual = ifelse(bank.train$y == 'yes', 1, 0))
head(data_roc_tree)

Plotting ROC

bank_roc_rf <- prediction(predictions = data_roc_rf$pred_prob,
                        labels = data_roc_rf$actual)

# ROC curve
plot(performance(bank_roc_rf, "tpr", "fpr"))

bank_auc_rf <- performance(bank_roc_rf, measure = "auc")
bank_auc_rf@y.values
#> [[1]]
#> [1] 0.9966882

4.3.4 Interpreting Random Forest Model

varImp(bank_forest2)
#> rf variable importance
#> 
#>   only 20 most important variables shown (out of 52)
#> 
#>                            Overall
#> duration                   100.000
#> euribor3m                   34.250
#> age                         25.871
#> nr.employed                 23.301
#> campaign                    11.467
#> poutcomesuccess             10.283
#> cons.conf.idx                7.561
#> cons.price.idx               6.032
#> emp.var.rate                 5.816
#> housingyes                   5.005
#> previous                     3.920
#> educationuniversity.degree   3.838
#> maritalmarried               3.655
#> loanyes                      3.540
#> day_of_weekthu               3.396
#> day_of_weekwed               3.337
#> maritalsingle                3.336
#> educationhigh.school         3.330
#> day_of_weekmon               3.300
#> day_of_weektue               3.219

Random Forest weakness are limitation in interpretation, we could only tell which variables is the most important to the least important. In this model, duration are the most important and so on according the table, this mean duration variables are used in all tree in our random forest model.

5 Conclusion

Out of all the model we create, Random Forest turn out to be the best model according to all metrics. but if the goal is to understand our data, we can not use random forest model because we can not interpret it really well outside the importance of each predictor, instead, we can use Decision Tree, this model still maintaining a decent performance and can be interpreted in ruleset or plot the model to create an image of said rules (but the intepretation will get harder as the tree get deeper, and the image can not be seen cleary). Lastly we have Naive Bayes model, while it has the lowest, it still manage to have a decent accuracy and recall and the model can be interpreted really well using Naive Bayes formula to count its probability

To summarize it :

Naive Bayes : + Can be interpreted really well using its probability given x events

  • Lowest performance compared to other models

Decision Tree : + Can be interpreted using rule-set (Not Really when the rules get too much)

  • Fast to compute and decent performance

  • Tend to overfit

Random Forest : + the result are aggregate of decision trees, so it tend to not overfit, not bias.

  • High performance

  • Took a lot of time to compute

  • No interpretation other than variable importance