1 Import Library

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: NLP
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Warning: package 'partykit' was built under R version 4.1.1
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine

2 Data Read

Bank dataset was obtained from UCI Machine Learning Repository, containing the client data, contact list and some other attributes from Protugese Bank.

bank <- read.csv("bank-full.csv", stringsAsFactors = T, sep = ";")
str(bank)
## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Column Description:

Bank client data: - age (numeric) - job : type of job (categorical:“admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student” ,“blue-collar”,“self-employed”,“retired”,“technician”,“services”) - marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed) - education (categorical: “unknown”,“secondary”,“primary”,“tertiary”) - default: has credit in default? (binary: “yes”,“no”) - balance: average yearly balance, in euros (numeric) - housing: has housing loan? (binary: “yes”,“no”) - loan: has personal loan? (binary: “yes”,“no”) Related with the last contact of the current campaign: - contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”) - day: last contact day of the month (numeric) - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”) - duration: last contact duration, in seconds (numeric) Other attributes: - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted) - previous: number of contacts performed before this campaign and for this client (numeric) - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target): - y - has the client subscribed a term deposit? (binary: “yes”,“no”)

3 Data Exploration

The objective of this analysis, we are going to build a predictive model to classify whether a bank clients is going to subscribe or reject the subscription. To start this off, we explore deeper into what data are we investigating.

inspect_cat(bank) %>% 
  show_plot()

From the plot, we know that there is an unbalance number of our target variable (y). The table below shows the percentage of people that agrees to subscribe and reject to subscribe.

inspect_num(bank)
## # A tibble: 7 x 10
##   col_name   min    q1 median     mean    q3    max      sd pcnt_na hist        
##   <chr>    <int> <dbl>  <int>    <dbl> <dbl>  <int>   <dbl>   <dbl> <named list>
## 1 age         18    33     39   40.9      48     95   10.6        0 <tibble [18~
## 2 balance  -8019    72    448 1362.     1428 102127 3045.         0 <tibble [25~
## 3 day          1     8     16   15.8      21     31    8.32       0 <tibble [18~
## 4 duration     0   103    180  258.      319   4918  258.         0 <tibble [27~
## 5 campaign     1     1      2    2.76      3     63    3.10       0 <tibble [15~
## 6 pdays       -1    -1     -1   40.2      -1    871  100.         0 <tibble [21~
## 7 previous     0     0      0    0.580     0    275    2.30       0 <tibble [30~
inspect_num(bank[,-c(15)]) %>% show_plot()

GGally::ggcorr(data = bank, hjust = 1, layout.exp = 2, label = T, label_size = 3)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## Warning in GGally::ggcorr(data = bank, hjust = 1, layout.exp = 2, label = T, :
## data in column(s) 'job', 'marital', 'education', 'default', 'housing', 'loan',
## 'contact', 'month', 'poutcome', 'y' are not numeric and were ignored

The plot above shows the correlations between each numeric variables. It shows that there are no variables that are significantly correlate between one another. This means that we can use Bayesian Model to use as classification method.

4 Cross Validation

We differ our data into seperate train and test data. the train data is what we are going to use to train our model, and the test data is going to be the validation of our train model.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(299)
index <- sample(nrow(bank), nrow(bank)*.80)

bank_train <- bank[index,]
bank_test <- bank[-index,]
prop.table(table(bank$y))
## 
##        no       yes 
## 0.8830152 0.1169848

For the sake of making a good model we will downsample to even out the proportion. We chose to down sample because the data already contains 45,000 and deleting some data only had a minor effect rather than we use duplicated data.

bank_train <- downSample(x = bank_train %>% select(-y), y = bank_train$y, yname = "y")
prop.table(table(bank_train$y))
## 
##  no yes 
## 0.5 0.5

5 Naive Bayes

nb_bank <- naiveBayes(y ~ ., data = bank_train)
nb_pred.test <- predict(object = nb_bank, newdata = bank_test)
nb_pred.train <- predict(object = nb_bank, newdata = bank_train)

Train data Model Evaluation

confusionMatrix(data = nb_pred.train, reference = bank_train$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3453 1057
##        yes  802 3198
##                                           
##                Accuracy : 0.7816          
##                  95% CI : (0.7726, 0.7903)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5631          
##                                           
##  Mcnemar's Test P-Value : 3.837e-09       
##                                           
##             Sensitivity : 0.8115          
##             Specificity : 0.7516          
##          Pos Pred Value : 0.7656          
##          Neg Pred Value : 0.7995          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4058          
##    Detection Prevalence : 0.5300          
##       Balanced Accuracy : 0.7816          
##                                           
##        'Positive' Class : no              
## 

Test data Model Evaluation

confusionMatrix(data = nb_pred.test, reference = bank_test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6510  288
##        yes 1499  746
##                                          
##                Accuracy : 0.8024         
##                  95% CI : (0.794, 0.8106)
##     No Information Rate : 0.8857         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3538         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8128         
##             Specificity : 0.7215         
##          Pos Pred Value : 0.9576         
##          Neg Pred Value : 0.3323         
##              Prevalence : 0.8857         
##          Detection Rate : 0.7199         
##    Detection Prevalence : 0.7517         
##       Balanced Accuracy : 0.7672         
##                                          
##        'Positive' Class : no             
## 

The confusion matrix suggests

6 Decision Tree

6.1 Pre-pruning

dt_bank <- ctree(y ~ ., data = bank_train)
plot(dt_bank, type = "simple")

dt_pred.train <- predict(object = dt_bank, newdata = bank_train, type = "response")
confusionMatrix(data = dt_pred.train, reference = bank_train$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3515  441
##        yes  740 3814
##                                           
##                Accuracy : 0.8612          
##                  95% CI : (0.8537, 0.8685)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7224          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8261          
##             Specificity : 0.8964          
##          Pos Pred Value : 0.8885          
##          Neg Pred Value : 0.8375          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4130          
##    Detection Prevalence : 0.4649          
##       Balanced Accuracy : 0.8612          
##                                           
##        'Positive' Class : no              
## 
dt_pred.test <- predict(object = dt_bank, newdata = bank_test)
confusionMatrix(data = dt_pred.test, reference = bank_test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6412  118
##        yes 1597  916
##                                           
##                Accuracy : 0.8104          
##                  95% CI : (0.8021, 0.8184)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.423           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8006          
##             Specificity : 0.8859          
##          Pos Pred Value : 0.9819          
##          Neg Pred Value : 0.3645          
##              Prevalence : 0.8857          
##          Detection Rate : 0.7091          
##    Detection Prevalence : 0.7221          
##       Balanced Accuracy : 0.8432          
##                                           
##        'Positive' Class : no              
## 

The accuracy of the prediction on train data is significantly higher than on our test data. this suggest that the model we made are over-fit.

Our business judgement suggests that we have to minimize the possibility of false positive, where the model suggests that customer are likely to subscribe but evidently is not. This is because to treat a customer that have a possibility to subscribe the bank needs to give more attention and increases cost. So the best strategy is to avoid false Yes prediction.

6.2 Post Pruning

dtprune_bank <- ctree(y ~ ., data = bank_train, control = ctree_control(mincriterion = .01,
                                                                   minsplit = 150,
                                                                   minbucket = 0))
plot(dtprune_bank, type = "simple")

dtprune_pred.test <- predict(dtprune_bank, bank_test)
dtprune_pred.train <- predict(dtprune_bank, bank_train)
confusionMatrix(dtprune_pred.train, bank_train$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3611  525
##        yes  644 3730
##                                           
##                Accuracy : 0.8626          
##                  95% CI : (0.8551, 0.8699)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7253          
##                                           
##  Mcnemar's Test P-Value : 0.000558        
##                                           
##             Sensitivity : 0.8486          
##             Specificity : 0.8766          
##          Pos Pred Value : 0.8731          
##          Neg Pred Value : 0.8528          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4243          
##    Detection Prevalence : 0.4860          
##       Balanced Accuracy : 0.8626          
##                                           
##        'Positive' Class : no              
## 
confusionMatrix(dtprune_pred.test, bank_test$y)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  6643  139
##        yes 1366  895
##                                           
##                Accuracy : 0.8336          
##                  95% CI : (0.8257, 0.8412)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4582          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8294          
##             Specificity : 0.8656          
##          Pos Pred Value : 0.9795          
##          Neg Pred Value : 0.3958          
##              Prevalence : 0.8857          
##          Detection Rate : 0.7346          
##    Detection Prevalence : 0.7500          
##       Balanced Accuracy : 0.8475          
##                                           
##        'Positive' Class : no              
## 

After pruning, the model accuracy to our test data is slightly lower than our train data, this suggest that the model we made just fits the real case.

dt_pred.prob <- predict(dtprune_bank, newdata = bank_test, type = "prob")[,2]
head(dt_pred.prob)
##           1           3           8          12          15          21 
## 0.000000000 0.000000000 0.198113208 0.003636364 0.003636364 0.000000000
length(dt_pred.prob)
## [1] 9043
length(bank_test$y)
## [1] 9043

The area under curve represents the accuracy of our model. the more area is under the curve, the better the model can distinguish between positive or negative outcome.

dt_roc <- prediction(dt_pred.prob, bank_test$y)
dt_roc_vec <- performance(dt_roc, "tpr", "fpr")
plot(dt_roc_vec)
abline(0,1,lty = 2)

dt_auc <- performance(dt_roc, measure = "auc")@y.values
dt_auc
## [[1]]
## [1] 0.9070958

7 Random Forest

Finally, the last model that we are going to perform is the random forest model. We are not going to use the original bank data for the sake of time reduction. As a substitute we can use the bank.csv file as it contains only 10% of the original bank_full.csv data.

library(randomForest)
dim(bank)
## [1] 45211    17
bank_small <- read.csv("bank.csv", stringsAsFactors = T, sep = ";")
n0_var <- nearZeroVar(bank)
bank_clean <- bank_small[,-n0_var]
bank_clean %>% inspect_cat() %>%  show_plot()

Before we do the random forest classification, we need to seperate the train and test data. for this instance we are going to use K-fold Cross Validation. In K-Fold we are going to spread the data into seperate equal parts, where each repetition contain (x)% train data and (100-x)% test data. By doing this, every row has at least one time of being a test data or to put it simply the model is going to be more robust.

To do the Random forest model we are going to use bagging (Bootstrap and Aggregation) where the data is randomly selected with replacement so that there is a duplicated data and Out of bag data, this is crucial to determine the error of the model later. Then we are going to use rf() to make random forest model where each bootstrap sample are going to be used to make decision tree, then we classify them to find the best model suited.

ctrl <- trainControl(method = "cv", number = 5)
rf_bank <- train(y ~ ., data = bank_clean, method = "rf", trControl = ctrl)
saveRDS(rf_bank, "rf_bank.RDS")
rf_bank <- readRDS("rf_bank.RDS")
rf_bank$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 10.4%
## Confusion matrix:
##       no yes class.error
## no  3849 151   0.0377500
## yes  319 202   0.6122841

Out Of Bag (OOB) data is a sequence of data that are not being used in the process of making the Random Forest model, hence the name Out Of Bag (OOB). It estimates the error rate of the random forest model, that value is 10.4% which is still relatively high. In other words, the accuracy of the model is 90.59% approximately equal value that we get from the confusion matrix.

rf_pred_test <- predict(rf_bank, bank_test)
cm.rf_pred_test <- confusionMatrix(data = rf_pred_test, reference = bank_test$y)
cm.rf_pred_test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  7735  575
##        yes  274  459
##                                           
##                Accuracy : 0.9061          
##                  95% CI : (0.8999, 0.9121)
##     No Information Rate : 0.8857          
##     P-Value [Acc > NIR] : 1.876e-10       
##                                           
##                   Kappa : 0.4692          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9658          
##             Specificity : 0.4439          
##          Pos Pred Value : 0.9308          
##          Neg Pred Value : 0.6262          
##              Prevalence : 0.8857          
##          Detection Rate : 0.8554          
##    Detection Prevalence : 0.9189          
##       Balanced Accuracy : 0.7048          
##                                           
##        'Positive' Class : no              
## 
rf_pred.prob <- predict(rf_bank, newdata = bank_test, type = "prob")[,2]
head(rf_pred.prob)
## [1] 0.028 0.000 0.034 0.002 0.000 0.016
rf_roc <- prediction(rf_pred.prob, bank_test$y)
rf_roc_vec <- performance(rf_roc, "tpr", "fpr")
plot(rf_roc_vec)
abline(0,1,lty = 2)

rf_auc <- performance(rf_roc, measure = "auc")@y.values
rf_auc
## [[1]]
## [1] 0.9131718

7.1 Interpretation

The graph below shows the importance of each predictor variable and the comparison of it’s value with the others. varImp suggests how often each variable were used in the making of the decision tree. duration is the most used variable with 100% usage.

varImp(rf_bank)
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                    Overall
## duration           100.000
## balance             35.985
## age                 33.717
## day                 32.022
## poutcomesuccess     29.297
## campaign            12.484
## previous             8.545
## monthoct             7.137
## educationtertiary    4.987
## contactunknown       4.890
## maritalmarried       4.764
## monthjun             4.414
## housingyes           4.208
## jobtechnician        4.137
## monthaug             4.069
## monthmay             3.814
## monthmar             3.439
## monthfeb             3.324
## educationsecondary   3.079
## monthnov             2.884
plot(varImp(rf_bank))