1 Case

At a bank in Portugal wants to predict what kind of prospective customer will buy a product from the bank when called by the bank, the purpose of this prediction is to help the telemarketing team to find out which customer will buy the product for yes or no based on campaign data. the marketing.

2 Library

library(dplyr)
library(caret)
library(inspectdf)
library(gridExtra)
library(GGally)
library(rsample)
library(e1071)
library(ROCR)

3 Read Data

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

customer <- read.csv("bank-full.csv")
head(customer)

Column description :

age : Age

job : type of job

marital : marital status (category)

education : Categorical of education

default : has credit in default

balance : Saldo

housing : Categorical of has housing loan

loan : Personal loan

contact : contact communication type

day : last contact of day

month : last contact month of year

duration : last contact duration, in seconds (numeric)

campaign : number of contacts performed during this campaign and for this client

pdays : number of days that passed by after the client was last contacted from a previous campaign

previous : number of contacts performed before this campaign and for this client

poutcome : outcome of the previous marketing campaign

y : the client subscribed a term deposit

4 Data Wrangling

glimpse(customer)
#> Rows: 45,211
#> Columns: 17
#> $ age       <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, ~
#> $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", "~
#> $ marital   <chr> "married", "single", "married", "married", "single", "marrie~
#> $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown", ~
#> $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no",~
#> $ balance   <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71~
#> $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes"~
#> $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no"~
#> $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
#> $ day       <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, ~
#> $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "may~
#> $ duration  <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,~
#> $ campaign  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ pdays     <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, ~
#> $ previous  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unkn~
#> $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
cust <- customer %>% 
  mutate_if(is.character, as.factor)

inspect missing value

colSums(is.na(cust))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0

In this study we will focus on numerical predictor data, so we will delete the job, martial, education, contact, income data columns

cust_clean <- cust %>% 
  select(-c("job","marital","education","contact","poutcome"))

On data that is binary option category, we will make it class 0 and 1,and we enter it into the processing data

transform <- model.matrix(y ~ ., data = cust_clean) %>% 
  as.data.frame() %>% 
  select(c("defaultyes","housingyes","loanyes"))
cust_clean$default <- transform$defaultyes
cust_clean$housing <- transform$housingyes
cust_clean$loan <- transform$loanyes
head(cust_clean)

5 Exploratory Data Analysis

ggcorr(cust_clean ,hjust = 1, layout.exp = 2, label = T, label_size = 2.9)

From the data above, all predictors that have a strong enough relationship are only pdays andprevious, so it is still possible to use the Naive Bayes model.

#Proportion of target variables before training data
prop.table(table(cust_clean$y))
#> 
#>        no       yes 
#> 0.8830152 0.1169848

Cross Validation

RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(data = cust_clean, prop = 0.8, strata ="y")

cust_train <- training(index)
cust_test <- testing(index)

Proporsi data

#Proportion of target variables after training data
prop.table(table(cust_train$y))
#> 
#>       no      yes 
#> 0.882997 0.117003

After training data, the proportion of data still has the same proportion as the data before training, because the proportion of data is not balanced, we will use ‘downSample’ for data balancing

set.seed(100)

down_cust <- downSample(x = cust_train %>% select(-y),
                     y = cust_train$y,
                     yname = "y")
                     
prop.table(table(down_cust$y))
#> 
#>  no yes 
#> 0.5 0.5

6 Modeling

6.1 Naive Bayes

cust_naive <- naiveBayes(x = down_cust %>% select(-y),
                         y = down_cust$y,
                         laplace = 1)

We can also interpret one of the predictors to find out the proportion of the class

prop.table(table(down_cust$y, down_cust$month)) %>% 
  as.data.frame() %>% 
  rename(Class = Var1, Month = Var2)

From the results above, it can be concluded that the class proportion of customers in the month apr for no = 3% and for yes = 5%

6.2 Naive Bayes Model Evaluation

#model fitting
cust_pred <- predict(cust_naive, newdata = cust_test, type = "class")#for the class prediction

Evaluate the model with confusion matrix

confusionMatrix(cust_pred, cust_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6054  224
#>        yes 1930  833
#>                                              
#>                Accuracy : 0.7618             
#>                  95% CI : (0.7528, 0.7705)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3213             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.78808            
#>             Specificity : 0.75827            
#>          Pos Pred Value : 0.30148            
#>          Neg Pred Value : 0.96432            
#>              Prevalence : 0.11691            
#>          Detection Rate : 0.09214            
#>    Detection Prevalence : 0.30561            
#>       Balanced Accuracy : 0.77317            
#>                                              
#>        'Positive' Class : yes                
#> 

From the evaluation results above, the Naive Bayes model gets 76% accuracy, 30% precision, recall, 78% in the positive class Yes

ROC

Bentuk Probability dari hasil prediksi

cust_prob <- predict(cust_naive, newdata = cust_test, type = "raw")#for the probability
round(head(cust_prob),4)
#>          no    yes
#> [1,] 1.0000 0.0000
#> [2,] 0.7801 0.2199
#> [3,] 0.7478 0.2522
#> [4,] 0.7830 0.2170
#> [5,] 0.7225 0.2775
#> [6,] 0.9871 0.0129

Objek ROC

#Objek Prediction
cust_roc <- prediction(predictions = cust_prob[,2], # = kelas positif
                       labels = as.numeric(cust_test$y == "yes")) # label kelas positif
        
#Objec performance dari prediction
perf <- performance(prediction.obj = cust_roc,
                    measure = "tpr",
                    x.measure = "fpr")
#plot
plot(perf, main = "ROC")
abline(0,1, lty =2)

6.3 Area Under Curve (AUC)

auc <- performance(prediction.obj = cust_roc,
                   measure = "auc")
auc@y.values
#> [[1]]
#> [1] 0.8302735

The ROC is a probability curve representing the degree or measure of separation. This tells how much the model is able to differentiate between classes. The closer the curve reaches to the top left of the plot (true positive high and false positive low), the better our model will be. The higher the AUC score, the better the model will separate the target classes

7 Decision Tree

7.1 Model Fitting

In the fitting model we will use mincriterion = 0.95, where the p-value must be below 0.05 for a node to create branches. Then minsplit = 500 or the minimum number of observations after splitting. And minbucket = 1200 as the minimum number of observations in the terminal node.

library(partykit)
cust_tree <- ctree(formula = y ~ ., data = down_cust,
                   control = ctree_control(mincriterion = 0.5,
                                           minsplit = 500,
                                           minbucket = 1200))
                                          
plot(cust_tree, type = "simple")

From the plot decision tree above, it can be seen that 1 is the Root Node, 2,5 and 6 are theInternal Node and 3,4,7,8,9 are theLeaf Nodes

predict.DT <- predict(cust_tree, newdata = cust_test)# For the class
predict.prob.DT <- predict(cust_tree, newdata = cust_test, type = "prob")# For the probability
confusionMatrix(predict.DT, cust_test$y, positive = "yes")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6692  338
#>        yes 1292  719
#>                                              
#>                Accuracy : 0.8197             
#>                  95% CI : (0.8116, 0.8276)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3725             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.68023            
#>             Specificity : 0.83818            
#>          Pos Pred Value : 0.35753            
#>          Neg Pred Value : 0.95192            
#>              Prevalence : 0.11691            
#>          Detection Rate : 0.07953            
#>    Detection Prevalence : 0.22243            
#>       Balanced Accuracy : 0.75920            
#>                                              
#>        'Positive' Class : yes                
#> 

From the results of the decision tree model confusion matrix, it can be seen that there is an increase in the accuracy value to 81%. Furthermore, the results of the confusion matrix will be evaluated using ROC andAUC to find out how well the model classifies the two classes.

#Objek Prediction
dt_roc <- prediction(predictions = predict.prob.DT[,2], # = kelas positif
                       labels = as.numeric(cust_test$y == "yes")) # label kelas positif
        
#Objec performance dari prediction
dt_perf <- performance(prediction.obj = dt_roc,
                    measure = "tpr",
                    x.measure = "fpr")
#plot
plot(dt_perf, main = "ROC")
abline(0,1, lty =2)

dt_auc <- performance(prediction.obj = dt_roc,
                   measure = "auc")
dt_auc@y.values
#> [[1]]
#> [1] 0.8138376

8 Random Forest

One of the weaknesses of random forest is that the modeling takes a long time, so the model will be saved in the form of an RDS file with the function saveRDS () so that the model can be used immediately without having to train before.

8.1 Data Prepocessing

#discarding columns with low or close to 0 variance, so that they are not eligible to be used as predictors

n0_var <- nearZeroVar(down_cust)
down_cust <- down_cust[, -n0_var]
dim(down_cust)
#> [1] 8464   10
#model building
# set.seed(2018)
# ctrl <- trainControl(method="repeatedcv", number=4, repeats=3) # k-fold cross validation
# forest <- train(y ~ ., data=down_cust, method="rf", trControl = ctrl)
# saveRDS(forest, "model.random.forest.rds")
model_forest <- readRDS("model.random.forest.rds")
model_forest
#> Random Forest 
#> 
#> 8464 samples
#>    9 predictor
#>    2 classes: 'no', 'yes' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (4 fold, repeated 3 times) 
#> Summary of sample sizes: 6348, 6348, 6348, 6348, 6348, 6348, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.8390044  0.6780088
#>   10    0.8474716  0.6949433
#>   19    0.8438091  0.6876181
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 10.

From the model summary,the optimum number of variables considered for splitting at each tree node mtry is 10 because the highest accuracy. We can also inspect the importance of each variable that was used in our random forest using varImp().

Var_imp <- varImp(model_forest)
plot(Var_imp)

From the result, the importance variable can be interpreted into a plot, where the importance variable above is the duration variable.

When using random forest - we are not required to split our dataset into train and test sets because random forest already has out-of-bag estimates (OOB) which act as a reliable estimate of the accuracy on unseen examples. Although, it is also possible to hold out a regular train-test cross-validation. For example, the OOB we achieved (in the summary below) was generated from our wine_train dataset.

plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

model_forest$finalModel
#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 10
#> 
#>         OOB estimate of  error rate: 14.97%
#> Confusion matrix:
#>       no  yes class.error
#> no  3472  760   0.1795841
#> yes  507 3725   0.1198015
predict.forest <- predict(model_forest, cust_test, type = "raw")# for the class prediction
predict.prob.forest <- predict(model_forest, cust_test, type = "prob")# for the probability
cm_forest <- confusionMatrix(predict.forest, cust_test$y, positive = "yes")
cm_forest
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  6457  132
#>        yes 1527  925
#>                                              
#>                Accuracy : 0.8165             
#>                  95% CI : (0.8084, 0.8244)   
#>     No Information Rate : 0.8831             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.4349             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.8751             
#>             Specificity : 0.8087             
#>          Pos Pred Value : 0.3772             
#>          Neg Pred Value : 0.9800             
#>              Prevalence : 0.1169             
#>          Detection Rate : 0.1023             
#>    Detection Prevalence : 0.2712             
#>       Balanced Accuracy : 0.8419             
#>                                              
#>        'Positive' Class : yes                
#> 
#Objek Prediction
forest_roc <- prediction(predictions = predict.prob.forest[,2], # = kelas positif
                       labels = as.numeric(cust_test$y == "yes")) # label kelas positif
        
#Objec performance dari prediction
forest_perf <- performance(prediction.obj = forest_roc,
                    measure = "tpr",
                    x.measure = "fpr")
#plot
plot(forest_perf, main = "ROC")
abline(0,1, lty =2)

forest_auc <- performance(prediction.obj = forest_roc,
                   measure = "auc")
forest_auc@y.values
#> [[1]]
#> [1] 0.9041757

Based on the results of confusion matrix from random forest, the algorithm gives a good result with an accuracy of 81% and a recall of 87% in the target class positive = yes. The value of the AUC also has a high yield of 90% compared to the two previous models.

the prediction results of the three models in the test data against the positiveyes class

cust_test$predict.naive <- cust_pred
cust_test$predict.DT <- predict.DT
cust_test$predict.forest <- predict.forest
cust_test %>% 
  select(y,predict.naive, predict.DT, predict.forest) %>% 
  filter(y == "yes") %>% 
  head(5)
df1 <- data.frame(model = c("Naive Bayes","Decision Tree","Random Forest"),
                    accuracy = c(0.7618,0.8197,0.8165),
                    sensitivity = c(0.78808,0.68023,0.8751),
                    specificity = c(0.75827,0.83818,0.8087),
                    precision = c(0.30148,0.35753,0.3772),
                    AUC = c(0.8302735,0.8138376,0.9041757)
               )
df1

9 Conclusion

Based on the results of the three models, the data model from random forest has better results, in addition to high accuracy and also the value of AUC and sensitivity or recall, because in this case the target class is the customer buying the product yes of course if recalled. This high model proves that the model has a small number in False Positive or a prediction of no but actually ‘yes’.