Introduction

Although there are many ways to promote your product to your potential customers, One of the most effective way to do it is through telemarketing as it is affordable and will not take a huge amount of cost to be done.

One of the primary users of this method is banking institutions. Usually, each bank have their own groups of telemarketers that they use to promote their banking products. Still not everyone can do this job perfectly. There are certain risk it may reduce company reputation as you have a chance to drove your potential customer away. To reduce this risk, it may be useful to decide which are target that can be approach with this method with a high chance of success.

As a data scientist, we need to create a model that able to predict whether certain people will buy a product based on their several informations.

Attaching Necessary Library

library(tidyverse)
library(dplyr) 
library(ggplot2)
library(inspectdf)
library(GGally)
library(plotly)
library(e1071)
library(rsample)
## Warning: package 'rsample' was built under R version 4.0.3
library(randomForest)
library(DMwR)
## Warning: package 'DMwR' was built under R version 4.0.3
library(ROCR)
library(caret)
library(partykit)

Data Import

We will use telemarketing dataset from UCI Machine Learning Repository. This data related with marketing campaign from portuguese banking institution. We will save the data into an object bank

bank <- read.csv("bank/bank.csv", sep=";")
head(bank)
##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
## 6  35  management  single  tertiary      no     747      no   no cellular  23
##   month duration campaign pdays previous poutcome  y
## 1   oct       79        1    -1        0  unknown no
## 2   may      220        1   339        4  failure no
## 3   apr      185        1   330        1  failure no
## 4   jun      199        4    -1        0  unknown no
## 5   may      226        1    -1        0  unknown no
## 6   feb      141        2   176        3  failure no
glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 3...
## $ job       <chr> "unemployed", "services", "management", "management", "bl...
## $ marital   <chr> "married", "married", "single", "married", "married", "si...
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondar...
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no...
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374,...
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "ye...
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "...
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown",...
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, ...
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "...
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113,...
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, ...
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, ...
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, ...
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "f...
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no...

From our bank data, we can see that some of the variables have incorrect type of data in Data Wrangling Process we will transform the data type into the more suitable type of data.

Data Wrangling

Change Data Type

We will change our character data type into factor to ease our analysis.

bank <- bank %>%
  mutate_if(is.character, as.factor) %>%
  rename(subscribe = y)

head(bank, 5)
##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
##   month duration campaign pdays previous poutcome subscribe
## 1   oct       79        1    -1        0  unknown        no
## 2   may      220        1   339        4  failure        no
## 3   apr      185        1   330        1  failure        no
## 4   jun      199        4    -1        0  unknown        no
## 5   may      226        1    -1        0  unknown        no

Check Missing Vaues

After we change our data type, we will whether our data have any missing value.

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
## subscribe 
##         0

From the result, we conclude that our predictor and target variable does not have any missing value.

Data Distribution

numcols <- unlist(lapply(bank, is.numeric))

show_plot(inspect_num(bank[,numcols]))

marital_agg <- bank %>%
  group_by(marital) %>%
  summarise(total = n()) %>%
  arrange(-total) 
## `summarise()` ungrouping output (override with `.groups` argument)
marital_agg %>%
  ggplot(aes(x = reorder(marital, total), y = total)) +
  geom_col(aes(fill = marital)) +
  labs(x = "marital")

job_agg <- bank %>%
  group_by(job) %>%
  summarise(total = n()) %>%
  arrange(-total) 
## `summarise()` ungrouping output (override with `.groups` argument)
job_agg %>%
  ggplot(aes(x = total, y = reorder(job, total))) +
  geom_col(aes(fill = job)) +
  labs(y = "job")

ed_agg <- bank %>%
  group_by(education) %>%
  summarise(total = n()) %>%
  arrange(-total) 
## `summarise()` ungrouping output (override with `.groups` argument)
ed_agg %>%
  ggplot(aes(x = total, y = reorder(education, total))) +
  geom_col(aes(fill = education)) +
  labs(y = "education")

From these histograms, we can see some information that:

  1. Most of the data have an age distribution around 25 into 50.
  2. The data have an account balance distribution from less than a zero into no more than 20,000.
  3. Most of the respondent in our data have a marital status married with value 2797, and the lowest of it is divorced with only 528.
  4. The respondent of our data come from 12 different background with management, blue colar, and technician are the top 3 of our respondent jobs.
  5. The highest level of respondent education level is secondary education.

Data Pre-Processing

Next, we will check our target variable proportions with prop.table()

These are the proportions of our variable target class

prop.table(table(bank$subscribe))
## 
##      no     yes 
## 0.88476 0.11524

and these are the actual number of our variable target class

table(bank$subscribe)
## 
##   no  yes 
## 4000  521

From our variable target class, we can see that the proportion is not balance. This will affect our modelling as our model can be overfit and may mislead our prediction results.

Thus, we need to applying some method to balancing our data. We will use upsampling method as it will increase the proportion of our data. Upsampling method is a method to increase our observation by creating duplicate values over it.

We upsample our data after we conduct cross validation to split our data into train and test dataset.

Cross Validation

Next step of our analysis is conducting the cross validation towards our data. To do this, we will split our data into train and test data. We will use our train data to train our model and our test data to validate our model when overcoming any unseen data.

set.seed(100)
index <- sample(nrow(bank), nrow(bank)*0.8)

# Data Train
bank_train <- bank[index,]

# Data Test
bank_test <- bank[-index,]

Next we will check the proportion our train data to validate the proportion of our data.

prop.table(table(bank_train$subscribe))
## 
##        no       yes 
## 0.8846792 0.1153208

We see from the result that our train data is still not balance. Thus we will perform upsample method to our train data. These will overcome the problem that our model can be overfit because the data that we use for training our model is train data.

UpSample Method

bank_train <- upSample(x = bank_train %>%
                         select(-subscribe),
                       y = as.factor(bank_train$subscribe),
                       yname = "subscribe")

Then, We will recheck our data proportion to see the proportion of our data.

prop.table(table(bank_train$subscribe))
## 
##  no yes 
## 0.5 0.5

We see that our data proportion already have the same proportion between two category. Thus, we are ready to train our model using our train data.

Modelling

Decision Tree

Decision Tree is one of the simple tree-based model style yet it have a robust or powerful performance. The output of this model is interpretable and quite adaptive to most of the scenario.

This model will gave us output in the form of a tree-branch like decisions based on the pattern of our data.

As our data already splitted into train and test dataset. We will immediately create the Decision Tree model for with bank_train as an input.

set.seed(100)
dtree_model <- ctree(subscribe ~ ., bank_train)

Next, we will use our model to make prediction based on bank_test dataset. We will use confusionMatrix() to see the matrix of our model predictions.

bank_dtree_pred <- predict(dtree_model, bank_test)
confusionMatrix(bank_dtree_pred, bank_test$subscribe, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  646  26
##        yes 155  78
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.7724, 0.8256)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3614          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75000         
##             Specificity : 0.80649         
##          Pos Pred Value : 0.33476         
##          Neg Pred Value : 0.96131         
##              Prevalence : 0.11492         
##          Detection Rate : 0.08619         
##    Detection Prevalence : 0.25746         
##       Balanced Accuracy : 0.77825         
##                                           
##        'Positive' Class : yes             
## 

Based on the matrix above, we can see that our model have an accuracy around 80% which is quite good, but yet it does not excel in the other matrix such as sensitivity (Recall) and Precision as it only reach 75% and 33.47%

we will try to improve other matrix by tuning our decision tree model. To do this, we will create a same decision tree model and use control parameter in it. We will set mincriterion, minsplit, and minbucket into 0.09, 30, and 60

mincriterion : the p-value threshold that need to be passed for a node to create a branch minsplit : the minimal amount of observations for each branch after splitting. If it cannot be fulfilled, creating a new branch will not be carried out *minbucket : the minimal amount of observations in terminal node. If it cannot be fulfilled, creating a new branch will not be carried out

set.seed(100)
dtree_model_tuning <- ctree(subscribe ~ ., bank_train, control = ctree_control(mincriterion = 0.09, minsplit = 30, minbucket = 60))

next we will use our tuned model to make prediction and see the difference of our model with the previous one by using confusionMatrix()

bank_dtree_pred_tuning <- predict(dtree_model_tuning, bank_test)
confusionMatrix(bank_dtree_pred_tuning, bank_test$subscribe, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  674  24
##        yes 127  80
##                                           
##                Accuracy : 0.8331          
##                  95% CI : (0.8072, 0.8569)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4268          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7692          
##             Specificity : 0.8414          
##          Pos Pred Value : 0.3865          
##          Neg Pred Value : 0.9656          
##              Prevalence : 0.1149          
##          Detection Rate : 0.0884          
##    Detection Prevalence : 0.2287          
##       Balanced Accuracy : 0.8053          
##                                           
##        'Positive' Class : yes             
## 

Based on the matrix above, we were able to increase our Recall and Precision matrix into 76% and 38.65%.

bank_dtree_train_pred_tuning <- predict(dtree_model_tuning, bank_train)
confusionMatrix(bank_dtree_train_pred_tuning, bank_train$subscribe, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  2691  475
##        yes  508 2724
##                                           
##                Accuracy : 0.8464          
##                  95% CI : (0.8373, 0.8551)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6927          
##                                           
##  Mcnemar's Test P-Value : 0.3074          
##                                           
##             Sensitivity : 0.8515          
##             Specificity : 0.8412          
##          Pos Pred Value : 0.8428          
##          Neg Pred Value : 0.8500          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4258          
##    Detection Prevalence : 0.5052          
##       Balanced Accuracy : 0.8464          
##                                           
##        'Positive' Class : yes             
## 

While this matrix show the confusion matrix of our model if it make a prediction based on our train dataset. the summary shows a good value of Recall, Specificity and Precisions around 85%, 84%, AND 84%

Although we already tuned our model it can be seen that our prediction can be improved more. To improve our prediction we will try to use another Tree-Based Model named Random Forest

Random Forest

Random Forest is a modelling method where we create more than one decision tree outcome. Each of Decision Tree contain different characteristic without any relationship with the others. Then, this model will predict the results of each decision tree. This model will gather the prediction results and vote for the best one. We will create our Random Forest model based on our train data.

Below are the code for creating the model, we put the comment for this code as it is will take a long time to run. we will create a random forest model with bank_train as our input data, we will put some control parameters by set the numbers into 5 and repeats into 4

numbers : the amount of data to be splitted by using k-fold cross validation method repeats : the amount of repetition when conducting k-fold cross validation

#set.seed(123)
#control <- trainControl(method = "repeatedcv", number = 5, repeats = 4)

# Create Model

#bank_model <- train(subscribe ~ ., data = bank_train, method = "rf", trControl = control)

# Save model

#saveRDS(bank_model, "bank_forest.RDS")
#bank_model

After we create our Random Forest model, we will save the model in rf_model and see the results of our prediction. Jelaskan kenapa penting untuk nyimpan model random forest.

rf_model <- readRDS("bank_forest.RDS")
rf_model
## Random Forest 
## 
## 6380 samples
##   16 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 4 times) 
## Summary of sample sizes: 5104, 5104, 5104, 5104, 5104, 5104, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8851097  0.7702194
##   22    0.9681034  0.9362069
##   42    0.9631270  0.9262539
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
rf_model$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 22
## 
##         OOB estimate of  error rate: 2.49%
## Confusion matrix:
##       no  yes class.error
## no  3031  159  0.04984326
## yes    0 3190  0.00000000

Out of Bag Error: 2.49%, It can be said that there are possibility that our model have an error 2.49% predicting the unseen data / test data.

bank_rf_pred <- predict(rf_model, newdata = bank_test)
confusionMatrix(bank_rf_pred, bank_test$subscribe, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  794  10
##        yes   7  94
##                                          
##                Accuracy : 0.9812         
##                  95% CI : (0.9701, 0.989)
##     No Information Rate : 0.8851         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9065         
##                                          
##  Mcnemar's Test P-Value : 0.6276         
##                                          
##             Sensitivity : 0.9038         
##             Specificity : 0.9913         
##          Pos Pred Value : 0.9307         
##          Neg Pred Value : 0.9876         
##              Prevalence : 0.1149         
##          Detection Rate : 0.1039         
##    Detection Prevalence : 0.1116         
##       Balanced Accuracy : 0.9476         
##                                          
##        'Positive' Class : yes            
## 

The summary shows that our Random Forest Model have an accuracy around 98.12%. Other matrix such as Recall and Precision also have a good value in 90.38% and 93.07% respectively.

bank_rf_train_pred <- predict(rf_model, newdata = bank_train)
confusionMatrix(bank_rf_train_pred, bank_train$subscribe, positive = "yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3160  299
##        yes   39 2900
##                                           
##                Accuracy : 0.9472          
##                  95% CI : (0.9414, 0.9525)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8943          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9065          
##             Specificity : 0.9878          
##          Pos Pred Value : 0.9867          
##          Neg Pred Value : 0.9136          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4533          
##    Detection Prevalence : 0.4594          
##       Balanced Accuracy : 0.9472          
##                                           
##        'Positive' Class : yes             
## 

For this summary, it shows us the matrix that our model achieve if it make a prediction based on train dataset. it have a 94.72% Accuracy, 90.65% Recall, and 98.67% Precision.

As the matrix shows little difference between the model prediction with train and test dataset. it can be conclude that our model is not overfit.

Conclusion

With two different tree-type models compared. We will choose Random Forest as it have a well rounded performance for all kind of matrix. As for this case, we will need to make sure that we have a good Recall value as we wanted our model able to predict correctly the positive class.