Intro

Background

Telemarketing is a method of direct marketing which a person (can be sales) prospective customers to buy products or services, either over the phone or through face to face or web conferencing appointment. Telemarketing can also include recoreded sales pithes programmed to be played over the phone by automatic dialing.

Bank is one of the organisation use telemarketing method for selling banking products or services. telemarketing is a popular method used by bank to selling, because bank products and services sometimes too complicated for some users to understand. It more easy to users or target user to understand products or service if it explain directly. One advanteage of telemarketing by person, target users can directly asking question, if they didnt understand something.

Nowdays, Telemarketing has been negatively associated with various scams and frauds, such as pyramid schemes, and with deceptively overpriced products and services. Fraudulent telemarketing companies are frequently referred to as “telemarketing boiler rooms” or simply “boiler rooms”. Telemarketing is often criticized as an unethical business practice due to the perception of high-pressure sales techniques during unsolicited calls. Telemarketers marketing telephone companies may participate in telephone slamming, the practice of switching a customer’s telephone service without their knowledge or authorization.

Bank as financing organisation really care about good reputation and good branding, and one of bad thing do telemarketing can interfere reputation it self. So we need find out which our target will not buy product or service if bank offer product or service using telemarketing. It can help protect bank reputation by not disturbing target that we already know will not buy the product.

Analysis Method

In this case we will use machine learning to understand pattern and predict classification or label, we use several predictive model to predict using training and testing data. Predictive model we use is, Naive Bayes Classifier, Decision Tree, and Random Forest.

We will compare the result of prediction and see the performance from each mode. This 3 model are categorized as supervised learning. Supervised learning popular to predict pattern, this pattern can learn from train data and do ETL (Extract Transform Load) to get feature information. Based from feature we will compare with clasification patter from model get from labeled data to get final prediction.

Data Preparation

Import Library

# Data wrangling Library
library(tidyverse)
library(dplyr) 

# Visualize data
library(ggplot2)
library(inspectdf)
library(GGally)
library(plotly)

# Naive Bayes 
library(e1071)

# Splitting Data
library(rsample)

# Random Forest
library(randomForest)

# Smote for unbalanced data
library(DMwR)

# ROCR
library(ROCR)

# Confussion Matrix
library(caret)

# Decision Tree
library(partykit)

Import Function

source("matrix_result.R")
source("metrics.R")

Read Data

Telemarketing dataset was obtained from UCI Machine Learning Repository, The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed.

telemark <- read_csv2("data/bank-full.csv")
glimpse(telemark)
## Observations: 45,211
## Variables: 17
## $ age       <dbl> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57,…
## $ job       <chr> "management", "technician", "entrepreneur", "blue-collar", …
## $ marital   <chr> "married", "single", "married", "married", "single", "marri…
## $ education <chr> "tertiary", "secondary", "secondary", "unknown", "unknown",…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "no", "no"…
## $ balance   <dbl> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 7…
## $ housing   <chr> "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes", "yes…
## $ loan      <chr> "no", "no", "yes", "no", "no", "no", "yes", "no", "no", "no…
## $ contact   <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unk…
## $ day       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,…
## $ month     <chr> "may", "may", "may", "may", "may", "may", "may", "may", "ma…
## $ duration  <dbl> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517…
## $ campaign  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ pdays     <dbl> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,…
## $ previous  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ poutcome  <chr> "unknown", "unknown", "unknown", "unknown", "unknown", "unk…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…

Column Description:
1. age: age (numeric)
2. job : type of job (categorical: “admin.”,“unknown”,“unemployed”,“management”,“housemaid”,“entrepreneur”,“student”, “blue-collar”,“self-employed”,“retired”,“technician”,“services”)
3. marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
4. education : education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
5. default: has credit in default? (binary: “yes”,“no”)
6. balance: average yearly balance, in euros (numeric)
7. housing: has housing loan? (binary: “yes”,“no”)
8. loan: has personal loan? (binary: “yes”,“no”)
9. contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
10. day: last contact day of the month (numeric)
11. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
12. duration: last contact duration, in seconds (numeric)
13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client (numeric)
16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
17. y: has the client subscribed a term deposit? (binary: “yes”,“no”)

Data Wrangling

Missing Value (NA) is general problem from dataaset, there’s some way to solve the problem. Based on several refrence say that there is NO good way to deal with missing data. So before we going forward to next step, we should check missing value.

table(is.na(telemark))
## 
##  FALSE 
## 768587

The data has no missing value, so we doesnt need any thing with missing value. Based on column description some our imported variables have incorrect data types. Change the data type refer to column description.

telemark <- telemark %>% 
  mutate(job = as.factor(job),
         marital = as.factor(marital),
         education = as.factor(education),
         default = as.factor(default),
         housing = as.factor(housing),
         loan = as.factor(loan),
         contact = as.factor(contact),
         month = as.factor(month),
         poutcome = as.factor(poutcome),
         subscribe = as.factor(y)) %>% 
  select(-c(y))

There’s some of data variables is numeric, we can visualize the histogram to get data distribution from them.

numericCols <- unlist(lapply(telemark, is.numeric))
show_plot(inspect_num(telemark[,numericCols]))

Our target variables before are “Y” and we change it to “subscribe”, it make us more easy to understand which observer that want to subscribe any product or services when get call from telemarketing.

levels(telemark$subscribe)
## [1] "no"  "yes"

Our target variables consist of 2 levels “yes” means users agree to subscribe or buy product and “no” means users didnt agree or reject offers from telemarketing. Lets take look at the overall data structure,

summary(telemark)
##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome     subscribe  
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 

Exploratory Data Analysis

When conducting a supervised classification with machine learning algorithms such as Random Forests, one recommended practice is to work with a balanced classification dataset. Check proportion of our target variables subscribe

prop.table(table(telemark$subscribe))
## 
##        no       yes 
## 0.8830152 0.1169848

We found that our target variable subsribe is Imbalanced. It mean data refers to a situation where the number of observations is not the same for all the classes in a classification dataset. To avoid loss of variance, we will use upsampling to balance the proportion.

Lets take look correlation between predictor variables

show_plot(inspect_cor(subset(telemark, select = -c(subscribe))))

ggcorr(telemark, label = T)
## Warning in ggcorr(telemark, label = T): data in column(s) 'job', 'marital',
## 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome',
## 'subscribe' are not numeric and were ignored

Two plots above explain, there are some predictor variables have correlation with other predictor variable. These variables are previous with pdays, campaign with day, and balance with age. This didnt make decision us to take out the variables from dataset, but only warn maybe it can cause some models will not work porperly like Naive Bayes.

Cross Validation

Cross-validation (CV) is a statistical method that can be used to evaluate the performance of models or algorithms where the data is separated into two subsets namely learning process data and validation / evaluation data. In this case we will seperate data with proportion 80% dataset for data training and rest 20% we use as data test.

set.seed(1)
split <- initial_split(data = telemark, prop = 0.8, strata = subscribe)
telemark_train <- training(split)
telemark_test <- testing(split)

Checking proportion of target variable on train data

prop.table(table(telemark_train$subscribe))
## 
##        no       yes 
## 0.8832426 0.1167574

found target variable still imbalance, so we will try to make it balance using SMOTE method. SMOTE is a oversampling technique which synthesizes a new minority instance between a pair of one minority instance and one of its K nearest neighbor. see Other techniques adopt this concept with other criteria in order to generate balanced dataset for class imbalance problem.

telemark_train_upsample <- SMOTE(subscribe ~ ., as.data.frame(telemark_train), perc.over = 100, perc.under = 200)
prop.table(table(telemark_train_upsample$subscribe))
## 
##  no yes 
## 0.5 0.5

We already get balance proportion using SMOTE, then we go throught predictive modelling

Modelling

We will try to build predictive classification model using different algorithm and method. We will comparing three model and compare result and perfomance of Naive Bayes, Decision Tree, Random Forest

Naive Bayes

Based on our EDA we found some of our varibles or we can say it feature have correlaction with other feature. Another feature are continuous variable which based on this 2 point, it make these dataset not really suitable using Naive Bayes. Even so, we still going and try using Naive Bayes to see the result and we will compared with other models.There are certain characteristics of Naive Bayes that should be considered:

  • Assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
  • Prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
  • More appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
  • Apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

We need build our model, and in this case we also apply Laplace Estimator

model_naive <- naiveBayes(subscribe ~ ., data = telemark_train_upsample, laplace = 1)
naive_prediction <- predict(model_naive, telemark_test)
naive_prediction_raw <- as.data.frame(predict(model_naive, telemark_test, type = "raw"))

naive_prediction_raw <- naive_prediction_raw %>%
  mutate(no = round(no,4),
         yes = round(yes,4))
naive_matrix <- confusionMatrix(naive_prediction, telemark_test$subscribe, positive = "yes")
table <- as.table(naive_matrix)
table <- as.data.frame(table)

table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), fontface = "bold", color = "white") +
  theme_minimal() +
  theme(legend.position = "none")

naive_matrix <- matrix_result(naive_matrix, "Naive Bayes")
naive_matrix
##         Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Naive Bayes 0.7614466   0.8095685    0.755015      0.3063543

These table explain: * Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
                        trueclass = as.numeric(telemark_test$subscribe=="yes"))
head(naive_roc)
##   prediction trueclass
## 1     0.1773         0
## 2     0.4301         0
## 3     0.1509         0
## 4     0.1431         0
## 5     0.0786         0
## 6     0.0647         0
naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass) 

# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

The ROC is a curve that shows the performance of the classification model for all thresholds. ROC represents graphics from:

  • Recall / sensitivity (true positive rate) on they axis.
  • 1-specificity (false positive rate) on thex axis.
# AUC
auc <- performance(naive_roc_pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8455998

AUC is the area under the ROC curve. AUC values indicate the success of the model predicting / differentiating the two classes. The area value has an interval between 0 to 1. If the AUC value is close to 1, it means that the classification model is able to predict / distinguish the two classes well. However, if the AUC value is close to 0.5 it means that the classification model is not able to predict / distinguish the two classes well.

  • AUC value close to 1: the classification model is able to predict / distinguish the two classes well
  • AUC values close to 0.5: the classification model is not able to predict / distinguish the two classes well

Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8456519 it mean we have chance to improve this Naive Bayes Model. In Our Naive Bayes model result Accuracy, Sensitivity, Specificity is quite ok, but our positive pred value still small 30%.

In our case we are focus “No” class mean we dont want telemarketing call person who predict didnt want buy product or service offer from telemarketing, it means we are focus on Sensitivity parameter which mean our Naive Bayes already give us good result.

Based on our AUC score we still can improve our model, and one of method to tuning model is change the threshold.

# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                     prob = naive_prediction_raw$yes, 
                     ref = as.factor(ifelse(telemark_test$subscribe == "yes", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))

From plot above, we get information break even point is around 0.55. we will set it around 0.52 because we are focusing on Recall parameter, meanwhile we still want other paramaeter get good result.

#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
  mutate(label = as.factor(ifelse(yes >= 0.52, "yes", "no"))) %>% 
  select(label)
naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, naive_prediction, positive = "yes")
naive_matrix_tuning <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")
naive_matrix_tuning
##                Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Naive Bayes Tuning 0.9903782   0.9691161           1              1

Decision Tree

Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree. There are certain characteristics of decision tree model:

  • Perform well on both numerical and categorical variable.
  • All predictors are assumed to interact.
  • Quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
  • Robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

let build decision tree model.

model_dtree <- ctree(subscribe ~ ., telemark_train_upsample)
width(model_dtree)
## [1] 139
depth(model_dtree)
## [1] 14

We found our decision tree models create complicated tree using our train dataset. it shows from Width 139 it mean total Leaf Nodes, and 14 deoth it mean Internal Nodes. After we train it, we can use to predict using our data test.

dtree_prediction <- predict(model_dtree, telemark_test)
dtree_matrix <- confusionMatrix(dtree_prediction, telemark_test$subscribe, positive = "yes")
dtree_matrix <- matrix_result(dtree_matrix, "Decision Tree")
dtree_matrix
##           Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Decision Tree 0.8310108   0.8414634   0.8296138      0.3976064

In Our Decision model result Accuracy, Sensitivity, Specificity is quite ok, but our positive pred value still small 39%. We still can improve model by tuning it.

model_dtree_tuning <- ctree(subscribe ~ ., telemark_train_upsample,
                            control = ctree_control(mincriterion = 0.1, minsplit = 100, minbucket = 60))

dtree_prediction_tuning <- predict(model_dtree_tuning, telemark_test)
dtree_matrix_tuning <- confusionMatrix(dtree_prediction_tuning, telemark_test$subscribe)
dtree_matrix_tuning <- matrix_result(dtree_matrix_tuning, "Decision Tree Tuning")
dtree_matrix_tuning
##                  Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Decision Tree Tuning 0.8462729   0.8528084   0.7973734       0.969222

After we tuning the models we improve the result, and Pos Pred Value incerase significantly from 39% to 96%. This result is more than enought, and it really good. so we didnt do any tuning to make it better

Random Forest

Random forests are based on a simple idea: ‘the wisdom of the crowd’. Aggregate of the results of multiple predictors gives a better prediction than the best individual predictor. A group of predictors is called an ensemble. Thus, this technique is called Ensemble Learning.

To improve our model, we can train a group of Decision Tree classifiers, each on a different random subset of the train set. To make a prediction, we just obtain the predictions of all individuals trees, then predict the class that gets the most votes. This technique is called Random Forest.

One wekness of Random Forest Model is are easy to get over feet with train data, to decrease possibility get overfitted model we can use K-Fold Cross Validation. K- Fold Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general when used to make predictions on data not used during the training of the model.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

Lets implement it in our random forest model

ctrl <- trainControl(method = "repeatedcv", number = 5,repeats = 3)
# model_rforest <- train(subscribe ~ ., data = telemark_train_upsample, method = "rf", trControl = ctrl, ntree = 100)
# saveRDS(model_rforest, file = "model_rforest.RDS")
model_rforest <- readRDS("model_rforest.RDS")
model_rforest
## Random Forest 
## 
## 16892 samples
##    16 predictor
##     2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 13514, 13513, 13514, 13514, 13513, 13514, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8291893  0.6583782
##   22    0.8986502  0.7973005
##   42    0.8962035  0.7924070
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.

From model summary we have build, we het optimum number of variable considered for splitting at each tree node is 2. Next we will use varImp() to get information which feature/variable importance for our random forest model.

varImp(model_rforest)
## rf variable importance
## 
##   only 20 most important variables shown (out of 42)
## 
##                 Overall
## duration        100.000
## pdays            28.472
## previous         18.610
## balance          15.756
## age              15.198
## day              14.154
## campaign          6.751
## poutcomesuccess   6.019
## housingyes        4.565
## contactunknown    4.252
## poutcomeunknown   4.216
## monthmar          2.642
## monthjul          2.596
## monthoct          2.579
## monthmay          2.436
## monthjun          2.404
## monthaug          1.922
## maritalmarried    1.803
## monthnov          1.704
## jobblue-collar    1.432

We get that duration are the most importance variable in our random forest model. and there is other have importance rate more bigger than others is pdays, previous, balance, age, day. In practice, the random forest already have out-of-bag estimates (OOB) that can be used as a reliable estimate of its true accuracy on unseen examples. we can check OOB rate in our random forest model

model_rforest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 22
## 
##         OOB estimate of  error rate: 9.9%
## Confusion matrix:
##       no  yes class.error
## no  7523  923  0.10928250
## yes  749 7697  0.08868103
plot(model_rforest$finalModel)
legend("topright", colnames(model_rforest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

We can start our random forest model to our telemarketing dataset test:

rforest_predict <- predict(model_rforest, telemark_test)
rforest_predict_raw <- predict(model_rforest, telemark_test, type = "prob")
rforest_matrix <- confusionMatrix(rforest_predict, telemark_test$subscribe, positive = "yes")
table <- as.table(rforest_matrix)
table <- as.data.frame(table)

table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), fontface = "bold", color = "white") +
  theme_minimal() +
  theme(legend.position = "none")

rforest_matrix <- matrix_result(rforest_matrix, "Random Forest")
rforest_matrix
##           Model Accuracy Sensitivity Specificity Pos Pred Value
## 1 Random Forest 0.858549   0.8405253   0.8609579      0.4468828
# ROC
forest_roc <- data.frame(prediction = rforest_predict_raw[,2],
                        trueclass = as.numeric(telemark_test$subscribe=="yes"))
head(forest_roc)
##   prediction trueclass
## 1       0.05         0
## 2       0.31         0
## 3       0.00         0
## 4       0.00         0
## 5       0.00         0
## 6       0.02         0
forest_rocz_prediction <- prediction(forest_roc$prediction, forest_roc$trueclass) 

# ROC curve
plot(performance(forest_rocz_prediction, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

# AUC
auc_ROCR_n <- performance(forest_rocz_prediction, measure = "auc")
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n
## [1] 0.9214208

Based on result ROC and AUC, we get our ROC curve show good separation with AUC score 0.9214208 it mean our random forest model almost reach the maximum accuracy. In Our Random Forest model result Accuraacy, Sensitivity, Specificity is good performance and accuracy, but our positive pred value still small 44%.

We will try to tune the threshold to reach more higher score on each metrics, especially sensitivity as our target

co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                     prob = rforest_predict_raw$yes, 
                     ref = as.factor(ifelse(telemark_test$subscribe == "yes", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))
#Tuning Threshold
rforest_predict_tuning <- rforest_predict_raw %>%
  mutate(label = as.factor(ifelse(yes >= 0.42, "yes", "no"))) %>% 
  select(label)
rforest_matrix_tuning <- confusionMatrix(rforest_predict_tuning$label, telemark_test$subscribe, positive = "yes")
rforest_matrix_tuning <- matrix_result(rforest_matrix_tuning, "Random Forest Tuning")
rforest_matrix_tuning
##                  Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Random Forest Tuning 0.8336651   0.8939962   0.8256018        0.40657

Conclusion

result <- rbind(naive_matrix, naive_matrix_tuning, dtree_matrix, dtree_matrix_tuning, rforest_matrix, rforest_matrix_tuning)
result
##                  Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1          Naive Bayes 0.7614466   0.8095685   0.7550150      0.3063543
## 2   Naive Bayes Tuning 0.9903782   0.9691161   1.0000000      1.0000000
## 3        Decision Tree 0.8310108   0.8414634   0.8296138      0.3976064
## 4 Decision Tree Tuning 0.8462729   0.8528084   0.7973734      0.9692220
## 5        Random Forest 0.8585490   0.8405253   0.8609579      0.4468828
## 6 Random Forest Tuning 0.8336651   0.8939962   0.8256018      0.4065700

Based on matrix above, we found Naive Bayes Tuning have specificity 100% and Pos Pred Value 100% it indicate our Naive Bayes that we tuning is over fitted. so we can take out from our comparation

result <- result[-2,]
result
##                  Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1          Naive Bayes 0.7614466   0.8095685   0.7550150      0.3063543
## 3        Decision Tree 0.8310108   0.8414634   0.8296138      0.3976064
## 4 Decision Tree Tuning 0.8462729   0.8528084   0.7973734      0.9692220
## 5        Random Forest 0.8585490   0.8405253   0.8609579      0.4468828
## 6 Random Forest Tuning 0.8336651   0.8939962   0.8256018      0.4065700

Overall our predictive model we has build before, Random Forest gave us highest perfomance with 85% accuracy and also maintain sensitivity, specificity more highest than others. In our case we are focus “No” class mean we dont want telemarketing call person who predict didnt want buy product or service offer from telemarketing, it means we are focus on Sensitivity parameter which mean our Naive Bayes already give us good result. so lets find out which model get highest Sensitivty rate

result %>% arrange(desc(Sensitivity))
##                  Model  Accuracy Sensitivity Specificity Pos Pred Value
## 1 Random Forest Tuning 0.8336651   0.8939962   0.8256018      0.4065700
## 2 Decision Tree Tuning 0.8462729   0.8528084   0.7973734      0.9692220
## 3        Decision Tree 0.8310108   0.8414634   0.8296138      0.3976064
## 4        Random Forest 0.8585490   0.8405253   0.8609579      0.4468828
## 5          Naive Bayes 0.7614466   0.8095685   0.7550150      0.3063543

Conclusion is random forest with threshold tuning can gave us best perfomance to predict Sensitivty which we try the best to predict which our target we already know they really dont want to buy bank product. so we bank can maintain reputation by didnt disturb target users.

Source Code

This analysis made for education purpose, and creator make it public access for data and source code.

File can access and download in Github: alfandash github

Result of this rmardown can access in RPubs: alfandash rpubs