Introduction

Bank memiliki berbagai cara untuk mempromosikandan mengiklankan produknya, sering kali iklan atau promosi tersebut di abaikan, salah satu cara yang dinilai efektif dengan melalui telemarketing, dilansir dari Biz Fluent, salah satu keuntungan dari telemarketing adalah mudah dalam mendapatkan pelanggan serta menjaga hubungan dengannya.

Saat pelanggan mempunyai pertanyaan terkait produk atau jasa, kamu dapat menjelaskannya dengan detail melalui telepon.

Tidak hanya itu, melalui telemarketing kamu dapat menjangkau pelanggan dengan mudah karena hanya membutuhkan nomor teleponnya saja, tanpa perlu mendatanginya satu per satu.

Strategi ini juga dinilai hemat biaya sehingga dapat menekan anggaran dari perusahaan.

Data Background

Attribute Information:

bank client data:

1 age (numeric)

2 job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)

3 marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)

4 education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)

5 default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)

6 housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)

7 loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

related with the last contact of the current campaign:

8 contact: contact communication type (categorical: ‘cellular’,‘telephone’)

9 month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

10 day: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)

11 duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

12 campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 previous: number of contacts performed before this campaign and for this client (numeric)

15 poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Output variable (desired target):

16 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Improt Data

bank <- read.csv("data_input/bank.csv", sep=";")
head(bank)

Data Wrangling

str(bank)

## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Check Missing Value

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Exploration

#Check Data Distribution
numcols <- unlist(lapply(bank, is.numeric))

show_plot(inspect_num(bank[,numcols]))

bank_cust1 <- ggplot(data = bank, mapping = aes(x = marital)) + 
             geom_bar(mapping = aes(fill = job)) + theme_linedraw() + 
             ggtitle("Distribution of Customers by personal loan") +
             xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$loan)
ggplotly(bank_cust1)

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

bd1 <- ggplot(bank[(!is.na(bank$loan) & !is.na(bank$age)),], aes(x = age, fill = loan)) +
       geom_density(alpha=0.5, aes(fill=factor(loan))) + labs(title="Loan density and Age") +
       scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd1

bank_cust2 <- ggplot(data = bank, mapping = aes(x = marital)) + 
             geom_bar(mapping = aes(fill = job)) + theme_linedraw() + 
             ggtitle("Distribution of Customers by housing loan") +
             xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$housing)
ggplotly(bank_cust2)

bd2 <- ggplot(bank[(!is.na(bank$housing) & !is.na(bank$age)),], aes(x = age, fill = housing)) +
       geom_density(alpha=0.5, aes(fill=factor(housing))) + labs(title="Housing loan density and Age") +
       scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd2

bank_cust3 <- ggplot(data = bank, mapping = aes(x = marital)) + 
             geom_bar(mapping = aes(fill = job)) + theme_linedraw() + 
             ggtitle("Distribution of Customers by personal loan") +
             xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$default)
ggplotly(bank_cust3)

bd3 <- ggplot(bank[(!is.na(bank$default) & !is.na(bank$age)),], aes(x = age, fill = default)) +
       geom_density(alpha=0.5, aes(fill=factor(default))) + labs(title="default density and Age") +
       scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd3

bank_cust4 <- ggplot(data = bank, mapping = aes(x = marital)) + 
             geom_bar(mapping = aes(fill = job)) + theme_linedraw() + 
             ggtitle("Distribution of Customers by Credit Default") +
             xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$y)
ggplotly(bank_cust4)

bd4 <- ggplot(bank[(!is.na(bank$y) & !is.na(bank$age)),], aes(x = age, fill = y)) +
       geom_density(alpha=0.5, aes(fill=factor(y))) + labs(title="Deposit density and Age") +
       scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd4

Data Preprocessing

table(bank$y)

## 
##   no  yes 
## 4000  521

#Check Proportion Data
prop.table(table(bank$y))

## 
##      no     yes 
## 0.88476 0.11524

Cross Validation

Di bawah akan melakukan pememisahan data menjadi data train dan data test dengan proporsi 80% untuk data train dan 20% untuk data test.

set.seed(100)
index <- sample(nrow(bank), nrow(bank)*0.8)

# Data Train
bank_train <- bank[index,]

# Data Test
bank_test <- bank[-index,]

#Check Proportion Table 
prop.table(table(bank_train$y))

## 
##        no       yes 
## 0.8846792 0.1153208

UpSampling Method

Karna data yang di miliki tidak balance dan tidak terlalu banyak maka dilakukan nya metode resampling data menggunakan upsample untuk membuat proporsi data menjadi seimbang.

bank_train <- upSample(x = bank_train %>% select(-y),
                         y = as.factor(bank_train$y),
                         yname = "subscribe")

prop.table(table(bank_train$subscribe))

## 
##  no yes 
## 0.5 0.5

Modelling

Model yang akan digunakan adalah model random forest dan Decission Tree .Random Forest melakukan prediksi dengan membuat banyak Pohon keputusan (Decision Tree). Setiap Decision Tree memiliki karakteristik masing-masing dan tidak saling berkaitan satu sama lain. Model Random Forest kemudian melakukan prediksi untuk masing-masing decision tree, lalu dari banyaknya hasil prediksi tersebut dilakukan voting.

Traincontrol future selection yang menggunakan keuntungan dari random forest yaitu automatic feature selection: pemilihan prediktor secara otomatis dan random didalam pembuatan decision tree - method: traincontrol - number: 5 - repeats: r

# set.seed(123)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 4)

# # pembuatan model 
# bank_model <- train(subscribe ~ ., data = bank_train, method = "rf", trControl = control)
# # simpan model
# saveRDS(bank_model, "bank_forest.RDS")
# bank_model

Setelah membuat model random forest kita akan menyimpan model tersebut kedalam bank_forest dan memanggil model tersebut untuk digunakan kedalam prediksi. Penting untung menyimpan model random forest agar kita tidak menjalankan model secara berulang dikarenakan kelemahan dari random forest adalah komputasinya yang lama.

Random Forest

# Read rds data
rf<- readRDS("bank_forest.RDS")
rf

## Random Forest 
## 
## 6380 samples
##   16 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 4 times) 
## Summary of sample sizes: 5104, 5104, 5104, 5104, 5104, 5104, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8851097  0.7702194
##   22    0.9681034  0.9362069
##   42    0.9631270  0.9262539
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.

rf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 22
## 
##         OOB estimate of  error rate: 2.49%
## Confusion matrix:
##       no  yes class.error
## no  3031  159  0.04984326
## yes    0 3190  0.00000000

Out of Bag Error: 2.49%, Dapat dikatakan bahwa ada kemungkinan model ini memiliki error 2.49% dalam memprediksi data yang tidak terlihat lalu kita akan menggunakan model yang telah kita buat kedalam prediksi dengan menggunakan data test yang telah kita buat dan kita akan mengecek apakah model yang kita buat memiliki accuracy, Specificity , precision, dan sensitivity yang tinggi.

bank_rf_pred <- predict(rf, newdata = bank_test)
confusionMatrix(bank_rf_pred, bank_test$y, positive = "no")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  794  10
##        yes   7  94
##                                          
##                Accuracy : 0.9812         
##                  95% CI : (0.9701, 0.989)
##     No Information Rate : 0.8851         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9065         
##                                          
##  Mcnemar's Test P-Value : 0.6276         
##                                          
##             Sensitivity : 0.9913         
##             Specificity : 0.9038         
##          Pos Pred Value : 0.9876         
##          Neg Pred Value : 0.9307         
##              Prevalence : 0.8851         
##          Detection Rate : 0.8773         
##    Detection Prevalence : 0.8884         
##       Balanced Accuracy : 0.9476         
##                                          
##        'Positive' Class : no             
##

Dapat dilihat model Random Forest memiliki tingkat akurasi 98% , Specificity 99% , precision 93% dan sensitivity 90% , secara keseluruhan model ini sangat baik namun kita perlu membandingkannya dengan model decision tree.

Decision tree

bank_train_up <- upSample(x = bank_train %>% select(-subscribe),
                         y = as.factor(bank_train$subscribe),
                         yname = "subscribe")

set.seed(128)
bank_model_dtree <- ctree(subscribe ~ ., bank_train_up)

bank_pred_dtree <- predict(bank_model_dtree, bank_test)
confusionMatrix(bank_pred_dtree, bank_test$y, positive = "yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  646  25
##        yes 155  79
##                                           
##                Accuracy : 0.8011          
##                  95% CI : (0.7736, 0.8266)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3667          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.75962         
##             Specificity : 0.80649         
##          Pos Pred Value : 0.33761         
##          Neg Pred Value : 0.96274         
##              Prevalence : 0.11492         
##          Detection Rate : 0.08729         
##    Detection Prevalence : 0.25856         
##       Balanced Accuracy : 0.78305         
##                                           
##        'Positive' Class : yes             
##

Dapat dilihat hasil accuracy, recall, dan specificity sudah cukup baik, tapi nilai precision kita masih terlalu rendah yaitu 33%. Model kita masih dapat di improve dengan melakukan tuning model.

#tuning model
set.seed(128)
bank_dtree_tuning <- ctree(subscribe ~ ., bank_train_up,
                            control = ctree_control(mincriterion = 0.1, minsplit = 100, minbucket = 60))

dtree_prediction_tuning <- predict(bank_dtree_tuning, bank_test, positive = "yes")

confusionMatrix(dtree_prediction_tuning, bank_test$y)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  674  24
##        yes 127  80
##                                           
##                Accuracy : 0.8331          
##                  95% CI : (0.8072, 0.8569)
##     No Information Rate : 0.8851          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.4268          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8414          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.9656          
##          Neg Pred Value : 0.3865          
##              Prevalence : 0.8851          
##          Detection Rate : 0.7448          
##    Detection Prevalence : 0.7713          
##       Balanced Accuracy : 0.8053          
##                                           
##        'Positive' Class : no              
##

Setelah kita tuning terjadi peningkatan pada precision dari 33% menjadi 96%. Ini merupakan hasil yang sangat baik dan cukup.

Conclusion

Dari dua model diatas dapat dilihat model Random Forest memiliki overall matrix yang sangat baik, dan dalam kasus telemerketing bank kita akan fokus pada target “No” yang artinya kita tidak ingin target dari telemarketing yang diramalkan tidak akan membeli produk atau jasa yang ditawarkan oleh telemarketing, karena kita tidak ingin reputasi bank menurun akibat kontak yang tidak perlu atau di anggap mengganggu, maka dapat dilihat model Random Forest memiliki tingkat akurasi 98% , Specificity 99% , precision 93% dan sensitivity 90%.

Random Forest and Decission Trees with Bank Telemarketing dataset

Sandy Putra Utama

2021-03-01