Bank memiliki berbagai cara untuk mempromosikandan mengiklankan produknya, sering kali iklan atau promosi tersebut di abaikan, salah satu cara yang dinilai efektif dengan melalui telemarketing, dilansir dari Biz Fluent, salah satu keuntungan dari telemarketing adalah mudah dalam mendapatkan pelanggan serta menjaga hubungan dengannya.
Saat pelanggan mempunyai pertanyaan terkait produk atau jasa, kamu dapat menjelaskannya dengan detail melalui telepon.
Tidak hanya itu, melalui telemarketing kamu dapat menjangkau pelanggan dengan mudah karena hanya membutuhkan nomor teleponnya saja, tanpa perlu mendatanginya satu per satu.
Strategi ini juga dinilai hemat biaya sehingga dapat menekan anggaran dari perusahaan.
Attribute Information:
bank client data:
1 age (numeric)
2 job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
3 marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
4 education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
5 default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
6 housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
7 loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
related with the last contact of the current campaign:
8 contact: contact communication type (categorical: ‘cellular’,‘telephone’)
9 month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 day: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
11 duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
other attributes:
12 campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 previous: number of contacts performed before this campaign and for this client (numeric)
15 poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
Output variable (desired target):
16 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)
## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
#Check Data Distribution
numcols <- unlist(lapply(bank, is.numeric))
show_plot(inspect_num(bank[,numcols]))bank_cust1 <- ggplot(data = bank, mapping = aes(x = marital)) +
geom_bar(mapping = aes(fill = job)) + theme_linedraw() +
ggtitle("Distribution of Customers by personal loan") +
xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$loan)
ggplotly(bank_cust1)## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
bd1 <- ggplot(bank[(!is.na(bank$loan) & !is.na(bank$age)),], aes(x = age, fill = loan)) +
geom_density(alpha=0.5, aes(fill=factor(loan))) + labs(title="Loan density and Age") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd1bank_cust2 <- ggplot(data = bank, mapping = aes(x = marital)) +
geom_bar(mapping = aes(fill = job)) + theme_linedraw() +
ggtitle("Distribution of Customers by housing loan") +
xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$housing)
ggplotly(bank_cust2)bd2 <- ggplot(bank[(!is.na(bank$housing) & !is.na(bank$age)),], aes(x = age, fill = housing)) +
geom_density(alpha=0.5, aes(fill=factor(housing))) + labs(title="Housing loan density and Age") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd2bank_cust3 <- ggplot(data = bank, mapping = aes(x = marital)) +
geom_bar(mapping = aes(fill = job)) + theme_linedraw() +
ggtitle("Distribution of Customers by personal loan") +
xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$default)
ggplotly(bank_cust3)bd3 <- ggplot(bank[(!is.na(bank$default) & !is.na(bank$age)),], aes(x = age, fill = default)) +
geom_density(alpha=0.5, aes(fill=factor(default))) + labs(title="default density and Age") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd3bank_cust4 <- ggplot(data = bank, mapping = aes(x = marital)) +
geom_bar(mapping = aes(fill = job)) + theme_linedraw() +
ggtitle("Distribution of Customers by Credit Default") +
xlab("Marital") + ylab("Number of Customers") + facet_wrap(bank$y)
ggplotly(bank_cust4)bd4 <- ggplot(bank[(!is.na(bank$y) & !is.na(bank$age)),], aes(x = age, fill = y)) +
geom_density(alpha=0.5, aes(fill=factor(y))) + labs(title="Deposit density and Age") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_grey()
bd4##
## no yes
## 4000 521
##
## no yes
## 0.88476 0.11524
Di bawah akan melakukan pememisahan data menjadi data train dan data test dengan proporsi 80% untuk data train dan 20% untuk data test.
set.seed(100)
index <- sample(nrow(bank), nrow(bank)*0.8)
# Data Train
bank_train <- bank[index,]
# Data Test
bank_test <- bank[-index,]##
## no yes
## 0.8846792 0.1153208
Karna data yang di miliki tidak balance dan tidak terlalu banyak maka dilakukan nya metode resampling data menggunakan upsample untuk membuat proporsi data menjadi seimbang.
bank_train <- upSample(x = bank_train %>% select(-y),
y = as.factor(bank_train$y),
yname = "subscribe")##
## no yes
## 0.5 0.5
Model yang akan digunakan adalah model random forest dan Decission Tree .Random Forest melakukan prediksi dengan membuat banyak Pohon keputusan (Decision Tree). Setiap Decision Tree memiliki karakteristik masing-masing dan tidak saling berkaitan satu sama lain. Model Random Forest kemudian melakukan prediksi untuk masing-masing decision tree, lalu dari banyaknya hasil prediksi tersebut dilakukan voting.
Traincontrol future selection yang menggunakan keuntungan dari random forest yaitu automatic feature selection: pemilihan prediktor secara otomatis dan random didalam pembuatan decision tree - method: traincontrol - number: 5 - repeats: r
# set.seed(123)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 4)
# # pembuatan model
# bank_model <- train(subscribe ~ ., data = bank_train, method = "rf", trControl = control)
# # simpan model
# saveRDS(bank_model, "bank_forest.RDS")
# bank_modelSetelah membuat model random forest kita akan menyimpan model tersebut kedalam bank_forest dan memanggil model tersebut untuk digunakan kedalam prediksi. Penting untung menyimpan model random forest agar kita tidak menjalankan model secara berulang dikarenakan kelemahan dari random forest adalah komputasinya yang lama.
## Random Forest
##
## 6380 samples
## 16 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 4 times)
## Summary of sample sizes: 5104, 5104, 5104, 5104, 5104, 5104, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8851097 0.7702194
## 22 0.9681034 0.9362069
## 42 0.9631270 0.9262539
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 22.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 22
##
## OOB estimate of error rate: 2.49%
## Confusion matrix:
## no yes class.error
## no 3031 159 0.04984326
## yes 0 3190 0.00000000
Out of Bag Error: 2.49%, Dapat dikatakan bahwa ada kemungkinan model ini memiliki error 2.49% dalam memprediksi data yang tidak terlihat lalu kita akan menggunakan model yang telah kita buat kedalam prediksi dengan menggunakan data test yang telah kita buat dan kita akan mengecek apakah model yang kita buat memiliki accuracy, Specificity , precision, dan sensitivity yang tinggi.
bank_rf_pred <- predict(rf, newdata = bank_test)
confusionMatrix(bank_rf_pred, bank_test$y, positive = "no")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 794 10
## yes 7 94
##
## Accuracy : 0.9812
## 95% CI : (0.9701, 0.989)
## No Information Rate : 0.8851
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9065
##
## Mcnemar's Test P-Value : 0.6276
##
## Sensitivity : 0.9913
## Specificity : 0.9038
## Pos Pred Value : 0.9876
## Neg Pred Value : 0.9307
## Prevalence : 0.8851
## Detection Rate : 0.8773
## Detection Prevalence : 0.8884
## Balanced Accuracy : 0.9476
##
## 'Positive' Class : no
##
Dapat dilihat model Random Forest memiliki tingkat akurasi 98% , Specificity 99% , precision 93% dan sensitivity 90% , secara keseluruhan model ini sangat baik namun kita perlu membandingkannya dengan model decision tree.
bank_train_up <- upSample(x = bank_train %>% select(-subscribe),
y = as.factor(bank_train$subscribe),
yname = "subscribe")bank_pred_dtree <- predict(bank_model_dtree, bank_test)
confusionMatrix(bank_pred_dtree, bank_test$y, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 646 25
## yes 155 79
##
## Accuracy : 0.8011
## 95% CI : (0.7736, 0.8266)
## No Information Rate : 0.8851
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3667
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.75962
## Specificity : 0.80649
## Pos Pred Value : 0.33761
## Neg Pred Value : 0.96274
## Prevalence : 0.11492
## Detection Rate : 0.08729
## Detection Prevalence : 0.25856
## Balanced Accuracy : 0.78305
##
## 'Positive' Class : yes
##
Dapat dilihat hasil accuracy, recall, dan specificity sudah cukup baik, tapi nilai precision kita masih terlalu rendah yaitu 33%. Model kita masih dapat di improve dengan melakukan tuning model.
#tuning model
set.seed(128)
bank_dtree_tuning <- ctree(subscribe ~ ., bank_train_up,
control = ctree_control(mincriterion = 0.1, minsplit = 100, minbucket = 60))
dtree_prediction_tuning <- predict(bank_dtree_tuning, bank_test, positive = "yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 674 24
## yes 127 80
##
## Accuracy : 0.8331
## 95% CI : (0.8072, 0.8569)
## No Information Rate : 0.8851
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4268
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8414
## Specificity : 0.7692
## Pos Pred Value : 0.9656
## Neg Pred Value : 0.3865
## Prevalence : 0.8851
## Detection Rate : 0.7448
## Detection Prevalence : 0.7713
## Balanced Accuracy : 0.8053
##
## 'Positive' Class : no
##
Setelah kita tuning terjadi peningkatan pada precision dari 33% menjadi 96%. Ini merupakan hasil yang sangat baik dan cukup.
Dari dua model diatas dapat dilihat model Random Forest memiliki overall matrix yang sangat baik, dan dalam kasus telemerketing bank kita akan fokus pada target “No” yang artinya kita tidak ingin target dari telemarketing yang diramalkan tidak akan membeli produk atau jasa yang ditawarkan oleh telemarketing, karena kita tidak ingin reputasi bank menurun akibat kontak yang tidak perlu atau di anggap mengganggu, maka dapat dilihat model Random Forest memiliki tingkat akurasi 98% , Specificity 99% , precision 93% dan sensitivity 90%.