Deposito berjangka adalah produk bank sejenis tabungan, dimana uang yang disetorkan dalam deposito berjangka tidak boleh ditarik nasabah dan baru bisa dicairkan sesuai dengan tanggal jatuh temponya. Dalam data survei pelanggan ini dibedakan antara pelanggan yang menggunakan deposito berjangka dengan yang pelanggan tidak menggunakan berdasarkan hasil hasil dari campaign yang sudah diberikan. Proses klasifikasinya mencoba menggunakan metode Classification and Regression Tree (CART) untuk melihat hasil yang didapat dari penerapan metode CART dalam penerapannya. Metode CART adalah sebuah metode eksplorasi yang mengubah data yang sangat besar menjadi pohon keputusan yang merepresentasikan suatu aturan.
Membaca data survei dengan file format csv.
bank <- read.csv("D:/dataset/dataset/bank customer survey/bank_customer_survey.csv", header = T, sep=",")
Memanggil data dari 6 baris keatas.
head(bank,6)
## age job marital education default balance housing loan contact
## 1 58 management married tertiary no 2143 yes no unknown
## 2 44 technician single secondary no 29 yes no unknown
## 3 33 entrepreneur married secondary no 2 yes yes unknown
## 4 47 blue married unknown no 1506 yes no unknown
## 5 33 unknown single unknown no 1 no no unknown
## 6 35 management married tertiary no 231 yes no unknown
## day month duration campaign pdays previous poutcome y
## 1 5 may 261 1 -1 0 unknown 0
## 2 5 may 151 1 -1 0 unknown 0
## 3 5 may 76 1 -1 0 unknown 0
## 4 5 may 92 1 -1 0 unknown 0
## 5 5 may 198 1 -1 0 unknown 0
## 6 5 may 139 1 -1 0 unknown 0
Mengubah data yang “unknown” menjadi NA
bank[bank=="unknown"] <- NA
Deskripsi Statistik Data
summary(bank)
## age job marital education
## Min. :18.00 blue :9732 divorced: 5207 primary : 6851
## 1st Qu.:33.00 management:9458 married :27214 secondary:23202
## Median :39.00 technician:7597 single :12790 tertiary :13301
## Mean :40.94 admin :5171 unknown : 0
## 3rd Qu.:48.00 services :4154 NA's : 1857
## Max. :95.00 (Other) :8811
## NA's : 288
## default balance housing loan contact
## no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
## yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
## Median : 448 unknown : 0
## Mean : 1362 NA's :13020
## 3rd Qu.: 1428
## Max. :102127
##
## day month duration campaign
## Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
## 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
## Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
## Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
## 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
## (Other): 6060
## pdays previous poutcome y
## Min. : -1.0 Min. : 0.0000 failure: 4901 Min. :0.000
## 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 1st Qu.:0.000
## Median : -1.0 Median : 0.0000 success: 1511 Median :0.000
## Mean : 40.2 Mean : 0.5803 unknown: 0 Mean :0.117
## 3rd Qu.: -1.0 3rd Qu.: 0.0000 NA's :36959 3rd Qu.:0.000
## Max. :871.0 Max. :275.0000 Max. :1.000
##
Dataset ini berisi dari dua set pelanggan, pelanggan yang merupakan bagian dari kampanye sebelumnya dan pelanggan yang dihubungi untuk pertama kalinya. Akan dibuat 2 data yang berbeda
old_Cust_bank<-subset(bank, bank$poutcome != "nonexistent")
Melihat struktur data
str(old_Cust_bank)
## 'data.frame': 8252 obs. of 17 variables:
## $ age : int 33 42 33 36 36 56 44 26 51 34 ...
## $ job : Factor w/ 12 levels "admin","blue",..: 1 1 8 5 5 10 2 10 1 5 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 2 2 2 3 3 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 3 3 2 2 3 2 3 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : int 882 -247 3444 2415 0 589 1324 172 3132 1770 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 1 1 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 2 1 1 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 2 2 2 2 2 NA 2 2 2 NA ...
## $ day : int 21 21 21 22 23 23 25 4 5 6 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 11 11 11 11 11 11 10 10 10 ...
## $ duration : int 39 519 144 73 140 518 119 21 449 26 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 151 166 91 86 143 147 89 140 176 101 ...
## $ previous : int 3 1 4 4 3 2 2 4 1 11 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 1 2 1 2 1 3 2 2 1 2 ...
## $ y : int 0 1 1 0 1 1 0 0 0 0 ...
Hasil yang didapat terdapat 8252 data pelanggan pada hasil kampanye sebelumnya
new_Cust_bank<-subset(bank, bank$poutcome == "nonexistent")
Hasil yang didapat terdapat 0 data pelanggan pada hasil kampanye yang baru dihubungi. Sehingga data yang digunakan pada data old_Cust_bank. Selanjutnya adalah analisis nilai yang hilang
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
## Warning: package 'data.table' was built under R version 3.5.3
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
aggr_plot <- aggr(old_Cust_bank, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## education 0.039142026
## contact 0.008482792
## job 0.003999031
## age 0.000000000
## marital 0.000000000
## default 0.000000000
## balance 0.000000000
## housing 0.000000000
## loan 0.000000000
## day 0.000000000
## month 0.000000000
## duration 0.000000000
## campaign 0.000000000
## pdays 0.000000000
## previous 0.000000000
## poutcome 0.000000000
## y 0.000000000
Memeriksa target distribusi variabel untuk melihat ada tidaknya ketidakseimbangan data
counts <- table(old_Cust_bank$loan)
barplot(counts,col=c("darkblue","red"),legend = rownames(counts), main = "Pinjaman")
Jadi untuk dataset yang memliki nilai yang hilang maka akan digunakan dengan metode imputasi dengan menggunakan package mice.
library(mice)
## Loading required package: lattice
##
## Attaching package: 'mice'
## The following objects are masked from 'package:base':
##
## cbind, rbind
old_imp<-mice(old_Cust_bank)
##
## iter imp variable
## 1 1 job education contact
## 1 2 job education contact
## 1 3 job education contact
## 1 4 job education contact
## 1 5 job education contact
## 2 1 job education contact
## 2 2 job education contact
## 2 3 job education contact
## 2 4 job education contact
## 2 5 job education contact
## 3 1 job education contact
## 3 2 job education contact
## 3 3 job education contact
## 3 4 job education contact
## 3 5 job education contact
## 4 1 job education contact
## 4 2 job education contact
## 4 3 job education contact
## 4 4 job education contact
## 4 5 job education contact
## 5 1 job education contact
## 5 2 job education contact
## 5 3 job education contact
## 5 4 job education contact
## 5 5 job education contact
## Warning: Number of logged events: 75
old_imp_df<-complete(old_imp)
Mengecek data apabila ada nilai yang hilang.
aggr_plot <- aggr(old_imp_df, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(df), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## age 0
## job 0
## marital 0
## education 0
## default 0
## balance 0
## housing 0
## loan 0
## contact 0
## day 0
## month 0
## duration 0
## campaign 0
## pdays 0
## previous 0
## poutcome 0
## y 0
Mengubah data karakter ke tipe biner
old_imp_df$default<-ifelse(old_imp_df$default =="yes",1,0)
old_imp_df$housing<-ifelse(old_imp_df$housing =="yes",1,0)
old_imp_df$loan<-ifelse(old_imp_df$loan =="yes",1,0)
Melihat hasil dari perubahan dengan melihat struktur data
str(old_imp_df)
## 'data.frame': 8252 obs. of 17 variables:
## $ age : int 33 42 33 36 36 56 44 26 51 34 ...
## $ job : Factor w/ 12 levels "admin","blue",..: 1 1 8 5 5 10 2 10 1 5 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 2 2 2 3 3 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 3 3 2 2 3 2 3 ...
## $ default : num 0 0 0 0 0 0 0 0 0 0 ...
## $ balance : int 882 -247 3444 2415 0 589 1324 172 3132 1770 ...
## $ housing : num 0 1 1 1 1 1 1 0 0 1 ...
## $ loan : num 0 1 0 0 0 0 0 1 0 0 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 2 2 2 2 2 2 2 2 2 1 ...
## $ day : int 21 21 21 22 23 23 25 4 5 6 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 11 11 11 11 11 11 10 10 10 ...
## $ duration : int 39 519 144 73 140 518 119 21 449 26 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 151 166 91 86 143 147 89 140 176 101 ...
## $ previous : int 3 1 4 4 3 2 2 4 1 11 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 1 2 1 2 1 3 2 2 1 2 ...
## $ y : int 0 1 1 0 1 1 0 0 0 0 ...
Sekarang kita mengatur ulang data dan membentuk ketipe faktor dan numerik.
old_imp_df$job <- as.numeric(old_imp_df$job)
old_imp_df$marital <- as.numeric(old_imp_df$marital)
old_imp_df$education <- as.numeric(old_imp_df$education)
old_imp_df$default <- as.numeric(old_imp_df$default)
old_imp_df$housing <- as.numeric(old_imp_df$housing)
old_imp_df$loan <- as.numeric(old_imp_df$loan)
old_imp_df$contact <- as.numeric(old_imp_df$contact)
old_imp_df$month <- as.numeric(old_imp_df$month)
old_imp_df$age <- as.numeric(old_imp_df$age)
old_imp_df$duration <- as.numeric(old_imp_df$duration)
old_imp_df$campaign <- as.numeric(old_imp_df$campaign)
old_imp_df$pdays <- as.numeric(old_imp_df$pdays)
old_imp_df$previous <- as.numeric(old_imp_df$previous)
old_imp_df$poutcome <- as.numeric(old_imp_df$poutcome)
old_imp_df$y <- as.factor(old_imp_df$y)
Melihat hasil dari perubahan dengan melihat struktur data
str(old_imp_df)
## 'data.frame': 8252 obs. of 17 variables:
## $ age : num 33 42 33 36 36 56 44 26 51 34 ...
## $ job : num 1 1 8 5 5 10 2 10 1 5 ...
## $ marital : num 2 3 2 2 2 2 2 3 3 2 ...
## $ education: num 3 2 2 3 3 2 2 3 2 3 ...
## $ default : num 0 0 0 0 0 0 0 0 0 0 ...
## $ balance : int 882 -247 3444 2415 0 589 1324 172 3132 1770 ...
## $ housing : num 0 1 1 1 1 1 1 0 0 1 ...
## $ loan : num 0 1 0 0 0 0 0 1 0 0 ...
## $ contact : num 2 2 2 2 2 2 2 2 2 1 ...
## $ day : int 21 21 21 22 23 23 25 4 5 6 ...
## $ month : num 11 11 11 11 11 11 11 10 10 10 ...
## $ duration : num 39 519 144 73 140 518 119 21 449 26 ...
## $ campaign : num 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num 151 166 91 86 143 147 89 140 176 101 ...
## $ previous : num 3 1 4 4 3 2 2 4 1 11 ...
## $ poutcome : num 1 2 1 2 1 3 2 2 1 2 ...
## $ y : Factor w/ 2 levels "0","1": 1 2 2 1 2 2 1 1 1 1 ...
Membuat data train dan test
data <- old_imp_df
split <- sample(nrow(data), nrow(data)*0.8)
train <- data[split,]
test <- data[-split,]
Melihat hasil dari perubahan.
head(train)
## age job marital education default balance housing loan contact day
## 1253 57 5 1 3 0 2549 0 0 1 30
## 1593 58 6 3 2 0 1640 0 1 1 4
## 307 53 7 2 2 0 1066 1 0 1 18
## 2014 53 2 2 1 0 1409 1 0 1 2
## 396 35 5 1 3 0 771 1 1 1 19
## 401 30 2 1 2 0 4591 1 0 1 19
## month duration campaign pdays previous poutcome y
## 1253 5 456 3 197 2 1 0
## 1593 4 72 6 274 4 1 0
## 307 10 110 1 186 3 1 0
## 2014 1 464 1 287 1 1 0
## 396 10 110 1 93 5 1 0
## 401 10 89 1 175 1 2 0
Mengecek apakah ada nilai yang kosong.
(sum(is.na(train))/(nrow(train)*ncol(train)))*100
## [1] 0
Membangun model CART
library(rpart)
library(rpart.plot)
model.cart <- rpart(formula=y~age+job+marital+education+default+housing+
loan+contact+day+month+campaign+previous+poutcome+pdays,data=train,method = "class")
#Pohon Keputusan
prp(model.cart, type=2,extra=104,fallen.leaves = T,main="Pohon Keputusan")
Prediksi data testing.
prediksi.cart <- predict(model.cart, test,type = "class")
test$hasil <- prediksi.cart
head(test)
## age job marital education default balance housing loan contact day
## 1 33 1 2 3 0 882 0 0 2 21
## 4 36 5 2 3 0 2415 1 0 2 22
## 5 36 5 2 3 0 0 1 0 2 23
## 10 34 5 2 3 0 1770 1 0 1 6
## 11 33 11 1 2 0 1005 1 0 2 10
## 22 40 5 3 2 0 1623 1 0 1 17
## month duration campaign pdays previous poutcome y hasil
## 1 11 39 1 151 3 1 0 0
## 4 11 73 1 86 4 2 0 0
## 5 11 140 1 143 3 1 1 0
## 10 10 26 1 101 11 2 0 0
## 11 10 175 1 174 2 1 0 0
## 22 10 161 1 167 2 1 0 0
Membuat Confusion Matrix
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
confusionMatrix(as.factor(test$y),as.factor(prediksi.cart))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1153 100
## 1 198 200
##
## Accuracy : 0.8195
## 95% CI : (0.8001, 0.8378)
## No Information Rate : 0.8183
## P-Value [Acc > NIR] : 0.4645
##
## Kappa : 0.4615
##
## Mcnemar's Test P-Value : 1.92e-08
##
## Sensitivity : 0.8534
## Specificity : 0.6667
## Pos Pred Value : 0.9202
## Neg Pred Value : 0.5025
## Prevalence : 0.8183
## Detection Rate : 0.6984
## Detection Prevalence : 0.7589
## Balanced Accuracy : 0.7601
##
## 'Positive' Class : 0
##
Membuat Kurva Roc dari hasil testing
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pr <- prediction(as.numeric(prediksi.cart), as.numeric(test$y))
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
Melihat nilai AUC dari hasil testing
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.7113521
Kesimpulan : Dari penerapan metode CART dalam klasifikasi pelanggan yang berlangganan deposito berjangka memiliki akurasi BAIK dengan nilai akurasi 0.8147 berdasarkan hasil dari confussion matrix. Sedangkan untuk nilai AUC memiliki nilai 0.7052437 yang berarti Cukup Baik.