Faktor Yang Memengaruhi Seseorang Berlangganan Deposito Berjangka
Penerapan Regresi Logistik Biner
setwd(“D:\Kuliah S2 IPB\Bahan Kuliah\Semester 3 SSD 2020\Pemodelan Klasifikasi\Praktikum\R”)
Packages
Packages yang digunakan adalah
library(readxl)
library(DT)
library(tidyverse)
library(dplyr)
library(ggpubr)
library(broom)
library(caret)
library(DataExplorer)
library(grid)
library(InformationValue)
library(ISLR)
library(pscl)
library(ROSE)
library(ggplot2)
library(scales)
library(knitr)
library(RColorBrewer)Data
Data yang digunakan adalah dataset
bank-additional.csvyang diperoleh dari “Bank Marketing” UCI dataset. Data ini terkait dengan kampanye pemasaran langsung dari lembaga perbankan Portugal, yang tersedia pada: UCI Machine Learning Repository: Bank Marketing Data Set
Tugas
Bangunlah model terbaik dengan regresi logistik, terkait mau tidak nya pelanggan untuk berlangganan deposito berjangka (y = has the client subscribed a term deposit? (binary: ‘yes’,‘no’))
Gunakan proses validasi 80:20, dan identifikasi peubah yang mempengaruhi y secara signifikan
Input Data
Proses yang pertama dilakukan sebelum menyusun model regresi logistik adalah melakukan input data yang telah di download pada UCI Machine Learning Repository: Bank Marketing Data Set
bank <- read.csv("bank-additional.csv",sep=";",header=TRUE)
str(bank)## 'data.frame': 4119 obs. of 21 variables:
## $ age : int 30 39 25 38 47 32 32 41 31 35 ...
## $ job : chr "blue-collar" "services" "services" "services" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education : chr "basic.9y" "high.school" "high.school" "basic.9y" ...
## $ default : chr "no" "no" "no" "no" ...
## $ housing : chr "yes" "no" "yes" "unknown" ...
## $ loan : chr "no" "no" "no" "unknown" ...
## $ contact : chr "cellular" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "jun" "jun" ...
## $ day_of_week : chr "fri" "fri" "wed" "fri" ...
## $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 94.2 94.2 93.2 93.2 94 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num 1.3 4.9 5 5 4.2 0.9 0.9 4.2 4.2 4.9 ...
## $ nr.employed : num 5099 5191 5228 5228 5196 ...
## $ y : chr "no" "no" "no" "no" ...
dataset bank-additional.csv terdiri dari 21 kolom/variabel/peubah dan 4119 baris/observasi.
Ada 20 kandidat calon peubah bebas yang dapat digunakan dalam menyusun model regresi logistik untuk mengetahui faktor-faktor yang memengaruhi pelanggan untuk berlangganan deposito berjangka (y - has the client subscribed a term deposit? (binary: “yes”,“no”)). 20 kandidat peubah bebas tersebut adalah:
bank client data:
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
related with the last contact of the current campaign
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Dataset ini selanjutnya disimpan dengan nama bank.
Pre-processing Data
Cek Tipe Data
Dari data bank yang telah di input di atas,terlihat bahwa beberapa peubah yang seharusnya memiliki tipe kategori masih dibaca sebagai character dan numerik sehingga perlu ditransformasi terlebih dahulu menjadi factor.
#mengubah peubah categoric as.factor
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$day_of_week <- as.factor(bank$day_of_week)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)str(bank)## 'data.frame': 4119 obs. of 21 variables:
## $ age : int 30 39 25 38 47 32 32 41 31 35 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
## $ duration : int 487 346 227 17 58 128 290 44 68 170 ...
## $ campaign : int 2 4 1 3 1 3 4 2 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 2 0 0 1 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ emp.var.rate : num -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
## $ cons.price.idx: num 92.9 94 94.5 94.5 93.2 94.2 94.2 93.2 93.2 94 ...
## $ cons.conf.idx : num -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
## $ euribor3m : num 1.3 4.9 5 5 4.2 0.9 0.9 4.2 4.2 4.9 ...
## $ nr.employed : num 5099 5191 5228 5228 5196 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Dari output di atas,terlihat bahwa beberapa peubah yang memiliki tipe kategori sudah dibaca sebagai factor.
Cek Missing Data atau Jawaban “unknown”
missing(bank)## [1] FALSE
terlihat bahwa tidak ada data yang missing pada data bank.
summary(bank)## age job marital education
## Min. :18.00 admin. :1012 divorced: 446 university.degree :1264
## 1st Qu.:32.00 blue-collar: 884 married :2509 high.school : 921
## Median :38.00 technician : 691 single :1153 basic.9y : 574
## Mean :40.11 services : 393 unknown : 11 professional.course: 535
## 3rd Qu.:47.00 management : 324 basic.4y : 429
## Max. :88.00 retired : 166 basic.6y : 228
## (Other) : 649 (Other) : 168
## default housing loan contact month
## no :3315 no :1839 no :3349 cellular :2652 may :1378
## unknown: 803 unknown: 105 unknown: 105 telephone:1467 jul : 711
## yes : 1 yes :2175 yes : 665 aug : 636
## jun : 530
## nov : 446
## apr : 215
## (Other): 203
## day_of_week duration campaign pdays previous
## fri:768 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000
## mon:855 1st Qu.: 103.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
## thu:860 Median : 181.0 Median : 2.000 Median :999.0 Median :0.0000
## tue:841 Mean : 256.8 Mean : 2.537 Mean :960.4 Mean :0.1903
## wed:795 3rd Qu.: 317.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :3643.0 Max. :35.000 Max. :999.0 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 454 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:3523 1st Qu.:-1.80000 1st Qu.:93.10 1st Qu.:-42.7
## success : 142 Median : 1.10000 Median :93.70 Median :-41.8
## Mean : 0.08497 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:94.00 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.80 Max. :-26.9
##
## euribor3m nr.employed y
## Min. :0.600 Min. :4964 no :3668
## 1st Qu.:1.300 1st Qu.:5099 yes: 451
## Median :4.900 Median :5191
## Mean :3.645 Mean :5166
## 3rd Qu.:5.000 3rd Qu.:5228
## Max. :5.000 Max. :5228
##
Dari output di atas terlihat bahwa peubah marital, default, housing, dan loan memiliki kategori “unknown” atau “tidak tahu” sehingga observasi yang memiliki jawaban “unknown” akan dikeluarkan dari data untuk penyusunan model. Untuk peubah job, education, dan month yang memiliki kategori yang cukup banyak sehingga tidak bisa terlihat jelas apakah memiliki jawaban yang “unknown” atau tidak, maka akan dilihat rinciannya sebagai berikut:
table(bank$job)##
## admin. blue-collar entrepreneur housemaid management
## 1012 884 148 110 324
## retired self-employed services student technician
## 166 159 393 82 691
## unemployed unknown
## 111 39
Dari output di atas terlihat bahwa peubah job memiliki 39 observasi yang memilih kategori “unknown” atau “tidak tahu”.
table(bank$education)##
## basic.4y basic.6y basic.9y high.school
## 429 228 574 921
## illiterate professional.course university.degree unknown
## 1 535 1264 167
Dari output di atas terlihat bahwa peubah education memiliki 167 observasi yang memilih kategori “unknown” atau “tidak tahu” dan ada 1 observasi yang memilih ‘illiterate’ sehingga selanjutnya observasi ini akan dikeluarkan dari model.
table(bank$month)##
## apr aug dec jul jun mar may nov oct sep
## 215 636 22 711 530 48 1378 446 69 64
Dari output di atas terlihat bahwa peubah month tidak memiliki observasi yang memilih kategori “unknown” atau “tidak tahu”.
Data Bank yang mengandung observasi yang menjawab “unknown” atau “tidak tahu” pada peubah job, marital, education, default, housing, dan loan, akan dikeluarkan dari data yang akan digunakan untuk penyusunan model. Selain itu, 1 observasi yang memilih kategori education ‘illiterate’ juga dikeluarkan dari dataset. Selanjutnya data tersebut akan disimpan dengan nama bank.ok. Proses filtering data tersebut seperti pada syntax berikut:
bank.ok <- bank %>% filter(job!="unknown",marital!="unknown",education!="unknown",education!="illiterate",default!="unknown",housing!="unknown",loan!="unknown")
summary(bank.ok)## age job marital education
## Min. :20.00 admin. :854 divorced: 347 university.degree :1107
## 1st Qu.:31.00 technician :573 married :1791 high.school : 728
## Median :37.00 blue-collar :554 single : 951 professional.course: 454
## Mean :39.18 services :276 unknown : 0 basic.9y : 407
## 3rd Qu.:46.00 management :265 basic.4y : 243
## Max. :88.00 self-employed:126 basic.6y : 150
## (Other) :441 (Other) : 0
## default housing loan contact month
## no :3088 no :1401 no :2582 cellular :2107 may :981
## unknown: 0 unknown: 0 unknown: 0 telephone: 982 jul :514
## yes : 1 yes :1688 yes : 507 aug :494
## nov :387
## jun :365
## apr :169
## (Other):179
## day_of_week duration campaign pdays previous
## fri:580 Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.0000
## mon:642 1st Qu.: 104.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000
## thu:630 Median : 181.0 Median : 2.000 Median :999.0 Median :0.0000
## tue:613 Mean : 259.3 Mean : 2.509 Mean :953.3 Mean :0.2082
## wed:624 3rd Qu.: 315.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :3643.0 Max. :35.000 Max. :999.0 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 360 Min. :-3.40000 Min. :92.20 Min. :-50.80
## nonexistent:2601 1st Qu.:-1.80000 1st Qu.:93.10 1st Qu.:-42.70
## success : 128 Median : 1.10000 Median :93.40 Median :-41.80
## Mean :-0.04587 Mean :93.53 Mean :-40.62
## 3rd Qu.: 1.40000 3rd Qu.:94.00 3rd Qu.:-36.40
## Max. : 1.40000 Max. :94.80 Max. :-26.90
##
## euribor3m nr.employed y
## Min. :0.600 Min. :4964 no :2719
## 1st Qu.:1.300 1st Qu.:5099 yes: 370
## Median :4.900 Median :5191
## Mean :3.504 Mean :5161
## 3rd Qu.:5.000 3rd Qu.:5228
## Max. :5.000 Max. :5228
##
dim(bank.ok)## [1] 3089 21
bank.ok memiliki observasi sebanyak 3089 dengan 21 peubah. Jumlah observasi yang dikeluarkan dari dataset bank sebanyak 1030 observasi (observasi yang memiliki jawaban “unknown” pada salah satu atau lebih peubah bebas kategorik atau illiterate pada peubah education).
4119-3089## [1] 1030
Selanjutnya, data bank.ok akan digunakan proses eksplorasi data sebelum penyusunan model regresi linier.
Eksplorasi Data
Peubah Kategorik
table(bank.ok$job)/nrow(bank.ok)*100##
## admin. blue-collar entrepreneur housemaid management
## 27.646488 17.934607 3.463904 2.427970 8.578828
## retired self-employed services student technician
## 3.820006 4.078990 8.934930 1.780511 18.549692
## unemployed unknown
## 2.784073 0.000000
A1 <- ggplot(bank.ok %>%
count(job) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(job, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Job",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank() )+
xlim(0, 30)+
theme(axis.title.y = element_text(margin = margin(r = 20)))
B1 <- ggplot(bank.ok %>%
count(marital) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(marital, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Marital",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 60)
C1 <- ggplot(bank.ok %>%
count(education) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(education, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Education",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 40)
D1 <- ggplot(bank.ok %>%
count(default) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(default, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Default",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.6))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
E1 <- ggplot(bank.ok %>%
count(default) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(default, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Default",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
F1 <- ggplot(bank.ok %>%
count(housing) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(housing, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Housing",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 60)
G1 <- ggplot(bank.ok %>%
count(loan) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(loan, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Loan",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
H1 <- ggplot(bank.ok %>%
count(contact) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(contact, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Contact",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
I1 <- ggplot(bank.ok %>%
count(month) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(month, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Month",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 40)
J1 <- ggplot(bank.ok %>%
count(day_of_week) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(day_of_week, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Day_of_week",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 30)
K1 <- ggplot(bank.ok %>%
count(poutcome) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(poutcome, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Poutcome",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
L1 <- ggplot(bank.ok %>%
count(y) %>%
mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(y, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") +
labs(y = "Y",
x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
theme(axis.text.x = element_blank(),
panel.grid.major.x = element_blank())+
theme(axis.title.y = element_text(margin = margin(r = 20)))+
xlim(0, 100)
A1B1C1D1F1G1H1I1J1K1L1Dari output di atas terlihat bahwa pelanggan yang menjadi sampel dalam penelitian ini mayoritas tidak mau berlangganan deposito berjangka, pekerjaannya dalam bidang administrasi, status perkawinannya menikah, pendidikannya university degree, tidak memiliki pinajaman perumahan, tidak memiliki pinjaman pribadi, kontak terakhir pada kampanye pemasaran langsung dari bank melalui cellular pada bulan Mei dan hari Senin, serta hasil dari kampanye pemasaran sebelumnya “tidak ada/nonexistent”. Untuk peubah default (apakah kredit default?) semuanya menjawab tidak. Sehingga peubah ini akan dieliminasi dalam penyusunan model regresi logistik.
Selanjutnya akan dilakukan eksplorasi data dalam bentuk diagram batang untuk melihat hubungan secara visual antara setiap peubah bebas dengan peubah penjelas (y).
bank.ok %>%
count(y, job) %>%
mutate(percent = (n / sum(n))*100)## y job n percent
## 1 no admin. 736 23.8264811
## 2 no blue-collar 507 16.4130787
## 3 no entrepreneur 100 3.2372936
## 4 no housemaid 66 2.1366138
## 5 no management 241 7.8018776
## 6 no retired 89 2.8811913
## 7 no self-employed 113 3.6581418
## 8 no services 252 8.1579799
## 9 no student 41 1.3272904
## 10 no technician 504 16.3159599
## 11 no unemployed 70 2.2661055
## 12 yes admin. 118 3.8200065
## 13 yes blue-collar 47 1.5215280
## 14 yes entrepreneur 7 0.2266106
## 15 yes housemaid 9 0.2913564
## 16 yes management 24 0.7769505
## 17 yes retired 29 0.9388152
## 18 yes self-employed 13 0.4208482
## 19 yes services 24 0.7769505
## 20 yes student 14 0.4532211
## 21 yes technician 69 2.2337326
## 22 yes unemployed 16 0.5179670
# bar plot, with each bar representing 100%
A <- ggplot(bank.ok,
aes(y = job,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "job",
x = "proportion")+
theme(axis.title.y = element_text(margin = margin(r = 20)))## Warning: Ignoring unknown parameters: names
B <- ggplot(bank.ok,
aes(y = marital,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "marital",
x = "proportion")## Warning: Ignoring unknown parameters: names
C <- ggplot(bank.ok,
aes(y = education,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "education",
x = "proportion")## Warning: Ignoring unknown parameters: names
D <- ggplot(bank.ok,
aes(y = housing,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "housing",
x = "proportion")## Warning: Ignoring unknown parameters: names
E <- ggplot(bank.ok,
aes(y = loan,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "loan",
x = "proportion")## Warning: Ignoring unknown parameters: names
G <- ggplot(bank.ok,
aes(y = contact,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "contact",
x = "proportion")## Warning: Ignoring unknown parameters: names
H <- ggplot(bank.ok,
aes(y = month,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "month",
x = "proportion")## Warning: Ignoring unknown parameters: names
I <- ggplot(bank.ok,
aes(y = day_of_week,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "day_of_week",
x = "proportion")## Warning: Ignoring unknown parameters: names
J <- ggplot(bank.ok,
aes(y = poutcome,
fill = y)) + geom_bar(position = "fill", names ="y") +
scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
labs(y = "poutcome",
x = "proportion")## Warning: Ignoring unknown parameters: names
ABCDEGHIJDari output di atas terlihat bahwa pelanggan yang mau berlangganan deposito berjangka mayoritas memiliki status sebagai student, masih single, pendidikannya university degree, memiliki pinjaman perumahan, tidak memiliki pinjaman pribadi, kontak terakhir pada kampanye pemasaran langsung dari bank melalui cellular pada bulan Maret dan hari Senin atau Kamis, serta hasil dari kampanye pemasaran sebelumnya sukses.
Peubah Numerik
boxplot(bank.ok[,1],horizontal = TRUE,
xlab = "Age")boxplot(bank.ok[,11],horizontal = TRUE,
xlab = "Duration")boxplot(bank.ok[,12],horizontal = TRUE,
xlab = "Campaign")boxplot(bank.ok[,13],horizontal = TRUE,
xlab = "pdays")boxplot(bank.ok[,14],horizontal = TRUE,
xlab = "previous")boxplot(bank.ok[,16],horizontal = TRUE,
xlab = "emp.var.rate")boxplot(bank.ok[,17],horizontal = TRUE,
xlab = "cons.price.idx")boxplot(bank.ok[,18],horizontal = TRUE,
xlab = "emp.conf.rate")boxplot(bank.ok[,19],horizontal = TRUE,
xlab = "euribor3m")boxplot(bank.ok[,20],horizontal = TRUE,
xlab = "nr.employed")Dari boxplot di atas, terlihat bahwa terdapat beberapa pencilan untuk peubah umur (age), duration (last contact duration), campaign (number of contacts performed during this campaign and for this client), pdays (number of days that passed by after the client was last contacted from a previous campaign,999 means client was not previously contacted), previous (number of contacts performed before this campaign and for this client),dan cons.conf.idx (consumer confidence index - monthly indicator). Namun, tidak ditemukan pencilan pada peubah emp.var.rate (employment variation rate - quarterly indicator), cons.price.idx (consumer price index-monthly indicator), euribor3m (euribor 3 month rate - daily indicator), dan nr.employed (number of employees - quarterly indicator).
Dikutip dari UCI Machine Learning Repository: Bank Marketing Data Set:
“duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=”no“). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.”
“durasi: durasi kontak terakhir, dalam detik (numerik). Catatan penting: atribut ini sangat mempengaruhi target keluaran (misalnya, jika durasi=0 maka y=”tidak“). Namun, durasinya tidak diketahui sebelum panggilan dilakukan. Juga, setelah akhir panggilan y jelas diketahui. Dengan demikian, input ini hanya boleh dimasukkan untuk tujuan benchmark dan harus dibuang jika tujuannya adalah untuk memiliki model prediksi yang realistis.”
Karena variabel durasi ini hanya sebagai suatu benchmark dan harus dibuang jika tujuannya untuk mendapatkan model yang baik, maka selanjutnya peubah durasi tidak akan digunakan sebagai peubah bebas pada penyusunan model regresi logistik.
Selain itu, peubah pdays (number of days that passed by after the client was last contacted from a previous campaign,999 means client was not previously contacted). Peubah ini mayoritas memiliki jawaban 999 yang berarti pelanggan belum pernah di kontak sebelumnya dan berkaitan dengan peubah previous (number of contacts performed before this campaign and for this client). Maka, peubah pdays ini juga dikeluarkan pada penyusunan model regresi logistik.
table(bank.ok$pdays)##
## 0 1 2 3 4 5 6 7 9 10 11 12 13 15 16 18
## 2 3 4 46 13 4 40 6 2 8 1 5 2 2 1 1
## 19 21 999
## 1 1 2947
Selanjutnya dilakukan eksplorasi data dalam bentuk boxplot untuk melihat hubungan secara visual antara setiap peubah bebas dengan peubah penjelas (y).
# bar plot, with each bar representing 100%
K <- ggplot(bank.ok, aes(x=age, y= y)) +
geom_boxplot()+
labs(x = "age",
y = "y")
M <- ggplot(bank.ok, aes(x=campaign, y= y)) +
geom_boxplot()+
labs(x = "campaign",
y = "y")
O <- ggplot(bank.ok, aes(x=previous, y= y)) +
geom_boxplot()+
labs(x = "previous",
y = "y")
P <- ggplot(bank.ok, aes(x=emp.var.rate, y= y)) +
geom_boxplot()+
labs(x = "emp.var.rate",
y = "y")
Q <- ggplot(bank.ok, aes(x=cons.price.idx, y= y)) +
geom_boxplot()+
labs(x = "cons.price.idx",
y = "y")
R <- ggplot(bank.ok, aes(x=cons.conf.idx, y= y)) +
geom_boxplot()+
labs(x = "cons.conf.idx",
y = "y")
S <- ggplot(bank.ok, aes(x=euribor3m, y= y)) +
geom_boxplot()+
labs(x = "euribor3m",
y = "y")
U <- ggplot(bank.ok, aes(x=nr.employed, y= y)) +
geom_boxplot()+
labs(x = "nr.employed",
y = "y")
KMOPQRSUDari output di atas terlihat bahwa peubah yang memiliki nilai median lebih besar pada pelanggan yang mau berlangganan deposito berjangka dibandingkan dengan yang tidak mau berlangganan deposito berjangka adalah pada peubah umur dan cons.conf.idx.Pada peubah previous dan campaign, mediannya sama. Sedangkan pada peubah lainnya nilai median pada pelanggan yang mau berlangganan deposito berjangka lebih kecil dibandingkan dengan yang tidak mau berlangganan deposito berjangka.
Uji Chi-Square
Dari eksplorasi data yang telah dilakukan, kandidat peubah bebas yang dapat digunakan dalam menyusun model regresi logistik untuk mengetahui faktor-faktor yang memengaruhi pelanggan untuk berlangganan deposito berjangka (y - has the client subscribed a term deposit? (binary: “yes”,“no”)) adalah age, job, marital,education, housing, loan, contact, month, day_of_week, campaign, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, dan nr.employed
y.age <- chisq.test(table(bank.ok$y, bank.ok$age))## Warning in chisq.test(table(bank.ok$y, bank.ok$age)): Chi-squared approximation
## may be incorrect
y.job <- chisq.test(table(bank.ok$y, bank.ok$job,exclude = 'unknown'))
y.marital <- chisq.test(table(bank.ok$y, bank.ok$marital,exclude = 'unknown'))
y.education <- chisq.test(table(bank.ok$y, bank.ok$education,exclude = c('unknown','illiterate')))
y.housing <- chisq.test(table(bank.ok$y, bank.ok$housing,exclude = 'unknown'))
y.loan <- chisq.test(table(bank.ok$y, bank.ok$loan,exclude = 'unknown'))
y.contact <- chisq.test(table(bank.ok$y, bank.ok$contact,exclude = 'unknown'))
y.month <- chisq.test(table(bank.ok$y, bank.ok$month))## Warning in chisq.test(table(bank.ok$y, bank.ok$month)): Chi-squared
## approximation may be incorrect
y.day_of_week <- chisq.test(table(bank.ok$y, bank.ok$day_of_week))
y.campaign <- chisq.test(table(bank.ok$y, bank.ok$campaign))## Warning in chisq.test(table(bank.ok$y, bank.ok$campaign)): Chi-squared
## approximation may be incorrect
y.previous <- chisq.test(table(bank.ok$y, bank.ok$previous))## Warning in chisq.test(table(bank.ok$y, bank.ok$previous)): Chi-squared
## approximation may be incorrect
y.poutcome <- chisq.test(table(bank.ok$y, bank.ok$poutcome))
y.emp.var.rate <- chisq.test(table(bank.ok$y, bank.ok$emp.var.rate))## Warning in chisq.test(table(bank.ok$y, bank.ok$emp.var.rate)): Chi-squared
## approximation may be incorrect
y.cons.price.idx <- chisq.test(table(bank.ok$y, bank.ok$cons.price.idx))## Warning in chisq.test(table(bank.ok$y, bank.ok$cons.price.idx)): Chi-squared
## approximation may be incorrect
y.cons.conf.idx <- chisq.test(table(bank.ok$y, bank.ok$cons.conf.idx))## Warning in chisq.test(table(bank.ok$y, bank.ok$cons.conf.idx)): Chi-squared
## approximation may be incorrect
y.euribor3m <- chisq.test(table(bank.ok$y, bank.ok$euribor3m))## Warning in chisq.test(table(bank.ok$y, bank.ok$euribor3m)): Chi-squared
## approximation may be incorrect
y.nr.employed <- chisq.test(table(bank.ok$y, bank.ok$nr.employed))## Warning in chisq.test(table(bank.ok$y, bank.ok$nr.employed)): Chi-squared
## approximation may be incorrect
pvalue <- round(c(y.age$p.value,y.job$p.value,y.marital$p.value,y.education$p.value,y.housing$p.value, y.loan$p.value,y.contact$p.value,y.month$p.value,y.day_of_week$p.value,y.campaign$p.value, y.previous$p.value,y.poutcome$p.value,y.emp.var.rate$p.value,y.cons.price.idx$p.value,y.cons.conf.idx$p.value,y.euribor3m$p.value,y.nr.employed$p.value),4)
labelpvalue <- c("y.age$p.value","y.job$p.value","y.marital$p.value","y.education$p.value","y.housing$p.value", "y.loan$p.value","y.contact$p.value","y.month$p.value","y.day_of_week$p.value","y.campaign$p.value", "y.previous$p.value","y.poutcome$p.value","y.emp.var.rate$p.value","y.cons.price.idx$p.value","y.cons.conf.idx$p.value","y.euribor3m$p.value","y.nr.employed$p.value")
pvalue.chisquare <- data.frame(labelpvalue,pvalue)
arrange(pvalue.chisquare, pvalue)## labelpvalue pvalue
## 1 y.age$p.value 0.0000
## 2 y.job$p.value 0.0000
## 3 y.contact$p.value 0.0000
## 4 y.month$p.value 0.0000
## 5 y.previous$p.value 0.0000
## 6 y.poutcome$p.value 0.0000
## 7 y.emp.var.rate$p.value 0.0000
## 8 y.cons.price.idx$p.value 0.0000
## 9 y.cons.conf.idx$p.value 0.0000
## 10 y.euribor3m$p.value 0.0000
## 11 y.nr.employed$p.value 0.0000
## 12 y.education$p.value 0.0644
## 13 y.marital$p.value 0.0722
## 14 y.campaign$p.value 0.4275
## 15 y.loan$p.value 0.5270
## 16 y.day_of_week$p.value 0.5550
## 17 y.housing$p.value 0.6313
Berdasarkan hasil Uji Chi_Square di atas, peubah yang terbukti memiliki asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) dengan pvalue < alpha=0.05 adalah peubah age , job, contact,month,previous,poutcome, emp.var.rate,cons.price.id, cons.conf.idx, euribor3m, dan nr.employed. Sedangkan peubah lainnya (education,marital,campaign,loan,day_of_week,housing) tidak signifikan (pvalue > alpha=0.05) berarti tidak ada cukup bukti untuk menyatakan ada asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) sehingga peubah tersebut tidak akan digunakan dalam menyusun model regresi logistik.
Menentukan Kategori Referensi Peubah Kategorik
Sebelum menyusun model regresi logistik, ditentukan dahulu kategori referensi untuk peubah kategorik sebagai berikut:
# Referensi
bank.ok$y <- relevel(bank.ok$y,ref="no")
bank.ok$job <- relevel(bank.ok$job,ref="unemployed")
bank.ok$contact <- relevel(bank.ok$contact,ref="cellular")
bank.ok$month <- relevel(bank.ok$month,ref="may")
bank.ok$poutcome <- relevel(bank.ok$poutcome,ref="success")Membagi Data sebelum Menyusun Model Regresi Logistik
Dalam rangka membentuk model dan melakukan pengujian model yang telah dibuat, dilakukan pengambilan sampel dari data bank.ok. Pengambilan sampel ini dibagi menjadi 2 bagian data sehingga akan diperoleh dua bagian data,yaitu data train sebanyak 80% untuk membentuk model, dan data test sebanyak 20% untuk menguji model yang telah dibuat.
set.seed(1501202071) #NIM saya tanpa G
sample <- sample(nrow(bank.ok),floor(nrow(bank.ok)*0.8))
training.bank.ok <- bank.ok[sample,]
dim(training.bank.ok)## [1] 2471 21
testing.bank.ok <- bank.ok[-sample,]
dim(testing.bank.ok)## [1] 618 21
Terdapat 2471 observasi yang menjadi training data dan 618 observasi yang menjadi testing data.
Eksplorasi Data Training
summary(training.bank.ok)## age job marital education
## Min. :20.00 admin. :680 divorced: 282 university.degree :867
## 1st Qu.:32.00 technician :454 married :1420 high.school :587
## Median :37.00 blue-collar:447 single : 769 professional.course:378
## Mean :39.23 services :220 unknown : 0 basic.9y :326
## 3rd Qu.:46.00 management :211 basic.4y :190
## Max. :88.00 retired :106 basic.6y :123
## (Other) :353 (Other) : 0
## default housing loan contact month
## no :2470 no :1114 no :2074 cellular :1689 may :789
## unknown: 0 unknown: 0 unknown: 0 telephone: 782 jul :418
## yes : 1 yes :1357 yes : 397 aug :384
## nov :317
## jun :288
## apr :131
## (Other):144
## day_of_week duration campaign pdays previous
## fri:461 Min. : 4.0 Min. : 1.000 Min. : 0 Min. :0.0000
## mon:523 1st Qu.: 105.0 1st Qu.: 1.000 1st Qu.:999 1st Qu.:0.0000
## thu:500 Median : 182.0 Median : 2.000 Median :999 Median :0.0000
## tue:481 Mean : 261.1 Mean : 2.536 Mean :954 Mean :0.2133
## wed:506 3rd Qu.: 320.0 3rd Qu.: 3.000 3rd Qu.:999 3rd Qu.:0.0000
## Max. :3643.0 Max. :29.000 Max. :999 Max. :6.0000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## success : 101 Min. :-3.40000 Min. :92.20 Min. :-50.80
## failure : 295 1st Qu.:-1.80000 1st Qu.:93.10 1st Qu.:-42.70
## nonexistent:2075 Median : 1.10000 Median :93.40 Median :-41.80
## Mean :-0.03626 Mean :93.53 Mean :-40.68
## 3rd Qu.: 1.40000 3rd Qu.:94.00 3rd Qu.:-36.40
## Max. : 1.40000 Max. :94.80 Max. :-26.90
##
## euribor3m nr.employed y
## Min. :0.600 Min. :4964 no :2163
## 1st Qu.:1.300 1st Qu.:5099 yes: 308
## Median :4.900 Median :5191
## Mean :3.515 Mean :5162
## 3rd Qu.:5.000 3rd Qu.:5228
## Max. :5.000 Max. :5228
##
Dari output di atas, terlihat bahwa terdapat 308 observasi yang masuk sebagai kategori pelanggan yang mau berlangganan deposito berjangka sedangkan yang tidak mau berlangganan ada sebanyak 2163. Persentase pelanggan yang terkategori sebagai yes (mau berlangganan deposito berjangka) hanya sekitar 12.46% sedangkan sisanya terkategori sebagai no (tidak mau berlangganan deposito berjangka). Hal ini menunjukkan bahwa data tersebut tidak seimbang sehingga selanjutnya akan dicobakan juga melakukan proses resampling dengan menggunakan undersampling dan oversampling. Hasil model regresi logistik dengan menggunakan teknik resampling akan dibandingkan dengan tanpa menggunakan teknik resampling dengan melihat nilai AIC dan akurasinya.
round((308/(2163+308))*100,2)## [1] 12.46
Resampling
Ketidakseimbangan data terjadi pada saat suatu kelas atau kategori tertentu memiliki data yang lebih banyak dibandingkan dengan kategori lainnya (Jian et al., 2016). Ketidakseimbangan data ini perlu ditangani karena memengaruhi akurasi pada proses klasifikasi data (Thanathamathee et al., 2013). Teknik resampling merupakan salah satu solusi untuk menangani data yang tidak seimbang. Terdapat proses random undersampling dan oversampling. Random undersampling merupakan proses sampling yang dilakukan dengan mengeliminasi sebagian data pada kelas mayoritas secara acak (Prusa et al, 2015). Random oversampling merupakan proses sampling yang dilakukan dengan menambahkan jumlah data pada kelas minoritas secara acak.
Random Under Sampling
Teknik resampling diterapkan pada data training (data latih) untuk pembentukan model regresi regresi logistik biner. Proses undersampling dilakukan dengan mengeliminasi secara acak kategori mayoritas, yaitu kategori no (tidak mau berlangganan deposito berjangka) menjadi dari total 2163 menjadi 308 (sama dengan jumlah pada kategori yes, yaitu mau berlangganan deposito berjangka). Total dataset baru hasil undersampling sebanyak 616.
#under sampling
set.seed(1501202071) #NIM saya tanpa G
data_balanced_under <- ovun.sample(y ~ ., data = training.bank.ok, method = "under",N = 616)$data
table(data_balanced_under$y) #note: N adalah banyak data totalnya setelah undersampling (308+308=616)##
## no yes
## 308 308
Random Over Sampling
Teknik resampling diterapkan pada data training (data latih) untuk pembentukan model regresi regresi logistik biner. Proses oversampling dilakukan dengan menaikkan secara acak kategori minoritas, yaitu kategori yes, yaitu mau berlangganan deposito berjangka) dari total 308 menjadi 2163 (sama dengan jumlah pada kategori no (tidak mau berlangganan deposito berjangka) . Total dataset baru hasil oversampling sebanyak 4326.
#over sampling
set.seed(1501202071) #NIM saya tanpa G
data_balanced_over <- ovun.sample(y ~ ., data = training.bank.ok, method = "over",N = 4326)$data
table(data_balanced_over$y) #note: N adalah banyak data totalnya setelah oversampling (2163+2163=4326)##
## no yes
## 2163 2163
Regresi Logistik Biner
Berdasarkan hasil Uji Chi_Square yang telah dilakukan sebelumnya, peubah yang terbukti memiliki asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) pada taraf alpha 5% adalah peubah age, job, contact,month,previous,poutcome, emp.var.rate,cons.price.idx, cons.conf.idx, euribor3m, dan nr.employed. 11 peubah ini yang akan digunakan dalam penyusunan model regresi logistik biner.
Tanpa Teknik Resampling
reglog <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=training.bank.ok, family=binomial("link"=logit))
summary(reglog)##
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed, family = binomial(link = logit), data = training.bank.ok)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0007 -0.4041 -0.3294 -0.2631 2.8261
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.372e+02 1.259e+02 -1.090 0.27573
## age 9.085e-03 7.659e-03 1.186 0.23557
## jobadmin. -9.806e-02 4.075e-01 -0.241 0.80981
## jobblue-collar -3.145e-01 4.293e-01 -0.733 0.46373
## jobentrepreneur -9.450e-01 6.593e-01 -1.433 0.15174
## jobhousemaid -3.566e-01 6.090e-01 -0.585 0.55822
## jobmanagement -6.663e-01 4.764e-01 -1.399 0.16190
## jobretired -3.031e-01 5.091e-01 -0.595 0.55166
## jobself-employed -4.674e-01 5.340e-01 -0.875 0.38149
## jobservices -5.216e-01 4.701e-01 -1.110 0.26713
## jobstudent -9.756e-03 5.704e-01 -0.017 0.98635
## jobtechnician -6.607e-03 4.210e-01 -0.016 0.98748
## contacttelephone -1.198e+00 2.874e-01 -4.170 3.05e-05 ***
## monthapr 2.654e-01 3.030e-01 0.876 0.38104
## monthaug 3.186e-01 3.393e-01 0.939 0.34775
## monthdec 5.747e-01 6.342e-01 0.906 0.36485
## monthjul -8.026e-02 3.338e-01 -0.240 0.80999
## monthjun 1.751e-01 4.892e-01 0.358 0.72039
## monthmar 2.006e+00 4.373e-01 4.586 4.52e-06 ***
## monthnov -2.956e-01 3.458e-01 -0.855 0.39273
## monthoct 3.100e-01 4.658e-01 0.665 0.50575
## monthsep 1.671e-01 5.294e-01 0.316 0.75230
## previous 1.764e-01 1.846e-01 0.956 0.33924
## poutcomefailure -1.728e+00 2.998e-01 -5.765 8.17e-09 ***
## poutcomenonexistent -1.100e+00 3.845e-01 -2.860 0.00423 **
## emp.var.rate -1.160e+00 4.590e-01 -2.528 0.01148 *
## cons.price.idx 1.497e+00 8.222e-01 1.820 0.06873 .
## cons.conf.idx 2.620e-02 2.766e-02 0.947 0.34350
## euribor3m 4.026e-01 4.395e-01 0.916 0.35971
## nr.employed -8.222e-04 1.039e-02 -0.079 0.93692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1858.6 on 2470 degrees of freedom
## Residual deviance: 1427.1 on 2441 degrees of freedom
## AIC: 1487.1
##
## Number of Fisher Scoring iterations: 6
reglog.step <- step(reglog)## Start: AIC=1487.07
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - job 10 1437.1 1477.1
## - nr.employed 1 1427.1 1485.1
## - euribor3m 1 1427.9 1485.9
## - cons.conf.idx 1 1428.0 1486.0
## - previous 1 1428.0 1486.0
## - age 1 1428.5 1486.5
## <none> 1427.1 1487.1
## - cons.price.idx 1 1430.3 1488.3
## - emp.var.rate 1 1433.3 1491.3
## - month 9 1459.0 1501.0
## - contact 1 1447.5 1505.5
## - poutcome 2 1462.8 1518.8
##
## Step: AIC=1477.09
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - nr.employed 1 1437.2 1475.2
## - previous 1 1437.7 1475.7
## - cons.conf.idx 1 1437.8 1475.8
## - age 1 1438.0 1476.0
## - euribor3m 1 1438.2 1476.2
## <none> 1437.1 1477.1
## - cons.price.idx 1 1439.7 1477.7
## - emp.var.rate 1 1442.6 1480.6
## - month 9 1472.0 1494.0
## - contact 1 1456.4 1494.4
## - poutcome 2 1473.6 1509.6
##
## Step: AIC=1475.19
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - previous 1 1437.8 1473.8
## - age 1 1438.1 1474.1
## - euribor3m 1 1438.8 1474.8
## <none> 1437.2 1475.2
## - cons.conf.idx 1 1439.3 1475.3
## - emp.var.rate 1 1447.1 1483.1
## - month 9 1472.4 1492.4
## - cons.price.idx 1 1458.0 1494.0
## - contact 1 1458.3 1494.3
## - poutcome 2 1473.9 1507.9
##
## Step: AIC=1473.81
## y ~ age + contact + month + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - age 1 1438.8 1472.8
## - euribor3m 1 1439.4 1473.4
## <none> 1437.8 1473.8
## - cons.conf.idx 1 1440.1 1474.1
## - emp.var.rate 1 1447.9 1481.9
## - month 9 1473.7 1491.7
## - contact 1 1459.7 1493.7
## - cons.price.idx 1 1460.3 1494.3
## - poutcome 2 1477.3 1509.3
##
## Step: AIC=1472.76
## y ~ contact + month + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - euribor3m 1 1440.4 1472.4
## <none> 1438.8 1472.8
## - cons.conf.idx 1 1441.1 1473.1
## - emp.var.rate 1 1449.0 1481.0
## - month 9 1475.2 1491.2
## - contact 1 1460.4 1492.4
## - cons.price.idx 1 1461.8 1493.8
## - poutcome 2 1478.2 1508.2
##
## Step: AIC=1472.42
## y ~ contact + month + poutcome + emp.var.rate + cons.price.idx +
## cons.conf.idx
##
## Df Deviance AIC
## <none> 1440.4 1472.4
## - cons.conf.idx 1 1445.1 1475.1
## - month 9 1475.2 1489.2
## - contact 1 1460.5 1490.5
## - poutcome 2 1479.6 1507.6
## - cons.price.idx 1 1488.7 1518.7
## - emp.var.rate 1 1531.9 1561.9
reglog.step$anova## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 2441 1427.071 1487.071
## 2 - job 10 10.0177174 2451 1437.088 1477.088
## 3 - nr.employed 1 0.1000574 2452 1437.188 1475.188
## 4 - previous 1 0.6180707 2453 1437.806 1473.806
## 5 - age 1 0.9518552 2454 1438.758 1472.758
## 6 - euribor3m 1 1.6589185 2455 1440.417 1472.417
reglog.step.ok <- glm(y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx,data=training.bank.ok,family=binomial("link"=logit))
summary(reglog.step.ok)##
## Call:
## glm(formula = y ~ contact + month + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx, family = binomial(link = logit),
## data = training.bank.ok)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8666 -0.4206 -0.3403 -0.2539 2.8115
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -117.36247 16.54358 -7.094 1.30e-12 ***
## contacttelephone -1.11282 0.26529 -4.195 2.73e-05 ***
## monthapr 0.30784 0.27726 1.110 0.2669
## monthaug 0.38135 0.31261 1.220 0.2225
## monthdec 0.65817 0.60500 1.088 0.2766
## monthjul 0.08672 0.30149 0.288 0.7736
## monthjun 0.46666 0.26296 1.775 0.0760 .
## monthmar 2.11117 0.38030 5.551 2.84e-08 ***
## monthnov -0.01703 0.26452 -0.064 0.9487
## monthoct 0.55009 0.39355 1.398 0.1622
## monthsep 0.24669 0.41944 0.588 0.5564
## poutcomefailure -1.76207 0.29236 -6.027 1.67e-09 ***
## poutcomenonexistent -1.35415 0.26578 -5.095 3.49e-07 ***
## emp.var.rate -0.73076 0.07400 -9.875 < 2e-16 ***
## cons.price.idx 1.26267 0.17986 7.020 2.21e-12 ***
## cons.conf.idx 0.04056 0.01871 2.168 0.0301 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1858.6 on 2470 degrees of freedom
## Residual deviance: 1440.4 on 2455 degrees of freedom
## AIC: 1472.4
##
## Number of Fisher Scoring iterations: 5
Interpretasi
# odds = exponensial parameter model
interpretasi <- data.frame(koefisien=summary(reglog.step.ok)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok)$coefficients[,1]))
InterpretasiPeubahNyata <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata## koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept) 1.301571e-12 -117.36246845 1.071834e-51
## contacttelephone 2.731397e-05 -1.11281822 3.286315e-01
## monthmar 2.835945e-08 2.11116983 8.257896e+00
## poutcomefailure 1.669190e-09 -1.76207327 1.716885e-01
## poutcomenonexistent 3.487552e-07 -1.35414929 2.581668e-01
## emp.var.rate 5.365213e-23 -0.73075892 4.815434e-01
## cons.price.idx 2.214118e-12 1.26267142 3.534852e+00
## cons.conf.idx 3.014686e-02 0.04055963 1.041393e+00
Evalusi Model
TANPA STEPWISE:
testing.bank.ok$ypred.ns <- predict(reglog, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns <- ifelse(testing.bank.ok$ypred.ns > 0.50, "yes", "no")
testing.bank.ok$ypred.ns <- as.factor(testing.bank.ok$ypred.ns)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 544 47
## yes 12 15
##
## Accuracy : 0.9045
## 95% CI : (0.8786, 0.9265)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 0.375
##
## Kappa : 0.2941
##
## Mcnemar's Test P-Value : 9.581e-06
##
## Sensitivity : 0.24194
## Specificity : 0.97842
## Pos Pred Value : 0.55556
## Neg Pred Value : 0.92047
## Prevalence : 0.10032
## Detection Rate : 0.02427
## Detection Prevalence : 0.04369
## Balanced Accuracy : 0.61018
##
## 'Positive' Class : yes
##
Menggunakan data testing.
Confusion Matrix
Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.
Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 27.419 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 98.381 persen dengan accuracy sebesar 91.26 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.
testing.bank.ok$ypred <- predict(reglog.step.ok, testing.bank.ok, type="response")
testing.bank.ok$ypred <- ifelse(testing.bank.ok$ypred > 0.50, "yes", "no")
testing.bank.ok$ypred <- as.factor(testing.bank.ok$ypred)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 547 45
## yes 9 17
##
## Accuracy : 0.9126
## 95% CI : (0.8875, 0.9337)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 0.1575
##
## Kappa : 0.3477
##
## Mcnemar's Test P-Value : 1.908e-06
##
## Sensitivity : 0.27419
## Specificity : 0.98381
## Pos Pred Value : 0.65385
## Neg Pred Value : 0.92399
## Prevalence : 0.10032
## Detection Rate : 0.02751
## Detection Prevalence : 0.04207
## Balanced Accuracy : 0.62900
##
## 'Positive' Class : yes
##
broom::tidy(conf.mat)## # A tibble: 14 x 6
## term class estimate conf.low conf.high p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accuracy <NA> 0.913 0.888 0.934 0.157
## 2 kappa <NA> 0.348 NA NA NA
## 3 mcnemar <NA> NA NA NA 0.00000191
## 4 sensitivity yes 0.274 NA NA NA
## 5 specificity yes 0.984 NA NA NA
## 6 pos_pred_value yes 0.654 NA NA NA
## 7 neg_pred_value yes 0.924 NA NA NA
## 8 precision yes 0.654 NA NA NA
## 9 recall yes 0.274 NA NA NA
## 10 f1 yes 0.386 NA NA NA
## 11 prevalence yes 0.100 NA NA NA
## 12 detection_rate yes 0.0275 NA NA NA
## 13 detection_prevalence yes 0.0421 NA NA NA
## 14 balanced_accuracy yes 0.629 NA NA NA
Receiver Operating Characteristics (ROC) Curve
Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.
Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 53.3 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.
plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred=="yes")Dengan Teknik Resampling
Random Under Sampling
reglog.under <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=data_balanced_under, family=binomial("link"=logit))
summary(reglog.under)##
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed, family = binomial(link = logit), data = data_balanced_under)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4255 -0.8062 -0.1450 0.6923 2.0804
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.310e+02 1.758e+02 -1.314 0.18879
## age 1.631e-02 1.183e-02 1.379 0.16795
## jobadmin. -4.272e-01 7.658e-01 -0.558 0.57696
## jobblue-collar -7.362e-01 7.831e-01 -0.940 0.34716
## jobentrepreneur -1.270e+00 9.755e-01 -1.302 0.19291
## jobhousemaid -4.584e-01 1.061e+00 -0.432 0.66569
## jobmanagement -1.415e+00 8.331e-01 -1.699 0.08940 .
## jobretired -6.595e-01 9.265e-01 -0.712 0.47656
## jobself-employed -6.312e-01 9.271e-01 -0.681 0.49595
## jobservices -1.028e+00 8.263e-01 -1.244 0.21365
## jobstudent -1.738e-01 1.157e+00 -0.150 0.88056
## jobtechnician -2.678e-01 7.802e-01 -0.343 0.73139
## contacttelephone -1.049e+00 4.278e-01 -2.453 0.01418 *
## monthapr 4.079e-01 4.741e-01 0.860 0.38959
## monthaug 5.649e-01 5.866e-01 0.963 0.33559
## monthdec 1.003e+00 1.196e+00 0.838 0.40177
## monthjul 2.133e-01 4.637e-01 0.460 0.64560
## monthjun -3.752e-01 7.113e-01 -0.528 0.59783
## monthmar 2.996e+00 1.129e+00 2.653 0.00799 **
## monthnov -1.515e-01 4.795e-01 -0.316 0.75206
## monthoct 1.240e+00 8.955e-01 1.385 0.16618
## monthsep 4.753e-01 8.897e-01 0.534 0.59323
## previous 7.301e-01 5.101e-01 1.431 0.15232
## poutcomefailure -1.997e+00 6.857e-01 -2.912 0.00359 **
## poutcomenonexistent -7.967e-01 9.001e-01 -0.885 0.37612
## emp.var.rate -1.457e+00 6.973e-01 -2.089 0.03670 *
## cons.price.idx 2.061e+00 1.172e+00 1.759 0.07854 .
## cons.conf.idx 1.687e-02 5.329e-02 0.317 0.75151
## euribor3m 1.715e-01 6.905e-01 0.248 0.80382
## nr.employed 7.512e-03 1.456e-02 0.516 0.60580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.96 on 615 degrees of freedom
## Residual deviance: 607.86 on 586 degrees of freedom
## AIC: 667.86
##
## Number of Fisher Scoring iterations: 6
reglog.step.under <- step(reglog.under)## Start: AIC=667.86
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - job 10 619.84 659.84
## - euribor3m 1 607.92 665.92
## - cons.conf.idx 1 607.96 665.96
## - nr.employed 1 608.12 666.12
## - age 1 609.77 667.77
## <none> 607.86 667.86
## - previous 1 610.61 668.61
## - month 9 626.64 668.64
## - cons.price.idx 1 611.00 669.00
## - emp.var.rate 1 612.31 670.31
## - contact 1 614.26 672.26
## - poutcome 2 621.11 677.11
##
## Step: AIC=659.84
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - cons.conf.idx 1 619.85 657.85
## - nr.employed 1 619.86 657.86
## - euribor3m 1 620.12 658.12
## - age 1 621.40 659.40
## - previous 1 621.66 659.66
## <none> 619.84 659.84
## - cons.price.idx 1 622.07 660.07
## - emp.var.rate 1 623.62 661.62
## - contact 1 625.88 663.88
## - month 9 642.41 664.41
## - poutcome 2 633.85 669.85
##
## Step: AIC=657.85
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - nr.employed 1 619.86 655.86
## - euribor3m 1 620.65 656.65
## - age 1 621.43 657.43
## - previous 1 621.67 657.67
## <none> 619.85 657.85
## - cons.price.idx 1 622.53 658.53
## - emp.var.rate 1 623.63 659.63
## - month 9 642.41 662.41
## - contact 1 626.68 662.68
## - poutcome 2 634.04 668.04
##
## Step: AIC=655.86
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + euribor3m
##
## Df Deviance AIC
## - euribor3m 1 621.04 655.04
## - age 1 621.45 655.45
## - previous 1 621.70 655.70
## <none> 619.86 655.86
## - emp.var.rate 1 626.23 660.23
## - contact 1 626.71 660.71
## - month 9 644.94 662.94
## - cons.price.idx 1 629.05 663.05
## - poutcome 2 634.05 666.05
##
## Step: AIC=655.04
## y ~ age + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx
##
## Df Deviance AIC
## - age 1 622.76 654.76
## - previous 1 623.01 655.01
## <none> 621.04 655.04
## - contact 1 626.73 658.73
## - month 9 645.58 661.58
## - cons.price.idx 1 633.75 665.75
## - poutcome 2 635.91 665.91
## - emp.var.rate 1 663.20 695.20
##
## Step: AIC=654.76
## y ~ contact + month + previous + poutcome + emp.var.rate + cons.price.idx
##
## Df Deviance AIC
## <none> 622.76 654.76
## - previous 1 624.77 654.77
## - contact 1 628.23 658.23
## - month 9 647.48 661.48
## - poutcome 2 637.27 665.27
## - cons.price.idx 1 635.54 665.54
## - emp.var.rate 1 666.38 696.38
reglog.step.under$anova## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 586 607.8592 667.8592
## 2 - job 10 11.984416544 596 619.8436 659.8436
## 3 - cons.conf.idx 1 0.005974104 597 619.8496 657.8496
## 4 - nr.employed 1 0.014236906 598 619.8638 655.8638
## 5 - euribor3m 1 1.173879853 599 621.0377 655.0377
## 6 - age 1 1.723507083 600 622.7612 654.7612
reglog.step.ok.under <- glm(y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + euribor3m,data=data_balanced_under,family=binomial("link"=logit))
summary(reglog.step.ok.under)##
## Call:
## glm(formula = y ~ contact + month + poutcome + emp.var.rate +
## cons.price.idx + euribor3m, family = binomial(link = logit),
## data = data_balanced_under)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6416 -0.7606 -0.1931 0.7535 1.9974
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -157.8334 51.2712 -3.078 0.00208 **
## contacttelephone -0.9682 0.3783 -2.559 0.01049 *
## monthapr 0.5522 0.4249 1.300 0.19367
## monthaug 0.8139 0.3887 2.094 0.03626 *
## monthdec 1.0268 1.1382 0.902 0.36697
## monthjul 0.3726 0.4300 0.866 0.38628
## monthjun -0.1108 0.4715 -0.235 0.81417
## monthmar 2.8913 1.0514 2.750 0.00596 **
## monthnov -0.2272 0.4497 -0.505 0.61335
## monthoct 1.3870 0.8765 1.582 0.11355
## monthsep 0.5270 0.7265 0.726 0.46814
## poutcomefailure -2.1388 0.6679 -3.202 0.00136 **
## poutcomenonexistent -1.8805 0.6416 -2.931 0.00338 **
## emp.var.rate -1.3746 0.5136 -2.677 0.00744 **
## cons.price.idx 1.6853 0.5378 3.134 0.00173 **
## euribor3m 0.4639 0.3871 1.198 0.23074
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 853.96 on 615 degrees of freedom
## Residual deviance: 623.32 on 600 degrees of freedom
## AIC: 655.32
##
## Number of Fisher Scoring iterations: 5
Interpretasi
# odds = exponensial parameter model
interpretasi.under <- data.frame(koefisien=summary(reglog.step.ok.under)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok.under)$coefficients[,1]))
InterpretasiPeubahNyata.under <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata.under## koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept) 1.301571e-12 -117.36246845 1.071834e-51
## contacttelephone 2.731397e-05 -1.11281822 3.286315e-01
## monthmar 2.835945e-08 2.11116983 8.257896e+00
## poutcomefailure 1.669190e-09 -1.76207327 1.716885e-01
## poutcomenonexistent 3.487552e-07 -1.35414929 2.581668e-01
## emp.var.rate 5.365213e-23 -0.73075892 4.815434e-01
## cons.price.idx 2.214118e-12 1.26267142 3.534852e+00
## cons.conf.idx 3.014686e-02 0.04055963 1.041393e+00
Evalusi Model
Menggunakan data testing.
TANPA STEPWISE:
testing.bank.ok$ypred.ns.under <- predict(reglog.under, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns.under <- ifelse(testing.bank.ok$ypred.ns.under > 0.50, "yes", "no")
testing.bank.ok$ypred.ns.under <- as.factor(testing.bank.ok$ypred.ns.under)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns.under, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 446 17
## yes 110 45
##
## Accuracy : 0.7945
## 95% CI : (0.7605, 0.8257)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3168
##
## Mcnemar's Test P-Value : 3.25e-16
##
## Sensitivity : 0.72581
## Specificity : 0.80216
## Pos Pred Value : 0.29032
## Neg Pred Value : 0.96328
## Prevalence : 0.10032
## Detection Rate : 0.07282
## Detection Prevalence : 0.25081
## Balanced Accuracy : 0.76398
##
## 'Positive' Class : yes
##
Confusion Matrix
Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.
Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 70.97 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 83.99 persen dengan accuracy sebesar 82.69 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.
testing.bank.ok$ypred.under <- predict(reglog.step.ok.under, testing.bank.ok, type="response")
testing.bank.ok$ypred.under <- ifelse(testing.bank.ok$ypred.under > 0.50, "yes", "no")
testing.bank.ok$ypred.under <- as.factor(testing.bank.ok$ypred.under)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.under, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 467 18
## yes 89 44
##
## Accuracy : 0.8269
## 95% CI : (0.7947, 0.8559)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3643
##
## Mcnemar's Test P-Value : 1.313e-11
##
## Sensitivity : 0.7097
## Specificity : 0.8399
## Pos Pred Value : 0.3308
## Neg Pred Value : 0.9629
## Prevalence : 0.1003
## Detection Rate : 0.0712
## Detection Prevalence : 0.2152
## Balanced Accuracy : 0.7748
##
## 'Positive' Class : yes
##
broom::tidy(conf.mat)## # A tibble: 14 x 6
## term class estimate conf.low conf.high p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accuracy <NA> 0.827 0.795 0.856 1.00e+ 0
## 2 kappa <NA> 0.364 NA NA NA
## 3 mcnemar <NA> NA NA NA 1.31e-11
## 4 sensitivity yes 0.710 NA NA NA
## 5 specificity yes 0.840 NA NA NA
## 6 pos_pred_value yes 0.331 NA NA NA
## 7 neg_pred_value yes 0.963 NA NA NA
## 8 precision yes 0.331 NA NA NA
## 9 recall yes 0.710 NA NA NA
## 10 f1 yes 0.451 NA NA NA
## 11 prevalence yes 0.100 NA NA NA
## 12 detection_rate yes 0.0712 NA NA NA
## 13 detection_prevalence yes 0.215 NA NA NA
## 14 balanced_accuracy yes 0.775 NA NA NA
Receiver Operating Characteristics (ROC) Curve
Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.
Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 57.80 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.
plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred.under=="yes")Random Over Sampling
reglog.over <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=data_balanced_over, family=binomial("link"=logit))
summary(reglog.over)##
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome +
## emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m +
## nr.employed, family = binomial(link = logit), data = data_balanced_over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.8440 -0.1087 0.7627 2.0713
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.529e+02 6.801e+01 -2.248 0.024574 *
## age 8.868e-03 4.119e-03 2.153 0.031326 *
## jobadmin. 5.182e-01 2.507e-01 2.067 0.038743 *
## jobblue-collar 2.755e-01 2.588e-01 1.065 0.287029
## jobentrepreneur -4.978e-01 3.557e-01 -1.400 0.161638
## jobhousemaid 1.761e-01 3.460e-01 0.509 0.610739
## jobmanagement 2.869e-02 2.757e-01 0.104 0.917106
## jobretired 5.404e-01 3.052e-01 1.771 0.076632 .
## jobself-employed -1.015e-01 3.109e-01 -0.326 0.744131
## jobservices -5.972e-02 2.748e-01 -0.217 0.827935
## jobstudent 2.832e-01 3.509e-01 0.807 0.419650
## jobtechnician 6.130e-01 2.569e-01 2.386 0.017029 *
## contacttelephone -1.115e+00 1.569e-01 -7.104 1.21e-12 ***
## monthapr 2.093e-01 1.652e-01 1.267 0.205276
## monthaug 6.892e-01 2.010e-01 3.429 0.000607 ***
## monthdec 1.106e+00 4.144e-01 2.669 0.007615 **
## monthjul 1.641e-01 1.691e-01 0.971 0.331656
## monthjun 1.413e-01 2.758e-01 0.512 0.608407
## monthmar 2.145e+00 3.090e-01 6.940 3.93e-12 ***
## monthnov -1.903e-01 1.763e-01 -1.080 0.280312
## monthoct 5.461e-01 2.804e-01 1.948 0.051471 .
## monthsep 4.344e-01 3.141e-01 1.383 0.166692
## previous 2.378e-01 1.318e-01 1.804 0.071289 .
## poutcomefailure -1.542e+00 2.181e-01 -7.070 1.55e-12 ***
## poutcomenonexistent -1.015e+00 2.774e-01 -3.659 0.000253 ***
## emp.var.rate -1.492e+00 2.523e-01 -5.914 3.35e-09 ***
## cons.price.idx 1.658e+00 4.464e-01 3.714 0.000204 ***
## cons.conf.idx 7.462e-03 1.813e-02 0.411 0.680718
## euribor3m 6.513e-01 2.515e-01 2.590 0.009600 **
## nr.employed -8.226e-04 5.658e-03 -0.145 0.884391
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5997.1 on 4325 degrees of freedom
## Residual deviance: 4502.4 on 4296 degrees of freedom
## AIC: 4562.4
##
## Number of Fisher Scoring iterations: 5
reglog.step.over <- step(reglog.over)## Start: AIC=4562.43
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed
##
## Df Deviance AIC
## - nr.employed 1 4502.4 4560.4
## - cons.conf.idx 1 4502.6 4560.6
## <none> 4502.4 4562.4
## - previous 1 4505.9 4563.9
## - age 1 4507.1 4565.1
## - euribor3m 1 4509.2 4567.2
## - cons.price.idx 1 4516.2 4574.2
## - job 10 4554.5 4594.5
## - emp.var.rate 1 4537.4 4595.4
## - contact 1 4555.8 4613.8
## - poutcome 2 4562.6 4618.6
## - month 9 4612.5 4654.5
##
## Step: AIC=4560.45
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + cons.conf.idx + euribor3m
##
## Df Deviance AIC
## - cons.conf.idx 1 4503.0 4559.0
## <none> 4502.4 4560.4
## - previous 1 4506.0 4562.0
## - age 1 4507.1 4563.1
## - euribor3m 1 4517.6 4573.6
## - job 10 4554.7 4592.7
## - emp.var.rate 1 4553.2 4609.2
## - poutcome 2 4562.7 4616.7
## - contact 1 4561.2 4617.2
## - cons.price.idx 1 4579.6 4635.6
## - month 9 4617.2 4657.2
##
## Step: AIC=4559
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate +
## cons.price.idx + euribor3m
##
## Df Deviance AIC
## <none> 4503.0 4559.0
## - previous 1 4506.7 4560.7
## - age 1 4507.6 4561.6
## - euribor3m 1 4525.0 4579.0
## - job 10 4556.2 4592.2
## - poutcome 2 4565.6 4617.6
## - contact 1 4564.0 4618.0
## - emp.var.rate 1 4573.0 4627.0
## - cons.price.idx 1 4588.5 4642.5
## - month 9 4632.8 4670.8
reglog.step.over$anova## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 4296 4502.429 4562.429
## 2 - nr.employed 1 0.02115926 4297 4502.450 4560.450
## 3 - cons.conf.idx 1 0.55052307 4298 4503.000 4559.000
reglog.step.ok.over <- glm(y ~ age + job + contact + month + poutcome + emp.var.rate + cons.price.idx + euribor3m,data=data_balanced_over,family=binomial("link"=logit))
summary(reglog.step.ok.over)##
## Call:
## glm(formula = y ~ age + job + contact + month + poutcome + emp.var.rate +
## cons.price.idx + euribor3m, family = binomial(link = logit),
## data = data_balanced_over)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.69556 -0.84664 -0.09079 0.76496 2.07724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.722e+02 1.867e+01 -9.227 < 2e-16 ***
## age 8.777e-03 4.121e-03 2.130 0.03321 *
## jobadmin. 5.390e-01 2.504e-01 2.153 0.03132 *
## jobblue-collar 2.957e-01 2.582e-01 1.145 0.25213
## jobentrepreneur -4.786e-01 3.558e-01 -1.345 0.17865
## jobhousemaid 1.965e-01 3.457e-01 0.568 0.56980
## jobmanagement 8.123e-02 2.731e-01 0.297 0.76611
## jobretired 5.695e-01 3.045e-01 1.870 0.06149 .
## jobself-employed -8.840e-02 3.107e-01 -0.284 0.77604
## jobservices -3.927e-02 2.748e-01 -0.143 0.88639
## jobstudent 2.510e-01 3.512e-01 0.715 0.47474
## jobtechnician 6.449e-01 2.565e-01 2.514 0.01193 *
## contacttelephone -1.094e+00 1.414e-01 -7.737 1.01e-14 ***
## monthapr 1.559e-01 1.514e-01 1.030 0.30314
## monthaug 8.402e-01 1.474e-01 5.700 1.20e-08 ***
## monthdec 1.144e+00 4.027e-01 2.840 0.00451 **
## monthjul 1.866e-01 1.598e-01 1.168 0.24293
## monthjun 7.058e-02 1.803e-01 0.391 0.69552
## monthmar 2.160e+00 2.811e-01 7.684 1.55e-14 ***
## monthnov -1.986e-01 1.708e-01 -1.163 0.24483
## monthoct 5.648e-01 2.711e-01 2.083 0.03721 *
## monthsep 5.654e-01 2.556e-01 2.212 0.02696 *
## poutcomefailure -1.614e+00 2.140e-01 -7.543 4.58e-14 ***
## poutcomenonexistent -1.382e+00 2.017e-01 -6.855 7.13e-12 ***
## emp.var.rate -1.621e+00 1.923e-01 -8.429 < 2e-16 ***
## cons.price.idx 1.818e+00 1.956e-01 9.298 < 2e-16 ***
## euribor3m 6.905e-01 1.450e-01 4.763 1.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5997.1 on 4325 degrees of freedom
## Residual deviance: 4506.7 on 4299 degrees of freedom
## AIC: 4560.7
##
## Number of Fisher Scoring iterations: 5
Interpretasi
# odds = exponensial parameter model
interpretasi.over <- data.frame(koefisien=summary(reglog.step.ok.over)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok.over)$coefficients[,1]))
InterpretasiPeubahNyata.over <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata.over## koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept) 1.301571e-12 -117.36246845 1.071834e-51
## contacttelephone 2.731397e-05 -1.11281822 3.286315e-01
## monthmar 2.835945e-08 2.11116983 8.257896e+00
## poutcomefailure 1.669190e-09 -1.76207327 1.716885e-01
## poutcomenonexistent 3.487552e-07 -1.35414929 2.581668e-01
## emp.var.rate 5.365213e-23 -0.73075892 4.815434e-01
## cons.price.idx 2.214118e-12 1.26267142 3.534852e+00
## cons.conf.idx 3.014686e-02 0.04055963 1.041393e+00
Evalusi Model
Menggunakan data testing.
testing.bank.ok$ypred.ns.over <- predict(reglog.over, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns.over <- ifelse(testing.bank.ok$ypred.ns.over > 0.50, "yes", "no")
testing.bank.ok$ypred.ns.over <- as.factor(testing.bank.ok$ypred.ns.over)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns.over, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 466 18
## yes 90 44
##
## Accuracy : 0.8252
## 95% CI : (0.793, 0.8544)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3614
##
## Mcnemar's Test P-Value : 8.375e-12
##
## Sensitivity : 0.7097
## Specificity : 0.8381
## Pos Pred Value : 0.3284
## Neg Pred Value : 0.9628
## Prevalence : 0.1003
## Detection Rate : 0.0712
## Detection Prevalence : 0.2168
## Balanced Accuracy : 0.7739
##
## 'Positive' Class : yes
##
Confusion Matrix
Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.
Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 70.97 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 81.12 persen dengan accuracy sebesar 80.10 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.
testing.bank.ok$ypred.over <- predict(reglog.step.ok.over, testing.bank.ok, type="response")
testing.bank.ok$ypred.over <- ifelse(testing.bank.ok$ypred.over > 0.50, "yes", "no")
testing.bank.ok$ypred.over <- as.factor(testing.bank.ok$ypred.over)
(conf.mat.over<-caret::confusionMatrix(testing.bank.ok$ypred.over, testing.bank.ok$y, positive="yes"))## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 464 18
## yes 92 44
##
## Accuracy : 0.822
## 95% CI : (0.7895, 0.8514)
## No Information Rate : 0.8997
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3556
##
## Mcnemar's Test P-Value : 3.396e-12
##
## Sensitivity : 0.7097
## Specificity : 0.8345
## Pos Pred Value : 0.3235
## Neg Pred Value : 0.9627
## Prevalence : 0.1003
## Detection Rate : 0.0712
## Detection Prevalence : 0.2201
## Balanced Accuracy : 0.7721
##
## 'Positive' Class : yes
##
broom::tidy(conf.mat.over)## # A tibble: 14 x 6
## term class estimate conf.low conf.high p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 accuracy <NA> 0.822 0.790 0.851 1.00e+ 0
## 2 kappa <NA> 0.356 NA NA NA
## 3 mcnemar <NA> NA NA NA 3.40e-12
## 4 sensitivity yes 0.710 NA NA NA
## 5 specificity yes 0.835 NA NA NA
## 6 pos_pred_value yes 0.324 NA NA NA
## 7 neg_pred_value yes 0.963 NA NA NA
## 8 precision yes 0.324 NA NA NA
## 9 recall yes 0.710 NA NA NA
## 10 f1 yes 0.444 NA NA NA
## 11 prevalence yes 0.100 NA NA NA
## 12 detection_rate yes 0.0712 NA NA NA
## 13 detection_prevalence yes 0.220 NA NA NA
## 14 balanced_accuracy yes 0.772 NA NA NA
Receiver Operating Characteristics (ROC) Curve
Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.
Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 60.62 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.
plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred.over=="yes")Daftar Pustaka
Jian C, Gao J, Ao Y. (2016). A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Journal Neurocomputing. 193: 115-122.doi: 10.1016/j.neucom.2016.02.006
Thanathamathee P, Lursinap C. (2013). Handling Imbalanced Datasets with Synthetic Boundary Data Generation Using Bootstrap Resampling and AdaBoost Techniques. Pattern Recognition Letters.Elsevier BV 34(12): 1339-1347.doi: 10.1016/j.patrec.2013.04.019
Prusa J, Khoshgoftaar TM, Dittman DJ, Napolitano A. (2015). Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data. IEEE Xplore. doi: 10.1109/IRI.2015.39
G1501202071, Mahasiswa Pascasarjana Statistika dan Sains Data, IPB University, reniamelia@apps.ipb.ac.id↩︎