Faktor Yang Memengaruhi Seseorang Berlangganan Deposito Berjangka

Penerapan Regresi Logistik Biner

setwd(“D:\Kuliah S2 IPB\Bahan Kuliah\Semester 3 SSD 2020\Pemodelan Klasifikasi\Praktikum\R”)

Packages

Packages yang digunakan adalah

library(readxl)
library(DT)
library(tidyverse)
library(dplyr)
library(ggpubr)
library(broom)
library(caret)
library(DataExplorer)
library(grid)
library(InformationValue)
library(ISLR)
library(pscl)
library(ROSE)
library(ggplot2)
library(scales)
library(knitr)
library(RColorBrewer)

Data

Data yang digunakan adalah dataset bank-additional.csv yang diperoleh dari “Bank Marketing” UCI dataset. Data ini terkait dengan kampanye pemasaran langsung dari lembaga perbankan Portugal, yang tersedia pada: UCI Machine Learning Repository: Bank Marketing Data Set

Tugas

Bangunlah model terbaik dengan regresi logistik, terkait mau tidak nya pelanggan untuk berlangganan deposito berjangka (y = has the client subscribed a term deposit? (binary: ‘yes’,‘no’))

Gunakan proses validasi 80:20, dan identifikasi peubah yang mempengaruhi y secara signifikan

Input Data

Proses yang pertama dilakukan sebelum menyusun model regresi logistik adalah melakukan input data yang telah di download pada UCI Machine Learning Repository: Bank Marketing Data Set

bank <- read.csv("bank-additional.csv",sep=";",header=TRUE)
str(bank)

## 'data.frame':    4119 obs. of  21 variables:
##  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : chr  "blue-collar" "services" "services" "services" ...
##  $ marital       : chr  "married" "single" "married" "married" ...
##  $ education     : chr  "basic.9y" "high.school" "high.school" "basic.9y" ...
##  $ default       : chr  "no" "no" "no" "no" ...
##  $ housing       : chr  "yes" "no" "yes" "unknown" ...
##  $ loan          : chr  "no" "no" "no" "unknown" ...
##  $ contact       : chr  "cellular" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "jun" "jun" ...
##  $ day_of_week   : chr  "fri" "fri" "wed" "fri" ...
##  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 94.2 94.2 93.2 93.2 94 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num  1.3 4.9 5 5 4.2 0.9 0.9 4.2 4.2 4.9 ...
##  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
##  $ y             : chr  "no" "no" "no" "no" ...

dataset bank-additional.csv terdiri dari 21 kolom/variabel/peubah dan 4119 baris/observasi.

Ada 20 kandidat calon peubah bebas yang dapat digunakan dalam menyusun model regresi logistik untuk mengetahui faktor-faktor yang memengaruhi pelanggan untuk berlangganan deposito berjangka (y - has the client subscribed a term deposit? (binary: “yes”,“no”)). 20 kandidat peubah bebas tersebut adalah:

bank client data:

1 - age (numeric)

2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)

3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)

4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)

5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)

6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)

7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

related with the last contact of the current campaign

8 - contact: contact communication type (categorical: “cellular”,“telephone”)

9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

other attributes:

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)

social and economic context attributes

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

Dataset ini selanjutnya disimpan dengan nama bank.

Pre-processing Data

Cek Tipe Data

Dari data bank yang telah di input di atas,terlihat bahwa beberapa peubah yang seharusnya memiliki tipe kategori masih dibaca sebagai character dan numerik sehingga perlu ditransformasi terlebih dahulu menjadi factor.

#mengubah peubah categoric as.factor

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$day_of_week <- as.factor(bank$day_of_week)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)

## 'data.frame':    4119 obs. of  21 variables:
##  $ age           : int  30 39 25 38 47 32 32 41 31 35 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 2 8 8 8 1 8 1 3 8 2 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 3 2 2 2 3 3 2 1 2 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 3 4 4 3 7 7 7 7 6 3 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 1 1 1 2 1 2 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 3 1 3 2 3 1 3 3 1 1 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 1 2 2 2 1 1 1 1 1 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 5 5 8 10 10 8 8 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 1 1 5 1 2 3 2 2 4 3 ...
##  $ duration      : int  487 346 227 17 58 128 290 44 68 170 ...
##  $ campaign      : int  2 4 1 3 1 3 4 2 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 2 0 0 1 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ emp.var.rate  : num  -1.8 1.1 1.4 1.4 -0.1 -1.1 -1.1 -0.1 -0.1 1.1 ...
##  $ cons.price.idx: num  92.9 94 94.5 94.5 93.2 94.2 94.2 93.2 93.2 94 ...
##  $ cons.conf.idx : num  -46.2 -36.4 -41.8 -41.8 -42 -37.5 -37.5 -42 -42 -36.4 ...
##  $ euribor3m     : num  1.3 4.9 5 5 4.2 0.9 0.9 4.2 4.2 4.9 ...
##  $ nr.employed   : num  5099 5191 5228 5228 5196 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Dari output di atas,terlihat bahwa beberapa peubah yang memiliki tipe kategori sudah dibaca sebagai factor.

Cek Missing Data atau Jawaban “unknown”

missing(bank)

## [1] FALSE

terlihat bahwa tidak ada data yang missing pada data bank.

summary(bank)

##       age                 job           marital                   education   
##  Min.   :18.00   admin.     :1012   divorced: 446   university.degree  :1264  
##  1st Qu.:32.00   blue-collar: 884   married :2509   high.school        : 921  
##  Median :38.00   technician : 691   single  :1153   basic.9y           : 574  
##  Mean   :40.11   services   : 393   unknown :  11   professional.course: 535  
##  3rd Qu.:47.00   management : 324                   basic.4y           : 429  
##  Max.   :88.00   retired    : 166                   basic.6y           : 228  
##                  (Other)    : 649                   (Other)            : 168  
##     default        housing          loan           contact         month     
##  no     :3315   no     :1839   no     :3349   cellular :2652   may    :1378  
##  unknown: 803   unknown: 105   unknown: 105   telephone:1467   jul    : 711  
##  yes    :   1   yes    :2175   yes    : 665                    aug    : 636  
##                                                                jun    : 530  
##                                                                nov    : 446  
##                                                                apr    : 215  
##                                                                (Other): 203  
##  day_of_week    duration         campaign          pdays          previous     
##  fri:768     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
##  mon:855     1st Qu.: 103.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
##  thu:860     Median : 181.0   Median : 2.000   Median :999.0   Median :0.0000  
##  tue:841     Mean   : 256.8   Mean   : 2.537   Mean   :960.4   Mean   :0.1903  
##  wed:795     3rd Qu.: 317.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
##              Max.   :3643.0   Max.   :35.000   Max.   :999.0   Max.   :6.0000  
##                                                                                
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 454   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:3523   1st Qu.:-1.80000   1st Qu.:93.10   1st Qu.:-42.7  
##  success    : 142   Median : 1.10000   Median :93.70   Median :-41.8  
##                     Mean   : 0.08497   Mean   :93.58   Mean   :-40.5  
##                     3rd Qu.: 1.40000   3rd Qu.:94.00   3rd Qu.:-36.4  
##                     Max.   : 1.40000   Max.   :94.80   Max.   :-26.9  
##                                                                       
##    euribor3m      nr.employed     y       
##  Min.   :0.600   Min.   :4964   no :3668  
##  1st Qu.:1.300   1st Qu.:5099   yes: 451  
##  Median :4.900   Median :5191             
##  Mean   :3.645   Mean   :5166             
##  3rd Qu.:5.000   3rd Qu.:5228             
##  Max.   :5.000   Max.   :5228             
##

Dari output di atas terlihat bahwa peubah marital, default, housing, dan loan memiliki kategori “unknown” atau “tidak tahu” sehingga observasi yang memiliki jawaban “unknown” akan dikeluarkan dari data untuk penyusunan model. Untuk peubah job, education, dan month yang memiliki kategori yang cukup banyak sehingga tidak bisa terlihat jelas apakah memiliki jawaban yang “unknown” atau tidak, maka akan dilihat rinciannya sebagai berikut:

table(bank$job)

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##          1012           884           148           110           324 
##       retired self-employed      services       student    technician 
##           166           159           393            82           691 
##    unemployed       unknown 
##           111            39

Dari output di atas terlihat bahwa peubah job memiliki 39 observasi yang memilih kategori “unknown” atau “tidak tahu”.

table(bank$education)

## 
##            basic.4y            basic.6y            basic.9y         high.school 
##                 429                 228                 574                 921 
##          illiterate professional.course   university.degree             unknown 
##                   1                 535                1264                 167

Dari output di atas terlihat bahwa peubah education memiliki 167 observasi yang memilih kategori “unknown” atau “tidak tahu” dan ada 1 observasi yang memilih ‘illiterate’ sehingga selanjutnya observasi ini akan dikeluarkan dari model.

table(bank$month)

## 
##  apr  aug  dec  jul  jun  mar  may  nov  oct  sep 
##  215  636   22  711  530   48 1378  446   69   64

Dari output di atas terlihat bahwa peubah month tidak memiliki observasi yang memilih kategori “unknown” atau “tidak tahu”.

Data Bank yang mengandung observasi yang menjawab “unknown” atau “tidak tahu” pada peubah job, marital, education, default, housing, dan loan, akan dikeluarkan dari data yang akan digunakan untuk penyusunan model. Selain itu, 1 observasi yang memilih kategori education ‘illiterate’ juga dikeluarkan dari dataset. Selanjutnya data tersebut akan disimpan dengan nama bank.ok. Proses filtering data tersebut seperti pada syntax berikut:

bank.ok <- bank %>% filter(job!="unknown",marital!="unknown",education!="unknown",education!="illiterate",default!="unknown",housing!="unknown",loan!="unknown")
summary(bank.ok)

##       age                   job          marital                   education   
##  Min.   :20.00   admin.       :854   divorced: 347   university.degree  :1107  
##  1st Qu.:31.00   technician   :573   married :1791   high.school        : 728  
##  Median :37.00   blue-collar  :554   single  : 951   professional.course: 454  
##  Mean   :39.18   services     :276   unknown :   0   basic.9y           : 407  
##  3rd Qu.:46.00   management   :265                   basic.4y           : 243  
##  Max.   :88.00   self-employed:126                   basic.6y           : 150  
##                  (Other)      :441                   (Other)            :   0  
##     default        housing          loan           contact         month    
##  no     :3088   no     :1401   no     :2582   cellular :2107   may    :981  
##  unknown:   0   unknown:   0   unknown:   0   telephone: 982   jul    :514  
##  yes    :   1   yes    :1688   yes    : 507                    aug    :494  
##                                                                nov    :387  
##                                                                jun    :365  
##                                                                apr    :169  
##                                                                (Other):179  
##  day_of_week    duration         campaign          pdays          previous     
##  fri:580     Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.0000  
##  mon:642     1st Qu.: 104.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.0000  
##  thu:630     Median : 181.0   Median : 2.000   Median :999.0   Median :0.0000  
##  tue:613     Mean   : 259.3   Mean   : 2.509   Mean   :953.3   Mean   :0.2082  
##  wed:624     3rd Qu.: 315.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.0000  
##              Max.   :3643.0   Max.   :35.000   Max.   :999.0   Max.   :6.0000  
##                                                                                
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx   
##  failure    : 360   Min.   :-3.40000   Min.   :92.20   Min.   :-50.80  
##  nonexistent:2601   1st Qu.:-1.80000   1st Qu.:93.10   1st Qu.:-42.70  
##  success    : 128   Median : 1.10000   Median :93.40   Median :-41.80  
##                     Mean   :-0.04587   Mean   :93.53   Mean   :-40.62  
##                     3rd Qu.: 1.40000   3rd Qu.:94.00   3rd Qu.:-36.40  
##                     Max.   : 1.40000   Max.   :94.80   Max.   :-26.90  
##                                                                        
##    euribor3m      nr.employed     y       
##  Min.   :0.600   Min.   :4964   no :2719  
##  1st Qu.:1.300   1st Qu.:5099   yes: 370  
##  Median :4.900   Median :5191             
##  Mean   :3.504   Mean   :5161             
##  3rd Qu.:5.000   3rd Qu.:5228             
##  Max.   :5.000   Max.   :5228             
##

dim(bank.ok)

## [1] 3089   21

bank.ok memiliki observasi sebanyak 3089 dengan 21 peubah. Jumlah observasi yang dikeluarkan dari dataset bank sebanyak 1030 observasi (observasi yang memiliki jawaban “unknown” pada salah satu atau lebih peubah bebas kategorik atau illiterate pada peubah education).

4119-3089

## [1] 1030

Selanjutnya, data bank.ok akan digunakan proses eksplorasi data sebelum penyusunan model regresi linier.

Eksplorasi Data

Peubah Kategorik

table(bank.ok$job)/nrow(bank.ok)*100

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##     27.646488     17.934607      3.463904      2.427970      8.578828 
##       retired self-employed      services       student    technician 
##      3.820006      4.078990      8.934930      1.780511     18.549692 
##    unemployed       unknown 
##      2.784073      0.000000

A1 <- ggplot(bank.ok %>% 
  count(job) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(job, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Job",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank() )+
  xlim(0, 30)+
  theme(axis.title.y = element_text(margin = margin(r = 20)))

B1 <- ggplot(bank.ok %>% 
  count(marital) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(marital, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Marital",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 60)


C1 <- ggplot(bank.ok %>% 
  count(education) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(education, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Education",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 40)

D1 <- ggplot(bank.ok %>% 
  count(default) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(default, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Default",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.6))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

E1 <- ggplot(bank.ok %>% 
  count(default) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(default, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Default",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

F1 <- ggplot(bank.ok %>% 
  count(housing) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(housing, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Housing",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 60)

G1 <- ggplot(bank.ok %>% 
  count(loan) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(loan, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Loan",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

H1 <- ggplot(bank.ok %>% 
  count(contact) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(contact, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Contact",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

I1 <- ggplot(bank.ok %>% 
  count(month) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(month, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Month",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 40)

J1 <- ggplot(bank.ok %>% 
  count(day_of_week) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(day_of_week, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Day_of_week",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 30)

K1 <- ggplot(bank.ok %>% 
  count(poutcome) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(poutcome, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Poutcome",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

L1 <- ggplot(bank.ok %>% 
  count(y) %>% 
  mutate(percentage = (n / nrow(bank.ok))*100), aes(y=reorder(y, -percentage), x = percentage)) + geom_bar(stat = "identity",fill = "#3F51B5") + 
  labs(y = "Y",
       x = "Percentage (%)") + geom_label(aes(label = round(percentage,2), hjust = 0.2))+
  theme(axis.text.x = element_blank(),
    panel.grid.major.x = element_blank())+
  theme(axis.title.y = element_text(margin = margin(r = 20)))+
  xlim(0, 100)

A1

B1

C1

D1

F1

G1

H1

I1

J1

K1

L1

Dari output di atas terlihat bahwa pelanggan yang menjadi sampel dalam penelitian ini mayoritas tidak mau berlangganan deposito berjangka, pekerjaannya dalam bidang administrasi, status perkawinannya menikah, pendidikannya university degree, tidak memiliki pinajaman perumahan, tidak memiliki pinjaman pribadi, kontak terakhir pada kampanye pemasaran langsung dari bank melalui cellular pada bulan Mei dan hari Senin, serta hasil dari kampanye pemasaran sebelumnya “tidak ada/nonexistent”. Untuk peubah default (apakah kredit default?) semuanya menjawab tidak. Sehingga peubah ini akan dieliminasi dalam penyusunan model regresi logistik.

Selanjutnya akan dilakukan eksplorasi data dalam bentuk diagram batang untuk melihat hubungan secara visual antara setiap peubah bebas dengan peubah penjelas (y).

bank.ok %>%   
  count(y, job) %>% 
  mutate(percent = (n / sum(n))*100)

##      y           job   n    percent
## 1   no        admin. 736 23.8264811
## 2   no   blue-collar 507 16.4130787
## 3   no  entrepreneur 100  3.2372936
## 4   no     housemaid  66  2.1366138
## 5   no    management 241  7.8018776
## 6   no       retired  89  2.8811913
## 7   no self-employed 113  3.6581418
## 8   no      services 252  8.1579799
## 9   no       student  41  1.3272904
## 10  no    technician 504 16.3159599
## 11  no    unemployed  70  2.2661055
## 12 yes        admin. 118  3.8200065
## 13 yes   blue-collar  47  1.5215280
## 14 yes  entrepreneur   7  0.2266106
## 15 yes     housemaid   9  0.2913564
## 16 yes    management  24  0.7769505
## 17 yes       retired  29  0.9388152
## 18 yes self-employed  13  0.4208482
## 19 yes      services  24  0.7769505
## 20 yes       student  14  0.4532211
## 21 yes    technician  69  2.2337326
## 22 yes    unemployed  16  0.5179670

# bar plot, with each bar representing 100%
A <- ggplot(bank.ok,
            aes(y = job,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "job",
       x = "proportion")+
  theme(axis.title.y = element_text(margin = margin(r = 20)))

## Warning: Ignoring unknown parameters: names

B <- ggplot(bank.ok,
            aes(y = marital,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "marital",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

C <- ggplot(bank.ok,
            aes(y = education,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "education",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

D <- ggplot(bank.ok,
            aes(y = housing,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "housing",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

E <- ggplot(bank.ok,
            aes(y = loan,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "loan",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

G <- ggplot(bank.ok,
            aes(y = contact,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "contact",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

H <- ggplot(bank.ok,
            aes(y = month,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "month",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

I <- ggplot(bank.ok,
            aes(y = day_of_week,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "day_of_week",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

J <- ggplot(bank.ok,
            aes(y = poutcome,
                fill = y)) + geom_bar(position = "fill", names ="y") +
  scale_fill_manual(values = c("#8C9EFF","#3F51B5"))+
  labs(y = "poutcome",
       x = "proportion")

## Warning: Ignoring unknown parameters: names

Dari output di atas terlihat bahwa pelanggan yang mau berlangganan deposito berjangka mayoritas memiliki status sebagai student, masih single, pendidikannya university degree, memiliki pinjaman perumahan, tidak memiliki pinjaman pribadi, kontak terakhir pada kampanye pemasaran langsung dari bank melalui cellular pada bulan Maret dan hari Senin atau Kamis, serta hasil dari kampanye pemasaran sebelumnya sukses.

Peubah Numerik

boxplot(bank.ok[,1],horizontal = TRUE,  
        xlab = "Age")

boxplot(bank.ok[,11],horizontal = TRUE,  
        xlab = "Duration")

boxplot(bank.ok[,12],horizontal = TRUE,  
        xlab = "Campaign")

boxplot(bank.ok[,13],horizontal = TRUE,  
        xlab = "pdays")

boxplot(bank.ok[,14],horizontal = TRUE,  
        xlab = "previous")

boxplot(bank.ok[,16],horizontal = TRUE,  
        xlab = "emp.var.rate")

boxplot(bank.ok[,17],horizontal = TRUE,  
        xlab = "cons.price.idx")

boxplot(bank.ok[,18],horizontal = TRUE,  
        xlab = "emp.conf.rate")

boxplot(bank.ok[,19],horizontal = TRUE,  
        xlab = "euribor3m")

boxplot(bank.ok[,20],horizontal = TRUE,  
        xlab = "nr.employed")

Dari boxplot di atas, terlihat bahwa terdapat beberapa pencilan untuk peubah umur (age), duration (last contact duration), campaign (number of contacts performed during this campaign and for this client), pdays (number of days that passed by after the client was last contacted from a previous campaign,999 means client was not previously contacted), previous (number of contacts performed before this campaign and for this client),dan cons.conf.idx (consumer confidence index - monthly indicator). Namun, tidak ditemukan pencilan pada peubah emp.var.rate (employment variation rate - quarterly indicator), cons.price.idx (consumer price index-monthly indicator), euribor3m (euribor 3 month rate - daily indicator), dan nr.employed (number of employees - quarterly indicator).

Dikutip dari UCI Machine Learning Repository: Bank Marketing Data Set:

“duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=”no“). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.”

“durasi: durasi kontak terakhir, dalam detik (numerik). Catatan penting: atribut ini sangat mempengaruhi target keluaran (misalnya, jika durasi=0 maka y=”tidak“). Namun, durasinya tidak diketahui sebelum panggilan dilakukan. Juga, setelah akhir panggilan y jelas diketahui. Dengan demikian, input ini hanya boleh dimasukkan untuk tujuan benchmark dan harus dibuang jika tujuannya adalah untuk memiliki model prediksi yang realistis.”

Karena variabel durasi ini hanya sebagai suatu benchmark dan harus dibuang jika tujuannya untuk mendapatkan model yang baik, maka selanjutnya peubah durasi tidak akan digunakan sebagai peubah bebas pada penyusunan model regresi logistik.

Selain itu, peubah pdays (number of days that passed by after the client was last contacted from a previous campaign,999 means client was not previously contacted). Peubah ini mayoritas memiliki jawaban 999 yang berarti pelanggan belum pernah di kontak sebelumnya dan berkaitan dengan peubah previous (number of contacts performed before this campaign and for this client). Maka, peubah pdays ini juga dikeluarkan pada penyusunan model regresi logistik.

table(bank.ok$pdays)

## 
##    0    1    2    3    4    5    6    7    9   10   11   12   13   15   16   18 
##    2    3    4   46   13    4   40    6    2    8    1    5    2    2    1    1 
##   19   21  999 
##    1    1 2947

Selanjutnya dilakukan eksplorasi data dalam bentuk boxplot untuk melihat hubungan secara visual antara setiap peubah bebas dengan peubah penjelas (y).

# bar plot, with each bar representing 100%
K <- ggplot(bank.ok, aes(x=age, y= y)) +
        geom_boxplot()+
    labs(x = "age",
       y = "y")

M <- ggplot(bank.ok, aes(x=campaign, y= y)) +
        geom_boxplot()+
    labs(x = "campaign",
       y = "y")

O <- ggplot(bank.ok, aes(x=previous, y= y)) +
        geom_boxplot()+
    labs(x = "previous",
       y = "y")


P <- ggplot(bank.ok, aes(x=emp.var.rate, y= y)) +
        geom_boxplot()+
    labs(x = "emp.var.rate",
       y = "y")


Q <- ggplot(bank.ok, aes(x=cons.price.idx, y= y)) +
        geom_boxplot()+
    labs(x = "cons.price.idx",
       y = "y")


R <- ggplot(bank.ok, aes(x=cons.conf.idx, y= y)) +
        geom_boxplot()+
    labs(x = "cons.conf.idx",
       y = "y")


S <- ggplot(bank.ok, aes(x=euribor3m, y= y)) +
        geom_boxplot()+
    labs(x = "euribor3m",
       y = "y")

U <- ggplot(bank.ok, aes(x=nr.employed, y= y)) +
        geom_boxplot()+
    labs(x = "nr.employed",
       y = "y")

K

Dari output di atas terlihat bahwa peubah yang memiliki nilai median lebih besar pada pelanggan yang mau berlangganan deposito berjangka dibandingkan dengan yang tidak mau berlangganan deposito berjangka adalah pada peubah umur dan cons.conf.idx.Pada peubah previous dan campaign, mediannya sama. Sedangkan pada peubah lainnya nilai median pada pelanggan yang mau berlangganan deposito berjangka lebih kecil dibandingkan dengan yang tidak mau berlangganan deposito berjangka.

Uji Chi-Square

Dari eksplorasi data yang telah dilakukan, kandidat peubah bebas yang dapat digunakan dalam menyusun model regresi logistik untuk mengetahui faktor-faktor yang memengaruhi pelanggan untuk berlangganan deposito berjangka (y - has the client subscribed a term deposit? (binary: “yes”,“no”)) adalah age, job, marital,education, housing, loan, contact, month, day_of_week, campaign, previous, poutcome, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, dan nr.employed

y.age <- chisq.test(table(bank.ok$y, bank.ok$age))

## Warning in chisq.test(table(bank.ok$y, bank.ok$age)): Chi-squared approximation
## may be incorrect

y.job <- chisq.test(table(bank.ok$y, bank.ok$job,exclude = 'unknown'))
 
y.marital <- chisq.test(table(bank.ok$y, bank.ok$marital,exclude = 'unknown'))
 
y.education <- chisq.test(table(bank.ok$y, bank.ok$education,exclude = c('unknown','illiterate')))

y.housing <- chisq.test(table(bank.ok$y, bank.ok$housing,exclude = 'unknown'))
 
y.loan <- chisq.test(table(bank.ok$y, bank.ok$loan,exclude = 'unknown'))
 
y.contact <- chisq.test(table(bank.ok$y, bank.ok$contact,exclude = 'unknown'))
 
y.month <- chisq.test(table(bank.ok$y, bank.ok$month))

## Warning in chisq.test(table(bank.ok$y, bank.ok$month)): Chi-squared
## approximation may be incorrect

y.day_of_week <- chisq.test(table(bank.ok$y, bank.ok$day_of_week))
 
y.campaign <- chisq.test(table(bank.ok$y, bank.ok$campaign))

## Warning in chisq.test(table(bank.ok$y, bank.ok$campaign)): Chi-squared
## approximation may be incorrect

y.previous <- chisq.test(table(bank.ok$y, bank.ok$previous))

## Warning in chisq.test(table(bank.ok$y, bank.ok$previous)): Chi-squared
## approximation may be incorrect

y.poutcome <- chisq.test(table(bank.ok$y, bank.ok$poutcome))
 
y.emp.var.rate <- chisq.test(table(bank.ok$y, bank.ok$emp.var.rate))

## Warning in chisq.test(table(bank.ok$y, bank.ok$emp.var.rate)): Chi-squared
## approximation may be incorrect

y.cons.price.idx <- chisq.test(table(bank.ok$y, bank.ok$cons.price.idx))

## Warning in chisq.test(table(bank.ok$y, bank.ok$cons.price.idx)): Chi-squared
## approximation may be incorrect

y.cons.conf.idx <- chisq.test(table(bank.ok$y, bank.ok$cons.conf.idx))

## Warning in chisq.test(table(bank.ok$y, bank.ok$cons.conf.idx)): Chi-squared
## approximation may be incorrect

y.euribor3m <- chisq.test(table(bank.ok$y, bank.ok$euribor3m))

## Warning in chisq.test(table(bank.ok$y, bank.ok$euribor3m)): Chi-squared
## approximation may be incorrect

y.nr.employed <- chisq.test(table(bank.ok$y, bank.ok$nr.employed))

## Warning in chisq.test(table(bank.ok$y, bank.ok$nr.employed)): Chi-squared
## approximation may be incorrect

pvalue <- round(c(y.age$p.value,y.job$p.value,y.marital$p.value,y.education$p.value,y.housing$p.value, y.loan$p.value,y.contact$p.value,y.month$p.value,y.day_of_week$p.value,y.campaign$p.value, y.previous$p.value,y.poutcome$p.value,y.emp.var.rate$p.value,y.cons.price.idx$p.value,y.cons.conf.idx$p.value,y.euribor3m$p.value,y.nr.employed$p.value),4)

labelpvalue <- c("y.age$p.value","y.job$p.value","y.marital$p.value","y.education$p.value","y.housing$p.value", "y.loan$p.value","y.contact$p.value","y.month$p.value","y.day_of_week$p.value","y.campaign$p.value", "y.previous$p.value","y.poutcome$p.value","y.emp.var.rate$p.value","y.cons.price.idx$p.value","y.cons.conf.idx$p.value","y.euribor3m$p.value","y.nr.employed$p.value")

pvalue.chisquare <- data.frame(labelpvalue,pvalue)

arrange(pvalue.chisquare, pvalue)

##                 labelpvalue pvalue
## 1             y.age$p.value 0.0000
## 2             y.job$p.value 0.0000
## 3         y.contact$p.value 0.0000
## 4           y.month$p.value 0.0000
## 5        y.previous$p.value 0.0000
## 6        y.poutcome$p.value 0.0000
## 7    y.emp.var.rate$p.value 0.0000
## 8  y.cons.price.idx$p.value 0.0000
## 9   y.cons.conf.idx$p.value 0.0000
## 10      y.euribor3m$p.value 0.0000
## 11    y.nr.employed$p.value 0.0000
## 12      y.education$p.value 0.0644
## 13        y.marital$p.value 0.0722
## 14       y.campaign$p.value 0.4275
## 15           y.loan$p.value 0.5270
## 16    y.day_of_week$p.value 0.5550
## 17        y.housing$p.value 0.6313

Berdasarkan hasil Uji Chi_Square di atas, peubah yang terbukti memiliki asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) dengan pvalue < alpha=0.05 adalah peubah age , job, contact,month,previous,poutcome, emp.var.rate,cons.price.id, cons.conf.idx, euribor3m, dan nr.employed. Sedangkan peubah lainnya (education,marital,campaign,loan,day_of_week,housing) tidak signifikan (pvalue > alpha=0.05) berarti tidak ada cukup bukti untuk menyatakan ada asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) sehingga peubah tersebut tidak akan digunakan dalam menyusun model regresi logistik.

Menentukan Kategori Referensi Peubah Kategorik

Sebelum menyusun model regresi logistik, ditentukan dahulu kategori referensi untuk peubah kategorik sebagai berikut:

# Referensi
bank.ok$y <- relevel(bank.ok$y,ref="no")
bank.ok$job <- relevel(bank.ok$job,ref="unemployed")
bank.ok$contact <- relevel(bank.ok$contact,ref="cellular")
bank.ok$month <- relevel(bank.ok$month,ref="may")
bank.ok$poutcome <- relevel(bank.ok$poutcome,ref="success")

Membagi Data sebelum Menyusun Model Regresi Logistik

Dalam rangka membentuk model dan melakukan pengujian model yang telah dibuat, dilakukan pengambilan sampel dari data bank.ok. Pengambilan sampel ini dibagi menjadi 2 bagian data sehingga akan diperoleh dua bagian data,yaitu data train sebanyak 80% untuk membentuk model, dan data test sebanyak 20% untuk menguji model yang telah dibuat.

set.seed(1501202071) #NIM saya tanpa G
sample <- sample(nrow(bank.ok),floor(nrow(bank.ok)*0.8))
training.bank.ok <- bank.ok[sample,]
dim(training.bank.ok)

## [1] 2471   21

testing.bank.ok <- bank.ok[-sample,]
dim(testing.bank.ok)

## [1] 618  21

Terdapat 2471 observasi yang menjadi training data dan 618 observasi yang menjadi testing data.

Eksplorasi Data Training

summary(training.bank.ok)

##       age                 job          marital                   education  
##  Min.   :20.00   admin.     :680   divorced: 282   university.degree  :867  
##  1st Qu.:32.00   technician :454   married :1420   high.school        :587  
##  Median :37.00   blue-collar:447   single  : 769   professional.course:378  
##  Mean   :39.23   services   :220   unknown :   0   basic.9y           :326  
##  3rd Qu.:46.00   management :211                   basic.4y           :190  
##  Max.   :88.00   retired    :106                   basic.6y           :123  
##                  (Other)    :353                   (Other)            :  0  
##     default        housing          loan           contact         month    
##  no     :2470   no     :1114   no     :2074   cellular :1689   may    :789  
##  unknown:   0   unknown:   0   unknown:   0   telephone: 782   jul    :418  
##  yes    :   1   yes    :1357   yes    : 397                    aug    :384  
##                                                                nov    :317  
##                                                                jun    :288  
##                                                                apr    :131  
##                                                                (Other):144  
##  day_of_week    duration         campaign          pdays        previous     
##  fri:461     Min.   :   4.0   Min.   : 1.000   Min.   :  0   Min.   :0.0000  
##  mon:523     1st Qu.: 105.0   1st Qu.: 1.000   1st Qu.:999   1st Qu.:0.0000  
##  thu:500     Median : 182.0   Median : 2.000   Median :999   Median :0.0000  
##  tue:481     Mean   : 261.1   Mean   : 2.536   Mean   :954   Mean   :0.2133  
##  wed:506     3rd Qu.: 320.0   3rd Qu.: 3.000   3rd Qu.:999   3rd Qu.:0.0000  
##              Max.   :3643.0   Max.   :29.000   Max.   :999   Max.   :6.0000  
##                                                                              
##         poutcome     emp.var.rate      cons.price.idx  cons.conf.idx   
##  success    : 101   Min.   :-3.40000   Min.   :92.20   Min.   :-50.80  
##  failure    : 295   1st Qu.:-1.80000   1st Qu.:93.10   1st Qu.:-42.70  
##  nonexistent:2075   Median : 1.10000   Median :93.40   Median :-41.80  
##                     Mean   :-0.03626   Mean   :93.53   Mean   :-40.68  
##                     3rd Qu.: 1.40000   3rd Qu.:94.00   3rd Qu.:-36.40  
##                     Max.   : 1.40000   Max.   :94.80   Max.   :-26.90  
##                                                                        
##    euribor3m      nr.employed     y       
##  Min.   :0.600   Min.   :4964   no :2163  
##  1st Qu.:1.300   1st Qu.:5099   yes: 308  
##  Median :4.900   Median :5191             
##  Mean   :3.515   Mean   :5162             
##  3rd Qu.:5.000   3rd Qu.:5228             
##  Max.   :5.000   Max.   :5228             
##

Dari output di atas, terlihat bahwa terdapat 308 observasi yang masuk sebagai kategori pelanggan yang mau berlangganan deposito berjangka sedangkan yang tidak mau berlangganan ada sebanyak 2163. Persentase pelanggan yang terkategori sebagai yes (mau berlangganan deposito berjangka) hanya sekitar 12.46% sedangkan sisanya terkategori sebagai no (tidak mau berlangganan deposito berjangka). Hal ini menunjukkan bahwa data tersebut tidak seimbang sehingga selanjutnya akan dicobakan juga melakukan proses resampling dengan menggunakan undersampling dan oversampling. Hasil model regresi logistik dengan menggunakan teknik resampling akan dibandingkan dengan tanpa menggunakan teknik resampling dengan melihat nilai AIC dan akurasinya.

round((308/(2163+308))*100,2)

## [1] 12.46

Resampling

Ketidakseimbangan data terjadi pada saat suatu kelas atau kategori tertentu memiliki data yang lebih banyak dibandingkan dengan kategori lainnya (Jian et al., 2016). Ketidakseimbangan data ini perlu ditangani karena memengaruhi akurasi pada proses klasifikasi data (Thanathamathee et al., 2013). Teknik resampling merupakan salah satu solusi untuk menangani data yang tidak seimbang. Terdapat proses random undersampling dan oversampling. Random undersampling merupakan proses sampling yang dilakukan dengan mengeliminasi sebagian data pada kelas mayoritas secara acak (Prusa et al, 2015). Random oversampling merupakan proses sampling yang dilakukan dengan menambahkan jumlah data pada kelas minoritas secara acak.

Random Under Sampling

Teknik resampling diterapkan pada data training (data latih) untuk pembentukan model regresi regresi logistik biner. Proses undersampling dilakukan dengan mengeliminasi secara acak kategori mayoritas, yaitu kategori no (tidak mau berlangganan deposito berjangka) menjadi dari total 2163 menjadi 308 (sama dengan jumlah pada kategori yes, yaitu mau berlangganan deposito berjangka). Total dataset baru hasil undersampling sebanyak 616.

#under sampling
set.seed(1501202071) #NIM saya tanpa G
data_balanced_under <- ovun.sample(y ~ ., data = training.bank.ok, method = "under",N = 616)$data
table(data_balanced_under$y)  #note: N adalah banyak data totalnya setelah undersampling (308+308=616)

## 
##  no yes 
## 308 308

Random Over Sampling

Teknik resampling diterapkan pada data training (data latih) untuk pembentukan model regresi regresi logistik biner. Proses oversampling dilakukan dengan menaikkan secara acak kategori minoritas, yaitu kategori yes, yaitu mau berlangganan deposito berjangka) dari total 308 menjadi 2163 (sama dengan jumlah pada kategori no (tidak mau berlangganan deposito berjangka) . Total dataset baru hasil oversampling sebanyak 4326.

#over sampling
set.seed(1501202071) #NIM saya tanpa G
data_balanced_over <- ovun.sample(y ~ ., data = training.bank.ok, method = "over",N = 4326)$data
table(data_balanced_over$y)  #note: N adalah banyak data totalnya setelah oversampling (2163+2163=4326)

## 
##   no  yes 
## 2163 2163

Regresi Logistik Biner

Berdasarkan hasil Uji Chi_Square yang telah dilakukan sebelumnya, peubah yang terbukti memiliki asosiasi dengan peubah y (mau atau tidak berlangganan deposito berjangka) pada taraf alpha 5% adalah peubah age, job, contact,month,previous,poutcome, emp.var.rate,cons.price.idx, cons.conf.idx, euribor3m, dan nr.employed. 11 peubah ini yang akan digunakan dalam penyusunan model regresi logistik biner.

Tanpa Teknik Resampling

reglog <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=training.bank.ok, family=binomial("link"=logit))
summary(reglog)

## 
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed, family = binomial(link = logit), data = training.bank.ok)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0007  -0.4041  -0.3294  -0.2631   2.8261  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.372e+02  1.259e+02  -1.090  0.27573    
## age                  9.085e-03  7.659e-03   1.186  0.23557    
## jobadmin.           -9.806e-02  4.075e-01  -0.241  0.80981    
## jobblue-collar      -3.145e-01  4.293e-01  -0.733  0.46373    
## jobentrepreneur     -9.450e-01  6.593e-01  -1.433  0.15174    
## jobhousemaid        -3.566e-01  6.090e-01  -0.585  0.55822    
## jobmanagement       -6.663e-01  4.764e-01  -1.399  0.16190    
## jobretired          -3.031e-01  5.091e-01  -0.595  0.55166    
## jobself-employed    -4.674e-01  5.340e-01  -0.875  0.38149    
## jobservices         -5.216e-01  4.701e-01  -1.110  0.26713    
## jobstudent          -9.756e-03  5.704e-01  -0.017  0.98635    
## jobtechnician       -6.607e-03  4.210e-01  -0.016  0.98748    
## contacttelephone    -1.198e+00  2.874e-01  -4.170 3.05e-05 ***
## monthapr             2.654e-01  3.030e-01   0.876  0.38104    
## monthaug             3.186e-01  3.393e-01   0.939  0.34775    
## monthdec             5.747e-01  6.342e-01   0.906  0.36485    
## monthjul            -8.026e-02  3.338e-01  -0.240  0.80999    
## monthjun             1.751e-01  4.892e-01   0.358  0.72039    
## monthmar             2.006e+00  4.373e-01   4.586 4.52e-06 ***
## monthnov            -2.956e-01  3.458e-01  -0.855  0.39273    
## monthoct             3.100e-01  4.658e-01   0.665  0.50575    
## monthsep             1.671e-01  5.294e-01   0.316  0.75230    
## previous             1.764e-01  1.846e-01   0.956  0.33924    
## poutcomefailure     -1.728e+00  2.998e-01  -5.765 8.17e-09 ***
## poutcomenonexistent -1.100e+00  3.845e-01  -2.860  0.00423 ** 
## emp.var.rate        -1.160e+00  4.590e-01  -2.528  0.01148 *  
## cons.price.idx       1.497e+00  8.222e-01   1.820  0.06873 .  
## cons.conf.idx        2.620e-02  2.766e-02   0.947  0.34350    
## euribor3m            4.026e-01  4.395e-01   0.916  0.35971    
## nr.employed         -8.222e-04  1.039e-02  -0.079  0.93692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1858.6  on 2470  degrees of freedom
## Residual deviance: 1427.1  on 2441  degrees of freedom
## AIC: 1487.1
## 
## Number of Fisher Scoring iterations: 6

reglog.step <- step(reglog)

## Start:  AIC=1487.07
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - job            10   1437.1 1477.1
## - nr.employed     1   1427.1 1485.1
## - euribor3m       1   1427.9 1485.9
## - cons.conf.idx   1   1428.0 1486.0
## - previous        1   1428.0 1486.0
## - age             1   1428.5 1486.5
## <none>                1427.1 1487.1
## - cons.price.idx  1   1430.3 1488.3
## - emp.var.rate    1   1433.3 1491.3
## - month           9   1459.0 1501.0
## - contact         1   1447.5 1505.5
## - poutcome        2   1462.8 1518.8
## 
## Step:  AIC=1477.09
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - nr.employed     1   1437.2 1475.2
## - previous        1   1437.7 1475.7
## - cons.conf.idx   1   1437.8 1475.8
## - age             1   1438.0 1476.0
## - euribor3m       1   1438.2 1476.2
## <none>                1437.1 1477.1
## - cons.price.idx  1   1439.7 1477.7
## - emp.var.rate    1   1442.6 1480.6
## - month           9   1472.0 1494.0
## - contact         1   1456.4 1494.4
## - poutcome        2   1473.6 1509.6
## 
## Step:  AIC=1475.19
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - previous        1   1437.8 1473.8
## - age             1   1438.1 1474.1
## - euribor3m       1   1438.8 1474.8
## <none>                1437.2 1475.2
## - cons.conf.idx   1   1439.3 1475.3
## - emp.var.rate    1   1447.1 1483.1
## - month           9   1472.4 1492.4
## - cons.price.idx  1   1458.0 1494.0
## - contact         1   1458.3 1494.3
## - poutcome        2   1473.9 1507.9
## 
## Step:  AIC=1473.81
## y ~ age + contact + month + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - age             1   1438.8 1472.8
## - euribor3m       1   1439.4 1473.4
## <none>                1437.8 1473.8
## - cons.conf.idx   1   1440.1 1474.1
## - emp.var.rate    1   1447.9 1481.9
## - month           9   1473.7 1491.7
## - contact         1   1459.7 1493.7
## - cons.price.idx  1   1460.3 1494.3
## - poutcome        2   1477.3 1509.3
## 
## Step:  AIC=1472.76
## y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - euribor3m       1   1440.4 1472.4
## <none>                1438.8 1472.8
## - cons.conf.idx   1   1441.1 1473.1
## - emp.var.rate    1   1449.0 1481.0
## - month           9   1475.2 1491.2
## - contact         1   1460.4 1492.4
## - cons.price.idx  1   1461.8 1493.8
## - poutcome        2   1478.2 1508.2
## 
## Step:  AIC=1472.42
## y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + 
##     cons.conf.idx
## 
##                  Df Deviance    AIC
## <none>                1440.4 1472.4
## - cons.conf.idx   1   1445.1 1475.1
## - month           9   1475.2 1489.2
## - contact         1   1460.5 1490.5
## - poutcome        2   1479.6 1507.6
## - cons.price.idx  1   1488.7 1518.7
## - emp.var.rate    1   1531.9 1561.9

reglog.step$anova

##            Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1               NA         NA      2441   1427.071 1487.071
## 2         - job 10 10.0177174      2451   1437.088 1477.088
## 3 - nr.employed  1  0.1000574      2452   1437.188 1475.188
## 4    - previous  1  0.6180707      2453   1437.806 1473.806
## 5         - age  1  0.9518552      2454   1438.758 1472.758
## 6   - euribor3m  1  1.6589185      2455   1440.417 1472.417

reglog.step.ok <- glm(y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + cons.conf.idx,data=training.bank.ok,family=binomial("link"=logit))
summary(reglog.step.ok)

## 
## Call:
## glm(formula = y ~ contact + month + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx, family = binomial(link = logit), 
##     data = training.bank.ok)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8666  -0.4206  -0.3403  -0.2539   2.8115  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -117.36247   16.54358  -7.094 1.30e-12 ***
## contacttelephone      -1.11282    0.26529  -4.195 2.73e-05 ***
## monthapr               0.30784    0.27726   1.110   0.2669    
## monthaug               0.38135    0.31261   1.220   0.2225    
## monthdec               0.65817    0.60500   1.088   0.2766    
## monthjul               0.08672    0.30149   0.288   0.7736    
## monthjun               0.46666    0.26296   1.775   0.0760 .  
## monthmar               2.11117    0.38030   5.551 2.84e-08 ***
## monthnov              -0.01703    0.26452  -0.064   0.9487    
## monthoct               0.55009    0.39355   1.398   0.1622    
## monthsep               0.24669    0.41944   0.588   0.5564    
## poutcomefailure       -1.76207    0.29236  -6.027 1.67e-09 ***
## poutcomenonexistent   -1.35415    0.26578  -5.095 3.49e-07 ***
## emp.var.rate          -0.73076    0.07400  -9.875  < 2e-16 ***
## cons.price.idx         1.26267    0.17986   7.020 2.21e-12 ***
## cons.conf.idx          0.04056    0.01871   2.168   0.0301 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1858.6  on 2470  degrees of freedom
## Residual deviance: 1440.4  on 2455  degrees of freedom
## AIC: 1472.4
## 
## Number of Fisher Scoring iterations: 5

Interpretasi

# odds = exponensial parameter model
interpretasi <- data.frame(koefisien=summary(reglog.step.ok)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok)$coefficients[,1])) 
InterpretasiPeubahNyata <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata

##                     koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept)               1.301571e-12      -117.36246845  1.071834e-51
## contacttelephone          2.731397e-05        -1.11281822  3.286315e-01
## monthmar                  2.835945e-08         2.11116983  8.257896e+00
## poutcomefailure           1.669190e-09        -1.76207327  1.716885e-01
## poutcomenonexistent       3.487552e-07        -1.35414929  2.581668e-01
## emp.var.rate              5.365213e-23        -0.73075892  4.815434e-01
## cons.price.idx            2.214118e-12         1.26267142  3.534852e+00
## cons.conf.idx             3.014686e-02         0.04055963  1.041393e+00

Evalusi Model

TANPA STEPWISE:

testing.bank.ok$ypred.ns <- predict(reglog, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns <- ifelse(testing.bank.ok$ypred.ns > 0.50, "yes", "no")
testing.bank.ok$ypred.ns <- as.factor(testing.bank.ok$ypred.ns)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  544  47
##        yes  12  15
##                                           
##                Accuracy : 0.9045          
##                  95% CI : (0.8786, 0.9265)
##     No Information Rate : 0.8997          
##     P-Value [Acc > NIR] : 0.375           
##                                           
##                   Kappa : 0.2941          
##                                           
##  Mcnemar's Test P-Value : 9.581e-06       
##                                           
##             Sensitivity : 0.24194         
##             Specificity : 0.97842         
##          Pos Pred Value : 0.55556         
##          Neg Pred Value : 0.92047         
##              Prevalence : 0.10032         
##          Detection Rate : 0.02427         
##    Detection Prevalence : 0.04369         
##       Balanced Accuracy : 0.61018         
##                                           
##        'Positive' Class : yes             
##

Menggunakan data testing.

Confusion Matrix

Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.

Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 27.419 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 98.381 persen dengan accuracy sebesar 91.26 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.

testing.bank.ok$ypred <- predict(reglog.step.ok, testing.bank.ok, type="response")
testing.bank.ok$ypred <- ifelse(testing.bank.ok$ypred > 0.50, "yes", "no")
testing.bank.ok$ypred <- as.factor(testing.bank.ok$ypred)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  547  45
##        yes   9  17
##                                           
##                Accuracy : 0.9126          
##                  95% CI : (0.8875, 0.9337)
##     No Information Rate : 0.8997          
##     P-Value [Acc > NIR] : 0.1575          
##                                           
##                   Kappa : 0.3477          
##                                           
##  Mcnemar's Test P-Value : 1.908e-06       
##                                           
##             Sensitivity : 0.27419         
##             Specificity : 0.98381         
##          Pos Pred Value : 0.65385         
##          Neg Pred Value : 0.92399         
##              Prevalence : 0.10032         
##          Detection Rate : 0.02751         
##    Detection Prevalence : 0.04207         
##       Balanced Accuracy : 0.62900         
##                                           
##        'Positive' Class : yes             
##

broom::tidy(conf.mat)

## # A tibble: 14 x 6
##    term                 class estimate conf.low conf.high     p.value
##    <chr>                <chr>    <dbl>    <dbl>     <dbl>       <dbl>
##  1 accuracy             <NA>    0.913     0.888     0.934  0.157     
##  2 kappa                <NA>    0.348    NA        NA     NA         
##  3 mcnemar              <NA>   NA        NA        NA      0.00000191
##  4 sensitivity          yes     0.274    NA        NA     NA         
##  5 specificity          yes     0.984    NA        NA     NA         
##  6 pos_pred_value       yes     0.654    NA        NA     NA         
##  7 neg_pred_value       yes     0.924    NA        NA     NA         
##  8 precision            yes     0.654    NA        NA     NA         
##  9 recall               yes     0.274    NA        NA     NA         
## 10 f1                   yes     0.386    NA        NA     NA         
## 11 prevalence           yes     0.100    NA        NA     NA         
## 12 detection_rate       yes     0.0275   NA        NA     NA         
## 13 detection_prevalence yes     0.0421   NA        NA     NA         
## 14 balanced_accuracy    yes     0.629    NA        NA     NA

Receiver Operating Characteristics (ROC) Curve

Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.

Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 53.3 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.

plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred=="yes")

Dengan Teknik Resampling

Random Under Sampling

reglog.under <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=data_balanced_under, family=binomial("link"=logit))
summary(reglog.under)

## 
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed, family = binomial(link = logit), data = data_balanced_under)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4255  -0.8062  -0.1450   0.6923   2.0804  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         -2.310e+02  1.758e+02  -1.314  0.18879   
## age                  1.631e-02  1.183e-02   1.379  0.16795   
## jobadmin.           -4.272e-01  7.658e-01  -0.558  0.57696   
## jobblue-collar      -7.362e-01  7.831e-01  -0.940  0.34716   
## jobentrepreneur     -1.270e+00  9.755e-01  -1.302  0.19291   
## jobhousemaid        -4.584e-01  1.061e+00  -0.432  0.66569   
## jobmanagement       -1.415e+00  8.331e-01  -1.699  0.08940 . 
## jobretired          -6.595e-01  9.265e-01  -0.712  0.47656   
## jobself-employed    -6.312e-01  9.271e-01  -0.681  0.49595   
## jobservices         -1.028e+00  8.263e-01  -1.244  0.21365   
## jobstudent          -1.738e-01  1.157e+00  -0.150  0.88056   
## jobtechnician       -2.678e-01  7.802e-01  -0.343  0.73139   
## contacttelephone    -1.049e+00  4.278e-01  -2.453  0.01418 * 
## monthapr             4.079e-01  4.741e-01   0.860  0.38959   
## monthaug             5.649e-01  5.866e-01   0.963  0.33559   
## monthdec             1.003e+00  1.196e+00   0.838  0.40177   
## monthjul             2.133e-01  4.637e-01   0.460  0.64560   
## monthjun            -3.752e-01  7.113e-01  -0.528  0.59783   
## monthmar             2.996e+00  1.129e+00   2.653  0.00799 **
## monthnov            -1.515e-01  4.795e-01  -0.316  0.75206   
## monthoct             1.240e+00  8.955e-01   1.385  0.16618   
## monthsep             4.753e-01  8.897e-01   0.534  0.59323   
## previous             7.301e-01  5.101e-01   1.431  0.15232   
## poutcomefailure     -1.997e+00  6.857e-01  -2.912  0.00359 **
## poutcomenonexistent -7.967e-01  9.001e-01  -0.885  0.37612   
## emp.var.rate        -1.457e+00  6.973e-01  -2.089  0.03670 * 
## cons.price.idx       2.061e+00  1.172e+00   1.759  0.07854 . 
## cons.conf.idx        1.687e-02  5.329e-02   0.317  0.75151   
## euribor3m            1.715e-01  6.905e-01   0.248  0.80382   
## nr.employed          7.512e-03  1.456e-02   0.516  0.60580   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.96  on 615  degrees of freedom
## Residual deviance: 607.86  on 586  degrees of freedom
## AIC: 667.86
## 
## Number of Fisher Scoring iterations: 6

reglog.step.under <- step(reglog.under)

## Start:  AIC=667.86
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - job            10   619.84 659.84
## - euribor3m       1   607.92 665.92
## - cons.conf.idx   1   607.96 665.96
## - nr.employed     1   608.12 666.12
## - age             1   609.77 667.77
## <none>                607.86 667.86
## - previous        1   610.61 668.61
## - month           9   626.64 668.64
## - cons.price.idx  1   611.00 669.00
## - emp.var.rate    1   612.31 670.31
## - contact         1   614.26 672.26
## - poutcome        2   621.11 677.11
## 
## Step:  AIC=659.84
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - cons.conf.idx   1   619.85 657.85
## - nr.employed     1   619.86 657.86
## - euribor3m       1   620.12 658.12
## - age             1   621.40 659.40
## - previous        1   621.66 659.66
## <none>                619.84 659.84
## - cons.price.idx  1   622.07 660.07
## - emp.var.rate    1   623.62 661.62
## - contact         1   625.88 663.88
## - month           9   642.41 664.41
## - poutcome        2   633.85 669.85
## 
## Step:  AIC=657.85
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - nr.employed     1   619.86 655.86
## - euribor3m       1   620.65 656.65
## - age             1   621.43 657.43
## - previous        1   621.67 657.67
## <none>                619.85 657.85
## - cons.price.idx  1   622.53 658.53
## - emp.var.rate    1   623.63 659.63
## - month           9   642.41 662.41
## - contact         1   626.68 662.68
## - poutcome        2   634.04 668.04
## 
## Step:  AIC=655.86
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + euribor3m
## 
##                  Df Deviance    AIC
## - euribor3m       1   621.04 655.04
## - age             1   621.45 655.45
## - previous        1   621.70 655.70
## <none>                619.86 655.86
## - emp.var.rate    1   626.23 660.23
## - contact         1   626.71 660.71
## - month           9   644.94 662.94
## - cons.price.idx  1   629.05 663.05
## - poutcome        2   634.05 666.05
## 
## Step:  AIC=655.04
## y ~ age + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx
## 
##                  Df Deviance    AIC
## - age             1   622.76 654.76
## - previous        1   623.01 655.01
## <none>                621.04 655.04
## - contact         1   626.73 658.73
## - month           9   645.58 661.58
## - cons.price.idx  1   633.75 665.75
## - poutcome        2   635.91 665.91
## - emp.var.rate    1   663.20 695.20
## 
## Step:  AIC=654.76
## y ~ contact + month + previous + poutcome + emp.var.rate + cons.price.idx
## 
##                  Df Deviance    AIC
## <none>                622.76 654.76
## - previous        1   624.77 654.77
## - contact         1   628.23 658.23
## - month           9   647.48 661.48
## - poutcome        2   637.27 665.27
## - cons.price.idx  1   635.54 665.54
## - emp.var.rate    1   666.38 696.38

reglog.step.under$anova

##              Step Df     Deviance Resid. Df Resid. Dev      AIC
## 1                 NA           NA       586   607.8592 667.8592
## 2           - job 10 11.984416544       596   619.8436 659.8436
## 3 - cons.conf.idx  1  0.005974104       597   619.8496 657.8496
## 4   - nr.employed  1  0.014236906       598   619.8638 655.8638
## 5     - euribor3m  1  1.173879853       599   621.0377 655.0377
## 6           - age  1  1.723507083       600   622.7612 654.7612

reglog.step.ok.under <- glm(y ~ contact + month + poutcome + emp.var.rate + cons.price.idx + euribor3m,data=data_balanced_under,family=binomial("link"=logit))
summary(reglog.step.ok.under)

## 
## Call:
## glm(formula = y ~ contact + month + poutcome + emp.var.rate + 
##     cons.price.idx + euribor3m, family = binomial(link = logit), 
##     data = data_balanced_under)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6416  -0.7606  -0.1931   0.7535   1.9974  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)         -157.8334    51.2712  -3.078  0.00208 **
## contacttelephone      -0.9682     0.3783  -2.559  0.01049 * 
## monthapr               0.5522     0.4249   1.300  0.19367   
## monthaug               0.8139     0.3887   2.094  0.03626 * 
## monthdec               1.0268     1.1382   0.902  0.36697   
## monthjul               0.3726     0.4300   0.866  0.38628   
## monthjun              -0.1108     0.4715  -0.235  0.81417   
## monthmar               2.8913     1.0514   2.750  0.00596 **
## monthnov              -0.2272     0.4497  -0.505  0.61335   
## monthoct               1.3870     0.8765   1.582  0.11355   
## monthsep               0.5270     0.7265   0.726  0.46814   
## poutcomefailure       -2.1388     0.6679  -3.202  0.00136 **
## poutcomenonexistent   -1.8805     0.6416  -2.931  0.00338 **
## emp.var.rate          -1.3746     0.5136  -2.677  0.00744 **
## cons.price.idx         1.6853     0.5378   3.134  0.00173 **
## euribor3m              0.4639     0.3871   1.198  0.23074   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 853.96  on 615  degrees of freedom
## Residual deviance: 623.32  on 600  degrees of freedom
## AIC: 655.32
## 
## Number of Fisher Scoring iterations: 5

Interpretasi

# odds = exponensial parameter model
interpretasi.under <- data.frame(koefisien=summary(reglog.step.ok.under)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok.under)$coefficients[,1])) 
InterpretasiPeubahNyata.under <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata.under

##                     koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept)               1.301571e-12      -117.36246845  1.071834e-51
## contacttelephone          2.731397e-05        -1.11281822  3.286315e-01
## monthmar                  2.835945e-08         2.11116983  8.257896e+00
## poutcomefailure           1.669190e-09        -1.76207327  1.716885e-01
## poutcomenonexistent       3.487552e-07        -1.35414929  2.581668e-01
## emp.var.rate              5.365213e-23        -0.73075892  4.815434e-01
## cons.price.idx            2.214118e-12         1.26267142  3.534852e+00
## cons.conf.idx             3.014686e-02         0.04055963  1.041393e+00

Evalusi Model

Menggunakan data testing.

TANPA STEPWISE:

testing.bank.ok$ypred.ns.under <- predict(reglog.under, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns.under <- ifelse(testing.bank.ok$ypred.ns.under > 0.50, "yes", "no")
testing.bank.ok$ypred.ns.under <- as.factor(testing.bank.ok$ypred.ns.under)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns.under, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  446  17
##        yes 110  45
##                                           
##                Accuracy : 0.7945          
##                  95% CI : (0.7605, 0.8257)
##     No Information Rate : 0.8997          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3168          
##                                           
##  Mcnemar's Test P-Value : 3.25e-16        
##                                           
##             Sensitivity : 0.72581         
##             Specificity : 0.80216         
##          Pos Pred Value : 0.29032         
##          Neg Pred Value : 0.96328         
##              Prevalence : 0.10032         
##          Detection Rate : 0.07282         
##    Detection Prevalence : 0.25081         
##       Balanced Accuracy : 0.76398         
##                                           
##        'Positive' Class : yes             
##

Confusion Matrix

Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.

Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 70.97 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 83.99 persen dengan accuracy sebesar 82.69 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.

testing.bank.ok$ypred.under <- predict(reglog.step.ok.under, testing.bank.ok, type="response")
testing.bank.ok$ypred.under <- ifelse(testing.bank.ok$ypred.under > 0.50, "yes", "no")
testing.bank.ok$ypred.under <- as.factor(testing.bank.ok$ypred.under)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.under, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  467  18
##        yes  89  44
##                                           
##                Accuracy : 0.8269          
##                  95% CI : (0.7947, 0.8559)
##     No Information Rate : 0.8997          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3643          
##                                           
##  Mcnemar's Test P-Value : 1.313e-11       
##                                           
##             Sensitivity : 0.7097          
##             Specificity : 0.8399          
##          Pos Pred Value : 0.3308          
##          Neg Pred Value : 0.9629          
##              Prevalence : 0.1003          
##          Detection Rate : 0.0712          
##    Detection Prevalence : 0.2152          
##       Balanced Accuracy : 0.7748          
##                                           
##        'Positive' Class : yes             
##

broom::tidy(conf.mat)

## # A tibble: 14 x 6
##    term                 class estimate conf.low conf.high   p.value
##    <chr>                <chr>    <dbl>    <dbl>     <dbl>     <dbl>
##  1 accuracy             <NA>    0.827     0.795     0.856  1.00e+ 0
##  2 kappa                <NA>    0.364    NA        NA     NA       
##  3 mcnemar              <NA>   NA        NA        NA      1.31e-11
##  4 sensitivity          yes     0.710    NA        NA     NA       
##  5 specificity          yes     0.840    NA        NA     NA       
##  6 pos_pred_value       yes     0.331    NA        NA     NA       
##  7 neg_pred_value       yes     0.963    NA        NA     NA       
##  8 precision            yes     0.331    NA        NA     NA       
##  9 recall               yes     0.710    NA        NA     NA       
## 10 f1                   yes     0.451    NA        NA     NA       
## 11 prevalence           yes     0.100    NA        NA     NA       
## 12 detection_rate       yes     0.0712   NA        NA     NA       
## 13 detection_prevalence yes     0.215    NA        NA     NA       
## 14 balanced_accuracy    yes     0.775    NA        NA     NA

Receiver Operating Characteristics (ROC) Curve

Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.

Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 57.80 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.

plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred.under=="yes")

Random Over Sampling

reglog.over <- glm(y~
age+job+contact+month+previous+poutcome+emp.var.rate+cons.price.idx+cons.conf.idx+ euribor3m+nr.employed,data=data_balanced_over, family=binomial("link"=logit))
summary(reglog.over)

## 
## Call:
## glm(formula = y ~ age + job + contact + month + previous + poutcome + 
##     emp.var.rate + cons.price.idx + cons.conf.idx + euribor3m + 
##     nr.employed, family = binomial(link = logit), data = data_balanced_over)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8919  -0.8440  -0.1087   0.7627   2.0713  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.529e+02  6.801e+01  -2.248 0.024574 *  
## age                  8.868e-03  4.119e-03   2.153 0.031326 *  
## jobadmin.            5.182e-01  2.507e-01   2.067 0.038743 *  
## jobblue-collar       2.755e-01  2.588e-01   1.065 0.287029    
## jobentrepreneur     -4.978e-01  3.557e-01  -1.400 0.161638    
## jobhousemaid         1.761e-01  3.460e-01   0.509 0.610739    
## jobmanagement        2.869e-02  2.757e-01   0.104 0.917106    
## jobretired           5.404e-01  3.052e-01   1.771 0.076632 .  
## jobself-employed    -1.015e-01  3.109e-01  -0.326 0.744131    
## jobservices         -5.972e-02  2.748e-01  -0.217 0.827935    
## jobstudent           2.832e-01  3.509e-01   0.807 0.419650    
## jobtechnician        6.130e-01  2.569e-01   2.386 0.017029 *  
## contacttelephone    -1.115e+00  1.569e-01  -7.104 1.21e-12 ***
## monthapr             2.093e-01  1.652e-01   1.267 0.205276    
## monthaug             6.892e-01  2.010e-01   3.429 0.000607 ***
## monthdec             1.106e+00  4.144e-01   2.669 0.007615 ** 
## monthjul             1.641e-01  1.691e-01   0.971 0.331656    
## monthjun             1.413e-01  2.758e-01   0.512 0.608407    
## monthmar             2.145e+00  3.090e-01   6.940 3.93e-12 ***
## monthnov            -1.903e-01  1.763e-01  -1.080 0.280312    
## monthoct             5.461e-01  2.804e-01   1.948 0.051471 .  
## monthsep             4.344e-01  3.141e-01   1.383 0.166692    
## previous             2.378e-01  1.318e-01   1.804 0.071289 .  
## poutcomefailure     -1.542e+00  2.181e-01  -7.070 1.55e-12 ***
## poutcomenonexistent -1.015e+00  2.774e-01  -3.659 0.000253 ***
## emp.var.rate        -1.492e+00  2.523e-01  -5.914 3.35e-09 ***
## cons.price.idx       1.658e+00  4.464e-01   3.714 0.000204 ***
## cons.conf.idx        7.462e-03  1.813e-02   0.411 0.680718    
## euribor3m            6.513e-01  2.515e-01   2.590 0.009600 ** 
## nr.employed         -8.226e-04  5.658e-03  -0.145 0.884391    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5997.1  on 4325  degrees of freedom
## Residual deviance: 4502.4  on 4296  degrees of freedom
## AIC: 4562.4
## 
## Number of Fisher Scoring iterations: 5

reglog.step.over <- step(reglog.over)

## Start:  AIC=4562.43
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed
## 
##                  Df Deviance    AIC
## - nr.employed     1   4502.4 4560.4
## - cons.conf.idx   1   4502.6 4560.6
## <none>                4502.4 4562.4
## - previous        1   4505.9 4563.9
## - age             1   4507.1 4565.1
## - euribor3m       1   4509.2 4567.2
## - cons.price.idx  1   4516.2 4574.2
## - job            10   4554.5 4594.5
## - emp.var.rate    1   4537.4 4595.4
## - contact         1   4555.8 4613.8
## - poutcome        2   4562.6 4618.6
## - month           9   4612.5 4654.5
## 
## Step:  AIC=4560.45
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + cons.conf.idx + euribor3m
## 
##                  Df Deviance    AIC
## - cons.conf.idx   1   4503.0 4559.0
## <none>                4502.4 4560.4
## - previous        1   4506.0 4562.0
## - age             1   4507.1 4563.1
## - euribor3m       1   4517.6 4573.6
## - job            10   4554.7 4592.7
## - emp.var.rate    1   4553.2 4609.2
## - poutcome        2   4562.7 4616.7
## - contact         1   4561.2 4617.2
## - cons.price.idx  1   4579.6 4635.6
## - month           9   4617.2 4657.2
## 
## Step:  AIC=4559
## y ~ age + job + contact + month + previous + poutcome + emp.var.rate + 
##     cons.price.idx + euribor3m
## 
##                  Df Deviance    AIC
## <none>                4503.0 4559.0
## - previous        1   4506.7 4560.7
## - age             1   4507.6 4561.6
## - euribor3m       1   4525.0 4579.0
## - job            10   4556.2 4592.2
## - poutcome        2   4565.6 4617.6
## - contact         1   4564.0 4618.0
## - emp.var.rate    1   4573.0 4627.0
## - cons.price.idx  1   4588.5 4642.5
## - month           9   4632.8 4670.8

reglog.step.over$anova

##              Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                 NA         NA      4296   4502.429 4562.429
## 2   - nr.employed  1 0.02115926      4297   4502.450 4560.450
## 3 - cons.conf.idx  1 0.55052307      4298   4503.000 4559.000

reglog.step.ok.over <- glm(y ~ age + job + contact + month + poutcome + emp.var.rate + cons.price.idx + euribor3m,data=data_balanced_over,family=binomial("link"=logit))
summary(reglog.step.ok.over)

## 
## Call:
## glm(formula = y ~ age + job + contact + month + poutcome + emp.var.rate + 
##     cons.price.idx + euribor3m, family = binomial(link = logit), 
##     data = data_balanced_over)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.69556  -0.84664  -0.09079   0.76496   2.07724  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.722e+02  1.867e+01  -9.227  < 2e-16 ***
## age                  8.777e-03  4.121e-03   2.130  0.03321 *  
## jobadmin.            5.390e-01  2.504e-01   2.153  0.03132 *  
## jobblue-collar       2.957e-01  2.582e-01   1.145  0.25213    
## jobentrepreneur     -4.786e-01  3.558e-01  -1.345  0.17865    
## jobhousemaid         1.965e-01  3.457e-01   0.568  0.56980    
## jobmanagement        8.123e-02  2.731e-01   0.297  0.76611    
## jobretired           5.695e-01  3.045e-01   1.870  0.06149 .  
## jobself-employed    -8.840e-02  3.107e-01  -0.284  0.77604    
## jobservices         -3.927e-02  2.748e-01  -0.143  0.88639    
## jobstudent           2.510e-01  3.512e-01   0.715  0.47474    
## jobtechnician        6.449e-01  2.565e-01   2.514  0.01193 *  
## contacttelephone    -1.094e+00  1.414e-01  -7.737 1.01e-14 ***
## monthapr             1.559e-01  1.514e-01   1.030  0.30314    
## monthaug             8.402e-01  1.474e-01   5.700 1.20e-08 ***
## monthdec             1.144e+00  4.027e-01   2.840  0.00451 ** 
## monthjul             1.866e-01  1.598e-01   1.168  0.24293    
## monthjun             7.058e-02  1.803e-01   0.391  0.69552    
## monthmar             2.160e+00  2.811e-01   7.684 1.55e-14 ***
## monthnov            -1.986e-01  1.708e-01  -1.163  0.24483    
## monthoct             5.648e-01  2.711e-01   2.083  0.03721 *  
## monthsep             5.654e-01  2.556e-01   2.212  0.02696 *  
## poutcomefailure     -1.614e+00  2.140e-01  -7.543 4.58e-14 ***
## poutcomenonexistent -1.382e+00  2.017e-01  -6.855 7.13e-12 ***
## emp.var.rate        -1.621e+00  1.923e-01  -8.429  < 2e-16 ***
## cons.price.idx       1.818e+00  1.956e-01   9.298  < 2e-16 ***
## euribor3m            6.905e-01  1.450e-01   4.763 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5997.1  on 4325  degrees of freedom
## Residual deviance: 4506.7  on 4299  degrees of freedom
## AIC: 4560.7
## 
## Number of Fisher Scoring iterations: 5

Interpretasi

# odds = exponensial parameter model
interpretasi.over <- data.frame(koefisien=summary(reglog.step.ok.over)$coefficients[,c(4,1)],exp.koefisien=exp(summary(reglog.step.ok.over)$coefficients[,1])) 
InterpretasiPeubahNyata.over <- interpretasi %>% filter(koefisien.Pr...z..< 0.05)
InterpretasiPeubahNyata.over

##                     koefisien.Pr...z.. koefisien.Estimate exp.koefisien
## (Intercept)               1.301571e-12      -117.36246845  1.071834e-51
## contacttelephone          2.731397e-05        -1.11281822  3.286315e-01
## monthmar                  2.835945e-08         2.11116983  8.257896e+00
## poutcomefailure           1.669190e-09        -1.76207327  1.716885e-01
## poutcomenonexistent       3.487552e-07        -1.35414929  2.581668e-01
## emp.var.rate              5.365213e-23        -0.73075892  4.815434e-01
## cons.price.idx            2.214118e-12         1.26267142  3.534852e+00
## cons.conf.idx             3.014686e-02         0.04055963  1.041393e+00

Evalusi Model

Menggunakan data testing.

testing.bank.ok$ypred.ns.over <- predict(reglog.over, testing.bank.ok, type="response")
testing.bank.ok$ypred.ns.over <- ifelse(testing.bank.ok$ypred.ns.over > 0.50, "yes", "no")
testing.bank.ok$ypred.ns.over <- as.factor(testing.bank.ok$ypred.ns.over)
(conf.mat<-caret::confusionMatrix(testing.bank.ok$ypred.ns.over, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  466  18
##        yes  90  44
##                                          
##                Accuracy : 0.8252         
##                  95% CI : (0.793, 0.8544)
##     No Information Rate : 0.8997         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3614         
##                                          
##  Mcnemar's Test P-Value : 8.375e-12      
##                                          
##             Sensitivity : 0.7097         
##             Specificity : 0.8381         
##          Pos Pred Value : 0.3284         
##          Neg Pred Value : 0.9628         
##              Prevalence : 0.1003         
##          Detection Rate : 0.0712         
##    Detection Prevalence : 0.2168         
##       Balanced Accuracy : 0.7739         
##                                          
##        'Positive' Class : yes            
##

Confusion Matrix

Dalam confusion matrix, sensitivitas (atau True Positive Rate) adalah persentase pengamatan (aktual) yang diprediksi dengan benar oleh model, sedangkan spesifisitas adalah persentase dari 0 (aktual) yang diprediksi dengan benar.

Hasil confusion matrix menunjukkan bahwa model regresi logistik yang diperoleh memiliki persentase pengamatan yes (aktual) yang diprediksi dengan benar oleh model sebesar 70.97 persen sedangkan persentase dari no (aktual) yang diprediksi dengan benar oleh model adalah sebesar 81.12 persen dengan accuracy sebesar 80.10 persen.Jika dilihat dari nilai akurasinya, model ini dinilai baik.

testing.bank.ok$ypred.over <- predict(reglog.step.ok.over, testing.bank.ok, type="response")
testing.bank.ok$ypred.over <- ifelse(testing.bank.ok$ypred.over > 0.50, "yes", "no")
testing.bank.ok$ypred.over <- as.factor(testing.bank.ok$ypred.over)
(conf.mat.over<-caret::confusionMatrix(testing.bank.ok$ypred.over, testing.bank.ok$y, positive="yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  464  18
##        yes  92  44
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.7895, 0.8514)
##     No Information Rate : 0.8997          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3556          
##                                           
##  Mcnemar's Test P-Value : 3.396e-12       
##                                           
##             Sensitivity : 0.7097          
##             Specificity : 0.8345          
##          Pos Pred Value : 0.3235          
##          Neg Pred Value : 0.9627          
##              Prevalence : 0.1003          
##          Detection Rate : 0.0712          
##    Detection Prevalence : 0.2201          
##       Balanced Accuracy : 0.7721          
##                                           
##        'Positive' Class : yes             
##

broom::tidy(conf.mat.over)

## # A tibble: 14 x 6
##    term                 class estimate conf.low conf.high   p.value
##    <chr>                <chr>    <dbl>    <dbl>     <dbl>     <dbl>
##  1 accuracy             <NA>    0.822     0.790     0.851  1.00e+ 0
##  2 kappa                <NA>    0.356    NA        NA     NA       
##  3 mcnemar              <NA>   NA        NA        NA      3.40e-12
##  4 sensitivity          yes     0.710    NA        NA     NA       
##  5 specificity          yes     0.835    NA        NA     NA       
##  6 pos_pred_value       yes     0.324    NA        NA     NA       
##  7 neg_pred_value       yes     0.963    NA        NA     NA       
##  8 precision            yes     0.324    NA        NA     NA       
##  9 recall               yes     0.710    NA        NA     NA       
## 10 f1                   yes     0.444    NA        NA     NA       
## 11 prevalence           yes     0.100    NA        NA     NA       
## 12 detection_rate       yes     0.0712   NA        NA     NA       
## 13 detection_prevalence yes     0.220    NA        NA     NA       
## 14 balanced_accuracy    yes     0.772    NA        NA     NA

Receiver Operating Characteristics (ROC) Curve

Receiver Operating Characteristics (ROC) curve melacak persentase true positive saat cut-off peluang prediksi diturunkan dari 1 menjadi 0. Model yang baik akan memperlihat kurva yang lebih curam, artinya True Positive Rate meningkat lebih cepat dibandingkan dengan False Positive Rate ketika cut-off menurun. Dengan kata lain, semakin besar luas area di bawah kurva ROC maka kemampuan prediksi yang dihasilkan oleh model semakin baik.

Dari model regresi logistik yang diperoleh, persentase true positive saat cut-off peluang prediksi diturunkan dari yes menjadi no sebesar 60.62 persen. Besarnya luas area di bawah kurva ROC ini mengindikasikan bahwa model yang diperoleh memiliki kemampuan prediksi yang cukup baik.

plotROC(testing.bank.ok=="yes", testing.bank.ok$ypred.over=="yes")

Daftar Pustaka

Jian C, Gao J, Ao Y. (2016). A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Journal Neurocomputing. 193: 115-122.doi: 10.1016/j.neucom.2016.02.006

Thanathamathee P, Lursinap C. (2013). Handling Imbalanced Datasets with Synthetic Boundary Data Generation Using Bootstrap Resampling and AdaBoost Techniques. Pattern Recognition Letters.Elsevier BV 34(12): 1339-1347.doi: 10.1016/j.patrec.2013.04.019

Prusa J, Khoshgoftaar TM, Dittman DJ, Napolitano A. (2015). Using Random Undersampling to Alleviate Class Imbalance on Tweet Sentiment Data. IEEE Xplore. doi: 10.1109/IRI.2015.39

G1501202071, Mahasiswa Pascasarjana Statistika dan Sains Data, IPB University, reniamelia@apps.ipb.ac.id ↩︎