library(DT)
library(graphics)
library(tidyverse)
library(class)

1 Introduction

Logistic Regression and K-Nearest Neighbor is a well-known supervised machine learning technique that deals with prediction in classification. We try to get insights from The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution.

1.1 The Source of The Data

To access the data, check the followinG URL https://archive.ics.uci.edu/ml/datasets/bank+marketing

2 Reading The Data

bank <- read.csv("bank-full.csv", header = TRUE, sep = ";")
rmarkdown::paged_table(bank)
dim(bank)
#> [1] 45211    17
names(bank)
#>  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
#>  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
#> [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

The dataset includes information as follow:

  • 1 - age (numeric)
  • 2 - job : type of job (categorical)
  • 3 - marital : marital status (categorical)
  • 4 - education (categorical)
  • 5 - default: has credit in default? (categorical)
  • 6 - housing: has housing loan? (categorical)
  • 7 - loan: has personal loan? (categorical)
  • 8 - contact: contact communication type (categorical)
  • 9 - month: last contact month of year (categorical)
  • 10 - day_of_week: last contact day of the week (categorical)
  • 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
  • 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  • 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  • 14 - previous: number of contacts performed before this campaign and for this client (numeric)
  • 15 - poutcome: outcome of the previous marketing campaign (categorical)
  • 16 - balance: average yearly balance
  • 17 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

3 Data Wrangling

3.1 Checking Missing Value

We will see whether the data have missing value or not. Missing values are represented by the symbol NA (not available) in R.

colSums(is.na(bank))
#>       age       job   marital education   default   balance   housing      loan 
#>         0         0         0         0         0         0         0         0 
#>   contact       day     month  duration  campaign     pdays  previous  poutcome 
#>         0         0         0         0         0         0         0         0 
#>         y 
#>         0
anyNA(bank)
#> [1] FALSE

3.2 Converting Variables Type

We see no missing value in the data. The next step is viewing the metadata to check the type of each variable.

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
#>  $ marital  : chr  "married" "single" "married" "married" ...
#>  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
#>  $ default  : chr  "no" "no" "no" "no" ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : chr  "yes" "yes" "yes" "yes" ...
#>  $ loan     : chr  "no" "no" "yes" "no" ...
#>  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : chr  "may" "may" "may" "may" ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
#>  $ y        : chr  "no" "no" "no" "no" ...

From this result, we find some of the data type not in the correct type. we need to convert it into the correct type (data coercion)

bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$month <- as.factor(bank$month)
bank$default <- as.factor(bank$default)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)

str(bank)
#> 'data.frame':    45211 obs. of  17 variables:
#>  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
#>  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
#>  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
#>  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Each column already changed into the desired data type.

nrow(bank)
#> [1] 45211
summary(bank)
#>       age                 job           marital          education    
#>  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
#>  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
#>  Median :39.00   technician :7597   single  :12790   tertiary :13301  
#>  Mean   :40.94   admin.     :5171                    unknown  : 1857  
#>  3rd Qu.:48.00   services   :4154                                     
#>  Max.   :95.00   retired    :2264                                     
#>                  (Other)    :6835                                     
#>  default        balance       housing      loan            contact     
#>  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
#>  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
#>              Median :   448                           unknown  :13020  
#>              Mean   :  1362                                            
#>              3rd Qu.:  1428                                            
#>              Max.   :102127                                            
#>                                                                        
#>       day            month          duration         campaign     
#>  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
#>  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
#>  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
#>  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
#>  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
#>  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
#>                  (Other): 6060                                    
#>      pdays          previous           poutcome       y        
#>  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
#>  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
#>  Median : -1.0   Median :  0.0000   success: 1511              
#>  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
#>  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
#>  Max.   :871.0   Max.   :275.0000                              
#> 

4 Pre-Processing Data

Before doing any modeling method, we need to check the proportion of each answer from the outcome variable (y)

prop.table(table(bank$y))
#> 
#>        no       yes 
#> 0.8830152 0.1169848
table(bank$y)
#> 
#>    no   yes 
#> 39922  5289

The Proportion of Yes vs No is not balanced. Hence, we need to reshape the data into equal proportion response. The method to do this is using downsample()

library(caret)
bank_new <- downSample(x = bank[, -17], y = bank[, 17], yname = "y")
prop.table(table(bank_new$y))
#> 
#>  no yes 
#> 0.5 0.5

The Proportion is balanced and we are ready for the next step.

4.1 Spliting Train-Test

We need to train the model in the train dataset so we can apply the model in the test dataset. First, we need to separate the dataset into two forms. One for the training and one for the testing.

set.seed(11)
intrain <- sample(nrow(bank_new), nrow(bank_new)*0.7)
bank_train <- bank_new[intrain,]
bank_test <- bank_new[-intrain,]
bank_new$y %>% 
  levels()
#> [1] "no"  "yes"
bank_test$y %>% 
  levels()
#> [1] "no"  "yes"
str(bank_train)
#> 'data.frame':    7404 obs. of  17 variables:
#>  $ age      : int  56 36 39 45 41 37 46 30 36 22 ...
#>  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 6 7 8 5 2 10 2 5 10 8 ...
#>  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 2 3 2 2 1 3 ...
#>  $ education: Factor w/ 4 levels "primary","secondary",..: 2 2 2 3 2 3 1 3 2 2 ...
#>  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ balance  : int  0 1173 695 523 393 197 0 1567 1 129 ...
#>  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 1 2 1 1 ...
#>  $ loan     : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 1 1 1 ...
#>  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 3 3 3 1 1 1 1 1 1 ...
#>  $ day      : int  7 20 28 5 10 15 6 12 8 16 ...
#>  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 2 7 9 9 6 4 2 2 2 11 ...
#>  $ duration : int  110 50 429 849 122 531 178 1133 1242 258 ...
#>  $ campaign : int  4 5 10 2 1 1 4 4 3 1 ...
#>  $ pdays    : int  -1 -1 -1 -1 -1 91 -1 -1 -1 -1 ...
#>  $ previous : int  0 0 0 0 0 2 0 0 0 0 ...
#>  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 3 4 4 4 4 ...
#>  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 2 2 2 ...

5 Building Model

5.1 Logistic Regression

model_camp <- glm(y~.,
                  data = bank_train, family = "binomial")
summary(model_camp)
#> 
#> Call:
#> glm(formula = y ~ ., family = "binomial", data = bank_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -5.8297  -0.6088   0.0083   0.6208   2.9976  
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)        -8.343e-01  3.219e-01  -2.592 0.009554 ** 
#> age                 6.819e-03  3.904e-03   1.747 0.080658 .  
#> jobblue-collar     -4.689e-01  1.282e-01  -3.658 0.000254 ***
#> jobentrepreneur    -5.131e-01  2.200e-01  -2.333 0.019673 *  
#> jobhousemaid       -8.270e-01  2.255e-01  -3.667 0.000245 ***
#> jobmanagement      -3.054e-01  1.333e-01  -2.291 0.021985 *  
#> jobretired          2.857e-01  1.819e-01   1.570 0.116344    
#> jobself-employed   -4.580e-01  2.015e-01  -2.273 0.023049 *  
#> jobservices        -3.374e-01  1.468e-01  -2.298 0.021577 *  
#> jobstudent          3.194e-01  2.085e-01   1.532 0.125531    
#> jobtechnician      -1.789e-01  1.243e-01  -1.440 0.149886    
#> jobunemployed      -2.737e-01  2.000e-01  -1.369 0.171038    
#> jobunknown         -6.287e-01  4.094e-01  -1.536 0.124586    
#> maritalmarried     -1.183e-01  1.058e-01  -1.118 0.263404    
#> maritalsingle       3.369e-01  1.217e-01   2.770 0.005610 ** 
#> educationsecondary  1.586e-01  1.115e-01   1.422 0.154909    
#> educationtertiary   4.520e-01  1.324e-01   3.414 0.000641 ***
#> educationunknown    3.249e-01  1.825e-01   1.780 0.075023 .  
#> defaultyes         -1.732e-01  2.695e-01  -0.643 0.520450    
#> balance             1.867e-05  1.002e-05   1.862 0.062561 .  
#> housingyes         -8.113e-01  7.601e-02 -10.674  < 2e-16 ***
#> loanyes            -5.186e-01  1.038e-01  -4.996 5.85e-07 ***
#> contacttelephone   -9.657e-02  1.335e-01  -0.723 0.469385    
#> contactunknown     -1.626e+00  1.200e-01 -13.551  < 2e-16 ***
#> day                 4.834e-03  4.347e-03   1.112 0.266154    
#> monthaug           -1.003e+00  1.378e-01  -7.276 3.44e-13 ***
#> monthdec            8.687e-01  4.133e-01   2.102 0.035561 *  
#> monthfeb           -2.781e-01  1.545e-01  -1.800 0.071880 .  
#> monthjan           -1.552e+00  2.128e-01  -7.294 3.00e-13 ***
#> monthjul           -9.218e-01  1.367e-01  -6.743 1.55e-11 ***
#> monthjun            2.135e-01  1.624e-01   1.315 0.188645    
#> monthmar            1.513e+00  2.570e-01   5.888 3.91e-09 ***
#> monthmay           -7.005e-01  1.323e-01  -5.295 1.19e-07 ***
#> monthnov           -1.011e+00  1.474e-01  -6.863 6.73e-12 ***
#> monthoct            1.247e+00  2.342e-01   5.324 1.02e-07 ***
#> monthsep            8.635e-01  2.536e-01   3.405 0.000661 ***
#> duration            5.352e-03  1.508e-04  35.492  < 2e-16 ***
#> campaign           -7.429e-02  1.646e-02  -4.512 6.41e-06 ***
#> pdays              -2.738e-04  5.303e-04  -0.516 0.605602    
#> previous           -1.894e-03  6.844e-03  -0.277 0.782020    
#> poutcomeother       2.873e-01  1.595e-01   1.801 0.071726 .  
#> poutcomesuccess     2.393e+00  1.804e-01  13.266  < 2e-16 ***
#> poutcomeunknown    -3.397e-01  1.587e-01  -2.141 0.032271 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10264  on 7403  degrees of freedom
#> Residual deviance:  6050  on 7361  degrees of freedom
#> AIC: 6136
#> 
#> Number of Fisher Scoring iterations: 6

We see here there are a lot variables who are not significant. Hence, we will use stepwise with “backward” method for model fitting

5.1.1 Model Fitting

library(MASS)
model2 <- stepAIC(model_camp, direction = "backward")
#> Start:  AIC=6135.98
#> y ~ age + job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + duration + campaign + pdays + 
#>     previous + poutcome
#> 
#>             Df Deviance    AIC
#> - previous   1   6050.1 6134.1
#> - pdays      1   6050.2 6134.2
#> - default    1   6050.4 6134.4
#> - day        1   6051.2 6135.2
#> <none>           6050.0 6136.0
#> - age        1   6053.0 6137.0
#> - balance    1   6053.5 6137.5
#> - education  3   6064.2 6144.2
#> - campaign   1   6072.1 6156.1
#> - loan       1   6075.6 6159.6
#> - marital    2   6080.7 6162.7
#> - job       11   6104.4 6168.4
#> - housing    1   6166.4 6250.4
#> - contact    2   6251.6 6333.6
#> - poutcome   3   6403.4 6483.4
#> - month     11   6474.5 6538.5
#> - duration   1   8423.3 8507.3
#> 
#> Step:  AIC=6134.05
#> y ~ age + job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + duration + campaign + pdays + 
#>     poutcome
#> 
#>             Df Deviance    AIC
#> - pdays      1   6050.3 6132.3
#> - default    1   6050.5 6132.5
#> - day        1   6051.3 6133.3
#> <none>           6050.1 6134.1
#> - age        1   6053.1 6135.1
#> - balance    1   6053.6 6135.6
#> - education  3   6064.2 6142.2
#> - campaign   1   6072.2 6154.2
#> - loan       1   6075.7 6157.7
#> - marital    2   6080.8 6160.8
#> - job       11   6104.4 6166.4
#> - housing    1   6166.7 6248.7
#> - contact    2   6251.7 6331.7
#> - poutcome   3   6408.0 6486.0
#> - month     11   6474.6 6536.6
#> - duration   1   8423.3 8505.3
#> 
#> Step:  AIC=6132.32
#> y ~ age + job + marital + education + default + balance + housing + 
#>     loan + contact + day + month + duration + campaign + poutcome
#> 
#>             Df Deviance    AIC
#> - default    1   6050.7 6130.7
#> - day        1   6051.6 6131.6
#> <none>           6050.3 6132.3
#> - age        1   6053.4 6133.4
#> - balance    1   6053.9 6133.9
#> - education  3   6064.6 6140.6
#> - campaign   1   6072.5 6152.5
#> - loan       1   6075.9 6155.9
#> - marital    2   6081.1 6159.1
#> - job       11   6104.9 6164.9
#> - housing    1   6168.4 6248.4
#> - contact    2   6251.7 6329.7
#> - month     11   6474.6 6534.6
#> - poutcome   3   6482.1 6558.1
#> - duration   1   8424.0 8504.0
#> 
#> Step:  AIC=6130.75
#> y ~ age + job + marital + education + balance + housing + loan + 
#>     contact + day + month + duration + campaign + poutcome
#> 
#>             Df Deviance    AIC
#> - day        1   6052.0 6130.0
#> <none>           6050.7 6130.7
#> - age        1   6053.8 6131.8
#> - balance    1   6054.5 6132.5
#> - education  3   6065.1 6139.1
#> - campaign   1   6073.0 6151.0
#> - loan       1   6076.8 6154.8
#> - marital    2   6081.4 6157.4
#> - job       11   6105.5 6163.5
#> - housing    1   6169.0 6247.0
#> - contact    2   6253.1 6329.1
#> - month     11   6475.8 6533.8
#> - poutcome   3   6483.5 6557.5
#> - duration   1   8424.8 8502.8
#> 
#> Step:  AIC=6130.02
#> y ~ age + job + marital + education + balance + housing + loan + 
#>     contact + month + duration + campaign + poutcome
#> 
#>             Df Deviance    AIC
#> <none>           6052.0 6130.0
#> - age        1   6055.1 6131.1
#> - balance    1   6055.8 6131.8
#> - education  3   6066.6 6138.6
#> - campaign   1   6073.1 6149.1
#> - loan       1   6078.4 6154.4
#> - marital    2   6082.7 6156.7
#> - job       11   6106.8 6162.8
#> - housing    1   6171.4 6247.4
#> - contact    2   6253.3 6327.3
#> - month     11   6478.4 6534.4
#> - poutcome   3   6483.7 6555.7
#> - duration   1   8424.8 8500.8
summary(model2)
#> 
#> Call:
#> glm(formula = y ~ age + job + marital + education + balance + 
#>     housing + loan + contact + month + duration + campaign + 
#>     poutcome, family = "binomial", data = bank_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -5.8191  -0.6070   0.0084   0.6220   2.9643  
#> 
#> Coefficients:
#>                      Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)        -8.234e-01  2.847e-01  -2.892 0.003831 ** 
#> age                 6.804e-03  3.901e-03   1.744 0.081151 .  
#> jobblue-collar     -4.712e-01  1.282e-01  -3.677 0.000236 ***
#> jobentrepreneur    -5.213e-01  2.199e-01  -2.371 0.017747 *  
#> jobhousemaid       -8.229e-01  2.252e-01  -3.655 0.000258 ***
#> jobmanagement      -3.090e-01  1.333e-01  -2.319 0.020401 *  
#> jobretired          2.862e-01  1.818e-01   1.575 0.115345    
#> jobself-employed   -4.580e-01  2.015e-01  -2.272 0.023066 *  
#> jobservices        -3.389e-01  1.468e-01  -2.308 0.020998 *  
#> jobstudent          3.183e-01  2.083e-01   1.528 0.126529    
#> jobtechnician      -1.793e-01  1.242e-01  -1.443 0.148978    
#> jobunemployed      -2.737e-01  1.999e-01  -1.369 0.170935    
#> jobunknown         -6.414e-01  4.096e-01  -1.566 0.117354    
#> maritalmarried     -1.161e-01  1.056e-01  -1.099 0.271735    
#> maritalsingle       3.387e-01  1.216e-01   2.786 0.005341 ** 
#> educationsecondary  1.582e-01  1.114e-01   1.420 0.155575    
#> educationtertiary   4.560e-01  1.323e-01   3.447 0.000567 ***
#> educationunknown    3.313e-01  1.824e-01   1.817 0.069283 .  
#> balance             1.933e-05  1.003e-05   1.927 0.053959 .  
#> housingyes         -8.182e-01  7.572e-02 -10.805  < 2e-16 ***
#> loanyes            -5.257e-01  1.036e-01  -5.075 3.88e-07 ***
#> contacttelephone   -9.723e-02  1.334e-01  -0.729 0.466120    
#> contactunknown     -1.610e+00  1.188e-01 -13.552  < 2e-16 ***
#> monthaug           -1.022e+00  1.365e-01  -7.492 6.79e-14 ***
#> monthdec            8.411e-01  4.129e-01   2.037 0.041626 *  
#> monthfeb           -3.227e-01  1.479e-01  -2.182 0.029116 *  
#> monthjan           -1.514e+00  2.104e-01  -7.196 6.18e-13 ***
#> monthjul           -9.315e-01  1.364e-01  -6.827 8.66e-12 ***
#> monthjun            1.645e-01  1.559e-01   1.055 0.291387    
#> monthmar            1.490e+00  2.557e-01   5.829 5.57e-09 ***
#> monthmay           -7.208e-01  1.310e-01  -5.501 3.78e-08 ***
#> monthnov           -1.010e+00  1.464e-01  -6.899 5.23e-12 ***
#> monthoct            1.261e+00  2.342e-01   5.384 7.27e-08 ***
#> monthsep            8.320e-01  2.521e-01   3.301 0.000965 ***
#> duration            5.348e-03  1.507e-04  35.496  < 2e-16 ***
#> campaign           -7.141e-02  1.621e-02  -4.404 1.06e-05 ***
#> poutcomeother       2.895e-01  1.583e-01   1.829 0.067400 .  
#> poutcomesuccess     2.408e+00  1.773e-01  13.584  < 2e-16 ***
#> poutcomeunknown    -2.672e-01  9.751e-02  -2.740 0.006139 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10264  on 7403  degrees of freedom
#> Residual deviance:  6052  on 7365  degrees of freedom
#> AIC: 6130
#> 
#> Number of Fisher Scoring iterations: 6

5.1.2 Prediction

bank_test$pred_camp <-predict(model2, type = "response", newdata = bank_test)


ggplot(bank_test, aes(x=pred_camp)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

bank_test$pred_camph <- factor(ifelse(bank_test$pred_camp > 0.5, "yes","no"))

`

log_conf <- confusionMatrix(bank_test$pred_camph, bank_test$y, positive = "yes")
log_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  1370  293
#>        yes  233 1278
#>                                           
#>                Accuracy : 0.8343          
#>                  95% CI : (0.8209, 0.8471)
#>     No Information Rate : 0.505           
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.6684          
#>                                           
#>  Mcnemar's Test P-Value : 0.0101          
#>                                           
#>             Sensitivity : 0.8135          
#>             Specificity : 0.8546          
#>          Pos Pred Value : 0.8458          
#>          Neg Pred Value : 0.8238          
#>              Prevalence : 0.4950          
#>          Detection Rate : 0.4026          
#>    Detection Prevalence : 0.4761          
#>       Balanced Accuracy : 0.8341          
#>                                           
#>        'Positive' Class : yes             
#> 

Sensitivity and specificity are metrics commonly used to measures the performance of a binary classification.

  • Sensitivity (also called the true positive rate, the recall, or probability of detection in some fields) measures the proportion of positives that are correctly identified as such (cancer cell detection, email spam, insurance fraud etc)
  • Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g. the percentage of healthy people who are correctly identified as not having the condition, legitimate emails identified as such, legitimate insurance claims)
  • Precision: Proportion of correctly identified positives from all classified as such
  • Accuracy: Proportion of correctly identified cases from all cases

Here, I choose to use the sensitivity since we want to maximize the campaign to get more customer who says “yes” and to decrease its false-negative.

5.2 K-Nearest Neighbor

glimpse(bank_train)
#> Rows: 7,404
#> Columns: 17
#> $ age       <int> 56, 36, 39, 45, 41, 37, 46, 30, 36, 22, 37, 40, 23, 28, 3...
#> $ job       <fct> retired, self-employed, services, management, blue-collar...
#> $ marital   <fct> married, single, married, married, married, single, marri...
#> $ education <fct> secondary, secondary, secondary, tertiary, secondary, ter...
#> $ default   <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, n...
#> $ balance   <int> 0, 1173, 695, 523, 393, 197, 0, 1567, 1, 129, 565, 80, 0,...
#> $ housing   <fct> no, yes, yes, yes, yes, no, no, yes, no, no, no, yes, yes...
#> $ loan      <fct> yes, yes, no, no, yes, no, no, no, no, no, no, no, no, ye...
#> $ contact   <fct> cellular, unknown, unknown, unknown, cellular, cellular, ...
#> $ day       <int> 7, 20, 28, 5, 10, 15, 6, 12, 8, 16, 29, 30, 8, 18, 12, 20...
#> $ month     <fct> aug, jun, may, may, jul, feb, aug, aug, aug, oct, apr, ja...
#> $ duration  <int> 110, 50, 429, 849, 122, 531, 178, 1133, 1242, 258, 449, 1...
#> $ campaign  <int> 4, 5, 10, 2, 1, 1, 4, 4, 3, 1, 1, 3, 5, 2, 5, 3, 4, 1, 2,...
#> $ pdays     <int> -1, -1, -1, -1, -1, 91, -1, -1, -1, -1, 398, -1, 350, -1,...
#> $ previous  <int> 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 4, 10, 0,...
#> $ poutcome  <fct> unknown, unknown, unknown, unknown, unknown, success, unk...
#> $ y         <fct> no, no, no, no, no, yes, no, yes, yes, yes, yes, no, no, ...

Now let’s try to explore the classification model using the k-Nearest Neighbor algorithm. In the k-Nearest Neighbor algorithm, we need to perform one more step of data preprocessing. For both our train and test set, drop the categorical variable from each column except our y variable. Separate the predictor and target in-out train and test set.

# predictor variables in `train`
train_x <-bank_train %>% dplyr::select(age, balance, day, duration, campaign, pdays, previous)

# predictor variables in `test`
test_x <-bank_test %>% dplyr::select(age, balance, day, duration, campaign, pdays, previous)

# target variable in `train`
train_y <-bank_train[,17]

# target variable in `test`
test_y <-bank_test[,17]

Recall that the distance calculation for kNN is heavily dependent upon the measurement scale of the input features. If any variable with a different high range of value could potentially cause problems for our classifier, let’s apply normalization to rescale the elements to a standard range of benefits.

To normalize the features in train_x, please using scale() function. Meanwhile, in testing set data, please normalize each features using the attribute center and scale of train_x set data.

Please look up to the following code as an example to normalize test_x data:

# scale train_x data
train_x <- scale(train_x)

# scale test_x data
test_x <- scale(test_x,center = attr(train_x,"scaled:center"),scale = attr(train_x,"scaled:scale"))

After performing data normalizing, we need to find the right K to use for our K-NN model. There is one common practice for determining the number of K. That is to square root by number of rows

k<-sqrt(nrow(train_x))

Here, I will choose 87 since the number of row is even.

5.2.1 Building Model

model_knn <- knn(train = train_x, test = test_x, cl = train_y, k = 87)

5.2.2 Prediction

pred_knn_conf <- confusionMatrix(as.factor(model_knn), as.factor(test_y),"yes")
pred_knn_conf
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   no  yes
#>        no  1324  433
#>        yes  279 1138
#>                                           
#>                Accuracy : 0.7757          
#>                  95% CI : (0.7608, 0.7901)
#>     No Information Rate : 0.505           
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.5509          
#>                                           
#>  Mcnemar's Test P-Value : 9.814e-09       
#>                                           
#>             Sensitivity : 0.7244          
#>             Specificity : 0.8260          
#>          Pos Pred Value : 0.8031          
#>          Neg Pred Value : 0.7536          
#>              Prevalence : 0.4950          
#>          Detection Rate : 0.3585          
#>    Detection Prevalence : 0.4464          
#>       Balanced Accuracy : 0.7752          
#>                                           
#>        'Positive' Class : yes             
#> 

Here, we see that the sensitivity is lower than in the logistic regression model. It happens since we have erased many predictors variables. To get better performance, we would better use all predictor variables by transforming into a dummy variable from the categorical variables we have.

5.3 Model Interpretation

exp(model2$coefficients) %>% 
  data.frame() 

To interpret this table, we see that the customer who has a management job will likely use “yes” for the campaign with more than 70% probability.

We have done two methods of classifying. From the model above, it is better to use logistic regression since it gives better sensitivity number. For better results, from the model_knn we have, we would better transform the categorical variable into the dummy variable.