INTRODUCTION

On this occasion, I will try to make predictions on bank clients who will be predicted to subscribe or not based on the categories of the 19 supporting variables. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning. The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.

DATA PREPROCESSING

Packages Activation

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)

Import Data

bank <- read.csv2("bank-additional-full.csv", stringsAsFactors = T)

Data Manipulation

bank <- bank %>%
  select(-duration) %>%
  rename(subscribe = y) %>%
  mutate_at(c("emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"),
            as.numeric)
bank$subscribe <- ifelse(bank$subscribe == "yes", 1, 0)
bank$subscribe <- as.factor(bank$subscribe)
glimpse(bank)
#> Rows: 41,188
#> Columns: 20
#> $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,~
#> $ job            <fct> housemaid, services, services, admin., services, servic~
#> $ marital        <fct> married, married, married, married, married, married, m~
#> $ education      <fct> basic.4y, high.school, high.school, basic.6y, high.scho~
#> $ default        <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, ~
#> $ housing        <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,~
#> $ loan           <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n~
#> $ contact        <fct> telephone, telephone, telephone, telephone, telephone, ~
#> $ month          <fct> may, may, may, may, may, may, may, may, may, may, may, ~
#> $ day_of_week    <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, ~
#> $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, ~
#> $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ poutcome       <fct> nonexistent, nonexistent, nonexistent, nonexistent, non~
#> $ emp.var.rate   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ cons.price.idx <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19,~
#> $ cons.conf.idx  <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
#> $ euribor3m      <dbl> 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, ~
#> $ nr.employed    <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ subscribe      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Missing Value Check

anyNA(bank)
#> [1] FALSE

Glossary Data

head(bank)
  1. age (numeric)
  2. job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
  3. marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
  4. education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
  5. default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
  6. housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
  7. loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
  8. contact: contact communication type (categorical: ‘cellular’,‘telephone’)
  9. month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
  10. day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
  11. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  12. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  13. previous: number of contacts performed before this campaign and for this client (numeric)
  14. poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
  15. emp.var.rate: employment variation rate - quarterly indicator (numeric)
  16. cons.price.idx: consumer price index - monthly indicator (numeric)
  17. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
  18. euribor3m: euribor 3 month rate - daily indicator (numeric)
  19. nr.employed: number of employees - quarterly indicator (numeric)
  20. subscribe - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Data Splitting

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- initial_split(data = bank, prop = 0.8, strata = "subscribe")
train <- training(intrain) 
test <- testing(intrain)

Balancing Data

prop.table(table(train$subscribe))
#> 
#>         0         1 
#> 0.8873445 0.1126555
prop.table(table(train$subscribe))
#> 
#>         0         1 
#> 0.8873445 0.1126555

The proportion of unsubscribed clients is 89% and the subscribed clients are 11%. This data is not balanced, so we need to do down-sampling.

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

library(caret)
train <- downSample(x = train[, -20],
                         y = train$subscribe,
                         yname = "subscribe")

DATA MODELLING

Logistic Regression

model <- glm(formula = subscribe ~ ., family = "binomial", data = train)
summary(model)
#> 
#> Call:
#> glm(formula = subscribe ~ ., family = "binomial", data = train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9323  -0.8640  -0.1406   0.8263   2.1814  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                                Estimate Std. Error z value          Pr(>|z|)
#> (Intercept)                   4.1342825  0.5470125   7.558 0.000000000000041
#> age                          -0.0017967  0.0033274  -0.540          0.589226
#> jobblue-collar               -0.0704032  0.1036191  -0.679          0.496857
#> jobentrepreneur               0.0515184  0.1596435   0.323          0.746915
#> jobhousemaid                 -0.0317424  0.1948268  -0.163          0.870577
#> jobmanagement                 0.0016294  0.1191533   0.014          0.989090
#> jobretired                    0.2886229  0.1584951   1.821          0.068604
#> jobself-employed              0.1255154  0.1597050   0.786          0.431914
#> jobservices                  -0.0160963  0.1139789  -0.141          0.887695
#> jobstudent                    0.1310914  0.1801112   0.728          0.466714
#> jobtechnician                 0.0562148  0.0969316   0.580          0.561953
#> jobunemployed                 0.0173884  0.1837646   0.095          0.924614
#> jobunknown                    0.3184534  0.3521423   0.904          0.365820
#> maritalmarried                0.1195616  0.0918998   1.301          0.193259
#> maritalsingle                 0.2160563  0.1041243   2.075          0.037988
#> maritalunknown               -0.2103451  0.5691991  -0.370          0.711721
#> educationbasic.6y             0.0721148  0.1511415   0.477          0.633266
#> educationbasic.9y             0.0695536  0.1230743   0.565          0.571982
#> educationhigh.school          0.0963353  0.1253353   0.769          0.442118
#> educationilliterate           1.1844359  1.1949471   0.991          0.321586
#> educationprofessional.course  0.0848181  0.1391474   0.610          0.542156
#> educationuniversity.degree    0.1742095  0.1264624   1.378          0.168340
#> educationunknown              0.2466546  0.1685937   1.463          0.143464
#> defaultunknown               -0.2162364  0.0786670  -2.749          0.005982
#> housingunknown               -0.1670575  0.1778791  -0.939          0.347647
#> housingyes                    0.0032134  0.0560394   0.057          0.954273
#> loanunknown                          NA         NA      NA                NA
#> loanyes                      -0.0007388  0.0771414  -0.010          0.992359
#> contacttelephone             -0.4303114  0.1006260  -4.276 0.000018998762854
#> monthaug                     -0.3152258  0.1696353  -1.858          0.063133
#> monthdec                      0.8763744  0.4592968   1.908          0.056381
#> monthjul                      0.2721673  0.1366847   1.991          0.046458
#> monthjun                      0.4733285  0.1420589   3.332          0.000862
#> monthmar                      0.9605370  0.2284793   4.204 0.000026218891707
#> monthmay                     -0.7219572  0.1085949  -6.648 0.000000000029676
#> monthnov                     -0.6442862  0.1740025  -3.703          0.000213
#> monthoct                      0.2025423  0.2153873   0.940          0.347031
#> monthsep                     -0.4885715  0.2532523  -1.929          0.053707
#> day_of_weekmon               -0.1924819  0.0893685  -2.154          0.031256
#> day_of_weekthu               -0.0800636  0.0873920  -0.916          0.359592
#> day_of_weektue               -0.0655968  0.0897620  -0.731          0.464910
#> day_of_weekwed                0.0965588  0.0890494   1.084          0.278219
#> campaign                     -0.0428726  0.0127951  -3.351          0.000806
#> pdays                        -0.0010602  0.0003825  -2.772          0.005575
#> previous                     -0.1662750  0.1092295  -1.522          0.127946
#> poutcomenonexistent           0.4330829  0.1544500   2.804          0.005047
#> poutcomesuccess               0.8060621  0.3827393   2.106          0.035201
#> emp.var.rate                 -0.0457347  0.0228060  -2.005          0.044923
#> cons.price.idx               -0.0286395  0.0082965  -3.452          0.000556
#> cons.conf.idx                -0.0212858  0.0098754  -2.155          0.031128
#> euribor3m                    -0.0003363  0.0012704  -0.265          0.791246
#> nr.employed                  -0.2615683  0.0352732  -7.415 0.000000000000121
#>                                 
#> (Intercept)                  ***
#> age                             
#> jobblue-collar                  
#> jobentrepreneur                 
#> jobhousemaid                    
#> jobmanagement                   
#> jobretired                   .  
#> jobself-employed                
#> jobservices                     
#> jobstudent                      
#> jobtechnician                   
#> jobunemployed                   
#> jobunknown                      
#> maritalmarried                  
#> maritalsingle                *  
#> maritalunknown                  
#> educationbasic.6y               
#> educationbasic.9y               
#> educationhigh.school            
#> educationilliterate             
#> educationprofessional.course    
#> educationuniversity.degree      
#> educationunknown                
#> defaultunknown               ** 
#> housingunknown                  
#> housingyes                      
#> loanunknown                     
#> loanyes                         
#> contacttelephone             ***
#> monthaug                     .  
#> monthdec                     .  
#> monthjul                     *  
#> monthjun                     ***
#> monthmar                     ***
#> monthmay                     ***
#> monthnov                     ***
#> monthoct                        
#> monthsep                     .  
#> day_of_weekmon               *  
#> day_of_weekthu                  
#> day_of_weektue                  
#> day_of_weekwed                  
#> campaign                     ***
#> pdays                        ** 
#> previous                        
#> poutcomenonexistent          ** 
#> poutcomesuccess              *  
#> emp.var.rate                 *  
#> cons.price.idx               ***
#> cons.conf.idx                *  
#> euribor3m                       
#> nr.employed                  ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10291.8  on 7423  degrees of freedom
#> Residual deviance:  7958.8  on 7373  degrees of freedom
#> AIC: 8060.8
#> 
#> Number of Fisher Scoring iterations: 5

Model Fitting

In the first modeling, there are many predictor variables that are not significant to the target variable, therefore we will try to do a model fitting using the stepwise method.

library(MASS)
model_fit <- stepAIC(model, direction = "backward", trace = 0)
summary(model_fit)
#> 
#> Call:
#> glm(formula = subscribe ~ marital + default + contact + month + 
#>     day_of_week + campaign + pdays + previous + poutcome + emp.var.rate + 
#>     cons.price.idx + cons.conf.idx + nr.employed, family = "binomial", 
#>     data = train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9220  -0.8621  -0.1442   0.8440   2.1223  
#> 
#> Coefficients:
#>                       Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)          4.3082910  0.5019377   8.583 < 0.0000000000000002 ***
#> maritalmarried       0.1074050  0.0908629   1.182             0.237184    
#> maritalsingle        0.2374438  0.0970528   2.447             0.014423 *  
#> maritalunknown      -0.1532229  0.5705472  -0.269             0.788273    
#> defaultunknown      -0.2467302  0.0761079  -3.242             0.001188 ** 
#> contacttelephone    -0.4313096  0.0995392  -4.333    0.000014704951920 ***
#> monthaug            -0.3082251  0.1652161  -1.866             0.062099 .  
#> monthdec             0.8968909  0.4534675   1.978             0.047946 *  
#> monthjul             0.2840095  0.1347890   2.107             0.035112 *  
#> monthjun             0.4867265  0.1374133   3.542             0.000397 ***
#> monthmar             0.9706694  0.2279753   4.258    0.000020646421878 ***
#> monthmay            -0.7614103  0.1066033  -7.142    0.000000000000917 ***
#> monthnov            -0.6434139  0.1735204  -3.708             0.000209 ***
#> monthoct             0.1955270  0.2143067   0.912             0.361574    
#> monthsep            -0.5149734  0.2469587  -2.085             0.037046 *  
#> day_of_weekmon      -0.1983658  0.0890909  -2.227             0.025977 *  
#> day_of_weekthu      -0.0836150  0.0870942  -0.960             0.337029    
#> day_of_weektue      -0.0739560  0.0894027  -0.827             0.408111    
#> day_of_weekwed       0.0895782  0.0886935   1.010             0.312507    
#> campaign            -0.0410276  0.0127154  -3.227             0.001253 ** 
#> pdays               -0.0010746  0.0003817  -2.816             0.004870 ** 
#> previous            -0.1600083  0.1085405  -1.474             0.140433    
#> poutcomenonexistent  0.4438523  0.1538950   2.884             0.003925 ** 
#> poutcomesuccess      0.8018153  0.3820009   2.099             0.035818 *  
#> emp.var.rate        -0.0486142  0.0227525  -2.137             0.032626 *  
#> cons.price.idx      -0.0301588  0.0072232  -4.175    0.000029766837421 ***
#> cons.conf.idx       -0.0238581  0.0091434  -2.609             0.009072 ** 
#> nr.employed         -0.2731009  0.0193279 -14.130 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10292  on 7423  degrees of freedom
#> Residual deviance:  7975  on 7396  degrees of freedom
#> AIC: 8031
#> 
#> Number of Fisher Scoring iterations: 5

Predicting

# Predict
test$prob_subs <- predict(model_fit, type = "response", newdata = test)
ggplot(test, aes(x = prob_subs)) +
  geom_density(lwd = 0.5) +
  labs(title = "Distribution of Probability Prediction Data",
       x = "Probability of Subscribed Client") +
  theme_minimal()

In the graph above, it can be interpreted that the prediction results are more inclined towards 0 which means that the client does not subscribe.

Model Evaluating

test$pred_subs <- factor(ifelse(test$prob_subs > 0.5, 1, 0))
test[1:10, c("pred_subs", "subscribe")]
library(caret)
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 6128  353
#>          1 1182  575
#>                                              
#>                Accuracy : 0.8137             
#>                  95% CI : (0.8051, 0.822)    
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3294             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.6196             
#>             Specificity : 0.8383             
#>          Pos Pred Value : 0.3273             
#>          Neg Pred Value : 0.9455             
#>              Prevalence : 0.1126             
#>          Detection Rate : 0.0698             
#>    Detection Prevalence : 0.2133             
#>       Balanced Accuracy : 0.7290             
#>                                              
#>        'Positive' Class : 1                  
#> 

Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 81.4%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 83.8%. From the total actual data of clients who subscribed, the model was able to guess correctly by 62.0%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 32.7%.

Tunning Off

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length = 100)
result <- matrix(0, 100, 4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = test$prob_subs, 
                     ref = test$subscribe, 
                     postarget = "1", 
                     negtarget = "0")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
           "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("plum1","plum4","rosybrown1", "rosybrown3")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

In this case, we may want to take as many positive classes as possible (subscribed clients), so we need to increase the recall/sensitivity value. Based on the tradeoff of the performance model above, we can see that with a cutoff of 0.38 we get rather high recall/sensitivity values, but rather low precision, accuracy and specificity values.

# Decreasing Treshold to Increase Recall Value
test$pred_subs <- factor(ifelse(test$prob_subs > 0.38, 1, 0))
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 5157  258
#>          1 2153  670
#>                                              
#>                Accuracy : 0.7073             
#>                  95% CI : (0.6974, 0.7171)   
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.226              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.72198            
#>             Specificity : 0.70547            
#>          Pos Pred Value : 0.23734            
#>          Neg Pred Value : 0.95235            
#>              Prevalence : 0.11265            
#>          Detection Rate : 0.08133            
#>    Detection Prevalence : 0.34268            
#>       Balanced Accuracy : 0.71373            
#>                                              
#>        'Positive' Class : 1                  
#> 

Model Interpretation

exp(model_fit$coefficients) %>% 
  data.frame() 
  1. Client odds unknown credit in default = 0.78. This means that clients who are not known to have credit in default are 78% more likely to subscribe compared to clients who have and do not have credit in default.
  2. Odds of clients with contact communication using telephone = 0.65. This means that clients with contact communication using telephones are 65% more likely to subscribe compared to clients using cellular.

K-Nearest Neighbour

Pre-processing Data

# Data Balancing
prop.table(table(bank$subscribe))
#> 
#>         0         1 
#> 0.8873458 0.1126542
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

bank <- downSample(x = bank[, -20],
                  y = bank$subscribe,
                  yname = "subscribe")
# Dummy Variables
dmy <- dummyVars(" ~subscribe + job + marital + education + default + housing + loan + contact + month + day_of_week + poutcome", data = bank)
dmy <- data.frame(predict(dmy, newdata = bank))
glimpse(dmy)
#> Rows: 9,280
#> Columns: 55
#> $ subscribe.0                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ subscribe.1                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.admin.                    <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0~
#> $ job.blue.collar               <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.entrepreneur              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.housemaid                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.management                <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.retired                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.self.employed             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.services                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1~
#> $ job.student                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.technician                <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ job.unemployed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.unknown                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ marital.divorced              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ marital.married               <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0~
#> $ marital.single                <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ marital.unknown               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.4y            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.6y            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.9y            <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0~
#> $ education.high.school         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1~
#> $ education.illiterate          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.professional.course <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.university.degree   <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0~
#> $ education.unknown             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ default.no                    <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1~
#> $ default.unknown               <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0~
#> $ default.yes                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ housing.no                    <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0~
#> $ housing.unknown               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ housing.yes                   <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
#> $ loan.no                       <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1~
#> $ loan.unknown                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ loan.yes                      <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ contact.cellular              <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1~
#> $ contact.telephone             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0~
#> $ month.apr                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.aug                     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ month.dec                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.jul                     <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1~
#> $ month.jun                     <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ month.mar                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.may                     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0~
#> $ month.nov                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.oct                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.sep                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.fri               <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0~
#> $ day_of_week.mon               <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.thu               <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1~
#> $ day_of_week.tue               <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ day_of_week.wed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ poutcome.failure              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ poutcome.nonexistent          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1~
#> $ poutcome.success              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
# Delete variables that has only 2 categories
dmy$contact.cellular <- NULL
dmy$subscribe.0 <- NULL
# Data Splitting
set.seed(100)
intrain <- sample(nrow(bank), nrow(bank)*0.8)
dmy_train <- dmy[intrain,1:19]
dmy_test <- dmy[-intrain,1:19]

dmy_train_label <- dmy[intrain,20]
dmy_test_label <- dmy[-intrain,20]

Predict Values

# k optimum
round(sqrt(nrow(dmy_train)))
#> [1] 86
# Scaling data prediktor
dmy_train <- scale(dmy_train)
  
dmy_test <- scale(dmy_test,
                      center = attr(dmy_train, "scaled:center"),
                      scale = attr(dmy_train, "scaled:scale"))
#Predict
knn <- class::knn(train = dmy_train,
                  test = dmy_test, 
                  cl = dmy_train_label, 
                  k = 86)
knn
#>    [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>   [38] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
#>   [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
#>  [149] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#>  [223] 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#>  [260] 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0
#>  [297] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0
#>  [334] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
#>  [371] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
#>  [445] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
#>  [482] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#>  [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1
#>  [556] 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0
#>  [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [630] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1
#>  [667] 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [704] 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
#>  [741] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#>  [778] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
#>  [815] 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#>  [852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#>  [889] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
#>  [926] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [963] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
#> [1000] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1037] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
#> [1074] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1148] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1
#> [1185] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1222] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1259] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1296] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1333] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1370] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [1407] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1444] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1518] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [1555] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1592] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1629] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1666] 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1703] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
#> [1740] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
#> [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1851] 0 0 0 0 0 0
#> Levels: 0 1

Model Evaluating

knn_eval <- confusionMatrix(as.factor(knn), as.factor(dmy_test_label),"1")
knn_eval
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1570  120
#>          1   60  106
#>                                           
#>                Accuracy : 0.903           
#>                  95% CI : (0.8886, 0.9161)
#>     No Information Rate : 0.8782          
#>     P-Value [Acc > NIR] : 0.0004455       
#>                                           
#>                   Kappa : 0.488           
#>                                           
#>  Mcnemar's Test P-Value : 0.00001095      
#>                                           
#>             Sensitivity : 0.46903         
#>             Specificity : 0.96319         
#>          Pos Pred Value : 0.63855         
#>          Neg Pred Value : 0.92899         
#>              Prevalence : 0.12177         
#>          Detection Rate : 0.05711         
#>    Detection Prevalence : 0.08944         
#>       Balanced Accuracy : 0.71611         
#>                                           
#>        'Positive' Class : 1               
#> 

Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 90.3%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 96.3%. From the total actual data of clients who subscribed, the model was able to guess correctly by 46.9%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 63.9%.

Comparison of The Two Models

eval_reg <- data_frame(Accuracy = reg$overall[1],
                       Recall = reg$byClass[1],
                       Specificity = reg$byClass[2],
                       Precision = reg$byClass[3])

eval_knn <- data_frame(Accuracy = knn_eval$overall[1],
                       Recall = knn_eval$byClass[1],
                       Specificity = knn_eval$byClass[2],
                       Precision = knn_eval$byClass[3])
# Logistic Regression
eval_reg
# k-Nearest Neighbour
eval_knn

When viewed from the two methods, namely by using Logistics Regression and k-NN, the model’s ability to take as many positive classes as possible (subscribed clients) is better by using the Logistics Regression method because it has a recall value = 72.2% greater than using the k-NN method.

CONCLUSION

  1. If we want to increase the chances of a client subscribing, then the Logistics Regression method is better to use in predicting whether a client subscribes or not based on the recall value.
  2. If we want to save on the telemarketing budget, we need to be as selective/precise as possible in assigning positive classes. So the k-NN method is better to use in predicting whether a client subscribes or not based on the precision value.

REFERENCES

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014