INTRODUCTION

On this occasion, I will try to make predictions on bank clients who will be predicted to subscribe or not based on the categories of the 19 supporting variables. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning. The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.

DATA PREPROCESSING

Packages Activation

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)

Import Data

bank <- read.csv2("bank-additional-full.csv", stringsAsFactors = T)

Data Manipulation

bank <- bank %>%
  select(-duration) %>%
  rename(subscribe = y) %>%
  mutate_at(c("emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"),
            as.numeric)
bank$subscribe <- ifelse(bank$subscribe == "yes", 1, 0)
bank$subscribe <- as.factor(bank$subscribe)
glimpse(bank)

#> Rows: 41,188
#> Columns: 20
#> $ age            <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,~
#> $ job            <fct> housemaid, services, services, admin., services, servic~
#> $ marital        <fct> married, married, married, married, married, married, m~
#> $ education      <fct> basic.4y, high.school, high.school, basic.6y, high.scho~
#> $ default        <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, ~
#> $ housing        <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,~
#> $ loan           <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n~
#> $ contact        <fct> telephone, telephone, telephone, telephone, telephone, ~
#> $ month          <fct> may, may, may, may, may, may, may, may, may, may, may, ~
#> $ day_of_week    <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, ~
#> $ campaign       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ pdays          <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, ~
#> $ previous       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ poutcome       <fct> nonexistent, nonexistent, nonexistent, nonexistent, non~
#> $ emp.var.rate   <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ cons.price.idx <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19,~
#> $ cons.conf.idx  <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
#> $ euribor3m      <dbl> 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, ~
#> $ nr.employed    <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ subscribe      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

Missing Value Check

anyNA(bank)

#> [1] FALSE

Glossary Data

head(bank)

age (numeric)
job : type of job (categorical: ‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)
marital : marital status (categorical: ‘divorced’,‘married’,‘single’,‘unknown’; note: ‘divorced’ means divorced or widowed)
education (categorical: ‘basic.4y’,‘basic.6y’,‘basic.9y’,‘high.school’,‘illiterate’,‘professional.course’,‘university.degree’,‘unknown’)
default: has credit in default? (categorical: ‘no’,‘yes’,‘unknown’)
housing: has housing loan? (categorical: ‘no’,‘yes’,‘unknown’)
loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)
contact: contact communication type (categorical: ‘cellular’,‘telephone’)
month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
day_of_week: last contact day of the week (categorical: ‘mon’,‘tue’,‘wed’,‘thu’,‘fri’)
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)
emp.var.rate: employment variation rate - quarterly indicator (numeric)
cons.price.idx: consumer price index - monthly indicator (numeric)
cons.conf.idx: consumer confidence index - monthly indicator (numeric)
euribor3m: euribor 3 month rate - daily indicator (numeric)
nr.employed: number of employees - quarterly indicator (numeric)
subscribe - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

Data Splitting

library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- initial_split(data = bank, prop = 0.8, strata = "subscribe")
train <- training(intrain) 
test <- testing(intrain)

Balancing Data

prop.table(table(train$subscribe))

#> 
#>         0         1 
#> 0.8873445 0.1126555

prop.table(table(train$subscribe))

#> 
#>         0         1 
#> 0.8873445 0.1126555

The proportion of unsubscribed clients is 89% and the subscribed clients are 11%. This data is not balanced, so we need to do down-sampling.

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

library(caret)
train <- downSample(x = train[, -20],
                         y = train$subscribe,
                         yname = "subscribe")

DATA MODELLING

Logistic Regression

model <- glm(formula = subscribe ~ ., family = "binomial", data = train)
summary(model)

#> 
#> Call:
#> glm(formula = subscribe ~ ., family = "binomial", data = train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9323  -0.8640  -0.1406   0.8263   2.1814  
#> 
#> Coefficients: (1 not defined because of singularities)
#>                                Estimate Std. Error z value          Pr(>|z|)
#> (Intercept)                   4.1342825  0.5470125   7.558 0.000000000000041
#> age                          -0.0017967  0.0033274  -0.540          0.589226
#> jobblue-collar               -0.0704032  0.1036191  -0.679          0.496857
#> jobentrepreneur               0.0515184  0.1596435   0.323          0.746915
#> jobhousemaid                 -0.0317424  0.1948268  -0.163          0.870577
#> jobmanagement                 0.0016294  0.1191533   0.014          0.989090
#> jobretired                    0.2886229  0.1584951   1.821          0.068604
#> jobself-employed              0.1255154  0.1597050   0.786          0.431914
#> jobservices                  -0.0160963  0.1139789  -0.141          0.887695
#> jobstudent                    0.1310914  0.1801112   0.728          0.466714
#> jobtechnician                 0.0562148  0.0969316   0.580          0.561953
#> jobunemployed                 0.0173884  0.1837646   0.095          0.924614
#> jobunknown                    0.3184534  0.3521423   0.904          0.365820
#> maritalmarried                0.1195616  0.0918998   1.301          0.193259
#> maritalsingle                 0.2160563  0.1041243   2.075          0.037988
#> maritalunknown               -0.2103451  0.5691991  -0.370          0.711721
#> educationbasic.6y             0.0721148  0.1511415   0.477          0.633266
#> educationbasic.9y             0.0695536  0.1230743   0.565          0.571982
#> educationhigh.school          0.0963353  0.1253353   0.769          0.442118
#> educationilliterate           1.1844359  1.1949471   0.991          0.321586
#> educationprofessional.course  0.0848181  0.1391474   0.610          0.542156
#> educationuniversity.degree    0.1742095  0.1264624   1.378          0.168340
#> educationunknown              0.2466546  0.1685937   1.463          0.143464
#> defaultunknown               -0.2162364  0.0786670  -2.749          0.005982
#> housingunknown               -0.1670575  0.1778791  -0.939          0.347647
#> housingyes                    0.0032134  0.0560394   0.057          0.954273
#> loanunknown                          NA         NA      NA                NA
#> loanyes                      -0.0007388  0.0771414  -0.010          0.992359
#> contacttelephone             -0.4303114  0.1006260  -4.276 0.000018998762854
#> monthaug                     -0.3152258  0.1696353  -1.858          0.063133
#> monthdec                      0.8763744  0.4592968   1.908          0.056381
#> monthjul                      0.2721673  0.1366847   1.991          0.046458
#> monthjun                      0.4733285  0.1420589   3.332          0.000862
#> monthmar                      0.9605370  0.2284793   4.204 0.000026218891707
#> monthmay                     -0.7219572  0.1085949  -6.648 0.000000000029676
#> monthnov                     -0.6442862  0.1740025  -3.703          0.000213
#> monthoct                      0.2025423  0.2153873   0.940          0.347031
#> monthsep                     -0.4885715  0.2532523  -1.929          0.053707
#> day_of_weekmon               -0.1924819  0.0893685  -2.154          0.031256
#> day_of_weekthu               -0.0800636  0.0873920  -0.916          0.359592
#> day_of_weektue               -0.0655968  0.0897620  -0.731          0.464910
#> day_of_weekwed                0.0965588  0.0890494   1.084          0.278219
#> campaign                     -0.0428726  0.0127951  -3.351          0.000806
#> pdays                        -0.0010602  0.0003825  -2.772          0.005575
#> previous                     -0.1662750  0.1092295  -1.522          0.127946
#> poutcomenonexistent           0.4330829  0.1544500   2.804          0.005047
#> poutcomesuccess               0.8060621  0.3827393   2.106          0.035201
#> emp.var.rate                 -0.0457347  0.0228060  -2.005          0.044923
#> cons.price.idx               -0.0286395  0.0082965  -3.452          0.000556
#> cons.conf.idx                -0.0212858  0.0098754  -2.155          0.031128
#> euribor3m                    -0.0003363  0.0012704  -0.265          0.791246
#> nr.employed                  -0.2615683  0.0352732  -7.415 0.000000000000121
#>                                 
#> (Intercept)                  ***
#> age                             
#> jobblue-collar                  
#> jobentrepreneur                 
#> jobhousemaid                    
#> jobmanagement                   
#> jobretired                   .  
#> jobself-employed                
#> jobservices                     
#> jobstudent                      
#> jobtechnician                   
#> jobunemployed                   
#> jobunknown                      
#> maritalmarried                  
#> maritalsingle                *  
#> maritalunknown                  
#> educationbasic.6y               
#> educationbasic.9y               
#> educationhigh.school            
#> educationilliterate             
#> educationprofessional.course    
#> educationuniversity.degree      
#> educationunknown                
#> defaultunknown               ** 
#> housingunknown                  
#> housingyes                      
#> loanunknown                     
#> loanyes                         
#> contacttelephone             ***
#> monthaug                     .  
#> monthdec                     .  
#> monthjul                     *  
#> monthjun                     ***
#> monthmar                     ***
#> monthmay                     ***
#> monthnov                     ***
#> monthoct                        
#> monthsep                     .  
#> day_of_weekmon               *  
#> day_of_weekthu                  
#> day_of_weektue                  
#> day_of_weekwed                  
#> campaign                     ***
#> pdays                        ** 
#> previous                        
#> poutcomenonexistent          ** 
#> poutcomesuccess              *  
#> emp.var.rate                 *  
#> cons.price.idx               ***
#> cons.conf.idx                *  
#> euribor3m                       
#> nr.employed                  ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10291.8  on 7423  degrees of freedom
#> Residual deviance:  7958.8  on 7373  degrees of freedom
#> AIC: 8060.8
#> 
#> Number of Fisher Scoring iterations: 5

Model Fitting

In the first modeling, there are many predictor variables that are not significant to the target variable, therefore we will try to do a model fitting using the stepwise method.

library(MASS)
model_fit <- stepAIC(model, direction = "backward", trace = 0)
summary(model_fit)

#> 
#> Call:
#> glm(formula = subscribe ~ marital + default + contact + month + 
#>     day_of_week + campaign + pdays + previous + poutcome + emp.var.rate + 
#>     cons.price.idx + cons.conf.idx + nr.employed, family = "binomial", 
#>     data = train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.9220  -0.8621  -0.1442   0.8440   2.1223  
#> 
#> Coefficients:
#>                       Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)          4.3082910  0.5019377   8.583 < 0.0000000000000002 ***
#> maritalmarried       0.1074050  0.0908629   1.182             0.237184    
#> maritalsingle        0.2374438  0.0970528   2.447             0.014423 *  
#> maritalunknown      -0.1532229  0.5705472  -0.269             0.788273    
#> defaultunknown      -0.2467302  0.0761079  -3.242             0.001188 ** 
#> contacttelephone    -0.4313096  0.0995392  -4.333    0.000014704951920 ***
#> monthaug            -0.3082251  0.1652161  -1.866             0.062099 .  
#> monthdec             0.8968909  0.4534675   1.978             0.047946 *  
#> monthjul             0.2840095  0.1347890   2.107             0.035112 *  
#> monthjun             0.4867265  0.1374133   3.542             0.000397 ***
#> monthmar             0.9706694  0.2279753   4.258    0.000020646421878 ***
#> monthmay            -0.7614103  0.1066033  -7.142    0.000000000000917 ***
#> monthnov            -0.6434139  0.1735204  -3.708             0.000209 ***
#> monthoct             0.1955270  0.2143067   0.912             0.361574    
#> monthsep            -0.5149734  0.2469587  -2.085             0.037046 *  
#> day_of_weekmon      -0.1983658  0.0890909  -2.227             0.025977 *  
#> day_of_weekthu      -0.0836150  0.0870942  -0.960             0.337029    
#> day_of_weektue      -0.0739560  0.0894027  -0.827             0.408111    
#> day_of_weekwed       0.0895782  0.0886935   1.010             0.312507    
#> campaign            -0.0410276  0.0127154  -3.227             0.001253 ** 
#> pdays               -0.0010746  0.0003817  -2.816             0.004870 ** 
#> previous            -0.1600083  0.1085405  -1.474             0.140433    
#> poutcomenonexistent  0.4438523  0.1538950   2.884             0.003925 ** 
#> poutcomesuccess      0.8018153  0.3820009   2.099             0.035818 *  
#> emp.var.rate        -0.0486142  0.0227525  -2.137             0.032626 *  
#> cons.price.idx      -0.0301588  0.0072232  -4.175    0.000029766837421 ***
#> cons.conf.idx       -0.0238581  0.0091434  -2.609             0.009072 ** 
#> nr.employed         -0.2731009  0.0193279 -14.130 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 10292  on 7423  degrees of freedom
#> Residual deviance:  7975  on 7396  degrees of freedom
#> AIC: 8031
#> 
#> Number of Fisher Scoring iterations: 5

Predicting

# Predict
test$prob_subs <- predict(model_fit, type = "response", newdata = test)
ggplot(test, aes(x = prob_subs)) +
  geom_density(lwd = 0.5) +
  labs(title = "Distribution of Probability Prediction Data",
       x = "Probability of Subscribed Client") +
  theme_minimal()

In the graph above, it can be interpreted that the prediction results are more inclined towards 0 which means that the client does not subscribe.

Model Evaluating

test$pred_subs <- factor(ifelse(test$prob_subs > 0.5, 1, 0))
test[1:10, c("pred_subs", "subscribe")]

library(caret)
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 6128  353
#>          1 1182  575
#>                                              
#>                Accuracy : 0.8137             
#>                  95% CI : (0.8051, 0.822)    
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.3294             
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.6196             
#>             Specificity : 0.8383             
#>          Pos Pred Value : 0.3273             
#>          Neg Pred Value : 0.9455             
#>              Prevalence : 0.1126             
#>          Detection Rate : 0.0698             
#>    Detection Prevalence : 0.2133             
#>       Balanced Accuracy : 0.7290             
#>                                              
#>        'Positive' Class : 1                  
#>

Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 81.4%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 83.8%. From the total actual data of clients who subscribed, the model was able to guess correctly by 62.0%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 32.7%.

Tunning Off

performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length = 100)
result <- matrix(0, 100, 4)

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = test$prob_subs, 
                     ref = test$subscribe, 
                     postarget = "1", 
                     negtarget = "0")
}

data_frame("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
           "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = performa)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("plum1","plum4","rosybrown1", "rosybrown3")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

In this case, we may want to take as many positive classes as possible (subscribed clients), so we need to increase the recall/sensitivity value. Based on the tradeoff of the performance model above, we can see that with a cutoff of 0.38 we get rather high recall/sensitivity values, but rather low precision, accuracy and specificity values.

# Decreasing Treshold to Increase Recall Value
test$pred_subs <- factor(ifelse(test$prob_subs > 0.38, 1, 0))
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 5157  258
#>          1 2153  670
#>                                              
#>                Accuracy : 0.7073             
#>                  95% CI : (0.6974, 0.7171)   
#>     No Information Rate : 0.8874             
#>     P-Value [Acc > NIR] : 1                  
#>                                              
#>                   Kappa : 0.226              
#>                                              
#>  Mcnemar's Test P-Value : <0.0000000000000002
#>                                              
#>             Sensitivity : 0.72198            
#>             Specificity : 0.70547            
#>          Pos Pred Value : 0.23734            
#>          Neg Pred Value : 0.95235            
#>              Prevalence : 0.11265            
#>          Detection Rate : 0.08133            
#>    Detection Prevalence : 0.34268            
#>       Balanced Accuracy : 0.71373            
#>                                              
#>        'Positive' Class : 1                  
#>

Model Interpretation

exp(model_fit$coefficients) %>% 
  data.frame()

Client odds unknown credit in default = 0.78. This means that clients who are not known to have credit in default are 78% more likely to subscribe compared to clients who have and do not have credit in default.
Odds of clients with contact communication using telephone = 0.65. This means that clients with contact communication using telephones are 65% more likely to subscribe compared to clients using cellular.

K-Nearest Neighbour

Pre-processing Data

# Data Balancing
prop.table(table(bank$subscribe))

#> 
#>         0         1 
#> 0.8873458 0.1126542

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)

bank <- downSample(x = bank[, -20],
                  y = bank$subscribe,
                  yname = "subscribe")

# Dummy Variables
dmy <- dummyVars(" ~subscribe + job + marital + education + default + housing + loan + contact + month + day_of_week + poutcome", data = bank)
dmy <- data.frame(predict(dmy, newdata = bank))
glimpse(dmy)

#> Rows: 9,280
#> Columns: 55
#> $ subscribe.0                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ subscribe.1                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.admin.                    <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0~
#> $ job.blue.collar               <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.entrepreneur              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.housemaid                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.management                <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.retired                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.self.employed             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.services                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1~
#> $ job.student                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.technician                <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ job.unemployed                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.unknown                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ marital.divorced              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ marital.married               <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0~
#> $ marital.single                <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ marital.unknown               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.4y            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.6y            <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.9y            <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0~
#> $ education.high.school         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1~
#> $ education.illiterate          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.professional.course <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.university.degree   <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0~
#> $ education.unknown             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ default.no                    <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1~
#> $ default.unknown               <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0~
#> $ default.yes                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ housing.no                    <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0~
#> $ housing.unknown               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ housing.yes                   <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
#> $ loan.no                       <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1~
#> $ loan.unknown                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ loan.yes                      <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ contact.cellular              <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1~
#> $ contact.telephone             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0~
#> $ month.apr                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.aug                     <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ month.dec                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.jul                     <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1~
#> $ month.jun                     <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ month.mar                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.may                     <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0~
#> $ month.nov                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.oct                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.sep                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.fri               <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0~
#> $ day_of_week.mon               <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.thu               <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1~
#> $ day_of_week.tue               <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ day_of_week.wed               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ poutcome.failure              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ poutcome.nonexistent          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1~
#> $ poutcome.success              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~

# Delete variables that has only 2 categories
dmy$contact.cellular <- NULL
dmy$subscribe.0 <- NULL

# Data Splitting
set.seed(100)
intrain <- sample(nrow(bank), nrow(bank)*0.8)
dmy_train <- dmy[intrain,1:19]
dmy_test <- dmy[-intrain,1:19]

dmy_train_label <- dmy[intrain,20]
dmy_test_label <- dmy[-intrain,20]

Predict Values

# k optimum
round(sqrt(nrow(dmy_train)))

#> [1] 86

# Scaling data prediktor
dmy_train <- scale(dmy_train)
  
dmy_test <- scale(dmy_test,
                      center = attr(dmy_train, "scaled:center"),
                      scale = attr(dmy_train, "scaled:scale"))

#Predict
knn <- class::knn(train = dmy_train,
                  test = dmy_test, 
                  cl = dmy_train_label, 
                  k = 86)
knn

#>    [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>   [38] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
#>   [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
#>  [149] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#>  [223] 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#>  [260] 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0
#>  [297] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0
#>  [334] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
#>  [371] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
#>  [445] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
#>  [482] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#>  [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1
#>  [556] 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0
#>  [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#>  [630] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1
#>  [667] 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [704] 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
#>  [741] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#>  [778] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
#>  [815] 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#>  [852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#>  [889] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
#>  [926] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#>  [963] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
#> [1000] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1037] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
#> [1074] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1148] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1
#> [1185] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1222] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1259] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1296] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1333] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1370] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [1407] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1444] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1518] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [1555] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1592] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1629] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1666] 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1703] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
#> [1740] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
#> [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1851] 0 0 0 0 0 0
#> Levels: 0 1

Model Evaluating

knn_eval <- confusionMatrix(as.factor(knn), as.factor(dmy_test_label),"1")
knn_eval

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction    0    1
#>          0 1570  120
#>          1   60  106
#>                                           
#>                Accuracy : 0.903           
#>                  95% CI : (0.8886, 0.9161)
#>     No Information Rate : 0.8782          
#>     P-Value [Acc > NIR] : 0.0004455       
#>                                           
#>                   Kappa : 0.488           
#>                                           
#>  Mcnemar's Test P-Value : 0.00001095      
#>                                           
#>             Sensitivity : 0.46903         
#>             Specificity : 0.96319         
#>          Pos Pred Value : 0.63855         
#>          Neg Pred Value : 0.92899         
#>              Prevalence : 0.12177         
#>          Detection Rate : 0.05711         
#>    Detection Prevalence : 0.08944         
#>       Balanced Accuracy : 0.71611         
#>                                           
#>        'Positive' Class : 1               
#>

Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 90.3%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 96.3%. From the total actual data of clients who subscribed, the model was able to guess correctly by 46.9%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 63.9%.

Comparison of The Two Models

eval_reg <- data_frame(Accuracy = reg$overall[1],
                       Recall = reg$byClass[1],
                       Specificity = reg$byClass[2],
                       Precision = reg$byClass[3])

eval_knn <- data_frame(Accuracy = knn_eval$overall[1],
                       Recall = knn_eval$byClass[1],
                       Specificity = knn_eval$byClass[2],
                       Precision = knn_eval$byClass[3])

# Logistic Regression
eval_reg

# k-Nearest Neighbour
eval_knn

When viewed from the two methods, namely by using Logistics Regression and k-NN, the model’s ability to take as many positive classes as possible (subscribed clients) is better by using the Logistics Regression method because it has a recall value = 72.2% greater than using the k-NN method.

CONCLUSION

If we want to increase the chances of a client subscribing, then the Logistics Regression method is better to use in predicting whether a client subscribes or not based on the recall value.
If we want to save on the telemarketing budget, we need to be as selective/precise as possible in assigning positive classes. So the k-NN method is better to use in predicting whether a client subscribes or not based on the precision value.

REFERENCES

S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014

Prediction of Bank Telemarketing Success (Subscription Client)

Karina Sasmitha

12/22/2021

INTRODUCTION

DATA PREPROCESSING

Packages Activation

Import Data

Data Manipulation

Missing Value Check

Glossary Data

Data Splitting

Balancing Data

DATA MODELLING

Logistic Regression

Model Fitting

Predicting

Model Evaluating

Tunning Off

Model Interpretation

K-Nearest Neighbour

Pre-processing Data

Predict Values

Model Evaluating

Comparison of The Two Models

CONCLUSION

REFERENCES