On this occasion, I will try to make predictions on bank clients who will be predicted to subscribe or not based on the categories of the 19 supporting variables. The algorithm that I will use is to use logistic regression and k-nearest neighbor which is included in supervised learning. The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)bank <- read.csv2("bank-additional-full.csv", stringsAsFactors = T)bank <- bank %>%
select(-duration) %>%
rename(subscribe = y) %>%
mutate_at(c("emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"),
as.numeric)
bank$subscribe <- ifelse(bank$subscribe == "yes", 1, 0)
bank$subscribe <- as.factor(bank$subscribe)
glimpse(bank)#> Rows: 41,188
#> Columns: 20
#> $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25, 29, 57,~
#> $ job <fct> housemaid, services, services, admin., services, servic~
#> $ marital <fct> married, married, married, married, married, married, m~
#> $ education <fct> basic.4y, high.school, high.school, basic.6y, high.scho~
#> $ default <fct> no, unknown, no, no, no, unknown, no, unknown, no, no, ~
#> $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, yes, no,~
#> $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no, yes, n~
#> $ contact <fct> telephone, telephone, telephone, telephone, telephone, ~
#> $ month <fct> may, may, may, may, may, may, may, may, may, may, may, ~
#> $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, mon, ~
#> $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, ~
#> $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ poutcome <fct> nonexistent, nonexistent, nonexistent, nonexistent, non~
#> $ emp.var.rate <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ cons.price.idx <dbl> 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19, 19,~
#> $ cons.conf.idx <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,~
#> $ euribor3m <dbl> 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, 288, ~
#> $ nr.employed <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ subscribe <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
anyNA(bank)#> [1] FALSE
head(bank)library(rsample)
RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- initial_split(data = bank, prop = 0.8, strata = "subscribe")
train <- training(intrain)
test <- testing(intrain)prop.table(table(train$subscribe))#>
#> 0 1
#> 0.8873445 0.1126555
prop.table(table(train$subscribe))#>
#> 0 1
#> 0.8873445 0.1126555
The proportion of unsubscribed clients is 89% and the subscribed clients are 11%. This data is not balanced, so we need to do down-sampling.
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
train <- downSample(x = train[, -20],
y = train$subscribe,
yname = "subscribe")model <- glm(formula = subscribe ~ ., family = "binomial", data = train)
summary(model)#>
#> Call:
#> glm(formula = subscribe ~ ., family = "binomial", data = train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9323 -0.8640 -0.1406 0.8263 2.1814
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.1342825 0.5470125 7.558 0.000000000000041
#> age -0.0017967 0.0033274 -0.540 0.589226
#> jobblue-collar -0.0704032 0.1036191 -0.679 0.496857
#> jobentrepreneur 0.0515184 0.1596435 0.323 0.746915
#> jobhousemaid -0.0317424 0.1948268 -0.163 0.870577
#> jobmanagement 0.0016294 0.1191533 0.014 0.989090
#> jobretired 0.2886229 0.1584951 1.821 0.068604
#> jobself-employed 0.1255154 0.1597050 0.786 0.431914
#> jobservices -0.0160963 0.1139789 -0.141 0.887695
#> jobstudent 0.1310914 0.1801112 0.728 0.466714
#> jobtechnician 0.0562148 0.0969316 0.580 0.561953
#> jobunemployed 0.0173884 0.1837646 0.095 0.924614
#> jobunknown 0.3184534 0.3521423 0.904 0.365820
#> maritalmarried 0.1195616 0.0918998 1.301 0.193259
#> maritalsingle 0.2160563 0.1041243 2.075 0.037988
#> maritalunknown -0.2103451 0.5691991 -0.370 0.711721
#> educationbasic.6y 0.0721148 0.1511415 0.477 0.633266
#> educationbasic.9y 0.0695536 0.1230743 0.565 0.571982
#> educationhigh.school 0.0963353 0.1253353 0.769 0.442118
#> educationilliterate 1.1844359 1.1949471 0.991 0.321586
#> educationprofessional.course 0.0848181 0.1391474 0.610 0.542156
#> educationuniversity.degree 0.1742095 0.1264624 1.378 0.168340
#> educationunknown 0.2466546 0.1685937 1.463 0.143464
#> defaultunknown -0.2162364 0.0786670 -2.749 0.005982
#> housingunknown -0.1670575 0.1778791 -0.939 0.347647
#> housingyes 0.0032134 0.0560394 0.057 0.954273
#> loanunknown NA NA NA NA
#> loanyes -0.0007388 0.0771414 -0.010 0.992359
#> contacttelephone -0.4303114 0.1006260 -4.276 0.000018998762854
#> monthaug -0.3152258 0.1696353 -1.858 0.063133
#> monthdec 0.8763744 0.4592968 1.908 0.056381
#> monthjul 0.2721673 0.1366847 1.991 0.046458
#> monthjun 0.4733285 0.1420589 3.332 0.000862
#> monthmar 0.9605370 0.2284793 4.204 0.000026218891707
#> monthmay -0.7219572 0.1085949 -6.648 0.000000000029676
#> monthnov -0.6442862 0.1740025 -3.703 0.000213
#> monthoct 0.2025423 0.2153873 0.940 0.347031
#> monthsep -0.4885715 0.2532523 -1.929 0.053707
#> day_of_weekmon -0.1924819 0.0893685 -2.154 0.031256
#> day_of_weekthu -0.0800636 0.0873920 -0.916 0.359592
#> day_of_weektue -0.0655968 0.0897620 -0.731 0.464910
#> day_of_weekwed 0.0965588 0.0890494 1.084 0.278219
#> campaign -0.0428726 0.0127951 -3.351 0.000806
#> pdays -0.0010602 0.0003825 -2.772 0.005575
#> previous -0.1662750 0.1092295 -1.522 0.127946
#> poutcomenonexistent 0.4330829 0.1544500 2.804 0.005047
#> poutcomesuccess 0.8060621 0.3827393 2.106 0.035201
#> emp.var.rate -0.0457347 0.0228060 -2.005 0.044923
#> cons.price.idx -0.0286395 0.0082965 -3.452 0.000556
#> cons.conf.idx -0.0212858 0.0098754 -2.155 0.031128
#> euribor3m -0.0003363 0.0012704 -0.265 0.791246
#> nr.employed -0.2615683 0.0352732 -7.415 0.000000000000121
#>
#> (Intercept) ***
#> age
#> jobblue-collar
#> jobentrepreneur
#> jobhousemaid
#> jobmanagement
#> jobretired .
#> jobself-employed
#> jobservices
#> jobstudent
#> jobtechnician
#> jobunemployed
#> jobunknown
#> maritalmarried
#> maritalsingle *
#> maritalunknown
#> educationbasic.6y
#> educationbasic.9y
#> educationhigh.school
#> educationilliterate
#> educationprofessional.course
#> educationuniversity.degree
#> educationunknown
#> defaultunknown **
#> housingunknown
#> housingyes
#> loanunknown
#> loanyes
#> contacttelephone ***
#> monthaug .
#> monthdec .
#> monthjul *
#> monthjun ***
#> monthmar ***
#> monthmay ***
#> monthnov ***
#> monthoct
#> monthsep .
#> day_of_weekmon *
#> day_of_weekthu
#> day_of_weektue
#> day_of_weekwed
#> campaign ***
#> pdays **
#> previous
#> poutcomenonexistent **
#> poutcomesuccess *
#> emp.var.rate *
#> cons.price.idx ***
#> cons.conf.idx *
#> euribor3m
#> nr.employed ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 10291.8 on 7423 degrees of freedom
#> Residual deviance: 7958.8 on 7373 degrees of freedom
#> AIC: 8060.8
#>
#> Number of Fisher Scoring iterations: 5
In the first modeling, there are many predictor variables that are not significant to the target variable, therefore we will try to do a model fitting using the stepwise method.
library(MASS)
model_fit <- stepAIC(model, direction = "backward", trace = 0)
summary(model_fit)#>
#> Call:
#> glm(formula = subscribe ~ marital + default + contact + month +
#> day_of_week + campaign + pdays + previous + poutcome + emp.var.rate +
#> cons.price.idx + cons.conf.idx + nr.employed, family = "binomial",
#> data = train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.9220 -0.8621 -0.1442 0.8440 2.1223
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 4.3082910 0.5019377 8.583 < 0.0000000000000002 ***
#> maritalmarried 0.1074050 0.0908629 1.182 0.237184
#> maritalsingle 0.2374438 0.0970528 2.447 0.014423 *
#> maritalunknown -0.1532229 0.5705472 -0.269 0.788273
#> defaultunknown -0.2467302 0.0761079 -3.242 0.001188 **
#> contacttelephone -0.4313096 0.0995392 -4.333 0.000014704951920 ***
#> monthaug -0.3082251 0.1652161 -1.866 0.062099 .
#> monthdec 0.8968909 0.4534675 1.978 0.047946 *
#> monthjul 0.2840095 0.1347890 2.107 0.035112 *
#> monthjun 0.4867265 0.1374133 3.542 0.000397 ***
#> monthmar 0.9706694 0.2279753 4.258 0.000020646421878 ***
#> monthmay -0.7614103 0.1066033 -7.142 0.000000000000917 ***
#> monthnov -0.6434139 0.1735204 -3.708 0.000209 ***
#> monthoct 0.1955270 0.2143067 0.912 0.361574
#> monthsep -0.5149734 0.2469587 -2.085 0.037046 *
#> day_of_weekmon -0.1983658 0.0890909 -2.227 0.025977 *
#> day_of_weekthu -0.0836150 0.0870942 -0.960 0.337029
#> day_of_weektue -0.0739560 0.0894027 -0.827 0.408111
#> day_of_weekwed 0.0895782 0.0886935 1.010 0.312507
#> campaign -0.0410276 0.0127154 -3.227 0.001253 **
#> pdays -0.0010746 0.0003817 -2.816 0.004870 **
#> previous -0.1600083 0.1085405 -1.474 0.140433
#> poutcomenonexistent 0.4438523 0.1538950 2.884 0.003925 **
#> poutcomesuccess 0.8018153 0.3820009 2.099 0.035818 *
#> emp.var.rate -0.0486142 0.0227525 -2.137 0.032626 *
#> cons.price.idx -0.0301588 0.0072232 -4.175 0.000029766837421 ***
#> cons.conf.idx -0.0238581 0.0091434 -2.609 0.009072 **
#> nr.employed -0.2731009 0.0193279 -14.130 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 10292 on 7423 degrees of freedom
#> Residual deviance: 7975 on 7396 degrees of freedom
#> AIC: 8031
#>
#> Number of Fisher Scoring iterations: 5
# Predict
test$prob_subs <- predict(model_fit, type = "response", newdata = test)
ggplot(test, aes(x = prob_subs)) +
geom_density(lwd = 0.5) +
labs(title = "Distribution of Probability Prediction Data",
x = "Probability of Subscribed Client") +
theme_minimal() In the graph above, it can be interpreted that the prediction results are more inclined towards 0 which means that the client does not subscribe.
test$pred_subs <- factor(ifelse(test$prob_subs > 0.5, 1, 0))
test[1:10, c("pred_subs", "subscribe")]library(caret)
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 6128 353
#> 1 1182 575
#>
#> Accuracy : 0.8137
#> 95% CI : (0.8051, 0.822)
#> No Information Rate : 0.8874
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.3294
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.6196
#> Specificity : 0.8383
#> Pos Pred Value : 0.3273
#> Neg Pred Value : 0.9455
#> Prevalence : 0.1126
#> Detection Rate : 0.0698
#> Detection Prevalence : 0.2133
#> Balanced Accuracy : 0.7290
#>
#> 'Positive' Class : 1
#>
Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 81.4%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 83.8%. From the total actual data of clients who subscribed, the model was able to guess correctly by 62.0%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 32.7%.
performa <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.80,length = 100)
result <- matrix(0, 100, 4)
for(i in 1:100){
result[i,] = performa(cutoff = co[i],
prob = test$prob_subs,
ref = test$subscribe,
postarget = "1",
negtarget = "0")
}
data_frame("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = performa)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("plum1","plum4","rosybrown1", "rosybrown3")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()) In this case, we may want to take as many positive classes as possible (subscribed clients), so we need to increase the recall/sensitivity value. Based on the tradeoff of the performance model above, we can see that with a cutoff of 0.38 we get rather high recall/sensitivity values, but rather low precision, accuracy and specificity values.
# Decreasing Treshold to Increase Recall Value
test$pred_subs <- factor(ifelse(test$prob_subs > 0.38, 1, 0))
reg <- confusionMatrix(test$pred_subs, test$subscribe, positive = "1")
reg#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 5157 258
#> 1 2153 670
#>
#> Accuracy : 0.7073
#> 95% CI : (0.6974, 0.7171)
#> No Information Rate : 0.8874
#> P-Value [Acc > NIR] : 1
#>
#> Kappa : 0.226
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.72198
#> Specificity : 0.70547
#> Pos Pred Value : 0.23734
#> Neg Pred Value : 0.95235
#> Prevalence : 0.11265
#> Detection Rate : 0.08133
#> Detection Prevalence : 0.34268
#> Balanced Accuracy : 0.71373
#>
#> 'Positive' Class : 1
#>
exp(model_fit$coefficients) %>%
data.frame() # Data Balancing
prop.table(table(bank$subscribe))#>
#> 0 1
#> 0.8873458 0.1126542
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
bank <- downSample(x = bank[, -20],
y = bank$subscribe,
yname = "subscribe")# Dummy Variables
dmy <- dummyVars(" ~subscribe + job + marital + education + default + housing + loan + contact + month + day_of_week + poutcome", data = bank)
dmy <- data.frame(predict(dmy, newdata = bank))
glimpse(dmy)#> Rows: 9,280
#> Columns: 55
#> $ subscribe.0 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ subscribe.1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.admin. <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0~
#> $ job.blue.collar <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.entrepreneur <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.housemaid <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.management <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.retired <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.self.employed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.services <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1~
#> $ job.student <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.technician <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ job.unemployed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ job.unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ marital.divorced <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ marital.married <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0~
#> $ marital.single <dbl> 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
#> $ marital.unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.4y <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.6y <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.basic.9y <dbl> 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0~
#> $ education.high.school <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1~
#> $ education.illiterate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.professional.course <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ education.university.degree <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0~
#> $ education.unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ default.no <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1~
#> $ default.unknown <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0~
#> $ default.yes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ housing.no <dbl> 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0~
#> $ housing.unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ housing.yes <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1~
#> $ loan.no <dbl> 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1~
#> $ loan.unknown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0~
#> $ loan.yes <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ contact.cellular <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1~
#> $ contact.telephone <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0~
#> $ month.apr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.aug <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0~
#> $ month.dec <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.jul <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1~
#> $ month.jun <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ month.mar <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.may <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0~
#> $ month.nov <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.oct <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ month.sep <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.fri <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0~
#> $ day_of_week.mon <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ day_of_week.thu <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1~
#> $ day_of_week.tue <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0~
#> $ day_of_week.wed <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0~
#> $ poutcome.failure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0~
#> $ poutcome.nonexistent <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1~
#> $ poutcome.success <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
# Delete variables that has only 2 categories
dmy$contact.cellular <- NULL
dmy$subscribe.0 <- NULL# Data Splitting
set.seed(100)
intrain <- sample(nrow(bank), nrow(bank)*0.8)
dmy_train <- dmy[intrain,1:19]
dmy_test <- dmy[-intrain,1:19]
dmy_train_label <- dmy[intrain,20]
dmy_test_label <- dmy[-intrain,20]# k optimum
round(sqrt(nrow(dmy_train)))#> [1] 86
# Scaling data prediktor
dmy_train <- scale(dmy_train)
dmy_test <- scale(dmy_test,
center = attr(dmy_train, "scaled:center"),
scale = attr(dmy_train, "scaled:scale"))#Predict
knn <- class::knn(train = dmy_train,
test = dmy_test,
cl = dmy_train_label,
k = 86)
knn#> [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [38] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
#> [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [149] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [223] 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [260] 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0
#> [297] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0
#> [334] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
#> [371] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
#> [445] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
#> [482] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
#> [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1
#> [556] 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 0 0
#> [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> [630] 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1
#> [667] 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [704] 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
#> [741] 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
#> [778] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
#> [815] 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [852] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
#> [889] 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
#> [926] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [963] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
#> [1000] 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1037] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0
#> [1074] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1148] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1
#> [1185] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1222] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1259] 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1296] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1333] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1370] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
#> [1407] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
#> [1444] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1518] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
#> [1555] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1592] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1629] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [1666] 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
#> [1703] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0
#> [1740] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
#> [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1814] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> [1851] 0 0 0 0 0 0
#> Levels: 0 1
knn_eval <- confusionMatrix(as.factor(knn), as.factor(dmy_test_label),"1")
knn_eval#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 1570 120
#> 1 60 106
#>
#> Accuracy : 0.903
#> 95% CI : (0.8886, 0.9161)
#> No Information Rate : 0.8782
#> P-Value [Acc > NIR] : 0.0004455
#>
#> Kappa : 0.488
#>
#> Mcnemar's Test P-Value : 0.00001095
#>
#> Sensitivity : 0.46903
#> Specificity : 0.96319
#> Pos Pred Value : 0.63855
#> Neg Pred Value : 0.92899
#> Prevalence : 0.12177
#> Detection Rate : 0.05711
#> Detection Prevalence : 0.08944
#> Balanced Accuracy : 0.71611
#>
#> 'Positive' Class : 1
#>
Based on the results of the confusionMatrix above, we can take information that the model’s ability to guess target y (subscribed and not sunscribed) is 90.3%. Meanwhile, based on the actual data of unsubscribed clients, the model was able to guess correctly at 96.3%. From the total actual data of clients who subscribed, the model was able to guess correctly by 46.9%. From the overall prediction results that the model was able to guess, the model was able to correctly guess the positive class by 63.9%.
eval_reg <- data_frame(Accuracy = reg$overall[1],
Recall = reg$byClass[1],
Specificity = reg$byClass[2],
Precision = reg$byClass[3])
eval_knn <- data_frame(Accuracy = knn_eval$overall[1],
Recall = knn_eval$byClass[1],
Specificity = knn_eval$byClass[2],
Precision = knn_eval$byClass[3])# Logistic Regression
eval_reg# k-Nearest Neighbour
eval_knnWhen viewed from the two methods, namely by using Logistics Regression and k-NN, the model’s ability to take as many positive classes as possible (subscribed clients) is better by using the Logistics Regression method because it has a recall value = 72.2% greater than using the k-NN method.
S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014