Telemarketing is one of the ways banks can advertise their product but it is very intrusive and annoying to the customers who receive the unwanted calls. I have never been a telemarketer but I have received many telemarketer calls from unknown numbers. As soon as I know that the person calling is a telemarketer I will end the call after a brief conversation irrespective of whether the person on the other end has talked about their offer or not. I think that telemarketers (of credits without collateral) often target people who can pay for their services but have no need for them. I also think conversely, that people who needs the services usually cannot pay. A research done with the data from a bank in Portugal aimed to predict who, among the customers, will accept the term deposit product offered by that bank’s telemarketer.
After the initial EDA and additional insights from here, and here, finally we are ready to do the actual prediction of customers’ acceptance or rejection of the term deposit offers by the bank via telemarketing.
The dataset and its description can be found in this link.
There are four datasets in the link above:
For the initial EDA and additional insights, we have used bank.csv, for the actual prediction, we will use bank-full.csv
The description for each columns are:
# read in the bank.csv
bank_full <- read.csv("data_input/bank-full.csv",sep = ";", stringsAsFactors = TRUE)
# inspect the first six rows of bank_full
head(bank_full)
# inspect the data types of bank
glimpse(bank_full)
#> Rows: 45,211
#> Columns: 17
#> $ age <int> 58, 44, 33, 47, 33, 35, 28, 42, 58, 43, 41, 29, 53, 58, 57, …
#> $ job <fct> management, technician, entrepreneur, blue-collar, unknown, …
#> $ marital <fct> married, single, married, married, single, married, single, …
#> $ education <fct> tertiary, secondary, secondary, unknown, unknown, tertiary, …
#> $ default <fct> no, no, no, no, no, no, no, yes, no, no, no, no, no, no, no,…
#> $ balance <int> 2143, 29, 2, 1506, 1, 231, 447, 2, 121, 593, 270, 390, 6, 71…
#> $ housing <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes, yes, y…
#> $ loan <fct> no, no, yes, no, no, no, yes, no, no, no, no, no, no, no, no…
#> $ contact <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ day <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, …
#> $ month <fct> may, may, may, may, may, may, may, may, may, may, may, may, …
#> $ duration <int> 261, 151, 76, 92, 198, 139, 217, 380, 50, 55, 222, 137, 517,…
#> $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ pdays <int> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, …
#> $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ poutcome <fct> unknown, unknown, unknown, unknown, unknown, unknown, unknow…
#> $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, …
The data has 45211 observations and 17 columns. Because there is now 10x more data than the previous bank.csv we are going to view the summary and several boxplots to check if our EDA on bank.csv holds true for bank-full.csv.
#check if there is any NA
anyNA(bank_full)
#> [1] FALSE
# summary of bank_full
summary(bank_full)
#> age job marital education
#> Min. :18.00 blue-collar:9732 divorced: 5207 primary : 6851
#> 1st Qu.:33.00 management :9458 married :27214 secondary:23202
#> Median :39.00 technician :7597 single :12790 tertiary :13301
#> Mean :40.94 admin. :5171 unknown : 1857
#> 3rd Qu.:48.00 services :4154
#> Max. :95.00 retired :2264
#> (Other) :6835
#> default balance housing loan contact
#> no :44396 Min. : -8019 no :20081 no :37967 cellular :29285
#> yes: 815 1st Qu.: 72 yes:25130 yes: 7244 telephone: 2906
#> Median : 448 unknown :13020
#> Mean : 1362
#> 3rd Qu.: 1428
#> Max. :102127
#>
#> day month duration campaign
#> Min. : 1.00 may :13766 Min. : 0.0 Min. : 1.000
#> 1st Qu.: 8.00 jul : 6895 1st Qu.: 103.0 1st Qu.: 1.000
#> Median :16.00 aug : 6247 Median : 180.0 Median : 2.000
#> Mean :15.81 jun : 5341 Mean : 258.2 Mean : 2.764
#> 3rd Qu.:21.00 nov : 3970 3rd Qu.: 319.0 3rd Qu.: 3.000
#> Max. :31.00 apr : 2932 Max. :4918.0 Max. :63.000
#> (Other): 6060
#> pdays previous poutcome y
#> Min. : -1.0 Min. : 0.0000 failure: 4901 no :39922
#> 1st Qu.: -1.0 1st Qu.: 0.0000 other : 1840 yes: 5289
#> Median : -1.0 Median : 0.0000 success: 1511
#> Mean : 40.2 Mean : 0.5803 unknown:36959
#> 3rd Qu.: -1.0 3rd Qu.: 0.0000
#> Max. :871.0 Max. :275.0000
#>
From the summary, it can be seen that the target variable is very unbalanced. Let us check the imbalance using prop.table.
prop.table(table(bank_full$y))
#>
#> no yes
#> 0.8830152 0.1169848
Over 88.3% customers rejected the offers and only 11.7% accepted them.
Now, if we check the mean and medians of the numerical columns, we can see that balance, pdays, campaign, and duration is very likely to have many outliers while previous is likely to have lesser outliers.
Most categorical columns show the complete category with the number of observations in each category, but two, month and job. Let us check the number of unique values of these two columns.
table(bank_full$job)
#>
#> admin. blue-collar entrepreneur housemaid management
#> 5171 9732 1487 1240 9458
#> retired self-employed services student technician
#> 2264 1579 4154 938 7597
#> unemployed unknown
#> 1303 288
table(bank_full$month)
#>
#> apr aug dec feb jan jul jun mar may nov oct sep
#> 2932 6247 214 2649 1403 6895 5341 477 13766 3970 738 579
Based on the summary and the above count table, there are no category_level that has just 3 or less observations so we do not need to filter levels with that much observations and we can go on to train-test split.
But before we go to the cross-validation or train-test split, it is good to check for correlations between the numerical columns of bank_full.
ggcorr(bank_full, label=TRUE)
The only moderately strong correlation is between previous and pdays while all the other columns have weak correlations.
For prediction purposes, the duration of the call is not known beforehand, so we’ll remove this predictor before spliting.
bank_full1 = bank_full %>% select(-duration)
library(rsample)
set.seed(123)
init <- initial_split(data = bank_full1, prop = 0.8, strata = y)
train_bank <- training(init)
test_bank <- testing(init)
The first model we are going to use in predicting if customers will accept or reject telemarketing offer is logistic regression. This model is simpler and less robust than the next model we are going to try, k-nearest neighbor or kNN but it is also more interpretable.
model_logreg <- glm(y ~., data= train_bank, family = "binomial")
summary(model_logreg)
#>
#> Call:
#> glm(formula = y ~ ., family = "binomial", data = train_bank)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3235 -0.4819 -0.3793 -0.2435 3.5757
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.252715940 0.183500028 -6.827 0.0000000000086836 ***
#> age 0.000226180 0.002199845 0.103 0.918109
#> jobblue-collar -0.154905842 0.072791657 -2.128 0.033331 *
#> jobentrepreneur -0.162872161 0.124125559 -1.312 0.189467
#> jobhousemaid -0.256082163 0.132224933 -1.937 0.052780 .
#> jobmanagement -0.038654212 0.073553625 -0.526 0.599219
#> jobretired 0.407362719 0.097379874 4.183 0.0000287392377694 ***
#> jobself-employed -0.162305976 0.113530334 -1.430 0.152824
#> jobservices -0.060723417 0.083068127 -0.731 0.464775
#> jobstudent 0.271615929 0.111018694 2.447 0.014422 *
#> jobtechnician -0.090513505 0.069490288 -1.303 0.192734
#> jobunemployed 0.121109647 0.108356454 1.118 0.263697
#> jobunknown -0.161202121 0.233077486 -0.692 0.489173
#> maritalmarried -0.159798235 0.058601192 -2.727 0.006394 **
#> maritalsingle 0.144166535 0.066806826 2.158 0.030931 *
#> educationsecondary 0.132479032 0.063391922 2.090 0.036632 *
#> educationtertiary 0.317410942 0.074049413 4.286 0.0000181530801310 ***
#> educationunknown 0.180781138 0.103823192 1.741 0.081641 .
#> defaultyes -0.112750773 0.163534416 -0.689 0.490533
#> balance 0.000015866 0.000005007 3.169 0.001529 **
#> housingyes -0.500745025 0.043115630 -11.614 < 0.0000000000000002 ***
#> loanyes -0.452128779 0.060558669 -7.466 0.0000000000000827 ***
#> contacttelephone -0.244521313 0.073049592 -3.347 0.000816 ***
#> contactunknown -1.370121616 0.071176408 -19.250 < 0.0000000000000002 ***
#> day 0.003288564 0.002479243 1.326 0.184694
#> monthaug -0.812319254 0.078188450 -10.389 < 0.0000000000000002 ***
#> monthdec 0.761792389 0.179955241 4.233 0.0000230356056245 ***
#> monthfeb -0.397947745 0.089899260 -4.427 0.0000095731578611 ***
#> monthjan -1.140465739 0.121884380 -9.357 < 0.0000000000000002 ***
#> monthjul -0.660700780 0.076060998 -8.686 < 0.0000000000000002 ***
#> monthjun 0.205605533 0.093299108 2.204 0.027544 *
#> monthmar 1.055639338 0.124771628 8.461 < 0.0000000000000002 ***
#> monthmay -0.445097925 0.071903955 -6.190 0.0000000006009821 ***
#> monthnov -0.862126512 0.084010373 -10.262 < 0.0000000000000002 ***
#> monthoct 0.686586812 0.109652456 6.261 0.0000000003813373 ***
#> monthsep 0.638271761 0.122486179 5.211 0.0000001878561988 ***
#> campaign -0.081849377 0.009408080 -8.700 < 0.0000000000000002 ***
#> pdays -0.000114825 0.000310349 -0.370 0.711392
#> previous 0.025999418 0.009695453 2.682 0.007327 **
#> poutcomeother 0.192934216 0.090256129 2.138 0.032547 *
#> poutcomesuccess 2.199595832 0.084013625 26.181 < 0.0000000000000002 ***
#> poutcomeunknown 0.051268131 0.096910267 0.529 0.596787
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 26108 on 36169 degrees of freedom
#> Residual deviance: 21711 on 36128 degrees of freedom
#> AIC: 21795
#>
#> Number of Fisher Scoring iterations: 6
Logistic regression produces probabilities and need to be converted to class labels using if else statements. Here we will convert probabilities of 0.5 and more to “yes” and probabilities lower than 0.5 to “no”. The labels then needs to be converted to factors so that the predicted labels can be compared to the true labels to measure the performance of the models.
bank_pred_logreg1 <- predict(model_logreg, newdata = test_bank ,type = "response")
predictions1 <- as.factor(ifelse(bank_pred_logreg1<0.5, "no", "yes"))
head(predictions1)
#> 2 5 7 10 16 17
#> no no no no no no
#> Levels: no yes
The metric we are going to use to evaluate the performance of our model is the accuracy, recall/sensitivity and precision/pos_pred via confusionMatrix package in the caret library. Out of the three we will use recall as our primary metric because recall capture the amount of false negatives, that is customers that are predicted to reject the offer but actually will accept the offer. This makes sense in the context of the bank business because false negatives mean not contacting the customers that will accept the offer.
confusionMatrix(predictions1, reference = test_bank$y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7873 874
#> yes 111 183
#>
#> Accuracy : 0.8911
#> 95% CI : (0.8844, 0.8974)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 0.009147
#>
#> Kappa : 0.2318
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.17313
#> Specificity : 0.98610
#> Pos Pred Value : 0.62245
#> Neg Pred Value : 0.90008
#> Prevalence : 0.11691
#> Detection Rate : 0.02024
#> Detection Prevalence : 0.03252
#> Balanced Accuracy : 0.57961
#>
#> 'Positive' Class : yes
#>
Now what if the feature selection is applied to the logistic regression model using step ?
model_logreg2 <- step(model_logreg, direction = "backward")
#> Start: AIC=21794.54
#> y ~ age + job + marital + education + default + balance + housing +
#> loan + contact + day + month + campaign + pdays + previous +
#> poutcome
#>
#> Df Deviance AIC
#> - age 1 21711 21793
#> - pdays 1 21711 21793
#> - default 1 21711 21793
#> - day 1 21712 21794
#> <none> 21710 21794
#> - previous 1 21717 21799
#> - balance 1 21720 21802
#> - education 3 21731 21809
#> - job 11 21772 21834
#> - marital 2 21757 21837
#> - loan 1 21771 21853
#> - campaign 1 21804 21886
#> - housing 1 21847 21929
#> - contact 2 22117 22197
#> - month 11 22430 22492
#> - poutcome 3 22666 22744
#>
#> Step: AIC=21792.56
#> y ~ job + marital + education + default + balance + housing +
#> loan + contact + day + month + campaign + pdays + previous +
#> poutcome
#>
#> Df Deviance AIC
#> - pdays 1 21711 21791
#> - default 1 21711 21791
#> - day 1 21712 21792
#> <none> 21711 21793
#> - previous 1 21717 21797
#> - balance 1 21720 21800
#> - education 3 21731 21807
#> - job 11 21779 21839
#> - marital 2 21763 21841
#> - loan 1 21771 21851
#> - campaign 1 21804 21884
#> - housing 1 21848 21928
#> - contact 2 22118 22196
#> - month 11 22430 22490
#> - poutcome 3 22667 22743
#>
#> Step: AIC=21790.69
#> y ~ job + marital + education + default + balance + housing +
#> loan + contact + day + month + campaign + previous + poutcome
#>
#> Df Deviance AIC
#> - default 1 21711 21789
#> - day 1 21712 21790
#> <none> 21711 21791
#> - previous 1 21717 21795
#> - balance 1 21720 21798
#> - education 3 21732 21806
#> - job 11 21780 21838
#> - marital 2 21763 21839
#> - loan 1 21771 21849
#> - campaign 1 21804 21882
#> - housing 1 21850 21928
#> - contact 2 22118 22194
#> - month 11 22430 22488
#> - poutcome 3 22687 22761
#>
#> Step: AIC=21789.19
#> y ~ job + marital + education + balance + housing + loan + contact +
#> day + month + campaign + previous + poutcome
#>
#> Df Deviance AIC
#> - day 1 21713 21789
#> <none> 21711 21789
#> - previous 1 21718 21794
#> - balance 1 21721 21797
#> - education 3 21732 21804
#> - job 11 21780 21836
#> - marital 2 21763 21837
#> - loan 1 21773 21849
#> - campaign 1 21805 21881
#> - housing 1 21850 21926
#> - contact 2 22119 22193
#> - month 11 22432 22488
#> - poutcome 3 22688 22760
#>
#> Step: AIC=21788.97
#> y ~ job + marital + education + balance + housing + loan + contact +
#> month + campaign + previous + poutcome
#>
#> Df Deviance AIC
#> <none> 21713 21789
#> - previous 1 21720 21794
#> - balance 1 21723 21797
#> - education 3 21734 21804
#> - job 11 21782 21836
#> - marital 2 21765 21837
#> - loan 1 21775 21849
#> - campaign 1 21805 21879
#> - housing 1 21855 21929
#> - contact 2 22119 22191
#> - month 11 22441 22495
#> - poutcome 3 22690 22760
summary(model_logreg2)
#>
#> Call:
#> glm(formula = y ~ job + marital + education + balance + housing +
#> loan + contact + month + campaign + previous + poutcome,
#> family = "binomial", data = train_bank)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3321 -0.4822 -0.3793 -0.2435 3.5617
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.212605816 0.122915860 -9.865 < 0.0000000000000002 ***
#> jobblue-collar -0.157422432 0.072733927 -2.164 0.030437 *
#> jobentrepreneur -0.166085332 0.124021276 -1.339 0.180516
#> jobhousemaid -0.253782372 0.131830341 -1.925 0.054221 .
#> jobmanagement -0.039196761 0.073457063 -0.534 0.593618
#> jobretired 0.412312500 0.087766123 4.698 0.0000026290822583 ***
#> jobself-employed -0.163407912 0.113508376 -1.440 0.149977
#> jobservices -0.062011356 0.083035566 -0.747 0.455181
#> jobstudent 0.270093127 0.109144939 2.475 0.013338 *
#> jobtechnician -0.089211667 0.069464494 -1.284 0.199045
#> jobunemployed 0.119899628 0.108339610 1.107 0.268423
#> jobunknown -0.166871219 0.232756571 -0.717 0.473415
#> maritalmarried -0.159760674 0.058342289 -2.738 0.006175 **
#> maritalsingle 0.142244506 0.062630595 2.271 0.023137 *
#> educationsecondary 0.131570940 0.063061839 2.086 0.036944 *
#> educationtertiary 0.318193780 0.073448605 4.332 0.0000147629167635 ***
#> educationunknown 0.182617195 0.103786372 1.760 0.078484 .
#> balance 0.000016083 0.000004984 3.227 0.001251 **
#> housingyes -0.506098885 0.042713650 -11.849 < 0.0000000000000002 ***
#> loanyes -0.456887065 0.060437498 -7.560 0.0000000000000404 ***
#> contacttelephone -0.243252762 0.072257881 -3.366 0.000761 ***
#> contactunknown -1.359578525 0.070641886 -19.246 < 0.0000000000000002 ***
#> monthaug -0.823983149 0.077549419 -10.625 < 0.0000000000000002 ***
#> monthdec 0.749444250 0.179646624 4.172 0.0000302243702798 ***
#> monthfeb -0.435061451 0.084926066 -5.123 0.0000003009902168 ***
#> monthjan -1.110311687 0.119901978 -9.260 < 0.0000000000000002 ***
#> monthjul -0.662762703 0.075981047 -8.723 < 0.0000000000000002 ***
#> monthjun 0.175151688 0.090212867 1.942 0.052193 .
#> monthmar 1.042384864 0.124204017 8.393 < 0.0000000000000002 ***
#> monthmay -0.461806774 0.070903267 -6.513 0.0000000000735691 ***
#> monthnov -0.860182283 0.083495924 -10.302 < 0.0000000000000002 ***
#> monthoct 0.686669555 0.109472415 6.273 0.0000000003552165 ***
#> monthsep 0.615557068 0.121177522 5.080 0.0000003778408024 ***
#> campaign -0.080042312 0.009283064 -8.622 < 0.0000000000000002 ***
#> previous 0.025908585 0.009669390 2.679 0.007374 **
#> poutcomeother 0.195206198 0.090110470 2.166 0.030288 *
#> poutcomesuccess 2.207565085 0.081658485 27.034 < 0.0000000000000002 ***
#> poutcomeunknown 0.078930092 0.064219744 1.229 0.219048
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 26108 on 36169 degrees of freedom
#> Residual deviance: 21713 on 36132 degrees of freedom
#> AIC: 21789
#>
#> Number of Fisher Scoring iterations: 6
bank_pred_logreg2 <- predict(model_logreg2, newdata = test_bank ,type = "response")
predictions2 <- as.factor(ifelse(bank_pred_logreg2 < 0.5, "no", "yes"))
head(predictions2)
#> 2 5 7 10 16 17
#> no no no no no no
#> Levels: no yes
confusionMatrix(predictions2, reference = test_bank$y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7875 873
#> yes 109 184
#>
#> Accuracy : 0.8914
#> 95% CI : (0.8848, 0.8977)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 0.006949
#>
#> Kappa : 0.2337
#>
#> Mcnemar's Test P-Value : < 0.00000000000000022
#>
#> Sensitivity : 0.17408
#> Specificity : 0.98635
#> Pos Pred Value : 0.62799
#> Neg Pred Value : 0.90021
#> Prevalence : 0.11691
#> Detection Rate : 0.02035
#> Detection Prevalence : 0.03241
#> Balanced Accuracy : 0.58021
#>
#> 'Positive' Class : yes
#>
Before feature selection:
After feature selection:
After the feature selection the performance of the model actually decreased across all three metrics that we use. Therefore we will use the model before the feature selection for prediction and interpretation. Before we can interpret the model we need to check if the assumptions underlying the logistic regression models are fulfilled to ensure the validity of the model.
The first assumption is multicolinearity, the predictors should not have strong correlation between one another. We can check this using vif from the car library.
vif(model_logreg)
#> GVIF Df GVIF^(1/(2*Df))
#> age 2.133225 1 1.460556
#> job 4.072676 11 1.065913
#> marital 1.444888 2 1.096374
#> education 2.273831 3 1.146726
#> default 1.014232 1 1.007091
#> balance 1.044496 1 1.022006
#> housing 1.406153 1 1.185813
#> loan 1.055708 1 1.027477
#> contact 1.853741 2 1.166843
#> day 1.346200 1 1.160259
#> month 3.663943 11 1.060801
#> campaign 1.102741 1 1.050115
#> pdays 3.783283 1 1.945066
#> previous 1.634883 1 1.278625
#> poutcome 4.730988 3 1.295663
This assumption is fulfilled because none of the GVIF value is above 10.
The second assumption is independence of observations which means that there is no dependence between observations which can happen if the observations are a result of repeated measurement. This assumption is fulfilled because we used random sampling to split our data into train and test set.
The third assumption is the linearity of predictor and log of odds which means the model assumes linear relationship between the predictor coefficients and the log of odds or probability. We can take a look again at the model summary.
summary(model_logreg)
#>
#> Call:
#> glm(formula = y ~ ., family = "binomial", data = train_bank)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.3235 -0.4819 -0.3793 -0.2435 3.5757
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.252715940 0.183500028 -6.827 0.0000000000086836 ***
#> age 0.000226180 0.002199845 0.103 0.918109
#> jobblue-collar -0.154905842 0.072791657 -2.128 0.033331 *
#> jobentrepreneur -0.162872161 0.124125559 -1.312 0.189467
#> jobhousemaid -0.256082163 0.132224933 -1.937 0.052780 .
#> jobmanagement -0.038654212 0.073553625 -0.526 0.599219
#> jobretired 0.407362719 0.097379874 4.183 0.0000287392377694 ***
#> jobself-employed -0.162305976 0.113530334 -1.430 0.152824
#> jobservices -0.060723417 0.083068127 -0.731 0.464775
#> jobstudent 0.271615929 0.111018694 2.447 0.014422 *
#> jobtechnician -0.090513505 0.069490288 -1.303 0.192734
#> jobunemployed 0.121109647 0.108356454 1.118 0.263697
#> jobunknown -0.161202121 0.233077486 -0.692 0.489173
#> maritalmarried -0.159798235 0.058601192 -2.727 0.006394 **
#> maritalsingle 0.144166535 0.066806826 2.158 0.030931 *
#> educationsecondary 0.132479032 0.063391922 2.090 0.036632 *
#> educationtertiary 0.317410942 0.074049413 4.286 0.0000181530801310 ***
#> educationunknown 0.180781138 0.103823192 1.741 0.081641 .
#> defaultyes -0.112750773 0.163534416 -0.689 0.490533
#> balance 0.000015866 0.000005007 3.169 0.001529 **
#> housingyes -0.500745025 0.043115630 -11.614 < 0.0000000000000002 ***
#> loanyes -0.452128779 0.060558669 -7.466 0.0000000000000827 ***
#> contacttelephone -0.244521313 0.073049592 -3.347 0.000816 ***
#> contactunknown -1.370121616 0.071176408 -19.250 < 0.0000000000000002 ***
#> day 0.003288564 0.002479243 1.326 0.184694
#> monthaug -0.812319254 0.078188450 -10.389 < 0.0000000000000002 ***
#> monthdec 0.761792389 0.179955241 4.233 0.0000230356056245 ***
#> monthfeb -0.397947745 0.089899260 -4.427 0.0000095731578611 ***
#> monthjan -1.140465739 0.121884380 -9.357 < 0.0000000000000002 ***
#> monthjul -0.660700780 0.076060998 -8.686 < 0.0000000000000002 ***
#> monthjun 0.205605533 0.093299108 2.204 0.027544 *
#> monthmar 1.055639338 0.124771628 8.461 < 0.0000000000000002 ***
#> monthmay -0.445097925 0.071903955 -6.190 0.0000000006009821 ***
#> monthnov -0.862126512 0.084010373 -10.262 < 0.0000000000000002 ***
#> monthoct 0.686586812 0.109652456 6.261 0.0000000003813373 ***
#> monthsep 0.638271761 0.122486179 5.211 0.0000001878561988 ***
#> campaign -0.081849377 0.009408080 -8.700 < 0.0000000000000002 ***
#> pdays -0.000114825 0.000310349 -0.370 0.711392
#> previous 0.025999418 0.009695453 2.682 0.007327 **
#> poutcomeother 0.192934216 0.090256129 2.138 0.032547 *
#> poutcomesuccess 2.199595832 0.084013625 26.181 < 0.0000000000000002 ***
#> poutcomeunknown 0.051268131 0.096910267 0.529 0.596787
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 26108 on 36169 degrees of freedom
#> Residual deviance: 21711 on 36128 degrees of freedom
#> AIC: 21795
#>
#> Number of Fisher Scoring iterations: 6
This assumption is fulfilled as well because the numeric variables can have linear relationship with the log of odds.
Therefore we can interpret the model as :
exp(model_logreg$coefficients) %>% data.frame()
Looking at the odds and variable table above we can interpret the model as the customers who had the previous campaign status as success is almost 10 times more likely to say yes to the current telemarketing offer.
Next we are going to predict the outcome of the telemarketing effort using the kNN model. This model is generally more robust and have higher performance than logistic regression but it is more difficult to interpret. This model also cannot work with categorical variables as predictors so we need to filter out the categorical variables. The reason is that kNN needs to work with a measure of distance to cluster or group a certain observations into classes and categorical variables cannot be converted into distances except 0 and 1.
# predictor variables in `train`
train_x <- train_bank %>% select_if(is.numeric)
head(train_x)
# predictor variables in `test`
test_x <- test_bank %>% select_if(is.numeric)
head(test_x)
# target variable in `train`
train_y <- train_bank$y
head(train_y)
#> [1] no no no no no no
#> Levels: no yes
# target variable in `test`
test_y <- test_bank$y
head(test_y)
#> [1] no no no no no no
#> Levels: no yes
Next the numerical variables needs to be scaled using z-score scaling so that the distances all have equal weight of importance.
# scale train_x data
train_x <- scale(train_x)
# scale test_x data
test_x <- scale(test_x,
center = attr(train_x, "scaled:center"),
scale = attr(train_x, "scaled:scale"))
head(train_x)
#> age balance day campaign pdays previous
#> 1 1.60278370 0.25322992 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 3 -0.75210265 -0.44568450 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 4 0.56663371 0.04528575 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 6 -0.56371175 -0.37092905 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 8 0.09565643 -0.44568450 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 9 1.60278370 -0.40683778 -1.302524 -0.5694578 -0.4104198 -0.2992611
head(test_x)
#> age balance day campaign pdays previous
#> 2 0.2840473 -0.4368705 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 5 -0.7521027 -0.4460109 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 7 -1.2230799 -0.3004174 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 10 0.1898519 -0.2527567 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 16 0.9434155 -0.3715819 -1.302524 -0.5694578 -0.4104198 -0.2992611
#> 17 0.3782428 -0.4420936 -1.302524 -0.5694578 -0.4104198 -0.2992611
Finally, we need to determine k, the number of nearest neighbor in a cluster. For our first iteration, Let us use the k of odd square root of train observations.
k <- round(sqrt(nrow(train_x)))+1
k
#> [1] 191
bank_pred_knn <- knn(train = train_x, test = test_x, cl = train_y, k= k)
We will use the same metrics as the logistic regression in order to enable comparison between the two models.
confusionMatrix(bank_pred_knn, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7979 1050
#> yes 5 7
#>
#> Accuracy : 0.8833
#> 95% CI : (0.8765, 0.8899)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 0.4821
#>
#> Kappa : 0.0105
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.0066225
#> Specificity : 0.9993737
#> Pos Pred Value : 0.5833333
#> Neg Pred Value : 0.8837081
#> Prevalence : 0.1169118
#> Detection Rate : 0.0007743
#> Detection Prevalence : 0.0013273
#> Balanced Accuracy : 0.5029981
#>
#> 'Positive' Class : yes
#>
The metrics are:
Our kNN model are actually worse across all three metrics compared to logistic regression model. Before we make any conclusions about the performance of the models, let’s try two different k values, one higher at 300 and the other lower at 100.
bank_pred_knn2 <- knn(train = train_x, test = test_x, cl = train_y, k= 300)
confusionMatrix(bank_pred_knn2, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7984 1056
#> yes 0 1
#>
#> Accuracy : 0.8832
#> 95% CI : (0.8764, 0.8898)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 0.4951
#>
#> Kappa : 0.0017
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.0009461
#> Specificity : 1.0000000
#> Pos Pred Value : 1.0000000
#> Neg Pred Value : 0.8831858
#> Prevalence : 0.1169118
#> Detection Rate : 0.0001106
#> Detection Prevalence : 0.0001106
#> Balanced Accuracy : 0.5004730
#>
#> 'Positive' Class : yes
#>
bank_pred_knn3 <- knn(train = train_x, test = test_x, cl = train_y, k= 100)
confusionMatrix(bank_pred_knn3, reference = test_y, positive = "yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 7956 1037
#> yes 28 20
#>
#> Accuracy : 0.8822
#> 95% CI : (0.8754, 0.8888)
#> No Information Rate : 0.8831
#> P-Value [Acc > NIR] : 0.6111
#>
#> Kappa : 0.0263
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.018921
#> Specificity : 0.996493
#> Pos Pred Value : 0.416667
#> Neg Pred Value : 0.884688
#> Prevalence : 0.116912
#> Detection Rate : 0.002212
#> Detection Prevalence : 0.005309
#> Balanced Accuracy : 0.507707
#>
#> 'Positive' Class : yes
#>
For bank_pred_knn2 the metrics are :
For bank_pred_knn3 the metrics are :
So with smaller k there is an increase across all three metrics but they are still lower than the logistic regression model. Finally we can conclude that for our case of predicting the customer’s response to term deposit offers via telemarketing campaign logistic regression performs better than kNN and that logistic regression without feature selection performs the best.
We have used 2 machine learning models to help predict what kind of customers will accept the term deposit offer via the bank’s telemarketing campaign: logistic regression and k nearest neighbor. Out of the two logistic regression performs better and it performs the best when all available columns are used as predictors except the target column. The performance of the best model has a recall of 15.89%. This is still too low and the number of false negatives is still too high which means the logistic regression model needs to be tuned further or replaced with an even better model. The logistic regression was able to identify that a customer who accepts the offer in a previous campaign will ten times more likely to accept again compared to those who have declined, however.