Telemarketing nowadays is not something new, but it is still one of most important way to gain more potential customers. Some company focus on training on their marketer in that hope that such investment will payoff. However, even the best saleman in the world fails sometime. It is usually not the fault of saleman but rather the problem lies on the gap of understanding client. Certain group of clients share same charateristics will be more likely interested in while those have same background will shy away from the offer. A clear example is those who work at hazard environments have higher chance to sign up insurance than those work in the office.
Aspired from the orginal paper: “A Data-Driven Approach to Predict the Success of Bank Telemarketing” by S. Moro, P. Cortez and P. Rita in 2014, 5 machine learning models were built to predict whether the client subscribed a term deposit. Nevertheless, unlike the paper, dataset was preprocessed in simple way, formula was selected differently and the 4 models was not heavily modified.
The dataset is based on “Bank Marketing” UCI dataset.Nevertheless, it is dissimilar to the dataset used by S. Moro, P. Cortez and P. Rita, which was not enriched by the Banco de Portugal. According to the author of dataset, by the addition of five new social and economic features/attributes, the probability of being correctly predicted improved significant in comparison with the orginal.
Number of observations: 41188 Number of features: 20 and 1 target value
## Observations: 41,188
## Variables: 21
## $ age <int> 56, 57, 37, 40, 56, 45, 59, 41, 24, 25, 41, 25,...
## $ job <fct> housemaid, services, services, admin., services...
## $ marital <fct> married, married, married, married, married, ma...
## $ education <fct> basic.4y, high.school, high.school, basic.6y, h...
## $ default <fct> no, unknown, no, no, no, unknown, no, unknown, ...
## $ housing <fct> no, no, yes, no, no, no, no, no, yes, yes, no, ...
## $ loan <fct> no, no, no, no, yes, no, no, no, no, no, no, no...
## $ contact <fct> telephone, telephone, telephone, telephone, tel...
## $ month <fct> may, may, may, may, may, may, may, may, may, ma...
## $ day_of_week <fct> mon, mon, mon, mon, mon, mon, mon, mon, mon, mo...
## $ duration <int> 261, 149, 226, 151, 307, 198, 139, 217, 380, 50...
## $ campaign <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ poutcome <fct> nonexistent, nonexistent, nonexistent, nonexist...
## $ emp.var.rate <dbl> 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1.1, 1....
## $ cons.price.idx <dbl> 93.994, 93.994, 93.994, 93.994, 93.994, 93.994,...
## $ cons.conf.idx <dbl> -36.4, -36.4, -36.4, -36.4, -36.4, -36.4, -36.4...
## $ euribor3m <dbl> 4.857, 4.857, 4.857, 4.857, 4.857, 4.857, 4.857...
## $ nr.employed <dbl> 5191, 5191, 5191, 5191, 5191, 5191, 5191, 5191,...
## $ y <fct> no, no, no, no, no, no, no, no, no, no, no, no,...
Infomation about variables, cited from source:
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3 - marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4 - education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: “cellular”,“telephone”)
9 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10 - day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”) # social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: “yes”,“no”)
options(scipen=999)
summary(bank)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing
## university.degree :12168 no :32588 no :18622
## high.school : 9515 unknown: 8597 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## loan contact month day_of_week
## no :33950 cellular :26144 may :13769 fri:7827
## unknown: 990 telephone:15044 jul : 7174 mon:8514
## yes : 6248 aug : 6178 thu:8623
## jun : 5318 tue:8090
## nov : 4101 wed:8134
## apr : 2632
## (Other): 2016
## duration campaign pdays previous
## Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.000
## 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000
## Median : 180.0 Median : 2.000 Median :999.0 Median :0.000
## Mean : 258.3 Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :4918.0 Max. :56.000 Max. :999.0 Max. :7.000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 4252 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 1373 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y
## Min. :0.634 Min. :4964 no :36548
## 1st Qu.:1.344 1st Qu.:5099 yes: 4640
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
##
## Warning: attributes are not identical across measure variables;
## they will be dropped
Noted from the dataset’s author, unknown in categorial columns denote to missing value. Here is an instance from martial column
table(bank$marital)
##
## divorced married single unknown
## 4612 24928 11568 80
bank[bank=='unknown'] <- NA
#missing value check
sum(is.na(bank))
## [1] 12718
The number of missing value is considerate, it could have an negative impact.
md.pattern(bank) #library mice 3.3.0
## age contact month day_of_week duration campaign pdays previous
## 30488 1 1 1 1 1 1 1 1
## 7757 1 1 1 1 1 1 1 1
## 1102 1 1 1 1 1 1 1 1
## 456 1 1 1 1 1 1 1 1
## 739 1 1 1 1 1 1 1 1
## 207 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1 1
## 16 1 1 1 1 1 1 1 1
## 116 1 1 1 1 1 1 1 1
## 76 1 1 1 1 1 1 1 1
## 54 1 1 1 1 1 1 1 1
## 70 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 59 1 1 1 1 1 1 1 1
## 7 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0
## poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 30488 1 1 1 1 1
## 7757 1 1 1 1 1
## 1102 1 1 1 1 1
## 456 1 1 1 1 1
## 739 1 1 1 1 1
## 207 1 1 1 1 1
## 22 1 1 1 1 1
## 16 1 1 1 1 1
## 116 1 1 1 1 1
## 76 1 1 1 1 1
## 54 1 1 1 1 1
## 70 1 1 1 1 1
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 1
## 59 1 1 1 1 1
## 7 1 1 1 1 1
## 2 1 1 1 1 1
## 2 1 1 1 1 1
## 1 1 1 1 1 1
## 4 1 1 1 1 1
## 3 1 1 1 1 1
## 2 1 1 1 1 1
## 0 0 0 0 0
## nr.employed y marital job housing loan education default
## 30488 1 1 1 1 1 1 1 1 0
## 7757 1 1 1 1 1 1 1 0 1
## 1102 1 1 1 1 1 1 0 1 1
## 456 1 1 1 1 1 1 0 0 2
## 739 1 1 1 1 0 0 1 1 2
## 207 1 1 1 1 0 0 1 0 3
## 22 1 1 1 1 0 0 0 1 3
## 16 1 1 1 1 0 0 0 0 4
## 116 1 1 1 0 1 1 1 1 1
## 76 1 1 1 0 1 1 1 0 2
## 54 1 1 1 0 1 1 0 1 2
## 70 1 1 1 0 1 1 0 0 3
## 1 1 1 1 0 0 0 1 1 3
## 2 1 1 1 0 0 0 1 0 4
## 2 1 1 1 0 0 0 0 0 5
## 59 1 1 0 1 1 1 1 1 1
## 7 1 1 0 1 1 1 1 0 2
## 2 1 1 0 1 1 1 0 1 2
## 2 1 1 0 1 1 1 0 0 3
## 1 1 1 0 1 0 0 1 1 3
## 4 1 1 0 0 1 1 1 1 2
## 3 1 1 0 0 1 1 0 1 3
## 2 1 1 0 0 1 1 0 0 4
## 0 0 80 330 990 990 1731 8597 12718
#library mice 3.5.0: md.pattern(bank, rotate.names = TRUE)
aggr(bank,
numbers = TRUE,
prop = FALSE,
sortVars = TRUE,
cex.axis = .5,
gap = 2,
ylab = c("Number of misisngs", "Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## default 8597
## education 1731
## housing 990
## loan 990
## job 330
## marital 80
## age 0
## contact 0
## month 0
## day_of_week 0
## duration 0
## campaign 0
## pdays 0
## previous 0
## poutcome 0
## emp.var.rate 0
## cons.price.idx 0
## cons.conf.idx 0
## euribor3m 0
## nr.employed 0
## y 0
Eventhough it is hard to clarify which columns have missing value, the amount of missing value is well spreaded. Together with large number of missing values, a simple remove observations will likely to cause the data imbalance and have a significant impact on the prediction
One of popular method when it comes to categorial value is impletment Mode - assign missing value to the most popular value. In R there is no function for this. Fortunately, a post from Github delivered a perfect solution: (https://gist.github.com/jmarhee/8530768).
source("Mode.R")
bank <- bank %>% mutate_if(is.factor,
funs(replace(.,is.na(.), Mode(na.omit(.))))) %>%
mutate_if(is.factor, factor)
sum(is.na(bank))
## [1] 0
A model with many features can suffer from overfitting and perform poorly in test data. Also it could increase time complexity and memory complexity unneccessary. Best practice is to find the least unimportant via test and analysis, then remove them from the final formula
The basic idea to use chi squared test to rank the most important feature to the least
cate_vars <- c('job','marital','education','default','housing','poutcome',
'loan','contact','month','day_of_week')
chi2_weights <- chi.squared(y ~ ., bank[,c("y", cate_vars)])
chi2_weights
## attr_importance
## job 0.152688727
## marital 0.054152670
## education 0.068309967
## default 0.003041013
## housing 0.011085158
## poutcome 0.320487960
## loan 0.004466117
## contact 0.144773056
## month 0.274394871
## day_of_week 0.025194658
In many cases, a column with many zero value or near zero value offer no important contribution to the model.
num_vars <- c('age','duration','campaign','pdays','previous','emp.var.rate',
'cons.price.idx','cons.conf.idx', 'euribor3m','nr.employed')
nearZeroVar(bank,
saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## age 1.054713 0.189375546 FALSE FALSE
## job 1.161876 0.026706808 FALSE FALSE
## marital 2.161826 0.007283675 FALSE FALSE
## education 1.460746 0.016995241 FALSE FALSE
## default 13728.333333 0.004855783 FALSE TRUE
## housing 1.211793 0.004855783 FALSE FALSE
## loan 5.592190 0.004855783 FALSE FALSE
## contact 1.737836 0.004855783 FALSE FALSE
## month 1.919292 0.024278916 FALSE FALSE
## day_of_week 1.012802 0.012139458 FALSE FALSE
## duration 1.000000 3.748664660 FALSE FALSE
## campaign 1.669063 0.101971448 FALSE FALSE
## pdays 90.371298 0.065553074 FALSE TRUE
## previous 7.797194 0.019423133 FALSE FALSE
## poutcome 8.363829 0.007283675 FALSE FALSE
## emp.var.rate 1.767639 0.024278916 FALSE FALSE
## cons.price.idx 1.161257 0.063125182 FALSE FALSE
## cons.conf.idx 1.161257 0.063125182 FALSE FALSE
## euribor3m 1.097589 0.767213752 FALSE FALSE
## nr.employed 1.902273 0.026706808 FALSE FALSE
## y 7.876724 0.004855783 FALSE FALSE
From the table, it looks like the default and pdays has near zero value observations. But before they can be removed, they should be calculated the proportion
prop.table(table(bank$default == 0))
##
## FALSE
## 1
It seems to be the case, the number of near zero value is minor that it is better to ignored
prop.table(table(bank$pdays == 0))
##
## FALSE TRUE
## 0.9996358163 0.0003641837
Once again, nothing will change.
For numeric variables, a resemble similarity could lead to excess information for the model. This can be done by checking correlation.
(corrs <- cor(bank[, num_vars]))
## age duration campaign pdays
## age 1.0000000000 -0.000865705 0.00459358 -0.03436895
## duration -0.0008657050 1.000000000 -0.07169923 -0.04757702
## campaign 0.0045935805 -0.071699226 1.00000000 0.05258357
## pdays -0.0343689512 -0.047577015 0.05258357 1.00000000
## previous 0.0243647409 0.020640351 -0.07914147 -0.58751386
## emp.var.rate -0.0003706855 -0.027967884 0.15075381 0.27100417
## cons.price.idx 0.0008567150 0.005312268 0.12783591 0.07888911
## cons.conf.idx 0.1293716142 -0.008172873 -0.01373310 -0.09134235
## euribor3m 0.0107674295 -0.032896656 0.13513251 0.29689911
## nr.employed -0.0177251319 -0.044703223 0.14409489 0.37260474
## previous emp.var.rate cons.price.idx cons.conf.idx
## age 0.02436474 -0.0003706855 0.000856715 0.129371614
## duration 0.02064035 -0.0279678845 0.005312268 -0.008172873
## campaign -0.07914147 0.1507538056 0.127835912 -0.013733099
## pdays -0.58751386 0.2710041743 0.078889109 -0.091342354
## previous 1.00000000 -0.4204891094 -0.203129967 -0.050936351
## emp.var.rate -0.42048911 1.0000000000 0.775334171 0.196041268
## cons.price.idx -0.20312997 0.7753341708 1.000000000 0.058986182
## cons.conf.idx -0.05093635 0.1960412681 0.058986182 1.000000000
## euribor3m -0.45449365 0.9722446712 0.688230107 0.277686220
## nr.employed -0.50133293 0.9069701013 0.522033977 0.100513432
## euribor3m nr.employed
## age 0.01076743 -0.01772513
## duration -0.03289666 -0.04470322
## campaign 0.13513251 0.14409489
## pdays 0.29689911 0.37260474
## previous -0.45449365 -0.50133293
## emp.var.rate 0.97224467 0.90697010
## cons.price.idx 0.68823011 0.52203398
## cons.conf.idx 0.27768622 0.10051343
## euribor3m 1.00000000 0.94515443
## nr.employed 0.94515443 1.00000000
corrplot.mixed(corrs,
upper = "pie",
lower = "number")
findCorrelation(corrs,
cutoff = 0.9,
names = TRUE)
## [1] "euribor3m" "emp.var.rate"
Using 90% correlation as metric, emp.var.rate and euribor3m can be filterd out.
The dataset has 10 quatitative varaibles and 10 qualitative varaibles. In order to balance the number of quatitative varaibles choosen in the formula, the 8 most important qualitatives will be selected
selected_cate<-cutoff.k(chi2_weights, k = 8)
selected_cate
## [1] "poutcome" "month" "job" "contact" "education"
## [6] "marital" "day_of_week" "housing"
The final formula is written as:
selected_num <- c('age','duration','campaign','pdays','previous',
'cons.price.idx','cons.conf.idx', 'nr.employed')
groupvars <- paste(paste(selected_cate,collapse = " + "),
paste(selected_num, collapse = " + "), sep = ' + ')
model_formula <- as.formula(paste('y',groupvars, sep = ' ~ '))
model_formula
## y ~ poutcome + month + job + contact + education + marital +
## day_of_week + housing + age + duration + campaign + pdays +
## previous + cons.price.idx + cons.conf.idx + nr.employed
set.seed(123)
which_train <- createDataPartition(bank$y, p = 0.8, list = FALSE)
bank_down <- bank[which_train,]
bank_test <- bank[-which_train,]
tabyl(bank_down$y)
## bank_down$y n percent
## no 29239 0.8873479
## yes 3712 0.1126521
tabyl(bank_test$y)
## bank_test$y n percent
## no 7309 0.8873376
## yes 928 0.1126624
createDataPartition() split into train set and test set equally, especially the distribution of target value
The name ‘Regression’ is not suggested that it is a regression model, but rather a classification model by framing output into probability with logistic function. The probability lies between 0 and 1, which can be segerated by user as ‘yes’,‘no’ or ‘true’ or ‘fail’.
set.seed(123)
ctrl_cv5x3a <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
repeats = 3)
logit_train <-
train(model_formula,
data = bank_down,
method = "glm",
metric = "ROC",
family = "binomial",
trControl = ctrl_cv5x3a)
# Summary
summary(logit_train)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.9225 -0.3073 -0.1936 -0.1374 3.1319
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 90.42120159 4.67683882 19.334
## poutcomenonexistent 0.44680653 0.10486497 4.261
## poutcomesuccess 0.93648913 0.23422194 3.998
## monthaug 0.09604061 0.11153558 0.861
## monthdec -0.09745703 0.21918386 -0.445
## monthjul 0.23498821 0.10441297 2.251
## monthjun 0.56180937 0.10035802 5.598
## monthmar 1.22705220 0.13187509 9.305
## monthmay -0.77534066 0.08257615 -9.389
## monthnov -0.29955387 0.10667950 -2.808
## monthoct 0.02920495 0.13708981 0.213
## monthsep -0.49268620 0.14710397 -3.349
## `jobblue-collar` -0.24529425 0.08653543 -2.835
## jobentrepreneur -0.22684554 0.14007322 -1.619
## jobhousemaid -0.00356211 0.16051496 -0.022
## jobmanagement -0.05197012 0.09489854 -0.548
## jobretired 0.31049585 0.11771733 2.638
## `jobself-employed` -0.15042989 0.13203698 -1.139
## jobservices -0.10468080 0.09324357 -1.123
## jobstudent 0.24755407 0.11997074 2.063
## jobtechnician 0.00284645 0.07806458 0.036
## jobunemployed -0.00220706 0.14118705 -0.016
## contacttelephone -0.36551968 0.07779272 -4.699
## educationbasic.6y 0.11866552 0.13338089 0.890
## educationbasic.9y 0.06162253 0.10629834 0.580
## educationhigh.school 0.11918026 0.10182273 1.170
## educationilliterate 1.18742473 0.77076037 1.541
## educationprofessional.course 0.14609463 0.11294514 1.294
## educationuniversity.degree 0.26930033 0.09902771 2.719
## maritalmarried 0.03475556 0.07615878 0.456
## maritalsingle 0.09857749 0.08690029 1.134
## day_of_weekmon -0.12767818 0.07283839 -1.753
## day_of_weekthu -0.00581714 0.07076536 -0.082
## day_of_weektue 0.08302712 0.07235523 1.147
## day_of_weekwed 0.15018716 0.07257820 2.069
## housingyes -0.03630511 0.04533513 -0.801
## age 0.00095468 0.00265923 0.359
## duration 0.00457544 0.00008158 56.086
## campaign -0.03598209 0.01285511 -2.799
## pdays -0.00094231 0.00024012 -3.924
## previous -0.03209143 0.06709048 -0.478
## cons.price.idx -0.24964898 0.04820186 -5.179
## cons.conf.idx 0.01268636 0.00576518 2.201
## nr.employed -0.01356947 0.00039554 -34.306
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## poutcomenonexistent 0.0000203715 ***
## poutcomesuccess 0.0000637996 ***
## monthaug 0.38920
## monthdec 0.65658
## monthjul 0.02441 *
## monthjun 0.0000000217 ***
## monthmar < 0.0000000000000002 ***
## monthmay < 0.0000000000000002 ***
## monthnov 0.00499 **
## monthoct 0.83130
## monthsep 0.00081 ***
## `jobblue-collar` 0.00459 **
## jobentrepreneur 0.10534
## jobhousemaid 0.98229
## jobmanagement 0.58394
## jobretired 0.00835 **
## `jobself-employed` 0.25458
## jobservices 0.26158
## jobstudent 0.03907 *
## jobtechnician 0.97091
## jobunemployed 0.98753
## contacttelephone 0.0000026190 ***
## educationbasic.6y 0.37364
## educationbasic.9y 0.56211
## educationhigh.school 0.24181
## educationilliterate 0.12342
## educationprofessional.course 0.19584
## educationuniversity.degree 0.00654 **
## maritalmarried 0.64813
## maritalsingle 0.25664
## day_of_weekmon 0.07962 .
## day_of_weekthu 0.93449
## day_of_weektue 0.25118
## day_of_weekwed 0.03852 *
## housingyes 0.42324
## age 0.71959
## duration < 0.0000000000000002 ***
## campaign 0.00513 **
## pdays 0.0000869844 ***
## previous 0.63241
## cons.price.idx 0.0000002228 ***
## cons.conf.idx 0.02777 *
## nr.employed < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 23199 on 32950 degrees of freedom
## Residual deviance: 13916 on 32907 degrees of freedom
## AIC: 14004
##
## Number of Fisher Scoring iterations: 6
Fitted values
logit_train_fitted <- predict(logit_train,
bank_down,
type = "prob")
Predicted values
logit_train_forecasts <- predict(logit_train,
bank_test,
type = "prob")
confusion matrix train set
confusionMatrix(data = as.factor(ifelse(logit_train_fitted["yes"] > 0.5,
"yes",
"no")),
reference = bank_down$y,
positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 28464 2204
## yes 775 1508
##
## Accuracy : 0.9096
## 95% CI : (0.9064, 0.9127)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4564
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.40625
## Specificity : 0.97349
## Pos Pred Value : 0.66053
## Neg Pred Value : 0.92813
## Prevalence : 0.11265
## Detection Rate : 0.04576
## Detection Prevalence : 0.06928
## Balanced Accuracy : 0.68987
##
## 'Positive' Class : yes
##
Accuracy : 0.9096 is a good result. However, Balanced Accuracy : 0.68987 is rather disappointed
ROC train set
roc.area(ifelse(bank_down$y == "yes", 1, 0),
logit_train_fitted[,"yes"])
## $A
## [1] 0.9323093
##
## $n.total
## [1] 32951
##
## $n.events
## [1] 3712
##
## $n.noevents
## [1] 29239
##
## $p.value
## [1] 0
confusion matrix bank_test set
cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_train_forecasts["yes"] > 0.5,
"yes",
"no")),
reference = bank_test$y,
positive = "yes")
cMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7116 520
## yes 193 408
##
## Accuracy : 0.9134
## 95% CI : (0.9072, 0.9194)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.000000000000004566
##
## Kappa : 0.4884
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.43966
## Specificity : 0.97359
## Pos Pred Value : 0.67887
## Neg Pred Value : 0.93190
## Prevalence : 0.11266
## Detection Rate : 0.04953
## Detection Prevalence : 0.07296
## Balanced Accuracy : 0.70662
##
## 'Positive' Class : yes
##
Accuracy : 0.9134 is good, and Balanced Accuracy : 0.70662 is better than the train set
ROC bank_test set
roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0),
logit_train_forecasts[,"yes"])
roc
## $A
## [1] 0.9359121
##
## $n.total
## [1] 8237
##
## $n.events
## [1] 928
##
## $n.noevents
## [1] 7309
##
## $p.value
## [1] 0
ROC bank_test set plot
roc.plot(ifelse(bank_test$y == "yes", 1, 0),
logit_train_forecasts[,"yes"])
## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.
A generalization of Fisher’s linear discriminan to find a linear combination of features by applying Bayes’ theorem for classification. It is more stable than logistic regression.
set.seed(123)
ctrl_cv5x3a <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
repeats = 3)
lda_train <-
train(model_formula,
data = bank_down,
method = "lda",
metric = "ROC",
trControl = ctrl_cv5x3a)
# Summary
summary(lda_train)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 86 -none- numeric
## scaling 43 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## xNames 43 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Fitted values
lda_train_fitted <- predict(lda_train,
bank_down,
type = "prob")
Predicted values
lda_train_forecasts <- predict(lda_train,
bank_test,
type = "prob")
confusion matrix train set
confusionMatrix(data = as.factor(ifelse(lda_train_fitted["yes"] > 0.5,
"yes",
"no")),
reference = bank_down$y,
positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 28083 1848
## yes 1156 1864
##
## Accuracy : 0.9088
## 95% CI : (0.9057, 0.9119)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5036
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.50216
## Specificity : 0.96046
## Pos Pred Value : 0.61722
## Neg Pred Value : 0.93826
## Prevalence : 0.11265
## Detection Rate : 0.05657
## Detection Prevalence : 0.09165
## Balanced Accuracy : 0.73131
##
## 'Positive' Class : yes
##
Accuracy : 0.9088 and Balanced Accuracy : 0.73131 indicate an good result
ROC train set
roc.area(ifelse(bank_down$y == "yes", 1, 0),
lda_train_fitted[,"yes"])
## $A
## [1] 0.9315323
##
## $n.total
## [1] 32951
##
## $n.events
## [1] 3712
##
## $n.noevents
## [1] 29239
##
## $p.value
## [1] 0
confusion matrix bank_test set
cMatrix<-confusionMatrix(data = as.factor(ifelse(lda_train_forecasts["yes"] > 0.5,
"yes",
"no")),
reference = bank_test$y,
positive = "yes")
cMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7008 441
## yes 301 487
##
## Accuracy : 0.9099
## 95% CI : (0.9035, 0.916)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.00000000001313
##
## Kappa : 0.5177
## Mcnemar's Test P-Value : 0.00000033457525
##
## Sensitivity : 0.52478
## Specificity : 0.95882
## Pos Pred Value : 0.61802
## Neg Pred Value : 0.94080
## Prevalence : 0.11266
## Detection Rate : 0.05912
## Detection Prevalence : 0.09567
## Balanced Accuracy : 0.74180
##
## 'Positive' Class : yes
##
Accuracy : 0.9099 while Balanced Accuracy : 0.74180
ROC bank_test set
roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0),
lda_train_forecasts[,"yes"])
roc
## $A
## [1] 0.9346318
##
## $n.total
## [1] 8237
##
## $n.events
## [1] 928
##
## $n.noevents
## [1] 7309
##
## $p.value
## [1] 0
ROC bank_test set plot
roc.plot(ifelse(bank_test$y == "yes", 1, 0),
lda_train_forecasts[,"yes"])
## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## lda_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.
When the data have a significant number of variables, it might be useful to find a reduced set of variables so that the performance is optimal.
In such case, imposing a penalty to the logistic regression may affect the variable( which contribute less in the model)’s coefficient move toward zero. This method is widely called regularization.
There are 3 well-known regularizations:
Lasso Regression penalizes the sum of absolute values (L1 penalty)
Ridge Regression: penalizes sum of squared coefficients (L2 penalty)
Net elastic: the combination of both Lasso and Ridge
In this report, the net elastic was used depsite the fact that under some circumstances, specific penalty may perfom better.
Unlike other model, regularization methods only accept, matrix as input instead of raw data
# Dumy code categorical predictor variables
x <- model.matrix(y~., bank_down)[,-1]
# Convert the outcome (class) to a numerical variable
y <- ifelse(bank_down$y == "yes", 1, 0)
Find the best lambda using cross-validation
cv.net <- cv.glmnet(x, y, alpha = 0.5, family = "binomial")
Display regression coefficients
coef(cv.net)
## 48 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 41.3011799632
## age .
## jobblue-collar -0.1556873448
## jobentrepreneur .
## jobhousemaid .
## jobmanagement .
## jobretired 0.2136182389
## jobself-employed .
## jobservices .
## jobstudent 0.2027000279
## jobtechnician .
## jobunemployed .
## maritalmarried .
## maritalsingle 0.0239695860
## educationbasic.6y .
## educationbasic.9y .
## educationhigh.school .
## educationilliterate .
## educationprofessional.course .
## educationuniversity.degree 0.0962649229
## defaultyes .
## housingyes .
## loanyes .
## contacttelephone -0.1950296434
## monthaug .
## monthdec .
## monthjul .
## monthjun 0.1440108907
## monthmar 1.1151635008
## monthmay -0.7413631225
## monthnov -0.1916883269
## monthoct 0.0776097416
## monthsep -0.0592038072
## day_of_weekmon -0.0531327440
## day_of_weekthu .
## day_of_weektue .
## day_of_weekwed 0.0169268253
## duration 0.0040981504
## campaign -0.0002195744
## pdays -0.0007958577
## previous .
## poutcomenonexistent 0.1793600384
## poutcomesuccess 0.8433424805
## emp.var.rate -0.1846557153
## cons.price.idx .
## cons.conf.idx 0.0115821734
## euribor3m -0.0247689761
## nr.employed -0.0084785511
Shrinking coeffienct does not mean it will reduce the number of variable totally.
Fit the final model on the training data
net_train <- glmnet(x, y, alpha = 0.5, family = "binomial",
lambda = cv.net$lambda.min)
Make predictions on the test data
x_test <- model.matrix(y ~., bank_test)[,-1]
net_forecasts <- cv.net %>% predict(newx = x_test)
Confusion matrix bank_test set
cMatrix<-confusionMatrix(data = as.factor(ifelse(net_forecasts > 0.5,"yes", "no")),
reference = bank_test$y,
positive = "yes")
cMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7201 656
## yes 108 272
##
## Accuracy : 0.9072
## 95% CI : (0.9008, 0.9134)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.000000002435
##
## Kappa : 0.375
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.29310
## Specificity : 0.98522
## Pos Pred Value : 0.71579
## Neg Pred Value : 0.91651
## Prevalence : 0.11266
## Detection Rate : 0.03302
## Detection Prevalence : 0.04613
## Balanced Accuracy : 0.63916
##
## 'Positive' Class : yes
##
ROC bank_test set
roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0),
ifelse(net_forecasts > 0.5, 1, 0))
roc
## $A
## [1] 0.6391636
##
## $n.total
## [1] 8237
##
## $n.events
## [1] 928
##
## $n.noevents
## [1] 7309
##
## $p.value
## [1] 0
ROC plot bank_test set
roc.plot(ifelse(bank_test$y == "yes", 1, 0),
net_forecasts)
Instead of using custom formula, there is a method to let add/remove features automatically by the model until the model is best fit.
There are 3 schools of this strategy:
The downside of this method is take long time to train. Therefore, to reduce the long calulation, only 10% of train data will be used and backward stepwise is used as an example of the three
set.seed(123)
which_small <- createDataPartition(bank_down$y,
p = 0.1,
list = FALSE)
bank_down_small <- bank_down[which_small,]
ctrl_nocv <- trainControl(method = "none")
logit_backward_train <-
train(y~.,
data = bank_down_small,
# stepwise method
method = "glmStepAIC",
# additional argument
direction = "backward",
trControl = ctrl_nocv)
# Summary
summary(logit_backward_train)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7419 -0.2951 -0.1807 -0.1264 2.9967
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -175.8538795 20.4705367 -8.591 < 0.0000000000000002
## jobentrepreneur -2.2981321 0.7806478 -2.944 0.003241
## jobunemployed 0.7386721 0.4255651 1.736 0.082609
## maritalmarried 0.6584864 0.2887554 2.280 0.022582
## maritalsingle 0.7550437 0.2977289 2.536 0.011212
## contacttelephone -0.8756656 0.2394056 -3.658 0.000255
## monthaug 0.5696218 0.2582666 2.206 0.027415
## monthmar 2.2894933 0.3802869 6.020 0.00000000174
## monthmay -0.4862553 0.1948151 -2.496 0.012561
## day_of_weekmon -0.4459487 0.1895273 -2.353 0.018625
## duration 0.0047914 0.0002626 18.248 < 0.0000000000000002
## previous -0.1834359 0.1221841 -1.501 0.133276
## poutcomesuccess 1.7540073 0.3282482 5.344 0.00000009115
## emp.var.rate -1.0955977 0.0825343 -13.274 < 0.0000000000000002
## cons.price.idx 1.8558290 0.2197159 8.446 < 0.0000000000000002
## cons.conf.idx 0.0558012 0.0157176 3.550 0.000385
##
## (Intercept) ***
## jobentrepreneur **
## jobunemployed .
## maritalmarried *
## maritalsingle *
## contacttelephone ***
## monthaug *
## monthmar ***
## monthmay *
## day_of_weekmon *
## duration ***
## previous
## poutcomesuccess ***
## emp.var.rate ***
## cons.price.idx ***
## cons.conf.idx ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2323.4 on 3295 degrees of freedom
## Residual deviance: 1316.9 on 3280 degrees of freedom
## AIC: 1348.9
##
## Number of Fisher Scoring iterations: 7
Fitted values
logit_backward_train_fitted <- predict(logit_backward_train,
bank_down,
type = "prob")
Predicted values
logit_backward_train_forecasts <- predict(logit_backward_train,
bank_test,
type = "prob")
confusion matrix train set
confusionMatrix(data = as.factor(ifelse(logit_backward_train_fitted["yes"] > 0.5,
"yes",
"no")),
reference = bank_down$y,
positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 28330 2075
## yes 909 1637
##
## Accuracy : 0.9094
## 95% CI : (0.9063, 0.9125)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4751
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.44100
## Specificity : 0.96891
## Pos Pred Value : 0.64297
## Neg Pred Value : 0.93175
## Prevalence : 0.11265
## Detection Rate : 0.04968
## Detection Prevalence : 0.07727
## Balanced Accuracy : 0.70496
##
## 'Positive' Class : yes
##
ROC train set
roc.area(ifelse(bank_down$y == "yes", 1, 0),
logit_backward_train_fitted[,"yes"])
## $A
## [1] 0.9243167
##
## $n.total
## [1] 32951
##
## $n.events
## [1] 3712
##
## $n.noevents
## [1] 29239
##
## $p.value
## [1] 0
confusion matrix bank_test set
cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_backward_train_forecasts["yes"] > 0.5,
"yes",
"no")),
reference = bank_test$y,
positive = "yes")
cMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 7099 501
## yes 210 427
##
## Accuracy : 0.9137
## 95% CI : (0.9074, 0.9197)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 0.000000000000002517
##
## Kappa : 0.4998
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.46013
## Specificity : 0.97127
## Pos Pred Value : 0.67033
## Neg Pred Value : 0.93408
## Prevalence : 0.11266
## Detection Rate : 0.05184
## Detection Prevalence : 0.07733
## Balanced Accuracy : 0.71570
##
## 'Positive' Class : yes
##
ROC bank_test set
roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0),
logit_backward_train_forecasts[,"yes"])
roc
## $A
## [1] 0.9249201
##
## $n.total
## [1] 8237
##
## $n.events
## [1] 928
##
## $n.noevents
## [1] 7309
##
## $p.value
## [1] 0
ROC bank_test set plot
roc.plot(ifelse(bank_test$y == "yes", 1, 0),
logit_backward_train_forecasts[,"yes"])
## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_backward_train_forecasts[, : Large amount of unique predictions used
## as thresholds. Consider specifying thresholds.
When investigated, the target variables rather have imbalance responses
plot(bank$y)
Resampling the data could be resovle this problem, this report employs simple downsampling since the data is huge and other method require complex modification. After that, logistic regression is used once again to give a clear view between with and without resampling.
bank_down <- downSample(# a matrix or data frame of predictor variables
x = bank,
# a factor variable with the class memberships
y = bank$y,
yname = 'y')
plot(bank_down$y)
set.seed(123)
ctrl_cv5x3a <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
repeats = 3)
logit_train_down <-
train(model_formula,
data = bank_down,
method = "glm",
metric = "ROC",
family = "binomial",
trControl = ctrl_cv5x3a)
# Summary
summary(logit_train_down)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.7113 -0.3834 -0.0299 0.4696 2.8042
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 134.7905764 7.2883351 18.494
## poutcomenonexistent 0.5878466 0.1788857 3.286
## poutcomesuccess 1.1744008 0.4449877 2.639
## monthaug 0.0560733 0.1783100 0.314
## monthdec 0.6623160 0.5307674 1.248
## monthjul 0.1030927 0.1557615 0.662
## monthjun 0.2412178 0.1462258 1.650
## monthmar 1.0741070 0.2035046 5.278
## monthmay -1.2838507 0.1219229 -10.530
## monthnov -0.4482753 0.1588433 -2.822
## monthoct 0.8471415 0.2526704 3.353
## monthsep -0.6356747 0.2401612 -2.647
## `jobblue-collar` -0.2672402 0.1247763 -2.142
## jobentrepreneur -0.1778486 0.1905039 -0.934
## jobhousemaid -0.0725454 0.2398715 -0.302
## jobmanagement -0.3081139 0.1344372 -2.292
## jobretired 0.5974513 0.1807795 3.305
## `jobself-employed` 0.0071891 0.1924446 0.037
## jobservices -0.2369093 0.1364427 -1.736
## jobstudent 0.3540091 0.1924422 1.840
## jobtechnician -0.0298042 0.1159295 -0.257
## jobunemployed 0.1636437 0.2062688 0.793
## contacttelephone 0.1103048 0.1177006 0.937
## educationbasic.6y -0.2131415 0.1958554 -1.088
## educationbasic.9y -0.0887367 0.1524787 -0.582
## educationhigh.school 0.0675848 0.1502255 0.450
## educationilliterate 12.4887303 231.4142277 0.054
## educationprofessional.course 0.2302059 0.1644193 1.400
## educationuniversity.degree 0.3136710 0.1455642 2.155
## maritalmarried 0.0572722 0.1093055 0.524
## maritalsingle 0.1960455 0.1255927 1.561
## day_of_weekmon -0.2408250 0.1059709 -2.273
## day_of_weekthu -0.1380881 0.1072717 -1.287
## day_of_weektue -0.0662619 0.1080861 -0.613
## day_of_weekwed -0.0223141 0.1072929 -0.208
## housingyes 0.1116696 0.0666635 1.675
## age -0.0012453 0.0038664 -0.322
## duration 0.0070967 0.0001702 41.689
## campaign -0.0449017 0.0174160 -2.578
## pdays -0.0010449 0.0004482 -2.332
## previous 0.0282864 0.1302363 0.217
## cons.price.idx -0.5825783 0.0768454 -7.581
## cons.conf.idx -0.0073905 0.0094266 -0.784
## nr.employed -0.0160745 0.0006359 -25.277
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## poutcomenonexistent 0.00102 **
## poutcomesuccess 0.00831 **
## monthaug 0.75316
## monthdec 0.21209
## monthjul 0.50806
## monthjun 0.09902 .
## monthmar 0.0000001305677283 ***
## monthmay < 0.0000000000000002 ***
## monthnov 0.00477 **
## monthoct 0.00080 ***
## monthsep 0.00812 **
## `jobblue-collar` 0.03221 *
## jobentrepreneur 0.35053
## jobhousemaid 0.76232
## jobmanagement 0.02191 *
## jobretired 0.00095 ***
## `jobself-employed` 0.97020
## jobservices 0.08251 .
## jobstudent 0.06583 .
## jobtechnician 0.79711
## jobunemployed 0.42757
## contacttelephone 0.34867
## educationbasic.6y 0.27648
## educationbasic.9y 0.56059
## educationhigh.school 0.65279
## educationilliterate 0.95696
## educationprofessional.course 0.16148
## educationuniversity.degree 0.03117 *
## maritalmarried 0.60030
## maritalsingle 0.11853
## day_of_weekmon 0.02305 *
## day_of_weekthu 0.19800
## day_of_weektue 0.53984
## day_of_weekwed 0.83525
## housingyes 0.09391 .
## age 0.74739
## duration < 0.0000000000000002 ***
## campaign 0.00993 **
## pdays 0.01972 *
## previous 0.82806
## cons.price.idx 0.0000000000000342 ***
## cons.conf.idx 0.43304
## nr.employed < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12864.8 on 9279 degrees of freedom
## Residual deviance: 6022.5 on 9236 degrees of freedom
## AIC: 6110.5
##
## Number of Fisher Scoring iterations: 12
Fitted values
logit_train_down_fitted <- predict(logit_train_down,
bank_down,
type = "prob")
Predicted values
logit_train_down_forecasts <- predict(logit_train_down,
bank_test,
type = "prob")
confusion matrix train set
confusionMatrix(data = as.factor(ifelse(logit_train_down_fitted["yes"] > 0.5,
"yes",
"no")),
reference = bank_down$y,
positive = "yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3980 553
## yes 660 4087
##
## Accuracy : 0.8693
## 95% CI : (0.8623, 0.8761)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.7386
## Mcnemar's Test P-Value : 0.002338
##
## Sensitivity : 0.8808
## Specificity : 0.8578
## Pos Pred Value : 0.8610
## Neg Pred Value : 0.8780
## Prevalence : 0.5000
## Detection Rate : 0.4404
## Detection Prevalence : 0.5115
## Balanced Accuracy : 0.8693
##
## 'Positive' Class : yes
##
Accuracy : 0.9096 is a good result. However, Balanced Accuracy : 0.68987 is rather disappointed
ROC train set
roc.area(ifelse(bank_down$y == "yes", 1, 0),
logit_train_down_fitted[,"yes"])
## $A
## [1] 0.9385968
##
## $n.total
## [1] 9280
##
## $n.events
## [1] 4640
##
## $n.noevents
## [1] 4640
##
## $p.value
## [1] 0
confusion matrix bank_test set
cMatrix<-confusionMatrix(data = as.factor(ifelse(logit_train_down_forecasts["yes"] > 0.5,
"yes",
"no")),
reference = bank_test$y,
positive = "yes")
cMatrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 6256 104
## yes 1053 824
##
## Accuracy : 0.8595
## 95% CI : (0.8518, 0.867)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.5143
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.8879
## Specificity : 0.8559
## Pos Pred Value : 0.4390
## Neg Pred Value : 0.9836
## Prevalence : 0.1127
## Detection Rate : 0.1000
## Detection Prevalence : 0.2279
## Balanced Accuracy : 0.8719
##
## 'Positive' Class : yes
##
Accuracy : 0.9134 is good, and Balanced Accuracy : 0.70662 is better than the train set
ROC bank_test set
roc<-roc.area(ifelse(bank_test$y == "yes", 1, 0),
logit_train_down_forecasts[,"yes"])
roc
## $A
## [1] 0.9391079
##
## $n.total
## [1] 8237
##
## $n.events
## [1] 928
##
## $n.noevents
## [1] 7309
##
## $p.value
## [1] 0
ROC bank_test set plot
roc.plot(ifelse(bank_test$y == "yes", 1, 0),
logit_train_forecasts[,"yes"])
## Warning in roc.plot.default(ifelse(bank_test$y == "yes", 1, 0),
## logit_train_forecasts[, : Large amount of unique predictions used as
## thresholds. Consider specifying thresholds.
# Comparison test result ----
for (i in seq(1,length(result),1)){
result_df[i,1] <- result[[i]][["overall"]][["Accuracy"]]
result_df[i,2] <- result[[i]][["byClass"]][["Balanced Accuracy"]]
result_df[i,3] <- result[[i]][[7]][["A"]]
result_df[i,4] <- result[[i]][["byClass"]][["Sensitivity"]]
result_df[i,5] <- result[[i]][["byClass"]][["Specificity"]]
}
row.names(result_df)<- c('Logistic Regression','Linear Discriminant',
'Net Elastic', 'Logit Backward', 'Logistic Regression (downsampling)')
result_df
## Accuracy Balance_acc Roc
## Logistic Regression 0.9134394 0.7066247 0.9359121
## Linear Discriminant 0.9099187 0.7418012 0.9346318
## Net Elastic 0.9072478 0.6391636 0.6391636
## Logit Backward 0.9136822 0.7156988 0.9249201
## Logistic Regression (downsampling) 0.8595362 0.8719310 0.9391079
## Sensitivity Specificity
## Logistic Regression 0.4396552 0.9735942
## Linear Discriminant 0.5247845 0.9588179
## Net Elastic 0.2931034 0.9852237
## Logit Backward 0.4601293 0.9712683
## Logistic Regression (downsampling) 0.8879310 0.8559310
5 models have been deployed to predict whether if the client will submit to deposit. Each of them have there strength and weakness, which is hard to say which one is the best, but the best is depended on the situation.
The Logistic Regression does the job perfectly, having acceptable Accuracy. Linear Discriminant, however, does not outshine any other model in any criteria. Meanwhile, Net Elastic has the best Sepcificity and the worst Sensitivity, so it have more chance to predict correct potential customer but more chance to fail to idential non-potential ones, if losing customer is costly, Net Elastic should be avoid to be used. Even though Logit Backward runs slow, it prove to be the one with highest Accuracy, still only higher than Logistic Regression does not justify its cost. When using Logistic Regression with downsamping data, the Sensitivity shoots up at the cost of lower other critera, except Balance Accurarcy, which scores the top, therefore, depend on the business context, it should be considered.
There are some models have been built around such as kNN and SVM, however, when dealing with large dataset and low computing power, they requires a significant amount of time but the result is not improved much. Nevertheless, based on good result of simple models, with more modifications, whether in the data or in the models, the result can be improved further.