The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often more than one call to the same client was required in order to finalize whether or not they had subscribed to a term deposit. A term deposit is a cash investment to a financial institution for an agreed interest rate over a fixed time period. This project attempts to solve this problem using Bayesian Additive Regression Trees.
NOTE: This dataset is based on the “Bank Marketing” UCI dataset.
age: age of clientjob : type of jobmarital : marital statuseducation: highest educational achievementdefault: has credit in default?housing: has housing loan?loan: has personal loan?contact: contact communication typemonth: last contact month of yearday_of_week: last contact day of the weekduration: last contact duration, in secondscampaign: number of contacts performed during this campaign and for this clientpdays: number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)previous: number of contacts performed before this campaign and for this clientpoutcome: outcome of the previous marketing campaignemp.var.rate: employment variation rate - quarterly indicatorcons.price.idx: consumer price index - monthly indicatorcons.conf.idx: consumer confidence index - monthly indicatoreuribor3m: euribor 3 month rate - daily indicatornr.employed: number of employees - quarterly indicatory - has the client subscribed a term deposit?NOTE: Although this data is already well-cleaned, there are a few things that will be done and discussed.
datatable(bank, options = list(pageLength = 10, scrollX = TRUE))glimpse(bank)## Observations: 4,119
## Variables: 21
## $ age <int> 30, 39, 25, 38, 47, 32, 32, 41, 31, 35, 25, 36,...
## $ job <fctr> blue-collar, services, services, services, adm...
## $ marital <fctr> married, single, married, married, married, si...
## $ education <fctr> basic.9y, high.school, high.school, basic.9y, ...
## $ default <fctr> no, no, no, no, no, no, no, unknown, no, unkno...
## $ housing <fctr> yes, no, yes, unknown, yes, no, yes, yes, no, ...
## $ loan <fctr> no, no, no, unknown, no, no, no, no, no, no, n...
## $ contact <fctr> cellular, telephone, telephone, telephone, cel...
## $ month <fctr> may, may, jun, jun, nov, sep, sep, nov, nov, m...
## $ day_of_week <fctr> fri, fri, wed, fri, mon, thu, mon, mon, tue, t...
## $ duration <int> 487, 346, 227, 17, 58, 128, 290, 44, 68, 170, 3...
## $ campaign <int> 2, 4, 1, 3, 1, 3, 4, 2, 1, 1, 1, 1, 2, 2, 2, 2,...
## $ pdays <int> 999, 999, 999, 999, 999, 999, 999, 999, 999, 99...
## $ previous <int> 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ poutcome <fctr> nonexistent, nonexistent, nonexistent, nonexis...
## $ emp.var.rate <dbl> -1.8, 1.1, 1.4, 1.4, -0.1, -1.1, -1.1, -0.1, -0...
## $ cons.price.idx <dbl> 92.893, 93.994, 94.465, 94.465, 93.200, 94.199,...
## $ cons.conf.idx <dbl> -46.2, -36.4, -41.8, -41.8, -42.0, -37.5, -37.5...
## $ euribor3m <dbl> 1.313, 4.855, 4.962, 4.959, 4.191, 0.884, 0.879...
## $ nr.employed <dbl> 5099.1, 5191.0, 5228.1, 5228.1, 5195.8, 4963.6,...
## $ y <fctr> no, no, no, no, no, no, no, no, no, no, no, no...
summary(bank)## age job marital
## Min. :18.00 admin. :1012 divorced: 446
## 1st Qu.:32.00 blue-collar: 884 married :2509
## Median :38.00 technician : 691 single :1153
## Mean :40.11 services : 393 unknown : 11
## 3rd Qu.:47.00 management : 324
## Max. :88.00 retired : 166
## (Other) : 649
## education default housing loan
## university.degree :1264 no :3315 no :1839 no :3349
## high.school : 921 unknown: 803 unknown: 105 unknown: 105
## basic.9y : 574 yes : 1 yes :2175 yes : 665
## professional.course: 535
## basic.4y : 429
## basic.6y : 228
## (Other) : 168
## contact month day_of_week duration
## cellular :2652 may :1378 fri:768 Min. : 0.0
## telephone:1467 jul : 711 mon:855 1st Qu.: 103.0
## aug : 636 thu:860 Median : 181.0
## jun : 530 tue:841 Mean : 256.8
## nov : 446 wed:795 3rd Qu.: 317.0
## apr : 215 Max. :3643.0
## (Other): 203
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.0000 failure : 454
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.0000 nonexistent:3523
## Median : 2.000 Median :999.0 Median :0.0000 success : 142
## Mean : 2.537 Mean :960.4 Mean :0.1903
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.0000
## Max. :35.000 Max. :999.0 Max. :6.0000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m
## Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.635
## 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.334
## Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
## Mean : 0.08497 Mean :93.58 Mean :-40.5 Mean :3.621
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961
## Max. : 1.40000 Max. :94.77 Max. :-26.9 Max. :5.045
##
## nr.employed y
## Min. :4964 no :3668
## 1st Qu.:5099 yes: 451
## Median :5191
## Mean :5166
## 3rd Qu.:5228
## Max. :5228
##
The campaign,duration, previous, and age variables clearly has outliers. This can be diagnosed further but it is out of the scope of this analysis.
bank %>%
plot_ly(x = bank$age,
type = 'histogram',
color = bank$y) %>%
layout(xaxis = list(title="Age"),
yaxis = list(title="Count"))Tidy data ensures the dataset is a collection of values. The following are the Principles of Tidy Data:
Knowing this, these are symptons that identify messy data:
There are two paths in preparing data:
Preparing data for analysis considers the following:
There are several missing values in some categorical attributes, all coded with the “unknown” label. These missing values can be treated as a possible class label or by using deletion/imputation techniques. For this analysis, I will use them as class labels to help draw more insight and predict the outcome.
Preparing data for modeling considers the following:
training_set and test_set (also a validation set if preferred)library(caTools)
set.seed(123)
bank$duration <- NULL
split <- sample.split(bank[,'y'], SplitRatio = 0.8)
training_set <- subset(bank, split == TRUE)
y_train <- training_set[,'y']
test_set <- subset(bank, split == FALSE)
y_test <- test_set[,'y']
dm <- rbind(Train=dim(training_set),Test=dim(test_set))
colnames(dm) <- c("Observations", "Features")
kable(dm)| Observations | Features | |
|---|---|---|
| Train | 3295 | 20 |
| Test | 824 | 20 |
label_encoder <- function(df){
for (i in 1:ncol(df)) {
factr <- as.vector(sapply(df[i], is.factor))
if (factr) {
df[i] <- lapply(df[i], as.numeric)
}
}
return(df)
}
#training_set[,-20] <- scale(label_encoder(training_set[,-20]))
training_set[,-20] <- label_encoder(training_set[,-20])
datatable(training_set,
options = list(pageLength = 5,
scrollX = TRUE))#test_set[,-20] <- scale(label_encoder(test_set[,-20]))
test_set[,-20] <- label_encoder(test_set[,-20])
datatable(test_set,
options = list(pageLength = 5,
scrollX = TRUE))“BART” is a Bayesian approach to nonparametric function estimation using regression trees which rely on recursive binary partitioning of predictor space into a set of hyperrectangles in order to approximate some unknown function. It can be considered a sum-of-trees ensemble, with an estimation approach relying on a fully Bayesian probability model.
Please refer here for a better overview of BART and the bartMachine package.
library(bartMachine)
set_bart_machine_num_cores(2)## bartMachine now using 2 cores.
bart <- bartMachine(as.data.frame(training_set),y_train)## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 22 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for classification ...Covariate importance prior ON.
## evaluating in sample data...done
bart## bartMachine v1.2.3 for classification
##
## training data n = 3295 and p = 21
## built in 8.1 secs on 2 cores, 50 trees, 250 burn-in and 1000 post. samples
##
## confusion matrix:
##
## predicted no predicted yes model errors
## actual no 2934 0 0
## actual yes 0 361 0
## use errors 0 0 0
pred <- bart_predict_for_test_data(bart_machine = bart,
Xtest = as.data.frame(test_set),
ytest = y_test)
kable(pred$confusion_matrix)| predicted no | predicted yes | model errors | |
|---|---|---|---|
| actual no | 734 | 0 | 0 |
| actual yes | 0 | 90 | 0 |
| use errors | 0 | 0 | 0 |