Preliminaries Setting up libraries and loading the data set
## Libraries used
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
library(tidyverse)
## Registered S3 method overwritten by 'rvest':
## method from
## read_xml.response xml2
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.0.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.4.0
## Warning: package 'stringr' was built under R version 3.6.1
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gmodels)
## Warning: package 'gmodels' was built under R version 3.6.1
library(ggmosaic)
## Warning: package 'ggmosaic' was built under R version 3.6.1
library(corrplot)
## corrplot 0.84 loaded
library(caret)
## Warning: package 'caret' was built under R version 3.6.1
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.6.1
library(C50)
## Warning: package 'C50' was built under R version 3.6.1
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.6.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading the dataset
bank_data <- read.csv("C:/Users/qasim/OneDrive/Desktop/York U/2 - Basic Methods of Data Analytics/Lab 2/bank-additional-full.csv", sep=";")
Abstract We are using the “Bank Marketing” UCI dataset related to direct marketing campaigns of to customers of a Portugal-based bank. The goal of the analysis is use at least two typical classification alogrithms, logistic regression and decision tree, to make predicitive models around customers signing up for term deposits.
The dataset used is based on “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing) published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] difference being that attributes that could be used as “personally identifiable information” has been removed due to privacy concerns.
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if and what types of clients buy a term deposit (variable y).
The objective of this analysis is to provide a reliable and feasible recommendation algorithm to predict client take up based on client type and campaign type. The target value is the binary “yes” or “no” regarding the subscription of term deposit by the client. Hence the task could be solved by classification using a logisitic regression or a decision tree. We plan to use both and look at F-scores in the end to judge which alogrithm generated more accurate results.
The dimensions and variables aren’t too obvious so here is a description table.
Variables related to bank client data:
Variables related to last contact of the current marketing campaign:
Social and economic context attributes:
Output variable (desired target):
20: y - has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)
Check number of rows and columns.
ncol(bank_data)
## [1] 21
nrow(bank_data)
## [1] 41188
Quickly preview data structure with HEAD(), STR(), and SUMMARY().
head(bank_data,10)
## age job marital education default housing loan
## 1 56 housemaid married basic.4y no no no
## 2 57 services married high.school unknown no no
## 3 37 services married high.school no yes no
## 4 40 admin. married basic.6y no no no
## 5 56 services married high.school no no yes
## 6 45 services married basic.9y unknown no no
## 7 59 admin. married professional.course no no no
## 8 41 blue-collar married unknown unknown no no
## 9 24 technician single professional.course no yes no
## 10 25 services single high.school no yes no
## contact month day_of_week duration campaign pdays previous
## 1 telephone may mon 261 1 999 0
## 2 telephone may mon 149 1 999 0
## 3 telephone may mon 226 1 999 0
## 4 telephone may mon 151 1 999 0
## 5 telephone may mon 307 1 999 0
## 6 telephone may mon 198 1 999 0
## 7 telephone may mon 139 1 999 0
## 8 telephone may mon 217 1 999 0
## 9 telephone may mon 380 1 999 0
## 10 telephone may mon 50 1 999 0
## poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 1 nonexistent 1.1 93.994 -36.4 4.857
## 2 nonexistent 1.1 93.994 -36.4 4.857
## 3 nonexistent 1.1 93.994 -36.4 4.857
## 4 nonexistent 1.1 93.994 -36.4 4.857
## 5 nonexistent 1.1 93.994 -36.4 4.857
## 6 nonexistent 1.1 93.994 -36.4 4.857
## 7 nonexistent 1.1 93.994 -36.4 4.857
## 8 nonexistent 1.1 93.994 -36.4 4.857
## 9 nonexistent 1.1 93.994 -36.4 4.857
## 10 nonexistent 1.1 93.994 -36.4 4.857
## nr.employed y
## 1 5191 no
## 2 5191 no
## 3 5191 no
## 4 5191 no
## 5 5191 no
## 6 5191 no
## 7 5191 no
## 8 5191 no
## 9 5191 no
## 10 5191 no
str(bank_data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(bank_data)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing
## university.degree :12168 no :32588 no :18622
## high.school : 9515 unknown: 8597 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## loan contact month day_of_week
## no :33950 cellular :26144 may :13769 fri:7827
## unknown: 990 telephone:15044 jul : 7174 mon:8514
## yes : 6248 aug : 6178 thu:8623
## jun : 5318 tue:8090
## nov : 4101 wed:8134
## apr : 2632
## (Other): 2016
## duration campaign pdays previous
## Min. : 0.0 Min. : 1.000 Min. : 0.0 Min. :0.000
## 1st Qu.: 102.0 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000
## Median : 180.0 Median : 2.000 Median :999.0 Median :0.000
## Mean : 258.3 Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 319.0 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :4918.0 Max. :56.000 Max. :999.0 Max. :7.000
##
## poutcome emp.var.rate cons.price.idx cons.conf.idx
## failure : 4252 Min. :-3.40000 Min. :92.20 Min. :-50.8
## nonexistent:35563 1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7
## success : 1373 Median : 1.10000 Median :93.75 Median :-41.8
## Mean : 0.08189 Mean :93.58 Mean :-40.5
## 3rd Qu.: 1.40000 3rd Qu.:93.99 3rd Qu.:-36.4
## Max. : 1.40000 Max. :94.77 Max. :-26.9
##
## euribor3m nr.employed y
## Min. :0.634 Min. :4964 no :36548
## 1st Qu.:1.344 1st Qu.:5099 yes: 4640
## Median :4.857 Median :5191
## Mean :3.621 Mean :5167
## 3rd Qu.:4.961 3rd Qu.:5228
## Max. :5.045 Max. :5228
##
Check distribution of target variable.
summary(bank_data$y)
## no yes
## 36548 4640
CrossTable(bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | no | yes |
## |-----------|-----------|
## | 36548 | 4640 |
## | 0.887 | 0.113 |
## |-----------|-----------|
##
##
##
##
Creating a binary dependent varaible for potential regression models.
bank_data <- bank_data %>%
mutate(y_binary = ifelse(y == "no",0,1))
Check distribution of target variable.
hist(bank_data$y_binary)
summary(bank_data$y_binary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1127 0.0000 1.0000
Getting an idea of missing values:
colSums(is.na(bank_data))
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary
## 0 0
colSums(bank_data == "")
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary
## 0 0
colSums(bank_data == "unknown")
## age job marital education default
## 0 330 80 1731 8597
## housing loan contact month day_of_week
## 990 990 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 0 0
## y y_binary
## 0 0
sum(bank_data == "unknown")
## [1] 12718
We see that missing values are encoded as “unknown”
Which variables suffer the most from those “missing values”?
bank_data %>%
summarise_all(list(~sum(. == "unknown"))) %>%
gather(key = "variable", value = "nr_unknown") %>%
arrange(-nr_unknown)
## variable nr_unknown
## 1 default 8597
## 2 education 1731
## 3 housing 990
## 4 loan 990
## 5 job 330
## 6 marital 80
## 7 age 0
## 8 contact 0
## 9 month 0
## 10 day_of_week 0
## 11 duration 0
## 12 campaign 0
## 13 pdays 0
## 14 previous 0
## 15 poutcome 0
## 16 emp.var.rate 0
## 17 cons.price.idx 0
## 18 cons.conf.idx 0
## 19 euribor3m 0
## 20 nr.employed 0
## 21 y 0
## 22 y_binary 0
6 features have at least 1 unknown value.
Before deciding how to manage those missing values, we’ll study each variable and take a decision after visualisations.
We can’t afford to delete 8,597 rows in our dataset, it’s more than 20% of our observations.
## default theme for ggplot
theme_set(theme_bw())
## setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
What kind of persons were contacted during this marketing campaign?
summary(bank_data$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 32.00 38.00 40.02 47.00 98.00
bank_data %>%
ggplot() +
aes(x = age) +
geom_bar() +
geom_vline(xintercept = c(30, 60),
col = "red",
linetype = "dashed") +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 100, 5))
First of all, It seems that the banks are not very much interested by contacting the older population. Even though, after the 60-years threshold, the relative frequency is higher when y = 1. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.
We can also slice the age feature at 30 years to make three easily interpretable classes : [0, 30[, [30, 60[ and [60, +Inf[.
The minimum and maximum values are 17 and 98 but we can expect new observations outside this range.
We’re replacing the continious variable “age” by this categorical variable. We might lose some information from this continious-to-discrete transformation, but there wasn’t any clear pattern between years.
Cutting into classes will make the algorithms easier to interpret later.
bank_data = bank_data %>%
mutate(age_discrete = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))
Cross-tab with our dependent variable
CrossTable(bank_data$age_discrete, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$age_discrete | no | yes | Row Total |
## -----------------------|-----------|-----------|-----------|
## high | 496 | 414 | 910 |
## | 120.154 | 946.422 | |
## | 0.545 | 0.455 | 0.022 |
## | 0.014 | 0.089 | |
## | 0.012 | 0.010 | |
## -----------------------|-----------|-----------|-----------|
## low | 6259 | 1124 | 7383 |
## | 13.039 | 102.707 | |
## | 0.848 | 0.152 | 0.179 |
## | 0.171 | 0.242 | |
## | 0.152 | 0.027 | |
## -----------------------|-----------|-----------|-----------|
## mid | 29793 | 3102 | 32895 |
## | 12.488 | 98.367 | |
## | 0.906 | 0.094 | 0.799 |
## | 0.815 | 0.669 | |
## | 0.723 | 0.075 | |
## -----------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## -----------------------|-----------|-----------|-----------|
##
##
45.5% of people over 60 years old subscribed to a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations said yes (aged between 30 and 60)).
What are the types of jobs represented in our data?
table(bank_data$job)
##
## admin. blue-collar entrepreneur housemaid management
## 10422 9254 1456 1060 2924
## retired self-employed services student technician
## 1720 1421 3969 875 6743
## unemployed unknown
## 1014 330
We have 330 unknown jobs.
Cross-tab with our dependent variable
CrossTable(bank_data$job, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$job | no | yes | Row Total |
## --------------|-----------|-----------|-----------|
## admin. | 9070 | 1352 | 10422 |
## | 3.423 | 26.961 | |
## | 0.870 | 0.130 | 0.253 |
## | 0.248 | 0.291 | |
## | 0.220 | 0.033 | |
## --------------|-----------|-----------|-----------|
## blue-collar | 8616 | 638 | 9254 |
## | 19.926 | 156.951 | |
## | 0.931 | 0.069 | 0.225 |
## | 0.236 | 0.138 | |
## | 0.209 | 0.015 | |
## --------------|-----------|-----------|-----------|
## entrepreneur | 1332 | 124 | 1456 |
## | 1.240 | 9.767 | |
## | 0.915 | 0.085 | 0.035 |
## | 0.036 | 0.027 | |
## | 0.032 | 0.003 | |
## --------------|-----------|-----------|-----------|
## housemaid | 954 | 106 | 1060 |
## | 0.191 | 1.507 | |
## | 0.900 | 0.100 | 0.026 |
## | 0.026 | 0.023 | |
## | 0.023 | 0.003 | |
## --------------|-----------|-----------|-----------|
## management | 2596 | 328 | 2924 |
## | 0.001 | 0.006 | |
## | 0.888 | 0.112 | 0.071 |
## | 0.071 | 0.071 | |
## | 0.063 | 0.008 | |
## --------------|-----------|-----------|-----------|
## retired | 1286 | 434 | 1720 |
## | 37.814 | 297.849 | |
## | 0.748 | 0.252 | 0.042 |
## | 0.035 | 0.094 | |
## | 0.031 | 0.011 | |
## --------------|-----------|-----------|-----------|
## self-employed | 1272 | 149 | 1421 |
## | 0.097 | 0.767 | |
## | 0.895 | 0.105 | 0.035 |
## | 0.035 | 0.032 | |
## | 0.031 | 0.004 | |
## --------------|-----------|-----------|-----------|
## services | 3646 | 323 | 3969 |
## | 4.375 | 34.458 | |
## | 0.919 | 0.081 | 0.096 |
## | 0.100 | 0.070 | |
## | 0.089 | 0.008 | |
## --------------|-----------|-----------|-----------|
## student | 600 | 275 | 875 |
## | 40.090 | 315.775 | |
## | 0.686 | 0.314 | 0.021 |
## | 0.016 | 0.059 | |
## | 0.015 | 0.007 | |
## --------------|-----------|-----------|-----------|
## technician | 6013 | 730 | 6743 |
## | 0.147 | 1.156 | |
## | 0.892 | 0.108 | 0.164 |
## | 0.165 | 0.157 | |
## | 0.146 | 0.018 | |
## --------------|-----------|-----------|-----------|
## unemployed | 870 | 144 | 1014 |
## | 0.985 | 7.758 | |
## | 0.858 | 0.142 | 0.025 |
## | 0.024 | 0.031 | |
## | 0.021 | 0.003 | |
## --------------|-----------|-----------|-----------|
## unknown | 293 | 37 | 330 |
## | 0.000 | 0.001 | |
## | 0.888 | 0.112 | 0.008 |
## | 0.008 | 0.008 | |
## | 0.007 | 0.001 | |
## --------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## --------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, job), fill = y)) +
mosaic_theme +
xlab("Job") +
ylab(NULL)
Higher response among students (31.4%) and retired people (25.2%).
Other classes range between 6.9% (blue-collar) and 14.2 (unemployed).
We also see that we can ignore “unknown”. No big effect seen here.
How is marital status effecting client behavior?
table(bank_data$marital)
##
## divorced married single unknown
## 4612 24928 11568 80
80 unknowns in this segment.
Cross-tab with our dependent variable:
CrossTable(bank_data$marital, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$marital | no | yes | Row Total |
## ------------------|-----------|-----------|-----------|
## divorced | 4136 | 476 | 4612 |
## | 0.464 | 3.652 | |
## | 0.897 | 0.103 | 0.112 |
## | 0.113 | 0.103 | |
## | 0.100 | 0.012 | |
## ------------------|-----------|-----------|-----------|
## married | 22396 | 2532 | 24928 |
## | 3.450 | 27.174 | |
## | 0.898 | 0.102 | 0.605 |
## | 0.613 | 0.546 | |
## | 0.544 | 0.061 | |
## ------------------|-----------|-----------|-----------|
## single | 9948 | 1620 | 11568 |
## | 9.778 | 77.021 | |
## | 0.860 | 0.140 | 0.281 |
## | 0.272 | 0.349 | |
## | 0.242 | 0.039 | |
## ------------------|-----------|-----------|-----------|
## unknown | 68 | 12 | 80 |
## | 0.126 | 0.990 | |
## | 0.850 | 0.150 | 0.002 |
## | 0.002 | 0.003 | |
## | 0.002 | 0.000 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, marital), fill = y)) +
mosaic_theme +
xlab("Marital status") +
ylab(NULL)
No big effect of marriage. Singles (14.0%) slightly more like to say “yes” than divorced (10.3%) or married customers (10.2%).
marriage_table <- table(bank_data$marital, bank_data$y)
marriage_tab <- as.data.frame(prop.table(marriage_table, 2))
colnames(marriage_tab) <- c("marital", "y", "perc")
ggplot(data = marriage_tab, aes(x = marital, y = perc, fill = y)) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) +
xlab("Marital")+
ylab("Percent")
How educated are the clients and how are that effecting their choice?
table(bank_data$education)
##
## basic.4y basic.6y basic.9y
## 4176 2292 6045
## high.school illiterate professional.course
## 9515 18 5243
## university.degree unknown
## 12168 1731
1731 unknowns here.
Cross-tab with our dependent variable:
CrossTable(bank_data$education, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$education | no | yes | Row Total |
## --------------------|-----------|-----------|-----------|
## basic.4y | 3748 | 428 | 4176 |
## | 0.486 | 3.829 | |
## | 0.898 | 0.102 | 0.101 |
## | 0.103 | 0.092 | |
## | 0.091 | 0.010 | |
## --------------------|-----------|-----------|-----------|
## basic.6y | 2104 | 188 | 2292 |
## | 2.423 | 19.088 | |
## | 0.918 | 0.082 | 0.056 |
## | 0.058 | 0.041 | |
## | 0.051 | 0.005 | |
## --------------------|-----------|-----------|-----------|
## basic.9y | 5572 | 473 | 6045 |
## | 8.065 | 63.527 | |
## | 0.922 | 0.078 | 0.147 |
## | 0.152 | 0.102 | |
## | 0.135 | 0.011 | |
## --------------------|-----------|-----------|-----------|
## high.school | 8484 | 1031 | 9515 |
## | 0.198 | 1.561 | |
## | 0.892 | 0.108 | 0.231 |
## | 0.232 | 0.222 | |
## | 0.206 | 0.025 | |
## --------------------|-----------|-----------|-----------|
## illiterate | 14 | 4 | 18 |
## | 0.244 | 1.918 | |
## | 0.778 | 0.222 | 0.000 |
## | 0.000 | 0.001 | |
## | 0.000 | 0.000 | |
## --------------------|-----------|-----------|-----------|
## professional.course | 4648 | 595 | 5243 |
## | 0.004 | 0.032 | |
## | 0.887 | 0.113 | 0.127 |
## | 0.127 | 0.128 | |
## | 0.113 | 0.014 | |
## --------------------|-----------|-----------|-----------|
## university.degree | 10498 | 1670 | 12168 |
## | 8.292 | 65.317 | |
## | 0.863 | 0.137 | 0.295 |
## | 0.287 | 0.360 | |
## | 0.255 | 0.041 | |
## --------------------|-----------|-----------|-----------|
## unknown | 1480 | 251 | 1731 |
## | 2.041 | 16.079 | |
## | 0.855 | 0.145 | 0.042 |
## | 0.040 | 0.054 | |
## | 0.036 | 0.006 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## --------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, education), fill = y)) +
mosaic_theme +
xlab("Education Level") +
ylab(NULL)
It appears that a positive correlation exists between the number of years of education and the odds of subscribing to a term deposit.
Among the 1,596 rows containing the “unknown” value, 234 of them subscribed to a term deposit. This is around 5% of the total group of subscribers.
Since we’re facing a very unbalanced dependent variable situation, we can not afford to discard those rows. This category has the highest relative frequency of “y = 1” (14.7%)
It might make sense to recode these as “university.degree holders” as they are the most similar (13.7%).
How many of our clients are in default?
table(bank_data$default)
##
## no unknown yes
## 32588 8597 3
8597 unknowns in default.
Cross-tab with our dependent variable
CrossTable(bank_data$default, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$default | no | yes | Row Total |
## ------------------|-----------|-----------|-----------|
## no | 28391 | 4197 | 32588 |
## | 9.562 | 75.315 | |
## | 0.871 | 0.129 | 0.791 |
## | 0.777 | 0.905 | |
## | 0.689 | 0.102 | |
## ------------------|-----------|-----------|-----------|
## unknown | 8154 | 443 | 8597 |
## | 36.198 | 285.122 | |
## | 0.948 | 0.052 | 0.209 |
## | 0.223 | 0.095 | |
## | 0.198 | 0.011 | |
## ------------------|-----------|-----------|-----------|
## yes | 3 | 0 | 3 |
## | 0.043 | 0.338 | |
## | 1.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | |
## | 0.000 | 0.000 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ------------------|-----------|-----------|-----------|
##
##
This question though useful, gives us a feature that is not usable. Only 3 individuals replied “yes” to the question “Do you have credit in default?”. People either answered “no” (79.3%) or didn’t even reply (20.7%), which gives us zero information.
We decide to not use this variable in the final models.
Does clients have a housing loan?
table(bank_data$housing)
##
## no unknown yes
## 18622 990 21576
990 unknowns about housing
Cross-tab with our dependent variable:
CrossTable(bank_data$housing, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$housing | no | yes | Row Total |
## ------------------|-----------|-----------|-----------|
## no | 16596 | 2026 | 18622 |
## | 0.312 | 2.461 | |
## | 0.891 | 0.109 | 0.452 |
## | 0.454 | 0.437 | |
## | 0.403 | 0.049 | |
## ------------------|-----------|-----------|-----------|
## unknown | 883 | 107 | 990 |
## | 0.023 | 0.184 | |
## | 0.892 | 0.108 | 0.024 |
## | 0.024 | 0.023 | |
## | 0.021 | 0.003 | |
## ------------------|-----------|-----------|-----------|
## yes | 19069 | 2507 | 21576 |
## | 0.305 | 2.400 | |
## | 0.884 | 0.116 | 0.524 |
## | 0.522 | 0.540 | |
## | 0.463 | 0.061 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, housing), fill = y)) +
mosaic_theme +
xlab("Housing") +
ylab(NULL)
Not much observable variation between those who have housing loans (11.6%) and those who do not(10.6%).Unknown at 10.8%
Checking this mathematically with a Chi-Squared Test:
chisq.test(bank_data$housing, bank_data$y)
##
## Pearson's Chi-squared test
##
## data: bank_data$housing and bank_data$y
## X-squared = 5.6845, df = 2, p-value = 0.05829
P-value suggests it may be ok to ignore this variable in final analysis.
Do clients have a personal loan already? How does that effect thier take-up?
table(bank_data$loan)
##
## no unknown yes
## 33950 990 6248
990 unknowns
Cross-tab with our dependent variable:
CrossTable(bank_data$loan, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$loan | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 30100 | 3850 | 33950 |
## | 0.021 | 0.169 | |
## | 0.887 | 0.113 | 0.824 |
## | 0.824 | 0.830 | |
## | 0.731 | 0.093 | |
## ---------------|-----------|-----------|-----------|
## unknown | 883 | 107 | 990 |
## | 0.023 | 0.184 | |
## | 0.892 | 0.108 | 0.024 |
## | 0.024 | 0.023 | |
## | 0.021 | 0.003 | |
## ---------------|-----------|-----------|-----------|
## yes | 5565 | 683 | 6248 |
## | 0.079 | 0.618 | |
## | 0.891 | 0.109 | 0.152 |
## | 0.152 | 0.147 | |
## | 0.135 | 0.017 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ---------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, loan), fill = y)) +
mosaic_theme +
xlab("Loan") +
ylab(NULL)
loan_table <- table(bank_data$loan, bank_data$y)
loan_tab <- as.data.frame(prop.table(loan_table, 2))
colnames(loan_tab) <- c("loan", "y", "perc")
ggplot(data = loan_tab, aes(x = loan, y = perc, fill = y)) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) +
xlab("loan")+
ylab("Percent")
Not much variation between 11.3% (for no) and 10.9% (for yes).
Checking this mathematically with a Chi-Squared Test:
chisq.test(bank_data$loan, bank_data$y)
##
## Pearson's Chi-squared test
##
## data: bank_data$loan and bank_data$y
## X-squared = 1.094, df = 2, p-value = 0.5787
P-value suggests that this is not significant at all.
How were clients contacted and does it make a difference?
table(bank_data$contact)
##
## cellular telephone
## 26144 15044
Cross-tab with our dependent variable:
CrossTable(bank_data$contact, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$contact | no | yes | Row Total |
## ------------------|-----------|-----------|-----------|
## cellular | 22291 | 3853 | 26144 |
## | 35.521 | 279.790 | |
## | 0.853 | 0.147 | 0.635 |
## | 0.610 | 0.830 | |
## | 0.541 | 0.094 | |
## ------------------|-----------|-----------|-----------|
## telephone | 14257 | 787 | 15044 |
## | 61.730 | 486.229 | |
## | 0.948 | 0.052 | 0.365 |
## | 0.390 | 0.170 | |
## | 0.346 | 0.019 | |
## ------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, contact), fill = y)) +
mosaic_theme +
xlab("Contact") +
ylab(NULL)
This feature is really interesting; 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did.
Does month make a difference?
Cross-tab with our dependent variable:
CrossTable(bank_data$month, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$month | no | yes | Row Total |
## ----------------|-----------|-----------|-----------|
## apr | 2093 | 539 | 2632 |
## | 25.178 | 198.321 | |
## | 0.795 | 0.205 | 0.064 |
## | 0.057 | 0.116 | |
## | 0.051 | 0.013 | |
## ----------------|-----------|-----------|-----------|
## aug | 5523 | 655 | 6178 |
## | 0.306 | 2.413 | |
## | 0.894 | 0.106 | 0.150 |
## | 0.151 | 0.141 | |
## | 0.134 | 0.016 | |
## ----------------|-----------|-----------|-----------|
## dec | 93 | 89 | 182 |
## | 29.052 | 228.836 | |
## | 0.511 | 0.489 | 0.004 |
## | 0.003 | 0.019 | |
## | 0.002 | 0.002 | |
## ----------------|-----------|-----------|-----------|
## jul | 6525 | 649 | 7174 |
## | 3.980 | 31.353 | |
## | 0.910 | 0.090 | 0.174 |
## | 0.179 | 0.140 | |
## | 0.158 | 0.016 | |
## ----------------|-----------|-----------|-----------|
## jun | 4759 | 559 | 5318 |
## | 0.341 | 2.683 | |
## | 0.895 | 0.105 | 0.129 |
## | 0.130 | 0.120 | |
## | 0.116 | 0.014 | |
## ----------------|-----------|-----------|-----------|
## mar | 270 | 276 | 546 |
## | 94.958 | 747.959 | |
## | 0.495 | 0.505 | 0.013 |
## | 0.007 | 0.059 | |
## | 0.007 | 0.007 | |
## ----------------|-----------|-----------|-----------|
## may | 12883 | 886 | 13769 |
## | 36.210 | 285.214 | |
## | 0.936 | 0.064 | 0.334 |
## | 0.352 | 0.191 | |
## | 0.313 | 0.022 | |
## ----------------|-----------|-----------|-----------|
## nov | 3685 | 416 | 4101 |
## | 0.581 | 4.579 | |
## | 0.899 | 0.101 | 0.100 |
## | 0.101 | 0.090 | |
## | 0.089 | 0.010 | |
## ----------------|-----------|-----------|-----------|
## oct | 403 | 315 | 718 |
## | 86.028 | 677.617 | |
## | 0.561 | 0.439 | 0.017 |
## | 0.011 | 0.068 | |
## | 0.010 | 0.008 | |
## ----------------|-----------|-----------|-----------|
## sep | 314 | 256 | 570 |
## | 72.723 | 572.818 | |
## | 0.551 | 0.449 | 0.014 |
## | 0.009 | 0.055 | |
## | 0.008 | 0.006 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ----------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
aes(x = month, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")
month_table <- table(bank_data$month, bank_data$y)
month_tab <- as.data.frame(prop.table(month_table, 2))
colnames(month_tab) <- c("month", "y", "perc")
ggplot(data = month_tab, aes(x = month, y = perc, fill = y)) +
geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) +
xlab("Month")+
ylab("Percent")
Most of the calls were in May but there is higher coversion in March, September, October, and in December.
We also notice that no contact has been made during January and February.
The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%).
Every month with a very low frequency of contact (March, September, October and December) shows very good results (between 44% and 51% of subscribers).
December aside, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.
Does the day of the week matter?
Cross-tab with our dependent variable:
CrossTable(bank_data$day_of_week, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank_data$y
## bank_data$day_of_week | no | yes | Row Total |
## ----------------------|-----------|-----------|-----------|
## fri | 6981 | 846 | 7827 |
## | 0.184 | 1.449 | |
## | 0.892 | 0.108 | 0.190 |
## | 0.191 | 0.182 | |
## | 0.169 | 0.021 | |
## ----------------------|-----------|-----------|-----------|
## mon | 7667 | 847 | 8514 |
## | 1.664 | 13.111 | |
## | 0.901 | 0.099 | 0.207 |
## | 0.210 | 0.183 | |
## | 0.186 | 0.021 | |
## ----------------------|-----------|-----------|-----------|
## thu | 7578 | 1045 | 8623 |
## | 0.708 | 5.574 | |
## | 0.879 | 0.121 | 0.209 |
## | 0.207 | 0.225 | |
## | 0.184 | 0.025 | |
## ----------------------|-----------|-----------|-----------|
## tue | 7137 | 953 | 8090 |
## | 0.241 | 1.901 | |
## | 0.882 | 0.118 | 0.196 |
## | 0.195 | 0.205 | |
## | 0.173 | 0.023 | |
## ----------------------|-----------|-----------|-----------|
## wed | 7185 | 949 | 8134 |
## | 0.148 | 1.165 | |
## | 0.883 | 0.117 | 0.197 |
## | 0.197 | 0.205 | |
## | 0.174 | 0.023 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## | 0.887 | 0.113 | |
## ----------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
aes(x = day_of_week, y = ..count../nrow(bank_data), fill = y) +
geom_bar() +
ylab("relative frequency")
Calls aren’t made during weekend days. If we assume that calls are evenly distributed between the different weekdays, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 9.9% of successful calls.
However, those differences are small, which makes this feature not that important.
It would’ve been interesting to see the attitude of responders from weekend calls.
Since the goal is to seek best candidates who will have the best odds to subscribe to a term deposit, the call duration can’t be known before. So we recommend this feature be removed.
Still some interesting facts:
mean(bank_data$duration[bank_data$y == "no"])
## [1] 220.8448
mean(bank_data$duration[bank_data$y == "yes"])
## [1] 553.1912
max(bank_data$duration[bank_data$y == "no"])
## [1] 4918
min(bank_data$duration[bank_data$y == "yes"])
## [1] 37
Plotting the frequency of contact with clients:
bank_data %>%
ggplot() +
aes(x = campaign) +
geom_bar() +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 50, 5))
Calling the same person more than ten times during a single marketing campaign seems excessive. We’ll consider those as outliers, even if marketing harrassment a real thing. However, we can see that on the chart above that harassment doesn’t work at all.
Trimming our data set and replotting:
bank_data <- bank_data %>%
filter(campaign <= 10)
bank_data %>%
ggplot() +
aes(x = campaign) +
geom_bar() +
facet_grid(y ~ .,
scales = "free_y") +
scale_x_continuous(breaks = seq(0, 10, 1))
Truncated Cross-tab with our dependent variable:
CrossTable(bank_data$campaign, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40319
##
##
## | bank_data$y
## bank_data$campaign | no | yes | Row Total |
## -------------------|-----------|-----------|-----------|
## 1 | 15342 | 2300 | 17642 |
## | 5.073 | 39.268 | |
## | 0.870 | 0.130 | 0.438 |
## | 0.430 | 0.499 | |
## | 0.381 | 0.057 | |
## -------------------|-----------|-----------|-----------|
## 2 | 9359 | 1211 | 10570 |
## | 0.000 | 0.002 | |
## | 0.885 | 0.115 | 0.262 |
## | 0.262 | 0.263 | |
## | 0.232 | 0.030 | |
## -------------------|-----------|-----------|-----------|
## 3 | 4767 | 574 | 5341 |
## | 0.291 | 2.250 | |
## | 0.893 | 0.107 | 0.132 |
## | 0.134 | 0.124 | |
## | 0.118 | 0.014 | |
## -------------------|-----------|-----------|-----------|
## 4 | 2402 | 249 | 2651 |
## | 1.256 | 9.724 | |
## | 0.906 | 0.094 | 0.066 |
## | 0.067 | 0.054 | |
## | 0.060 | 0.006 | |
## -------------------|-----------|-----------|-----------|
## 5 | 1479 | 120 | 1599 |
## | 2.798 | 21.658 | |
## | 0.925 | 0.075 | 0.040 |
## | 0.041 | 0.026 | |
## | 0.037 | 0.003 | |
## -------------------|-----------|-----------|-----------|
## 6 | 904 | 75 | 979 |
## | 1.580 | 12.229 | |
## | 0.923 | 0.077 | 0.024 |
## | 0.025 | 0.016 | |
## | 0.022 | 0.002 | |
## -------------------|-----------|-----------|-----------|
## 7 | 591 | 38 | 629 |
## | 2.071 | 16.031 | |
## | 0.940 | 0.060 | 0.016 |
## | 0.017 | 0.008 | |
## | 0.015 | 0.001 | |
## -------------------|-----------|-----------|-----------|
## 8 | 383 | 17 | 400 |
## | 2.336 | 18.080 | |
## | 0.958 | 0.042 | 0.010 |
## | 0.011 | 0.004 | |
## | 0.009 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## 9 | 266 | 17 | 283 |
## | 0.944 | 7.304 | |
## | 0.940 | 0.060 | 0.007 |
## | 0.007 | 0.004 | |
## | 0.007 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## 10 | 213 | 12 | 225 |
## | 0.948 | 7.337 | |
## | 0.947 | 0.053 | 0.006 |
## | 0.006 | 0.003 | |
## | 0.005 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 35706 | 4613 | 40319 |
## | 0.886 | 0.114 | |
## -------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, campaign), fill = y)) +
mosaic_theme +
xlab("Campaign") +
ylab(NULL)
There is a linear pattern observable that depends on the different values of Campaign.
How often are clients contacted? does that make a difference?
table(bank_data$pdays)
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 15 26 61 439 118 46 412 60 18 63 52 28
## 12 13 14 15 16 17 18 19 20 21 22 25
## 58 36 19 24 11 8 7 3 1 2 3 1
## 26 27 999
## 1 1 38806
mean(bank_data$pdays[bank_data$y == "no"])
## [1] 983.8182
mean(bank_data$pdays[bank_data$y == "yes"])
## [1] 790.8242
max(bank_data$pdays[bank_data$y == "no"])
## [1] 999
min(bank_data$pdays[bank_data$y == "yes"])
## [1] 0
min(bank_data$pdays[bank_data$y == "no"])
## [1] 0
max(bank_data$pdays[bank_data$y == "yes"])
## [1] 999
The idea of contact with clients, in general, seems more important than days passed.
999 value means the client wasn’t previously contacted. Let’s create a dummy out of it.
Clients who haven’t been contacted in a previous campaign will be labeled “0” in the pdays_dummy variable
bank_data <- bank_data %>%
mutate(pdays_dummy = if_else(pdays == 999, "0", "1"))
Cross-tab of dummy with our dependent variable:
CrossTable(bank_data$pdays_dummy, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40319
##
##
## | bank_data$y
## bank_data$pdays_dummy | no | yes | Row Total |
## ----------------------|-----------|-----------|-----------|
## 0 | 35160 | 3646 | 38806 |
## | 18.340 | 141.956 | |
## | 0.906 | 0.094 | 0.962 |
## | 0.985 | 0.790 | |
## | 0.872 | 0.090 | |
## ----------------------|-----------|-----------|-----------|
## 1 | 546 | 967 | 1513 |
## | 470.386 | 3640.929 | |
## | 0.361 | 0.639 | 0.038 |
## | 0.015 | 0.210 | |
## | 0.014 | 0.024 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 35706 | 4613 | 40319 |
## | 0.886 | 0.114 | |
## ----------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(y, pdays_dummy), fill = y)) +
mosaic_theme +
xlab("pdays") +
ylab(NULL)
Recontacting a client after a previous campaign seems to highly increase the odds of subscription.
What is the number of contacts performed before this campaign and for each client.
table(bank_data$previous)
##
## 0 1 2 3 4 5 6 7
## 34703 4554 752 216 70 18 5 1
Cross-tab with our dependent variable:
CrossTable(bank_data$previous, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40319
##
##
## | bank_data$y
## bank_data$previous | no | yes | Row Total |
## -------------------|-----------|-----------|-----------|
## 0 | 31589 | 3114 | 34703 |
## | 23.868 | 184.745 | |
## | 0.910 | 0.090 | 0.861 |
## | 0.885 | 0.675 | |
## | 0.783 | 0.077 | |
## -------------------|-----------|-----------|-----------|
## 1 | 3587 | 967 | 4554 |
## | 49.315 | 381.711 | |
## | 0.788 | 0.212 | 0.113 |
## | 0.100 | 0.210 | |
## | 0.089 | 0.024 | |
## -------------------|-----------|-----------|-----------|
## 2 | 402 | 350 | 752 |
## | 104.624 | 809.824 | |
## | 0.535 | 0.465 | 0.019 |
## | 0.011 | 0.076 | |
## | 0.010 | 0.009 | |
## -------------------|-----------|-----------|-----------|
## 3 | 88 | 128 | 216 |
## | 55.771 | 431.681 | |
## | 0.407 | 0.593 | 0.005 |
## | 0.002 | 0.028 | |
## | 0.002 | 0.003 | |
## -------------------|-----------|-----------|-----------|
## 4 | 32 | 38 | 70 |
## | 14.510 | 112.309 | |
## | 0.457 | 0.543 | 0.002 |
## | 0.001 | 0.008 | |
## | 0.001 | 0.001 | |
## -------------------|-----------|-----------|-----------|
## 5 | 5 | 13 | 18 |
## | 7.509 | 58.121 | |
## | 0.278 | 0.722 | 0.000 |
## | 0.000 | 0.003 | |
## | 0.000 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## 6 | 2 | 3 | 5 |
## | 1.331 | 10.305 | |
## | 0.400 | 0.600 | 0.000 |
## | 0.000 | 0.001 | |
## | 0.000 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## 7 | 1 | 0 | 1 |
## | 0.015 | 0.114 | |
## | 1.000 | 0.000 | 0.000 |
## | 0.000 | 0.000 | |
## | 0.000 | 0.000 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 35706 | 4613 | 40319 |
## | 0.886 | 0.114 | |
## -------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(previous), fill = y)) +
mosaic_theme +
xlab("Previous") +
ylab(NULL)
Even one contact improves probability of “yes” (from 8.8% to 21.2%).
Note: Analyzing this variable can be tricky from a prediction stand-point. We cannot have a 2nd contact without 1st or a 3rd contact without a 2nd.
But creating a dummy would be same as pdays_dummy.
So we instead choose to create bins for this variable.
bank_data <- bank_data %>%
mutate(previous_binned = if_else(previous >= 2, "2+", if_else(previous == 1, "1", "0")))
Cross-tab on binned variable with our dependent variable:
CrossTable(bank_data$previous_binned, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40319
##
##
## | bank_data$y
## bank_data$previous_binned | no | yes | Row Total |
## --------------------------|-----------|-----------|-----------|
## 0 | 31589 | 3114 | 34703 |
## | 23.868 | 184.745 | |
## | 0.910 | 0.090 | 0.861 |
## | 0.885 | 0.675 | |
## | 0.783 | 0.077 | |
## --------------------------|-----------|-----------|-----------|
## 1 | 3587 | 967 | 4554 |
## | 49.315 | 381.711 | |
## | 0.788 | 0.212 | 0.113 |
## | 0.100 | 0.210 | |
## | 0.089 | 0.024 | |
## --------------------------|-----------|-----------|-----------|
## 2+ | 530 | 532 | 1062 |
## | 179.167 | 1386.804 | |
## | 0.499 | 0.501 | 0.026 |
## | 0.015 | 0.115 | |
## | 0.013 | 0.013 | |
## --------------------------|-----------|-----------|-----------|
## Column Total | 35706 | 4613 | 40319 |
## | 0.886 | 0.114 | |
## --------------------------|-----------|-----------|-----------|
##
##
It seems 2+ contacts increase probability to 50%
As the analysis of the pdays_dummy variable has shown, recontacting someone again will increase the odds.
Can we say that long term harassment works unlike short term harassment?
table(bank_data$poutcome)
##
## failure nonexistent success
## 4243 34703 1373
Cross-tab with our dependent variable:
CrossTable(bank_data$poutcome, bank_data$y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40319
##
##
## | bank_data$y
## bank_data$poutcome | no | yes | Row Total |
## -------------------|-----------|-----------|-----------|
## failure | 3638 | 605 | 4243 |
## | 3.803 | 29.440 | |
## | 0.857 | 0.143 | 0.105 |
## | 0.102 | 0.131 | |
## | 0.090 | 0.015 | |
## -------------------|-----------|-----------|-----------|
## nonexistent | 31589 | 3114 | 34703 |
## | 23.868 | 184.745 | |
## | 0.910 | 0.090 | 0.861 |
## | 0.885 | 0.675 | |
## | 0.783 | 0.077 | |
## -------------------|-----------|-----------|-----------|
## success | 479 | 894 | 1373 |
## | 446.610 | 3456.897 | |
## | 0.349 | 0.651 | 0.034 |
## | 0.013 | 0.194 | |
## | 0.012 | 0.022 | |
## -------------------|-----------|-----------|-----------|
## Column Total | 35706 | 4613 | 40319 |
## | 0.886 | 0.114 | |
## -------------------|-----------|-----------|-----------|
##
##
bank_data %>%
ggplot() +
geom_mosaic(aes(x = product(poutcome), fill = y)) +
mosaic_theme +
xlab("Previous Outcome") +
ylab(NULL)
65.1% of people who already subscribed to a term deposit after a previous contact have accepted to do it again.
Even if they were denied before, they’re still more enthusiastic to accept it (14.2%) than people who haven’t been contacted before (8.8%).
So even if the previous campaign was a failure, recontacting people seems important.
These five continious variables are social and economic indicators. They’re supposed to be highly correlated. Let’s compute the correlation matrix.
bank_data %>%
select(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed) %>%
cor() %>%
corrplot(method = "number",
type = "upper",
tl.cex = 0.8,
tl.srt = 45,
tl.col = "black")
Three pairs show a correlation coefficient higher than 0.90. Let’s figure out which variable(s) should be removed to lighten this correlation matrix.
emp.var.rate isn’t meaningful. It wouldn’t make sense for banks to vary employees before every campaign. We’ll remove it to soften correlations between our 5 variables.
euribor3m and nr.employed are highly correlated (0.95), we’re keeping both features. This is probably a spurious correlation, bank size (number of employees) isn’t reactive to the euribor rate.
So far, we’ve:
The data exploration ends here, let’s process to machine learning models.
Duplicating data for safe keeping.
dup_bank_data <- bank_data
Removing and transforming “unknowns”
dup_bank_data <- dup_bank_data %>%
filter(job != "unknown")
dup_bank_data <- dup_bank_data %>%
filter(marital != "unknown")
dup_bank_data = dup_bank_data %>%
mutate(education = recode(education, "unknown" = "university.degree"))
Converting our variables to factors with ordered levels (ordinal variables)
dup_bank_data$contact <- factor(dup_bank_data$contact, order = TRUE, levels =c('telephone', 'cellular'))
dup_bank_data$education <- factor(dup_bank_data$education, order = TRUE, levels =c('illiterate','basic.4y', 'basic.6y','basic.9y', 'high.school','professional.course','university.degree'))
dup_bank_data$age_discrete <- factor(dup_bank_data$age_discrete, order = TRUE, levels =c('low', 'mid','high'))
dup_bank_data$job <- factor(dup_bank_data$job, order = TRUE, levels =c('blue-collar', 'services','entrepreneur', 'housemaid', 'self-employed','technician', 'management','admin.','unemployed', 'retired','student'))
dup_bank_data$marital <- factor(dup_bank_data$marital, order = TRUE, levels =c('married', 'divorced', 'single'))
dup_bank_data$month <- factor(dup_bank_data$month, order = TRUE, levels =c('mar', 'apr','may', 'jun','jul', 'aug', 'sep','oct', 'nov','dec'))
dup_bank_data$previous_binned <- factor(dup_bank_data$previous_binned, order = TRUE, levels =c('0', '1','2+'))
dup_bank_data$poutcome <- factor(dup_bank_data$poutcome, order = TRUE, levels =c('nonexistent', 'failure','success'))
Splitting the data into training and test datasets (80-20 split):
set.seed(1984)
training <- createDataPartition(dup_bank_data$y_binary, p = 0.8, list=FALSE)
train_data <- dup_bank_data[training,]
test_data <- dup_bank_data[-training,]
In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate model fit and accuracy. But, Logistic Regression employs all different sets of metrics. Here, we deal with probabilities and categorical values. Once evaluation metric is Akaike Information Criteria (AIC):
We can look at AIC as counterpart of adjusted r square in multiple regression. It’s an important indicator of model fit. It follows the rule: Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn’t let AIC increase. It helps to avoid overfitting.
Looking at the AIC metric of one model wouldn’t really help. It is more useful in comparing models (model selection). So, build 2 or 3 Logistic Regression models and compare their AIC. The model with the lowest AIC will be relatively better.
We will use this to evaulate multiple models.
In addition, we can also perform an ANOVA Chi-square test to check the overall effect of variables on the dependent variable. This will also help us choose the better model.
Running a Logit model:
model <- glm(y_binary ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m + nr.employed,family=binomial(link='logit'),data=train_data)
summary(model)
##
## Call:
## glm(formula = y_binary ~ age_discrete + job + marital + education +
## contact + month + pdays_dummy + previous_binned + poutcome +
## cons.price.idx + cons.conf.idx + euribor3m + nr.employed,
## family = binomial(link = "logit"), data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2366 -0.3888 -0.3351 -0.2600 2.7452
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 96.324391 24.509174 3.930 8.49e-05 ***
## age_discrete.L 0.096720 0.093277 1.037 0.299773
## age_discrete.Q 0.217019 0.054479 3.984 6.79e-05 ***
## job.L 0.247700 0.092281 2.684 0.007270 **
## job.Q 0.045078 0.086339 0.522 0.601594
## job.C 0.093332 0.093516 0.998 0.318262
## job^4 -0.010552 0.092891 -0.114 0.909560
## job^5 -0.058395 0.099413 -0.587 0.556935
## job^6 -0.022088 0.089663 -0.246 0.805418
## job^7 -0.135317 0.104593 -1.294 0.195751
## job^8 0.019738 0.100293 0.197 0.843979
## job^9 -0.089681 0.099551 -0.901 0.367667
## job^10 -0.045361 0.085268 -0.532 0.594742
## marital.L 0.026581 0.035545 0.748 0.454572
## marital.Q -0.003398 0.054023 -0.063 0.949842
## education.L -0.635762 0.382902 -1.660 0.096839 .
## education.Q 0.680389 0.367675 1.851 0.064239 .
## education.C -0.416465 0.278719 -1.494 0.135121
## education^4 0.303541 0.172774 1.757 0.078941 .
## education^5 -0.216346 0.100229 -2.159 0.030888 *
## education^6 0.205799 0.069858 2.946 0.003220 **
## contact.L 0.389120 0.047967 8.112 4.97e-16 ***
## month.L -0.670112 0.147499 -4.543 5.54e-06 ***
## month.Q 0.499398 0.128248 3.894 9.86e-05 ***
## month.C -0.017095 0.135112 -0.127 0.899314
## month^4 0.648315 0.092718 6.992 2.70e-12 ***
## month^5 -0.339622 0.086926 -3.907 9.34e-05 ***
## month^6 -0.199858 0.085081 -2.349 0.018823 *
## month^7 0.586365 0.071002 8.258 < 2e-16 ***
## month^8 -0.223344 0.078503 -2.845 0.004440 **
## month^9 0.229071 0.070556 3.247 0.001168 **
## pdays_dummy1 1.141009 0.221608 5.149 2.62e-07 ***
## previous_binned.L -0.959180 0.155095 -6.184 6.23e-10 ***
## previous_binned.Q 0.524573 0.118350 4.432 9.32e-06 ***
## poutcome.L 1.121389 0.308288 3.637 0.000275 ***
## poutcome.Q NA NA NA NA
## cons.price.idx -0.248884 0.135328 -1.839 0.065898 .
## cons.conf.idx 0.012184 0.007800 1.562 0.118257
## euribor3m 0.212429 0.131534 1.615 0.106309
## nr.employed -0.014611 0.002517 -5.806 6.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22824 on 31943 degrees of freedom
## Residual deviance: 17999 on 31905 degrees of freedom
## AIC: 18077
##
## Number of Fisher Scoring iterations: 6
We can see that older age, higer education, contact over cellphone, and previous contact all seem to be highly predictive and positively correlated with a ‘yes’.
We also see thats jobs do not seem to have a significant impact. Similiary our socio-economics indicators do not turn out to be significant expect for nr.empolyed.
We run few reduced models to see if we can improve our prediction rate.
Running a simplified models:
model_2 <- glm(y_binary ~ age_discrete + marital + education + contact + month + pdays_dummy + previous_binned + poutcome,family=binomial(link='logit'),data=train_data)
summary(model_2)
##
## Call:
## glm(formula = y_binary ~ age_discrete + marital + education +
## contact + month + pdays_dummy + previous_binned + poutcome,
## family = binomial(link = "logit"), data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6613 -0.4409 -0.3934 -0.2654 2.7330
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.08084 0.13756 -7.857 3.92e-15 ***
## age_discrete.L 0.65178 0.07406 8.800 < 2e-16 ***
## age_discrete.Q 0.65082 0.04472 14.555 < 2e-16 ***
## marital.L 0.10191 0.03334 3.057 0.002239 **
## marital.Q 0.03234 0.05238 0.617 0.537005
## education.L -0.55378 0.36645 -1.511 0.130739
## education.Q 0.84186 0.35306 2.384 0.017106 *
## education.C -0.49684 0.26762 -1.857 0.063380 .
## education^4 0.30277 0.16548 1.830 0.067300 .
## education^5 -0.18436 0.09566 -1.927 0.053948 .
## education^6 0.23451 0.06686 3.507 0.000453 ***
## contact.L 0.73630 0.04133 17.817 < 2e-16 ***
## month.L 0.01245 0.11977 0.104 0.917186
## month.Q 1.28020 0.12352 10.364 < 2e-16 ***
## month.C -0.74902 0.11727 -6.387 1.69e-10 ***
## month^4 0.46697 0.09156 5.100 3.39e-07 ***
## month^5 0.64652 0.08287 7.802 6.11e-15 ***
## month^6 1.17729 0.07407 15.894 < 2e-16 ***
## month^7 0.80761 0.06988 11.557 < 2e-16 ***
## month^8 -0.65996 0.07954 -8.297 < 2e-16 ***
## month^9 0.03842 0.06437 0.597 0.550610
## pdays_dummy1 1.49145 0.22772 6.549 5.77e-11 ***
## previous_binned.L -0.45252 0.15761 -2.871 0.004090 **
## previous_binned.Q 0.59443 0.12063 4.928 8.31e-07 ***
## poutcome.L 1.25060 0.31728 3.942 8.09e-05 ***
## poutcome.Q NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22824 on 31943 degrees of freedom
## Residual deviance: 19037 on 31919 degrees of freedom
## AIC: 19087
##
## Number of Fisher Scoring iterations: 5
model_3 <- glm(y_binary ~ age_discrete + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + euribor3m + nr.employed,family=binomial(link='logit'),data=train_data)
summary(model_3)
##
## Call:
## glm(formula = y_binary ~ age_discrete + marital + education +
## contact + month + pdays_dummy + previous_binned + poutcome +
## euribor3m + nr.employed, family = binomial(link = "logit"),
## data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1875 -0.3926 -0.3374 -0.2406 2.7628
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 53.0802742 4.4651459 11.888 < 2e-16 ***
## age_discrete.L 0.2163968 0.0733073 2.952 0.003158 **
## age_discrete.Q 0.3000977 0.0449700 6.673 2.50e-11 ***
## marital.L 0.0336598 0.0346374 0.972 0.331163
## marital.Q -0.0055407 0.0537092 -0.103 0.917835
## education.L -0.6021767 0.3754519 -1.604 0.108742
## education.Q 0.7513522 0.3617816 2.077 0.037819 *
## education.C -0.4519499 0.2741734 -1.648 0.099269 .
## education^4 0.2894968 0.1694195 1.709 0.087496 .
## education^5 -0.2080738 0.0976672 -2.130 0.033135 *
## education^6 0.2125854 0.0679289 3.130 0.001751 **
## contact.L 0.3458413 0.0450145 7.683 1.56e-14 ***
## month.L -0.2761345 0.1208047 -2.286 0.022266 *
## month.Q 0.5199370 0.1210153 4.296 1.74e-05 ***
## month.C -0.2200220 0.1174744 -1.873 0.061077 .
## month^4 0.7446388 0.0897319 8.298 < 2e-16 ***
## month^5 -0.2292005 0.0840959 -2.725 0.006421 **
## month^6 -0.1075966 0.0800427 -1.344 0.178871
## month^7 0.5416931 0.0689505 7.856 3.96e-15 ***
## month^8 -0.1699538 0.0778704 -2.183 0.029071 *
## month^9 0.2596190 0.0664003 3.910 9.23e-05 ***
## pdays_dummy1 1.1929479 0.2219852 5.374 7.70e-08 ***
## previous_binned.L -0.9735425 0.1550782 -6.278 3.44e-10 ***
## previous_binned.Q 0.4991260 0.1184196 4.215 2.50e-05 ***
## poutcome.L 1.0818211 0.3088849 3.502 0.000461 ***
## poutcome.Q NA NA NA NA
## euribor3m -0.0142966 0.0422466 -0.338 0.735056
## nr.employed -0.0106699 0.0008957 -11.913 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22824 on 31943 degrees of freedom
## Residual deviance: 18043 on 31917 degrees of freedom
## AIC: 18097
##
## Number of Fisher Scoring iterations: 6
As you can see, we achieved a lower AIC value with our first model.
Also, we can compare both the models using the ANOVA test. Let’s say our null hypothesis is that second model is better than the first model. p < 0.05 would reject our hypothesis and in case p > 0.05, we’ll fail to reject the null hypothesis.
anova(model,model_2,test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: y_binary ~ age_discrete + job + marital + education + contact +
## month + pdays_dummy + previous_binned + poutcome + cons.price.idx +
## cons.conf.idx + euribor3m + nr.employed
## Model 2: y_binary ~ age_discrete + marital + education + contact + month +
## pdays_dummy + previous_binned + poutcome
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 31905 17999
## 2 31919 19037 -14 -1038.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With p < 0.05, this ANOVA test also corroborates the fact that the first model is better at prediction than our second model.
Precision attempts to answer the following question:
What proportion of positive identifications was actually correct?
Whilem Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Various metrics have been developed that rely on both precision and recall F1 score is one such metric.
In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score.
Let’s look at the prediction of the first model on the test set (test_data):
pred.train <- predict(model,test_data)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading
pred.train <- ifelse(pred.train > 0.5,1,0)
# Mean of the true prediction
mean(pred.train == test_data$y_binary)
## [1] 0.8993238
t1 <- table(pred.train,test_data$y_binary)
# Presicion and recall of the model
presicion <- t1[1,1]/(sum(t1[1,]))
recall <- t1[1,1]/(sum(t1[,1]))
presicion
## [1] 0.905855
recall
## [1] 0.9895745
F1<- 2*presicion*recall/(presicion+recall)
F1
## [1] 0.9458659
Running the decision tree algorithm from the “rpart” library:
model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m + nr.employed, data = train_data, method="class")
rpart.plot(model_dt)
Although, out model suggests that ‘nr.employed’ is an important classifier, it makes little sense for banks to change this indicator to conform to its marketing campaigns.
Running a model without ‘nr.employed’:
model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m, data = train_data, method="class")
rpart.plot(model_dt)
Similarly removing ‘euribor3m’:
model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx, data = train_data, method="class")
rpart.plot(model_dt)
When removing all socio-economic indicators:
model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome, data = train_data, method="class")
rpart.plot(model_dt)
We can see from all our previous decision tree models that previous contact with the customer as indicated by ‘pdays_dummy’ and ’poutcome’have a very large effect on getting a yes. (anywhere from a 65% chance to a 74% chance.)
As we stated the economic indicators are aspects of marketing that we cannot influence but the divisions in the various decision trees do give us a lot of economic information:
We see that the probability of a ‘yes’ increases when:
the bank has fewer than 5088 (in the first quartile) euribor3m < 1.2 (in the first quartile) consumer confidence above -47.0 and especially above -41.0 (again the first quartile)
This all indicates an economy moving towards recovery from its lowest point. Although, we might not always have situations like this, it makes sense to try to get people to go for term deposits as the economy starts imporvement. Take up slows down as the indicators more towards the mean and median.
Let’s look at the prediction of this model on the test set (test_data):
pred.train.dt <- predict(model_dt,test_data,type = "class")
mean(pred.train.dt==test_data$y)
## [1] 0.8988229
t2<-table(pred.train.dt,test_data$y)
presicion_dt<- t2[1,1]/(sum(t2[1,]))
recall_dt<- t2[1,1]/(sum(t2[,1]))
presicion_dt
## [1] 0.9102531
recall_dt
## [1] 0.9830938
F1_dt<- 2*presicion_dt*recall_dt/(presicion_dt+recall_dt)
F1_dt
## [1] 0.9452723
Running the random forest algorithm from the “randomForest” library:
model_rf <- randomForest(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m + nr.employed, data=train_data)
## model_rf<-randomForest(y~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome, data=train_data)
plot(model_rf)
Let’s look at the prediction of this model on the test set (test_data):
pred.train.rf <- predict(model_rf,test_data)
mean(pred.train.rf==test_data$y)
## [1] 0.8993238
t3<-table(pred.train.rf,test_data$y)
presicion_rf<- t3[1,1]/(sum(t3[1,]))
recall_rf<- t3[1,1]/(sum(t3[,1]))
presicion_rf
## [1] 0.9140789
recall_rf
## [1] 0.9787264
F1_rf<- 2*presicion_rf*recall_rf/(presicion_rf+recall_rf)
F1_rf
## [1] 0.9452987
Running the C5.0 decision tree algorithm from the “C5.0” library:
model_c5 <- C5.0(x=test_data[, c(2,3,4,8,9,12,15,17,18,19,23,24,25)], y=test_data$y)
## too resource intensive to plot.
## plot(model_c5)
Let’s look at the prediction of this model on the test set (test_data):
pred.train.rf <- predict(model_c5,test_data)
mean(pred.train.rf==test_data$y)
## [1] 0.8990734
t4<-table(pred.train.rf,test_data$y)
presicion_c5<- t4[1,1]/(sum(t4[1,]))
recall_c5<- t4[1,1]/(sum(t4[,1]))
presicion_c5
## [1] 0.9091027
recall_c5
## [1] 0.9849253
F1_c5<- 2*presicion_c5*recall_c5/(presicion_c5+recall_c5)
F1_c5
## [1] 0.9454963
Averaging out the mean of correct predictions and the F-score of our various algorithms, we get:
0.923 for our Logit model,
0.924 for our Decision Tree model,
0.922 for our Random Forest model,
and 0.922 for our C5.0 Decision Tree model.
All our models have high predictive power.
We can also conclude and suggest that the best approach for the bank is to contact clients, especially those it contacted during previous campaign (usually within our datsets suggested 27 day window). Ideal candidates are highly educated and in higher age brackets. Contact on cellphone numbers should be preferred.
Or decision tree models suggests that take up is better when the European Inter-bank borrowing 3 month rate is below 1.2 %.
It would indicate that take is higher when bank rates are low and banks should target their campaigns during these periods.
Other indications from our decision tree models suggest that the best time for such campaigns is whenever the economy starts an up-swing.