Preliminaries Setting up libraries and loading the data set

## Libraries used

library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(tidyverse)

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1       v purrr   0.3.2  
## v tidyr   0.8.3       v dplyr   0.8.0.1
## v readr   1.3.1       v stringr 1.4.0  
## v tibble  2.1.1       v forcats 0.4.0

## Warning: package 'stringr' was built under R version 3.6.1

## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.6.1

library(ggmosaic)

## Warning: package 'ggmosaic' was built under R version 3.6.1

library(corrplot)

## corrplot 0.84 loaded

library(caret)

## Warning: package 'caret' was built under R version 3.6.1

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.1

library(C50)

## Warning: package 'C50' was built under R version 3.6.1

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.6.1

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Loading the dataset
bank_data <- read.csv("C:/Users/qasim/OneDrive/Desktop/York U/2 - Basic Methods of Data Analytics/Lab 2/bank-additional-full.csv", sep=";")

Bank Marketing - EDA + Classification Algorithms

Abstract We are using the “Bank Marketing” UCI dataset related to direct marketing campaigns of to customers of a Portugal-based bank. The goal of the analysis is use at least two typical classification alogrithms, logistic regression and decision tree, to make predicitive models around customers signing up for term deposits.

1. Introduction

The dataset used is based on “Bank Marketing” UCI dataset , (detailed description at: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing) published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. This dataset is almost identical to the one used in [Moro et al., 2014] difference being that attributes that could be used as “personally identifiable information” has been removed due to privacy concerns.

The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. The classification goal is to predict if and what types of clients buy a term deposit (variable y).

1.1 Objective

The objective of this analysis is to provide a reliable and feasible recommendation algorithm to predict client take up based on client type and campaign type. The target value is the binary “yes” or “no” regarding the subscription of term deposit by the client. Hence the task could be solved by classification using a logisitic regression or a decision tree. We plan to use both and look at F-scores in the end to judge which alogrithm generated more accurate results.

1.2 Dataset

The dimensions and variables aren’t too obvious so here is a description table.

Variables related to bank client data:

Age: Client’s age.
Job: Client’s type of job.
Marital: Client’s marital status, divorced means divorced or widowed.
Education: Client’s education.
Default: Client has previosly defaulted.
Housing: Client has a housing loan.
Loan: Client has a personal loan.

Variables related to last contact of the current marketing campaign:

Contact: Contact communication type (telephone or cellular).
Month: Last contact month of year.
day_of_week: Last contact day of week.
duration: Last contact duration in seconds. If duration is 0s, then we never contacted a client to sign up for a term deposit account.
Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
Previous: number of contacts performed before this campaign and for this client (numeric)
Poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,‘nonexistent’,‘success’)

Social and economic context attributes:

Emp.var.rate: employment variation rate - quarterly indicator (numeric)
Cons.price.idx: consumer price index - monthly indicator (numeric)
Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
Euribor3m: euribor 3 month rate - daily indicator (numeric)
Nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

20: y - has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

Initial exploration of the dataset:

Check number of rows and columns.

ncol(bank_data)

## [1] 21

nrow(bank_data)

## [1] 41188

Quickly preview data structure with HEAD(), STR(), and SUMMARY().

head(bank_data,10)

##    age         job marital           education default housing loan
## 1   56   housemaid married            basic.4y      no      no   no
## 2   57    services married         high.school unknown      no   no
## 3   37    services married         high.school      no     yes   no
## 4   40      admin. married            basic.6y      no      no   no
## 5   56    services married         high.school      no      no  yes
## 6   45    services married            basic.9y unknown      no   no
## 7   59      admin. married professional.course      no      no   no
## 8   41 blue-collar married             unknown unknown      no   no
## 9   24  technician  single professional.course      no     yes   no
## 10  25    services  single         high.school      no     yes   no
##      contact month day_of_week duration campaign pdays previous
## 1  telephone   may         mon      261        1   999        0
## 2  telephone   may         mon      149        1   999        0
## 3  telephone   may         mon      226        1   999        0
## 4  telephone   may         mon      151        1   999        0
## 5  telephone   may         mon      307        1   999        0
## 6  telephone   may         mon      198        1   999        0
## 7  telephone   may         mon      139        1   999        0
## 8  telephone   may         mon      217        1   999        0
## 9  telephone   may         mon      380        1   999        0
## 10 telephone   may         mon       50        1   999        0
##       poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 1  nonexistent          1.1         93.994         -36.4     4.857
## 2  nonexistent          1.1         93.994         -36.4     4.857
## 3  nonexistent          1.1         93.994         -36.4     4.857
## 4  nonexistent          1.1         93.994         -36.4     4.857
## 5  nonexistent          1.1         93.994         -36.4     4.857
## 6  nonexistent          1.1         93.994         -36.4     4.857
## 7  nonexistent          1.1         93.994         -36.4     4.857
## 8  nonexistent          1.1         93.994         -36.4     4.857
## 9  nonexistent          1.1         93.994         -36.4     4.857
## 10 nonexistent          1.1         93.994         -36.4     4.857
##    nr.employed  y
## 1         5191 no
## 2         5191 no
## 3         5191 no
## 4         5191 no
## 5         5191 no
## 6         5191 no
## 7         5191 no
## 8         5191 no
## 9         5191 no
## 10        5191 no

str(bank_data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

summary(bank_data)

##       age                 job            marital     
##  Min.   :17.00   admin.     :10422   divorced: 4612  
##  1st Qu.:32.00   blue-collar: 9254   married :24928  
##  Median :38.00   technician : 6743   single  :11568  
##  Mean   :40.02   services   : 3969   unknown :   80  
##  3rd Qu.:47.00   management : 2924                   
##  Max.   :98.00   retired    : 1720                   
##                  (Other)    : 6156                   
##                education        default         housing     
##  university.degree  :12168   no     :32588   no     :18622  
##  high.school        : 9515   unknown: 8597   unknown:  990  
##  basic.9y           : 6045   yes    :    3   yes    :21576  
##  professional.course: 5243                                  
##  basic.4y           : 4176                                  
##  basic.6y           : 2292                                  
##  (Other)            : 1749                                  
##       loan            contact          month       day_of_week
##  no     :33950   cellular :26144   may    :13769   fri:7827   
##  unknown:  990   telephone:15044   jul    : 7174   mon:8514   
##  yes    : 6248                     aug    : 6178   thu:8623   
##                                    jun    : 5318   tue:8090   
##                                    nov    : 4101   wed:8134   
##                                    apr    : 2632              
##                                    (Other): 2016              
##     duration         campaign          pdays          previous    
##  Min.   :   0.0   Min.   : 1.000   Min.   :  0.0   Min.   :0.000  
##  1st Qu.: 102.0   1st Qu.: 1.000   1st Qu.:999.0   1st Qu.:0.000  
##  Median : 180.0   Median : 2.000   Median :999.0   Median :0.000  
##  Mean   : 258.3   Mean   : 2.568   Mean   :962.5   Mean   :0.173  
##  3rd Qu.: 319.0   3rd Qu.: 3.000   3rd Qu.:999.0   3rd Qu.:0.000  
##  Max.   :4918.0   Max.   :56.000   Max.   :999.0   Max.   :7.000  
##                                                                   
##         poutcome      emp.var.rate      cons.price.idx  cons.conf.idx  
##  failure    : 4252   Min.   :-3.40000   Min.   :92.20   Min.   :-50.8  
##  nonexistent:35563   1st Qu.:-1.80000   1st Qu.:93.08   1st Qu.:-42.7  
##  success    : 1373   Median : 1.10000   Median :93.75   Median :-41.8  
##                      Mean   : 0.08189   Mean   :93.58   Mean   :-40.5  
##                      3rd Qu.: 1.40000   3rd Qu.:93.99   3rd Qu.:-36.4  
##                      Max.   : 1.40000   Max.   :94.77   Max.   :-26.9  
##                                                                        
##    euribor3m      nr.employed     y        
##  Min.   :0.634   Min.   :4964   no :36548  
##  1st Qu.:1.344   1st Qu.:5099   yes: 4640  
##  Median :4.857   Median :5191              
##  Mean   :3.621   Mean   :5167              
##  3rd Qu.:4.961   3rd Qu.:5228              
##  Max.   :5.045   Max.   :5228              
##

Check distribution of target variable.

summary(bank_data$y)

##    no   yes 
## 36548  4640

CrossTable(bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##           |        no |       yes | 
##           |-----------|-----------|
##           |     36548 |      4640 | 
##           |     0.887 |     0.113 | 
##           |-----------|-----------|
## 
## 
## 
##

Creating a binary dependent varaible for potential regression models.

bank_data <- bank_data %>%
  mutate(y_binary = ifelse(y == "no",0,1))

Check distribution of target variable.

hist(bank_data$y_binary)

summary(bank_data$y_binary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1127  0.0000  1.0000

Getting an idea of missing values:

colSums(is.na(bank_data))

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary 
##              0              0

colSums(bank_data == "")

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary 
##              0              0

colSums(bank_data == "unknown")

##            age            job        marital      education        default 
##              0            330             80           1731           8597 
##        housing           loan        contact          month    day_of_week 
##            990            990              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              0              0 
##              y       y_binary 
##              0              0

sum(bank_data == "unknown")

## [1] 12718

We see that missing values are encoded as “unknown”

Which variables suffer the most from those “missing values”?

bank_data %>% 
  summarise_all(list(~sum(. == "unknown"))) %>% 
  gather(key = "variable", value = "nr_unknown") %>% 
  arrange(-nr_unknown)

##          variable nr_unknown
## 1         default       8597
## 2       education       1731
## 3         housing        990
## 4            loan        990
## 5             job        330
## 6         marital         80
## 7             age          0
## 8         contact          0
## 9           month          0
## 10    day_of_week          0
## 11       duration          0
## 12       campaign          0
## 13          pdays          0
## 14       previous          0
## 15       poutcome          0
## 16   emp.var.rate          0
## 17 cons.price.idx          0
## 18  cons.conf.idx          0
## 19      euribor3m          0
## 20    nr.employed          0
## 21              y          0
## 22       y_binary          0

6 features have at least 1 unknown value.

Before deciding how to manage those missing values, we’ll study each variable and take a decision after visualisations.

We can’t afford to delete 8,597 rows in our dataset, it’s more than 20% of our observations.

## default theme for ggplot
theme_set(theme_bw())

## setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
                                                hjust = 1,
                                                vjust = 0.5),
                     axis.text.y = element_blank(),
                     axis.ticks.y = element_blank())

1.3 Exploratory Data Analysis

1.3.1 Univariate Analysis

Age

What kind of persons were contacted during this marketing campaign?

summary(bank_data$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   32.00   38.00   40.02   47.00   98.00

bank_data %>% 
  ggplot() +
  aes(x = age) +
  geom_bar() +
  geom_vline(xintercept = c(30, 60), 
             col = "red",
             linetype = "dashed") +
  facet_grid(y ~ .,
             scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 100, 5))

First of all, It seems that the banks are not very much interested by contacting the older population. Even though, after the 60-years threshold, the relative frequency is higher when y = 1. In other words, we can say that elderly persons are more likely to subscribe to a term deposit.

We can also slice the age feature at 30 years to make three easily interpretable classes : [0, 30[, [30, 60[ and [60, +Inf[.

The minimum and maximum values are 17 and 98 but we can expect new observations outside this range.

We’re replacing the continious variable “age” by this categorical variable. We might lose some information from this continious-to-discrete transformation, but there wasn’t any clear pattern between years.

Cutting into classes will make the algorithms easier to interpret later.

bank_data = bank_data %>% 
  mutate(age_discrete = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))

Cross-tab with our dependent variable

CrossTable(bank_data$age_discrete, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                        | bank_data$y 
## bank_data$age_discrete |        no |       yes | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                   high |       496 |       414 |       910 | 
##                        |   120.154 |   946.422 |           | 
##                        |     0.545 |     0.455 |     0.022 | 
##                        |     0.014 |     0.089 |           | 
##                        |     0.012 |     0.010 |           | 
## -----------------------|-----------|-----------|-----------|
##                    low |      6259 |      1124 |      7383 | 
##                        |    13.039 |   102.707 |           | 
##                        |     0.848 |     0.152 |     0.179 | 
##                        |     0.171 |     0.242 |           | 
##                        |     0.152 |     0.027 |           | 
## -----------------------|-----------|-----------|-----------|
##                    mid |     29793 |      3102 |     32895 | 
##                        |    12.488 |    98.367 |           | 
##                        |     0.906 |     0.094 |     0.799 | 
##                        |     0.815 |     0.669 |           | 
##                        |     0.723 |     0.075 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |     36548 |      4640 |     41188 | 
##                        |     0.887 |     0.113 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##

45.5% of people over 60 years old subscribed to a term deposit, which is a lot in comparison with younger individuals (15.2% for young adults (aged lower than 30) and only 9.4% for the remaining observations said yes (aged between 30 and 60)).

Jobs

What are the types of jobs represented in our data?

table(bank_data$job)

## 
##        admin.   blue-collar  entrepreneur     housemaid    management 
##         10422          9254          1456          1060          2924 
##       retired self-employed      services       student    technician 
##          1720          1421          3969           875          6743 
##    unemployed       unknown 
##          1014           330

We have 330 unknown jobs.

Cross-tab with our dependent variable

CrossTable(bank_data$job, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##               | bank_data$y 
## bank_data$job |        no |       yes | Row Total | 
## --------------|-----------|-----------|-----------|
##        admin. |      9070 |      1352 |     10422 | 
##               |     3.423 |    26.961 |           | 
##               |     0.870 |     0.130 |     0.253 | 
##               |     0.248 |     0.291 |           | 
##               |     0.220 |     0.033 |           | 
## --------------|-----------|-----------|-----------|
##   blue-collar |      8616 |       638 |      9254 | 
##               |    19.926 |   156.951 |           | 
##               |     0.931 |     0.069 |     0.225 | 
##               |     0.236 |     0.138 |           | 
##               |     0.209 |     0.015 |           | 
## --------------|-----------|-----------|-----------|
##  entrepreneur |      1332 |       124 |      1456 | 
##               |     1.240 |     9.767 |           | 
##               |     0.915 |     0.085 |     0.035 | 
##               |     0.036 |     0.027 |           | 
##               |     0.032 |     0.003 |           | 
## --------------|-----------|-----------|-----------|
##     housemaid |       954 |       106 |      1060 | 
##               |     0.191 |     1.507 |           | 
##               |     0.900 |     0.100 |     0.026 | 
##               |     0.026 |     0.023 |           | 
##               |     0.023 |     0.003 |           | 
## --------------|-----------|-----------|-----------|
##    management |      2596 |       328 |      2924 | 
##               |     0.001 |     0.006 |           | 
##               |     0.888 |     0.112 |     0.071 | 
##               |     0.071 |     0.071 |           | 
##               |     0.063 |     0.008 |           | 
## --------------|-----------|-----------|-----------|
##       retired |      1286 |       434 |      1720 | 
##               |    37.814 |   297.849 |           | 
##               |     0.748 |     0.252 |     0.042 | 
##               |     0.035 |     0.094 |           | 
##               |     0.031 |     0.011 |           | 
## --------------|-----------|-----------|-----------|
## self-employed |      1272 |       149 |      1421 | 
##               |     0.097 |     0.767 |           | 
##               |     0.895 |     0.105 |     0.035 | 
##               |     0.035 |     0.032 |           | 
##               |     0.031 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
##      services |      3646 |       323 |      3969 | 
##               |     4.375 |    34.458 |           | 
##               |     0.919 |     0.081 |     0.096 | 
##               |     0.100 |     0.070 |           | 
##               |     0.089 |     0.008 |           | 
## --------------|-----------|-----------|-----------|
##       student |       600 |       275 |       875 | 
##               |    40.090 |   315.775 |           | 
##               |     0.686 |     0.314 |     0.021 | 
##               |     0.016 |     0.059 |           | 
##               |     0.015 |     0.007 |           | 
## --------------|-----------|-----------|-----------|
##    technician |      6013 |       730 |      6743 | 
##               |     0.147 |     1.156 |           | 
##               |     0.892 |     0.108 |     0.164 | 
##               |     0.165 |     0.157 |           | 
##               |     0.146 |     0.018 |           | 
## --------------|-----------|-----------|-----------|
##    unemployed |       870 |       144 |      1014 | 
##               |     0.985 |     7.758 |           | 
##               |     0.858 |     0.142 |     0.025 | 
##               |     0.024 |     0.031 |           | 
##               |     0.021 |     0.003 |           | 
## --------------|-----------|-----------|-----------|
##       unknown |       293 |        37 |       330 | 
##               |     0.000 |     0.001 |           | 
##               |     0.888 |     0.112 |     0.008 | 
##               |     0.008 |     0.008 |           | 
##               |     0.007 |     0.001 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |     36548 |      4640 |     41188 | 
##               |     0.887 |     0.113 |           | 
## --------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, job), fill = y)) +
  mosaic_theme +
  xlab("Job") +
  ylab(NULL)

Higher response among students (31.4%) and retired people (25.2%).

Other classes range between 6.9% (blue-collar) and 14.2 (unemployed).

We also see that we can ignore “unknown”. No big effect seen here.

Maritial status

How is marital status effecting client behavior?

table(bank_data$marital)

## 
## divorced  married   single  unknown 
##     4612    24928    11568       80

80 unknowns in this segment.

Cross-tab with our dependent variable:

CrossTable(bank_data$marital, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                   | bank_data$y 
## bank_data$marital |        no |       yes | Row Total | 
## ------------------|-----------|-----------|-----------|
##          divorced |      4136 |       476 |      4612 | 
##                   |     0.464 |     3.652 |           | 
##                   |     0.897 |     0.103 |     0.112 | 
##                   |     0.113 |     0.103 |           | 
##                   |     0.100 |     0.012 |           | 
## ------------------|-----------|-----------|-----------|
##           married |     22396 |      2532 |     24928 | 
##                   |     3.450 |    27.174 |           | 
##                   |     0.898 |     0.102 |     0.605 | 
##                   |     0.613 |     0.546 |           | 
##                   |     0.544 |     0.061 |           | 
## ------------------|-----------|-----------|-----------|
##            single |      9948 |      1620 |     11568 | 
##                   |     9.778 |    77.021 |           | 
##                   |     0.860 |     0.140 |     0.281 | 
##                   |     0.272 |     0.349 |           | 
##                   |     0.242 |     0.039 |           | 
## ------------------|-----------|-----------|-----------|
##           unknown |        68 |        12 |        80 | 
##                   |     0.126 |     0.990 |           | 
##                   |     0.850 |     0.150 |     0.002 | 
##                   |     0.002 |     0.003 |           | 
##                   |     0.002 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |     36548 |      4640 |     41188 | 
##                   |     0.887 |     0.113 |           | 
## ------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, marital), fill = y)) +
  mosaic_theme +
  xlab("Marital status") +
  ylab(NULL)

No big effect of marriage. Singles (14.0%) slightly more like to say “yes” than divorced (10.3%) or married customers (10.2%).

marriage_table <- table(bank_data$marital, bank_data$y)
marriage_tab <- as.data.frame(prop.table(marriage_table, 2))
colnames(marriage_tab) <-  c("marital", "y", "perc")

ggplot(data = marriage_tab, aes(x = marital, y = perc, fill = y)) + 
  geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) + 
  xlab("Marital")+
  ylab("Percent")

Education

How educated are the clients and how are that effecting their choice?

table(bank_data$education)

## 
##            basic.4y            basic.6y            basic.9y 
##                4176                2292                6045 
##         high.school          illiterate professional.course 
##                9515                  18                5243 
##   university.degree             unknown 
##               12168                1731

1731 unknowns here.

Cross-tab with our dependent variable:

CrossTable(bank_data$education, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                     | bank_data$y 
## bank_data$education |        no |       yes | Row Total | 
## --------------------|-----------|-----------|-----------|
##            basic.4y |      3748 |       428 |      4176 | 
##                     |     0.486 |     3.829 |           | 
##                     |     0.898 |     0.102 |     0.101 | 
##                     |     0.103 |     0.092 |           | 
##                     |     0.091 |     0.010 |           | 
## --------------------|-----------|-----------|-----------|
##            basic.6y |      2104 |       188 |      2292 | 
##                     |     2.423 |    19.088 |           | 
##                     |     0.918 |     0.082 |     0.056 | 
##                     |     0.058 |     0.041 |           | 
##                     |     0.051 |     0.005 |           | 
## --------------------|-----------|-----------|-----------|
##            basic.9y |      5572 |       473 |      6045 | 
##                     |     8.065 |    63.527 |           | 
##                     |     0.922 |     0.078 |     0.147 | 
##                     |     0.152 |     0.102 |           | 
##                     |     0.135 |     0.011 |           | 
## --------------------|-----------|-----------|-----------|
##         high.school |      8484 |      1031 |      9515 | 
##                     |     0.198 |     1.561 |           | 
##                     |     0.892 |     0.108 |     0.231 | 
##                     |     0.232 |     0.222 |           | 
##                     |     0.206 |     0.025 |           | 
## --------------------|-----------|-----------|-----------|
##          illiterate |        14 |         4 |        18 | 
##                     |     0.244 |     1.918 |           | 
##                     |     0.778 |     0.222 |     0.000 | 
##                     |     0.000 |     0.001 |           | 
##                     |     0.000 |     0.000 |           | 
## --------------------|-----------|-----------|-----------|
## professional.course |      4648 |       595 |      5243 | 
##                     |     0.004 |     0.032 |           | 
##                     |     0.887 |     0.113 |     0.127 | 
##                     |     0.127 |     0.128 |           | 
##                     |     0.113 |     0.014 |           | 
## --------------------|-----------|-----------|-----------|
##   university.degree |     10498 |      1670 |     12168 | 
##                     |     8.292 |    65.317 |           | 
##                     |     0.863 |     0.137 |     0.295 | 
##                     |     0.287 |     0.360 |           | 
##                     |     0.255 |     0.041 |           | 
## --------------------|-----------|-----------|-----------|
##             unknown |      1480 |       251 |      1731 | 
##                     |     2.041 |    16.079 |           | 
##                     |     0.855 |     0.145 |     0.042 | 
##                     |     0.040 |     0.054 |           | 
##                     |     0.036 |     0.006 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     36548 |      4640 |     41188 | 
##                     |     0.887 |     0.113 |           | 
## --------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, education), fill = y)) +
  mosaic_theme +
  xlab("Education Level") +
  ylab(NULL)

It appears that a positive correlation exists between the number of years of education and the odds of subscribing to a term deposit.

Among the 1,596 rows containing the “unknown” value, 234 of them subscribed to a term deposit. This is around 5% of the total group of subscribers.

Since we’re facing a very unbalanced dependent variable situation, we can not afford to discard those rows. This category has the highest relative frequency of “y = 1” (14.7%)

It might make sense to recode these as “university.degree holders” as they are the most similar (13.7%).

Default

How many of our clients are in default?

table(bank_data$default)

## 
##      no unknown     yes 
##   32588    8597       3

8597 unknowns in default.

Cross-tab with our dependent variable

CrossTable(bank_data$default, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                   | bank_data$y 
## bank_data$default |        no |       yes | Row Total | 
## ------------------|-----------|-----------|-----------|
##                no |     28391 |      4197 |     32588 | 
##                   |     9.562 |    75.315 |           | 
##                   |     0.871 |     0.129 |     0.791 | 
##                   |     0.777 |     0.905 |           | 
##                   |     0.689 |     0.102 |           | 
## ------------------|-----------|-----------|-----------|
##           unknown |      8154 |       443 |      8597 | 
##                   |    36.198 |   285.122 |           | 
##                   |     0.948 |     0.052 |     0.209 | 
##                   |     0.223 |     0.095 |           | 
##                   |     0.198 |     0.011 |           | 
## ------------------|-----------|-----------|-----------|
##               yes |         3 |         0 |         3 | 
##                   |     0.043 |     0.338 |           | 
##                   |     1.000 |     0.000 |     0.000 | 
##                   |     0.000 |     0.000 |           | 
##                   |     0.000 |     0.000 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |     36548 |      4640 |     41188 | 
##                   |     0.887 |     0.113 |           | 
## ------------------|-----------|-----------|-----------|
## 
##

This question though useful, gives us a feature that is not usable. Only 3 individuals replied “yes” to the question “Do you have credit in default?”. People either answered “no” (79.3%) or didn’t even reply (20.7%), which gives us zero information.

We decide to not use this variable in the final models.

Housing

Does clients have a housing loan?

table(bank_data$housing)

## 
##      no unknown     yes 
##   18622     990   21576

990 unknowns about housing

Cross-tab with our dependent variable:

CrossTable(bank_data$housing, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                   | bank_data$y 
## bank_data$housing |        no |       yes | Row Total | 
## ------------------|-----------|-----------|-----------|
##                no |     16596 |      2026 |     18622 | 
##                   |     0.312 |     2.461 |           | 
##                   |     0.891 |     0.109 |     0.452 | 
##                   |     0.454 |     0.437 |           | 
##                   |     0.403 |     0.049 |           | 
## ------------------|-----------|-----------|-----------|
##           unknown |       883 |       107 |       990 | 
##                   |     0.023 |     0.184 |           | 
##                   |     0.892 |     0.108 |     0.024 | 
##                   |     0.024 |     0.023 |           | 
##                   |     0.021 |     0.003 |           | 
## ------------------|-----------|-----------|-----------|
##               yes |     19069 |      2507 |     21576 | 
##                   |     0.305 |     2.400 |           | 
##                   |     0.884 |     0.116 |     0.524 | 
##                   |     0.522 |     0.540 |           | 
##                   |     0.463 |     0.061 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |     36548 |      4640 |     41188 | 
##                   |     0.887 |     0.113 |           | 
## ------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, housing), fill = y)) +
  mosaic_theme +
  xlab("Housing") +
  ylab(NULL)

Not much observable variation between those who have housing loans (11.6%) and those who do not(10.6%).Unknown at 10.8%

Checking this mathematically with a Chi-Squared Test:

chisq.test(bank_data$housing, bank_data$y)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$housing and bank_data$y
## X-squared = 5.6845, df = 2, p-value = 0.05829

P-value suggests it may be ok to ignore this variable in final analysis.

Loan

Do clients have a personal loan already? How does that effect thier take-up?

table(bank_data$loan)

## 
##      no unknown     yes 
##   33950     990    6248

990 unknowns

Cross-tab with our dependent variable:

CrossTable(bank_data$loan, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                | bank_data$y 
## bank_data$loan |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |     30100 |      3850 |     33950 | 
##                |     0.021 |     0.169 |           | 
##                |     0.887 |     0.113 |     0.824 | 
##                |     0.824 |     0.830 |           | 
##                |     0.731 |     0.093 |           | 
## ---------------|-----------|-----------|-----------|
##        unknown |       883 |       107 |       990 | 
##                |     0.023 |     0.184 |           | 
##                |     0.892 |     0.108 |     0.024 | 
##                |     0.024 |     0.023 |           | 
##                |     0.021 |     0.003 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |      5565 |       683 |      6248 | 
##                |     0.079 |     0.618 |           | 
##                |     0.891 |     0.109 |     0.152 | 
##                |     0.152 |     0.147 |           | 
##                |     0.135 |     0.017 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |     36548 |      4640 |     41188 | 
##                |     0.887 |     0.113 |           | 
## ---------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, loan), fill = y)) +
  mosaic_theme +
  xlab("Loan") +
  ylab(NULL)

loan_table <- table(bank_data$loan, bank_data$y)
loan_tab <- as.data.frame(prop.table(loan_table, 2))
colnames(loan_tab) <-  c("loan", "y", "perc")

ggplot(data = loan_tab, aes(x = loan, y = perc, fill = y)) + 
  geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) + 
  xlab("loan")+
  ylab("Percent")

Not much variation between 11.3% (for no) and 10.9% (for yes).

Checking this mathematically with a Chi-Squared Test:

chisq.test(bank_data$loan, bank_data$y)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$loan and bank_data$y
## X-squared = 1.094, df = 2, p-value = 0.5787

P-value suggests that this is not significant at all.

Contact

How were clients contacted and does it make a difference?

table(bank_data$contact)

## 
##  cellular telephone 
##     26144     15044

Cross-tab with our dependent variable:

CrossTable(bank_data$contact, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                   | bank_data$y 
## bank_data$contact |        no |       yes | Row Total | 
## ------------------|-----------|-----------|-----------|
##          cellular |     22291 |      3853 |     26144 | 
##                   |    35.521 |   279.790 |           | 
##                   |     0.853 |     0.147 |     0.635 | 
##                   |     0.610 |     0.830 |           | 
##                   |     0.541 |     0.094 |           | 
## ------------------|-----------|-----------|-----------|
##         telephone |     14257 |       787 |     15044 | 
##                   |    61.730 |   486.229 |           | 
##                   |     0.948 |     0.052 |     0.365 | 
##                   |     0.390 |     0.170 |           | 
##                   |     0.346 |     0.019 |           | 
## ------------------|-----------|-----------|-----------|
##      Column Total |     36548 |      4640 |     41188 | 
##                   |     0.887 |     0.113 |           | 
## ------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, contact), fill = y)) +
  mosaic_theme +
  xlab("Contact") +
  ylab(NULL)

This feature is really interesting; 14.7% of cellular responders subscribed to a term deposit while only 5.2% of telephone responders did.

Month

Does month make a difference?

Cross-tab with our dependent variable:

CrossTable(bank_data$month, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                 | bank_data$y 
## bank_data$month |        no |       yes | Row Total | 
## ----------------|-----------|-----------|-----------|
##             apr |      2093 |       539 |      2632 | 
##                 |    25.178 |   198.321 |           | 
##                 |     0.795 |     0.205 |     0.064 | 
##                 |     0.057 |     0.116 |           | 
##                 |     0.051 |     0.013 |           | 
## ----------------|-----------|-----------|-----------|
##             aug |      5523 |       655 |      6178 | 
##                 |     0.306 |     2.413 |           | 
##                 |     0.894 |     0.106 |     0.150 | 
##                 |     0.151 |     0.141 |           | 
##                 |     0.134 |     0.016 |           | 
## ----------------|-----------|-----------|-----------|
##             dec |        93 |        89 |       182 | 
##                 |    29.052 |   228.836 |           | 
##                 |     0.511 |     0.489 |     0.004 | 
##                 |     0.003 |     0.019 |           | 
##                 |     0.002 |     0.002 |           | 
## ----------------|-----------|-----------|-----------|
##             jul |      6525 |       649 |      7174 | 
##                 |     3.980 |    31.353 |           | 
##                 |     0.910 |     0.090 |     0.174 | 
##                 |     0.179 |     0.140 |           | 
##                 |     0.158 |     0.016 |           | 
## ----------------|-----------|-----------|-----------|
##             jun |      4759 |       559 |      5318 | 
##                 |     0.341 |     2.683 |           | 
##                 |     0.895 |     0.105 |     0.129 | 
##                 |     0.130 |     0.120 |           | 
##                 |     0.116 |     0.014 |           | 
## ----------------|-----------|-----------|-----------|
##             mar |       270 |       276 |       546 | 
##                 |    94.958 |   747.959 |           | 
##                 |     0.495 |     0.505 |     0.013 | 
##                 |     0.007 |     0.059 |           | 
##                 |     0.007 |     0.007 |           | 
## ----------------|-----------|-----------|-----------|
##             may |     12883 |       886 |     13769 | 
##                 |    36.210 |   285.214 |           | 
##                 |     0.936 |     0.064 |     0.334 | 
##                 |     0.352 |     0.191 |           | 
##                 |     0.313 |     0.022 |           | 
## ----------------|-----------|-----------|-----------|
##             nov |      3685 |       416 |      4101 | 
##                 |     0.581 |     4.579 |           | 
##                 |     0.899 |     0.101 |     0.100 | 
##                 |     0.101 |     0.090 |           | 
##                 |     0.089 |     0.010 |           | 
## ----------------|-----------|-----------|-----------|
##             oct |       403 |       315 |       718 | 
##                 |    86.028 |   677.617 |           | 
##                 |     0.561 |     0.439 |     0.017 | 
##                 |     0.011 |     0.068 |           | 
##                 |     0.010 |     0.008 |           | 
## ----------------|-----------|-----------|-----------|
##             sep |       314 |       256 |       570 | 
##                 |    72.723 |   572.818 |           | 
##                 |     0.551 |     0.449 |     0.014 | 
##                 |     0.009 |     0.055 |           | 
##                 |     0.008 |     0.006 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |     36548 |      4640 |     41188 | 
##                 |     0.887 |     0.113 |           | 
## ----------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  aes(x = month, y = ..count../nrow(bank_data), fill = y) +
  geom_bar() +
  ylab("relative frequency")

month_table <- table(bank_data$month, bank_data$y)
month_tab <- as.data.frame(prop.table(month_table, 2))
colnames(month_tab) <-  c("month", "y", "perc")

ggplot(data = month_tab, aes(x = month, y = perc, fill = y)) + 
  geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) + 
  xlab("Month")+
  ylab("Percent")

Most of the calls were in May but there is higher coversion in March, September, October, and in December.

We also notice that no contact has been made during January and February.

The highest spike occurs during May, with 33.4% of total contacts, but it has the worst ratio of subscribers over persons contacted (6.5%).

Every month with a very low frequency of contact (March, September, October and December) shows very good results (between 44% and 51% of subscribers).

December aside, there are enough observations to conclude this isn’t pure luck, so this feature will probably be very important in models.

Day of the week

Does the day of the week matter?

Cross-tab with our dependent variable:

CrossTable(bank_data$day_of_week, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  41188 
## 
##  
##                       | bank_data$y 
## bank_data$day_of_week |        no |       yes | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                   fri |      6981 |       846 |      7827 | 
##                       |     0.184 |     1.449 |           | 
##                       |     0.892 |     0.108 |     0.190 | 
##                       |     0.191 |     0.182 |           | 
##                       |     0.169 |     0.021 |           | 
## ----------------------|-----------|-----------|-----------|
##                   mon |      7667 |       847 |      8514 | 
##                       |     1.664 |    13.111 |           | 
##                       |     0.901 |     0.099 |     0.207 | 
##                       |     0.210 |     0.183 |           | 
##                       |     0.186 |     0.021 |           | 
## ----------------------|-----------|-----------|-----------|
##                   thu |      7578 |      1045 |      8623 | 
##                       |     0.708 |     5.574 |           | 
##                       |     0.879 |     0.121 |     0.209 | 
##                       |     0.207 |     0.225 |           | 
##                       |     0.184 |     0.025 |           | 
## ----------------------|-----------|-----------|-----------|
##                   tue |      7137 |       953 |      8090 | 
##                       |     0.241 |     1.901 |           | 
##                       |     0.882 |     0.118 |     0.196 | 
##                       |     0.195 |     0.205 |           | 
##                       |     0.173 |     0.023 |           | 
## ----------------------|-----------|-----------|-----------|
##                   wed |      7185 |       949 |      8134 | 
##                       |     0.148 |     1.165 |           | 
##                       |     0.883 |     0.117 |     0.197 | 
##                       |     0.197 |     0.205 |           | 
##                       |     0.174 |     0.023 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |     36548 |      4640 |     41188 | 
##                       |     0.887 |     0.113 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  aes(x = day_of_week, y = ..count../nrow(bank_data), fill = y) +
  geom_bar() +
  ylab("relative frequency")

Calls aren’t made during weekend days. If we assume that calls are evenly distributed between the different weekdays, Thursdays tend to show better results (12.1% of subscribers among calls made this day) unlike Mondays with only 9.9% of successful calls.

However, those differences are small, which makes this feature not that important.

It would’ve been interesting to see the attitude of responders from weekend calls.

Duration

Since the goal is to seek best candidates who will have the best odds to subscribe to a term deposit, the call duration can’t be known before. So we recommend this feature be removed.

Still some interesting facts:

mean(bank_data$duration[bank_data$y == "no"])

## [1] 220.8448

mean(bank_data$duration[bank_data$y == "yes"])

## [1] 553.1912

max(bank_data$duration[bank_data$y == "no"])

## [1] 4918

min(bank_data$duration[bank_data$y == "yes"])

## [1] 37

Campaign

Plotting the frequency of contact with clients:

bank_data %>% 
  ggplot() +
  aes(x = campaign) +
  geom_bar() +
  facet_grid(y ~ .,
             scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 50, 5))

Calling the same person more than ten times during a single marketing campaign seems excessive. We’ll consider those as outliers, even if marketing harrassment a real thing. However, we can see that on the chart above that harassment doesn’t work at all.

Trimming our data set and replotting:

bank_data <- bank_data %>%   
    filter(campaign <= 10) 

bank_data %>% 
  ggplot() +
  aes(x = campaign) +
  geom_bar() +
  facet_grid(y ~ .,
             scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 10, 1))

Truncated Cross-tab with our dependent variable:

CrossTable(bank_data$campaign, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40319 
## 
##  
##                    | bank_data$y 
## bank_data$campaign |        no |       yes | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  1 |     15342 |      2300 |     17642 | 
##                    |     5.073 |    39.268 |           | 
##                    |     0.870 |     0.130 |     0.438 | 
##                    |     0.430 |     0.499 |           | 
##                    |     0.381 |     0.057 |           | 
## -------------------|-----------|-----------|-----------|
##                  2 |      9359 |      1211 |     10570 | 
##                    |     0.000 |     0.002 |           | 
##                    |     0.885 |     0.115 |     0.262 | 
##                    |     0.262 |     0.263 |           | 
##                    |     0.232 |     0.030 |           | 
## -------------------|-----------|-----------|-----------|
##                  3 |      4767 |       574 |      5341 | 
##                    |     0.291 |     2.250 |           | 
##                    |     0.893 |     0.107 |     0.132 | 
##                    |     0.134 |     0.124 |           | 
##                    |     0.118 |     0.014 |           | 
## -------------------|-----------|-----------|-----------|
##                  4 |      2402 |       249 |      2651 | 
##                    |     1.256 |     9.724 |           | 
##                    |     0.906 |     0.094 |     0.066 | 
##                    |     0.067 |     0.054 |           | 
##                    |     0.060 |     0.006 |           | 
## -------------------|-----------|-----------|-----------|
##                  5 |      1479 |       120 |      1599 | 
##                    |     2.798 |    21.658 |           | 
##                    |     0.925 |     0.075 |     0.040 | 
##                    |     0.041 |     0.026 |           | 
##                    |     0.037 |     0.003 |           | 
## -------------------|-----------|-----------|-----------|
##                  6 |       904 |        75 |       979 | 
##                    |     1.580 |    12.229 |           | 
##                    |     0.923 |     0.077 |     0.024 | 
##                    |     0.025 |     0.016 |           | 
##                    |     0.022 |     0.002 |           | 
## -------------------|-----------|-----------|-----------|
##                  7 |       591 |        38 |       629 | 
##                    |     2.071 |    16.031 |           | 
##                    |     0.940 |     0.060 |     0.016 | 
##                    |     0.017 |     0.008 |           | 
##                    |     0.015 |     0.001 |           | 
## -------------------|-----------|-----------|-----------|
##                  8 |       383 |        17 |       400 | 
##                    |     2.336 |    18.080 |           | 
##                    |     0.958 |     0.042 |     0.010 | 
##                    |     0.011 |     0.004 |           | 
##                    |     0.009 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                  9 |       266 |        17 |       283 | 
##                    |     0.944 |     7.304 |           | 
##                    |     0.940 |     0.060 |     0.007 | 
##                    |     0.007 |     0.004 |           | 
##                    |     0.007 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                 10 |       213 |        12 |       225 | 
##                    |     0.948 |     7.337 |           | 
##                    |     0.947 |     0.053 |     0.006 | 
##                    |     0.006 |     0.003 |           | 
##                    |     0.005 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |     35706 |      4613 |     40319 | 
##                    |     0.886 |     0.114 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, campaign), fill = y)) +
  mosaic_theme +
  xlab("Campaign") +
  ylab(NULL)

There is a linear pattern observable that depends on the different values of Campaign.

Pdays

How often are clients contacted? does that make a difference?

table(bank_data$pdays)

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
##    15    26    61   439   118    46   412    60    18    63    52    28 
##    12    13    14    15    16    17    18    19    20    21    22    25 
##    58    36    19    24    11     8     7     3     1     2     3     1 
##    26    27   999 
##     1     1 38806

mean(bank_data$pdays[bank_data$y == "no"])

## [1] 983.8182

mean(bank_data$pdays[bank_data$y == "yes"])

## [1] 790.8242

max(bank_data$pdays[bank_data$y == "no"])

## [1] 999

min(bank_data$pdays[bank_data$y == "yes"])

## [1] 0

min(bank_data$pdays[bank_data$y == "no"])

## [1] 0

max(bank_data$pdays[bank_data$y == "yes"])

## [1] 999

The idea of contact with clients, in general, seems more important than days passed.

999 value means the client wasn’t previously contacted. Let’s create a dummy out of it.

Clients who haven’t been contacted in a previous campaign will be labeled “0” in the pdays_dummy variable

bank_data <- bank_data %>% 
  mutate(pdays_dummy = if_else(pdays == 999, "0", "1"))

Cross-tab of dummy with our dependent variable:

CrossTable(bank_data$pdays_dummy, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40319 
## 
##  
##                       | bank_data$y 
## bank_data$pdays_dummy |        no |       yes | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |     35160 |      3646 |     38806 | 
##                       |    18.340 |   141.956 |           | 
##                       |     0.906 |     0.094 |     0.962 | 
##                       |     0.985 |     0.790 |           | 
##                       |     0.872 |     0.090 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |       546 |       967 |      1513 | 
##                       |   470.386 |  3640.929 |           | 
##                       |     0.361 |     0.639 |     0.038 | 
##                       |     0.015 |     0.210 |           | 
##                       |     0.014 |     0.024 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |     35706 |      4613 |     40319 | 
##                       |     0.886 |     0.114 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(y, pdays_dummy), fill = y)) +
  mosaic_theme +
  xlab("pdays") +
  ylab(NULL)

Recontacting a client after a previous campaign seems to highly increase the odds of subscription.

What is the number of contacts performed before this campaign and for each client.

table(bank_data$previous)

## 
##     0     1     2     3     4     5     6     7 
## 34703  4554   752   216    70    18     5     1

Cross-tab with our dependent variable:

CrossTable(bank_data$previous, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40319 
## 
##  
##                    | bank_data$y 
## bank_data$previous |        no |       yes | Row Total | 
## -------------------|-----------|-----------|-----------|
##                  0 |     31589 |      3114 |     34703 | 
##                    |    23.868 |   184.745 |           | 
##                    |     0.910 |     0.090 |     0.861 | 
##                    |     0.885 |     0.675 |           | 
##                    |     0.783 |     0.077 |           | 
## -------------------|-----------|-----------|-----------|
##                  1 |      3587 |       967 |      4554 | 
##                    |    49.315 |   381.711 |           | 
##                    |     0.788 |     0.212 |     0.113 | 
##                    |     0.100 |     0.210 |           | 
##                    |     0.089 |     0.024 |           | 
## -------------------|-----------|-----------|-----------|
##                  2 |       402 |       350 |       752 | 
##                    |   104.624 |   809.824 |           | 
##                    |     0.535 |     0.465 |     0.019 | 
##                    |     0.011 |     0.076 |           | 
##                    |     0.010 |     0.009 |           | 
## -------------------|-----------|-----------|-----------|
##                  3 |        88 |       128 |       216 | 
##                    |    55.771 |   431.681 |           | 
##                    |     0.407 |     0.593 |     0.005 | 
##                    |     0.002 |     0.028 |           | 
##                    |     0.002 |     0.003 |           | 
## -------------------|-----------|-----------|-----------|
##                  4 |        32 |        38 |        70 | 
##                    |    14.510 |   112.309 |           | 
##                    |     0.457 |     0.543 |     0.002 | 
##                    |     0.001 |     0.008 |           | 
##                    |     0.001 |     0.001 |           | 
## -------------------|-----------|-----------|-----------|
##                  5 |         5 |        13 |        18 | 
##                    |     7.509 |    58.121 |           | 
##                    |     0.278 |     0.722 |     0.000 | 
##                    |     0.000 |     0.003 |           | 
##                    |     0.000 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                  6 |         2 |         3 |         5 | 
##                    |     1.331 |    10.305 |           | 
##                    |     0.400 |     0.600 |     0.000 | 
##                    |     0.000 |     0.001 |           | 
##                    |     0.000 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                  7 |         1 |         0 |         1 | 
##                    |     0.015 |     0.114 |           | 
##                    |     1.000 |     0.000 |     0.000 | 
##                    |     0.000 |     0.000 |           | 
##                    |     0.000 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |     35706 |      4613 |     40319 | 
##                    |     0.886 |     0.114 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(previous), fill = y)) +
  mosaic_theme +
  xlab("Previous") +
  ylab(NULL)

Even one contact improves probability of “yes” (from 8.8% to 21.2%).

Note: Analyzing this variable can be tricky from a prediction stand-point. We cannot have a 2nd contact without 1st or a 3rd contact without a 2nd.

But creating a dummy would be same as pdays_dummy.

So we instead choose to create bins for this variable.

bank_data <- bank_data %>% 
  mutate(previous_binned = if_else(previous >=  2, "2+", if_else(previous == 1, "1", "0")))

Cross-tab on binned variable with our dependent variable:

CrossTable(bank_data$previous_binned, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40319 
## 
##  
##                           | bank_data$y 
## bank_data$previous_binned |        no |       yes | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                         0 |     31589 |      3114 |     34703 | 
##                           |    23.868 |   184.745 |           | 
##                           |     0.910 |     0.090 |     0.861 | 
##                           |     0.885 |     0.675 |           | 
##                           |     0.783 |     0.077 |           | 
## --------------------------|-----------|-----------|-----------|
##                         1 |      3587 |       967 |      4554 | 
##                           |    49.315 |   381.711 |           | 
##                           |     0.788 |     0.212 |     0.113 | 
##                           |     0.100 |     0.210 |           | 
##                           |     0.089 |     0.024 |           | 
## --------------------------|-----------|-----------|-----------|
##                        2+ |       530 |       532 |      1062 | 
##                           |   179.167 |  1386.804 |           | 
##                           |     0.499 |     0.501 |     0.026 | 
##                           |     0.015 |     0.115 |           | 
##                           |     0.013 |     0.013 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |     35706 |      4613 |     40319 | 
##                           |     0.886 |     0.114 |           | 
## --------------------------|-----------|-----------|-----------|
## 
##

It seems 2+ contacts increase probability to 50%

As the analysis of the pdays_dummy variable has shown, recontacting someone again will increase the odds.

Can we say that long term harassment works unlike short term harassment?

Poutcome

table(bank_data$poutcome)

## 
##     failure nonexistent     success 
##        4243       34703        1373

Cross-tab with our dependent variable:

CrossTable(bank_data$poutcome, bank_data$y)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40319 
## 
##  
##                    | bank_data$y 
## bank_data$poutcome |        no |       yes | Row Total | 
## -------------------|-----------|-----------|-----------|
##            failure |      3638 |       605 |      4243 | 
##                    |     3.803 |    29.440 |           | 
##                    |     0.857 |     0.143 |     0.105 | 
##                    |     0.102 |     0.131 |           | 
##                    |     0.090 |     0.015 |           | 
## -------------------|-----------|-----------|-----------|
##        nonexistent |     31589 |      3114 |     34703 | 
##                    |    23.868 |   184.745 |           | 
##                    |     0.910 |     0.090 |     0.861 | 
##                    |     0.885 |     0.675 |           | 
##                    |     0.783 |     0.077 |           | 
## -------------------|-----------|-----------|-----------|
##            success |       479 |       894 |      1373 | 
##                    |   446.610 |  3456.897 |           | 
##                    |     0.349 |     0.651 |     0.034 | 
##                    |     0.013 |     0.194 |           | 
##                    |     0.012 |     0.022 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |     35706 |      4613 |     40319 | 
##                    |     0.886 |     0.114 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

bank_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(poutcome), fill = y)) +
  mosaic_theme +
  xlab("Previous Outcome") +
  ylab(NULL)

65.1% of people who already subscribed to a term deposit after a previous contact have accepted to do it again.

Even if they were denied before, they’re still more enthusiastic to accept it (14.2%) than people who haven’t been contacted before (8.8%).

So even if the previous campaign was a failure, recontacting people seems important.

1.3.2 Bivariate Analysis

Employment variation rate, Consumer price index, Consumer confidence index, Euribor 3 months rate, Number of employees at the bank

These five continious variables are social and economic indicators. They’re supposed to be highly correlated. Let’s compute the correlation matrix.

bank_data %>% 
  select(emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed) %>% 
  cor() %>% 
  corrplot(method = "number",
           type = "upper",
           tl.cex = 0.8,
           tl.srt = 45,
           tl.col = "black")

Three pairs show a correlation coefficient higher than 0.90. Let’s figure out which variable(s) should be removed to lighten this correlation matrix.

emp.var.rate isn’t meaningful. It wouldn’t make sense for banks to vary employees before every campaign. We’ll remove it to soften correlations between our 5 variables.

euribor3m and nr.employed are highly correlated (0.95), we’re keeping both features. This is probably a spurious correlation, bank size (number of employees) isn’t reactive to the euribor rate.

2. Predictive Models

2.1 Final Data Prepration

So far, we’ve:

Removed four variables: default (lack of variability), housing (lack of information), loan (lack of information), and emp.var.rate (lack of significance),
Binned two variables: pdays (into pdays_dummy), and previous (into previous_binned).
Re-framed one variable: campaign because it had outliers.
Detected but kept two correlation issues: nr.employed with euribor3m, and poutcome with pdays_dummy.

The data exploration ends here, let’s process to machine learning models.

Duplicating data for safe keeping.

dup_bank_data <- bank_data

Removing and transforming “unknowns”

dup_bank_data <- dup_bank_data %>% 
  filter(job != "unknown")

dup_bank_data <- dup_bank_data %>% 
  filter(marital != "unknown")

dup_bank_data = dup_bank_data %>% 
  mutate(education = recode(education, "unknown" = "university.degree"))

Converting our variables to factors with ordered levels (ordinal variables)

dup_bank_data$contact <- factor(dup_bank_data$contact, order = TRUE, levels =c('telephone', 'cellular'))
dup_bank_data$education <- factor(dup_bank_data$education, order = TRUE, levels =c('illiterate','basic.4y', 'basic.6y','basic.9y', 'high.school','professional.course','university.degree'))
dup_bank_data$age_discrete <- factor(dup_bank_data$age_discrete, order = TRUE, levels =c('low', 'mid','high'))
dup_bank_data$job <- factor(dup_bank_data$job, order = TRUE, levels =c('blue-collar', 'services','entrepreneur', 'housemaid', 'self-employed','technician', 'management','admin.','unemployed', 'retired','student'))
dup_bank_data$marital <- factor(dup_bank_data$marital, order = TRUE, levels =c('married', 'divorced', 'single'))
dup_bank_data$month <- factor(dup_bank_data$month, order = TRUE, levels =c('mar', 'apr','may', 'jun','jul', 'aug', 'sep','oct', 'nov','dec'))
dup_bank_data$previous_binned <- factor(dup_bank_data$previous_binned, order = TRUE, levels =c('0', '1','2+'))
dup_bank_data$poutcome <- factor(dup_bank_data$poutcome, order = TRUE, levels =c('nonexistent', 'failure','success'))

Splitting the data into training and test datasets (80-20 split):

set.seed(1984)
training <- createDataPartition(dup_bank_data$y_binary, p = 0.8, list=FALSE)

train_data <- dup_bank_data[training,]
test_data <- dup_bank_data[-training,]

2.2 Logistic Regression

2.2.1 How can you evaluate Logistic Regression model fit and accuracy ?

In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate model fit and accuracy. But, Logistic Regression employs all different sets of metrics. Here, we deal with probabilities and categorical values. Once evaluation metric is Akaike Information Criteria (AIC):

Akaike Information Criteria (AIC)

We can look at AIC as counterpart of adjusted r square in multiple regression. It’s an important indicator of model fit. It follows the rule: Smaller the better. AIC penalizes increasing number of coefficients in the model. In other words, adding more variables to the model wouldn’t let AIC increase. It helps to avoid overfitting.

Looking at the AIC metric of one model wouldn’t really help. It is more useful in comparing models (model selection). So, build 2 or 3 Logistic Regression models and compare their AIC. The model with the lowest AIC will be relatively better.

We will use this to evaulate multiple models.

In addition, we can also perform an ANOVA Chi-square test to check the overall effect of variables on the dependent variable. This will also help us choose the better model.

Running a Logit model:

model <- glm(y_binary ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m + nr.employed,family=binomial(link='logit'),data=train_data)

summary(model)

## 
## Call:
## glm(formula = y_binary ~ age_discrete + job + marital + education + 
##     contact + month + pdays_dummy + previous_binned + poutcome + 
##     cons.price.idx + cons.conf.idx + euribor3m + nr.employed, 
##     family = binomial(link = "logit"), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2366  -0.3888  -0.3351  -0.2600   2.7452  
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       96.324391  24.509174   3.930 8.49e-05 ***
## age_discrete.L     0.096720   0.093277   1.037 0.299773    
## age_discrete.Q     0.217019   0.054479   3.984 6.79e-05 ***
## job.L              0.247700   0.092281   2.684 0.007270 ** 
## job.Q              0.045078   0.086339   0.522 0.601594    
## job.C              0.093332   0.093516   0.998 0.318262    
## job^4             -0.010552   0.092891  -0.114 0.909560    
## job^5             -0.058395   0.099413  -0.587 0.556935    
## job^6             -0.022088   0.089663  -0.246 0.805418    
## job^7             -0.135317   0.104593  -1.294 0.195751    
## job^8              0.019738   0.100293   0.197 0.843979    
## job^9             -0.089681   0.099551  -0.901 0.367667    
## job^10            -0.045361   0.085268  -0.532 0.594742    
## marital.L          0.026581   0.035545   0.748 0.454572    
## marital.Q         -0.003398   0.054023  -0.063 0.949842    
## education.L       -0.635762   0.382902  -1.660 0.096839 .  
## education.Q        0.680389   0.367675   1.851 0.064239 .  
## education.C       -0.416465   0.278719  -1.494 0.135121    
## education^4        0.303541   0.172774   1.757 0.078941 .  
## education^5       -0.216346   0.100229  -2.159 0.030888 *  
## education^6        0.205799   0.069858   2.946 0.003220 ** 
## contact.L          0.389120   0.047967   8.112 4.97e-16 ***
## month.L           -0.670112   0.147499  -4.543 5.54e-06 ***
## month.Q            0.499398   0.128248   3.894 9.86e-05 ***
## month.C           -0.017095   0.135112  -0.127 0.899314    
## month^4            0.648315   0.092718   6.992 2.70e-12 ***
## month^5           -0.339622   0.086926  -3.907 9.34e-05 ***
## month^6           -0.199858   0.085081  -2.349 0.018823 *  
## month^7            0.586365   0.071002   8.258  < 2e-16 ***
## month^8           -0.223344   0.078503  -2.845 0.004440 ** 
## month^9            0.229071   0.070556   3.247 0.001168 ** 
## pdays_dummy1       1.141009   0.221608   5.149 2.62e-07 ***
## previous_binned.L -0.959180   0.155095  -6.184 6.23e-10 ***
## previous_binned.Q  0.524573   0.118350   4.432 9.32e-06 ***
## poutcome.L         1.121389   0.308288   3.637 0.000275 ***
## poutcome.Q               NA         NA      NA       NA    
## cons.price.idx    -0.248884   0.135328  -1.839 0.065898 .  
## cons.conf.idx      0.012184   0.007800   1.562 0.118257    
## euribor3m          0.212429   0.131534   1.615 0.106309    
## nr.employed       -0.014611   0.002517  -5.806 6.40e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22824  on 31943  degrees of freedom
## Residual deviance: 17999  on 31905  degrees of freedom
## AIC: 18077
## 
## Number of Fisher Scoring iterations: 6

We can see that older age, higer education, contact over cellphone, and previous contact all seem to be highly predictive and positively correlated with a ‘yes’.

We also see thats jobs do not seem to have a significant impact. Similiary our socio-economics indicators do not turn out to be significant expect for nr.empolyed.

We run few reduced models to see if we can improve our prediction rate.

Running a simplified models:

model_2 <- glm(y_binary ~ age_discrete + marital + education + contact + month + pdays_dummy + previous_binned + poutcome,family=binomial(link='logit'),data=train_data)

summary(model_2)

## 
## Call:
## glm(formula = y_binary ~ age_discrete + marital + education + 
##     contact + month + pdays_dummy + previous_binned + poutcome, 
##     family = binomial(link = "logit"), data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6613  -0.4409  -0.3934  -0.2654   2.7330  
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.08084    0.13756  -7.857 3.92e-15 ***
## age_discrete.L     0.65178    0.07406   8.800  < 2e-16 ***
## age_discrete.Q     0.65082    0.04472  14.555  < 2e-16 ***
## marital.L          0.10191    0.03334   3.057 0.002239 ** 
## marital.Q          0.03234    0.05238   0.617 0.537005    
## education.L       -0.55378    0.36645  -1.511 0.130739    
## education.Q        0.84186    0.35306   2.384 0.017106 *  
## education.C       -0.49684    0.26762  -1.857 0.063380 .  
## education^4        0.30277    0.16548   1.830 0.067300 .  
## education^5       -0.18436    0.09566  -1.927 0.053948 .  
## education^6        0.23451    0.06686   3.507 0.000453 ***
## contact.L          0.73630    0.04133  17.817  < 2e-16 ***
## month.L            0.01245    0.11977   0.104 0.917186    
## month.Q            1.28020    0.12352  10.364  < 2e-16 ***
## month.C           -0.74902    0.11727  -6.387 1.69e-10 ***
## month^4            0.46697    0.09156   5.100 3.39e-07 ***
## month^5            0.64652    0.08287   7.802 6.11e-15 ***
## month^6            1.17729    0.07407  15.894  < 2e-16 ***
## month^7            0.80761    0.06988  11.557  < 2e-16 ***
## month^8           -0.65996    0.07954  -8.297  < 2e-16 ***
## month^9            0.03842    0.06437   0.597 0.550610    
## pdays_dummy1       1.49145    0.22772   6.549 5.77e-11 ***
## previous_binned.L -0.45252    0.15761  -2.871 0.004090 ** 
## previous_binned.Q  0.59443    0.12063   4.928 8.31e-07 ***
## poutcome.L         1.25060    0.31728   3.942 8.09e-05 ***
## poutcome.Q              NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22824  on 31943  degrees of freedom
## Residual deviance: 19037  on 31919  degrees of freedom
## AIC: 19087
## 
## Number of Fisher Scoring iterations: 5

model_3 <- glm(y_binary ~ age_discrete + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + euribor3m + nr.employed,family=binomial(link='logit'),data=train_data)

summary(model_3)

## 
## Call:
## glm(formula = y_binary ~ age_discrete + marital + education + 
##     contact + month + pdays_dummy + previous_binned + poutcome + 
##     euribor3m + nr.employed, family = binomial(link = "logit"), 
##     data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1875  -0.3926  -0.3374  -0.2406   2.7628  
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       53.0802742  4.4651459  11.888  < 2e-16 ***
## age_discrete.L     0.2163968  0.0733073   2.952 0.003158 ** 
## age_discrete.Q     0.3000977  0.0449700   6.673 2.50e-11 ***
## marital.L          0.0336598  0.0346374   0.972 0.331163    
## marital.Q         -0.0055407  0.0537092  -0.103 0.917835    
## education.L       -0.6021767  0.3754519  -1.604 0.108742    
## education.Q        0.7513522  0.3617816   2.077 0.037819 *  
## education.C       -0.4519499  0.2741734  -1.648 0.099269 .  
## education^4        0.2894968  0.1694195   1.709 0.087496 .  
## education^5       -0.2080738  0.0976672  -2.130 0.033135 *  
## education^6        0.2125854  0.0679289   3.130 0.001751 ** 
## contact.L          0.3458413  0.0450145   7.683 1.56e-14 ***
## month.L           -0.2761345  0.1208047  -2.286 0.022266 *  
## month.Q            0.5199370  0.1210153   4.296 1.74e-05 ***
## month.C           -0.2200220  0.1174744  -1.873 0.061077 .  
## month^4            0.7446388  0.0897319   8.298  < 2e-16 ***
## month^5           -0.2292005  0.0840959  -2.725 0.006421 ** 
## month^6           -0.1075966  0.0800427  -1.344 0.178871    
## month^7            0.5416931  0.0689505   7.856 3.96e-15 ***
## month^8           -0.1699538  0.0778704  -2.183 0.029071 *  
## month^9            0.2596190  0.0664003   3.910 9.23e-05 ***
## pdays_dummy1       1.1929479  0.2219852   5.374 7.70e-08 ***
## previous_binned.L -0.9735425  0.1550782  -6.278 3.44e-10 ***
## previous_binned.Q  0.4991260  0.1184196   4.215 2.50e-05 ***
## poutcome.L         1.0818211  0.3088849   3.502 0.000461 ***
## poutcome.Q                NA         NA      NA       NA    
## euribor3m         -0.0142966  0.0422466  -0.338 0.735056    
## nr.employed       -0.0106699  0.0008957 -11.913  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22824  on 31943  degrees of freedom
## Residual deviance: 18043  on 31917  degrees of freedom
## AIC: 18097
## 
## Number of Fisher Scoring iterations: 6

As you can see, we achieved a lower AIC value with our first model.

Also, we can compare both the models using the ANOVA test. Let’s say our null hypothesis is that second model is better than the first model. p < 0.05 would reject our hypothesis and in case p > 0.05, we’ll fail to reject the null hypothesis.

anova(model,model_2,test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: y_binary ~ age_discrete + job + marital + education + contact + 
##     month + pdays_dummy + previous_binned + poutcome + cons.price.idx + 
##     cons.conf.idx + euribor3m + nr.employed
## Model 2: y_binary ~ age_discrete + marital + education + contact + month + 
##     pdays_dummy + previous_binned + poutcome
##   Resid. Df Resid. Dev  Df Deviance  Pr(>Chi)    
## 1     31905      17999                           
## 2     31919      19037 -14  -1038.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With p < 0.05, this ANOVA test also corroborates the fact that the first model is better at prediction than our second model.

Precision, Recall, and F1 Score

Precision attempts to answer the following question:

What proportion of positive identifications was actually correct?

Whilem Recall attempts to answer the following question:

What proportion of actual positives was identified correctly?

To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Various metrics have been developed that rely on both precision and recall F1 score is one such metric.

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score.

Let’s look at the prediction of the first model on the test set (test_data):

pred.train <- predict(model,test_data)

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type
## == : prediction from a rank-deficient fit may be misleading

pred.train <- ifelse(pred.train > 0.5,1,0)
# Mean of the true prediction 
mean(pred.train == test_data$y_binary)

## [1] 0.8993238

t1 <- table(pred.train,test_data$y_binary)

# Presicion and recall of the model
presicion <- t1[1,1]/(sum(t1[1,]))
recall <- t1[1,1]/(sum(t1[,1]))
presicion

## [1] 0.905855

recall

## [1] 0.9895745

F1<- 2*presicion*recall/(presicion+recall)
F1

## [1] 0.9458659

2.3 Decision Tree

Running the decision tree algorithm from the “rpart” library:

model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m  + nr.employed, data = train_data, method="class")
rpart.plot(model_dt)

Although, out model suggests that ‘nr.employed’ is an important classifier, it makes little sense for banks to change this indicator to conform to its marketing campaigns.

Running a model without ‘nr.employed’:

model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m, data = train_data, method="class")
rpart.plot(model_dt)

Similarly removing ‘euribor3m’:

model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx, data = train_data, method="class")
rpart.plot(model_dt)

When removing all socio-economic indicators:

model_dt<- rpart(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome, data = train_data, method="class")
rpart.plot(model_dt)

We can see from all our previous decision tree models that previous contact with the customer as indicated by ‘pdays_dummy’ and ’poutcome’have a very large effect on getting a yes. (anywhere from a 65% chance to a 74% chance.)

2.3.1 Interpreting the economic indicators

As we stated the economic indicators are aspects of marketing that we cannot influence but the divisions in the various decision trees do give us a lot of economic information:

We see that the probability of a ‘yes’ increases when:

the bank has fewer than 5088 (in the first quartile) euribor3m < 1.2 (in the first quartile) consumer confidence above -47.0 and especially above -41.0 (again the first quartile)

This all indicates an economy moving towards recovery from its lowest point. Although, we might not always have situations like this, it makes sense to try to get people to go for term deposits as the economy starts imporvement. Take up slows down as the indicators more towards the mean and median.

prediction

Let’s look at the prediction of this model on the test set (test_data):

pred.train.dt <- predict(model_dt,test_data,type = "class")
mean(pred.train.dt==test_data$y)

## [1] 0.8988229

t2<-table(pred.train.dt,test_data$y)

presicion_dt<- t2[1,1]/(sum(t2[1,]))
recall_dt<- t2[1,1]/(sum(t2[,1]))
presicion_dt

## [1] 0.9102531

recall_dt

## [1] 0.9830938

F1_dt<- 2*presicion_dt*recall_dt/(presicion_dt+recall_dt)
F1_dt

## [1] 0.9452723

2.4 Random Forest

Running the random forest algorithm from the “randomForest” library:

model_rf <- randomForest(y ~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome + cons.price.idx + cons.conf.idx + euribor3m + nr.employed, data=train_data)

## model_rf<-randomForest(y~ age_discrete + job + marital + education + contact + month + pdays_dummy + previous_binned + poutcome, data=train_data)

plot(model_rf)

Let’s look at the prediction of this model on the test set (test_data):

pred.train.rf <- predict(model_rf,test_data)
mean(pred.train.rf==test_data$y)

## [1] 0.8993238

t3<-table(pred.train.rf,test_data$y)
presicion_rf<- t3[1,1]/(sum(t3[1,]))
recall_rf<- t3[1,1]/(sum(t3[,1]))
presicion_rf

## [1] 0.9140789

recall_rf

## [1] 0.9787264

F1_rf<- 2*presicion_rf*recall_rf/(presicion_rf+recall_rf)
F1_rf

## [1] 0.9452987

2.5 Decision Tree C5.0

Running the C5.0 decision tree algorithm from the “C5.0” library:

model_c5 <- C5.0(x=test_data[, c(2,3,4,8,9,12,15,17,18,19,23,24,25)], y=test_data$y)

## too resource intensive to plot.
## plot(model_c5)

Let’s look at the prediction of this model on the test set (test_data):

pred.train.rf <- predict(model_c5,test_data)
mean(pred.train.rf==test_data$y)

## [1] 0.8990734

t4<-table(pred.train.rf,test_data$y)
presicion_c5<- t4[1,1]/(sum(t4[1,]))
recall_c5<- t4[1,1]/(sum(t4[,1]))
presicion_c5

## [1] 0.9091027

recall_c5

## [1] 0.9849253

F1_c5<- 2*presicion_c5*recall_c5/(presicion_c5+recall_c5)
F1_c5

## [1] 0.9454963

Conclusion:

Averaging out the mean of correct predictions and the F-score of our various algorithms, we get:

0.923 for our Logit model,

0.924 for our Decision Tree model,

0.922 for our Random Forest model,

and 0.922 for our C5.0 Decision Tree model.

All our models have high predictive power.

We can also conclude and suggest that the best approach for the bank is to contact clients, especially those it contacted during previous campaign (usually within our datsets suggested 27 day window). Ideal candidates are highly educated and in higher age brackets. Contact on cellphone numbers should be preferred.

Or decision tree models suggests that take up is better when the European Inter-bank borrowing 3 month rate is below 1.2 %.

It would indicate that take is higher when bank rates are low and banks should target their campaigns during these periods.

Other indications from our decision tree models suggest that the best time for such campaigns is whenever the economy starts an up-swing.

Bank Marketing

Qasim Ahmed

7/15/2019

Bank Marketing - EDA + Classification Algorithms

1. Introduction

1.1 Objective

1.2 Dataset

Initial exploration of the dataset:

1.3 Exploratory Data Analysis

1.3.1 Univariate Analysis

Age

Jobs

Maritial status

Education

Default

Housing

Loan

Contact

Month

Day of the week

Duration

Campaign

Pdays

Previous

Poutcome

1.3.2 Bivariate Analysis

Employment variation rate, Consumer price index, Consumer confidence index, Euribor 3 months rate, Number of employees at the bank

2. Predictive Models

2.1 Final Data Prepration

2.2 Logistic Regression

2.2.1 How can you evaluate Logistic Regression model fit and accuracy ?

Akaike Information Criteria (AIC)

Precision, Recall, and F1 Score

2.3 Decision Tree

2.3.1 Interpreting the economic indicators

prediction

2.4 Random Forest

2.5 Decision Tree C5.0

Conclusion: