1. INTRODUCTION

The objective of this project is to predict whether the client will subscribe (yes/no) to a term deposit at a Portuguese bank using the data from May 2008 to November 2010.The data set used is sourced from the UCI Machine Learning Repository and is based on the phone calls made (often more than once) for the marketing campaign to access if the term deposit(product/target feature) would be subscribed(Yes/No). The dataset can be accessed at https://archive.ics.uci.edu/ml/datasets/bank+marketing [1] . This project has two phases. While the Phase I focuses on data preprocessing and exploration, as covered in this report, the Phase II covers the model building and its performance analysis. This report is compiled using R markdown and contains both narratives and the R codes and is organised as follows:

Section 1: Introduction

Section 2: Description of the dataset and their attributes

Section 3: Data Preprocessing

Section 4: Exploration of each attribute and their relationships.

The report ends with a summary.

2. DATASET

The UCI Machine Learning Repository contains two zip files but only bank.zip is used for this project. The file contains bank-full.csv (the full dataset), bank.csv (10% of the examples) and bank-names.txt (readme).As bank-full.csv is the full and original dataset having 45211 instances and 16 descriptive and 1 desired/target feature hence preferred for this project.

2.1. Target Feature

The desired target feature is “y” - “Has the client subscribed to a term deposit?”

According to the dataset, “y” has two classes so it is identified as a binary classification problem.

Yes: The client has subscribed to the term deposit.
No: The client has not subscribed to the term deposit.

The goal is to predict whether a client has subscribed to the term deposit.

2.2. Descriptive Features

The 17 inputs contained in the dataset are:

Bank client:

1 - age : the client’s age (numeric)

2 - job : type of job (categorical:‘admin.’,‘blue-collar’,‘entrepreneur’,‘housemaid’,‘management’,‘retired’,‘self-employed’,‘services’,‘student’,‘technician’,‘unemployed’,‘unknown’)

3 - marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)

4 - education (categorical: ‘unknown’, ‘primary’, ‘secondary’, ‘tertiary’)

5 - default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (categorical: ‘no’, ‘yes’, ‘unknown’)

8 - loan: has personal loan? (categorical: ‘no’,‘yes’,‘unknown’)

Related with the last contact of the current campaign:

9 - contact: contact communication type (categorical:“unknown”,“telephone”,“cellular”)

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)

12 - duration: last contact duration, in seconds (numeric)

Other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)

Output variable (desired target):

17 - y - has the client subscribed to the term deposit? (binary: “yes”,“no”)

3. DATA PREPROCESSING

3.1. Preliminaries

The necessary libraries are loaded and the downloaded dataset is imported in R using base R read.csv() function and assigned to an object “bank” and redundant variables were dropped before proceeding further with data preprocessing.

library(readr)
library(knitr)
library(mlr)
library(ggplot2)
library(magrittr)
library(cowplot)
library(dplyr)
library(gridExtra)
library(GGally)

bank <- read.csv("bank-full.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

bank <- bank %>% select(-c(9:11)) #Dropping unncessary variables (i.e.: contact, day, month)

3.2. Data Cleaning and Transformation

To confirm that the feature type match the description as outlined in the documentation, str() function is used.

#Check the class
str(bank)

## 'data.frame':    45211 obs. of  14 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

The dataset is summarised feature wise to get a summary about the numeric features.

summarizeColumns(bank) %>% knitr::kable( caption = 'Feature Summary before Data Preprocessing')

Feature Summary before Data Preprocessing
name	type	mean	disp	median	mad	min	max	nlevs
age	integer	40.9362102	10.6187620	39	10.3782	18	95	0
job	character	NA	0.7847427	NA	NA	288	9732	12
marital	character	NA	0.3980668	NA	NA	5207	27214	3
education	character	NA	0.4868063	NA	NA	1857	23202	4
default	character	NA	0.0180266	NA	NA	815	44396	2
balance	integer	1362.2720577	3044.7658292	448	664.2048	-8019	102127	0
housing	character	NA	0.4441618	NA	NA	20081	25130	2
loan	character	NA	0.1602265	NA	NA	7244	37967	2
duration	integer	258.1630798	257.5278123	180	137.8818	0	4918	0
campaign	integer	2.7638407	3.0980209	2	1.4826	1	63	0
pdays	integer	40.1978280	100.1287460	-1	0.0000	-1	871	0
previous	integer	0.5803234	2.3034410	0	0.0000	0	275	0
poutcome	character	NA	0.1825220	NA	NA	1511	36959	4
y	character	NA	0.1169848	NA	NA	5289	39922	2

3.2.1. Scanning for NAs

The dataset is scanned for missing values using is.na() function and no missing values are found. It can be assumed now that the dataset doesn’t contain any missing values.

colSums(is.na(bank))

##       age       job   marital education   default   balance   housing 
##         0         0         0         0         0         0         0 
##      loan  duration  campaign     pdays  previous  poutcome         y 
##         0         0         0         0         0         0         0

3.2.2. Scanning for white space

In order to avoid any discrepancy later, extra white space, if any, is removed for all the character features.

bank[, sapply( bank, is.character )] <- sapply( bank[, sapply( bank, is.character )], trimws)

3.2.4. Scanning for case errors

Scanning case error means scanning inconsistency in categorical attributes caused by random upper/lower case mistakes. First we applied unique() function to show all unique available values in each specific categorical variable, on that basis, we can detect any odd cases.

unique(bank$job)

##  [1] "management"    "technician"    "entrepreneur"  "blue-collar"  
##  [5] "unknown"       "retired"       "admin."        "services"     
##  [9] "self-employed" "unemployed"    "housemaid"     "student"

unique(bank$marital)

## [1] "married"  "single"   "divorced"

unique(bank$education)

## [1] "tertiary"  "secondary" "unknown"   "primary"

unique(bank$default)

## [1] "no"  "yes"

unique(bank$housing)

## [1] "yes" "no"

unique(bank$loan)

## [1] "no"  "yes"

unique(bank$poutcome)

## [1] "unknown" "failure" "other"   "success"

After performing the function, it is confirmed that there are no upper/lowercase errors in the outputs and they matched with the descriptive features.

3.2.3. Renaming some variable’s values

There are 4 descriptive features (“default”, “housing”, “loan” and “target y”) that have the same binary responses (“yes” or “no”). In order to avoid confusion with target y during the visualisation/exploration, the values of these descriptive features are labelled differently as follows: (a) defaulter, no defaulter (b) housing loan, no housing loan (c) personal loan , no personal loan.

bank$default <-ifelse(bank$default =="yes", "defaulter","no defaulter")
bank$housing <-ifelse(bank$housing =="yes", "housing loan","no housing loan")
bank$loan <-ifelse(bank$loan =="yes", "personal loan","no personal loan")

4. DATA EXPLORATION

4.1. Univariate Visualization

For Univariate visualization, Bar chart and BoxHistogram Plot are used for categorical and numerical features respectively. For categorical features, Bar plot illustrates a bar chart with the categories on X axis and frequency/count on the Y axis and is useful in presenting the count by categories. For the numerical features, BoxHistogramPlot depicts a histogram and a box plot. While a histogram is useful in visualizing the shape of the underlying distribution, a box plot tells the range of the attribute and helps detect any outliers.

4.1.1 Categorical Features

Figure 1 indicates that out of the total number of people contacted for marketing campaign, collectively close to 50 % are blue-collar, management professionals and technicians. Around 28,000 of those contacted in total are married (figure 2) while according to figure 3, 25,000 out of total are possessing secondary education. Almost all of the people have paid their dues on time with less than 1000 having default credit(figure 4). While more than half of the people have running housing loan(figure 5), comparatively fewer people, around 7000, avail personal loan(figure 6). The figure 7 indicates that the outcome of the previous calls for a substantial amount of individuals is unknown hence it can be deduced intuitively that most likely it has no bearing on the predicted outcome and can be left out during predictive modelling. The bar chart of the target feature, figure 8, illustrates that a large proportion of individuals do not subscribe to term deposit.

4.1.1.1 Job

job_sum <- bank%>% group_by(job) %>% summarise(count = n())
job_sum$job <- job_sum$job %>% factor(levels = job_sum$job[order(-job_sum$count)]) 
ggplot(job_sum,aes(x = job, y = count)) + geom_bar(stat="identity") +theme(axis.text.x=element_text(angle=45,hjust=1)) + labs(title = "Figure 1")

4.1.1.2 Marital status

marital_sum <- bank%>% group_by(marital) %>% summarise(count = n())
marital_sum$marital <- marital_sum$marital %>% factor(levels = marital_sum$marital[order(-marital_sum$count)]) 
ggplot(marital_sum,aes(x = marital, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 2")

4.1.1.3 Education

education_sum <- bank%>% group_by(education) %>% summarise(count = n())
education_sum$education <- education_sum$education %>% factor(levels = education_sum$education[order(-education_sum$count)]) 
ggplot(education_sum,aes(x = education, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 3")

4.1.1.4 Default Credit

default_sum <- bank%>% group_by(default) %>% summarise(count = n())
default_sum$default <- default_sum$default %>% factor(levels = default_sum$default[order(-default_sum$count)]) 
ggplot(default_sum,aes(x = default, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 4")

4.1.1.5 Housing Loan

housing_sum <- bank%>% group_by(housing) %>% summarise(count = n())
housing_sum$housing <- housing_sum$housing %>% factor(levels = housing_sum$housing[order(-housing_sum$count)]) 
ggplot(housing_sum,aes(x = housing, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 5")

4.1.1.6 Personal loan

loan_sum <- bank%>% group_by(loan) %>% summarise(count = n())
loan_sum$loan <- factor(c(loan_sum$loan), levels = c("personal loan", "no personal loan"), ordered = TRUE)
ggplot(loan_sum,aes(x = loan, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 6")

4.1.1.7 Previous Outcome

poutcome_sum <- bank%>% group_by(poutcome) %>% summarise(count = n())
poutcome_sum$poutcome <- poutcome_sum$poutcome %>% factor(levels = poutcome_sum$poutcome[order(-poutcome_sum$count)]) 
ggplot(poutcome_sum,aes(x = poutcome, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 7")

4.1.1.8 Target Y

y_sum <- bank%>% group_by(y) %>% summarise(count = n())
y_sum$y <- y_sum$y %>% factor(levels = y_sum$y[order(-y_sum$count)]) 
ggplot(y_sum,aes(x = y, y = count)) + geom_bar(stat="identity") + labs(title = "Figure 8")

4.1.2. Numerical features

Figure 9 depicts that most of the individuals aged between 30-60 years. The boxplot for the average yearly balance, figure 10, shows a median of zero signifying that most of the people contacted for this campaign have negative or nearly zero average yearly balance.Figure 11 instantiates that vast majority of people decide about the subscription in the first 500 seconds(8 minutes), with a median of around 300 seconds(5 minutes). Most of the clients were contacted only once or twice during this campaign as illustated in figure 12. The Figure 13 displays the number of days passed after the client/individual was last contacted and the median 0 without any Inter-quartile range reflects that almost all of the individuals are contacted for the first time during this campaign. The fact assumed from the interpretation of the figure 13 is confirmed by the figure 14 where number of contacts performed before this campaign to the same individual is plotted and the median is zero with no interquartile range means that no contacts were made previously to the clients contacted during this marketing campaign.

4.1.2.1. Age

p <- ggplot(bank, aes(x = factor(1), y = age)) +   geom_boxplot(width = .50)
p1 <- ggplot(bank, aes(x = age)) +
  geom_density(fill = "dodgerblue", alpha = .2) +
  geom_histogram(colour="white",aes(y=..density..),alpha = 1/2) +
  geom_vline(xintercept= median(bank$age)) +
  annotate("text",label = "Median",x = 39, y = 0.045) +
  geom_vline(xintercept= mean(bank$age),linetype=2) +
  annotate("text",label = "Mean",x = 41, y = 0.05) + labs(title = "Figure 9")
plot_grid(p1, p + coord_flip() + theme(axis.title.y=element_blank(), 
                                        axis.text.y=element_blank(),
                                        axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.1.2.2. Average Yearly Balance

p_balance <- ggplot(bank, aes(x = factor(1), y = balance)) +geom_boxplot(width = .50)
                  
p1_balance <- ggplot(bank, aes(x = balance)) + labs(title = "Figure 10") + geom_histogram(colour="white",aes(y=..count..), bins = 10)

plot_grid(p1_balance, p_balance + coord_flip() + theme(axis.title.y=element_blank(), 
                                       axis.text.y=element_blank(),
                                       axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.1.2.3. Duration of last call

p_duration <- ggplot(bank, aes(x = factor(1), y = duration)) +   geom_boxplot(width = .50)
p1_duration <- ggplot(bank, aes(x = duration)) +
labs(title = "Figure 11") + geom_histogram(colour ="white", aes(y=..count..),alpha = 1/2)
plot_grid(p1_duration, p_duration + coord_flip() + theme(axis.title.y=element_blank(),                                                        axis.text.y=element_blank(),
                                                       axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.1.2.4.Campaign - Contacts made to a client during this Campaign

p_campaign <- ggplot(bank, aes(x = factor(1), y = campaign )) + labs(y ="Number of contact") + geom_boxplot(width = .50)
p1_campaign <- ggplot(bank, aes(x = campaign)) + labs(title = "Figure 12", x ="Number of contact") + geom_histogram(colour ="white", aes(y=..count..),alpha = 1/2)
plot_grid(p1_campaign, p_campaign + coord_flip() + theme(axis.title.y=element_blank(),                                                          axis.text.y=element_blank(),
                                                         axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.1.2.5. Pdays - Number of days passed after the client was last contacted

p_pdays <- ggplot(bank, aes(x = factor(1), y = pdays)) +  labs(y ="Number of days passed") + geom_boxplot(width = .50)
p1_pdays <- ggplot(bank, aes(x = pdays)) +
labs (title = "Figure 13",x ="Number of days passed") + geom_histogram(colour ="white", aes(y=..count..),alpha = 1/2)
plot_grid(p1_pdays, p_pdays + coord_flip() + theme(axis.title.y=element_blank(),                                                          axis.text.y=element_blank(),
                                                         axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.1.2.6. Previous - Number of contacts performed before this campaign

p_previous <- ggplot(bank, aes(x = factor(1), y = previous)) + labs(y = "Number of contacts") + geom_boxplot(width = .50)
p1_previous <- ggplot(bank, aes(x = previous)) +
  labs(title = "Figure 14", x= "Number of contacts") + geom_histogram(colour ="white", aes(y=..count..),alpha = 1/2)
plot_grid(p1_previous, p_previous + coord_flip() + theme(axis.title.y=element_blank(), 
                                                         axis.text.y=element_blank(),
                                                         axis.ticks.y = element_blank()), ncol=1, align="v",
          rel_heights = c(2,1))

4.2. Multivariate Visualization

Each feature of the dataset is already explored individually and now these can be explored in relation to the target feature “y” with its respective levels (Yes/No). After this,likely relationship between two probable descriptive features in relation to the target feature is explored, followed by exploring the correlation amongst all the numerical features with the help of a scatter matrix.

In the below chunk,the target feature is factorised and ordered and then divided into two separate subsets with “Yes” and “No” in order to make the further multivariate visualizations easy to plot and interpret.

#Factorise the target feature y
bank$y <- factor(c(bank$y), levels = c("yes","no"), ordered = TRUE)

#Divide the target feature into two seperate subsets with Yes and No
bank_yes <- bank %>% filter(bank$y =="yes")
bank_no <- bank %>% filter(bank$y =="no")

4.2.1. Numeric Features Segregated by Target Y

First all the numeric features are explored in relation to target feature except “Pdays” and “Previous” because almost all of the clients contacted during this campaign were contacted for the first time and drawing comparison and exploring on such basis will not yield any meaningful insights.Also majority of people were contacted only once or twice hence exploring “campaign” might also not yield anything meaningful. Hence campaign is also not explored with respect to target feature. Later, The categorical features are explored.

4.2.1.1. Age

Although people from all age group are subscribing to the term deposit however people somewhere between 30 and 40 avail it the most. Interestingly, the same age group has the highest count who did not subscribe to the term deposit. This may lead to an interesting fact that this is the most sought after group due to its highest proportion in total.

ggplot(data=bank, aes(age, fill =y)) + geom_histogram(aes(y=..count..)) + facet_grid(~y)

4.2.1.2. Balance

A careful comparison between both the visualizations demonstrate that people with near zero or low balance are least likely to subscribe to term deposits and very few people with low average yearly balance opt for term deposit.

balance_yes <- ggplot(bank_yes,aes(balance)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits Yes by Balance", x="Balance", y="Count") + xlim(c(0,2800)) + ylim(c(0,1000))
balance_no <- ggplot(bank_no,aes(balance)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits No by Balance", x="Balance", y="Count") + xlim(c(0,2800)) +ylim(c(0,1000))
grid.arrange(balance_yes,balance_no)

4.2.1.3. Duration

The duration has an important bearing on the target outcome in a way that when duration is “0” , the term deposit outcome is always “No”. The other significant find is- almost all of the people who do not wish to subscribe term deposit decide in the first 8 minutes and people who wish to subscribe sometimes take little longer in getting convinced and deciding.

balance_yes <- ggplot(bank_yes,aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits Yes by Duration", x="Duration", y="Count") + xlim(c(0,2800))
balance_no <- ggplot(bank_no,aes(duration)) + geom_histogram(binwidth = 10) + labs(title = "Term Deposits No by Duration", x="Duration", y="Count") + xlim(c(0,2800))
grid.arrange(balance_yes,balance_no)

4.2.2. Categorical Features Segregated by Target Y

4.2.2.1. Job

Out of the top three jobs(on the basis of total count), people in management jobs avail the highest number of term deposits followed by technicians. However, this figure/value seems quite obvious due to their high proportion in total count.

bank_yes_job <-bank_yes%>% group_by(job) %>% summarise(count = n())
bank_yes_job$y <- c(strrep("yes",1))
bank_no_job <-bank_no%>% group_by(job) %>% summarise(count = n())
bank_no_job$y <- c(strrep("no",1))
bank_job <- rbind(bank_yes_job, bank_no_job)
bank_job$y <- factor(c(bank_job$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_job,aes(x = job, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + theme(axis.text.x=element_text(angle=45,hjust=1))

4.2.2.2. Marital status

Due to high proportion of married people in total count, it is not surprising to see that this group tops the list in subscribing the term deposit in comparison to the other two groups.

bank_yes_marital <- bank_yes%>% group_by(marital) %>% summarise(count = n())
bank_no_marital <- bank_no  %>% group_by(marital) %>% summarise(count = n())
bank_yes_marital$y <-c(strrep("yes",1))
bank_no_marital$y <-c(strrep("no",1))
bank_marital <- rbind(bank_yes_marital,bank_no_marital)
bank_marital$y <- factor(c(bank_marital$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_marital,aes(x = marital, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + facet_grid(~y)

4.2.2.3. Education

Due to large proportion of secondary education in the total count, it’s not uncommon they top the chart in terms of subscribing the term deposit as well.

bank_yes_education <-bank_yes%>% group_by(education) %>% summarise(count = n())
bank_no_education <-bank_no%>% group_by(education) %>% summarise(count = n())
bank_yes_education$y <- c(strrep("yes",1))
bank_no_education$y <- c(strrep("no",1))
bank_education <- rbind(bank_yes_education, bank_no_education)
bank_education$y <- factor(c(bank_education$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_education,aes(x = education, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + facet_grid(~y) + theme(axis.text.x=element_text(angle=45,hjust=1))

4.2.2.4. Default

It is quite normal to see that “no defaulters” are opting for term deposit because of their high ratio in the total count as well.What is intriguing is that defaulters also subscribe to term deposit.

bank_yes_default <-bank_yes%>% group_by(default) %>% summarise(count = n())
bank_no_default <-bank_no%>% group_by(default) %>% summarise(count = n())
bank_yes_default$y <- c(strrep("yes",1))
bank_no_default$y <- c(strrep("no",1))
bank_default <- rbind(bank_yes_default, bank_no_default)
bank_default$y <- factor(c(bank_default$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_default,aes(x = default, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + facet_grid(~y)

4.2.2.5. Housing

It is fascinating to find that although housing loan has high proportion than no housing loan in the total count however their count in subscribing to term deposit is lesser than those who have not taken housing loan. It is probable that housing loan affects the propensity/inclination of a person towards investing in a term deposit.

bank_yes_housing <-bank_yes%>% group_by(housing) %>% summarise(count = n())
bank_no_housing <-bank_no%>% group_by(housing) %>% summarise(count = n())
bank_yes_housing$y <- c(strrep("yes",1))
bank_no_housing$y <- c(strrep("no",1))
bank_housing <- rbind(bank_yes_housing, bank_no_housing)
bank_housing$y <- factor(c(bank_housing$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_housing,aes(x = housing, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + facet_grid(~y)

4.2.2.6. Loan

Due to large proportion of people having personal loan in the total count, It is not surprising if they are again topping the chart in terms of subscribing to term deposits. Therefore, it would be interesting to find whether a loan be it housing or personal affects a person’s likelihood of subscribing to term deposit. Consequently, this aspect is further explored later in this section with other features.

bank_yes_loan <-bank_yes%>% group_by(loan) %>% summarise(count = n())
bank_no_loan <-bank_no%>% group_by(loan) %>% summarise(count = n())
bank_yes_loan$y <- c(strrep("yes",1))
bank_no_loan$y <- c(strrep("no",1))
bank_loan <- rbind(bank_yes_loan, bank_no_loan)
bank_loan$y <- factor(c(bank_loan$y), levels = c("yes","no"), ordered = TRUE)
ggplot(bank_loan,aes(x = loan, y = count, fill = y)) + geom_bar(stat="identity", position ="dodge") + facet_grid(~y)

4.2.3 Interaction between Categorical and Numeric Features

4.2.3.1 Balance VS Default

It is very difficult to draw any interpretation from the “defaulter” and balance due to the distorted boxplot but it is clear that people who pay their dues on time(i.e no defaulters) have slightly more average yearly balance and people who have more balance are more likely to subscribe to term deposits as the figure shows that the median balance for such a group is higher than the others. So,in short, we can infer from the figure that no defaulters having close to median yearly average balance are more likely to avail term deposits at the bank.

ggplot(data = bank, aes(x=default, y = balance, fill = y)) + geom_boxplot(width = .50) +scale_fill_manual(values = c("red","blue"))

4.2.3.2 Duration VS Housing

It is interesting to note from this figure that whether people have housing loans or not, those subscribing to term loans spend more time on the marketing phone calls than those who do not subscribe to term deposits. Also, those who already have running housing loans spend more time on call duration in deciding about the subscription than those who have no house loans.

ggplot(data = bank, aes(x=housing, y = duration, fill = y)) + geom_boxplot(width= .50) +scale_fill_manual(values = c("red","blue"))

4.2.3.3 Duration VS Loan

People already having personal loans spend more time over phone during the marketing campaign than those who have no personal loans. Also, those who have personal loans are more likely to subscribe to term deposits. It can be said, that people availing personal loans give comparatively more time on phone and generally more likely to subscribe to the term deposit. However, all these are assumptions only and can be confirmed with further investigation in this matter.

ggplot(data = bank, aes(x=loan, y = duration, fill = y)) + geom_boxplot(width= .50) +scale_fill_manual(values = c("red","blue"))

4.2.4 Scatter plot matrix of Numerical Features by Target Y

The following scatter plot for the significant numerical features is generated using “GGally” package in R. A scatter plot matrix is a collection of scatterplots organised into a grid or matrix where each scatterplot shows a relationship between pair of variables.This scatterplot matrix shows that none of the pair of numerical features have any significant relationship between them. This can be confirmed with the correlation proportion. None of the pairs have proportion any higher than 0.09 (positive or negative), some of them are as low as 0.02, hence can be considered uncorrelated.

ggpairs(bank, columns = c(1,6,9,10,11),axisLabels = "internal")

5. SUMMARY

In phase 1 of the project, the dataset chosen has 16 descriptive features and 1 target feature. All features are taken into account except “Contact”, “day” and “month” as they are not contributing anything significant to the outcome. Later, after exploring the features individually it is found that other attributes -“campaign”, “poutcome”, “pday” and“previous” are related to the previous campaigns and have no significant bearing on the outcome/target feature of the current campaign as almost all of the people contacted during this campaign are new. Consequently, these were also omitted from multivariate exploration. The data Chosen is found to have no missing values, typo errors, case errors or extra white spaces after checking thoroughly during data preprocessing. The dataset has outliers in almost all the numerical attributes which can be seen during the individual visualizations. However, these were not removed or imputed because of two reasons: First, these outliers were found to be a part of the dataset and not just random figures and removing them would have completely modified the data. Secondly, for removing or handling any outliers(if they are in significant proportion or otherwise), subject matter expertise is needed and due to lack of desired domain expertise/knowledge it was decided to proceed without handling them. The dataset is explored to dive deep and gain meaningful insights from data that can be considered and attended to during model building in phase 2.

6. CITATION

[1]. [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014.

Machine Learning Project Phase 1

Predicting subscription to term deposit using the Bank Marketing Data Set

Anh Phan - s3258110 and Neeraj Sehrawat- s3711712