1 Introduction

This report will analyze the Bank Marketing data from marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same customer was required, in order to access if the product (bank term deposit) would be (or not) subscribed.

2 Data Input & Library

#Entering data input from CSV file with ";" (semicolon) as separator.
bank <- read.csv("bank.csv", sep=";",header=T)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.1.2

2.1 Data Inspection

#Here's a quick look at the initial data set we're using.
head(bank)
##   age         job marital education default balance housing loan  contact day
## 1  30  unemployed married   primary      no    1787      no   no cellular  19
## 2  33    services married secondary      no    4789     yes  yes cellular  11
## 3  35  management  single  tertiary      no    1350     yes   no cellular  16
## 4  30  management married  tertiary      no    1476     yes  yes  unknown   3
## 5  59 blue-collar married secondary      no       0     yes   no  unknown   5
## 6  35  management  single  tertiary      no     747      no   no cellular  23
##   month duration campaign pdays previous poutcome  y
## 1   oct       79        1    -1        0  unknown no
## 2   may      220        1   339        4  failure no
## 3   apr      185        1   330        1  failure no
## 4   jun      199        4    -1        0  unknown no
## 5   may      226        1    -1        0  unknown no
## 6   feb      141        2   176        3  failure no
tail(bank)
##      age           job marital education default balance housing loan  contact
## 4516  32      services  single secondary      no     473     yes   no cellular
## 4517  33      services married secondary      no    -333     yes   no cellular
## 4518  57 self-employed married  tertiary     yes   -3313     yes  yes  unknown
## 4519  57    technician married secondary      no     295      no   no cellular
## 4520  28   blue-collar married secondary      no    1137      no   no cellular
## 4521  44  entrepreneur  single  tertiary      no    1136     yes  yes cellular
##      day month duration campaign pdays previous poutcome  y
## 4516   7   jul      624        5    -1        0  unknown no
## 4517  30   jul      329        5    -1        0  unknown no
## 4518   9   may      153        1    -1        0  unknown no
## 4519  19   aug      151       11    -1        0  unknown no
## 4520   6   feb      129        4   211        3    other no
## 4521   3   apr      345        2   249        7    other no
#Below are the dimension of the dataset.
dim(bank)
## [1] 4521   17

Bank Marketing data contains 4,521 rows and 17 columns.

Below are the columns name in this dataset.

#Below are the columns in the dataset.
names(bank)
##  [1] "age"       "job"       "marital"   "education" "default"   "balance"  
##  [7] "housing"   "loan"      "contact"   "day"       "month"     "duration" 
## [13] "campaign"  "pdays"     "previous"  "poutcome"  "y"

2.2 Data Cleansing & Coertions

First, we need to check any missing value.

colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0
anyNA(bank)
## [1] FALSE

Luckily, there is no missing value in the dataset.

After that we need to check the data type in each column.

#Inspecting data type for each column.
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : chr  "unemployed" "services" "management" "management" ...
##  $ marital  : chr  "married" "married" "single" "married" ...
##  $ education: chr  "primary" "secondary" "tertiary" "tertiary" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : chr  "no" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "yes" "no" "yes" ...
##  $ contact  : chr  "cellular" "cellular" "cellular" "unknown" ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : chr  "oct" "may" "apr" "jun" ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : chr  "unknown" "failure" "failure" "unknown" ...
##  $ y        : chr  "no" "no" "no" "no" ...

From the initial result above, we need to change some columns to correct type by using data coertion as seen below. Also to help further analysis, we need to add interval columns such as age and balance.

bank$age <- as.numeric(bank$age)
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$balance <- as.numeric(bank$balance)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$day <- as.factor(bank$day)
bank$month <- as.factor(bank$month)
bank$duration <- as.numeric(bank$duration)
bank$campaign <- as.numeric(bank$campaign)
bank$pdays <- as.numeric(bank$pdays)
bank$previous <- as.numeric(bank$previous)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : num  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : num  1787 4789 1350 1476 0 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : Factor w/ 31 levels "1","2","3","4",..: 19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : num  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : num  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : num  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : num  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Also to help further analysis, we need to add interval columns such as age and balance.For age interval, first we have to identify the unique value so we can have a glance for the age range.

age_unq <- data.frame(unique(bank$age))
age_u_sort <- age_unq[order(age_unq$unique.bank.age., decreasing=F),]
age_u_sort
##  [1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## [26] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [51] 69 70 71 72 73 74 75 76 77 78 79 80 81 83 84 86 87

Based on the age data, we will group age in to following categories:

  • Young Adults (18-25)
  • Adults (26-40)
  • Middle Age Adults (41-60)
  • Older Adults (over 60)
#Below are functions to add age interval column.
age_func <- function(y){
  if(y>=18&y<=25){y <- "Young Adults"}
  else
    if(y>=26&y<=40){y <- "Adults"}
  else
    if(y>=41&y<=60){y <- "Middle-aged"}
  else
    if(y>60){y <- "Old Adults"}
}
bank$age_int <- sapply(X=bank$age, FUN=age_func)

After that we need to change the age interval column (age_int) as factor.

bank$age_int <- as.factor(bank$age_int)

3. Data Analysis

For introduction, we want to see percentage of customers response towards this campaign.

succ_tab <- data.frame(prop.table((table(bank$y)))*100)
colnames(succ_tab) <- c("response","perc(%)")
succ_tab
##   response perc(%)
## 1       no  88.476
## 2      yes  11.524

From the table above, we can see that 11,5% responded positively towards this campaign. Let’s look in the details that would affect this outcome.

3.1 Age

From the chart below, customer’s age are dominantly between. Below are the distribution of the customer age.

ggplot(bank, aes(x=age))+
  geom_bar()+
  labs(x="Age",y="Count of Customers", title="Age of Customers")+
  theme_classic()+
  scale_x_continuous(breaks = seq(0, 90, 5))

plot(bank$age_int, main="Age Interval Distribution")

age_tab <- data.frame(table(bank$age_int,bank$y))
colnames(age_tab) <- c("age","response","number_of_customers")
ggplot(data=age_tab, aes(x=age,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Result Age Distribution")

By looking at the chart above, most customers who responded positively are in the range of Adults or between 26-40 years old, though in term of percentage is lower than other age categories. The interesting part is the higher percentage positive response is in the Old-Adults category compared with category, as shown by chart below.

oldadl_tab <- age_tab[(age_tab=="Old Adults"),]
ggplot(data=oldadl_tab, aes(x=age,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Old Adults Response")

3.2 Marital

We want to see the proportion of marital status and response of the customer.

mar_tab = data.frame(table(bank$marital, bank$y))
colnames(mar_tab) <-
  c("marital_status","response","number_of_customers")

ggplot(data=mar_tab, aes(x=marital_status,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Marital Status Distribution")

From the chart above, the marketing target are mostly married customer. Also, we can see that married customers are likely to respond positively compared to divorced and single.

3.3 Education

We want to know education level of the customers and their response toward this campaign.

ed_tab <- data.frame(table(bank$education,bank$y))
colnames(ed_tab) <- c("education","response","number_of_customers")
ggplot(data=ed_tab, aes(x=education,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Result Education Distribution")

We can see that most customers are in the secondary level. Looking by the response from the chart above, the higher the education, more likely the customer to give positive response toward the campaign.

3.4 Default

def_tab <- data.frame(table(bank$default))
colnames(def_tab) <- c("default","number_of_customers")
ggplot(data=def_tab, aes(x=default,y=number_of_customers, fill=default))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Default Customer Distribution")

Looking at the chart above that most of the customers in the marketing campaign have no default histories, which is logical that customers who default are less likely have idle fund.

Now we want to see the response distribution among the default status.

def_res_tab <- data.frame(table(bank$default,bank$y))
colnames(def_res_tab) <- c("default","response","number_of_customers")
ggplot(data=def_res_tab, aes(x=default,y=number_of_customers, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Default Status Distribution")

From the chart above we can see that the most customers who responded positively have no default in their credit history.

3.5 Call Duration

We want to map the call duration within each response gave by the customer.

cl_dur <- bank[,c("y","duration")]
ggplot(cl_dur, aes(x=y,y=duration))+
  geom_boxplot(fill= "red", color="black")+
  theme_light()

Looking from the chart above that most successful campaign has higher median in call duration. So, the longer call could increase the chances of getting successful campaign.

3.6 Month

We want to see the call performed each month.

bank$month <- factor(bank$month,levels = c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))
ggplot(data=bank, aes(x=month, fill=month))+
  geom_bar()+
  labs(X="Number of customers",
       y=NULL,
       title="Call Performed each Month")+theme_light()

We can see, that the highest number of call performed happened in May. Next is we want to know the result of the call performed each month.

ggplot(data=bank, aes(x=month, fill=y))+
  geom_bar(position = 'dodge')+
  labs(y=NULL,
       title="Call Performed each Month")+theme_light()

As we can, more call performed doesn’t always mean more successful campaign. Many factors could affected the anomaly that happened in May.

3.7 Job

We want to map the customers’ job in this campaign.

job_tab <- data.frame(table(bank$job))
colnames(job_tab) <- c("job", "count")
ggplot(data=job_tab, aes(x=count, y=reorder(job,count), fill=job))+
  geom_bar(stat = 'identity')+
  labs(X=NULL,
       y=NULL,
       title="Customers' Job")+
  theme_light()

We can see from the chart above, most of the customers’ occupation are management and blue collar. Now we wan to see the response of each occupation.

job_y_tab <- data.frame(table(bank$job, bank$y))
colnames(job_y_tab) <- c("job","response","count")
ggplot(data=job_y_tab, aes(x=count,y=reorder(job,count), fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Campaign Result Job Distribution")

From the chart above, we can see that customers in management job are more likely to responded positively with the campaign.

We want to dig deeper with this job distribution. Usually, occupation is closely related to financial condition or their saving account. We would like to know average balance of each job category.

job_avg_bal <- aggregate(formula=balance~job, data=bank, FUN=mean)
job_avg_bal <- job_avg_bal[order(job_avg_bal$balance, decreasing = T),]
job_avg_bal
##              job  balance
## 6        retired 2319.191
## 4      housemaid 2083.804
## 5     management 1766.929
## 3   entrepreneur 1645.125
## 9        student 1543.821
## 12       unknown 1501.711
## 7  self-employed 1392.410
## 10    technician 1330.996
## 1         admin. 1226.736
## 8       services 1103.957
## 11    unemployed 1089.422
## 2    blue-collar 1085.162

Surprisingly, the highest average balance is in retired and housemaid category. But since the average is affected by outlier, we want to see how dispersed the balance in each job category.

ggplot(bank, aes(x=balance,y=job))+
  geom_boxplot(fill= "violet", color="black")+
   labs(y=NULL,title="Customer Job & Balance Distribution")+
  theme_light()

As we can see from the chart above, there are a lot of outliers in the balance data grouped by job, especially in the management job. Since the there are a lot of outliers, we decided to see the median, as shown in below table.

job_med_bal <- aggregate(formula=balance~job, data=bank, FUN=median)
job_med_bal <- job_med_bal[order(job_med_bal$balance, decreasing = T),]
job_med_bal
##              job balance
## 6        retired   672.5
## 12       unknown   655.5
## 5     management   577.0
## 7  self-employed   483.0
## 11    unemployed   473.5
## 10    technician   434.5
## 1         admin.   430.0
## 9        student   422.5
## 2    blue-collar   408.5
## 3   entrepreneur   365.5
## 4      housemaid   296.5
## 8       services   288.0
job_med_y_bal <- aggregate(formula=balance~job+y, data=bank, FUN=median)
job_med_y_bal <- job_med_y_bal[order(job_med_y_bal$balance, decreasing = T),]
job_med_y_bal
##              job   y balance
## 16     housemaid yes  1107.0
## 17    management yes  1064.0
## 18       retired yes  1029.5
## 23    unemployed yes   978.0
## 19 self-employed yes   780.5
## 24       unknown yes   732.0
## 14   blue-collar yes   664.0
## 12       unknown  no   634.0
## 13        admin. yes   579.0
## 22    technician yes   568.0
## 5     management  no   541.5
## 6        retired  no   511.0
## 7  self-employed  no   479.0
## 20      services yes   460.5
## 9        student  no   459.0
## 11    unemployed  no   415.0
## 10    technician  no   411.0
## 1         admin.  no   403.0
## 2    blue-collar  no   395.0
## 3   entrepreneur  no   376.0
## 21       student yes   331.0
## 8       services  no   266.0
## 4      housemaid  no   263.5
## 15  entrepreneur yes   170.0
ggplot(data=job_med_y_bal, aes(x=balance,y=reorder(job,balance), fill=y))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(y=NULL,title="Campaign Job & Balances Distribution")

Based on the chart above, there is pattern that each customers’ job category who responded yes has higher median balance than the customer’s who responded no. Which is, make sense that customer with higher median balance tend to respond positively towards the campaign compared to other.

3.8 Loan & Housing

We want to analyze the customer who has loan and housing loan.

ggplot(data=bank, aes(x=loan, fill=loan))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Customer Loan")+theme_light()

From the customer loan chart, we can see that most of the customer has no loan.

ggplot(data=bank, aes(x=housing, fill=housing))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Customer Housing Loan")+theme_light()

More than half of the customer has housing loan.

Next we want to know their response towards this campaign by looking at their liability condition.

loan_tab <- data.frame(table(bank$loan,bank$y))
colnames(loan_tab) <- c("loan", "response","count")
ggplot(data=loan_tab, aes(x=loan,y=count,fill=response))+
  geom_bar(stat = 'identity',position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Loan Status & Customer Response")+theme_light()

Looking at the chart above, customer with no loan tend to give positive response towards this campaign compared to customer who has loan. Which plausible that customer with loan tend to use the money to repay their loan.

housing_tab <- data.frame(table(bank$housing,bank$y))
colnames(housing_tab) <- c("housing", "response","count")
ggplot(data=housing_tab, aes(x=housing,y=count, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(title="Housing Loan & Customer Response")

From the chart above, customer with no housing loan tend to give positive response towards this campaign compared to customer who has no housing loan.

From the conditions above, we can conclude that, customers with no liability, they have more positive response than the customers who are in debt.

3.9 Contact

We want to look at how the customers are contacted in this campaign and the result.

ggplot(data=bank, aes(x=contact, fill=contact))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Contact Methods")+theme_light()

Almost all of the customers are contacted by cellular.

contact_tab <- data.frame(table(bank$contact,bank$y))
colnames(contact_tab) <- c("contact", "response","count")
ggplot(contact_tab, aes(x=contact,y=count, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(title="Contact Response")

Based on the chart above that most customer who responded positively are contacted through cellular.

3.10 Previous Campaign.

ggplot(data=bank, aes(x=poutcome, fill=poutcome))+
  geom_bar(position = 'dodge')+
  labs(X="Number of customers",
       y=NULL,
       title="Contact Methods")+theme_light()

From the chart above, that we can see that most of the customers haven’t been contacted prior this campaign.

We want to check the result based on this previous marketing campaign.

p_out_tab <- data.frame(table(bank$poutcome,bank$y))
colnames(p_out_tab) <- c("prev_out", "response","count")
ggplot(p_out_tab, aes(x=prev_out,y=count, fill=response))+
  geom_bar(stat = 'identity', position = 'dodge')+
  labs(title="Contact Response")

Based on the chart above, we can see that:

  1. For previous campaign failure, customers are less likely to give positive responses on the campaign
  2. For previous campaign success, customers are more likely to give positive responses on the campaign
  3. For no previous campaign, the result is not substantial but most of the customers who gave positive result were not contacted prior this campaign.

3.11 Balances.

We want to look balances of the customer. First we want to map how disperse the balance data.

ggplot(bank, aes(x=balance))+
  geom_boxplot(fill= "violet", color="black")+
   labs(y=NULL,title="Customer Balance Distribution")+
  theme_light()

As we can see, it is quite hard to see at which point the median, so we try to see by using summary of balance.

summary(bank$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3313      69     444    1423    1480   71188

From the table above, the median of the balance is 444 and also with a lot of outliers, either minus balances or more than 1,500.

Previously, we already compared the balance with customers’ job, we want to see whether balance is related to education.

We will make filtered tables by using each education category, by subsetting each category as shown below.

prim_ed <- bank[(bank$education=="primary"),]
sec_ed <- bank[(bank$education=="secondary"),]
tert_ed <- bank[(bank$education=="tertiary"),]
unkn_ed <- bank[(bank$education=="unknown"),]
ggplot(prim_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Primary Education")+
  theme_light()

summary(prim_ed$balance)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  -966.00    65.25   404.50  1411.54  1394.75 71188.00
ggplot(sec_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Secondary Education")+
  theme_light()

summary(sec_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2082.0    57.0   385.5  1196.8  1269.8 26452.0
ggplot(tert_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Tertiary Education")+
  theme_light()

summary(tert_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3313.0   102.0   583.5  1775.4  1840.0 42045.0
ggplot(unkn_ed, aes(x=balance))+
  geom_boxplot(fill= "orange", color="red")+
   labs(y=NULL,title="Balance Distribution in Secondary Education")+
  theme_light()

summary(unkn_ed$balance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -715.0    88.5   574.0  1701.2  2066.0 21244.0

From each category summary and chart above, the higher the education, the higher the average and max balance in their account hence it could be argued that these are more likely to buy term deposits.

4. Conclusion

This marketing campaign resulted 11.5% of success campaign. From the chart above we would like to point key finding that might affect the result the marketing campaign:

  1. Higher education would have affected the perception of the investment offered in the campaign.
  2. Higher median balance would affect the decision whether to allocate their balance for investment.
  3. Customer who has loans and housing loans are less likely to invest their money.
  4. Call duration also affect how customer perceive the investment. Call duration around 300-400s with effective persuasion could increase the chance of campaign success.
  5. The campaign should concentrate on new customer as seen on chart previous campaign chart, that most customer who responded positively are customer who never been called prior of the campaign.

Reference

  • [Moro et al., 2011] S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM’2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.
  • Age Interval