This report will analyze the Bank Marketing data from marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same customer was required, in order to access if the product (bank term deposit) would be (or not) subscribed.
#Entering data input from CSV file with ";" (semicolon) as separator.
bank <- read.csv("bank.csv", sep=";",header=T)
library(ggplot2)## Warning: package 'ggplot2' was built under R version 4.1.2
library(ggthemes)## Warning: package 'ggthemes' was built under R version 4.1.2
#Here's a quick look at the initial data set we're using.
head(bank)## age job marital education default balance housing loan contact day
## 1 30 unemployed married primary no 1787 no no cellular 19
## 2 33 services married secondary no 4789 yes yes cellular 11
## 3 35 management single tertiary no 1350 yes no cellular 16
## 4 30 management married tertiary no 1476 yes yes unknown 3
## 5 59 blue-collar married secondary no 0 yes no unknown 5
## 6 35 management single tertiary no 747 no no cellular 23
## month duration campaign pdays previous poutcome y
## 1 oct 79 1 -1 0 unknown no
## 2 may 220 1 339 4 failure no
## 3 apr 185 1 330 1 failure no
## 4 jun 199 4 -1 0 unknown no
## 5 may 226 1 -1 0 unknown no
## 6 feb 141 2 176 3 failure no
tail(bank)## age job marital education default balance housing loan contact
## 4516 32 services single secondary no 473 yes no cellular
## 4517 33 services married secondary no -333 yes no cellular
## 4518 57 self-employed married tertiary yes -3313 yes yes unknown
## 4519 57 technician married secondary no 295 no no cellular
## 4520 28 blue-collar married secondary no 1137 no no cellular
## 4521 44 entrepreneur single tertiary no 1136 yes yes cellular
## day month duration campaign pdays previous poutcome y
## 4516 7 jul 624 5 -1 0 unknown no
## 4517 30 jul 329 5 -1 0 unknown no
## 4518 9 may 153 1 -1 0 unknown no
## 4519 19 aug 151 11 -1 0 unknown no
## 4520 6 feb 129 4 211 3 other no
## 4521 3 apr 345 2 249 7 other no
#Below are the dimension of the dataset.
dim(bank)## [1] 4521 17
Bank Marketing data contains 4,521 rows and 17 columns.
Below are the columns name in this dataset.
#Below are the columns in the dataset.
names(bank)## [1] "age" "job" "marital" "education" "default" "balance"
## [7] "housing" "loan" "contact" "day" "month" "duration"
## [13] "campaign" "pdays" "previous" "poutcome" "y"
First, we need to check any missing value.
colSums(is.na(bank))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
anyNA(bank)## [1] FALSE
Luckily, there is no missing value in the dataset.
After that we need to check the data type in each column.
#Inspecting data type for each column.
str(bank)## 'data.frame': 4521 obs. of 17 variables:
## $ age : int 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr "unemployed" "services" "management" "management" ...
## $ marital : chr "married" "married" "single" "married" ...
## $ education: chr "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 1787 4789 1350 1476 0 747 307 147 221 -88 ...
## $ housing : chr "no" "yes" "yes" "yes" ...
## $ loan : chr "no" "yes" "no" "yes" ...
## $ contact : chr "cellular" "cellular" "cellular" "unknown" ...
## $ day : int 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr "oct" "may" "apr" "jun" ...
## $ duration : int 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : int 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : int -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : int 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr "unknown" "failure" "failure" "unknown" ...
## $ y : chr "no" "no" "no" "no" ...
From the initial result above, we need to change some columns to correct type by using data coertion as seen below. Also to help further analysis, we need to add interval columns such as age and balance.
bank$age <- as.numeric(bank$age)
bank$job <- as.factor(bank$job)
bank$marital <- as.factor(bank$marital)
bank$education <- as.factor(bank$education)
bank$default <- as.factor(bank$default)
bank$balance <- as.numeric(bank$balance)
bank$housing <- as.factor(bank$housing)
bank$loan <- as.factor(bank$loan)
bank$contact <- as.factor(bank$contact)
bank$day <- as.factor(bank$day)
bank$month <- as.factor(bank$month)
bank$duration <- as.numeric(bank$duration)
bank$campaign <- as.numeric(bank$campaign)
bank$pdays <- as.numeric(bank$pdays)
bank$previous <- as.numeric(bank$previous)
bank$poutcome <- as.factor(bank$poutcome)
bank$y <- as.factor(bank$y)str(bank)## 'data.frame': 4521 obs. of 17 variables:
## $ age : num 30 33 35 30 59 35 36 39 41 43 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
## $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ balance : num 1787 4789 1350 1476 0 ...
## $ housing : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
## $ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 19 11 16 3 5 23 14 6 14 17 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
## $ duration : num 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : num 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : num -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : num 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Also to help further analysis, we need to add interval columns such as age and balance.For age interval, first we have to identify the unique value so we can have a glance for the age range.
age_unq <- data.frame(unique(bank$age))
age_u_sort <- age_unq[order(age_unq$unique.bank.age., decreasing=F),]
age_u_sort## [1] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
## [26] 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [51] 69 70 71 72 73 74 75 76 77 78 79 80 81 83 84 86 87
Based on the age data, we will group age in to following categories:
#Below are functions to add age interval column.
age_func <- function(y){
if(y>=18&y<=25){y <- "Young Adults"}
else
if(y>=26&y<=40){y <- "Adults"}
else
if(y>=41&y<=60){y <- "Middle-aged"}
else
if(y>60){y <- "Old Adults"}
}bank$age_int <- sapply(X=bank$age, FUN=age_func)After that we need to change the age interval column (age_int) as factor.
bank$age_int <- as.factor(bank$age_int)For introduction, we want to see percentage of customers response towards this campaign.
succ_tab <- data.frame(prop.table((table(bank$y)))*100)
colnames(succ_tab) <- c("response","perc(%)")
succ_tab## response perc(%)
## 1 no 88.476
## 2 yes 11.524
From the table above, we can see that 11,5% responded positively towards this campaign. Let’s look in the details that would affect this outcome.
From the chart below, customer’s age are dominantly between. Below are the distribution of the customer age.
ggplot(bank, aes(x=age))+
geom_bar()+
labs(x="Age",y="Count of Customers", title="Age of Customers")+
theme_classic()+
scale_x_continuous(breaks = seq(0, 90, 5))plot(bank$age_int, main="Age Interval Distribution")age_tab <- data.frame(table(bank$age_int,bank$y))
colnames(age_tab) <- c("age","response","number_of_customers")ggplot(data=age_tab, aes(x=age,y=number_of_customers, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Campaign Result Age Distribution")By looking at the chart above, most customers who responded positively are in the range of Adults or between 26-40 years old, though in term of percentage is lower than other age categories. The interesting part is the higher percentage positive response is in the Old-Adults category compared with category, as shown by chart below.
oldadl_tab <- age_tab[(age_tab=="Old Adults"),]ggplot(data=oldadl_tab, aes(x=age,y=number_of_customers, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Old Adults Response")We want to see the proportion of marital status and response of the customer.
mar_tab = data.frame(table(bank$marital, bank$y))
colnames(mar_tab) <-
c("marital_status","response","number_of_customers")
ggplot(data=mar_tab, aes(x=marital_status,y=number_of_customers, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Campaign Marital Status Distribution")From the chart above, the marketing target are mostly married customer. Also, we can see that married customers are likely to respond positively compared to divorced and single.
We want to know education level of the customers and their response toward this campaign.
ed_tab <- data.frame(table(bank$education,bank$y))
colnames(ed_tab) <- c("education","response","number_of_customers")ggplot(data=ed_tab, aes(x=education,y=number_of_customers, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Campaign Result Education Distribution")We can see that most customers are in the secondary level. Looking by the response from the chart above, the higher the education, more likely the customer to give positive response toward the campaign.
def_tab <- data.frame(table(bank$default))
colnames(def_tab) <- c("default","number_of_customers")ggplot(data=def_tab, aes(x=default,y=number_of_customers, fill=default))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Default Customer Distribution")Looking at the chart above that most of the customers in the marketing campaign have no default histories, which is logical that customers who default are less likely have idle fund.
Now we want to see the response distribution among the default status.
def_res_tab <- data.frame(table(bank$default,bank$y))
colnames(def_res_tab) <- c("default","response","number_of_customers")ggplot(data=def_res_tab, aes(x=default,y=number_of_customers, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Default Status Distribution")From the chart above we can see that the most customers who responded positively have no default in their credit history.
We want to map the call duration within each response gave by the customer.
cl_dur <- bank[,c("y","duration")]ggplot(cl_dur, aes(x=y,y=duration))+
geom_boxplot(fill= "red", color="black")+
theme_light()Looking from the chart above that most successful campaign has higher median in call duration. So, the longer call could increase the chances of getting successful campaign.
We want to see the call performed each month.
bank$month <- factor(bank$month,levels = c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))ggplot(data=bank, aes(x=month, fill=month))+
geom_bar()+
labs(X="Number of customers",
y=NULL,
title="Call Performed each Month")+theme_light()We can see, that the highest number of call performed happened in May. Next is we want to know the result of the call performed each month.
ggplot(data=bank, aes(x=month, fill=y))+
geom_bar(position = 'dodge')+
labs(y=NULL,
title="Call Performed each Month")+theme_light()As we can, more call performed doesn’t always mean more successful campaign. Many factors could affected the anomaly that happened in May.
We want to map the customers’ job in this campaign.
job_tab <- data.frame(table(bank$job))
colnames(job_tab) <- c("job", "count")ggplot(data=job_tab, aes(x=count, y=reorder(job,count), fill=job))+
geom_bar(stat = 'identity')+
labs(X=NULL,
y=NULL,
title="Customers' Job")+
theme_light()We can see from the chart above, most of the customers’ occupation are management and blue collar. Now we wan to see the response of each occupation.
job_y_tab <- data.frame(table(bank$job, bank$y))
colnames(job_y_tab) <- c("job","response","count")ggplot(data=job_y_tab, aes(x=count,y=reorder(job,count), fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Campaign Result Job Distribution")From the chart above, we can see that customers in management job are more likely to responded positively with the campaign.
We want to dig deeper with this job distribution. Usually, occupation is closely related to financial condition or their saving account. We would like to know average balance of each job category.
job_avg_bal <- aggregate(formula=balance~job, data=bank, FUN=mean)
job_avg_bal <- job_avg_bal[order(job_avg_bal$balance, decreasing = T),]
job_avg_bal## job balance
## 6 retired 2319.191
## 4 housemaid 2083.804
## 5 management 1766.929
## 3 entrepreneur 1645.125
## 9 student 1543.821
## 12 unknown 1501.711
## 7 self-employed 1392.410
## 10 technician 1330.996
## 1 admin. 1226.736
## 8 services 1103.957
## 11 unemployed 1089.422
## 2 blue-collar 1085.162
Surprisingly, the highest average balance is in retired and housemaid category. But since the average is affected by outlier, we want to see how dispersed the balance in each job category.
ggplot(bank, aes(x=balance,y=job))+
geom_boxplot(fill= "violet", color="black")+
labs(y=NULL,title="Customer Job & Balance Distribution")+
theme_light()As we can see from the chart above, there are a lot of outliers in the balance data grouped by job, especially in the management job. Since the there are a lot of outliers, we decided to see the median, as shown in below table.
job_med_bal <- aggregate(formula=balance~job, data=bank, FUN=median)
job_med_bal <- job_med_bal[order(job_med_bal$balance, decreasing = T),]
job_med_bal## job balance
## 6 retired 672.5
## 12 unknown 655.5
## 5 management 577.0
## 7 self-employed 483.0
## 11 unemployed 473.5
## 10 technician 434.5
## 1 admin. 430.0
## 9 student 422.5
## 2 blue-collar 408.5
## 3 entrepreneur 365.5
## 4 housemaid 296.5
## 8 services 288.0
job_med_y_bal <- aggregate(formula=balance~job+y, data=bank, FUN=median)
job_med_y_bal <- job_med_y_bal[order(job_med_y_bal$balance, decreasing = T),]
job_med_y_bal## job y balance
## 16 housemaid yes 1107.0
## 17 management yes 1064.0
## 18 retired yes 1029.5
## 23 unemployed yes 978.0
## 19 self-employed yes 780.5
## 24 unknown yes 732.0
## 14 blue-collar yes 664.0
## 12 unknown no 634.0
## 13 admin. yes 579.0
## 22 technician yes 568.0
## 5 management no 541.5
## 6 retired no 511.0
## 7 self-employed no 479.0
## 20 services yes 460.5
## 9 student no 459.0
## 11 unemployed no 415.0
## 10 technician no 411.0
## 1 admin. no 403.0
## 2 blue-collar no 395.0
## 3 entrepreneur no 376.0
## 21 student yes 331.0
## 8 services no 266.0
## 4 housemaid no 263.5
## 15 entrepreneur yes 170.0
ggplot(data=job_med_y_bal, aes(x=balance,y=reorder(job,balance), fill=y))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(y=NULL,title="Campaign Job & Balances Distribution")Based on the chart above, there is pattern that each customers’ job category who responded yes has higher median balance than the customer’s who responded no. Which is, make sense that customer with higher median balance tend to respond positively towards the campaign compared to other.
We want to analyze the customer who has loan and housing loan.
ggplot(data=bank, aes(x=loan, fill=loan))+
geom_bar(position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Customer Loan")+theme_light()From the customer loan chart, we can see that most of the customer has no loan.
ggplot(data=bank, aes(x=housing, fill=housing))+
geom_bar(position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Customer Housing Loan")+theme_light()More than half of the customer has housing loan.
Next we want to know their response towards this campaign by looking at their liability condition.
loan_tab <- data.frame(table(bank$loan,bank$y))
colnames(loan_tab) <- c("loan", "response","count")ggplot(data=loan_tab, aes(x=loan,y=count,fill=response))+
geom_bar(stat = 'identity',position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Loan Status & Customer Response")+theme_light()Looking at the chart above, customer with no loan tend to give positive response towards this campaign compared to customer who has loan. Which plausible that customer with loan tend to use the money to repay their loan.
housing_tab <- data.frame(table(bank$housing,bank$y))
colnames(housing_tab) <- c("housing", "response","count")ggplot(data=housing_tab, aes(x=housing,y=count, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(title="Housing Loan & Customer Response")From the chart above, customer with no housing loan tend to give positive response towards this campaign compared to customer who has no housing loan.
From the conditions above, we can conclude that, customers with no liability, they have more positive response than the customers who are in debt.
We want to look at how the customers are contacted in this campaign and the result.
ggplot(data=bank, aes(x=contact, fill=contact))+
geom_bar(position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Contact Methods")+theme_light()Almost all of the customers are contacted by cellular.
contact_tab <- data.frame(table(bank$contact,bank$y))
colnames(contact_tab) <- c("contact", "response","count")ggplot(contact_tab, aes(x=contact,y=count, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(title="Contact Response")Based on the chart above that most customer who responded positively are contacted through cellular.
ggplot(data=bank, aes(x=poutcome, fill=poutcome))+
geom_bar(position = 'dodge')+
labs(X="Number of customers",
y=NULL,
title="Contact Methods")+theme_light()From the chart above, that we can see that most of the customers haven’t been contacted prior this campaign.
We want to check the result based on this previous marketing campaign.
p_out_tab <- data.frame(table(bank$poutcome,bank$y))
colnames(p_out_tab) <- c("prev_out", "response","count")ggplot(p_out_tab, aes(x=prev_out,y=count, fill=response))+
geom_bar(stat = 'identity', position = 'dodge')+
labs(title="Contact Response")Based on the chart above, we can see that:
We want to look balances of the customer. First we want to map how disperse the balance data.
ggplot(bank, aes(x=balance))+
geom_boxplot(fill= "violet", color="black")+
labs(y=NULL,title="Customer Balance Distribution")+
theme_light()As we can see, it is quite hard to see at which point the median, so we try to see by using summary of balance.
summary(bank$balance)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3313 69 444 1423 1480 71188
From the table above, the median of the balance is 444 and also with a lot of outliers, either minus balances or more than 1,500.
Previously, we already compared the balance with customers’ job, we want to see whether balance is related to education.
We will make filtered tables by using each education category, by subsetting each category as shown below.
prim_ed <- bank[(bank$education=="primary"),]
sec_ed <- bank[(bank$education=="secondary"),]
tert_ed <- bank[(bank$education=="tertiary"),]
unkn_ed <- bank[(bank$education=="unknown"),]ggplot(prim_ed, aes(x=balance))+
geom_boxplot(fill= "orange", color="red")+
labs(y=NULL,title="Balance Distribution in Primary Education")+
theme_light()summary(prim_ed$balance)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -966.00 65.25 404.50 1411.54 1394.75 71188.00
ggplot(sec_ed, aes(x=balance))+
geom_boxplot(fill= "orange", color="red")+
labs(y=NULL,title="Balance Distribution in Secondary Education")+
theme_light()summary(sec_ed$balance)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2082.0 57.0 385.5 1196.8 1269.8 26452.0
ggplot(tert_ed, aes(x=balance))+
geom_boxplot(fill= "orange", color="red")+
labs(y=NULL,title="Balance Distribution in Tertiary Education")+
theme_light()summary(tert_ed$balance)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3313.0 102.0 583.5 1775.4 1840.0 42045.0
ggplot(unkn_ed, aes(x=balance))+
geom_boxplot(fill= "orange", color="red")+
labs(y=NULL,title="Balance Distribution in Secondary Education")+
theme_light()summary(unkn_ed$balance)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -715.0 88.5 574.0 1701.2 2066.0 21244.0
From each category summary and chart above, the higher the education, the higher the average and max balance in their account hence it could be argued that these are more likely to buy term deposits.
This marketing campaign resulted 11.5% of success campaign. From the chart above we would like to point key finding that might affect the result the marketing campaign: