This work is aimed to present exploratory data analysis for the dataset about direct phone call marketing compaigns among bank customers in order to encourage them to make a deposit. The data was collected by the Portuguese banking institution from May 2008 to November 2010. The primary purpose for working with the data is to deal with churn prevention (rejection prevention in our case), understand the current situation with the existing customers and suggest data-driven policies to enhence the customer engagement rates (i.g. increase the number of customers making bank deposits).
During the EDA a number of typical characteristics were found that best describe the most probable deposit maker. It is supposed to be helpful in understanding current situation with the existing customers as well as while prioritizing customer segments (in terms of time, effort and money spent) to target with the offer during future marketing campaigns.
Generally, existing curtomers are mostly middle-aged, married, having at least secondary education diploma and occupying blue-collor, technical or managerial positions.
Having a closer look at deposit makers, it can be said that the probability to convince a person to make a deposit is way higher for those who had already accepted any previous offer. Moreover, calls are extraordinary efficient in the first month of spring and autumn. Retired people (aged 60 and older) and students (aged 18-25) are more likely to accept a deposit making offer. The chances are even higher if they are currently single and have a higher education. An average bank account balance for deposit makers is way higher compared to those who refuse, having on average 30% difference bank account balance between two groups. In general, people are more willing to make a deposit if their bank balance exceeds 1000. Finally, a longer call provides greater chances to convince a client to make a deposit.
for (var in names(bank)[c(3:6,8,9,11,16:18)]){
bank[[var]] = as.factor(bank[[var]])
}
a <- ggplot(bank) +
geom_histogram(aes(x = age),bins = 20,binwidth = 2, alpha = 0.2, fill = "blue", color = "black")+
labs(title = "Age distribution of existing customers
", x = "", y = "Number of respondents") +
scale_x_continuous(breaks = 0:85*5)+
scale_y_continuous(breaks = 0:3500*500)+
ggplot2::annotate(x= 75, y=3000, label=paste("Mean age = ",
round(mean(bank$age),1)),geom="text", size= 4)+
ggplot2::annotate(x= 75, y=2500, label=paste("Median age = ",
round(median(bank$age),1)),geom="text", size= 4)+
theme_minimal()+
theme(text = element_text(size = 10),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$job <- ordered(bank$job,
levels = c("other","student","housemaid","unemployed",
"entrepreneur","self-employed","retired","services",
"admin.","technician","management","blue-collar"))
b <- ggplot(bank, aes(x = job, y = ..count../sum(..count..))) +
geom_bar(color = "black", alpha = 0.2, fill = "blue") +
geom_text(aes(label = percent(..count../sum(..count..))),
size = 2.5, stat= "count", hjust = -0.1) +
scale_y_continuous(labels = percent) +
labs(title = "Customers` occupation distribution
", y = "", x = "") +
ylim(0,0.25)+
theme_minimal() +
coord_flip()+
theme(text = element_text(size = 11),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$marital <- ordered(bank$marital,
levels = c("married", "single","divorced"))
c <- ggplot(bank, aes(x = marital, y = ..count../sum(..count..))) +
geom_bar(color = "black", alpha = 0.2, fill = "blue") +
geom_text(aes(label = percent(..count../sum(..count..))),
size = 4, stat= "count", position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = percent) +
labs(title = "Customers` marital status
", y = "", x = "") +
ylim(0,0.65)+
theme_minimal() +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$education <- ordered(bank$education,
levels = c("secondary","tertiary","primary"))
d <- ggplot(bank, aes(x = education, y = ..count../sum(..count..))) +
geom_bar(color = "black", alpha = 0.2, fill = "blue") +
geom_text(aes(label = percent(..count../sum(..count..))),
size = 4, stat= "count", position = position_stack(vjust = 0.5)) +
scale_y_continuous(labels = percent) +
labs(title = "Customers` educational status
", y = "", x = "") +
ylim(0,0.55)+
theme_minimal() +
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
grid.arrange(a,c,d,b, nrow = 2)
Brief customers overview:
As it can be seen, the majority of the cliens are from 30 to 45 y.o., having mean age equals approximately 41. Still, there are a bunch of relatively young customers as well as senior ones.
More than a half of the clients are married (60%), a little bit less than a third are single and about 1/10 are divorsed.
About a half of the customers possess only secondary education diploma, about a third are university grduates and the remaining part completed only primary educational stage.
As for the customers career orientations, every 5th customer is engaged either in blue-collor job or has a managerial position. A little less are working in a technical feild. Every 10th client works in the services sphere and about 5% of the clients had already retired.
set = bank %>% filter(poutcome != "unknown") %>%
group_by(poutcome, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr1 <-ggplot(data = na.omit(subset(set, select = c(poutcome, response, count, perc))),
aes(x = poutcome, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.07, size = 3) +
geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers\nby the outcome of previous campaign
", y = "", x = "") +
theme_minimal() +
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
mar = bank %>%
group_by(marital, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr2 <- ggplot(data = na.omit(subset(mar, select = c(marital, response, count, perc))),
aes(x = marital, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.05, size =3) +
geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers\nby marital status
", y = "", x = "") +
theme_minimal()+
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 12),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
job = bank %>%
group_by(job, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr4 <- ggplot(data = na.omit(subset(job, select = c(job, response, count, perc))),
aes(x = reorder(job,perc), y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.03, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.6, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers by job
", y = "", x = "") +
theme_minimal() +
#guides(fill = F)+
scale_fill_discrete(name="Type of customer:",
breaks=c("yes", "no"), labels=c("Deposit maker", "Refuser"))+
coord_flip()+
theme(text = element_text(size = 10),
plot.title = element_text(hjust = 0.2), title = element_text(size = 8))
mon = bank %>%
group_by(month, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
mon$month <- ordered(mon$month,
levels = (c("dec","nov","oct","sep","aug","jul","jun","may","apr","mar","feb","jan")))
gr5 <- ggplot(data = na.omit(subset(mon, select = c(month, response, count, perc))),
aes(x = month, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.03, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.6, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers\nachieved by month
", y = "", x = "") +
theme_minimal() +
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),labels=c("Deposit maker", "Refuser"))+
coord_flip()+
theme(text = element_text(size = 10),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
grid.arrange(gr1,gr4,gr5,gr2, nrow = 2)
Brief response-related characteristics overview:
As it might be intuitively referred, successfull outcome of previous marketing campaign undertaken for a certain customer increases the chances to convince a client to make a deposit now: the probability to make a customer agree with the current offer is 6 times higher compared to those customers who refused to engage with the offer during previous campaigns.
It is also reasonable to consider the current occupation of a client: retired people and students are the most frequent deposit makers, as well as people holding managerial positions and unemployed ones.
Interestingly enough, the proportions of people successfully convineced to make a deposit varies during the calendar year. The chances to make a client a deposit maker are the highest in the first month of winter, spring and autumn. To be more precise, more than a half of all the clients being called up in March agreed to make a deposit. Approximately the same proportions remained in September and December.
Last but not least, customers who are single of divorsed are more likely to make a deposit compared to married ones.
bal = bank %>% group_by(job, response) %>% summarise(mean_balance = mean(balance))
bal2 = bank %>% filter(response == "yes")
bal3 = bank %>% filter(response == "no")
gr6 <- ggplot(data = na.omit(subset(bal, select = c(mean_balance, response, job))))+
geom_bar(data = bal %>% filter(response == "yes"),aes(x = reorder(job, -mean_balance),
y = mean_balance), stat = "identity", color = "black", alpha = 0.4, fill = "lightblue")+
geom_bar(data = bal %>% filter(response == "no"), aes(x = reorder(job,-mean_balance),
y = mean_balance), stat = "identity", color = "black", alpha = 0.4, fill = "pink")+
geom_hline(aes(yintercept = mean(bal2$balance)), color = "darkgreen", linetype = "dashed", size = 0.6)+
geom_hline(aes(yintercept = mean(bal3$balance)), color = "red", linetype = "dashed", size = 0.6)+
labs(title = "Mean balance of deposit makers (blue) VS refusers (pink)\nby job
", y = "Mean balance", x = "") +
scale_y_continuous(breaks = 0:2100*200)+
ggplot2::annotate(x= 7.5, y=1500, label=paste("Mean balance of deposit maker = ",
round(mean(bal2$balance),1)),geom="text", size= 4)+
theme_minimal() +
coord_cartesian(ylim=c(800,2010))+
theme(text = element_text(size = 12),axis.text.x = element_text(angle = 20, hjust = 1),
plot.title = element_text(hjust = 0.5), title = element_text(size = 12))
gr6
Let`s look a bit closer on the customer’s balance taking into account an occupation. As it can be seen from the graph, the highest mean balance possess retired people and those who are employed in “other” categories of jobs (probably freelance, etc.). However, an interesting thing to notice is that:
The balance of typical deposit maker is 30% higher on everage (about 1500) compared to those who refused to make a deposit (about 1000). So we can say that on average people among all occupations are willing to make a deposit if their bank account is about or higher 1500.
The important pattern is that for each job type it holds true that those who decided to make a deposit had substantially higher mean balance compared to those who didn`t. However, the only exception are people working in “other” job types: the mean balance of a deposit maker is way lower compared to refuser within the same job type, which is particulary interesting.
edu = bank %>%
group_by(education, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr3 <- ggplot(data = na.omit(subset(edu, select = c(education, response, count, perc))),
aes(x = education, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.05, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers\nby educational status
", y = "", x = "") +
theme_minimal() +
scale_fill_discrete(name="Type of customer:",
breaks=c("yes", "no"),
labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 10), axis.text.x = element_text(angle = 20),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$duration_cut = cut(bank$duration, breaks = c(0,1,2,3,4,5,6,7,8,9,10,85))
dur = bank %>%
group_by(duration_cut, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr9 <- ggplot(data = na.omit(subset(dur, select = c(duration_cut, response, count, perc))),
aes(x = duration_cut, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.05, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers achieved\nby call duration
", y = "", x = "") +
theme_minimal() +
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),
#labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 10), axis.text.x = element_text(angle = 20),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$balance = as.numeric(bank$balance)
bank$balance_cut = ifelse(bank$balance < 0, "in debt", ifelse(bank$balance >= 0 & bank$balance <100, "0-100", ifelse(bank$balance >=100 & bank$balance<500, "100-500", ifelse(bank$balance >= 500 & bank$balance <1000, "500-1000", ifelse(bank$balance >=1000 & bank$balance <2000, "1000-2000", ifelse(bank$balance >=2000 & bank$balance <3000, "2000-3000", ifelse(bank$balance >=3000 & bank$balance<4000, "3000-4000", ifelse(bank$balance >=4000 & bank$balance <5000, "4000-5000", "more then 5000")))))))) %>% as.factor()
bank$balance_cut <- ordered(bank$balance_cut,
levels = (c("in debt","0-100","100-500","500-1000","1000-2000","2000-3000","3000-4000","4000-5000","more then 5000")))
ls = bank %>%
group_by(balance_cut, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr10 <- ggplot(data = na.omit(subset(ls, select = c(balance_cut, response, count, perc))),
aes(x = balance_cut, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.05, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers achieved\nby balance
", y = "", x = "") +
theme_minimal() +
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),
#labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 10), axis.text.x = element_text(angle = 20),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
bank$age_cut = cut(bank$age, breaks = c(15,20,25,30,35,40,45,50,55,60,65,95))
ag = bank %>%
group_by(age_cut, response) %>%
summarise(count = n()) %>% mutate(perc = round((count/sum(count)*100),1)) %>% na.omit()
gr11 <- ggplot(data = na.omit(subset(ag, select = c(age_cut, response, count, perc))),
aes(x = age_cut, y = count, fill = response)) +
geom_bar(stat = "identity", color = "black", alpha = 0.2, position = "fill") +
#geom_text(aes(label = ifelse(response == "yes", perc, " %")), y = 0.05, size = 3) +
#geom_text(aes(label = ifelse(response == "no", perc, " %")), y = 0.8, size = 3) +
scale_y_continuous(labels = scales::percent)+
labs(title = "Relative proportion of deposit makers achieved\nby age groups
", y = "", x = "") +
theme_minimal() +
guides(fill = F)+
#scale_fill_discrete(name="Type of customer:",
#breaks=c("yes", "no"),
#labels=c("Deposit maker", "Refuser"))+
theme(text = element_text(size = 10), axis.text.x = element_text(angle = 20),
plot.title = element_text(hjust = 0.5), title = element_text(size = 8))
grid.arrange(gr9, gr3, gr10, gr11, nrow = 2)
It is also worth noticing further detailed information about the client`s characteristics.
It can be noticed that the longer the phone call is, the higher the chances that a customer will eventually agree to make a deposit. For call durations exceeding 10 minutes almost a half of the clients being called up agreed for deposit making.
As for the customers` educational level, it can be said that the clients with higher education are more likely to make a deposit, however the difference is not that significant compared to clients with only secondary or primary education.
Taking a closer look on the distribution of customer`s bank account balance, it becomes evident that a typical deposit maker will most probably have from 1000 to 5000 and more on his/her bank account. So, this is the target audience to focus on. Needless to say that people in debt or with near 0 balance are the least likely to make a deposit so there is no reason to put any time and effort calling them up during marketing campaign.
Finally, the majority of deposit makers are young people aged from 15 to 20-25 (and they are most probably students - as it was already discovered previously) and senior clients aged between 60 and up to 95. This is an important notion to keep in mind while performing new marketing campaign as well: students and senior citizens are more likely to agree to make a deposit.