We have a dataset of Portuguese Bank Marketing campaign.
The purpose of the campaign: promotion of term deposits among existing customers by direct phone call
Who: Portuguese banking institution
When: May 2008-November 2010
Our main goal: to increase the effectiveness of the marketing campaign on direct phone calls.
Our project will allow the bank to get a more detailed view of its customer base, to understand the current situation related to the promotion of the campaign and its success, to predict the reaction of customers to the campaign and to form the type of customer who will respond positively to the campaign and thereby increase the efficiency of the Bank, reducing the cost of unsuitable customers for calling.
By analyzing existing customer data and their specific characteristics, we plan to show how the bank is able to predict the behavior of its customers on the basis of certain features. We will also show what type of clients the Bank should pay attention to when implementing marketing plans to promote term deposits.
# loading libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(descr)
library(SDMTools)
library(boot)
library(survival)
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
library(ggrepel)
library(forcats)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
##
## viridis_pal
# loading the dataset (cleaned) because it has already less variables without missing values
bank <- read.csv("~/BA2019/bank_cleaned.csv")
dim(bank)
## [1] 40841 18
head(bank)
## X age job marital education default balance housing loan day
## 1 0 58 management married tertiary no 2143 yes no 5
## 2 1 44 technician single secondary no 29 yes no 5
## 3 2 33 entrepreneur married secondary no 2 yes yes 5
## 4 5 35 management married tertiary no 231 yes no 5
## 5 6 28 management single tertiary no 447 yes yes 5
## 6 7 42 entrepreneur divorced tertiary yes 2 yes no 5
## month duration campaign pdays previous poutcome response response_binary
## 1 may 4.35 1 -1 0 unknown no 0
## 2 may 2.52 1 -1 0 unknown no 0
## 3 may 1.27 1 -1 0 unknown no 0
## 4 may 2.32 1 -1 0 unknown no 0
## 5 may 3.62 1 -1 0 unknown no 0
## 6 may 6.33 1 -1 0 unknown no 0
str(bank)
## 'data.frame': 40841 obs. of 18 variables:
## $ X : int 0 1 2 5 6 7 8 9 10 11 ...
## $ age : int 58 44 33 35 28 42 58 43 41 29 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 5 11 3 5 5 3 7 11 1 1 ...
## $ marital : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
## $ education : Factor w/ 3 levels "primary","secondary",..: 3 2 2 3 3 3 1 2 2 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ balance : int 2143 29 2 231 447 2 121 593 270 390 ...
## $ housing : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ loan : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ duration : num 4.35 2.52 1.27 2.32 3.62 6.33 0.83 0.92 3.7 2.28 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","success",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ response : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ response_binary: int 0 0 0 0 0 0 0 0 0 0 ...
The data contains 40841 objects of 18 variables:
resp <- prop.table(table(bank$response))
knitr::kable(resp)
| Var1 | Freq |
|---|---|
| no | 0.8864132 |
| yes | 0.1135868 |
Let’s look first at the percentage of people who agreed to subscribed a term deposit and those who did not. As we can see, the marketing strategy of banks is not quite successful, because over 88% of people have not subscribed.
bank4 <- bank[, c(2,7,12,11,13,15,18)]
bank4$month <- as.numeric(bank4$month)
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(bank4, type="full")
## corrplot 0.84 loaded
## $r
## campaign month previous duration response_binary age
## campaign 1.0000 -0.1200 -0.0520 -0.0820 -0.072 0.0036
## month -0.1200 1.0000 0.0320 0.0096 -0.023 -0.0460
## previous -0.0520 0.0320 1.0000 0.0048 0.130 0.0130
## duration -0.0820 0.0096 0.0048 1.0000 0.400 -0.0074
## response_binary -0.0720 -0.0230 0.1300 0.4000 1.000 0.0230
## age 0.0036 -0.0460 0.0130 -0.0074 0.023 1.0000
## balance -0.0260 0.0230 0.0370 0.0350 0.069 0.1000
## balance
## campaign -0.026
## month 0.023
## previous 0.037
## duration 0.035
## response_binary 0.069
## age 0.100
## balance 1.000
##
## $p
## campaign month previous duration response_binary
## campaign 0.0e+00 6.5e-125 2.0e-25 1.2e-61 1.0e-48
## month 6.5e-125 0.0e+00 5.1e-11 5.3e-02 4.7e-06
## previous 2.0e-25 5.1e-11 0.0e+00 3.3e-01 2.3e-149
## duration 1.2e-61 5.3e-02 3.3e-01 0.0e+00 0.0e+00
## response_binary 1.0e-48 4.7e-06 2.3e-149 0.0e+00 0.0e+00
## age 4.7e-01 3.2e-20 9.3e-03 1.3e-01 2.7e-06
## balance 8.7e-08 2.8e-06 1.1e-13 7.4e-13 7.4e-45
## age balance
## campaign 4.7e-01 8.7e-08
## month 3.2e-20 2.8e-06
## previous 9.3e-03 1.1e-13
## duration 1.3e-01 7.4e-13
## response_binary 2.7e-06 7.4e-45
## age 0.0e+00 1.8e-97
## balance 1.8e-97 0.0e+00
##
## $sym
## campaign month previous duration response_binary age
## campaign 1
## month 1
## previous 1
## duration 1
## response_binary . 1
## age 1
## balance
## balance
## campaign
## month
## previous
## duration
## response_binary
## age
## balance 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
Correlation matrix shows us that response to the campaign has a strong correlation with duration of contacts, moderate correlation with previous number of contacts performed before this campaign and also moderate correlation with level of balance.
table1 <- bank %>% mutate(min = min(bank$age), max = max(bank$age), mean = mean(bank$age))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
| min | max | mean |
|---|---|---|
| 18 | 95 | 40.79068 |
m <- ggplot(bank, aes(x = age))+
geom_histogram(col = "#4D8802", fill = "#35D073", alpha = 0.8)+
xlab("Age")+
ylab("Count")
m + geom_vline(aes(xintercept=mean(age)),
color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As it can be seen from graph, our data is distributed normally. The minimum age of clients is 18. the maximum age is 95, the mean age is 40. Thre is not much eldery and youngest people who receive calls from the bank, but a lot of clients between 30 to 50 years.
table1 <- bank %>% mutate(min = min(bank$balance), max = max(bank$balance), mean = mean(bank$balance))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
| min | max | mean |
|---|---|---|
| -6847 | 10443 | 1073.982 |
a <- ggplot(bank, aes(x = balance))+
geom_histogram(col = "#4D8802", fill = "#00DC7D", alpha = 0.8)+
xlab("Balance")+
ylab("Count")
a + geom_vline(aes(xintercept=mean(balance)),
color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the table we see that minimum balance of clients is - 6847 euros, maximum - 10443 euros, mean balance level - 1073 euros. Not many people have sub-zero balance level the same as very bug amount of euros on the cards.
table1 <- bank %>% mutate(min = min(bank$duration), max = max(bank$duration), mean = mean(bank$duration))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
| min | max | mean |
|---|---|---|
| 0.1 | 81.97 | 4.308949 |
p <- ggplot(bank, aes(x = duration))+
geom_histogram(col = "#4D8802", fill = "#77BD8B", alpha = 0.8)+
xlab("Duration")+
ylab("Count")
p + geom_vline(aes(xintercept=mean(duration)),
color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The minimum duration of the contact is 0.1 minute, maximum - 81 minutes, mean duration of the contact - 4.3 minutes.
ggplot(bank, aes(x = age, y = balance))+
geom_point(col = "#41B619")
From the scatterplot we see that there is no clear correlation between age of clients and their balance level, but we can notice that the elder person the lower his balance level. A maximum balance level is seen of people from 25 to 60 years, but the retires’ balance is decreasing with their age.
ggplot(bank, aes(x=campaign, color=response)) +
geom_histogram(fill="green", alpha=0.5, position="identity")+
scale_x_continuous(limits = c(0, 20))+
xlab("Number of Contacts Performed During The Campaign")+
ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 223 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
The histogram shows that the more calls were made to the client during the promotion of the company, the probability of their positive response decreased. Despite the large variation in the data, it can be clearly seen that the fewer calls made by the bank, the more positive responses from customers were received.
b<-ggplot(bank, aes(x=balance))+
geom_histogram(color="#748700", fill="#A7E541", alpha = 0.8)+
facet_grid(response ~ .)+
xlab("Balance Level")+
ylab("Count")
mu <- ddply(bank, "response", summarise, grp.mean=mean(balance))
b+geom_vline(data=mu, aes(xintercept=grp.mean),
linetype="dashed", color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph shows that people who subscribed a term deposit on average have higher balance level than people who did not subscribe a term deposit.
c<-ggplot(bank, aes(x=duration))+
geom_histogram(color="#77BD8B", fill="#C9FFBF")+
facet_grid(response ~ .)+
scale_x_continuous(limits = c(0, 20))+
xlab("Duration")+
ylab("Count")
ma <- ddply(bank, "response", summarise, grp.mean=mean(duration))
c+geom_vline(data=ma, aes(xintercept=grp.mean),
linetype="dashed", color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 513 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).
It is clearly seen from above that duration of the calls less than 5 minutes leaded to negative answer to the campaign, while the average minutes of calls of subscribed clients is around 10 minutes.
job <- bank
x <- table(job$job, job$response)
x
##
## no yes
## admin. 4142 563
## blue-collar 8168 637
## entrepreneur 1231 109
## housemaid 1048 101
## management 7440 1125
## other 143 15
## retired 1572 448
## self-employed 1281 162
## services 3469 332
## student 486 203
## technician 6200 754
## unemployed 1022 190
prop <- c(563/(4142+563)*100, 637/(8168+637)*100, 109/(1231+109)*100, 101/(1048+101)*100, 1125/(1125+7440)*100, 15/(15+143)*100, 448/(448+1572)*100, 162/(162+1281)*100, 332/(332+3469)*100, 203/(203+486)*100, 754/(754+6200)*100, 190/(190+1022)*100)
job <- c('admin', 'blue-collar','entrepreneur', 'housemaid','management','other','retired','self-employed','services','student','technician','unemployed')
proportion <- as.data.frame(prop, job) %>% summarise(job, prop) %>% arrange(desc(prop))
knitr::kable(proportion)
| job | prop |
|---|---|
| student | 29.462990 |
| retired | 22.178218 |
| unemployed | 15.676568 |
| management | 13.134851 |
| admin | 11.965994 |
| self-employed | 11.226611 |
| technician | 10.842681 |
| other | 9.493671 |
| housemaid | 8.790252 |
| services | 8.734543 |
| entrepreneur | 8.134328 |
| blue-collar | 7.234526 |
Thanks to the received proportion of positive responses to the campaign on various professions, we notice an important point. The highest percentage of positive responses is found in students and retired. Thus, we can conclude that of all the jobs, these groups of people were more likely to give a positive response to calls received from the bank.
job <- bank
x <- table(job$month, job$response)
x
##
## no yes
## apr 2033 496
## aug 5263 614
## dec 94 79
## feb 1868 390
## jan 1061 122
## jul 5968 552
## jun 4363 490
## mar 186 221
## may 11670 826
## nov 3129 354
## oct 333 272
## sep 234 223
prop1 <- c(496/(496+2033)*100,614/(5263+614)*100,79/(79+94)*100,390/(390+1868)*100,122/(122+1061)*100,552/(552+5968)*100,490/(490+4363)*100,221/(221+186)*100,826/(826+11670)*100,354/(354+3129)*100,272/(333+272)*100,223/(223+234)*100)
month <- c('April', 'August','December', 'February','January','July','June','March','May','November','October','September')
proportion <- as.data.frame(prop1, month) %>% summarise(month, prop1) %>% arrange(desc(prop1))
knitr::kable(proportion)
| month | prop1 |
|---|---|
| March | 54.299754 |
| September | 48.796499 |
| December | 45.664740 |
| October | 44.958678 |
| April | 19.612495 |
| February | 17.271922 |
| August | 10.447507 |
| January | 10.312764 |
| November | 10.163652 |
| June | 10.096847 |
| July | 8.466258 |
| May | 6.610115 |
Another table shows us that the majority of positive responses to the campaign were received in March, September, December and October. At this time more/about half of the people who received calls gave a positive response to the campaign.
outcome <- bank %>% filter(poutcome != "unknown")
ggplot() +
geom_bar(data = outcome, aes(x = response, fill=poutcome)) +
scale_fill_brewer(name = "Previous Outcome", palette = "Dark2") +
xlab("Response To The Campaign") +
ylab("Count")
This simple graph shows that people who subscribed to the previous campaign tend to answer positive again to the new marketing campaign, while those who rejected subscribing before are more likely to do the same next time.
ggplot(data = bank, aes(x = campaign, y = age)) +
geom_point(aes(color = age))+
scale_color_viridis(option = "D")+
theme_minimal() +
theme(legend.position = "right")+
xlab("Number of Contacts")+
ylab("Age")+
scale_x_continuous()+
scale_y_continuous()
Above we see that the younger the client the more calls the bank makes to offer its services, but the older the client, the fewer calls the bank makes. Based on the previous analysis, we can assume that the bank should call more often to retired people, as they are more likely to agree to a subscription.
The analysis showed that the marketing campaign should target older customers and students, and make more calls in March, December, October and September to increase the number of consenting customers. The bank should also not bother with their calls, because the more calls the lower the probability of receiving a positive response from the client. The bank should also be aware of the duration of the call, not making it too short and too long.