1. Backgroung Information and Our Goals

We have a dataset of Portuguese Bank Marketing campaign.

The purpose of the campaign: promotion of term deposits among existing customers by direct phone call

Who: Portuguese banking institution

When: May 2008-November 2010

Our main goal: to increase the effectiveness of the marketing campaign on direct phone calls.

Our project will allow the bank to get a more detailed view of its customer base, to understand the current situation related to the promotion of the campaign and its success, to predict the reaction of customers to the campaign and to form the type of customer who will respond positively to the campaign and thereby increase the efficiency of the Bank, reducing the cost of unsuitable customers for calling.

By analyzing existing customer data and their specific characteristics, we plan to show how the bank is able to predict the behavior of its customers on the basis of certain features. We will also show what type of clients the Bank should pay attention to when implementing marketing plans to promote term deposits.

2. More About The Dataset

# loading libraries 
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(descr)
library(SDMTools)
library(boot)
library(survival)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
library(rms)
## Loading required package: Hmisc
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
library(ggrepel)
library(forcats)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## The following object is masked from 'package:scales':
## 
##     viridis_pal
# loading the dataset (cleaned) because it has already less variables without missing values
bank <- read.csv("~/BA2019/bank_cleaned.csv")
dim(bank)
## [1] 40841    18
head(bank)
##   X age          job  marital education default balance housing loan day
## 1 0  58   management  married  tertiary      no    2143     yes   no   5
## 2 1  44   technician   single secondary      no      29     yes   no   5
## 3 2  33 entrepreneur  married secondary      no       2     yes  yes   5
## 4 5  35   management  married  tertiary      no     231     yes   no   5
## 5 6  28   management   single  tertiary      no     447     yes  yes   5
## 6 7  42 entrepreneur divorced  tertiary     yes       2     yes   no   5
##   month duration campaign pdays previous poutcome response response_binary
## 1   may     4.35        1    -1        0  unknown       no               0
## 2   may     2.52        1    -1        0  unknown       no               0
## 3   may     1.27        1    -1        0  unknown       no               0
## 4   may     2.32        1    -1        0  unknown       no               0
## 5   may     3.62        1    -1        0  unknown       no               0
## 6   may     6.33        1    -1        0  unknown       no               0
str(bank)
## 'data.frame':    40841 obs. of  18 variables:
##  $ X              : int  0 1 2 5 6 7 8 9 10 11 ...
##  $ age            : int  58 44 33 35 28 42 58 43 41 29 ...
##  $ job            : Factor w/ 12 levels "admin.","blue-collar",..: 5 11 3 5 5 3 7 11 1 1 ...
##  $ marital        : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 1 2 3 1 3 ...
##  $ education      : Factor w/ 3 levels "primary","secondary",..: 3 2 2 3 3 3 1 2 2 2 ...
##  $ default        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ balance        : int  2143 29 2 231 447 2 121 593 270 390 ...
##  $ housing        : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ loan           : Factor w/ 2 levels "no","yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ day            : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month          : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration       : num  4.35 2.52 1.27 2.32 3.62 6.33 0.83 0.92 3.7 2.28 ...
##  $ campaign       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays          : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome       : Factor w/ 3 levels "failure","success",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ response       : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response_binary: int  0 0 0 0 0 0 0 0 0 0 ...

The data contains 40841 objects of 18 variables:

3. Exploratory Data Analysis

Table of Response

resp <- prop.table(table(bank$response))

knitr::kable(resp)
Var1 Freq
no 0.8864132
yes 0.1135868

Let’s look first at the percentage of people who agreed to subscribed a term deposit and those who did not. As we can see, the marketing strategy of banks is not quite successful, because over 88% of people have not subscribed.

Correlation matrix

bank4 <- bank[, c(2,7,12,11,13,15,18)]
bank4$month <- as.numeric(bank4$month)
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(bank4, type="full")
## corrplot 0.84 loaded

## $r
##                 campaign   month previous duration response_binary     age
## campaign          1.0000 -0.1200  -0.0520  -0.0820          -0.072  0.0036
## month            -0.1200  1.0000   0.0320   0.0096          -0.023 -0.0460
## previous         -0.0520  0.0320   1.0000   0.0048           0.130  0.0130
## duration         -0.0820  0.0096   0.0048   1.0000           0.400 -0.0074
## response_binary  -0.0720 -0.0230   0.1300   0.4000           1.000  0.0230
## age               0.0036 -0.0460   0.0130  -0.0074           0.023  1.0000
## balance          -0.0260  0.0230   0.0370   0.0350           0.069  0.1000
##                 balance
## campaign         -0.026
## month             0.023
## previous          0.037
## duration          0.035
## response_binary   0.069
## age               0.100
## balance           1.000
## 
## $p
##                 campaign    month previous duration response_binary
## campaign         0.0e+00 6.5e-125  2.0e-25  1.2e-61         1.0e-48
## month           6.5e-125  0.0e+00  5.1e-11  5.3e-02         4.7e-06
## previous         2.0e-25  5.1e-11  0.0e+00  3.3e-01        2.3e-149
## duration         1.2e-61  5.3e-02  3.3e-01  0.0e+00         0.0e+00
## response_binary  1.0e-48  4.7e-06 2.3e-149  0.0e+00         0.0e+00
## age              4.7e-01  3.2e-20  9.3e-03  1.3e-01         2.7e-06
## balance          8.7e-08  2.8e-06  1.1e-13  7.4e-13         7.4e-45
##                     age balance
## campaign        4.7e-01 8.7e-08
## month           3.2e-20 2.8e-06
## previous        9.3e-03 1.1e-13
## duration        1.3e-01 7.4e-13
## response_binary 2.7e-06 7.4e-45
## age             0.0e+00 1.8e-97
## balance         1.8e-97 0.0e+00
## 
## $sym
##                 campaign month previous duration response_binary age
## campaign        1                                                   
## month                    1                                          
## previous                       1                                    
## duration                                1                           
## response_binary                         .        1                  
## age                                                              1  
## balance                                                             
##                 balance
## campaign               
## month                  
## previous               
## duration               
## response_binary        
## age                    
## balance         1      
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

Correlation matrix shows us that response to the campaign has a strong correlation with duration of contacts, moderate correlation with previous number of contacts performed before this campaign and also moderate correlation with level of balance.

Distribution of Age

table1 <- bank %>% mutate(min = min(bank$age), max = max(bank$age), mean = mean(bank$age))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
min max mean
18 95 40.79068
m <- ggplot(bank, aes(x = age))+
  geom_histogram(col = "#4D8802", fill = "#35D073", alpha = 0.8)+
  xlab("Age")+
  ylab("Count")

m + geom_vline(aes(xintercept=mean(age)),
            color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As it can be seen from graph, our data is distributed normally. The minimum age of clients is 18. the maximum age is 95, the mean age is 40. Thre is not much eldery and youngest people who receive calls from the bank, but a lot of clients between 30 to 50 years.

Distribution of Balance

table1 <- bank %>% mutate(min = min(bank$balance), max = max(bank$balance), mean = mean(bank$balance))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
min max mean
-6847 10443 1073.982
a <- ggplot(bank, aes(x = balance))+
  geom_histogram(col = "#4D8802", fill = "#00DC7D", alpha = 0.8)+
  xlab("Balance")+
  ylab("Count")

a + geom_vline(aes(xintercept=mean(balance)),
            color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the table we see that minimum balance of clients is - 6847 euros, maximum - 10443 euros, mean balance level - 1073 euros. Not many people have sub-zero balance level the same as very bug amount of euros on the cards.

Distribution of Duration

table1 <- bank %>% mutate(min = min(bank$duration), max = max(bank$duration), mean = mean(bank$duration))
table2 <- table1[1, c(19,20,21)]
knitr::kable(table2)
min max mean
0.1 81.97 4.308949
p <- ggplot(bank, aes(x = duration))+
  geom_histogram(col = "#4D8802", fill = "#77BD8B", alpha = 0.8)+
  xlab("Duration")+
  ylab("Count")

p + geom_vline(aes(xintercept=mean(duration)),
            color="red", linetype="dashed", size=1)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The minimum duration of the contact is 0.1 minute, maximum - 81 minutes, mean duration of the contact - 4.3 minutes.

Visualisation of Balance Level and Age

ggplot(bank, aes(x = age, y = balance))+
  geom_point(col = "#41B619")

From the scatterplot we see that there is no clear correlation between age of clients and their balance level, but we can notice that the elder person the lower his balance level. A maximum balance level is seen of people from 25 to 60 years, but the retires’ balance is decreasing with their age.

Visualisation of Number of Contacts Performed During The Campaign and Response To The Campaign

ggplot(bank, aes(x=campaign, color=response)) +
  geom_histogram(fill="green", alpha=0.5, position="identity")+
  scale_x_continuous(limits = c(0, 20))+
  xlab("Number of Contacts Performed During The Campaign")+
  ylab("Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 223 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).

The histogram shows that the more calls were made to the client during the promotion of the company, the probability of their positive response decreased. Despite the large variation in the data, it can be clearly seen that the fewer calls made by the bank, the more positive responses from customers were received.

Visualisation of Balance and Response To The Campaign

b<-ggplot(bank, aes(x=balance))+
  geom_histogram(color="#748700", fill="#A7E541", alpha = 0.8)+
  facet_grid(response ~ .)+
  xlab("Balance Level")+
  ylab("Count")

mu <- ddply(bank, "response", summarise, grp.mean=mean(balance))

b+geom_vline(data=mu, aes(xintercept=grp.mean),
             linetype="dashed", color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph shows that people who subscribed a term deposit on average have higher balance level than people who did not subscribe a term deposit.

Visualisation of Duration and Response To The Campaign

c<-ggplot(bank, aes(x=duration))+
  geom_histogram(color="#77BD8B", fill="#C9FFBF")+
  facet_grid(response ~ .)+
  scale_x_continuous(limits = c(0, 20))+
  xlab("Duration")+
  ylab("Count")
  

ma <- ddply(bank, "response", summarise, grp.mean=mean(duration))

c+geom_vline(data=ma, aes(xintercept=grp.mean),
             linetype="dashed", color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 513 rows containing non-finite values (stat_bin).
## Warning: Removed 4 rows containing missing values (geom_bar).

It is clearly seen from above that duration of the calls less than 5 minutes leaded to negative answer to the campaign, while the average minutes of calls of subscribed clients is around 10 minutes.

**The Proportion of Different Jobs and Response Answer ‘Yes’

job <- bank 

x <- table(job$job, job$response)
x
##                
##                   no  yes
##   admin.        4142  563
##   blue-collar   8168  637
##   entrepreneur  1231  109
##   housemaid     1048  101
##   management    7440 1125
##   other          143   15
##   retired       1572  448
##   self-employed 1281  162
##   services      3469  332
##   student        486  203
##   technician    6200  754
##   unemployed    1022  190
prop <- c(563/(4142+563)*100, 637/(8168+637)*100, 109/(1231+109)*100, 101/(1048+101)*100, 1125/(1125+7440)*100, 15/(15+143)*100, 448/(448+1572)*100, 162/(162+1281)*100, 332/(332+3469)*100, 203/(203+486)*100, 754/(754+6200)*100, 190/(190+1022)*100)
job <- c('admin', 'blue-collar','entrepreneur', 'housemaid','management','other','retired','self-employed','services','student','technician','unemployed')

proportion <- as.data.frame(prop, job) %>% summarise(job, prop) %>% arrange(desc(prop))
knitr::kable(proportion)
job prop
student 29.462990
retired 22.178218
unemployed 15.676568
management 13.134851
admin 11.965994
self-employed 11.226611
technician 10.842681
other 9.493671
housemaid 8.790252
services 8.734543
entrepreneur 8.134328
blue-collar 7.234526

Thanks to the received proportion of positive responses to the campaign on various professions, we notice an important point. The highest percentage of positive responses is found in students and retired. Thus, we can conclude that of all the jobs, these groups of people were more likely to give a positive response to calls received from the bank.

**The Proportion of Months and Response Answer ‘Yes’

job <- bank 

x <- table(job$month, job$response)
x
##      
##          no   yes
##   apr  2033   496
##   aug  5263   614
##   dec    94    79
##   feb  1868   390
##   jan  1061   122
##   jul  5968   552
##   jun  4363   490
##   mar   186   221
##   may 11670   826
##   nov  3129   354
##   oct   333   272
##   sep   234   223
prop1 <- c(496/(496+2033)*100,614/(5263+614)*100,79/(79+94)*100,390/(390+1868)*100,122/(122+1061)*100,552/(552+5968)*100,490/(490+4363)*100,221/(221+186)*100,826/(826+11670)*100,354/(354+3129)*100,272/(333+272)*100,223/(223+234)*100)
month <- c('April', 'August','December', 'February','January','July','June','March','May','November','October','September')

proportion <- as.data.frame(prop1, month) %>% summarise(month, prop1) %>% arrange(desc(prop1))
knitr::kable(proportion)
month prop1
March 54.299754
September 48.796499
December 45.664740
October 44.958678
April 19.612495
February 17.271922
August 10.447507
January 10.312764
November 10.163652
June 10.096847
July 8.466258
May 6.610115

Another table shows us that the majority of positive responses to the campaign were received in March, September, December and October. At this time more/about half of the people who received calls gave a positive response to the campaign.

Visualisation of Previous Outcome and Response To The Campaign

outcome <- bank %>% filter(poutcome != "unknown")
ggplot() +
  geom_bar(data = outcome, aes(x = response, fill=poutcome)) +
  scale_fill_brewer(name = "Previous Outcome", palette = "Dark2") +
  xlab("Response To The Campaign") +
  ylab("Count")

This simple graph shows that people who subscribed to the previous campaign tend to answer positive again to the new marketing campaign, while those who rejected subscribing before are more likely to do the same next time.

Visualisation of Number of Contacts and Age

ggplot(data = bank, aes(x = campaign, y = age)) +
  geom_point(aes(color = age))+
  scale_color_viridis(option = "D")+
  theme_minimal() +
  theme(legend.position = "right")+
  xlab("Number of Contacts")+
  ylab("Age")+
  scale_x_continuous()+
  scale_y_continuous()

Above we see that the younger the client the more calls the bank makes to offer its services, but the older the client, the fewer calls the bank makes. Based on the previous analysis, we can assume that the bank should call more often to retired people, as they are more likely to agree to a subscription.

Suggestions

The analysis showed that the marketing campaign should target older customers and students, and make more calls in March, December, October and September to increase the number of consenting customers. The bank should also not bother with their calls, because the more calls the lower the probability of receiving a positive response from the client. The bank should also be aware of the duration of the call, not making it too short and too long.