The dataset contains information about marketing campaigns that were conducted via phone calls from a Portuguese banking institution to their clients. Purpose of these campaigns is to prompt their clients to subscribe for a specific financial product of the bank (term deposit). After each call was conducted, the client had to inform the institution about their intention of either subscribing to the product (indicating a successful campaign) or not (unsucessful campaign). The final output of this survey will be a binary result indicating if the client subscribed (‘yes’) to the product or not (‘no’).
The dataset has 41188 rows (instances of calls to clients) and 21 columns (variables) which are describing certain aspects of the call. Please note that there are cases where the same client was contacted multiple times - something that practically doesn’t affect the analysis as each call will be considered independent from another even if the client is the same.
In the StoryBoard pages we will see how successful the campaign was - plus, some other insights into the data so we can get a better understanding of what lies within.
1. Variables that describing attributes related directly to the client:
2. Variables related to the last contact of the current campaign:
3. Other variables related to the campaign(s):
4. Socioeconomic variables:
Note that May, June, July and November are successful periods to get people to subscribe. We can also deduce that the “married” marital status is more likely to get people to subscribe followed by single then divorced. Another clear indicator is that people who do not subscribe irrespective of marital status keep the call duration to a minimum.
Chart Helper
The red line is count of non subscribed and the blue line is count of subscribed (by month, and marital status).
The Boxplot is a statistical summary of call duration in seconds by month, maritaL status and subscribed (yes or no)
Here we can clearly see that the 20 to 30 years old single are the most likely to subscribe - followed by 20 to 40 years old married
job divorced married single unknown
admin. 132 (2.8448%) 652 (14.0517%) 566 (12.1983%) 2 (0.0431%)
blue-collar 53 (1.1422%) 421 (9.0733%) 161 (3.4698%) 3 (0.0647%)
entrepreneur 14 (0.3017%) 88 (1.8966%) 21 (0.4526%) 1 (0.0216%)
housemaid 16 (0.3448%) 74 (1.5948%) 16 (0.3448%) 0 (0.0000%)
management 39 (0.8405%) 226 (4.8707%) 63 (1.3578%) 0 (0.0000%)
retired 92 (1.9828%) 329 (7.0905%) 12 (0.2586%) 1 (0.0216%)
self-employed 16 (0.3448%) 82 (1.7672%) 51 (1.0991%) 0 (0.0000%)
services 33 (0.7112%) 166 (3.5776%) 124 (2.6724%) 0 (0.0000%)
student 3 (0.0647%) 8 (0.1724%) 264 (5.6897%) 0 (0.0000%)
technician 65 (1.4009%) 384 (8.2759%) 279 (6.0129%) 2 (0.0431%)
unemployed 10 (0.2155%) 86 (1.8534%) 48 (1.0345%) 0 (0.0000%)
unknown 3 (0.0647%) 16 (0.3448%) 15 (0.3233%) 3 (0.0647%)
job divorced married single unknown
admin. 132 (27.7311%) 652 (25.7504%) 566 (34.9383%) 2 (16.6667%)
blue-collar 53 (11.1345%) 421 (16.6272%) 161 (9.9383%) 3 (25.0000%)
entrepreneur 14 (2.9412%) 88 (3.4755%) 21 (1.2963%) 1 (8.3333%)
housemaid 16 (3.3613%) 74 (2.9226%) 16 (0.9877%) 0 (0.0000%)
management 39 (8.1933%) 226 (8.9258%) 63 (3.8889%) 0 (0.0000%)
retired 92 (19.3277%) 329 (12.9937%) 12 (0.7407%) 1 (8.3333%)
self-employed 16 (3.3613%) 82 (3.2385%) 51 (3.1481%) 0 (0.0000%)
services 33 (6.9328%) 166 (6.5561%) 124 (7.6543%) 0 (0.0000%)
student 3 (0.6303%) 8 (0.3160%) 264 (16.2963%) 0 (0.0000%)
technician 65 (13.6555%) 384 (15.1659%) 279 (17.2222%) 2 (16.6667%)
unemployed 10 (2.1008%) 86 (3.3965%) 48 (2.9630%) 0 (0.0000%)
unknown 3 (0.6303%) 16 (0.6319%) 15 (0.9259%) 3 (25.0000%)
job divorced married single unknown
admin. 132 (9.7633%) 652 (48.2249%) 566 (41.8639%) 2 (0.1479%)
blue-collar 53 (8.3072%) 421 (65.9875%) 161 (25.2351%) 3 (0.4702%)
entrepreneur 14 (11.2903%) 88 (70.9677%) 21 (16.9355%) 1 (0.8065%)
housemaid 16 (15.0943%) 74 (69.8113%) 16 (15.0943%) 0 (0.0000%)
management 39 (11.8902%) 226 (68.9024%) 63 (19.2073%) 0 (0.0000%)
retired 92 (21.1982%) 329 (75.8065%) 12 (2.7650%) 1 (0.2304%)
self-employed 16 (10.7383%) 82 (55.0336%) 51 (34.2282%) 0 (0.0000%)
services 33 (10.2167%) 166 (51.3932%) 124 (38.3901%) 0 (0.0000%)
student 3 (1.0909%) 8 (2.9091%) 264 (96.0000%) 0 (0.0000%)
technician 65 (8.9041%) 384 (52.6027%) 279 (38.2192%) 2 (0.2740%)
unemployed 10 (6.9444%) 86 (59.7222%) 48 (33.3333%) 0 (0.0000%)
unknown 3 (8.1081%) 16 (43.2432%) 15 (40.5405%) 3 (8.1081%)
---
title: "Banking Campaign by Phil Sivyer"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
social: menu
source_code: embed
theme: readable
css: fd.css
keep_tex: yes
---
```{r setup, include=T}
library(flexdashboard)
library(tidyverse)
library(plotly)
library(dplyr)
library(lubridate)
library(plotly)
library(ggplot2)
library(modelr)
library(corrplot)
library(corrgram)
library(janitor)
library(DT)
library(knitr)
library(formattable)
library(rmarkdown)
library(sqldf)
library(party)
library(partykit)
library(caret)
library(packrat)
library(rsconnect)
library(skimr)
library(sjmisc)
library(kableExtra)
library(highcharter)
library(psych)
library(ggcorrplot)
library(visreg)
library(GGally)
library(scales)
library(themis)
library(tidymodels)
library(kknn)
library(tidymetrics)
library(ggforce)
library(tidyquant)
library(concaveman)
library(ggthemes)
bank.df <- read.table("C:/Reference Material/Machine_Learning/Bank_Campaign.csv", sep=";", header=T)
bank.df <- bank.df %>%
mutate(across(where(is.character),as.factor))
bank.df$month <- factor(bank.df$month, levels=c('jan','feb','mar','apr','may','jun',
'jul','aug','sep','oct','nov','dec'))
```
```{r setup-data, cache=F,include=F}
####################
#### Load Data ####
####################
```
About {data-navmenu="Explore"}
=====================================
Column {data-width=250}
-------------------------------------
### About this dashboard
<font style="font-size: 16px">The dataset contains information about marketing campaigns that were conducted via phone calls from a Portuguese banking institution to their clients. Purpose of these campaigns is to prompt their clients to subscribe for a specific financial product of the bank (term deposit). After each call was conducted, the client had to inform the institution about their intention of either subscribing to the product (indicating a successful campaign) or not (unsucessful campaign).
The final output of this survey will be a binary result indicating if the client subscribed ('yes') to the product or not ('no').
The dataset has 41188 rows (instances of calls to clients) and 21 columns (variables) which are describing certain aspects of the call. Please note that there are cases where the same client was contacted multiple times - something that practically doesn't affect the analysis as each call will be considered independent from another even if the client is the same.
**In the StoryBoard pages we will see how successful the campaign was - plus, some other insights into the data so we can get a better understanding of what lies within.**</font>
Column {.tabset}
-------------------------------------
### Dataset Content
**1. Variables that describing attributes related directly to the client:**
- a. age
- b. job: type of job (e.g. 'admin', 'technician', 'unemployed', etc)
- c. marital: marital status ('married', 'single', 'divorced', 'unknown')
- d. education: level of education ('basic.4y', 'high.school', 'basic.6y', 'basic.9y','professional.course', 'unknown','university.degree','illiterate')
- e. default: if the client has credit in default ('no', 'unknown', 'yes')
- f. housing: if the client has housing a loan ('no', 'unknown', 'yes')
- g. loan: if the client has a personal loan ? ('no', 'unknown', 'yes')
**2. Variables related to the last contact of the current campaign:**
- a. contact: type of communication ('telephone', 'cellular')
- b. month: month of last contact
- c. day_of_week: day of last contact
- d. duration: call duration (in seconds)
**3. Other variables related to the campaign(s):**
- a. campaign: number of contacts performed during this campaign and for this client
- b. pdays: number of days passed by after the client was last contacted from a previous campaign
- c. previous: number of contacts performed before this campaign and for this client
- d. poutcome: outcome of previous marketing campaign ('nonexistent', 'failure', 'success')
**4. Socioeconomic variables:**
- a. emp.var.rate: employement variation rate - quarterly indicator
- b. cons.price.idx: consumer price index - monthly indicator
- c. cons.conf.idx: consumer confidence index - monthly indicator
- d. euribor3m: euribor 3 month rate - daily indicator
- e. nr.employed: number of employees - quarterly indicator
```{r }
```
Storyboard {.storyboard data-navmenu="Explore"}
=========================================
### Subscribe success by month & marital status {data-commentary-width=210}
```{r}
subscribed <- bank.df %>%
filter(subscribed == 'yes',marital != 'unknown') %>%
group_by(month,marital) %>%
mutate(yes= sum(n())) %>%
ungroup() %>%
distinct(month,marital,yes)
subscribed$month <- factor(subscribed$month, levels=c('jan','feb','mar','apr','may','jun',
'jul','aug','sep','oct','nov','dec'))
#########################################################################################
not_subscribed <- bank.df %>%
filter(subscribed == 'no',marital != 'unknown') %>%
group_by(month,marital) %>%
mutate(no= sum(n())) %>%
ungroup() %>%
distinct(month,marital,no)
not_subscribed$month <- factor(not_subscribed$month, levels=c('jan','feb','mar','apr','may','jun',
'jul','aug','sep','oct','nov','dec'))
p1 <-ggplot() +
geom_boxplot(data = bank.df %>%
filter(marital != 'unknown'),
aes(x = month, y = duration,
group = interaction(month, subscribed),
fill = subscribed),
width = 10) +
coord_cartesian(ylim = c(0, 4000))+
facet_grid(.~marital)+
labs(x="month", y = "call duration in seconds & count of yes/no subscribed")+
theme(axis.text.x=element_text(angle=-90, vjust=0.4,hjust=1))+
geom_line(data = subscribed,
aes(x = month, y = yes, group = 1), size = 0.7, color = '#499FCA')+
geom_line(data = not_subscribed,
aes(x = month, y = no, group = 1), size = 0.7, color = '#E6565B')
ggplotly(p1)%>%
config(displayModeBar = FALSE)
```
***
Note that **May, June, July and November** are successful periods to get people to subscribe. We can also deduce that the "married" marital status is more likely to get people to subscribe followed by single then divorced. Another clear indicator is that people who do not subscribe irrespective of marital status keep the call duration to a minimum.
**Chart Helper**
- *The red line is count of non subscribed and the blue line is count of subscribed (by month, and marital status).*
- *The Boxplot is a statistical summary of call duration in seconds by month, maritaL status and subscribed (yes or no)*
### Histogram - Age (bin width = 10) {data-commentary-width=210}
```{r sm-1-jan22-2017,fig.width=10}
p <- ggplot(bank.df %>% filter(marital != 'unknown'), aes(x=age,fill = subscribed)) + geom_histogram(binwidth=10,colour="white")
p <- p + facet_grid(subscribed ~ marital, scales="free", space="free")
ggplotly(p)%>%
config(displayModeBar = FALSE)
```
***
Here we can clearly see that the **20 to 30 years old single** are the most likely to subscribe - followed by **20 to 40 years old married**
### Percent of subscribed success
```{r sm-2-jan22-2017,fig.width=10}
figaa <- plot_ly()
figaa <- figaa %>% add_pie(data = count(bank.df %>%
filter(subscribed == 'yes'), job), labels = ~job, values = ~n,
name = "Job", domain = list(x = c(0, 0.4), y = c(0.4, 1)))
figaa <- figaa %>% add_pie(data = count(bank.df%>%
filter(subscribed == 'yes'),marital), labels = ~marital, values = ~n,
name = "Marital", domain = list(x = c(0.6, 1), y = c(0.4, 1)))
figaa <- figaa %>% add_pie(data = count(bank.df%>%
filter(subscribed == 'yes'),education ), labels = ~education, values = ~n,
name = "Education", domain = list(x = c(0.25, 0.75), y = c(0, 0.6)))
figaa <- figaa %>% layout(title = "Left to right - Job - Education - Marital Status", showlegend = F,
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
figaa%>%
config(displayModeBar = FALSE)
```
### Decision tree
```{r sm-3-jan22-2017,fig.width=10}
d_tree <- bank.df %>%
mutate(education = case_when(education %in% c('university.degree','professional.course') ~ 'uni', TRUE ~ 'non_uni')) %>%
mutate(job = case_when(job %in% c('management','technician','entrepeneur','self-employed') ~ 'prof', TRUE ~ 'non_prof')) %>%
mutate(age = case_when(age > 60 ~ '>60'
,age > 40 ~ '41_60'
,age >= 20 ~ '20_40'
,age <= 20 ~ '<20'
,TRUE ~ 'NA'))%>%
select(subscribed,education,job,marital,age)
d_tree <- d_tree %>%
dplyr::mutate(across(where(is.character),as.factor))
pd <- sample(2,nrow(d_tree),replace = TRUE, prob = c(0.8,0.2))
train <- d_tree[pd == 1,]
validate <- d_tree[pd == 2,]
tree <- ctree(subscribed ~., data = d_tree
, maxdepth = 4, alpha = 0.5
# , controls = ctree_control(mincriterion = 0.9, minsplit = 50)
)
plot(tree, gp = gpar(fontsize = 7),inner_panel=node_inner(tree,pval = FALSE, id = FALSE, fill = "orange"), terminal_panel = node_barplot(tree,id = FALSE),
ip_args = list(id = FALSE, fill = "red"),
ep_args = list(fill = "yellow"))
```
### Heat map for Jobs v Marital status that have subscribed
```{r sm-4-jan22-2017,fig.width=10}
tile <- bank.df %>%
filter(subscribed == 'yes', marital != "unknown") %>%
group_by(job,marital) %>%
count(marital,name = 'n')
pl1<-tile %>%
ggplot(aes(job,marital,fill = n))+
geom_tile()+
geom_label(aes(label = n),fill = "white")+
scale_fill_viridis_c()+
theme_linedraw()+
labs(title = "Subscribed data")
ggplotly(pl1)
```
### Subscribed scatter plot - areas denoteing marital status and age
```{r sm-5-jan22-2017,fig.width=10}
library(ggforce)
library(tidyquant)
library(concaveman)
g1 <- bank.df %>%
filter(marital != 'unknown', subscribed == 'yes') %>%
ggplot(aes(duration,age))+
geom_point(aes(color = marital))+
geom_mark_hull(aes(fill= marital, label =marital),
concavity = 2.0)+
geom_smooth(se = TRUE, span = 0.3)+
expand_limits(y = 100) +
theme_fivethirtyeight()+
theme(axis.title = element_text(),text = element_text(family = "Rubik"))+
scale_fill_tq() +
labs(title = "Scatter Plot - Duration of call & Age of caller", subtitle = "",
y = "Duration of call",
x = "Age of caller",
caption = "Bank of Portugal")
g1
```
### Subscribed over time by marital status and by call duration
```{r sm-6-jan22-2017,fig.width=10}
b1 <- bank.df %>%
filter(subscribed != 'yes', marital != "unknown") %>%
group_by(month, marital) %>%
summarise(total_days = sum(duration)/86400)
hc <- b1 %>%
hchart('line', hcaes(x = 'month', y = 'total_days', group = "marital"),
style = list(fontFamily = "Rubik"))%>%
hc_plotOptions(line = list(
lineWidth = 4 # set the line width to 3 pixels
#, dashStyle = "LongDashDot"
))%>%
hc_colors(c("red", "green","black")) %>%
hc_chart(
backgroundColor = "lightgrey" # set the background color to light grey
) %>%
hc_xAxis(
gridLineWidth = 0 # remove the x-axis gridlines
) %>%
hc_yAxis(
gridLineWidth = 0 # remove the y-axis gridlines
)
hc
```
Polar chart {data-navmenu="Explore"}
=====================================
Column {data-width=600}
-------------------------------------
### Call duration by jobs and subscribed
```{r, eval=T}
b3 <- bank.df %>%
filter(subscribed == 'yes') %>%
group_by(job) %>%
summarise(duration = as.numeric(sprintf("%.2f",sum(duration)/86400)))
highchart() %>%
hc_chart(polar = TRUE) %>%
hc_title(text = "Which profession has held the longest phone call duration (days)",
style = list(fontWeight = "bold", fontSize = "30px"),
align = "center") %>%
hc_subtitle(text = "‘Data is subscribed only & taken from 1 years data source - red denotes higher than the overall mean’",
style = list(fontWeight = "bold"),
align = "center") %>%
hc_xAxis(categories = b3$job,
style = list(fontWeight = "bold")) %>%
hc_credits(enabled = TRUE,
text = "Data Source taken from 'KAGGLE'") %>%
hc_add_theme(hc_theme_ffx()) %>%
hc_legend(enabled = FALSE) %>%
hc_series(
list(
name = "Bars",
data = b3$duration,
colorByPoint = TRUE,
type = "column",
colors = ifelse(b3$duration < mean(b3$duration),"green","red")
),
list(
name = "line",
data = b3$duration,
pointPlacement = "on",
type = "line"))
```
Column {data-width=600}
-------------------------------------
### Call duration by education and subscribed
```{r,eval=T}
b4 <- bank.df %>%
filter(subscribed == 'yes') %>%
group_by(education) %>%
summarise(duration = as.numeric(sprintf("%.2f",sum(duration)/86400)))
highchart() %>%
hc_chart(polar = TRUE) %>%
hc_title(text = "Which education has held the longest phone call duration (days)",
style = list(fontWeight = "bold", fontSize = "30px"),
align = "center") %>%
hc_subtitle(text = "‘Data is subscribed only & taken from 1 years data source - red denotes higher than the overall mean’",
style = list(fontWeight = "bold"),
align = "center") %>%
hc_xAxis(categories = b4$education,
style = list(fontWeight = "bold")) %>%
hc_credits(enabled = TRUE,
text = "Data Source taken from 'KAGGLE'") %>%
hc_add_theme(hc_theme_ffx()) %>%
hc_legend(enabled = FALSE) %>%
hc_series(
list(
name = "Bars",
data = b4$duration,
colorByPoint = TRUE,
type = "column",
colors = ifelse(b4$duration < mean(b4$duration),"green","red")
),
list(
name = "line",
data = b4$duration,
pointPlacement = "on",
type = "line"))
```
# Subscribed / Job - Education - percentages by all, column and row
## Row {.tabset .tabset-fade}
-----------------------------------------------------------------------
### **Job percentages by marital status - all %**
```{r,eval=T}
perc_all <- bank.df %>%
filter(subscribed == 'yes') %>%
select(marital,job)
##
## GENDER RACE %
gr1 <-perc_all %>%
tabyl(job,marital) %>%
adorn_percentages("all") %>% ## CAN BE col, row, all
adorn_pct_formatting(digits = 4) %>%
adorn_ns("front")
print(gr1)
```
### **Job percentages by marital status - column %**
```{r,eval=T}
gr2 <-perc_all %>%
tabyl(job,marital) %>%
adorn_percentages("col") %>% ## CAN BE col, row, all
adorn_pct_formatting(digits = 4) %>%
adorn_ns("front")
print(gr2)
```
### **Job percentages by marital status - row %**
```{r,eval=T}
gr3 <-perc_all %>%
tabyl(job,marital) %>%
adorn_percentages("row") %>% ## CAN BE col, row, all
adorn_pct_formatting(digits = 4) %>%
adorn_ns("front")
print(gr3)
```